areusch edited a comment on issue #9022:
URL: https://github.com/apache/tvm/issues/9022#issuecomment-923290583


   apologies for the delay in reply here.
   
   I agree that allocating on the stack is, in the current GraphExecutor world, 
a codegen-level optimization. In my mind, the introduction of AOTExecutor (and 
actually also VMExecutor) means that this is moving one level up. Here is why:
   - previously, the maximum effective call depth of a TIR function was 1, 
meaning that codegen could decide *then and there* that it made sense to move a 
particular local `tir.allocate` node onto the stack.
   - with the introduction of AOT and control flow VMExecutor, such 
`tir.allocate` may now live while another TIR function is being invoked e.g. 
with `tir.call_packed`.
   
   Now that this is the case, it doesn't make sense to make stack-allocation of 
`tir.allocate` a codegen-level optimization. It must be done prior to codegen 
using graph-level memory analysis e.g. USMP.
   
   Then question then is how to model this in the program. I think using 
`storage_scope` does make sense. The definition I've been working with of a 
`storage_scope` is:
   > A contiguous region of memory, potentially with a maximum size, which can 
be used by operator implementations on a set of `tvm.Device`.
   
   Under this definition, a special `<context>.stack` `storage_scope` could be 
created with a target-specific bound on the amount of memory available. I do 
not think we should create a default `global` scope anywhere, as it's unlikely 
a global scope is truly ever global except in the case of single-CPU homogenous 
model execution. We just haven't modeled this yet in TVM, as we don't have an 
explicit mapping indicating which storage_scope are accessible by which device. 
The closest thing to a global scope I can think of would be something like 
`executor` scope, which could be used to place `DLTensor` used by the AOT 
executor for intermediate tensors.
   
   Something we still have yet to address is how to configure the 
`storage_scope` made available for a particular Target. I think the 
Target-specific attributes may not quite cut it, as `llvm` obviously would have 
different amounts of stack RAM available based on the selected CPU. And, the 
user will need to be able to override this as well, since sometimes linking 
with different software libraries reduces the stack RAM available (e.g. some 
libraries require more global memory and therefore take away from that RAM 
available for stack usage; or other times, some libraries require massive 
interrupt stacks and a larger budget must be reserved to mitigate the risk of 
overwriting global memory due to an interrupt).
   
   It's likely that a scalable solution in micro-land is to have the Project 
API server able to provide a boilerplate set of `storage_scope` in an e.g. 
`server_info_query` call; and then it's likely we will need to have a way for 
users to override this.
   
   Lastly, as to the `<context>` prefix mentioned above: there are many cases 
when multiple stacks exist on a device:
   - multi-core execution
   - parallel thread execution (e.g. stack viewed as a sort of thread-local 
storage)
   
   in the former case, it's likely that `<context>` should be replaced with the 
`name` given to the TargetDevice e.g. in 
https://github.com/apache/tvm/pull/8892. For example `dsp.stack` or 
`always-on-cpu.stack`.
   
   in the latter case, we probably additionally need a thread identifier e.g. 
`dsp.thread0.stack`.
   
   Until we have USMP, my thoughts are that the short-term solution should be 
to stick with a codegen-level optimization and add an attribute which can be 
used to disable the stack-allocation optimization. What do you think @tqchen 
@manupa-arm ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to