manupa-arm commented on issue #9022: URL: https://github.com/apache/tvm/issues/9022#issuecomment-923686813
So there are two sub issues here, if you follow the discussion : P1: The AoT executor codegen creates tir.allocates that should be served TVMBAW calls more generally. P2: The CPU PrimFuncs generating stack allocation for some allocates based on a per-allocate heuristic. For P1, We are working on a PR (to be published soon) make the AoT executor codegen create tir.allocate with 'global.workspace' (as discussed above) storage scopes to make sure they get lowered to TVMBAWs that are device/memory agnostic calls from a codegen perspective. For P2, We might need to control stack-alloca heuristic for each compilation. I feel that should solve the issues in the scope. **@areusch shall we move the discussion to another issue/discuss post above promoting/introducing storage_scope that is more higher than the current definition of 'global'?** I agree with you that the current definition of 'global' is a misnomer. There are certain memories (e.g. stack) that could be 'owned' by a TargetDevice -- in which case we could do \<TargetDevice\>.stack. In a way, stack is special as it does not create an opaque call to get hold of the memory, so that LLVM (or any other post-TVM compiler) could do optimizations. Hence, my emphasis on treating it especially with 'local' storage_scope. However, I think it would be too big of a change given the historical precedence of TVM, as @tqchen points out. However, in shared memory SoCs, the ownership of memories to TargetDevice will not scale very well. For e.g., if we have a SRAM that is accessed by CPU, DSP, and NPU in the same SoC, what should be the prefix? The current approach we are taking (and discussed in the RFC) of USMP is to use the storage_scope: global.<pool_name> where pool_name points to PoolInfo that describes which TargetDevice have access to it. Therefore, I believe we could make an Arena that is pinned to the memories in the runtime layer to serve each Pool via TVMBAWs. If USMP is invoked, the static-sized allocations will be translated to fixed offsets within a static buffer pinned to the same memories. I guess my point is the provision of Arenas/buffers should be lifted out of the codegen to the application/runtime layer and interfaced by generated codes via TVMBAWs (for non-static sized allocates) or offset (for static-sized allocates). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
