[GitHub] [tvm] manupa-arm commented on issue #9022: [Bug] BuiltinLower has hard-coded heuristic for alloca which is not appropriate for all kDLCPU target devices

GitBox Mon, 20 Sep 2021 23:43:09 -0700


manupa-arm commented on issue #9022:
URL: https://github.com/apache/tvm/issues/9022#issuecomment-923686813



   So there are two sub issues here, if you follow the discussion : 
   
   P1: The AoT executor codegen creates tir.allocates that should be served 
TVMBAW calls more generally.
   P2: The CPU PrimFuncs generating stack allocation for some allocates based 
on a per-allocate heuristic.
   
   For P1,
   We are working on a PR (to be published soon) make the AoT executor codegen 
create tir.allocate with 'global.workspace'  (as discussed above) storage 
scopes to make sure they get lowered to TVMBAWs that are device/memory agnostic 
calls from a codegen perspective.
   
   For P2,
   We might need to control stack-alloca heuristic for each compilation.
   
   I feel that should solve the issues in the scope.
   
   **@areusch shall we move the discussion to another issue/discuss post above 
promoting/introducing storage_scope that is more higher than the current 
definition of 'global'?**
   
   I agree with you that the current definition of  'global' is a misnomer. 
There are certain memories (e.g. stack) that could be 'owned' by a TargetDevice 
-- in which case we could do \<TargetDevice\>.stack.  In a way, stack is 
special as it does not create an opaque call to get hold of the memory, so that 
LLVM (or any other post-TVM compiler) could do optimizations. Hence, my 
emphasis on treating it especially with 'local' storage_scope. However, I think 
it would be too big of a change given the historical precedence of TVM, as 
@tqchen points out.
   
   However, in shared memory SoCs, the ownership of memories to TargetDevice 
will not scale very well. For e.g., if we have a SRAM that is accessed by  CPU, 
DSP, and NPU in the same SoC, what should be the prefix? The current approach 
we are taking (and discussed in the RFC) of USMP is to use the storage_scope: 
global.<pool_name> where pool_name points to PoolInfo that describes which 
TargetDevice have access to it. Therefore, I believe we could make an Arena 
that is pinned to the memories in the runtime layer to serve each Pool via 
TVMBAWs. If USMP is invoked, the static-sized allocations will be translated to 
fixed offsets within a static buffer pinned to the same memories. I guess my 
point is the provision of Arenas/buffers should be lifted out of the codegen to 
the application/runtime layer and interfaced by generated codes via TVMBAWs 
(for non-static sized allocates) or offset (for static-sized allocates).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [tvm] manupa-arm commented on issue #9022: [Bug] BuiltinLower has hard-coded heuristic for alloca which is not appropriate for all kDLCPU target devices

Reply via email to