[GitHub] [tvm] tqchen commented on issue #9022: [Bug] BuiltinLower does not use alloca for storage on kDLCPU target devices

GitBox Thu, 16 Sep 2021 06:04:57 -0700


tqchen commented on issue #9022:
URL: https://github.com/apache/tvm/issues/9022#issuecomment-920882592



   Thanks for the discussions. Before we suggest a resolution, it would be 
helpful
   to summarize the discussions so far. 
   
   # Semantics of Allocate and storage_scope
   
   Allocate at the moment does not specify the way to allocate the memory.
   The lowering phase can feel free to pick a runtime  dependent one or a 
   more platform native one(e.g. stack alloca or direct declaration of shared 
memory).
   
   In practie, we see two different kinds of needs:
   - N0: The need for optimizing "device" level code.
   - N1: The need to be able to allocate memory from the host side, possibly
     in a runtime dependent way.
   
   N0 is needed to get best performing kernel, since a native way of allocation 
   that can be understood by the compiler would give the best chance for 
followup
   optimizations. This is the case for CPU related optimizations. Note that such
   optimization is also needed for micro setting, when we use TIR to generate 
kernel
   code that requested a temp scracth memory for output tiling.
   
   N1 is needed for "driving" the set of computations. In the case of GPU, we 
have
   to rely on host side driver function. In the case of micro AOT, it is 
desirable
   to allocate using platform specific memory.
   
   One possible approach is to try to ask user to differentiate these two kinds 
of
   allocations. However, to give user maximum convenience, TVM allows a mixed 
host/device
   programming model like CUDA and hide that complexity from the user. The 
boundary
   of N0 and N1 can also be blurred, e.g. a kernel can request to allocate from 
platform
   dependent memory.
   
   In specific specialized target devices, we also see an additional need:
   
   - N2: Allocating memory with special semantics. 
   
   For example, an unified device pinned memory that is accessible from both 
NPU and CPU. 
   A specialized texture memory or shared memory. The request in N2 is quite 
different 
   and brings additional requirement to how we allocate the memory and how the 
memory can be 
   used and represented in codegen.
   
   Right now N2 can be covered by introducing a special memory tag -- 
"global.npu_unified".
   This will usually allow the TVM to know that the memory needed to be lowered 
by possibly
   special target dependent passes, and leave them out.
   
   ## Possible Resolutions
   
   Based on discussions so far, there are two suggested resolutions.
   
   - R0: Separate out a "local" scope that carries the stack allocation heavior.
   - R1: Keep "global" scope as it is, introduce a special tagged scope 
"global.workspace"
     that represents a global memory request specifically for workspace memory. 
     And introduce lowerings for them. For specialized memory(e.g. NPU 
unified), introduce
     separate memory scopes. 
   - R3: Introduce target specific attribute that marks the possible stack 
alloca size for 
     lowering the R0("local") or R1("global"). Note R3 can be done in addition 
to R0 or R1.
   
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [tvm] tqchen commented on issue #9022: [Bug] BuiltinLower does not use alloca for storage on kDLCPU target devices

Reply via email to