tqchen commented on issue #9022:
URL: https://github.com/apache/tvm/issues/9022#issuecomment-920882592
Thanks for the discussions. Before we suggest a resolution, it would be
helpful
to summarize the discussions so far.
# Semantics of Allocate and storage_scope
Allocate at the moment does not specify the way to allocate the memory.
The lowering phase can feel free to pick a runtime dependent one or a
more platform native one(e.g. stack alloca or direct declaration of shared
memory).
In practie, we see two different kinds of needs:
- N0: The need for optimizing "device" level code.
- N1: The need to be able to allocate memory from the host side, possibly
in a runtime dependent way.
N0 is needed to get best performing kernel, since a native way of allocation
that can be understood by the compiler would give the best chance for
followup
optimizations. This is the case for CPU related optimizations. Note that such
optimization is also needed for micro setting, when we use TIR to generate
kernel
code that requested a temp scracth memory for output tiling.
N1 is needed for "driving" the set of computations. In the case of GPU, we
have
to rely on host side driver function. In the case of micro AOT, it is
desirable
to allocate using platform specific memory.
One possible approach is to try to ask user to differentiate these two kinds
of
allocations. However, to give user maximum convenience, TVM allows a mixed
host/device
programming model like CUDA and hide that complexity from the user. The
boundary
of N0 and N1 can also be blurred, e.g. a kernel can request to allocate from
platform
dependent memory.
In specific specialized target devices, we also see an additional need:
- N2: Allocating memory with special semantics.
For example, an unified device pinned memory that is accessible from both
NPU and CPU.
A specialized texture memory or shared memory. The request in N2 is quite
different
and brings additional requirement to how we allocate the memory and how the
memory can be
used and represented in codegen.
Right now N2 can be covered by introducing a special memory tag --
"global.npu_unified".
This will usually allow the TVM to know that the memory needed to be lowered
by possibly
special target dependent passes, and leave them out.
## Possible Resolutions
Based on discussions so far, there are two suggested resolutions.
- R0: Separate out a "local" scope that carries the stack allocation heavior.
- R1: Keep "global" scope as it is, introduce a special tagged scope
"global.workspace"
that represents a global memory request specifically for workspace memory.
And introduce lowerings for them. For specialized memory(e.g. NPU
unified), introduce
separate memory scopes.
- R3: Introduce target specific attribute that marks the possible stack
alloca size for
lowering the R0("local") or R1("global"). Note R3 can be done in addition
to R0 or R1.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]