manupa-arm commented on issue #9022:
URL: https://github.com/apache/tvm/issues/9022#issuecomment-921044001
Thanks @tqchen for summarizing the ideas and presenting possible resolutions.
The two needs seems very valid.
For N0, The operators should really be tagged with 'local' storage scope for
the needs of N0 as they are quite local to the operator primfunc and they
benefit of further optimizations within and beyond TVM -- i.e. follow up C
compiler / LLVM.
For N1, we could use the 'global' tag to give the responsibility for the
application/runtime layer to service the allocation.
Therefore, the actual fix should have been tagging the allocates that are
expected to be optimized to be 'local' to the PrimFunc, rather than making the
'global' allocates to CPU being treated as local.
> N0 is needed to get best performing kernel, since a native way of
allocation
that can be understood by the compiler would give the best chance for
followup
optimizations. This is the case for CPU related optimizations. Note that such
optimization is also needed for micro setting, when we use TIR to generate
kernel
code that requested a temp scracth memory for output tiling.
I feel we are incorrectly tagging the storage_scopes here. They should
really be 'local' for this specific usecase.
> First of all, R0 and R1 are not that different in nature. Both tries to
introduce two separate scopes that brings different behavior. The main
questions boils down to how can we name the "global" scope.
In the solution R1, I still that as a workaround for incorrect treatment of
'global' scoped memories where we create an override of an actual 'global' what
we declare as 'global.workspace'. In shared memory SoCs, it would be
un-scalable explosion of tags if we want to keep tagging memories for devices
which have access to them. I would think we'd want to treat memories to devices
having a many-to-many relationship.
The lowering we had until few days back was general enough so they were
serviced by a runtime/application layer routine and that were more aligned with
what we call as 'global' (with respect to codegen) scoped storage.
> Per allocate semantics, we treats "global" as normal CPU memory which can
come from stack or platform specific allocation
Can you explain what you define as 'normal CPU memory' ? A CPU can
technically have access to many memories.
> However, the need N0 would favor stack allocation when possible. Note that
we will likely need a related behavior for micro devices as well when
generating operator kernels.
It would have been nice to have a RFC (Apologize in advance if I missed this
if there was a one already) to discuss before we move from TVMBAW style
allocation which I find more generic than just stack allocations. It almost
feel the schedules should have tagged them 'local' if this was the expectation
rather than assuming a combined logic : 'global' and CPU.
> One possible approach is to try to ask user to differentiate these two
kinds of allocations.
Wouldn't it be simpler if tag allocations for N0 to be 'local' and N1 to be
'global' ?
> N2: Allocating memory with special semantics. For example, an unified
device pinned memory that is accessible from both NPU and CPU. A specialized
texture memory or shared memory. The request in N2 is quite different and
brings additional requirement to how we allocate the memory and how the memory
can be used and represented in codegen.
Its a memory where multiple Target have access to which the
runtime/application could provide via TVMBAW with a specialized
global.<pool_name>.
> It is important to do this because the compiler makes no assumption that
"global" can be accessed by other types of devices.
Hmmmm, this argument seems counter-intuitive to me. I think we should assume
the 'global' to be accessible unless they are explicitly specified to be
restricted. i.e. global.<pool_name>. Otherwise, the terminology is confusing.
>
R0: Separate out a "local" scope that carries the stack allocation
heavior. (proposed by @manupa-arm )
R1: Keep "global" scope as it is, introduce a special tagged scope
"global.workspace"
that represents a global memory request specifically for workspace
memory.
And introduce lowerings for them. For specialized memory(e.g. NPU
unified), introduce
separate memory scopes.
R3: Introduce target specific attribute that marks the possible stack
alloca size for
lowering the R0("local") or R1("global"). Note R3 can be done in
addition to R0 or R1.
Ideally, the allocates destined to be end up in stack should have been
'local'. Moreover, at which point we can decide not to tag allocates that
exceed the target specific attribute rather than dealing this in the codegen or
lower_builtin_tvm pass.
I would propose :
R4 : R0 + I think its cleaner if we just introduce a optional pass to tag
memories with 'local'. At which point, we should only tag them according to the
target specific attribute max stack alloca size -- that could work until we fix
the schedules that wants them in stack to have 'local' storage scope for the
need N0 -- that is a conscious decision the schedule writer / autoscheduler
takes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]