manupa-arm commented on issue #9022:
URL: https://github.com/apache/tvm/issues/9022#issuecomment-920577170
@tqchen @mbs-octoml ,
This is not specific to Arm(R) Ethos(TM)-U codegen and its generally
applicable for any micro controller where we would want to avoid creating
allocation of memories in the stack but rather service them via platform
abstraction that is handled via TVMBackendWorkspaceAlloc -->
TVMPlatformAllocate.
This only showed up in Arm(R) Ethos(TM)-U codegen because we use
TVMPlatformAllocate allocate memory from a buffer placed in a memory that is
both accessible by CPU and NPU. Thus, it makes this a functional bug.
However, with this change current main produces code that have much higher
stack allocation for micros -- that is not desired.
cc : @u99127 @areusch
> Stack allocation is important for the performance of the CPU code. In the
case of TVM, we do not have explicit concept of registers in most cases.
Instead we need to rely on LLVM's mem2reg pass to transform a set of constant
indexing into stack allocation and turn them into registers, so the code can
run effectively. So removing this code path can complicate the code generator
side optimization by quite a bit and slow down the CPU code.
The correct way represent this seems to be using tir.allocates with
storage_scope="local" for device=CPU to go into the stack. For targets that
needs this behavior, there should be an explicit pass to convert them to local
to make them placed to the stack rather than assuming this default behaviour.
> Of course this can be a target specific thing. LowerTVMBuiltin right now
has the assumption to only run on host(CPU) code.
> Allocate always prefers (native) stack allocation when possible, but
also allows other means of opaque allocation(as long as the allocation is
fulfilled)
There are however, cases when stack allocation is not possible
When the size of memory requested is too big, stack alloca will
explode the stack space(That is why there is a size check in the CPU case and
the use of global opaque was meant as a fallback to avoid stackoverflow in
models with big intermediate temp space)
LowerTVMBuiltin was originally designed to run on the host side,
which means as soon as the allocation is about device side memory, it will need
to call onto a (host side) device API to allocate the memory instead
Clearly, this definition of CPU leaves out micros.
It feels wrong to print out allocates with "global" storage_scope directly
into CPU PrimFunc that gets printed as a stack allocation rather it should be
serviced via TVMBAW call which moves the responsibility for the
runtime/application layer.
> So rationales for the specific CPU side logic:
> We want to have stack alloca on host when possible(to gain mem2reg
optimization)
When the requested size is too large, we fallback to opaque workspace
allocation on heap to allow the code to safely handle code with big temp memory
requests as well as dynamic size allocation requests.
This certainly sounds like we could use an optimization pass to convert the
tir.allocate's storage_scope for targets that requires this rather than making
that the default behaviour for tir.allocates with "global" storage scope.
cc : @tom-gall @mbaret @Mousius
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]