tqchen commented on issue #9022:
URL: https://github.com/apache/tvm/issues/9022#issuecomment-921820526
Please allow me to explain the overall rationale here, in particular over
the term "constraint"
- C0: On one hand, we want a "default" memory to be generically accessible
(per @manupa-arm 's comment) in all cases so runtime libraries can be built to
leverage the generic property(e.g. access from NPU).
- C1: On the other hand, from compiler's POV, we want to leave flexibility
to code optimizations and codegen phase, and only constraint on the property we
want (e.g. accessible from CPU)
## The two ways to see the constrainting
The way C0 sees the constaint is about the possible accessor of the memory
- "global"=> memory can be accessed from {cpu, npu, other devices}
- "global.stack" => memory can be accessed from { cpu }
The way C1 sees the constraint is about possible memories to choose from.
- "global"(memory that is accessible from CPU) => can choose from {stack,
heap, TVMBAW}
- "global.workspace" => can choose from { TVMBAW }
- "global.stack" => can choose from { stack }
## Discussions
When we say a compiler IR is more constrainted than another one. Usually we
mean that less optimizations can be performed, because there is lack of
flexibility in terms of rewriting. For example, `volatile` keyword puts
additional constraints on memory access.
This makes C1 more aligned in the common compiler IR design. Note that
"memory that is accessible from all devices" is term that depends on the
specific runtime platform, and not very well defined in a generic IR.
The more constraints we put on the memory itself, the smaller set it can
become. As a result, there are less opportunities
of code transformations and optimizations.
Under the current semantics "CPU" && "global" can result in stack
allocation. Note that is is one kind of flexibility we want to offer to later
stages so that specializations can be made.
- So yes it is indeed OK for a pass to map "global" to TVMBAW, the resulting
program will run slower, but still correctly on CPU.
- It also does not precldue TVMLowerBuiltin to take benefit of the semantics
to choose stack allocation, which usually benefit performance.
One thing we should keep in mind is that the codegen and AOT compiler should
not rely on the behavior of TVMLowerBuiltin to ensure correctness(since it can
choose to do anything in the case of "global", including dispatching to another
custom allocator). If a special kind of memory is needed, we should declare
such constraint through IR. Attaching a special scope is the best way to do so
under the current semantics, regardless of the implementation of
TVMLowerBuiltin.
TVMLowerBuiltin picks `kMaxStackAllocaSize` as a heuristic number that
maximizes the benefit of stack allocation without overexploding the stack. Of
course a better heuristic can be used, setting it default to 0 would bring down
performance of a lot of CPU code so not as desirable. We could certainly have a
target dependent property for micro target and set that to 0.
Please let me know if it clarifies things
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]