manupa-arm commented on a change in pull request #65:
URL: https://github.com/apache/tvm-rfcs/pull/65#discussion_r839354482
##########
File path: rfcs/0009_Unified_Static_Memory_Planning.md
##########
@@ -349,6 +349,108 @@ tvmc compile my_model.tflite --executor=aot
--output-format=mlf --target=c
TVMExecute(&my_model, &inputs, &outputs, &context);
}
```
+
+## U4 : User wants to write/read directly to the workspace buffer
+
+This usecase allows the space used by I/O tensors to be re-used by the
inference.
+
+### TVMC
+```
+ tvmc compile my_model.tflite
+ --executor=aot
+ --target=c
+ --workspace-pools=sram
+ --pass-config tir.usmp.enable=1
+ --pass-config tir.usmp.use_workspace_io=1
+
+```
+### Codegen'd Artifacts
+```
+ //Codegen'd artifacts in metadata.c (lib0.c)
+
+ int32_t tvmgen_my_model_run(
+ tvmgen_my_model_workspace_pools* workspace_pools,
+ ){
+ return my_model_main(workspace_pools.sram);
+ }
+
+ // Returns a handle pointing to space inside the
+ // workspace pool where input should be stored
+
+ tvmgen_my_model_inputs tvmgen_my_model_map_inputs(
+ tvmgen_my_model_workspace_pools* workspace_pools
+ ) {
+ tvmgen_my_model_inputs = {
+ .input0 = &workspace_pools->sram[<INPUT0_OFFSET>],
+ };
+ return tvmgen_my_model_inputs;
+ }
+
+ // Returns a handle pointing to space inside the
+ // workspace pool where output is stored
+
+ tvmgen_my_model_outputs tvmgen_my_model_map_outputs(
+ tvmgen_my_model_workspace_pools* workspace_pools
+ ) {
+ tvmgen_my_model_outputs = {
+ .output0 = &workspace_pools->sram[<OUTPUT0_OFFSET>],
+ };
+ return tvmgen_my_model_outputs;
+ }
+```
+```
+// metadata.h
+
+ #define TVM_MY_MODEL_SRAM_WORKSPACE_BUFFER_SIZE xxxx
+
+ typedef struct {
+ uint8_t* sram;
+ } tvmgen_my_model_workspace_pools;
+
+ typedef struct {
+ uint8_t* input0;
+ } tvmgen_my_model_inputs;
+
+ typedef struct {
+ uint8_t* output0;
+ } tvmgen_my_model_outputs;
+
+ tvmgen_my_model_inputs tvmgen_my_model_map_inputs(
+ tvmgen_my_model_workspace_pools* workspace_pools
+ );
+
+ tvmgen_my_model_outputs tvmgen_my_model_map_outputs(
+ tvmgen_my_model_workspace_pools* workspace_pools
+ );
+```
+### User Application
+```
+ // The User Application model;
+ __attribute__((section( "SRAM" ), aligned( 16 ))) static uint8_t
workspace_buffer_sram[TVM_MY_MODEL_SRAM_WORKSPACE_BUFFER_SIZE];
+
+ int main(...) {
+ ...
+ tvmgen_my_model_workspace_pools workspaces = {
+ .sram = &workspace_buffer_sram,
+ };
+ tvmgen_my_model_inputs inputs =
+ tvmgen_my_model_map_inputs(&workspaces);
+ tvmgen_my_model_outputs outputs =
+ tvmgen_my_model_map_outputs(&workspaces);
+
+ // Generate input tensor by passing the handle
+ // E.g. this could be a driver writing directly to
+ // the workspace buffer
+ GenerateInput(inputs.input0)
+
+ tvmgen_my_model_run(&workspaces);
Review comment:
Good points @areusch!
One thing I would like to make clear, the proposal here is to extend the
current TVM's status. Therefore, this change do not consider the following two
scenarios :
A1. using tvm.relay.build(List[IRModule])
A2. Selective re-use of space of some I/O tensors.
Thus, I would consider them out of scope for extension proposed here.
However, I'd like to highlight how this work is unrelated or do not block
such extentions in the future.
For A1,
If and when we want to do this (hopefully in that RFC) -- discuss why would
we ever want to do this. As of now I feel this over-stretching that API. Even
if we wanted do that, there is a larger issue to be solved because in the case
where PassConfig need to be seperately set for the different programs that
coupled in the build call. e.g. we might want not to vectorize one program
while we want to do vectorize the other.
For A2,
This is a perfectly valid extension upon this work.
Right now, we have kept the change self-contained within USMP Pass --
therefore using PassConfig.
I would still consider this route to be valueable because the user does not
need to manually annotate all the I/O tensors otherwise, which would account
for most usage of this feature.
One thing that I find challenging would be to use these annotations, before
relay.build(...) following the agreement reached here :
https://github.com/apache/tvm-rfcs/blob/main/rfcs/0029-migrating-to-irmodule-attributes.md.
i.e. the annotations are being done inside relay.build(...).
Let say if we could overcome that hurdle, then what needs to be done to
support that would be just use the annotation to filter CreateAllocateIO pass
to select which annotations should be selected. Thus, one could simply extend
the feature proposed here to support that derivative advanced feature.
I can add some text as future considerations, if you like. lmk.
##########
File path: rfcs/0009_Unified_Static_Memory_Planning.md
##########
@@ -515,4 +663,6 @@ NOTE : to support tir.constants generally, we'll be
enhancing the bound relay.co
# Drawbacks
-* The relay "main" function that describes the call order to operator
PrimFuncs has to be described in TIR to be able to integrate the USMP into the
respective executor codegen. However, we dont view this as a major problem as
the relay "main" function could easily be lowered to TIR.
\ No newline at end of file
+* The relay "main" function that describes the call order to operator
PrimFuncs has to be described in TIR to be able to integrate the USMP into the
respective executor codegen. However, we dont view this as a major problem as
the relay "main" function could easily be lowered to TIR.
+
+* The U4 usecase will only be supported with [Embedded C Runtime
Interface](https://discuss.tvm.apache.org/t/rfc-utvm-embedded-c-runtime-interface/9951/14).
This is mainly because the nature of the requirement is associated with
embedded usecases. However, the USMP changes here should be complimentary to
support other runtime interfaces such as Module-based Model Runtime Interface's
set_input and set_output in future.
Review comment:
Good point again!
This is generally applicable for Embedded C Runtime Inferface as well.
Another way to frame the the suggestion is to say : what cant we treat the
buffer space used by I/O tensors as seperate workspace buffers ?
I have considered this too :). The short anwser is it is sub-optimal use
disjoint workspace buffers rather than a single workspace buffer -- because it
does not allow partial overlaps of tensors across such buffers. TLDR;
So the workspace buffer is viewed as a spatio-temporal buffer and tensors
are considered as spatio-temporal allocation. Having disjoint workspace
buffers, create a spatial barrier that could potentially lead to higher
fragmentation. Therefore, out of the two options; we are prioritizing the
better one here. Again, this could be an another usecase if there is value for
it.
Again, I can add few lines for this as future consideration.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]