[GitHub] [tvm-rfcs] manupa-arm commented on a change in pull request #9: [RFC] TVM Unified Static Memory Planning

GitBox Tue, 06 Jul 2021 14:11:55 -0700


manupa-arm commented on a change in pull request #9:
URL: https://github.com/apache/tvm-rfcs/pull/9#discussion_r664881130




##########
File path: rfcs/0009_Unified_Static_Memory_Planning.md
##########
@@ -0,0 +1,467 @@
+    Feature Name: Unified Static Memory Planner
+    Start Date: 2021 June 1
+    RFC PR: #0009
+    GitHub Issue: https://github.com/apache/tvm/issues/8404
+
+# Background
+
+Currently, given a ML model primarily TVM will generate two main artifacts :
+
+* A1 : Description of the sequential execution of operators :
+  1. If the "executor" is "graph", this would be a JSON
+  2. if the "executor" is "aot", this would be a main function describing call 
graph of operators
+* A2 : library of operators (in the form of runtime.Module)
+
+A1 is generally created out of lowering the "main" relay function and A2 is 
created lowering fused relay primitive functions → TIR PrimFuncs → C or LLVM 
artifacts of the operator library.
+
+### Is there some sort of memory planning already being performed ?
+
+Yes, there is.
+
+For A1, the inter-(fused) operator tensors are visible in the "main" relay 
function. Thus, there exists currently a Relay level pass known as 
"GraphPlanMemory" that works on the Relay IR to share the space used by tensors 
which are not live simultaneously and are visible between (fused) operators . 
Currently, the said pass will use Shared Memory Buffer Object memory planning 
scheme (See 
https://blog.tensorflow.org/2020/10/optimizing-tensorflow-lite-runtime.html) to 
perform the planning.
+
+For A2, the operators are lowered to TIR PrimFuncs. There exist a pass called 
StorageRewrite that more or less does the same thing as "GraphPlanMemory" but 
on TIR for the tensors visible within (fused) operators and are not live 
simultaneously.
+
+# Motivation
+
+For embedded use-cases, its widely accepted that aggressive memory 
optimizations are vital. Intially we are looking at enable memory planning for 
embedded use-cases using the AoT executor.
+
+Therefore, there exist two main shortcomings of the current approach :
+
+* The memory used by intermediary tensors within operators are not shared 
between memory used by inter-operator tensors.
+
+Example TIR :
+```
+    primfn(placeholder_3: handle, placeholder_4: handle, placeholder_5: 
handle, T_cast_1: handle) -> ()
+      attr = { "global_symbol" :  
"fused_nn_conv2d_add_fixed_point_multiply_clip_cast_cast_21" ,  "tir.noalias" : 
True}
+      buffers = {T_cast: Buffer(T_cast_2: Pointer(int16), int16, [ 1 ,  56 ,  
56 ,  128 ], []),
+      placeholder_2: Buffer(placeholder_6: Pointer(int32), int32, [ 1 ,  1 ,  
1 ,  128 ], []),
+      placeholder: Buffer(placeholder_7: Pointer(int16), int16, [ 1 ,  56 ,  
56 , 128 ], []),
+      placeholder_1: Buffer(placeholder_8: Pointer(int16), int16, [ 3 ,  3 ,  
128 ,  1 ], [])}
+
+       buffer_map = {placeholder_3: placeholder, placeholder_4: placeholder_1, 
placeholder_5: placeholder_2, T_cast_1: T_cast} {
+       attr [PaddedInput: Pointer(int16)]  "storage_scope" =  "global" ;
+       allocate(PaddedInput, int16, [ 430592 ]);
+       attr [DepthwiseConv2d: Pointer(int32)]  "storage_scope" =  "global" ;
+
+       allocate(DepthwiseConv2d, int32, [ 401408 ]) {
+         for (i1: int32,  0 ,  58 ) {
+           for (i2: int32,  0 ,  58 ) {
+            for(i3: int32,0,128) {
+               PaddedInput[(((i1*7424) + (i2*128)) + i3)] = 
@tir.if_then_else(((((1<= i1) && (i1 < 57)) && (1<= i2)) && (i2 < 57)), 
(int16*)placeholder_7[((((i1*7168) + (i2* 128 )) + i3) - 7296)], 0i16, 
dtype=int16)
+             }
+```
+
+The above TIR snippet shows that two intra operator buffers PaddedInput, 
DepthwiseConv2d is not visible to Relay Graph Plan Memory to be shared.
+
+* Assumption of local optimization : performing sharing inside the operator 
first and sub-subsequently sharing that workspace with inter-operator tensors, 
would be sub-optimal.
+
+Thus, for the embedded use-cases, we'd need a unified static memory planner 
that performs memory planning of all tensors holistically to achieve best 
memory utilization.
+
+# Goals
+
+G1. There would be no TVMBackendAlloc(/Free)Workspace calls generated for 
tir.allocates that could be evaluated at compile time.
+
+Currently, the TVM codegen and the AoT executor relies on TVMB(A/F)W calls to 
increment/decrement a pointer of user provided workspace buffer. By the end of 
this set of work, if the backend uses Unified Static Memory Planning, there 
should not be TVMB(A/F)W calls rather correct offset in to the user provided 
buffer should be codegen'd for allocates that could be evaluated at compile 
time. The dynamically sized allocates will remain untouched, thus will be 
lowered as usual.
+
+G2. The static memory planning algorithm should be changeable.
+
+There are a variety of memory planning algorithms in discussion with different 
tradeoffs (See 
https://discuss.tvm.apache.org/t/discussion-alignment-memory-planning/9730 and 
https://blog.tensorflow.org/2020/10/optimizing-tensorflow-lite-runtime.html). 
Depending on the topology and schedules of intermediary buffers, the memory 
planning algorithm should easily be able to be change able. However, the 
current design ties the algorithm intimately to the IR constructs – making it 
harder to modularize / change the algorithm w/o inventing a whole new pass. In 
reality, the outcome of USMP's algorithm is offsets within a given workspace 
buffer. Moreover, to produce that it should only need to know the sizes of each 
tensor and their relative liveness. Therefore, the algorithm interface to USMP 
should be kept simple to be able to add more algorithms.
+
+G3. Multiple pool support (including constants)
+
+Ideally, the user would expect to provide these buffers in the granularity of 
the memories they'd want to pin them to. E.g., if there are two RW memories : 
DRAM and SRAM, the buffers need to be identified and pooled by the compiler. 
Similiarly, for constant data, we need to have a mechanism to allow user to pin 
them to appropriate memories and addresses in the IR would simply be offsets 
into the constant buffer(s) provided by the user
+
+# Guide-level explanation
+
+## U1: Most simple use case
+
+### TVMC
+
+
+```
+tvmc compile my_model.tflite --executor=aot --output-format=mlf --target=c
+```
+
+ ### Codegen'd artifacts
+
+
+```
+    `//Codegen'd artifacts in metadata.c (lib0.c)`
+    const TVMModel my_model = {
+       ...
+       .entrypoint = &entrypoint,
+    }
+
+    static uint8_t workspace_buffer[WORKSPACE_BUFFER_SIZE];
+    static const uint8_t parameters_buffer[PARAMETERS_BUFFER_SIZE] = 
<compiler_generated_constant_data>;
+
+    static int32_t entrypoint(TVMInputs_my_model* inputs, 
+                              TVMOutputs_my_model* outputs,
+                               TVMContext* context){
+        return my_model_main(inputs.input0, 
+                             outputs.output0,
+                             &workspace_buffer,
+                             parameters_buffer,
+                             context.resource_handle);
+    }
+```
+```
+// metadata.h
+
+    typedef struct {
+       uint8_t* input0;
+    }  TVMInputs_my_model;
+
+    typedef struct {
+       uint8_t* output0;
+    }  TVMOutputs_my_model;
+```
+
+### User Application
+```
+
+    // The User Application 
+        extern  const TVMModel my_model;
+           int main(...) {
+                ...
+                TVMInputs_my_model inputs = {my_data};
+                TVMOutputs_my_model outputs = {output_space};
+                TVMExecute(&my_model,
+                           &inputs,
+                           &outputs,  
+                           NULL);
+            }
+```
+## U2: User wants to share workspaces
+
+### TVMC
+```
+    tvmc compile my_model_1.tflite
+    --executor=aot 
+    --output-format=mlf
+    --target=accel,c  
+    --with-workspace-buffer= "name=sram;target=c,accel"
+
+    tvmc compile my_model_2.tflite 
+    --executor=aot
+    --output-format=mlf 
+    --target=accel,c
+    --with-workspace-buffer= "name=sram;target=c,accel"

Review comment:
       these buffers should ideally be compiler flags (and same goes to flags 
such as --executor, -runtime). However, we are currently using the target to 
hold such flags. Depending on the progress of the activity, where this flags 
end up will change :) 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [tvm-rfcs] manupa-arm commented on a change in pull request #9: [RFC] TVM Unified Static Memory Planning

Reply via email to