areusch commented on a change in pull request #9: URL: https://github.com/apache/tvm-rfcs/pull/9#discussion_r688845134
########## File path: rfcs/0009_Unified_Static_Memory_Planning.md ########## @@ -0,0 +1,476 @@ + Feature Name: Unified Static Memory Planner + Start Date: 2021 June 1 + RFC PR: #0009 + GitHub Issue: https://github.com/apache/tvm/issues/8404 + +# Background + +Currently, given a ML model primarily TVM will generate two main artifacts : + +* A1 : executor configuration : the description of the sequential execution of operators + 1. If the "executor" is "graph", this would be a JSON + 2. if the "executor" is "aot", this would be a main function describing call graph of operators + 3. if the "executor" is "vm", this would be a series of VM bytecode instructions +* A2 : library of operators (in the form of runtime.Module) +* A3 : compiled parameters of the model + +A1 is generally created out of lowering the "main" relay function and A2 is created lowering fused relay primitive functions → TIR PrimFuncs → C or LLVM artifacts of the operator library. + +### Is there some sort of memory planning already being performed ? + +Yes, there is. + +For A1, the inter-(fused) operator tensors are visible in the "main" relay function. There exists currently a Relay level pass known as "GraphPlanMemory" that works on the Relay IR to share the space used by tensors which are not live simultaneously and are visible between (fused) operators . Currently, the said pass will use Shared Memory Buffer Object memory planning scheme (See https://blog.tensorflow.org/2020/10/optimizing-tensorflow-lite-runtime.html) to perform the planning. + +For A2, the operators are lowered to TIR PrimFuncs. There exist a pass called StorageRewrite that more or less does the same thing as "GraphPlanMemory" but on TIR for the tensors visible within (fused) operators and are not live simultaneously. + +# Motivation + +For embedded use-cases, its widely accepted that aggressive memory optimizations are vital. Intially we are looking at enable memory planning for embedded use-cases using the AoT executor. + +Therefore, there exist two main shortcomings of the current approach : + +* The memory used by intermediary tensors within operators are not shared between memory used by inter-operator tensors. + +Example TIR : +``` + primfn(placeholder_3: handle, placeholder_4: handle, placeholder_5: handle, T_cast_1: handle) -> () + attr = { "global_symbol" : "fused_nn_conv2d_add_fixed_point_multiply_clip_cast_cast_21" , "tir.noalias" : True} + buffers = {T_cast: Buffer(T_cast_2: Pointer(int16), int16, [ 1 , 56 , 56 , 128 ], []), + placeholder_2: Buffer(placeholder_6: Pointer(int32), int32, [ 1 , 1 , 1 , 128 ], []), + placeholder: Buffer(placeholder_7: Pointer(int16), int16, [ 1 , 56 , 56 , 128 ], []), + placeholder_1: Buffer(placeholder_8: Pointer(int16), int16, [ 3 , 3 , 128 , 1 ], [])} + + buffer_map = {placeholder_3: placeholder, placeholder_4: placeholder_1, placeholder_5: placeholder_2, T_cast_1: T_cast} { + attr [PaddedInput: Pointer(int16)] "storage_scope" = "global" ; + allocate(PaddedInput, int16, [ 430592 ]); + attr [DepthwiseConv2d: Pointer(int32)] "storage_scope" = "global" ; + + allocate(DepthwiseConv2d, int32, [ 401408 ]) { + for (i1: int32, 0 , 58 ) { + for (i2: int32, 0 , 58 ) { + for(i3: int32,0,128) { + PaddedInput[(((i1*7424) + (i2*128)) + i3)] = @tir.if_then_else(((((1<= i1) && (i1 < 57)) && (1<= i2)) && (i2 < 57)), (int16*)placeholder_7[((((i1*7168) + (i2* 128 )) + i3) - 7296)], 0i16, dtype=int16) + } +``` + +The above TIR snippet shows that two intra operator buffers PaddedInput, DepthwiseConv2d are not visible for optimization by the Relay-level GraphPlanMemory approach. + +* Assumption of local optimization : performing sharing inside the operator first and sub-subsequently sharing that workspace with inter-operator tensors, would be sub-optimal. + +Thus, for the embedded use-cases, we'd need a unified static memory planner that performs memory planning of all tensors holistically to achieve best memory utilization. + +# Goals + +G1. There would be no TVMBackendAlloc(/Free)Workspace calls generated for tir.allocates that could be evaluated at compile time. + +Currently, the TVM codegen and the AoT executor relies on TVMB(A/F)W calls to increment/decrement a pointer of user provided workspace buffer. By the end of this set of work, if the backend uses Unified Static Memory Planning, there should not be TVMB(A/F)W calls rather correct offset in to the user provided buffer should be codegen'd for allocates for which the size argument could be evaluated at compile time. The dynamically sized allocates will remain untouched, thus will be lowered as usual. + +G2. The static memory planning algorithm should be changeable. + +There are a variety of memory planning algorithms in discussion with different tradeoffs (See https://discuss.tvm.apache.org/t/discussion-alignment-memory-planning/9730 and https://blog.tensorflow.org/2020/10/optimizing-tensorflow-lite-runtime.html). Depending on the topology and schedules of intermediary buffers, the memory planning algorithm should easily be able to be change able. However, the current design ties the algorithm intimately to the IR constructs – making it harder to modularize / change the algorithm w/o inventing a whole new pass. In reality, the outcome of USMP's algorithm is offsets within a given workspace buffer. Moreover, to produce that it should only need to know the sizes of each tensor and their relative liveness. Therefore, the algorithm interface to USMP should be kept simple to be able to add more algorithms. + +G3. Multiple pool support (including constants) + +Ideally, the user would expect to provide these buffers in the granularity of the memories they'd want to pin them to. E.g., if there are two RW memories : DRAM and SRAM, the buffers need to be identified and pooled by the compiler. Similiarly, for constant data, we need to have a mechanism to allow user to pin them to appropriate memories and addresses in the IR would simply be offsets into the constant buffer(s) provided by the user + +# Guide-level explanation + +## U1: Most simple use case + +### TVMC + + +``` +tvmc compile my_model.tflite --executor=aot --output-format=mlf --target=c +``` + + ### Codegen'd artifacts + + +``` + `//Codegen'd artifacts in metadata.c (lib0.c)` + const TVMModel my_model = { + ... + .entrypoint = &entrypoint, + } + + static uint8_t workspace_buffer[WORKSPACE_BUFFER_SIZE]; + static const uint8_t parameters_buffer[PARAMETERS_BUFFER_SIZE] = <compiler_generated_constant_data>; + + static int32_t entrypoint(TVMInputs_my_model* inputs, + TVMOutputs_my_model* outputs, + TVMContext* context){ + return my_model_main(inputs.input0, + outputs.output0, + &workspace_buffer, + parameters_buffer, + context.resource_handle); + } +``` +``` +// metadata.h + + typedef struct { + uint8_t* input0; + } TVMInputs_my_model; + + typedef struct { + uint8_t* output0; + } TVMOutputs_my_model; +``` + +### User Application +``` + + // The User Application + extern const TVMModel my_model; + int main(...) { + ... + TVMInputs_my_model inputs = {my_data}; + TVMOutputs_my_model outputs = {output_space}; + TVMExecute(&my_model, + &inputs, + &outputs, + NULL); + } +``` +## U2: User wants to share workspaces + +### TVMC +``` + tvmc compile my_model_1.tflite + --executor=aot + --output-format=mlf + --target=accel,c + --with-workspace-buffer= "name=sram;target=c,accel" + + tvmc compile my_model_2.tflite + --executor=aot + --output-format=mlf + --target=accel,c + --with-workspace-buffer= "name=sram;target=c,accel" +``` +### Codegen'd Artifacts +``` + //Codegen'd artifacts in metadata.c (lib0.c) + const TVMModel my_model_1 = { + ... + .entrypoint = &entrypoint, + } + + static const uint8_t parameters_buffer[PARAMETERS_BUFFER_SIZE] = <compiler_generated_constant_data>; + + static int32_t entrypoint(TVMInputs_my_model_1* inputs, + TVMOutputs_my_model_1* outputs, + TVMContext* context){ + return my_model_1_main(inputs.input0, + outputs.output0, + parameters_buffer, + context.workspaces.sram, + context.resource_handle); + } +``` +``` +// metadata.h + + #define TVM_MY_MODEL_1_SRAM_WORKSPACE_BUFFER_SIZE xxxx + + typedef struct { + uint8_t* sram; + } TVMWorkspaces_my_model_1; + + typedef struct { + uint8_t* input0; + } TVMInputs_my_model_1; + + typedef struct { + uint8_t* output0; + } TVMOutputs_my_model_1; + +`//Codegen'd artifacts in metadata.c (lib0.c)` + + const TVMModel my_model_2 = { + ... + .entrypoint = &entrypoint, + } +``` +``` + static const uint8_t parameters_buffer[PARAMETERS_BUFFER_SIZE] = <compiler_generated_constant_data>; + + static int32_t entrypoint(TVMInputs_my_model_2* inputs, + TVMOutputs_my_model_2* outputs, + TVMContext* context){ + return my_model_2_main(inputs.input0, + outputs.output0, + parameters_buffer, + context.workspaces.sram, + context.resource_handle); + } +``` +``` +// metadata.h + + #define TVM_MY_MODEL_2_SRAM_WORKSPACE_BUFFER_SIZE xxxx + + typedef struct { + uint8_t* sram; + } TVMWorkspaces_my_model_2; + + typedef struct { + uint8_t* input0; + } TVMInputs_my_model_2; + + typedef struct { + uint8_t* output0; + } TVMOutputs_my_model_2; +``` +### User Application +``` + // The User Application + extern const TVMModel my_model_1; + extern const TVMModel my_model_2; + + // Please calculate the maximum of TVM_MY_MODEL_1_SRAM_WORKSPACE_BUFFER_SIZE and TVM_MY_MODEL_2_SRAM_WORKSPACE_BUFFER_SIZE and define it as TVM_MY_MODELS_COMMON_WORKSPACE_BUFFER_SIZE + // Alternatively, user could use a malloc (if permitted and desired) for runtime calculation of the max + static uint8_t workspace_buffer[TVM_MY_MODELS_COMMON_WORKSPACE_BUFFER_SIZE]; + + int main(...) { + ... + TVMContext context; + TVMInputs_my_model_1 inputs = {my_data_1}; + TVMOutputs_my_model_1 outputs = {output_space_1}; + TVMWorkspaces_my_model_1 workspaces1 = { + .sram = &workspace_buffer, + }; + TVMSetWorkspaces(&context, &workspaces1); + TVMExecute(&my_model_1, &inputs_1, &outputs_1, &context); + ... + TVMInputs_my_model_2 inputs = {my_data_2}; + TVMOutputs_my_model_2 outputs = {output_space_2}; + TVMWorkspaces_my_model_2 workspaces2 = { + .sram = &workspace_buffer, + }; + TVMSetWorkspaces(&context, &workspaces2); + TVMExecute(&my_model_2, &inputs_2, &outputs_2, &context); + ... + } +``` +## U3 : User wants to pin buffers to different memories + +### TVMC +``` + tvmc compile my_model.tflite + --executor=aot + --target=accel,c + --with-workspace-buffer= "name=dtcm;target=c;size=1000" # Here the size is more of a hint/guide provided to USMP + --with-workspace-buffer= "name=sram;target=c,accel" + --with-parameter-buffer= "name=itcm;target=c;size=5000" # Here the size is more of a hint/guide provided to USMP + --with-parameter-buffer= "name=flash;target=c,accel" +``` +### Codegen'd Artifacts +``` + //Codegen'd artifacts in metadata.c (lib0.c) + const TVMModel my_model = { + ... + .entrypoint = &entrypoint, + } + + static int32_t entrypoint(TVMInputs_my_model* inputs, + TVMOutputs_my_model* outputs, + TVMContext* context){ + + return my_model_main(inputs.input0, + outputs.output0, + context.workspaces.dtcm, + context.workspaces.sram, + context.parameters.itcm, + context.parameters.flash, + context.resource_handle); + } +``` +``` +// metadata.h + + #define TVM_MY_MODEL_DTCM_WORKSPACE_BUFFER_SIZE xxxx + #define TVM_MY_MODEL_SRAM_WORKSPACE_BUFFER_SIZE xxxx + #define TVM_MY_MODEL_ITCM_PARAMETER_BUFFER_SIZE xxxx + #define TVM_MY_MODEL_FLASH_PARAMETER_BUFFER_SIZE xxxx + + typedef struct { + uint8_t* dtcm; + uint8_t* sram; + } TVMWorkspaces_my_model; + + typedef struct { + uint8_t* itcm; + uint8_t* flash; + } TVMParameters_my_model; + + typedef struct { + uint8_t* input0; + } TVMInputs_my_model; + + typedef struct { + uint8_t* output0; + } TVMOutputs_my_model; +``` +### User Application +``` + // The User Application + extern const TVMModel my_model; + __attribute__((section( "ITCM" ) const uint8_t my_model_params_1[TVM_MY_MODEL_ITCM_PARAMETER_BUFFER_SIZE] = <param_1_data>; + __attribute__((section( "FLASH" ), aligned( 16 ))) const uint8_t my_model_params_2[TVM_MY_MODEL_FLASH_PARAMETER_BUFFER_SIZE] = <param_2_data>; + __attribute__((section( "DTCM" ) static uint8_t workspace_buffer_1[TVM_MY_MODEL_DTCM_WORKSPACE_BUFFER_SIZE]; + __attribute__((section( "SRAM" ), aligned( 16 ))) static uint8_t workspace_buffer_2[TVM_MY_MODEL_SRAM_WORKSPACE_BUFFER_SIZE]; + + int main(...) { + ... + TVMContext context; + TVMInputs_my_model_1 inputs = {input}; + TVMOutputs_my_model_1 outputs = {output}; + TVMWorkspaces_my_model workspaces = { + .sram = &workspace_buffer_1, + .dtcm = &workspace_buffer_2, + }; + TVMParameters_my_model parameters = { + .flash = &my_model_params_1, + .itcm = &my_model_params_2 + }; + TVMSetWorkspaces(&context, &workspaces); + TVMSetParameters(&context, parameters); + TVMExecute(&my_model, &inputs, &outputs, &context); + } +``` +# Reference-level explanation + +## Overview + +This should be a IRModule (TIR) → IRModule (TIR) pass. + +Inputs : +* AoT TIR PrimFunc ( the control function describing the call graph to operators) Review comment: hi Manupa, @jroesch @csullivan and I discussed this at length over the last two days in conjunction with @jroesch's plans to unify the Graph, AOT, and VM lowering pipelines. The root of the concern was over three issues: 1. it may be of interest for some users to explicitly specify which memory pool a Relay operator shall place its results. How shall TVM interpret such requests? 2. What if memory planning impacts scheduling (e.g. you may want to choose a different schedule because memory does not fit)? 3. As a side effect of unifying the lowering flows, it will be desirable to also unify memory planning across the three. Given the VM supports additional features over AOT/Graph executors (e.g. dynamic shapes, potentially additional control flow such as recursion), it may not be possible to perform static memory planning on all tensors in such a world. We want to ensure such a unified flow is possible without being overly complex and while still supporting features. 4. There was some confusion as to how GraphExecutor/AOTExecutor may both use USMP. Point #1 can be handled with a Relay-level annotation on either the function or variable which can then be moved into TIR to drive the memory planner. Point #4 is handled by writing a pass to translate the USMP-planned AOT top-level function into a GraphExecutor graph.json when possible. I've clarified this with @jroesch and @csullivan. To point #2: there are a couple of classes of this problem: 1) when a schedule must be discarded because a single buffer always doesn't fit into RAM; 2) when a particular configuration of a schedule can fit into RAM, but not all. Ultimately, this problem can always be addressed by iteratively running the compilation pipeline starting from scheduling and discarding the results after memory planning if they don't work out. However, this solution is very slow so we prefer to avoid it. There are a couple of ways to do this: 1. introduce memory constraints at schedule time so that it's obvious when a buffer is far too big for the available memory 2. We can model allocations before scheduling, like the VM does. this can allow us to discard schedules in class 1 before doing expensive memory planning. 3. we may be able to leverage AutoTIR to "tweak" a schedule (e.g. reduce the intermediate scratchpad usage) when that usage is proportional to an inner/outer loop nest (this solution re To point #3: After some discussion, it appears that the planning done in the VM and USMP have something in common: they both create allocate nodes to model the graph-level tensors in TIR. Since USMP relies on building an abstract representation of allocate/free events, it should be possible to continue to build this after unifying the flows based on either the AOT top-level function or a similar function produced by lowering the VM-optimized Relay module to TIR. Therefore, the static tensors used in a TIR IRModule should still be able to be planned using USMP after unifying the flows. The dynamic tensors may need to be handled separately, but in general doing much ahead-of-time planning for tensors with data-dependent lifetime (e.g. when handling recursion) is difficult. Finally, related: @jroesch also mentioned that in unifying the flows, he wants to move to the world where a conceptual equivalent of the VM buffer/storage nodes are injected pre-scheduling. There will be some work in doing this to ensure that they are fully compatible with the memory pool concept here, but at first glance they appear to be broadly compatible. The VM carries a somewhat similar concept of a per-device dynamic allocator, which is conceptually similar to that which may be required if we were to consider at a later time supporting dynamic models in the C runtime. We think these concepts are similar and can be unified moving forwards. Here our intent isn't to choose any one implementation over another but purely to ensure we can reduce the conceptual complexity of the compiler over time while retaining flexibility of the compiler. My thoughts are that adopting this approach (and therefore, going ahead with USMP now) should allow others to begin building the static memory planning algorithms while we are working to simplify the compiler, and this will ensure we can still continue to support any memory planning algorithms developed on top of USMP in the medium-long term. Please let me know if this makes sense to you! I think this should allow everyone to continue working without stepping on each others' toes excessively. -Andrew -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
