manupa-arm commented on a change in pull request #9: URL: https://github.com/apache/tvm-rfcs/pull/9#discussion_r685513200
########## File path: rfcs/0009_Unified_Static_Memory_Planning.md ########## @@ -0,0 +1,473 @@ + Feature Name: Unified Static Memory Planner + Start Date: 2021 June 1 + RFC PR: #0009 + GitHub Issue: https://github.com/apache/tvm/issues/8404 + +# Background + +Currently, given a ML model primarily TVM will generate two main artifacts : + +* A1 : Description of the sequential execution of operators : + 1. If the "executor" is "graph", this would be a JSON + 2. if the "executor" is "aot", this would be a main function describing call graph of operators + 3. if the "executor" is "vm", this would be a series of VM bytecode instructions +* A2 : library of operators (in the form of runtime.Module) + +A1 is generally created out of lowering the "main" relay function and A2 is created lowering fused relay primitive functions → TIR PrimFuncs → C or LLVM artifacts of the operator library. + +### Is there some sort of memory planning already being performed ? + +Yes, there is. + +For A1, the inter-(fused) operator tensors are visible in the "main" relay function. Thus, there exists currently a Relay level pass known as "GraphPlanMemory" that works on the Relay IR to share the space used by tensors which are not live simultaneously and are visible between (fused) operators . Currently, the said pass will use Shared Memory Buffer Object memory planning scheme (See https://blog.tensorflow.org/2020/10/optimizing-tensorflow-lite-runtime.html) to perform the planning. + +For A2, the operators are lowered to TIR PrimFuncs. There exist a pass called StorageRewrite that more or less does the same thing as "GraphPlanMemory" but on TIR for the tensors visible within (fused) operators and are not live simultaneously. + +# Motivation + +For embedded use-cases, its widely accepted that aggressive memory optimizations are vital. Intially we are looking at enable memory planning for embedded use-cases using the AoT executor. + +Therefore, there exist two main shortcomings of the current approach : + +* The memory used by intermediary tensors within operators are not shared between memory used by inter-operator tensors. + +Example TIR : +``` + primfn(placeholder_3: handle, placeholder_4: handle, placeholder_5: handle, T_cast_1: handle) -> () + attr = { "global_symbol" : "fused_nn_conv2d_add_fixed_point_multiply_clip_cast_cast_21" , "tir.noalias" : True} + buffers = {T_cast: Buffer(T_cast_2: Pointer(int16), int16, [ 1 , 56 , 56 , 128 ], []), + placeholder_2: Buffer(placeholder_6: Pointer(int32), int32, [ 1 , 1 , 1 , 128 ], []), + placeholder: Buffer(placeholder_7: Pointer(int16), int16, [ 1 , 56 , 56 , 128 ], []), + placeholder_1: Buffer(placeholder_8: Pointer(int16), int16, [ 3 , 3 , 128 , 1 ], [])} + + buffer_map = {placeholder_3: placeholder, placeholder_4: placeholder_1, placeholder_5: placeholder_2, T_cast_1: T_cast} { + attr [PaddedInput: Pointer(int16)] "storage_scope" = "global" ; + allocate(PaddedInput, int16, [ 430592 ]); + attr [DepthwiseConv2d: Pointer(int32)] "storage_scope" = "global" ; + + allocate(DepthwiseConv2d, int32, [ 401408 ]) { + for (i1: int32, 0 , 58 ) { + for (i2: int32, 0 , 58 ) { + for(i3: int32,0,128) { + PaddedInput[(((i1*7424) + (i2*128)) + i3)] = @tir.if_then_else(((((1<= i1) && (i1 < 57)) && (1<= i2)) && (i2 < 57)), (int16*)placeholder_7[((((i1*7168) + (i2* 128 )) + i3) - 7296)], 0i16, dtype=int16) + } +``` + +The above TIR snippet shows that two intra operator buffers PaddedInput, DepthwiseConv2d is not visible to Relay Graph Plan Memory to be shared. Review comment: Done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
