[GitHub] [tvm-rfcs] manupa-arm commented on a change in pull request #46: Module Based Model Runtime for AOT

GitBox Fri, 11 Feb 2022 08:07:03 -0800


manupa-arm commented on a change in pull request #46:
URL: https://github.com/apache/tvm-rfcs/pull/46#discussion_r804789970




##########
File path: rfcs/0046-module-based-model-runtime-for-aot.md
##########
@@ -0,0 +1,348 @@
+# Module-based Model Runtime Interface for AOT
+
+- Feature Name: module_based_model_runtime_for_aot
+- Start Date: 2021-09-17
+- RFC PR: [apache/tvm-rfcs#0046](https://github.com/apache/tvm-rfcs/pull/0046)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+
+# **Summary**
+
+This RFC describes a [Module-based Model Runtime
+interface](https://discuss.tvm.apache.org/t/discuss-module-based-model-runtime-interface/5025)
 for
+the [Ahead-of-Time 
Executor](https://discuss.tvm.apache.org/t/implementing-aot-in-tvm/9206), 
thereby
+enabling its use from the TVM C++ Runtime.
+
+# **Motivation**
+
+The microTVM project has made significant progress towards an Ahead-of-Time 
Executor for compiled
+Relay models. At the time of writing, it's now possible to codegen a TIR 
function which executes
+Relay models that have known shapes, don't have graph-level control flow, and 
execute only on the
+CPU device. Right now, the C runtime is the only such runtime environment 
which can interact with
+this generated code. However, significant interest exists in enabling the C++ 
runtime to use the
+Ahead-of-Time executor.
+
+# **Guide-level explanation**
+
+Users select the AOT executor at compile time through the traditional 
GraphExecutor compilation flow
+(e.g. `[tvm.relay.build](http://tvm.relay.build)`) by including 
`--executor=aot` in the Target
+[1]. The return value of `tvm.relay.build` in this case is an 
`AotExecutorFactory` Module
+object. Users instantiate the AOT executor via `AotExecutorFactory` as they do 
with `GraphExecutor`:
+
+```bash
+ir_mod = tvm.parser.fromtext("""\
+      #[version = "0.0.5"]
+      def @main(%a : Tensor[(1, 2), uint8], %b : Tensor[(1, 2), uint8]) {
+          %0 = %a + %b;
+          %0
+      }"""
+    )
+
+with PassConfig(opt_level=3):
+  factory : AotExecutorFactory = tvm.relay.build(
+       ir_mod, "llvm -executor=aot", module_name="my_mod")
+
+aot_executor : AotExecutor = factory["my_mod"](tvm.cpu(0))
+```
+
+`AotExecutor` supports the traditional Module-Based Model Runtime Interface 
and can be used as a
+user normally would `GraphExecutor`:
+
+```bash
+aot_executor.set_input("a", tvm.nd.array(np.ndarray([1, 2], dtype="uint8")))
+aot_executor.set_input("b", tvm.nd.array(np.ndarray([3, 5], dtype="uint8")))
+aot_exec.run()
+output = aot_exec.get_output(0)
+assert output.asnumpy() == np.ndarray([5, 7], dtype="uint8")
+```
+
+[1] NOTE: The target string is not the final place this customization should 
be made. However, it's
+been the place where we've been putting runtime-related stuff. A separate RFC 
will split the Target
+string into Target options (which affect tuning) and runtime options.
+
+# **Reference-level explanation**
+
+Already committed to TVM is the AotExecutorCodegen. This module produces a TIR 
top-level function
+which invokes the Relay operators (implemented in TIR) in a correct order. An 
example is given
+below:
+
+```bash
+PrimFunc([input1, input2, output]) attrs={"global_symbol": 
"tvmgen_my_mod_run_model", "runner_function": (bool)1} {
+  // attr [(nullptr)] device_id = 0
+  // attr [(nullptr)] device_type = 1
+  tir.tvm_call_packed("tvmgen_my_mod_fused_add", input1, input2, output)
+}
+```
+
+The AotExecutor then needs to accomplish the following to meet Module-based 
Model Runtime Interface:
+
+1. Allocate input and output tensors as defined in the `run_model` function 
using the correct Device

Review comment:
       > However, consider a microTVM use case where DMA is used to copy data 
from e.g. a camera into an SRAM dedicated to accelerator usage. In this case, 
TVM should really do as much as possible to stay out of the way--the copy 
operation is application- or at least SoC- specific.
   
   > Additionally, I imagine in the double-buffered input use case, the user 
would need to somehow communicate that space for two input buffers is needed. 
Not sure how to do that now outside of downsizing a memory pool.
   
   So it is a requirement of the accelerator that it can only access certain 
memories and the input tensor is not originally present in a such a memory. 
From an AoT perspective, ideally we should provide the compiler with this 
information -- because we cant generally know where the input is going to be.
   
   Therefore, when the compiler knows whether the tensor needs copying, it 
could insert a copy -- that gets lowered to produce an intermediary allocate 
node -- to be in accelerator accessible memory. Once we have that, this should 
ideally translate down to a loop-nest of copy that could be scheduled via 
double_buffer() scheduling primitives either in TE/S-TIR. This will result in 
creating intermediary storage in the form of allocate nodes that could be 
planned properly. Additionally to that, depending on who does the copy 
(accelerator or the host) we need to place that -- now that copy insertion is 
IRModule --> IRModule transformation which I think a wrapper would not be 
sufficient --  I would assume copies might need ending up in the device 
functions.
   
   Now, I agree that it out of scope / too much to consider for this work. In 
the absense of that, I think letting the set_input runtime function handling 
the copy make sense. It is just that the short-term solution goes way against 
the long term solution in my mind that sparked this conversation -- however, I 
agree with you that there is no easy way other than inserting the copies in the 
host PrimFunc of host,device PrimFunc tuple if we want to do it in a more 
codegen'd flow.
   
    

##########
File path: rfcs/0046-module-based-model-runtime-for-aot.md
##########
@@ -0,0 +1,348 @@
+# Module-based Model Runtime Interface for AOT
+
+- Feature Name: module_based_model_runtime_for_aot
+- Start Date: 2021-09-17
+- RFC PR: [apache/tvm-rfcs#0046](https://github.com/apache/tvm-rfcs/pull/0046)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+
+# **Summary**
+
+This RFC describes a [Module-based Model Runtime
+interface](https://discuss.tvm.apache.org/t/discuss-module-based-model-runtime-interface/5025)
 for
+the [Ahead-of-Time 
Executor](https://discuss.tvm.apache.org/t/implementing-aot-in-tvm/9206), 
thereby
+enabling its use from the TVM C++ Runtime.
+
+# **Motivation**
+
+The microTVM project has made significant progress towards an Ahead-of-Time 
Executor for compiled
+Relay models. At the time of writing, it's now possible to codegen a TIR 
function which executes
+Relay models that have known shapes, don't have graph-level control flow, and 
execute only on the
+CPU device. Right now, the C runtime is the only such runtime environment 
which can interact with
+this generated code. However, significant interest exists in enabling the C++ 
runtime to use the
+Ahead-of-Time executor.
+
+# **Guide-level explanation**
+
+Users select the AOT executor at compile time through the traditional 
GraphExecutor compilation flow
+(e.g. `[tvm.relay.build](http://tvm.relay.build)`) by including 
`--executor=aot` in the Target
+[1]. The return value of `tvm.relay.build` in this case is an 
`AotExecutorFactory` Module
+object. Users instantiate the AOT executor via `AotExecutorFactory` as they do 
with `GraphExecutor`:
+
+```bash
+ir_mod = tvm.parser.fromtext("""\
+      #[version = "0.0.5"]
+      def @main(%a : Tensor[(1, 2), uint8], %b : Tensor[(1, 2), uint8]) {
+          %0 = %a + %b;
+          %0
+      }"""
+    )
+
+with PassConfig(opt_level=3):
+  factory : AotExecutorFactory = tvm.relay.build(
+       ir_mod, "llvm -executor=aot", module_name="my_mod")
+
+aot_executor : AotExecutor = factory["my_mod"](tvm.cpu(0))
+```
+
+`AotExecutor` supports the traditional Module-Based Model Runtime Interface 
and can be used as a
+user normally would `GraphExecutor`:
+
+```bash
+aot_executor.set_input("a", tvm.nd.array(np.ndarray([1, 2], dtype="uint8")))
+aot_executor.set_input("b", tvm.nd.array(np.ndarray([3, 5], dtype="uint8")))
+aot_exec.run()
+output = aot_exec.get_output(0)
+assert output.asnumpy() == np.ndarray([5, 7], dtype="uint8")
+```
+
+[1] NOTE: The target string is not the final place this customization should 
be made. However, it's
+been the place where we've been putting runtime-related stuff. A separate RFC 
will split the Target
+string into Target options (which affect tuning) and runtime options.
+
+# **Reference-level explanation**
+
+Already committed to TVM is the AotExecutorCodegen. This module produces a TIR 
top-level function
+which invokes the Relay operators (implemented in TIR) in a correct order. An 
example is given
+below:
+
+```bash
+PrimFunc([input1, input2, output]) attrs={"global_symbol": 
"tvmgen_my_mod_run_model", "runner_function": (bool)1} {
+  // attr [(nullptr)] device_id = 0
+  // attr [(nullptr)] device_type = 1
+  tir.tvm_call_packed("tvmgen_my_mod_fused_add", input1, input2, output)
+}
+```
+
+The AotExecutor then needs to accomplish the following to meet Module-based 
Model Runtime Interface:
+
+1. Allocate input and output tensors as defined in the `run_model` function 
using the correct Device
+   API.
+2. Provide a mapping from relay parameter name to positional argument.
+3. Invoke the generated TIR function and provide profiling.
+
+### Compiler ↔ Runtime Metadata
+
+In order to implement (1) and (2) above, additional metadata about the 
`run_model` function needs to
+be communicated from Compiler to Runtime:
+
+- The mapping between Relay parameter name and TIR argument position
+- The number of inputs and outputs
+- The type of each parameter
+- Information sufficient to choose a Device API to allocate memory for that 
data.
+
+At present, Metadata is passed from Compiler to Runtime in several different 
ways:
+
+1. Constant DLTensor can be bundled with code and supplied to 
`runtime::Module` via
+   `runtime::MetadataModule`
+2. Many non-DSO-exportable backends (`cuda`, `hexagon`, `metal`, `opencl`, 
`sdaccel`, `rocm`,
+   `vulkan`) have adopted the convention of including a
+   
[1runtime::FunctionInfo`](https://github.com/apache/tvm/blob/main/src/runtime/meta_data.h#L106)
+   (NOTE: distinct from `tvm::relay::transform::FunctionInfo`) in their 
serialization:
+
+    ```bash
+    /*! \brief function information needed by device */
+    struct FunctionInfo {
+      std::string name;
+      std::vector<DLDataType> arg_types;
+      std::vector<std::string> launch_param_tags;
+    }
+    ```
+
+3. AotExecutorCodegen and GraphExecutorCodegen have adopted the practice of 
producing the
+   graph-level
+   
[`runtime::MetadataNode`](https://github.com/apache/tvm/blob/main/src/runtime/meta_data.h#L55):
+
+    ```bash
+    /*!
+     * \brief Structure that can be optionally used by the executor codegen
+     */
+    class MetadataNode : public Object {
+     public:
+      /*! \brief input information for the main function */
+      Array<String> inputs;
+      /*! \brief number of outputs of the main function */
+      int num_outputs = 1;
+      /*! \brief the executor to be used to run the model */
+      String executor = kTvmExecutorGraph;
+
+      String mod_name = "";
+    }
+    ```
+
+4. The recent AOTExecutor implementation has created 
`tvm::relay::transform::FunctionInfo` which
+   communicates statistics about memory usage and I/O operation for each TIR 
operator and aggregate
+   statistics for the top-level AOT function:
+
+    ```bash
+    struct FunctionInfoNode : public Object {
+      Map<Target, Integer> workspace_sizes;
+      Map<Target, Integer> io_sizes;
+      Map<Target, Integer> constant_sizes;
+      Map<Target, tir::PrimFunc> tir_primfuncs;
+      Map<Target, Function> relay_primfuncs;
+    }
+    ```
+
+
+Some duplication of information is already present. Likely this is due in part 
to the existing
+middle-end compiler design, in which a separate `IRModule` is produced for 
each backend. Another

Review comment:
       I think I dont dispute the claim that IRModule --> (tree of 
runtime.Module). 
   
   The current lowering flow creates IRModule per backend that get translated 
to runtime.Module(s).
   
   I see that as (correct me if I am wrong) : Unified IRModule --> [IRModule 
per backend] --> tree of runtime.Module s 
   which is basically a host runtime.Module includes a flat array device 
runtime.Modules.
   
   Now the proposal here want to attach model-level Metadata from the Unified 
IRModule to (root of) tree of runtime.Module s. 
   
   So the gap in the text here is that the text assumes it is common knowledge 
that how [IRModule per backend] disappears. Hypothetically, lets say it is not 
there -- then text make sense because the model-level metadata could be 
attached to Unified IRModule and then passed onto tree of runtime.Module s.
   
   Now my questions is, in the absense of a proposal how to remove the IRModule 
per backend (or if its already there -- please link the RFC/pre-RFC), this RFC 
needs to outline a way how this will be communicated from Unified IRModule to 
tree of runtime.Module s in the current lowering flow and/or the changes 
brought by this RFC.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [tvm-rfcs] manupa-arm commented on a change in pull request #46: Module Based Model Runtime for AOT

Reply via email to