[GitHub] [tvm-rfcs] areusch commented on a change in pull request #46: Module Based Model Runtime for AOT

GitBox Tue, 08 Feb 2022 12:30:44 -0800


areusch commented on a change in pull request #46:
URL: https://github.com/apache/tvm-rfcs/pull/46#discussion_r802025838




##########
File path: rfcs/0046-module-based-model-runtime-for-aot.md
##########
@@ -0,0 +1,348 @@
+# Module-based Model Runtime Interface for AOT
+
+- Feature Name: module_based_model_runtime_for_aot
+- Start Date: 2021-09-17
+- RFC PR: [apache/tvm-rfcs#0046](https://github.com/apache/tvm-rfcs/pull/0046)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+
+# **Summary**
+
+This RFC describes a [Module-based Model Runtime
+interface](https://discuss.tvm.apache.org/t/discuss-module-based-model-runtime-interface/5025)
 for
+the [Ahead-of-Time 
Executor](https://discuss.tvm.apache.org/t/implementing-aot-in-tvm/9206), 
thereby
+enabling its use from the TVM C++ Runtime.
+
+# **Motivation**
+
+The microTVM project has made significant progress towards an Ahead-of-Time 
Executor for compiled
+Relay models. At the time of writing, it's now possible to codegen a TIR 
function which executes
+Relay models that have known shapes, don't have graph-level control flow, and 
execute only on the
+CPU device. Right now, the C runtime is the only such runtime environment 
which can interact with
+this generated code. However, significant interest exists in enabling the C++ 
runtime to use the
+Ahead-of-Time executor.
+
+# **Guide-level explanation**
+
+Users select the AOT executor at compile time through the traditional 
GraphExecutor compilation flow
+(e.g. `[tvm.relay.build](http://tvm.relay.build)`) by including 
`--executor=aot` in the Target
+[1]. The return value of `tvm.relay.build` in this case is an 
`AotExecutorFactory` Module
+object. Users instantiate the AOT executor via `AotExecutorFactory` as they do 
with `GraphExecutor`:
+
+```bash
+ir_mod = tvm.parser.fromtext("""\
+      #[version = "0.0.5"]
+      def @main(%a : Tensor[(1, 2), uint8], %b : Tensor[(1, 2), uint8]) {
+          %0 = %a + %b;
+          %0
+      }"""
+    )
+
+with PassConfig(opt_level=3):
+  factory : AotExecutorFactory = tvm.relay.build(
+       ir_mod, "llvm -executor=aot", module_name="my_mod")
+
+aot_executor : AotExecutor = factory["my_mod"](tvm.cpu(0))
+```
+
+`AotExecutor` supports the traditional Module-Based Model Runtime Interface 
and can be used as a
+user normally would `GraphExecutor`:
+
+```bash
+aot_executor.set_input("a", tvm.nd.array(np.ndarray([1, 2], dtype="uint8")))
+aot_executor.set_input("b", tvm.nd.array(np.ndarray([3, 5], dtype="uint8")))
+aot_exec.run()
+output = aot_exec.get_output(0)
+assert output.asnumpy() == np.ndarray([5, 7], dtype="uint8")
+```
+
+[1] NOTE: The target string is not the final place this customization should 
be made. However, it's
+been the place where we've been putting runtime-related stuff. A separate RFC 
will split the Target
+string into Target options (which affect tuning) and runtime options.
+
+# **Reference-level explanation**
+
+Already committed to TVM is the AotExecutorCodegen. This module produces a TIR 
top-level function
+which invokes the Relay operators (implemented in TIR) in a correct order. An 
example is given
+below:
+
+```bash
+PrimFunc([input1, input2, output]) attrs={"global_symbol": 
"tvmgen_my_mod_run_model", "runner_function": (bool)1} {
+  // attr [(nullptr)] device_id = 0
+  // attr [(nullptr)] device_type = 1
+  tir.tvm_call_packed("tvmgen_my_mod_fused_add", input1, input2, output)
+}
+```
+
+The AotExecutor then needs to accomplish the following to meet Module-based 
Model Runtime Interface:
+
+1. Allocate input and output tensors as defined in the `run_model` function 
using the correct Device

Review comment:
       > I suppose the proposal here is we create a runtime wrapper around it 
-- which also works but with cons of not exposing these allocations to the core 
compiler for further optimization.
   
   I actually think we should defer the question of how the input tensors are 
loaded and consumed to the runtime. You're right that AOTExecutor here is 
intended to be a generic runtime wrapper when there are enough resources on the 
system to handle this at runtime. However, consider a microTVM use case where 
DMA is used to copy data from e.g. a camera into an SRAM dedicated to 
accelerator usage. In this case, TVM should really do as much as possible to 
stay out of the way--the copy operation is application- or at least SoC- 
specific. TVM should provide pointer and sizing information so this can be 
handled separately. That argues for exposing some type of `get_input_tensor` 
function as you're mentioning here as a first-class citizen of MBMR. I think we 
should take that up as we move on the C Device API and USMP integrations. 
   
   > However, Im curious to know whether it would just easier to create a copy 
in the main body to tir.allocate node that get translated to a device copy -- 
which I think has the same effect.
   
   I think I see what you're raising here--we have to explicitly inject the 
input and output tensors into USMP if they are not modeled by a `tir.allocate` 
somewhere. We could model them without specially injecting them by wrapping the 
TIR top-level function in a MBMR-compatible TIR function which contains 
`tir.allocate` nodes, IIUC. I'm not sure how we get away from needing to do 
this at present, though--for `DLDevice(kDLCPU, 0)`, no such copy is needed, but 
we would then still need the `tir.allocate` nodes to make the input tensors 
visible to USMP.
   
   Additionally, I imagine in the double-buffered input use case, the user 
would need to somehow communicate that space for two input buffers is needed. 
Not sure how to do that now outside of downsizing a memory pool.
   
   > I would propose being explicit here that this is a runtime wrapper.
   Possibly, would it be possible to add a few notes that why/ why we should 
not do it the codegen of main ?
   
   Added.
   
   

##########
File path: rfcs/0046-module-based-model-runtime-for-aot.md
##########
@@ -0,0 +1,348 @@
+# Module-based Model Runtime Interface for AOT
+
+- Feature Name: module_based_model_runtime_for_aot
+- Start Date: 2021-09-17
+- RFC PR: [apache/tvm-rfcs#0046](https://github.com/apache/tvm-rfcs/pull/0046)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+
+# **Summary**
+
+This RFC describes a [Module-based Model Runtime
+interface](https://discuss.tvm.apache.org/t/discuss-module-based-model-runtime-interface/5025)
 for
+the [Ahead-of-Time 
Executor](https://discuss.tvm.apache.org/t/implementing-aot-in-tvm/9206), 
thereby
+enabling its use from the TVM C++ Runtime.
+
+# **Motivation**
+
+The microTVM project has made significant progress towards an Ahead-of-Time 
Executor for compiled
+Relay models. At the time of writing, it's now possible to codegen a TIR 
function which executes
+Relay models that have known shapes, don't have graph-level control flow, and 
execute only on the
+CPU device. Right now, the C runtime is the only such runtime environment 
which can interact with
+this generated code. However, significant interest exists in enabling the C++ 
runtime to use the
+Ahead-of-Time executor.
+
+# **Guide-level explanation**
+
+Users select the AOT executor at compile time through the traditional 
GraphExecutor compilation flow
+(e.g. `[tvm.relay.build](http://tvm.relay.build)`) by including 
`--executor=aot` in the Target
+[1]. The return value of `tvm.relay.build` in this case is an 
`AotExecutorFactory` Module
+object. Users instantiate the AOT executor via `AotExecutorFactory` as they do 
with `GraphExecutor`:
+
+```bash
+ir_mod = tvm.parser.fromtext("""\
+      #[version = "0.0.5"]
+      def @main(%a : Tensor[(1, 2), uint8], %b : Tensor[(1, 2), uint8]) {
+          %0 = %a + %b;
+          %0
+      }"""
+    )
+
+with PassConfig(opt_level=3):
+  factory : AotExecutorFactory = tvm.relay.build(
+       ir_mod, "llvm -executor=aot", module_name="my_mod")
+
+aot_executor : AotExecutor = factory["my_mod"](tvm.cpu(0))
+```
+
+`AotExecutor` supports the traditional Module-Based Model Runtime Interface 
and can be used as a
+user normally would `GraphExecutor`:
+
+```bash
+aot_executor.set_input("a", tvm.nd.array(np.ndarray([1, 2], dtype="uint8")))
+aot_executor.set_input("b", tvm.nd.array(np.ndarray([3, 5], dtype="uint8")))
+aot_exec.run()
+output = aot_exec.get_output(0)
+assert output.asnumpy() == np.ndarray([5, 7], dtype="uint8")
+```
+
+[1] NOTE: The target string is not the final place this customization should 
be made. However, it's
+been the place where we've been putting runtime-related stuff. A separate RFC 
will split the Target
+string into Target options (which affect tuning) and runtime options.
+
+# **Reference-level explanation**
+
+Already committed to TVM is the AotExecutorCodegen. This module produces a TIR 
top-level function
+which invokes the Relay operators (implemented in TIR) in a correct order. An 
example is given
+below:
+
+```bash
+PrimFunc([input1, input2, output]) attrs={"global_symbol": 
"tvmgen_my_mod_run_model", "runner_function": (bool)1} {
+  // attr [(nullptr)] device_id = 0
+  // attr [(nullptr)] device_type = 1
+  tir.tvm_call_packed("tvmgen_my_mod_fused_add", input1, input2, output)
+}
+```
+
+The AotExecutor then needs to accomplish the following to meet Module-based 
Model Runtime Interface:
+
+1. Allocate input and output tensors as defined in the `run_model` function 
using the correct Device
+   API.
+2. Provide a mapping from relay parameter name to positional argument.
+3. Invoke the generated TIR function and provide profiling.
+
+### Compiler ↔ Runtime Metadata
+
+In order to implement (1) and (2) above, additional metadata about the 
`run_model` function needs to
+be communicated from Compiler to Runtime:
+
+- The mapping between Relay parameter name and TIR argument position
+- The number of inputs and outputs
+- The type of each parameter
+- Information sufficient to choose a Device API to allocate memory for that 
data.
+
+At present, Metadata is passed from Compiler to Runtime in several different 
ways:
+
+1. Constant DLTensor can be bundled with code and supplied to 
`runtime::Module` via
+   `runtime::MetadataModule`
+2. Many non-DSO-exportable backends (`cuda`, `hexagon`, `metal`, `opencl`, 
`sdaccel`, `rocm`,
+   `vulkan`) have adopted the convention of including a
+   
[1runtime::FunctionInfo`](https://github.com/apache/tvm/blob/main/src/runtime/meta_data.h#L106)
+   (NOTE: distinct from `tvm::relay::transform::FunctionInfo`) in their 
serialization:
+
+    ```bash
+    /*! \brief function information needed by device */
+    struct FunctionInfo {
+      std::string name;
+      std::vector<DLDataType> arg_types;
+      std::vector<std::string> launch_param_tags;
+    }
+    ```
+
+3. AotExecutorCodegen and GraphExecutorCodegen have adopted the practice of 
producing the
+   graph-level
+   
[`runtime::MetadataNode`](https://github.com/apache/tvm/blob/main/src/runtime/meta_data.h#L55):
+
+    ```bash
+    /*!
+     * \brief Structure that can be optionally used by the executor codegen
+     */
+    class MetadataNode : public Object {
+     public:
+      /*! \brief input information for the main function */
+      Array<String> inputs;
+      /*! \brief number of outputs of the main function */
+      int num_outputs = 1;
+      /*! \brief the executor to be used to run the model */
+      String executor = kTvmExecutorGraph;
+
+      String mod_name = "";
+    }
+    ```
+
+4. The recent AOTExecutor implementation has created 
`tvm::relay::transform::FunctionInfo` which
+   communicates statistics about memory usage and I/O operation for each TIR 
operator and aggregate
+   statistics for the top-level AOT function:
+
+    ```bash
+    struct FunctionInfoNode : public Object {
+      Map<Target, Integer> workspace_sizes;
+      Map<Target, Integer> io_sizes;
+      Map<Target, Integer> constant_sizes;
+      Map<Target, tir::PrimFunc> tir_primfuncs;
+      Map<Target, Function> relay_primfuncs;
+    }
+    ```
+
+
+Some duplication of information is already present. Likely this is due in part 
to the existing
+middle-end compiler design, in which a separate `IRModule` is produced for 
each backend. Another
+factor may be: since `runtime::Module` are responsible for their own 
serialization, and passing
+`Node` across `PackedFunc` requires a cast, the lack of a centralized facility 
for
+`runtime::Modules` to obtain module-level Metadata has led backend authors to 
roll their own. This
+pattern means that it's very difficult to assess the full scope of metadata 
handed to the runtime,
+particularly across all backends.
+
+Work is currently ongoing to unify the pre-codegen `IRModule` into a single 
instance. After this
+work is completed, it will be much easier to produce a centralized 
module-level Metadata. This RFC
+argues for the expansion of `runtime::MetadataNode` in the following ways:

Review comment:
       ah i see, changed the wording. i guess i was trying to say that the name 
MetadataNode should cover a more complex thing, but in practice you're right 
that this proposes a restructuring.

##########
File path: rfcs/0046-module-based-model-runtime-for-aot.md
##########
@@ -0,0 +1,348 @@
+# Module-based Model Runtime Interface for AOT
+
+- Feature Name: module_based_model_runtime_for_aot
+- Start Date: 2021-09-17
+- RFC PR: [apache/tvm-rfcs#0046](https://github.com/apache/tvm-rfcs/pull/0046)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+
+# **Summary**
+
+This RFC describes a [Module-based Model Runtime
+interface](https://discuss.tvm.apache.org/t/discuss-module-based-model-runtime-interface/5025)
 for
+the [Ahead-of-Time 
Executor](https://discuss.tvm.apache.org/t/implementing-aot-in-tvm/9206), 
thereby
+enabling its use from the TVM C++ Runtime.
+
+# **Motivation**
+
+The microTVM project has made significant progress towards an Ahead-of-Time 
Executor for compiled
+Relay models. At the time of writing, it's now possible to codegen a TIR 
function which executes
+Relay models that have known shapes, don't have graph-level control flow, and 
execute only on the
+CPU device. Right now, the C runtime is the only such runtime environment 
which can interact with
+this generated code. However, significant interest exists in enabling the C++ 
runtime to use the
+Ahead-of-Time executor.
+
+# **Guide-level explanation**
+
+Users select the AOT executor at compile time through the traditional 
GraphExecutor compilation flow
+(e.g. `[tvm.relay.build](http://tvm.relay.build)`) by including 
`--executor=aot` in the Target
+[1]. The return value of `tvm.relay.build` in this case is an 
`AotExecutorFactory` Module
+object. Users instantiate the AOT executor via `AotExecutorFactory` as they do 
with `GraphExecutor`:
+
+```bash
+ir_mod = tvm.parser.fromtext("""\
+      #[version = "0.0.5"]
+      def @main(%a : Tensor[(1, 2), uint8], %b : Tensor[(1, 2), uint8]) {
+          %0 = %a + %b;
+          %0
+      }"""
+    )
+
+with PassConfig(opt_level=3):
+  factory : AotExecutorFactory = tvm.relay.build(
+       ir_mod, "llvm -executor=aot", module_name="my_mod")
+
+aot_executor : AotExecutor = factory["my_mod"](tvm.cpu(0))
+```
+
+`AotExecutor` supports the traditional Module-Based Model Runtime Interface 
and can be used as a
+user normally would `GraphExecutor`:
+
+```bash
+aot_executor.set_input("a", tvm.nd.array(np.ndarray([1, 2], dtype="uint8")))
+aot_executor.set_input("b", tvm.nd.array(np.ndarray([3, 5], dtype="uint8")))
+aot_exec.run()
+output = aot_exec.get_output(0)
+assert output.asnumpy() == np.ndarray([5, 7], dtype="uint8")
+```
+
+[1] NOTE: The target string is not the final place this customization should 
be made. However, it's
+been the place where we've been putting runtime-related stuff. A separate RFC 
will split the Target
+string into Target options (which affect tuning) and runtime options.
+
+# **Reference-level explanation**
+
+Already committed to TVM is the AotExecutorCodegen. This module produces a TIR 
top-level function
+which invokes the Relay operators (implemented in TIR) in a correct order. An 
example is given
+below:
+
+```bash
+PrimFunc([input1, input2, output]) attrs={"global_symbol": 
"tvmgen_my_mod_run_model", "runner_function": (bool)1} {
+  // attr [(nullptr)] device_id = 0
+  // attr [(nullptr)] device_type = 1
+  tir.tvm_call_packed("tvmgen_my_mod_fused_add", input1, input2, output)
+}
+```
+
+The AotExecutor then needs to accomplish the following to meet Module-based 
Model Runtime Interface:
+
+1. Allocate input and output tensors as defined in the `run_model` function 
using the correct Device
+   API.
+2. Provide a mapping from relay parameter name to positional argument.
+3. Invoke the generated TIR function and provide profiling.
+
+### Compiler ↔ Runtime Metadata
+
+In order to implement (1) and (2) above, additional metadata about the 
`run_model` function needs to
+be communicated from Compiler to Runtime:
+
+- The mapping between Relay parameter name and TIR argument position
+- The number of inputs and outputs
+- The type of each parameter
+- Information sufficient to choose a Device API to allocate memory for that 
data.
+
+At present, Metadata is passed from Compiler to Runtime in several different 
ways:
+
+1. Constant DLTensor can be bundled with code and supplied to 
`runtime::Module` via
+   `runtime::MetadataModule`
+2. Many non-DSO-exportable backends (`cuda`, `hexagon`, `metal`, `opencl`, 
`sdaccel`, `rocm`,
+   `vulkan`) have adopted the convention of including a
+   
[1runtime::FunctionInfo`](https://github.com/apache/tvm/blob/main/src/runtime/meta_data.h#L106)
+   (NOTE: distinct from `tvm::relay::transform::FunctionInfo`) in their 
serialization:
+
+    ```bash
+    /*! \brief function information needed by device */
+    struct FunctionInfo {
+      std::string name;
+      std::vector<DLDataType> arg_types;
+      std::vector<std::string> launch_param_tags;
+    }
+    ```
+
+3. AotExecutorCodegen and GraphExecutorCodegen have adopted the practice of 
producing the
+   graph-level
+   
[`runtime::MetadataNode`](https://github.com/apache/tvm/blob/main/src/runtime/meta_data.h#L55):
+
+    ```bash
+    /*!
+     * \brief Structure that can be optionally used by the executor codegen
+     */
+    class MetadataNode : public Object {
+     public:
+      /*! \brief input information for the main function */
+      Array<String> inputs;
+      /*! \brief number of outputs of the main function */
+      int num_outputs = 1;
+      /*! \brief the executor to be used to run the model */
+      String executor = kTvmExecutorGraph;
+
+      String mod_name = "";
+    }
+    ```
+
+4. The recent AOTExecutor implementation has created 
`tvm::relay::transform::FunctionInfo` which
+   communicates statistics about memory usage and I/O operation for each TIR 
operator and aggregate
+   statistics for the top-level AOT function:
+
+    ```bash
+    struct FunctionInfoNode : public Object {
+      Map<Target, Integer> workspace_sizes;
+      Map<Target, Integer> io_sizes;
+      Map<Target, Integer> constant_sizes;
+      Map<Target, tir::PrimFunc> tir_primfuncs;
+      Map<Target, Function> relay_primfuncs;
+    }
+    ```
+
+
+Some duplication of information is already present. Likely this is due in part 
to the existing
+middle-end compiler design, in which a separate `IRModule` is produced for 
each backend. Another

Review comment:
       hmm, i'm not quite sure i follow you here. I'm happy to add a reference 
to the Artifact proposal, but I'm not sure it's quite exactly what I'm stating 
here. Here, what I mean is that the TIR-to-Runtime interface is `IRModule -> 
(tree of runtime.Module)`. The existing MetadataModule (which is proposed to 
rename to ConstLoaderModule here) seems to have arisen out of a desire to build 
common infrastructure to handle loading DLTensor from `.text` in the C++ 
runtime. Here what I'm trying to point out is that since the TIR-to-Runtime 
interface provides no facility for the TIR-to-runtime processes to return 
metadata outside of the `runtime::Module`, this leads to duplication of 
information should it be required by the compiler in any way. For example, 
`constant_sizes` could be deduced from the DLTensor passed to 
ConstLoaderModule, but ConstLoaderModule is not supported by all runtimes and 
not the de-facto way to load constant data or metadata at runtime because it 
doesn't support enc
 oding structs and scalar values. You can also see some duplication in 
CudaModule.
   
   I think this proposal is attempting to start down the path of unifying these 
different methods of providing data generated during lowering to the runtime as 
Metadata. I think that's mainly covered here, but happy to add a reference to 
the Artifact thing if it helps, it just seems a bit orthogonal to me.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [tvm-rfcs] areusch commented on a change in pull request #46: Module Based Model Runtime for AOT

Reply via email to