manupa-arm commented on a change in pull request #46: URL: https://github.com/apache/tvm-rfcs/pull/46#discussion_r804789970
########## File path: rfcs/0046-module-based-model-runtime-for-aot.md ########## @@ -0,0 +1,348 @@ +# Module-based Model Runtime Interface for AOT + +- Feature Name: module_based_model_runtime_for_aot +- Start Date: 2021-09-17 +- RFC PR: [apache/tvm-rfcs#0046](https://github.com/apache/tvm-rfcs/pull/0046) +- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000) + +# **Summary** + +This RFC describes a [Module-based Model Runtime +interface](https://discuss.tvm.apache.org/t/discuss-module-based-model-runtime-interface/5025) for +the [Ahead-of-Time Executor](https://discuss.tvm.apache.org/t/implementing-aot-in-tvm/9206), thereby +enabling its use from the TVM C++ Runtime. + +# **Motivation** + +The microTVM project has made significant progress towards an Ahead-of-Time Executor for compiled +Relay models. At the time of writing, it's now possible to codegen a TIR function which executes +Relay models that have known shapes, don't have graph-level control flow, and execute only on the +CPU device. Right now, the C runtime is the only such runtime environment which can interact with +this generated code. However, significant interest exists in enabling the C++ runtime to use the +Ahead-of-Time executor. + +# **Guide-level explanation** + +Users select the AOT executor at compile time through the traditional GraphExecutor compilation flow +(e.g. `[tvm.relay.build](http://tvm.relay.build)`) by including `--executor=aot` in the Target +[1]. The return value of `tvm.relay.build` in this case is an `AotExecutorFactory` Module +object. Users instantiate the AOT executor via `AotExecutorFactory` as they do with `GraphExecutor`: + +```bash +ir_mod = tvm.parser.fromtext("""\ + #[version = "0.0.5"] + def @main(%a : Tensor[(1, 2), uint8], %b : Tensor[(1, 2), uint8]) { + %0 = %a + %b; + %0 + }""" + ) + +with PassConfig(opt_level=3): + factory : AotExecutorFactory = tvm.relay.build( + ir_mod, "llvm -executor=aot", module_name="my_mod") + +aot_executor : AotExecutor = factory["my_mod"](tvm.cpu(0)) +``` + +`AotExecutor` supports the traditional Module-Based Model Runtime Interface and can be used as a +user normally would `GraphExecutor`: + +```bash +aot_executor.set_input("a", tvm.nd.array(np.ndarray([1, 2], dtype="uint8"))) +aot_executor.set_input("b", tvm.nd.array(np.ndarray([3, 5], dtype="uint8"))) +aot_exec.run() +output = aot_exec.get_output(0) +assert output.asnumpy() == np.ndarray([5, 7], dtype="uint8") +``` + +[1] NOTE: The target string is not the final place this customization should be made. However, it's +been the place where we've been putting runtime-related stuff. A separate RFC will split the Target +string into Target options (which affect tuning) and runtime options. + +# **Reference-level explanation** + +Already committed to TVM is the AotExecutorCodegen. This module produces a TIR top-level function +which invokes the Relay operators (implemented in TIR) in a correct order. An example is given +below: + +```bash +PrimFunc([input1, input2, output]) attrs={"global_symbol": "tvmgen_my_mod_run_model", "runner_function": (bool)1} { + // attr [(nullptr)] device_id = 0 + // attr [(nullptr)] device_type = 1 + tir.tvm_call_packed("tvmgen_my_mod_fused_add", input1, input2, output) +} +``` + +The AotExecutor then needs to accomplish the following to meet Module-based Model Runtime Interface: + +1. Allocate input and output tensors as defined in the `run_model` function using the correct Device Review comment: > However, consider a microTVM use case where DMA is used to copy data from e.g. a camera into an SRAM dedicated to accelerator usage. In this case, TVM should really do as much as possible to stay out of the way--the copy operation is application- or at least SoC- specific. > Additionally, I imagine in the double-buffered input use case, the user would need to somehow communicate that space for two input buffers is needed. Not sure how to do that now outside of downsizing a memory pool. So it is a requirement of the accelerator that it can only access certain memories and the input tensor is not originally present in a such a memory. From an AoT perspective, ideally we should provide the compiler with this information -- because we cant generally know where the input is going to be. Therefore, when the compiler knows whether the tensor needs copying, it could insert a copy -- that gets lowered to produce an intermediary allocate node -- to be in accelerator accessible memory. Once we have that, this should ideally translate down to a loop-nest of copy that could be scheduled via double_buffer() scheduling primitives either in TE/S-TIR. This will result in creating intermediary storage in the form of allocate nodes that could be planned properly. Additionally to that, depending on who does the copy (accelerator or the host) we need to place that -- now that copy insertion is IRModule --> IRModule transformation which I think a wrapper would not be sufficient -- I would assume copies might need ending up in the device functions. Now, I agree that it out of scope / too much to consider for this work. In the absense of that, I think letting the set_input runtime function handling the copy make sense. It is just that the short-term solution goes way against the long term solution in my mind that sparked this conversation -- however, I agree with you that there is no easy way other than inserting the copies in the host PrimFunc of host,device PrimFunc tuple if we want to do it in a more codegen'd flow. ########## File path: rfcs/0046-module-based-model-runtime-for-aot.md ########## @@ -0,0 +1,348 @@ +# Module-based Model Runtime Interface for AOT + +- Feature Name: module_based_model_runtime_for_aot +- Start Date: 2021-09-17 +- RFC PR: [apache/tvm-rfcs#0046](https://github.com/apache/tvm-rfcs/pull/0046) +- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000) + +# **Summary** + +This RFC describes a [Module-based Model Runtime +interface](https://discuss.tvm.apache.org/t/discuss-module-based-model-runtime-interface/5025) for +the [Ahead-of-Time Executor](https://discuss.tvm.apache.org/t/implementing-aot-in-tvm/9206), thereby +enabling its use from the TVM C++ Runtime. + +# **Motivation** + +The microTVM project has made significant progress towards an Ahead-of-Time Executor for compiled +Relay models. At the time of writing, it's now possible to codegen a TIR function which executes +Relay models that have known shapes, don't have graph-level control flow, and execute only on the +CPU device. Right now, the C runtime is the only such runtime environment which can interact with +this generated code. However, significant interest exists in enabling the C++ runtime to use the +Ahead-of-Time executor. + +# **Guide-level explanation** + +Users select the AOT executor at compile time through the traditional GraphExecutor compilation flow +(e.g. `[tvm.relay.build](http://tvm.relay.build)`) by including `--executor=aot` in the Target +[1]. The return value of `tvm.relay.build` in this case is an `AotExecutorFactory` Module +object. Users instantiate the AOT executor via `AotExecutorFactory` as they do with `GraphExecutor`: + +```bash +ir_mod = tvm.parser.fromtext("""\ + #[version = "0.0.5"] + def @main(%a : Tensor[(1, 2), uint8], %b : Tensor[(1, 2), uint8]) { + %0 = %a + %b; + %0 + }""" + ) + +with PassConfig(opt_level=3): + factory : AotExecutorFactory = tvm.relay.build( + ir_mod, "llvm -executor=aot", module_name="my_mod") + +aot_executor : AotExecutor = factory["my_mod"](tvm.cpu(0)) +``` + +`AotExecutor` supports the traditional Module-Based Model Runtime Interface and can be used as a +user normally would `GraphExecutor`: + +```bash +aot_executor.set_input("a", tvm.nd.array(np.ndarray([1, 2], dtype="uint8"))) +aot_executor.set_input("b", tvm.nd.array(np.ndarray([3, 5], dtype="uint8"))) +aot_exec.run() +output = aot_exec.get_output(0) +assert output.asnumpy() == np.ndarray([5, 7], dtype="uint8") +``` + +[1] NOTE: The target string is not the final place this customization should be made. However, it's +been the place where we've been putting runtime-related stuff. A separate RFC will split the Target +string into Target options (which affect tuning) and runtime options. + +# **Reference-level explanation** + +Already committed to TVM is the AotExecutorCodegen. This module produces a TIR top-level function +which invokes the Relay operators (implemented in TIR) in a correct order. An example is given +below: + +```bash +PrimFunc([input1, input2, output]) attrs={"global_symbol": "tvmgen_my_mod_run_model", "runner_function": (bool)1} { + // attr [(nullptr)] device_id = 0 + // attr [(nullptr)] device_type = 1 + tir.tvm_call_packed("tvmgen_my_mod_fused_add", input1, input2, output) +} +``` + +The AotExecutor then needs to accomplish the following to meet Module-based Model Runtime Interface: + +1. Allocate input and output tensors as defined in the `run_model` function using the correct Device + API. +2. Provide a mapping from relay parameter name to positional argument. +3. Invoke the generated TIR function and provide profiling. + +### Compiler ↔ Runtime Metadata + +In order to implement (1) and (2) above, additional metadata about the `run_model` function needs to +be communicated from Compiler to Runtime: + +- The mapping between Relay parameter name and TIR argument position +- The number of inputs and outputs +- The type of each parameter +- Information sufficient to choose a Device API to allocate memory for that data. + +At present, Metadata is passed from Compiler to Runtime in several different ways: + +1. Constant DLTensor can be bundled with code and supplied to `runtime::Module` via + `runtime::MetadataModule` +2. Many non-DSO-exportable backends (`cuda`, `hexagon`, `metal`, `opencl`, `sdaccel`, `rocm`, + `vulkan`) have adopted the convention of including a + [1runtime::FunctionInfo`](https://github.com/apache/tvm/blob/main/src/runtime/meta_data.h#L106) + (NOTE: distinct from `tvm::relay::transform::FunctionInfo`) in their serialization: + + ```bash + /*! \brief function information needed by device */ + struct FunctionInfo { + std::string name; + std::vector<DLDataType> arg_types; + std::vector<std::string> launch_param_tags; + } + ``` + +3. AotExecutorCodegen and GraphExecutorCodegen have adopted the practice of producing the + graph-level + [`runtime::MetadataNode`](https://github.com/apache/tvm/blob/main/src/runtime/meta_data.h#L55): + + ```bash + /*! + * \brief Structure that can be optionally used by the executor codegen + */ + class MetadataNode : public Object { + public: + /*! \brief input information for the main function */ + Array<String> inputs; + /*! \brief number of outputs of the main function */ + int num_outputs = 1; + /*! \brief the executor to be used to run the model */ + String executor = kTvmExecutorGraph; + + String mod_name = ""; + } + ``` + +4. The recent AOTExecutor implementation has created `tvm::relay::transform::FunctionInfo` which + communicates statistics about memory usage and I/O operation for each TIR operator and aggregate + statistics for the top-level AOT function: + + ```bash + struct FunctionInfoNode : public Object { + Map<Target, Integer> workspace_sizes; + Map<Target, Integer> io_sizes; + Map<Target, Integer> constant_sizes; + Map<Target, tir::PrimFunc> tir_primfuncs; + Map<Target, Function> relay_primfuncs; + } + ``` + + +Some duplication of information is already present. Likely this is due in part to the existing +middle-end compiler design, in which a separate `IRModule` is produced for each backend. Another Review comment: I think I dont dispute the claim that IRModule --> (tree of runtime.Module). The current lowering flow creates IRModule per backend that get translated to runtime.Module(s). I see that as (correct me if I am wrong) : Unified IRModule --> [IRModule per backend] --> tree of runtime.Module s which is basically a host runtime.Module includes a flat array device runtime.Modules. Now the proposal here want to attach model-level Metadata from the Unified IRModule to (root of) tree of runtime.Module s. So the gap in the text here is that the text assumes it is common knowledge that how [IRModule per backend] disappears. Hypothetically, lets say it is not there -- then text make sense because the model-level metadata could be attached to Unified IRModule and then passed onto tree of runtime.Module s. Now my questions is, in the absense of a proposal how to remove the IRModule per backend (or if its already there -- please link the RFC/pre-RFC), this RFC needs to outline a way how this will be communicated from Unified IRModule to tree of runtime.Module s in the current lowering flow and/or the changes brought by this RFC. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
