sunggg commented on code in PR #89: URL: https://github.com/apache/tvm-rfcs/pull/89#discussion_r950360721
########## rfcs/0089-relax-upstreaming.md: ########## @@ -0,0 +1,701 @@ +- Feature Name: Relax Upstreaming +- Start Date: 2022-08-17 +- RFC PR: [apache/tvm-rfcs#0089](https://github.com/apache/tvm-rfcs/pull/0089) +- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000) +- Co-Authors: [@denise-k](https://github.com/denise-k), [@jwfromm](https://github.com/jwfromm) + +# 1. **Summary** + +This RFC proposes to upstream the core foundation of Relax (Relay Next). Relax is a new graph-level IR that enables new capabilities to address the [critical needs](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344) identified by the TVM community over the years of using and developing deep learning compilers. + +# 2. **Motivation and goals** + +Relax is an effort within [TVM Unity](https://tvm.apache.org/2021/12/15/tvm-unity) that aims to evolve the graph-level IR to maximize **expressibility, performance, and portability** across today and tomorrow’s workloads. Relax has three key goals motivated by the TVM community’s needs, and lessons the community has learned in ML acceleration through years of using and developing TVM: + +- Build a unified interface to transcends the boundaries of TVM’s abstractions between graph-level IR, tensor programs (TensorIR), and runtime libraries (PackedFunc); +- Enable and optimize dynamic shape workloads; +- Support “computational graph” style optimizations with advanced dataflow semantics. + +For more details on the design goals of Relax, please check out the [discuss forum post](https://discuss.tvm.apache.org/t/relax-co-designing-high-level-abstraction-towards-tvm-unity/12496). + +The main focus of this upstreaming RFC is to upstream the **core foundation** of Relax as an **optional** compilation flow in TVM with two principles: + +- **Minimize disruption:** This upstreaming should provide a **non-default** path to enable new capabilities for users/developers who are interested in what Relax brings, so it will not break the current default Relay flow. +- **Minimize complexity:** This upstreaming should reuse existing TVM/Relay infrastructure as much as possible (for example IRModule, runtime Module, TOPI library, etc.) to avoid duplicated effort and code. + +This initial upstreaming will open the path for TVM Unity, and incrementally bring Relax into the community. + +# 3. **Guide-level explanation** + +This section introduces the three major design points of Relax, which map directly to the three key goals of Relax in the last section. At the beginning of this section, we first introduce what user-facing interfaces will look like after this RFC lands. + +(Most of the code examples in this RFC are written in [TVMScript](https://github.com/apache/tvm-rfcs/pull/74/files#diff-6965a40ad8df7618ae68e11c88f924542a506c74a931cc3011ae9f99989b5f51R21-R27), which enables users to write and print TVM programs containing both Relax and TIR functions with Python syntax.) + +## User-facing interface + +After this upstreaming lands, users are able to write a Relax program in TVMScript or translate a model directly from Relay. Relax provides a simple API to compile the IRModule to VM executable, and run it on Relax VM. + +```python +import tvm.script +from tvm.script import relax as R, tir as T + +# Relax IRModule written in TVMScript [email protected]_module +class MyIRModule: + # This is a TIR PrimFunc which calls the TIR intrinsic T.exp + @T.prim_func + def tir_exp_func(x: T.handle, y: T.handle): ## <= D2 + X = T.match_buffer(x, (n,), "float32") + Y = T.match_buffer(y, (n,), "float32") + with T.grid(n) as i: + Y[i] = T.exp(X[i]) + + # This is a Relax function which contains a dataflow block + # representing a computational graph, as well as a call to an + # opaque packed function which performs an in-place update to the + # data in variable gv0. + # We mark the corresponding design points (D0, D1, D2) that map to + # the following sections throughout the relax function bellow. + @R.function + def relax_func(x: R.Tensor[(n, k), "float32"], w: R.Tensor[_, "float32"]): + # n, k above are implicitly defined within the function signature + # so we will be able to refer to n, k within all of relax_func + with R.dataflow(): ## <= D2 + lv0 = R.match_shape(w, (k, m)) ## <= D1 + lv1: R.Tensor[(n, m), "float32"] = R.dot(x, lv0) + lv2: R.Tensor[(n * m,), "float32"] = R.flatten(lv1) ## <= D1 + lv3: R.Shape = (n * m,) ## <= D1 + gv0 = R.call_tir(tir_exp_func, [lv2], lv3, dtype="float32") ## <= D0 + R.outputs(gv0) + + R.call_packed("custom_inplace_update", gv0) ## <= D0, D2 + return gv0 + +# Print IRModule with syntax highlighting +MyIRModule.show() + +# Build the Relax IRModule +target = tvm.target.Target("llvm") +exec = relax.vm.build(MyIRModule, target) + +# Dump the VM executable instructions as text +print(ex.as_text()) + +# Run the function on Relax VM runtime +vm = relax.VirtualMachine(exec, tvm.cpu()) +shape = (2, 3) +data = tvm.nd.array(np.random.rand(*shape).astype(np.float32)) +res = vm["relax_func"](data) +``` + +## D0: ****Unified abstractions and optimizations across layers**** + +The first key design point is to allow the high-level graph IR to be able to directly interact and call into lower-level TensorIR and PackedFunc (TVM FFI). + +The TensorIR PrimFunc and many external libraries adopt a **destination-passing-style** (DPS) calling convention that both input and output are passed to the function as arguments, and the outputs are mutated directly inside the function: + +```python +def low_level_func(input0, input1, ..., output): + # implementations +``` + +The main idea of DPS is that input and output are explicitly allocated outside and passed to the low-level primitive function. This style is commonly used in low-level library designs (for example TensorRT), so that higher-level frameworks (for example, the compiler) can handle memory allocation. + +### ****call_tir**** + +In Relax, we introduce `call_tir` to bridge graph-level IR and TIR. `call_tir` is an intrinsic that calls a TIR PrimFunc (that follows DPS) and returns the output. The semantics of `call_tir` can be demonstrated by the code below. + +```python +def call_tir(tir_primfunc: GlobalVar, + inputs: Tuple[Expr], + output_shape: Shape, + output_dtype: DataType) -> Expr: + """Example code to demonstrate the semantics of call_tir""" + out_tensor = alloc_tensor(output_shape, output_dtype) + low_level_func(*inputs, out_tensor) + return out_tensor +``` + +`call_tir` takes in tir_primfunc (a GlobalVar that maps to a TIR PrimFunc in the IRModule), a tuple of inputs, output tensor shape and datatype. Notably, when the compiler lowers `call_tir`, it is not required to individually allocate each output tensor. The compiler can choose to create a memory plan of the intermediate tensors and tie things together for effective reuse. + +`call_tir` is implemented as a special relax operator to minimize the impact on the IR changes (instead of a standalone IR node). From the AST point of view, it becomes: + +```python +Call( + op=Op::Get("relax.call_tir"), + tir_primfunc, + inputs, + output_shape, + output_dtype +) +``` + +### ****call_packed**** + +In Relax, we introduce `call_packed` to bridge graph-level IR and PackedFunc. It indicates a call to a **non-DPS packed function** that is registered in the environment via TVM FFI. + +From the AST’s point of view, we do not need to introduce an additional call node, instead, we introduce an `ExternFunc` construct that represents a PackedFunc that we can call into (the PackedFunc may or may not return a value): + +```python +Call(op=ExternFunc("my_packed_func"), *args) +``` + +`R.call_packed("my_packed_func", gv0)` in TVMScript (as shown in the User-facing interface section) only served as a syntax sugar to represent the above AST node. + +### ****call_dps_packed**** + +To be able to call into a DPS packed function (many low-level library (e.g. TensorRT) functions are designed in this way), and hence the compiler is able to directly handle the output memory, we introduce a `call_dps_packed` intrinsic, which corresponds to the following AST: + +```python +Call( + op=Op::Get("relax.call_dps_packed"), + ExternFunc("my_packed_func"), + inputs, + output_shape, + output_dtype +) +``` + +Suppose `custom_packed_func` is a user-defined packed function in DPS: + +```python +R.call_dps_packed("custom_packed_func", (input0, input1), output_shape=(3, 4), output_dtype="float32") +``` + +corresponds to the following AST: + +```python +Call( + op=Op::Get("relax.call_dps_packed"), + ExternFunc("custom_packed_func"), + (input0, input1), + output_shape=(3, 4), + output_dtype="float32" +) +``` + +The following program in TVMScript shows that with `call_tir`, `call_packed`, and `call_dps_packed`, we can directly embed and call the TIR and PackedFunc functions in the high-level Relax IR program. + +```python +from tvm.script import relax as R + +# User-defined packed functions +# Non-DPS PackedFunc with return [email protected]_func("custom_add") +def add_packed(a, b): + ret = a.numpy() + b.numpy() + return tvm.nd.array(ret) + +# Non-DPS PackedFunc without return [email protected]_func("custom_print") +def print_packed(a): + print(a) + +# DPS PackedFunc [email protected]_func("custom_tile") +def tile_packed(a, b): + b[:] = tvm.nd.array(np.tile(a.numpy(), (1, 2))) + [email protected]_module +class MyIRModule: + # define a PrimFunc to do matrix multiply + # note TIR PrimFunc is in DPS, here z is the output + @T.prim_func + def tir_matmul(x: T.handle, y: T.handle, z: T.handle) -> None: + m = T.var("int32") + n = T.var("int32") + k = T.var("int32") + A = T.match_buffer(x, (m, n)) + B = T.match_buffer(y, (n, k)) + C = T.match_buffer(z, (m, k)) + + for (i0, j0, k0) in T.grid(m, n, k): + with T.block(): + i, j, k = T.axis.remap("SSR", [i0, j0, k0]) + with T.init(): + C[i, j] = 0.0 + C[i, j] += A[i, k] * B[j, k] + + @R.function + def relax_func(x: R.Tensor[(m, n), "float32"], y: R.Tensor[(n, k), "float32"]): + with R.dataflow(): + # call_tir calls into a PrimFunc, and returns the matrix multiplication result + gv0 = R.call_tir(tir_matmul, (x, y), (m, k), dtype="float32") + R.outputs(gv0) + + # call into a PackedFunc to print the value of gv0 + R.call_packed("custom_print", gv0) + + # call the registered "custom_add" non-DPS PackedFunc and return the result + gv1 = R.call_packed("custom_add", gv0, gv0) + + # call the registered "custom_tile" DPS PackedFunc and return the result + gv2 = R.call_dps_packed("custom_tile", (gv1), (m, k * 2), dtype="float32") + return gv2 +``` + +This cross-level interaction unlocks many interesting things that were not possible before, including, but not limited to: + +- Incrementally lower different parts of a program using different strategies, instead of lowering the entire program to TIR directly from Relay as today. +- Allow for more customized optimizations, such as whole program optimizations, cascading, and other post-schedule optimizations. +- Enable automation (MetaSchedule) to analyze call_tir nodes and the callee TIR programs, perform optimizations and rewrites to one or more call_tir nodes, thus feeding decisions such as layout rewrite directly to the high-level IR. +- By turning subgraphs into calls to PackedFunc (via call_dps_packed), BYOC becomes an IRModule ⇒ IRModule transformation as a natural part of compilation. Review Comment: Let me start with my comments for the previous comment :) > Relax would provide more unified and organized support for BYOC. In Relay, although the functionalities are provided, IMHO, it has been quite fragmented so it has been tricky to use multiple BYOCs together with TVM's pipeline. So, in relax, our goal is to > (1) Support the existing functionalities > (2) Organize and simplify BYOC workflow Personally, my ultimate goal is to unlock Relax to allow the easy customization of both TVM internal and BYOC pipeline. For example, we may want to rewrite the graph IRs while considering low-level optimizations for TIR and multiple BYOC choices. (e.g., for `Matmul+add`, we have bunch of choices: Tune with MetaSchedule, use BYOC like TensorRT, Cutlass, Cublas...). And some of recent BYOCs, such as Cutlass, are tunable. To enable this functionalities, pipeline should be easy to customize, straightforward, and maintainable for the future extension. This has been difficult to achieve in Relay for many reasons: for example, inflexible lowering, each BYOC has its own greedy partitioning without considering other potential BYOCs, no interaction between BYOC and main pipeline, etc. In relax, although it is still work-in-progess, we've seen some great signals. I would be happy to discuss further if you are interested in. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
