jcf94 opened a new pull request #5962: URL: https://github.com/apache/incubator-tvm/pull/5962
Hi all, The last PR of Ansor #5883 is not clear enough for reviewers to fully understand our design. After some discussion, we changed our upstream plan to propose a minimal version of Ansor which contains a small but complete framework, so others can get a better understanding of the whole structure of Ansor. --- In [[RFC] Ansor: An Auto-scheduler for TVM (AutoTVM v2.0)](https://discuss.tvm.ai/t/rfc-ansor-an-auto-scheduler-for-tvm-autotvm-v2-0/7005), we've introduced the **Ansor** auto-scheduler. And we reached an agreement that should replace AutoTVM with Ansor in the end. For most existing templates, current Ansor can directly replace them with better performance and less tuning time. For other special templates (low-precision, sparse), the plan is to introduce search space customization and gradually rewrite them with Ansor's new API. This PR introduces a self-contained minimum version of Ansor with most of the bones. This PR includes the interface of core data structures and an empty search policy that does nothing. More advanced search policy and cost models will be in the next few PRs. # Infrastructure: A Sketch IR for Schedule Searching Different from AutoTVM, whose tuning spaces are composed of predefined parameters, Ansor constructs a search space by manipulating the loop structures of the given compute DAG. To enable flexible manipulation of the loop structures, we implemented a lightweight loop structure IR (Intermediate Representation) based on the original TVM IR but specifically for schedule search. The IR is composed of the **state** and **action**, which are defined as follows: - **State**: A state of schedule search is the loop structure defined by the schedule (i.e., the TVM IR created by `tvm.lower`). See **LoopState** in the Key Data Structure for details. - **Action**: An action is composed of one or more schedule primitives to manipulate (e.g., split, reorder, fuse) a state. See **TransformStep** in the Key Data Structure for details. We don't use the existing TVM IR but to extend a new Sketch IR on it is because: 1. We want fast incremental change to the loop structures; 2. We want serializable transform history for replay, backtracking, and mutation; 3. We may create some macro schedule primitives that represent the combination of several TVM schedule primitives After the search is done, we will lower this to TVM IR with TVM schedule primitives. # Key Data Structures To help build an overview of Ansor, this shows the class relations of some important Ansor data structures: <img width="1222" alt="截屏2020-06-30 上午11 17 25" src="https://user-images.githubusercontent.com/12119698/86079551-55da9980-bac3-11ea-9f5f-d51e3dce5dfd.png"> - **ComputeDAG**: Compute declaration graph and its related analysis tools. Related source files: `src/ansor/compute_dag.*`, `python/tvm/ansor/compute_dag.py` Ansor takes a compute declaration, which could be a single operator or a subgraph, described by `tvm.compute` as an input and converts it to a ComputeDAG. ComputeDAG implementation includes a set of analyses such as the total float operations, consumer/producer relations of each operation stage, whether a stage should be tiled/compute inlined, and so on (*some of the analysis will be included in the follow-up PRs*). These analyses can help the search policy to do specific decisions during schedule search process. - **LoopState**: This defines the "state" for the search problem. Related source files: `src/ansor/loop_state.*`, `python/tvm/ansor/loop_state.py`. Each LoopState corresponds to a specific schedule for the target ComputeDAG. A LoopState consists of: 1. the current loop structure; 2. the transform history to reach this loop structure. The loop structure keeps a **preview** of how the schedule will finally look like after lowering (number of iterators, the extent of each iterator, the `compute_at` locations, etc), which can help the search policy to make decisions during the search. The transform history is a sequence of `TransformStep` which will finally be mapped to schedule primitives. - **TransformStep**: This defines the "action" for the search problem, i.e., the schedule primitives for our sketch IR. Related files: `src/ansor/transform_step.*`, `python/tvm/ansor/loop_state.py` Each step has its corresponding `tvm.te` schedule primitives. We record all TransformSteps for every state as its transform history. After finishing the schedule search, these transform steps will be lowered with their corresponding TVM's schedule primitives. *Note*: This PR only contains a small subset of the TransformSteps. The complete set of transform steps will be in the next PRs. > ComputeDAG is also playing a role of connecting Ansor state system to TVM schedule system. That is, ComputeDAG is able to replay TransformSteps to the final TVM schedule (e.g., `ComputeDAG(state, actions)`). - **SearchTask**: Meta information and hardware parameters for a specific schedule search task. Related source files: `src/ansor/search_task.*` This structure includes the target ComputeDAG, device information as well as some hardware parameters obtained from the system or user inputs. - **SearchPolicy**: The search policy is defined for Ansor to auto-generate a high-performance schedule for different computations. Related source files: `src/ansor/search_policy/*` A SearchPolicy takes a `SearchTask`, system information and some tuning options as inputs, performs the schedule search, and returns a state with the best performance. The resulting state can be used to apply to TVM schedule later. Note that in Ansor paper ([https://arxiv.org/abs/2006.06762](https://arxiv.org/abs/2006.06762)), we proposed a sketch generation policy which achieves pretty good results with various workloads on different devices. On the other hand, in this minimum system PR, we only provide an `EmptyPolicy` to illustrate the search policy interface. # Ansor Minimum System This is a brief diagram for the Ansor system: <img width="1481" alt="截屏2020-06-30 上午10 44 59" src="https://user-images.githubusercontent.com/12119698/86077606-d3e87180-babe-11ea-8c3d-fe52c918db18.png"> 1. Define the target computation with TVM `te` API and create a ComputeDAG structure. 2. Specify the target device, hardware parameters, tuning options and pack those with `ComputeDAG` to create a `SearchTask` structure. 3. `SearchPolicy` takes the `SearchTask` as input and performs the schedule search. During the search process, `SearchPolicy` will generate multiple candidate states, each of which corresponds to a specific TVM schedule. 4. Get the best state and use ComputeDAG API to transform it to the final TVM schedule. In the Ansor system, we use the sketch generation policy (will be brought in later PRs) described in the paper as the default search policy, which should be enough to cover most use cases. Meanwhile, we will have an RFC for a custom rule mechanism that enables user-defined template search to serve the same functionality as the current AutoTVM template. Specifically, we will provide Python APIs for the new IR that is intended to be used by users for sketch customization. They look very similar to existing schedule primitives, as shown in `python/tvm/ansor/loop_state.py`. Our goal is to make sure Ansor can cover all AutoTVM functionalities while achieving the same or better performance so that the community can gradually switch to Ansor from AutoTVM. # More Details on LoopState In this section, we illustrate how a loop state looks like and how does it connect to the current TVM build system. Take a simple State that includes Split, Fuse and Reorder steps for example: ```python A, B, C = matmul_ansor_test(512, 512, 512) dag = ansor.ComputeDAG([A, B, C]) state = dag.get_init_state() i, j, k = state[C].iters io, ii = state.split(C, i, [16]) jo, ji = state.split(C, j, [8]) state.reorder(C, [io, jo, k, ji, ii]) fused_it = state.fuse(C, [io, jo]) ``` First, let's print out a state. It shows the loop structure of the corresponding TVM schedule, the "preview": ```python >>> print(state) Placeholder: A, B for [email protected]@ (0,2048) for k (0,512) for j.1 (0,8) for i.1 (0,16) C = ... ``` It stores all history transform steps required to reach the current state. We can print the history transform steps as TVM's python schedule API. This can be used for debugging or to apply the schedule on a former TVM version without Ansor support. ```python >>> print(dag.print_python_code_from_state(state)) i, j, k = tuple(C.op.axis) + tuple(C.op.reduce_axis) i_o, i_i = s[C].split(i, factor=16) j_o, j_i = s[C].split(j, factor=8) s[C].reorder(i_o, j_o, k, j_i, i_i) i_o_j_o_fused = s[C].fuse(i_o, j_o) ``` We can also replay these steps to get a schedule for `tvm.lower` and `tvm.build`. ```python >>> sche, args = dag.apply_steps_from_state(state) >>> print(tvm.lower(sche, args, simple_mode=True)) primfn(A_1: handle, B_1: handle, C_1: handle) -> () attr = {"global_symbol": "main", "tir.noalias": True} buffers = {C: Buffer(C_2: handle, float32, [512, 512], []), A: Buffer(A_2: handle, float32, [512, 512], []), B: Buffer(B_2: handle, float32, [512, 512], [])} buffer_map = {A_1: A, B_1: B, C_1: C} { for (i.outer.j.outer.fused: int32, 0, 2048) { for (j.inner.init: int32, 0, 8) { for (i.inner.init: int32, 0, 16) { C_2[((((floordiv(i.outer.j.outer.fused, 64)*8192) + (i.inner.init*512)) + (floormod(i.outer.j.outer.fused, 64)*8)) + j.inner.init)] = 0f32 } } for (k: int32, 0, 512) { for (j.inner: int32, 0, 8) { for (i.inner: int32, 0, 16) { C_2[((((floordiv(i.outer.j.outer.fused, 64)*8192) + (i.inner*512)) + (floormod(i.outer.j.outer.fused, 64)*8)) + j.inner)] = ((float32*)C_2[((((floordiv(i.outer.j.outer.fused, 64)*8192) + (i.inner*512)) + (floormod(i.outer.j.outer.fused, 64)*8)) + j.inner)]) + ((float32*)A_2[(((floordiv(i.outer.j.outer.fused, 64)*8192) + (i.inner*512)) + k)])*(float32*)B_2[(((k*512) + (floormod(i.outer.j.outer.fused, 64)*8)) + j.inner)]))) } } } } } ``` The steps of this state can be serialized into the log file as: ```python >>> target = tvm.target.create("llvm") >>> task = ansor.SearchTask(dag, "test", target) >>> inp = ansor.measure.MeasureInput(task, state) >>> res = ansor.measure.MeasureResult([0.1], 0, "", 0.2, 1) >>> with open("test.log", "w") as fp: >>> ansor.serialization.write_measure_records_to_file(fp.name, [inp], [res]) {"i": [["test", "llvm"], [[], [["SP", 2, 0, 512, [16], 1], ["SP", 2, 2, 512, [8], 1], ["RE", 2, [0, 2, 4, 3, 1]], ["FU", 2, [0, 1]]]]], "r": [[0.1], 0, 0.2, 1], "v": "v0.2"} ``` Ansor serializes all transform steps to the log file; while AutoTVM serializes parameters of a predefined template. The log format discussion would be based on https://discuss.tvm.ai/t/rfc-canonicalizing-autotvm-log-format/7038/. --- In the next few PRs, we'll introduce the complete search policy and tutorials for single op/ subgraph schedule search, Relay integration, and tutorials for end-to-end network schedule search, custom rules to support customized search space. This is a joint work by @merrymercy @jcf94 @minminsun @FrozenGene @comaniac @yangjunpro @yidawang . ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
