jcf94 opened a new pull request #5962:
URL: https://github.com/apache/incubator-tvm/pull/5962


   Hi all,
   The last PR of Ansor #5883 is not clear enough for reviewers to fully 
understand our design. After some discussion, we changed our upstream plan to 
propose a minimal version of Ansor which contains a small but complete 
framework, so others can get a better understanding of the whole structure of 
Ansor.
   
   ---
   
   
   In [[RFC] Ansor: An Auto-scheduler for TVM (AutoTVM 
v2.0)](https://discuss.tvm.ai/t/rfc-ansor-an-auto-scheduler-for-tvm-autotvm-v2-0/7005),
 we've introduced the **Ansor** auto-scheduler. And we reached an agreement 
that should replace AutoTVM with Ansor in the end.
   For most existing templates, current Ansor can directly replace them with 
better performance and less tuning time.
   For other special templates (low-precision, sparse), the plan is to 
introduce search space customization and gradually rewrite them with Ansor's 
new API.
   
   This PR introduces a self-contained minimum version of Ansor with most of 
the bones.
   
   This PR includes the interface of core data structures and an empty search 
policy that does nothing. More advanced search policy and cost models will be 
in the next few PRs.
   
   # Infrastructure: A Sketch IR for Schedule Searching
   Different from AutoTVM, whose tuning spaces are composed of predefined 
parameters, Ansor constructs a search space by manipulating the loop structures 
of the given compute DAG.
   To enable flexible manipulation of the loop structures, we implemented a 
lightweight loop structure IR (Intermediate Representation) based on the 
original TVM IR but specifically for schedule search. The IR is composed of the 
**state** and **action**, which are defined as follows:
   
   - **State**: A state of schedule search is the loop structure defined by the 
schedule (i.e., the TVM IR created by `tvm.lower`). See **LoopState** in the 
Key Data Structure for details.
   
   - **Action**: An action is composed of one or more schedule primitives to 
manipulate (e.g., split, reorder, fuse) a state. See **TransformStep** in the 
Key Data Structure for details.
   
   We don't use the existing TVM IR but to extend a new Sketch IR on it is 
because:
   1. We want fast incremental change to the loop structures;
   2. We want serializable transform history for replay, backtracking, and 
mutation;
   3. We may create some macro schedule primitives that represent the 
combination of several TVM schedule primitives
   
   After the search is done, we will lower this to TVM IR with TVM schedule 
primitives.
   
   # Key Data Structures
   To help build an overview of Ansor, this shows the class relations of some 
important Ansor data structures:
   <img width="1222" alt="截屏2020-06-30 上午11 17 25" 
src="https://user-images.githubusercontent.com/12119698/86079551-55da9980-bac3-11ea-9f5f-d51e3dce5dfd.png";>
   
   - **ComputeDAG**: Compute declaration graph and its related analysis tools.
   Related source files: `src/ansor/compute_dag.*`, 
`python/tvm/ansor/compute_dag.py`
   Ansor takes a compute declaration, which could be a single operator or a 
subgraph, described by `tvm.compute` as an input and converts it to a 
ComputeDAG.
   ComputeDAG implementation includes a set of analyses such as the total float 
operations, consumer/producer relations of each operation stage, whether a 
stage should be tiled/compute inlined, and so on (*some of the analysis will be 
included in the follow-up PRs*). These analyses can help the search policy to 
do specific decisions during schedule search process.
   
   - **LoopState**: This defines the "state" for the search problem.
   Related source files: `src/ansor/loop_state.*`, 
`python/tvm/ansor/loop_state.py`.
   Each LoopState corresponds to a specific schedule for the target ComputeDAG.
   A LoopState consists of: 1. the current loop structure; 2. the transform 
history to reach this loop structure.
   The loop structure keeps a **preview** of how the schedule will finally look 
like after lowering (number of iterators, the extent of each iterator, the 
`compute_at` locations, etc), which can help the search policy to make 
decisions during the search.
   The transform history is a sequence of `TransformStep` which will finally be 
mapped to schedule primitives.
   
   - **TransformStep**: This defines the "action" for the search problem, i.e., 
the schedule primitives for our sketch IR.
   Related files: `src/ansor/transform_step.*`, `python/tvm/ansor/loop_state.py`
   Each step has its corresponding `tvm.te` schedule primitives. We record all 
TransformSteps for every state as its transform history. After finishing the 
schedule search, these transform steps will be lowered with their corresponding 
TVM's schedule primitives.
   *Note*: This PR only contains a small subset of the TransformSteps. The 
complete set of transform steps will be in the next PRs.
   
   > ComputeDAG is also playing a role of connecting Ansor state system to TVM 
schedule system. That is, ComputeDAG is able to replay TransformSteps to the 
final TVM schedule (e.g., `ComputeDAG(state, actions)`).
   
   - **SearchTask**: Meta information and hardware parameters for a specific 
schedule search task.
   Related source files: `src/ansor/search_task.*`
   This structure includes the target ComputeDAG, device information as well as 
some hardware parameters obtained from the system or user inputs.
   
   - **SearchPolicy**: The search policy is defined for Ansor to auto-generate 
a high-performance schedule for different computations.
   Related source files: `src/ansor/search_policy/*`
   A SearchPolicy takes a `SearchTask`, system information and some tuning 
options as inputs, performs the schedule search, and returns a state with the 
best performance. The resulting state can be used to apply to TVM schedule 
later.
   
     Note that in Ansor paper 
([https://arxiv.org/abs/2006.06762](https://arxiv.org/abs/2006.06762)), we 
proposed a sketch generation policy which achieves pretty good results with 
various workloads on different devices. On the other hand, in this minimum 
system PR, we only provide an `EmptyPolicy` to illustrate the search policy 
interface.
   
   # Ansor Minimum System
   
   This is a brief diagram for the Ansor system:
   <img width="1481" alt="截屏2020-06-30 上午10 44 59" 
src="https://user-images.githubusercontent.com/12119698/86077606-d3e87180-babe-11ea-8c3d-fe52c918db18.png";>
   
   1. Define the target computation with TVM `te` API and create a ComputeDAG 
structure.
   2. Specify the target device, hardware parameters, tuning options and pack 
those with `ComputeDAG` to create a `SearchTask` structure.
   3. `SearchPolicy` takes the `SearchTask` as input and performs the schedule 
search. During the search process, `SearchPolicy` will generate multiple 
candidate states, each of which corresponds to a specific TVM schedule.
   4. Get the best state and use ComputeDAG API to transform it to the final 
TVM schedule.
   
   In the Ansor system, we use the sketch generation policy (will be brought in 
later PRs) described in the paper as the default search policy, which should be 
enough to cover most use cases. Meanwhile, we will have an RFC for a custom 
rule mechanism that enables user-defined template search to serve the same 
functionality as the current AutoTVM template. Specifically, we will provide 
Python APIs for the new IR that is intended to be used by users for sketch 
customization. They look very similar to existing schedule primitives, as shown 
in `python/tvm/ansor/loop_state.py`.
   
   Our goal is to make sure Ansor can cover all AutoTVM functionalities while 
achieving the same or better performance so that the community can gradually 
switch to Ansor from AutoTVM.
   
   # More Details on LoopState
   
   In this section, we illustrate how a loop state looks like and how does it 
connect to the current TVM build system.
   
   Take a simple State that includes Split, Fuse and Reorder steps for example:
   
   ```python
   A, B, C = matmul_ansor_test(512, 512, 512)
   dag = ansor.ComputeDAG([A, B, C])
   state = dag.get_init_state()
   i, j, k = state[C].iters
   io, ii = state.split(C, i, [16])
   jo, ji = state.split(C, j, [8])
   state.reorder(C, [io, jo, k, ji, ii])
   fused_it = state.fuse(C, [io, jo])
   ```
   
   First, let's print out a state. It shows the loop structure of the 
corresponding TVM schedule, the "preview":
   ```python
   >>> print(state)
   
   Placeholder: A, B
   for i.0@j.0@ (0,2048)
     for k (0,512)
       for j.1 (0,8)
         for i.1 (0,16)
           C = ...
   ```
   
   It stores all history transform steps required to reach the current state. 
We can print the history transform steps as TVM's python schedule API. This can 
be used for debugging or to apply the schedule on a former TVM version without 
Ansor support.
   
   ```python
   >>> print(dag.print_python_code_from_state(state))
   
   i, j, k = tuple(C.op.axis) + tuple(C.op.reduce_axis)
   i_o, i_i = s[C].split(i, factor=16)
   j_o, j_i = s[C].split(j, factor=8)
   s[C].reorder(i_o, j_o, k, j_i, i_i)
   i_o_j_o_fused = s[C].fuse(i_o, j_o)
   ```
   
   We can also replay these steps to get a schedule for `tvm.lower` and 
`tvm.build`.
   
   ```python
   >>> sche, args = dag.apply_steps_from_state(state)
   >>> print(tvm.lower(sche, args, simple_mode=True))
   
   primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
     attr = {"global_symbol": "main", "tir.noalias": True}
     buffers = {C: Buffer(C_2: handle, float32, [512, 512], []),
                A: Buffer(A_2: handle, float32, [512, 512], []),
                B: Buffer(B_2: handle, float32, [512, 512], [])}
     buffer_map = {A_1: A, B_1: B, C_1: C} {
     for (i.outer.j.outer.fused: int32, 0, 2048) {
       for (j.inner.init: int32, 0, 8) {
         for (i.inner.init: int32, 0, 16) {
           C_2[((((floordiv(i.outer.j.outer.fused, 64)*8192) + 
(i.inner.init*512)) + (floormod(i.outer.j.outer.fused, 64)*8)) + j.inner.init)] 
= 0f32
         }
       }
       for (k: int32, 0, 512) {
         for (j.inner: int32, 0, 8) {
           for (i.inner: int32, 0, 16) {
             C_2[((((floordiv(i.outer.j.outer.fused, 64)*8192) + (i.inner*512)) 
+ (floormod(i.outer.j.outer.fused, 64)*8)) + j.inner)] = 
((float32*)C_2[((((floordiv(i.outer.j.outer.fused, 64)*8192) + (i.inner*512)) + 
(floormod(i.outer.j.outer.fused, 64)*8)) + j.inner)]) + 
((float32*)A_2[(((floordiv(i.outer.j.outer.fused, 64)*8192) + (i.inner*512)) + 
k)])*(float32*)B_2[(((k*512) + (floormod(i.outer.j.outer.fused, 64)*8)) + 
j.inner)])))
           }
         }
       }
     }
   }
   ```
   
   The steps of this state can be serialized into the log file as:
   ```python
   >>> target = tvm.target.create("llvm")
   >>> task = ansor.SearchTask(dag, "test", target)
   >>> inp = ansor.measure.MeasureInput(task, state)
   >>> res = ansor.measure.MeasureResult([0.1], 0, "", 0.2, 1)
   >>> with open("test.log", "w") as fp:
   >>>     ansor.serialization.write_measure_records_to_file(fp.name, [inp], 
[res])
   
   {"i": [["test", "llvm"], [[], [["SP", 2, 0, 512, [16], 1], ["SP", 2, 2, 512, 
[8], 1], ["RE", 2, [0, 2, 4, 3, 1]], ["FU", 2, [0, 1]]]]], "r": [[0.1], 0, 0.2, 
1], "v": "v0.2"}
   ```
   Ansor serializes all transform steps to the log file; while AutoTVM 
serializes parameters of a predefined template. The log format discussion would 
be based on 
https://discuss.tvm.ai/t/rfc-canonicalizing-autotvm-log-format/7038/.
   
   ---
   
   In the next few PRs, we'll introduce the complete search policy and 
tutorials for single op/ subgraph schedule search, Relay integration, and 
tutorials for end-to-end network schedule search, custom rules to support 
customized search space.
   
   This is a joint work by @merrymercy @jcf94 @minminsun @FrozenGene @comaniac 
@yangjunpro @yidawang .


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to