[tvm-rfcs] branch main updated: [RFC] Meta Schedule (AutoTIR) (#5)

tqchen Mon, 26 Jul 2021 17:19:33 -0700

This is an automated email from the ASF dual-hosted git repository.

tqchen pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/tvm-rfcs.git



The following commit(s) were added to refs/heads/main by this push:
     new 4e26cc6  [RFC] Meta Schedule (AutoTIR) (#5)
4e26cc6 is described below

commit 4e26cc66a79775dc7cbd570d31f998dfdaf48285
Author: Junru Shao <[email protected]>
AuthorDate: Mon Jul 26 17:19:19 2021 -0700

    [RFC] Meta Schedule (AutoTIR) (#5)
    
    Co-authored-by: Siyuan Feng <[email protected]>
    Co-authored-by: Bohan Hou 
<[email protected]>
    Co-authored-by: Ruihang Lai <[email protected]>
    Co-authored-by: Hongyi Jin <[email protected]>
    Co-authored-by: Wuwei Lin <[email protected]>
    Co-authored-by: Cody Yu <[email protected]>
    Co-authored-by: Tristan Konolige <[email protected]>
    Co-authored-by: Andrew Reusch <[email protected]>
    
    Co-authored-by: Siyuan Feng <[email protected]>
    Co-authored-by: Bohan Hou 
<[email protected]>
    Co-authored-by: Ruihang Lai <[email protected]>
    Co-authored-by: Hongyi Jin <[email protected]>
    Co-authored-by: Wuwei Lin <[email protected]>
    Co-authored-by: Cody Yu <[email protected]>
    Co-authored-by: Tristan Konolige <[email protected]>
    Co-authored-by: Andrew Reusch <[email protected]>
---
 resources/meta-schedule-workflow.png    | Bin 0 -> 42034 bytes
 rfcs/0001-meta-schedule-autotensorir.md | 604 ++++++++++++++++++++++++++++++++
 2 files changed, 604 insertions(+)

diff --git a/resources/meta-schedule-workflow.png 
b/resources/meta-schedule-workflow.png
new file mode 100644
index 0000000..300b9eb
Binary files /dev/null and b/resources/meta-schedule-workflow.png differ
diff --git a/rfcs/0001-meta-schedule-autotensorir.md 
b/rfcs/0001-meta-schedule-autotensorir.md
new file mode 100644
index 0000000..79c192d
--- /dev/null
+++ b/rfcs/0001-meta-schedule-autotensorir.md
@@ -0,0 +1,604 @@
+<!--- Licensed to the Apache Software Foundation (ASF) under one -->
+<!--- or more contributor license agreements.  See the NOTICE file -->
+<!--- distributed with this work for additional information -->
+<!--- regarding copyright ownership.  The ASF licenses this file -->
+<!--- to you under the Apache License, Version 2.0 (the -->
+<!--- "License"); you may not use this file except in compliance -->
+<!--- with the License.  You may obtain a copy of the License at -->
+
+<!---   http://www.apache.org/licenses/LICENSE-2.0 -->
+
+<!--- Unless required by applicable law or agreed to in writing, -->
+<!--- software distributed under the License is distributed on an -->
+<!--- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -->
+<!--- KIND, either express or implied.  See the License for the -->
+<!--- specific language governing permissions and limitations -->
+<!--- under the License. -->
+
+* Feature Name: Meta Schedule (Formerly AutoTIR)
+* Start Date: 2021-05-28
+* RFC PR: https://github.com/apache/tvm-rfcs/pull/5/
+* GitHub Issue: https://github.com/apache/tvm/issues/8473
+
+## 1. Summary
+
+This proposal introduces Meta Schedule: a scheduling DSL on TIR that unifies 
the
+approaches of AutoTVM [1] and AutoScheduler [2]. Meta schedule provides a 
pragmatic way to define
+the space of automatic tuning, extensibility in terms of all possible TIR 
schedule primitives like
+tensorization and loop partitioning, and customizability on every layer of the 
automation system.
+
+Meta Schedule is the 3rd generation automatic scheduling system.
+
+## 2. Motivation
+
+**Scheduling.** In TVM,
+both TensorIR and TE allow direct or indirect IR transformation guided by a
+developer-provided program, for example,
+specifying a particular reordering of loops for better locality,
+or tensorizing a compute region with specific hardware intrinsics.
+The process of invoking such a set of pre-defined transformations is called 
"**scheduling**",
+and each of such transformations is called a "**schedule primitive**".
+These primitives form a domain-specific language (DSL).
+
+**Design space.** The set of all possible schedulings of a TE/TensorIR is 
called its design space.
+Optimization in TVM is essentially exploring such space to find out a 
scheduling that transforms the
+IR to generate the kernel with optimal performance.
+
+### Problems with the current scheduling system
+
+Currently there are 3 sets of scheduling APIs in TVM:
+* **Manual schedule**: Developers optimize their programs by manually invoking 
schedule primitives,
+  i.e. explore points in the design space with humans in the loop. This can be 
a tedious and
+  error-prone approach, hence the creation of AutoTVM and AutoScheduler.
+* **AutoTVM**: The automation system requires users to define the design space 
through
+  per-operator "schedule templates." Therefore, programmer time is a 
bottleneck in scaling
+  to hundreds of operators across many hardware platforms.
+  hardware platforms.
+* **AutoScheduler**: It automatically generates schedule templates as the 
design space,
+  according to a set of predefined "search rules". However, it is non-trivial 
to extend
+  AutoScheduler to new schedule primitives (tensorize, loop partition, 
software pipelining, etc).
+* The three systems above have isolated sets of APIs with several layers of 
their own abstraction,
+  which are not only hard to learn, but also engineering-intensive to 
customize.
+
+### Benefits of Meta Schedule
+
+The existing three scheduling systems are mutually incompatible with each 
other in terms of API
+design and divergence: besides manual TE scheduling, AutoTVM requires users to 
learn a new set of
+APIs, and AutoScheduler brings in another set of C++-based search rules. It 
adds the users' mental
+overhead to understand and extend the existing systems. Further, the inability 
to switch between
+template-based and template-free auto-tuning could lead to inferior 
customizability and hence
+make it needlessly difficult to achieve optimal performance.
+
+Meta schedule provides:
+* Succinct syntax, consistent APIs to TensorIR schedule with no other layer of 
abstraction.
+* Unified APIs for implementing manual schedules, AutoTVM-style schedules, and 
AutoScheduler-style
+  schedules.
+* Extensibility of all the schedule primitives, including tensorization and 
loop partitioning.
+  Almost no extra effort is needed to use a new primitive in auto-tuning.
+* The automation infrastructure is extensible on every of its components. 
Every component of the
+  system can be customized easily in pure python or C++ or both. For example, 
one can develop a new
+  design space generator in python, a new ProgramRunner in python, etc.
+
+
+## 3. Guide-level explanation
+
+Meta Schedule DSL is a language that provides TVM backend developers
+a flexible way to define or auto-generate the operator design space.
+
+This section introduces the syntax of Meta Schedule DSL by describing the 5 
common usage patterns
+envisioned by this RFC. These patterns are:
+1) Manually constructing a schedule using existing schedule primitives 
(Section 3.1);
+2) Defining composite schedule to simplify the ap sequence of schedule 
primitives (Section 3.2);
+3) Describing a design space of possible schedules,
+a.k.a. AutoTVM-style schedule templates (Section 3.3);
+4) Automatically generating the design space, a.k.a. AutoScheduler-style 
search rules (Section 3.4);
+5) Mixing the usage of manual schedule, AutoTVM and AutoScheduler-style design 
space specification
+in Meta Schedule (Section 3.5).
+
+### 3.1. Manual Schedule
+
+Meta schedule APIs are almost the same as TE or TensorIR scheduling. Here is 
an example of a manual
+schedule for matrix multiplication:
+
+```python
+# Designate a set of tile sizes
+i_tiles = [16, 8, 8, 8]
+j_tiles = [16, 8, 8, 8]
+k_tiles = [256, 8]
+
+# Tile the loops according to the tile sizes
+i_0, i_1, i_2, i_3 = sch.split(loop=i, factors=i_tiles)
+j_0, j_1, j_2, j_3 = sch.split(loop=j, factors=j_tiles)
+k_0, k_1           = sch.split(loop=k, factors=k_tiles)
+
+# Organize the loops into "SSRSRS" 6-level tiles
+sch.reorder(
+    i_0, j_0, # S: the 1st spatial tile
+    i_1, j_1, # S: the 2nd spatial tile
+    k_0,      # R: the 1st reduction tile
+    i_2, j_2, # S: the 3rd spatial tile
+    k_1,      # R: the 2nd reduction tile
+    i_3, j_3, # S: the 4th spatial tile
+)
+```
+
+In this manual scheduling example, the developers tweak the tile sizes and 
measure the performance of the
+generated kernels to explore the opportunities of potential optimization.
+
+Generally speaking, while writing a schedule, there are often some parameters 
that are hard to
+determine ahead of time, for example, tile sizes, unroll steps, or which 
tensor intrinsics to use.
+Developers may manually enumerate possible combinations of these unknown 
factors, and then pick the
+best schedule according to measurement results on their device.
+
+### 3.2. Composite Schedule Rules
+
+As introduced in the previous section, in TensorIR, each schedule primitive 
handles only a very
+basic transformation of the IR. For example, `split` only splits a loop into 
two new loops. In the
+real world, the over-fine granularity of those primitives usually leads to 
repetitive and verbose
+scheduling code. Take the code snippet in the previous section as an example: 
a sequence of `split`s
+are invoked, followed by a `reorder`. Taken together these 4 primitives are 
colloquially known as
+"SSRSRS" tiling.
+
+To make it more convenient and modular, users are allowed to register 
**composite schedules** that apply
+a sequence of schedule primitives according to certain analysis of the IR.
+The word **composite** here means the schedule transformation is *composed* of 
those **primitives**.
+
+For example, suppose there is a composite schedule called 
`Inline-Elementwise-Operation`, which
+inlines elementwise computation into their consumers if possible. Applying it 
to the
+following TensorIR:
+
+```python
[email protected]
+def example_func(...):
+  for i, j in ...:
+    with tir.Block("B") ...:
+      B[i, j] = A[i, j] + 1
+  for i, j in ...:
+    with tir.Block("C") ...:
+      C[i, j] = B[i, j] + 1
+  for i, j in ...:
+    with tir.Block("D") ...:
+      D[i, j] = C[i, j] + 1
+
+sch = tir.Schedule(example_func)
+# `InlineElementwiseOperation` is a composite schedule rule that analyzes a 
given block.
+# If the block contains only elementwise computation, and can be inlined into 
its consumer,
+# then `sch.compute_inline` is called on that block.
+inliner = InlineElementwiseOperation()
+inliner.apply(schedule=sch, block=sch.get_block("B"))
+inliner.apply(schedule=sch, block=sch.get_block("C"))
+inliner.apply(schedule=sch, block=sch.get_block("D"))
+```
+
+Below is the result after applying this composite schedule, and its 
corresponding trace:
+
+```python
+
+>>> print(tvm.script.asscript(sch.mod))
[email protected]
+def example_func(...):
+  for i, j in ...:
+    with tir.Block("D") ...:
+      D[i, j] = A[i, j] + 1 + 1 + 1
+
+>>> print(sch.trace)
+# Block "B" is elementwise and inlinable, then `sch.compute_inline(B)` is 
called
+B = sch.get_block("B")
+sch.compute_inline(B)
+# Block "C" is elementwise and inlinable, then `sch.compute_inline(C)` is 
called
+C = sch.get_block("C")
+sch.compute_inline(C)
+# Block "D" is elementwise but does not have a consumer,
+# so the rule does not call `compute_inline` because it is not inlinable
+D = sch.get_block("D")
+```
+
+### 3.3. AutoTVM-style Design Space Description
+
+Meta schedule extends the schedule DSL with a set of new schedule primitives
+called **sampling instructions**. These primitives do not transform the 
TensorIR,
+but instead introduce random statistical variables which can be referenced 
later in scheduling
+to parameterize the schedule. Incorporating **sampling instructions** into a 
operator's schedule
+allows the backend developers to succinctly describe a design space in terms of
+tiling strategies, fusion levels, unroll lengths, etc.
+
+The matmul example above is extended to cover all possible tilings using these 
sampling
+instructions:
+
+```python
+# Sample tile sizes
+i_tiles = sch.sample_perfect_tile(i, n=4)  # was: i_tiles = [16, 8, 8, 8]
+j_tiles = sch.sample_perfect_tile(j, n=4)  # was: j_tiles = [16, 8, 8, 8]
+k_tiles = sch.sample_perfect_tile(k, n=2)  # was: k_tiles = [256, 8]
+# Tile the loops according to the random variables
+i_0, i_1, i_2, i_3 = sch.split(loop=i, factors=i_tiles)
+j_0, j_1, j_2, j_3 = sch.split(loop=j, factors=j_tiles)
+k_0, k_1           = sch.split(loop=k, factors=k_tiles)
+# Organize the loops into "SSRSRS" 6-level tiles
+sch.reorder(
+    i_0, j_0, # S: the 1st spatial tile
+    i_1, j_1, # S: the 2nd spatial tile
+    k_0,      # R: the 1st reduction tile
+    i_2, j_2, # S: the 3rd spatial tile
+    k_1,      # R: the 2nd reduction tile
+    i_3, j_3, # S: the 4th spatial tile
+)
+```
+
+### 3.4. AutoScheduler-style Design Space Generation
+
+To generate design space, AutoScheduler applies a set of rules to each
+[TE stage](https://tvm.apache.org/docs/api/python/te.html#tvm.te.Stage) that 
corresponds to a
+[TE 
operation](https://tvm.apache.org/docs/api/doxygen/classtvm_1_1te_1_1Operation.html),
+defined by 
[`te.compute(...)`](https://tvm.apache.org/docs/api/python/te.html#tvm.te.compute).
+The rules analyze the TE operations and apply an [internal DSL]() to 
manipulating its internal IR,
+which is in the end mapped to TE schedule primitives. This process is called 
*sketch generation*.
+
+Composite schedule rules work in a similar way scheduling TensorIR, as 
introduced in Section 3.2.
+It analyzes the TensorIR and apply schedule primitives directly to TensorIR 
accordingly.
+When applying such rules to each TensorIR block in certain order (Post-DFS is 
provided as the
+builtin order, but customization is allowed),
+it generates a sequence of schedule primitives.
+This process corresponds to the *sketch generation* phase in AutoScheduler.
+If sampling instructions are present in this sequence,
+then a design space is defined by those instructions for the meta schedule to 
explore.
+This process is similar to the *random annotation* phase in AutoScheduler.
+
+Several built-in composite schedule rules are shipped with meta schedule to 
align with the design
+space generated by AutoScheduler:
+
+* Multi-level tiling
+* Inline pure spatial blocks
+* Parallelize & vectorize & unroll
+* Auto tensorize
+
+Developers may implement their own rules in either Python or C++. They may 
specify which rules to
+use with the following syntax:
+
+```python
+from tvm import meta_schedule as ms
+
+design_space_generator = ms.PostOrderApply(rules=[
+    ms.MultiLevelTiling(...),
+    CustomRule(...),
+    ms.OtherBuiltinRules(...),
+])
+
+```
+
+### 3.5. Unifying manual schedule / AutoTVM / AutoScheduler
+
+This subsection shows that the design space induced by TE manual schedule, 
AutoTVM and AutoScheduler
+are all subsets of meta schedule, and meta schedule further allows mixing 
those three styles to
+search jointly.
+
+- **Manual schedule**. The TE schedule is a special case of a meta schedule 
program, where there is no
+randomness introduced by sampling instructions. It is a single point in terms 
of design space.
+- **AutoTVM (Template-based tuning)**. It is more natural representation of 
AutoTVM’s schedule
+templates (knobs) by writing one or more schedule functions in meta schedule 
with sampling
+instructions. The probability space supported by the sampling instructions is 
the design space to
+be explored.
+- **AutoScheduler (Template-free tuning)**. As mentioned in the previous 
section, application
+  of composite schedule rules generates the design space, which is equivalent 
to AutoScheduler’s
+  sketch generation.
+- **Mixing styles in design space definition**. By taking union of the spaces 
induced by the three
+  special cases, meta schedule allows developers to combine generic rules that 
AutoScheduler
+  provides and operator-specific scheduling.
+
+## 4. Reference-level explanation
+
+This section introduces the underlying techniques for the automation system to 
extract and
+explore the design space. The figure below briefly illustrates the workflow of 
the system:
+
+![meta-schedule-workflow](../resources/meta-schedule-workflow.png)
+
+### 4.1. Execution trace as the design space
+
+**Trace**. To represent the design space defined by the meta schedule DSL, the 
underlying system
+records all the instructions users applied to the schedule class, including 
sampling and schedule
+primitives. This list of scheduling instructions being invoked, along with the 
random decisions made
+on sampling instructions, is called a trace.
+
+For instance, executing the example above results in the following trace:
+
+```
+Instruction 0. Sample-Perfect-Tile
+Instruction 1. Sample-Perfect-Tile
+Instruction 2. Sample-Perfect-Tile
+Instruction 3. Split
+Instruction 4. Split
+Instruction 5. Split
+Instruction 6. Reorder
+```
+
+**Trace forms design space.** A trace may contain zero or more sampling 
instructions, which
+introduces the uncertainty in scheduling - one instance of sampling results in 
one point in the
+design space. Therefore, the trace itself forms a design space to explore, 
e.g. which set of tile
+sizes works best on a specific hardware.
+
+**Union of design space**. Meta schedule works on a set of traces, 
representing the union of the design
+spaces represented by every single trace.
+
+**Fork a trace**. When two different decisions in the scheduling process are 
equally important to
+generate high-performance schedules, it is allowed to fork the trace into two, 
and the design space is
+the union of the forked traces.
+
+The trace is not strictly user-facing, but can be accessed and printed with 
the following syntax:
+
+```python
+# requires to trace the execution
+sch = tir.Schedule(..., traced=True)
+# do a lot of scheduling
+...
+# print the trace
+print(sch.trace)
+```
+
+And below is an example of the printed trace, which honestly reflects the 
schedule as a snippet of
+python scheduling function:
+
+```python
+b0 = sch.get_block(name="matmul", func_name="main")
+l1, l2, l3 = sch.get_loops(block=b0)
+v4, v5, v6, v7 = sch.sample_perfect_tile(loop=l1, n=4, 
max_innermost_factor=16, decision=[32, 1, 16, 2])
+v8, v9, v10, v11 = sch.sample_perfect_tile(loop=l2, n=4, 
max_innermost_factor=16, decision=[64, 4, 2, 2])
+v12, v13 = sch.sample_perfect_tile(loop=l3, n=2, max_innermost_factor=16, 
decision=[64, 16])
+l14, l15, l16, l17 = sch.split(loop=l1, factors=[v4, v5, v6, v7])
+l18, l19, l20, l21 = sch.split(loop=l2, factors=[v8, v9, v10, v11])
+l22, l23 = sch.split(loop=l3, factors=[v12, v13])
+sch.reorder(l14, l18, l15, l19, l22, l16, l20, l23, l17, l21)
+```
+
+**Implementation.** A trace is defined as:
+
+```python
+class Trace:
+  instructions: List[Instruction]
+  decisions: Dict[Instruction, Any]
+```
+
+For each sampling instruction in the trace, if it has a corresponding entry in 
the decisions dict,
+then the output is uniquely determined by the decision, hence reproducibility 
is guaranteed
+(Example 1); If a corresponding entry is not presented, then randomness will 
be introduced by
+interpreting the trace (Example 2).
+
+```python
+# Example 1. Trace with deterministic result
+l1, l2 = sch.sample_perfect_tile(loop, n=2, decisions=[4, 32])  # 
Deterministic l1 = 4, l2 = 32
+# Example 2. Trace with randomized result
+l1, l2 = sch.sample_perfect_tile(loop, n=2)  # l1 and l2 are random
+```
+
+### 4.2. Exploring the Design Space
+
+Meta Schedule provides several built-in exploration strategies to exhaustively 
or efficiently search
+for efficient schedules. Those strategies are mostly supported by re-execute 
either a function or a
+trace with a builtin interpreter in meta schedule, and this process is called 
**replay**.
+
+#### Random search by replaying schedule functions
+
+With a user-provided schedule function
+as a black-box design space generator, meta schedule repetitively invokes such 
an opaque TVM
+packed function without doing any extra analysis.
+If sampling instructions are present in the trace, then scheduling is 
non-deterministic
+(random decisions may not be repeated across runs)
+Effectively, it is equivalent to random exploration without trace,
+allowing the flexibility for users to define arbitrary functions
+that trace may not well support (e.g. control flow divergence based on the 
value of intermediate
+random variables), but it forbids future opportunity of any trace-based 
analysis.
+
+#### Random search by replaying traces
+
+A builtin interpreter directly replays the traces obtained
+from manual schedule, template-based or template-free design space generation.
+If sampling instructions are present in the traces,
+then their random decisions are mutated during each replay, i.e. jumps to a 
new point in the
+design space. Therefore, repetitive replay of those traces are equivalent to 
exploration of the
+design space.
+
+The search speed of meta schedule could be improved by allowing traces to be 
analyzed before they
+are run. For example, trace analysis could reject obviously-invalid schedules 
(e.g. using too many
+CUDA resource), remove dead-code before they are run. The cost model could 
benefit from the traces
+as well by extracting trace-level features.
+
+#### Cost-model-guided evolutionary search
+
+A more efficient exploration strategy introduced by AutoScheduler.
+For more details, please refer to Section 5.1 of its paper [2].
+
+The evolutionary search strategy requires two sets of rule-specific logic to 
execute to
+either validate or tweak the produced schedule:
+
+* Mutator: defines how to jump to a point’s "neighbor" in the design space
+* Postprocessor: after the trace is executed, there are some extra rules to 
execute, for
+  example:
+  * Check CUDA resource limits: There is a hard requirement in CUDA that the 
maximum number of
+    threads should not exceed 1024, but it is a random variable that cannot be 
determined before
+    actually executing the trace. In this case, a postprocessor is implemented 
to error out when
+    such conditions are not satisfied.
+  * Fuse outer loops until the extent of the fused loops is large enough: The 
number of outer loops
+    to be fused together depends on their extents, which are random variables. 
In this case, the
+    composite schedule rule annotates the maximum extent allowed on the block, 
and a corresponding
+    postprocessor detects the annotation and does the actual fusion.
+
+The evolutionary search algorithm uses mutators to find possible schedules in 
the design space, then
+applies postprocessors and asks the cost model to predict its performance. 
After a few
+iterations, the new schedules with the highest scores are finally compiled and 
measured on device.
+Epsilon-greedy is used in this process to balance exploitation and exploration.
+
+### 4.3. Database
+
+All the measure records are serialized and stored in a database. The schema of 
the database has the
+following information:
+
+- The workload, a serialized TensorIR;
+- The hardware target where the measure is conducted;
+- Argument type: the shapes and dtypes of the input tensors fed to the 
measured PrimFunc.
+This field can be useful for future dynamic-shape workloads;
+- The trace, including the schedule primitives used and their corresponding 
decisions (if any);
+- The measured running time;
+- The version of the log.
+
+### 4.4. Python-first for flexibility & customizability
+
+The system is implemented in a way that all levels are decoupled and open to 
customization, aiming
+at providing a playground for developers to try out new ideas and potentially 
deliver performance
+quickly.
+
+While all the important APIs are implemented in C++ for efficiency, every part 
of the system can be
+easily switched to customized python implementation. For example,
+
+#### Customize design space in python
+
+The design space can be defined as a python function
+
+```python
+def schedule_matmul(sch) -> sch:
+    i, j, k = sch.get_loops(sch.get_block("matmul"))
+    i_tiles = sch.sample_perfect_tile(i, n=4)
+    j_tiles = sch.sample_perfect_tile(j, n=4)
+    k_tiles = sch.sample_perfect_tile(k, n=2)
+    # Tile the loops according to the random variables
+    i_0, i_1, i_2, i_3 = sch.split(loop=i, factors=i_tiles)
+    j_0, j_1, j_2, j_3 = sch.split(loop=j, factors=j_tiles)
+    k_0, k_1 = sch.split(loop=k, factors=k_tiles)
+    # Organize the loops into "SSRSRS" 6-level tiles
+    sch.reorder(
+        i_0, j_0, # S
+        i_1, j_1, # S
+        k_0,      # R
+        i_2, j_2, # S
+        k_1,      # R
+        i_3, j_3, # S
+    )
+    return sch
+```
+
+#### Customize composite schedule in python
+
+The system provides two interfaces to define a composite schedule in python, 
one is more succinct,
+and the other is more comprehensive:
+
+Method 1. Derive from `PyCompositeSchedule`, and implement two methods 
`initialize` and `apply`:
+
+```python
+class MultiLevelTiling(PyCompositeSchedule):
+    def initialize(...):
+        # initialize the class, usually this method is empty
+        ...
+
+    def apply(sch: Schedule, block: BlockRV) -> Union[Schedule, 
List[Schedule]]:
+        # write any python code, including:
+        # - analyze `block`
+        # - invoke schedule primitives in `sch`
+        # - do debug printing
+        ...
+```
+
+Method 2. A decorator as the syntactic sugar if the `initialize` method is 
empty, which converts the
+function to the `apply` method.
+
+```python
[email protected]_composite_schedule(name="multi-level-tiling")
+def multi_level_tiling(sch: Schedule, block: BlockRV) -> Union[Schedule, 
List[Schedule]]:
+    # write any python code, including:
+    # - analyze `block`
+    # - invoke schedule primitives in `sch`
+    # - do debug printing
+    ...
+```
+
+#### Customize exploration strategies in python
+
+Developers can implement any search algorithm in python as well by deriving 
from `PySearchPolicy`,
+and the syntax is identical to customizing with `PyCompositeSchedule`.
+
+#### Other customizable components
+
+This list includes:
+
+* Cost model
+* Database
+* Measure callbacks
+* Feature extractor
+* Program builder & runner
+* Analysis methods
+* ...
+
+In a short summary, almost every component of the system is decoupled with 
each other and extensions
+could be easily plugged in.
+
+## 5. Drawbacks
+
+Meta schedule requires integration with Relay operator strategy and compile 
engine (TECompiler) to
+properly lower a Relay subgraph for task extraction. Further, TE schedules in 
TOPI will need to be
+migrated to TensorIR schedules.
+
+## 6. Rationale and alternatives
+
+The system is designed with the principle of minimalism: Assuming users 
already know TensorIR
+scheduling APIs, there is no more extra API set to learn; The previous 
programs that schedules
+TensorIR still work out of box with the meta schedule DSL. It could 
potentially lower the bar of
+adopting this system.
+
+Unifying manual scheduling, AutoTVM's semi automatic templates and 
AutoScheduler's fully automatic
+sketch generation provides flexible way to balance injection new domain 
knowledge and automation.
+
+Flexibility in customization allows quick try-out on new tasks, new strategies 
and new hardware
+targets without deep knowledge of the system.
+
+## 7. Prior art
+
+**Tensor Expression (TE)** in TVM is a DSL that decouples compute and 
schedule, which provides
+convenient ways to handcraft optimized kernels for different hardware targets.
+
+**TensorIR** is the low-level IR in TVM. Its capability of eagerly applying 
schedule primitives
+opens the door for meta schedule, the proposed next-generation auto scheduling 
system.
+
+**AutoTVM** is the 1st generation automation framework in TVM, which requires 
developers to
+implement per-operator scheduling templates, and the system could handle the 
tuning process.
+
+**AutoScheduler** is the 2nd generation automation framework in TVM, whose 
built-in rules
+could automatically generate schedule templates for almost all the operators 
on CPU, GPU, etc.
+
+## 8. Unresolved questions
+
+**Control Flow**
+
+The meta schedule DSL does not support control flow yet. Although there is no 
report of
+real-world use case at the time of writing, it is possible that it could 
appear in some future
+workloads. The best syntax of the control flow is not determined yet, but a 
working example could be
+TensorFlow's `tf.cond`.
+
+**Assertion**
+
+Sampling instructions may lead to wrong schedules on CUDA, e.g. the resulting 
program uses too much
+shared memory, too many threads, etc. It is detected by a postprocessor. To 
accelerate the process,
+it is possible to introduce an assertion statement that exits early if the GPU 
code is not valid,
+and its syntax can be, for example:
+
+```python
+sch.assert(j_2 * j_2 <= 1024)
+```
+
+## 9. Future possibilities
+
+**Unifying Manual Scheduling, AutoTVM and AutoScheduler in TOPI**
+
+As described in Section 3.5, meta schedule provides an idiomatic approach to 
unify the three
+existing scheduling APIs in TVM:
+
+* Manual schedules are meta schedules without sampling instructions;
+* AutoTVM templates are meta schedules where knobs are replaced by sampling 
instructions;
+* Each of AutoScheduler’s search rules generates a snippet of a meta schedule;
+* Mixture of the above three approaches to cover larger design space.
+
+At the time of writing, TOPI contains a number of schedule functions 
implemented either in manual TE
+or AutoTVM-style. It is our future work to unify these existing scheduling 
APIs on TOPI operators,
+and enable different styles to be auto-tuned jointly.
+
+[1] Zheng, Lianmin, et al. "Ansor: Generating high-performance tensor programs 
for deep learning."
+14th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 
20). 2020.
+
+[2] Chen, Tianqi, et al. "Learning to Optimize Tensor Programs." Advances in 
Neural Information
+Processing Systems 31 (2018): 3389-3400.

[tvm-rfcs] branch main updated: [RFC] Meta Schedule (AutoTIR) (#5)

Reply via email to