This is an automated email from the ASF dual-hosted git repository.
tqchen pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/tvm-rfcs.git
The following commit(s) were added to refs/heads/main by this push:
new 4e26cc6 [RFC] Meta Schedule (AutoTIR) (#5)
4e26cc6 is described below
commit 4e26cc66a79775dc7cbd570d31f998dfdaf48285
Author: Junru Shao <[email protected]>
AuthorDate: Mon Jul 26 17:19:19 2021 -0700
[RFC] Meta Schedule (AutoTIR) (#5)
Co-authored-by: Siyuan Feng <[email protected]>
Co-authored-by: Bohan Hou
<[email protected]>
Co-authored-by: Ruihang Lai <[email protected]>
Co-authored-by: Hongyi Jin <[email protected]>
Co-authored-by: Wuwei Lin <[email protected]>
Co-authored-by: Cody Yu <[email protected]>
Co-authored-by: Tristan Konolige <[email protected]>
Co-authored-by: Andrew Reusch <[email protected]>
Co-authored-by: Siyuan Feng <[email protected]>
Co-authored-by: Bohan Hou
<[email protected]>
Co-authored-by: Ruihang Lai <[email protected]>
Co-authored-by: Hongyi Jin <[email protected]>
Co-authored-by: Wuwei Lin <[email protected]>
Co-authored-by: Cody Yu <[email protected]>
Co-authored-by: Tristan Konolige <[email protected]>
Co-authored-by: Andrew Reusch <[email protected]>
---
resources/meta-schedule-workflow.png | Bin 0 -> 42034 bytes
rfcs/0001-meta-schedule-autotensorir.md | 604 ++++++++++++++++++++++++++++++++
2 files changed, 604 insertions(+)
diff --git a/resources/meta-schedule-workflow.png
b/resources/meta-schedule-workflow.png
new file mode 100644
index 0000000..300b9eb
Binary files /dev/null and b/resources/meta-schedule-workflow.png differ
diff --git a/rfcs/0001-meta-schedule-autotensorir.md
b/rfcs/0001-meta-schedule-autotensorir.md
new file mode 100644
index 0000000..79c192d
--- /dev/null
+++ b/rfcs/0001-meta-schedule-autotensorir.md
@@ -0,0 +1,604 @@
+<!--- Licensed to the Apache Software Foundation (ASF) under one -->
+<!--- or more contributor license agreements. See the NOTICE file -->
+<!--- distributed with this work for additional information -->
+<!--- regarding copyright ownership. The ASF licenses this file -->
+<!--- to you under the Apache License, Version 2.0 (the -->
+<!--- "License"); you may not use this file except in compliance -->
+<!--- with the License. You may obtain a copy of the License at -->
+
+<!--- http://www.apache.org/licenses/LICENSE-2.0 -->
+
+<!--- Unless required by applicable law or agreed to in writing, -->
+<!--- software distributed under the License is distributed on an -->
+<!--- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -->
+<!--- KIND, either express or implied. See the License for the -->
+<!--- specific language governing permissions and limitations -->
+<!--- under the License. -->
+
+* Feature Name: Meta Schedule (Formerly AutoTIR)
+* Start Date: 2021-05-28
+* RFC PR: https://github.com/apache/tvm-rfcs/pull/5/
+* GitHub Issue: https://github.com/apache/tvm/issues/8473
+
+## 1. Summary
+
+This proposal introduces Meta Schedule: a scheduling DSL on TIR that unifies
the
+approaches of AutoTVM [1] and AutoScheduler [2]. Meta schedule provides a
pragmatic way to define
+the space of automatic tuning, extensibility in terms of all possible TIR
schedule primitives like
+tensorization and loop partitioning, and customizability on every layer of the
automation system.
+
+Meta Schedule is the 3rd generation automatic scheduling system.
+
+## 2. Motivation
+
+**Scheduling.** In TVM,
+both TensorIR and TE allow direct or indirect IR transformation guided by a
+developer-provided program, for example,
+specifying a particular reordering of loops for better locality,
+or tensorizing a compute region with specific hardware intrinsics.
+The process of invoking such a set of pre-defined transformations is called
"**scheduling**",
+and each of such transformations is called a "**schedule primitive**".
+These primitives form a domain-specific language (DSL).
+
+**Design space.** The set of all possible schedulings of a TE/TensorIR is
called its design space.
+Optimization in TVM is essentially exploring such space to find out a
scheduling that transforms the
+IR to generate the kernel with optimal performance.
+
+### Problems with the current scheduling system
+
+Currently there are 3 sets of scheduling APIs in TVM:
+* **Manual schedule**: Developers optimize their programs by manually invoking
schedule primitives,
+ i.e. explore points in the design space with humans in the loop. This can be
a tedious and
+ error-prone approach, hence the creation of AutoTVM and AutoScheduler.
+* **AutoTVM**: The automation system requires users to define the design space
through
+ per-operator "schedule templates." Therefore, programmer time is a
bottleneck in scaling
+ to hundreds of operators across many hardware platforms.
+ hardware platforms.
+* **AutoScheduler**: It automatically generates schedule templates as the
design space,
+ according to a set of predefined "search rules". However, it is non-trivial
to extend
+ AutoScheduler to new schedule primitives (tensorize, loop partition,
software pipelining, etc).
+* The three systems above have isolated sets of APIs with several layers of
their own abstraction,
+ which are not only hard to learn, but also engineering-intensive to
customize.
+
+### Benefits of Meta Schedule
+
+The existing three scheduling systems are mutually incompatible with each
other in terms of API
+design and divergence: besides manual TE scheduling, AutoTVM requires users to
learn a new set of
+APIs, and AutoScheduler brings in another set of C++-based search rules. It
adds the users' mental
+overhead to understand and extend the existing systems. Further, the inability
to switch between
+template-based and template-free auto-tuning could lead to inferior
customizability and hence
+make it needlessly difficult to achieve optimal performance.
+
+Meta schedule provides:
+* Succinct syntax, consistent APIs to TensorIR schedule with no other layer of
abstraction.
+* Unified APIs for implementing manual schedules, AutoTVM-style schedules, and
AutoScheduler-style
+ schedules.
+* Extensibility of all the schedule primitives, including tensorization and
loop partitioning.
+ Almost no extra effort is needed to use a new primitive in auto-tuning.
+* The automation infrastructure is extensible on every of its components.
Every component of the
+ system can be customized easily in pure python or C++ or both. For example,
one can develop a new
+ design space generator in python, a new ProgramRunner in python, etc.
+
+
+## 3. Guide-level explanation
+
+Meta Schedule DSL is a language that provides TVM backend developers
+a flexible way to define or auto-generate the operator design space.
+
+This section introduces the syntax of Meta Schedule DSL by describing the 5
common usage patterns
+envisioned by this RFC. These patterns are:
+1) Manually constructing a schedule using existing schedule primitives
(Section 3.1);
+2) Defining composite schedule to simplify the ap sequence of schedule
primitives (Section 3.2);
+3) Describing a design space of possible schedules,
+a.k.a. AutoTVM-style schedule templates (Section 3.3);
+4) Automatically generating the design space, a.k.a. AutoScheduler-style
search rules (Section 3.4);
+5) Mixing the usage of manual schedule, AutoTVM and AutoScheduler-style design
space specification
+in Meta Schedule (Section 3.5).
+
+### 3.1. Manual Schedule
+
+Meta schedule APIs are almost the same as TE or TensorIR scheduling. Here is
an example of a manual
+schedule for matrix multiplication:
+
+```python
+# Designate a set of tile sizes
+i_tiles = [16, 8, 8, 8]
+j_tiles = [16, 8, 8, 8]
+k_tiles = [256, 8]
+
+# Tile the loops according to the tile sizes
+i_0, i_1, i_2, i_3 = sch.split(loop=i, factors=i_tiles)
+j_0, j_1, j_2, j_3 = sch.split(loop=j, factors=j_tiles)
+k_0, k_1 = sch.split(loop=k, factors=k_tiles)
+
+# Organize the loops into "SSRSRS" 6-level tiles
+sch.reorder(
+ i_0, j_0, # S: the 1st spatial tile
+ i_1, j_1, # S: the 2nd spatial tile
+ k_0, # R: the 1st reduction tile
+ i_2, j_2, # S: the 3rd spatial tile
+ k_1, # R: the 2nd reduction tile
+ i_3, j_3, # S: the 4th spatial tile
+)
+```
+
+In this manual scheduling example, the developers tweak the tile sizes and
measure the performance of the
+generated kernels to explore the opportunities of potential optimization.
+
+Generally speaking, while writing a schedule, there are often some parameters
that are hard to
+determine ahead of time, for example, tile sizes, unroll steps, or which
tensor intrinsics to use.
+Developers may manually enumerate possible combinations of these unknown
factors, and then pick the
+best schedule according to measurement results on their device.
+
+### 3.2. Composite Schedule Rules
+
+As introduced in the previous section, in TensorIR, each schedule primitive
handles only a very
+basic transformation of the IR. For example, `split` only splits a loop into
two new loops. In the
+real world, the over-fine granularity of those primitives usually leads to
repetitive and verbose
+scheduling code. Take the code snippet in the previous section as an example:
a sequence of `split`s
+are invoked, followed by a `reorder`. Taken together these 4 primitives are
colloquially known as
+"SSRSRS" tiling.
+
+To make it more convenient and modular, users are allowed to register
**composite schedules** that apply
+a sequence of schedule primitives according to certain analysis of the IR.
+The word **composite** here means the schedule transformation is *composed* of
those **primitives**.
+
+For example, suppose there is a composite schedule called
`Inline-Elementwise-Operation`, which
+inlines elementwise computation into their consumers if possible. Applying it
to the
+following TensorIR:
+
+```python
[email protected]
+def example_func(...):
+ for i, j in ...:
+ with tir.Block("B") ...:
+ B[i, j] = A[i, j] + 1
+ for i, j in ...:
+ with tir.Block("C") ...:
+ C[i, j] = B[i, j] + 1
+ for i, j in ...:
+ with tir.Block("D") ...:
+ D[i, j] = C[i, j] + 1
+
+sch = tir.Schedule(example_func)
+# `InlineElementwiseOperation` is a composite schedule rule that analyzes a
given block.
+# If the block contains only elementwise computation, and can be inlined into
its consumer,
+# then `sch.compute_inline` is called on that block.
+inliner = InlineElementwiseOperation()
+inliner.apply(schedule=sch, block=sch.get_block("B"))
+inliner.apply(schedule=sch, block=sch.get_block("C"))
+inliner.apply(schedule=sch, block=sch.get_block("D"))
+```
+
+Below is the result after applying this composite schedule, and its
corresponding trace:
+
+```python
+
+>>> print(tvm.script.asscript(sch.mod))
[email protected]
+def example_func(...):
+ for i, j in ...:
+ with tir.Block("D") ...:
+ D[i, j] = A[i, j] + 1 + 1 + 1
+
+>>> print(sch.trace)
+# Block "B" is elementwise and inlinable, then `sch.compute_inline(B)` is
called
+B = sch.get_block("B")
+sch.compute_inline(B)
+# Block "C" is elementwise and inlinable, then `sch.compute_inline(C)` is
called
+C = sch.get_block("C")
+sch.compute_inline(C)
+# Block "D" is elementwise but does not have a consumer,
+# so the rule does not call `compute_inline` because it is not inlinable
+D = sch.get_block("D")
+```
+
+### 3.3. AutoTVM-style Design Space Description
+
+Meta schedule extends the schedule DSL with a set of new schedule primitives
+called **sampling instructions**. These primitives do not transform the
TensorIR,
+but instead introduce random statistical variables which can be referenced
later in scheduling
+to parameterize the schedule. Incorporating **sampling instructions** into a
operator's schedule
+allows the backend developers to succinctly describe a design space in terms of
+tiling strategies, fusion levels, unroll lengths, etc.
+
+The matmul example above is extended to cover all possible tilings using these
sampling
+instructions:
+
+```python
+# Sample tile sizes
+i_tiles = sch.sample_perfect_tile(i, n=4) # was: i_tiles = [16, 8, 8, 8]
+j_tiles = sch.sample_perfect_tile(j, n=4) # was: j_tiles = [16, 8, 8, 8]
+k_tiles = sch.sample_perfect_tile(k, n=2) # was: k_tiles = [256, 8]
+# Tile the loops according to the random variables
+i_0, i_1, i_2, i_3 = sch.split(loop=i, factors=i_tiles)
+j_0, j_1, j_2, j_3 = sch.split(loop=j, factors=j_tiles)
+k_0, k_1 = sch.split(loop=k, factors=k_tiles)
+# Organize the loops into "SSRSRS" 6-level tiles
+sch.reorder(
+ i_0, j_0, # S: the 1st spatial tile
+ i_1, j_1, # S: the 2nd spatial tile
+ k_0, # R: the 1st reduction tile
+ i_2, j_2, # S: the 3rd spatial tile
+ k_1, # R: the 2nd reduction tile
+ i_3, j_3, # S: the 4th spatial tile
+)
+```
+
+### 3.4. AutoScheduler-style Design Space Generation
+
+To generate design space, AutoScheduler applies a set of rules to each
+[TE stage](https://tvm.apache.org/docs/api/python/te.html#tvm.te.Stage) that
corresponds to a
+[TE
operation](https://tvm.apache.org/docs/api/doxygen/classtvm_1_1te_1_1Operation.html),
+defined by
[`te.compute(...)`](https://tvm.apache.org/docs/api/python/te.html#tvm.te.compute).
+The rules analyze the TE operations and apply an [internal DSL]() to
manipulating its internal IR,
+which is in the end mapped to TE schedule primitives. This process is called
*sketch generation*.
+
+Composite schedule rules work in a similar way scheduling TensorIR, as
introduced in Section 3.2.
+It analyzes the TensorIR and apply schedule primitives directly to TensorIR
accordingly.
+When applying such rules to each TensorIR block in certain order (Post-DFS is
provided as the
+builtin order, but customization is allowed),
+it generates a sequence of schedule primitives.
+This process corresponds to the *sketch generation* phase in AutoScheduler.
+If sampling instructions are present in this sequence,
+then a design space is defined by those instructions for the meta schedule to
explore.
+This process is similar to the *random annotation* phase in AutoScheduler.
+
+Several built-in composite schedule rules are shipped with meta schedule to
align with the design
+space generated by AutoScheduler:
+
+* Multi-level tiling
+* Inline pure spatial blocks
+* Parallelize & vectorize & unroll
+* Auto tensorize
+
+Developers may implement their own rules in either Python or C++. They may
specify which rules to
+use with the following syntax:
+
+```python
+from tvm import meta_schedule as ms
+
+design_space_generator = ms.PostOrderApply(rules=[
+ ms.MultiLevelTiling(...),
+ CustomRule(...),
+ ms.OtherBuiltinRules(...),
+])
+
+```
+
+### 3.5. Unifying manual schedule / AutoTVM / AutoScheduler
+
+This subsection shows that the design space induced by TE manual schedule,
AutoTVM and AutoScheduler
+are all subsets of meta schedule, and meta schedule further allows mixing
those three styles to
+search jointly.
+
+- **Manual schedule**. The TE schedule is a special case of a meta schedule
program, where there is no
+randomness introduced by sampling instructions. It is a single point in terms
of design space.
+- **AutoTVM (Template-based tuning)**. It is more natural representation of
AutoTVM’s schedule
+templates (knobs) by writing one or more schedule functions in meta schedule
with sampling
+instructions. The probability space supported by the sampling instructions is
the design space to
+be explored.
+- **AutoScheduler (Template-free tuning)**. As mentioned in the previous
section, application
+ of composite schedule rules generates the design space, which is equivalent
to AutoScheduler’s
+ sketch generation.
+- **Mixing styles in design space definition**. By taking union of the spaces
induced by the three
+ special cases, meta schedule allows developers to combine generic rules that
AutoScheduler
+ provides and operator-specific scheduling.
+
+## 4. Reference-level explanation
+
+This section introduces the underlying techniques for the automation system to
extract and
+explore the design space. The figure below briefly illustrates the workflow of
the system:
+
+
+
+### 4.1. Execution trace as the design space
+
+**Trace**. To represent the design space defined by the meta schedule DSL, the
underlying system
+records all the instructions users applied to the schedule class, including
sampling and schedule
+primitives. This list of scheduling instructions being invoked, along with the
random decisions made
+on sampling instructions, is called a trace.
+
+For instance, executing the example above results in the following trace:
+
+```
+Instruction 0. Sample-Perfect-Tile
+Instruction 1. Sample-Perfect-Tile
+Instruction 2. Sample-Perfect-Tile
+Instruction 3. Split
+Instruction 4. Split
+Instruction 5. Split
+Instruction 6. Reorder
+```
+
+**Trace forms design space.** A trace may contain zero or more sampling
instructions, which
+introduces the uncertainty in scheduling - one instance of sampling results in
one point in the
+design space. Therefore, the trace itself forms a design space to explore,
e.g. which set of tile
+sizes works best on a specific hardware.
+
+**Union of design space**. Meta schedule works on a set of traces,
representing the union of the design
+spaces represented by every single trace.
+
+**Fork a trace**. When two different decisions in the scheduling process are
equally important to
+generate high-performance schedules, it is allowed to fork the trace into two,
and the design space is
+the union of the forked traces.
+
+The trace is not strictly user-facing, but can be accessed and printed with
the following syntax:
+
+```python
+# requires to trace the execution
+sch = tir.Schedule(..., traced=True)
+# do a lot of scheduling
+...
+# print the trace
+print(sch.trace)
+```
+
+And below is an example of the printed trace, which honestly reflects the
schedule as a snippet of
+python scheduling function:
+
+```python
+b0 = sch.get_block(name="matmul", func_name="main")
+l1, l2, l3 = sch.get_loops(block=b0)
+v4, v5, v6, v7 = sch.sample_perfect_tile(loop=l1, n=4,
max_innermost_factor=16, decision=[32, 1, 16, 2])
+v8, v9, v10, v11 = sch.sample_perfect_tile(loop=l2, n=4,
max_innermost_factor=16, decision=[64, 4, 2, 2])
+v12, v13 = sch.sample_perfect_tile(loop=l3, n=2, max_innermost_factor=16,
decision=[64, 16])
+l14, l15, l16, l17 = sch.split(loop=l1, factors=[v4, v5, v6, v7])
+l18, l19, l20, l21 = sch.split(loop=l2, factors=[v8, v9, v10, v11])
+l22, l23 = sch.split(loop=l3, factors=[v12, v13])
+sch.reorder(l14, l18, l15, l19, l22, l16, l20, l23, l17, l21)
+```
+
+**Implementation.** A trace is defined as:
+
+```python
+class Trace:
+ instructions: List[Instruction]
+ decisions: Dict[Instruction, Any]
+```
+
+For each sampling instruction in the trace, if it has a corresponding entry in
the decisions dict,
+then the output is uniquely determined by the decision, hence reproducibility
is guaranteed
+(Example 1); If a corresponding entry is not presented, then randomness will
be introduced by
+interpreting the trace (Example 2).
+
+```python
+# Example 1. Trace with deterministic result
+l1, l2 = sch.sample_perfect_tile(loop, n=2, decisions=[4, 32]) #
Deterministic l1 = 4, l2 = 32
+# Example 2. Trace with randomized result
+l1, l2 = sch.sample_perfect_tile(loop, n=2) # l1 and l2 are random
+```
+
+### 4.2. Exploring the Design Space
+
+Meta Schedule provides several built-in exploration strategies to exhaustively
or efficiently search
+for efficient schedules. Those strategies are mostly supported by re-execute
either a function or a
+trace with a builtin interpreter in meta schedule, and this process is called
**replay**.
+
+#### Random search by replaying schedule functions
+
+With a user-provided schedule function
+as a black-box design space generator, meta schedule repetitively invokes such
an opaque TVM
+packed function without doing any extra analysis.
+If sampling instructions are present in the trace, then scheduling is
non-deterministic
+(random decisions may not be repeated across runs)
+Effectively, it is equivalent to random exploration without trace,
+allowing the flexibility for users to define arbitrary functions
+that trace may not well support (e.g. control flow divergence based on the
value of intermediate
+random variables), but it forbids future opportunity of any trace-based
analysis.
+
+#### Random search by replaying traces
+
+A builtin interpreter directly replays the traces obtained
+from manual schedule, template-based or template-free design space generation.
+If sampling instructions are present in the traces,
+then their random decisions are mutated during each replay, i.e. jumps to a
new point in the
+design space. Therefore, repetitive replay of those traces are equivalent to
exploration of the
+design space.
+
+The search speed of meta schedule could be improved by allowing traces to be
analyzed before they
+are run. For example, trace analysis could reject obviously-invalid schedules
(e.g. using too many
+CUDA resource), remove dead-code before they are run. The cost model could
benefit from the traces
+as well by extracting trace-level features.
+
+#### Cost-model-guided evolutionary search
+
+A more efficient exploration strategy introduced by AutoScheduler.
+For more details, please refer to Section 5.1 of its paper [2].
+
+The evolutionary search strategy requires two sets of rule-specific logic to
execute to
+either validate or tweak the produced schedule:
+
+* Mutator: defines how to jump to a point’s "neighbor" in the design space
+* Postprocessor: after the trace is executed, there are some extra rules to
execute, for
+ example:
+ * Check CUDA resource limits: There is a hard requirement in CUDA that the
maximum number of
+ threads should not exceed 1024, but it is a random variable that cannot be
determined before
+ actually executing the trace. In this case, a postprocessor is implemented
to error out when
+ such conditions are not satisfied.
+ * Fuse outer loops until the extent of the fused loops is large enough: The
number of outer loops
+ to be fused together depends on their extents, which are random variables.
In this case, the
+ composite schedule rule annotates the maximum extent allowed on the block,
and a corresponding
+ postprocessor detects the annotation and does the actual fusion.
+
+The evolutionary search algorithm uses mutators to find possible schedules in
the design space, then
+applies postprocessors and asks the cost model to predict its performance.
After a few
+iterations, the new schedules with the highest scores are finally compiled and
measured on device.
+Epsilon-greedy is used in this process to balance exploitation and exploration.
+
+### 4.3. Database
+
+All the measure records are serialized and stored in a database. The schema of
the database has the
+following information:
+
+- The workload, a serialized TensorIR;
+- The hardware target where the measure is conducted;
+- Argument type: the shapes and dtypes of the input tensors fed to the
measured PrimFunc.
+This field can be useful for future dynamic-shape workloads;
+- The trace, including the schedule primitives used and their corresponding
decisions (if any);
+- The measured running time;
+- The version of the log.
+
+### 4.4. Python-first for flexibility & customizability
+
+The system is implemented in a way that all levels are decoupled and open to
customization, aiming
+at providing a playground for developers to try out new ideas and potentially
deliver performance
+quickly.
+
+While all the important APIs are implemented in C++ for efficiency, every part
of the system can be
+easily switched to customized python implementation. For example,
+
+#### Customize design space in python
+
+The design space can be defined as a python function
+
+```python
+def schedule_matmul(sch) -> sch:
+ i, j, k = sch.get_loops(sch.get_block("matmul"))
+ i_tiles = sch.sample_perfect_tile(i, n=4)
+ j_tiles = sch.sample_perfect_tile(j, n=4)
+ k_tiles = sch.sample_perfect_tile(k, n=2)
+ # Tile the loops according to the random variables
+ i_0, i_1, i_2, i_3 = sch.split(loop=i, factors=i_tiles)
+ j_0, j_1, j_2, j_3 = sch.split(loop=j, factors=j_tiles)
+ k_0, k_1 = sch.split(loop=k, factors=k_tiles)
+ # Organize the loops into "SSRSRS" 6-level tiles
+ sch.reorder(
+ i_0, j_0, # S
+ i_1, j_1, # S
+ k_0, # R
+ i_2, j_2, # S
+ k_1, # R
+ i_3, j_3, # S
+ )
+ return sch
+```
+
+#### Customize composite schedule in python
+
+The system provides two interfaces to define a composite schedule in python,
one is more succinct,
+and the other is more comprehensive:
+
+Method 1. Derive from `PyCompositeSchedule`, and implement two methods
`initialize` and `apply`:
+
+```python
+class MultiLevelTiling(PyCompositeSchedule):
+ def initialize(...):
+ # initialize the class, usually this method is empty
+ ...
+
+ def apply(sch: Schedule, block: BlockRV) -> Union[Schedule,
List[Schedule]]:
+ # write any python code, including:
+ # - analyze `block`
+ # - invoke schedule primitives in `sch`
+ # - do debug printing
+ ...
+```
+
+Method 2. A decorator as the syntactic sugar if the `initialize` method is
empty, which converts the
+function to the `apply` method.
+
+```python
[email protected]_composite_schedule(name="multi-level-tiling")
+def multi_level_tiling(sch: Schedule, block: BlockRV) -> Union[Schedule,
List[Schedule]]:
+ # write any python code, including:
+ # - analyze `block`
+ # - invoke schedule primitives in `sch`
+ # - do debug printing
+ ...
+```
+
+#### Customize exploration strategies in python
+
+Developers can implement any search algorithm in python as well by deriving
from `PySearchPolicy`,
+and the syntax is identical to customizing with `PyCompositeSchedule`.
+
+#### Other customizable components
+
+This list includes:
+
+* Cost model
+* Database
+* Measure callbacks
+* Feature extractor
+* Program builder & runner
+* Analysis methods
+* ...
+
+In a short summary, almost every component of the system is decoupled with
each other and extensions
+could be easily plugged in.
+
+## 5. Drawbacks
+
+Meta schedule requires integration with Relay operator strategy and compile
engine (TECompiler) to
+properly lower a Relay subgraph for task extraction. Further, TE schedules in
TOPI will need to be
+migrated to TensorIR schedules.
+
+## 6. Rationale and alternatives
+
+The system is designed with the principle of minimalism: Assuming users
already know TensorIR
+scheduling APIs, there is no more extra API set to learn; The previous
programs that schedules
+TensorIR still work out of box with the meta schedule DSL. It could
potentially lower the bar of
+adopting this system.
+
+Unifying manual scheduling, AutoTVM's semi automatic templates and
AutoScheduler's fully automatic
+sketch generation provides flexible way to balance injection new domain
knowledge and automation.
+
+Flexibility in customization allows quick try-out on new tasks, new strategies
and new hardware
+targets without deep knowledge of the system.
+
+## 7. Prior art
+
+**Tensor Expression (TE)** in TVM is a DSL that decouples compute and
schedule, which provides
+convenient ways to handcraft optimized kernels for different hardware targets.
+
+**TensorIR** is the low-level IR in TVM. Its capability of eagerly applying
schedule primitives
+opens the door for meta schedule, the proposed next-generation auto scheduling
system.
+
+**AutoTVM** is the 1st generation automation framework in TVM, which requires
developers to
+implement per-operator scheduling templates, and the system could handle the
tuning process.
+
+**AutoScheduler** is the 2nd generation automation framework in TVM, whose
built-in rules
+could automatically generate schedule templates for almost all the operators
on CPU, GPU, etc.
+
+## 8. Unresolved questions
+
+**Control Flow**
+
+The meta schedule DSL does not support control flow yet. Although there is no
report of
+real-world use case at the time of writing, it is possible that it could
appear in some future
+workloads. The best syntax of the control flow is not determined yet, but a
working example could be
+TensorFlow's `tf.cond`.
+
+**Assertion**
+
+Sampling instructions may lead to wrong schedules on CUDA, e.g. the resulting
program uses too much
+shared memory, too many threads, etc. It is detected by a postprocessor. To
accelerate the process,
+it is possible to introduce an assertion statement that exits early if the GPU
code is not valid,
+and its syntax can be, for example:
+
+```python
+sch.assert(j_2 * j_2 <= 1024)
+```
+
+## 9. Future possibilities
+
+**Unifying Manual Scheduling, AutoTVM and AutoScheduler in TOPI**
+
+As described in Section 3.5, meta schedule provides an idiomatic approach to
unify the three
+existing scheduling APIs in TVM:
+
+* Manual schedules are meta schedules without sampling instructions;
+* AutoTVM templates are meta schedules where knobs are replaced by sampling
instructions;
+* Each of AutoScheduler’s search rules generates a snippet of a meta schedule;
+* Mixture of the above three approaches to cover larger design space.
+
+At the time of writing, TOPI contains a number of schedule functions
implemented either in manual TE
+or AutoTVM-style. It is our future work to unify these existing scheduling
APIs on TOPI operators,
+and enable different styles to be auto-tuned jointly.
+
+[1] Zheng, Lianmin, et al. "Ansor: Generating high-performance tensor programs
for deep learning."
+14th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI}
20). 2020.
+
+[2] Chen, Tianqi, et al. "Learning to Optimize Tensor Programs." Advances in
Neural Information
+Processing Systems 31 (2018): 3389-3400.