[tvm-rfcs] branch main updated: Collage RFC (#62)

jwfromm Fri, 01 Apr 2022 11:51:10 -0700

This is an automated email from the ASF dual-hosted git repository.

jwfromm pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/tvm-rfcs.git



The following commit(s) were added to refs/heads/main by this push:
     new 23250f5  Collage RFC (#62)
23250f5 is described below

commit 23250f59cfc89673b7efe1d8f13192ed8381824d
Author: Mark Shields <[email protected]>
AuthorDate: Fri Apr 1 11:50:58 2022 -0700

    Collage RFC (#62)
    
    * Collage RFC
    
    * - Add some FAQ entries. Make the partition_for_x preamble issue more 
explict.
    
    * - Clarify pass ordering issue.
    
    * - Strawman for moving CollageFuseOps early.
    
    * - Switch from 'fusion' to 'partition' focus, matched prototype again.
    
    * - Err, three advantages, not four.
    
    * - minor clarrification
    
    * - some more clarifications based on github comments
    
    * - expand on PartitionRule
    
    * - Manupa's comments
    
    * - Remainder of Manupa's comments (oops)
    - Add some pictures
---
 rfcs/0062-collage.md                               | 1041 ++++++++++++++++++++
 .../assets/0062/dataflow_graphs_and_sub_graphs.png |  Bin 0 -> 99128 bytes
 rfcs/assets/0062/optimal_placement.png             |  Bin 0 -> 125042 bytes
 rfcs/assets/0062/partition_rules.png               |  Bin 0 -> 188360 bytes
 rfcs/assets/0062/search_graph.png                  |  Bin 0 -> 191885 bytes
 5 files changed, 1041 insertions(+)

diff --git a/rfcs/0062-collage.md b/rfcs/0062-collage.md
new file mode 100644
index 0000000..0e7a5c4
--- /dev/null
+++ b/rfcs/0062-collage.md
@@ -0,0 +1,1041 @@
+```
+Feature Name: Collage [Draft 0.81]
+Start Date: Mar 2022
+Authors: Mark Shields ([email protected])
+RFC PR: https://github.com/apache/tvm-rfcs/pull/62
+
+History:
+- v0.7: First draft.
+- v0.8: Rework to emphasise 'partitioning' (quite early in pipeline) instead 
of 'fusion' (quite late in pipeline). 
+```
+
+# Summary
+
+This design doc (with accompanying
+['v2' prototype 
implementation](https://github.com/mbs-octoml/mbs-tvm/tree/mbs-collage-sketch))
+shows how to bring tuning to TVM's BYOC partitioning passes. The tuning search 
explores the choice of sub-graphs (aka '
+partitions') and toolchains (aka 'backends') so as to minimize the expected 
model inference latency. Both 'graph
+style' (eg TensorRT) and 'library style' (eg DNNL) BYOC integrations are 
supported. We call the result an 'optimal
+partitioning'. This new tuning layer complements the tuning traditionally done 
by TVM and other toolchains during
+lowering. It can also complement any global tuning, for example to explore the 
choice of layout convention or device
+assignment.
+
+The approach is based on the [preprint](https://arxiv.org/pdf/2111.00655.pdf):
+
+> *Collage: Automated Integration of Deep Learning Backends*  
+> Byungsoo Jeon, Sunghyun Park, Peiyuan Liao, Sheng Xu, Tianqi Chen, Zhihao Jia
+
+(See Appendix A for a comparison of this proposal and the paper's 
implementation. See Appendix D for TODO items in the '
+v2' prototype.)
+
+When Collage is enabled it subsumes the existing 
`MergeComposite`/`AnnotateTarget`/`MergeCompilerRegions`/
+`PartitionGraph` passes embedded within each `partition_for_<toolchain>` 
function with a single new
+`CollagePartitioner` pass. The pass is guided by the list of available 
`Target`s and three existing sources:
+
+1. The `"TOpPattern"` attributes provided for every Relay operator and used by 
TVM's built-in `FuseOps`.
+2. The BYOC `"target.<toolchain>"` operator predicates provided for some 
operator/toolchain pairs by
+   'operator-based' BYOC integrations.
+   TODO(mbs): Consider removing predicate based BYOC integrations once 
TensorRT has been transitioned to be
+   predicate based.
+3. The BYOC operator pattern/predicates (usually) registered in the pattern 
table by 'pattern-based' BYOC integrations.
+
+The pass is run as early in the compilation flow as possible (see Appendix C).
+
+Only some boilerplate aspects of existing BYOC integrations need to be 
adjusted to support Collage (patterns must
+be registered in the standard pattern table, 'preamble' passes need to be 
split out as per Appendix C, and any
+mandatory post lowering helpers must be folded into the custom lowering 
function. We'll make sure these changes are
+part of or coordinated with the UMA project). However Collage may require more 
robustness from the BYOC integrations,
+see Appendix F.
+
+Note however that we are **not** proposing to deprecate the existing 
`partition_for_<toolchain>` operations (or their
+UMA equivalent). This is mostly because Collage is inherently a tuning-based 
system which is not practical for users who
+need a stand-alone compiler. But it is also because of challenges with 
establishing a common pass ordering which will
+work for both TVM and all BYOC toolchains (see Appendix C for more details).
+
+# Motivation
+
+This tuning approach contrasts with TVM's existing "greedy" and "manual" 
approaches to partitioning:
+
+- Greedy: Currently only the largest possible supported sub-graphs are used 
for partitions, irrespective of their
+  execution time. With Collage many more candidate sub-graphs are explored, 
and it is possible for two smaller
+  sub-graphs to yield better overall latency than one large sub-graph if they 
mix toolchains.
+- Manual: Currently the TVM user must commit to a BYOC toolchain and invoke 
the corresponding
+  `partition_for_<toolchain>` function before the main TVM compilation flow 
begins. With Collage the choice of toolchain
+  can be automated based on measured latency. Collage will also explore mixing 
and matching between multiple BYOC
+  toolchains as well as TVM's native backend.
+
+
+Collage offers three advantages:
+
+- **Latency**: Overall model latency may be reduced compared to TVM native, 
TVM with a single
+  `partition_for_<toolchain>` call, or a non-TVM stand-alone compiler such as 
TensorRT.
+- **Automation**: The choice of which BYOC toolchains to enable can be 
automated.
+- **Economy and modularity of implementation**: Four standalone passes using 
two separate mechanisms for expressing
+  partitioning rules/algorithms can be replaced with one, which itself is 
built from compositional primitives. (The
+  machinery is also reusable for the very similar problem of choosing TVM 
fusion kernels, which we'll tackle in the
+  future).
+
+See Appendix H for some frequently asked questions.
+
+# Success Metrics
+
+1. Collage offers at least a 10% latency improvement for a selection of 
standard ONNX models and NVIDIA hardware using
+   targets which include the CuDNN and CuBlas libraries, the CUTLASS library 
(with tuning, via BYOC), the TensorRT
+   compiler (via BYOC), and (obviously!) TVM native.
+2. Collage does not require new per-target or per-model patterns or rules to 
be implemented independently of the BYOC
+   integrations.
+3. Collage with a `Target` list enabling just one BYOC toolchain is never 
worse than using the the existing
+   `partition_for_<toolchain>` function directly. (Since partitioning for 
multiple toolchains in sequence should never
+   improve the result for any single toolchain we consider just the single 
BYOC case.)
+
+
+# Project Milestones
+
+- [Done] M0: Port paper prototype to recent TVM main and validate paper 
results.
+- [Done] M1: Internal design doc.
+- [Done] M2: Use 'v2' prototype to test design doc, and rework ready for TVM 
community.
+- [In progress] M3: RFC
+- [2022Q1] M4: Re-validate results on 'v2' prototype for larger models (eg 
GPT2) and more NVIDIA targets.
+- [2022Q2] M5: Implementation in TVM main, including 'sub-projects' listed 
below.
+- [OctoML internal] M6: Estimator integrated into OctoML platform, validation 
against OctoML test suite.
+- [OctoML internal] M7: Productionization for OctoML.
+
+# Check-in plan
+
+Though the 'v2' prototype is in a personal branch we'd like to transition to 
main ASAP and rely on directory/namespace
+separation, maintaining backwards compat, and a new `PassConfig` flag to 
isolate all Collage changes from the rest of
+TVM. A rough PR progression is:
+
+- TensorRT and CUTLASS BYOC changes are backwards compat. The existing 
`partition_for_<toolchain>` functions remain. The
+  CUTLASS-specific tuning and codegen functions will either continue to be 
supported or we'll work with users to account
+  for them being folded into the function-at-a-time `relay.ext.cutlass` 
codegen function.
+- The `DFPattern` and friends changes are all mostly just for improving the 
robustness of the
+  `IndexedGraph<T>` class and can go into main independently.
+- Some basic `Expr` improvements can go into main independently.
+- The design allows for multiple `Target`s for the same `DLDeviceType`. That 
requires the various
+  `build` interfaces which currently accept `Union[Target,Dict]` to also 
accept a list of `Target`s, and can be
+  backwards compat.
+- The new Collage code can go in bottom-up as we develop unit tests:
+    - Support utils, including `NameSupply`, `IndexSet`, `PriorityQueue`, 
`Cost`, `CostEstimator`.
+    - The core `SubGraph` datatype.
+    - `CandidatePartition`.
+    - The `PartitionRule` class hierarchy, as a series of PRs, ending with 
`PartitionSpec`.
+    - `GatherPartitionSpecs` helper for bridging the existing BYOC world with 
the Collage `PartitionRule` world.
+    - The `CollagePartitioner` driver pass itself.
+
+# Guide-level explanation
+
+Collage allows the choice and partitioning for BYOC toolchains to be 
determined automatically
+so as to minimize overall (expected) model execution latency.
+
+To compile with Collage it's necessary to set a `PassContext` flag, and include
+'Collage aware' `Targets` in the build's `target` argument.
+
+
+For example, assume `mod` is bound to 
[MNIST](https://github.com/onnx/models/tree/main/vision/classification/mnist):
+
+```
+def @main(%x: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
+  %0 = nn.pad(%x, 0f, pad_width=[[0, 0], [0, 0], [2, 2], [2, 2]]);
+  %1 = nn.conv2d(%0, meta[relay.Constant][0] /*Tensor[(8, 1, 5, 5), float32]*/,
+                 padding=[0, 0, 0, 0], channels=8, kernel_size=[5, 5]);
+  %2 = add(%1, meta[relay.Constant][1] /*Tensor[(8, 1, 1), float32]*/);
+  %3 = nn.relu(%2);
+  %4 = nn.max_pool2d(%3, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 
0]);
+  %5 = nn.pad(%4, 0f, pad_width=[[0, 0], [0, 0], [2, 2], [2, 2]]);
+  %6 = nn.conv2d(%5, meta[relay.Constant][2] /*Tensor[(16, 8, 5, 5), 
float32]*/,
+                 padding=[0, 0, 0, 0], channels=16, kernel_size=[5, 5]);
+  %7 = add(%6, meta[relay.Constant][3] /*Tensor[(16, 1, 1), float32]*/);
+  %8 = nn.relu(%7);
+  %9 = nn.max_pool2d(%8, pool_size=[3, 3], strides=[3, 3], padding=[0, 0, 0, 
0]);
+  %10 = reshape(%9, newshape=[1, 256]);
+  %11 = nn.dense(%10, meta[relay.Constant][4] /*Tensor[(10, 256), float32]*/, 
units=None, out_dtype="float32");
+  add(%11, meta[relay.Constant][5] /*Tensor[(1, 10), float32]*/)
+}
+```
+
+We can compile this with Collage enabled for a variety of NVIDIA 
toolchains/libraries with
+the following fragment:
+
+```
+with tvm.transform.PassContext(config={"relay.fallback_device_type": 2, 
"relay.collage.enable_collage": True}):
+    host_target = tvm.target.Target("llvm")
+    generic_target = tvm.target.Target("cuda", host_target)
+    cutlass_target = tvm.target.Target("cuda -compiler=cutlass", host_target)
+    tensorrt_target = tvm.target.Target("cuda -compiler=tensorrt", host_target)
+    cudnn_target = tvm.target.Target("cuda -compiler=cudnn", host_target)
+    cublas_target = tvm.target.Target("cuda -compiler=cublas", host_target)
+    targets = [generic_target, cutlass_target, tensorrt_target, cudnn_target, 
cublas_target]
+    exe = tvm.relay.vm.compile(mod, target=targets)
+```
+
+(Note that `cudnn` and `cublas` are not yet supported in the 'v2' prototype, 
see Appendix B.)
+
+After the `CollagePartitioner` pass, the intermediate `"main"` global function 
could resemble the following
+(though we've modified this "optimal" partitioning by hand for illustration so 
don't take it as representative of actual
+performance):
+
+```
+def @main(%x: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
+  # Operators left behind in the function body are intended for TVM.
+  # The usual Relay passes may rewrite them, then FuseOps will push them
+  # into "Primitive" functions (without any "Compiler" attribute) ready
+  # for TVM lowering. 
+  %4 = nn.pad(%x, 0f, pad_width=[[0, 0], [0, 0], [2, 2], [2, 2]]);
+  # This conv2d will be offloaded to cudnn. However the main TVM compilation
+  # flow is responsible for emitting the call.
+  %6 = fn (%FunctionVar_5: Tensor[(1, 1, 32, 32), float32],
+           Composite="cudnn.conv2d") -> Tensor[(1, 8, 28, 28), float32] {
+    nn.conv2d(%FunctionVar_5, meta[relay.Constant][0] /*Tensor[(8, 1, 5, 5), 
float32]*/,
+              padding=[0, 0, 0, 0], channels=8, kernel_size=[5, 5])
+  };
+  # Back to vanilla TVM.
+  %7 = %6(%4);
+  %3 = add(%7, meta[relay.Constant][1] /*Tensor[(8, 1, 1), float32]*/);
+  %9 = nn.relu(%3);
+  %11 = nn.max_pool2d(%9, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 
0]);
+  %13 = nn.pad(%11, 0f, pad_width=[[0, 0], [0, 0], [2, 2], [2, 2]]);
+  # Use TensorRT. The "Primitive" function deleniates the partition.
+  %14 = fn (%FunctionVar_03: Tensor[(1, 8, 18, 18), float32],
+            %FunctionVar_11: Tensor[(16, 1, 1), float32],
+            Primitive=1,
+            Compiler="tensorrt",
+            global_symbol="collage_nn_conv2d_add_nn_relu_1") -> Tensor[(1, 16, 
14, 14), float32] {
+    %1 = nn.conv2d(%FunctionVar_03, meta[relay.Constant][2] /*Tensor[(16, 8, 
5, 5), float32]*/,
+                   padding=[0, 0, 0, 0], channels=16, kernel_size=[5, 5]);
+    %2 = add(%1, %FunctionVar_11);
+    nn.relu(%2)
+  };
+  %15 = %14(%13, meta[relay.Constant][3] /*Tensor[(16, 1, 1), float32]*/);
+  # Back to vanilla TVM.
+  %17 = nn.max_pool2d(%15, pool_size=[3, 3], strides=[3, 3], padding=[0, 0, 0, 
0]);
+  %19 = reshape(%17, newshape=[1, 256]);
+  # Use CUTLASS. Note the double function nesting: the outer "Primitive" 
function
+  # deleniates the partition and the inner "Composite" function maps the 
original
+  # Relay operators to a tag to be used during compilation/build/lowering with 
the
+  # CUTLASS BYOC integration.
+  %20 = fn (%FunctionVar_0: Tensor[(1, 256), float32],
+            %FunctionVar_1: Tensor[(10, 256), float32],
+            %FunctionVar_2: Tensor[(1, 10), float32],
+            Primitive=1,
+            Compiler="cutlass",
+            global_symbol="collage_cutlass_dense_bias_nn_dense_add") -> 
Tensor[(1, 10), float32] {
+    %1 = fn (%FunctionVar_01: Tensor[(1, 256), float32],
+             %FunctionVar_11: Tensor[(10, 256), float32],
+             %FunctionVar_21: Tensor[(1, 10), float32],
+             Composite="cutlass.dense_bias") -> Tensor[(1, 10), float32] {
+      %0 = nn.dense(%FunctionVar_01, %FunctionVar_11, units=None, 
out_dtype="float32");
+      add(%0, %FunctionVar_21)
+    };
+    %1(%FunctionVar_0, %FunctionVar_1, %FunctionVar_2)
+  };
+  %20(%19, meta[relay.Constant][4] /*Tensor[(10, 256), float32]*/,
+      meta[relay.Constant][5] /*Tensor[(1, 10), float32]*/)
+}
+```
+
+The remainder of the compilation will respect the partitioning found by 
Collage without
+any further user involvement.
+
+# Reference-level explanation
+
+The implementation is mostly under `src/relay/collage/...` (namespace 
`tvm::relay::collage`), with just a few Python
+helper functions under `python/tvm/relay/collage`.
+
+If the `relay.collage.enable_collage` `PassConfig` attribute is true then a 
new `CollagePartitioner` pass is inserted
+before all other Relay passes. The result of the pass is:
+
+- All Relay sub-graphs in all global functions which are to be handed off to a 
BYOC toolchain are replaced by calls to
+  an inline `"Primitive"` function with `"Compiler"` and `"global_symbol"` 
attributes.
+- Relay operators, or groups of operators, which are to be translated to 
particular library or BYOC-supplied function
+  are replaced by calls to an inline `"Composite"` function. (This encoding is 
supported for both BYOC and external
+  libraries.)
+
+TODO(mbs): We need to also support
+[RFC10](https://github.com/apache/tvm-rfcs/blob/main/rfcs/0010-target-registered-compiler-flow-customisation.md)
 style BYOC extensions in the partitioning encoding.
+
+Note that no `"Primitive"` functions denoting TVM kernels are produced -- the 
existing `FuseOps` pass is still required.
+
+The `CollagePartitioner` pass has four phases:
+
+- **Phase 1**: The available `Target`s are scanned to build a list of rules 
describing how to find possible partitions (
+  see `PartitionSpec` and `PartitionRule` below). Depending on the `Target` 
the rules may incorporate entries from the
+  BYOC pattern table. (The remaining phases execute on each global function 
separately.)
+- **Phase 2**: A dataflow graph is constructed for the global function (which 
is just an `IndexedGraph<Expr>`). The
+  available rules from phase 1 are evaluated on the dataflow graph to yield a 
(possibly overlapping) set of candidate
+  partitions for each target (see `CandidatePartition` below). Each candidate 
efficiently describes a sub-graph of the
+  global function's body without the need to construct any new expressions 
(see `SubGraph` below).
+- **Phase 3**: A least cost path is found in the following (implicit and 
lazily constructed) search graph:
+  - Search Node: Each node represents the set of 'covered' dataflow nodes 
which have been assigned to a
+    candidate partition on every path to the node from the starting node.
+  - Starting node: The search node with empty 'covered' set.
+  - Ending node: The search node with every dataflow node in the 'covered' set.
+  - Search Edge X->Y: A candidate partition P does not overlap X's 'covered' 
nodes. Y's 'covered' nodes are
+    those of X union P. To avoid an unnecessary search space explosion the 
candidate must also include the
+    next yet-to-be-covered dataflow node in X.
+  - Edge cost: The estimated latency of the candidate partition, plus a 
partition transition penalty. Note
+    that though we need to be able to extract the candidate's sub-graph in 
order to build a function
+    representing the candidate to measure with, we do not yet need to 
partition the overall function body
+    expression.
+
+  Other search algorithms are certainly possible, eg the paper uses an 
evolutionary search to refine the partitioning
+  found by the dynamic-programming search. We can easily abstract away the 
search interface to support multiple
+  implementations in the future.
+- **Phase 4**: The function body is partitioned according to the candidate 
kernels on the shortest path. This phase
+  can be run independently of the first three so that additional inspection or 
optimization may be applied to
+  the intmediate optimal partitioning.
+
+In the following we introduce the new datatypes, then expand on the phases.
+
+### Util Datatypes
+
+- `PostDfsIndex`: The integer index of a Relay sub-expression in a post-dfs 
traversal of the overall Relay expression.
+  If index i is less than index j then we know the sub-expression for j cannot 
influence the value of the sub-expression
+  for i.
+- `DataflowGraph`: As alias for the existing `IndexedGraph<Expr>` from the 
`DFPatternMatcher` suite (which in turn is a
+  reworked copy of the `IndexedGraph` private to `fuse_ops.cc`). It is used 
throughout to manage the three-way bijection
+  from Relay `ExprNode`s to `PostDfsIndex`s to
+  `DataflowGraph::Node`s. Each `DataflowGraph::Node` describes the 
sub-expression's dataflow inputs, outputs, dominator
+  and inverse-dominators.
+- `IndexSet`:  A bit vector indexed by `PostDfsIndex`s. These are used as a 
compact representation for an arbitrary set
+  of dataflow nodes in a dataflow graph.
+- `Cost`: A `double` representing a candidate partition (or kernel) 'cost', 
which currently is just mean execution
+  latency in seconds. Collage only cares that costs are additive and a total 
order, so in the future we could support
+  cost functions which balance execution time against high memory watermark or 
other measures. Costs may be `Unknown`
+  (ie NaN) to signal some other heuristic should be used to compare kernel 
costs. Costs may be `Invalid` (ie +inf)
+  to signal the toolchain could not compile and run a candidate kernel.
+
+### SubGraph
+
+A `SubGraph` is an `IndexSet` of the `PostDfsIndex`s of all dataflow nodes 
'inside' an arbitrary sub-graph of the
+overall dataflow graph. This and `PartitionRule` below are the core Collage 
datatypes. The following illustrates
+the dataflow graph, indexes and one sub-graph for 'mini' MNIST (MNIST with the 
second layer removed):
+
+![dataflow graphs and 
sub-graphs](assets/0062/dataflow_graphs_and_sub_graphs.png)
+
+Sub-graphs can be used to represent partitions/kernels/composite functions 
without having to pay the cost of
+constructing or rewriting any expressions. We also allow 'extracting' a 
function to use for measuring a
+partition/kernel's latency independently from 'rewriting' the overall Relay 
expression since only a tiny subset of
+candidate partitions will end up being needed after Collage has completed its 
search.
+
+We expect O(thousands) of sub-graphs to be in flight while processing a given 
model, so are mindful of space overhead.
+
+A sub-graph classifies every dataflow node of the overall expression as either 
'inside' or
+'outside' the sub-graph. Obviously not all such divisions make sense, for 
example it is not valid for an inside node to
+feed into another inside node via outside nodes. We provide an
+`IsValid` method to check for validity, and `SubGraphConfig` to control which 
validity rules apply (such as maximum
+depth).
+
+We generally work with the `DataflowGraph` representation of the overall Relay 
expression rather than the expression
+itself. We use the post-dfs visit index to uniquely refer to expression nodes.
+
+As well as 'inside' and 'outside' we have four other flavors of dataflow 
nodes, all uniquely determined from the '
+inside' nodes:
+
+- 'entry' nodes are those inside with at least one dataflow input outside.
+- 'exit' nodes are those inside with at least one dataflow output outside, or 
which are considered 'external' in the
+  underlying dataflow graph (eg because they represent the result of the 
overall function).
+- 'input' nodes are those outside with at least one dataflow output inside.
+- 'output' nodes are those outside with at least one dataflow input inside.
+
+Index sets for these are cached with the sub-graph for performance.
+
+It is valid to have multiple entry nodes (we can bind a parameter for each). 
It may be valid to have multiple exit
+nodes (we can build a tuple of all such). It may be valid to have exit nodes 
which also contribute to other inside
+nodes (ie represent a 'tap' on an intermediate result).
+
+Sub-graphs are closed under:
+
+- Disjoint union.
+- Wrapping by a function with given attributes. This can be used to encode 
"Composite" functions, or to represent a
+  candidate kernel within a "Primitive" function. (By combining 'wrapping' with
+  'union' we can encode, eg, 'this sub-graph should be placed inside a 
primitive function which itself may have calls to
+  composite functions).
+- Substitution, which allows a sub-graph w.r.t. one dataflow graph to be 
transformed to match some other (typically
+  smaller) dataflow graph.
+
+Note that the Relay `PatternPartitoner` goes directly from `Expr` to 
partitioned `Expr` without stopping at any
+intermediate representation. It may be worth 'promoting' `SubGraph` out of 
Collage and into the standard `DFPattern`
+suite, we leave that to future work.
+
+### CandidatePartition
+
+A `CandidatePartition` pairs a `SubGraph` with a `Target`. All Collage search 
and measurement is in terms of candidate
+partitions.
+
+### PartitionRule
+
+A `PartitionRule` describes how to find a set of `CandidatePartitions`s for a 
`DataflowGraph`. This and `SubGraph`
+above are the core Collage datatypes. All partition rules implement the method:
+
+```
+virtual std::vector<CandidatePartition> AllCandidates(const DataflowGraph& 
dataflow_graph,
+                                                      const PartitionSpec& 
spec) const;
+```
+
+The candidates are allowed to overlap, and ultimately it is the job of the 
Collage searcher to find a selection of
+candidates which cover the whole Relay expression without overlap. There may 
be many thousands of candidates in flight
+during the Collage search.
+
+We currently have three distinct flavors of partitions:
+
+- For pattern-based BYOC integrations, individual `DFPattern`s are used to 
select the `"Composite"` functions to
+  offload, and those are grouped into a `"Primitive"` Relay function with a 
`"Compiler"` attribute.
+- For operator-based BYOC integrations, per-operator predicates indicate 
operators to offload, and again those are
+  grouped into a `"Primitive"` Relay function with a `"Compiler"` attribute.
+  TODO(mbs): Consider removing predicate based BYOC integrations once TensorRT 
has been transitioned to be
+  predicate based.
+- For TVM, obviously all of Relay can go into a single partition, however for 
search efficiency the partitions should
+  roughly mimic the Relay `FuseOps`. That pass uses the `"TOpPattern"` (of 
type `OPPatternKind`) attribute on all Relay
+  operators, and rules for when operators of one kind can be folded into 
another (typically by moving scalar ops from
+  elementwise operators into the output position of an earlier operator). This 
is implemented as a
+  stand-alone analysis which encodes its result using `"Primitive"` functions.
+
+Two new flavors are also showing up:
+
+- For easy external library integration we would like to borrow the 
`DFPattern`-with-composite-functions approach from
+  pattern-based BYOC integrations. But we'd like to leave those composite 
functions outside of any `"Primitive"`
+  function so that the library calls could end up within larger TVM kernels.
+- `FuseOps` is generally considered too inflexible, and we've sought a more 
flexible way to express target-dependent
+  fusion rules.
+
+So in the same way `DFPattern`s provide a small library of 'base' and 
'combinator' pattern rules supporting a wide
+variety of examples, we seek the same economy and flexibility from 
`PartitionRule`s. 
+
+An obvious question is whether all partition rules should be expressed with 
`DFPattern`s, possibly by extending
+the `DFPattern` library itself. Indeed, though it does not appear to be used 
in prod, the `DominatorPattern` is
+an attempt to use `DFPattern`s to subsume the existing `FuseOps` machinery. We 
actually went down this path but
+decided to back out:
+- We'd need a new pattern combinator to associate predicates with sub-patterns.
+- Since we are interested in searching over possibly overlapping candidate 
partitions we'd need the `DFPattern`
+  machinery to all enumeration over all matching sub-expressions. That would 
require a rewrite of the
+  `DFPatternMatcher`.
+- Some of the more subtle fusion rules are difficult to express as patterns.
+- `DFPattern`s are widely used outside of just partitioning, so any change 
would need to ensure no efficiency
+  or cognitive overhead for those common cases.
+
+That pushed us to the present design, which builds on `DFPatterns`, but 
introduces new 'base' and 'combinator'
+partition rules which can be combined to match the desired partition flavors:
+- The 'base' rules produce candidates from the dataflow graph directly. Eg we 
have a base rule to produce
+  all sub-graphs matching a given `DFPattern`.
+- The 'combinator' rules combine the candidates found by one or more sub-rules 
into a new set of
+  candidates. The sub-rule(s) can be 'base' or 'candidate' rules. We call the 
candidates produced by
+  a sub-rule 'sub-candidates'. Eg we have a combinator rule which wraps all 
sub-candidates in a
+  `"Composite"` function (when the overall expression is rewritten).
+
+The following illustrates some base and combinator patterns on the earlier 
mini MNIST dataflow graph:
+
+![partition rules](assets/0062/partition_rules.png)
+
+The base rules are:
+
+- `DFPatternPartitionRule`: Given a `DFPattern` and expression predicate, 
produces a candidate for every sub-graph
+  matched by the pattern and predicate. Unlike the `PatternRewriter`, 
candidates are free to overlap. Mainly used
+  to bring BYOC patterns into the Collage framework.
+- `OpPredicatePartitionRule`: Given an attribute name, produces a candidate 
for every call to a primitive Relay
+  operator where the operator i) has predicate bound to that attribute which 
ii) returns true given the
+  call sub-expression. Generally this will result in a singleton sub-graph 
containing only the call, but it may also
+  pull in constant arguments to the call should they be required. Used to 
bring BYOC operator predicates into the
+  Collage framework.
+  TODO(mbs): Consider removing predicate based BYOC integrations once TensorRT 
has been transitioned to be
+  predicate based.
+- `OpCallByKindPartitionRule`: Uses the `"TOpPattern"` attribute provided for 
every Relay operator to produce a
+  candidate for every call to a 'fusable Relay operator'. Used to select the 
operators which `FuseOps` will consider
+  parts of kernels.
+
+The combinator rules are:
+
+- `CompositePartitionRule`: Indicates all sub-candidates matched by the 
sub-rule should be wrapped by a `"Composite"`
+  function. The `"Composite"` name is taken from the rule name. Used to 
indicate Relay operators (or groups of Relay
+  operators) should be mapped to target-specific operators, both for BYOC and 
TVM external library integrations.
+- `PrimitivePartitionRule`: Indicates all sub-candidates matched by the 
sub-rule should be wrapped by a `"Primitive"`
+  function, possibly with an additional `"Compiler"` attribute. Used to 
delineate a partition (or kernel).
+- `UnionPartitionRule`: Simply unions all the sub-candidates from all 
sub-rules together. Used to combine
+  individual `DFPatternPartitionRules`.
+- `CombinePartitionRule`: Given a sub-rule and a list of 'combiner' rules (see 
below), finds all possible ways of
+  combining the sub-candidates to yield even larger candidates. Note that the 
sub-candidates may also be directly
+  included in the results. The 'combiner' rules allow combining by 
`OpPatternKinds`, combining the arguments to
+  tuples which themselves are arguments to Relay operator calls, and so on. 
This rule is intended to mimic the
+  existing TVM `FuseOps` pass, though: i) all candidates are found rather than 
just the largest, ii) the starting
+  set of candidates can be provided by any other rule, and iii) we rely on 
`SubGraph` validity checking to weed out
+  infeasible candidates.
+- `OnlyValidPartitionRule`: Given a `SubGraphConfig`, ignores candidates with 
'invalid' sub-graphs. Used to limit the
+  maximum candidate depth, the number of independent outputs, and whether 
intermediate 'taps' are allowed.
+- `HostPartitionRule`: Produces candidates for all Relay expressions which 
could be
+  'left behind' for execution by the host (eg on the VM). This rule lets us 
move special case handling out of the
+  core search algorithm and into a simple rule.
+
+Here are some typical ways to combine `PartitionRules` for different partition 
flavors. (These combinations
+may be generated during phase 1 by inspection of the `Target` and BYOC 
registration -- see 'Phase 1' below.)
+
+- Classic operator-predicate based BYOC with
+  `AnnotateTarget`/`MergeCompilerRegions`/`PartitionGraph` passes (eg see 
`tensorrt.py`):
+  ```
+  PrimitivePartitionRule
+    OnlyValidPartitionRule
+      CombinePartitionRule (with a join-anything combiner rule)
+        OpPredicatePartitionRule
+  ```
+  TODO(mbs): Consider removing predicate based BYOC integrations once TensorRT 
has been transitioned to be
+  predicate based.
+
+- Classic pattern-based BYOC with 
`MergeComposite`/`AnnotateTarget`/`PartitionGraph` passes
+  (eg see `cutlass.py`)`:
+  ```
+  PrimitivePartitionRule
+    OnlyValidPartitionRule
+      CombinePartitionRule (with join-anything combiner rule)
+        UnionPartitionRule
+          CompositePartitionRule(label1)
+            DFPatternPartitionRule(pattern1)
+                      :
+          CompositePartitionRule(labeln)
+            DFPatternPartitionRule(patternn)
+  ```
+
+  The `CompositePartitionRule`/`DFPatternPartitionRule` combination is 
repeated for each entry in the pattern table for
+  the BYOC toolchain name, eg:
+  ```
+  CompositePartitionRule(
+    rule_name="cutlass.conv2d_bias_residual_multiply_relu"
+    sub_rule=DFPatternPartitionRule(
+      pattern=CallPatternNode(Op(nn.relu), 
+                              [AltPattern(CallPatternNode(Op(multiply), 
+                                                          
[CallPatternNode(AltPattern(Op(add) | Op(nn.bias_add)),
+                                                                           
[CallPatternNode(Op(nn.conv2d), [*, *]), *]),
+                                                           *]) |
+                                          CallPatternNode(Op(multiply),
+                                                          [*, 
+                                                           
CallPatternNode(AltPattern(Op(add) | Op(nn.bias_add)),
+                                                                           
[CallPatternNode(Op(nn.conv2d), [*, *]), *])
+                                                          ]))
+                              ])))
+  ```
+
+- "Consider this library implementation for these sub-expressions", using 
`DFPatterns` to pick out which Relay operators
+  are supported (a new scheme):
+  ```
+  OnlyValidPartitionRule
+    CombinePartitionRule (with default TVM combiner rules)
+      UnionPartitionRule
+        OpCallByKindPartitionRule
+        CompositePartitionRule(lable1)
+          DFPatternPartitionRule(pattern1)
+                     :
+        CompositePartitionRule(lablen)
+          DFPatternPartitionRule(patternn)
+  ```
+
+- Classic TVM `FuseOps`:
+  ```
+  PrimitivePartitionRule
+    OnlyValidPartitionRule
+      CombinePartitionRule (with default TVM combiner rules)
+        OpCallByKindPartitionRule
+  ```
+
+- "Just fuse what I tell you to fuse", using `DFPatterns` to directly select 
candidates (a new scheme):
+  ```
+  PrimitivePartitionRule
+    OnlyValidPartitionRule
+      UnionPartitionRule
+        DFPatternPartitionRule(pattern1)
+                     :
+        DFPatternPartitionRule(patternn)
+  ```
+
+### PartitionSpec
+
+A `PartitionSpec` pairs a a `PartitionRule` with one or more `Target`s.
+
+### Phase 1
+
+We build on the existing TVM support for heterogeneous devices and targets. 
The available `Targets` are extracted from
+the compilation configuration (eg using the existing `CompilationConfig` 
helper class). Each target is inspected to
+decide on how to construct a `PartitionRule`, which will guide Collage in the 
selection of candidate kernels to explore
+for that target. (See Appendix G for the requirements which motivated this 
part of the design.)
+
+- If the `Target` has a `"partition_rule"` attribute, use that directly. This 
would allow users to directly control
+  partitioning/fusion for the target's they care about.
+- If the `Target` has a `"compiler"` attribute (eg `"cutlass"`), and the 
global pattern table has an entry for that
+  attribute value, assume the `Target` denotes a pattern-based BYOC 
integration to explore. The `PartitionRule`
+  will import all the BYOC patterns and predicates automatically.
+- As above, but if global pattern has no matching entry, assume the `Target` 
denotes a predicate-based BYOC integration
+  to explore (eg `"tensorrt"`). The `PartitonRule` will look for and evaluate 
predicates with the
+  `"target.<compiler>"` attribute on all Relay operators.
+- Otherwise, assume the `Target` denotes a TVM-native target. The 
`PartitionRule` mimics `FuseOps`, but now generalized
+  to explore multiple candidates so as to leave room for possible BYOC 
candidates.
+
+Note that to make this approach work we need to allow for multiple `Target`s 
with the same `DLDeviceKind`. For the VM
+simply switching the `target` argument from dictionary to list and removing 
some redundant Python preprocessing code was
+all that was required to support this.
+
+The user can use `on_device` annotations to constrain sub-graphs to particular 
devices. When Collage is considering
+candidate partitions, it should be sure to choose a candidate `Target` which 
'refines' the `Target` for every
+sub-expression discovered by the `PlanDevicesPass`. Given targets T and U we 
say 'T refines U' if T has a
+'"compiler"' and/or '"partition_rule"' attributes, U has no such attributes, 
and T and U otherwise agree on all other
+fields.
+
+### Phase 2
+
+Most of the hard work for this phase is carried by the `AllCandidates` 
implementations of the `PartitionRule`s. The main
+driver simply needs to index all the found `CandidatePartitions` by their 
minimum 'inside' `PostDfsIndex`
+for rapid retrieval during the shortest path search.
+
+### Phase 3
+
+We find it most natural to use Dijkstra to find the optimal partitioning. A 
`SearchState` is a node in the search
+graph, and contains:
+
+- An `IndexSet` of the dataflow nodes already 'covered' by candidates on every 
path to this state. This is the
+  identifying key for the state.
+- The predecessor `SearchState` in the best path to this state.
+- The `Cost` of the best path to this state. This is the order for the 
Dijkstra priority queue.
+- The `CandidatePartition` for the transition from the best predecessor to 
this state.
+
+The starting state has no covered nodes. The final state has all nodes 
covered. The following is an example search
+graph fragment for the mini MNIST example: 
+
+![search graph](assets/0062/search_graph.png)
+
+When expanding a state we could choose any `CandidatePartition` collected from 
phase 2 provided it doesn't overlap with
+the state's covered set. However, a search path applying candidates C then D 
is equivalent to one applying D then C, so
+we only consider candidates which intersect the next yet-to-be-covered 
dataflow node. For each such candidate we use
+the `CostEstimator` (with it's assumed cache) to get the candidate's cost, 
build the successor state, and 'relax' the
+successor state in the usual way. (See Appendix E for more details on 
`CostEstimator`.)
+
+The `HostPartitionRule` is used to allow some dataflow nodes to be 'left 
behind' for execution by the host.
+
+The result at this stage is an `Array<CandidatePartition>`, which can be 
materialized and restored using the standard
+TVM object graph machinery if desired. An example least-cost path for the mini 
MNIST example could be the following:
+
+![optimal placement](assets/0062/optimal_placement.png)
+
+### Phase 4
+
+The overall Relay expression is partitioned over all the `CandidatePartition`s 
on the lowest cost path 'in parallel'.
+Since all the candidates are expressed using `SubGraph`s w.r.t. the original 
dataflow graph, we must be careful not to
+invalidate yet-to-be-partitioned candidates as we go. Working backwards in 
dataflow order avoids this problem.
+
+# Drawbacks
+
+- **Some BYOC boilerplate changes required**: TVM's current BYOC integration 
API only requires the 'lowering/codegen'
+  function to be registered to a well-known global function name. Everything 
else is up to the BYOC author.
+    - Collage requires pattern-based BYOC integrations to register their 
patterns in the global pattern table. Some BYOC
+      integrations use the table but many do not, but it's an easy fix.
+    - Collage requires the BYOC lowering function to yield a valid 
`runtime::Module` without requiring any additional
+      BYOC-specific passes to be run. Some BYOC integrations require the user 
to run separate passes to tune and/or
+      compile the partitioned, those need to be moved into the lowering 
function itself.
+    - Collage requires the BYOC integration to either correctly test for which 
operators are supported in the
+      pattern/operator predicate, or gracefully propagate failure rather than 
CHECK-fail if an unsupported operator is
+      included in a candidate kernel. Thus a BYOC integration will need to be 
'robustified' to become 'Collage
+      compatible'.
+
+  Overall we've tried to make as few changes as possible. Collage will happily 
follow along with any improvements to the
+  BYOC integration API (eg via the UMA project).
+
+- **Non-compositional BYOC toolchains**: BYOC partitioning functions often run 
global passes to get the Relay graph into
+  a state better aligned with the toolchain on the assumption they are the 
exclusive partitioning pass. Most obvious is
+  the choice of layout, and if two BYOC integrations have a different choice 
of layout then there's currently no way for
+  them to be used concurrently. All of those passes must either be i) pushed 
up to global configuration (which could be
+  explored by a search layer outside of TVM), ii) pushed into the BYOC 
lowering/codegen function (to prepare the
+  sub-graph for further compilation) or iii) moved into the standard Relay 
optimization passes run before
+  `CollagePartitioner`.
+
+- **Higher tuning cost**: Obviously Collage needs to estimate the latency of 
partitions. For TVM this can trigger
+  turning of schedules for novel kernels, which can require O(thousands) of 
trials and take O(hours), so we'll be very
+  dependent on cached tuning logs to amortize this cost between models for the 
same target.
+
+- **Task extraction vs Tuning**: Traditionally TVM has had three phases: i) 
Task extraction (find the fused sub-graphs
+  to tune), ii) Tuning (find a good schedule for those sub-graphs), and iii) 
Compilation (re-compile the model, now
+  retrieving schedules for all the anticipated sub-graphs from the cache.) 
However the Collage 'v2' prototype collapses
+  all these phases. This lets us lazily explore the implied search graph 
(nodes = partially rewritten models, edges =
+  selected of sub-graph and toolchain as a candidate partition, cost = 
estimated sum of partition costs plus transition
+  penalties), and thus only pay the cost of measuring (and tuning) candidates 
which could possibly influence the final
+  partitioning.
+
+- **No non-local optimization**: Though Collage can explore the choice of 
sub-graph and toolchain, it cannot explore any
+  choices which require the arguments and/or result of the sub-graph to be 
rewritten, or the overall `IRModule` to be
+  changed. Thus Collage **cannot** be used to search over:
+    - Choice of layout for arguments/results (may require insertion of layout 
transforms),
+    - Choice of memory scope for arguments/results (may require insertion of 
device copies),
+    - Choice of device on which to host the kernel (ditto),
+    - Choice of quantization scheme,
+
+  To support this efficiently we'd need to abandon the simple-minded but fast 
`SubGraph` representation we describe here
+  in favor of something like an EGraph representation, which seems like a very 
large change for TVM.
+
+- **Dependency management**: Currently BYOC integrations tend to assume they 
are the only non-TVM toolchain in use. So
+  it's possible two toolchains introduce runtime dependencies which can't be 
satisfied. Collage has no notion of
+  dependencies or incompatibilities and may attemt to mix candidate kernels we 
can't support in prod. It's also possible
+  for two BYOC integrations to have incompatible runtimes.
+
+- **Additive cost assumption**: Collage as per this design assumes the cost of 
running candidate partitions is additive,
+  plus a small transition penalty. However cache effects can dominate measured 
latency, particularly for
+  'lightweight' kernels. Thus there may be a **additive error** in the final 
result:
+
+  > additive_error = measured_latency(collage_partitioning) - sum_{partition} 
(estimated_latency(partition) + penalty)
+
+  The evolutionary search explored by the paper can help here since it uses 
measured end-to-end model latency as its
+  cost function, but we're deferring that to future work.
+
+- **Limited search space**: Naively exploring all sub-graphs is O(n!), so we 
need to constrain the search. The easiest
+  approach is just to limit candidates to sub-graphs of just a few operators. 
This can mean significantly faster
+  candidates are not explored, yielding a partitioning with high **optimality 
loss**:
+
+  > optimality_loss = measured_latency(collage_partitioning) - 
measured_latency(true_optimal_partitioning)
+
+  Though the 'true' optimal partitioning may be infeasible to find, the user 
may easily discover a high
+  **apparent loss**, eg by comparing the Collage result with a traditional 
BYOC partitioning result:
+
+  > apparent_loss = measured_latency(collage_partitioning) - 
measured_latency(users_own_partitioning)
+
+- **Fragile toolchains**: Some BYOC toolchains are intended to be stand-alone 
compilers in their own right, and have
+  been tuned against common models and include global flags to guide 
optimizations such as reducing precision. However
+  Collage will only feed these toolchains smaller sub-graphs, thus making the 
limited search space problem more severe.
+
+- **High variance in lightweight kernels**: Small kernels can have high 
variance, thus the choice of which toolchain to
+  use can be arbitrary. We probably want to i) validate our variance estimator 
is accurate, ii) choose a percentile
+  slightly above 50% for the estimated candidate kernel latency, and iii) fall 
back to hard-coded priorities when the
+  measured variance is too high.
+
+- **Explainability**: It's easy to show the user the final partitioning and 
estimated times for each kernel, but harder
+  to show why that partitioning won out over all others during search.
+
+- **Does not subsume `partition_for_<toolchain>`**: We don't have any plan to 
deprecate the existing patterns of each
+  BYOC integration supplying a `partiion_for_<toolchain>` function. If the 
user has a specific toolchain in mind then
+  making the partition explicit enjoys both faster compilation and can 
incorporate global optimization passes which
+  Collage cannot currently account for (eg enforcing a particular layout).
+
+# Prior art
+
+- The [Cascading 
Scheduler](https://github.com/apache/tvm-rfcs/blob/main/rfcs/0037-arm-ethosu-cascading-scheduler.md)
+  combines i) dynamic-programming to find an optimal grouping of TE 
sub-expressions, ii) an analytic model of cost to
+  guide the search, and iii) cascading scheduling of the TE sub-expressions so 
as to reduce memory high-watermark. By
+  contrast Collage i) also uses dynamic-programming, but to find an optimal 
grouping of Relay sub-expressions, ii)
+  uses (very much slower!) measurement to guide the search and iii) has no 
influence over how either TVM or BYOC
+  toolchains actually lower the sub-graphs given to them.
+- The [Universal modular Accelerator 
Interface](https://github.com/apache/tvm-rfcs/pull/60) proposal adds a layer on 
top
+  of the existing and separate TVM BYOC, operator strategy, operator 
scheduling, target-specific passes and
+  target-specific code generation extension points. Collage currently relies 
only on the global pattern registry and
+  global `relay.ext.<toolchain>` function to integrate with BYOC integrations, 
but this is easy to rework should UMA
+  change the source of truth.
+
+# Appendix A: Differences between the paper's implementation and this proposal
+
+The results of the paper were derived in a 
[branch](https://github.com/cmu-catalyst/collage) from
+[TVM](https://github.com/apache/tvm) at 
`461d06eb5cfc7954f1983779acd05c47cea269f1`. We ported/rebased that code onto
+main, and refer to it as the
+['v1' prototype 
implementation](https://github.com/mbs-octoml/mbs-tvm/tree/mbs-collage-port).
+
+The 'v1' prototype has nine main parts:
+
+1. A new
+   
[backend](https://github.com/mbs-octoml/mbs-tvm/blob/52d8780e879a9115b8a93e505bcd3a6c2646c61f/include/tvm/ir/expr.h#L208)
+   field on every Relay `Expr` to capture the pattern name and backend name 
chosen during search. Downstream compilation
+   must be forced to respect that choice.
+2. An
+   
[intercept](https://github.com/mbs-octoml/mbs-tvm/blob/52d8780e879a9115b8a93e505bcd3a6c2646c61f/src/relay/transforms/fuse_ops.cc#L1392)
+   in `fuse_ops.cc` which redirects to the main Collage fuser/searcher before 
TVM’s fusion rules kick in.
+3. The main fuser/searcher
+   
[implementation](https://github.com/mbs-octoml/mbs-tvm/blob/52d8780e879a9115b8a93e505bcd3a6c2646c61f/python/collage/optimizer/comp_graph_optimizer.py#L221)
+   (for the simpler DP algorithm). This implementation:
+4. Uses both Relay `Pattern` s and its own path-based fusion algorithm to find 
candidate sub-graphs.
+5. Uses the DP algorithm to find the best assignment of fused sub-graphs and 
backends to cover the whole Relay graph.
+6. Applies the resulting assignment to the `IRModule` using the new `backend` 
field on every expression.
+7. An evolutionary search algorithm may optionally run after the above, and 
will attempt to replace ‘op’ kernels
+   (which use a library) with ‘graph’ kernels (arbtrary sub-graph), but only 
if there’s a unique graph backend enabled.
+8. An intercept
+   
([here](https://github.com/mbs-octoml/mbs-tvm/blob/52d8780e879a9115b8a93e505bcd3a6c2646c61f/src/relay/transforms/fuse_ops.cc#L1402)
+   and
+   
[here](https://github.com/mbs-octoml/mbs-tvm/blob/52d8780e879a9115b8a93e505bcd3a6c2646c61f/python/collage/optimizer/_optimizer.py#L48))
+   in `fuse_ops.cc` to actually effect the fusion for BYOC backends depending 
on the new `backend` field.
+9. An intercept
+   
([here](https://github.com/mbs-octoml/mbs-tvm/blob/52d8780e879a9115b8a93e505bcd3a6c2646c61f/src/relay/backend/te_compiler_cache.cc#L284)
+   and
+   
[here](https://github.com/mbs-octoml/mbs-tvm/blob/52d8780e879a9115b8a93e505bcd3a6c2646c61f/python/collage/backend/collage_strategies.py#L18))
+   in `te_compiler.cc` to take over the selection of `OpStrategy` based on the 
`backend` field.
+
+Note that the 'v1' prototype only supports `IRModules` with a single `"main"` 
whose body is in the ‘pure dataflow’ Relay
+subset. Ie only calls, tuples, tuple projections, function variables and 
constants are supported.
+
+In comparison to the 'v1' prototype, this design:
+
+1. Shifts the search to occur at the very start of the compilation flow rather 
than just before `FuseOps` so that
+   Collage sees the same input model as would an existing 
`partition_for_<toolchain>` BYOC function. This change allows
+   us to use the existing BYOC patterns/predicates to guide the selection of 
candidate partitions instead of requiring
+   new patterns to be added to support the combination of models and BYOC 
toolchains. It also ensures the existing BYOC
+   lowering functions see partitions before any TVM-lowering specific passes 
have been applied, such as
+   `qnn::transform::Legalize` and `transform::CombineParallelConv2D`.
+2. Builds on the existing support for heterogeneous `Target`s to represent the 
menu of available toolchains to use
+   during search. In particular, we want to allow users to blend `on_device` 
annotations (to express preferences for
+   which devices should execute which sub-graphs) with Collage (to find the 
best partitions respecting those device
+   preferences).
+3. Uses the existing convention for `"Primitive"`, `"Composite"` and 
`"Compiler"` attributes on Relay `Function`s to
+   encode partitioning choices.
+4. Does not treat library integrations (eg for `CuDnn`) differently from 
toolchain integrations (eg for `TensorRT`). See
+   Appendix B for a sketch of the issues.
+5. Supports all of Relay.
+6. Is implemented almost entirely in C++.
+
+However:
+
+6. We have only re-implemented the 'op-level' dynamic-programming based search 
strategy from the paper. Though the paper
+   reports encouraging results with the 'graph-level' evolutionary-search 
strategy we leave that to future work.
+
+# Appendix B: Easier Library Integration
+
+TVM has two very different ways to make external library implementations 
available for use as (or in) kernels:
+The pattern-based BYOC approach and the TVM `te.extern` approach.
+
+The pattern-based approach allows library implementations to match with more 
than one Relay operator, such as for biased
+convolution with an activation function. For example, for
+[DNNL](https://oneapi-src.github.io/oneDNN/v1.3/index.html) the global pattern 
table is extended
+in `python/tvm/relay/op/contrib/dnnl.py`, and the pattern labels encode the 
intended corresponding DNNL functions. The
+user is responsible for partitioning using the usual 
`MergeComposite`/`AnnotateTarget`/`PartitionGraph`
+sequence (surprisingly there's no `partition_for_dnnl` convenience function). 
The `relay.ext.dnnl` BYOC function
+in `src/relay/backend/contrib/dnnl/codegen.cc` looks for calls to 
`"Composite"` functions in the overall `"Primitive"`
+function, and dispatches based on the `"Composite"` label. C code is emitted 
to target the DNNL library, and the
+standard C compiler helper is invoked to produce a `runtime::Module`.
+
+Note that it is not possible for a TVM-generated kernel to call a library 
function integrated this way. In effect every
+library function must go into a library-specific kernel (though kernels may 
group calls to multiple library function).
+
+The `te.extern` approach only allows library implementations which are 1:1 
with Relay operators. However the library may
+be used as part of a larger TVM-generated kernel, and the usual TVM tuning 
machinery may choose to use the library based
+on overall kernel performance measured during TVM tuning. For example, 
`batch_matmul` can be implemented using
+[CuBLAS](https://developer.nvidia.com/cublas) via the strategy `batch_matmul` 
in `python/tvm/contrib/cublas.py`, which
+is made available to the operator's `OpStrategy` using 
`batch_matmul_stategy_cuda` in
+`python/tvm/relay/op/strategy/cuda.py` when `cublas` appears in the `Target`s 
`libs` attribute. That strategy simply
+calls the `PackedFunc` registered as `tvm.contrib.cublas.batch_matmul` and 
implemented in
+`src/runtime/contrib/cublas/cublas.cc` as part of the TVM runtime.
+
+The `te.extern` approach also supports integrating 'micro-kernels' which may 
be invoked as part of the TVM schedule for
+some larger Relay operator.
+
+Collage as presented can work with either approach. For the pattern-based BYOC 
approach Collage doesn't need to know
+what's going on under the BYOC integration hood, it only needs to see a 
`Target` with the appropriate
+`compiler` attribute. For the `te.extern` approach Collage similarly doesn't 
need to know that the TVM partition may
+result in a kernel who's schedule includes a call to the linked library 
provided the `Target` has the appropriate
+`libs` attribute.
+
+However, we'd to make library integration with Collage as seamless as possible 
since we expect it to be the common case.
+The requirements are roughly:
+
+- Support library functions which match sub-graphs as well as single Relay 
operators.
+- Allow library calls from within TVM-generated kernels.
+- Avoid the boilerplate needed for full BYOC integrations, but retain the 
familiar BYOC pattern-based mechanism.
+- Express the choice of external library in the same way we express other 
partitioning choices.
+
+One possibility is:
+
+- Like the `te.extern` approach, libraries can be made available to the TVM 
runtime via registered `PackedFunc`s.
+- Like the pattern-based BYOC approach, labelled patterns can be supplied 
which indicate how Relay operators could be
+  mapped to registered `PackedFunc`s.
+- Like the BYOC custom lowering approach, a distinguished compiler name 
controls when the library is available and
+  causes lowering to go via a different path.
+- But unlike the BYOC custom lowering approach, the rewrite to an external 
library call is made available in TE or TIR
+  form so that it can be incorporated into larger TVM kernels.
+
+We'll follow up on this separately, but mention it here since it motivates why 
we've tried to handle "Composite"
+patterns fairly generally.
+
+# Appendix C: When to run Collage?
+
+There's a few cross-cutting issues when it comes to when Collage should run in 
the compilation flow:
+
+- The current TVM convention is for BYOC integrations to supply a 
`partition_for_<toolchain>` function which can be
+  called by the user on the original input model, ie before **any** Relay 
passes are run.
+- Many `partition_for_<toolchain>` functions run their own 'preamble' passes 
before the standard
+  `MergeComposite`/`AnnotateTarget`/`MergeCompilerRegions`/`PartitionGraph` 
passes. The preamble passes are sometimes
+  just generic Relay passes such as `FoldConstant`. But some partitioning 
functions impose specific global rewrites,
+  such as for layout. All the BYOC op patterns and op predicates are thus 
written expecting the Relay model in 'vanilla'
+  form with only those preamble passes applied.
+- There's no reason to expect the preamble passes are compositional in any 
sensible way between different BYOC
+  integrations.
+- Some BYOC integrations also supply additional passes which are expected to 
be run after partitioning and lowering, for
+  example to finish tuning or compilation.
+- The standard Relay pass prefix includes many passes which are either target 
dependent (for example to
+  'legalize' quantized version of ops depending on the intended target), or 
which prepare the model for Relay fusion and
+  subsequent lowering. These passes are all run before `FuseOps`. Those passes 
are not universally applicable to all
+  BYOC integrations, and BYOC patterns are not guaranteed to be invariant over 
them.
+- Relay's default `FuseOps` is currently hard-coded to greedily find the 
largest possible kernels using fixed
+  `OpPatternKind` based rules. Those rules are intended to predict exactly 
what TVM's scheduling can support. There's
+  interest in bringing customization (eg limiting fusion patterns to directly 
match hardware supported primitives,
+  supporting custom 'horizontal' and 'vertical' fusion) and search (to reduce 
the strong coupling of fusion with
+  lowering) to `FuseOps`, which all looks very similar to customization and 
search Collage brings to partitioning.
+- Finally(!), Collage should obviously explore candidate partitions which both 
TVM and the BYOC toolchains can do well
+  on. That encourages large partitions with fusion opportunities. But a naive 
search over all O(n!) possibilities is
+  also obviously not feasible. This means Collage should limit its search to 
candidates which more or less correspond to
+  the kernels each backend would choose using their own fusion rules. This in 
turn requires Collage's partitioning rules
+  to roughly match the backend fusion rules.
+
+This puts us in a bit of a pickle since there's no obvious single point in the 
compilation flow for Collage:
+
+1. We could run before any Relay passes at all, just as 
`partition_for_<toolchain>` functions are run today. However
+   there's no opportunity to apply any BYOC preamble passes which may be 
needed before patterns are used.
+2. We could run just after the BYOC preamble passes. However that's 
prematurely committing to a particular BYOC
+   integration, and there's no way to detect when two BYOC integrations have 
incompatible preambles.
+3. We could run instead of `FuseOps` to collapse partitioning with fusion. 
However, by that stage too many TVM-specific
+   optimizations have been applied for the BYOC integrations to work.
+
+Our compromise is to i) run Collage at the very beginning of compilation (ie 
option 1), ii) require the user manually
+apply global passes which may assist particular BYOC integrations (such as to 
choose a particularly favorable layout),
+and iii) leave `FuseOps` unchanged. (Note the first draft of the RFC instead 
chose option 3.)
+
+In more detail, here's a taxonomy of pass instances and how we can handle them 
in Collage:
+
+- **BYOC global** (eg `ConvertLayout`): These passes are currently run in the 
preamble of some of the
+  `partition_for_<toolchain>` functions to apply a global rewrite to improve 
efficiency for the intended target.
+    - Under Collage, these passes should be made available as a new 
`optimize_for_<toolchain>` function. The user
+      (or some top-level search outside of TVM) can apply this function in the 
same way they currently apply
+      `partition_for_<toolchain>`.
+    - Ideally this would also be a well-known UMA extension point.
+- **BYOC partitioning** 
(`MergeComposite`/`AnnotateTarget`/`MergeCompilerRegions`/`PartitionGraph` or a 
subset thereof):
+  These passes are currently run after the premable of the 
`partition_for_<toolchain>` function to effect the desired
+  partitioning and composite function labelling.
+    - Under Collage, these passes are subsumed by `CollagePartitioner`.
+    - They will also be called automatically by the UMA 'partitioner'.
+- **BYOC lowering** (eg `Function->runtime::Module` function registered as 
`relay.ext.<toolchain>`). These passes invoke
+  the BYOC-specific compilation toolchain.
+    - Under Collage, the standard `TELower` pass will continue to invoke these 
functions depending on partitioning
+      annotations. Collage will also need to support the other per-target 
compilation overrides.
+- **BYOC post-lowering** (eg `tune_cutlass_kernels`): These follow-on passes 
are supplied by some BYOC integrations to
+  further prepare the `runtime::Module` after lowering.
+    - Under Collage, these passes need to be folded back into the generic BYOC 
lowering extension point.
+- **Safe global** (eg `FoldConstant`): These passes are run within the 
standard Relay pass prefix, but may also be
+  included in the BYOC preamble. However the pass is universally applicable to 
all BYOC and TVM toolchains.
+    - Under Collage this 'safe' prefix of passes can be run before 
`CollagePartitioner`. If any BYOC predicates/patterns
+      are not invariant to these safe passes then we'll need to generalize 
them. Note that currently this pass set is
+      empty.
+- **Target specific** (eg `qnn::transform::Legalize`): These passes are also 
within the standard Relay pass prefix. They
+  apply per-operator or other rewrites which may be target-dependent. Clearly 
the target must already be known.
+  Technically they should be run after `PlanDevices` to support heterogeneous 
execution this is not currently the case
+  (and a few are disabled in the heterogeneous case).
+    - Under Collage these passes are run after `PlanDevices` (which may use 
`on_device` annotations to enforce some
+      target constraints) and `CollagePartitioner` (which will choose targets 
for all partitions subject to any existing
+      constraints). But they are only run on non-BYOC partitions, ie on 
everything other than `"Primitive"`
+      functions with a `"Compiler"` attribute.
+- **Lowering specific** (eg `CanonicalizeOps`): These passes apply 
optimizations preparing for `FuseOps` and subsequent
+  TVM lowering. They are also within the standard Relay pass prefix.
+    - Under Collage, same as for the target specific passes above.
+
+# Appendix D: TODO items in the 'v2' prototype before proceeding
+
+- Implement the penalty for transitioning execution between partitions.
+- Bring in `cudnn` and `cublas` to support measurement (ideally as Appendix B, 
but probably as heavyweight BYOC
+  integration pending better support).
+- Support RFC10-style integrations in the partitioning rules / sub-graph 
realization.
+- Implement TVM-tuning during Collage search.
+- Cross-check measurements against some of the 'v1' models.
+- Bring up on `GPT2` and other 'larger' models.
+- Measure additive error.
+- Measure apparent loss with `partition_for_tensorrt`.
+- Explore `float16` performance mixing `CUTLASS` and `TensorRT`.
+- Find model+target combination that shows compelling speedup from mixing 
w.r.t. all other options, including stand
+  alone `TensorRT`.
+- If needed, implement 'lookahead' from the current search state to find the 
'next' dataflow node(s) which have
+  candidates crossing multiple `PartitionSpec`s. That defines a sub-graph. 
There's no need to search over all possible
+  candidates within that sub-graph since almost certainly the maximal 
candidates will be best. Somehow prune the
+  candidates to implement that.
+- If needed, build indexes in `CombinePartitionRule` to avoid O(n^2) iteration 
over candidates.
+- Reconcile with the 'RFC10' style BYOC extension methods -- we should support 
them all but for simplicity have just
+  focussed on the traditional "Compiler" annotation.
+
+# Appendix E: Robust candidate kernel latency measurement
+
+Collage requires an implementation of a `CostEstimator`:
+
+```
+class CostEstimator {
+ public:
+  /*!
+   * \brief Returns the estimated cost (possibly after many many minutes of 
training time) of
+   * running "main" in mod using target, which represents a possible 
partitioning of some overall
+   * Relay expression.
+   */
+  virtual Cost Estimate(const IRModule& mod, const Target& target) const;
+}
+```
+
+The 'v2' prototype has implemented this with an in-memory cache and a small 
Python driver which defers to
+TVM's `tvm.runtime.vm.VirtualMachine`s `benchmark` helper. The following needs 
to be designed and implemented:
+
+- The recent MetaSchedule work has provided `BuilderInput` 
(`include/tvm/meta_schedule/builder.h`),
+  `RunnerInput` (`include/tvm/meta_schedule/runner.h`) and `Database` 
(`include/tvm/meta_schedule/database.h`)
+  interfaces. The latter is for `TuningRecord`s of `Workload`s. It looks like 
these interfaces can support the
+  measurement of Collage `CandidatePartitions`s with no or minor changes.
+- Collage converts measured 50th %ile latencies to costs in seconds. We may 
need to consider taking a slightly higher
+  %ile to be more robust against variance on small kernels. We need to 
validate the estimated variance reflects true
+  variance.
+- For TVM-native targets, we would like the `Estimate` call to perform any TVM 
tuning required for a novel candidate
+  kernel.
+- Collage needs an estimate of the cost of transitioning between partitions 
(or kernels). Ideally that estimate would be
+  based on measurement rather than hard coded.
+
+# Appendix F: Robust BYOC integrations for targets of interest
+
+Overall any BYOC toolchain which could be supported by Collage needs to be 
brought to a high standard:
+
+- It should support the latest toolchain/library versions.
+- It should support as much of Relay (both operators and dtypes) as feasible. 
In particular, Collage will only find
+  interesting mixes when BYOC toolchains have overlapping operator and dtype 
support.
+- It should correctly report which operators/patterns are supported.
+- It should have good unit test coverage in CI.
+- Dependencies should be documented and installation scripted (hopefully this 
is an easy consequence of the above).
+- The translation scheme should give the BYOC toolchain the best chance to do 
well. In particular, if Collage reports
+  toolchain X 'is better' than toolchain Y for a candidate sub-graph we want 
to have confidence that's not just because
+  toolchain Y has been hobbled by a poor translation, API misuse, or other 
'holding it wrong' issue.
+- Where feasible, using `partition_for_<toolchain>` (ie using TVM but not 
Collage) should not be worse than using the
+  toolchain directly (ie not using TVM at all).
+
+Our current focus is on TensorRT, CUTLASS, CuDnn and CuBlas.
+
+# Appendix G: Visualization
+
+A [netron](https://netron.app/) style visualization for Relay which clearly 
shows the partitioning and cost for all the
+kernels would be very valuable. The paper prototype produces such a 
visualization but we've lost that functionality in
+the transition to 'v2'.
+
+# Appendix H: FAQ
+
+- **Are you deprecating `FuseOps`?** No. `FuseOps` will be run along with all 
the other Relay passes on the TVM
+  partitions (ie all Relay expressions not partitioned onto a BYOC backend).
+- **Are you deprecating the BYOC `partition_for_<toolchain>` functions?** No. 
Collage does not yet have a way to handle
+  any global passes invoked before partitioning in those functions. Those 
functions are still the best approach for
+  users who cannot tolerate long search/tuning times.
+- **Can I use Collage for optimizing layout? Device placement? Quantization 
strategy?** No. Collage only explores
+  partitionings, and cannot currently explore rewrites. Though Collage could 
allow sub-graphs to be rewritten as part of
+  a partitioning choice (eg to insert `device_copy` nodes on inputs and 
outputs), there's little utility to doing so
+  since Collage won't be able to measure the effect of those rewrites on the 
overall model latency after further
+  downstream passes (eg to collapse unnecessary `device_copy` nodes). These 
sorts of global optimization problems can be
+  tackled by additional analysis passes before `CollagePartitioner`.
+- **Won't this increase tuning time?** Yes. Collage will explore significantly 
more candidate partitions, and for the
+  TVM backend the resulting kernels will themselves require schedule tuning.
+- **Does this clash with the UMA proposal?** No. Though Collage takes over 
partitioning, it can still greatly benefit
+  from the UMA proposal to better organize all the BYOC extension points. Any 
changes made by the UMA project should be
+  easy to account for in Collage.
+- **Why replace the existing 
`MergeComposite`/`AnnotateTarget`/`MergeCompilerRegions`/
+  `PartitionGraph` passes?** Those passes don't allow us to represent the 
search space over all partitionings.
+- **Why not just build on `DFPattern`s instead of introducing the 
PartitionRule family?** We actually started in that
+  direction but found the complexity overwhelming. We believe it's best to 
keep `DFPattern`s focussed on the simple and
+  common case of deterministically matching specific sub-expressions.
+- **Why should partitioning have anything to do with fusion?** For efficiency 
Collage should only explore candidate
+  partitions which roughly match kernel boundaries. This means Collage's 
partitioning rules need to roughly predict the
+  fusion rules of each backend, TVM included.
+
+# Appendix G: Representing Collage 'backends'
+
+The paper introduced an explicit representation for 'backends'. In this design 
we've chosen to merge this notion back
+into the existing `Target` machinery:
+
+- All the backends we know of are dependent on a family of `Target`s. For 
example, `TensorRT` obviously only applies to
+  CUDA targets, `DNNL` only applies to CPU targets, and so on.
+- So it seems most natural to just extend `TargetKind`s with the ability to 
specify a particular BYOC toolchain, and
+  allow the user to supply as many `Target`s as needed to cover all the BYOC 
toolchains they'd like to include in the
+  Collage search.
+- There's then two subtleties which are easy to handle:
+    - A `Target` which is specific about it's BYOC toolchain should be 
considered a refinement of the same `Target`
+      without any such detail.
+    - The user may supply multiple `Target`s for the same `DLDeviceType`. 
There's a few places in device planning where
+      the `DLDeviceType` to `Target` mapping needs to choose the least-refined 
`Target`.
diff --git a/rfcs/assets/0062/dataflow_graphs_and_sub_graphs.png 
b/rfcs/assets/0062/dataflow_graphs_and_sub_graphs.png
new file mode 100644
index 0000000..4af6497
Binary files /dev/null and 
b/rfcs/assets/0062/dataflow_graphs_and_sub_graphs.png differ
diff --git a/rfcs/assets/0062/optimal_placement.png 
b/rfcs/assets/0062/optimal_placement.png
new file mode 100644
index 0000000..fbacff2
Binary files /dev/null and b/rfcs/assets/0062/optimal_placement.png differ
diff --git a/rfcs/assets/0062/partition_rules.png 
b/rfcs/assets/0062/partition_rules.png
new file mode 100644
index 0000000..2efd15c
Binary files /dev/null and b/rfcs/assets/0062/partition_rules.png differ
diff --git a/rfcs/assets/0062/search_graph.png 
b/rfcs/assets/0062/search_graph.png
new file mode 100644
index 0000000..50ac85c
Binary files /dev/null and b/rfcs/assets/0062/search_graph.png differ

[tvm-rfcs] branch main updated: Collage RFC (#62)

Reply via email to