cgerum commented on a change in pull request #62:
URL: https://github.com/apache/tvm-rfcs/pull/62#discussion_r828055710



##########
File path: rfcs/xxxx-collage.md
##########
@@ -0,0 +1,833 @@
+# Design Doc: Collage [Draft 0.7]
+
+```
+Feature Name: Collage
+Start Date: Mar 2022
+Authors: Mark Shields ([email protected])
+RFC PR: <tbd>
+GitHub Issue: <tbd>
+```
+
+This design doc (with accompanying
+['v2' prototype 
implementation](https://github.com/mbs-octoml/mbs-tvm/tree/mbs-collage-sketch))
+shows how to bring tuning to TVM's operator fusion and BYOC partitioning 
passes. The tuning search explores the choice
+of sub-graphs (aka 'partitions') as well as choice of toolchain (TVM native or 
one of the available BYOC integrations,
+aka 'backends') for each candidate kernel so as to minimize the expected model 
inference latency. We call the result
+an 'optimal partitioning'. This new tuning layer complements the tuning 
traditionally done by TVM and other toolchains
+during lowering. It can also complement any global tuning, for example to 
explore all possible global layouts.
+
+The approach is based on the [preprint](https://arxiv.org/pdf/2111.00655.pdf):
+
+> *Collage: Automated Integration of Deep Learning Backends*  
+> Byungsoo Jeon, Sunghyun Park, Peiyuan Liao, Sheng Xu, Tianqi Chen, Zhihao Jia
+
+This tuning approach contrasts with TVM's existing "greedy" and "manual" 
approaches to fusion and BYOC:
+
+- Greedy: Currently only the largest possible supported sub-graphs are used 
for kernels, irrespective of their execution
+  time. With Collage many more candidate sub-graphs are explored, and it is 
possible for two smaller sub-graphs to yield
+  better overall latency than one large sub-graph if they mix toolchains.
+- Manual: Currently the TVM user must commit to a BYOC toolchain and invoke 
the corresponding partitioning function
+  before the main TVM compilation flow proceeds. With Collage the choice of 
toolchain can be automated based on measured
+  latency. Collage will also explore mixing and matching between multiple BYOC 
toolchains as well as TVM's native
+  backend.
+
+The design (when Collage is enabled) subsumes TVM's fixed `FuseOps` and 
BYOC-provided `partition_for_<toolchain>`
+operations (built using the 
`MergeComposite`/`AnnotateTarget`/`MergeCompilerRegions`/`PartitionGraph` 
passes) with a
+single new
+`CollageFuseOps` pass. The pass is carefully engineered to build directly on 
the existing `"TOpPattern"` attributes
+(provided for every Relay operator and used by `FuseOps`), BYOC 
`"target.<toolchain>"`
+operator predicates (provided for some operator/toolchain pairs by 
'operator-based' BYOC integrations) and BYOC operator
+pattern/predicates (registered in the pattern table by 'pattern-based' BYOC 
integrations). In this way only the more
+boilerplate aspects of existing BYOC integrations need to be adjusted to 
support Collage. The
+`partition_for_<toolchain>` operations are retained for users who wish to 
retain manual control.
+
+> NOTE: We'd like to coordinate these changes with the UMA project. Our aim in 
this design is to make the smallest
+> changes to BYOC as possible. We think the changes described here can be 
easily reworked to follow any BYOC API
+> proposals settled on by UMA. See also "Related Work."
+
+Collage offers four advantages:
+
+- **Latency**: Overall model latency may be reduced compared to TVM native, 
TVM with a specific BYOC toolchain, or a
+  non-TVM compiler such as TensorRT.
+- **Automation**: The choice of which BYOC toolchains to enable can be 
automated.
+- **Economy of implementation**: Five standalone passes using three separate 
mechanisms for expressing fusion
+  rules/algorithms and implementing partitioning can be replaced with one, 
which itself is built from compositional
+  primitives.
+- **Decoupling**: It is ok for a candidate kernel found during search to 
actually not be valid for a toolchain (even
+  TVM's). Such candidates could be given 'infinite' cost and thus ignored 
during search. In this way we can avoid tight
+  coupling between backends and fusion rules.
+
+## FAQ
+
+Pending.
+
+## Success Metrics
+
+1. Collage offers at least a 10% latency improvement for a selection of 
standard ONNX models and NVIDIA hardware using
+   targets which include the CuDNN and CuBlas libraries, the CUTLASS library 
(with tuning, via BYOC), the TensorRT
+   compiler (via BYOC), and (obviously!) TVM native.
+2. Collage does not require new per-target or per-model patterns or rules to 
be implemented independently of the BYOC
+   integrations.
+3. Collage with just the native TWM and a single BYOC toolchain enabled is 
never worse than using the
+   existing `partition_for_<toolchain` method in TVM today.
+
+## Project Milestones
+
+- [Done] M0: Port paper prototype to recent TVM main and validate paper 
results.
+- [Done] M1: Internal design doc.
+- [Done] M2: Use 'v2' prototype to test design doc, and rework ready for TVM 
community.
+- [In progress] M3: RFC
+- [2022Q1] M4: Re-validate results on 'v2' prototype for larger models (eg 
GPT2) and more NVIDIA targets.
+- [2022Q2] M5: Implementation in TVM main, including 'sub-projects' listed 
below.
+- [OctoML internal] M6: Estimator integrated into OctoML platform, validation 
against OctoML test suite.
+- [OctoML internal] M7: Productionization for OctoML.
+
+## Check-in plan
+
+Though the 'v2' prototype is in a personal branch we'd like to transition to 
main ASAP and rely on directory/namespace
+separation, maintaining backwards compat, and a new `PassConfig` flag to 
isolate all Collage changes from the rest of
+TVM. A rough PR progression is:
+
+- TensorRT and CUTLASS BYOC changes are backwards compat. The existing 
`partition_for_X` functions remain. The
+  CUTLASS-specific tuning and codegen functions will either continue to be 
supported or we'll work with users to account
+  for them being folded into the function-at-a-time `relay.ext.cutlass`
+  codegen function.
+- The the `DFPattern` and friends changes are all mostly just for improving 
the robustness of the
+  `IndexedGraph<T>` class and can go into main independently.
+- Some basic `Expr` improvements can go into main independently.
+- The design allows for multiple `Target`s for the same `DLDeviceType`. That 
requires the various
+  `build` interfaces which currently accept `Union[Target,Dict]` to also 
accept a list of `Target`s, and can be
+  backwards compat.
+- The new Collage code can go in bottom-up as we develop unit tests:
+    - Support utils, including `NameSupply`, `IndexSet`, `PriorityQueue`, 
`Cost`, `CostEstimator`.
+    - The core `SubGraph` datatype.
+    - `CandidateKernel`.
+    - The `FusionRule` class hierachy (which itself can be broken into 
sub-PRs).
+    - `FusionSpec`.
+    - `GatherFusionSpecs` helper for bridging the existing BYOC world with the 
Collage 'FusionRule' world.
+    - The `CollageFuseOps` driver pass itself.
+
+## Related Work
+
+- The [Cascading 
Scheduler](https://github.com/apache/tvm-rfcs/blob/main/rfcs/0037-arm-ethosu-cascading-scheduler.md)
 combines i) dynamic-programming
+  to find an optimal grouping of TE sub-expressions, ii) an analytic model of 
cost to guide the search,
+  and iii) cascading scheduling of the TE sub-expressions so as to reduce 
memory high-watermark. By contrast
+  Collage i) also uses dynamic-programming, but to find an optimal grouping of 
Relay sub-expressions, ii)
+  uses measurement to guide the search and iii) assuming the toolchain will 
'do its best' with the
+  sub-graph offered to it.
+- The [Universal modular Accelerator 
Interface](https://github.com/apache/tvm-rfcs/pull/60) proposal
+  adds a layer on top of the existing and separate TVM BYOC, operator 
strategy, operator scheduling,
+  target-specific passes and target-specific code generation extension points. 
Collage currently relies
+  only on the global pattern registry and global `relay.ext.<toolchain>` 
function to integrate with BYOC
+  integrations, but this is trivial to change should this project change the 
source of truth.
+
+## Example
+
+We start with `mod` bound to 
[MNIST](https://github.com/onnx/models/tree/main/vision/classification/mnist):
+
+```
+fn (%x: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
+  %0 = nn.pad(%x, 0f, pad_width=[[0, 0], [0, 0], [2, 2], [2, 2]]);
+  %1 = nn.conv2d(%0, meta[relay.Constant][0] /*Tensor[(8, 1, 5, 5), float32]*/,
+                 padding=[0, 0, 0, 0], channels=8, kernel_size=[5, 5]);
+  %2 = add(%1, meta[relay.Constant][1] /*Tensor[(8, 1, 1), float32]*/);
+  %3 = nn.relu(%2);
+  %4 = nn.max_pool2d(%3, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 
0]);
+  %5 = nn.pad(%4, 0f, pad_width=[[0, 0], [0, 0], [2, 2], [2, 2]]);
+  %6 = nn.conv2d(%5, meta[relay.Constant][2] /*Tensor[(16, 8, 5, 5), 
float32]*/,
+                 padding=[0, 0, 0, 0], channels=16, kernel_size=[5, 5]);
+  %7 = add(%6, meta[relay.Constant][3] /*Tensor[(16, 1, 1), float32]*/);
+  %8 = nn.relu(%7);
+  %9 = nn.max_pool2d(%8, pool_size=[3, 3], strides=[3, 3], padding=[0, 0, 0, 
0]);
+  %10 = reshape(%9, newshape=[1, 256]);
+  %11 = nn.dense(%10, meta[relay.Constant][4] /*Tensor[(10, 256), float32]*/, 
units=None, out_dtype="float32");
+  add(%11, meta[relay.Constant][5] /*Tensor[(1, 10), float32]*/)
+}
+```
+
+We can compile this with Collage enabled for a variety of NVIDIA 
toolchains/libraries as follows:
+
+```
+with tvm.transform.PassContext(config={"relay.fallback_device_type": 2, 
"relay.collage.enable_collage": True}):
+    host_target = tvm.target.Target("llvm")
+    generic_target = tvm.target.Target("cuda", host_target)
+    cutlass_target = tvm.target.Target("cuda -compiler=cutlass", host_target)
+    tensorrt_target = tvm.target.Target("cuda -compiler=tensorrt", host_target)
+    cudnn_target = tvm.target.Target("cuda -libs=cudnn", host_target)
+    cublas_target = tvm.target.Target("cuda -libs=cublas", host_target)
+    targets = [generic_target, cutlass_target, tensorrt_target, cudnn_target, 
cublas_target]
+    exe = tvm.relay.vm.compile(mod, target=targets)
+```
+
+(Note that `cudnn` and `cublas` are not yet supported in the 'v2' prototype.)
+
+After the `CollageFuseOps` pass, the intermediate `"main"` global function 
could resemble the following (though we've
+modified this "optimal" partitioning by hand to illustrate all the varieties 
of kernels so don't take it as
+representative of actual performance):
+
+```
+fn (%x: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
+  # Use TVM native
+  %3 = fn (%FunctionVar_08: Tensor[(1, 1, 28, 28), float32],
+           Primitive=1) -> Tensor[(1, 1, 32, 32), float32] {
+    nn.pad(%FunctionVar_08, 0f, pad_width=[[0, 0], [0, 0], [2, 2], [2, 2]])
+  };
+  %4 = %3(%x);
+  # Use TVM native, but indicate we wish to link to CuDNN
+  %6 = fn (%FunctionVar_07: Tensor[(1, 1, 32, 32), float32],
+           Primitive=1) -> Tensor[(1, 8, 28, 28), float32] {
+    %5 = fn (%FunctionVar_5: Tensor[(1, 1, 32, 32), float32],
+             Composite="cudnn.conv2d") -> Tensor[(1, 8, 28, 28), float32] {
+      nn.conv2d(%FunctionVar_5, meta[relay.Constant][0] /*Tensor[(8, 1, 5, 5), 
float32]*/,
+                padding=[0, 0, 0, 0], channels=8, kernel_size=[5, 5])
+    };
+    %5(%FunctionVar_07)  
+  };
+  %7 = %6(%4);
+  # Use TVM native, with fusion
+  %8 = fn (%FunctionVar_06: Tensor[(1, 8, 28, 28), float32],
+           %FunctionVar_12: Tensor[(8, 1, 1), float32],
+           Primitive=1) -> Tensor[(1, 8, 28, 28), float32] {
+    %3 = add(%FunctionVar_06, %FunctionVar_12);
+    nn.relu(%3)
+  };
+  %9 = %8(%7, meta[relay.Constant][1] /*Tensor[(8, 1, 1), float32]*/);
+  # Use TVM native
+  %10 = fn (%FunctionVar_05: Tensor[(1, 8, 28, 28), float32],
+            Primitive=1) -> Tensor[(1, 8, 14, 14), float32] {
+    nn.max_pool2d(%FunctionVar_05, pool_size=[2, 2], strides=[2, 2], 
padding=[0, 0, 0, 0])
+  };
+  %11 = %10(%9);
+  # Use TVM native
+  %12 = fn (%FunctionVar_04: Tensor[(1, 8, 14, 14), float32],
+            Primitive=1) -> Tensor[(1, 8, 18, 18), float32] {
+    nn.pad(%FunctionVar_04, 0f, pad_width=[[0, 0], [0, 0], [2, 2], [2, 2]])
+  };
+  %13 = %12(%11);
+  # Use TensorRT, with fusion
+  %14 = fn (%FunctionVar_03: Tensor[(1, 8, 18, 18), float32],
+            %FunctionVar_11: Tensor[(16, 1, 1), float32],
+            Primitive=1,
+            Compiler="tensorrt",
+            global_symbol="collage_nn_conv2d_add_nn_relu_1") -> Tensor[(1, 16, 
14, 14), float32] {
+    %1 = nn.conv2d(%FunctionVar_03, meta[relay.Constant][2] /*Tensor[(16, 8, 
5, 5), float32]*/,
+                   padding=[0, 0, 0, 0], channels=16, kernel_size=[5, 5]);
+    %2 = add(%1, %FunctionVar_11);
+    nn.relu(%2)
+  };
+  %15 = %14(%13, meta[relay.Constant][3] /*Tensor[(16, 1, 1), float32]*/);
+  # Use TVM native
+  %16 = fn (%FunctionVar_02: Tensor[(1, 16, 14, 14), float32],
+            Primitive=1) -> Tensor[(1, 16, 4, 4), float32] {
+    nn.max_pool2d(%FunctionVar_02, pool_size=[3, 3], strides=[3, 3], 
padding=[0, 0, 0, 0])
+  };
+  %17 = %16(%15);
+  # Use TVM native
+  %18 = fn (%FunctionVar_01: Tensor[(1, 16, 4, 4), float32],
+            Primitive=1) -> Tensor[(1, 256), float32] {
+    reshape(%FunctionVar_01, newshape=[1, 256])
+  };
+  %19 = %18(%17);
+  # Use CUTLASS, with fusion
+  %20 = fn (%FunctionVar_0: Tensor[(1, 256), float32],
+            %FunctionVar_1: Tensor[(10, 256), float32],
+            %FunctionVar_2: Tensor[(1, 10), float32],
+            Primitive=1,
+            Compiler="cutlass",
+            global_symbol="collage_cutlass_dense_bias_nn_dense_add") -> 
Tensor[(1, 10), float32] {
+    %1 = fn (%FunctionVar_01: Tensor[(1, 256), float32],
+             %FunctionVar_11: Tensor[(10, 256), float32],
+             %FunctionVar_21: Tensor[(1, 10), float32],
+             Composite="cutlass.dense_bias") -> Tensor[(1, 10), float32] {
+      %0 = nn.dense(%FunctionVar_01, %FunctionVar_11, units=None, 
out_dtype="float32");
+      add(%0, %FunctionVar_21)
+    };
+    %1(%FunctionVar_0, %FunctionVar_1, %FunctionVar_2)
+  };
+  %20(%19, meta[relay.Constant][4] /*Tensor[(10, 256), float32]*/, 
meta[relay.Constant][5] /*Tensor[(1, 10), float32]*/)
+}
+```
+
+## Design
+
+The implementation is mostly under `src/relay/collage/...` (namespace 
`tvm::relay::collage`), with some helper Python
+under `python/tvm/relay/collage`.
+
+If the `relay.collage.enable_collage` `PassConfig` attribute is true then a 
new `CollageFuseOps` pass is inserted before
+the existing `FuseOps` pass. The new pass effects the invariant:
+
+> All Relay sub-graphs in all global functions which are to be lowered to a 
kernel are replaced by calls to an inline
+> `"Primitive"` `Function`. Functions which are to be lowered by a 
BYOC-provided toolchain are given
+> `"Compiler"` and `"global_symbol"` attributes. The bodies of those function 
may contain calls to inlined
+> `"Composite"` annotated functions to further direct lowering within the 
kernel.
+
+The `CollageFuseOps` pass proceeds in four phases:
+
+- **Phase 1**: The available `Target`s are scanned to build a list of 
`FusionSpec`s. Each `FusionSpec` is built from
+  (a tree of) `FusionRule`s. How the rules are constructed depends on `Target` 
itself. The remaining phases execute on
+  each global function separately.
+- **Phase 2**: A `DataflowGraph` is constructed for the global function. The 
available `FusionRule`s are evaluated on
+  the dataflow graph to yield a (possibly overlapping) set of 
`CandidateKernels` for each target. Each candidate is
+  described by a `SubGraph` which efficiently denotes a sub-graph of the 
global function's body without the need to
+  construct any new expressions. The candidates are placed in a 
`CandidateKernelIndex` for use below.
+- **Phase 3**: A shortest path is found in the following (implicit) search 
graph:
+    - Search Nodes: An `IndexSet` describing which dataflow nodes are been 
assigned to a candidate kernel so far.
+    - Search Edge X->Y: A `CandidateKernel` can be applied to node X to give 
node Y. The candidate is disjoint from all
+      dataflow nodes already assigned in X. To avoid an unnecessary search 
space explosion the candidate must also
+      include the next yet-to-be-assigned dataflow node in X.
+    - Edge cost: Estimated latency of the candidate kernel, plus a kernel 
launch penalty. Note that though we need to be
+      able to extract the candidate's sub-graph in order to build the kernel, 
we do not yet need to partition the
+      overall function body expression.
+  Other search algorithms are certainly possible, eg the Paper uses an 
evolutionary search to refine
+  the partitioning found by the dynamic-programming search. We can easily 
abstract away the search
+  interface to support multiple implementations in the future.
+- **Phase 4**: The function body is partitioned according to the candidate 
kernels on the shortest path.
+
+In the following we introduce the new datatypes, then expand on the phases.
+
+### Util Datatypes
+
+- `PostDfsIndex`: The integer index of a Relay sub-expression in a post-dfs 
traversal of the overall Relay expression.
+  If index i is less than index j then we know the sub-expression for j cannot 
influence the value of the sub-expression
+  for i.
+- `DataflowGraph`: As alias for the existing `IndexedGraph<Expr>` from the 
`DFPatternMatcher` suite (which in turn is a
+  reworked copy of the `IndexedGraph` private to `fuse_ops.cc`). It is used 
throughout to manage the three-way bijection
+  from Relay `ExprNode`s to `PostDfsIndex`s to
+  `DataflowGraph::Node`s. Each `DataflowGraph::Node` describes the 
sub-expression's dataflow inputs, outputs, dominator
+  and inverse-dominators.
+- `IndexSet`:  A bit vector indexed by `PostDfsIndex`s. These are used as a 
compact representation for an arbitrary set
+  of dataflow nodes in a dataflow graph.
+- `Cost`: A `double` representing a candidate kernel's 'cost', which currently 
is just mean execution latency in
+  seconds. Collage only cares that costs are additive and a total order, so in 
the future we could support cost
+  functions which balance execution time against high memory watermark or 
other measures. Costs may be `Unknown`
+  (ie NaN) to signal some other heuristic should be used to compare kernel 
costs. Costs may be `Invalid` (ie +inf)
+  to signal the toolchain could not compile and run a candidate kernel.
+
+### SubGraph
+
+A `SubGraph` is an `IndexSet` of the `PostDfsIndex`s of all dataflow nodes 
'inside' an arbitrary sub-graph of the
+overall dataflow graph. This and `FusionRule` below are the core Collage 
datatypes.
+
+Sub-graphs can be used to represent 'composite'' and 'fused' functions without 
having to pay the cost of constructing
+either the function or the rewritten overall 'partitioned' expression which 
calls that function. We also allow functions
+to be extracted independently of partitioning, since we'll need to estimate 
the latency of many more kernel functions
+than will ultimately be used in the final Relay expression. We expect 
O(thousands) of sub-graphs to be in flight while
+processing a given model.
+
+A sub-graph classifies every dataflow node of the overall expression as either 
'inside' or 'outside' the sub-graph.
+Obviously not all such divisions make sense, for example it is not valid for 
an inside node to feed into another inside
+node via outside nodes. We provide the `IsValid` method to check for validity, 
and `SubGraphConfig` to control which
+rules apply (such as maximum depth).
+
+As well as 'inside' and 'outside' we have four other flavors of dataflow nodes 
(all uniquely determined from the
+'inside' nodes):
+
+- 'entry' nodes are those inside with at least one dataflow input outside.
+- 'exit' nodes are those inside with at least one dataflow output outside, or 
which are considered 'external' in the
+  underlying dataflow graph (eg because they represent the result of the 
overall function).
+- 'input' nodes are those outside with at least one dataflow output inside.
+- 'output' nodes are those outside with at least one dataflow input inside.
+
+It is valid to have multiple entry nodes (we'll bind a parameter for each). It 
may be valid to have multiple exit
+nodes (we'll build a tuple of all such). It may be valid to have exit nodes 
which also contribute to other inside
+nodes (ie represent a 'top' on an intermediate result).
+
+Sub-graphs are closed under:
+
+- Disjoint union.
+- Wrapping by a label, which indicates the wrapped sub-graph should be 
extracted as a sub-function with a "Composite"
+  label.
+- Substitution, which allows a sub-graph w.r.t. one dataflow graph to be 
transformed to match some other (typically
+  smaller) dataflow graph.
+
+To support some of the `OpPatternKind`-based fusion rules (see below) we give 
sub-graphs a kind, which is generally the
+maximum of the kinds of all the operator calls appearing inside it. We also 
given sub-graphs a label to help debugging.
+
+Note that the Relay `PatternPartitoner` goes directly from `Expr` to 
partitioned `Expr` without stopping at any
+intermediate representation. It may be worth 'promoting' `SubGraph` out of 
Collage andy into the standard `DFPattern`
+suite.
+
+Note that to support closure on both disjoint union and wrapping by a label 
`SubGraph`s are actually recursive -- see
+the 'v2' prototype `sub_graph.cc` for details.
+
+### CandidateKernel
+
+A `CandidateKernel` pairs a `SubGraph` with a `FusionSpec` (from which the 
intended `Target` for the candidate kernel
+can be extracted). All Collage search and measurement is in units of candidate 
kernels.
+
+### FusionRule
+
+A `FusionRule` describes how to find a set of `CandidateKernel`s for a 
`DataflowGraph`. This and `SubGraph` above are
+the core Collage datatypes. All fusion rules implement the method:
+
+```
+virtual Array<CandidateKernel> AllCandidateKernels(const DataflowGraph& 
dataflow_graph,
+                                                   const FusionSpec& spec) 
const;
+```
+
+The candidates are allowed to overlap, and ultimately it is the job of the 
Collage fusion searcher to find a selection
+of candidates which covers the whole Relay expression without overlap.
+
+We provide a set of 'base' fusion rules which produce candidates from the 
dataflow graph directly. We also provide a set
+of 'combinator' rules which can produce new candidates from the results of 
arbitrary sub-rule or sub-rules. In this way
+it is possible to combine the fusion rules to express a wide variety of fusion 
strategies, akin to the way we can
+combine TVM passes.
+
+There may be many thousands of candidates in flight during the fusion search. 
We take care to defer rewriting any Relay
+expressions (eg to extract the fused function, or partition the model) until 
absolutely necessary.
+
+The base rules implemented so far:
+
+- `DFPatternFusionRule`: Given a `DFPattern` and expression predicate, 
produces a candidate for every sub-graph matched
+  by the pattern and predicate. Unlike the Relay `PatternRewriter`, candidates 
are free to overlap. This is the
+  foundation for pattern-based BYOC integrations, and can be used to write 
targeted fusion rules as well as find
+  examples of 'composite' operators.
+- `OpPredicateFusionRule`: Given an attribute name, produces a candidate for 
every call to a primitive Relay operator
+  where the operator has predicate bound to that attribute which returns true 
given the call sub-expression. Generally
+  this will result in a singleton sub-graph containing only the call, but it 
may pull in constant arguments to the call
+  should they be required. This is the foundation for operator-based BYOC 
integrations, though we should consider
+  retiring this mechanism in favor of pattern-based alone.
+- `OpCallByKindFusionRule`: Uses the `"TOpPattern"` attribute provided for 
every Relay operator to produce a candidate
+  for every call to a 'fusable Relay operator'. This can be used as the 
foundation for generic fusion patterns which
+  work over all Relay operators with particular properties (elementwise, 
broadcast, injective, reductive, anchor).
+
+The combinator rules implemented so far:
+
+- `CompositeFusionRule`: 'Tags' the candidates matched by an arbitrary 
sub-rule with the rule name. Tagged sub-graphs
+  are turned into "Primitive" Function with the "Composite"
+  attribute bound to the tag. This can be used to indicate Relay operators (or 
groups of Relay operators) are to be
+  rewritten to specific target-specific operators. This combinator wraps the 
`DFPatternFusionRules` for the
+  pattern-based BYOC integrations. However it could also be used with the 
default TVM backend, eg to indicate Relay
+  operators should be replaced with particular external library 
implementations.
+- `CombineByPrimitivesFusionRule`: Given a sub-rule and a list of 'primitive' 
rules, finds all possible ways of
+  combining the sub-rule candidates to yield even larger candidates. Note that 
the sub-rule's candidates may also be
+  included in the results -- that is every combination of candidates is 
considered optional. The 'primitive' rules allow
+  combining by
+  `OpPatternKinds`, and combining the arguments to tuples which themselves are 
arguments to Relay operator calls. This
+  rule is intended to mimic the existing TVM `FuseOps` pass, though: i) all 
combinations are found, ii) the starting set
+  of candidates can be provided by any other rule (ie not just 
`OpCallByKindFusionRule`), and iii) we rely on `SubGraph`
+  validity checking to weed out infeasible candidates.
+
+Though not yet implemented, we'd like to allow a combinator rule which will 
union candidate based on their 'anchor'
+operators. This can be used to implement 'vertical' and 'horizontal' fusion on 
more primitive candidates. Note that the
+`SubGraph` machinery supports multiple-input and -output sub-graphs and their 
validation, so horizontal fusion is easy
+implement.
+
+We also have `MaxCoalesceFusionRule`, which eagerly combines 'touching' 
candidates (ie candidates where the output of
+one sub-graph can be directly connected to the input of the other sub-graph)
+to form the largest possible candidate. The idea is once the search has been 
completed this rule can be used to collapse
+adjacent kernels intended for the same target.
+
+Here's some typical `FusionRule` combinations for different fusion strategies 
(please excuse the crudity of the diagram,
+I didn't have time to build it to scale or paint it):
+
+- Classic TVM `FuseOps`:
+
+```
+      OpCallByKindFusionRule
+                |
+                v
+    CombineByPrimitivesFusionRule (with default TVM primitive rules)
+```
+
+- Classic operator-based BYOC with 
`AnnotateTarget`/`MergeCompilerRegions`/`PartitionGraph` passes:
+
+```
+      OpPredicateFusionRule
+                |
+                v
+   CombineByPrimitivesFusionRule (with join anything primitive rule)
+```
+
+- Classic pattern-based BYOC with 
`MergeComposite`/`AnnotateTarget`/`PartitionGraph` passes:
+
+```
+     DFPatternFusionRule(pattern1)  ...  DFPatternFusionRule(patternn)
+                |                                    |
+                v                                    v
+      CompositeFusionRule(label1)   ...   CompositeFusionRule(labeln)
+                        \                     /
+                         v                   v
+                            UnionFusionRule
+                                   |
+                                   v
+                     CombineByPrimitivesFusionRule (with join anything 
primitive rule)
+```
+
+- "Just fuse what I tell you to fuse", using `DFPatterns` to directly select 
candidates:
+
+```
+     DFPatternFusionRule(pattern1)  ...  DFPatternFusionRule(patternn)
+                           \                 /
+                            v               v
+                             UnionFusionRule
+```
+
+- "Consider this library implementation for these sub-expressions", using 
`DFPatterns` to pick out which Relay operators
+  are supported (note that TVM lowering does not currently support this):
+
+```
+    OpCallByKindFusionRule     DFPatternFusionRule(pattern1) ... 
DFPatternFusionRule(patternn)
+                    \                       |                                 |
+                     \                      v                                 v
+                      \         CompositeFusionRule(label1)  ...  
CompositeFusionRule(labeln)
+                       \                    |                        /
+                        v                   v                       v
+                                      UnionFusionRule
+                                            |
+                                            v
+                             CombineByPrimitivesFusionRule (with default TVM 
primitive rules)
+```
+
+### FusionSpec
+
+A `FusionSpec` pairs a a `FusionRule` with a `Target`.
+
+### Phase 1
+
+We build on the existing TVM support for heterogeneous devices and targets. 
The available `Targets` are extracted from
+the compilation configuration (eg using the existing `CompilationConfig` 
helper class). Each target is inspected to
+decide on how to construct a `FusionSpec`, which will guide Collage in the 
selection of candidate kernels to explore for
+that target.
+
+- If the `Target` has a `"fusion_spec"` attribute, use that directly (not 
currently in the 'v2' prototype). This would
+  allow users to directly control fusion for the target's they care about.
+- If the `Target` has a `"compiler"` attribute (eg `"cutlass"`), and the 
global pattern table has an entry for that
+  attribute value, assume the `Target` denotes a pattern-based BYOC 
integration to explore. The `FusionSpec`
+  will import all the BYOC patterns and predicates automatically.
+- As above, but if global pattern has no matching entry, assume the `Target` 
denotes a predicate-based BYOC integration
+  to explore (eg `"tensorrt"`). The `FusionSpec` will look for and evaluate 
predicates with the
+  `"target.<compiler>"` attribute on all Relay operators.
+- Otherwise, assume the `Target` denotes a TVM-native target. The `FusionSpec` 
mimics the existing `FuseOps`, but now
+  generalized to explore multiple candidates so as to leave room for possible 
BYOC candidates.
+
+Note that to make this approach work we need to allow for multiple `Target`s 
with the same `DLDeviceKind`. For the VM
+simply switching the `target` argument from dictionary to list and removing 
some redundant Python preprocessing code was
+all that was required to support this.
+
+The user can use `on_device` annotations to constrain sub-graphs to particular 
devices. When Collage is considering
+candidate kernels, it should be sure to choose a candidate `Target` which 
'refines' the `Target` for every
+sub-expression discovered by the `PlanDevicesPass`. Given targets T and U we 
say 'T refines U' if T has a
+'"compiler"' and/or '"fusion_spec"' attributes, U has no such attributes, and 
T and U otherwise agree on all other
+fields. (This is not currently in the 'v2' prototype).
+
+### Phase 2
+
+Most of the hard work for this phase is carried by the `AllCandidateKernels` 
implementations of the `FusionRule`s. The
+main driver simply needs to index all the found `CandidateKernels` by their 
minimum 'inside' `PostDfsIndex`
+for rapid retrieval during the shortest path search.
+
+### Phase 3
+
+We find it most natural to use Dijkstra to find the optimal partitioning. A 
`SearchState` is:
+
+- An `IndexSet` of the dataflow nodes already 'covered' by candidates on the 
best path to this state. This is the
+  identifying key for the state.
+- The predecessor `SearchState` in the best path to this state.
+- The `Cost` of the best path to this state. This is the order for the 
Dijkstra priority queue.
+- The `CandidateKernel` for the transition from the best predecessor to this 
state.
+
+The starting state has no covered nodes. The final state has all nodes covered.
+
+When expanding a state we could choose any `CandidateKernel` collected from 
phase 2 provided it doesn't overlap with the
+state's covered set. However, a search path applying candidates C then D is 
equivalent to one applying D then C, so we
+only consider candidates which intersect the next yet-to-be-covered dataflow 
node. For each such candidate we use
+the `CostEstimator` (with it's assumed cache) to get the candidate's cost, 
build the successor state, and 'relax' the
+successor state in the usual way.
+
+Not all Relay expression nodes need to be assigned to a kernel since the VM or 
other execution provider can happily
+evaluate most Relay expressions except for calls to primitive operators. Thus 
the search must allow for the possibility
+of a expression node being 'left behind'.
+
+### Phase 4
+
+The overall Relay expression is partitioned over all the `CandidateKernel`s on 
the shortest path 'in parallel'. Since
+all the candidates are expressed using `SubGraph`s w.r.t. the original 
dataflow graph, we must be careful not to
+invalidate yet-to-be-partitioned candidates as we go. Working backwards in 
dataflow order avoids this problem.
+
+Note that all the extracted functions in the result will be marked as 
`"Primitive"`, and thus will be left alone by most
+other Relay passes except `LowerTEPass`. Thus it's fine for `FuseOps` to be 
run (repeatably) after
+`CollageFuseOps`.
+
+## Known Limitations
+
+- **Some BYOC boilerplate changes required**: TVM's current BYOC integration 
API only requires the 'lowering/codegen'
+  function to be registered to a well-known global function name. Everything 
else is up to the BYOC author.
+    - Collage requires pattern-based BYOC integrations to register their 
patterns in the global pattern table.
+    - Collage requires the BYOC lowering function to yield a valid 
`runtime::Module` without requiring any additional
+      BYOC-specific passes to be run.
+    - Collage requires the BYOC integration to either correctly test for which 
operators are supported in the
+      pattern/operator predicate, or gracefully propagate failure rather than 
CHECK-fail if an unsupported operator is
+      included in a candidate kernel. Thus a BYOC integration will need to be 
'robustified' to become 'Collage
+      compatible'. Overall we've tried to make as few changes as possible. 
Collage will happily follow along with any
+      improvements to the BYOC integration API (eg via the UMA project).
+- **Higher tuning cost**: Obviously Collage needs to estimate the latency of 
many more candidate kernels, and each
+  candidate may itself trigger tuning during lowering. For TVM this can 
require O(thousands) of trials and take O(hours)
+  , so we'll be very dependent on cached tuning logs to amortize this cost 
between models for the same target.
+  Currently Collage will measure more candidates even if TVM native is the 
only available target.
+- **Task extraction vs Tuning**: Traditionally TVM has had three phases: i) 
Task extraction (find the fused sub-graphs
+  to tune), ii) Tuning (find a good schedule for those sub-graphs), and iii) 
Compilation (re-compile the model, now
+  retrieving schedules for all the anticipated sub-graphs from the cache.) 
However the Collage 'v2' prototype collapses
+  all these phases. This lets us lazily explore the implied search graph 
(nodes = partially rewritten models, edges =
+  selected of sub-graph and toolchain as a candidate kernel, cost = estimated 
sum of kernel costs plus launch penalties)
+  , and thus only pay the cost of tuning candidate kernels which could 
possibly influence the final partitioning.
+- **No non-local optimization**: Though Collage can explore the choice of 
sub-graph and toolchain, it cannot explore any
+  choices which require the arguments and/or result of the sub-graph to be 
rewritten. Thus Collage **cannot** be used to
+  search over:
+    - choice of layout for arguments/results (may require insertion of layout 
transforms),
+    - choice of memory scope for arguments/results (may require insertion of 
device copies),
+    - choice of device on which to host the kernel (ditto)
+      since all those choices can require changes beyond the candidates 
sub-graph.
+- the choice of layout for a kernel since any choice other than the model's 
default must be
+  'corrected' for by the inserted layout transformations. To support this 
efficiently we'd need to abandon the
+  simple-minded but fast `SubGraph` representation we describe below in favor 
of something like an EGraph
+  representation, which seems like a very large change for TVM.
+- **Dependency management**: Currently BYOC integrations tend to assume they 
are the only non-TVM toolchain in use. So
+  it's possible two toolchains introduce runtime dependencies which can't be 
satisfied. Collage has no notion of
+  dependencies or incompatibilities and may attemt to mix candidate kernels we 
can't support in prod. It's also possible
+  for two BYOC integrations to have incompatible runtimes.
+- **Additive kernel cost assumption**: Collage as per this design assumes the 
cost of running candidate kernels is
+  additive, plus a small launch penalty. However cache effects can dominate 
measured latency, particularly for 'light'
+  kernels. Thus there may be a **additive error** in the final result:
+
+  > additive_error = measured_latency(collage_partitioning) - sum_{kernel} 
(estimated_latency(kernel) + penalty)
+
+  The evolutionary search explored by the Collage paper can help here since it 
uses measured end-to-end model latency as
+  its cost function, but we're deferring that to future work.
+
+- **Limited search space**: Naively exploring all sub-graphs is O(n!), so we 
need to constrain the search. The easiest
+  approach is just to limit candidate kernels to sub-graphs of just a few 
operators. This can mean significatly faster

Review comment:
       This might be very bad on Hardware Accelerators that use extensive 
Fusion, e.g. EthosU. In these cases it would be probably be preferrable, to 
limit the search space by using a maximum number of splits of potentially 
fusible regions. Heuristics might be selected as an attribute of 
`CombineByPrimitivesFusionRule`. 

##########
File path: rfcs/xxxx-collage.md
##########
@@ -0,0 +1,833 @@
+# Design Doc: Collage [Draft 0.7]
+
+```
+Feature Name: Collage
+Start Date: Mar 2022
+Authors: Mark Shields ([email protected])
+RFC PR: <tbd>
+GitHub Issue: <tbd>
+```
+
+This design doc (with accompanying
+['v2' prototype 
implementation](https://github.com/mbs-octoml/mbs-tvm/tree/mbs-collage-sketch))
+shows how to bring tuning to TVM's operator fusion and BYOC partitioning 
passes. The tuning search explores the choice
+of sub-graphs (aka 'partitions') as well as choice of toolchain (TVM native or 
one of the available BYOC integrations,
+aka 'backends') for each candidate kernel so as to minimize the expected model 
inference latency. We call the result
+an 'optimal partitioning'. This new tuning layer complements the tuning 
traditionally done by TVM and other toolchains
+during lowering. It can also complement any global tuning, for example to 
explore all possible global layouts.
+
+The approach is based on the [preprint](https://arxiv.org/pdf/2111.00655.pdf):
+
+> *Collage: Automated Integration of Deep Learning Backends*  
+> Byungsoo Jeon, Sunghyun Park, Peiyuan Liao, Sheng Xu, Tianqi Chen, Zhihao Jia
+
+This tuning approach contrasts with TVM's existing "greedy" and "manual" 
approaches to fusion and BYOC:
+
+- Greedy: Currently only the largest possible supported sub-graphs are used 
for kernels, irrespective of their execution
+  time. With Collage many more candidate sub-graphs are explored, and it is 
possible for two smaller sub-graphs to yield
+  better overall latency than one large sub-graph if they mix toolchains.
+- Manual: Currently the TVM user must commit to a BYOC toolchain and invoke 
the corresponding partitioning function
+  before the main TVM compilation flow proceeds. With Collage the choice of 
toolchain can be automated based on measured
+  latency. Collage will also explore mixing and matching between multiple BYOC 
toolchains as well as TVM's native
+  backend.
+
+The design (when Collage is enabled) subsumes TVM's fixed `FuseOps` and 
BYOC-provided `partition_for_<toolchain>`
+operations (built using the 
`MergeComposite`/`AnnotateTarget`/`MergeCompilerRegions`/`PartitionGraph` 
passes) with a
+single new
+`CollageFuseOps` pass. The pass is carefully engineered to build directly on 
the existing `"TOpPattern"` attributes
+(provided for every Relay operator and used by `FuseOps`), BYOC 
`"target.<toolchain>"`
+operator predicates (provided for some operator/toolchain pairs by 
'operator-based' BYOC integrations) and BYOC operator
+pattern/predicates (registered in the pattern table by 'pattern-based' BYOC 
integrations). In this way only the more
+boilerplate aspects of existing BYOC integrations need to be adjusted to 
support Collage. The
+`partition_for_<toolchain>` operations are retained for users who wish to 
retain manual control.
+
+> NOTE: We'd like to coordinate these changes with the UMA project. Our aim in 
this design is to make the smallest
+> changes to BYOC as possible. We think the changes described here can be 
easily reworked to follow any BYOC API
+> proposals settled on by UMA. See also "Related Work."
+
+Collage offers four advantages:
+
+- **Latency**: Overall model latency may be reduced compared to TVM native, 
TVM with a specific BYOC toolchain, or a
+  non-TVM compiler such as TensorRT.
+- **Automation**: The choice of which BYOC toolchains to enable can be 
automated.
+- **Economy of implementation**: Five standalone passes using three separate 
mechanisms for expressing fusion
+  rules/algorithms and implementing partitioning can be replaced with one, 
which itself is built from compositional
+  primitives.
+- **Decoupling**: It is ok for a candidate kernel found during search to 
actually not be valid for a toolchain (even
+  TVM's). Such candidates could be given 'infinite' cost and thus ignored 
during search. In this way we can avoid tight
+  coupling between backends and fusion rules.
+
+## FAQ
+
+Pending.
+
+## Success Metrics
+
+1. Collage offers at least a 10% latency improvement for a selection of 
standard ONNX models and NVIDIA hardware using
+   targets which include the CuDNN and CuBlas libraries, the CUTLASS library 
(with tuning, via BYOC), the TensorRT
+   compiler (via BYOC), and (obviously!) TVM native.
+2. Collage does not require new per-target or per-model patterns or rules to 
be implemented independently of the BYOC
+   integrations.
+3. Collage with just the native TWM and a single BYOC toolchain enabled is 
never worse than using the
+   existing `partition_for_<toolchain` method in TVM today.
+
+## Project Milestones
+
+- [Done] M0: Port paper prototype to recent TVM main and validate paper 
results.
+- [Done] M1: Internal design doc.
+- [Done] M2: Use 'v2' prototype to test design doc, and rework ready for TVM 
community.
+- [In progress] M3: RFC
+- [2022Q1] M4: Re-validate results on 'v2' prototype for larger models (eg 
GPT2) and more NVIDIA targets.
+- [2022Q2] M5: Implementation in TVM main, including 'sub-projects' listed 
below.
+- [OctoML internal] M6: Estimator integrated into OctoML platform, validation 
against OctoML test suite.
+- [OctoML internal] M7: Productionization for OctoML.
+
+## Check-in plan
+
+Though the 'v2' prototype is in a personal branch we'd like to transition to 
main ASAP and rely on directory/namespace
+separation, maintaining backwards compat, and a new `PassConfig` flag to 
isolate all Collage changes from the rest of
+TVM. A rough PR progression is:
+
+- TensorRT and CUTLASS BYOC changes are backwards compat. The existing 
`partition_for_X` functions remain. The
+  CUTLASS-specific tuning and codegen functions will either continue to be 
supported or we'll work with users to account
+  for them being folded into the function-at-a-time `relay.ext.cutlass`
+  codegen function.
+- The the `DFPattern` and friends changes are all mostly just for improving 
the robustness of the
+  `IndexedGraph<T>` class and can go into main independently.
+- Some basic `Expr` improvements can go into main independently.
+- The design allows for multiple `Target`s for the same `DLDeviceType`. That 
requires the various
+  `build` interfaces which currently accept `Union[Target,Dict]` to also 
accept a list of `Target`s, and can be
+  backwards compat.
+- The new Collage code can go in bottom-up as we develop unit tests:
+    - Support utils, including `NameSupply`, `IndexSet`, `PriorityQueue`, 
`Cost`, `CostEstimator`.
+    - The core `SubGraph` datatype.
+    - `CandidateKernel`.
+    - The `FusionRule` class hierachy (which itself can be broken into 
sub-PRs).
+    - `FusionSpec`.
+    - `GatherFusionSpecs` helper for bridging the existing BYOC world with the 
Collage 'FusionRule' world.
+    - The `CollageFuseOps` driver pass itself.
+
+## Related Work
+
+- The [Cascading 
Scheduler](https://github.com/apache/tvm-rfcs/blob/main/rfcs/0037-arm-ethosu-cascading-scheduler.md)
 combines i) dynamic-programming
+  to find an optimal grouping of TE sub-expressions, ii) an analytic model of 
cost to guide the search,
+  and iii) cascading scheduling of the TE sub-expressions so as to reduce 
memory high-watermark. By contrast
+  Collage i) also uses dynamic-programming, but to find an optimal grouping of 
Relay sub-expressions, ii)
+  uses measurement to guide the search and iii) assuming the toolchain will 
'do its best' with the
+  sub-graph offered to it.
+- The [Universal modular Accelerator 
Interface](https://github.com/apache/tvm-rfcs/pull/60) proposal
+  adds a layer on top of the existing and separate TVM BYOC, operator 
strategy, operator scheduling,
+  target-specific passes and target-specific code generation extension points. 
Collage currently relies
+  only on the global pattern registry and global `relay.ext.<toolchain>` 
function to integrate with BYOC
+  integrations, but this is trivial to change should this project change the 
source of truth.
+
+## Example
+
+We start with `mod` bound to 
[MNIST](https://github.com/onnx/models/tree/main/vision/classification/mnist):
+
+```
+fn (%x: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
+  %0 = nn.pad(%x, 0f, pad_width=[[0, 0], [0, 0], [2, 2], [2, 2]]);
+  %1 = nn.conv2d(%0, meta[relay.Constant][0] /*Tensor[(8, 1, 5, 5), float32]*/,
+                 padding=[0, 0, 0, 0], channels=8, kernel_size=[5, 5]);
+  %2 = add(%1, meta[relay.Constant][1] /*Tensor[(8, 1, 1), float32]*/);
+  %3 = nn.relu(%2);
+  %4 = nn.max_pool2d(%3, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 
0]);
+  %5 = nn.pad(%4, 0f, pad_width=[[0, 0], [0, 0], [2, 2], [2, 2]]);
+  %6 = nn.conv2d(%5, meta[relay.Constant][2] /*Tensor[(16, 8, 5, 5), 
float32]*/,
+                 padding=[0, 0, 0, 0], channels=16, kernel_size=[5, 5]);
+  %7 = add(%6, meta[relay.Constant][3] /*Tensor[(16, 1, 1), float32]*/);
+  %8 = nn.relu(%7);
+  %9 = nn.max_pool2d(%8, pool_size=[3, 3], strides=[3, 3], padding=[0, 0, 0, 
0]);
+  %10 = reshape(%9, newshape=[1, 256]);
+  %11 = nn.dense(%10, meta[relay.Constant][4] /*Tensor[(10, 256), float32]*/, 
units=None, out_dtype="float32");
+  add(%11, meta[relay.Constant][5] /*Tensor[(1, 10), float32]*/)
+}
+```
+
+We can compile this with Collage enabled for a variety of NVIDIA 
toolchains/libraries as follows:
+
+```
+with tvm.transform.PassContext(config={"relay.fallback_device_type": 2, 
"relay.collage.enable_collage": True}):
+    host_target = tvm.target.Target("llvm")
+    generic_target = tvm.target.Target("cuda", host_target)
+    cutlass_target = tvm.target.Target("cuda -compiler=cutlass", host_target)
+    tensorrt_target = tvm.target.Target("cuda -compiler=tensorrt", host_target)
+    cudnn_target = tvm.target.Target("cuda -libs=cudnn", host_target)
+    cublas_target = tvm.target.Target("cuda -libs=cublas", host_target)
+    targets = [generic_target, cutlass_target, tensorrt_target, cudnn_target, 
cublas_target]
+    exe = tvm.relay.vm.compile(mod, target=targets)
+```
+
+(Note that `cudnn` and `cublas` are not yet supported in the 'v2' prototype.)
+
+After the `CollageFuseOps` pass, the intermediate `"main"` global function 
could resemble the following (though we've
+modified this "optimal" partitioning by hand to illustrate all the varieties 
of kernels so don't take it as
+representative of actual performance):
+
+```
+fn (%x: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
+  # Use TVM native
+  %3 = fn (%FunctionVar_08: Tensor[(1, 1, 28, 28), float32],
+           Primitive=1) -> Tensor[(1, 1, 32, 32), float32] {
+    nn.pad(%FunctionVar_08, 0f, pad_width=[[0, 0], [0, 0], [2, 2], [2, 2]])
+  };
+  %4 = %3(%x);
+  # Use TVM native, but indicate we wish to link to CuDNN
+  %6 = fn (%FunctionVar_07: Tensor[(1, 1, 32, 32), float32],
+           Primitive=1) -> Tensor[(1, 8, 28, 28), float32] {
+    %5 = fn (%FunctionVar_5: Tensor[(1, 1, 32, 32), float32],
+             Composite="cudnn.conv2d") -> Tensor[(1, 8, 28, 28), float32] {
+      nn.conv2d(%FunctionVar_5, meta[relay.Constant][0] /*Tensor[(8, 1, 5, 5), 
float32]*/,
+                padding=[0, 0, 0, 0], channels=8, kernel_size=[5, 5])
+    };
+    %5(%FunctionVar_07)  
+  };
+  %7 = %6(%4);
+  # Use TVM native, with fusion
+  %8 = fn (%FunctionVar_06: Tensor[(1, 8, 28, 28), float32],
+           %FunctionVar_12: Tensor[(8, 1, 1), float32],
+           Primitive=1) -> Tensor[(1, 8, 28, 28), float32] {
+    %3 = add(%FunctionVar_06, %FunctionVar_12);
+    nn.relu(%3)
+  };
+  %9 = %8(%7, meta[relay.Constant][1] /*Tensor[(8, 1, 1), float32]*/);
+  # Use TVM native
+  %10 = fn (%FunctionVar_05: Tensor[(1, 8, 28, 28), float32],
+            Primitive=1) -> Tensor[(1, 8, 14, 14), float32] {
+    nn.max_pool2d(%FunctionVar_05, pool_size=[2, 2], strides=[2, 2], 
padding=[0, 0, 0, 0])
+  };
+  %11 = %10(%9);
+  # Use TVM native
+  %12 = fn (%FunctionVar_04: Tensor[(1, 8, 14, 14), float32],
+            Primitive=1) -> Tensor[(1, 8, 18, 18), float32] {
+    nn.pad(%FunctionVar_04, 0f, pad_width=[[0, 0], [0, 0], [2, 2], [2, 2]])
+  };
+  %13 = %12(%11);
+  # Use TensorRT, with fusion
+  %14 = fn (%FunctionVar_03: Tensor[(1, 8, 18, 18), float32],
+            %FunctionVar_11: Tensor[(16, 1, 1), float32],
+            Primitive=1,
+            Compiler="tensorrt",
+            global_symbol="collage_nn_conv2d_add_nn_relu_1") -> Tensor[(1, 16, 
14, 14), float32] {
+    %1 = nn.conv2d(%FunctionVar_03, meta[relay.Constant][2] /*Tensor[(16, 8, 
5, 5), float32]*/,
+                   padding=[0, 0, 0, 0], channels=16, kernel_size=[5, 5]);
+    %2 = add(%1, %FunctionVar_11);
+    nn.relu(%2)
+  };
+  %15 = %14(%13, meta[relay.Constant][3] /*Tensor[(16, 1, 1), float32]*/);
+  # Use TVM native
+  %16 = fn (%FunctionVar_02: Tensor[(1, 16, 14, 14), float32],
+            Primitive=1) -> Tensor[(1, 16, 4, 4), float32] {
+    nn.max_pool2d(%FunctionVar_02, pool_size=[3, 3], strides=[3, 3], 
padding=[0, 0, 0, 0])
+  };
+  %17 = %16(%15);
+  # Use TVM native
+  %18 = fn (%FunctionVar_01: Tensor[(1, 16, 4, 4), float32],
+            Primitive=1) -> Tensor[(1, 256), float32] {
+    reshape(%FunctionVar_01, newshape=[1, 256])
+  };
+  %19 = %18(%17);
+  # Use CUTLASS, with fusion
+  %20 = fn (%FunctionVar_0: Tensor[(1, 256), float32],
+            %FunctionVar_1: Tensor[(10, 256), float32],
+            %FunctionVar_2: Tensor[(1, 10), float32],
+            Primitive=1,
+            Compiler="cutlass",
+            global_symbol="collage_cutlass_dense_bias_nn_dense_add") -> 
Tensor[(1, 10), float32] {
+    %1 = fn (%FunctionVar_01: Tensor[(1, 256), float32],
+             %FunctionVar_11: Tensor[(10, 256), float32],
+             %FunctionVar_21: Tensor[(1, 10), float32],
+             Composite="cutlass.dense_bias") -> Tensor[(1, 10), float32] {
+      %0 = nn.dense(%FunctionVar_01, %FunctionVar_11, units=None, 
out_dtype="float32");
+      add(%0, %FunctionVar_21)
+    };
+    %1(%FunctionVar_0, %FunctionVar_1, %FunctionVar_2)
+  };
+  %20(%19, meta[relay.Constant][4] /*Tensor[(10, 256), float32]*/, 
meta[relay.Constant][5] /*Tensor[(1, 10), float32]*/)
+}
+```
+
+## Design
+
+The implementation is mostly under `src/relay/collage/...` (namespace 
`tvm::relay::collage`), with some helper Python
+under `python/tvm/relay/collage`.
+
+If the `relay.collage.enable_collage` `PassConfig` attribute is true then a 
new `CollageFuseOps` pass is inserted before
+the existing `FuseOps` pass. The new pass effects the invariant:
+
+> All Relay sub-graphs in all global functions which are to be lowered to a 
kernel are replaced by calls to an inline
+> `"Primitive"` `Function`. Functions which are to be lowered by a 
BYOC-provided toolchain are given
+> `"Compiler"` and `"global_symbol"` attributes. The bodies of those function 
may contain calls to inlined
+> `"Composite"` annotated functions to further direct lowering within the 
kernel.
+
+The `CollageFuseOps` pass proceeds in four phases:
+
+- **Phase 1**: The available `Target`s are scanned to build a list of 
`FusionSpec`s. Each `FusionSpec` is built from
+  (a tree of) `FusionRule`s. How the rules are constructed depends on `Target` 
itself. The remaining phases execute on
+  each global function separately.
+- **Phase 2**: A `DataflowGraph` is constructed for the global function. The 
available `FusionRule`s are evaluated on
+  the dataflow graph to yield a (possibly overlapping) set of 
`CandidateKernels` for each target. Each candidate is
+  described by a `SubGraph` which efficiently denotes a sub-graph of the 
global function's body without the need to
+  construct any new expressions. The candidates are placed in a 
`CandidateKernelIndex` for use below.
+- **Phase 3**: A shortest path is found in the following (implicit) search 
graph:
+    - Search Nodes: An `IndexSet` describing which dataflow nodes are been 
assigned to a candidate kernel so far.
+    - Search Edge X->Y: A `CandidateKernel` can be applied to node X to give 
node Y. The candidate is disjoint from all
+      dataflow nodes already assigned in X. To avoid an unnecessary search 
space explosion the candidate must also
+      include the next yet-to-be-assigned dataflow node in X.
+    - Edge cost: Estimated latency of the candidate kernel, plus a kernel 
launch penalty. Note that though we need to be
+      able to extract the candidate's sub-graph in order to build the kernel, 
we do not yet need to partition the
+      overall function body expression.
+  Other search algorithms are certainly possible, eg the Paper uses an 
evolutionary search to refine
+  the partitioning found by the dynamic-programming search. We can easily 
abstract away the search
+  interface to support multiple implementations in the future.
+- **Phase 4**: The function body is partitioned according to the candidate 
kernels on the shortest path.
+
+In the following we introduce the new datatypes, then expand on the phases.
+
+### Util Datatypes
+
+- `PostDfsIndex`: The integer index of a Relay sub-expression in a post-dfs 
traversal of the overall Relay expression.
+  If index i is less than index j then we know the sub-expression for j cannot 
influence the value of the sub-expression
+  for i.
+- `DataflowGraph`: As alias for the existing `IndexedGraph<Expr>` from the 
`DFPatternMatcher` suite (which in turn is a
+  reworked copy of the `IndexedGraph` private to `fuse_ops.cc`). It is used 
throughout to manage the three-way bijection
+  from Relay `ExprNode`s to `PostDfsIndex`s to
+  `DataflowGraph::Node`s. Each `DataflowGraph::Node` describes the 
sub-expression's dataflow inputs, outputs, dominator
+  and inverse-dominators.
+- `IndexSet`:  A bit vector indexed by `PostDfsIndex`s. These are used as a 
compact representation for an arbitrary set
+  of dataflow nodes in a dataflow graph.
+- `Cost`: A `double` representing a candidate kernel's 'cost', which currently 
is just mean execution latency in
+  seconds. Collage only cares that costs are additive and a total order, so in 
the future we could support cost
+  functions which balance execution time against high memory watermark or 
other measures. Costs may be `Unknown`
+  (ie NaN) to signal some other heuristic should be used to compare kernel 
costs. Costs may be `Invalid` (ie +inf)
+  to signal the toolchain could not compile and run a candidate kernel.
+
+### SubGraph
+
+A `SubGraph` is an `IndexSet` of the `PostDfsIndex`s of all dataflow nodes 
'inside' an arbitrary sub-graph of the
+overall dataflow graph. This and `FusionRule` below are the core Collage 
datatypes.
+
+Sub-graphs can be used to represent 'composite'' and 'fused' functions without 
having to pay the cost of constructing
+either the function or the rewritten overall 'partitioned' expression which 
calls that function. We also allow functions
+to be extracted independently of partitioning, since we'll need to estimate 
the latency of many more kernel functions
+than will ultimately be used in the final Relay expression. We expect 
O(thousands) of sub-graphs to be in flight while
+processing a given model.
+
+A sub-graph classifies every dataflow node of the overall expression as either 
'inside' or 'outside' the sub-graph.
+Obviously not all such divisions make sense, for example it is not valid for 
an inside node to feed into another inside
+node via outside nodes. We provide the `IsValid` method to check for validity, 
and `SubGraphConfig` to control which
+rules apply (such as maximum depth).
+
+As well as 'inside' and 'outside' we have four other flavors of dataflow nodes 
(all uniquely determined from the
+'inside' nodes):
+
+- 'entry' nodes are those inside with at least one dataflow input outside.
+- 'exit' nodes are those inside with at least one dataflow output outside, or 
which are considered 'external' in the
+  underlying dataflow graph (eg because they represent the result of the 
overall function).
+- 'input' nodes are those outside with at least one dataflow output inside.
+- 'output' nodes are those outside with at least one dataflow input inside.
+
+It is valid to have multiple entry nodes (we'll bind a parameter for each). It 
may be valid to have multiple exit
+nodes (we'll build a tuple of all such). It may be valid to have exit nodes 
which also contribute to other inside
+nodes (ie represent a 'top' on an intermediate result).
+
+Sub-graphs are closed under:
+
+- Disjoint union.
+- Wrapping by a label, which indicates the wrapped sub-graph should be 
extracted as a sub-function with a "Composite"
+  label.
+- Substitution, which allows a sub-graph w.r.t. one dataflow graph to be 
transformed to match some other (typically
+  smaller) dataflow graph.
+
+To support some of the `OpPatternKind`-based fusion rules (see below) we give 
sub-graphs a kind, which is generally the
+maximum of the kinds of all the operator calls appearing inside it. We also 
given sub-graphs a label to help debugging.
+
+Note that the Relay `PatternPartitoner` goes directly from `Expr` to 
partitioned `Expr` without stopping at any
+intermediate representation. It may be worth 'promoting' `SubGraph` out of 
Collage andy into the standard `DFPattern`
+suite.
+
+Note that to support closure on both disjoint union and wrapping by a label 
`SubGraph`s are actually recursive -- see
+the 'v2' prototype `sub_graph.cc` for details.
+
+### CandidateKernel
+
+A `CandidateKernel` pairs a `SubGraph` with a `FusionSpec` (from which the 
intended `Target` for the candidate kernel
+can be extracted). All Collage search and measurement is in units of candidate 
kernels.
+
+### FusionRule
+
+A `FusionRule` describes how to find a set of `CandidateKernel`s for a 
`DataflowGraph`. This and `SubGraph` above are
+the core Collage datatypes. All fusion rules implement the method:
+
+```
+virtual Array<CandidateKernel> AllCandidateKernels(const DataflowGraph& 
dataflow_graph,
+                                                   const FusionSpec& spec) 
const;
+```
+
+The candidates are allowed to overlap, and ultimately it is the job of the 
Collage fusion searcher to find a selection
+of candidates which covers the whole Relay expression without overlap.
+
+We provide a set of 'base' fusion rules which produce candidates from the 
dataflow graph directly. We also provide a set
+of 'combinator' rules which can produce new candidates from the results of 
arbitrary sub-rule or sub-rules. In this way
+it is possible to combine the fusion rules to express a wide variety of fusion 
strategies, akin to the way we can
+combine TVM passes.
+
+There may be many thousands of candidates in flight during the fusion search. 
We take care to defer rewriting any Relay
+expressions (eg to extract the fused function, or partition the model) until 
absolutely necessary.
+
+The base rules implemented so far:
+
+- `DFPatternFusionRule`: Given a `DFPattern` and expression predicate, 
produces a candidate for every sub-graph matched
+  by the pattern and predicate. Unlike the Relay `PatternRewriter`, candidates 
are free to overlap. This is the
+  foundation for pattern-based BYOC integrations, and can be used to write 
targeted fusion rules as well as find
+  examples of 'composite' operators.
+- `OpPredicateFusionRule`: Given an attribute name, produces a candidate for 
every call to a primitive Relay operator
+  where the operator has predicate bound to that attribute which returns true 
given the call sub-expression. Generally
+  this will result in a singleton sub-graph containing only the call, but it 
may pull in constant arguments to the call
+  should they be required. This is the foundation for operator-based BYOC 
integrations, though we should consider
+  retiring this mechanism in favor of pattern-based alone.
+- `OpCallByKindFusionRule`: Uses the `"TOpPattern"` attribute provided for 
every Relay operator to produce a candidate
+  for every call to a 'fusable Relay operator'. This can be used as the 
foundation for generic fusion patterns which
+  work over all Relay operators with particular properties (elementwise, 
broadcast, injective, reductive, anchor).
+
+The combinator rules implemented so far:
+
+- `CompositeFusionRule`: 'Tags' the candidates matched by an arbitrary 
sub-rule with the rule name. Tagged sub-graphs
+  are turned into "Primitive" Function with the "Composite"
+  attribute bound to the tag. This can be used to indicate Relay operators (or 
groups of Relay operators) are to be
+  rewritten to specific target-specific operators. This combinator wraps the 
`DFPatternFusionRules` for the
+  pattern-based BYOC integrations. However it could also be used with the 
default TVM backend, eg to indicate Relay
+  operators should be replaced with particular external library 
implementations.
+- `CombineByPrimitivesFusionRule`: Given a sub-rule and a list of 'primitive' 
rules, finds all possible ways of
+  combining the sub-rule candidates to yield even larger candidates. Note that 
the sub-rule's candidates may also be
+  included in the results -- that is every combination of candidates is 
considered optional. The 'primitive' rules allow
+  combining by
+  `OpPatternKinds`, and combining the arguments to tuples which themselves are 
arguments to Relay operator calls. This
+  rule is intended to mimic the existing TVM `FuseOps` pass, though: i) all 
combinations are found, ii) the starting set
+  of candidates can be provided by any other rule (ie not just 
`OpCallByKindFusionRule`), and iii) we rely on `SubGraph`
+  validity checking to weed out infeasible candidates.
+
+Though not yet implemented, we'd like to allow a combinator rule which will 
union candidate based on their 'anchor'
+operators. This can be used to implement 'vertical' and 'horizontal' fusion on 
more primitive candidates. Note that the
+`SubGraph` machinery supports multiple-input and -output sub-graphs and their 
validation, so horizontal fusion is easy
+implement.
+
+We also have `MaxCoalesceFusionRule`, which eagerly combines 'touching' 
candidates (ie candidates where the output of
+one sub-graph can be directly connected to the input of the other sub-graph)
+to form the largest possible candidate. The idea is once the search has been 
completed this rule can be used to collapse
+adjacent kernels intended for the same target.
+
+Here's some typical `FusionRule` combinations for different fusion strategies 
(please excuse the crudity of the diagram,
+I didn't have time to build it to scale or paint it):
+
+- Classic TVM `FuseOps`:
+
+```
+      OpCallByKindFusionRule
+                |
+                v
+    CombineByPrimitivesFusionRule (with default TVM primitive rules)
+```
+
+- Classic operator-based BYOC with 
`AnnotateTarget`/`MergeCompilerRegions`/`PartitionGraph` passes:
+
+```
+      OpPredicateFusionRule
+                |
+                v
+   CombineByPrimitivesFusionRule (with join anything primitive rule)
+```
+
+- Classic pattern-based BYOC with 
`MergeComposite`/`AnnotateTarget`/`PartitionGraph` passes:
+
+```
+     DFPatternFusionRule(pattern1)  ...  DFPatternFusionRule(patternn)
+                |                                    |
+                v                                    v
+      CompositeFusionRule(label1)   ...   CompositeFusionRule(labeln)
+                        \                     /
+                         v                   v
+                            UnionFusionRule
+                                   |
+                                   v
+                     CombineByPrimitivesFusionRule (with join anything 
primitive rule)
+```
+
+- "Just fuse what I tell you to fuse", using `DFPatterns` to directly select 
candidates:
+
+```
+     DFPatternFusionRule(pattern1)  ...  DFPatternFusionRule(patternn)
+                           \                 /
+                            v               v
+                             UnionFusionRule
+```
+
+- "Consider this library implementation for these sub-expressions", using 
`DFPatterns` to pick out which Relay operators
+  are supported (note that TVM lowering does not currently support this):
+
+```
+    OpCallByKindFusionRule     DFPatternFusionRule(pattern1) ... 
DFPatternFusionRule(patternn)
+                    \                       |                                 |
+                     \                      v                                 v
+                      \         CompositeFusionRule(label1)  ...  
CompositeFusionRule(labeln)
+                       \                    |                        /
+                        v                   v                       v
+                                      UnionFusionRule
+                                            |
+                                            v
+                             CombineByPrimitivesFusionRule (with default TVM 
primitive rules)
+```
+
+### FusionSpec
+
+A `FusionSpec` pairs a a `FusionRule` with a `Target`.
+
+### Phase 1
+
+We build on the existing TVM support for heterogeneous devices and targets. 
The available `Targets` are extracted from
+the compilation configuration (eg using the existing `CompilationConfig` 
helper class). Each target is inspected to
+decide on how to construct a `FusionSpec`, which will guide Collage in the 
selection of candidate kernels to explore for
+that target.
+
+- If the `Target` has a `"fusion_spec"` attribute, use that directly (not 
currently in the 'v2' prototype). This would
+  allow users to directly control fusion for the target's they care about.
+- If the `Target` has a `"compiler"` attribute (eg `"cutlass"`), and the 
global pattern table has an entry for that
+  attribute value, assume the `Target` denotes a pattern-based BYOC 
integration to explore. The `FusionSpec`
+  will import all the BYOC patterns and predicates automatically.
+- As above, but if global pattern has no matching entry, assume the `Target` 
denotes a predicate-based BYOC integration
+  to explore (eg `"tensorrt"`). The `FusionSpec` will look for and evaluate 
predicates with the
+  `"target.<compiler>"` attribute on all Relay operators.
+- Otherwise, assume the `Target` denotes a TVM-native target. The `FusionSpec` 
mimics the existing `FuseOps`, but now
+  generalized to explore multiple candidates so as to leave room for possible 
BYOC candidates.
+
+Note that to make this approach work we need to allow for multiple `Target`s 
with the same `DLDeviceKind`. For the VM
+simply switching the `target` argument from dictionary to list and removing 
some redundant Python preprocessing code was
+all that was required to support this.
+
+The user can use `on_device` annotations to constrain sub-graphs to particular 
devices. When Collage is considering
+candidate kernels, it should be sure to choose a candidate `Target` which 
'refines' the `Target` for every
+sub-expression discovered by the `PlanDevicesPass`. Given targets T and U we 
say 'T refines U' if T has a
+'"compiler"' and/or '"fusion_spec"' attributes, U has no such attributes, and 
T and U otherwise agree on all other
+fields. (This is not currently in the 'v2' prototype).
+
+### Phase 2
+
+Most of the hard work for this phase is carried by the `AllCandidateKernels` 
implementations of the `FusionRule`s. The
+main driver simply needs to index all the found `CandidateKernels` by their 
minimum 'inside' `PostDfsIndex`
+for rapid retrieval during the shortest path search.
+
+### Phase 3
+
+We find it most natural to use Dijkstra to find the optimal partitioning. A 
`SearchState` is:
+
+- An `IndexSet` of the dataflow nodes already 'covered' by candidates on the 
best path to this state. This is the
+  identifying key for the state.
+- The predecessor `SearchState` in the best path to this state.
+- The `Cost` of the best path to this state. This is the order for the 
Dijkstra priority queue.
+- The `CandidateKernel` for the transition from the best predecessor to this 
state.
+
+The starting state has no covered nodes. The final state has all nodes covered.
+
+When expanding a state we could choose any `CandidateKernel` collected from 
phase 2 provided it doesn't overlap with the
+state's covered set. However, a search path applying candidates C then D is 
equivalent to one applying D then C, so we
+only consider candidates which intersect the next yet-to-be-covered dataflow 
node. For each such candidate we use
+the `CostEstimator` (with it's assumed cache) to get the candidate's cost, 
build the successor state, and 'relax' the
+successor state in the usual way.
+
+Not all Relay expression nodes need to be assigned to a kernel since the VM or 
other execution provider can happily
+evaluate most Relay expressions except for calls to primitive operators. Thus 
the search must allow for the possibility
+of a expression node being 'left behind'.
+
+### Phase 4
+
+The overall Relay expression is partitioned over all the `CandidateKernel`s on 
the shortest path 'in parallel'. Since
+all the candidates are expressed using `SubGraph`s w.r.t. the original 
dataflow graph, we must be careful not to
+invalidate yet-to-be-partitioned candidates as we go. Working backwards in 
dataflow order avoids this problem.
+
+Note that all the extracted functions in the result will be marked as 
`"Primitive"`, and thus will be left alone by most
+other Relay passes except `LowerTEPass`. Thus it's fine for `FuseOps` to be 
run (repeatably) after
+`CollageFuseOps`.
+
+## Known Limitations
+
+- **Some BYOC boilerplate changes required**: TVM's current BYOC integration 
API only requires the 'lowering/codegen'
+  function to be registered to a well-known global function name. Everything 
else is up to the BYOC author.
+    - Collage requires pattern-based BYOC integrations to register their 
patterns in the global pattern table.
+    - Collage requires the BYOC lowering function to yield a valid 
`runtime::Module` without requiring any additional
+      BYOC-specific passes to be run.
+    - Collage requires the BYOC integration to either correctly test for which 
operators are supported in the
+      pattern/operator predicate, or gracefully propagate failure rather than 
CHECK-fail if an unsupported operator is
+      included in a candidate kernel. Thus a BYOC integration will need to be 
'robustified' to become 'Collage
+      compatible'. Overall we've tried to make as few changes as possible. 
Collage will happily follow along with any
+      improvements to the BYOC integration API (eg via the UMA project).
+- **Higher tuning cost**: Obviously Collage needs to estimate the latency of 
many more candidate kernels, and each
+  candidate may itself trigger tuning during lowering. For TVM this can 
require O(thousands) of trials and take O(hours)
+  , so we'll be very dependent on cached tuning logs to amortize this cost 
between models for the same target.
+  Currently Collage will measure more candidates even if TVM native is the 
only available target.
+- **Task extraction vs Tuning**: Traditionally TVM has had three phases: i) 
Task extraction (find the fused sub-graphs
+  to tune), ii) Tuning (find a good schedule for those sub-graphs), and iii) 
Compilation (re-compile the model, now
+  retrieving schedules for all the anticipated sub-graphs from the cache.) 
However the Collage 'v2' prototype collapses
+  all these phases. This lets us lazily explore the implied search graph 
(nodes = partially rewritten models, edges =
+  selected of sub-graph and toolchain as a candidate kernel, cost = estimated 
sum of kernel costs plus launch penalties)
+  , and thus only pay the cost of tuning candidate kernels which could 
possibly influence the final partitioning.
+- **No non-local optimization**: Though Collage can explore the choice of 
sub-graph and toolchain, it cannot explore any
+  choices which require the arguments and/or result of the sub-graph to be 
rewritten. Thus Collage **cannot** be used to
+  search over:
+    - choice of layout for arguments/results (may require insertion of layout 
transforms),
+    - choice of memory scope for arguments/results (may require insertion of 
device copies),
+    - choice of device on which to host the kernel (ditto)
+      since all those choices can require changes beyond the candidates 
sub-graph.
+- the choice of layout for a kernel since any choice other than the model's 
default must be
+  'corrected' for by the inserted layout transformations. To support this 
efficiently we'd need to abandon the
+  simple-minded but fast `SubGraph` representation we describe below in favor 
of something like an EGraph

Review comment:
       EGraphs would of course be a cool feature, especially if one wanted to 
integrate it with target specific layout transforms. 

##########
File path: rfcs/xxxx-collage.md
##########
@@ -0,0 +1,833 @@
+# Design Doc: Collage [Draft 0.7]
+
+```
+Feature Name: Collage
+Start Date: Mar 2022
+Authors: Mark Shields ([email protected])
+RFC PR: <tbd>
+GitHub Issue: <tbd>
+```
+
+This design doc (with accompanying
+['v2' prototype 
implementation](https://github.com/mbs-octoml/mbs-tvm/tree/mbs-collage-sketch))
+shows how to bring tuning to TVM's operator fusion and BYOC partitioning 
passes. The tuning search explores the choice
+of sub-graphs (aka 'partitions') as well as choice of toolchain (TVM native or 
one of the available BYOC integrations,
+aka 'backends') for each candidate kernel so as to minimize the expected model 
inference latency. We call the result
+an 'optimal partitioning'. This new tuning layer complements the tuning 
traditionally done by TVM and other toolchains
+during lowering. It can also complement any global tuning, for example to 
explore all possible global layouts.
+
+The approach is based on the [preprint](https://arxiv.org/pdf/2111.00655.pdf):
+
+> *Collage: Automated Integration of Deep Learning Backends*  
+> Byungsoo Jeon, Sunghyun Park, Peiyuan Liao, Sheng Xu, Tianqi Chen, Zhihao Jia
+
+This tuning approach contrasts with TVM's existing "greedy" and "manual" 
approaches to fusion and BYOC:
+
+- Greedy: Currently only the largest possible supported sub-graphs are used 
for kernels, irrespective of their execution
+  time. With Collage many more candidate sub-graphs are explored, and it is 
possible for two smaller sub-graphs to yield
+  better overall latency than one large sub-graph if they mix toolchains.
+- Manual: Currently the TVM user must commit to a BYOC toolchain and invoke 
the corresponding partitioning function
+  before the main TVM compilation flow proceeds. With Collage the choice of 
toolchain can be automated based on measured
+  latency. Collage will also explore mixing and matching between multiple BYOC 
toolchains as well as TVM's native
+  backend.
+
+The design (when Collage is enabled) subsumes TVM's fixed `FuseOps` and 
BYOC-provided `partition_for_<toolchain>`
+operations (built using the 
`MergeComposite`/`AnnotateTarget`/`MergeCompilerRegions`/`PartitionGraph` 
passes) with a
+single new
+`CollageFuseOps` pass. The pass is carefully engineered to build directly on 
the existing `"TOpPattern"` attributes
+(provided for every Relay operator and used by `FuseOps`), BYOC 
`"target.<toolchain>"`
+operator predicates (provided for some operator/toolchain pairs by 
'operator-based' BYOC integrations) and BYOC operator
+pattern/predicates (registered in the pattern table by 'pattern-based' BYOC 
integrations). In this way only the more
+boilerplate aspects of existing BYOC integrations need to be adjusted to 
support Collage. The
+`partition_for_<toolchain>` operations are retained for users who wish to 
retain manual control.
+
+> NOTE: We'd like to coordinate these changes with the UMA project. Our aim in 
this design is to make the smallest
+> changes to BYOC as possible. We think the changes described here can be 
easily reworked to follow any BYOC API
+> proposals settled on by UMA. See also "Related Work."
+
+Collage offers four advantages:
+
+- **Latency**: Overall model latency may be reduced compared to TVM native, 
TVM with a specific BYOC toolchain, or a
+  non-TVM compiler such as TensorRT.
+- **Automation**: The choice of which BYOC toolchains to enable can be 
automated.
+- **Economy of implementation**: Five standalone passes using three separate 
mechanisms for expressing fusion
+  rules/algorithms and implementing partitioning can be replaced with one, 
which itself is built from compositional
+  primitives.
+- **Decoupling**: It is ok for a candidate kernel found during search to 
actually not be valid for a toolchain (even
+  TVM's). Such candidates could be given 'infinite' cost and thus ignored 
during search. In this way we can avoid tight
+  coupling between backends and fusion rules.
+
+## FAQ
+
+Pending.
+
+## Success Metrics
+
+1. Collage offers at least a 10% latency improvement for a selection of 
standard ONNX models and NVIDIA hardware using
+   targets which include the CuDNN and CuBlas libraries, the CUTLASS library 
(with tuning, via BYOC), the TensorRT
+   compiler (via BYOC), and (obviously!) TVM native.
+2. Collage does not require new per-target or per-model patterns or rules to 
be implemented independently of the BYOC
+   integrations.
+3. Collage with just the native TWM and a single BYOC toolchain enabled is 
never worse than using the
+   existing `partition_for_<toolchain` method in TVM today.
+
+## Project Milestones
+
+- [Done] M0: Port paper prototype to recent TVM main and validate paper 
results.
+- [Done] M1: Internal design doc.
+- [Done] M2: Use 'v2' prototype to test design doc, and rework ready for TVM 
community.
+- [In progress] M3: RFC
+- [2022Q1] M4: Re-validate results on 'v2' prototype for larger models (eg 
GPT2) and more NVIDIA targets.
+- [2022Q2] M5: Implementation in TVM main, including 'sub-projects' listed 
below.
+- [OctoML internal] M6: Estimator integrated into OctoML platform, validation 
against OctoML test suite.
+- [OctoML internal] M7: Productionization for OctoML.
+
+## Check-in plan
+
+Though the 'v2' prototype is in a personal branch we'd like to transition to 
main ASAP and rely on directory/namespace
+separation, maintaining backwards compat, and a new `PassConfig` flag to 
isolate all Collage changes from the rest of
+TVM. A rough PR progression is:
+
+- TensorRT and CUTLASS BYOC changes are backwards compat. The existing 
`partition_for_X` functions remain. The
+  CUTLASS-specific tuning and codegen functions will either continue to be 
supported or we'll work with users to account
+  for them being folded into the function-at-a-time `relay.ext.cutlass`
+  codegen function.
+- The the `DFPattern` and friends changes are all mostly just for improving 
the robustness of the
+  `IndexedGraph<T>` class and can go into main independently.
+- Some basic `Expr` improvements can go into main independently.
+- The design allows for multiple `Target`s for the same `DLDeviceType`. That 
requires the various
+  `build` interfaces which currently accept `Union[Target,Dict]` to also 
accept a list of `Target`s, and can be
+  backwards compat.
+- The new Collage code can go in bottom-up as we develop unit tests:
+    - Support utils, including `NameSupply`, `IndexSet`, `PriorityQueue`, 
`Cost`, `CostEstimator`.
+    - The core `SubGraph` datatype.
+    - `CandidateKernel`.
+    - The `FusionRule` class hierachy (which itself can be broken into 
sub-PRs).
+    - `FusionSpec`.
+    - `GatherFusionSpecs` helper for bridging the existing BYOC world with the 
Collage 'FusionRule' world.
+    - The `CollageFuseOps` driver pass itself.
+
+## Related Work
+
+- The [Cascading 
Scheduler](https://github.com/apache/tvm-rfcs/blob/main/rfcs/0037-arm-ethosu-cascading-scheduler.md)
 combines i) dynamic-programming
+  to find an optimal grouping of TE sub-expressions, ii) an analytic model of 
cost to guide the search,
+  and iii) cascading scheduling of the TE sub-expressions so as to reduce 
memory high-watermark. By contrast
+  Collage i) also uses dynamic-programming, but to find an optimal grouping of 
Relay sub-expressions, ii)
+  uses measurement to guide the search and iii) assuming the toolchain will 
'do its best' with the
+  sub-graph offered to it.
+- The [Universal modular Accelerator 
Interface](https://github.com/apache/tvm-rfcs/pull/60) proposal
+  adds a layer on top of the existing and separate TVM BYOC, operator 
strategy, operator scheduling,
+  target-specific passes and target-specific code generation extension points. 
Collage currently relies
+  only on the global pattern registry and global `relay.ext.<toolchain>` 
function to integrate with BYOC

Review comment:
       UMA and Ethos-U/CMSISNN use RFC10 eg.:
   
   ```cpp
   TVM_REGISTER_TARGET_KIND("ethos-n", kDLCPU)
       .set_attr<FTVMRelayToRuntime>("RelayToRuntime", EthosNCodeGen)
       .set_attr<FTVMConstantUpdater>("UpdateConstants", EthosNConstantUpdater);
   ```
   But as mentioned, this should be relatively easy to support. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to