mbs-octoml commented on a change in pull request #62: URL: https://github.com/apache/tvm-rfcs/pull/62#discussion_r834608024
########## File path: rfcs/xxxx-collage.md ########## @@ -0,0 +1,987 @@ +# Design Doc: Collage [Draft 0.8] + +``` +Feature Name: Collage +Start Date: Mar 2022 +Authors: Mark Shields ([email protected]) +RFC PR: <tbd> +GitHub Issue: <tbd> + +History: +- v0.7: First draft. +- v0.8: Rework to emphasise 'partitioning' (quite early in pipeline) instead of 'fusion' (quite late in pipeline). +``` + +This design doc (with accompanying +['v2' prototype implementation](https://github.com/mbs-octoml/mbs-tvm/tree/mbs-collage-sketch)) +shows how to bring tuning to TVM's BYOC partitioning passes. The tuning search explores the choice of sub-graphs (aka ' +partitions') and toolchains (aka 'backends') so as to minimize the expected model inference latency. Both 'graph +style' (eg TensorRT) and 'library style' (eg DNNL) BYOC integrations are supported. We call the result an 'optimal +partitioning'. This new tuning layer complements the tuning traditionally done by TVM and other toolchains during +lowering. It can also complement any global tuning, for example to explore the choice of layout convention or device +assignment. + +The approach is based on the [preprint](https://arxiv.org/pdf/2111.00655.pdf): + +> *Collage: Automated Integration of Deep Learning Backends* +> Byungsoo Jeon, Sunghyun Park, Peiyuan Liao, Sheng Xu, Tianqi Chen, Zhihao Jia + +(See Appendix A for a comparison of this proposal and the paper's implementation. See Appendix D for TODO items in the ' +v2' prototype.) + +This tuning approach contrasts with TVM's existing "greedy" and "manual" approaches to partitioning: + +- Greedy: Currently only the largest possible supported sub-graphs are used for partitions, irrespective of their + execution time. With Collage many more candidate sub-graphs are explored, and it is possible for two smaller + sub-graphs to yield better overall latency than one large sub-graph if they mix toolchains. +- Manual: Currently the TVM user must commit to a BYOC toolchain and invoke the corresponding + `partition_for_<toolchain>` function before the main TVM compilation flow begins. With Collage the choice of toolchain + can be automated based on measured latency. Collage will also explore mixing and matching between multiple BYOC + toolchains as well as TVM's native backend. + +When Collage is enabled it subsumes the existing `MergeComposite`/`AnnotateTarget`/`MergeCompilerRegions`/ +`PartitionGraph` passes embedded within each `partition_for_<toolchain>` function with a single new +`CollagePartitioner` pass. The pass is guided by the list of available `Target`s and three existing sources: + +1. The `"TOpPattern"` attributes provided for every Relay operator and used by TVM's built-in `FuseOps`. +2. The BYOC `"target.<toolchain>"` operator predicates provided for some operator/toolchain pairs by + 'operator-based' BYOC integrations. +3. The BYOC operator pattern/predicates (usually) registered in the pattern table by 'pattern-based' BYOC integrations. + +Only some boilerplate aspects of existing BYOC integrations need to be adjusted to support Collage (and we will make +these changes either as part of or in coordination with the UMA project). However Collage may require more robustness +from the BYOC integrations, see Appendix F. + +Note however that we are **not** proposing to deprecate the existing `partition_for_<toolchain>` operations (or their +UMA equivalent). This is mostly because Collage is inherently a tuning-based system which is not practical for users who +need a stand-alone compiler. But it is also because of challenges with establishing a common pass ordering which will +work for both TVM and all BYOC toolchains (see Appendix C for more details). + +Collage offers three advantages: + +- **Latency**: Overall model latency may be reduced compared to TVM native, TVM with a single + `partition_for_<toolchain>` call, or a non-TVM stand-alone compiler such as TensorRT. +- **Automation**: The choice of which BYOC toolchains to enable can be automated. +- **Economy and modularity of implementation**: Four standalone passes using two separate mechanisms for expressing + partitioning rules/algorithms can be replaced with one, which itself is built from compositional primitives. (The + machinery is also reusable for the very similar problem of choosing TVM fusion kernels, which we'll tackle in the + future). + +See Appendix H for some frequently asked questions. + +## Success Metrics + +1. Collage offers at least a 10% latency improvement for a selection of standard ONNX models and NVIDIA hardware using + targets which include the CuDNN and CuBlas libraries, the CUTLASS library (with tuning, via BYOC), the TensorRT + compiler (via BYOC), and (obviously!) TVM native. +2. Collage does not require new per-target or per-model patterns or rules to be implemented independently of the BYOC + integrations. +3. Collage with a `Target` list enabling just one BYOC toolchain is never worse than using the the existing + `partition_for_<toolchain>` function directly. + +## Project Milestones + +- [Done] M0: Port paper prototype to recent TVM main and validate paper results. +- [Done] M1: Internal design doc. +- [Done] M2: Use 'v2' prototype to test design doc, and rework ready for TVM community. +- [In progress] M3: RFC +- [2022Q1] M4: Re-validate results on 'v2' prototype for larger models (eg GPT2) and more NVIDIA targets. +- [2022Q2] M5: Implementation in TVM main, including 'sub-projects' listed below. +- [OctoML internal] M6: Estimator integrated into OctoML platform, validation against OctoML test suite. +- [OctoML internal] M7: Productionization for OctoML. + +## Check-in plan + +Though the 'v2' prototype is in a personal branch we'd like to transition to main ASAP and rely on directory/namespace +separation, maintaining backwards compat, and a new `PassConfig` flag to isolate all Collage changes from the rest of +TVM. A rough PR progression is: + +- TensorRT and CUTLASS BYOC changes are backwards compat. The existing `partition_for_<toolchain>` functions remain. The + CUTLASS-specific tuning and codegen functions will either continue to be supported or we'll work with users to account + for them being folded into the function-at-a-time `relay.ext.cutlass` codegen function. +- The `DFPattern` and friends changes are all mostly just for improving the robustness of the + `IndexedGraph<T>` class and can go into main independently. +- Some basic `Expr` improvements can go into main independently. +- The design allows for multiple `Target`s for the same `DLDeviceType`. That requires the various + `build` interfaces which currently accept `Union[Target,Dict]` to also accept a list of `Target`s, and can be + backwards compat. +- The new Collage code can go in bottom-up as we develop unit tests: + - Support utils, including `NameSupply`, `IndexSet`, `PriorityQueue`, `Cost`, `CostEstimator`. + - The core `SubGraph` datatype. + - `CandidatePartition`. + - The `PartitionRule` class hierarchy, as a series of PRs, ending with `PartitionSpec`. + - `GatherPartitionSpecs` helper for bridging the existing BYOC world with the Collage `PartitionRule` world. + - The `CollagePartitioner` driver pass itself. + +## Related Work + +- The [Cascading Scheduler](https://github.com/apache/tvm-rfcs/blob/main/rfcs/0037-arm-ethosu-cascading-scheduler.md) + combines i) dynamic-programming to find an optimal grouping of TE sub-expressions, ii) an analytic model of cost to + guide the search, and iii) cascading scheduling of the TE sub-expressions so as to reduce memory high-watermark. By + contrast Collage i) also uses dynamic-programming, but to find an optimal grouping of Relay sub-expressions, ii) + uses (very much slower!) measurement to guide the search and iii) has no influence over how either TVM or BYOC + toolchains actually lower the sub-graphs given to them. +- The [Universal modular Accelerator Interface](https://github.com/apache/tvm-rfcs/pull/60) proposal adds a layer on top + of the existing and separate TVM BYOC, operator strategy, operator scheduling, target-specific passes and + target-specific code generation extension points. Collage currently relies only on the global pattern registry and + global `relay.ext.<toolchain>` function to integrate with BYOC integrations, but this is easy to rework should UMA + change the source of truth. + +## Example + +We start with `mod` bound to [MNIST](https://github.com/onnx/models/tree/main/vision/classification/mnist): + +``` +fn (%x: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] { + %0 = nn.pad(%x, 0f, pad_width=[[0, 0], [0, 0], [2, 2], [2, 2]]); + %1 = nn.conv2d(%0, meta[relay.Constant][0] /*Tensor[(8, 1, 5, 5), float32]*/, + padding=[0, 0, 0, 0], channels=8, kernel_size=[5, 5]); + %2 = add(%1, meta[relay.Constant][1] /*Tensor[(8, 1, 1), float32]*/); + %3 = nn.relu(%2); + %4 = nn.max_pool2d(%3, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0]); + %5 = nn.pad(%4, 0f, pad_width=[[0, 0], [0, 0], [2, 2], [2, 2]]); + %6 = nn.conv2d(%5, meta[relay.Constant][2] /*Tensor[(16, 8, 5, 5), float32]*/, + padding=[0, 0, 0, 0], channels=16, kernel_size=[5, 5]); + %7 = add(%6, meta[relay.Constant][3] /*Tensor[(16, 1, 1), float32]*/); + %8 = nn.relu(%7); + %9 = nn.max_pool2d(%8, pool_size=[3, 3], strides=[3, 3], padding=[0, 0, 0, 0]); + %10 = reshape(%9, newshape=[1, 256]); + %11 = nn.dense(%10, meta[relay.Constant][4] /*Tensor[(10, 256), float32]*/, units=None, out_dtype="float32"); + add(%11, meta[relay.Constant][5] /*Tensor[(1, 10), float32]*/) +} +``` + +We can compile this with Collage enabled for a variety of NVIDIA toolchains/libraries as follows: + +``` +with tvm.transform.PassContext(config={"relay.fallback_device_type": 2, "relay.collage.enable_collage": True}): + host_target = tvm.target.Target("llvm") + generic_target = tvm.target.Target("cuda", host_target) + cutlass_target = tvm.target.Target("cuda -compiler=cutlass", host_target) + tensorrt_target = tvm.target.Target("cuda -compiler=tensorrt", host_target) + cudnn_target = tvm.target.Target("cuda -compiler=cudnn", host_target) + cublas_target = tvm.target.Target("cuda -compiler=cublas", host_target) + targets = [generic_target, cutlass_target, tensorrt_target, cudnn_target, cublas_target] + exe = tvm.relay.vm.compile(mod, target=targets) +``` + +(Note that `cudnn` and `cublas` are not yet supported in the 'v2' prototype, see Appendix B.) + +After the `CollagePartitioner` pass, the intermediate `"main"` global function could resemble the following +(though we've modified this "optimal" partitioning by hand for illustration so don't take it as representative of actual +performance): + +``` +fn (%x: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] { + # Operators left behind in the function body are intended for TVM. + # The usual Relay passes may rewrite them, then FuseOps will push them + # into "Primitive" functions (without any "Compiler" attribute) ready + # for TVM lowering. + %4 = nn.pad(%x, 0f, pad_width=[[0, 0], [0, 0], [2, 2], [2, 2]]); + # This conv2d will be offloaded to cudnn. However the main TVM compilation + # flow is responsible for emitting the call. + %6 = fn (%FunctionVar_5: Tensor[(1, 1, 32, 32), float32], + Composite="cudnn.conv2d") -> Tensor[(1, 8, 28, 28), float32] { + nn.conv2d(%FunctionVar_5, meta[relay.Constant][0] /*Tensor[(8, 1, 5, 5), float32]*/, + padding=[0, 0, 0, 0], channels=8, kernel_size=[5, 5]) + }; + # Back to vanilla TVM. + %7 = %6(%4); + %3 = add(%7, meta[relay.Constant][1] /*Tensor[(8, 1, 1), float32]*/); + %9 = nn.relu(%3); + %11 = nn.max_pool2d(%9, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0]); + %13 = nn.pad(%11, 0f, pad_width=[[0, 0], [0, 0], [2, 2], [2, 2]]); + # Use TensorRT. The "Primitive" function deleniates the partition. + %14 = fn (%FunctionVar_03: Tensor[(1, 8, 18, 18), float32], + %FunctionVar_11: Tensor[(16, 1, 1), float32], + Primitive=1, + Compiler="tensorrt", + global_symbol="collage_nn_conv2d_add_nn_relu_1") -> Tensor[(1, 16, 14, 14), float32] { + %1 = nn.conv2d(%FunctionVar_03, meta[relay.Constant][2] /*Tensor[(16, 8, 5, 5), float32]*/, + padding=[0, 0, 0, 0], channels=16, kernel_size=[5, 5]); + %2 = add(%1, %FunctionVar_11); + nn.relu(%2) + }; + %15 = %14(%13, meta[relay.Constant][3] /*Tensor[(16, 1, 1), float32]*/); + # Back to vanilla TVM. + %17 = nn.max_pool2d(%15, pool_size=[3, 3], strides=[3, 3], padding=[0, 0, 0, 0]); + %19 = reshape(%17, newshape=[1, 256]); + # Use CUTLASS. Note the double function nesting: the outer "Primitive" function + # deleniates the partition and the inner "Composite" function maps the original + # Relay operators to a tag to be used during compilation/build/lowering with the + # CUTLASS BYOC integration. + %20 = fn (%FunctionVar_0: Tensor[(1, 256), float32], + %FunctionVar_1: Tensor[(10, 256), float32], + %FunctionVar_2: Tensor[(1, 10), float32], + Primitive=1, + Compiler="cutlass", + global_symbol="collage_cutlass_dense_bias_nn_dense_add") -> Tensor[(1, 10), float32] { + %1 = fn (%FunctionVar_01: Tensor[(1, 256), float32], + %FunctionVar_11: Tensor[(10, 256), float32], + %FunctionVar_21: Tensor[(1, 10), float32], + Composite="cutlass.dense_bias") -> Tensor[(1, 10), float32] { + %0 = nn.dense(%FunctionVar_01, %FunctionVar_11, units=None, out_dtype="float32"); + add(%0, %FunctionVar_21) + }; + %1(%FunctionVar_0, %FunctionVar_1, %FunctionVar_2) + }; + %20(%19, meta[relay.Constant][4] /*Tensor[(10, 256), float32]*/, + meta[relay.Constant][5] /*Tensor[(1, 10), float32]*/) +} +``` + +Ideally this optimal partitioning would be understandable to the user, see Appendix G. + +## Design + +The implementation is mostly under `src/relay/collage/...` (namespace `tvm::relay::collage`), with just a few Python +helper functions under `python/tvm/relay/collage`. + +If the `relay.collage.enable_collage` `PassConfig` attribute is true then a new `CollagePartitioner` pass is inserted +before all other Relay passes. The result of the pass is: + +- All Relay sub-graphs in all global functions which are to be handed off to a BYOC toolchain are replaced by calls to + an inline `"Primitive"` function with `"Compiler"` and `"global_symbol"` attributes. +- Relay operators, or groups of operators, which are to be translated to particular library or BYOC-supplied function + are replaced by calls to an inline `"Composite"` function. (This encoding is supported for both BYOC and external + libraries.) + +Note that no `"Primitive"` functions denoting TVM kernels are produced -- the existing `FuseOps` pass is still required. + +The `CollagePartitioner` pass has four phases: + +- **Phase 1**: The available `Target`s are scanned to build a list of rules describing how to find possible partitions ( + see `PartitionSpec` and `PartitionRule` below). Depending on the `Target` the rules may incorporate entries from the + BYOC pattern table. (The remaining phases execute on each global function separately.) +- **Phase 2**: A dataflow graph is constructed for the global function (which is just an `IndexedGraph<Expr>`). The + available rules from phase 1 are evaluated on the dataflow graph to yield a (possibly overlapping) set of candidate + partitions for each target (see `CandidatePartition` below). Each candidate efficiently describes a sub-graph of the + global function's body without the need to construct any new expressions (see `SubGraph` below). +- **Phase 3**: A shortest path is found in the following (implicit and lazily constructed) search graph: + - Search Nodes: The set of dataflow nodes which have already been assigned to a candidate partition in all paths to + the node. + - Search Edge X->Y: A candidate partition can be applied to node X to give node Y. The candidate is disjoint from + all dataflow nodes already assigned in X. To avoid an unnecessary search space explosion the candidate must also + include the next yet-to-be-assigned dataflow node in X. + - Edge cost: The estimated latency of the candidate partition, plus a partition transition penalty. Note that though + we need to be able to extract the candidate's sub-graph in order to build a function representing the candidate, + we do not yet need to partition the overall function body expression. + + Other search algorithms are certainly possible, eg the paper uses an evolutionary search to refine the partitioning + found by the dynamic-programming search. We can easily abstract away the search interface to support multiple + implementations in the future. +- **Phase 4**: The function body is partitioned according to the candidate kernels on the shortest path. + +In the following we introduce the new datatypes, then expand on the phases. + +### Util Datatypes + +- `PostDfsIndex`: The integer index of a Relay sub-expression in a post-dfs traversal of the overall Relay expression. + If index i is less than index j then we know the sub-expression for j cannot influence the value of the sub-expression + for i. +- `DataflowGraph`: As alias for the existing `IndexedGraph<Expr>` from the `DFPatternMatcher` suite (which in turn is a + reworked copy of the `IndexedGraph` private to `fuse_ops.cc`). It is used throughout to manage the three-way bijection + from Relay `ExprNode`s to `PostDfsIndex`s to + `DataflowGraph::Node`s. Each `DataflowGraph::Node` describes the sub-expression's dataflow inputs, outputs, dominator + and inverse-dominators. +- `IndexSet`: A bit vector indexed by `PostDfsIndex`s. These are used as a compact representation for an arbitrary set + of dataflow nodes in a dataflow graph. +- `Cost`: A `double` representing a candidate partition (or kernel) 'cost', which currently is just mean execution + latency in seconds. Collage only cares that costs are additive and a total order, so in the future we could support + cost functions which balance execution time against high memory watermark or other measures. Costs may be `Unknown` + (ie NaN) to signal some other heuristic should be used to compare kernel costs. Costs may be `Invalid` (ie +inf) + to signal the toolchain could not compile and run a candidate kernel. + +### SubGraph + +A `SubGraph` is an `IndexSet` of the `PostDfsIndex`s of all dataflow nodes 'inside' an arbitrary sub-graph of the +overall dataflow graph. This and `PartitionRule` below are the core Collage datatypes. + +Sub-graphs can be used to represent partitions/kernels/composite functions without having to pay the cost of +constructing or rewriting any expressions. We also allow 'extracting' a function to use for measuring a +partition/kernel's latency independently from 'rewriting' the overall Relay expression since only a tiny subset of +candidate partitions will end up being needed after Collage has completed its search. + +We expect O(thousands) of sub-graphs to be in flight while processing a given model, so are mindful of space overhead. + +A sub-graph classifies every dataflow node of the overall expression as either 'inside' or +'outside' the sub-graph. Obviously not all such divisions make sense, for example it is not valid for an inside node to +feed into another inside node via outside nodes. We provide an +`IsValid` method to check for validity, and `SubGraphConfig` to control which validity rules apply (such as maximum +depth). + +We generally work with the `DataflowGraph` representation of the overall Relay expression rather than the expression +itself. We use the post-dfs visit index to uniquely refer to expression nodes. + +As well as 'inside' and 'outside' we have four other flavors of dataflow nodes, all uniquely determined from the ' +inside' nodes: + +- 'entry' nodes are those inside with at least one dataflow input outside. +- 'exit' nodes are those inside with at least one dataflow output outside, or which are considered 'external' in the + underlying dataflow graph (eg because they represent the result of the overall function). +- 'input' nodes are those outside with at least one dataflow output inside. +- 'output' nodes are those outside with at least one dataflow input inside. + +Index sets for these are cached with the sub-graph for performance. + +It is valid to have multiple entry nodes (we can bind a parameter for each). It may be valid to have multiple exit +nodes (we can build a tuple of all such). It may be valid to have exit nodes which also contribute to other inside +nodes (ie represent a 'tap' on an intermediate result). + +Sub-graphs are closed under: + +- Disjoint union. +- Wrapping by a function with given attributes. This can be used to encode "Composite" functions, or to represent a + candidate kernel within a "Primitive" function. (By combining 'wrapping' with + 'union' we can encode, eg, 'this sub-graph should be placed inside a primitive function which itself may have calls to + composite functions). +- Substitution, which allows a sub-graph w.r.t. one dataflow graph to be transformed to match some other (typically + smaller) dataflow graph. + +Note that the Relay `PatternPartitoner` goes directly from `Expr` to partitioned `Expr` without stopping at any +intermediate representation. It may be worth 'promoting' `SubGraph` out of Collage and into the standard `DFPattern` +suite, we leave that to future work. + +### CandidatePartition + +A `CandidatePartition` pairs a `SubGraph` with a `Target`. All Collage search and measurement is in terms of candidate +partitions. + +### PartitionRule + +A `PartitionRule` describes how to find a set of `CandidatePartitions`s for a `DataflowGraph`. This and `SubGraph` +above are the core Collage datatypes. All partition rules implement the method: + +``` +virtual std::vector<CandidatePartition> AllCandidates(const DataflowGraph& dataflow_graph, + const PartitionSpec& spec) const; +``` + +The candidates are allowed to overlap, and ultimately it is the job of the Collage searcher to find a selection of +candidates which cover the whole Relay expression without overlap. There may be many thousands of candidates in flight +during the Collage search. + +We currently have three distinct flavors of partitions: + +- For pattern-based BYOC integrations, individual `DFPattern`s are used to select the `"Composite"` functions to + offload, and those are grouped into a `"Primitive"` Relay function with a `"Compiler"` attribute. +- For operator-based BYOC integrations, per-operator predicates indicate operators to offload, and again those are + grouped into a `"Primitive"` Relay function with a `"Compiler"` attribute. +- For TVM, obviously all of Relay can go into a single partition, however for search efficiency the partitions should + roughly mimic the Relay `FuseOps`. That pass uses the `"TOpPattern"` (of type `OPPatternKind`) attribute on all Relay + operators, and rules for when operators of one kind can be folded into another (typically by moving scalar ops from + elementwise operators into the output position of an earlier operator). This is implemented as a + stand-alone analysis which encodes its result using `"Primitive"` functions. + +Two new flavors are also showing up: + +- For easy external library integration we would like to borrow the `DFPattern`-with-composite-functions approach from + pattern-based BYOC integrations. But we'd like to leave those composite functions outside of any `"Primitive"` + function so that the library calls could end up within larger TVM kernels. +- `FuseOps` is generally considered too inflexible, and we've sought a more flexible way to express target-dependent + fusion rules. + +So in the same way `DFPattern`s provide a small library of 'base' and 'combinator' pattern rules supporting a wide +variety of examples, we seek the same economy and flexibility from `PartitionRule`s. + +An obvious question is whether all partition rules should be expressed with `DFPattern`s, possibly by extending +the `DFPattern` library itself. Indeed, though it does not appear to be used in prod, the `DominatorPattern` is +an attempt to use `DFPattern`s to subsume the existing `FuseOps` machinery. We actually went down this path but +decided to back out: +- Since we are interested in searching over possibly overlapping candidate partitions we'd need the `DFPattern` + machinery to all enumeration over all matching sub-expressions. That would require a rewrite of the + `DFPatternMatcher`. +- Some of the more subtle fusion rules are difficult to express as patterns. +- `DFPattern`s are widely used outside of just partitioning, so any change would need to ensure no efficiency + or cognitive overhead for those common cases. + +That pushed us to the present design, which builds on `DFPatterns`, but introduces new 'base' and 'combinator' +partition rules which can be combined to match the desired partition flavors. The 'base' rules produce candidates +from the dataflow graph directly (eg all sub-graphs matching a given `DFPattern`). The 'combinator' rules +combine the candidates found by one or more sub-rules into a new set of candidates (eg by indicating the +sub-candidates should be wrapped by a `"Composite"` function, or touching sub-candidates should be unioned +together). + +The base rules are: + +- `DFPatternPartitionRule`: Given a `DFPattern` and expression predicate, produces a candidate for every sub-graph + matched by the pattern and predicate. Unlike the `PatternRewriter`, candidates are free to overlap. Mainly used + to bring BYOC patterns into the Collage framework. +- `OpPredicatePartitionRule`: Given an attribute name, produces a candidate for every call to a primitive Relay + operator where the operator i) has predicate bound to that attribute which ii) returns true given the + call sub-expression. Generally this will result in a singleton sub-graph containing only the call, but it may also + pull in constant arguments to the call should they be required. Used to bring BYOC operator predicates into the + Collage framework. +- `OpCallByKindPartitionRule`: Uses the `"TOpPattern"` attribute provided for every Relay operator to produce a + candidate for every call to a 'fusable Relay operator'. Used to select the operators which `FuseOps` will consider + parts of kernels. + +The combinator rules are: + +- `CompositePartitionRule`: Indicates all sub-candidates matched by the sub-rule should be wrapped by a `"Composite"` + function. The `"Composite"` name is taken from the rule name. Used to indicate Relay operators (or groups of Relay + operators) should be mapped to target-specific operators, both for BYOC and TVM external library integrations. +- `PrimitivePartitionRule`: Indicates all sub-candidates matched by the sub-rule should be wrapped by a `"Primitive"` + function, possibly with an additional `"Compiler"` attribute. Used to delineate a partition (or kernel). +- `UnionPartitionRule`: Simply unions all the sub-candidates from all sub-rules together. Used to combine + individual `DFPatternPartitionRules`. +- `CombinePartitionRule`: Given a sub-rule and a list of 'combiner' rules (see below), finds all possible ways of + combining the sub-candidates to yield even larger candidates. Note that the sub-candidates may also be directly + included in the results. The 'combiner' rules allow combining by `OpPatternKinds`, combining the arguments to + tuples which themselves are arguments to Relay operator calls, and so on. This rule is intended to mimic the + existing TVM `FuseOps` pass, though: i) all candidates are found rather than just the largest, ii) the starting + set of candidates can be provided by any other rule, and iii) we rely on `SubGraph` validity checking to weed out + infeasible candidates. +- `OnlyValidPartitionRule`: Given a `SubGraphConfig`, ignores candidates with 'invalid' sub-graphs. Used to limit the + maximum candidate depth, the number of independent outputs, and whether intermediate 'taps' are allowed. +- `HostPartitionRule`: Produces candidates for all Relay expressions which could be + 'left behind' for execution by the host (eg on the VM). This rule lets us simplify the overall Collage search + algorithm. + +Here are some typical ways to combine `PartitionRules` for different partition flavors: + +- Classic operator-predicate based BYOC with + `AnnotateTarget`/`MergeCompilerRegions`/`PartitionGraph` passes (eg see `tensorrt.py`): + ``` + PrimitivePartitionRule + OnlyValidPartitionRule + CombinePartitionRule (with a join-anything combiner rule) + OpPredicatePartitionRule + ``` + +- Classic pattern-based BYOC with `MergeComposite`/`AnnotateTarget`/`PartitionGraph` passes + (eg see `cutlass.py`)`: + ``` + PrimitivePartitionRule + OnlyValidPartitionRule + CombinePartitionRule (with join-anything combiner rule) + UnionPartitionRule + CompositePartitionRule(label1) + DFPatternPartitionRule(pattern1) + : + CompositePartitionRule(labeln) + DFPatternPartitionRule(patternn) + ``` + + The `CompositePartitionRule`/`DFPatternPartitionRule` combination is repeated for each entry in the pattern table for + the BYOC toolchain name, eg: + ``` + CompositePartitionRule( + rule_name="cutlass.conv2d_bias_residual_multiply_relu" + sub_rule=DFPatternPartitionRule( + pattern=CallPatternNode(Op(nn.relu), + [AltPattern(CallPatternNode(Op(multiply), + [CallPatternNode(AltPattern(Op(add) | Op(nn.bias_add)), + [CallPatternNode(Op(nn.conv2d), [*, *]), *]), + *]) | + CallPatternNode(Op(multiply), + [*, + CallPatternNode(AltPattern(Op(add) | Op(nn.bias_add)), + [CallPatternNode(Op(nn.conv2d), [*, *]), *]) + ])) + ]))) + ``` + +- "Consider this library implementation for these sub-expressions", using `DFPatterns` to pick out which Relay operators + are supported (a new scheme): + ``` + OnlyValidPartitionRule + CombinePartitionRule (with default TVM combiner rules) + UnionPartitionRule + OpCallByKindPartitionRule + CompositePartitionRule(lable1) + DFPatternPartitionRule(pattern1) + : + CompositePartitionRule(lablen) + DFPatternPartitionRule(patternn) + ``` + +- Classic TVM `FuseOps`: + ``` + PrimitivePartitionRule + OnlyValidPartitionRule + CombinePartitionRule (with default TVM combiner rules) + OpCallByKindPartitionRule + ``` + +- "Just fuse what I tell you to fuse", using `DFPatterns` to directly select candidates (a new scheme): + ``` + PrimitivePartitionRule + OnlyValidPartitionRule + UnionPartitionRule + DFPatternPartitionRule(pattern1) + : + DFPatternPartitionRule(patternn) + ``` + +### PartitionSpec + +A `PartitionSpec` pairs a a `PartitionRule` with one or more `Target`s. + +### Phase 1 + +We build on the existing TVM support for heterogeneous devices and targets. The available `Targets` are extracted from +the compilation configuration (eg using the existing `CompilationConfig` helper class). Each target is inspected to +decide on how to construct a `PartitionRule`, which will guide Collage in the selection of candidate kernels to explore +for that target. (See Appendix G for the requirements which motivated this part of the design.) + +- If the `Target` has a `"partition_rule"` attribute, use that directly. This would allow users to directly control + partitioning/fusion for the target's they care about. +- If the `Target` has a `"compiler"` attribute (eg `"cutlass"`), and the global pattern table has an entry for that + attribute value, assume the `Target` denotes a pattern-based BYOC integration to explore. The `PartitionRule` + will import all the BYOC patterns and predicates automatically. +- As above, but if global pattern has no matching entry, assume the `Target` denotes a predicate-based BYOC integration + to explore (eg `"tensorrt"`). The `PartitonRule` will look for and evaluate predicates with the + `"target.<compiler>"` attribute on all Relay operators. +- Otherwise, assume the `Target` denotes a TVM-native target. The `PartitionRule` mimics `FuseOps`, but now generalized + to explore multiple candidates so as to leave room for possible BYOC candidates. + +Note that to make this approach work we need to allow for multiple `Target`s with the same `DLDeviceKind`. For the VM +simply switching the `target` argument from dictionary to list and removing some redundant Python preprocessing code was +all that was required to support this. + +The user can use `on_device` annotations to constrain sub-graphs to particular devices. When Collage is considering +candidate partitions, it should be sure to choose a candidate `Target` which 'refines' the `Target` for every +sub-expression discovered by the `PlanDevicesPass`. Given targets T and U we say 'T refines U' if T has a +'"compiler"' and/or '"partition_rule"' attributes, U has no such attributes, and T and U otherwise agree on all other +fields. + +### Phase 2 + +Most of the hard work for this phase is carried by the `AllCandidates` implementations of the `PartitionRule`s. The main +driver simply needs to index all the found `CandidatePartitions` by their minimum 'inside' `PostDfsIndex` +for rapid retrieval during the shortest path search. + +### Phase 3 + +We find it most natural to use Dijkstra to find the optimal partitioning. A `SearchState` is: + +- An `IndexSet` of the dataflow nodes already 'covered' by candidates on the best path to this state. This is the + identifying key for the state. +- The predecessor `SearchState` in the best path to this state. +- The `Cost` of the best path to this state. This is the order for the Dijkstra priority queue. +- The `CandidatePartition` for the transition from the best predecessor to this state. + +The starting state has no covered nodes. The final state has all nodes covered. + +When expanding a state we could choose any `CandidatePartition` collected from phase 2 provided it doesn't overlap with +the state's covered set. However, a search path applying candidates C then D is equivalent to one applying D then C, so +we only consider candidates which intersect the next yet-to-be-covered dataflow node. For each such candidate we use +the `CostEstimator` (with it's assumed cache) to get the candidate's cost, build the successor state, and 'relax' the +successor state in the usual way. (See Appendix E for more details on `CostEstimator`.) + +The `HostPartitionRule` is used to allow some dataflow nodes to be 'left behind' for execution by the host. + +### Phase 4 + +The overall Relay expression is partitioned over all the `CandidatePartition`s on the shortest path 'in parallel'. Since +all the candidates are expressed using `SubGraph`s w.r.t. the original dataflow graph, we must be careful not to +invalidate yet-to-be-partitioned candidates as we go. Working backwards in dataflow order avoids this problem. + +## Known Limitations + +- **Some BYOC boilerplate changes required**: TVM's current BYOC integration API only requires the 'lowering/codegen' + function to be registered to a well-known global function name. Everything else is up to the BYOC author. + - Collage requires pattern-based BYOC integrations to register their patterns in the global pattern table. Some BYOC + integrations use the table but many do not, but it's an easy fix. + - Collage requires the BYOC lowering function to yield a valid `runtime::Module` without requiring any additional Review comment: It will need to, yes, I've just not gotten around to figuring out the encoding so that the hooks make the correct redirection, and how to tell the difference between using Compiler=<toolchain> vs Target=.... But I think it's just leg work. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
