[GitHub] [tvm-rfcs] manupa-arm commented on a change in pull request #62: Collage RFC

GitBox Fri, 25 Mar 2022 11:11:36 -0700


manupa-arm commented on a change in pull request #62:
URL: https://github.com/apache/tvm-rfcs/pull/62#discussion_r833980767




##########
File path: rfcs/xxxx-collage.md
##########
@@ -0,0 +1,987 @@
+# Design Doc: Collage [Draft 0.8]
+
+```
+Feature Name: Collage
+Start Date: Mar 2022
+Authors: Mark Shields ([email protected])
+RFC PR: <tbd>
+GitHub Issue: <tbd>
+
+History:
+- v0.7: First draft.
+- v0.8: Rework to emphasise 'partitioning' (quite early in pipeline) instead 
of 'fusion' (quite late in pipeline). 
+```
+
+This design doc (with accompanying
+['v2' prototype 
implementation](https://github.com/mbs-octoml/mbs-tvm/tree/mbs-collage-sketch))
+shows how to bring tuning to TVM's BYOC partitioning passes. The tuning search 
explores the choice of sub-graphs (aka '
+partitions') and toolchains (aka 'backends') so as to minimize the expected 
model inference latency. Both 'graph
+style' (eg TensorRT) and 'library style' (eg DNNL) BYOC integrations are 
supported. We call the result an 'optimal
+partitioning'. This new tuning layer complements the tuning traditionally done 
by TVM and other toolchains during
+lowering. It can also complement any global tuning, for example to explore the 
choice of layout convention or device
+assignment.
+
+The approach is based on the [preprint](https://arxiv.org/pdf/2111.00655.pdf):
+
+> *Collage: Automated Integration of Deep Learning Backends*  
+> Byungsoo Jeon, Sunghyun Park, Peiyuan Liao, Sheng Xu, Tianqi Chen, Zhihao Jia
+
+(See Appendix A for a comparison of this proposal and the paper's 
implementation. See Appendix D for TODO items in the '
+v2' prototype.)
+
+This tuning approach contrasts with TVM's existing "greedy" and "manual" 
approaches to partitioning:
+
+- Greedy: Currently only the largest possible supported sub-graphs are used 
for partitions, irrespective of their
+  execution time. With Collage many more candidate sub-graphs are explored, 
and it is possible for two smaller
+  sub-graphs to yield better overall latency than one large sub-graph if they 
mix toolchains.
+- Manual: Currently the TVM user must commit to a BYOC toolchain and invoke 
the corresponding
+  `partition_for_<toolchain>` function before the main TVM compilation flow 
begins. With Collage the choice of toolchain
+  can be automated based on measured latency. Collage will also explore mixing 
and matching between multiple BYOC
+  toolchains as well as TVM's native backend.
+
+When Collage is enabled it subsumes the existing 
`MergeComposite`/`AnnotateTarget`/`MergeCompilerRegions`/
+`PartitionGraph` passes embedded within each `partition_for_<toolchain>` 
function with a single new
+`CollagePartitioner` pass. The pass is guided by the list of available 
`Target`s and three existing sources:
+
+1. The `"TOpPattern"` attributes provided for every Relay operator and used by 
TVM's built-in `FuseOps`.
+2. The BYOC `"target.<toolchain>"` operator predicates provided for some 
operator/toolchain pairs by
+   'operator-based' BYOC integrations.
+3. The BYOC operator pattern/predicates (usually) registered in the pattern 
table by 'pattern-based' BYOC integrations.
+
+Only some boilerplate aspects of existing BYOC integrations need to be 
adjusted to support Collage (and we will make
+these changes either as part of or in coordination with the UMA project). 
However Collage may require more robustness
+from the BYOC integrations, see Appendix F.
+
+Note however that we are **not** proposing to deprecate the existing 
`partition_for_<toolchain>` operations (or their
+UMA equivalent). This is mostly because Collage is inherently a tuning-based 
system which is not practical for users who
+need a stand-alone compiler. But it is also because of challenges with 
establishing a common pass ordering which will
+work for both TVM and all BYOC toolchains (see Appendix C for more details).
+
+Collage offers three advantages:
+
+- **Latency**: Overall model latency may be reduced compared to TVM native, 
TVM with a single
+  `partition_for_<toolchain>` call, or a non-TVM stand-alone compiler such as 
TensorRT.
+- **Automation**: The choice of which BYOC toolchains to enable can be 
automated.
+- **Economy and modularity of implementation**: Four standalone passes using 
two separate mechanisms for expressing
+  partitioning rules/algorithms can be replaced with one, which itself is 
built from compositional primitives. (The
+  machinery is also reusable for the very similar problem of choosing TVM 
fusion kernels, which we'll tackle in the
+  future).
+
+See Appendix H for some frequently asked questions.
+
+## Success Metrics
+
+1. Collage offers at least a 10% latency improvement for a selection of 
standard ONNX models and NVIDIA hardware using
+   targets which include the CuDNN and CuBlas libraries, the CUTLASS library 
(with tuning, via BYOC), the TensorRT
+   compiler (via BYOC), and (obviously!) TVM native.
+2. Collage does not require new per-target or per-model patterns or rules to 
be implemented independently of the BYOC
+   integrations.
+3. Collage with a `Target` list enabling just one BYOC toolchain is never 
worse than using the the existing
+   `partition_for_<toolchain>` function directly.
+
+## Project Milestones
+
+- [Done] M0: Port paper prototype to recent TVM main and validate paper 
results.
+- [Done] M1: Internal design doc.
+- [Done] M2: Use 'v2' prototype to test design doc, and rework ready for TVM 
community.
+- [In progress] M3: RFC
+- [2022Q1] M4: Re-validate results on 'v2' prototype for larger models (eg 
GPT2) and more NVIDIA targets.
+- [2022Q2] M5: Implementation in TVM main, including 'sub-projects' listed 
below.
+- [OctoML internal] M6: Estimator integrated into OctoML platform, validation 
against OctoML test suite.
+- [OctoML internal] M7: Productionization for OctoML.
+
+## Check-in plan
+
+Though the 'v2' prototype is in a personal branch we'd like to transition to 
main ASAP and rely on directory/namespace
+separation, maintaining backwards compat, and a new `PassConfig` flag to 
isolate all Collage changes from the rest of
+TVM. A rough PR progression is:
+
+- TensorRT and CUTLASS BYOC changes are backwards compat. The existing 
`partition_for_<toolchain>` functions remain. The
+  CUTLASS-specific tuning and codegen functions will either continue to be 
supported or we'll work with users to account
+  for them being folded into the function-at-a-time `relay.ext.cutlass` 
codegen function.
+- The `DFPattern` and friends changes are all mostly just for improving the 
robustness of the
+  `IndexedGraph<T>` class and can go into main independently.
+- Some basic `Expr` improvements can go into main independently.
+- The design allows for multiple `Target`s for the same `DLDeviceType`. That 
requires the various
+  `build` interfaces which currently accept `Union[Target,Dict]` to also 
accept a list of `Target`s, and can be
+  backwards compat.
+- The new Collage code can go in bottom-up as we develop unit tests:
+    - Support utils, including `NameSupply`, `IndexSet`, `PriorityQueue`, 
`Cost`, `CostEstimator`.
+    - The core `SubGraph` datatype.
+    - `CandidatePartition`.
+    - The `PartitionRule` class hierarchy, as a series of PRs, ending with 
`PartitionSpec`.
+    - `GatherPartitionSpecs` helper for bridging the existing BYOC world with 
the Collage `PartitionRule` world.
+    - The `CollagePartitioner` driver pass itself.
+
+## Related Work
+
+- The [Cascading 
Scheduler](https://github.com/apache/tvm-rfcs/blob/main/rfcs/0037-arm-ethosu-cascading-scheduler.md)
+  combines i) dynamic-programming to find an optimal grouping of TE 
sub-expressions, ii) an analytic model of cost to
+  guide the search, and iii) cascading scheduling of the TE sub-expressions so 
as to reduce memory high-watermark. By
+  contrast Collage i) also uses dynamic-programming, but to find an optimal 
grouping of Relay sub-expressions, ii)
+  uses (very much slower!) measurement to guide the search and iii) has no 
influence over how either TVM or BYOC
+  toolchains actually lower the sub-graphs given to them.
+- The [Universal modular Accelerator 
Interface](https://github.com/apache/tvm-rfcs/pull/60) proposal adds a layer on 
top
+  of the existing and separate TVM BYOC, operator strategy, operator 
scheduling, target-specific passes and
+  target-specific code generation extension points. Collage currently relies 
only on the global pattern registry and
+  global `relay.ext.<toolchain>` function to integrate with BYOC integrations, 
but this is easy to rework should UMA
+  change the source of truth.
+
+## Example
+
+We start with `mod` bound to 
[MNIST](https://github.com/onnx/models/tree/main/vision/classification/mnist):
+
+```
+fn (%x: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
+  %0 = nn.pad(%x, 0f, pad_width=[[0, 0], [0, 0], [2, 2], [2, 2]]);
+  %1 = nn.conv2d(%0, meta[relay.Constant][0] /*Tensor[(8, 1, 5, 5), float32]*/,
+                 padding=[0, 0, 0, 0], channels=8, kernel_size=[5, 5]);
+  %2 = add(%1, meta[relay.Constant][1] /*Tensor[(8, 1, 1), float32]*/);
+  %3 = nn.relu(%2);
+  %4 = nn.max_pool2d(%3, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 
0]);
+  %5 = nn.pad(%4, 0f, pad_width=[[0, 0], [0, 0], [2, 2], [2, 2]]);
+  %6 = nn.conv2d(%5, meta[relay.Constant][2] /*Tensor[(16, 8, 5, 5), 
float32]*/,
+                 padding=[0, 0, 0, 0], channels=16, kernel_size=[5, 5]);
+  %7 = add(%6, meta[relay.Constant][3] /*Tensor[(16, 1, 1), float32]*/);
+  %8 = nn.relu(%7);
+  %9 = nn.max_pool2d(%8, pool_size=[3, 3], strides=[3, 3], padding=[0, 0, 0, 
0]);
+  %10 = reshape(%9, newshape=[1, 256]);
+  %11 = nn.dense(%10, meta[relay.Constant][4] /*Tensor[(10, 256), float32]*/, 
units=None, out_dtype="float32");
+  add(%11, meta[relay.Constant][5] /*Tensor[(1, 10), float32]*/)
+}
+```
+
+We can compile this with Collage enabled for a variety of NVIDIA 
toolchains/libraries as follows:
+
+```
+with tvm.transform.PassContext(config={"relay.fallback_device_type": 2, 
"relay.collage.enable_collage": True}):
+    host_target = tvm.target.Target("llvm")
+    generic_target = tvm.target.Target("cuda", host_target)
+    cutlass_target = tvm.target.Target("cuda -compiler=cutlass", host_target)
+    tensorrt_target = tvm.target.Target("cuda -compiler=tensorrt", host_target)
+    cudnn_target = tvm.target.Target("cuda -compiler=cudnn", host_target)
+    cublas_target = tvm.target.Target("cuda -compiler=cublas", host_target)
+    targets = [generic_target, cutlass_target, tensorrt_target, cudnn_target, 
cublas_target]
+    exe = tvm.relay.vm.compile(mod, target=targets)
+```
+
+(Note that `cudnn` and `cublas` are not yet supported in the 'v2' prototype, 
see Appendix B.)
+
+After the `CollagePartitioner` pass, the intermediate `"main"` global function 
could resemble the following
+(though we've modified this "optimal" partitioning by hand for illustration so 
don't take it as representative of actual
+performance):
+
+```
+fn (%x: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
+  # Operators left behind in the function body are intended for TVM.
+  # The usual Relay passes may rewrite them, then FuseOps will push them
+  # into "Primitive" functions (without any "Compiler" attribute) ready
+  # for TVM lowering. 
+  %4 = nn.pad(%x, 0f, pad_width=[[0, 0], [0, 0], [2, 2], [2, 2]]);
+  # This conv2d will be offloaded to cudnn. However the main TVM compilation
+  # flow is responsible for emitting the call.
+  %6 = fn (%FunctionVar_5: Tensor[(1, 1, 32, 32), float32],
+           Composite="cudnn.conv2d") -> Tensor[(1, 8, 28, 28), float32] {
+    nn.conv2d(%FunctionVar_5, meta[relay.Constant][0] /*Tensor[(8, 1, 5, 5), 
float32]*/,
+              padding=[0, 0, 0, 0], channels=8, kernel_size=[5, 5])
+  };
+  # Back to vanilla TVM.
+  %7 = %6(%4);
+  %3 = add(%7, meta[relay.Constant][1] /*Tensor[(8, 1, 1), float32]*/);
+  %9 = nn.relu(%3);
+  %11 = nn.max_pool2d(%9, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 
0]);
+  %13 = nn.pad(%11, 0f, pad_width=[[0, 0], [0, 0], [2, 2], [2, 2]]);
+  # Use TensorRT. The "Primitive" function deleniates the partition.
+  %14 = fn (%FunctionVar_03: Tensor[(1, 8, 18, 18), float32],
+            %FunctionVar_11: Tensor[(16, 1, 1), float32],
+            Primitive=1,
+            Compiler="tensorrt",
+            global_symbol="collage_nn_conv2d_add_nn_relu_1") -> Tensor[(1, 16, 
14, 14), float32] {
+    %1 = nn.conv2d(%FunctionVar_03, meta[relay.Constant][2] /*Tensor[(16, 8, 
5, 5), float32]*/,
+                   padding=[0, 0, 0, 0], channels=16, kernel_size=[5, 5]);
+    %2 = add(%1, %FunctionVar_11);
+    nn.relu(%2)
+  };
+  %15 = %14(%13, meta[relay.Constant][3] /*Tensor[(16, 1, 1), float32]*/);
+  # Back to vanilla TVM.
+  %17 = nn.max_pool2d(%15, pool_size=[3, 3], strides=[3, 3], padding=[0, 0, 0, 
0]);
+  %19 = reshape(%17, newshape=[1, 256]);
+  # Use CUTLASS. Note the double function nesting: the outer "Primitive" 
function
+  # deleniates the partition and the inner "Composite" function maps the 
original
+  # Relay operators to a tag to be used during compilation/build/lowering with 
the
+  # CUTLASS BYOC integration.
+  %20 = fn (%FunctionVar_0: Tensor[(1, 256), float32],
+            %FunctionVar_1: Tensor[(10, 256), float32],
+            %FunctionVar_2: Tensor[(1, 10), float32],
+            Primitive=1,
+            Compiler="cutlass",
+            global_symbol="collage_cutlass_dense_bias_nn_dense_add") -> 
Tensor[(1, 10), float32] {
+    %1 = fn (%FunctionVar_01: Tensor[(1, 256), float32],
+             %FunctionVar_11: Tensor[(10, 256), float32],
+             %FunctionVar_21: Tensor[(1, 10), float32],
+             Composite="cutlass.dense_bias") -> Tensor[(1, 10), float32] {
+      %0 = nn.dense(%FunctionVar_01, %FunctionVar_11, units=None, 
out_dtype="float32");
+      add(%0, %FunctionVar_21)
+    };
+    %1(%FunctionVar_0, %FunctionVar_1, %FunctionVar_2)
+  };
+  %20(%19, meta[relay.Constant][4] /*Tensor[(10, 256), float32]*/,
+      meta[relay.Constant][5] /*Tensor[(1, 10), float32]*/)
+}
+```
+
+Ideally this optimal partitioning would be understandable to the user, see 
Appendix G.
+
+## Design
+
+The implementation is mostly under `src/relay/collage/...` (namespace 
`tvm::relay::collage`), with just a few Python
+helper functions under `python/tvm/relay/collage`.
+
+If the `relay.collage.enable_collage` `PassConfig` attribute is true then a 
new `CollagePartitioner` pass is inserted
+before all other Relay passes. The result of the pass is:
+
+- All Relay sub-graphs in all global functions which are to be handed off to a 
BYOC toolchain are replaced by calls to
+  an inline `"Primitive"` function with `"Compiler"` and `"global_symbol"` 
attributes.
+- Relay operators, or groups of operators, which are to be translated to 
particular library or BYOC-supplied function
+  are replaced by calls to an inline `"Composite"` function. (This encoding is 
supported for both BYOC and external
+  libraries.)
+
+Note that no `"Primitive"` functions denoting TVM kernels are produced -- the 
existing `FuseOps` pass is still required.
+
+The `CollagePartitioner` pass has four phases:
+
+- **Phase 1**: The available `Target`s are scanned to build a list of rules 
describing how to find possible partitions (
+  see `PartitionSpec` and `PartitionRule` below). Depending on the `Target` 
the rules may incorporate entries from the
+  BYOC pattern table. (The remaining phases execute on each global function 
separately.)
+- **Phase 2**: A dataflow graph is constructed for the global function (which 
is just an `IndexedGraph<Expr>`). The
+  available rules from phase 1 are evaluated on the dataflow graph to yield a 
(possibly overlapping) set of candidate
+  partitions for each target (see `CandidatePartition` below). Each candidate 
efficiently describes a sub-graph of the
+  global function's body without the need to construct any new expressions 
(see `SubGraph` below).
+- **Phase 3**: A shortest path is found in the following (implicit and lazily 
constructed) search graph:
+    - Search Nodes: The set of dataflow nodes which have already been assigned 
to a candidate partition in all paths to
+      the node.
+    - Search Edge X->Y: A candidate partition can be applied to node X to give 
node Y. The candidate is disjoint from

Review comment:
       Im not sure I follow what is meant by "give" here ? I assume between the 
search nodes candidate partitions change.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [tvm-rfcs] manupa-arm commented on a change in pull request #62: Collage RFC

Reply via email to