mbs-octoml commented on a change in pull request #62:
URL: https://github.com/apache/tvm-rfcs/pull/62#discussion_r827395902



##########
File path: rfcs/xxxx-collage.md
##########
@@ -0,0 +1,833 @@
+# Design Doc: Collage [Draft 0.7]
+
+```
+Feature Name: Collage
+Start Date: Mar 2022
+Authors: Mark Shields ([email protected])
+RFC PR: <tbd>
+GitHub Issue: <tbd>
+```
+
+This design doc (with accompanying
+['v2' prototype 
implementation](https://github.com/mbs-octoml/mbs-tvm/tree/mbs-collage-sketch))
+shows how to bring tuning to TVM's operator fusion and BYOC partitioning 
passes. The tuning search explores the choice
+of sub-graphs (aka 'partitions') as well as choice of toolchain (TVM native or 
one of the available BYOC integrations,
+aka 'backends') for each candidate kernel so as to minimize the expected model 
inference latency. We call the result
+an 'optimal partitioning'. This new tuning layer complements the tuning 
traditionally done by TVM and other toolchains
+during lowering. It can also complement any global tuning, for example to 
explore all possible global layouts.
+
+The approach is based on the [preprint](https://arxiv.org/pdf/2111.00655.pdf):
+
+> *Collage: Automated Integration of Deep Learning Backends*  
+> Byungsoo Jeon, Sunghyun Park, Peiyuan Liao, Sheng Xu, Tianqi Chen, Zhihao Jia
+
+This tuning approach contrasts with TVM's existing "greedy" and "manual" 
approaches to fusion and BYOC:
+
+- Greedy: Currently only the largest possible supported sub-graphs are used 
for kernels, irrespective of their execution
+  time. With Collage many more candidate sub-graphs are explored, and it is 
possible for two smaller sub-graphs to yield
+  better overall latency than one large sub-graph if they mix toolchains.
+- Manual: Currently the TVM user must commit to a BYOC toolchain and invoke 
the corresponding partitioning function
+  before the main TVM compilation flow proceeds. With Collage the choice of 
toolchain can be automated based on measured
+  latency. Collage will also explore mixing and matching between multiple BYOC 
toolchains as well as TVM's native
+  backend.
+
+The design (when Collage is enabled) subsumes TVM's fixed `FuseOps` and 
BYOC-provided `partition_for_<toolchain>`
+operations (built using the 
`MergeComposite`/`AnnotateTarget`/`MergeCompilerRegions`/`PartitionGraph` 
passes) with a
+single new
+`CollageFuseOps` pass. The pass is carefully engineered to build directly on 
the existing `"TOpPattern"` attributes
+(provided for every Relay operator and used by `FuseOps`), BYOC 
`"target.<toolchain>"`
+operator predicates (provided for some operator/toolchain pairs by 
'operator-based' BYOC integrations) and BYOC operator
+pattern/predicates (registered in the pattern table by 'pattern-based' BYOC 
integrations). In this way only the more
+boilerplate aspects of existing BYOC integrations need to be adjusted to 
support Collage. The
+`partition_for_<toolchain>` operations are retained for users who wish to 
retain manual control.
+
+> NOTE: We'd like to coordinate these changes with the UMA project. Our aim in 
this design is to make the smallest
+> changes to BYOC as possible. We think the changes described here can be 
easily reworked to follow any BYOC API
+> proposals settled on by UMA. See also "Related Work."
+
+Collage offers four advantages:
+
+- **Latency**: Overall model latency may be reduced compared to TVM native, 
TVM with a specific BYOC toolchain, or a
+  non-TVM compiler such as TensorRT.
+- **Automation**: The choice of which BYOC toolchains to enable can be 
automated.
+- **Economy of implementation**: Five standalone passes using three separate 
mechanisms for expressing fusion
+  rules/algorithms and implementing partitioning can be replaced with one, 
which itself is built from compositional
+  primitives.
+- **Decoupling**: It is ok for a candidate kernel found during search to 
actually not be valid for a toolchain (even
+  TVM's). Such candidates could be given 'infinite' cost and thus ignored 
during search. In this way we can avoid tight
+  coupling between backends and fusion rules.
+
+## FAQ
+
+Pending.
+
+## Success Metrics
+
+1. Collage offers at least a 10% latency improvement for a selection of 
standard ONNX models and NVIDIA hardware using
+   targets which include the CuDNN and CuBlas libraries, the CUTLASS library 
(with tuning, via BYOC), the TensorRT
+   compiler (via BYOC), and (obviously!) TVM native.
+2. Collage does not require new per-target or per-model patterns or rules to 
be implemented independently of the BYOC
+   integrations.
+3. Collage with just the native TWM and a single BYOC toolchain enabled is 
never worse than using the
+   existing `partition_for_<toolchain` method in TVM today.
+
+## Project Milestones
+
+- [Done] M0: Port paper prototype to recent TVM main and validate paper 
results.
+- [Done] M1: Internal design doc.
+- [Done] M2: Use 'v2' prototype to test design doc, and rework ready for TVM 
community.
+- [In progress] M3: RFC
+- [2022Q1] M4: Re-validate results on 'v2' prototype for larger models (eg 
GPT2) and more NVIDIA targets.
+- [2022Q2] M5: Implementation in TVM main, including 'sub-projects' listed 
below.
+- [OctoML internal] M6: Estimator integrated into OctoML platform, validation 
against OctoML test suite.
+- [OctoML internal] M7: Productionization for OctoML.
+
+## Check-in plan
+
+Though the 'v2' prototype is in a personal branch we'd like to transition to 
main ASAP and rely on directory/namespace
+separation, maintaining backwards compat, and a new `PassConfig` flag to 
isolate all Collage changes from the rest of
+TVM. A rough PR progression is:
+
+- TensorRT and CUTLASS BYOC changes are backwards compat. The existing 
`partition_for_X` functions remain. The
+  CUTLASS-specific tuning and codegen functions will either continue to be 
supported or we'll work with users to account
+  for them being folded into the function-at-a-time `relay.ext.cutlass`
+  codegen function.
+- The the `DFPattern` and friends changes are all mostly just for improving 
the robustness of the
+  `IndexedGraph<T>` class and can go into main independently.
+- Some basic `Expr` improvements can go into main independently.
+- The design allows for multiple `Target`s for the same `DLDeviceType`. That 
requires the various
+  `build` interfaces which currently accept `Union[Target,Dict]` to also 
accept a list of `Target`s, and can be
+  backwards compat.
+- The new Collage code can go in bottom-up as we develop unit tests:
+    - Support utils, including `NameSupply`, `IndexSet`, `PriorityQueue`, 
`Cost`, `CostEstimator`.
+    - The core `SubGraph` datatype.
+    - `CandidateKernel`.
+    - The `FusionRule` class hierachy (which itself can be broken into 
sub-PRs).
+    - `FusionSpec`.
+    - `GatherFusionSpecs` helper for bridging the existing BYOC world with the 
Collage 'FusionRule' world.
+    - The `CollageFuseOps` driver pass itself.
+
+## Related Work
+
+- The [Cascading 
Scheduler](https://github.com/apache/tvm-rfcs/blob/main/rfcs/0037-arm-ethosu-cascading-scheduler.md)
 combines i) dynamic-programming
+  to find an optimal grouping of TE sub-expressions, ii) an analytic model of 
cost to guide the search,
+  and iii) cascading scheduling of the TE sub-expressions so as to reduce 
memory high-watermark. By contrast
+  Collage i) also uses dynamic-programming, but to find an optimal grouping of 
Relay sub-expressions, ii)
+  uses measurement to guide the search and iii) assuming the toolchain will 
'do its best' with the
+  sub-graph offered to it.
+- The [Universal modular Accelerator 
Interface](https://github.com/apache/tvm-rfcs/pull/60) proposal
+  adds a layer on top of the existing and separate TVM BYOC, operator 
strategy, operator scheduling,
+  target-specific passes and target-specific code generation extension points. 
Collage currently relies
+  only on the global pattern registry and global `relay.ext.<toolchain>` 
function to integrate with BYOC
+  integrations, but this is trivial to change should this project change the 
source of truth.
+
+## Example
+
+We start with `mod` bound to 
[MNIST](https://github.com/onnx/models/tree/main/vision/classification/mnist):
+
+```
+fn (%x: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
+  %0 = nn.pad(%x, 0f, pad_width=[[0, 0], [0, 0], [2, 2], [2, 2]]);
+  %1 = nn.conv2d(%0, meta[relay.Constant][0] /*Tensor[(8, 1, 5, 5), float32]*/,
+                 padding=[0, 0, 0, 0], channels=8, kernel_size=[5, 5]);
+  %2 = add(%1, meta[relay.Constant][1] /*Tensor[(8, 1, 1), float32]*/);
+  %3 = nn.relu(%2);
+  %4 = nn.max_pool2d(%3, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 
0]);
+  %5 = nn.pad(%4, 0f, pad_width=[[0, 0], [0, 0], [2, 2], [2, 2]]);
+  %6 = nn.conv2d(%5, meta[relay.Constant][2] /*Tensor[(16, 8, 5, 5), 
float32]*/,
+                 padding=[0, 0, 0, 0], channels=16, kernel_size=[5, 5]);
+  %7 = add(%6, meta[relay.Constant][3] /*Tensor[(16, 1, 1), float32]*/);
+  %8 = nn.relu(%7);
+  %9 = nn.max_pool2d(%8, pool_size=[3, 3], strides=[3, 3], padding=[0, 0, 0, 
0]);
+  %10 = reshape(%9, newshape=[1, 256]);
+  %11 = nn.dense(%10, meta[relay.Constant][4] /*Tensor[(10, 256), float32]*/, 
units=None, out_dtype="float32");
+  add(%11, meta[relay.Constant][5] /*Tensor[(1, 10), float32]*/)
+}
+```
+
+We can compile this with Collage enabled for a variety of NVIDIA 
toolchains/libraries as follows:
+
+```
+with tvm.transform.PassContext(config={"relay.fallback_device_type": 2, 
"relay.collage.enable_collage": True}):
+    host_target = tvm.target.Target("llvm")
+    generic_target = tvm.target.Target("cuda", host_target)
+    cutlass_target = tvm.target.Target("cuda -compiler=cutlass", host_target)
+    tensorrt_target = tvm.target.Target("cuda -compiler=tensorrt", host_target)
+    cudnn_target = tvm.target.Target("cuda -libs=cudnn", host_target)
+    cublas_target = tvm.target.Target("cuda -libs=cublas", host_target)
+    targets = [generic_target, cutlass_target, tensorrt_target, cudnn_target, 
cublas_target]
+    exe = tvm.relay.vm.compile(mod, target=targets)
+```
+
+(Note that `cudnn` and `cublas` are not yet supported in the 'v2' prototype.)

Review comment:
       That's actually good -- we want to explore the possibility that a some 
Relay ops (with some shapes, dtypes) are better implemented by a 
target-specific library. It's perfectly fine if that support is tiny.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to