[GitHub] [tvm-rfcs] mbs-octoml commented on a change in pull request #62: Collage RFC

GitBox Tue, 15 Mar 2022 13:45:29 -0700


mbs-octoml commented on a change in pull request #62:
URL: https://github.com/apache/tvm-rfcs/pull/62#discussion_r827396907




##########
File path: rfcs/xxxx-collage.md
##########
@@ -0,0 +1,833 @@
+# Design Doc: Collage [Draft 0.7]
+
+```
+Feature Name: Collage
+Start Date: Mar 2022
+Authors: Mark Shields ([email protected])
+RFC PR: <tbd>
+GitHub Issue: <tbd>
+```
+
+This design doc (with accompanying
+['v2' prototype 
implementation](https://github.com/mbs-octoml/mbs-tvm/tree/mbs-collage-sketch))
+shows how to bring tuning to TVM's operator fusion and BYOC partitioning 
passes. The tuning search explores the choice
+of sub-graphs (aka 'partitions') as well as choice of toolchain (TVM native or 
one of the available BYOC integrations,
+aka 'backends') for each candidate kernel so as to minimize the expected model 
inference latency. We call the result
+an 'optimal partitioning'. This new tuning layer complements the tuning 
traditionally done by TVM and other toolchains
+during lowering. It can also complement any global tuning, for example to 
explore all possible global layouts.
+
+The approach is based on the [preprint](https://arxiv.org/pdf/2111.00655.pdf):
+
+> *Collage: Automated Integration of Deep Learning Backends*  
+> Byungsoo Jeon, Sunghyun Park, Peiyuan Liao, Sheng Xu, Tianqi Chen, Zhihao Jia
+
+This tuning approach contrasts with TVM's existing "greedy" and "manual" 
approaches to fusion and BYOC:
+
+- Greedy: Currently only the largest possible supported sub-graphs are used 
for kernels, irrespective of their execution
+  time. With Collage many more candidate sub-graphs are explored, and it is 
possible for two smaller sub-graphs to yield
+  better overall latency than one large sub-graph if they mix toolchains.
+- Manual: Currently the TVM user must commit to a BYOC toolchain and invoke 
the corresponding partitioning function
+  before the main TVM compilation flow proceeds. With Collage the choice of 
toolchain can be automated based on measured
+  latency. Collage will also explore mixing and matching between multiple BYOC 
toolchains as well as TVM's native
+  backend.
+
+The design (when Collage is enabled) subsumes TVM's fixed `FuseOps` and 
BYOC-provided `partition_for_<toolchain>`
+operations (built using the 
`MergeComposite`/`AnnotateTarget`/`MergeCompilerRegions`/`PartitionGraph` 
passes) with a
+single new
+`CollageFuseOps` pass. The pass is carefully engineered to build directly on 
the existing `"TOpPattern"` attributes
+(provided for every Relay operator and used by `FuseOps`), BYOC 
`"target.<toolchain>"`
+operator predicates (provided for some operator/toolchain pairs by 
'operator-based' BYOC integrations) and BYOC operator
+pattern/predicates (registered in the pattern table by 'pattern-based' BYOC 
integrations). In this way only the more
+boilerplate aspects of existing BYOC integrations need to be adjusted to 
support Collage. The
+`partition_for_<toolchain>` operations are retained for users who wish to 
retain manual control.
+
+> NOTE: We'd like to coordinate these changes with the UMA project. Our aim in 
this design is to make the smallest
+> changes to BYOC as possible. We think the changes described here can be 
easily reworked to follow any BYOC API
+> proposals settled on by UMA. See also "Related Work."
+
+Collage offers four advantages:
+
+- **Latency**: Overall model latency may be reduced compared to TVM native, 
TVM with a specific BYOC toolchain, or a
+  non-TVM compiler such as TensorRT.
+- **Automation**: The choice of which BYOC toolchains to enable can be 
automated.
+- **Economy of implementation**: Five standalone passes using three separate 
mechanisms for expressing fusion
+  rules/algorithms and implementing partitioning can be replaced with one, 
which itself is built from compositional
+  primitives.
+- **Decoupling**: It is ok for a candidate kernel found during search to 
actually not be valid for a toolchain (even
+  TVM's). Such candidates could be given 'infinite' cost and thus ignored 
during search. In this way we can avoid tight
+  coupling between backends and fusion rules.
+
+## FAQ
+
+Pending.
+
+## Success Metrics
+
+1. Collage offers at least a 10% latency improvement for a selection of 
standard ONNX models and NVIDIA hardware using
+   targets which include the CuDNN and CuBlas libraries, the CUTLASS library 
(with tuning, via BYOC), the TensorRT
+   compiler (via BYOC), and (obviously!) TVM native.
+2. Collage does not require new per-target or per-model patterns or rules to 
be implemented independently of the BYOC
+   integrations.
+3. Collage with just the native TWM and a single BYOC toolchain enabled is 
never worse than using the

Review comment:
       Oh darn, now I've gone and leaked it. You didn't see anything.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [tvm-rfcs] mbs-octoml commented on a change in pull request #62: Collage RFC

Reply via email to