mbs-octoml commented on a change in pull request #62: URL: https://github.com/apache/tvm-rfcs/pull/62#discussion_r827396907
########## File path: rfcs/xxxx-collage.md ########## @@ -0,0 +1,833 @@ +# Design Doc: Collage [Draft 0.7] + +``` +Feature Name: Collage +Start Date: Mar 2022 +Authors: Mark Shields ([email protected]) +RFC PR: <tbd> +GitHub Issue: <tbd> +``` + +This design doc (with accompanying +['v2' prototype implementation](https://github.com/mbs-octoml/mbs-tvm/tree/mbs-collage-sketch)) +shows how to bring tuning to TVM's operator fusion and BYOC partitioning passes. The tuning search explores the choice +of sub-graphs (aka 'partitions') as well as choice of toolchain (TVM native or one of the available BYOC integrations, +aka 'backends') for each candidate kernel so as to minimize the expected model inference latency. We call the result +an 'optimal partitioning'. This new tuning layer complements the tuning traditionally done by TVM and other toolchains +during lowering. It can also complement any global tuning, for example to explore all possible global layouts. + +The approach is based on the [preprint](https://arxiv.org/pdf/2111.00655.pdf): + +> *Collage: Automated Integration of Deep Learning Backends* +> Byungsoo Jeon, Sunghyun Park, Peiyuan Liao, Sheng Xu, Tianqi Chen, Zhihao Jia + +This tuning approach contrasts with TVM's existing "greedy" and "manual" approaches to fusion and BYOC: + +- Greedy: Currently only the largest possible supported sub-graphs are used for kernels, irrespective of their execution + time. With Collage many more candidate sub-graphs are explored, and it is possible for two smaller sub-graphs to yield + better overall latency than one large sub-graph if they mix toolchains. +- Manual: Currently the TVM user must commit to a BYOC toolchain and invoke the corresponding partitioning function + before the main TVM compilation flow proceeds. With Collage the choice of toolchain can be automated based on measured + latency. Collage will also explore mixing and matching between multiple BYOC toolchains as well as TVM's native + backend. + +The design (when Collage is enabled) subsumes TVM's fixed `FuseOps` and BYOC-provided `partition_for_<toolchain>` +operations (built using the `MergeComposite`/`AnnotateTarget`/`MergeCompilerRegions`/`PartitionGraph` passes) with a +single new +`CollageFuseOps` pass. The pass is carefully engineered to build directly on the existing `"TOpPattern"` attributes +(provided for every Relay operator and used by `FuseOps`), BYOC `"target.<toolchain>"` +operator predicates (provided for some operator/toolchain pairs by 'operator-based' BYOC integrations) and BYOC operator +pattern/predicates (registered in the pattern table by 'pattern-based' BYOC integrations). In this way only the more +boilerplate aspects of existing BYOC integrations need to be adjusted to support Collage. The +`partition_for_<toolchain>` operations are retained for users who wish to retain manual control. + +> NOTE: We'd like to coordinate these changes with the UMA project. Our aim in this design is to make the smallest +> changes to BYOC as possible. We think the changes described here can be easily reworked to follow any BYOC API +> proposals settled on by UMA. See also "Related Work." + +Collage offers four advantages: + +- **Latency**: Overall model latency may be reduced compared to TVM native, TVM with a specific BYOC toolchain, or a + non-TVM compiler such as TensorRT. +- **Automation**: The choice of which BYOC toolchains to enable can be automated. +- **Economy of implementation**: Five standalone passes using three separate mechanisms for expressing fusion + rules/algorithms and implementing partitioning can be replaced with one, which itself is built from compositional + primitives. +- **Decoupling**: It is ok for a candidate kernel found during search to actually not be valid for a toolchain (even + TVM's). Such candidates could be given 'infinite' cost and thus ignored during search. In this way we can avoid tight + coupling between backends and fusion rules. + +## FAQ + +Pending. + +## Success Metrics + +1. Collage offers at least a 10% latency improvement for a selection of standard ONNX models and NVIDIA hardware using + targets which include the CuDNN and CuBlas libraries, the CUTLASS library (with tuning, via BYOC), the TensorRT + compiler (via BYOC), and (obviously!) TVM native. +2. Collage does not require new per-target or per-model patterns or rules to be implemented independently of the BYOC + integrations. +3. Collage with just the native TWM and a single BYOC toolchain enabled is never worse than using the Review comment: Oh darn, now I've gone and leaked it. You didn't see anything. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
