manupa-arm commented on a change in pull request #62: URL: https://github.com/apache/tvm-rfcs/pull/62#discussion_r840467074
########## File path: rfcs/0062-collage.md ########## @@ -0,0 +1,1041 @@ +``` +Feature Name: Collage [Draft 0.81] +Start Date: Mar 2022 +Authors: Mark Shields ([email protected]) +RFC PR: https://github.com/apache/tvm-rfcs/pull/62 + +History: +- v0.7: First draft. +- v0.8: Rework to emphasise 'partitioning' (quite early in pipeline) instead of 'fusion' (quite late in pipeline). +``` + +# Summary + +This design doc (with accompanying +['v2' prototype implementation](https://github.com/mbs-octoml/mbs-tvm/tree/mbs-collage-sketch)) +shows how to bring tuning to TVM's BYOC partitioning passes. The tuning search explores the choice of sub-graphs (aka ' +partitions') and toolchains (aka 'backends') so as to minimize the expected model inference latency. Both 'graph +style' (eg TensorRT) and 'library style' (eg DNNL) BYOC integrations are supported. We call the result an 'optimal +partitioning'. This new tuning layer complements the tuning traditionally done by TVM and other toolchains during +lowering. It can also complement any global tuning, for example to explore the choice of layout convention or device +assignment. + +The approach is based on the [preprint](https://arxiv.org/pdf/2111.00655.pdf): + +> *Collage: Automated Integration of Deep Learning Backends* +> Byungsoo Jeon, Sunghyun Park, Peiyuan Liao, Sheng Xu, Tianqi Chen, Zhihao Jia + +(See Appendix A for a comparison of this proposal and the paper's implementation. See Appendix D for TODO items in the ' +v2' prototype.) + +When Collage is enabled it subsumes the existing `MergeComposite`/`AnnotateTarget`/`MergeCompilerRegions`/ +`PartitionGraph` passes embedded within each `partition_for_<toolchain>` function with a single new +`CollagePartitioner` pass. The pass is guided by the list of available `Target`s and three existing sources: + +1. The `"TOpPattern"` attributes provided for every Relay operator and used by TVM's built-in `FuseOps`. +2. The BYOC `"target.<toolchain>"` operator predicates provided for some operator/toolchain pairs by + 'operator-based' BYOC integrations. + TODO(mbs): Consider removing predicate based BYOC integrations once TensorRT has been transitioned to be + predicate based. +3. The BYOC operator pattern/predicates (usually) registered in the pattern table by 'pattern-based' BYOC integrations. + +The pass is run as early in the compilation flow as possible (see Appendix C). + +Only some boilerplate aspects of existing BYOC integrations need to be adjusted to support Collage (patterns must +be registered in the standard pattern table, 'preamble' passes need to be split out as per Appendix C, and any +mandatory post lowering helpers must be folded into the custom lowering function. We'll make sure these changes are +part of or coordinated with the UMA project). However Collage may require more robustness from the BYOC integrations, +see Appendix F. + +Note however that we are **not** proposing to deprecate the existing `partition_for_<toolchain>` operations (or their +UMA equivalent). This is mostly because Collage is inherently a tuning-based system which is not practical for users who +need a stand-alone compiler. But it is also because of challenges with establishing a common pass ordering which will +work for both TVM and all BYOC toolchains (see Appendix C for more details). + +# Motivation + +This tuning approach contrasts with TVM's existing "greedy" and "manual" approaches to partitioning: + +- Greedy: Currently only the largest possible supported sub-graphs are used for partitions, irrespective of their + execution time. With Collage many more candidate sub-graphs are explored, and it is possible for two smaller + sub-graphs to yield better overall latency than one large sub-graph if they mix toolchains. +- Manual: Currently the TVM user must commit to a BYOC toolchain and invoke the corresponding + `partition_for_<toolchain>` function before the main TVM compilation flow begins. With Collage the choice of toolchain + can be automated based on measured latency. Collage will also explore mixing and matching between multiple BYOC + toolchains as well as TVM's native backend. + + +Collage offers three advantages: + +- **Latency**: Overall model latency may be reduced compared to TVM native, TVM with a single + `partition_for_<toolchain>` call, or a non-TVM stand-alone compiler such as TensorRT. +- **Automation**: The choice of which BYOC toolchains to enable can be automated. +- **Economy and modularity of implementation**: Four standalone passes using two separate mechanisms for expressing + partitioning rules/algorithms can be replaced with one, which itself is built from compositional primitives. (The + machinery is also reusable for the very similar problem of choosing TVM fusion kernels, which we'll tackle in the + future). + +See Appendix H for some frequently asked questions. + +# Success Metrics + +1. Collage offers at least a 10% latency improvement for a selection of standard ONNX models and NVIDIA hardware using + targets which include the CuDNN and CuBlas libraries, the CUTLASS library (with tuning, via BYOC), the TensorRT + compiler (via BYOC), and (obviously!) TVM native. +2. Collage does not require new per-target or per-model patterns or rules to be implemented independently of the BYOC + integrations. +3. Collage with a `Target` list enabling just one BYOC toolchain is never worse than using the the existing + `partition_for_<toolchain>` function directly. (Since partitioning for multiple toolchains in sequence should never + improve the result for any single toolchain we consider just the single BYOC case.) + + +# Project Milestones + +- [Done] M0: Port paper prototype to recent TVM main and validate paper results. +- [Done] M1: Internal design doc. +- [Done] M2: Use 'v2' prototype to test design doc, and rework ready for TVM community. +- [In progress] M3: RFC +- [2022Q1] M4: Re-validate results on 'v2' prototype for larger models (eg GPT2) and more NVIDIA targets. +- [2022Q2] M5: Implementation in TVM main, including 'sub-projects' listed below. +- [OctoML internal] M6: Estimator integrated into OctoML platform, validation against OctoML test suite. +- [OctoML internal] M7: Productionization for OctoML. + +# Check-in plan + +Though the 'v2' prototype is in a personal branch we'd like to transition to main ASAP and rely on directory/namespace +separation, maintaining backwards compat, and a new `PassConfig` flag to isolate all Collage changes from the rest of +TVM. A rough PR progression is: + +- TensorRT and CUTLASS BYOC changes are backwards compat. The existing `partition_for_<toolchain>` functions remain. The + CUTLASS-specific tuning and codegen functions will either continue to be supported or we'll work with users to account + for them being folded into the function-at-a-time `relay.ext.cutlass` codegen function. +- The `DFPattern` and friends changes are all mostly just for improving the robustness of the + `IndexedGraph<T>` class and can go into main independently. +- Some basic `Expr` improvements can go into main independently. +- The design allows for multiple `Target`s for the same `DLDeviceType`. That requires the various + `build` interfaces which currently accept `Union[Target,Dict]` to also accept a list of `Target`s, and can be + backwards compat. +- The new Collage code can go in bottom-up as we develop unit tests: + - Support utils, including `NameSupply`, `IndexSet`, `PriorityQueue`, `Cost`, `CostEstimator`. + - The core `SubGraph` datatype. + - `CandidatePartition`. + - The `PartitionRule` class hierarchy, as a series of PRs, ending with `PartitionSpec`. + - `GatherPartitionSpecs` helper for bridging the existing BYOC world with the Collage `PartitionRule` world. + - The `CollagePartitioner` driver pass itself. + +# Guide-level explanation + +Collage allows the choice and partitioning for BYOC toolchains to be determined automatically +so as to minimize overall (expected) model execution latency. + +To compile with Collage it's necessary to set a `PassContext` flag, and include +'Collage aware' `Targets` in the build's `target` argument. + + +For example, assume `mod` is bound to [MNIST](https://github.com/onnx/models/tree/main/vision/classification/mnist): + +``` +def @main(%x: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] { + %0 = nn.pad(%x, 0f, pad_width=[[0, 0], [0, 0], [2, 2], [2, 2]]); + %1 = nn.conv2d(%0, meta[relay.Constant][0] /*Tensor[(8, 1, 5, 5), float32]*/, + padding=[0, 0, 0, 0], channels=8, kernel_size=[5, 5]); + %2 = add(%1, meta[relay.Constant][1] /*Tensor[(8, 1, 1), float32]*/); + %3 = nn.relu(%2); + %4 = nn.max_pool2d(%3, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0]); + %5 = nn.pad(%4, 0f, pad_width=[[0, 0], [0, 0], [2, 2], [2, 2]]); + %6 = nn.conv2d(%5, meta[relay.Constant][2] /*Tensor[(16, 8, 5, 5), float32]*/, + padding=[0, 0, 0, 0], channels=16, kernel_size=[5, 5]); + %7 = add(%6, meta[relay.Constant][3] /*Tensor[(16, 1, 1), float32]*/); + %8 = nn.relu(%7); + %9 = nn.max_pool2d(%8, pool_size=[3, 3], strides=[3, 3], padding=[0, 0, 0, 0]); + %10 = reshape(%9, newshape=[1, 256]); + %11 = nn.dense(%10, meta[relay.Constant][4] /*Tensor[(10, 256), float32]*/, units=None, out_dtype="float32"); + add(%11, meta[relay.Constant][5] /*Tensor[(1, 10), float32]*/) +} +``` + +We can compile this with Collage enabled for a variety of NVIDIA toolchains/libraries with +the following fragment: + +``` +with tvm.transform.PassContext(config={"relay.fallback_device_type": 2, "relay.collage.enable_collage": True}): + host_target = tvm.target.Target("llvm") + generic_target = tvm.target.Target("cuda", host_target) + cutlass_target = tvm.target.Target("cuda -compiler=cutlass", host_target) + tensorrt_target = tvm.target.Target("cuda -compiler=tensorrt", host_target) + cudnn_target = tvm.target.Target("cuda -compiler=cudnn", host_target) + cublas_target = tvm.target.Target("cuda -compiler=cublas", host_target) + targets = [generic_target, cutlass_target, tensorrt_target, cudnn_target, cublas_target] + exe = tvm.relay.vm.compile(mod, target=targets) +``` + +(Note that `cudnn` and `cublas` are not yet supported in the 'v2' prototype, see Appendix B.) + +After the `CollagePartitioner` pass, the intermediate `"main"` global function could resemble the following Review comment: One small question -- would we be able to export IRModule after the pass and make sure if we pass that IRModule to relay.build(...) the compilation works as if it were tuned by Collage ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
