mbs-octoml commented on a change in pull request #62: URL: https://github.com/apache/tvm-rfcs/pull/62#discussion_r827415931
########## File path: rfcs/xxxx-collage.md ########## @@ -0,0 +1,833 @@ +# Design Doc: Collage [Draft 0.7] + +``` +Feature Name: Collage +Start Date: Mar 2022 +Authors: Mark Shields ([email protected]) +RFC PR: <tbd> +GitHub Issue: <tbd> +``` + +This design doc (with accompanying +['v2' prototype implementation](https://github.com/mbs-octoml/mbs-tvm/tree/mbs-collage-sketch)) +shows how to bring tuning to TVM's operator fusion and BYOC partitioning passes. The tuning search explores the choice +of sub-graphs (aka 'partitions') as well as choice of toolchain (TVM native or one of the available BYOC integrations, +aka 'backends') for each candidate kernel so as to minimize the expected model inference latency. We call the result +an 'optimal partitioning'. This new tuning layer complements the tuning traditionally done by TVM and other toolchains +during lowering. It can also complement any global tuning, for example to explore all possible global layouts. + +The approach is based on the [preprint](https://arxiv.org/pdf/2111.00655.pdf): + +> *Collage: Automated Integration of Deep Learning Backends* +> Byungsoo Jeon, Sunghyun Park, Peiyuan Liao, Sheng Xu, Tianqi Chen, Zhihao Jia + +This tuning approach contrasts with TVM's existing "greedy" and "manual" approaches to fusion and BYOC: + +- Greedy: Currently only the largest possible supported sub-graphs are used for kernels, irrespective of their execution + time. With Collage many more candidate sub-graphs are explored, and it is possible for two smaller sub-graphs to yield + better overall latency than one large sub-graph if they mix toolchains. +- Manual: Currently the TVM user must commit to a BYOC toolchain and invoke the corresponding partitioning function + before the main TVM compilation flow proceeds. With Collage the choice of toolchain can be automated based on measured + latency. Collage will also explore mixing and matching between multiple BYOC toolchains as well as TVM's native + backend. + +The design (when Collage is enabled) subsumes TVM's fixed `FuseOps` and BYOC-provided `partition_for_<toolchain>` +operations (built using the `MergeComposite`/`AnnotateTarget`/`MergeCompilerRegions`/`PartitionGraph` passes) with a +single new +`CollageFuseOps` pass. The pass is carefully engineered to build directly on the existing `"TOpPattern"` attributes +(provided for every Relay operator and used by `FuseOps`), BYOC `"target.<toolchain>"` +operator predicates (provided for some operator/toolchain pairs by 'operator-based' BYOC integrations) and BYOC operator +pattern/predicates (registered in the pattern table by 'pattern-based' BYOC integrations). In this way only the more +boilerplate aspects of existing BYOC integrations need to be adjusted to support Collage. The +`partition_for_<toolchain>` operations are retained for users who wish to retain manual control. + +> NOTE: We'd like to coordinate these changes with the UMA project. Our aim in this design is to make the smallest +> changes to BYOC as possible. We think the changes described here can be easily reworked to follow any BYOC API +> proposals settled on by UMA. See also "Related Work." + +Collage offers four advantages: + +- **Latency**: Overall model latency may be reduced compared to TVM native, TVM with a specific BYOC toolchain, or a + non-TVM compiler such as TensorRT. +- **Automation**: The choice of which BYOC toolchains to enable can be automated. +- **Economy of implementation**: Five standalone passes using three separate mechanisms for expressing fusion + rules/algorithms and implementing partitioning can be replaced with one, which itself is built from compositional + primitives. +- **Decoupling**: It is ok for a candidate kernel found during search to actually not be valid for a toolchain (even + TVM's). Such candidates could be given 'infinite' cost and thus ignored during search. In this way we can avoid tight + coupling between backends and fusion rules. + +## FAQ + +Pending. + +## Success Metrics + +1. Collage offers at least a 10% latency improvement for a selection of standard ONNX models and NVIDIA hardware using + targets which include the CuDNN and CuBlas libraries, the CUTLASS library (with tuning, via BYOC), the TensorRT + compiler (via BYOC), and (obviously!) TVM native. +2. Collage does not require new per-target or per-model patterns or rules to be implemented independently of the BYOC + integrations. +3. Collage with just the native TWM and a single BYOC toolchain enabled is never worse than using the + existing `partition_for_<toolchain` method in TVM today. + +## Project Milestones + +- [Done] M0: Port paper prototype to recent TVM main and validate paper results. +- [Done] M1: Internal design doc. +- [Done] M2: Use 'v2' prototype to test design doc, and rework ready for TVM community. +- [In progress] M3: RFC +- [2022Q1] M4: Re-validate results on 'v2' prototype for larger models (eg GPT2) and more NVIDIA targets. +- [2022Q2] M5: Implementation in TVM main, including 'sub-projects' listed below. +- [OctoML internal] M6: Estimator integrated into OctoML platform, validation against OctoML test suite. +- [OctoML internal] M7: Productionization for OctoML. + +## Check-in plan + +Though the 'v2' prototype is in a personal branch we'd like to transition to main ASAP and rely on directory/namespace +separation, maintaining backwards compat, and a new `PassConfig` flag to isolate all Collage changes from the rest of +TVM. A rough PR progression is: + +- TensorRT and CUTLASS BYOC changes are backwards compat. The existing `partition_for_X` functions remain. The + CUTLASS-specific tuning and codegen functions will either continue to be supported or we'll work with users to account + for them being folded into the function-at-a-time `relay.ext.cutlass` + codegen function. +- The the `DFPattern` and friends changes are all mostly just for improving the robustness of the + `IndexedGraph<T>` class and can go into main independently. +- Some basic `Expr` improvements can go into main independently. +- The design allows for multiple `Target`s for the same `DLDeviceType`. That requires the various + `build` interfaces which currently accept `Union[Target,Dict]` to also accept a list of `Target`s, and can be + backwards compat. +- The new Collage code can go in bottom-up as we develop unit tests: + - Support utils, including `NameSupply`, `IndexSet`, `PriorityQueue`, `Cost`, `CostEstimator`. + - The core `SubGraph` datatype. + - `CandidateKernel`. + - The `FusionRule` class hierachy (which itself can be broken into sub-PRs). + - `FusionSpec`. + - `GatherFusionSpecs` helper for bridging the existing BYOC world with the Collage 'FusionRule' world. + - The `CollageFuseOps` driver pass itself. + +## Related Work + +- The [Cascading Scheduler](https://github.com/apache/tvm-rfcs/blob/main/rfcs/0037-arm-ethosu-cascading-scheduler.md) combines i) dynamic-programming + to find an optimal grouping of TE sub-expressions, ii) an analytic model of cost to guide the search, + and iii) cascading scheduling of the TE sub-expressions so as to reduce memory high-watermark. By contrast + Collage i) also uses dynamic-programming, but to find an optimal grouping of Relay sub-expressions, ii) + uses measurement to guide the search and iii) assuming the toolchain will 'do its best' with the + sub-graph offered to it. +- The [Universal modular Accelerator Interface](https://github.com/apache/tvm-rfcs/pull/60) proposal + adds a layer on top of the existing and separate TVM BYOC, operator strategy, operator scheduling, + target-specific passes and target-specific code generation extension points. Collage currently relies + only on the global pattern registry and global `relay.ext.<toolchain>` function to integrate with BYOC + integrations, but this is trivial to change should this project change the source of truth. + +## Example + +We start with `mod` bound to [MNIST](https://github.com/onnx/models/tree/main/vision/classification/mnist): + +``` +fn (%x: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] { + %0 = nn.pad(%x, 0f, pad_width=[[0, 0], [0, 0], [2, 2], [2, 2]]); + %1 = nn.conv2d(%0, meta[relay.Constant][0] /*Tensor[(8, 1, 5, 5), float32]*/, + padding=[0, 0, 0, 0], channels=8, kernel_size=[5, 5]); + %2 = add(%1, meta[relay.Constant][1] /*Tensor[(8, 1, 1), float32]*/); + %3 = nn.relu(%2); + %4 = nn.max_pool2d(%3, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0]); + %5 = nn.pad(%4, 0f, pad_width=[[0, 0], [0, 0], [2, 2], [2, 2]]); + %6 = nn.conv2d(%5, meta[relay.Constant][2] /*Tensor[(16, 8, 5, 5), float32]*/, + padding=[0, 0, 0, 0], channels=16, kernel_size=[5, 5]); + %7 = add(%6, meta[relay.Constant][3] /*Tensor[(16, 1, 1), float32]*/); + %8 = nn.relu(%7); + %9 = nn.max_pool2d(%8, pool_size=[3, 3], strides=[3, 3], padding=[0, 0, 0, 0]); + %10 = reshape(%9, newshape=[1, 256]); + %11 = nn.dense(%10, meta[relay.Constant][4] /*Tensor[(10, 256), float32]*/, units=None, out_dtype="float32"); + add(%11, meta[relay.Constant][5] /*Tensor[(1, 10), float32]*/) +} +``` + +We can compile this with Collage enabled for a variety of NVIDIA toolchains/libraries as follows: + +``` +with tvm.transform.PassContext(config={"relay.fallback_device_type": 2, "relay.collage.enable_collage": True}): + host_target = tvm.target.Target("llvm") + generic_target = tvm.target.Target("cuda", host_target) + cutlass_target = tvm.target.Target("cuda -compiler=cutlass", host_target) + tensorrt_target = tvm.target.Target("cuda -compiler=tensorrt", host_target) + cudnn_target = tvm.target.Target("cuda -libs=cudnn", host_target) + cublas_target = tvm.target.Target("cuda -libs=cublas", host_target) + targets = [generic_target, cutlass_target, tensorrt_target, cudnn_target, cublas_target] + exe = tvm.relay.vm.compile(mod, target=targets) Review comment: Thanks. Ideally I'd not need to innovate in target handling so if there's an easier path I'm all for it. I do see https://discuss.tvm.apache.org/t/rfc-composite-target/7744/10 which explains the intent is similar to what I need here, but I don't see any downstream handling of it -- do you have any pointers? There's also been quite a bit of discussion lately about handling multiple targets which I've not followed, perhaps there's a summary of that? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
