This is an automated email from the ASF dual-hosted git repository. lmzheng pushed a commit to branch pr-ansor-blog in repository https://gitbox.apache.org/repos/asf/tvm-site.git
commit 8f416972a358de045c81fcd78adf8b6bde38f035 Author: Lianmin Zheng <[email protected]> AuthorDate: Mon Mar 1 17:23:17 2021 -0800 add ansor blog --- _posts/2021-03-01-intro-auto-scheduler.md | 131 ++++++++++++++++++++++++ images/intro-auto-scheduler/code_perf.png | Bin 0 -> 36724 bytes images/intro-auto-scheduler/search_overview.png | Bin 0 -> 433415 bytes images/intro-auto-scheduler/search_time.png | Bin 0 -> 45583 bytes images/intro-auto-scheduler/workflow.png | Bin 0 -> 1014076 bytes 5 files changed, 131 insertions(+) diff --git a/_posts/2021-03-01-intro-auto-scheduler.md b/_posts/2021-03-01-intro-auto-scheduler.md new file mode 100644 index 0000000..dfdad51 --- /dev/null +++ b/_posts/2021-03-01-intro-auto-scheduler.md @@ -0,0 +1,131 @@ +--- +layout: post +title: Introducing TVM Auto-scheduler (a.k.a. Ansor) +date: 2021-03-01 +author: Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu +--- + +Optimizing the execution speed of deep neural networks is extremely hard with the growing +model size, operator diversity, and hardware heterogeneity. +From a computational perspective, deep neural networks are just layers and layers of tensor computations. +These tensor computations, such as matmul and conv2d, can be easily described by mathematical expressions. +However, providing high-performance implementations for them on modern hardware can be very challenging. +We have to apply various low-level optimizations and utilize special hardware intrinsics to achieve high performance. +It takes huge engineering effort to build linear algebra and neural network acceleration libraries like CuBLAS, CuDNN, oneMKL, and oneDNN. + +Our life will be much easier if we can just write mathematical expressions and have something +magically turn them into efficient code implementations. +Three years ago, we built AutoTVM as the first step towards this goal. +AutoTVM employs a template-based search algorithm to find efficient implementations for a given tensor computation. +However, it is a template-based approach, so it still requires domain experts to implement a non-trivial manual template +for every operator on every platform. +Today, there are more than 15k lines of code for these templates in the TVM code repository. +Besides being very hard to develop, these templates often have limited search spaces, +making them unable to achieve optimal performance. + +To address the limitations of AutoTVM, we started project Ansor to build a fully automated auto-scheduler for code generation. +Ansor auto-scheduler only takes tensor expressions as input and generates high-performance code without manual templates. +We made innovations in the search space construction and search algorithm. +As a result, the auto-scheduler can achieve better performance with less search time in a more automated way. + +Ansor auto-scheduler is now integrated into Apache TVM as `tvm.auto_scheduler` package. +Detailed tutorials are available for Intel CPUs, ARM CPUs, NVIDIA CPUs, and Mali GPUs on the TVM website [1]. +In this blog post, we will give a high-level introduction and show some benchmark results. + +# System Overview + +## Autotvm vs Auto-scheduler +{:center: style="text-align: center"} +{: width="75%"} +{:center} +<center> Table 1. Workflow Comparision </center> <p></p> + +Table 1 compares the workflow for generating code for an operator in AutoTVM and auto-scheduler. +In AutoTVM, the developer has to go through three steps. +In step 1, the developer has to write the compute definition in TVM's tensor expression language. +This part is relatively easy because TVM's tensor expression language looks just like math expressions. +In step 2, the developer has to write a schedule template, which typically consists of 20-100 lines of tricky DSL code. +This part requires domain expertise of both the target hardware architecture and operator semantics, so it is difficult. +The last step, step 3, is automated by a search algorithm. + +In auto-scheduler, we eliminate the most difficult step 2 by automatic search space construction and accelerate step 3 with a better search algorithm. +By doing automatic search space construction, we not only eliminate huge manual effort, +but also enabling the exploration of much more optimization combinations. +This automation does not come free, because we still need to design rules to generate the search space. +However, these rules are very general. They are based on static analysis of the tensor expressions. +We only need to design a few general rules once and can apply them to almost all tensor computations in deep learning. + +## Search Process +{:center: style="text-align: center"} +{: width="40%"} +{:center} +<center> Figure 1. Search Process Overview </center> <p></p> + +Figure 1. shows the search process of auto-scheduler when optiming a whole neural network. +The system takes deep learning models as input. +It then partitions the big model into small subgraphs with Relay's operator fusion pass. +A task scheduler is utilized to allocate the time resource for optimizing many subgraphs. +At each iteration, it picks a subgraph that has the most potential to increase the end-to-end performance. +For this subgraph, we analysis its tensor expression and generates several sketches for it. +Then we run evolutionary search with a learned cost model to get a batch of optimized programs. +The optimized programs are sent to actual hardware for measurements. +When the measurements are finished, the profiling results are used as feedback to update all components of the system. +This process is repeated iteratively until the optimization converges or we run out of time budget. +More technical details can be found in our paper [3] and our code. + +The auto-scheduler reuses the existing computation definitions in TOPI but does not use any schedule template. + + +# Benchmark Results +In this section, we benchmark the performance of AutoTVM and Auto-scheduler. +The CPU benchmark is done on an AWS c5.9xlarge, which is equipped with an Intel 18-core skylake 8124-m CPU. +The GPU benchmark is done on an AWS g4dn.4xlarge, which is equipped with an NVIDIA T4 GPU. +All benchmark code, raw data, tuning logs can be found in this repo [2]. + +## Performance of the generated code +We benchmark the fp32 single-batch inference latency on three networks. +Figure 2 shows the relative speedup auto-scheduler can get against AutoTVM. +We can see auto-scheduler outperforms AutoTVM in all cases with 1.02x to 8.95x speedup. +This is because auto-scheduler explores a larger search space. Auto-scheduler can find more efficient combinations +of optimizations that are missed by manual templates. +The BERT-base@GPU is an extreme case where the manual templates are very badly designed. +The manual template for dense layers does not perform well for the shapes in BERT model. + +{:center: style="text-align: center"} +{: width="85%"} +{:center} +<center> Figure 2. Code Performance Comparision (Higher is better) </center> <p></p> + +## Search Time +As we know, the search-based approaches can be very time-consuming, so we also care about the search time. +It typically takes several hours to let the search converge for a single neural network. +Figure 3 compares the search time of AutoTVM and auto-scheduler. +Auto-scheduler requires much less time to converge in most cases, despite its larger search space. +This is because of auto-scheduler has a better cost model and task scheduler. + +{:center: style="text-align: center"} +{: width="85%"} +{:center} +<center> Figure 3. Search Time Comparision (Lower is better) </center> <p></p> + +## More Results +The repo above serves as an internal benchmark tool for TVM, so it only compares the latest AutoTVM and AutoScheduler. +You can find results for more libraries and backends in our paper [3]. +Recently, this blog post [4] also tried auto-scheduler on an Apple M1 chip and got some good results. + +# Conclusion +We build TVM auto-scheduler, a system that automatically generates high-performance code for tensor expressions. +Compared with the predecessor AutoTVM, auto-scheduler does not require manual templates. +Besides, auto-scheduler generates better code with less search time. +We achieve this by making innovations in the search space construction and search algorithm. + +We are excited about the current performance of auto-scheduler. +In the future, we are interested in extending the ability of auto-scheduler to support +sparse operators, low-precision operators, and dynamic shape better. + +# Links +[1] Tutorials: [https://tvm.apache.org/docs/tutorials/index.html#autoscheduler-template-free-auto-scheduling](https://tvm.apache.org/docs/tutorials/index.html#autoscheduler-template-free-auto-scheduling) +[2] Benchmark repo: [https://github.com/tlc-pack/TLCBench](https://github.com/tlc-pack/TLCBench) +[3] OSDI Paper: [Ansor : Generating High-Performance Tensor Programs for Deep Learning](https://arxiv.org/abs/2006.06762) +[4] Results on Apple M1 chip: [https://medium.com/octoml/on-the-apple-m1-beating-apples-core-ml-4-with-30-model-performance-improvements-9d94af7d1b2d](https://medium.com/octoml/on-the-apple-m1-beating-apples-core-ml-4-with-30-model-performance-improvements-9d94af7d1b2d). + diff --git a/images/intro-auto-scheduler/code_perf.png b/images/intro-auto-scheduler/code_perf.png new file mode 100644 index 0000000..d070a6e Binary files /dev/null and b/images/intro-auto-scheduler/code_perf.png differ diff --git a/images/intro-auto-scheduler/search_overview.png b/images/intro-auto-scheduler/search_overview.png new file mode 100644 index 0000000..7b6f56d Binary files /dev/null and b/images/intro-auto-scheduler/search_overview.png differ diff --git a/images/intro-auto-scheduler/search_time.png b/images/intro-auto-scheduler/search_time.png new file mode 100644 index 0000000..4bd700b Binary files /dev/null and b/images/intro-auto-scheduler/search_time.png differ diff --git a/images/intro-auto-scheduler/workflow.png b/images/intro-auto-scheduler/workflow.png new file mode 100644 index 0000000..b2c7b26 Binary files /dev/null and b/images/intro-auto-scheduler/workflow.png differ
