[tvm-site] 01/01: add ansor blog

lmzheng Mon, 01 Mar 2021 17:23:46 -0800

This is an automated email from the ASF dual-hosted git repository.

lmzheng pushed a commit to branch pr-ansor-blog
in repository https://gitbox.apache.org/repos/asf/tvm-site.git


commit 8f416972a358de045c81fcd78adf8b6bde38f035
Author: Lianmin Zheng <[email protected]>
AuthorDate: Mon Mar 1 17:23:17 2021 -0800

    add ansor blog
---
 _posts/2021-03-01-intro-auto-scheduler.md       | 131 ++++++++++++++++++++++++
 images/intro-auto-scheduler/code_perf.png       | Bin 0 -> 36724 bytes
 images/intro-auto-scheduler/search_overview.png | Bin 0 -> 433415 bytes
 images/intro-auto-scheduler/search_time.png     | Bin 0 -> 45583 bytes
 images/intro-auto-scheduler/workflow.png        | Bin 0 -> 1014076 bytes
 5 files changed, 131 insertions(+)

diff --git a/_posts/2021-03-01-intro-auto-scheduler.md 
b/_posts/2021-03-01-intro-auto-scheduler.md
new file mode 100644
index 0000000..dfdad51
--- /dev/null
+++ b/_posts/2021-03-01-intro-auto-scheduler.md
@@ -0,0 +1,131 @@
+---
+layout: post
+title: Introducing TVM Auto-scheduler (a.k.a. Ansor)
+date: 2021-03-01
+author: Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu
+---
+
+Optimizing the execution speed of deep neural networks is extremely hard with 
the growing
+model size, operator diversity, and hardware heterogeneity.
+From a computational perspective, deep neural networks are just layers and 
layers of tensor computations.
+These tensor computations, such as matmul and conv2d, can be easily described 
by mathematical expressions.
+However, providing high-performance implementations for them on modern 
hardware can be very challenging.
+We have to apply various low-level optimizations and utilize special hardware 
intrinsics to achieve high performance.
+It takes huge engineering effort to build linear algebra and neural network 
acceleration libraries like CuBLAS, CuDNN, oneMKL, and oneDNN.
+
+Our life will be much easier if we can just write mathematical expressions and 
have something
+magically turn them into efficient code implementations.
+Three years ago, we built AutoTVM as the first step towards this goal.
+AutoTVM employs a template-based search algorithm to find efficient 
implementations for a given tensor computation.
+However, it is a template-based approach, so it still requires domain experts 
to implement a non-trivial manual template
+for every operator on every platform.
+Today, there are more than 15k lines of code for these templates in the TVM 
code repository.
+Besides being very hard to develop, these templates often have limited search 
spaces,
+making them unable to achieve optimal performance.
+
+To address the limitations of AutoTVM, we started project Ansor to build a 
fully automated auto-scheduler for code generation.
+Ansor auto-scheduler only takes tensor expressions as input and generates 
high-performance code without manual templates.
+We made innovations in the search space construction and search algorithm.
+As a result, the auto-scheduler can achieve better performance with less 
search time in a more automated way.
+
+Ansor auto-scheduler is now integrated into Apache TVM as `tvm.auto_scheduler` 
package.
+Detailed tutorials are available for Intel CPUs, ARM CPUs, NVIDIA CPUs, and 
Mali GPUs on the TVM website [1].
+In this blog post, we will give a high-level introduction and show some 
benchmark results.
+
+# System Overview
+
+## Autotvm vs Auto-scheduler
+{:center: style="text-align: center"}
+![image](/images/intro-auto-scheduler/workflow.png){: width="75%"}
+{:center}
+<center> Table 1. Workflow Comparision </center> <p></p>
+
+Table 1 compares the workflow for generating code for an operator in AutoTVM 
and auto-scheduler.
+In AutoTVM, the developer has to go through three steps.
+In step 1, the developer has to write the compute definition in TVM's tensor 
expression language.
+This part is relatively easy because TVM's tensor expression language looks 
just like math expressions.
+In step 2, the developer has to write a schedule template, which typically 
consists of 20-100 lines of tricky DSL code.
+This part requires domain expertise of both the target hardware architecture 
and operator semantics, so it is difficult.
+The last step, step 3, is automated by a search algorithm.
+
+In auto-scheduler, we eliminate the most difficult step 2 by automatic search 
space construction and accelerate step 3 with a better search algorithm.
+By doing automatic search space construction, we not only eliminate huge 
manual effort, 
+but also enabling the exploration of much more optimization combinations.
+This automation does not come free, because we still need to design rules to 
generate the search space.
+However, these rules are very general. They are based on static analysis of 
the tensor expressions.
+We only need to design a few general rules once and can apply them to almost 
all tensor computations in deep learning.
+
+## Search Process
+{:center: style="text-align: center"}
+![image](/images/intro-auto-scheduler/search_overview.png){: width="40%"}
+{:center}
+<center> Figure 1. Search Process Overview  </center> <p></p>
+
+Figure 1. shows the search process of auto-scheduler when optiming a whole 
neural network.
+The system takes deep learning models as input.
+It then partitions the big model into small subgraphs with Relay's operator 
fusion pass.
+A task scheduler is utilized to allocate the time resource for optimizing many 
subgraphs.
+At each iteration, it picks a subgraph that has the most potential to increase 
the end-to-end performance.
+For this subgraph, we analysis its tensor expression and generates several 
sketches for it.
+Then we run evolutionary search with a learned cost model to get a batch of 
optimized programs.
+The optimized programs are sent to actual hardware for measurements.
+When the measurements are finished, the profiling results are used as feedback 
to update all components of the system.
+This process is repeated iteratively until the optimization converges or we 
run out of time budget.
+More technical details can be found in our paper [3] and our code.
+
+The auto-scheduler reuses the existing computation definitions in TOPI but 
does not use any schedule template.
+
+
+# Benchmark Results
+In this section, we benchmark the performance of AutoTVM and Auto-scheduler.
+The CPU benchmark is done on an AWS c5.9xlarge, which is equipped with an 
Intel 18-core skylake 8124-m CPU. 
+The GPU benchmark is done on an AWS g4dn.4xlarge, which is equipped with an 
NVIDIA T4 GPU.
+All benchmark code, raw data, tuning logs can be found in this repo [2].
+
+## Performance of the generated code
+We benchmark the fp32 single-batch inference latency on three networks.
+Figure 2 shows the relative speedup auto-scheduler can get against AutoTVM.
+We can see auto-scheduler outperforms AutoTVM in all cases with 1.02x to 8.95x 
speedup.
+This is because auto-scheduler explores a larger search space. Auto-scheduler 
can find more efficient combinations
+of optimizations that are missed by manual templates.
+The BERT-base@GPU is an extreme case where the manual templates are very badly 
designed.
+The manual template for dense layers does not perform well for the shapes in 
BERT model.
+
+{:center: style="text-align: center"}
+![image](/images/intro-auto-scheduler/code_perf.png){: width="85%"}
+{:center}
+<center> Figure 2. Code Performance Comparision (Higher is better) </center> 
<p></p>
+
+## Search Time
+As we know, the search-based approaches can be very time-consuming, so we also 
care about the search time.
+It typically takes several hours to let the search converge for a single 
neural network.
+Figure 3 compares the search time of AutoTVM and auto-scheduler.
+Auto-scheduler requires much less time to converge in most cases, despite its 
larger search space.
+This is because of auto-scheduler has a better cost model and task scheduler.
+
+{:center: style="text-align: center"}
+![image](/images/intro-auto-scheduler/search_time.png){: width="85%"}
+{:center}
+<center> Figure 3. Search Time Comparision (Lower is better) </center> <p></p>
+
+## More Results
+The repo above serves as an internal benchmark tool for TVM, so it only 
compares the latest AutoTVM and AutoScheduler.
+You can find results for more libraries and backends in our paper [3].
+Recently, this blog post [4] also tried auto-scheduler on an Apple M1 chip and 
got some good results.
+
+# Conclusion
+We build TVM auto-scheduler, a system that automatically generates 
high-performance code for tensor expressions.
+Compared with the predecessor AutoTVM, auto-scheduler does not require manual 
templates.
+Besides, auto-scheduler generates better code with less search time.
+We achieve this by making innovations in the search space construction and 
search algorithm.
+
+We are excited about the current performance of auto-scheduler.
+In the future, we are interested in extending the ability of auto-scheduler to 
support
+sparse operators, low-precision operators, and dynamic shape better.
+
+# Links
+[1] Tutorials: 
[https://tvm.apache.org/docs/tutorials/index.html#autoscheduler-template-free-auto-scheduling](https://tvm.apache.org/docs/tutorials/index.html#autoscheduler-template-free-auto-scheduling)
  
+[2] Benchmark repo: 
[https://github.com/tlc-pack/TLCBench](https://github.com/tlc-pack/TLCBench)  
+[3] OSDI Paper: [Ansor : Generating High-Performance Tensor Programs for Deep 
Learning](https://arxiv.org/abs/2006.06762)  
+[4] Results on Apple M1 chip: 
[https://medium.com/octoml/on-the-apple-m1-beating-apples-core-ml-4-with-30-model-performance-improvements-9d94af7d1b2d](https://medium.com/octoml/on-the-apple-m1-beating-apples-core-ml-4-with-30-model-performance-improvements-9d94af7d1b2d).
  
+
diff --git a/images/intro-auto-scheduler/code_perf.png 
b/images/intro-auto-scheduler/code_perf.png
new file mode 100644
index 0000000..d070a6e
Binary files /dev/null and b/images/intro-auto-scheduler/code_perf.png differ
diff --git a/images/intro-auto-scheduler/search_overview.png 
b/images/intro-auto-scheduler/search_overview.png
new file mode 100644
index 0000000..7b6f56d
Binary files /dev/null and b/images/intro-auto-scheduler/search_overview.png 
differ
diff --git a/images/intro-auto-scheduler/search_time.png 
b/images/intro-auto-scheduler/search_time.png
new file mode 100644
index 0000000..4bd700b
Binary files /dev/null and b/images/intro-auto-scheduler/search_time.png differ
diff --git a/images/intro-auto-scheduler/workflow.png 
b/images/intro-auto-scheduler/workflow.png
new file mode 100644
index 0000000..b2c7b26
Binary files /dev/null and b/images/intro-auto-scheduler/workflow.png differ

[tvm-site] 01/01: add ansor blog

Reply via email to