This is an automated email from the ASF dual-hosted git repository.
lmzheng pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/tvm-site.git
The following commit(s) were added to refs/heads/asf-site by this push:
new 8828684 Build at Wed Mar 3 01:20:50 PST 2021
8828684 is described below
commit 88286848906663319587f02c11f37dd2fe696f30
Author: Lianmin Zheng <[email protected]>
AuthorDate: Wed Mar 3 01:20:50 2021 -0800
Build at Wed Mar 3 01:20:50 PST 2021
---
2017/08/17/tvm-release-announcement.html | 2 +-
...s-with-TVM-A-Depthwise-Convolution-Example.html | 2 +-
2017/10/06/nnvm-compiler-announcement.html | 2 +-
...s-to-TVM-Stack-and-NNVM-Compiler-with-ROCm.html | 2 +-
2017/11/08/android-rpc-introduction.html | 2 +-
2018/01/16/opt-mali-gpu.html | 2 +-
2018/03/12/webgl.html | 2 +-
2018/03/23/nmt-transformer-optimize.html | 2 +-
2018/07/12/vta-release-announcement.html | 2 +-
2018/08/10/DLPack-Bridge.html | 2 +-
2018/10/03/auto-opt-all.html | 2 +-
2018/10/09/ml-in-tees.html | 2 +-
2018/12/18/lowprecision-conv.html | 2 +-
2019/01/19/Golang.html | 2 +-
2019/03/18/tvm-apache-announcement.html | 2 +-
2019/04/29/opt-cuda-quantized.html | 2 +-
2019/05/30/pytorch-frontend.html | 2 +-
...machine-learning-to-webassembly-and-webgpu.html | 2 +-
2020/06/04/tinyml-how-tvm-is-taming-tiny.html | 2 +-
2020/07/14/bert-pytorch-tvm.html | 2 +-
.../15/how-to-bring-your-own-codegen-to-tvm.html | 2 +-
2020/09/26/bring-your-own-datatypes.html | 2 +-
2021/03/03/intro-auto-scheduler.html | 321 +++++++++++++++++++++
atom.xml | 253 +++++++++-------
blog.html | 10 +
community.html | 4 +
feed.xml | 291 +++++++++----------
images/community/sjtu.png | Bin 0 -> 236508 bytes
images/intro-auto-scheduler/code_perf.png | Bin 0 -> 36724 bytes
images/intro-auto-scheduler/search_overview.png | Bin 0 -> 433415 bytes
images/intro-auto-scheduler/search_time.png | Bin 0 -> 45583 bytes
images/intro-auto-scheduler/workflow.png | Bin 0 -> 1014076 bytes
rss.xml | 255 +++++++++-------
sitemap.txt | 1 +
34 files changed, 789 insertions(+), 390 deletions(-)
diff --git a/2017/08/17/tvm-release-announcement.html
b/2017/08/17/tvm-release-announcement.html
index ea95cf0..dbd65e1 100644
--- a/2017/08/17/tvm-release-announcement.html
+++ b/2017/08/17/tvm-release-announcement.html
@@ -140,7 +140,7 @@
<div class="span14 w-100">
<h1>TVM: An End to End IR Stack for Deploying Deep Learning Workloads on
Hardware Platforms </h1>
<p class="post-meta">
- <time datetime="2017-08-17T15:00:00-04:00" itemprop="datePublished">
+ <time datetime="2017-08-17T12:00:00-07:00" itemprop="datePublished">
Aug 17, 2017
</time>
diff --git
a/2017/08/22/Optimize-Deep-Learning-GPU-Operators-with-TVM-A-Depthwise-Convolution-Example.html
b/2017/08/22/Optimize-Deep-Learning-GPU-Operators-with-TVM-A-Depthwise-Convolution-Example.html
index 96b2e16..13a15a3 100644
---
a/2017/08/22/Optimize-Deep-Learning-GPU-Operators-with-TVM-A-Depthwise-Convolution-Example.html
+++
b/2017/08/22/Optimize-Deep-Learning-GPU-Operators-with-TVM-A-Depthwise-Convolution-Example.html
@@ -140,7 +140,7 @@
<div class="span14 w-100">
<h1>Optimize Deep Learning GPU Operators with TVM: A Depthwise
Convolution Example </h1>
<p class="post-meta">
- <time datetime="2017-08-22T00:00:00-04:00" itemprop="datePublished">
+ <time datetime="2017-08-22T00:00:00-07:00" itemprop="datePublished">
Aug 22, 2017
</time>
diff --git a/2017/10/06/nnvm-compiler-announcement.html
b/2017/10/06/nnvm-compiler-announcement.html
index 40557e0..b627ca6 100644
--- a/2017/10/06/nnvm-compiler-announcement.html
+++ b/2017/10/06/nnvm-compiler-announcement.html
@@ -140,7 +140,7 @@
<div class="span14 w-100">
<h1>NNVM Compiler: Open Compiler for AI Frameworks </h1>
<p class="post-meta">
- <time datetime="2017-10-06T11:30:00-04:00" itemprop="datePublished">
+ <time datetime="2017-10-06T08:30:00-07:00" itemprop="datePublished">
Oct 6, 2017
</time>
diff --git
a/2017/10/30/Bringing-AMDGPUs-to-TVM-Stack-and-NNVM-Compiler-with-ROCm.html
b/2017/10/30/Bringing-AMDGPUs-to-TVM-Stack-and-NNVM-Compiler-with-ROCm.html
index 06f20bd..e6a6c2f 100644
--- a/2017/10/30/Bringing-AMDGPUs-to-TVM-Stack-and-NNVM-Compiler-with-ROCm.html
+++ b/2017/10/30/Bringing-AMDGPUs-to-TVM-Stack-and-NNVM-Compiler-with-ROCm.html
@@ -140,7 +140,7 @@
<div class="span14 w-100">
<h1>Bringing AMDGPUs to TVM Stack and NNVM Compiler with ROCm </h1>
<p class="post-meta">
- <time datetime="2017-10-30T00:00:00-04:00" itemprop="datePublished">
+ <time datetime="2017-10-30T00:00:00-07:00" itemprop="datePublished">
Oct 30, 2017
</time>
diff --git a/2017/11/08/android-rpc-introduction.html
b/2017/11/08/android-rpc-introduction.html
index 7d15d82..f7e34b5 100644
--- a/2017/11/08/android-rpc-introduction.html
+++ b/2017/11/08/android-rpc-introduction.html
@@ -140,7 +140,7 @@
<div class="span14 w-100">
<h1>Remote Profile and Test Deep Learning Cross Compilation on Mobile
Phones with TVM RPC </h1>
<p class="post-meta">
- <time datetime="2017-11-08T00:00:00-05:00" itemprop="datePublished">
+ <time datetime="2017-11-08T00:00:00-08:00" itemprop="datePublished">
Nov 8, 2017
</time>
diff --git a/2018/01/16/opt-mali-gpu.html b/2018/01/16/opt-mali-gpu.html
index a039779..40fc7f0 100644
--- a/2018/01/16/opt-mali-gpu.html
+++ b/2018/01/16/opt-mali-gpu.html
@@ -140,7 +140,7 @@
<div class="span14 w-100">
<h1>Optimizing Mobile Deep Learning on ARM GPU with TVM </h1>
<p class="post-meta">
- <time datetime="2018-01-16T00:00:00-05:00" itemprop="datePublished">
+ <time datetime="2018-01-16T00:00:00-08:00" itemprop="datePublished">
Jan 16, 2018
</time>
diff --git a/2018/03/12/webgl.html b/2018/03/12/webgl.html
index 792c922..74313b5 100644
--- a/2018/03/12/webgl.html
+++ b/2018/03/12/webgl.html
@@ -140,7 +140,7 @@
<div class="span14 w-100">
<h1>Compiling Deep Learning Models to WebGL with TVM </h1>
<p class="post-meta">
- <time datetime="2018-03-12T00:00:00-04:00" itemprop="datePublished">
+ <time datetime="2018-03-12T00:00:00-07:00" itemprop="datePublished">
Mar 12, 2018
</time>
diff --git a/2018/03/23/nmt-transformer-optimize.html
b/2018/03/23/nmt-transformer-optimize.html
index 2182327..35c211a 100644
--- a/2018/03/23/nmt-transformer-optimize.html
+++ b/2018/03/23/nmt-transformer-optimize.html
@@ -140,7 +140,7 @@
<div class="span14 w-100">
<h1>Bringing TVM into TensorFlow for Optimizing Neural Machine
Translation on GPU </h1>
<p class="post-meta">
- <time datetime="2018-03-23T00:00:00-04:00" itemprop="datePublished">
+ <time datetime="2018-03-23T00:00:00-07:00" itemprop="datePublished">
Mar 23, 2018
</time>
diff --git a/2018/07/12/vta-release-announcement.html
b/2018/07/12/vta-release-announcement.html
index c60a3e1..1250749 100644
--- a/2018/07/12/vta-release-announcement.html
+++ b/2018/07/12/vta-release-announcement.html
@@ -140,7 +140,7 @@
<div class="span14 w-100">
<h1>VTA: An Open, Customizable Deep Learning Acceleration Stack </h1>
<p class="post-meta">
- <time datetime="2018-07-12T00:00:00-04:00" itemprop="datePublished">
+ <time datetime="2018-07-12T00:00:00-07:00" itemprop="datePublished">
Jul 12, 2018
</time>
diff --git a/2018/08/10/DLPack-Bridge.html b/2018/08/10/DLPack-Bridge.html
index 7ec1aaa..af4d193 100644
--- a/2018/08/10/DLPack-Bridge.html
+++ b/2018/08/10/DLPack-Bridge.html
@@ -140,7 +140,7 @@
<div class="span14 w-100">
<h1>Building a Cross-Framework Deep Learning Compiler via DLPack </h1>
<p class="post-meta">
- <time datetime="2018-08-10T00:00:00-04:00" itemprop="datePublished">
+ <time datetime="2018-08-10T00:00:00-07:00" itemprop="datePublished">
Aug 10, 2018
</time>
diff --git a/2018/10/03/auto-opt-all.html b/2018/10/03/auto-opt-all.html
index 98269c7..ac36190 100644
--- a/2018/10/03/auto-opt-all.html
+++ b/2018/10/03/auto-opt-all.html
@@ -140,7 +140,7 @@
<div class="span14 w-100">
<h1>Automatic Kernel Optimization for Deep Learning on All Hardware
Platforms </h1>
<p class="post-meta">
- <time datetime="2018-10-03T00:00:00-04:00" itemprop="datePublished">
+ <time datetime="2018-10-03T00:00:00-07:00" itemprop="datePublished">
Oct 3, 2018
</time>
diff --git a/2018/10/09/ml-in-tees.html b/2018/10/09/ml-in-tees.html
index 992e1a3..0f59a69 100644
--- a/2018/10/09/ml-in-tees.html
+++ b/2018/10/09/ml-in-tees.html
@@ -140,7 +140,7 @@
<div class="span14 w-100">
<h1>Efficient Privacy-Preserving ML Using TVM </h1>
<p class="post-meta">
- <time datetime="2018-10-09T00:00:00-04:00" itemprop="datePublished">
+ <time datetime="2018-10-09T00:00:00-07:00" itemprop="datePublished">
Oct 9, 2018
</time>
diff --git a/2018/12/18/lowprecision-conv.html
b/2018/12/18/lowprecision-conv.html
index c5def47..f32251d 100644
--- a/2018/12/18/lowprecision-conv.html
+++ b/2018/12/18/lowprecision-conv.html
@@ -140,7 +140,7 @@
<div class="span14 w-100">
<h1>Automating Generation of Low Precision Deep Learning Operators </h1>
<p class="post-meta">
- <time datetime="2018-12-18T00:00:00-05:00" itemprop="datePublished">
+ <time datetime="2018-12-18T00:00:00-08:00" itemprop="datePublished">
Dec 18, 2018
</time>
diff --git a/2019/01/19/Golang.html b/2019/01/19/Golang.html
index 27a39f0..6b8b94a 100644
--- a/2019/01/19/Golang.html
+++ b/2019/01/19/Golang.html
@@ -140,7 +140,7 @@
<div class="span14 w-100">
<h1>TVM Golang Runtime for Deep Learning Deployment </h1>
<p class="post-meta">
- <time datetime="2019-01-19T00:00:00-05:00" itemprop="datePublished">
+ <time datetime="2019-01-19T00:00:00-08:00" itemprop="datePublished">
Jan 19, 2019
</time>
diff --git a/2019/03/18/tvm-apache-announcement.html
b/2019/03/18/tvm-apache-announcement.html
index 386de84..19b5017 100644
--- a/2019/03/18/tvm-apache-announcement.html
+++ b/2019/03/18/tvm-apache-announcement.html
@@ -140,7 +140,7 @@
<div class="span14 w-100">
<h1>TVM Deep Learning Compiler Joins Apache Software Foundation </h1>
<p class="post-meta">
- <time datetime="2019-03-18T00:00:00-04:00" itemprop="datePublished">
+ <time datetime="2019-03-18T00:00:00-07:00" itemprop="datePublished">
Mar 18, 2019
</time>
diff --git a/2019/04/29/opt-cuda-quantized.html
b/2019/04/29/opt-cuda-quantized.html
index 3b401af..1c55a9a 100644
--- a/2019/04/29/opt-cuda-quantized.html
+++ b/2019/04/29/opt-cuda-quantized.html
@@ -140,7 +140,7 @@
<div class="span14 w-100">
<h1>Automating Optimization of Quantized Deep Learning Models on CUDA
</h1>
<p class="post-meta">
- <time datetime="2019-04-29T12:00:00-04:00" itemprop="datePublished">
+ <time datetime="2019-04-29T09:00:00-07:00" itemprop="datePublished">
Apr 29, 2019
</time>
diff --git a/2019/05/30/pytorch-frontend.html b/2019/05/30/pytorch-frontend.html
index ad8281b..a4dd9a3 100644
--- a/2019/05/30/pytorch-frontend.html
+++ b/2019/05/30/pytorch-frontend.html
@@ -140,7 +140,7 @@
<div class="span14 w-100">
<h1>Integrating TVM into PyTorch </h1>
<p class="post-meta">
- <time datetime="2019-05-30T00:00:00-04:00" itemprop="datePublished">
+ <time datetime="2019-05-30T00:00:00-07:00" itemprop="datePublished">
May 30, 2019
</time>
diff --git
a/2020/05/14/compiling-machine-learning-to-webassembly-and-webgpu.html
b/2020/05/14/compiling-machine-learning-to-webassembly-and-webgpu.html
index 38bd956..50f01e7 100644
--- a/2020/05/14/compiling-machine-learning-to-webassembly-and-webgpu.html
+++ b/2020/05/14/compiling-machine-learning-to-webassembly-and-webgpu.html
@@ -140,7 +140,7 @@
<div class="span14 w-100">
<h1>Compiling Machine Learning to WASM and WebGPU with Apache TVM </h1>
<p class="post-meta">
- <time datetime="2020-05-14T00:00:00-04:00" itemprop="datePublished">
+ <time datetime="2020-05-14T00:00:00-07:00" itemprop="datePublished">
May 14, 2020
</time>
diff --git a/2020/06/04/tinyml-how-tvm-is-taming-tiny.html
b/2020/06/04/tinyml-how-tvm-is-taming-tiny.html
index bcb1aed..ec640c7 100644
--- a/2020/06/04/tinyml-how-tvm-is-taming-tiny.html
+++ b/2020/06/04/tinyml-how-tvm-is-taming-tiny.html
@@ -140,7 +140,7 @@
<div class="span14 w-100">
<h1>TinyML - How TVM is Taming Tiny </h1>
<p class="post-meta">
- <time datetime="2020-06-04T00:00:00-04:00" itemprop="datePublished">
+ <time datetime="2020-06-04T00:00:00-07:00" itemprop="datePublished">
Jun 4, 2020
</time>
diff --git a/2020/07/14/bert-pytorch-tvm.html b/2020/07/14/bert-pytorch-tvm.html
index a563504..387e219 100644
--- a/2020/07/14/bert-pytorch-tvm.html
+++ b/2020/07/14/bert-pytorch-tvm.html
@@ -140,7 +140,7 @@
<div class="span14 w-100">
<h1>Bridging PyTorch and TVM </h1>
<p class="post-meta">
- <time datetime="2020-07-14T00:00:00-04:00" itemprop="datePublished">
+ <time datetime="2020-07-14T00:00:00-07:00" itemprop="datePublished">
Jul 14, 2020
</time>
diff --git a/2020/07/15/how-to-bring-your-own-codegen-to-tvm.html
b/2020/07/15/how-to-bring-your-own-codegen-to-tvm.html
index a2066ec..3d39e96 100644
--- a/2020/07/15/how-to-bring-your-own-codegen-to-tvm.html
+++ b/2020/07/15/how-to-bring-your-own-codegen-to-tvm.html
@@ -140,7 +140,7 @@
<div class="span14 w-100">
<h1>How to Bring Your Own Codegen to TVM </h1>
<p class="post-meta">
- <time datetime="2020-07-15T00:00:00-04:00" itemprop="datePublished">
+ <time datetime="2020-07-15T00:00:00-07:00" itemprop="datePublished">
Jul 15, 2020
</time>
diff --git a/2020/09/26/bring-your-own-datatypes.html
b/2020/09/26/bring-your-own-datatypes.html
index 0dc4fb0..135d0db 100644
--- a/2020/09/26/bring-your-own-datatypes.html
+++ b/2020/09/26/bring-your-own-datatypes.html
@@ -140,7 +140,7 @@
<div class="span14 w-100">
<h1>Bring Your Own Datatypes: Enabling Custom Datatype Exploration in
TVM </h1>
<p class="post-meta">
- <time datetime="2020-09-26T00:00:00-04:00" itemprop="datePublished">
+ <time datetime="2020-09-26T00:00:00-07:00" itemprop="datePublished">
Sep 26, 2020
</time>
diff --git a/2021/03/03/intro-auto-scheduler.html
b/2021/03/03/intro-auto-scheduler.html
new file mode 100644
index 0000000..e10a971
--- /dev/null
+++ b/2021/03/03/intro-auto-scheduler.html
@@ -0,0 +1,321 @@
+<html lang="en">
+<head>
+ <meta charset="UTF-8">
+ <meta name="viewport" content="width=device-width, initial-scale=1.0">
+ <title>Introducing TVM Auto-scheduler (a.k.a. Ansor)</title>
+ <link rel="shortcut icon" href="/assets/images/favicon.ico">
+ <link rel="stylesheet"
href="https://stackpath.bootstrapcdn.com/bootstrap/4.1.3/css/bootstrap.min.css"
integrity="sha384-MCw98/SFnGE8fJT3GXwEOngsV7Zt27NXFoaoApmYm81iuXoPkFOJwJ8ERdknLPMO"
crossorigin="anonymous">
+ <link rel="stylesheet" href="/css/slick.css">
+ <link rel="stylesheet" href="/css/slick-theme.css">
+ <link rel="stylesheet" href="/css/custom.css">
+</head>
+<body>
+
+
+<div class="bannerPage">
+ <header class="header">
+ <div class="container">
+ <div class="headerInner d-flex justify-content-between
align-items-center">
+ <div class="headerLogo">
+ <a href="/"><img src="/assets/images/logo.svg" alt="Logo"></a>
+ </div>
+ <div id="headMenu" class="headerNav">
+ <button type="button" id="closeHeadMenu" class="navCloseBtn"><img
src="/assets/images/close-icon.svg"
+ alt="Close"></button>
+ <ul class="nav">
+
+ <li class="nav-item">
+ <a class="nav-link" href="/community">Community</a>
+ </li>
+
+ <li class="nav-item">
+ <a class="nav-link" href="/download">Download</a>
+ </li>
+
+ <li class="nav-item">
+ <a class="nav-link" href="/vta">VTA</a>
+ </li>
+
+ <li class="nav-item">
+ <a class="nav-link" href="/blog">Blog</a>
+ </li>
+
+ <li class="nav-item">
+ <a class="nav-link" href="https://tvm.apache.org/docs/">Docs</a>
+ </li>
+
+ <li class="nav-item">
+ <a class="nav-link" href="https://tvmconf.org/">Conference</a>
+ </li>
+
+ <li class="nav-item">
+ <a class="nav-link"
href="https://github.com/apache/incubator-tvm/">Github</a>
+ </li>
+
+</ul>
+ <div class="responsiveasfdropdown">
+ <button type="button" class="btn-link">
+ ASF
+ </button>
+ <ul>
+
+ <li>
+ <a href="https://www.apache.org/">Apache Homepage</a>
+ </li>
+
+ <li>
+ <a href="https://www.apache.org/licenses/">License</a>
+ </li>
+
+ <li>
+ <a
href="https://www.apache.org/foundation/sponsorship.html">Sponsorship</a>
+ </li>
+
+ <li>
+ <a href="https://www.apache.org/security/">Security</a>
+ </li>
+
+ <li>
+ <a href="https://www.apache.org/foundation/thanks.html">Thanks</a>
+ </li>
+
+ <li>
+ <a href="https://www.apache.org/events/current-event">Events</a>
+ </li>
+
+</ul>
+ </div>
+ </div>
+ <div class="responsiveMenuIcon">
+ <button type="button" id="menuBtn" class="btn-menu"><img
src="/assets/images/menu-icon.svg"
+ alt="Menu Icon" /></button>
+ </div>
+ <div class="asfDropdown">
+ <div class="dropdown">
+ <button type="button" class="btn-link dropdown-toggle"
data-toggle="dropdown" aria-haspopup="true"
+ aria-expanded="false">
+ ASF
+ </button>
+ <div class="dropdown-menu dropdown-menu-right">
+ <ul>
+
+ <li>
+ <a href="https://www.apache.org/">Apache Homepage</a>
+ </li>
+
+ <li>
+ <a href="https://www.apache.org/licenses/">License</a>
+ </li>
+
+ <li>
+ <a
href="https://www.apache.org/foundation/sponsorship.html">Sponsorship</a>
+ </li>
+
+ <li>
+ <a href="https://www.apache.org/security/">Security</a>
+ </li>
+
+ <li>
+ <a href="https://www.apache.org/foundation/thanks.html">Thanks</a>
+ </li>
+
+ <li>
+ <a href="https://www.apache.org/events/current-event">Events</a>
+ </li>
+
+</ul>
+ </div>
+ </div>
+ </div>
+ </div>
+ </div>
+ </header>
+
+</div>
+
+
+<div class="container">
+<div class="content">
+ <div class="row">
+ <div class="span14 w-100">
+ <h1>Introducing TVM Auto-scheduler (a.k.a. Ansor) </h1>
+ <p class="post-meta">
+ <time datetime="2021-03-03T00:00:00-08:00" itemprop="datePublished">
+ Mar 3, 2021
+ </time>
+
+ • <span itemprop="author" itemscope
itemtype="http://schema.org/Person">
+ <span itemprop="name">Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao
Wu, Cody Hao Yu</span>
+ </span>
+
+ </p>
+ <p class="post-meta">
+ </p>
+ </br>
+ <p>Optimizing the execution speed of deep neural networks is extremely
hard with the growing
+model size, operator diversity, and hardware heterogeneity.
+From a computational perspective, deep neural networks are just layers and
layers of tensor computations.
+These tensor computations, such as matmul and conv2d, can be easily described
by mathematical expressions.
+However, providing high-performance implementations for them on modern
hardware can be very challenging.
+We have to apply various low-level optimizations and utilize special hardware
intrinsics to achieve high performance.
+It takes huge engineering effort to build linear algebra and neural network
acceleration libraries like CuBLAS, CuDNN, oneMKL, and oneDNN.</p>
+
+<p>Our life will be much easier if we can just write mathematical expressions
and have something
+magically turn them into efficient code implementations.
+Three years ago, deep learning compiler TVM and its search module AutoTVM were
built as the first step towards this goal.
+AutoTVM employs a template-based search algorithm to find efficient
implementations for a given tensor computation.
+However, it is a template-based approach, so it still requires domain experts
to implement a non-trivial manual template
+for every operator on every platform.
+Today, there are more than 15k lines of code for these templates in the TVM
code repository.
+Besides being very hard to develop, these templates often have inefficient and
limited search spaces,
+making them unable to achieve optimal performance.</p>
+
+<p>To address the limitations of AutoTVM, we started project Ansor aiming at a
fully automated auto-scheduler for
+generating code for tensor computations.
+Ansor auto-scheduler only takes tensor expressions as input and generates
high-performance code without manual templates.
+We made innovations in the search space construction and search algorithm.
+As a result, the auto-scheduler can achieve better performance with less
search time in a more automated way.</p>
+
+<p>Ansor auto-scheduler is now integrated into Apache TVM as <code
class="language-plaintext highlighter-rouge">tvm.auto_scheduler</code> package.
+This is a joint effort by collaborators from UC Berkeley, Alibaba, AWS and
OctoML.
+Detailed tutorials are available for Intel CPUs, ARM CPUs, NVIDIA CPUs, and
Mali GPUs on the TVM website [1].
+In this blog post, we will give a high-level introduction and show some
benchmark results.</p>
+
+<h1 id="system-overview">System Overview</h1>
+
+<h2 id="autotvm-vs-auto-scheduler">AutoTVM vs Auto-scheduler</h2>
+<p style="text-align: center"><img
src="/images/intro-auto-scheduler/workflow.png" alt="image" width="75%" /></p>
+<center> Table 1. Workflow Comparision </center>
+<p></p>
+
+<p>Table 1 compares the workflow for generating code for an operator in
AutoTVM and auto-scheduler.
+In AutoTVM, the developer has to go through three steps.
+In step 1, the developer has to write the compute definition in TVM’s tensor
expression language.
+This part is relatively easy because TVM’s tensor expression language looks
just like math expressions.
+In step 2, the developer has to write a schedule template, which typically
consists of 20-100 lines of tricky DSL code.
+This part requires domain expertise of both the target hardware architecture
and operator semantics, so it is difficult.
+The last step, step 3, is automated by a search algorithm.</p>
+
+<p>In auto-scheduler, we eliminate the most difficult step 2 by automatic
search space construction and accelerate step 3 with a better search algorithm.
+By doing automatic search space construction, we not only eliminate huge
manual effort,
+but also enabling the exploration of much more optimization combinations.
+This automation does not come for free, because we still need to design rules
to generate the search space.
+However, these rules are very general. They are based on static analysis of
the tensor expressions.
+We only need to design a few general rules once and can apply them to almost
all tensor computations in deep learning.</p>
+
+<h2 id="search-process">Search Process</h2>
+<p style="text-align: center"><img
src="/images/intro-auto-scheduler/search_overview.png" alt="image" width="40%"
/></p>
+<center> Figure 1. Search Process Overview </center>
+<p></p>
+
+<p>Figure 1. shows the search process of auto-scheduler when optimizing a
whole neural network.
+The system takes deep learning models as input.
+It then partitions the big model into small subgraphs with Relay’s operator
fusion pass.
+A task scheduler is utilized to allocate the time resource for optimizing many
subgraphs.
+At each iteration, it picks a subgraph that has the most potential to increase
the end-to-end performance.
+For this subgraph, we analyze its tensor expression and generate several
sketches for it.
+Then we run evolutionary search with a learned cost model to get a batch of
optimized programs.
+The optimized programs are sent to actual hardware for measurements.
+When the measurements are finished, the profiling results are used as feedback
to update all components of the system.
+This process is repeated iteratively until the optimization converges or we
run out of time budget.
+More technical details can be found in our paper [3] and our code.</p>
+
+<p>It is worth notiing that since the auto-scheduler generates schedules from
scratch,
+it reuses the existing computation definitions in TOPI but not schedule
templates.</p>
+
+<h1 id="benchmark-results">Benchmark Results</h1>
+<p>In this section, we benchmark the performance of AutoTVM and Auto-scheduler.
+The CPU benchmark is done on an AWS c5.9xlarge, which is equipped with an
Intel 18-core skylake 8124-m CPU.
+The GPU benchmark is done on an AWS g4dn.4xlarge, which is equipped with an
NVIDIA T4 GPU.
+All benchmark code, raw data, tuning logs can be found in this repo [2].</p>
+
+<h2 id="performance-of-the-generated-code">Performance of the generated
code</h2>
+<p>We benchmark the fp32 single-batch inference latency on three networks.
+Figure 2 shows the relative speedup of auto-scheduler against AutoTVM.
+We can see auto-scheduler outperforms AutoTVM in all cases with 1.02x to 8.95x
speedup.
+This is because auto-scheduler explores a larger search space, which covers
more efficient combinations
+of optimizations that are missed in TOPI manual templates.
+The BERT-base@GPU is an extreme case where the manual templates are very badly
designed.
+In other words, the manual template for dense layers does not perform well for
the shapes in BERT model.</p>
+
+<p style="text-align: center"><img
src="/images/intro-auto-scheduler/code_perf.png" alt="image" width="85%" /></p>
+<center> Figure 2. Code Performance Comparision (Higher is better) </center>
+<p></p>
+
+<h2 id="search-time">Search Time</h2>
+<p>As we know, the search-based approaches can be very time-consuming, so we
also care about the search time.
+It typically takes several hours to let the search converge for a single
neural network.
+Figure 3 compares the search time of AutoTVM and auto-scheduler.
+Auto-scheduler requires much less time to converge in most cases, despite its
larger search space.
+This is mainly because of auto-scheduler has a better cost model and task
scheduler.</p>
+
+<p style="text-align: center"><img
src="/images/intro-auto-scheduler/search_time.png" alt="image" width="85%"
/></p>
+<center> Figure 3. Search Time Comparision (Lower is better) </center>
+<p></p>
+
+<h2 id="more-results">More Results</h2>
+<p>The repo above serves as an internal benchmark tool for TVM, so it only
compares the latest AutoTVM and AutoScheduler.
+You can find results for more libraries and backends in our paper [3].
+Recently, this blog post [4] also tried auto-scheduler on an Apple M1 chip and
got some good results.</p>
+
+<h1 id="conclusion">Conclusion</h1>
+<p>We build TVM auto-scheduler, a system that automatically generates
high-performance code for tensor expressions.
+Compared with the predecessor AutoTVM, auto-scheduler does not require manual
templates.
+Besides, auto-scheduler is capable of generating schedules with better
performance in a shorter time.
+We achieve this by making innovations in the search space construction and
search algorithm.</p>
+
+<p>We are excited about the current performance of auto-scheduler.
+In the future, we are interested in extending the ability of auto-scheduler to
support
+sparse operators, low-precision operators, and dynamic shape better.</p>
+
+<h1 id="links">Links</h1>
+<p>[1] Tutorials: <a
href="https://tvm.apache.org/docs/tutorials/index.html#autoscheduler-template-free-auto-scheduling">https://tvm.apache.org/docs/tutorials/index.html#autoscheduler-template-free-auto-scheduling</a><br
/>
+[2] Benchmark repo: <a
href="https://github.com/tlc-pack/TLCBench">https://github.com/tlc-pack/TLCBench</a><br
/>
+[3] OSDI Paper: <a href="https://arxiv.org/abs/2006.06762">Ansor : Generating
High-Performance Tensor Programs for Deep Learning</a><br />
+[4] Results on Apple M1 chip: <a
href="https://medium.com/octoml/on-the-apple-m1-beating-apples-core-ml-4-with-30-model-performance-improvements-9d94af7d1b2d">https://medium.com/octoml/on-the-apple-m1-beating-apples-core-ml-4-with-30-model-performance-improvements-9d94af7d1b2d</a>.</p>
+
+
+ </div>
+ </div>
+</div>
+</div>
+
+
+
+
+
+
+ <script src="https://code.jquery.com/jquery-2.2.0.min.js"
type="text/javascript"></script>
+ <script
src="https://cdnjs.cloudflare.com/ajax/libs/popper.js/1.14.3/umd/popper.min.js"
integrity="sha384-ZMP7rVo3mIykV+2+9J3UJ46jBk0WLaUAdn689aCwoqbBJiSnjAK/l8WvCWPIPm49"
crossorigin="anonymous"></script>
+ <script
src="https://stackpath.bootstrapcdn.com/bootstrap/4.1.3/js/bootstrap.min.js"
integrity="sha384-ChfqqxuZUCnJSK3+MXmPNIyE6ZbWh2IMqE241rYiqJxyMiZ6OW/JmZQ5stwEULTy"
crossorigin="anonymous"></script>
+ <!-- <script src="./assets/js/slick.js"></script> -->
+ <script src="/assets/js/custome.js"></script>
+ <script async
src="https://www.googletagmanager.com/gtag/js?id=UA-75982049-2"></script>
+ <script>
+ window.dataLayer = window.dataLayer || [];
+ function gtag(){dataLayer.push(arguments);}
+ gtag('js', new Date());
+ gtag('config', 'UA-75982049-2');
+ </script>
+</body>
+<section class="footerSec">
+ <div class="footerHeader">
+ <ul class="container d-flex align-md-items-center justify-content-between
flex-column flex-md-row">
+ <li class="logo">
+
+ <p><a href="/"><img src="/assets/images/logo.svg" alt="logo"
title="logo" /></a></p>
+ </li>
+ <li class="copywrite d-flex align-items-center">
+ <h5 id="apache-software-foundation--all-right-reserved">© 2020 Apache
Software Foundation | All right reserved</h5>
+ </li>
+ </ul>
+
+ </div>
+
+ <ul class="container">
+ <li class="footernote">
+ Copyright © 2020 The Apache Software Foundation. Apache TVM, Apache, the
Apache feather, and the Apache TVM project logo are either trademarks or
registered trademarks of the Apache Software Foundation.</li>
+ </ul>
+
+</section>
+</html>
diff --git a/atom.xml b/atom.xml
index 84cd5f0..cb57f8a 100644
--- a/atom.xml
+++ b/atom.xml
@@ -4,7 +4,7 @@
<title>TVM</title>
<link href="https://tvm.apache.org" rel="self"/>
<link href="https://tvm.apache.org"/>
- <updated>2021-01-04T16:22:52-05:00</updated>
+ <updated>2021-03-03T01:20:46-08:00</updated>
<id>https://tvm.apache.org</id>
<author>
<name></name>
@@ -13,9 +13,139 @@
<entry>
+ <title>Introducing TVM Auto-scheduler (a.k.a. Ansor)</title>
+ <link href="https://tvm.apache.org/2021/03/03/intro-auto-scheduler"/>
+ <updated>2021-03-03T00:00:00-08:00</updated>
+ <id>https://tvm.apache.org/2021/03/03/intro-auto-scheduler</id>
+ <content type="html"><p>Optimizing the execution speed of deep neural
networks is extremely hard with the growing
+model size, operator diversity, and hardware heterogeneity.
+From a computational perspective, deep neural networks are just layers and
layers of tensor computations.
+These tensor computations, such as matmul and conv2d, can be easily described
by mathematical expressions.
+However, providing high-performance implementations for them on modern
hardware can be very challenging.
+We have to apply various low-level optimizations and utilize special hardware
intrinsics to achieve high performance.
+It takes huge engineering effort to build linear algebra and neural network
acceleration libraries like CuBLAS, CuDNN, oneMKL, and oneDNN.</p>
+
+<p>Our life will be much easier if we can just write mathematical
expressions and have something
+magically turn them into efficient code implementations.
+Three years ago, deep learning compiler TVM and its search module AutoTVM were
built as the first step towards this goal.
+AutoTVM employs a template-based search algorithm to find efficient
implementations for a given tensor computation.
+However, it is a template-based approach, so it still requires domain experts
to implement a non-trivial manual template
+for every operator on every platform.
+Today, there are more than 15k lines of code for these templates in the TVM
code repository.
+Besides being very hard to develop, these templates often have inefficient and
limited search spaces,
+making them unable to achieve optimal performance.</p>
+
+<p>To address the limitations of AutoTVM, we started project Ansor
aiming at a fully automated auto-scheduler for
+generating code for tensor computations.
+Ansor auto-scheduler only takes tensor expressions as input and generates
high-performance code without manual templates.
+We made innovations in the search space construction and search algorithm.
+As a result, the auto-scheduler can achieve better performance with less
search time in a more automated way.</p>
+
+<p>Ansor auto-scheduler is now integrated into Apache TVM as <code
class="language-plaintext
highlighter-rouge">tvm.auto_scheduler</code> package.
+This is a joint effort by collaborators from UC Berkeley, Alibaba, AWS and
OctoML.
+Detailed tutorials are available for Intel CPUs, ARM CPUs, NVIDIA CPUs, and
Mali GPUs on the TVM website [1].
+In this blog post, we will give a high-level introduction and show some
benchmark results.</p>
+
+<h1 id="system-overview">System Overview</h1>
+
+<h2 id="autotvm-vs-auto-scheduler">AutoTVM vs
Auto-scheduler</h2>
+<p style="text-align: center"><img
src="/images/intro-auto-scheduler/workflow.png" alt="image"
width="75%" /></p>
+<center> Table 1. Workflow Comparision </center>
+<p></p>
+
+<p>Table 1 compares the workflow for generating code for an operator in
AutoTVM and auto-scheduler.
+In AutoTVM, the developer has to go through three steps.
+In step 1, the developer has to write the compute definition in TVM’s tensor
expression language.
+This part is relatively easy because TVM’s tensor expression language looks
just like math expressions.
+In step 2, the developer has to write a schedule template, which typically
consists of 20-100 lines of tricky DSL code.
+This part requires domain expertise of both the target hardware architecture
and operator semantics, so it is difficult.
+The last step, step 3, is automated by a search algorithm.</p>
+
+<p>In auto-scheduler, we eliminate the most difficult step 2 by
automatic search space construction and accelerate step 3 with a better search
algorithm.
+By doing automatic search space construction, we not only eliminate huge
manual effort,
+but also enabling the exploration of much more optimization combinations.
+This automation does not come for free, because we still need to design rules
to generate the search space.
+However, these rules are very general. They are based on static analysis of
the tensor expressions.
+We only need to design a few general rules once and can apply them to almost
all tensor computations in deep learning.</p>
+
+<h2 id="search-process">Search Process</h2>
+<p style="text-align: center"><img
src="/images/intro-auto-scheduler/search_overview.png"
alt="image" width="40%" /></p>
+<center> Figure 1. Search Process Overview </center>
+<p></p>
+
+<p>Figure 1. shows the search process of auto-scheduler when optimizing
a whole neural network.
+The system takes deep learning models as input.
+It then partitions the big model into small subgraphs with Relay’s operator
fusion pass.
+A task scheduler is utilized to allocate the time resource for optimizing many
subgraphs.
+At each iteration, it picks a subgraph that has the most potential to increase
the end-to-end performance.
+For this subgraph, we analyze its tensor expression and generate several
sketches for it.
+Then we run evolutionary search with a learned cost model to get a batch of
optimized programs.
+The optimized programs are sent to actual hardware for measurements.
+When the measurements are finished, the profiling results are used as feedback
to update all components of the system.
+This process is repeated iteratively until the optimization converges or we
run out of time budget.
+More technical details can be found in our paper [3] and our code.</p>
+
+<p>It is worth notiing that since the auto-scheduler generates schedules
from scratch,
+it reuses the existing computation definitions in TOPI but not schedule
templates.</p>
+
+<h1 id="benchmark-results">Benchmark Results</h1>
+<p>In this section, we benchmark the performance of AutoTVM and
Auto-scheduler.
+The CPU benchmark is done on an AWS c5.9xlarge, which is equipped with an
Intel 18-core skylake 8124-m CPU.
+The GPU benchmark is done on an AWS g4dn.4xlarge, which is equipped with an
NVIDIA T4 GPU.
+All benchmark code, raw data, tuning logs can be found in this repo
[2].</p>
+
+<h2 id="performance-of-the-generated-code">Performance of the
generated code</h2>
+<p>We benchmark the fp32 single-batch inference latency on three
networks.
+Figure 2 shows the relative speedup of auto-scheduler against AutoTVM.
+We can see auto-scheduler outperforms AutoTVM in all cases with 1.02x to 8.95x
speedup.
+This is because auto-scheduler explores a larger search space, which covers
more efficient combinations
+of optimizations that are missed in TOPI manual templates.
+The BERT-base@GPU is an extreme case where the manual templates are very badly
designed.
+In other words, the manual template for dense layers does not perform well for
the shapes in BERT model.</p>
+
+<p style="text-align: center"><img
src="/images/intro-auto-scheduler/code_perf.png"
alt="image" width="85%" /></p>
+<center> Figure 2. Code Performance Comparision (Higher is better)
</center>
+<p></p>
+
+<h2 id="search-time">Search Time</h2>
+<p>As we know, the search-based approaches can be very time-consuming,
so we also care about the search time.
+It typically takes several hours to let the search converge for a single
neural network.
+Figure 3 compares the search time of AutoTVM and auto-scheduler.
+Auto-scheduler requires much less time to converge in most cases, despite its
larger search space.
+This is mainly because of auto-scheduler has a better cost model and task
scheduler.</p>
+
+<p style="text-align: center"><img
src="/images/intro-auto-scheduler/search_time.png"
alt="image" width="85%" /></p>
+<center> Figure 3. Search Time Comparision (Lower is better)
</center>
+<p></p>
+
+<h2 id="more-results">More Results</h2>
+<p>The repo above serves as an internal benchmark tool for TVM, so it
only compares the latest AutoTVM and AutoScheduler.
+You can find results for more libraries and backends in our paper [3].
+Recently, this blog post [4] also tried auto-scheduler on an Apple M1 chip and
got some good results.</p>
+
+<h1 id="conclusion">Conclusion</h1>
+<p>We build TVM auto-scheduler, a system that automatically generates
high-performance code for tensor expressions.
+Compared with the predecessor AutoTVM, auto-scheduler does not require manual
templates.
+Besides, auto-scheduler is capable of generating schedules with better
performance in a shorter time.
+We achieve this by making innovations in the search space construction and
search algorithm.</p>
+
+<p>We are excited about the current performance of auto-scheduler.
+In the future, we are interested in extending the ability of auto-scheduler to
support
+sparse operators, low-precision operators, and dynamic shape better.</p>
+
+<h1 id="links">Links</h1>
+<p>[1] Tutorials: <a
href="https://tvm.apache.org/docs/tutorials/index.html#autoscheduler-template-free-auto-scheduling">https://tvm.apache.org/docs/tutorials/index.html#autoscheduler-template-free-auto-scheduling</a><br
/>
+[2] Benchmark repo: <a
href="https://github.com/tlc-pack/TLCBench">https://github.com/tlc-pack/TLCBench</a><br
/>
+[3] OSDI Paper: <a
href="https://arxiv.org/abs/2006.06762">Ansor : Generating
High-Performance Tensor Programs for Deep Learning</a><br />
+[4] Results on Apple M1 chip: <a
href="https://medium.com/octoml/on-the-apple-m1-beating-apples-core-ml-4-with-30-model-performance-improvements-9d94af7d1b2d">https://medium.com/octoml/on-the-apple-m1-beating-apples-core-ml-4-with-30-model-performance-improvements-9d94af7d1b2d</a>.</p>
+
+</content>
+ </entry>
+
+ <entry>
<title>Bring Your Own Datatypes: Enabling Custom Datatype Exploration in
TVM</title>
<link href="https://tvm.apache.org/2020/09/26/bring-your-own-datatypes"/>
- <updated>2020-09-26T00:00:00-04:00</updated>
+ <updated>2020-09-26T00:00:00-07:00</updated>
<id>https://tvm.apache.org/2020/09/26/bring-your-own-datatypes</id>
<content type="html"><p>In this post, we describe the Bring Your Own
Datatypes framework, which enables the use of custom datatypes within
TVM.</p>
@@ -308,7 +438,7 @@ For more documentation about the Bring Your Own Datatypes
framework
<entry>
<title>How to Bring Your Own Codegen to TVM</title>
<link
href="https://tvm.apache.org/2020/07/15/how-to-bring-your-own-codegen-to-tvm"/>
- <updated>2020-07-15T00:00:00-04:00</updated>
+ <updated>2020-07-15T00:00:00-07:00</updated>
<id>https://tvm.apache.org/2020/07/15/how-to-bring-your-own-codegen-to-tvm</id>
<content type="html"><p>To free data scientists from worrying about
the performance when developing a new model, hardware backend providers (e.g.,
Intel, NVIDIA, ARM, etc) either provide kernel libraries such as cuBLAS or
cuDNN with many commonly used deep learning kernels, or provide frameworks such
as DNNL or TensorRT with a graph engine to let users describe their models in a
certain way to achieve high performance. In addition, emerging deep learning
accelerators also have t [...]
@@ -787,7 +917,7 @@ Figure 4: After Graph Partitioning.
<entry>
<title>Bridging PyTorch and TVM</title>
<link href="https://tvm.apache.org/2020/07/14/bert-pytorch-tvm"/>
- <updated>2020-07-14T00:00:00-04:00</updated>
+ <updated>2020-07-14T00:00:00-07:00</updated>
<id>https://tvm.apache.org/2020/07/14/bert-pytorch-tvm</id>
<content type="html">
<p>(A more code-heavy variant is crossposted on the more PyTorch affine
<a
href="https://lernapparat.de/transformers-pytorch-tvm/">Lernapparat</a>,
@@ -1310,7 +1440,7 @@ He is a PyTorch core developer and co-authored <a
href="https://www.mann
<entry>
<title>TinyML - How TVM is Taming Tiny</title>
<link
href="https://tvm.apache.org/2020/06/04/tinyml-how-tvm-is-taming-tiny"/>
- <updated>2020-06-04T00:00:00-04:00</updated>
+ <updated>2020-06-04T00:00:00-07:00</updated>
<id>https://tvm.apache.org/2020/06/04/tinyml-how-tvm-is-taming-tiny</id>
<content type="html">
<p><img src="/images/microtvm/logo.png" alt="microTVM
logo" width="30%" /><br /></p>
@@ -1619,7 +1749,7 @@ Diagram from CMSIS-NN paper showing a 2x2 matrix
multiplication microkernel</
<entry>
<title>Compiling Machine Learning to WASM and WebGPU with Apache TVM</title>
<link
href="https://tvm.apache.org/2020/05/14/compiling-machine-learning-to-webassembly-and-webgpu"/>
- <updated>2020-05-14T00:00:00-04:00</updated>
+ <updated>2020-05-14T00:00:00-07:00</updated>
<id>https://tvm.apache.org/2020/05/14/compiling-machine-learning-to-webassembly-and-webgpu</id>
<content type="html"><p><strong>TLDR</strong></p>
@@ -1706,7 +1836,7 @@ Diagram from CMSIS-NN paper showing a 2x2 matrix
multiplication microkernel</
<entry>
<title>Integrating TVM into PyTorch</title>
<link href="https://tvm.apache.org/2019/05/30/pytorch-frontend"/>
- <updated>2019-05-30T00:00:00-04:00</updated>
+ <updated>2019-05-30T00:00:00-07:00</updated>
<id>https://tvm.apache.org/2019/05/30/pytorch-frontend</id>
<content type="html"><p>As TVM continuously demonstrates improvements
to the efficiency of deep learning execution,
it has become clear that PyTorch stands to benefit from directly leveraging
the compiler stack.
@@ -1808,7 +1938,7 @@ relay_graph = torch_tvm.to_relay(mul, inputs)
<entry>
<title>Automating Optimization of Quantized Deep Learning Models on
CUDA</title>
<link href="https://tvm.apache.org/2019/04/29/opt-cuda-quantized"/>
- <updated>2019-04-29T12:00:00-04:00</updated>
+ <updated>2019-04-29T09:00:00-07:00</updated>
<id>https://tvm.apache.org/2019/04/29/opt-cuda-quantized</id>
<content type="html"><p>Deep learning has been successfully applied
to a variety of tasks.
On real-time scenarios such as inference on autonomous vehicles, the inference
speed of the model is critical.
@@ -1952,7 +2082,7 @@ We show that automatic optimization in TVM makes it easy
and flexible to support
<entry>
<title>TVM Deep Learning Compiler Joins Apache Software Foundation</title>
<link href="https://tvm.apache.org/2019/03/18/tvm-apache-announcement"/>
- <updated>2019-03-18T00:00:00-04:00</updated>
+ <updated>2019-03-18T00:00:00-07:00</updated>
<id>https://tvm.apache.org/2019/03/18/tvm-apache-announcement</id>
<content type="html"><p>There is an increasing need to bring machine
learning to a wide diversity of hardware devices. Current frameworks rely on
vendor-specific operator libraries and optimize for a narrow range of
server-class GPUs. Deploying workloads to new platforms – such as mobile
phones, embedded devices, and accelerators (e.g., FPGAs, ASICs) – requires
significant manual effort.</p>
@@ -1975,7 +2105,7 @@ We show that automatic optimization in TVM makes it easy
and flexible to support
<entry>
<title>TVM Golang Runtime for Deep Learning Deployment</title>
<link href="https://tvm.apache.org/2019/01/19/Golang"/>
- <updated>2019-01-19T00:00:00-05:00</updated>
+ <updated>2019-01-19T00:00:00-08:00</updated>
<id>https://tvm.apache.org/2019/01/19/Golang</id>
<content type="html"><h2
id="introduction">Introduction</h2>
@@ -2145,7 +2275,7 @@ closure as TVM packed function and invoke the same across
programming language b
<entry>
<title>Automating Generation of Low Precision Deep Learning
Operators</title>
<link href="https://tvm.apache.org/2018/12/18/lowprecision-conv"/>
- <updated>2018-12-18T00:00:00-05:00</updated>
+ <updated>2018-12-18T00:00:00-08:00</updated>
<id>https://tvm.apache.org/2018/12/18/lowprecision-conv</id>
<content type="html"><p>As deep learning models grow larger and more
complex, deploying them on low powered phone and IoT
devices becomes challenging because of their limited compute and energy
budgets. A recent trend
@@ -2306,7 +2436,7 @@ Note: x86 doesn’t support a vectorized popcount for this
microarchitecture, so
<entry>
<title>Efficient Privacy-Preserving ML Using TVM</title>
<link href="https://tvm.apache.org/2018/10/09/ml-in-tees"/>
- <updated>2018-10-09T00:00:00-04:00</updated>
+ <updated>2018-10-09T00:00:00-07:00</updated>
<id>https://tvm.apache.org/2018/10/09/ml-in-tees</id>
<content type="html"><p>This post describes Myelin, a framework for
privacy-preserving machine learning in trusted hardware enclaves, and how TVM
makes Myelin fast.
The key idea is that TVM, unlike other popular ML frameworks, compiles models
into lightweight, optimized, and dependency-free libraries which can fit into
resource constrained enclaves.</p>
@@ -2422,7 +2552,7 @@ His research interest is in the general domain of ML on
shared private data, but
<entry>
<title>Automatic Kernel Optimization for Deep Learning on All Hardware
Platforms</title>
<link href="https://tvm.apache.org/2018/10/03/auto-opt-all"/>
- <updated>2018-10-03T00:00:00-04:00</updated>
+ <updated>2018-10-03T00:00:00-07:00</updated>
<id>https://tvm.apache.org/2018/10/03/auto-opt-all</id>
<content type="html"><p>Optimizing the performance of deep neural
network on a diverse range of hardware platforms is still a hard
problem for AI developers. In terms of system support, we are facing a
many-to-many problem here:
@@ -2816,7 +2946,7 @@ for inference deployment. TVM just provides such a
solution.</p>
<entry>
<title>Building a Cross-Framework Deep Learning Compiler via DLPack</title>
<link href="https://tvm.apache.org/2018/08/10/DLPack-Bridge"/>
- <updated>2018-08-10T00:00:00-04:00</updated>
+ <updated>2018-08-10T00:00:00-07:00</updated>
<id>https://tvm.apache.org/2018/08/10/DLPack-Bridge</id>
<content type="html"><p>Deep learning frameworks such as Tensorflow,
PyTorch, and ApacheMxNet provide a
powerful toolbox for quickly prototyping and deploying deep learning models.
@@ -2955,7 +3085,7 @@ support, and can be used to implement convenient
converters, such as
<entry>
<title>VTA: An Open, Customizable Deep Learning Acceleration Stack </title>
<link href="https://tvm.apache.org/2018/07/12/vta-release-announcement"/>
- <updated>2018-07-12T00:00:00-04:00</updated>
+ <updated>2018-07-12T00:00:00-07:00</updated>
<id>https://tvm.apache.org/2018/07/12/vta-release-announcement</id>
<content type="html"><p style="text-align: center">Thierry
Moreau(VTA architect), Tianqi Chen(TVM stack), Ziheng Jiang†(graph
compilation), Luis Vega(cloud deployment)</p>
<p style="text-align: center">Advisors: Luis Ceze, Carlos
Guestrin, Arvind Krishnamurthy</p>
@@ -3097,7 +3227,7 @@ This kind of high-level visibility is essential to system
designers who want to
<entry>
<title>Bringing TVM into TensorFlow for Optimizing Neural Machine
Translation on GPU</title>
<link href="https://tvm.apache.org/2018/03/23/nmt-transformer-optimize"/>
- <updated>2018-03-23T00:00:00-04:00</updated>
+ <updated>2018-03-23T00:00:00-07:00</updated>
<id>https://tvm.apache.org/2018/03/23/nmt-transformer-optimize</id>
<content type="html"><h2 id="author">Author</h2>
@@ -3363,7 +3493,7 @@ C = tvm.compute(
<entry>
<title>Compiling Deep Learning Models to WebGL with TVM</title>
<link href="https://tvm.apache.org/2018/03/12/webgl"/>
- <updated>2018-03-12T00:00:00-04:00</updated>
+ <updated>2018-03-12T00:00:00-07:00</updated>
<id>https://tvm.apache.org/2018/03/12/webgl</id>
<content type="html"><p>Now TVM comes with a brand-new OpenGL/WebGL
backend!
This blog post explains what it is, and what you can achieve with it.</p>
@@ -3479,7 +3609,7 @@ optimizations into the TVM stack.</p>
<entry>
<title>Optimizing Mobile Deep Learning on ARM GPU with TVM</title>
<link href="https://tvm.apache.org/2018/01/16/opt-mali-gpu"/>
- <updated>2018-01-16T00:00:00-05:00</updated>
+ <updated>2018-01-16T00:00:00-08:00</updated>
<id>https://tvm.apache.org/2018/01/16/opt-mali-gpu</id>
<content type="html"><p>With the great success of deep learning, the
demand for
deploying deep neural networks to mobile devices is growing rapidly.
@@ -4053,7 +4183,7 @@ advice and <a
href="https://github.com/yzhliu">Yizhi Liu</a&g
<entry>
<title>Remote Profile and Test Deep Learning Cross Compilation on Mobile
Phones with TVM RPC</title>
<link href="https://tvm.apache.org/2017/11/08/android-rpc-introduction"/>
- <updated>2017-11-08T00:00:00-05:00</updated>
+ <updated>2017-11-08T00:00:00-08:00</updated>
<id>https://tvm.apache.org/2017/11/08/android-rpc-introduction</id>
<content type="html"><p>TVM stack is an end to end compilation stack
to deploy deep learning workloads to all hardware backends.
Thanks to the NNVM compiler support of TVM stack, we can now directly compile
descriptions from deep learning frameworks and compile them to bare metal code.
@@ -4281,7 +4411,7 @@ make jvminstall
<entry>
<title>Bringing AMDGPUs to TVM Stack and NNVM Compiler with ROCm</title>
<link
href="https://tvm.apache.org/2017/10/30/Bringing-AMDGPUs-to-TVM-Stack-and-NNVM-Compiler-with-ROCm"/>
- <updated>2017-10-30T00:00:00-04:00</updated>
+ <updated>2017-10-30T00:00:00-07:00</updated>
<id>https://tvm.apache.org/2017/10/30/Bringing-AMDGPUs-to-TVM-Stack-and-NNVM-Compiler-with-ROCm</id>
<content type="html"><p style="text-align: center">Aditya
Atluri, Advanced Micro Devices, Inc.</p>
<p style="text-align: center">Masahiro Masuda, Ziosoft,
Inc.</p>
@@ -4504,88 +4634,5 @@ BB0_6:
</content>
</entry>
- <entry>
- <title>NNVM Compiler: Open Compiler for AI Frameworks</title>
- <link href="https://tvm.apache.org/2017/10/06/nnvm-compiler-announcement"/>
- <updated>2017-10-06T11:30:00-04:00</updated>
- <id>https://tvm.apache.org/2017/10/06/nnvm-compiler-announcement</id>
- <content type="html"><p style="text-align: center">Paul G.
Allen School of Computer Science &amp; Engineering, University of
Washington</p>
-<p style="text-align: center">Amazon Web Service AI
team</p>
-<p style="text-align: center">DMLC open-source
community</p>
-
-<p>Deep learning has become ubiquitous and indispensable. We are seeing
a rising need for deploying deep learning workloads on many kinds of platforms
such as mobile phones, GPU, IoT devices and specialized accelerators. Last
month, we announced TVM stack to close the gap between deep learning
frameworks, and the performance- or efficiency-oriented hardware backends. TVM
stack makes it easy to build an end to end compilation for a deep learning
framework. However, we think it wo [...]
-
-<p>Today, UW Allen school and AWS AI team, together with other
contributors, are excited to announce the release of NNVM compiler, an open
deep learning compiler to compile front-end framework workloads directly to
hardware backends. We build it using the two-level intermediate
representation(IR) in the TVM stack.
-The reader is welcome to refer to the <a
href="http://www.tvmlang.org/2017/08/17/tvm-release-announcement.html">original
TVM announcement</a> for more technical details about TVM stack. With
the help of TVM stack, NNVM compiler can:</p>
-
-<ul>
- <li>Represent and optimize the common deep learning workloads in high
level graph IR</li>
- <li>Transform the computation graph to minimize memory utilization,
optimize data layout and fuse computation patterns for different hardware
backends.</li>
- <li>Present an end to end compilation pipeline from front-end deep
learning frameworks to bare metal hardwares.</li>
-</ul>
-
-<p style="text-align: center"><img
src="/images/nnvm/nnvm_compiler_stack.png" alt="image"
width="612px" /></p>
-
-<p>The NNVM compiler can directly take models from deep learning
frameworks such as Apache MXNet.
-It also support model exchange formats such as ONNX and CoreML. ONNX support
enables NNVM to compile deep learning models from PyTorch, Caffe2 and CNTK.
-The CoreML frontend enables deployment of CoreML models to non-iOS
devices.</p>
-
-<p style="text-align: center"><img
src="/images/nnvm/nnvm_compiler_code.png" alt="image"
width="712px" /></p>
-
-<h2 id="separation-of-optimization-and-deployment">Separation
of Optimization and Deployment</h2>
-
-<p style="text-align: center"><img
src="/images/nnvm/nnvm_deploy.png" alt="image"
width="512px" /></p>
-
-<p>NNVM compiler applies graph level and tensor level optimizations and
jointly optimize them to get the best performance. We take a different approach
from existing deep learning frameworks, which packages the graph optimization
with the deployment runtime. NNVM compiler adopts the conventional wisdom from
compiler to separate the optimization from the actual deployment runtime. This
approach offers substantial optimization but still keeps the runtime
lightweight. The compiled mo [...]
-
-<h2 id="performance">Performance</h2>
-
-<p>NNVM compiler is still under active development, and we can expect
more improvements to come, but we have started to see promising results.
-We benchmarked its performance and compared it against Apache MXNet on two
typical hardware configurations: ARM CPU on Raspberry PI and Nvidia GPU on AWS.
Despite the radical architecture difference between these two chips, we can use
the same infrastructure and only need to change the schedule for each type of
hardware.</p>
-
-<h3 id="nvidia-gpu">Nvidia GPU</h3>
-
-<p>GPU benchmarks and schedules are contributed by Leyuan Wang
(AWS/UCDavis) and Yuwei Hu (TuSimple). We compared the NNVM compiler against
Apache MXNet with CUDA8 and cuDNN7 as the backend on Nvidia K80. This is a very
strong baseline, as Apache MXNet turns on auto-tuning to select the best kernel
from CuDNN. We also used the optimized depthwise kernel in MXNet to optimize
MobileNet workload.</p>
-
-<p style="text-align: center"><img
src="/images/nnvm/nnvm_k80_result.png" alt="image"
width="400px" /></p>
-
-<p>As can be seen, NNVM compiler generate code that outperforms Apache
MXNet on K80. These improvements are due to the joint graph level and kernel
level optimizations. It is worth noting that NNVM compiler generates all the
optimized GPU kernels on its own without relying on external libraries like
CuDNN.</p>
-
-<h3 id="raspberry-pi-3b">Raspberry Pi 3b</h3>
-
-<p>The Rasberry Pi compilation stack is contributed by Ziheng
Jiang(AWS/FDU).
-We compared NNVM compiler against Apache MXNet with OpenBLAS and NNPack.
-We explored the setups to get the best performance out of MXNet: we turned on
Winograd convolution in the NNPACK for 3x3 convolutions, enabled
multi-threading and disabled the additional scheduler thread (so all threads
are used by NNPack).</p>
-
-<p style="text-align: center"><img
src="/images/nnvm/nnvm_rasp_result.png" alt="image"
width="400px" /></p>
-
-<p>As can be seen, the code generated by NNVM compiler is two times
faster on ResNet18.
-The gap on MobileNet is mainly due to lack of depthwise convolution in
existing CPU DNN libraries. NNVM compiler takes benefit of direct generating
efficient ARM code directly.</p>
-
-<h2 id="acknowledgement">Acknowledgement</h2>
-<p>This project wouldn’t become possible without our early contributors
in the DMLC community.
-We would like to specially thank Yuwei Hu(TuSimple), Leyuan Wang(AWS/UCDavis),
Joshua Z. Zhang(AWS)
-and Xingjian Shi(HKUST) for their early contributions to the project. We would
also like to thank all the contributors
-to the TVM stack.</p>
-
-<p>We also learnt a lot from the following projects when building NNVM
Compiler.</p>
-<ul>
- <li><a
href="https://github.com/Theano/Theano">Theano</a>: possibly
the earliest compiler for deep learning</li>
- <li><a
href="https://github.com/halide/Halide">Halide</a>: TVM uses
<a href="https://github.com/dmlc/HalideIR">HalideIR</a>
as data structure for
-arithematic simplification and low level lowering. HalideIR is derived from
Halide.
-We also learns from Halide when implementing the lowering pipeline in
TVM.</li>
- <li><a
href="https://github.com/inducer/loopy">Loopy</a>: use of
integer set analysis and its loop transformation primitives.</li>
-</ul>
-
-<h2 id="links">Links</h2>
-<ul>
- <li>Github page of NNVM Compiler: <a
href="https://github.com/dmlc/nnvm">https://github.com/dmlc/nnvm</a></li>
- <li>Github page of TVM: <a
href="https://github.com/dmlc/tvm">https://github.com/dmlc/tvm</a></li>
- <li><a
href="https://news.cs.washington.edu/2017/10/06/allen-school-and-aws-team-up-on-new-nnvm-compiler-for-deep-learning-frameworks/">UW
Allen school blog about NNVM compiler</a></li>
- <li><a
href="https://aws.amazon.com/blogs/ai/introducing-nnvm-compiler-a-new-open-end-to-end-compiler-for-ai-frameworks/">AWS
blogpost about NNVM compiler</a></li>
-</ul>
-</content>
- </entry>
-
</feed>
diff --git a/blog.html b/blog.html
index 8dbab07..ae12173 100644
--- a/blog.html
+++ b/blog.html
@@ -146,6 +146,16 @@
<li>
<span>
+ <a class="post-link" href="/2021/03/03/intro-auto-scheduler">Introducing
TVM Auto-scheduler (a.k.a. Ansor)</a>
+ </span>
+ </br>
+ <span>
+ Mar 3, 2021
+ </span>
+</li>
+
+<li>
+ <span>
<a class="post-link" href="/2020/09/26/bring-your-own-datatypes">Bring
Your Own Datatypes: Enabling Custom Datatype Exploration in TVM</a>
</span>
</br>
diff --git a/community.html b/community.html
index 365bb79..d3e347f 100644
--- a/community.html
+++ b/community.html
@@ -279,6 +279,10 @@ This is a community maintained list of organizations using
and contributing to t
</li>
<li>
+ <img src="/images/community/sjtu.png" />
+ </li>
+
+ <li>
<img src="/images/community/ucberkeley.png" />
</li>
diff --git a/feed.xml b/feed.xml
index a3d90e2..5d387ea 100644
--- a/feed.xml
+++ b/feed.xml
@@ -1,4 +1,124 @@
-<?xml version="1.0" encoding="utf-8"?><feed
xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/"
version="4.1.1">Jekyll</generator><link href="/feed.xml" rel="self"
type="application/atom+xml" /><link href="/" rel="alternate" type="text/html"
/><updated>2021-01-04T16:22:52-05:00</updated><id>/feed.xml</id><title
type="html">TVM</title><author><name>{"name"=>nil}</name></author><entry><title
type="html">Bring Your Own Datatypes: Enabling Custom Datatype [...]
+<?xml version="1.0" encoding="utf-8"?><feed
xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/"
version="4.1.1">Jekyll</generator><link href="/feed.xml" rel="self"
type="application/atom+xml" /><link href="/" rel="alternate" type="text/html"
/><updated>2021-03-03T01:20:46-08:00</updated><id>/feed.xml</id><title
type="html">TVM</title><author><name>{"name"=>nil}</name></author><entry><title
type="html">Introducing TVM Auto-scheduler (a.k.a. Ansor)</tit [...]
+model size, operator diversity, and hardware heterogeneity.
+From a computational perspective, deep neural networks are just layers and
layers of tensor computations.
+These tensor computations, such as matmul and conv2d, can be easily described
by mathematical expressions.
+However, providing high-performance implementations for them on modern
hardware can be very challenging.
+We have to apply various low-level optimizations and utilize special hardware
intrinsics to achieve high performance.
+It takes huge engineering effort to build linear algebra and neural network
acceleration libraries like CuBLAS, CuDNN, oneMKL, and oneDNN.</p>
+
+<p>Our life will be much easier if we can just write mathematical
expressions and have something
+magically turn them into efficient code implementations.
+Three years ago, deep learning compiler TVM and its search module AutoTVM were
built as the first step towards this goal.
+AutoTVM employs a template-based search algorithm to find efficient
implementations for a given tensor computation.
+However, it is a template-based approach, so it still requires domain experts
to implement a non-trivial manual template
+for every operator on every platform.
+Today, there are more than 15k lines of code for these templates in the TVM
code repository.
+Besides being very hard to develop, these templates often have inefficient and
limited search spaces,
+making them unable to achieve optimal performance.</p>
+
+<p>To address the limitations of AutoTVM, we started project Ansor
aiming at a fully automated auto-scheduler for
+generating code for tensor computations.
+Ansor auto-scheduler only takes tensor expressions as input and generates
high-performance code without manual templates.
+We made innovations in the search space construction and search algorithm.
+As a result, the auto-scheduler can achieve better performance with less
search time in a more automated way.</p>
+
+<p>Ansor auto-scheduler is now integrated into Apache TVM as <code
class="language-plaintext
highlighter-rouge">tvm.auto_scheduler</code> package.
+This is a joint effort by collaborators from UC Berkeley, Alibaba, AWS and
OctoML.
+Detailed tutorials are available for Intel CPUs, ARM CPUs, NVIDIA CPUs, and
Mali GPUs on the TVM website [1].
+In this blog post, we will give a high-level introduction and show some
benchmark results.</p>
+
+<h1 id="system-overview">System Overview</h1>
+
+<h2 id="autotvm-vs-auto-scheduler">AutoTVM vs
Auto-scheduler</h2>
+<p style="text-align: center"><img
src="/images/intro-auto-scheduler/workflow.png" alt="image"
width="75%" /></p>
+<center> Table 1. Workflow Comparision </center>
+<p></p>
+
+<p>Table 1 compares the workflow for generating code for an operator in
AutoTVM and auto-scheduler.
+In AutoTVM, the developer has to go through three steps.
+In step 1, the developer has to write the compute definition in TVM’s tensor
expression language.
+This part is relatively easy because TVM’s tensor expression language looks
just like math expressions.
+In step 2, the developer has to write a schedule template, which typically
consists of 20-100 lines of tricky DSL code.
+This part requires domain expertise of both the target hardware architecture
and operator semantics, so it is difficult.
+The last step, step 3, is automated by a search algorithm.</p>
+
+<p>In auto-scheduler, we eliminate the most difficult step 2 by
automatic search space construction and accelerate step 3 with a better search
algorithm.
+By doing automatic search space construction, we not only eliminate huge
manual effort,
+but also enabling the exploration of much more optimization combinations.
+This automation does not come for free, because we still need to design rules
to generate the search space.
+However, these rules are very general. They are based on static analysis of
the tensor expressions.
+We only need to design a few general rules once and can apply them to almost
all tensor computations in deep learning.</p>
+
+<h2 id="search-process">Search Process</h2>
+<p style="text-align: center"><img
src="/images/intro-auto-scheduler/search_overview.png"
alt="image" width="40%" /></p>
+<center> Figure 1. Search Process Overview </center>
+<p></p>
+
+<p>Figure 1. shows the search process of auto-scheduler when optimizing
a whole neural network.
+The system takes deep learning models as input.
+It then partitions the big model into small subgraphs with Relay’s operator
fusion pass.
+A task scheduler is utilized to allocate the time resource for optimizing many
subgraphs.
+At each iteration, it picks a subgraph that has the most potential to increase
the end-to-end performance.
+For this subgraph, we analyze its tensor expression and generate several
sketches for it.
+Then we run evolutionary search with a learned cost model to get a batch of
optimized programs.
+The optimized programs are sent to actual hardware for measurements.
+When the measurements are finished, the profiling results are used as feedback
to update all components of the system.
+This process is repeated iteratively until the optimization converges or we
run out of time budget.
+More technical details can be found in our paper [3] and our code.</p>
+
+<p>It is worth notiing that since the auto-scheduler generates schedules
from scratch,
+it reuses the existing computation definitions in TOPI but not schedule
templates.</p>
+
+<h1 id="benchmark-results">Benchmark Results</h1>
+<p>In this section, we benchmark the performance of AutoTVM and
Auto-scheduler.
+The CPU benchmark is done on an AWS c5.9xlarge, which is equipped with an
Intel 18-core skylake 8124-m CPU.
+The GPU benchmark is done on an AWS g4dn.4xlarge, which is equipped with an
NVIDIA T4 GPU.
+All benchmark code, raw data, tuning logs can be found in this repo
[2].</p>
+
+<h2 id="performance-of-the-generated-code">Performance of the
generated code</h2>
+<p>We benchmark the fp32 single-batch inference latency on three
networks.
+Figure 2 shows the relative speedup of auto-scheduler against AutoTVM.
+We can see auto-scheduler outperforms AutoTVM in all cases with 1.02x to 8.95x
speedup.
+This is because auto-scheduler explores a larger search space, which covers
more efficient combinations
+of optimizations that are missed in TOPI manual templates.
+The BERT-base@GPU is an extreme case where the manual templates are very badly
designed.
+In other words, the manual template for dense layers does not perform well for
the shapes in BERT model.</p>
+
+<p style="text-align: center"><img
src="/images/intro-auto-scheduler/code_perf.png"
alt="image" width="85%" /></p>
+<center> Figure 2. Code Performance Comparision (Higher is better)
</center>
+<p></p>
+
+<h2 id="search-time">Search Time</h2>
+<p>As we know, the search-based approaches can be very time-consuming,
so we also care about the search time.
+It typically takes several hours to let the search converge for a single
neural network.
+Figure 3 compares the search time of AutoTVM and auto-scheduler.
+Auto-scheduler requires much less time to converge in most cases, despite its
larger search space.
+This is mainly because of auto-scheduler has a better cost model and task
scheduler.</p>
+
+<p style="text-align: center"><img
src="/images/intro-auto-scheduler/search_time.png"
alt="image" width="85%" /></p>
+<center> Figure 3. Search Time Comparision (Lower is better)
</center>
+<p></p>
+
+<h2 id="more-results">More Results</h2>
+<p>The repo above serves as an internal benchmark tool for TVM, so it
only compares the latest AutoTVM and AutoScheduler.
+You can find results for more libraries and backends in our paper [3].
+Recently, this blog post [4] also tried auto-scheduler on an Apple M1 chip and
got some good results.</p>
+
+<h1 id="conclusion">Conclusion</h1>
+<p>We build TVM auto-scheduler, a system that automatically generates
high-performance code for tensor expressions.
+Compared with the predecessor AutoTVM, auto-scheduler does not require manual
templates.
+Besides, auto-scheduler is capable of generating schedules with better
performance in a shorter time.
+We achieve this by making innovations in the search space construction and
search algorithm.</p>
+
+<p>We are excited about the current performance of auto-scheduler.
+In the future, we are interested in extending the ability of auto-scheduler to
support
+sparse operators, low-precision operators, and dynamic shape better.</p>
+
+<h1 id="links">Links</h1>
+<p>[1] Tutorials: <a
href="https://tvm.apache.org/docs/tutorials/index.html#autoscheduler-template-free-auto-scheduling">https://tvm.apache.org/docs/tutorials/index.html#autoscheduler-template-free-auto-scheduling</a><br
/>
+[2] Benchmark repo: <a
href="https://github.com/tlc-pack/TLCBench">https://github.com/tlc-pack/TLCBench</a><br
/>
+[3] OSDI Paper: <a
href="https://arxiv.org/abs/2006.06762">Ansor : Generating
High-Performance Tensor Programs for Deep Learning</a><br />
+[4] Results on Apple M1 chip: <a
href="https://medium.com/octoml/on-the-apple-m1-beating-apples-core-ml-4-with-30-model-performance-improvements-9d94af7d1b2d">https://medium.com/octoml/on-the-apple-m1-beating-apples-core-ml-4-with-30-model-performance-improvements-9d94af7d1b2d</a>.</p></content><author><name>Lianmin
Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu</name></author><summary
type="html">Optimizing the execution speed of deep neural networks i [...]
<h2 id="introduction">Introduction</h2>
@@ -282,7 +402,7 @@ For more documentation about the Bring Your Own Datatypes
framework
<p><a
href="https://posithub.org/docs/BeatingFloatingPoint.pdf"
target="_blank">Beating Floating Point at its Own Game: Posit
Arithmetic</a> <a href="#fnref:posit"
class="reversefootnote"
role="doc-backlink">&#8617;</a></p>
</li>
</ol>
-</div></content><author><name>Gus Smith, Andrew
Liu</name></author><summary type="html">In this post, we describe the Bring
Your Own Datatypes framework, which enables the use of custom datatypes within
TVM.</summary></entry><entry><title type="html">How to Bring Your Own Codegen
to TVM</title><link href="/2020/07/15/how-to-bring-your-own-codegen-to-tvm"
rel="alternate" type="text/html" title="How to Bring Your Own Codegen to TVM"
/><published>2020-07-15T00:00:00-04:00</published>< [...]
+</div></content><author><name>Gus Smith, Andrew
Liu</name></author><summary type="html">In this post, we describe the Bring
Your Own Datatypes framework, which enables the use of custom datatypes within
TVM.</summary></entry><entry><title type="html">How to Bring Your Own Codegen
to TVM</title><link href="/2020/07/15/how-to-bring-your-own-codegen-to-tvm"
rel="alternate" type="text/html" title="How to Bring Your Own Codegen to TVM"
/><published>2020-07-15T00:00:00-07:00</published>< [...]
<p>However, users have to learn a new programming interface when they
attempt to work on a new kernel library or a device. As a result, the demand
for a unified programming interface becomes more and more important to let all
users and hardware backend providers stand on the same page.</p>
@@ -751,7 +871,7 @@ Figure 4: After Graph Partitioning.
<h2 id="acknowledgment">Acknowledgment</h2>
-<p>We would like to thank our colleague Animesh Jain for valuable
discussions in the framework design; Tianqi Chen and Jared Roesch from OctoML
for system design discussions and prototyping; Masahiro Masuda from the TVM
community to help code review and improve the DNNL integration. We would also
like to thank Ramana Radhakrishnan, Matthew Barrett, Manupa Karunaratne, and
Luke Hutton from ARM, U.K. for contributing several helpful ideas, related
Relay passes, and the Arm Compute Li [...]
+<p>We would like to thank our colleague Animesh Jain for valuable
discussions in the framework design; Tianqi Chen and Jared Roesch from OctoML
for system design discussions and prototyping; Masahiro Masuda from the TVM
community to help code review and improve the DNNL integration. We would also
like to thank Ramana Radhakrishnan, Matthew Barrett, Manupa Karunaratne, and
Luke Hutton from ARM, U.K. for contributing several helpful ideas, related
Relay passes, and the Arm Compute Li [...]
the Jupyter Notebook to follow along is on <a
href="https://github.com/t-vi/pytorch-tvmisc/tree/master/transformers-pytorch-tvm/">github</a>.)</p>
<p>Some of the most intriguing applications of Artificial Intelligence
have been in Natural Language Processing.
@@ -1264,7 +1384,7 @@ one would want to re-do cheap computation, most
prominently point-wise computati
<h1 id="author">Author</h1>
<p><a href="https://lernapparat.de/">Thomas
Viehmann</a> is the founder of <a
href="https://mathinf.eu/">MathInf GmbH</a>, Munich,
Germany, a boutique training and consultancy firm focusing on Machine Learning
and PyTorch.
-He is a PyTorch core developer and co-authored <a
href="https://www.manning.com/books/deep-learning-with-pytorch">Deep
Learning with PyTorch</a>, which currently available as <a
href="https://pytorch.org/deep-learning-with-pytorch">free
download from the PyTorch
website</a>.</p></content><author><name>Thomas Viehmann, MathInf
GmbH</name></author><summary type="html"></summary></entry><entry><title
type="html">TinyML - How TVM is Taming Ti [...]
+He is a PyTorch core developer and co-authored <a
href="https://www.manning.com/books/deep-learning-with-pytorch">Deep
Learning with PyTorch</a>, which currently available as <a
href="https://pytorch.org/deep-learning-with-pytorch">free
download from the PyTorch
website</a>.</p></content><author><name>Thomas Viehmann, MathInf
GmbH</name></author><summary type="html"></summary></entry><entry><title
type="html">TinyML - How TVM is Taming Ti [...]
<p>The proliferation of low-cost, AI-powered consumer devices has led to
widespread interest in “bare-metal” (low-power, often without an operating
system) devices among ML researchers and practitioners. While it is already
possible for experts to run <em>some</em> models on
<em>some</em> bare-metal devices, optimizing models for diverse
sets of devices is challenging, often requiring manually optimized
device-specific libraries. And for those platforms wi [...]
@@ -1563,7 +1683,7 @@ Diagram from CMSIS-NN paper showing a 2x2 matrix
multiplication microkernel</
<li><a
href="https://homes.cs.washington.edu/~moreau/">Thierry
Moreau</a>, for mentoring me during my time at OctoML.</li>
<li><a
href="https://homes.cs.washington.edu/~vegaluis/">Luis
Vega</a>, for teaching me the fundamentals of interacting with
microcontrollers.</li>
<li><a
href="https://www.linkedin.com/in/themadrasi/?originalSubdomain=uk">Ramana
Radhakrishnan</a>, for supplying the Arm hardware used in our
experiments and for providing guidance on its usage.</li>
-</ul></content><author><name>Logan Weber and Andrew Reusch,
OctoML</name></author><summary type="html"></summary></entry><entry><title
type="html">Compiling Machine Learning to WASM and WebGPU with Apache
TVM</title><link
href="/2020/05/14/compiling-machine-learning-to-webassembly-and-webgpu"
rel="alternate" type="text/html" title="Compiling Machine Learning to WASM and
WebGPU with Apache TVM"
/><published>2020-05-14T00:00:00-04:00</published><updated>2020-05-14T00:00:00-04:00</upd
[...]
+</ul></content><author><name>Logan Weber and Andrew Reusch,
OctoML</name></author><summary type="html"></summary></entry><entry><title
type="html">Compiling Machine Learning to WASM and WebGPU with Apache
TVM</title><link
href="/2020/05/14/compiling-machine-learning-to-webassembly-and-webgpu"
rel="alternate" type="text/html" title="Compiling Machine Learning to WASM and
WebGPU with Apache TVM"
/><published>2020-05-14T00:00:00-07:00</published><updated>2020-05-14T00:00:00-07:00</upd
[...]
<p>We introduced support for WASM and WebGPU to the Apache TVM deep
learning compiler. Our experiments shows that TVM’s WebGPU backend can get
<strong>close to native</strong> <strong>GPU
performance</strong> when deploying models to the web.</p>
@@ -1641,7 +1761,7 @@ Diagram from CMSIS-NN paper showing a 2x2 matrix
multiplication microkernel</
<h2 id="acknowledgement">Acknowledgement</h2>
-<p>We would like to thank the emscripten project for providing the WASM
compilation infrastructures as well as the JS library support on the web. We
would also like to thank the WebGPU community for various helpful discussions.
Thanks to Fletcher Haynes for valuable feedbacks to the
post.</p></content><author><name>Tianqi Chen and Jared Roesch,
OctoML</name></author><summary type="html">TLDR</summary></entry><entry><title
type="html">Integrating TVM into PyTorch</title><link [...]
+<p>We would like to thank the emscripten project for providing the WASM
compilation infrastructures as well as the JS library support on the web. We
would also like to thank the WebGPU community for various helpful discussions.
Thanks to Fletcher Haynes for valuable feedbacks to the
post.</p></content><author><name>Tianqi Chen and Jared Roesch,
OctoML</name></author><summary type="html">TLDR</summary></entry><entry><title
type="html">Integrating TVM into PyTorch</title><link [...]
it has become clear that PyTorch stands to benefit from directly leveraging
the compiler stack.
A major tenet of PyTorch is providing seamless and robust integrations that
don’t get in the user’s way.
To that end, PyTorch now has an official TVM-based backend, <a
href="https://github.com/pytorch/tvm">torch_tvm</a>.</p>
@@ -1733,7 +1853,7 @@ def mul(a, b, c):
# via script
relay_graph = torch_tvm.to_relay(mul, inputs)
-</code></pre></div></div></content><author><name>Bram
Wasti</name></author><summary type="html">As TVM continuously demonstrates
improvements to the efficiency of deep learning execution, it has become clear
that PyTorch stands to benefit from directly leveraging the compiler stack. A
major tenet of PyTorch is providing seamless and robust integrations that don’t
get in the user’s way. To that end, PyTorch now has an official TVM-based
backend, torch_tvm.</summary [...]
+</code></pre></div></div></content><author><name>Bram
Wasti</name></author><summary type="html">As TVM continuously demonstrates
improvements to the efficiency of deep learning execution, it has become clear
that PyTorch stands to benefit from directly leveraging the compiler stack. A
major tenet of PyTorch is providing seamless and robust integrations that don’t
get in the user’s way. To that end, PyTorch now has an official TVM-based
backend, torch_tvm.</summary [...]
On real-time scenarios such as inference on autonomous vehicles, the inference
speed of the model is critical.
Network quantization is an effective approach to accelerating deep learning
models.
In quantized models, both data and model parameters are represented with low
precision data types such as <code class="language-plaintext
highlighter-rouge">int8</code> and <code
class="language-plaintext highlighter-rouge">float16</code>.
@@ -1868,7 +1988,7 @@ We show that automatic optimization in TVM makes it easy
and flexible to support
</ul>
<h1 id="bio--acknowledgement">Bio &amp;
Acknowledgement</h1>
-<p><a href="https://wuwei.io/">Wuwei Lin</a> is an
undergraduate student at SJTU. He is currently an intern at TuSimple. The
author has many thanks to <a
href="https://homes.cs.washington.edu/~tqchen/">Tianqi
Chen</a> and <a
href="https://homes.cs.washington.edu/~eqy/">Eddie Yan</a>
for their reviews.</p></content><author><name>Wuwei
Lin</name></author><summary type="html">Deep learning has been successfully ap
[...]
+<p><a href="https://wuwei.io/">Wuwei Lin</a> is an
undergraduate student at SJTU. He is currently an intern at TuSimple. The
author has many thanks to <a
href="https://homes.cs.washington.edu/~tqchen/">Tianqi
Chen</a> and <a
href="https://homes.cs.washington.edu/~eqy/">Eddie Yan</a>
for their reviews.</p></content><author><name>Wuwei
Lin</name></author><summary type="html">Deep learning has been successfully ap
[...]
<p>TVM is an open source deep learning compiler stack that closes the
gap between the productivity-focused deep learning frameworks, and the
performance- or efficiency-oriented hardware backends. Today, we are glad to
announce that the TVM community has decided to move on to Apache incubator, and
becomes an Apache(incubating) project.</p>
@@ -1882,7 +2002,7 @@ We show that automatic optimization in TVM makes it easy
and flexible to support
<p>We would like to take this chance to thank the Allen School for
supporting the SAMPL team that gave birth to the TVM project. We would also
like to thank the Halide project which provided the basis for TVM’s loop-level
IR and initial code generation. We would like to thank our Apache incubator
mentors for introducing the project to Apache and providing useful guidance.
Finally, we would like to thank the TVM community and all of the organizations,
as listed above, that supported [...]
-<p>See also the <a
href="https://news.cs.washington.edu/2019/03/18/allen-schools-tvm-deep-learning-compiler-framework-transitions-to-apache/">Allen
School news about the transition here</a>, <a
href="https://sampl.cs.washington.edu/tvmconf/#about-tvmconf">TVM
conference program slides and recordings</a>, and <a
href="https://tvm.apache.org/docs//contribute/community.html">our
community guideline here</a>. Follow us o [...]
+<p>See also the <a
href="https://news.cs.washington.edu/2019/03/18/allen-schools-tvm-deep-learning-compiler-framework-transitions-to-apache/">Allen
School news about the transition here</a>, <a
href="https://sampl.cs.washington.edu/tvmconf/#about-tvmconf">TVM
conference program slides and recordings</a>, and <a
href="https://tvm.apache.org/docs//contribute/community.html">our
community guideline here</a>. Follow us o [...]
<p>TVM is an open deep learning compiler stack to compile various deep
learning models from different
frameworks to CPU, GPU or specialized accelerators. TVM supports model
compilation from a wide range
@@ -2043,155 +2163,4 @@ closure as TVM packed function and invoke the same
across programming language b
<li>[5] <a
href="https://blog.learngoprogramming.com/golang-variadic-funcs-how-to-patterns-369408f19085">Go
Variadic Functions</a></li>
<li>[6] <a
href="https://github.com/jdeng/gomxnet">CFFI
Ref</a></li>
<li>[7] <a
href="https://golang.org/pkg/runtime/#SetFinalizer">Go
Finalizers</a></li>
-</ul></content><author><name>Siva</name></author><summary
type="html">Introduction</summary></entry><entry><title type="html">Automating
Generation of Low Precision Deep Learning Operators</title><link
href="/2018/12/18/lowprecision-conv" rel="alternate" type="text/html"
title="Automating Generation of Low Precision Deep Learning Operators"
/><published>2018-12-18T00:00:00-05:00</published><updated>2018-12-18T00:00:00-05:00</updated><id>/2018/12/18/lowprecision-conv</id><content
ty [...]
-devices becomes challenging because of their limited compute and energy
budgets. A recent trend
- in deep learning is the use of extremely quantized models that
operate on inputs and
- weights of a few bits, with networks like XNOR-Net, DoReFa-Net, and
HWGQ-Net making steady
-progress improving accuracy.</p>
-
-<p>An example of a low precision graph snippet is below. The low
precision convolution takes in
-quantized data and bitpacks into the proper data layout for an efficient
bitserial convolution.
-The output is in a higher precision and traditional deep learning layers such
as batch normalization and ReLu are applied to it, before being re-quantized
and sent through another low precision operator.</p>
-
-<p style="text-align: center"><img
src="/images/low-precision/workflow.png" alt="image"
width="50%" /></p>
-<center> Low precision convolution pipeline.</center>
-<p></p>
-
-<p>Theoretically, low precision operators use less operations than
-floating point operators, leading many to believe they can achieve up
tremendous speedups.
-However, deep learning frameworks leverage decades of engineering work
through low level
-BLAS and LAPACK libraries that are incredibly well optimized, and CPUs
include intrinsic
-instructions to accelerate these tasks. In practice, it is not simple
to develop low-level
-operators such as convolutions that are competitive with 8-bit quantized
or even floating
-point operators.
-In this post we introduce our approach to automatically generating
optimized
-low precision convolutions for CPUs. We declare our low precision operators
so that they compute
-on efficiently stored low precision inputs, and describe a schedule that
describes a search space
-of implementation parameters. We rely on AutoTVM to quickly search the space
and find optimized
-parameters for the particular convolution, precision, and backend.</p>
-
-<h2 id="bitserial-computation-background">Bitserial
Computation Background</h2>
-
-<p>The core of low precision models is the bitserial dot product
that enables convolution and
-dense operators to be computed using only bitwise operations and popcount.
- Typically, a dot product is computed by element wise multiplication of two
vectors followed by
- summing all the elements, like the simple example below. If all the data is
binary, the input
- vectors can be packed into single integer, and the dot product can be
computed by bitwise-anding
- the packed inputs and counting the number of 1’s in the result using popcount.
-Note: Depending how the input data is quantized, bitwise-xnor may be used
instead of bitwise-and.</p>
-
-<p style="text-align: center"><img
src="/images/low-precision/binary-dotproduct.png"
alt="image" width="50%" /></p>
-<center> Binary dot product.</center>
-<p></p>
-
-<p>Arbitrary precision dot products can be computed in this fashion by
first separating input data
-into bitplanes. Once in this representation we can compute dotproduct by
summing weighted binary
-dot products between the bitplanes of A and B. The number of binary
dotproducts grows with the
-product of A and B’s precision, so this method is only practical for very low
precision data.</p>
-
-<p style="text-align: center"><img
src="/images/low-precision/bitserial-dotproduct.png"
alt="image" width="50%" /></p>
-<center> Bitserial dot product.</center>
-<p></p>
-
-<h2 id="defining-operators-in-tvm">Defining Operators in
TVM</h2>
-<p>Before the computation, input data needs to be bitpacked so that the
bitplanes of the input data
-can be accessed and are packed into a supported datatype such as a uint8 or
uint32. We provide
-a flexible bitpacking operator that takes arbitrary size input tensors and
returns a bitpacked
-tensor where the user specifies which axis the bitplanes should be.</p>
-
-<p style="text-align: center"><img
src="/images/low-precision/bitpack.png" alt="image"
width="50%" /></p>
-<center> Different bitpacked layouts.</center>
-<p></p>
-
-<p>Once in this bitpacked format the low precision convolution can be
computed bitserially.
-For this demo, that data is packed along the input channel and the bitplanes
are added to the
-innermost axis, and the data is packed into 32-bit integers. The bitserial
convolution is computed
-similar to a normal convolution, but the bitwise-and (&amp;) replaces
multiplication, and we use
-popcount to accumulate values in the packed data. The bitplane axes become
additional reduction axes
-and compute the binary dot products between different bitplanes of the input
and kernel.
-Finally, the output is computed in an unpacked format and in higher
precision.</p>
-
-<div class="language-python highlighter-rouge"><div
class="highlight"><pre
class="highlight"><code><span
class="n">Input_bitpacked</span> <span
class="o">=</span> <span
class="n">bitpack</span><span
class="p">(</span><span
class="n">Input</span><span
class="p">,</span> <span class="n">acti [...]
-<span class="n">Weights_bitpacked</span> <span
class="o">=</span> <span
class="n">bitpack</span><span
class="p">(</span><span
class="n">Filter</span><span
class="p">,</span> <span
class="n">weight_bits</span><span
class="p">,</span> <span
class="n">pack_axis</span><span class="o"& [...]
-<span class="n">batch</span><span
class="p">,</span> <span
class="n">in_height</span><span
class="p">,</span> <span
class="n">in_width</span><span
class="p">,</span> <span
class="n">in_channel_q</span><span
class="p">,</span> <span
class="n">_</span> <span
class="o">=</span& [...]
-<span class="n">kernel_h</span><span
class="p">,</span> <span
class="n">kernel_w</span><span
class="p">,</span> <span
class="n">_</span><span
class="p">,</span> <span
class="n">num_filter</span><span
class="p">,</span> <span
class="n">_</span> <span
class="o">=</span> < [...]
-
-<span class="n">stride_h</span><span
class="p">,</span> <span
class="n">stride_w</span> <span
class="o">=</span> <span
class="n">stride</span>
-<span class="n">pad_top</span><span
class="p">,</span> <span
class="n">pad_left</span><span
class="p">,</span> <span
class="n">pad_down</span><span
class="p">,</span> <span
class="n">pad_right</span> <span
class="o">=</span> <span
class="n">get_pad_tuple</span><span
class="p">( [...]
-
-<span class="c1"># Computing the output shape
-</span><span class="n">out_channel</span> <span
class="o">=</span> <span
class="n">num_filter</span>
-<span class="n">out_height</span> <span
class="o">=</span> <span
class="n">simplify</span><span
class="p">((</span><span
class="n">in_height</span> <span
class="o">-</span> <span
class="n">kernel_h</span> <span
class="o">+</span> <span
class="n">pad_top</span> <span class="o">+
[...]
-<span class="n">out_width</span> <span
class="o">=</span> <span
class="n">simplify</span><span
class="p">((</span><span
class="n">in_width</span> <span
class="o">-</span> <span
class="n">kernel_w</span> <span
class="o">+</span> <span
class="n">pad_left</span> <span class="o">+&
[...]
-<span class="n">pad_before</span> <span
class="o">=</span> <span
class="p">[</span><span
class="mi">0</span><span
class="p">,</span> <span
class="n">pad_top</span><span
class="p">,</span> <span
class="n">pad_left</span><span
class="p">,</span> <span
class="mi">0</span>< [...]
-<span class="n">pad_after</span> <span
class="o">=</span> <span
class="p">[</span><span
class="mi">0</span><span
class="p">,</span> <span
class="n">pad_down</span><span
class="p">,</span> <span
class="n">pad_right</span><span
class="p">,</span> <span
class="mi">0</span>&l [...]
-<span class="n">Input_padded</span> <span
class="o">=</span> <span
class="n">pad</span><span
class="p">(</span><span
class="n">Input_bitpacked</span><span
class="p">,</span> <span
class="n">pad_before</span><span
class="p">,</span> <span
class="n">pad_after</span><span class="p"&g
[...]
-
-<span class="c1"># Treat the bitplane axes like additional
reduction axes
-</span><span class="n">rc</span> <span
class="o">=</span> <span
class="n">tvm</span><span
class="p">.</span><span
class="n">reduce_axis</span><span
class="p">((</span><span
class="mi">0</span><span
class="p">,</span> <span
class="n">in_channel_q</span><span
class="p">),&l [...]
-<span class="n">ry</span> <span
class="o">=</span> <span
class="n">tvm</span><span
class="p">.</span><span
class="n">reduce_axis</span><span
class="p">((</span><span
class="mi">0</span><span
class="p">,</span> <span
class="n">kernel_h</span><span
class="p">),</span> <s [...]
-<span class="n">rx</span> <span
class="o">=</span> <span
class="n">tvm</span><span
class="p">.</span><span
class="n">reduce_axis</span><span
class="p">((</span><span
class="mi">0</span><span
class="p">,</span> <span
class="n">kernel_w</span><span
class="p">),</span> <s [...]
-<span class="n">ib</span> <span
class="o">=</span> <span
class="n">tvm</span><span
class="p">.</span><span
class="n">reduce_axis</span><span
class="p">((</span><span
class="mi">0</span><span
class="p">,</span> <span
class="n">input_bits</span><span
class="p">),</span> < [...]
-<span class="n">wb</span> <span
class="o">=</span> <span
class="n">tvm</span><span
class="p">.</span><span
class="n">reduce_axis</span><span
class="p">((</span><span
class="mi">0</span><span
class="p">,</span> <span
class="n">weight_bits</span><span
class="p">),</span> &l [...]
-
-
-<span class="n">tvm</span><span
class="p">.</span><span
class="n">compute</span><span
class="p">((</span><span
class="n">batch</span><span
class="p">,</span> <span
class="n">out_height</span><span
class="p">,</span> <span
class="n">out_width</span><span
class="p">,</span> [...]
- <span class="n">tvm</span><span
class="p">.</span><span
class="nb">sum</span><span
class="p">(</span><span
class="n">tvm</span><span
class="p">.</span><span
class="n">popcount</span><span
class="p">(</span>
- <span
class="n">Input_padded</span><span
class="p">[</span><span
class="n">nn</span><span
class="p">,</span> <span
class="n">yy</span> <span
class="o">*</span> <span
class="n">stride_h</span> <span
class="o">+</span> <span
class="n">ry</span><span class="p">,<
[...]
- <span
class="n">Weights_bitpacked</span><span
class="p">[</span><span
class="n">ry</span><span
class="p">,</span> <span
class="n">rx</span><span
class="p">,</span> <span
class="n">rc</span><span
class="p">,</span> <span
class="n">ff</span><span class="p">,</sp
[...]
- <span class="n">axis</span><span
class="o">=</span><span
class="p">[</span><span
class="n">rc</span><span
class="p">,</span> <span
class="n">ry</span><span
class="p">,</span> <span
class="n">rx</span><span
class="p">,</span> <span
class="n">wb</span><spa [...]
-
-</code></pre></div></div>
-
-<p>In our schedule we apply common optimizations like vectorization and
memory tiling to provide better
-memory locality and take advantage of SIMD units. Some of these optimizations
such as tiling,
-require parameters that need to be tuned to for the specific
microarchitecture. We expose these
-parameters as knobs to TVM and use AutoTVM to automatically tune all the
parameters simultaneously.</p>
-
-<p>Finally, we can craft small microkernels to replace the innermost
loop(s) of computation and schedule
- them using TVM’s tensorize primitive. Since, compilers often produce
suboptimal code, people can
- often write short assembly sequences that are more efficient. These
microkernels often take advantage
- of new intrinsics that are being introduced to help accelerate deep learning
workloads and use
- them clever ways to improve memory accesses or reduce the number instructions
required.</p>
-
-<h2 id="results">Results</h2>
-
-<h3 id="raspberry-pi">Raspberry Pi</h3>
-<p>Convolution speedups on Raspberry Pi 3B compared to 16-bit integer
TVM implementation.
-Workload are convolution layers from ResNet18.</p>
-
-<p style="text-align: center"><img
src="/images/low-precision/rasp-conv.png" alt="image"
width="50%" /></p>
-<center> Speedup of low precision convolutions on a Raspberry Pi
compared to 16-bit TVM implementation.</center>
-<p></p>
-
-<p>2-bit activation, 1-bit weight convolution speedups on Raspberry Pi
3B compared to hand optimized implementation from <a
href="https://arxiv.org/pdf/1712.02427.pdf">High performance
ultra-low-precision convolutions
-on mobile devices.</a>.
-Workload are convolution layers from ResNet18.</p>
-
-<p style="text-align: center"><img
src="/images/low-precision/rasp-conv-2.png" alt="image"
width="50%" /></p>
-<center> Speedup of 2-bit weight 1-bit activation Raspberry Pi
convolutions against a hand optimized implementation.</center>
-<p></p>
-
-<h3 id="x86">x86</h3>
-
-<p>Convolution speedups on x86 compared to a 32-bit floating point TVM
implementation.
-Note: x86 doesn’t support a vectorized popcount for this microarchitecture, so
speedups are lower.</p>
-<p style="text-align: center"><img
src="/images/low-precision/x86-conv.png" alt="image"
width="50%" /></p>
-<center> Speedup of x86 low precision convolutions compared to a 32-bit
floating point TVM implementation.</center>
-<p></p>
-
-<h2 id="show-me-the-code">Show me the code</h2>
-
-<ul>
- <li><a
href="https://github.com/apache/incubator-tvm/blob/main/topi/python/topi/nn/bitserial_conv2d.py">TOPI
bitserial convolution</a></li>
- <li><a
href="https://github.com/apache/incubator-tvm/blob/main/topi/python/topi/arm_cpu/bitserial_conv2d.py">TOPI
ARM cpu bitserial convolution</a></li>
-</ul>
-
-<h2 id="references">References</h2>
-
-<ul>
- <li>[1] <a
href="https://arxiv.org/abs/1810.11066">Automating Generation of
Low Precision Deep Learning Operators</a></li>
- <li>[2] <a
href="https://arxiv.org/abs/1603.05279">XNOR-Net</a></li>
- <li>[3] <a
href="https://arxiv.org/abs/1702.00953">HWGQ</a></li>
- <li>[4] <a
href="https://arxiv.org/abs/1606.06160">DoReFa</a></li>
-</ul></content><author><name>Meghan Cowan</name></author><summary
type="html">As deep learning models grow larger and more complex, deploying
them on low powered phone and IoT devices becomes challenging because of their
limited compute and energy budgets. A recent trend in deep learning is the use
of extremely quantized models that operate on inputs and weights of a few bits,
with networks like XNOR-Net, DoReFa-Net, and HWGQ-Net making steady progress
improving accuracy.</summary> [...]
\ No newline at end of file
+</ul></content><author><name>Siva</name></author><summary
type="html">Introduction</summary></entry></feed>
\ No newline at end of file
diff --git a/images/community/sjtu.png b/images/community/sjtu.png
new file mode 100644
index 0000000..0de00de
Binary files /dev/null and b/images/community/sjtu.png differ
diff --git a/images/intro-auto-scheduler/code_perf.png
b/images/intro-auto-scheduler/code_perf.png
new file mode 100644
index 0000000..d070a6e
Binary files /dev/null and b/images/intro-auto-scheduler/code_perf.png differ
diff --git a/images/intro-auto-scheduler/search_overview.png
b/images/intro-auto-scheduler/search_overview.png
new file mode 100644
index 0000000..7b6f56d
Binary files /dev/null and b/images/intro-auto-scheduler/search_overview.png
differ
diff --git a/images/intro-auto-scheduler/search_time.png
b/images/intro-auto-scheduler/search_time.png
new file mode 100644
index 0000000..4bd700b
Binary files /dev/null and b/images/intro-auto-scheduler/search_time.png differ
diff --git a/images/intro-auto-scheduler/workflow.png
b/images/intro-auto-scheduler/workflow.png
new file mode 100644
index 0000000..b2c7b26
Binary files /dev/null and b/images/intro-auto-scheduler/workflow.png differ
diff --git a/rss.xml b/rss.xml
index f2dfac7..2173b21 100644
--- a/rss.xml
+++ b/rss.xml
@@ -5,12 +5,142 @@
<description>TVM - </description>
<link>https://tvm.apache.org</link>
<atom:link href="https://tvm.apache.org" rel="self"
type="application/rss+xml" />
- <lastBuildDate>Mon, 04 Jan 2021 16:22:52 -0500</lastBuildDate>
- <pubDate>Mon, 04 Jan 2021 16:22:52 -0500</pubDate>
+ <lastBuildDate>Wed, 03 Mar 2021 01:20:46 -0800</lastBuildDate>
+ <pubDate>Wed, 03 Mar 2021 01:20:46 -0800</pubDate>
<ttl>60</ttl>
<item>
+ <title>Introducing TVM Auto-scheduler (a.k.a. Ansor)</title>
+ <description><p>Optimizing the execution speed of deep
neural networks is extremely hard with the growing
+model size, operator diversity, and hardware heterogeneity.
+From a computational perspective, deep neural networks are just layers and
layers of tensor computations.
+These tensor computations, such as matmul and conv2d, can be easily described
by mathematical expressions.
+However, providing high-performance implementations for them on modern
hardware can be very challenging.
+We have to apply various low-level optimizations and utilize special hardware
intrinsics to achieve high performance.
+It takes huge engineering effort to build linear algebra and neural network
acceleration libraries like CuBLAS, CuDNN, oneMKL, and oneDNN.</p>
+
+<p>Our life will be much easier if we can just write mathematical
expressions and have something
+magically turn them into efficient code implementations.
+Three years ago, deep learning compiler TVM and its search module AutoTVM were
built as the first step towards this goal.
+AutoTVM employs a template-based search algorithm to find efficient
implementations for a given tensor computation.
+However, it is a template-based approach, so it still requires domain experts
to implement a non-trivial manual template
+for every operator on every platform.
+Today, there are more than 15k lines of code for these templates in the TVM
code repository.
+Besides being very hard to develop, these templates often have inefficient and
limited search spaces,
+making them unable to achieve optimal performance.</p>
+
+<p>To address the limitations of AutoTVM, we started project Ansor
aiming at a fully automated auto-scheduler for
+generating code for tensor computations.
+Ansor auto-scheduler only takes tensor expressions as input and generates
high-performance code without manual templates.
+We made innovations in the search space construction and search algorithm.
+As a result, the auto-scheduler can achieve better performance with less
search time in a more automated way.</p>
+
+<p>Ansor auto-scheduler is now integrated into Apache TVM as <code
class="language-plaintext
highlighter-rouge">tvm.auto_scheduler</code> package.
+This is a joint effort by collaborators from UC Berkeley, Alibaba, AWS and
OctoML.
+Detailed tutorials are available for Intel CPUs, ARM CPUs, NVIDIA CPUs, and
Mali GPUs on the TVM website [1].
+In this blog post, we will give a high-level introduction and show some
benchmark results.</p>
+
+<h1 id="system-overview">System Overview</h1>
+
+<h2 id="autotvm-vs-auto-scheduler">AutoTVM vs
Auto-scheduler</h2>
+<p style="text-align: center"><img
src="/images/intro-auto-scheduler/workflow.png" alt="image"
width="75%" /></p>
+<center> Table 1. Workflow Comparision </center>
+<p></p>
+
+<p>Table 1 compares the workflow for generating code for an operator in
AutoTVM and auto-scheduler.
+In AutoTVM, the developer has to go through three steps.
+In step 1, the developer has to write the compute definition in TVM’s tensor
expression language.
+This part is relatively easy because TVM’s tensor expression language looks
just like math expressions.
+In step 2, the developer has to write a schedule template, which typically
consists of 20-100 lines of tricky DSL code.
+This part requires domain expertise of both the target hardware architecture
and operator semantics, so it is difficult.
+The last step, step 3, is automated by a search algorithm.</p>
+
+<p>In auto-scheduler, we eliminate the most difficult step 2 by
automatic search space construction and accelerate step 3 with a better search
algorithm.
+By doing automatic search space construction, we not only eliminate huge
manual effort,
+but also enabling the exploration of much more optimization combinations.
+This automation does not come for free, because we still need to design rules
to generate the search space.
+However, these rules are very general. They are based on static analysis of
the tensor expressions.
+We only need to design a few general rules once and can apply them to almost
all tensor computations in deep learning.</p>
+
+<h2 id="search-process">Search Process</h2>
+<p style="text-align: center"><img
src="/images/intro-auto-scheduler/search_overview.png"
alt="image" width="40%" /></p>
+<center> Figure 1. Search Process Overview </center>
+<p></p>
+
+<p>Figure 1. shows the search process of auto-scheduler when optimizing
a whole neural network.
+The system takes deep learning models as input.
+It then partitions the big model into small subgraphs with Relay’s operator
fusion pass.
+A task scheduler is utilized to allocate the time resource for optimizing many
subgraphs.
+At each iteration, it picks a subgraph that has the most potential to increase
the end-to-end performance.
+For this subgraph, we analyze its tensor expression and generate several
sketches for it.
+Then we run evolutionary search with a learned cost model to get a batch of
optimized programs.
+The optimized programs are sent to actual hardware for measurements.
+When the measurements are finished, the profiling results are used as feedback
to update all components of the system.
+This process is repeated iteratively until the optimization converges or we
run out of time budget.
+More technical details can be found in our paper [3] and our code.</p>
+
+<p>It is worth notiing that since the auto-scheduler generates schedules
from scratch,
+it reuses the existing computation definitions in TOPI but not schedule
templates.</p>
+
+<h1 id="benchmark-results">Benchmark Results</h1>
+<p>In this section, we benchmark the performance of AutoTVM and
Auto-scheduler.
+The CPU benchmark is done on an AWS c5.9xlarge, which is equipped with an
Intel 18-core skylake 8124-m CPU.
+The GPU benchmark is done on an AWS g4dn.4xlarge, which is equipped with an
NVIDIA T4 GPU.
+All benchmark code, raw data, tuning logs can be found in this repo
[2].</p>
+
+<h2 id="performance-of-the-generated-code">Performance of the
generated code</h2>
+<p>We benchmark the fp32 single-batch inference latency on three
networks.
+Figure 2 shows the relative speedup of auto-scheduler against AutoTVM.
+We can see auto-scheduler outperforms AutoTVM in all cases with 1.02x to 8.95x
speedup.
+This is because auto-scheduler explores a larger search space, which covers
more efficient combinations
+of optimizations that are missed in TOPI manual templates.
+The BERT-base@GPU is an extreme case where the manual templates are very badly
designed.
+In other words, the manual template for dense layers does not perform well for
the shapes in BERT model.</p>
+
+<p style="text-align: center"><img
src="/images/intro-auto-scheduler/code_perf.png"
alt="image" width="85%" /></p>
+<center> Figure 2. Code Performance Comparision (Higher is better)
</center>
+<p></p>
+
+<h2 id="search-time">Search Time</h2>
+<p>As we know, the search-based approaches can be very time-consuming,
so we also care about the search time.
+It typically takes several hours to let the search converge for a single
neural network.
+Figure 3 compares the search time of AutoTVM and auto-scheduler.
+Auto-scheduler requires much less time to converge in most cases, despite its
larger search space.
+This is mainly because of auto-scheduler has a better cost model and task
scheduler.</p>
+
+<p style="text-align: center"><img
src="/images/intro-auto-scheduler/search_time.png"
alt="image" width="85%" /></p>
+<center> Figure 3. Search Time Comparision (Lower is better)
</center>
+<p></p>
+
+<h2 id="more-results">More Results</h2>
+<p>The repo above serves as an internal benchmark tool for TVM, so it
only compares the latest AutoTVM and AutoScheduler.
+You can find results for more libraries and backends in our paper [3].
+Recently, this blog post [4] also tried auto-scheduler on an Apple M1 chip and
got some good results.</p>
+
+<h1 id="conclusion">Conclusion</h1>
+<p>We build TVM auto-scheduler, a system that automatically generates
high-performance code for tensor expressions.
+Compared with the predecessor AutoTVM, auto-scheduler does not require manual
templates.
+Besides, auto-scheduler is capable of generating schedules with better
performance in a shorter time.
+We achieve this by making innovations in the search space construction and
search algorithm.</p>
+
+<p>We are excited about the current performance of auto-scheduler.
+In the future, we are interested in extending the ability of auto-scheduler to
support
+sparse operators, low-precision operators, and dynamic shape better.</p>
+
+<h1 id="links">Links</h1>
+<p>[1] Tutorials: <a
href="https://tvm.apache.org/docs/tutorials/index.html#autoscheduler-template-free-auto-scheduling">https://tvm.apache.org/docs/tutorials/index.html#autoscheduler-template-free-auto-scheduling</a><br
/>
+[2] Benchmark repo: <a
href="https://github.com/tlc-pack/TLCBench">https://github.com/tlc-pack/TLCBench</a><br
/>
+[3] OSDI Paper: <a
href="https://arxiv.org/abs/2006.06762">Ansor : Generating
High-Performance Tensor Programs for Deep Learning</a><br />
+[4] Results on Apple M1 chip: <a
href="https://medium.com/octoml/on-the-apple-m1-beating-apples-core-ml-4-with-30-model-performance-improvements-9d94af7d1b2d">https://medium.com/octoml/on-the-apple-m1-beating-apples-core-ml-4-with-30-model-performance-improvements-9d94af7d1b2d</a>.</p>
+
+</description>
+
<link>https://tvm.apache.org/2021/03/03/intro-auto-scheduler</link>
+
<guid>https://tvm.apache.org/2021/03/03/intro-auto-scheduler</guid>
+ <pubDate>Wed, 03 Mar 2021 00:00:00 -0800</pubDate>
+ </item>
+
+ <item>
<title>Bring Your Own Datatypes: Enabling Custom Datatype
Exploration in TVM</title>
<description><p>In this post, we describe the Bring Your
Own Datatypes framework, which enables the use of custom datatypes within
TVM.</p>
@@ -300,7 +430,7 @@ For more documentation about the Bring Your Own Datatypes
framework
</description>
<link>https://tvm.apache.org/2020/09/26/bring-your-own-datatypes</link>
<guid>https://tvm.apache.org/2020/09/26/bring-your-own-datatypes</guid>
- <pubDate>Sat, 26 Sep 2020 00:00:00 -0400</pubDate>
+ <pubDate>Sat, 26 Sep 2020 00:00:00 -0700</pubDate>
</item>
<item>
@@ -779,7 +909,7 @@ Figure 4: After Graph Partitioning.
</description>
<link>https://tvm.apache.org/2020/07/15/how-to-bring-your-own-codegen-to-tvm</link>
<guid>https://tvm.apache.org/2020/07/15/how-to-bring-your-own-codegen-to-tvm</guid>
- <pubDate>Wed, 15 Jul 2020 00:00:00 -0400</pubDate>
+ <pubDate>Wed, 15 Jul 2020 00:00:00 -0700</pubDate>
</item>
<item>
@@ -1302,7 +1432,7 @@ He is a PyTorch core developer and co-authored <a
href="https://www.mann
</description>
<link>https://tvm.apache.org/2020/07/14/bert-pytorch-tvm</link>
<guid>https://tvm.apache.org/2020/07/14/bert-pytorch-tvm</guid>
- <pubDate>Tue, 14 Jul 2020 00:00:00 -0400</pubDate>
+ <pubDate>Tue, 14 Jul 2020 00:00:00 -0700</pubDate>
</item>
<item>
@@ -1611,7 +1741,7 @@ Diagram from CMSIS-NN paper showing a 2x2 matrix
multiplication microkernel</
</description>
<link>https://tvm.apache.org/2020/06/04/tinyml-how-tvm-is-taming-tiny</link>
<guid>https://tvm.apache.org/2020/06/04/tinyml-how-tvm-is-taming-tiny</guid>
- <pubDate>Thu, 04 Jun 2020 00:00:00 -0400</pubDate>
+ <pubDate>Thu, 04 Jun 2020 00:00:00 -0700</pubDate>
</item>
<item>
@@ -1698,7 +1828,7 @@ Diagram from CMSIS-NN paper showing a 2x2 matrix
multiplication microkernel</
</description>
<link>https://tvm.apache.org/2020/05/14/compiling-machine-learning-to-webassembly-and-webgpu</link>
<guid>https://tvm.apache.org/2020/05/14/compiling-machine-learning-to-webassembly-and-webgpu</guid>
- <pubDate>Thu, 14 May 2020 00:00:00 -0400</pubDate>
+ <pubDate>Thu, 14 May 2020 00:00:00 -0700</pubDate>
</item>
<item>
@@ -1800,7 +1930,7 @@ relay_graph = torch_tvm.to_relay(mul, inputs)
</description>
<link>https://tvm.apache.org/2019/05/30/pytorch-frontend</link>
<guid>https://tvm.apache.org/2019/05/30/pytorch-frontend</guid>
- <pubDate>Thu, 30 May 2019 00:00:00 -0400</pubDate>
+ <pubDate>Thu, 30 May 2019 00:00:00 -0700</pubDate>
</item>
<item>
@@ -1944,7 +2074,7 @@ We show that automatic optimization in TVM makes it easy
and flexible to support
</description>
<link>https://tvm.apache.org/2019/04/29/opt-cuda-quantized</link>
<guid>https://tvm.apache.org/2019/04/29/opt-cuda-quantized</guid>
- <pubDate>Mon, 29 Apr 2019 12:00:00 -0400</pubDate>
+ <pubDate>Mon, 29 Apr 2019 09:00:00 -0700</pubDate>
</item>
<item>
@@ -1967,7 +2097,7 @@ We show that automatic optimization in TVM makes it easy
and flexible to support
</description>
<link>https://tvm.apache.org/2019/03/18/tvm-apache-announcement</link>
<guid>https://tvm.apache.org/2019/03/18/tvm-apache-announcement</guid>
- <pubDate>Mon, 18 Mar 2019 00:00:00 -0400</pubDate>
+ <pubDate>Mon, 18 Mar 2019 00:00:00 -0700</pubDate>
</item>
<item>
@@ -2137,7 +2267,7 @@ closure as TVM packed function and invoke the same across
programming language b
</description>
<link>https://tvm.apache.org/2019/01/19/Golang</link>
<guid>https://tvm.apache.org/2019/01/19/Golang</guid>
- <pubDate>Sat, 19 Jan 2019 00:00:00 -0500</pubDate>
+ <pubDate>Sat, 19 Jan 2019 00:00:00 -0800</pubDate>
</item>
<item>
@@ -2298,7 +2428,7 @@ Note: x86 doesn’t support a vectorized popcount for this
microarchitecture, so
</description>
<link>https://tvm.apache.org/2018/12/18/lowprecision-conv</link>
<guid>https://tvm.apache.org/2018/12/18/lowprecision-conv</guid>
- <pubDate>Tue, 18 Dec 2018 00:00:00 -0500</pubDate>
+ <pubDate>Tue, 18 Dec 2018 00:00:00 -0800</pubDate>
</item>
<item>
@@ -2414,7 +2544,7 @@ His research interest is in the general domain of ML on
shared private data, but
</description>
<link>https://tvm.apache.org/2018/10/09/ml-in-tees</link>
<guid>https://tvm.apache.org/2018/10/09/ml-in-tees</guid>
- <pubDate>Tue, 09 Oct 2018 00:00:00 -0400</pubDate>
+ <pubDate>Tue, 09 Oct 2018 00:00:00 -0700</pubDate>
</item>
<item>
@@ -2808,7 +2938,7 @@ for inference deployment. TVM just provides such a
solution.</p>
</description>
<link>https://tvm.apache.org/2018/10/03/auto-opt-all</link>
<guid>https://tvm.apache.org/2018/10/03/auto-opt-all</guid>
- <pubDate>Wed, 03 Oct 2018 00:00:00 -0400</pubDate>
+ <pubDate>Wed, 03 Oct 2018 00:00:00 -0700</pubDate>
</item>
<item>
@@ -2947,7 +3077,7 @@ support, and can be used to implement convenient
converters, such as
</description>
<link>https://tvm.apache.org/2018/08/10/DLPack-Bridge</link>
<guid>https://tvm.apache.org/2018/08/10/DLPack-Bridge</guid>
- <pubDate>Fri, 10 Aug 2018 00:00:00 -0400</pubDate>
+ <pubDate>Fri, 10 Aug 2018 00:00:00 -0700</pubDate>
</item>
<item>
@@ -3089,7 +3219,7 @@ This kind of high-level visibility is essential to system
designers who want to
</description>
<link>https://tvm.apache.org/2018/07/12/vta-release-announcement</link>
<guid>https://tvm.apache.org/2018/07/12/vta-release-announcement</guid>
- <pubDate>Thu, 12 Jul 2018 00:00:00 -0400</pubDate>
+ <pubDate>Thu, 12 Jul 2018 00:00:00 -0700</pubDate>
</item>
<item>
@@ -3355,7 +3485,7 @@ C = tvm.compute(
</description>
<link>https://tvm.apache.org/2018/03/23/nmt-transformer-optimize</link>
<guid>https://tvm.apache.org/2018/03/23/nmt-transformer-optimize</guid>
- <pubDate>Fri, 23 Mar 2018 00:00:00 -0400</pubDate>
+ <pubDate>Fri, 23 Mar 2018 00:00:00 -0700</pubDate>
</item>
<item>
@@ -3471,7 +3601,7 @@ optimizations into the TVM stack.</p>
</description>
<link>https://tvm.apache.org/2018/03/12/webgl</link>
<guid>https://tvm.apache.org/2018/03/12/webgl</guid>
- <pubDate>Mon, 12 Mar 2018 00:00:00 -0400</pubDate>
+ <pubDate>Mon, 12 Mar 2018 00:00:00 -0700</pubDate>
</item>
<item>
@@ -4045,7 +4175,7 @@ advice and <a
href="https://github.com/yzhliu">Yizhi Liu</a&g
</description>
<link>https://tvm.apache.org/2018/01/16/opt-mali-gpu</link>
<guid>https://tvm.apache.org/2018/01/16/opt-mali-gpu</guid>
- <pubDate>Tue, 16 Jan 2018 00:00:00 -0500</pubDate>
+ <pubDate>Tue, 16 Jan 2018 00:00:00 -0800</pubDate>
</item>
<item>
@@ -4273,7 +4403,7 @@ make jvminstall
</description>
<link>https://tvm.apache.org/2017/11/08/android-rpc-introduction</link>
<guid>https://tvm.apache.org/2017/11/08/android-rpc-introduction</guid>
- <pubDate>Wed, 08 Nov 2017 00:00:00 -0500</pubDate>
+ <pubDate>Wed, 08 Nov 2017 00:00:00 -0800</pubDate>
</item>
<item>
@@ -4499,90 +4629,7 @@ BB0_6:
</description>
<link>https://tvm.apache.org/2017/10/30/Bringing-AMDGPUs-to-TVM-Stack-and-NNVM-Compiler-with-ROCm</link>
<guid>https://tvm.apache.org/2017/10/30/Bringing-AMDGPUs-to-TVM-Stack-and-NNVM-Compiler-with-ROCm</guid>
- <pubDate>Mon, 30 Oct 2017 00:00:00 -0400</pubDate>
- </item>
-
- <item>
- <title>NNVM Compiler: Open Compiler for AI Frameworks</title>
- <description><p style="text-align:
center">Paul G. Allen School of Computer Science &amp; Engineering,
University of Washington</p>
-<p style="text-align: center">Amazon Web Service AI
team</p>
-<p style="text-align: center">DMLC open-source
community</p>
-
-<p>Deep learning has become ubiquitous and indispensable. We are seeing
a rising need for deploying deep learning workloads on many kinds of platforms
such as mobile phones, GPU, IoT devices and specialized accelerators. Last
month, we announced TVM stack to close the gap between deep learning
frameworks, and the performance- or efficiency-oriented hardware backends. TVM
stack makes it easy to build an end to end compilation for a deep learning
framework. However, we think it wo [...]
-
-<p>Today, UW Allen school and AWS AI team, together with other
contributors, are excited to announce the release of NNVM compiler, an open
deep learning compiler to compile front-end framework workloads directly to
hardware backends. We build it using the two-level intermediate
representation(IR) in the TVM stack.
-The reader is welcome to refer to the <a
href="http://www.tvmlang.org/2017/08/17/tvm-release-announcement.html">original
TVM announcement</a> for more technical details about TVM stack. With
the help of TVM stack, NNVM compiler can:</p>
-
-<ul>
- <li>Represent and optimize the common deep learning workloads in high
level graph IR</li>
- <li>Transform the computation graph to minimize memory utilization,
optimize data layout and fuse computation patterns for different hardware
backends.</li>
- <li>Present an end to end compilation pipeline from front-end deep
learning frameworks to bare metal hardwares.</li>
-</ul>
-
-<p style="text-align: center"><img
src="/images/nnvm/nnvm_compiler_stack.png" alt="image"
width="612px" /></p>
-
-<p>The NNVM compiler can directly take models from deep learning
frameworks such as Apache MXNet.
-It also support model exchange formats such as ONNX and CoreML. ONNX support
enables NNVM to compile deep learning models from PyTorch, Caffe2 and CNTK.
-The CoreML frontend enables deployment of CoreML models to non-iOS
devices.</p>
-
-<p style="text-align: center"><img
src="/images/nnvm/nnvm_compiler_code.png" alt="image"
width="712px" /></p>
-
-<h2 id="separation-of-optimization-and-deployment">Separation
of Optimization and Deployment</h2>
-
-<p style="text-align: center"><img
src="/images/nnvm/nnvm_deploy.png" alt="image"
width="512px" /></p>
-
-<p>NNVM compiler applies graph level and tensor level optimizations and
jointly optimize them to get the best performance. We take a different approach
from existing deep learning frameworks, which packages the graph optimization
with the deployment runtime. NNVM compiler adopts the conventional wisdom from
compiler to separate the optimization from the actual deployment runtime. This
approach offers substantial optimization but still keeps the runtime
lightweight. The compiled mo [...]
-
-<h2 id="performance">Performance</h2>
-
-<p>NNVM compiler is still under active development, and we can expect
more improvements to come, but we have started to see promising results.
-We benchmarked its performance and compared it against Apache MXNet on two
typical hardware configurations: ARM CPU on Raspberry PI and Nvidia GPU on AWS.
Despite the radical architecture difference between these two chips, we can use
the same infrastructure and only need to change the schedule for each type of
hardware.</p>
-
-<h3 id="nvidia-gpu">Nvidia GPU</h3>
-
-<p>GPU benchmarks and schedules are contributed by Leyuan Wang
(AWS/UCDavis) and Yuwei Hu (TuSimple). We compared the NNVM compiler against
Apache MXNet with CUDA8 and cuDNN7 as the backend on Nvidia K80. This is a very
strong baseline, as Apache MXNet turns on auto-tuning to select the best kernel
from CuDNN. We also used the optimized depthwise kernel in MXNet to optimize
MobileNet workload.</p>
-
-<p style="text-align: center"><img
src="/images/nnvm/nnvm_k80_result.png" alt="image"
width="400px" /></p>
-
-<p>As can be seen, NNVM compiler generate code that outperforms Apache
MXNet on K80. These improvements are due to the joint graph level and kernel
level optimizations. It is worth noting that NNVM compiler generates all the
optimized GPU kernels on its own without relying on external libraries like
CuDNN.</p>
-
-<h3 id="raspberry-pi-3b">Raspberry Pi 3b</h3>
-
-<p>The Rasberry Pi compilation stack is contributed by Ziheng
Jiang(AWS/FDU).
-We compared NNVM compiler against Apache MXNet with OpenBLAS and NNPack.
-We explored the setups to get the best performance out of MXNet: we turned on
Winograd convolution in the NNPACK for 3x3 convolutions, enabled
multi-threading and disabled the additional scheduler thread (so all threads
are used by NNPack).</p>
-
-<p style="text-align: center"><img
src="/images/nnvm/nnvm_rasp_result.png" alt="image"
width="400px" /></p>
-
-<p>As can be seen, the code generated by NNVM compiler is two times
faster on ResNet18.
-The gap on MobileNet is mainly due to lack of depthwise convolution in
existing CPU DNN libraries. NNVM compiler takes benefit of direct generating
efficient ARM code directly.</p>
-
-<h2 id="acknowledgement">Acknowledgement</h2>
-<p>This project wouldn’t become possible without our early contributors
in the DMLC community.
-We would like to specially thank Yuwei Hu(TuSimple), Leyuan Wang(AWS/UCDavis),
Joshua Z. Zhang(AWS)
-and Xingjian Shi(HKUST) for their early contributions to the project. We would
also like to thank all the contributors
-to the TVM stack.</p>
-
-<p>We also learnt a lot from the following projects when building NNVM
Compiler.</p>
-<ul>
- <li><a
href="https://github.com/Theano/Theano">Theano</a>: possibly
the earliest compiler for deep learning</li>
- <li><a
href="https://github.com/halide/Halide">Halide</a>: TVM uses
<a href="https://github.com/dmlc/HalideIR">HalideIR</a>
as data structure for
-arithematic simplification and low level lowering. HalideIR is derived from
Halide.
-We also learns from Halide when implementing the lowering pipeline in
TVM.</li>
- <li><a
href="https://github.com/inducer/loopy">Loopy</a>: use of
integer set analysis and its loop transformation primitives.</li>
-</ul>
-
-<h2 id="links">Links</h2>
-<ul>
- <li>Github page of NNVM Compiler: <a
href="https://github.com/dmlc/nnvm">https://github.com/dmlc/nnvm</a></li>
- <li>Github page of TVM: <a
href="https://github.com/dmlc/tvm">https://github.com/dmlc/tvm</a></li>
- <li><a
href="https://news.cs.washington.edu/2017/10/06/allen-school-and-aws-team-up-on-new-nnvm-compiler-for-deep-learning-frameworks/">UW
Allen school blog about NNVM compiler</a></li>
- <li><a
href="https://aws.amazon.com/blogs/ai/introducing-nnvm-compiler-a-new-open-end-to-end-compiler-for-ai-frameworks/">AWS
blogpost about NNVM compiler</a></li>
-</ul>
-</description>
-
<link>https://tvm.apache.org/2017/10/06/nnvm-compiler-announcement</link>
-
<guid>https://tvm.apache.org/2017/10/06/nnvm-compiler-announcement</guid>
- <pubDate>Fri, 06 Oct 2017 11:30:00 -0400</pubDate>
+ <pubDate>Mon, 30 Oct 2017 00:00:00 -0700</pubDate>
</item>
diff --git a/sitemap.txt b/sitemap.txt
index bfad106..db8795d 100644
--- a/sitemap.txt
+++ b/sitemap.txt
@@ -16,6 +16,7 @@ https://tvm.apache.org/vta
https://tvm.apache.org/feed.xml
https://tvm.apache.org/css/custom.css.map
+https://tvm.apache.org/2021/03/03/intro-auto-scheduler
https://tvm.apache.org/2020/09/26/bring-your-own-datatypes
https://tvm.apache.org/2020/07/15/how-to-bring-your-own-codegen-to-tvm
https://tvm.apache.org/2020/07/14/bert-pytorch-tvm