This is an automated email from the ASF dual-hosted git repository.

lmzheng pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/tvm-site.git


The following commit(s) were added to refs/heads/asf-site by this push:
     new 8828684  Build at Wed Mar  3 01:20:50 PST 2021
8828684 is described below

commit 88286848906663319587f02c11f37dd2fe696f30
Author: Lianmin Zheng <[email protected]>
AuthorDate: Wed Mar 3 01:20:50 2021 -0800

    Build at Wed Mar  3 01:20:50 PST 2021
---
 2017/08/17/tvm-release-announcement.html           |   2 +-
 ...s-with-TVM-A-Depthwise-Convolution-Example.html |   2 +-
 2017/10/06/nnvm-compiler-announcement.html         |   2 +-
 ...s-to-TVM-Stack-and-NNVM-Compiler-with-ROCm.html |   2 +-
 2017/11/08/android-rpc-introduction.html           |   2 +-
 2018/01/16/opt-mali-gpu.html                       |   2 +-
 2018/03/12/webgl.html                              |   2 +-
 2018/03/23/nmt-transformer-optimize.html           |   2 +-
 2018/07/12/vta-release-announcement.html           |   2 +-
 2018/08/10/DLPack-Bridge.html                      |   2 +-
 2018/10/03/auto-opt-all.html                       |   2 +-
 2018/10/09/ml-in-tees.html                         |   2 +-
 2018/12/18/lowprecision-conv.html                  |   2 +-
 2019/01/19/Golang.html                             |   2 +-
 2019/03/18/tvm-apache-announcement.html            |   2 +-
 2019/04/29/opt-cuda-quantized.html                 |   2 +-
 2019/05/30/pytorch-frontend.html                   |   2 +-
 ...machine-learning-to-webassembly-and-webgpu.html |   2 +-
 2020/06/04/tinyml-how-tvm-is-taming-tiny.html      |   2 +-
 2020/07/14/bert-pytorch-tvm.html                   |   2 +-
 .../15/how-to-bring-your-own-codegen-to-tvm.html   |   2 +-
 2020/09/26/bring-your-own-datatypes.html           |   2 +-
 2021/03/03/intro-auto-scheduler.html               | 321 +++++++++++++++++++++
 atom.xml                                           | 253 +++++++++-------
 blog.html                                          |  10 +
 community.html                                     |   4 +
 feed.xml                                           | 291 +++++++++----------
 images/community/sjtu.png                          | Bin 0 -> 236508 bytes
 images/intro-auto-scheduler/code_perf.png          | Bin 0 -> 36724 bytes
 images/intro-auto-scheduler/search_overview.png    | Bin 0 -> 433415 bytes
 images/intro-auto-scheduler/search_time.png        | Bin 0 -> 45583 bytes
 images/intro-auto-scheduler/workflow.png           | Bin 0 -> 1014076 bytes
 rss.xml                                            | 255 +++++++++-------
 sitemap.txt                                        |   1 +
 34 files changed, 789 insertions(+), 390 deletions(-)

diff --git a/2017/08/17/tvm-release-announcement.html 
b/2017/08/17/tvm-release-announcement.html
index ea95cf0..dbd65e1 100644
--- a/2017/08/17/tvm-release-announcement.html
+++ b/2017/08/17/tvm-release-announcement.html
@@ -140,7 +140,7 @@
     <div class="span14 w-100">
       <h1>TVM: An End to End IR Stack for Deploying Deep Learning Workloads on 
Hardware Platforms </h1>
       <p class="post-meta">
-        <time datetime="2017-08-17T15:00:00-04:00" itemprop="datePublished">
+        <time datetime="2017-08-17T12:00:00-07:00" itemprop="datePublished">
           Aug 17, 2017
         </time>
         
diff --git 
a/2017/08/22/Optimize-Deep-Learning-GPU-Operators-with-TVM-A-Depthwise-Convolution-Example.html
 
b/2017/08/22/Optimize-Deep-Learning-GPU-Operators-with-TVM-A-Depthwise-Convolution-Example.html
index 96b2e16..13a15a3 100644
--- 
a/2017/08/22/Optimize-Deep-Learning-GPU-Operators-with-TVM-A-Depthwise-Convolution-Example.html
+++ 
b/2017/08/22/Optimize-Deep-Learning-GPU-Operators-with-TVM-A-Depthwise-Convolution-Example.html
@@ -140,7 +140,7 @@
     <div class="span14 w-100">
       <h1>Optimize Deep Learning GPU Operators with TVM: A Depthwise 
Convolution Example </h1>
       <p class="post-meta">
-        <time datetime="2017-08-22T00:00:00-04:00" itemprop="datePublished">
+        <time datetime="2017-08-22T00:00:00-07:00" itemprop="datePublished">
           Aug 22, 2017
         </time>
         
diff --git a/2017/10/06/nnvm-compiler-announcement.html 
b/2017/10/06/nnvm-compiler-announcement.html
index 40557e0..b627ca6 100644
--- a/2017/10/06/nnvm-compiler-announcement.html
+++ b/2017/10/06/nnvm-compiler-announcement.html
@@ -140,7 +140,7 @@
     <div class="span14 w-100">
       <h1>NNVM Compiler: Open Compiler for AI Frameworks </h1>
       <p class="post-meta">
-        <time datetime="2017-10-06T11:30:00-04:00" itemprop="datePublished">
+        <time datetime="2017-10-06T08:30:00-07:00" itemprop="datePublished">
           Oct 6, 2017
         </time>
         
diff --git 
a/2017/10/30/Bringing-AMDGPUs-to-TVM-Stack-and-NNVM-Compiler-with-ROCm.html 
b/2017/10/30/Bringing-AMDGPUs-to-TVM-Stack-and-NNVM-Compiler-with-ROCm.html
index 06f20bd..e6a6c2f 100644
--- a/2017/10/30/Bringing-AMDGPUs-to-TVM-Stack-and-NNVM-Compiler-with-ROCm.html
+++ b/2017/10/30/Bringing-AMDGPUs-to-TVM-Stack-and-NNVM-Compiler-with-ROCm.html
@@ -140,7 +140,7 @@
     <div class="span14 w-100">
       <h1>Bringing AMDGPUs to TVM Stack and NNVM Compiler with ROCm </h1>
       <p class="post-meta">
-        <time datetime="2017-10-30T00:00:00-04:00" itemprop="datePublished">
+        <time datetime="2017-10-30T00:00:00-07:00" itemprop="datePublished">
           Oct 30, 2017
         </time>
         
diff --git a/2017/11/08/android-rpc-introduction.html 
b/2017/11/08/android-rpc-introduction.html
index 7d15d82..f7e34b5 100644
--- a/2017/11/08/android-rpc-introduction.html
+++ b/2017/11/08/android-rpc-introduction.html
@@ -140,7 +140,7 @@
     <div class="span14 w-100">
       <h1>Remote Profile and Test Deep Learning Cross Compilation on Mobile 
Phones with TVM RPC </h1>
       <p class="post-meta">
-        <time datetime="2017-11-08T00:00:00-05:00" itemprop="datePublished">
+        <time datetime="2017-11-08T00:00:00-08:00" itemprop="datePublished">
           Nov 8, 2017
         </time>
         
diff --git a/2018/01/16/opt-mali-gpu.html b/2018/01/16/opt-mali-gpu.html
index a039779..40fc7f0 100644
--- a/2018/01/16/opt-mali-gpu.html
+++ b/2018/01/16/opt-mali-gpu.html
@@ -140,7 +140,7 @@
     <div class="span14 w-100">
       <h1>Optimizing Mobile Deep Learning on ARM GPU with TVM </h1>
       <p class="post-meta">
-        <time datetime="2018-01-16T00:00:00-05:00" itemprop="datePublished">
+        <time datetime="2018-01-16T00:00:00-08:00" itemprop="datePublished">
           Jan 16, 2018
         </time>
         
diff --git a/2018/03/12/webgl.html b/2018/03/12/webgl.html
index 792c922..74313b5 100644
--- a/2018/03/12/webgl.html
+++ b/2018/03/12/webgl.html
@@ -140,7 +140,7 @@
     <div class="span14 w-100">
       <h1>Compiling Deep Learning Models to WebGL with TVM </h1>
       <p class="post-meta">
-        <time datetime="2018-03-12T00:00:00-04:00" itemprop="datePublished">
+        <time datetime="2018-03-12T00:00:00-07:00" itemprop="datePublished">
           Mar 12, 2018
         </time>
         
diff --git a/2018/03/23/nmt-transformer-optimize.html 
b/2018/03/23/nmt-transformer-optimize.html
index 2182327..35c211a 100644
--- a/2018/03/23/nmt-transformer-optimize.html
+++ b/2018/03/23/nmt-transformer-optimize.html
@@ -140,7 +140,7 @@
     <div class="span14 w-100">
       <h1>Bringing TVM into TensorFlow for Optimizing Neural Machine 
Translation on GPU </h1>
       <p class="post-meta">
-        <time datetime="2018-03-23T00:00:00-04:00" itemprop="datePublished">
+        <time datetime="2018-03-23T00:00:00-07:00" itemprop="datePublished">
           Mar 23, 2018
         </time>
         
diff --git a/2018/07/12/vta-release-announcement.html 
b/2018/07/12/vta-release-announcement.html
index c60a3e1..1250749 100644
--- a/2018/07/12/vta-release-announcement.html
+++ b/2018/07/12/vta-release-announcement.html
@@ -140,7 +140,7 @@
     <div class="span14 w-100">
       <h1>VTA: An Open, Customizable Deep Learning Acceleration Stack  </h1>
       <p class="post-meta">
-        <time datetime="2018-07-12T00:00:00-04:00" itemprop="datePublished">
+        <time datetime="2018-07-12T00:00:00-07:00" itemprop="datePublished">
           Jul 12, 2018
         </time>
         
diff --git a/2018/08/10/DLPack-Bridge.html b/2018/08/10/DLPack-Bridge.html
index 7ec1aaa..af4d193 100644
--- a/2018/08/10/DLPack-Bridge.html
+++ b/2018/08/10/DLPack-Bridge.html
@@ -140,7 +140,7 @@
     <div class="span14 w-100">
       <h1>Building a Cross-Framework Deep Learning Compiler via DLPack </h1>
       <p class="post-meta">
-        <time datetime="2018-08-10T00:00:00-04:00" itemprop="datePublished">
+        <time datetime="2018-08-10T00:00:00-07:00" itemprop="datePublished">
           Aug 10, 2018
         </time>
         
diff --git a/2018/10/03/auto-opt-all.html b/2018/10/03/auto-opt-all.html
index 98269c7..ac36190 100644
--- a/2018/10/03/auto-opt-all.html
+++ b/2018/10/03/auto-opt-all.html
@@ -140,7 +140,7 @@
     <div class="span14 w-100">
       <h1>Automatic Kernel Optimization for Deep Learning on All Hardware 
Platforms </h1>
       <p class="post-meta">
-        <time datetime="2018-10-03T00:00:00-04:00" itemprop="datePublished">
+        <time datetime="2018-10-03T00:00:00-07:00" itemprop="datePublished">
           Oct 3, 2018
         </time>
         
diff --git a/2018/10/09/ml-in-tees.html b/2018/10/09/ml-in-tees.html
index 992e1a3..0f59a69 100644
--- a/2018/10/09/ml-in-tees.html
+++ b/2018/10/09/ml-in-tees.html
@@ -140,7 +140,7 @@
     <div class="span14 w-100">
       <h1>Efficient Privacy-Preserving ML Using TVM </h1>
       <p class="post-meta">
-        <time datetime="2018-10-09T00:00:00-04:00" itemprop="datePublished">
+        <time datetime="2018-10-09T00:00:00-07:00" itemprop="datePublished">
           Oct 9, 2018
         </time>
         
diff --git a/2018/12/18/lowprecision-conv.html 
b/2018/12/18/lowprecision-conv.html
index c5def47..f32251d 100644
--- a/2018/12/18/lowprecision-conv.html
+++ b/2018/12/18/lowprecision-conv.html
@@ -140,7 +140,7 @@
     <div class="span14 w-100">
       <h1>Automating Generation of Low Precision Deep Learning Operators </h1>
       <p class="post-meta">
-        <time datetime="2018-12-18T00:00:00-05:00" itemprop="datePublished">
+        <time datetime="2018-12-18T00:00:00-08:00" itemprop="datePublished">
           Dec 18, 2018
         </time>
         
diff --git a/2019/01/19/Golang.html b/2019/01/19/Golang.html
index 27a39f0..6b8b94a 100644
--- a/2019/01/19/Golang.html
+++ b/2019/01/19/Golang.html
@@ -140,7 +140,7 @@
     <div class="span14 w-100">
       <h1>TVM Golang Runtime for Deep Learning Deployment </h1>
       <p class="post-meta">
-        <time datetime="2019-01-19T00:00:00-05:00" itemprop="datePublished">
+        <time datetime="2019-01-19T00:00:00-08:00" itemprop="datePublished">
           Jan 19, 2019
         </time>
         
diff --git a/2019/03/18/tvm-apache-announcement.html 
b/2019/03/18/tvm-apache-announcement.html
index 386de84..19b5017 100644
--- a/2019/03/18/tvm-apache-announcement.html
+++ b/2019/03/18/tvm-apache-announcement.html
@@ -140,7 +140,7 @@
     <div class="span14 w-100">
       <h1>TVM Deep Learning Compiler Joins Apache Software Foundation </h1>
       <p class="post-meta">
-        <time datetime="2019-03-18T00:00:00-04:00" itemprop="datePublished">
+        <time datetime="2019-03-18T00:00:00-07:00" itemprop="datePublished">
           Mar 18, 2019
         </time>
         
diff --git a/2019/04/29/opt-cuda-quantized.html 
b/2019/04/29/opt-cuda-quantized.html
index 3b401af..1c55a9a 100644
--- a/2019/04/29/opt-cuda-quantized.html
+++ b/2019/04/29/opt-cuda-quantized.html
@@ -140,7 +140,7 @@
     <div class="span14 w-100">
       <h1>Automating Optimization of Quantized Deep Learning Models on CUDA 
</h1>
       <p class="post-meta">
-        <time datetime="2019-04-29T12:00:00-04:00" itemprop="datePublished">
+        <time datetime="2019-04-29T09:00:00-07:00" itemprop="datePublished">
           Apr 29, 2019
         </time>
         
diff --git a/2019/05/30/pytorch-frontend.html b/2019/05/30/pytorch-frontend.html
index ad8281b..a4dd9a3 100644
--- a/2019/05/30/pytorch-frontend.html
+++ b/2019/05/30/pytorch-frontend.html
@@ -140,7 +140,7 @@
     <div class="span14 w-100">
       <h1>Integrating TVM into PyTorch </h1>
       <p class="post-meta">
-        <time datetime="2019-05-30T00:00:00-04:00" itemprop="datePublished">
+        <time datetime="2019-05-30T00:00:00-07:00" itemprop="datePublished">
           May 30, 2019
         </time>
         
diff --git 
a/2020/05/14/compiling-machine-learning-to-webassembly-and-webgpu.html 
b/2020/05/14/compiling-machine-learning-to-webassembly-and-webgpu.html
index 38bd956..50f01e7 100644
--- a/2020/05/14/compiling-machine-learning-to-webassembly-and-webgpu.html
+++ b/2020/05/14/compiling-machine-learning-to-webassembly-and-webgpu.html
@@ -140,7 +140,7 @@
     <div class="span14 w-100">
       <h1>Compiling Machine Learning to WASM and WebGPU with Apache TVM </h1>
       <p class="post-meta">
-        <time datetime="2020-05-14T00:00:00-04:00" itemprop="datePublished">
+        <time datetime="2020-05-14T00:00:00-07:00" itemprop="datePublished">
           May 14, 2020
         </time>
         
diff --git a/2020/06/04/tinyml-how-tvm-is-taming-tiny.html 
b/2020/06/04/tinyml-how-tvm-is-taming-tiny.html
index bcb1aed..ec640c7 100644
--- a/2020/06/04/tinyml-how-tvm-is-taming-tiny.html
+++ b/2020/06/04/tinyml-how-tvm-is-taming-tiny.html
@@ -140,7 +140,7 @@
     <div class="span14 w-100">
       <h1>TinyML - How TVM is Taming Tiny </h1>
       <p class="post-meta">
-        <time datetime="2020-06-04T00:00:00-04:00" itemprop="datePublished">
+        <time datetime="2020-06-04T00:00:00-07:00" itemprop="datePublished">
           Jun 4, 2020
         </time>
         
diff --git a/2020/07/14/bert-pytorch-tvm.html b/2020/07/14/bert-pytorch-tvm.html
index a563504..387e219 100644
--- a/2020/07/14/bert-pytorch-tvm.html
+++ b/2020/07/14/bert-pytorch-tvm.html
@@ -140,7 +140,7 @@
     <div class="span14 w-100">
       <h1>Bridging PyTorch and TVM </h1>
       <p class="post-meta">
-        <time datetime="2020-07-14T00:00:00-04:00" itemprop="datePublished">
+        <time datetime="2020-07-14T00:00:00-07:00" itemprop="datePublished">
           Jul 14, 2020
         </time>
         
diff --git a/2020/07/15/how-to-bring-your-own-codegen-to-tvm.html 
b/2020/07/15/how-to-bring-your-own-codegen-to-tvm.html
index a2066ec..3d39e96 100644
--- a/2020/07/15/how-to-bring-your-own-codegen-to-tvm.html
+++ b/2020/07/15/how-to-bring-your-own-codegen-to-tvm.html
@@ -140,7 +140,7 @@
     <div class="span14 w-100">
       <h1>How to Bring Your Own Codegen to TVM </h1>
       <p class="post-meta">
-        <time datetime="2020-07-15T00:00:00-04:00" itemprop="datePublished">
+        <time datetime="2020-07-15T00:00:00-07:00" itemprop="datePublished">
           Jul 15, 2020
         </time>
         
diff --git a/2020/09/26/bring-your-own-datatypes.html 
b/2020/09/26/bring-your-own-datatypes.html
index 0dc4fb0..135d0db 100644
--- a/2020/09/26/bring-your-own-datatypes.html
+++ b/2020/09/26/bring-your-own-datatypes.html
@@ -140,7 +140,7 @@
     <div class="span14 w-100">
       <h1>Bring Your Own Datatypes: Enabling Custom Datatype Exploration in 
TVM </h1>
       <p class="post-meta">
-        <time datetime="2020-09-26T00:00:00-04:00" itemprop="datePublished">
+        <time datetime="2020-09-26T00:00:00-07:00" itemprop="datePublished">
           Sep 26, 2020
         </time>
         
diff --git a/2021/03/03/intro-auto-scheduler.html 
b/2021/03/03/intro-auto-scheduler.html
new file mode 100644
index 0000000..e10a971
--- /dev/null
+++ b/2021/03/03/intro-auto-scheduler.html
@@ -0,0 +1,321 @@
+<html lang="en">
+<head>
+    <meta charset="UTF-8">
+    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+    <title>Introducing TVM Auto-scheduler (a.k.a. Ansor)</title>
+    <link rel="shortcut icon" href="/assets/images/favicon.ico">
+    <link rel="stylesheet" 
href="https://stackpath.bootstrapcdn.com/bootstrap/4.1.3/css/bootstrap.min.css"; 
integrity="sha384-MCw98/SFnGE8fJT3GXwEOngsV7Zt27NXFoaoApmYm81iuXoPkFOJwJ8ERdknLPMO"
 crossorigin="anonymous">
+    <link rel="stylesheet" href="/css/slick.css">
+    <link rel="stylesheet" href="/css/slick-theme.css">
+    <link rel="stylesheet" href="/css/custom.css">
+</head>
+<body>
+
+    
+<div class="bannerPage">
+      <header class="header">
+      <div class="container">
+        <div class="headerInner d-flex justify-content-between 
align-items-center">
+          <div class="headerLogo">
+            <a href="/"><img src="/assets/images/logo.svg" alt="Logo"></a>
+          </div>
+          <div id="headMenu" class="headerNav">
+            <button type="button" id="closeHeadMenu" class="navCloseBtn"><img 
src="/assets/images/close-icon.svg"
+                alt="Close"></button>
+                <ul class="nav">
+    
+    <li class="nav-item">
+        <a class="nav-link" href="/community">Community</a>
+    </li>
+    
+    <li class="nav-item">
+        <a class="nav-link" href="/download">Download</a>
+    </li>
+    
+    <li class="nav-item">
+        <a class="nav-link" href="/vta">VTA</a>
+    </li>
+    
+    <li class="nav-item">
+        <a class="nav-link" href="/blog">Blog</a>
+    </li>
+    
+    <li class="nav-item">
+        <a class="nav-link" href="https://tvm.apache.org/docs/";>Docs</a>
+    </li>
+    
+    <li class="nav-item">
+        <a class="nav-link" href="https://tvmconf.org/";>Conference</a>
+    </li>
+    
+    <li class="nav-item">
+        <a class="nav-link" 
href="https://github.com/apache/incubator-tvm/";>Github</a>
+    </li>
+    
+</ul>
+            <div class="responsiveasfdropdown">
+              <button type="button" class="btn-link">
+                ASF
+              </button>
+              <ul>
+    
+    <li>
+        <a href="https://www.apache.org/";>Apache Homepage</a>
+    </li>
+    
+    <li>
+        <a href="https://www.apache.org/licenses/";>License</a>
+    </li>
+    
+    <li>
+        <a 
href="https://www.apache.org/foundation/sponsorship.html";>Sponsorship</a>
+    </li>
+    
+    <li>
+        <a href="https://www.apache.org/security/";>Security</a>
+    </li>
+    
+    <li>
+        <a href="https://www.apache.org/foundation/thanks.html";>Thanks</a>
+    </li>
+    
+    <li>
+        <a href="https://www.apache.org/events/current-event";>Events</a>
+    </li>
+    
+</ul>
+            </div>
+          </div>
+          <div class="responsiveMenuIcon">
+            <button type="button" id="menuBtn" class="btn-menu"><img 
src="/assets/images/menu-icon.svg"
+                alt="Menu Icon" /></button>
+          </div>
+          <div class="asfDropdown">
+            <div class="dropdown">
+              <button type="button" class="btn-link dropdown-toggle" 
data-toggle="dropdown" aria-haspopup="true"
+                aria-expanded="false">
+                ASF
+              </button>
+              <div class="dropdown-menu dropdown-menu-right">
+                <ul>
+    
+    <li>
+        <a href="https://www.apache.org/";>Apache Homepage</a>
+    </li>
+    
+    <li>
+        <a href="https://www.apache.org/licenses/";>License</a>
+    </li>
+    
+    <li>
+        <a 
href="https://www.apache.org/foundation/sponsorship.html";>Sponsorship</a>
+    </li>
+    
+    <li>
+        <a href="https://www.apache.org/security/";>Security</a>
+    </li>
+    
+    <li>
+        <a href="https://www.apache.org/foundation/thanks.html";>Thanks</a>
+    </li>
+    
+    <li>
+        <a href="https://www.apache.org/events/current-event";>Events</a>
+    </li>
+    
+</ul>
+              </div>
+            </div>
+          </div>
+        </div>
+      </div>
+    </header>
+
+</div>
+
+
+<div class="container">
+<div class="content">
+  <div class="row">
+    <div class="span14 w-100">
+      <h1>Introducing TVM Auto-scheduler (a.k.a. Ansor) </h1>
+      <p class="post-meta">
+        <time datetime="2021-03-03T00:00:00-08:00" itemprop="datePublished">
+          Mar 3, 2021
+        </time>
+        
+        • <span itemprop="author" itemscope 
itemtype="http://schema.org/Person";>
+          <span itemprop="name">Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao 
Wu, Cody Hao Yu</span>
+        </span>
+        
+      </p>
+      <p class="post-meta">
+        </p>
+    </br>
+    <p>Optimizing the execution speed of deep neural networks is extremely 
hard with the growing
+model size, operator diversity, and hardware heterogeneity.
+From a computational perspective, deep neural networks are just layers and 
layers of tensor computations.
+These tensor computations, such as matmul and conv2d, can be easily described 
by mathematical expressions.
+However, providing high-performance implementations for them on modern 
hardware can be very challenging.
+We have to apply various low-level optimizations and utilize special hardware 
intrinsics to achieve high performance.
+It takes huge engineering effort to build linear algebra and neural network 
acceleration libraries like CuBLAS, CuDNN, oneMKL, and oneDNN.</p>
+
+<p>Our life will be much easier if we can just write mathematical expressions 
and have something
+magically turn them into efficient code implementations.
+Three years ago, deep learning compiler TVM and its search module AutoTVM were 
built as the first step towards this goal.
+AutoTVM employs a template-based search algorithm to find efficient 
implementations for a given tensor computation.
+However, it is a template-based approach, so it still requires domain experts 
to implement a non-trivial manual template
+for every operator on every platform.
+Today, there are more than 15k lines of code for these templates in the TVM 
code repository.
+Besides being very hard to develop, these templates often have inefficient and 
limited search spaces,
+making them unable to achieve optimal performance.</p>
+
+<p>To address the limitations of AutoTVM, we started project Ansor aiming at a 
fully automated auto-scheduler for 
+generating code for tensor computations.
+Ansor auto-scheduler only takes tensor expressions as input and generates 
high-performance code without manual templates.
+We made innovations in the search space construction and search algorithm.
+As a result, the auto-scheduler can achieve better performance with less 
search time in a more automated way.</p>
+
+<p>Ansor auto-scheduler is now integrated into Apache TVM as <code 
class="language-plaintext highlighter-rouge">tvm.auto_scheduler</code> package.
+This is a joint effort by collaborators from UC Berkeley, Alibaba, AWS and 
OctoML.
+Detailed tutorials are available for Intel CPUs, ARM CPUs, NVIDIA CPUs, and 
Mali GPUs on the TVM website [1].
+In this blog post, we will give a high-level introduction and show some 
benchmark results.</p>
+
+<h1 id="system-overview">System Overview</h1>
+
+<h2 id="autotvm-vs-auto-scheduler">AutoTVM vs Auto-scheduler</h2>
+<p style="text-align: center"><img 
src="/images/intro-auto-scheduler/workflow.png" alt="image" width="75%" /></p>
+<center> Table 1. Workflow Comparision </center>
+<p></p>
+
+<p>Table 1 compares the workflow for generating code for an operator in 
AutoTVM and auto-scheduler.
+In AutoTVM, the developer has to go through three steps.
+In step 1, the developer has to write the compute definition in TVM’s tensor 
expression language.
+This part is relatively easy because TVM’s tensor expression language looks 
just like math expressions.
+In step 2, the developer has to write a schedule template, which typically 
consists of 20-100 lines of tricky DSL code.
+This part requires domain expertise of both the target hardware architecture 
and operator semantics, so it is difficult.
+The last step, step 3, is automated by a search algorithm.</p>
+
+<p>In auto-scheduler, we eliminate the most difficult step 2 by automatic 
search space construction and accelerate step 3 with a better search algorithm.
+By doing automatic search space construction, we not only eliminate huge 
manual effort, 
+but also enabling the exploration of much more optimization combinations.
+This automation does not come for free, because we still need to design rules 
to generate the search space.
+However, these rules are very general. They are based on static analysis of 
the tensor expressions.
+We only need to design a few general rules once and can apply them to almost 
all tensor computations in deep learning.</p>
+
+<h2 id="search-process">Search Process</h2>
+<p style="text-align: center"><img 
src="/images/intro-auto-scheduler/search_overview.png" alt="image" width="40%" 
/></p>
+<center> Figure 1. Search Process Overview  </center>
+<p></p>
+
+<p>Figure 1. shows the search process of auto-scheduler when optimizing a 
whole neural network.
+The system takes deep learning models as input.
+It then partitions the big model into small subgraphs with Relay’s operator 
fusion pass.
+A task scheduler is utilized to allocate the time resource for optimizing many 
subgraphs.
+At each iteration, it picks a subgraph that has the most potential to increase 
the end-to-end performance.
+For this subgraph, we analyze its tensor expression and generate several 
sketches for it.
+Then we run evolutionary search with a learned cost model to get a batch of 
optimized programs.
+The optimized programs are sent to actual hardware for measurements.
+When the measurements are finished, the profiling results are used as feedback 
to update all components of the system.
+This process is repeated iteratively until the optimization converges or we 
run out of time budget.
+More technical details can be found in our paper [3] and our code.</p>
+
+<p>It is worth notiing that since the auto-scheduler generates schedules from 
scratch, 
+it reuses the existing computation definitions in TOPI but not schedule 
templates.</p>
+
+<h1 id="benchmark-results">Benchmark Results</h1>
+<p>In this section, we benchmark the performance of AutoTVM and Auto-scheduler.
+The CPU benchmark is done on an AWS c5.9xlarge, which is equipped with an 
Intel 18-core skylake 8124-m CPU. 
+The GPU benchmark is done on an AWS g4dn.4xlarge, which is equipped with an 
NVIDIA T4 GPU.
+All benchmark code, raw data, tuning logs can be found in this repo [2].</p>
+
+<h2 id="performance-of-the-generated-code">Performance of the generated 
code</h2>
+<p>We benchmark the fp32 single-batch inference latency on three networks.
+Figure 2 shows the relative speedup of auto-scheduler against AutoTVM.
+We can see auto-scheduler outperforms AutoTVM in all cases with 1.02x to 8.95x 
speedup.
+This is because auto-scheduler explores a larger search space, which covers 
more efficient combinations
+of optimizations that are missed in TOPI manual templates.
+The BERT-base@GPU is an extreme case where the manual templates are very badly 
designed.
+In other words, the manual template for dense layers does not perform well for 
the shapes in BERT model.</p>
+
+<p style="text-align: center"><img 
src="/images/intro-auto-scheduler/code_perf.png" alt="image" width="85%" /></p>
+<center> Figure 2. Code Performance Comparision (Higher is better) </center>
+<p></p>
+
+<h2 id="search-time">Search Time</h2>
+<p>As we know, the search-based approaches can be very time-consuming, so we 
also care about the search time.
+It typically takes several hours to let the search converge for a single 
neural network.
+Figure 3 compares the search time of AutoTVM and auto-scheduler.
+Auto-scheduler requires much less time to converge in most cases, despite its 
larger search space.
+This is mainly because of auto-scheduler has a better cost model and task 
scheduler.</p>
+
+<p style="text-align: center"><img 
src="/images/intro-auto-scheduler/search_time.png" alt="image" width="85%" 
/></p>
+<center> Figure 3. Search Time Comparision (Lower is better) </center>
+<p></p>
+
+<h2 id="more-results">More Results</h2>
+<p>The repo above serves as an internal benchmark tool for TVM, so it only 
compares the latest AutoTVM and AutoScheduler.
+You can find results for more libraries and backends in our paper [3].
+Recently, this blog post [4] also tried auto-scheduler on an Apple M1 chip and 
got some good results.</p>
+
+<h1 id="conclusion">Conclusion</h1>
+<p>We build TVM auto-scheduler, a system that automatically generates 
high-performance code for tensor expressions.
+Compared with the predecessor AutoTVM, auto-scheduler does not require manual 
templates.
+Besides, auto-scheduler is capable of generating schedules with better 
performance in a shorter time.
+We achieve this by making innovations in the search space construction and 
search algorithm.</p>
+
+<p>We are excited about the current performance of auto-scheduler.
+In the future, we are interested in extending the ability of auto-scheduler to 
support
+sparse operators, low-precision operators, and dynamic shape better.</p>
+
+<h1 id="links">Links</h1>
+<p>[1] Tutorials: <a 
href="https://tvm.apache.org/docs/tutorials/index.html#autoscheduler-template-free-auto-scheduling";>https://tvm.apache.org/docs/tutorials/index.html#autoscheduler-template-free-auto-scheduling</a><br
 />
+[2] Benchmark repo: <a 
href="https://github.com/tlc-pack/TLCBench";>https://github.com/tlc-pack/TLCBench</a><br
 />
+[3] OSDI Paper: <a href="https://arxiv.org/abs/2006.06762";>Ansor : Generating 
High-Performance Tensor Programs for Deep Learning</a><br />
+[4] Results on Apple M1 chip: <a 
href="https://medium.com/octoml/on-the-apple-m1-beating-apples-core-ml-4-with-30-model-performance-improvements-9d94af7d1b2d";>https://medium.com/octoml/on-the-apple-m1-beating-apples-core-ml-4-with-30-model-performance-improvements-9d94af7d1b2d</a>.</p>
+
+
+    </div>
+  </div>
+</div>
+</div>
+
+    
+
+
+
+
+  <script src="https://code.jquery.com/jquery-2.2.0.min.js"; 
type="text/javascript"></script>
+  <script 
src="https://cdnjs.cloudflare.com/ajax/libs/popper.js/1.14.3/umd/popper.min.js"; 
integrity="sha384-ZMP7rVo3mIykV+2+9J3UJ46jBk0WLaUAdn689aCwoqbBJiSnjAK/l8WvCWPIPm49"
 crossorigin="anonymous"></script>
+  <script 
src="https://stackpath.bootstrapcdn.com/bootstrap/4.1.3/js/bootstrap.min.js"; 
integrity="sha384-ChfqqxuZUCnJSK3+MXmPNIyE6ZbWh2IMqE241rYiqJxyMiZ6OW/JmZQ5stwEULTy"
 crossorigin="anonymous"></script>
+  <!-- <script src="./assets/js/slick.js"></script> -->
+  <script src="/assets/js/custome.js"></script>
+  <script async 
src="https://www.googletagmanager.com/gtag/js?id=UA-75982049-2";></script>
+  <script>
+    window.dataLayer = window.dataLayer || [];
+    function gtag(){dataLayer.push(arguments);}
+    gtag('js', new Date());
+    gtag('config', 'UA-75982049-2');
+  </script>
+</body>
+<section class="footerSec">
+  <div class="footerHeader">
+    <ul class="container d-flex align-md-items-center justify-content-between 
flex-column flex-md-row">
+      <li class="logo">
+
+        <p><a href="/"><img src="/assets/images/logo.svg" alt="logo" 
title="logo" /></a></p>
+      </li>
+      <li class="copywrite d-flex align-items-center">
+        <h5 id="apache-software-foundation--all-right-reserved">© 2020 Apache 
Software Foundation | All right reserved</h5>
+      </li>
+    </ul>
+
+  </div>
+
+  <ul class="container">
+    <li class="footernote">
+      Copyright © 2020 The Apache Software Foundation. Apache TVM, Apache, the 
Apache feather, and the Apache TVM project logo are either trademarks or 
registered trademarks of the Apache Software Foundation.</li>
+  </ul>
+
+</section>
+</html>
diff --git a/atom.xml b/atom.xml
index 84cd5f0..cb57f8a 100644
--- a/atom.xml
+++ b/atom.xml
@@ -4,7 +4,7 @@
  <title>TVM</title>
  <link href="https://tvm.apache.org"; rel="self"/>
  <link href="https://tvm.apache.org"/>
- <updated>2021-01-04T16:22:52-05:00</updated>
+ <updated>2021-03-03T01:20:46-08:00</updated>
  <id>https://tvm.apache.org</id>
  <author>
    <name></name>
@@ -13,9 +13,139 @@
 
  
  <entry>
+   <title>Introducing TVM Auto-scheduler (a.k.a. Ansor)</title>
+   <link href="https://tvm.apache.org/2021/03/03/intro-auto-scheduler"/>
+   <updated>2021-03-03T00:00:00-08:00</updated>
+   <id>https://tvm.apache.org/2021/03/03/intro-auto-scheduler</id>
+   <content type="html">&lt;p&gt;Optimizing the execution speed of deep neural 
networks is extremely hard with the growing
+model size, operator diversity, and hardware heterogeneity.
+From a computational perspective, deep neural networks are just layers and 
layers of tensor computations.
+These tensor computations, such as matmul and conv2d, can be easily described 
by mathematical expressions.
+However, providing high-performance implementations for them on modern 
hardware can be very challenging.
+We have to apply various low-level optimizations and utilize special hardware 
intrinsics to achieve high performance.
+It takes huge engineering effort to build linear algebra and neural network 
acceleration libraries like CuBLAS, CuDNN, oneMKL, and oneDNN.&lt;/p&gt;
+
+&lt;p&gt;Our life will be much easier if we can just write mathematical 
expressions and have something
+magically turn them into efficient code implementations.
+Three years ago, deep learning compiler TVM and its search module AutoTVM were 
built as the first step towards this goal.
+AutoTVM employs a template-based search algorithm to find efficient 
implementations for a given tensor computation.
+However, it is a template-based approach, so it still requires domain experts 
to implement a non-trivial manual template
+for every operator on every platform.
+Today, there are more than 15k lines of code for these templates in the TVM 
code repository.
+Besides being very hard to develop, these templates often have inefficient and 
limited search spaces,
+making them unable to achieve optimal performance.&lt;/p&gt;
+
+&lt;p&gt;To address the limitations of AutoTVM, we started project Ansor 
aiming at a fully automated auto-scheduler for 
+generating code for tensor computations.
+Ansor auto-scheduler only takes tensor expressions as input and generates 
high-performance code without manual templates.
+We made innovations in the search space construction and search algorithm.
+As a result, the auto-scheduler can achieve better performance with less 
search time in a more automated way.&lt;/p&gt;
+
+&lt;p&gt;Ansor auto-scheduler is now integrated into Apache TVM as &lt;code 
class=&quot;language-plaintext 
highlighter-rouge&quot;&gt;tvm.auto_scheduler&lt;/code&gt; package.
+This is a joint effort by collaborators from UC Berkeley, Alibaba, AWS and 
OctoML.
+Detailed tutorials are available for Intel CPUs, ARM CPUs, NVIDIA CPUs, and 
Mali GPUs on the TVM website [1].
+In this blog post, we will give a high-level introduction and show some 
benchmark results.&lt;/p&gt;
+
+&lt;h1 id=&quot;system-overview&quot;&gt;System Overview&lt;/h1&gt;
+
+&lt;h2 id=&quot;autotvm-vs-auto-scheduler&quot;&gt;AutoTVM vs 
Auto-scheduler&lt;/h2&gt;
+&lt;p style=&quot;text-align: center&quot;&gt;&lt;img 
src=&quot;/images/intro-auto-scheduler/workflow.png&quot; alt=&quot;image&quot; 
width=&quot;75%&quot; /&gt;&lt;/p&gt;
+&lt;center&gt; Table 1. Workflow Comparision &lt;/center&gt;
+&lt;p&gt;&lt;/p&gt;
+
+&lt;p&gt;Table 1 compares the workflow for generating code for an operator in 
AutoTVM and auto-scheduler.
+In AutoTVM, the developer has to go through three steps.
+In step 1, the developer has to write the compute definition in TVM’s tensor 
expression language.
+This part is relatively easy because TVM’s tensor expression language looks 
just like math expressions.
+In step 2, the developer has to write a schedule template, which typically 
consists of 20-100 lines of tricky DSL code.
+This part requires domain expertise of both the target hardware architecture 
and operator semantics, so it is difficult.
+The last step, step 3, is automated by a search algorithm.&lt;/p&gt;
+
+&lt;p&gt;In auto-scheduler, we eliminate the most difficult step 2 by 
automatic search space construction and accelerate step 3 with a better search 
algorithm.
+By doing automatic search space construction, we not only eliminate huge 
manual effort, 
+but also enabling the exploration of much more optimization combinations.
+This automation does not come for free, because we still need to design rules 
to generate the search space.
+However, these rules are very general. They are based on static analysis of 
the tensor expressions.
+We only need to design a few general rules once and can apply them to almost 
all tensor computations in deep learning.&lt;/p&gt;
+
+&lt;h2 id=&quot;search-process&quot;&gt;Search Process&lt;/h2&gt;
+&lt;p style=&quot;text-align: center&quot;&gt;&lt;img 
src=&quot;/images/intro-auto-scheduler/search_overview.png&quot; 
alt=&quot;image&quot; width=&quot;40%&quot; /&gt;&lt;/p&gt;
+&lt;center&gt; Figure 1. Search Process Overview  &lt;/center&gt;
+&lt;p&gt;&lt;/p&gt;
+
+&lt;p&gt;Figure 1. shows the search process of auto-scheduler when optimizing 
a whole neural network.
+The system takes deep learning models as input.
+It then partitions the big model into small subgraphs with Relay’s operator 
fusion pass.
+A task scheduler is utilized to allocate the time resource for optimizing many 
subgraphs.
+At each iteration, it picks a subgraph that has the most potential to increase 
the end-to-end performance.
+For this subgraph, we analyze its tensor expression and generate several 
sketches for it.
+Then we run evolutionary search with a learned cost model to get a batch of 
optimized programs.
+The optimized programs are sent to actual hardware for measurements.
+When the measurements are finished, the profiling results are used as feedback 
to update all components of the system.
+This process is repeated iteratively until the optimization converges or we 
run out of time budget.
+More technical details can be found in our paper [3] and our code.&lt;/p&gt;
+
+&lt;p&gt;It is worth notiing that since the auto-scheduler generates schedules 
from scratch, 
+it reuses the existing computation definitions in TOPI but not schedule 
templates.&lt;/p&gt;
+
+&lt;h1 id=&quot;benchmark-results&quot;&gt;Benchmark Results&lt;/h1&gt;
+&lt;p&gt;In this section, we benchmark the performance of AutoTVM and 
Auto-scheduler.
+The CPU benchmark is done on an AWS c5.9xlarge, which is equipped with an 
Intel 18-core skylake 8124-m CPU. 
+The GPU benchmark is done on an AWS g4dn.4xlarge, which is equipped with an 
NVIDIA T4 GPU.
+All benchmark code, raw data, tuning logs can be found in this repo 
[2].&lt;/p&gt;
+
+&lt;h2 id=&quot;performance-of-the-generated-code&quot;&gt;Performance of the 
generated code&lt;/h2&gt;
+&lt;p&gt;We benchmark the fp32 single-batch inference latency on three 
networks.
+Figure 2 shows the relative speedup of auto-scheduler against AutoTVM.
+We can see auto-scheduler outperforms AutoTVM in all cases with 1.02x to 8.95x 
speedup.
+This is because auto-scheduler explores a larger search space, which covers 
more efficient combinations
+of optimizations that are missed in TOPI manual templates.
+The BERT-base@GPU is an extreme case where the manual templates are very badly 
designed.
+In other words, the manual template for dense layers does not perform well for 
the shapes in BERT model.&lt;/p&gt;
+
+&lt;p style=&quot;text-align: center&quot;&gt;&lt;img 
src=&quot;/images/intro-auto-scheduler/code_perf.png&quot; 
alt=&quot;image&quot; width=&quot;85%&quot; /&gt;&lt;/p&gt;
+&lt;center&gt; Figure 2. Code Performance Comparision (Higher is better) 
&lt;/center&gt;
+&lt;p&gt;&lt;/p&gt;
+
+&lt;h2 id=&quot;search-time&quot;&gt;Search Time&lt;/h2&gt;
+&lt;p&gt;As we know, the search-based approaches can be very time-consuming, 
so we also care about the search time.
+It typically takes several hours to let the search converge for a single 
neural network.
+Figure 3 compares the search time of AutoTVM and auto-scheduler.
+Auto-scheduler requires much less time to converge in most cases, despite its 
larger search space.
+This is mainly because of auto-scheduler has a better cost model and task 
scheduler.&lt;/p&gt;
+
+&lt;p style=&quot;text-align: center&quot;&gt;&lt;img 
src=&quot;/images/intro-auto-scheduler/search_time.png&quot; 
alt=&quot;image&quot; width=&quot;85%&quot; /&gt;&lt;/p&gt;
+&lt;center&gt; Figure 3. Search Time Comparision (Lower is better) 
&lt;/center&gt;
+&lt;p&gt;&lt;/p&gt;
+
+&lt;h2 id=&quot;more-results&quot;&gt;More Results&lt;/h2&gt;
+&lt;p&gt;The repo above serves as an internal benchmark tool for TVM, so it 
only compares the latest AutoTVM and AutoScheduler.
+You can find results for more libraries and backends in our paper [3].
+Recently, this blog post [4] also tried auto-scheduler on an Apple M1 chip and 
got some good results.&lt;/p&gt;
+
+&lt;h1 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h1&gt;
+&lt;p&gt;We build TVM auto-scheduler, a system that automatically generates 
high-performance code for tensor expressions.
+Compared with the predecessor AutoTVM, auto-scheduler does not require manual 
templates.
+Besides, auto-scheduler is capable of generating schedules with better 
performance in a shorter time.
+We achieve this by making innovations in the search space construction and 
search algorithm.&lt;/p&gt;
+
+&lt;p&gt;We are excited about the current performance of auto-scheduler.
+In the future, we are interested in extending the ability of auto-scheduler to 
support
+sparse operators, low-precision operators, and dynamic shape better.&lt;/p&gt;
+
+&lt;h1 id=&quot;links&quot;&gt;Links&lt;/h1&gt;
+&lt;p&gt;[1] Tutorials: &lt;a 
href=&quot;https://tvm.apache.org/docs/tutorials/index.html#autoscheduler-template-free-auto-scheduling&quot;&gt;https://tvm.apache.org/docs/tutorials/index.html#autoscheduler-template-free-auto-scheduling&lt;/a&gt;&lt;br
 /&gt;
+[2] Benchmark repo: &lt;a 
href=&quot;https://github.com/tlc-pack/TLCBench&quot;&gt;https://github.com/tlc-pack/TLCBench&lt;/a&gt;&lt;br
 /&gt;
+[3] OSDI Paper: &lt;a 
href=&quot;https://arxiv.org/abs/2006.06762&quot;&gt;Ansor : Generating 
High-Performance Tensor Programs for Deep Learning&lt;/a&gt;&lt;br /&gt;
+[4] Results on Apple M1 chip: &lt;a 
href=&quot;https://medium.com/octoml/on-the-apple-m1-beating-apples-core-ml-4-with-30-model-performance-improvements-9d94af7d1b2d&quot;&gt;https://medium.com/octoml/on-the-apple-m1-beating-apples-core-ml-4-with-30-model-performance-improvements-9d94af7d1b2d&lt;/a&gt;.&lt;/p&gt;
+
+</content>
+ </entry>
+ 
+ <entry>
    <title>Bring Your Own Datatypes: Enabling Custom Datatype Exploration in 
TVM</title>
    <link href="https://tvm.apache.org/2020/09/26/bring-your-own-datatypes"/>
-   <updated>2020-09-26T00:00:00-04:00</updated>
+   <updated>2020-09-26T00:00:00-07:00</updated>
    <id>https://tvm.apache.org/2020/09/26/bring-your-own-datatypes</id>
    <content type="html">&lt;p&gt;In this post, we describe the Bring Your Own 
Datatypes framework, which enables the use of custom datatypes within 
TVM.&lt;/p&gt;
 
@@ -308,7 +438,7 @@ For more documentation about the Bring Your Own Datatypes 
framework
  <entry>
    <title>How to Bring Your Own Codegen to TVM</title>
    <link 
href="https://tvm.apache.org/2020/07/15/how-to-bring-your-own-codegen-to-tvm"/>
-   <updated>2020-07-15T00:00:00-04:00</updated>
+   <updated>2020-07-15T00:00:00-07:00</updated>
    
<id>https://tvm.apache.org/2020/07/15/how-to-bring-your-own-codegen-to-tvm</id>
    <content type="html">&lt;p&gt;To free data scientists from worrying about 
the performance when developing a new model, hardware backend providers (e.g., 
Intel, NVIDIA, ARM, etc) either provide kernel libraries such as cuBLAS or 
cuDNN with many commonly used deep learning kernels, or provide frameworks such 
as DNNL or TensorRT with a graph engine to let users describe their models in a 
certain way to achieve high performance. In addition, emerging deep learning 
accelerators also have t [...]
 
@@ -787,7 +917,7 @@ Figure 4: After Graph Partitioning.
  <entry>
    <title>Bridging PyTorch and TVM</title>
    <link href="https://tvm.apache.org/2020/07/14/bert-pytorch-tvm"/>
-   <updated>2020-07-14T00:00:00-04:00</updated>
+   <updated>2020-07-14T00:00:00-07:00</updated>
    <id>https://tvm.apache.org/2020/07/14/bert-pytorch-tvm</id>
    <content type="html">
 &lt;p&gt;(A more code-heavy variant is crossposted on the more PyTorch affine 
&lt;a 
href=&quot;https://lernapparat.de/transformers-pytorch-tvm/&quot;&gt;Lernapparat&lt;/a&gt;,
@@ -1310,7 +1440,7 @@ He is a PyTorch core developer and co-authored &lt;a 
href=&quot;https://www.mann
  <entry>
    <title>TinyML - How TVM is Taming Tiny</title>
    <link 
href="https://tvm.apache.org/2020/06/04/tinyml-how-tvm-is-taming-tiny"/>
-   <updated>2020-06-04T00:00:00-04:00</updated>
+   <updated>2020-06-04T00:00:00-07:00</updated>
    <id>https://tvm.apache.org/2020/06/04/tinyml-how-tvm-is-taming-tiny</id>
    <content type="html">
 &lt;p&gt;&lt;img src=&quot;/images/microtvm/logo.png&quot; alt=&quot;microTVM 
logo&quot; width=&quot;30%&quot; /&gt;&lt;br /&gt;&lt;/p&gt;
@@ -1619,7 +1749,7 @@ Diagram from CMSIS-NN paper showing a 2x2 matrix 
multiplication microkernel&lt;/
  <entry>
    <title>Compiling Machine Learning to WASM and WebGPU with Apache TVM</title>
    <link 
href="https://tvm.apache.org/2020/05/14/compiling-machine-learning-to-webassembly-and-webgpu"/>
-   <updated>2020-05-14T00:00:00-04:00</updated>
+   <updated>2020-05-14T00:00:00-07:00</updated>
    
<id>https://tvm.apache.org/2020/05/14/compiling-machine-learning-to-webassembly-and-webgpu</id>
    <content type="html">&lt;p&gt;&lt;strong&gt;TLDR&lt;/strong&gt;&lt;/p&gt;
 
@@ -1706,7 +1836,7 @@ Diagram from CMSIS-NN paper showing a 2x2 matrix 
multiplication microkernel&lt;/
  <entry>
    <title>Integrating TVM into PyTorch</title>
    <link href="https://tvm.apache.org/2019/05/30/pytorch-frontend"/>
-   <updated>2019-05-30T00:00:00-04:00</updated>
+   <updated>2019-05-30T00:00:00-07:00</updated>
    <id>https://tvm.apache.org/2019/05/30/pytorch-frontend</id>
    <content type="html">&lt;p&gt;As TVM continuously demonstrates improvements 
to the efficiency of deep learning execution,
 it has become clear that PyTorch stands to benefit from directly leveraging 
the compiler stack.
@@ -1808,7 +1938,7 @@ relay_graph = torch_tvm.to_relay(mul, inputs)
  <entry>
    <title>Automating Optimization of Quantized Deep Learning Models on 
CUDA</title>
    <link href="https://tvm.apache.org/2019/04/29/opt-cuda-quantized"/>
-   <updated>2019-04-29T12:00:00-04:00</updated>
+   <updated>2019-04-29T09:00:00-07:00</updated>
    <id>https://tvm.apache.org/2019/04/29/opt-cuda-quantized</id>
    <content type="html">&lt;p&gt;Deep learning has been successfully applied 
to a variety of tasks.
 On real-time scenarios such as inference on autonomous vehicles, the inference 
speed of the model is critical.
@@ -1952,7 +2082,7 @@ We show that automatic optimization in TVM makes it easy 
and flexible to support
  <entry>
    <title>TVM Deep Learning Compiler Joins Apache Software Foundation</title>
    <link href="https://tvm.apache.org/2019/03/18/tvm-apache-announcement"/>
-   <updated>2019-03-18T00:00:00-04:00</updated>
+   <updated>2019-03-18T00:00:00-07:00</updated>
    <id>https://tvm.apache.org/2019/03/18/tvm-apache-announcement</id>
    <content type="html">&lt;p&gt;There is an increasing need to bring machine 
learning to a wide diversity of hardware devices. Current frameworks rely on 
vendor-specific operator libraries and optimize for a narrow range of 
server-class GPUs. Deploying workloads to new platforms – such as mobile 
phones, embedded devices, and accelerators (e.g., FPGAs, ASICs) – requires 
significant manual effort.&lt;/p&gt;
 
@@ -1975,7 +2105,7 @@ We show that automatic optimization in TVM makes it easy 
and flexible to support
  <entry>
    <title>TVM Golang Runtime for Deep Learning Deployment</title>
    <link href="https://tvm.apache.org/2019/01/19/Golang"/>
-   <updated>2019-01-19T00:00:00-05:00</updated>
+   <updated>2019-01-19T00:00:00-08:00</updated>
    <id>https://tvm.apache.org/2019/01/19/Golang</id>
    <content type="html">&lt;h2 
id=&quot;introduction&quot;&gt;Introduction&lt;/h2&gt;
 
@@ -2145,7 +2275,7 @@ closure as TVM packed function and invoke the same across 
programming language b
  <entry>
    <title>Automating Generation of Low Precision Deep Learning 
Operators</title>
    <link href="https://tvm.apache.org/2018/12/18/lowprecision-conv"/>
-   <updated>2018-12-18T00:00:00-05:00</updated>
+   <updated>2018-12-18T00:00:00-08:00</updated>
    <id>https://tvm.apache.org/2018/12/18/lowprecision-conv</id>
    <content type="html">&lt;p&gt;As deep learning models grow larger and more 
complex, deploying them on low powered phone and IoT
 devices becomes challenging because of their limited compute and energy 
budgets. A  recent  trend
@@ -2306,7 +2436,7 @@ Note: x86 doesn’t support a vectorized popcount for this 
microarchitecture, so
  <entry>
    <title>Efficient Privacy-Preserving ML Using TVM</title>
    <link href="https://tvm.apache.org/2018/10/09/ml-in-tees"/>
-   <updated>2018-10-09T00:00:00-04:00</updated>
+   <updated>2018-10-09T00:00:00-07:00</updated>
    <id>https://tvm.apache.org/2018/10/09/ml-in-tees</id>
    <content type="html">&lt;p&gt;This post describes Myelin, a framework for 
privacy-preserving machine learning in trusted hardware enclaves, and how TVM 
makes Myelin fast.
 The key idea is that TVM, unlike other popular ML frameworks, compiles models 
into lightweight, optimized, and dependency-free libraries which can fit into 
resource constrained enclaves.&lt;/p&gt;
@@ -2422,7 +2552,7 @@ His research interest is in the general domain of ML on 
shared private data, but
  <entry>
    <title>Automatic Kernel Optimization for Deep Learning on All Hardware 
Platforms</title>
    <link href="https://tvm.apache.org/2018/10/03/auto-opt-all"/>
-   <updated>2018-10-03T00:00:00-04:00</updated>
+   <updated>2018-10-03T00:00:00-07:00</updated>
    <id>https://tvm.apache.org/2018/10/03/auto-opt-all</id>
    <content type="html">&lt;p&gt;Optimizing the performance of deep neural 
network on a diverse range of hardware platforms is still a hard
 problem for AI developers. In terms of system support, we are facing a 
many-to-many problem here:
@@ -2816,7 +2946,7 @@ for inference deployment. TVM just provides such a 
solution.&lt;/p&gt;
  <entry>
    <title>Building a Cross-Framework Deep Learning Compiler via DLPack</title>
    <link href="https://tvm.apache.org/2018/08/10/DLPack-Bridge"/>
-   <updated>2018-08-10T00:00:00-04:00</updated>
+   <updated>2018-08-10T00:00:00-07:00</updated>
    <id>https://tvm.apache.org/2018/08/10/DLPack-Bridge</id>
    <content type="html">&lt;p&gt;Deep learning frameworks such as Tensorflow, 
PyTorch, and ApacheMxNet provide a
 powerful toolbox for quickly prototyping and deploying deep learning models.
@@ -2955,7 +3085,7 @@ support, and can be used to implement convenient 
converters, such as
  <entry>
    <title>VTA: An Open, Customizable Deep Learning Acceleration Stack </title>
    <link href="https://tvm.apache.org/2018/07/12/vta-release-announcement"/>
-   <updated>2018-07-12T00:00:00-04:00</updated>
+   <updated>2018-07-12T00:00:00-07:00</updated>
    <id>https://tvm.apache.org/2018/07/12/vta-release-announcement</id>
    <content type="html">&lt;p style=&quot;text-align: center&quot;&gt;Thierry 
Moreau(VTA architect), Tianqi Chen(TVM stack), Ziheng Jiang†(graph 
compilation), Luis Vega(cloud deployment)&lt;/p&gt;
 &lt;p style=&quot;text-align: center&quot;&gt;Advisors: Luis Ceze, Carlos 
Guestrin, Arvind Krishnamurthy&lt;/p&gt;
@@ -3097,7 +3227,7 @@ This kind of high-level visibility is essential to system 
designers who want to
  <entry>
    <title>Bringing TVM into TensorFlow for Optimizing Neural Machine 
Translation on GPU</title>
    <link href="https://tvm.apache.org/2018/03/23/nmt-transformer-optimize"/>
-   <updated>2018-03-23T00:00:00-04:00</updated>
+   <updated>2018-03-23T00:00:00-07:00</updated>
    <id>https://tvm.apache.org/2018/03/23/nmt-transformer-optimize</id>
    <content type="html">&lt;h2 id=&quot;author&quot;&gt;Author&lt;/h2&gt;
 
@@ -3363,7 +3493,7 @@ C = tvm.compute(
  <entry>
    <title>Compiling Deep Learning Models to WebGL with TVM</title>
    <link href="https://tvm.apache.org/2018/03/12/webgl"/>
-   <updated>2018-03-12T00:00:00-04:00</updated>
+   <updated>2018-03-12T00:00:00-07:00</updated>
    <id>https://tvm.apache.org/2018/03/12/webgl</id>
    <content type="html">&lt;p&gt;Now TVM comes with a brand-new OpenGL/WebGL 
backend!
 This blog post explains what it is, and what you can achieve with it.&lt;/p&gt;
@@ -3479,7 +3609,7 @@ optimizations into the TVM stack.&lt;/p&gt;
  <entry>
    <title>Optimizing Mobile Deep Learning on ARM GPU with TVM</title>
    <link href="https://tvm.apache.org/2018/01/16/opt-mali-gpu"/>
-   <updated>2018-01-16T00:00:00-05:00</updated>
+   <updated>2018-01-16T00:00:00-08:00</updated>
    <id>https://tvm.apache.org/2018/01/16/opt-mali-gpu</id>
    <content type="html">&lt;p&gt;With the great success of deep learning, the 
demand for
 deploying deep neural networks to mobile devices is growing rapidly.
@@ -4053,7 +4183,7 @@ advice and &lt;a 
href=&quot;https://github.com/yzhliu&quot;&gt;Yizhi Liu&lt;/a&g
  <entry>
    <title>Remote Profile and Test Deep Learning Cross Compilation on Mobile 
Phones with TVM RPC</title>
    <link href="https://tvm.apache.org/2017/11/08/android-rpc-introduction"/>
-   <updated>2017-11-08T00:00:00-05:00</updated>
+   <updated>2017-11-08T00:00:00-08:00</updated>
    <id>https://tvm.apache.org/2017/11/08/android-rpc-introduction</id>
    <content type="html">&lt;p&gt;TVM stack is an end to end compilation stack 
to deploy deep learning workloads to all hardware backends.
 Thanks to the NNVM compiler support of TVM stack, we can now directly compile 
descriptions from deep learning frameworks and compile them to bare metal code.
@@ -4281,7 +4411,7 @@ make jvminstall
  <entry>
    <title>Bringing AMDGPUs to TVM Stack and NNVM Compiler with ROCm</title>
    <link 
href="https://tvm.apache.org/2017/10/30/Bringing-AMDGPUs-to-TVM-Stack-and-NNVM-Compiler-with-ROCm"/>
-   <updated>2017-10-30T00:00:00-04:00</updated>
+   <updated>2017-10-30T00:00:00-07:00</updated>
    
<id>https://tvm.apache.org/2017/10/30/Bringing-AMDGPUs-to-TVM-Stack-and-NNVM-Compiler-with-ROCm</id>
    <content type="html">&lt;p style=&quot;text-align: center&quot;&gt;Aditya 
Atluri, Advanced Micro Devices, Inc.&lt;/p&gt;
 &lt;p style=&quot;text-align: center&quot;&gt;Masahiro Masuda, Ziosoft, 
Inc.&lt;/p&gt;
@@ -4504,88 +4634,5 @@ BB0_6:
 </content>
  </entry>
  
- <entry>
-   <title>NNVM Compiler: Open Compiler for AI Frameworks</title>
-   <link href="https://tvm.apache.org/2017/10/06/nnvm-compiler-announcement"/>
-   <updated>2017-10-06T11:30:00-04:00</updated>
-   <id>https://tvm.apache.org/2017/10/06/nnvm-compiler-announcement</id>
-   <content type="html">&lt;p style=&quot;text-align: center&quot;&gt;Paul G. 
Allen School of Computer Science &amp;amp; Engineering, University of 
Washington&lt;/p&gt;
-&lt;p style=&quot;text-align: center&quot;&gt;Amazon Web Service AI 
team&lt;/p&gt;
-&lt;p style=&quot;text-align: center&quot;&gt;DMLC open-source 
community&lt;/p&gt;
-
-&lt;p&gt;Deep learning has become ubiquitous and indispensable. We are seeing 
a rising need for deploying deep learning workloads on many kinds of platforms 
such as mobile phones, GPU, IoT devices and specialized accelerators.  Last 
month, we announced TVM stack to close the gap between deep learning 
frameworks, and the performance- or efficiency-oriented hardware backends.  TVM 
stack makes it easy to build an end to end compilation for a deep learning 
framework.  However, we think it wo [...]
-
-&lt;p&gt;Today, UW Allen school and AWS AI team, together with other 
contributors, are excited to announce the release of NNVM compiler, an open 
deep learning compiler to compile front-end framework workloads directly to 
hardware backends. We build it using the two-level intermediate 
representation(IR) in the TVM stack.
-The reader is welcome to refer to the &lt;a 
href=&quot;http://www.tvmlang.org/2017/08/17/tvm-release-announcement.html&quot;&gt;original
 TVM announcement&lt;/a&gt; for more technical details about TVM stack. With 
the help of TVM stack, NNVM compiler can:&lt;/p&gt;
-
-&lt;ul&gt;
-  &lt;li&gt;Represent and optimize the common deep learning workloads in high 
level graph IR&lt;/li&gt;
-  &lt;li&gt;Transform the computation graph to minimize memory utilization, 
optimize data layout and fuse computation patterns for different hardware 
backends.&lt;/li&gt;
-  &lt;li&gt;Present an end to end compilation pipeline from front-end deep 
learning frameworks to bare metal hardwares.&lt;/li&gt;
-&lt;/ul&gt;
-
-&lt;p style=&quot;text-align: center&quot;&gt;&lt;img 
src=&quot;/images/nnvm/nnvm_compiler_stack.png&quot; alt=&quot;image&quot; 
width=&quot;612px&quot; /&gt;&lt;/p&gt;
-
-&lt;p&gt;The NNVM compiler can directly take models from deep learning 
frameworks such as Apache MXNet.
-It also support model exchange formats such as ONNX and CoreML. ONNX support 
enables NNVM to compile deep learning models from PyTorch, Caffe2 and CNTK.
-The CoreML frontend enables deployment of CoreML models to non-iOS 
devices.&lt;/p&gt;
-
-&lt;p style=&quot;text-align: center&quot;&gt;&lt;img 
src=&quot;/images/nnvm/nnvm_compiler_code.png&quot; alt=&quot;image&quot; 
width=&quot;712px&quot; /&gt;&lt;/p&gt;
-
-&lt;h2 id=&quot;separation-of-optimization-and-deployment&quot;&gt;Separation 
of Optimization and Deployment&lt;/h2&gt;
-
-&lt;p style=&quot;text-align: center&quot;&gt;&lt;img 
src=&quot;/images/nnvm/nnvm_deploy.png&quot; alt=&quot;image&quot; 
width=&quot;512px&quot; /&gt;&lt;/p&gt;
-
-&lt;p&gt;NNVM compiler applies graph level and tensor level optimizations and 
jointly optimize them to get the best performance. We take a different approach 
from existing deep learning frameworks, which packages the graph optimization 
with the deployment runtime.  NNVM compiler adopts the conventional wisdom from 
compiler to separate the optimization from the actual deployment runtime. This 
approach offers substantial optimization but still keeps the runtime 
lightweight. The compiled mo [...]
-
-&lt;h2 id=&quot;performance&quot;&gt;Performance&lt;/h2&gt;
-
-&lt;p&gt;NNVM compiler is still under active development, and we can expect 
more improvements to come, but we have started to see promising results.
-We benchmarked its performance and compared it against Apache MXNet on two 
typical hardware configurations: ARM CPU on Raspberry PI and Nvidia GPU on AWS. 
Despite the radical architecture difference between these two chips, we can use 
the same infrastructure and only need to change the schedule for each type of 
hardware.&lt;/p&gt;
-
-&lt;h3 id=&quot;nvidia-gpu&quot;&gt;Nvidia GPU&lt;/h3&gt;
-
-&lt;p&gt;GPU benchmarks and schedules are contributed by Leyuan Wang 
(AWS/UCDavis) and Yuwei Hu (TuSimple). We compared the NNVM compiler against 
Apache MXNet with CUDA8 and cuDNN7 as the backend on Nvidia K80. This is a very 
strong baseline, as Apache MXNet turns on auto-tuning to select the best kernel 
from CuDNN. We also used the optimized depthwise kernel in MXNet to optimize 
MobileNet workload.&lt;/p&gt;
-
-&lt;p style=&quot;text-align: center&quot;&gt;&lt;img 
src=&quot;/images/nnvm/nnvm_k80_result.png&quot; alt=&quot;image&quot; 
width=&quot;400px&quot; /&gt;&lt;/p&gt;
-
-&lt;p&gt;As can be seen, NNVM compiler generate code that outperforms Apache 
MXNet on K80. These improvements are due to the joint graph level and kernel 
level optimizations. It is worth noting that NNVM compiler generates all the 
optimized GPU kernels on its own without relying on external libraries like 
CuDNN.&lt;/p&gt;
-
-&lt;h3 id=&quot;raspberry-pi-3b&quot;&gt;Raspberry Pi 3b&lt;/h3&gt;
-
-&lt;p&gt;The Rasberry Pi compilation stack is contributed by Ziheng 
Jiang(AWS/FDU).
-We compared NNVM compiler against Apache MXNet with OpenBLAS and NNPack.
-We explored the setups to get the best performance out of MXNet: we turned on 
Winograd convolution in the NNPACK for 3x3 convolutions, enabled 
multi-threading and disabled the additional scheduler thread (so all threads 
are used by NNPack).&lt;/p&gt;
-
-&lt;p style=&quot;text-align: center&quot;&gt;&lt;img 
src=&quot;/images/nnvm/nnvm_rasp_result.png&quot; alt=&quot;image&quot; 
width=&quot;400px&quot; /&gt;&lt;/p&gt;
-
-&lt;p&gt;As can be seen, the code generated by NNVM compiler is two times 
faster on ResNet18.
-The gap on MobileNet is mainly due to lack of depthwise convolution in 
existing CPU DNN libraries. NNVM compiler takes benefit of direct generating 
efficient ARM code directly.&lt;/p&gt;
-
-&lt;h2 id=&quot;acknowledgement&quot;&gt;Acknowledgement&lt;/h2&gt;
-&lt;p&gt;This project wouldn’t become possible without our early contributors 
in the DMLC community.
-We would like to specially thank Yuwei Hu(TuSimple), Leyuan Wang(AWS/UCDavis), 
Joshua Z. Zhang(AWS)
-and Xingjian Shi(HKUST) for their early contributions to the project. We would 
also like to thank all the contributors
-to the TVM stack.&lt;/p&gt;
-
-&lt;p&gt;We also learnt a lot from the following projects when building NNVM 
Compiler.&lt;/p&gt;
-&lt;ul&gt;
-  &lt;li&gt;&lt;a 
href=&quot;https://github.com/Theano/Theano&quot;&gt;Theano&lt;/a&gt;: possibly 
the earliest compiler for deep learning&lt;/li&gt;
-  &lt;li&gt;&lt;a 
href=&quot;https://github.com/halide/Halide&quot;&gt;Halide&lt;/a&gt;: TVM uses 
&lt;a href=&quot;https://github.com/dmlc/HalideIR&quot;&gt;HalideIR&lt;/a&gt; 
as data structure for
-arithematic simplification and low level lowering. HalideIR is derived from 
Halide.
-We also learns from Halide when implementing the lowering pipeline in 
TVM.&lt;/li&gt;
-  &lt;li&gt;&lt;a 
href=&quot;https://github.com/inducer/loopy&quot;&gt;Loopy&lt;/a&gt;: use of 
integer set analysis and its loop transformation primitives.&lt;/li&gt;
-&lt;/ul&gt;
-
-&lt;h2 id=&quot;links&quot;&gt;Links&lt;/h2&gt;
-&lt;ul&gt;
-  &lt;li&gt;Github page of NNVM Compiler: &lt;a 
href=&quot;https://github.com/dmlc/nnvm&quot;&gt;https://github.com/dmlc/nnvm&lt;/a&gt;&lt;/li&gt;
-  &lt;li&gt;Github page of TVM: &lt;a 
href=&quot;https://github.com/dmlc/tvm&quot;&gt;https://github.com/dmlc/tvm&lt;/a&gt;&lt;/li&gt;
-  &lt;li&gt;&lt;a 
href=&quot;https://news.cs.washington.edu/2017/10/06/allen-school-and-aws-team-up-on-new-nnvm-compiler-for-deep-learning-frameworks/&quot;&gt;UW
 Allen school blog about NNVM compiler&lt;/a&gt;&lt;/li&gt;
-  &lt;li&gt;&lt;a 
href=&quot;https://aws.amazon.com/blogs/ai/introducing-nnvm-compiler-a-new-open-end-to-end-compiler-for-ai-frameworks/&quot;&gt;AWS
 blogpost about NNVM compiler&lt;/a&gt;&lt;/li&gt;
-&lt;/ul&gt;
-</content>
- </entry>
- 
  
 </feed>
diff --git a/blog.html b/blog.html
index 8dbab07..ae12173 100644
--- a/blog.html
+++ b/blog.html
@@ -146,6 +146,16 @@
 
 <li>
   <span>
+    <a class="post-link" href="/2021/03/03/intro-auto-scheduler">Introducing 
TVM Auto-scheduler (a.k.a. Ansor)</a>
+  </span>
+  </br>
+  <span>
+    Mar 3, 2021
+  </span>
+</li>
+
+<li>
+  <span>
     <a class="post-link" href="/2020/09/26/bring-your-own-datatypes">Bring 
Your Own Datatypes: Enabling Custom Datatype Exploration in TVM</a>
   </span>
   </br>
diff --git a/community.html b/community.html
index 365bb79..d3e347f 100644
--- a/community.html
+++ b/community.html
@@ -279,6 +279,10 @@ This is a community maintained list of organizations using 
and contributing to t
     </li>
     
     <li>
+        <img src="/images/community/sjtu.png" />
+    </li>
+    
+    <li>
         <img src="/images/community/ucberkeley.png" />
     </li>
     
diff --git a/feed.xml b/feed.xml
index a3d90e2..5d387ea 100644
--- a/feed.xml
+++ b/feed.xml
@@ -1,4 +1,124 @@
-<?xml version="1.0" encoding="utf-8"?><feed 
xmlns="http://www.w3.org/2005/Atom"; ><generator uri="https://jekyllrb.com/"; 
version="4.1.1">Jekyll</generator><link href="/feed.xml" rel="self" 
type="application/atom+xml" /><link href="/" rel="alternate" type="text/html" 
/><updated>2021-01-04T16:22:52-05:00</updated><id>/feed.xml</id><title 
type="html">TVM</title><author><name>{&quot;name&quot;=&gt;nil}</name></author><entry><title
 type="html">Bring Your Own Datatypes: Enabling Custom Datatype [...]
+<?xml version="1.0" encoding="utf-8"?><feed 
xmlns="http://www.w3.org/2005/Atom"; ><generator uri="https://jekyllrb.com/"; 
version="4.1.1">Jekyll</generator><link href="/feed.xml" rel="self" 
type="application/atom+xml" /><link href="/" rel="alternate" type="text/html" 
/><updated>2021-03-03T01:20:46-08:00</updated><id>/feed.xml</id><title 
type="html">TVM</title><author><name>{&quot;name&quot;=&gt;nil}</name></author><entry><title
 type="html">Introducing TVM Auto-scheduler (a.k.a. Ansor)</tit [...]
+model size, operator diversity, and hardware heterogeneity.
+From a computational perspective, deep neural networks are just layers and 
layers of tensor computations.
+These tensor computations, such as matmul and conv2d, can be easily described 
by mathematical expressions.
+However, providing high-performance implementations for them on modern 
hardware can be very challenging.
+We have to apply various low-level optimizations and utilize special hardware 
intrinsics to achieve high performance.
+It takes huge engineering effort to build linear algebra and neural network 
acceleration libraries like CuBLAS, CuDNN, oneMKL, and oneDNN.&lt;/p&gt;
+
+&lt;p&gt;Our life will be much easier if we can just write mathematical 
expressions and have something
+magically turn them into efficient code implementations.
+Three years ago, deep learning compiler TVM and its search module AutoTVM were 
built as the first step towards this goal.
+AutoTVM employs a template-based search algorithm to find efficient 
implementations for a given tensor computation.
+However, it is a template-based approach, so it still requires domain experts 
to implement a non-trivial manual template
+for every operator on every platform.
+Today, there are more than 15k lines of code for these templates in the TVM 
code repository.
+Besides being very hard to develop, these templates often have inefficient and 
limited search spaces,
+making them unable to achieve optimal performance.&lt;/p&gt;
+
+&lt;p&gt;To address the limitations of AutoTVM, we started project Ansor 
aiming at a fully automated auto-scheduler for 
+generating code for tensor computations.
+Ansor auto-scheduler only takes tensor expressions as input and generates 
high-performance code without manual templates.
+We made innovations in the search space construction and search algorithm.
+As a result, the auto-scheduler can achieve better performance with less 
search time in a more automated way.&lt;/p&gt;
+
+&lt;p&gt;Ansor auto-scheduler is now integrated into Apache TVM as &lt;code 
class=&quot;language-plaintext 
highlighter-rouge&quot;&gt;tvm.auto_scheduler&lt;/code&gt; package.
+This is a joint effort by collaborators from UC Berkeley, Alibaba, AWS and 
OctoML.
+Detailed tutorials are available for Intel CPUs, ARM CPUs, NVIDIA CPUs, and 
Mali GPUs on the TVM website [1].
+In this blog post, we will give a high-level introduction and show some 
benchmark results.&lt;/p&gt;
+
+&lt;h1 id=&quot;system-overview&quot;&gt;System Overview&lt;/h1&gt;
+
+&lt;h2 id=&quot;autotvm-vs-auto-scheduler&quot;&gt;AutoTVM vs 
Auto-scheduler&lt;/h2&gt;
+&lt;p style=&quot;text-align: center&quot;&gt;&lt;img 
src=&quot;/images/intro-auto-scheduler/workflow.png&quot; alt=&quot;image&quot; 
width=&quot;75%&quot; /&gt;&lt;/p&gt;
+&lt;center&gt; Table 1. Workflow Comparision &lt;/center&gt;
+&lt;p&gt;&lt;/p&gt;
+
+&lt;p&gt;Table 1 compares the workflow for generating code for an operator in 
AutoTVM and auto-scheduler.
+In AutoTVM, the developer has to go through three steps.
+In step 1, the developer has to write the compute definition in TVM’s tensor 
expression language.
+This part is relatively easy because TVM’s tensor expression language looks 
just like math expressions.
+In step 2, the developer has to write a schedule template, which typically 
consists of 20-100 lines of tricky DSL code.
+This part requires domain expertise of both the target hardware architecture 
and operator semantics, so it is difficult.
+The last step, step 3, is automated by a search algorithm.&lt;/p&gt;
+
+&lt;p&gt;In auto-scheduler, we eliminate the most difficult step 2 by 
automatic search space construction and accelerate step 3 with a better search 
algorithm.
+By doing automatic search space construction, we not only eliminate huge 
manual effort, 
+but also enabling the exploration of much more optimization combinations.
+This automation does not come for free, because we still need to design rules 
to generate the search space.
+However, these rules are very general. They are based on static analysis of 
the tensor expressions.
+We only need to design a few general rules once and can apply them to almost 
all tensor computations in deep learning.&lt;/p&gt;
+
+&lt;h2 id=&quot;search-process&quot;&gt;Search Process&lt;/h2&gt;
+&lt;p style=&quot;text-align: center&quot;&gt;&lt;img 
src=&quot;/images/intro-auto-scheduler/search_overview.png&quot; 
alt=&quot;image&quot; width=&quot;40%&quot; /&gt;&lt;/p&gt;
+&lt;center&gt; Figure 1. Search Process Overview  &lt;/center&gt;
+&lt;p&gt;&lt;/p&gt;
+
+&lt;p&gt;Figure 1. shows the search process of auto-scheduler when optimizing 
a whole neural network.
+The system takes deep learning models as input.
+It then partitions the big model into small subgraphs with Relay’s operator 
fusion pass.
+A task scheduler is utilized to allocate the time resource for optimizing many 
subgraphs.
+At each iteration, it picks a subgraph that has the most potential to increase 
the end-to-end performance.
+For this subgraph, we analyze its tensor expression and generate several 
sketches for it.
+Then we run evolutionary search with a learned cost model to get a batch of 
optimized programs.
+The optimized programs are sent to actual hardware for measurements.
+When the measurements are finished, the profiling results are used as feedback 
to update all components of the system.
+This process is repeated iteratively until the optimization converges or we 
run out of time budget.
+More technical details can be found in our paper [3] and our code.&lt;/p&gt;
+
+&lt;p&gt;It is worth notiing that since the auto-scheduler generates schedules 
from scratch, 
+it reuses the existing computation definitions in TOPI but not schedule 
templates.&lt;/p&gt;
+
+&lt;h1 id=&quot;benchmark-results&quot;&gt;Benchmark Results&lt;/h1&gt;
+&lt;p&gt;In this section, we benchmark the performance of AutoTVM and 
Auto-scheduler.
+The CPU benchmark is done on an AWS c5.9xlarge, which is equipped with an 
Intel 18-core skylake 8124-m CPU. 
+The GPU benchmark is done on an AWS g4dn.4xlarge, which is equipped with an 
NVIDIA T4 GPU.
+All benchmark code, raw data, tuning logs can be found in this repo 
[2].&lt;/p&gt;
+
+&lt;h2 id=&quot;performance-of-the-generated-code&quot;&gt;Performance of the 
generated code&lt;/h2&gt;
+&lt;p&gt;We benchmark the fp32 single-batch inference latency on three 
networks.
+Figure 2 shows the relative speedup of auto-scheduler against AutoTVM.
+We can see auto-scheduler outperforms AutoTVM in all cases with 1.02x to 8.95x 
speedup.
+This is because auto-scheduler explores a larger search space, which covers 
more efficient combinations
+of optimizations that are missed in TOPI manual templates.
+The BERT-base@GPU is an extreme case where the manual templates are very badly 
designed.
+In other words, the manual template for dense layers does not perform well for 
the shapes in BERT model.&lt;/p&gt;
+
+&lt;p style=&quot;text-align: center&quot;&gt;&lt;img 
src=&quot;/images/intro-auto-scheduler/code_perf.png&quot; 
alt=&quot;image&quot; width=&quot;85%&quot; /&gt;&lt;/p&gt;
+&lt;center&gt; Figure 2. Code Performance Comparision (Higher is better) 
&lt;/center&gt;
+&lt;p&gt;&lt;/p&gt;
+
+&lt;h2 id=&quot;search-time&quot;&gt;Search Time&lt;/h2&gt;
+&lt;p&gt;As we know, the search-based approaches can be very time-consuming, 
so we also care about the search time.
+It typically takes several hours to let the search converge for a single 
neural network.
+Figure 3 compares the search time of AutoTVM and auto-scheduler.
+Auto-scheduler requires much less time to converge in most cases, despite its 
larger search space.
+This is mainly because of auto-scheduler has a better cost model and task 
scheduler.&lt;/p&gt;
+
+&lt;p style=&quot;text-align: center&quot;&gt;&lt;img 
src=&quot;/images/intro-auto-scheduler/search_time.png&quot; 
alt=&quot;image&quot; width=&quot;85%&quot; /&gt;&lt;/p&gt;
+&lt;center&gt; Figure 3. Search Time Comparision (Lower is better) 
&lt;/center&gt;
+&lt;p&gt;&lt;/p&gt;
+
+&lt;h2 id=&quot;more-results&quot;&gt;More Results&lt;/h2&gt;
+&lt;p&gt;The repo above serves as an internal benchmark tool for TVM, so it 
only compares the latest AutoTVM and AutoScheduler.
+You can find results for more libraries and backends in our paper [3].
+Recently, this blog post [4] also tried auto-scheduler on an Apple M1 chip and 
got some good results.&lt;/p&gt;
+
+&lt;h1 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h1&gt;
+&lt;p&gt;We build TVM auto-scheduler, a system that automatically generates 
high-performance code for tensor expressions.
+Compared with the predecessor AutoTVM, auto-scheduler does not require manual 
templates.
+Besides, auto-scheduler is capable of generating schedules with better 
performance in a shorter time.
+We achieve this by making innovations in the search space construction and 
search algorithm.&lt;/p&gt;
+
+&lt;p&gt;We are excited about the current performance of auto-scheduler.
+In the future, we are interested in extending the ability of auto-scheduler to 
support
+sparse operators, low-precision operators, and dynamic shape better.&lt;/p&gt;
+
+&lt;h1 id=&quot;links&quot;&gt;Links&lt;/h1&gt;
+&lt;p&gt;[1] Tutorials: &lt;a 
href=&quot;https://tvm.apache.org/docs/tutorials/index.html#autoscheduler-template-free-auto-scheduling&quot;&gt;https://tvm.apache.org/docs/tutorials/index.html#autoscheduler-template-free-auto-scheduling&lt;/a&gt;&lt;br
 /&gt;
+[2] Benchmark repo: &lt;a 
href=&quot;https://github.com/tlc-pack/TLCBench&quot;&gt;https://github.com/tlc-pack/TLCBench&lt;/a&gt;&lt;br
 /&gt;
+[3] OSDI Paper: &lt;a 
href=&quot;https://arxiv.org/abs/2006.06762&quot;&gt;Ansor : Generating 
High-Performance Tensor Programs for Deep Learning&lt;/a&gt;&lt;br /&gt;
+[4] Results on Apple M1 chip: &lt;a 
href=&quot;https://medium.com/octoml/on-the-apple-m1-beating-apples-core-ml-4-with-30-model-performance-improvements-9d94af7d1b2d&quot;&gt;https://medium.com/octoml/on-the-apple-m1-beating-apples-core-ml-4-with-30-model-performance-improvements-9d94af7d1b2d&lt;/a&gt;.&lt;/p&gt;</content><author><name>Lianmin
 Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu</name></author><summary 
type="html">Optimizing the execution speed of deep neural networks i [...]
 
 &lt;h2 id=&quot;introduction&quot;&gt;Introduction&lt;/h2&gt;
 
@@ -282,7 +402,7 @@ For more documentation about the Bring Your Own Datatypes 
framework
       &lt;p&gt;&lt;a 
href=&quot;https://posithub.org/docs/BeatingFloatingPoint.pdf&quot; 
target=&quot;_blank&quot;&gt;Beating Floating Point at its Own Game: Posit 
Arithmetic&lt;/a&gt; &lt;a href=&quot;#fnref:posit&quot; 
class=&quot;reversefootnote&quot; 
role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
     &lt;/li&gt;
   &lt;/ol&gt;
-&lt;/div&gt;</content><author><name>Gus Smith, Andrew 
Liu</name></author><summary type="html">In this post, we describe the Bring 
Your Own Datatypes framework, which enables the use of custom datatypes within 
TVM.</summary></entry><entry><title type="html">How to Bring Your Own Codegen 
to TVM</title><link href="/2020/07/15/how-to-bring-your-own-codegen-to-tvm" 
rel="alternate" type="text/html" title="How to Bring Your Own Codegen to TVM" 
/><published>2020-07-15T00:00:00-04:00</published>< [...]
+&lt;/div&gt;</content><author><name>Gus Smith, Andrew 
Liu</name></author><summary type="html">In this post, we describe the Bring 
Your Own Datatypes framework, which enables the use of custom datatypes within 
TVM.</summary></entry><entry><title type="html">How to Bring Your Own Codegen 
to TVM</title><link href="/2020/07/15/how-to-bring-your-own-codegen-to-tvm" 
rel="alternate" type="text/html" title="How to Bring Your Own Codegen to TVM" 
/><published>2020-07-15T00:00:00-07:00</published>< [...]
 
 &lt;p&gt;However, users have to learn a new programming interface when they 
attempt to work on a new kernel library or a device. As a result, the demand 
for a unified programming interface becomes more and more important to let all 
users and hardware backend providers stand on the same page.&lt;/p&gt;
 
@@ -751,7 +871,7 @@ Figure 4: After Graph Partitioning.
 
 &lt;h2 id=&quot;acknowledgment&quot;&gt;Acknowledgment&lt;/h2&gt;
 
-&lt;p&gt;We would like to thank our colleague Animesh Jain for valuable 
discussions in the framework design; Tianqi Chen and Jared Roesch from OctoML 
for system design discussions and prototyping; Masahiro Masuda from the TVM 
community to help code review and improve the DNNL integration. We would also 
like to thank Ramana Radhakrishnan, Matthew Barrett, Manupa Karunaratne, and 
Luke Hutton from ARM, U.K. for contributing several helpful ideas, related 
Relay passes, and the Arm Compute Li [...]
+&lt;p&gt;We would like to thank our colleague Animesh Jain for valuable 
discussions in the framework design; Tianqi Chen and Jared Roesch from OctoML 
for system design discussions and prototyping; Masahiro Masuda from the TVM 
community to help code review and improve the DNNL integration. We would also 
like to thank Ramana Radhakrishnan, Matthew Barrett, Manupa Karunaratne, and 
Luke Hutton from ARM, U.K. for contributing several helpful ideas, related 
Relay passes, and the Arm Compute Li [...]
  the Jupyter Notebook to follow along is on &lt;a 
href=&quot;https://github.com/t-vi/pytorch-tvmisc/tree/master/transformers-pytorch-tvm/&quot;&gt;github&lt;/a&gt;.)&lt;/p&gt;
 
 &lt;p&gt;Some of the most intriguing applications of Artificial Intelligence 
have been in Natural Language Processing.
@@ -1264,7 +1384,7 @@ one would want to re-do cheap computation, most 
prominently point-wise computati
 &lt;h1 id=&quot;author&quot;&gt;Author&lt;/h1&gt;
 
 &lt;p&gt;&lt;a href=&quot;https://lernapparat.de/&quot;&gt;Thomas 
Viehmann&lt;/a&gt; is the founder of &lt;a 
href=&quot;https://mathinf.eu/&quot;&gt;MathInf GmbH&lt;/a&gt;, Munich, 
Germany, a boutique training and consultancy firm focusing on Machine Learning 
and PyTorch.
-He is a PyTorch core developer and co-authored &lt;a 
href=&quot;https://www.manning.com/books/deep-learning-with-pytorch&quot;&gt;Deep
 Learning with PyTorch&lt;/a&gt;, which currently available as &lt;a 
href=&quot;https://pytorch.org/deep-learning-with-pytorch&quot;&gt;free 
download from the PyTorch 
website&lt;/a&gt;.&lt;/p&gt;</content><author><name>Thomas Viehmann, MathInf 
GmbH</name></author><summary type="html"></summary></entry><entry><title 
type="html">TinyML - How TVM is Taming Ti [...]
+He is a PyTorch core developer and co-authored &lt;a 
href=&quot;https://www.manning.com/books/deep-learning-with-pytorch&quot;&gt;Deep
 Learning with PyTorch&lt;/a&gt;, which currently available as &lt;a 
href=&quot;https://pytorch.org/deep-learning-with-pytorch&quot;&gt;free 
download from the PyTorch 
website&lt;/a&gt;.&lt;/p&gt;</content><author><name>Thomas Viehmann, MathInf 
GmbH</name></author><summary type="html"></summary></entry><entry><title 
type="html">TinyML - How TVM is Taming Ti [...]
 
 &lt;p&gt;The proliferation of low-cost, AI-powered consumer devices has led to 
widespread interest in “bare-metal” (low-power, often without an operating 
system) devices among ML researchers and practitioners.  While it is already 
possible for experts to run &lt;em&gt;some&lt;/em&gt; models on 
&lt;em&gt;some&lt;/em&gt; bare-metal devices, optimizing models for diverse 
sets of devices is challenging, often requiring manually optimized 
device-specific libraries.  And for those platforms wi [...]
 
@@ -1563,7 +1683,7 @@ Diagram from CMSIS-NN paper showing a 2x2 matrix 
multiplication microkernel&lt;/
   &lt;li&gt;&lt;a 
href=&quot;https://homes.cs.washington.edu/~moreau/&quot;&gt;Thierry 
Moreau&lt;/a&gt;, for mentoring me during my time at OctoML.&lt;/li&gt;
   &lt;li&gt;&lt;a 
href=&quot;https://homes.cs.washington.edu/~vegaluis/&quot;&gt;Luis 
Vega&lt;/a&gt;, for teaching me the fundamentals of interacting with 
microcontrollers.&lt;/li&gt;
   &lt;li&gt;&lt;a 
href=&quot;https://www.linkedin.com/in/themadrasi/?originalSubdomain=uk&quot;&gt;Ramana
 Radhakrishnan&lt;/a&gt;, for supplying the Arm hardware used in our 
experiments and for providing guidance on its usage.&lt;/li&gt;
-&lt;/ul&gt;</content><author><name>Logan Weber and Andrew Reusch, 
OctoML</name></author><summary type="html"></summary></entry><entry><title 
type="html">Compiling Machine Learning to WASM and WebGPU with Apache 
TVM</title><link 
href="/2020/05/14/compiling-machine-learning-to-webassembly-and-webgpu" 
rel="alternate" type="text/html" title="Compiling Machine Learning to WASM and 
WebGPU with Apache TVM" 
/><published>2020-05-14T00:00:00-04:00</published><updated>2020-05-14T00:00:00-04:00</upd
 [...]
+&lt;/ul&gt;</content><author><name>Logan Weber and Andrew Reusch, 
OctoML</name></author><summary type="html"></summary></entry><entry><title 
type="html">Compiling Machine Learning to WASM and WebGPU with Apache 
TVM</title><link 
href="/2020/05/14/compiling-machine-learning-to-webassembly-and-webgpu" 
rel="alternate" type="text/html" title="Compiling Machine Learning to WASM and 
WebGPU with Apache TVM" 
/><published>2020-05-14T00:00:00-07:00</published><updated>2020-05-14T00:00:00-07:00</upd
 [...]
 
 &lt;p&gt;We introduced support for WASM and WebGPU to the Apache TVM deep 
learning compiler. Our experiments shows that  TVM’s WebGPU backend can get 
&lt;strong&gt;close to native&lt;/strong&gt; &lt;strong&gt;GPU 
performance&lt;/strong&gt; when deploying models to the web.&lt;/p&gt;
 
@@ -1641,7 +1761,7 @@ Diagram from CMSIS-NN paper showing a 2x2 matrix 
multiplication microkernel&lt;/
 
 &lt;h2 id=&quot;acknowledgement&quot;&gt;Acknowledgement&lt;/h2&gt;
 
-&lt;p&gt;We would like to thank the emscripten project for providing the WASM 
compilation infrastructures as well as the JS library support on the web. We 
would also like to thank the WebGPU community for various helpful discussions. 
Thanks to Fletcher Haynes for valuable feedbacks to the 
post.&lt;/p&gt;</content><author><name>Tianqi Chen and Jared Roesch, 
OctoML</name></author><summary type="html">TLDR</summary></entry><entry><title 
type="html">Integrating TVM into PyTorch</title><link  [...]
+&lt;p&gt;We would like to thank the emscripten project for providing the WASM 
compilation infrastructures as well as the JS library support on the web. We 
would also like to thank the WebGPU community for various helpful discussions. 
Thanks to Fletcher Haynes for valuable feedbacks to the 
post.&lt;/p&gt;</content><author><name>Tianqi Chen and Jared Roesch, 
OctoML</name></author><summary type="html">TLDR</summary></entry><entry><title 
type="html">Integrating TVM into PyTorch</title><link  [...]
 it has become clear that PyTorch stands to benefit from directly leveraging 
the compiler stack.
 A major tenet of PyTorch is providing seamless and robust integrations that 
don’t get in the user’s way.
 To that end, PyTorch now has an official TVM-based backend, &lt;a 
href=&quot;https://github.com/pytorch/tvm&quot;&gt;torch_tvm&lt;/a&gt;.&lt;/p&gt;
@@ -1733,7 +1853,7 @@ def mul(a, b, c):
 
 # via script
 relay_graph = torch_tvm.to_relay(mul, inputs)
-&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;</content><author><name>Bram 
Wasti</name></author><summary type="html">As TVM continuously demonstrates 
improvements to the efficiency of deep learning execution, it has become clear 
that PyTorch stands to benefit from directly leveraging the compiler stack. A 
major tenet of PyTorch is providing seamless and robust integrations that don’t 
get in the user’s way. To that end, PyTorch now has an official TVM-based 
backend, torch_tvm.</summary [...]
+&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;</content><author><name>Bram 
Wasti</name></author><summary type="html">As TVM continuously demonstrates 
improvements to the efficiency of deep learning execution, it has become clear 
that PyTorch stands to benefit from directly leveraging the compiler stack. A 
major tenet of PyTorch is providing seamless and robust integrations that don’t 
get in the user’s way. To that end, PyTorch now has an official TVM-based 
backend, torch_tvm.</summary [...]
 On real-time scenarios such as inference on autonomous vehicles, the inference 
speed of the model is critical.
 Network quantization is an effective approach to accelerating deep learning 
models.
 In quantized models, both data and model parameters are represented with low 
precision data types such as &lt;code class=&quot;language-plaintext 
highlighter-rouge&quot;&gt;int8&lt;/code&gt; and &lt;code 
class=&quot;language-plaintext highlighter-rouge&quot;&gt;float16&lt;/code&gt;.
@@ -1868,7 +1988,7 @@ We show that automatic optimization in TVM makes it easy 
and flexible to support
 &lt;/ul&gt;
 
 &lt;h1 id=&quot;bio--acknowledgement&quot;&gt;Bio &amp;amp; 
Acknowledgement&lt;/h1&gt;
-&lt;p&gt;&lt;a href=&quot;https://wuwei.io/&quot;&gt;Wuwei Lin&lt;/a&gt; is an 
undergraduate student at SJTU. He is currently an intern at TuSimple. The 
author has many thanks to &lt;a 
href=&quot;https://homes.cs.washington.edu/~tqchen/&quot;&gt;Tianqi 
Chen&lt;/a&gt; and &lt;a 
href=&quot;https://homes.cs.washington.edu/~eqy/&quot;&gt;Eddie Yan&lt;/a&gt; 
for their reviews.&lt;/p&gt;</content><author><name>Wuwei 
Lin</name></author><summary type="html">Deep learning has been successfully ap 
[...]
+&lt;p&gt;&lt;a href=&quot;https://wuwei.io/&quot;&gt;Wuwei Lin&lt;/a&gt; is an 
undergraduate student at SJTU. He is currently an intern at TuSimple. The 
author has many thanks to &lt;a 
href=&quot;https://homes.cs.washington.edu/~tqchen/&quot;&gt;Tianqi 
Chen&lt;/a&gt; and &lt;a 
href=&quot;https://homes.cs.washington.edu/~eqy/&quot;&gt;Eddie Yan&lt;/a&gt; 
for their reviews.&lt;/p&gt;</content><author><name>Wuwei 
Lin</name></author><summary type="html">Deep learning has been successfully ap 
[...]
 
 &lt;p&gt;TVM is an open source deep learning compiler stack that closes the 
gap between the productivity-focused deep learning frameworks, and the 
performance- or efficiency-oriented hardware backends. Today, we are glad to 
announce that the TVM community has decided to move on to Apache incubator, and 
becomes an Apache(incubating) project.&lt;/p&gt;
 
@@ -1882,7 +2002,7 @@ We show that automatic optimization in TVM makes it easy 
and flexible to support
 
 &lt;p&gt;We would like to take this chance to thank the Allen School for 
supporting the SAMPL team that gave birth to the TVM project. We would also 
like to thank the Halide project which provided the basis for TVM’s loop-level 
IR and initial code generation. We would like to thank our Apache incubator 
mentors for introducing the project to Apache and providing useful guidance. 
Finally, we would like to thank the TVM community and all of the organizations, 
as listed above, that supported [...]
 
-&lt;p&gt;See also the &lt;a 
href=&quot;https://news.cs.washington.edu/2019/03/18/allen-schools-tvm-deep-learning-compiler-framework-transitions-to-apache/&quot;&gt;Allen
 School news about the transition here&lt;/a&gt;, &lt;a 
href=&quot;https://sampl.cs.washington.edu/tvmconf/#about-tvmconf&quot;&gt;TVM 
conference program slides and recordings&lt;/a&gt;, and &lt;a 
href=&quot;https://tvm.apache.org/docs//contribute/community.html&quot;&gt;our 
community guideline here&lt;/a&gt;. Follow us o [...]
+&lt;p&gt;See also the &lt;a 
href=&quot;https://news.cs.washington.edu/2019/03/18/allen-schools-tvm-deep-learning-compiler-framework-transitions-to-apache/&quot;&gt;Allen
 School news about the transition here&lt;/a&gt;, &lt;a 
href=&quot;https://sampl.cs.washington.edu/tvmconf/#about-tvmconf&quot;&gt;TVM 
conference program slides and recordings&lt;/a&gt;, and &lt;a 
href=&quot;https://tvm.apache.org/docs//contribute/community.html&quot;&gt;our 
community guideline here&lt;/a&gt;. Follow us o [...]
 
 &lt;p&gt;TVM is an open deep learning compiler stack to compile various deep 
learning models from different
 frameworks to CPU, GPU or specialized accelerators.  TVM supports model 
compilation from a wide range
@@ -2043,155 +2163,4 @@ closure as TVM packed function and invoke the same 
across programming language b
   &lt;li&gt;[5] &lt;a 
href=&quot;https://blog.learngoprogramming.com/golang-variadic-funcs-how-to-patterns-369408f19085&quot;&gt;Go
 Variadic Functions&lt;/a&gt;&lt;/li&gt;
   &lt;li&gt;[6] &lt;a 
href=&quot;https://github.com/jdeng/gomxnet&quot;&gt;CFFI 
Ref&lt;/a&gt;&lt;/li&gt;
   &lt;li&gt;[7] &lt;a 
href=&quot;https://golang.org/pkg/runtime/#SetFinalizer&quot;&gt;Go 
Finalizers&lt;/a&gt;&lt;/li&gt;
-&lt;/ul&gt;</content><author><name>Siva</name></author><summary 
type="html">Introduction</summary></entry><entry><title type="html">Automating 
Generation of Low Precision Deep Learning Operators</title><link 
href="/2018/12/18/lowprecision-conv" rel="alternate" type="text/html" 
title="Automating Generation of Low Precision Deep Learning Operators" 
/><published>2018-12-18T00:00:00-05:00</published><updated>2018-12-18T00:00:00-05:00</updated><id>/2018/12/18/lowprecision-conv</id><content
 ty [...]
-devices becomes challenging because of their limited compute and energy 
budgets. A  recent  trend
- in  deep  learning  is  the  use  of  extremely  quantized  models  that 
operate  on  inputs  and
- weights  of  a  few  bits, with networks like XNOR-Net, DoReFa-Net, and 
HWGQ-Net making steady
-progress improving accuracy.&lt;/p&gt;
-
-&lt;p&gt;An example of a low precision graph snippet is below. The low 
precision convolution takes in
-quantized data and bitpacks into the proper data layout for an efficient 
bitserial convolution.
-The output is in a higher precision and traditional deep learning layers such 
as batch normalization and ReLu are applied to it, before being re-quantized 
and sent through another low precision operator.&lt;/p&gt;
-
-&lt;p style=&quot;text-align: center&quot;&gt;&lt;img 
src=&quot;/images/low-precision/workflow.png&quot; alt=&quot;image&quot; 
width=&quot;50%&quot; /&gt;&lt;/p&gt;
-&lt;center&gt; Low precision convolution pipeline.&lt;/center&gt;
-&lt;p&gt;&lt;/p&gt;
-
-&lt;p&gt;Theoretically,  low  precision operators use less operations than
-floating point operators, leading many to believe they can achieve up 
tremendous speedups.
-However, deep  learning frameworks  leverage  decades  of  engineering  work  
through  low  level
-BLAS  and LAPACK libraries that are incredibly well optimized, and CPUs 
include intrinsic
-instructions to accelerate these tasks.  In  practice,  it  is  not  simple  
to  develop low-level
-operators such as convolutions  that  are competitive  with  8-bit  quantized  
or  even floating
-point operators.
-In  this  post  we  introduce  our  approach to automatically generating 
optimized
-low  precision  convolutions for  CPUs. We declare our low precision operators 
so that they compute
-on efficiently stored low precision inputs, and describe a schedule that 
describes a search space
-of implementation parameters. We rely on AutoTVM to quickly search the space 
and find optimized
-parameters for the particular convolution, precision, and backend.&lt;/p&gt;
-
-&lt;h2 id=&quot;bitserial-computation-background&quot;&gt;Bitserial 
Computation Background&lt;/h2&gt;
-
-&lt;p&gt;The  core  of  low  precision  models  is  the bitserial dot product 
that enables convolution and
-dense operators to be computed using only bitwise operations and popcount.
- Typically, a dot product is computed by element wise multiplication of two 
vectors followed by
- summing all the elements, like the simple example below. If all the data is 
binary, the input
- vectors can be packed into single integer, and the dot product can be 
computed by  bitwise-anding
- the packed inputs and counting the number of 1’s in the result using popcount.
-Note: Depending how the input data is quantized, bitwise-xnor may be used 
instead of bitwise-and.&lt;/p&gt;
-
-&lt;p style=&quot;text-align: center&quot;&gt;&lt;img 
src=&quot;/images/low-precision/binary-dotproduct.png&quot; 
alt=&quot;image&quot; width=&quot;50%&quot; /&gt;&lt;/p&gt;
-&lt;center&gt; Binary dot product.&lt;/center&gt;
-&lt;p&gt;&lt;/p&gt;
-
-&lt;p&gt;Arbitrary precision dot products can be computed in this fashion by 
first separating input data
-into bitplanes. Once in this representation we can compute dotproduct by 
summing weighted binary
-dot products between the bitplanes of A and B. The number of binary 
dotproducts grows with the
-product of A and B’s precision, so this method is only practical for very low 
precision data.&lt;/p&gt;
-
-&lt;p style=&quot;text-align: center&quot;&gt;&lt;img 
src=&quot;/images/low-precision/bitserial-dotproduct.png&quot; 
alt=&quot;image&quot; width=&quot;50%&quot; /&gt;&lt;/p&gt;
-&lt;center&gt; Bitserial dot product.&lt;/center&gt;
-&lt;p&gt;&lt;/p&gt;
-
-&lt;h2 id=&quot;defining-operators-in-tvm&quot;&gt;Defining Operators in 
TVM&lt;/h2&gt;
-&lt;p&gt;Before the computation, input data needs to be bitpacked so that the 
bitplanes of the input data
-can be accessed and are packed into a supported datatype such as a uint8 or 
uint32. We provide
-a flexible bitpacking operator that takes arbitrary size input tensors and 
returns a bitpacked
-tensor where the user specifies which axis the bitplanes should be.&lt;/p&gt;
-
-&lt;p style=&quot;text-align: center&quot;&gt;&lt;img 
src=&quot;/images/low-precision/bitpack.png&quot; alt=&quot;image&quot; 
width=&quot;50%&quot; /&gt;&lt;/p&gt;
-&lt;center&gt; Different bitpacked layouts.&lt;/center&gt;
-&lt;p&gt;&lt;/p&gt;
-
-&lt;p&gt;Once in this bitpacked format the low precision  convolution can be 
computed bitserially.
-For this demo, that data is packed along the input channel and the bitplanes 
are added to the
-innermost axis, and the data is packed into 32-bit integers. The bitserial 
convolution is computed
-similar to a normal convolution, but the bitwise-and (&amp;amp;) replaces 
multiplication, and we use
-popcount to accumulate values in the packed data. The bitplane axes become 
additional reduction axes
-and compute the binary dot products between different bitplanes of the input 
and kernel.
-Finally, the output is computed in an unpacked format and in higher 
precision.&lt;/p&gt;
-
-&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div 
class=&quot;highlight&quot;&gt;&lt;pre 
class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span 
class=&quot;n&quot;&gt;Input_bitpacked&lt;/span&gt; &lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;bitpack&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;Input&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;acti [...]
-&lt;span class=&quot;n&quot;&gt;Weights_bitpacked&lt;/span&gt; &lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;bitpack&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;Filter&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;weight_bits&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;pack_axis&lt;/span&gt;&lt;span class=&quot;o&quot;& [...]
-&lt;span class=&quot;n&quot;&gt;batch&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;in_height&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;in_width&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;in_channel_q&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;_&lt;/span&gt; &lt;span 
class=&quot;o&quot;&gt;=&lt;/span& [...]
-&lt;span class=&quot;n&quot;&gt;kernel_h&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;kernel_w&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;_&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;num_filter&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;_&lt;/span&gt; &lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt; &lt [...]
-
-&lt;span class=&quot;n&quot;&gt;stride_h&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;stride_w&lt;/span&gt; &lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;stride&lt;/span&gt;
-&lt;span class=&quot;n&quot;&gt;pad_top&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;pad_left&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;pad_down&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;pad_right&lt;/span&gt; &lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;get_pad_tuple&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;( [...]
-
-&lt;span class=&quot;c1&quot;&gt;# Computing the output shape
-&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;out_channel&lt;/span&gt; &lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;num_filter&lt;/span&gt;
-&lt;span class=&quot;n&quot;&gt;out_height&lt;/span&gt; &lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;simplify&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;((&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;in_height&lt;/span&gt; &lt;span 
class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;kernel_h&lt;/span&gt; &lt;span 
class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;pad_top&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+ 
[...]
-&lt;span class=&quot;n&quot;&gt;out_width&lt;/span&gt; &lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;simplify&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;((&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;in_width&lt;/span&gt; &lt;span 
class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;kernel_w&lt;/span&gt; &lt;span 
class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;pad_left&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+& 
[...]
-&lt;span class=&quot;n&quot;&gt;pad_before&lt;/span&gt; &lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span 
class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span 
class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;pad_top&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;pad_left&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt [...]
-&lt;span class=&quot;n&quot;&gt;pad_after&lt;/span&gt; &lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span 
class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span 
class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;pad_down&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;pad_right&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;mi&quot;&gt;0&lt;/span&gt;&l [...]
-&lt;span class=&quot;n&quot;&gt;Input_padded&lt;/span&gt; &lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;pad&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;Input_bitpacked&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;pad_before&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;pad_after&lt;/span&gt;&lt;span class=&quot;p&quot;&g 
[...]
-
-&lt;span class=&quot;c1&quot;&gt;# Treat the bitplane axes like additional 
reduction axes
-&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;rc&lt;/span&gt; &lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;reduce_axis&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;((&lt;/span&gt;&lt;span 
class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;in_channel_q&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;),&l [...]
-&lt;span class=&quot;n&quot;&gt;ry&lt;/span&gt; &lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;reduce_axis&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;((&lt;/span&gt;&lt;span 
class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;kernel_h&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;s [...]
-&lt;span class=&quot;n&quot;&gt;rx&lt;/span&gt; &lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;reduce_axis&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;((&lt;/span&gt;&lt;span 
class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;kernel_w&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;s [...]
-&lt;span class=&quot;n&quot;&gt;ib&lt;/span&gt; &lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;reduce_axis&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;((&lt;/span&gt;&lt;span 
class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;input_bits&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;),&lt;/span&gt; &lt [...]
-&lt;span class=&quot;n&quot;&gt;wb&lt;/span&gt; &lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;reduce_axis&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;((&lt;/span&gt;&lt;span 
class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;weight_bits&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;),&lt;/span&gt; &l [...]
-
-
-&lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;compute&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;((&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;batch&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;out_height&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;out_width&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; [...]
-             &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span 
class=&quot;nb&quot;&gt;sum&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;popcount&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;(&lt;/span&gt;
-               &lt;span 
class=&quot;n&quot;&gt;Input_padded&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;nn&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;yy&lt;/span&gt; &lt;span 
class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;stride_h&lt;/span&gt; &lt;span 
class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;ry&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt; 
[...]
-               &lt;span 
class=&quot;n&quot;&gt;Weights_bitpacked&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;ry&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;rx&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;rc&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;ff&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/sp 
[...]
-               &lt;span class=&quot;n&quot;&gt;axis&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;rc&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;ry&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;rx&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;wb&lt;/span&gt;&lt;spa [...]
-
-&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
-
-&lt;p&gt;In our schedule we apply common optimizations like vectorization and 
memory tiling to provide better
-memory locality and take advantage of SIMD units. Some of these optimizations 
such as tiling,
-require parameters that need to be tuned to for the specific 
microarchitecture. We expose these
-parameters as knobs to TVM and use AutoTVM to automatically tune all the 
parameters simultaneously.&lt;/p&gt;
-
-&lt;p&gt;Finally, we can craft small microkernels to replace the innermost 
loop(s) of computation and schedule
- them using TVM’s tensorize primitive. Since, compilers often produce 
suboptimal code, people can
- often write short assembly sequences that are more efficient. These 
microkernels often take advantage
- of new intrinsics that are being introduced to help accelerate deep learning 
workloads and use
- them clever ways to improve memory accesses or reduce the number instructions 
required.&lt;/p&gt;
-
-&lt;h2 id=&quot;results&quot;&gt;Results&lt;/h2&gt;
-
-&lt;h3 id=&quot;raspberry-pi&quot;&gt;Raspberry Pi&lt;/h3&gt;
-&lt;p&gt;Convolution speedups on Raspberry Pi 3B compared to 16-bit integer 
TVM implementation.
-Workload are convolution layers from ResNet18.&lt;/p&gt;
-
-&lt;p style=&quot;text-align: center&quot;&gt;&lt;img 
src=&quot;/images/low-precision/rasp-conv.png&quot; alt=&quot;image&quot; 
width=&quot;50%&quot; /&gt;&lt;/p&gt;
-&lt;center&gt; Speedup of low precision convolutions on a Raspberry Pi 
compared to 16-bit TVM implementation.&lt;/center&gt;
-&lt;p&gt;&lt;/p&gt;
-
-&lt;p&gt;2-bit activation, 1-bit weight convolution speedups on Raspberry Pi 
3B compared to hand optimized implementation from &lt;a 
href=&quot;https://arxiv.org/pdf/1712.02427.pdf&quot;&gt;High performance 
ultra-low-precision convolutions
-on mobile devices.&lt;/a&gt;.
-Workload are convolution layers from ResNet18.&lt;/p&gt;
-
-&lt;p style=&quot;text-align: center&quot;&gt;&lt;img 
src=&quot;/images/low-precision/rasp-conv-2.png&quot; alt=&quot;image&quot; 
width=&quot;50%&quot; /&gt;&lt;/p&gt;
-&lt;center&gt; Speedup of 2-bit weight 1-bit activation Raspberry Pi 
convolutions against a hand optimized implementation.&lt;/center&gt;
-&lt;p&gt;&lt;/p&gt;
-
-&lt;h3 id=&quot;x86&quot;&gt;x86&lt;/h3&gt;
-
-&lt;p&gt;Convolution speedups on x86 compared to a 32-bit floating point TVM 
implementation.
-Note: x86 doesn’t support a vectorized popcount for this microarchitecture, so 
speedups are lower.&lt;/p&gt;
-&lt;p style=&quot;text-align: center&quot;&gt;&lt;img 
src=&quot;/images/low-precision/x86-conv.png&quot; alt=&quot;image&quot; 
width=&quot;50%&quot; /&gt;&lt;/p&gt;
-&lt;center&gt; Speedup of x86 low precision convolutions compared to a 32-bit 
floating point TVM implementation.&lt;/center&gt;
-&lt;p&gt;&lt;/p&gt;
-
-&lt;h2 id=&quot;show-me-the-code&quot;&gt;Show me the code&lt;/h2&gt;
-
-&lt;ul&gt;
-  &lt;li&gt;&lt;a 
href=&quot;https://github.com/apache/incubator-tvm/blob/main/topi/python/topi/nn/bitserial_conv2d.py&quot;&gt;TOPI
 bitserial convolution&lt;/a&gt;&lt;/li&gt;
-  &lt;li&gt;&lt;a 
href=&quot;https://github.com/apache/incubator-tvm/blob/main/topi/python/topi/arm_cpu/bitserial_conv2d.py&quot;&gt;TOPI
 ARM cpu bitserial convolution&lt;/a&gt;&lt;/li&gt;
-&lt;/ul&gt;
-
-&lt;h2 id=&quot;references&quot;&gt;References&lt;/h2&gt;
-
-&lt;ul&gt;
-  &lt;li&gt;[1] &lt;a 
href=&quot;https://arxiv.org/abs/1810.11066&quot;&gt;Automating Generation of 
Low Precision Deep Learning Operators&lt;/a&gt;&lt;/li&gt;
-  &lt;li&gt;[2] &lt;a 
href=&quot;https://arxiv.org/abs/1603.05279&quot;&gt;XNOR-Net&lt;/a&gt;&lt;/li&gt;
-  &lt;li&gt;[3] &lt;a 
href=&quot;https://arxiv.org/abs/1702.00953&quot;&gt;HWGQ&lt;/a&gt;&lt;/li&gt;
-  &lt;li&gt;[4] &lt;a 
href=&quot;https://arxiv.org/abs/1606.06160&quot;&gt;DoReFa&lt;/a&gt;&lt;/li&gt;
-&lt;/ul&gt;</content><author><name>Meghan Cowan</name></author><summary 
type="html">As deep learning models grow larger and more complex, deploying 
them on low powered phone and IoT devices becomes challenging because of their 
limited compute and energy budgets. A recent trend in deep learning is the use 
of extremely quantized models that operate on inputs and weights of a few bits, 
with networks like XNOR-Net, DoReFa-Net, and HWGQ-Net making steady progress 
improving accuracy.</summary> [...]
\ No newline at end of file
+&lt;/ul&gt;</content><author><name>Siva</name></author><summary 
type="html">Introduction</summary></entry></feed>
\ No newline at end of file
diff --git a/images/community/sjtu.png b/images/community/sjtu.png
new file mode 100644
index 0000000..0de00de
Binary files /dev/null and b/images/community/sjtu.png differ
diff --git a/images/intro-auto-scheduler/code_perf.png 
b/images/intro-auto-scheduler/code_perf.png
new file mode 100644
index 0000000..d070a6e
Binary files /dev/null and b/images/intro-auto-scheduler/code_perf.png differ
diff --git a/images/intro-auto-scheduler/search_overview.png 
b/images/intro-auto-scheduler/search_overview.png
new file mode 100644
index 0000000..7b6f56d
Binary files /dev/null and b/images/intro-auto-scheduler/search_overview.png 
differ
diff --git a/images/intro-auto-scheduler/search_time.png 
b/images/intro-auto-scheduler/search_time.png
new file mode 100644
index 0000000..4bd700b
Binary files /dev/null and b/images/intro-auto-scheduler/search_time.png differ
diff --git a/images/intro-auto-scheduler/workflow.png 
b/images/intro-auto-scheduler/workflow.png
new file mode 100644
index 0000000..b2c7b26
Binary files /dev/null and b/images/intro-auto-scheduler/workflow.png differ
diff --git a/rss.xml b/rss.xml
index f2dfac7..2173b21 100644
--- a/rss.xml
+++ b/rss.xml
@@ -5,12 +5,142 @@
         <description>TVM - </description>
         <link>https://tvm.apache.org</link>
         <atom:link href="https://tvm.apache.org"; rel="self" 
type="application/rss+xml" />
-        <lastBuildDate>Mon, 04 Jan 2021 16:22:52 -0500</lastBuildDate>
-        <pubDate>Mon, 04 Jan 2021 16:22:52 -0500</pubDate>
+        <lastBuildDate>Wed, 03 Mar 2021 01:20:46 -0800</lastBuildDate>
+        <pubDate>Wed, 03 Mar 2021 01:20:46 -0800</pubDate>
         <ttl>60</ttl>
 
 
         <item>
+                <title>Introducing TVM Auto-scheduler (a.k.a. Ansor)</title>
+                <description>&lt;p&gt;Optimizing the execution speed of deep 
neural networks is extremely hard with the growing
+model size, operator diversity, and hardware heterogeneity.
+From a computational perspective, deep neural networks are just layers and 
layers of tensor computations.
+These tensor computations, such as matmul and conv2d, can be easily described 
by mathematical expressions.
+However, providing high-performance implementations for them on modern 
hardware can be very challenging.
+We have to apply various low-level optimizations and utilize special hardware 
intrinsics to achieve high performance.
+It takes huge engineering effort to build linear algebra and neural network 
acceleration libraries like CuBLAS, CuDNN, oneMKL, and oneDNN.&lt;/p&gt;
+
+&lt;p&gt;Our life will be much easier if we can just write mathematical 
expressions and have something
+magically turn them into efficient code implementations.
+Three years ago, deep learning compiler TVM and its search module AutoTVM were 
built as the first step towards this goal.
+AutoTVM employs a template-based search algorithm to find efficient 
implementations for a given tensor computation.
+However, it is a template-based approach, so it still requires domain experts 
to implement a non-trivial manual template
+for every operator on every platform.
+Today, there are more than 15k lines of code for these templates in the TVM 
code repository.
+Besides being very hard to develop, these templates often have inefficient and 
limited search spaces,
+making them unable to achieve optimal performance.&lt;/p&gt;
+
+&lt;p&gt;To address the limitations of AutoTVM, we started project Ansor 
aiming at a fully automated auto-scheduler for 
+generating code for tensor computations.
+Ansor auto-scheduler only takes tensor expressions as input and generates 
high-performance code without manual templates.
+We made innovations in the search space construction and search algorithm.
+As a result, the auto-scheduler can achieve better performance with less 
search time in a more automated way.&lt;/p&gt;
+
+&lt;p&gt;Ansor auto-scheduler is now integrated into Apache TVM as &lt;code 
class=&quot;language-plaintext 
highlighter-rouge&quot;&gt;tvm.auto_scheduler&lt;/code&gt; package.
+This is a joint effort by collaborators from UC Berkeley, Alibaba, AWS and 
OctoML.
+Detailed tutorials are available for Intel CPUs, ARM CPUs, NVIDIA CPUs, and 
Mali GPUs on the TVM website [1].
+In this blog post, we will give a high-level introduction and show some 
benchmark results.&lt;/p&gt;
+
+&lt;h1 id=&quot;system-overview&quot;&gt;System Overview&lt;/h1&gt;
+
+&lt;h2 id=&quot;autotvm-vs-auto-scheduler&quot;&gt;AutoTVM vs 
Auto-scheduler&lt;/h2&gt;
+&lt;p style=&quot;text-align: center&quot;&gt;&lt;img 
src=&quot;/images/intro-auto-scheduler/workflow.png&quot; alt=&quot;image&quot; 
width=&quot;75%&quot; /&gt;&lt;/p&gt;
+&lt;center&gt; Table 1. Workflow Comparision &lt;/center&gt;
+&lt;p&gt;&lt;/p&gt;
+
+&lt;p&gt;Table 1 compares the workflow for generating code for an operator in 
AutoTVM and auto-scheduler.
+In AutoTVM, the developer has to go through three steps.
+In step 1, the developer has to write the compute definition in TVM’s tensor 
expression language.
+This part is relatively easy because TVM’s tensor expression language looks 
just like math expressions.
+In step 2, the developer has to write a schedule template, which typically 
consists of 20-100 lines of tricky DSL code.
+This part requires domain expertise of both the target hardware architecture 
and operator semantics, so it is difficult.
+The last step, step 3, is automated by a search algorithm.&lt;/p&gt;
+
+&lt;p&gt;In auto-scheduler, we eliminate the most difficult step 2 by 
automatic search space construction and accelerate step 3 with a better search 
algorithm.
+By doing automatic search space construction, we not only eliminate huge 
manual effort, 
+but also enabling the exploration of much more optimization combinations.
+This automation does not come for free, because we still need to design rules 
to generate the search space.
+However, these rules are very general. They are based on static analysis of 
the tensor expressions.
+We only need to design a few general rules once and can apply them to almost 
all tensor computations in deep learning.&lt;/p&gt;
+
+&lt;h2 id=&quot;search-process&quot;&gt;Search Process&lt;/h2&gt;
+&lt;p style=&quot;text-align: center&quot;&gt;&lt;img 
src=&quot;/images/intro-auto-scheduler/search_overview.png&quot; 
alt=&quot;image&quot; width=&quot;40%&quot; /&gt;&lt;/p&gt;
+&lt;center&gt; Figure 1. Search Process Overview  &lt;/center&gt;
+&lt;p&gt;&lt;/p&gt;
+
+&lt;p&gt;Figure 1. shows the search process of auto-scheduler when optimizing 
a whole neural network.
+The system takes deep learning models as input.
+It then partitions the big model into small subgraphs with Relay’s operator 
fusion pass.
+A task scheduler is utilized to allocate the time resource for optimizing many 
subgraphs.
+At each iteration, it picks a subgraph that has the most potential to increase 
the end-to-end performance.
+For this subgraph, we analyze its tensor expression and generate several 
sketches for it.
+Then we run evolutionary search with a learned cost model to get a batch of 
optimized programs.
+The optimized programs are sent to actual hardware for measurements.
+When the measurements are finished, the profiling results are used as feedback 
to update all components of the system.
+This process is repeated iteratively until the optimization converges or we 
run out of time budget.
+More technical details can be found in our paper [3] and our code.&lt;/p&gt;
+
+&lt;p&gt;It is worth notiing that since the auto-scheduler generates schedules 
from scratch, 
+it reuses the existing computation definitions in TOPI but not schedule 
templates.&lt;/p&gt;
+
+&lt;h1 id=&quot;benchmark-results&quot;&gt;Benchmark Results&lt;/h1&gt;
+&lt;p&gt;In this section, we benchmark the performance of AutoTVM and 
Auto-scheduler.
+The CPU benchmark is done on an AWS c5.9xlarge, which is equipped with an 
Intel 18-core skylake 8124-m CPU. 
+The GPU benchmark is done on an AWS g4dn.4xlarge, which is equipped with an 
NVIDIA T4 GPU.
+All benchmark code, raw data, tuning logs can be found in this repo 
[2].&lt;/p&gt;
+
+&lt;h2 id=&quot;performance-of-the-generated-code&quot;&gt;Performance of the 
generated code&lt;/h2&gt;
+&lt;p&gt;We benchmark the fp32 single-batch inference latency on three 
networks.
+Figure 2 shows the relative speedup of auto-scheduler against AutoTVM.
+We can see auto-scheduler outperforms AutoTVM in all cases with 1.02x to 8.95x 
speedup.
+This is because auto-scheduler explores a larger search space, which covers 
more efficient combinations
+of optimizations that are missed in TOPI manual templates.
+The BERT-base@GPU is an extreme case where the manual templates are very badly 
designed.
+In other words, the manual template for dense layers does not perform well for 
the shapes in BERT model.&lt;/p&gt;
+
+&lt;p style=&quot;text-align: center&quot;&gt;&lt;img 
src=&quot;/images/intro-auto-scheduler/code_perf.png&quot; 
alt=&quot;image&quot; width=&quot;85%&quot; /&gt;&lt;/p&gt;
+&lt;center&gt; Figure 2. Code Performance Comparision (Higher is better) 
&lt;/center&gt;
+&lt;p&gt;&lt;/p&gt;
+
+&lt;h2 id=&quot;search-time&quot;&gt;Search Time&lt;/h2&gt;
+&lt;p&gt;As we know, the search-based approaches can be very time-consuming, 
so we also care about the search time.
+It typically takes several hours to let the search converge for a single 
neural network.
+Figure 3 compares the search time of AutoTVM and auto-scheduler.
+Auto-scheduler requires much less time to converge in most cases, despite its 
larger search space.
+This is mainly because of auto-scheduler has a better cost model and task 
scheduler.&lt;/p&gt;
+
+&lt;p style=&quot;text-align: center&quot;&gt;&lt;img 
src=&quot;/images/intro-auto-scheduler/search_time.png&quot; 
alt=&quot;image&quot; width=&quot;85%&quot; /&gt;&lt;/p&gt;
+&lt;center&gt; Figure 3. Search Time Comparision (Lower is better) 
&lt;/center&gt;
+&lt;p&gt;&lt;/p&gt;
+
+&lt;h2 id=&quot;more-results&quot;&gt;More Results&lt;/h2&gt;
+&lt;p&gt;The repo above serves as an internal benchmark tool for TVM, so it 
only compares the latest AutoTVM and AutoScheduler.
+You can find results for more libraries and backends in our paper [3].
+Recently, this blog post [4] also tried auto-scheduler on an Apple M1 chip and 
got some good results.&lt;/p&gt;
+
+&lt;h1 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h1&gt;
+&lt;p&gt;We build TVM auto-scheduler, a system that automatically generates 
high-performance code for tensor expressions.
+Compared with the predecessor AutoTVM, auto-scheduler does not require manual 
templates.
+Besides, auto-scheduler is capable of generating schedules with better 
performance in a shorter time.
+We achieve this by making innovations in the search space construction and 
search algorithm.&lt;/p&gt;
+
+&lt;p&gt;We are excited about the current performance of auto-scheduler.
+In the future, we are interested in extending the ability of auto-scheduler to 
support
+sparse operators, low-precision operators, and dynamic shape better.&lt;/p&gt;
+
+&lt;h1 id=&quot;links&quot;&gt;Links&lt;/h1&gt;
+&lt;p&gt;[1] Tutorials: &lt;a 
href=&quot;https://tvm.apache.org/docs/tutorials/index.html#autoscheduler-template-free-auto-scheduling&quot;&gt;https://tvm.apache.org/docs/tutorials/index.html#autoscheduler-template-free-auto-scheduling&lt;/a&gt;&lt;br
 /&gt;
+[2] Benchmark repo: &lt;a 
href=&quot;https://github.com/tlc-pack/TLCBench&quot;&gt;https://github.com/tlc-pack/TLCBench&lt;/a&gt;&lt;br
 /&gt;
+[3] OSDI Paper: &lt;a 
href=&quot;https://arxiv.org/abs/2006.06762&quot;&gt;Ansor : Generating 
High-Performance Tensor Programs for Deep Learning&lt;/a&gt;&lt;br /&gt;
+[4] Results on Apple M1 chip: &lt;a 
href=&quot;https://medium.com/octoml/on-the-apple-m1-beating-apples-core-ml-4-with-30-model-performance-improvements-9d94af7d1b2d&quot;&gt;https://medium.com/octoml/on-the-apple-m1-beating-apples-core-ml-4-with-30-model-performance-improvements-9d94af7d1b2d&lt;/a&gt;.&lt;/p&gt;
+
+</description>
+                
<link>https://tvm.apache.org/2021/03/03/intro-auto-scheduler</link>
+                
<guid>https://tvm.apache.org/2021/03/03/intro-auto-scheduler</guid>
+                <pubDate>Wed, 03 Mar 2021 00:00:00 -0800</pubDate>
+        </item>
+
+        <item>
                 <title>Bring Your Own Datatypes: Enabling Custom Datatype 
Exploration in TVM</title>
                 <description>&lt;p&gt;In this post, we describe the Bring Your 
Own Datatypes framework, which enables the use of custom datatypes within 
TVM.&lt;/p&gt;
 
@@ -300,7 +430,7 @@ For more documentation about the Bring Your Own Datatypes 
framework
 </description>
                 
<link>https://tvm.apache.org/2020/09/26/bring-your-own-datatypes</link>
                 
<guid>https://tvm.apache.org/2020/09/26/bring-your-own-datatypes</guid>
-                <pubDate>Sat, 26 Sep 2020 00:00:00 -0400</pubDate>
+                <pubDate>Sat, 26 Sep 2020 00:00:00 -0700</pubDate>
         </item>
 
         <item>
@@ -779,7 +909,7 @@ Figure 4: After Graph Partitioning.
 </description>
                 
<link>https://tvm.apache.org/2020/07/15/how-to-bring-your-own-codegen-to-tvm</link>
                 
<guid>https://tvm.apache.org/2020/07/15/how-to-bring-your-own-codegen-to-tvm</guid>
-                <pubDate>Wed, 15 Jul 2020 00:00:00 -0400</pubDate>
+                <pubDate>Wed, 15 Jul 2020 00:00:00 -0700</pubDate>
         </item>
 
         <item>
@@ -1302,7 +1432,7 @@ He is a PyTorch core developer and co-authored &lt;a 
href=&quot;https://www.mann
 </description>
                 <link>https://tvm.apache.org/2020/07/14/bert-pytorch-tvm</link>
                 <guid>https://tvm.apache.org/2020/07/14/bert-pytorch-tvm</guid>
-                <pubDate>Tue, 14 Jul 2020 00:00:00 -0400</pubDate>
+                <pubDate>Tue, 14 Jul 2020 00:00:00 -0700</pubDate>
         </item>
 
         <item>
@@ -1611,7 +1741,7 @@ Diagram from CMSIS-NN paper showing a 2x2 matrix 
multiplication microkernel&lt;/
 </description>
                 
<link>https://tvm.apache.org/2020/06/04/tinyml-how-tvm-is-taming-tiny</link>
                 
<guid>https://tvm.apache.org/2020/06/04/tinyml-how-tvm-is-taming-tiny</guid>
-                <pubDate>Thu, 04 Jun 2020 00:00:00 -0400</pubDate>
+                <pubDate>Thu, 04 Jun 2020 00:00:00 -0700</pubDate>
         </item>
 
         <item>
@@ -1698,7 +1828,7 @@ Diagram from CMSIS-NN paper showing a 2x2 matrix 
multiplication microkernel&lt;/
 </description>
                 
<link>https://tvm.apache.org/2020/05/14/compiling-machine-learning-to-webassembly-and-webgpu</link>
                 
<guid>https://tvm.apache.org/2020/05/14/compiling-machine-learning-to-webassembly-and-webgpu</guid>
-                <pubDate>Thu, 14 May 2020 00:00:00 -0400</pubDate>
+                <pubDate>Thu, 14 May 2020 00:00:00 -0700</pubDate>
         </item>
 
         <item>
@@ -1800,7 +1930,7 @@ relay_graph = torch_tvm.to_relay(mul, inputs)
 </description>
                 <link>https://tvm.apache.org/2019/05/30/pytorch-frontend</link>
                 <guid>https://tvm.apache.org/2019/05/30/pytorch-frontend</guid>
-                <pubDate>Thu, 30 May 2019 00:00:00 -0400</pubDate>
+                <pubDate>Thu, 30 May 2019 00:00:00 -0700</pubDate>
         </item>
 
         <item>
@@ -1944,7 +2074,7 @@ We show that automatic optimization in TVM makes it easy 
and flexible to support
 </description>
                 
<link>https://tvm.apache.org/2019/04/29/opt-cuda-quantized</link>
                 
<guid>https://tvm.apache.org/2019/04/29/opt-cuda-quantized</guid>
-                <pubDate>Mon, 29 Apr 2019 12:00:00 -0400</pubDate>
+                <pubDate>Mon, 29 Apr 2019 09:00:00 -0700</pubDate>
         </item>
 
         <item>
@@ -1967,7 +2097,7 @@ We show that automatic optimization in TVM makes it easy 
and flexible to support
 </description>
                 
<link>https://tvm.apache.org/2019/03/18/tvm-apache-announcement</link>
                 
<guid>https://tvm.apache.org/2019/03/18/tvm-apache-announcement</guid>
-                <pubDate>Mon, 18 Mar 2019 00:00:00 -0400</pubDate>
+                <pubDate>Mon, 18 Mar 2019 00:00:00 -0700</pubDate>
         </item>
 
         <item>
@@ -2137,7 +2267,7 @@ closure as TVM packed function and invoke the same across 
programming language b
 </description>
                 <link>https://tvm.apache.org/2019/01/19/Golang</link>
                 <guid>https://tvm.apache.org/2019/01/19/Golang</guid>
-                <pubDate>Sat, 19 Jan 2019 00:00:00 -0500</pubDate>
+                <pubDate>Sat, 19 Jan 2019 00:00:00 -0800</pubDate>
         </item>
 
         <item>
@@ -2298,7 +2428,7 @@ Note: x86 doesn’t support a vectorized popcount for this 
microarchitecture, so
 </description>
                 
<link>https://tvm.apache.org/2018/12/18/lowprecision-conv</link>
                 
<guid>https://tvm.apache.org/2018/12/18/lowprecision-conv</guid>
-                <pubDate>Tue, 18 Dec 2018 00:00:00 -0500</pubDate>
+                <pubDate>Tue, 18 Dec 2018 00:00:00 -0800</pubDate>
         </item>
 
         <item>
@@ -2414,7 +2544,7 @@ His research interest is in the general domain of ML on 
shared private data, but
 </description>
                 <link>https://tvm.apache.org/2018/10/09/ml-in-tees</link>
                 <guid>https://tvm.apache.org/2018/10/09/ml-in-tees</guid>
-                <pubDate>Tue, 09 Oct 2018 00:00:00 -0400</pubDate>
+                <pubDate>Tue, 09 Oct 2018 00:00:00 -0700</pubDate>
         </item>
 
         <item>
@@ -2808,7 +2938,7 @@ for inference deployment. TVM just provides such a 
solution.&lt;/p&gt;
 </description>
                 <link>https://tvm.apache.org/2018/10/03/auto-opt-all</link>
                 <guid>https://tvm.apache.org/2018/10/03/auto-opt-all</guid>
-                <pubDate>Wed, 03 Oct 2018 00:00:00 -0400</pubDate>
+                <pubDate>Wed, 03 Oct 2018 00:00:00 -0700</pubDate>
         </item>
 
         <item>
@@ -2947,7 +3077,7 @@ support, and can be used to implement convenient 
converters, such as
 </description>
                 <link>https://tvm.apache.org/2018/08/10/DLPack-Bridge</link>
                 <guid>https://tvm.apache.org/2018/08/10/DLPack-Bridge</guid>
-                <pubDate>Fri, 10 Aug 2018 00:00:00 -0400</pubDate>
+                <pubDate>Fri, 10 Aug 2018 00:00:00 -0700</pubDate>
         </item>
 
         <item>
@@ -3089,7 +3219,7 @@ This kind of high-level visibility is essential to system 
designers who want to
 </description>
                 
<link>https://tvm.apache.org/2018/07/12/vta-release-announcement</link>
                 
<guid>https://tvm.apache.org/2018/07/12/vta-release-announcement</guid>
-                <pubDate>Thu, 12 Jul 2018 00:00:00 -0400</pubDate>
+                <pubDate>Thu, 12 Jul 2018 00:00:00 -0700</pubDate>
         </item>
 
         <item>
@@ -3355,7 +3485,7 @@ C = tvm.compute(
 </description>
                 
<link>https://tvm.apache.org/2018/03/23/nmt-transformer-optimize</link>
                 
<guid>https://tvm.apache.org/2018/03/23/nmt-transformer-optimize</guid>
-                <pubDate>Fri, 23 Mar 2018 00:00:00 -0400</pubDate>
+                <pubDate>Fri, 23 Mar 2018 00:00:00 -0700</pubDate>
         </item>
 
         <item>
@@ -3471,7 +3601,7 @@ optimizations into the TVM stack.&lt;/p&gt;
 </description>
                 <link>https://tvm.apache.org/2018/03/12/webgl</link>
                 <guid>https://tvm.apache.org/2018/03/12/webgl</guid>
-                <pubDate>Mon, 12 Mar 2018 00:00:00 -0400</pubDate>
+                <pubDate>Mon, 12 Mar 2018 00:00:00 -0700</pubDate>
         </item>
 
         <item>
@@ -4045,7 +4175,7 @@ advice and &lt;a 
href=&quot;https://github.com/yzhliu&quot;&gt;Yizhi Liu&lt;/a&g
 </description>
                 <link>https://tvm.apache.org/2018/01/16/opt-mali-gpu</link>
                 <guid>https://tvm.apache.org/2018/01/16/opt-mali-gpu</guid>
-                <pubDate>Tue, 16 Jan 2018 00:00:00 -0500</pubDate>
+                <pubDate>Tue, 16 Jan 2018 00:00:00 -0800</pubDate>
         </item>
 
         <item>
@@ -4273,7 +4403,7 @@ make jvminstall
 </description>
                 
<link>https://tvm.apache.org/2017/11/08/android-rpc-introduction</link>
                 
<guid>https://tvm.apache.org/2017/11/08/android-rpc-introduction</guid>
-                <pubDate>Wed, 08 Nov 2017 00:00:00 -0500</pubDate>
+                <pubDate>Wed, 08 Nov 2017 00:00:00 -0800</pubDate>
         </item>
 
         <item>
@@ -4499,90 +4629,7 @@ BB0_6:
 </description>
                 
<link>https://tvm.apache.org/2017/10/30/Bringing-AMDGPUs-to-TVM-Stack-and-NNVM-Compiler-with-ROCm</link>
                 
<guid>https://tvm.apache.org/2017/10/30/Bringing-AMDGPUs-to-TVM-Stack-and-NNVM-Compiler-with-ROCm</guid>
-                <pubDate>Mon, 30 Oct 2017 00:00:00 -0400</pubDate>
-        </item>
-
-        <item>
-                <title>NNVM Compiler: Open Compiler for AI Frameworks</title>
-                <description>&lt;p style=&quot;text-align: 
center&quot;&gt;Paul G. Allen School of Computer Science &amp;amp; Engineering, 
University of Washington&lt;/p&gt;
-&lt;p style=&quot;text-align: center&quot;&gt;Amazon Web Service AI 
team&lt;/p&gt;
-&lt;p style=&quot;text-align: center&quot;&gt;DMLC open-source 
community&lt;/p&gt;
-
-&lt;p&gt;Deep learning has become ubiquitous and indispensable. We are seeing 
a rising need for deploying deep learning workloads on many kinds of platforms 
such as mobile phones, GPU, IoT devices and specialized accelerators.  Last 
month, we announced TVM stack to close the gap between deep learning 
frameworks, and the performance- or efficiency-oriented hardware backends.  TVM 
stack makes it easy to build an end to end compilation for a deep learning 
framework.  However, we think it wo [...]
-
-&lt;p&gt;Today, UW Allen school and AWS AI team, together with other 
contributors, are excited to announce the release of NNVM compiler, an open 
deep learning compiler to compile front-end framework workloads directly to 
hardware backends. We build it using the two-level intermediate 
representation(IR) in the TVM stack.
-The reader is welcome to refer to the &lt;a 
href=&quot;http://www.tvmlang.org/2017/08/17/tvm-release-announcement.html&quot;&gt;original
 TVM announcement&lt;/a&gt; for more technical details about TVM stack. With 
the help of TVM stack, NNVM compiler can:&lt;/p&gt;
-
-&lt;ul&gt;
-  &lt;li&gt;Represent and optimize the common deep learning workloads in high 
level graph IR&lt;/li&gt;
-  &lt;li&gt;Transform the computation graph to minimize memory utilization, 
optimize data layout and fuse computation patterns for different hardware 
backends.&lt;/li&gt;
-  &lt;li&gt;Present an end to end compilation pipeline from front-end deep 
learning frameworks to bare metal hardwares.&lt;/li&gt;
-&lt;/ul&gt;
-
-&lt;p style=&quot;text-align: center&quot;&gt;&lt;img 
src=&quot;/images/nnvm/nnvm_compiler_stack.png&quot; alt=&quot;image&quot; 
width=&quot;612px&quot; /&gt;&lt;/p&gt;
-
-&lt;p&gt;The NNVM compiler can directly take models from deep learning 
frameworks such as Apache MXNet.
-It also support model exchange formats such as ONNX and CoreML. ONNX support 
enables NNVM to compile deep learning models from PyTorch, Caffe2 and CNTK.
-The CoreML frontend enables deployment of CoreML models to non-iOS 
devices.&lt;/p&gt;
-
-&lt;p style=&quot;text-align: center&quot;&gt;&lt;img 
src=&quot;/images/nnvm/nnvm_compiler_code.png&quot; alt=&quot;image&quot; 
width=&quot;712px&quot; /&gt;&lt;/p&gt;
-
-&lt;h2 id=&quot;separation-of-optimization-and-deployment&quot;&gt;Separation 
of Optimization and Deployment&lt;/h2&gt;
-
-&lt;p style=&quot;text-align: center&quot;&gt;&lt;img 
src=&quot;/images/nnvm/nnvm_deploy.png&quot; alt=&quot;image&quot; 
width=&quot;512px&quot; /&gt;&lt;/p&gt;
-
-&lt;p&gt;NNVM compiler applies graph level and tensor level optimizations and 
jointly optimize them to get the best performance. We take a different approach 
from existing deep learning frameworks, which packages the graph optimization 
with the deployment runtime.  NNVM compiler adopts the conventional wisdom from 
compiler to separate the optimization from the actual deployment runtime. This 
approach offers substantial optimization but still keeps the runtime 
lightweight. The compiled mo [...]
-
-&lt;h2 id=&quot;performance&quot;&gt;Performance&lt;/h2&gt;
-
-&lt;p&gt;NNVM compiler is still under active development, and we can expect 
more improvements to come, but we have started to see promising results.
-We benchmarked its performance and compared it against Apache MXNet on two 
typical hardware configurations: ARM CPU on Raspberry PI and Nvidia GPU on AWS. 
Despite the radical architecture difference between these two chips, we can use 
the same infrastructure and only need to change the schedule for each type of 
hardware.&lt;/p&gt;
-
-&lt;h3 id=&quot;nvidia-gpu&quot;&gt;Nvidia GPU&lt;/h3&gt;
-
-&lt;p&gt;GPU benchmarks and schedules are contributed by Leyuan Wang 
(AWS/UCDavis) and Yuwei Hu (TuSimple). We compared the NNVM compiler against 
Apache MXNet with CUDA8 and cuDNN7 as the backend on Nvidia K80. This is a very 
strong baseline, as Apache MXNet turns on auto-tuning to select the best kernel 
from CuDNN. We also used the optimized depthwise kernel in MXNet to optimize 
MobileNet workload.&lt;/p&gt;
-
-&lt;p style=&quot;text-align: center&quot;&gt;&lt;img 
src=&quot;/images/nnvm/nnvm_k80_result.png&quot; alt=&quot;image&quot; 
width=&quot;400px&quot; /&gt;&lt;/p&gt;
-
-&lt;p&gt;As can be seen, NNVM compiler generate code that outperforms Apache 
MXNet on K80. These improvements are due to the joint graph level and kernel 
level optimizations. It is worth noting that NNVM compiler generates all the 
optimized GPU kernels on its own without relying on external libraries like 
CuDNN.&lt;/p&gt;
-
-&lt;h3 id=&quot;raspberry-pi-3b&quot;&gt;Raspberry Pi 3b&lt;/h3&gt;
-
-&lt;p&gt;The Rasberry Pi compilation stack is contributed by Ziheng 
Jiang(AWS/FDU).
-We compared NNVM compiler against Apache MXNet with OpenBLAS and NNPack.
-We explored the setups to get the best performance out of MXNet: we turned on 
Winograd convolution in the NNPACK for 3x3 convolutions, enabled 
multi-threading and disabled the additional scheduler thread (so all threads 
are used by NNPack).&lt;/p&gt;
-
-&lt;p style=&quot;text-align: center&quot;&gt;&lt;img 
src=&quot;/images/nnvm/nnvm_rasp_result.png&quot; alt=&quot;image&quot; 
width=&quot;400px&quot; /&gt;&lt;/p&gt;
-
-&lt;p&gt;As can be seen, the code generated by NNVM compiler is two times 
faster on ResNet18.
-The gap on MobileNet is mainly due to lack of depthwise convolution in 
existing CPU DNN libraries. NNVM compiler takes benefit of direct generating 
efficient ARM code directly.&lt;/p&gt;
-
-&lt;h2 id=&quot;acknowledgement&quot;&gt;Acknowledgement&lt;/h2&gt;
-&lt;p&gt;This project wouldn’t become possible without our early contributors 
in the DMLC community.
-We would like to specially thank Yuwei Hu(TuSimple), Leyuan Wang(AWS/UCDavis), 
Joshua Z. Zhang(AWS)
-and Xingjian Shi(HKUST) for their early contributions to the project. We would 
also like to thank all the contributors
-to the TVM stack.&lt;/p&gt;
-
-&lt;p&gt;We also learnt a lot from the following projects when building NNVM 
Compiler.&lt;/p&gt;
-&lt;ul&gt;
-  &lt;li&gt;&lt;a 
href=&quot;https://github.com/Theano/Theano&quot;&gt;Theano&lt;/a&gt;: possibly 
the earliest compiler for deep learning&lt;/li&gt;
-  &lt;li&gt;&lt;a 
href=&quot;https://github.com/halide/Halide&quot;&gt;Halide&lt;/a&gt;: TVM uses 
&lt;a href=&quot;https://github.com/dmlc/HalideIR&quot;&gt;HalideIR&lt;/a&gt; 
as data structure for
-arithematic simplification and low level lowering. HalideIR is derived from 
Halide.
-We also learns from Halide when implementing the lowering pipeline in 
TVM.&lt;/li&gt;
-  &lt;li&gt;&lt;a 
href=&quot;https://github.com/inducer/loopy&quot;&gt;Loopy&lt;/a&gt;: use of 
integer set analysis and its loop transformation primitives.&lt;/li&gt;
-&lt;/ul&gt;
-
-&lt;h2 id=&quot;links&quot;&gt;Links&lt;/h2&gt;
-&lt;ul&gt;
-  &lt;li&gt;Github page of NNVM Compiler: &lt;a 
href=&quot;https://github.com/dmlc/nnvm&quot;&gt;https://github.com/dmlc/nnvm&lt;/a&gt;&lt;/li&gt;
-  &lt;li&gt;Github page of TVM: &lt;a 
href=&quot;https://github.com/dmlc/tvm&quot;&gt;https://github.com/dmlc/tvm&lt;/a&gt;&lt;/li&gt;
-  &lt;li&gt;&lt;a 
href=&quot;https://news.cs.washington.edu/2017/10/06/allen-school-and-aws-team-up-on-new-nnvm-compiler-for-deep-learning-frameworks/&quot;&gt;UW
 Allen school blog about NNVM compiler&lt;/a&gt;&lt;/li&gt;
-  &lt;li&gt;&lt;a 
href=&quot;https://aws.amazon.com/blogs/ai/introducing-nnvm-compiler-a-new-open-end-to-end-compiler-for-ai-frameworks/&quot;&gt;AWS
 blogpost about NNVM compiler&lt;/a&gt;&lt;/li&gt;
-&lt;/ul&gt;
-</description>
-                
<link>https://tvm.apache.org/2017/10/06/nnvm-compiler-announcement</link>
-                
<guid>https://tvm.apache.org/2017/10/06/nnvm-compiler-announcement</guid>
-                <pubDate>Fri, 06 Oct 2017 11:30:00 -0400</pubDate>
+                <pubDate>Mon, 30 Oct 2017 00:00:00 -0700</pubDate>
         </item>
 
 
diff --git a/sitemap.txt b/sitemap.txt
index bfad106..db8795d 100644
--- a/sitemap.txt
+++ b/sitemap.txt
@@ -16,6 +16,7 @@ https://tvm.apache.org/vta
 https://tvm.apache.org/feed.xml
 https://tvm.apache.org/css/custom.css.map
 
+https://tvm.apache.org/2021/03/03/intro-auto-scheduler
 https://tvm.apache.org/2020/09/26/bring-your-own-datatypes
 https://tvm.apache.org/2020/07/15/how-to-bring-your-own-codegen-to-tvm
 https://tvm.apache.org/2020/07/14/bert-pytorch-tvm

Reply via email to