This is an automated email from the ASF dual-hosted git repository.

tqchen pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/incubator-tvm-site.git


The following commit(s) were added to refs/heads/asf-site by this push:
     new beecf83  Build at Thu Jun  4 09:03:33 PDT 2020
beecf83 is described below

commit beecf8354b43a57e70ec3f609720508da6f35f2e
Author: tqchen <[email protected]>
AuthorDate: Thu Jun 4 09:03:34 2020 -0700

    Build at Thu Jun  4 09:03:33 PDT 2020
---
 2020/06/04/tinyml-how-tvm-is-taming-tiny.html      | 505 +++++++++++++++++++++
 assets/themes/custom-twitter/css/style.css         |   2 +-
 atom.xml                                           | 311 ++++++++++++-
 blog.html                                          |  11 +
 images/microtvm/autotvm-infrastructure.png         | Bin 0 -> 294584 bytes
 images/microtvm/hardware-connection-diagram.png    | Bin 0 -> 723278 bytes
 images/microtvm/logo.png                           | Bin 0 -> 17808 bytes
 .../autotuned-cifar10-int-8-cnn-x86.png            | Bin 0 -> 9978 bytes
 .../autotuned-cifar10-int-8-cnn.png                | Bin 0 -> 12267 bytes
 .../microtvm/post-2020-05-28/cifar10-graphical.png | Bin 0 -> 472116 bytes
 .../microtvm/post-2020-05-28/cifar10-int-8-cnn.png | Bin 0 -> 9288 bytes
 images/microtvm/post-2020-05-28/memory-layout.png  | Bin 0 -> 39021 bytes
 images/microtvm/post-2020-05-28/simd-diagram.png   | Bin 0 -> 208006 bytes
 images/microtvm/self-hosted-runtime.png            | Bin 0 -> 301584 bytes
 rss.xml                                            | 313 ++++++++++++-
 sitemap.txt                                        |   1 +
 16 files changed, 1139 insertions(+), 4 deletions(-)

diff --git a/2020/06/04/tinyml-how-tvm-is-taming-tiny.html 
b/2020/06/04/tinyml-how-tvm-is-taming-tiny.html
new file mode 100644
index 0000000..09bca33
--- /dev/null
+++ b/2020/06/04/tinyml-how-tvm-is-taming-tiny.html
@@ -0,0 +1,505 @@
+
+<!DOCTYPE html>
+<html lang="en">
+  <head>
+    <meta charset="utf-8">
+    <title>TinyML - How TVM is Taming Tiny</title>
+    
+    <meta name="author" content="">
+
+    <!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
+    <!--[if lt IE 9]>
+      <script 
src="http://html5shim.googlecode.com/svn/trunk/html5.js";></script>
+    <![endif]-->
+
+    <!-- Le styles -->
+    <link href="/assets/themes/custom-twitter/css/1.4.0/bootstrap.css" 
rel="stylesheet">
+    <link href="/assets/themes/custom-twitter/css/style.css?body=1" 
rel="stylesheet" type="text/css" media="all">
+
+    <!-- Le fav and touch icons -->
+  <!-- Update these with your own images
+    <link rel="shortcut icon" href="images/logo/tvm-logo.png">
+  <link rel="shortcut icon" href="images/logo/tvm-logo.png">
+  -->
+  <link href="/images/logo/tvm-logo-square.png" rel="icon" type="image/png"/>
+  <!-- Global site tag (gtag.js) - Google Analytics -->
+  <script async 
src="https://www.googletagmanager.com/gtag/js?id=UA-75982049-2";></script>
+  <script>
+    window.dataLayer = window.dataLayer || [];
+    function gtag(){dataLayer.push(arguments);}
+
+    gtag('js', new Date());
+    gtag('config', 'UA-75982049-2');
+  </script>
+
+</head>
+
+  <body>
+    <div class="topbar">
+      <div class="fill">
+        <div class="container">
+          <h2 id="logo-wrap">
+            <a href="/" class="nav">
+              <img src="/images/logo/tvm-logo-small-black.png" width="100px">
+            </a>
+          </h2>
+          <ul class="nav" id="nav-bar">
+            
+            
+            
+
+
+
+  
+    
+      
+      
+    
+  
+    
+      
+      
+    
+  
+    
+      
+      
+    
+  
+    
+      
+      
+    
+  
+    
+      
+      
+    
+  
+    
+      
+      
+    
+  
+    
+      
+       
+       <li><a href="/community">Community</a></li>
+       
+      
+      
+    
+  
+    
+      
+       
+       <li><a href="/download">Download</a></li>
+       
+      
+      
+    
+  
+    
+      
+       
+       <li><a href="/about">About</a></li>
+       
+      
+      
+    
+  
+    
+      
+      
+    
+  
+    
+      
+       
+       <li><a href="/vta">VTA</a></li>
+       
+      
+      
+    
+  
+    
+      
+      
+       
+       <li><a href="/blog">Blog</a></li>
+       
+      
+    
+  
+
+
+
+
+            <li> <a href="https://tvm.apache.org/docs";>Docs</a></li>
+            <li> <a href="https://tvmconf.org";>TVM Conference</a></li>
+            <li> <a 
href="https://github.com/apache/incubator-tvm/";>Github</a></li>
+            <li> <a href="/asf">ASF</a></li>
+          </ul>
+        </div>
+      </div>
+    </div>
+    
+<div class="container">
+<div class="content">
+  <div class="row">
+    <div class="span14">
+      <h1>TinyML - How TVM is Taming Tiny </h1>
+      <p class="post-meta">
+        <time datetime="2020-06-04T00:00:00-07:00" itemprop="datePublished">
+          Jun 4, 2020
+        </time>
+        
+        • <span itemprop="author" itemscope 
itemtype="http://schema.org/Person";>
+          <span itemprop="name">Logan Weber and Andrew Reusch, OctoML</span>
+        </span>
+        
+      </p>
+      <p class="post-meta">
+        </p>
+    </br>
+    
+<p><img src="/images/microtvm/logo.png" alt="microTVM logo" width="30%" /><br 
/></p>
+
+<p>The proliferation of low-cost, AI-powered consumer devices has led to 
widespread interest in “bare-metal” (low-power, often without an operating 
system) devices among ML researchers and practitioners.  While it is already 
possible for experts to run <em>some</em> models on <em>some</em> bare-metal 
devices, optimizing models for diverse sets of devices is challenging, often 
requiring manually optimized device-specific libraries.  And for those 
platforms without, say, Linux support, the [...]
+
+<p>The manual optimization of machine learning software is not unique to the 
domain of bare-metal devices.  In fact, this has been a common theme for 
developers working with other hardware backends (e.g., GPUs and FPGAs).  TVM 
has proven resilient to the onslaught of new hardware targets, but until now, 
it couldn’t grapple with the unique profile of microcontrollers.  To solve the 
problem in this domain, we’ve extended TVM to feature a microcontroller 
backend, called µTVM (footnote: pron [...]
+
+<p style="text-align: center"><img 
src="/images/microtvm/autotvm-infrastructure.png" 
alt="/images/microtvm/autotvm-infrastructure.png" width="80%" /><br /></p>
+
+<h1 id="lets-see-it-in-action">Let’s see it in action</h1>
+
+<p>Before we talk about what TVM/MicroTVM is or how it works, let’s see a 
quick example of it in action.</p>
+
+<p style="text-align: center"><img 
src="/images/microtvm/hardware-connection-diagram.png" 
alt="/images/microtvm/hardware-connection-diagram.png" width="80%" /><br />
+A standard µTVM setup, where the host communicates with the device via 
JTAG.</p>
+
+<p>Above, we have an <a 
href="https://www.st.com/en/microcontrollers-microprocessors/stm32f746zg.html";>STM32F746ZG
 board</a>, housing an ARM Cortex-M7 processor, an ideal part for AI on the 
edge given it’s strong performance in a low power envelope. We use its USB-JTAG 
port to connect it to our desktop machine.  On the desktop, we run OpenOCD to 
open a JTAG connection with the device; in turn, OpenOCD allows µTVM to control 
the M7 processor using a device-agnostic TCP socket.  With this  [...]
+
+<div class="language-python highlighter-rouge"><div class="highlight"><pre 
class="highlight"><code><span class="n">OPENOCD_SERVER_ADDR</span> <span 
class="o">=</span> <span class="s">'127.0.0.1'</span>
+<span class="n">OPENOCD_SERVER_PORT</span> <span class="o">=</span> <span 
class="mi">6666</span>
+<span class="n">TARGET</span> <span class="o">=</span> <span 
class="n">tvm</span><span class="o">.</span><span class="n">target</span><span 
class="o">.</span><span class="n">create</span><span class="p">(</span><span 
class="s">'c -device=micro_dev'</span><span class="p">)</span>
+<span class="n">DEV_CONFIG</span> <span class="o">=</span> <span 
class="n">stm32f746xx</span><span class="o">.</span><span 
class="n">default_config</span><span class="p">(</span><span 
class="n">OPENOCD_SERVER_ADDR</span><span class="p">,</span> <span 
class="n">OPENOCD_SERVER_PORT</span><span class="p">)</span>
+
+<span class="n">module</span><span class="p">,</span> <span 
class="n">params</span> <span class="o">=</span> <span 
class="n">get_cifar10_cnn</span><span class="p">()</span>
+<span class="k">with</span> <span class="n">micro</span><span 
class="o">.</span><span class="n">Session</span><span class="p">(</span><span 
class="n">device_config</span><span class="p">)</span> <span 
class="k">as</span> <span class="n">sess</span><span class="p">:</span>
+       <span class="n">graph</span><span class="p">,</span> <span 
class="n">c_module</span><span class="p">,</span> <span class="n">params</span> 
<span class="o">=</span> <span class="n">relay</span><span 
class="o">.</span><span class="n">build</span><span class="p">(</span><span 
class="n">module</span><span class="p">[</span><span 
class="s">'main'</span><span class="p">],</span> <span 
class="n">target</span><span class="o">=</span><span 
class="n">TARGET</span><span class="p">,</span> <span cl [...]
+  <span class="n">micro_mod</span> <span class="o">=</span> <span 
class="n">micro</span><span class="o">.</span><span 
class="n">create_micro_mod</span><span class="p">(</span><span 
class="n">c_module</span><span class="p">,</span> <span 
class="n">DEV_CONFIG</span><span class="p">)</span>
+  <span class="n">graph_mod</span> <span class="o">=</span> <span 
class="n">graph_runtime</span><span class="o">.</span><span 
class="n">create</span><span class="p">(</span><span 
class="n">graph</span><span class="p">,</span> <span 
class="n">micro_mod</span><span class="p">,</span> <span 
class="n">ctx</span><span class="o">=</span><span class="n">tvm</span><span 
class="o">.</span><span class="n">micro_dev</span><span class="p">(</span><span 
class="mi">0</span><span class="p">))</span>
+  <span class="n">graph_mod</span><span class="o">.</span><span 
class="n">run</span><span class="p">(</span><span class="n">data</span><span 
class="o">=</span><span class="n">data_np</span><span class="p">)</span>
+  <span class="n">prediction</span> <span class="o">=</span> <span 
class="n">CIFAR10_CLASSES</span><span class="p">[</span><span 
class="n">np</span><span class="o">.</span><span class="n">argmax</span><span 
class="p">(</span><span class="n">graph_mod</span><span class="o">.</span><span 
class="n">get_output</span><span class="p">(</span><span 
class="mi">0</span><span class="p">)</span><span class="o">.</span><span 
class="n">asnumpy</span><span class="p">())]</span>
+  <span class="k">print</span><span class="p">(</span><span 
class="n">f</span><span class="s">'prediction was {prediction}'</span><span 
class="p">)</span>
+</code></pre></div></div>
+
+<p>Below are the performance results of MicroTVM, compared with <a 
href="https://github.com/ARM-software/CMSIS_5/releases/tag/5.6.0";>CMSIS-NN 
version 5.7.0</a> (commit <code class="highlighter-rouge">a65b7c9a</code>), a 
hand-optimized library of ML kernels.</p>
+
+<p style="text-align: center"><img 
src="/images/microtvm/post-2020-05-28/cifar10-int-8-cnn.png" 
alt="/images/microtvm/post-2020-05-28/cifar10-int-8-cnn.png" width="60%" /><br 
/></p>
+
+<p>As we can see, the out-of-the-box performance isn’t great, but this is 
where <a href="https://dl.acm.org/doi/10.5555/3327144.3327258";>AutoTVM</a> 
comes to the rescue.  We can write a schedule template for our device, do a 
round of autotuning, then achieve significantly better results.  To plug in our 
autotuned results, we only need to replace this line:</p>
+
+<div class="language-python highlighter-rouge"><div class="highlight"><pre 
class="highlight"><code><span class="n">graph</span><span class="p">,</span> 
<span class="n">c_module</span><span class="p">,</span> <span 
class="n">params</span> <span class="o">=</span> <span 
class="n">relay</span><span class="o">.</span><span class="n">build</span><span 
class="p">(</span><span class="n">module</span><span class="p">[</span><span 
class="s">'main'</span><span class="p">],</span> <span class="n">t [...]
+</code></pre></div></div>
+
+<p>with these lines:</p>
+
+<div class="language-python highlighter-rouge"><div class="highlight"><pre 
class="highlight"><code><span class="k">with</span> <span 
class="n">TARGET</span><span class="p">,</span> <span 
class="n">autotvm</span><span class="o">.</span><span 
class="n">apply_history_best</span><span class="p">(</span><span 
class="n">TUNING_RESULTS_FILE</span><span class="p">):</span>
+  <span class="n">graph</span><span class="p">,</span> <span 
class="n">c_module</span><span class="p">,</span> <span class="n">params</span> 
<span class="o">=</span> <span class="n">relay</span><span 
class="o">.</span><span class="n">build</span><span class="p">(</span><span 
class="n">module</span><span class="p">[</span><span 
class="s">'main'</span><span class="p">],</span> <span 
class="n">target</span><span class="o">=</span><span 
class="n">TARGET</span><span class="p">,</span> <span c [...]
+</code></pre></div></div>
+
+<p>And our results now look like this:</p>
+
+<p style="text-align: center"><img 
src="/images/microtvm/post-2020-05-28/autotuned-cifar10-int-8-cnn.png" 
alt="/images/microtvm/post-2020-05-28/autotuned-cifar10-int-8-cnn.png" 
width="60%" /><br /></p>
+
+<p>We’ve improved our performance by ~2x, and we’re now much closer to 
CMSIS-NN. Although the MicroTVM CIFAR10 implementation is competitive in with a 
similar TFLite/CMSIS-NN model, this work has just begun to take advantage of 
TVM’s optimization features. There’s room to optimize further by accelerating 
other operators such as dense/fully-connected and taking advantage of TVM’s 
model-specific quantization and operator fusion capabilities. TVM with µTVM 
enables you to play with the best  [...]
+
+<h1 id="design">Design</h1>
+
+<p style="text-align: center"><img 
src="/images/microtvm/post-2020-05-28/memory-layout.png" 
alt="/images/microtvm/post-2020-05-28/memory-layout.png" width="20%" /><br />
+The µTVM Device Memory Layout in RAM</p>
+
+<p>µTVM aims to support the lowest common denominator of devices by minimizing 
the set of requirements that must be satisfied.  In particular, users need only 
provide:</p>
+
+<ol>
+  <li>a C cross-compiler toolchain for their device</li>
+  <li>a method for reading/writing to device memory and executing code on the 
device</li>
+  <li>a specification containing the device’s memory layout and general 
architectural characteristics</li>
+  <li>a code snippet that prepares the device for function execution</li>
+</ol>
+
+<p>Most bare-metal devices have support for C and JTAG (a debugging protocol), 
so (1) and (2) usually come for free!  Furthermore, (3) and (4) are often very 
small asks.  Below are examples of (3) and (4) for STM32F746-series boards.</p>
+
+<div class="language-python highlighter-rouge"><div class="highlight"><pre 
class="highlight"><code><span class="n">device_config</span> <span 
class="o">=</span> <span class="p">{</span>
+    <span class="s">'device_id'</span><span class="p">:</span> <span 
class="s">'arm.stm32f746xx'</span><span class="p">,</span>        <span 
class="c1"># unique identifier for the device
+</span>    <span class="s">'toolchain_prefix'</span><span class="p">:</span> 
<span class="s">'arm-none-eabi-'</span><span class="p">,</span>  <span 
class="c1"># prefix of each binary in the cross-compilation toolchain (e.g., 
arm-none-eabi-gcc)
+</span>    <span class="s">'base_addr'</span><span class="p">:</span> <span 
class="mh">0x20000000</span><span class="p">,</span>               <span 
class="c1"># first address of RAM
+</span>    <span class="s">'section_sizes'</span><span class="p">:</span> 
<span class="p">{</span>                     <span class="c1"># dictionary of 
desired section sizes in bytes
+</span>         <span class="s">'text'</span><span class="p">:</span> <span 
class="mi">18000</span><span class="p">,</span>
+         <span class="s">'rodata'</span><span class="p">:</span> <span 
class="mi">100</span><span class="p">,</span>
+         <span class="s">'data'</span><span class="p">:</span> <span 
class="mi">100</span><span class="p">,</span>
+         <span class="o">...</span>
+    <span class="p">},</span>
+    <span class="s">'word_size'</span><span class="p">:</span> <span 
class="mi">4</span><span class="p">,</span>                        <span 
class="c1"># device word size
+</span>    <span class="s">'thumb_mode'</span><span class="p">:</span> <span 
class="bp">True</span><span class="p">,</span>                    <span 
class="c1"># whether to use ARM's thumb ISA
+</span>    <span class="s">'comms_method'</span><span class="p">:</span> <span 
class="s">'openocd'</span><span class="p">,</span>             <span 
class="c1"># method of communication with the device
+</span>    <span class="s">'server_addr'</span><span class="p">:</span> <span 
class="s">'127.0.0.1'</span><span class="p">,</span>            <span 
class="c1"># OpenOCD server address (if 'comms_method' is 'openocd')
+</span>    <span class="s">'server_port'</span><span class="p">:</span> <span 
class="mi">6666</span><span class="p">,</span>                   <span 
class="c1"># OpenOCD server port (if 'comms_method' is 'openocd')
+</span><span class="p">}</span>
+</code></pre></div></div>
+
+<div class="language-cpp highlighter-rouge"><div class="highlight"><pre 
class="highlight"><code><span class="p">.</span><span class="n">syntax</span> 
<span class="n">unified</span>
+<span class="p">.</span><span class="n">cpu</span> <span 
class="n">cortex</span><span class="o">-</span><span class="n">m7</span>
+<span class="p">.</span><span class="n">fpu</span> <span 
class="n">softvfp</span>
+<span class="p">.</span><span class="n">thumb</span>
+
+<span class="p">.</span><span class="n">section</span> <span 
class="p">.</span><span class="n">text</span><span class="p">.</span><span 
class="n">UTVMInit</span>
+<span class="p">.</span><span class="n">type</span> <span 
class="n">UTVMInit</span><span class="p">,</span> <span class="o">%</span><span 
class="n">function</span>
+<span class="n">UTVMInit</span><span class="o">:</span>
+  <span class="cm">/* enable fpu */</span>
+  <span class="n">ldr</span> <span class="n">r0</span><span class="p">,</span> 
<span class="o">=</span><span class="mh">0xE000ED88</span>
+  <span class="n">ldr</span> <span class="n">r1</span><span class="p">,</span> 
<span class="p">[</span><span class="n">r0</span><span class="p">]</span>
+  <span class="n">ldr</span> <span class="n">r2</span><span class="p">,</span> 
<span class="o">=</span><span class="mh">0xF00000</span>
+  <span class="n">orr</span> <span class="n">r1</span><span class="p">,</span> 
<span class="n">r2</span>
+  <span class="n">str</span> <span class="n">r1</span><span class="p">,</span> 
<span class="p">[</span><span class="n">r0</span><span class="p">]</span>
+  <span class="n">dsb</span>
+  <span class="n">isb</span>
+  <span class="cm">/* set stack pointer */</span>
+  <span class="n">ldr</span> <span class="n">sp</span><span class="p">,</span> 
<span class="o">=</span><span class="n">_utvm_stack_pointer_init</span>
+  <span class="n">bl</span> <span class="n">UTVMMain</span>
+<span class="p">.</span><span class="n">size</span> <span 
class="n">UTVMInit</span><span class="p">,</span> <span class="p">.</span><span 
class="o">-</span><span class="n">UTVMInit</span>
+</code></pre></div></div>
+
+<p>The µTVM infrastructure and device runtime have been built to only make use 
of these requirements, and we’re working to lessen these requirements by 
supporting common open source runtime platforms such as mBED OS to handle the 
compilation and linking processes.</p>
+
+<h2 id="device-sessions">Device Sessions</h2>
+
+<p>Given the networked nature of microcontroller interaction, we slightly 
deviate from standard TVM code by introducing the concept of <code 
class="highlighter-rouge">MicroSession</code>.</p>
+
+<p>Every piece of functionality in µTVM relies on having an open session with 
the target device.  If you’re familiar with TVM, you may have noticed a line of 
code that deviates from the norm in our first code snippet—-namely, this 
one:</p>
+
+<div class="language-python highlighter-rouge"><div class="highlight"><pre 
class="highlight"><code><span class="o">...</span>
+<span class="k">with</span> <span class="n">micro</span><span 
class="o">.</span><span class="n">Session</span><span class="p">(</span><span 
class="n">device_config</span><span class="p">)</span> <span 
class="k">as</span> <span class="n">sess</span><span class="p">:</span>
+       <span class="o">...</span>
+</code></pre></div></div>
+
+<p>Every line inside this <code class="highlighter-rouge">with</code> block 
can call functions in µTVM, with the context being the device specified by 
<code class="highlighter-rouge">device_config</code>.  This line is doing a 
number of things under the hood, so let’s unpack it.</p>
+
+<p>First, it initializes a connection with your device, using whichever 
communication method you specified (usually OpenOCD).  The µTVM device runtime 
is then cross-compiled, using whichever cross-compiler you specified.  Finally, 
space for the compiled binary is allocated by the host, and the binary is 
loaded onto the device using the opened connection.</p>
+
+<p>With the runtime now situated on the device, we’ll naturally want some 
functions to run through it.</p>
+
+<h2 id="module-loading">Module Loading</h2>
+
+<p>One of the core abstractions in TVM is that of a module.  A module stores a 
set of related functions for a particular device/runtime target.  Given that 
microcontrollers don’t normally have operating systems, µTVM needs to do a lot 
of extra work to maintain this high-level abstraction.  To see what’s going on, 
we’ll trace through the process of creating and loading a µTVM-compatible 
module.</p>
+
+<p>Suppose we have a <code class="highlighter-rouge">micro.Session</code> open 
with our device and a TVM schedule that implements 2D convolution.  If we want 
to load it onto our microcontroller, we need it to emit C code.  To do so, we 
just need to set the <code class="highlighter-rouge">target</code> in either 
<code class="highlighter-rouge">tvm.build</code> or <code 
class="highlighter-rouge">relay.build</code>.  Example:</p>
+
+<div class="language-python highlighter-rouge"><div class="highlight"><pre 
class="highlight"><code><span class="n">graph</span><span class="p">,</span> 
<span class="n">c_module</span><span class="p">,</span> <span 
class="n">params</span> <span class="o">=</span> <span 
class="n">relay</span><span class="o">.</span><span class="n">build</span><span 
class="p">(</span><span class="n">module</span><span class="p">[</span><span 
class="s">'main'</span><span class="p">],</span> <span class="n">t [...]
+</code></pre></div></div>
+
+<p>By setting the target like so, the build process runs through our C code 
generation backend.  However, the resulting C module still resides on the host 
machine.  In order to load it onto the device, we run it through one of the 
core functions in the µTVM infrastructure: <code 
class="highlighter-rouge">create_micro_mod</code>.  Example:</p>
+
+<div class="language-python highlighter-rouge"><div class="highlight"><pre 
class="highlight"><code><span class="n">micro_mod</span> <span 
class="o">=</span> <span class="n">micro</span><span class="o">.</span><span 
class="n">create_micro_mod</span><span class="p">(</span><span 
class="n">c_module</span><span class="p">,</span> <span 
class="n">DEV_CONFIG</span><span class="p">)</span>
+</code></pre></div></div>
+
+<p>The line above cross-compiles the C source within the module, allocates 
room for the resulting binary (so it can coexist with the runtime in device 
memory), then sends each section of the binary to its allocated slot on the 
device.  Once the module binary is snug in device memory, function pointers 
within the binary are patched to give the module access to helper functions in 
the device runtime (e.g., for allocating scratchpads).</p>
+
+<p>Now, with our kernel loaded on the device, we can grab a remote handle to 
the convolution function like so:</p>
+
+<div class="language-python highlighter-rouge"><div class="highlight"><pre 
class="highlight"><code><span class="n">micro_func</span> <span 
class="o">=</span> <span class="n">micro_mod</span><span 
class="p">[</span><span class="s">'conv2d'</span><span class="p">]</span>
+</code></pre></div></div>
+
+<h2 id="tensor-loading">Tensor Loading</h2>
+
+<p>If we want to call an operator, we first need some tensors as arguments:</p>
+
+<div class="language-python highlighter-rouge"><div class="highlight"><pre 
class="highlight"><code><span class="n">data_np</span><span class="p">,</span> 
<span class="n">kernel_np</span> <span class="o">=</span> <span 
class="n">get_conv_inputs</span><span class="p">()</span>
+<span class="n">ctx</span> <span class="o">=</span> <span 
class="n">tvm</span><span class="o">.</span><span 
class="n">micro_dev</span><span class="p">(</span><span 
class="mi">0</span><span class="p">)</span>
+<span class="n">data</span> <span class="o">=</span> <span 
class="n">tvm</span><span class="o">.</span><span class="n">nd</span><span 
class="o">.</span><span class="n">array</span><span class="p">(</span><span 
class="n">data_np</span><span class="p">,</span> <span 
class="n">ctx</span><span class="o">=</span><span class="n">ctx</span><span 
class="p">)</span>
+<span class="n">kernel</span> <span class="o">=</span> <span 
class="n">tvm</span><span class="o">.</span><span class="n">nd</span><span 
class="o">.</span><span class="n">array</span><span class="p">(</span><span 
class="n">kernel_np</span><span class="p">,</span> <span 
class="n">ctx</span><span class="o">=</span><span class="n">ctx</span><span 
class="p">)</span>
+</code></pre></div></div>
+
+<p>Based on its data type (e.g., <code class="highlighter-rouge">int8</code>, 
<code class="highlighter-rouge">float32</code>, etc.) and shape, each tensor’s 
size in bytes is calculated, and the host allocates a region of memory on the 
device’s heap.  The tensor’s data is then loaded into the allocated region.</p>
+
+<h2 id="function-calls">Function Calls</h2>
+
+<p>Operator execution is perhaps the trickiest part of this system.  To 
simplify its presentation, we’ll first cover strict execution (where operators 
are executed as soon as they’re called), then lazy execution (where operators 
are only executed once their results are needed)—-the latter is how the system 
actually works.</p>
+
+<h3 id="strict-execution">Strict Execution</h3>
+
+<p>When calling a function, both input and output tensors are passed as 
arguments, in what’s known as destination-passing style:</p>
+
+<div class="language-python highlighter-rouge"><div class="highlight"><pre 
class="highlight"><code><span class="n">conv2D</span><span 
class="p">(</span><span class="n">data</span><span class="p">,</span> <span 
class="n">kernel</span><span class="p">,</span> <span 
class="n">output</span><span class="p">)</span>
+</code></pre></div></div>
+
+<p>Given that these tensors are already allocated on the device, we only need 
to send <em>metadata</em> to the device (device address, shape, and data type), 
so it knows which of its resident tensors to use.  The runtime representation 
of a function call includes this metadata, as well as the address of the 
function being called (shown below).  Before constructing this representation, 
the metadata needs to be serialized into the arguments section on the device 
that exists expressly for t [...]
+
+<div class="language-c highlighter-rouge"><div class="highlight"><pre 
class="highlight"><code><span class="cm">/*
+ * task struct for uTVM
+ */</span>
+<span class="k">typedef</span> <span class="k">struct</span> <span 
class="p">{</span>
+  <span class="cm">/* pointer to function to call for this task */</span>
+  <span class="kt">int32_t</span> <span class="p">(</span><span 
class="o">*</span><span class="n">func</span><span class="p">)(</span><span 
class="kt">void</span><span class="o">*</span><span class="p">,</span> <span 
class="kt">void</span><span class="o">*</span><span class="p">,</span> <span 
class="kt">int32_t</span><span class="p">);</span>
+  <span class="cm">/* array of argument tensors */</span>
+  <span class="n">TVMValue</span><span class="o">*</span> <span 
class="n">arg_values</span><span class="p">;</span>
+  <span class="cm">/* array of datatype codes for each argument */</span>
+  <span class="kt">int</span><span class="o">*</span> <span 
class="n">arg_type_codes</span><span class="p">;</span>
+  <span class="cm">/* number of arguments */</span>
+  <span class="kt">int32_t</span> <span class="n">num_args</span><span 
class="p">;</span>
+<span class="p">}</span> <span class="n">UTVMTask</span><span 
class="p">;</span>
+</code></pre></div></div>
+
+<p>In the strict setting, there is a single global <code 
class="highlighter-rouge">UTVMTask</code> instance that we, from the host side, 
write into.  Once we have written to the task, the runtime has everything it 
needs to execute the function, and we can begin execution at the runtime’s 
entry point.  The runtime will perform some lightweight initialization, run our 
operator, then return control to the host.</p>
+
+<h3 id="lazy-execution">Lazy Execution</h3>
+
+<p>In practice, executing operators as soon as the user requests to becomes 
prohibitively expensive, as communication overhead begins to dominate.  We can 
improve the throughput of our system by delaying evaluation until the user 
wants the results of the call.</p>
+
+<p>From an implementation standpoint, instead of eagerly serializing argument 
metadata and <code class="highlighter-rouge">UTVMTask</code> data, we now need 
to accumulate function call metadata on the host side, before flushing it to 
the device.  The device runtime also needs a few changes: (1) we must now have 
a global array of <code class="highlighter-rouge">UTVMTask</code> and (2) we 
need to loop through and execute each task in order.</p>
+
+<h2 id="autotvm-with-microtvm">AutoTVM with MicroTVM</h2>
+
+<p>So far, the runtime we’ve described doesn’t seem very useful for <em>model 
deployment</em>, since it relies so heavily on a host machine.  This is 
intentional, and the runtime has in fact been designed for a different goal: 
<strong>AutoTVM support</strong>.</p>
+
+<p>In general, AutoTVM proposes candidate kernels, runs them on the target 
backend with random inputs, then uses the timing results to improve its search 
process.  Given that AutoTVM only cares about single operator executions, we 
have designed the runtime to be operator-oriented, as opposed to being 
model-oriented.  In the case of µTVM though, communication with the device will 
usually dominate the execution time.  Lazy execution allows us to run the same 
operator many times without ret [...]
+
+<p>Because AutoTVM requires rapid iteration on large numbers of candidate 
kernels, µTVM infrastructure only makes use of RAM currently.  However, for a 
self-hosted runtime, we will surely need to make use of both flash memory and 
RAM.</p>
+
+<h2 id="the-hosted-graph-runtime">The Hosted Graph Runtime</h2>
+
+<p>Although the hosted runtime was designed for AutoTVM, we can still run full 
models (as long as they don’t have any control flow).  This functionality comes 
for free just by using TVM’s graph runtime, but with a µTVM context.  In fact, 
the only reliance on the host with the graph runtime is for tensor allocation 
and operator scheduling (which is just a topological sort of the dependence 
graph).</p>
+
+<h1 id="evaluation">Evaluation</h1>
+
+<p>With this infrastructure in place, we sought to answer the following 
questions:</p>
+
+<ol>
+  <li>Is µTVM truly device-agnostic?</li>
+  <li>How much effort is required to experiment with optimizations using 
µTVM?</li>
+</ol>
+
+<p>To evaluate (1), we ran our experiments on two targets:</p>
+
+<ul>
+  <li>An <a 
href="https://www.st.com/en/microcontrollers-microprocessors/stm32f746ng.html";>Arm
 STM32F746NG development board</a>, featuring a Cortex-M7 processor</li>
+  <li>The µTVM host emulated device, which creates a memory arena on the host 
machine that is interfaced with as if it is a bare-metal device.</li>
+</ul>
+
+<p>To evaluate (2), we explore optimizations for the Arm board that give the 
biggest bang for your buck.</p>
+
+<p>As a point of comparison, we pulled a quantized CIFAR-10 CNN from <a 
href="https://developer.arm.com/solutions/machine-learning-on-arm/developer-material/how-to-guides/image-recognition-on-arm-cortex-m-with-cmsis-nn/single-page";>this
 tutorial by Arm</a>.  In the tutorial, <a 
href="https://arm-software.github.io/CMSIS_5/NN/html/index.html";>CMSIS-NN</a> 
(a library of highly optimized kernels by Arm experts) is used as the operator 
library, making this CNN the perfect evaluation target,  [...]
+
+<p style="text-align: center"><img 
src="/images/microtvm/post-2020-05-28/cifar10-graphical.png" 
alt="/images/microtvm/post-2020-05-28/cifar10-graphical.png" width="80%" /><br 
/>
+Diagram of CIFAR-10 CNN</p>
+
+<h2 id="methodology">Methodology</h2>
+
+<p>In our experiments, we use TVM from HEAD (commit <code 
class="highlighter-rouge">9fa8341</code>), version 5.7.0 of CMSIS-NN (commit 
<code class="highlighter-rouge">a65b7c9a</code>), version 1.16.0 of 
STM32CubeF7, and GCC from Arm’s GNU Tools for Arm Embedded Processors 
9-2019-q4-major 9.2.1 toolchain (revision 277599).  The host machine used in 
our experiments runs Ubuntu Linux 18.04.4 LTS and sports an AMD Ryzen 
Threadripper 2990WX 32-Core Processor with 62GB of RAM.  All evaluation  [...]
+
+<h3 id="arm-specific-optimizations">Arm-Specific Optimizations</h3>
+
+<p>With CMSIS-NN, the first convolution maps to their <a 
href="https://github.com/ARM-software/CMSIS_5/blob/develop/CMSIS/NN/Source/ConvolutionFunctions/arm_convolve_HWC_q7_RGB.c";>RGB
 convolution implementation</a> (specifically for usage in input layers) and 
the latter two map to their <a 
href="https://github.com/ARM-software/CMSIS_5/blob/develop/CMSIS/NN/Source/ConvolutionFunctions/arm_convolve_HWC_q7_fast.c";>“fast”
 convolution implementation</a>.  We felt our performance was close eno [...]
+
+<p style="text-align: center"><img 
src="/images/microtvm/post-2020-05-28/simd-diagram.png" 
alt="/images/microtvm/post-2020-05-28/simd-diagram.png" width="80%" /><br />
+Diagram from CMSIS-NN paper showing a 2x2 matrix multiplication microkernel</p>
+
+<p>Tensorization works by defining a microkernel that can be inserted into the 
innermost loop of a TVM operator.  Using this mechanism, adding SIMD support 
for the Arm board was as simple as defining a microkernel in C (found <a 
href="https://github.com/apache/incubator-tvm/blob/8d7249688771bb6806595931586d95648036f383/topi/python/topi/arm_cpu/cortex_m7/micro_kernel/gemm.py";>here</a>)
 that mirrored the implementation in their paper.  We defined a schedule that 
used this microkernel (foun [...]
+
+<p>While we were able to use the SIMD microkernel for direct convolution, 
CMSIS-NN uses what they call “partial im2col” as their implementation strategy, 
which offers a tradeoff between performance and memory usage.  Instead of 
manifesting the entire im2col matrix at once, partial im2col generates only a 
few columns at a time.  Then, with each batch, they can send the matrix to 
their SIMD matmul function.</p>
+
+<p>Our hypothesis was that, among other optimizations, we could find the 
optimal batch size via autotuning.  In practice, we found partial im2col to be 
significantly slower than our direct convolution implementation, so we don’t 
include it in the rest of our results.</p>
+
+<p>There are certainly other optimizations we could pull from CMSIS-NN to 
close the gap even further:</p>
+
+<ul>
+  <li>Batch expansion of <code class="highlighter-rouge">int8</code> weights 
into <code class="highlighter-rouge">int16</code>, to cut down on duplicate 
expansion for SIMD</li>
+  <li>Splitting convolution into 3x3 tiles to reduce padding checks</li>
+</ul>
+
+<p>But our goal in this blog post is to show the broad strokes of what can be 
done with µTVM.  Even so, it’s not a competition, because CMSIS-NN (and any 
other hand-optimized library) can plug directly into TVM using the <a 
href="https://tvm.apache.org/docs/dev/relay_bring_your_own_codegen.html";>Bring 
Your Own Codegen framework</a>.</p>
+
+<h2 id="end-to-end">End-To-End</h2>
+
+<h3 id="cifar-10">CIFAR-10</h3>
+
+<p>After exploring optimizations for convolution, we set out to measure their 
effects on end-to-end performance.  For the Arm board, we collected untuned 
results, results that were tuned <strong>without</strong> any use of SIMD, 
results that were tuned <strong>with</strong> SIMD, and results using CMSIS-NN. 
 For the emulated host device, we only collected untuned results and generic 
tuned results.</p>
+
+<p><a 
href="https://github.com/areusch/microtvm-blogpost-eval";>https://github.com/areusch/microtvm-blogpost-eval</a></p>
+
+<p style="text-align: center"><img 
src="/images/microtvm/post-2020-05-28/autotuned-cifar10-int-8-cnn.png" 
alt="/images/microtvm/post-2020-05-28/autotuned-cifar10-int-8-cnn.png" 
width="60%" /><br />
+<code class="highlighter-rouge">int8</code>-quantized CIFAR-10 CNN comparison 
on an Arm STM32F746NG (re-posted from above)</p>
+
+<p style="text-align: center"><img 
src="/images/microtvm/post-2020-05-28/autotuned-cifar10-int-8-cnn-x86.png" 
alt="/images/microtvm/post-2020-05-28/autotuned-cifar10-int-8-cnn-x86.png" 
width="60%" /><br />
+<code class="highlighter-rouge">int8</code>-quantized CIFAR-10 CNN comparison 
on µTVM’s emulated host device</p>
+
+<p>On the Arm STM32-series board, we were able to improve performance by ~2x 
compared to the initial untuned operators, and we achieved results much closer 
to CMSIS-NN.  Additionally, we were able to significantly improve performance 
on the host emulated device.  Though the x86 <strong><em>numbers</em></strong> 
don’t mean much, they show we can use the same infrastructure (µTVM) to 
optimize performance on vastly different architectures.</p>
+
+<p>Stay tuned in the future for more end-to-end benchmarks as we scale this 
approach out more broadly.</p>
+
+<h1 id="self-hosted-runtime-the-final-frontier">Self-Hosted Runtime: The Final 
Frontier</h1>
+
+<p style="text-align: center"><img 
src="/images/microtvm/self-hosted-runtime.png" 
alt="/images/microtvm/self-hosted-runtime.png" width="80%" /><br /></p>
+
+<p>The envisioned µTVM optimization and deployment pipeline</p>
+
+<p>While end-to-end benchmark results are already obtainable with the current 
runtime as we demonstrated above, deployment of these models in a standalone 
capacity is currently still on our roadmap. The gap being that the 
AutoTVM-oriented runtime currently relies on the host to allocate tensors and 
to schedule function execution. However, to be useful at the edge, we need a 
pipeline through µTVM that generates a <strong>single</strong> binary to be run 
on a bare-metal device. Users will  [...]
+
+<h1 id="conclusion">Conclusion</h1>
+
+<p>MicroTVM for single-kernel optimization is ready <strong>today</strong> and 
is <em>the</em> choice for that use case.  As we now build out self-hosted 
deployment support we hope you’re just as excited as we are to make µTVM 
<em>the</em> choice for model deployment as well. However, this isn’t just a 
spectator sport - remember: this is all open source!  µTVM is still in its 
early days, so every individual can have a great deal of impact on its 
trajectory. Check out the <a href="https:/ [...]
+
+<h2 id="acknowledgements">Acknowledgements</h2>
+
+<p>None of this work would have been possible, if not for the following 
people:</p>
+
+<ul>
+  <li><a href="https://tqchen.com/";>Tianqi Chen</a>, for guiding the design 
and for being a fantastic mentor.</li>
+  <li><a href="https://homes.cs.washington.edu/~patelp1/";>Pratyush Patel</a>, 
for collaborating on early prototypes of MicroTVM.</li>
+  <li><a href="https://octoml.ai/";>OctoML</a>, for facilitating the 
internships where I have been able to go full steam on this project.</li>
+  <li><a href="https://homes.cs.washington.edu/~moreau/";>Thierry Moreau</a>, 
for mentoring me during my time at OctoML.</li>
+  <li><a href="https://homes.cs.washington.edu/~vegaluis/";>Luis Vega</a>, for 
teaching me the fundamentals of interacting with microcontrollers.</li>
+  <li><a 
href="https://www.linkedin.com/in/themadrasi/?originalSubdomain=uk";>Ramana 
Radhakrishnan</a>, for supplying the Arm hardware used in our experiments and 
for providing guidance on its usage.</li>
+</ul>
+
+    </div>
+  </div>
+</div>
+</div>
+
+
+    
+
+
+
+
+
+    <div class="container">
+
+      <footer class="small">
+        Apache TVM is an effort undergoing incubation at The Apache Software 
Foundation (ASF),
+        sponsored by the <i>Apache Incubator</i>. Incubation is required
+        of all newly accepted projects until a further review indicates that 
the infrastructure,
+        communications, and decision making process have stabilized in a 
manner consistent with other
+        successful ASF projects. While incubation status is not necessarily a 
reflection of the completeness
+        or stability of the code, it does indicate that the project has yet to 
be fully endorsed by the ASF.
+
+        Copyright © 2020 The Apache Software Foundation. Apache TVM, Apache,
+        the Apache feather, and the Apache TVM project logo are either 
trademarks or registered trademarks of the Apache Software Foundation.
+
+        See also other useful <a href="/asf" class="footer-link">ASF links</a>:
+        <a href="https://www.apache.org/"; class="footer-link">Apache 
Homepage</a>,
+        <a href="https://www.apache.org/licenses/"; 
class="footer-link">License</a>
+        <a href="https://www.apache.org/foundation/sponsorship.html"; 
class="footer-link">Sponsorship</a>,
+        <a href="https://www.apache.org/security/"; 
class="footer-link">Security</a>
+        <a href="https://www.apache.org/foundation/thanks.html"; 
class="footer-link">Thanks</a>,
+        <a href="https://www.apache.org/events/current-event.html"; 
class="footer-link">Current Event</a>
+
+      </footer>
+    </div>
+  </body>
+</html>
+
+
diff --git a/assets/themes/custom-twitter/css/style.css 
b/assets/themes/custom-twitter/css/style.css
index a3d33c9..520deb4 100644
--- a/assets/themes/custom-twitter/css/style.css
+++ b/assets/themes/custom-twitter/css/style.css
@@ -125,7 +125,7 @@ footer {
    line-height: 24px;
 }
 
-.content ul {
+.content ul, .content ol {
    margin-top:8px;
    font-size: 16px;
    line-height: 24px;
diff --git a/atom.xml b/atom.xml
index 3de1519..4a1f194 100644
--- a/atom.xml
+++ b/atom.xml
@@ -4,7 +4,7 @@
  <title>TVM</title>
  <link href="https://tvm.apache.org"; rel="self"/>
  <link href="https://tvm.apache.org"/>
- <updated>2020-05-20T17:08:53-07:00</updated>
+ <updated>2020-06-04T09:03:32-07:00</updated>
  <id>https://tvm.apache.org</id>
  <author>
    <name></name>
@@ -13,6 +13,315 @@
 
  
  <entry>
+   <title>TinyML - How TVM is Taming Tiny</title>
+   <link 
href="https://tvm.apache.org/2020/06/04/tinyml-how-tvm-is-taming-tiny"/>
+   <updated>2020-06-04T00:00:00-07:00</updated>
+   <id>https://tvm.apache.org/2020/06/04/tinyml-how-tvm-is-taming-tiny</id>
+   <content type="html">
+&lt;p&gt;&lt;img src=&quot;/images/microtvm/logo.png&quot; alt=&quot;microTVM 
logo&quot; width=&quot;30%&quot; /&gt;&lt;br /&gt;&lt;/p&gt;
+
+&lt;p&gt;The proliferation of low-cost, AI-powered consumer devices has led to 
widespread interest in “bare-metal” (low-power, often without an operating 
system) devices among ML researchers and practitioners.  While it is already 
possible for experts to run &lt;em&gt;some&lt;/em&gt; models on 
&lt;em&gt;some&lt;/em&gt; bare-metal devices, optimizing models for diverse 
sets of devices is challenging, often requiring manually optimized 
device-specific libraries.  And for those platforms wi [...]
+
+&lt;p&gt;The manual optimization of machine learning software is not unique to 
the domain of bare-metal devices.  In fact, this has been a common theme for 
developers working with other hardware backends (e.g., GPUs and FPGAs).  TVM 
has proven resilient to the onslaught of new hardware targets, but until now, 
it couldn’t grapple with the unique profile of microcontrollers.  To solve the 
problem in this domain, we’ve extended TVM to feature a microcontroller 
backend, called µTVM (footnote [...]
+
+&lt;p style=&quot;text-align: center&quot;&gt;&lt;img 
src=&quot;/images/microtvm/autotvm-infrastructure.png&quot; 
alt=&quot;/images/microtvm/autotvm-infrastructure.png&quot; 
width=&quot;80%&quot; /&gt;&lt;br /&gt;&lt;/p&gt;
+
+&lt;h1 id=&quot;lets-see-it-in-action&quot;&gt;Let’s see it in 
action&lt;/h1&gt;
+
+&lt;p&gt;Before we talk about what TVM/MicroTVM is or how it works, let’s see 
a quick example of it in action.&lt;/p&gt;
+
+&lt;p style=&quot;text-align: center&quot;&gt;&lt;img 
src=&quot;/images/microtvm/hardware-connection-diagram.png&quot; 
alt=&quot;/images/microtvm/hardware-connection-diagram.png&quot; 
width=&quot;80%&quot; /&gt;&lt;br /&gt;
+A standard µTVM setup, where the host communicates with the device via 
JTAG.&lt;/p&gt;
+
+&lt;p&gt;Above, we have an &lt;a 
href=&quot;https://www.st.com/en/microcontrollers-microprocessors/stm32f746zg.html&quot;&gt;STM32F746ZG
 board&lt;/a&gt;, housing an ARM Cortex-M7 processor, an ideal part for AI on 
the edge given it’s strong performance in a low power envelope. We use its 
USB-JTAG port to connect it to our desktop machine.  On the desktop, we run 
OpenOCD to open a JTAG connection with the device; in turn, OpenOCD allows µTVM 
to control the M7 processor using a device-agno [...]
+
+&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div 
class=&quot;highlight&quot;&gt;&lt;pre 
class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span 
class=&quot;n&quot;&gt;OPENOCD_SERVER_ADDR&lt;/span&gt; &lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span 
class=&quot;s&quot;&gt;'127.0.0.1'&lt;/span&gt;
+&lt;span class=&quot;n&quot;&gt;OPENOCD_SERVER_PORT&lt;/span&gt; &lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span 
class=&quot;mi&quot;&gt;6666&lt;/span&gt;
+&lt;span class=&quot;n&quot;&gt;TARGET&lt;/span&gt; &lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;target&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;create&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'c 
-device=micro_dev'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
+&lt;span class=&quot;n&quot;&gt;DEV_CONFIG&lt;/span&gt; &lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;stm32f746xx&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;default_config&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;OPENOCD_SERVER_ADDR&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;OPENOCD_SERVER_PORT&lt;/span&gt;&lt;spa [...]
+
+&lt;span class=&quot;n&quot;&gt;module&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;params&lt;/span&gt; &lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;get_cifar10_cnn&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;()&lt;/span&gt;
+&lt;span class=&quot;k&quot;&gt;with&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;micro&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;Session&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;device_config&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span 
class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;sess&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;:&lt;/span&gt;
+       &lt;span class=&quot;n&quot;&gt;graph&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;c_module&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;params&lt;/span&gt; &lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;relay&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;build&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;(&lt;/span&gt;&lt [...]
+  &lt;span class=&quot;n&quot;&gt;micro_mod&lt;/span&gt; &lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;micro&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;create_micro_mod&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;c_module&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;DEV_CONFIG&lt;/span&gt;&lt;span class=&quot;p&quot;&g 
[...]
+  &lt;span class=&quot;n&quot;&gt;graph_mod&lt;/span&gt; &lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;graph_runtime&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;create&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;graph&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;micro_mod&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt [...]
+  &lt;span class=&quot;n&quot;&gt;graph_mod&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;run&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;data_np&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;)&lt;/span&gt;
+  &lt;span class=&quot;n&quot;&gt;prediction&lt;/span&gt; &lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;CIFAR10_CLASSES&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;np&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;argmax&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;graph_mod&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;.&lt; [...]
+  &lt;span class=&quot;k&quot;&gt;print&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;f&lt;/span&gt;&lt;span 
class=&quot;s&quot;&gt;'prediction was {prediction}'&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;)&lt;/span&gt;
+&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
+
+&lt;p&gt;Below are the performance results of MicroTVM, compared with &lt;a 
href=&quot;https://github.com/ARM-software/CMSIS_5/releases/tag/5.6.0&quot;&gt;CMSIS-NN
 version 5.7.0&lt;/a&gt; (commit &lt;code 
class=&quot;highlighter-rouge&quot;&gt;a65b7c9a&lt;/code&gt;), a hand-optimized 
library of ML kernels.&lt;/p&gt;
+
+&lt;p style=&quot;text-align: center&quot;&gt;&lt;img 
src=&quot;/images/microtvm/post-2020-05-28/cifar10-int-8-cnn.png&quot; 
alt=&quot;/images/microtvm/post-2020-05-28/cifar10-int-8-cnn.png&quot; 
width=&quot;60%&quot; /&gt;&lt;br /&gt;&lt;/p&gt;
+
+&lt;p&gt;As we can see, the out-of-the-box performance isn’t great, but this 
is where &lt;a 
href=&quot;https://dl.acm.org/doi/10.5555/3327144.3327258&quot;&gt;AutoTVM&lt;/a&gt;
 comes to the rescue.  We can write a schedule template for our device, do a 
round of autotuning, then achieve significantly better results.  To plug in our 
autotuned results, we only need to replace this line:&lt;/p&gt;
+
+&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div 
class=&quot;highlight&quot;&gt;&lt;pre 
class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span 
class=&quot;n&quot;&gt;graph&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;c_module&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;params&lt;/span&gt; &lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;relay&lt;/s [...]
+&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
+
+&lt;p&gt;with these lines:&lt;/p&gt;
+
+&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div 
class=&quot;highlight&quot;&gt;&lt;pre 
class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span 
class=&quot;k&quot;&gt;with&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;TARGET&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;autotvm&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;apply_history_best&lt;/span&gt;&lt;span 
class=&quot;p&quot;&g [...]
+  &lt;span class=&quot;n&quot;&gt;graph&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;c_module&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;params&lt;/span&gt; &lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;relay&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;build&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;(&lt;/span&gt;&l [...]
+&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
+
+&lt;p&gt;And our results now look like this:&lt;/p&gt;
+
+&lt;p style=&quot;text-align: center&quot;&gt;&lt;img 
src=&quot;/images/microtvm/post-2020-05-28/autotuned-cifar10-int-8-cnn.png&quot;
 
alt=&quot;/images/microtvm/post-2020-05-28/autotuned-cifar10-int-8-cnn.png&quot;
 width=&quot;60%&quot; /&gt;&lt;br /&gt;&lt;/p&gt;
+
+&lt;p&gt;We’ve improved our performance by ~2x, and we’re now much closer to 
CMSIS-NN. Although the MicroTVM CIFAR10 implementation is competitive in with a 
similar TFLite/CMSIS-NN model, this work has just begun to take advantage of 
TVM’s optimization features. There’s room to optimize further by accelerating 
other operators such as dense/fully-connected and taking advantage of TVM’s 
model-specific quantization and operator fusion capabilities. TVM with µTVM 
enables you to play with the [...]
+
+&lt;h1 id=&quot;design&quot;&gt;Design&lt;/h1&gt;
+
+&lt;p style=&quot;text-align: center&quot;&gt;&lt;img 
src=&quot;/images/microtvm/post-2020-05-28/memory-layout.png&quot; 
alt=&quot;/images/microtvm/post-2020-05-28/memory-layout.png&quot; 
width=&quot;20%&quot; /&gt;&lt;br /&gt;
+The µTVM Device Memory Layout in RAM&lt;/p&gt;
+
+&lt;p&gt;µTVM aims to support the lowest common denominator of devices by 
minimizing the set of requirements that must be satisfied.  In particular, 
users need only provide:&lt;/p&gt;
+
+&lt;ol&gt;
+  &lt;li&gt;a C cross-compiler toolchain for their device&lt;/li&gt;
+  &lt;li&gt;a method for reading/writing to device memory and executing code 
on the device&lt;/li&gt;
+  &lt;li&gt;a specification containing the device’s memory layout and general 
architectural characteristics&lt;/li&gt;
+  &lt;li&gt;a code snippet that prepares the device for function 
execution&lt;/li&gt;
+&lt;/ol&gt;
+
+&lt;p&gt;Most bare-metal devices have support for C and JTAG (a debugging 
protocol), so (1) and (2) usually come for free!  Furthermore, (3) and (4) are 
often very small asks.  Below are examples of (3) and (4) for STM32F746-series 
boards.&lt;/p&gt;
+
+&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div 
class=&quot;highlight&quot;&gt;&lt;pre 
class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span 
class=&quot;n&quot;&gt;device_config&lt;/span&gt; &lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span 
class=&quot;p&quot;&gt;{&lt;/span&gt;
+    &lt;span class=&quot;s&quot;&gt;'device_id'&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span 
class=&quot;s&quot;&gt;'arm.stm32f746xx'&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt;        &lt;span class=&quot;c1&quot;&gt;# 
unique identifier for the device
+&lt;/span&gt;    &lt;span 
class=&quot;s&quot;&gt;'toolchain_prefix'&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span 
class=&quot;s&quot;&gt;'arm-none-eabi-'&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;# 
prefix of each binary in the cross-compilation toolchain (e.g., 
arm-none-eabi-gcc)
+&lt;/span&gt;    &lt;span 
class=&quot;s&quot;&gt;'base_addr'&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span 
class=&quot;mh&quot;&gt;0x20000000&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt;               &lt;span 
class=&quot;c1&quot;&gt;# first address of RAM
+&lt;/span&gt;    &lt;span 
class=&quot;s&quot;&gt;'section_sizes'&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span 
class=&quot;p&quot;&gt;{&lt;/span&gt;                     &lt;span 
class=&quot;c1&quot;&gt;# dictionary of desired section sizes in bytes
+&lt;/span&gt;         &lt;span 
class=&quot;s&quot;&gt;'text'&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span 
class=&quot;mi&quot;&gt;18000&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt;
+         &lt;span class=&quot;s&quot;&gt;'rodata'&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span 
class=&quot;mi&quot;&gt;100&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt;
+         &lt;span class=&quot;s&quot;&gt;'data'&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span 
class=&quot;mi&quot;&gt;100&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt;
+         &lt;span class=&quot;o&quot;&gt;...&lt;/span&gt;
+    &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
+    &lt;span class=&quot;s&quot;&gt;'word_size'&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span 
class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt;                        &lt;span 
class=&quot;c1&quot;&gt;# device word size
+&lt;/span&gt;    &lt;span 
class=&quot;s&quot;&gt;'thumb_mode'&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span 
class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt;                    &lt;span 
class=&quot;c1&quot;&gt;# whether to use ARM's thumb ISA
+&lt;/span&gt;    &lt;span 
class=&quot;s&quot;&gt;'comms_method'&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span 
class=&quot;s&quot;&gt;'openocd'&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt;             &lt;span 
class=&quot;c1&quot;&gt;# method of communication with the device
+&lt;/span&gt;    &lt;span 
class=&quot;s&quot;&gt;'server_addr'&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span 
class=&quot;s&quot;&gt;'127.0.0.1'&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt;            &lt;span 
class=&quot;c1&quot;&gt;# OpenOCD server address (if 'comms_method' is 
'openocd')
+&lt;/span&gt;    &lt;span 
class=&quot;s&quot;&gt;'server_port'&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span 
class=&quot;mi&quot;&gt;6666&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt;                   &lt;span 
class=&quot;c1&quot;&gt;# OpenOCD server port (if 'comms_method' is 'openocd')
+&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
+&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
+
+&lt;div class=&quot;language-cpp highlighter-rouge&quot;&gt;&lt;div 
class=&quot;highlight&quot;&gt;&lt;pre 
class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span 
class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;syntax&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;unified&lt;/span&gt;
+&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;cpu&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;cortex&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;m7&lt;/span&gt;
+&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;fpu&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;softvfp&lt;/span&gt;
+&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;thumb&lt;/span&gt;
+
+&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;section&lt;/span&gt; &lt;span 
class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;text&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;UTVMInit&lt;/span&gt;
+&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;type&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;UTVMInit&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;o&quot;&gt;%&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;function&lt;/span&gt;
+&lt;span class=&quot;n&quot;&gt;UTVMInit&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;:&lt;/span&gt;
+  &lt;span class=&quot;cm&quot;&gt;/* enable fpu */&lt;/span&gt;
+  &lt;span class=&quot;n&quot;&gt;ldr&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;r0&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span 
class=&quot;mh&quot;&gt;0xE000ED88&lt;/span&gt;
+  &lt;span class=&quot;n&quot;&gt;ldr&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;r1&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;r0&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;]&lt;/span&gt;
+  &lt;span class=&quot;n&quot;&gt;ldr&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;r2&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span 
class=&quot;mh&quot;&gt;0xF00000&lt;/span&gt;
+  &lt;span class=&quot;n&quot;&gt;orr&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;r1&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;r2&lt;/span&gt;
+  &lt;span class=&quot;n&quot;&gt;str&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;r1&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;r0&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;]&lt;/span&gt;
+  &lt;span class=&quot;n&quot;&gt;dsb&lt;/span&gt;
+  &lt;span class=&quot;n&quot;&gt;isb&lt;/span&gt;
+  &lt;span class=&quot;cm&quot;&gt;/* set stack pointer */&lt;/span&gt;
+  &lt;span class=&quot;n&quot;&gt;ldr&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;sp&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;_utvm_stack_pointer_init&lt;/span&gt;
+  &lt;span class=&quot;n&quot;&gt;bl&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;UTVMMain&lt;/span&gt;
+&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;size&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;UTVMInit&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;UTVMInit&lt;/span&gt;
+&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
+
+&lt;p&gt;The µTVM infrastructure and device runtime have been built to only 
make use of these requirements, and we’re working to lessen these requirements 
by supporting common open source runtime platforms such as mBED OS to handle 
the compilation and linking processes.&lt;/p&gt;
+
+&lt;h2 id=&quot;device-sessions&quot;&gt;Device Sessions&lt;/h2&gt;
+
+&lt;p&gt;Given the networked nature of microcontroller interaction, we 
slightly deviate from standard TVM code by introducing the concept of &lt;code 
class=&quot;highlighter-rouge&quot;&gt;MicroSession&lt;/code&gt;.&lt;/p&gt;
+
+&lt;p&gt;Every piece of functionality in µTVM relies on having an open session 
with the target device.  If you’re familiar with TVM, you may have noticed a 
line of code that deviates from the norm in our first code snippet—-namely, 
this one:&lt;/p&gt;
+
+&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div 
class=&quot;highlight&quot;&gt;&lt;pre 
class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span 
class=&quot;o&quot;&gt;...&lt;/span&gt;
+&lt;span class=&quot;k&quot;&gt;with&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;micro&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;Session&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;device_config&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span 
class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;sess&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;:&lt;/span&gt;
+       &lt;span class=&quot;o&quot;&gt;...&lt;/span&gt;
+&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
+
+&lt;p&gt;Every line inside this &lt;code 
class=&quot;highlighter-rouge&quot;&gt;with&lt;/code&gt; block can call 
functions in µTVM, with the context being the device specified by &lt;code 
class=&quot;highlighter-rouge&quot;&gt;device_config&lt;/code&gt;.  This line 
is doing a number of things under the hood, so let’s unpack it.&lt;/p&gt;
+
+&lt;p&gt;First, it initializes a connection with your device, using whichever 
communication method you specified (usually OpenOCD).  The µTVM device runtime 
is then cross-compiled, using whichever cross-compiler you specified.  Finally, 
space for the compiled binary is allocated by the host, and the binary is 
loaded onto the device using the opened connection.&lt;/p&gt;
+
+&lt;p&gt;With the runtime now situated on the device, we’ll naturally want 
some functions to run through it.&lt;/p&gt;
+
+&lt;h2 id=&quot;module-loading&quot;&gt;Module Loading&lt;/h2&gt;
+
+&lt;p&gt;One of the core abstractions in TVM is that of a module.  A module 
stores a set of related functions for a particular device/runtime target.  
Given that microcontrollers don’t normally have operating systems, µTVM needs 
to do a lot of extra work to maintain this high-level abstraction.  To see 
what’s going on, we’ll trace through the process of creating and loading a 
µTVM-compatible module.&lt;/p&gt;
+
+&lt;p&gt;Suppose we have a &lt;code 
class=&quot;highlighter-rouge&quot;&gt;micro.Session&lt;/code&gt; open with our 
device and a TVM schedule that implements 2D convolution.  If we want to load 
it onto our microcontroller, we need it to emit C code.  To do so, we just need 
to set the &lt;code class=&quot;highlighter-rouge&quot;&gt;target&lt;/code&gt; 
in either &lt;code 
class=&quot;highlighter-rouge&quot;&gt;tvm.build&lt;/code&gt; or &lt;code 
class=&quot;highlighter-rouge&quot;&gt;relay.b [...]
+
+&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div 
class=&quot;highlight&quot;&gt;&lt;pre 
class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span 
class=&quot;n&quot;&gt;graph&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;c_module&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;params&lt;/span&gt; &lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;relay&lt;/s [...]
+&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
+
+&lt;p&gt;By setting the target like so, the build process runs through our C 
code generation backend.  However, the resulting C module still resides on the 
host machine.  In order to load it onto the device, we run it through one of 
the core functions in the µTVM infrastructure: &lt;code 
class=&quot;highlighter-rouge&quot;&gt;create_micro_mod&lt;/code&gt;.  
Example:&lt;/p&gt;
+
+&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div 
class=&quot;highlight&quot;&gt;&lt;pre 
class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span 
class=&quot;n&quot;&gt;micro_mod&lt;/span&gt; &lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;micro&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;create_micro_mod&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;c_ [...]
+&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
+
+&lt;p&gt;The line above cross-compiles the C source within the module, 
allocates room for the resulting binary (so it can coexist with the runtime in 
device memory), then sends each section of the binary to its allocated slot on 
the device.  Once the module binary is snug in device memory, function pointers 
within the binary are patched to give the module access to helper functions in 
the device runtime (e.g., for allocating scratchpads).&lt;/p&gt;
+
+&lt;p&gt;Now, with our kernel loaded on the device, we can grab a remote 
handle to the convolution function like so:&lt;/p&gt;
+
+&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div 
class=&quot;highlight&quot;&gt;&lt;pre 
class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span 
class=&quot;n&quot;&gt;micro_func&lt;/span&gt; &lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;micro_mod&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span 
class=&quot;s&quot;&gt;'conv2d'&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;]&lt;/span&gt;
+&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
+
+&lt;h2 id=&quot;tensor-loading&quot;&gt;Tensor Loading&lt;/h2&gt;
+
+&lt;p&gt;If we want to call an operator, we first need some tensors as 
arguments:&lt;/p&gt;
+
+&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div 
class=&quot;highlight&quot;&gt;&lt;pre 
class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span 
class=&quot;n&quot;&gt;data_np&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;kernel_np&lt;/span&gt; &lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;get_conv_inputs&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;()&lt;/span&gt;
+&lt;span class=&quot;n&quot;&gt;ctx&lt;/span&gt; &lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;micro_dev&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span 
class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;)&lt;/span&gt;
+&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt; &lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;nd&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;array&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;data_np&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span clas [...]
+&lt;span class=&quot;n&quot;&gt;kernel&lt;/span&gt; &lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;nd&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;array&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;kernel_np&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span  [...]
+&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
+
+&lt;p&gt;Based on its data type (e.g., &lt;code 
class=&quot;highlighter-rouge&quot;&gt;int8&lt;/code&gt;, &lt;code 
class=&quot;highlighter-rouge&quot;&gt;float32&lt;/code&gt;, etc.) and shape, 
each tensor’s size in bytes is calculated, and the host allocates a region of 
memory on the device’s heap.  The tensor’s data is then loaded into the 
allocated region.&lt;/p&gt;
+
+&lt;h2 id=&quot;function-calls&quot;&gt;Function Calls&lt;/h2&gt;
+
+&lt;p&gt;Operator execution is perhaps the trickiest part of this system.  To 
simplify its presentation, we’ll first cover strict execution (where operators 
are executed as soon as they’re called), then lazy execution (where operators 
are only executed once their results are needed)—-the latter is how the system 
actually works.&lt;/p&gt;
+
+&lt;h3 id=&quot;strict-execution&quot;&gt;Strict Execution&lt;/h3&gt;
+
+&lt;p&gt;When calling a function, both input and output tensors are passed as 
arguments, in what’s known as destination-passing style:&lt;/p&gt;
+
+&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div 
class=&quot;highlight&quot;&gt;&lt;pre 
class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span 
class=&quot;n&quot;&gt;conv2D&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;kernel&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;output&lt;/span& [...]
+&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
+
+&lt;p&gt;Given that these tensors are already allocated on the device, we only 
need to send &lt;em&gt;metadata&lt;/em&gt; to the device (device address, 
shape, and data type), so it knows which of its resident tensors to use.  The 
runtime representation of a function call includes this metadata, as well as 
the address of the function being called (shown below).  Before constructing 
this representation, the metadata needs to be serialized into the arguments 
section on the device that exis [...]
+
+&lt;div class=&quot;language-c highlighter-rouge&quot;&gt;&lt;div 
class=&quot;highlight&quot;&gt;&lt;pre 
class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;cm&quot;&gt;/*
+ * task struct for uTVM
+ */&lt;/span&gt;
+&lt;span class=&quot;k&quot;&gt;typedef&lt;/span&gt; &lt;span 
class=&quot;k&quot;&gt;struct&lt;/span&gt; &lt;span 
class=&quot;p&quot;&gt;{&lt;/span&gt;
+  &lt;span class=&quot;cm&quot;&gt;/* pointer to function to call for this 
task */&lt;/span&gt;
+  &lt;span class=&quot;kt&quot;&gt;int32_t&lt;/span&gt; &lt;span 
class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;func&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;)(&lt;/span&gt;&lt;span 
class=&quot;kt&quot;&gt;void&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;kt&quot;&gt;void&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span  [...]
+  &lt;span class=&quot;cm&quot;&gt;/* array of argument tensors */&lt;/span&gt;
+  &lt;span class=&quot;n&quot;&gt;TVMValue&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;arg_values&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;;&lt;/span&gt;
+  &lt;span class=&quot;cm&quot;&gt;/* array of datatype codes for each 
argument */&lt;/span&gt;
+  &lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;arg_type_codes&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;;&lt;/span&gt;
+  &lt;span class=&quot;cm&quot;&gt;/* number of arguments */&lt;/span&gt;
+  &lt;span class=&quot;kt&quot;&gt;int32_t&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;num_args&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;;&lt;/span&gt;
+&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;UTVMTask&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;;&lt;/span&gt;
+&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
+
+&lt;p&gt;In the strict setting, there is a single global &lt;code 
class=&quot;highlighter-rouge&quot;&gt;UTVMTask&lt;/code&gt; instance that we, 
from the host side, write into.  Once we have written to the task, the runtime 
has everything it needs to execute the function, and we can begin execution at 
the runtime’s entry point.  The runtime will perform some lightweight 
initialization, run our operator, then return control to the host.&lt;/p&gt;
+
+&lt;h3 id=&quot;lazy-execution&quot;&gt;Lazy Execution&lt;/h3&gt;
+
+&lt;p&gt;In practice, executing operators as soon as the user requests to 
becomes prohibitively expensive, as communication overhead begins to dominate.  
We can improve the throughput of our system by delaying evaluation until the 
user wants the results of the call.&lt;/p&gt;
+
+&lt;p&gt;From an implementation standpoint, instead of eagerly serializing 
argument metadata and &lt;code 
class=&quot;highlighter-rouge&quot;&gt;UTVMTask&lt;/code&gt; data, we now need 
to accumulate function call metadata on the host side, before flushing it to 
the device.  The device runtime also needs a few changes: (1) we must now have 
a global array of &lt;code 
class=&quot;highlighter-rouge&quot;&gt;UTVMTask&lt;/code&gt; and (2) we need to 
loop through and execute each task in order. [...]
+
+&lt;h2 id=&quot;autotvm-with-microtvm&quot;&gt;AutoTVM with MicroTVM&lt;/h2&gt;
+
+&lt;p&gt;So far, the runtime we’ve described doesn’t seem very useful for 
&lt;em&gt;model deployment&lt;/em&gt;, since it relies so heavily on a host 
machine.  This is intentional, and the runtime has in fact been designed for a 
different goal: &lt;strong&gt;AutoTVM support&lt;/strong&gt;.&lt;/p&gt;
+
+&lt;p&gt;In general, AutoTVM proposes candidate kernels, runs them on the 
target backend with random inputs, then uses the timing results to improve its 
search process.  Given that AutoTVM only cares about single operator 
executions, we have designed the runtime to be operator-oriented, as opposed to 
being model-oriented.  In the case of µTVM though, communication with the 
device will usually dominate the execution time.  Lazy execution allows us to 
run the same operator many times witho [...]
+
+&lt;p&gt;Because AutoTVM requires rapid iteration on large numbers of 
candidate kernels, µTVM infrastructure only makes use of RAM currently.  
However, for a self-hosted runtime, we will surely need to make use of both 
flash memory and RAM.&lt;/p&gt;
+
+&lt;h2 id=&quot;the-hosted-graph-runtime&quot;&gt;The Hosted Graph 
Runtime&lt;/h2&gt;
+
+&lt;p&gt;Although the hosted runtime was designed for AutoTVM, we can still 
run full models (as long as they don’t have any control flow).  This 
functionality comes for free just by using TVM’s graph runtime, but with a µTVM 
context.  In fact, the only reliance on the host with the graph runtime is for 
tensor allocation and operator scheduling (which is just a topological sort of 
the dependence graph).&lt;/p&gt;
+
+&lt;h1 id=&quot;evaluation&quot;&gt;Evaluation&lt;/h1&gt;
+
+&lt;p&gt;With this infrastructure in place, we sought to answer the following 
questions:&lt;/p&gt;
+
+&lt;ol&gt;
+  &lt;li&gt;Is µTVM truly device-agnostic?&lt;/li&gt;
+  &lt;li&gt;How much effort is required to experiment with optimizations using 
µTVM?&lt;/li&gt;
+&lt;/ol&gt;
+
+&lt;p&gt;To evaluate (1), we ran our experiments on two targets:&lt;/p&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;An &lt;a 
href=&quot;https://www.st.com/en/microcontrollers-microprocessors/stm32f746ng.html&quot;&gt;Arm
 STM32F746NG development board&lt;/a&gt;, featuring a Cortex-M7 
processor&lt;/li&gt;
+  &lt;li&gt;The µTVM host emulated device, which creates a memory arena on the 
host machine that is interfaced with as if it is a bare-metal device.&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;p&gt;To evaluate (2), we explore optimizations for the Arm board that give 
the biggest bang for your buck.&lt;/p&gt;
+
+&lt;p&gt;As a point of comparison, we pulled a quantized CIFAR-10 CNN from 
&lt;a 
href=&quot;https://developer.arm.com/solutions/machine-learning-on-arm/developer-material/how-to-guides/image-recognition-on-arm-cortex-m-with-cmsis-nn/single-page&quot;&gt;this
 tutorial by Arm&lt;/a&gt;.  In the tutorial, &lt;a 
href=&quot;https://arm-software.github.io/CMSIS_5/NN/html/index.html&quot;&gt;CMSIS-NN&lt;/a&gt;
 (a library of highly optimized kernels by Arm experts) is used as the operator 
librar [...]
+
+&lt;p style=&quot;text-align: center&quot;&gt;&lt;img 
src=&quot;/images/microtvm/post-2020-05-28/cifar10-graphical.png&quot; 
alt=&quot;/images/microtvm/post-2020-05-28/cifar10-graphical.png&quot; 
width=&quot;80%&quot; /&gt;&lt;br /&gt;
+Diagram of CIFAR-10 CNN&lt;/p&gt;
+
+&lt;h2 id=&quot;methodology&quot;&gt;Methodology&lt;/h2&gt;
+
+&lt;p&gt;In our experiments, we use TVM from HEAD (commit &lt;code 
class=&quot;highlighter-rouge&quot;&gt;9fa8341&lt;/code&gt;), version 5.7.0 of 
CMSIS-NN (commit &lt;code 
class=&quot;highlighter-rouge&quot;&gt;a65b7c9a&lt;/code&gt;), version 1.16.0 
of STM32CubeF7, and GCC from Arm’s GNU Tools for Arm Embedded Processors 
9-2019-q4-major 9.2.1 toolchain (revision 277599).  The host machine used in 
our experiments runs Ubuntu Linux 18.04.4 LTS and sports an AMD Ryzen 
Threadripper 2990WX 32 [...]
+
+&lt;h3 id=&quot;arm-specific-optimizations&quot;&gt;Arm-Specific 
Optimizations&lt;/h3&gt;
+
+&lt;p&gt;With CMSIS-NN, the first convolution maps to their &lt;a 
href=&quot;https://github.com/ARM-software/CMSIS_5/blob/develop/CMSIS/NN/Source/ConvolutionFunctions/arm_convolve_HWC_q7_RGB.c&quot;&gt;RGB
 convolution implementation&lt;/a&gt; (specifically for usage in input layers) 
and the latter two map to their &lt;a 
href=&quot;https://github.com/ARM-software/CMSIS_5/blob/develop/CMSIS/NN/Source/ConvolutionFunctions/arm_convolve_HWC_q7_fast.c&quot;&gt;“fast”
 convolution implementation [...]
+
+&lt;p style=&quot;text-align: center&quot;&gt;&lt;img 
src=&quot;/images/microtvm/post-2020-05-28/simd-diagram.png&quot; 
alt=&quot;/images/microtvm/post-2020-05-28/simd-diagram.png&quot; 
width=&quot;80%&quot; /&gt;&lt;br /&gt;
+Diagram from CMSIS-NN paper showing a 2x2 matrix multiplication 
microkernel&lt;/p&gt;
+
+&lt;p&gt;Tensorization works by defining a microkernel that can be inserted 
into the innermost loop of a TVM operator.  Using this mechanism, adding SIMD 
support for the Arm board was as simple as defining a microkernel in C (found 
&lt;a 
href=&quot;https://github.com/apache/incubator-tvm/blob/8d7249688771bb6806595931586d95648036f383/topi/python/topi/arm_cpu/cortex_m7/micro_kernel/gemm.py&quot;&gt;here&lt;/a&gt;)
 that mirrored the implementation in their paper.  We defined a schedule that 
[...]
+
+&lt;p&gt;While we were able to use the SIMD microkernel for direct 
convolution, CMSIS-NN uses what they call “partial im2col” as their 
implementation strategy, which offers a tradeoff between performance and memory 
usage.  Instead of manifesting the entire im2col matrix at once, partial im2col 
generates only a few columns at a time.  Then, with each batch, they can send 
the matrix to their SIMD matmul function.&lt;/p&gt;
+
+&lt;p&gt;Our hypothesis was that, among other optimizations, we could find the 
optimal batch size via autotuning.  In practice, we found partial im2col to be 
significantly slower than our direct convolution implementation, so we don’t 
include it in the rest of our results.&lt;/p&gt;
+
+&lt;p&gt;There are certainly other optimizations we could pull from CMSIS-NN 
to close the gap even further:&lt;/p&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;Batch expansion of &lt;code 
class=&quot;highlighter-rouge&quot;&gt;int8&lt;/code&gt; weights into &lt;code 
class=&quot;highlighter-rouge&quot;&gt;int16&lt;/code&gt;, to cut down on 
duplicate expansion for SIMD&lt;/li&gt;
+  &lt;li&gt;Splitting convolution into 3x3 tiles to reduce padding 
checks&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;p&gt;But our goal in this blog post is to show the broad strokes of what 
can be done with µTVM.  Even so, it’s not a competition, because CMSIS-NN (and 
any other hand-optimized library) can plug directly into TVM using the &lt;a 
href=&quot;https://tvm.apache.org/docs/dev/relay_bring_your_own_codegen.html&quot;&gt;Bring
 Your Own Codegen framework&lt;/a&gt;.&lt;/p&gt;
+
+&lt;h2 id=&quot;end-to-end&quot;&gt;End-To-End&lt;/h2&gt;
+
+&lt;h3 id=&quot;cifar-10&quot;&gt;CIFAR-10&lt;/h3&gt;
+
+&lt;p&gt;After exploring optimizations for convolution, we set out to measure 
their effects on end-to-end performance.  For the Arm board, we collected 
untuned results, results that were tuned &lt;strong&gt;without&lt;/strong&gt; 
any use of SIMD, results that were tuned &lt;strong&gt;with&lt;/strong&gt; 
SIMD, and results using CMSIS-NN.  For the emulated host device, we only 
collected untuned results and generic tuned results.&lt;/p&gt;
+
+&lt;p&gt;&lt;a 
href=&quot;https://github.com/areusch/microtvm-blogpost-eval&quot;&gt;https://github.com/areusch/microtvm-blogpost-eval&lt;/a&gt;&lt;/p&gt;
+
+&lt;p style=&quot;text-align: center&quot;&gt;&lt;img 
src=&quot;/images/microtvm/post-2020-05-28/autotuned-cifar10-int-8-cnn.png&quot;
 
alt=&quot;/images/microtvm/post-2020-05-28/autotuned-cifar10-int-8-cnn.png&quot;
 width=&quot;60%&quot; /&gt;&lt;br /&gt;
+&lt;code class=&quot;highlighter-rouge&quot;&gt;int8&lt;/code&gt;-quantized 
CIFAR-10 CNN comparison on an Arm STM32F746NG (re-posted from above)&lt;/p&gt;
+
+&lt;p style=&quot;text-align: center&quot;&gt;&lt;img 
src=&quot;/images/microtvm/post-2020-05-28/autotuned-cifar10-int-8-cnn-x86.png&quot;
 
alt=&quot;/images/microtvm/post-2020-05-28/autotuned-cifar10-int-8-cnn-x86.png&quot;
 width=&quot;60%&quot; /&gt;&lt;br /&gt;
+&lt;code class=&quot;highlighter-rouge&quot;&gt;int8&lt;/code&gt;-quantized 
CIFAR-10 CNN comparison on µTVM’s emulated host device&lt;/p&gt;
+
+&lt;p&gt;On the Arm STM32-series board, we were able to improve performance by 
~2x compared to the initial untuned operators, and we achieved results much 
closer to CMSIS-NN.  Additionally, we were able to significantly improve 
performance on the host emulated device.  Though the x86 
&lt;strong&gt;&lt;em&gt;numbers&lt;/em&gt;&lt;/strong&gt; don’t mean much, they 
show we can use the same infrastructure (µTVM) to optimize performance on 
vastly different architectures.&lt;/p&gt;
+
+&lt;p&gt;Stay tuned in the future for more end-to-end benchmarks as we scale 
this approach out more broadly.&lt;/p&gt;
+
+&lt;h1 id=&quot;self-hosted-runtime-the-final-frontier&quot;&gt;Self-Hosted 
Runtime: The Final Frontier&lt;/h1&gt;
+
+&lt;p style=&quot;text-align: center&quot;&gt;&lt;img 
src=&quot;/images/microtvm/self-hosted-runtime.png&quot; 
alt=&quot;/images/microtvm/self-hosted-runtime.png&quot; width=&quot;80%&quot; 
/&gt;&lt;br /&gt;&lt;/p&gt;
+
+&lt;p&gt;The envisioned µTVM optimization and deployment pipeline&lt;/p&gt;
+
+&lt;p&gt;While end-to-end benchmark results are already obtainable with the 
current runtime as we demonstrated above, deployment of these models in a 
standalone capacity is currently still on our roadmap. The gap being that the 
AutoTVM-oriented runtime currently relies on the host to allocate tensors and 
to schedule function execution. However, to be useful at the edge, we need a 
pipeline through µTVM that generates a &lt;strong&gt;single&lt;/strong&gt; 
binary to be run on a bare-metal d [...]
+
+&lt;h1 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h1&gt;
+
+&lt;p&gt;MicroTVM for single-kernel optimization is ready 
&lt;strong&gt;today&lt;/strong&gt; and is &lt;em&gt;the&lt;/em&gt; choice for 
that use case.  As we now build out self-hosted deployment support we hope 
you’re just as excited as we are to make µTVM &lt;em&gt;the&lt;/em&gt; choice 
for model deployment as well. However, this isn’t just a spectator sport - 
remember: this is all open source!  µTVM is still in its early days, so every 
individual can have a great deal of impact on its  [...]
+
+&lt;h2 id=&quot;acknowledgements&quot;&gt;Acknowledgements&lt;/h2&gt;
+
+&lt;p&gt;None of this work would have been possible, if not for the following 
people:&lt;/p&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;&lt;a href=&quot;https://tqchen.com/&quot;&gt;Tianqi 
Chen&lt;/a&gt;, for guiding the design and for being a fantastic 
mentor.&lt;/li&gt;
+  &lt;li&gt;&lt;a 
href=&quot;https://homes.cs.washington.edu/~patelp1/&quot;&gt;Pratyush 
Patel&lt;/a&gt;, for collaborating on early prototypes of MicroTVM.&lt;/li&gt;
+  &lt;li&gt;&lt;a href=&quot;https://octoml.ai/&quot;&gt;OctoML&lt;/a&gt;, for 
facilitating the internships where I have been able to go full steam on this 
project.&lt;/li&gt;
+  &lt;li&gt;&lt;a 
href=&quot;https://homes.cs.washington.edu/~moreau/&quot;&gt;Thierry 
Moreau&lt;/a&gt;, for mentoring me during my time at OctoML.&lt;/li&gt;
+  &lt;li&gt;&lt;a 
href=&quot;https://homes.cs.washington.edu/~vegaluis/&quot;&gt;Luis 
Vega&lt;/a&gt;, for teaching me the fundamentals of interacting with 
microcontrollers.&lt;/li&gt;
+  &lt;li&gt;&lt;a 
href=&quot;https://www.linkedin.com/in/themadrasi/?originalSubdomain=uk&quot;&gt;Ramana
 Radhakrishnan&lt;/a&gt;, for supplying the Arm hardware used in our 
experiments and for providing guidance on its usage.&lt;/li&gt;
+&lt;/ul&gt;
+</content>
+ </entry>
+ 
+ <entry>
    <title>Bring Your Own Datatypes: Enabling Custom Datatype Exploration in 
TVM</title>
    <link href="https://tvm.apache.org/2020/05/20/bring-your-own-datatypes"/>
    <updated>2020-05-20T00:00:00-07:00</updated>
diff --git a/blog.html b/blog.html
index fa09728..c54638e 100644
--- a/blog.html
+++ b/blog.html
@@ -1,3 +1,4 @@
+
 <!DOCTYPE html>
 <html lang="en">
   <head>
@@ -155,6 +156,16 @@
 
 <li>
   <span>
+    <a class="post-link" 
href="/2020/06/04/tinyml-how-tvm-is-taming-tiny">TinyML - How TVM is Taming 
Tiny</a>
+  </span>
+  </br>
+  <span>
+    Jun 4, 2020
+  </span>
+</li>
+
+<li>
+  <span>
     <a class="post-link" href="/2020/05/20/bring-your-own-datatypes">Bring 
Your Own Datatypes: Enabling Custom Datatype Exploration in TVM</a>
   </span>
   </br>
diff --git a/images/microtvm/autotvm-infrastructure.png 
b/images/microtvm/autotvm-infrastructure.png
new file mode 100644
index 0000000..01cecc8
Binary files /dev/null and b/images/microtvm/autotvm-infrastructure.png differ
diff --git a/images/microtvm/hardware-connection-diagram.png 
b/images/microtvm/hardware-connection-diagram.png
new file mode 100644
index 0000000..249241b
Binary files /dev/null and b/images/microtvm/hardware-connection-diagram.png 
differ
diff --git a/images/microtvm/logo.png b/images/microtvm/logo.png
new file mode 100644
index 0000000..ce9f201
Binary files /dev/null and b/images/microtvm/logo.png differ
diff --git 
a/images/microtvm/post-2020-05-28/autotuned-cifar10-int-8-cnn-x86.png 
b/images/microtvm/post-2020-05-28/autotuned-cifar10-int-8-cnn-x86.png
new file mode 100644
index 0000000..6ad3b4b
Binary files /dev/null and 
b/images/microtvm/post-2020-05-28/autotuned-cifar10-int-8-cnn-x86.png differ
diff --git a/images/microtvm/post-2020-05-28/autotuned-cifar10-int-8-cnn.png 
b/images/microtvm/post-2020-05-28/autotuned-cifar10-int-8-cnn.png
new file mode 100644
index 0000000..498c32a
Binary files /dev/null and 
b/images/microtvm/post-2020-05-28/autotuned-cifar10-int-8-cnn.png differ
diff --git a/images/microtvm/post-2020-05-28/cifar10-graphical.png 
b/images/microtvm/post-2020-05-28/cifar10-graphical.png
new file mode 100644
index 0000000..9bd65bb
Binary files /dev/null and 
b/images/microtvm/post-2020-05-28/cifar10-graphical.png differ
diff --git a/images/microtvm/post-2020-05-28/cifar10-int-8-cnn.png 
b/images/microtvm/post-2020-05-28/cifar10-int-8-cnn.png
new file mode 100644
index 0000000..96489ee
Binary files /dev/null and 
b/images/microtvm/post-2020-05-28/cifar10-int-8-cnn.png differ
diff --git a/images/microtvm/post-2020-05-28/memory-layout.png 
b/images/microtvm/post-2020-05-28/memory-layout.png
new file mode 100644
index 0000000..343424d
Binary files /dev/null and b/images/microtvm/post-2020-05-28/memory-layout.png 
differ
diff --git a/images/microtvm/post-2020-05-28/simd-diagram.png 
b/images/microtvm/post-2020-05-28/simd-diagram.png
new file mode 100644
index 0000000..f7606b5
Binary files /dev/null and b/images/microtvm/post-2020-05-28/simd-diagram.png 
differ
diff --git a/images/microtvm/self-hosted-runtime.png 
b/images/microtvm/self-hosted-runtime.png
new file mode 100644
index 0000000..62771a4
Binary files /dev/null and b/images/microtvm/self-hosted-runtime.png differ
diff --git a/rss.xml b/rss.xml
index ab26bf9..6e4c53d 100644
--- a/rss.xml
+++ b/rss.xml
@@ -5,12 +5,321 @@
         <description>TVM - </description>
         <link>https://tvm.apache.org</link>
         <atom:link href="https://tvm.apache.org"; rel="self" 
type="application/rss+xml" />
-        <lastBuildDate>Wed, 20 May 2020 17:08:53 -0700</lastBuildDate>
-        <pubDate>Wed, 20 May 2020 17:08:53 -0700</pubDate>
+        <lastBuildDate>Thu, 04 Jun 2020 09:03:32 -0700</lastBuildDate>
+        <pubDate>Thu, 04 Jun 2020 09:03:32 -0700</pubDate>
         <ttl>60</ttl>
 
 
         <item>
+                <title>TinyML - How TVM is Taming Tiny</title>
+                <description>
+&lt;p&gt;&lt;img src=&quot;/images/microtvm/logo.png&quot; alt=&quot;microTVM 
logo&quot; width=&quot;30%&quot; /&gt;&lt;br /&gt;&lt;/p&gt;
+
+&lt;p&gt;The proliferation of low-cost, AI-powered consumer devices has led to 
widespread interest in “bare-metal” (low-power, often without an operating 
system) devices among ML researchers and practitioners.  While it is already 
possible for experts to run &lt;em&gt;some&lt;/em&gt; models on 
&lt;em&gt;some&lt;/em&gt; bare-metal devices, optimizing models for diverse 
sets of devices is challenging, often requiring manually optimized 
device-specific libraries.  And for those platforms wi [...]
+
+&lt;p&gt;The manual optimization of machine learning software is not unique to 
the domain of bare-metal devices.  In fact, this has been a common theme for 
developers working with other hardware backends (e.g., GPUs and FPGAs).  TVM 
has proven resilient to the onslaught of new hardware targets, but until now, 
it couldn’t grapple with the unique profile of microcontrollers.  To solve the 
problem in this domain, we’ve extended TVM to feature a microcontroller 
backend, called µTVM (footnote [...]
+
+&lt;p style=&quot;text-align: center&quot;&gt;&lt;img 
src=&quot;/images/microtvm/autotvm-infrastructure.png&quot; 
alt=&quot;/images/microtvm/autotvm-infrastructure.png&quot; 
width=&quot;80%&quot; /&gt;&lt;br /&gt;&lt;/p&gt;
+
+&lt;h1 id=&quot;lets-see-it-in-action&quot;&gt;Let’s see it in 
action&lt;/h1&gt;
+
+&lt;p&gt;Before we talk about what TVM/MicroTVM is or how it works, let’s see 
a quick example of it in action.&lt;/p&gt;
+
+&lt;p style=&quot;text-align: center&quot;&gt;&lt;img 
src=&quot;/images/microtvm/hardware-connection-diagram.png&quot; 
alt=&quot;/images/microtvm/hardware-connection-diagram.png&quot; 
width=&quot;80%&quot; /&gt;&lt;br /&gt;
+A standard µTVM setup, where the host communicates with the device via 
JTAG.&lt;/p&gt;
+
+&lt;p&gt;Above, we have an &lt;a 
href=&quot;https://www.st.com/en/microcontrollers-microprocessors/stm32f746zg.html&quot;&gt;STM32F746ZG
 board&lt;/a&gt;, housing an ARM Cortex-M7 processor, an ideal part for AI on 
the edge given it’s strong performance in a low power envelope. We use its 
USB-JTAG port to connect it to our desktop machine.  On the desktop, we run 
OpenOCD to open a JTAG connection with the device; in turn, OpenOCD allows µTVM 
to control the M7 processor using a device-agno [...]
+
+&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div 
class=&quot;highlight&quot;&gt;&lt;pre 
class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span 
class=&quot;n&quot;&gt;OPENOCD_SERVER_ADDR&lt;/span&gt; &lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span 
class=&quot;s&quot;&gt;'127.0.0.1'&lt;/span&gt;
+&lt;span class=&quot;n&quot;&gt;OPENOCD_SERVER_PORT&lt;/span&gt; &lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span 
class=&quot;mi&quot;&gt;6666&lt;/span&gt;
+&lt;span class=&quot;n&quot;&gt;TARGET&lt;/span&gt; &lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;target&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;create&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'c 
-device=micro_dev'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
+&lt;span class=&quot;n&quot;&gt;DEV_CONFIG&lt;/span&gt; &lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;stm32f746xx&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;default_config&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;OPENOCD_SERVER_ADDR&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;OPENOCD_SERVER_PORT&lt;/span&gt;&lt;spa [...]
+
+&lt;span class=&quot;n&quot;&gt;module&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;params&lt;/span&gt; &lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;get_cifar10_cnn&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;()&lt;/span&gt;
+&lt;span class=&quot;k&quot;&gt;with&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;micro&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;Session&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;device_config&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span 
class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;sess&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;:&lt;/span&gt;
+       &lt;span class=&quot;n&quot;&gt;graph&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;c_module&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;params&lt;/span&gt; &lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;relay&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;build&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;(&lt;/span&gt;&lt [...]
+  &lt;span class=&quot;n&quot;&gt;micro_mod&lt;/span&gt; &lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;micro&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;create_micro_mod&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;c_module&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;DEV_CONFIG&lt;/span&gt;&lt;span class=&quot;p&quot;&g 
[...]
+  &lt;span class=&quot;n&quot;&gt;graph_mod&lt;/span&gt; &lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;graph_runtime&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;create&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;graph&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;micro_mod&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt [...]
+  &lt;span class=&quot;n&quot;&gt;graph_mod&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;run&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;data_np&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;)&lt;/span&gt;
+  &lt;span class=&quot;n&quot;&gt;prediction&lt;/span&gt; &lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;CIFAR10_CLASSES&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;np&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;argmax&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;graph_mod&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;.&lt; [...]
+  &lt;span class=&quot;k&quot;&gt;print&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;f&lt;/span&gt;&lt;span 
class=&quot;s&quot;&gt;'prediction was {prediction}'&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;)&lt;/span&gt;
+&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
+
+&lt;p&gt;Below are the performance results of MicroTVM, compared with &lt;a 
href=&quot;https://github.com/ARM-software/CMSIS_5/releases/tag/5.6.0&quot;&gt;CMSIS-NN
 version 5.7.0&lt;/a&gt; (commit &lt;code 
class=&quot;highlighter-rouge&quot;&gt;a65b7c9a&lt;/code&gt;), a hand-optimized 
library of ML kernels.&lt;/p&gt;
+
+&lt;p style=&quot;text-align: center&quot;&gt;&lt;img 
src=&quot;/images/microtvm/post-2020-05-28/cifar10-int-8-cnn.png&quot; 
alt=&quot;/images/microtvm/post-2020-05-28/cifar10-int-8-cnn.png&quot; 
width=&quot;60%&quot; /&gt;&lt;br /&gt;&lt;/p&gt;
+
+&lt;p&gt;As we can see, the out-of-the-box performance isn’t great, but this 
is where &lt;a 
href=&quot;https://dl.acm.org/doi/10.5555/3327144.3327258&quot;&gt;AutoTVM&lt;/a&gt;
 comes to the rescue.  We can write a schedule template for our device, do a 
round of autotuning, then achieve significantly better results.  To plug in our 
autotuned results, we only need to replace this line:&lt;/p&gt;
+
+&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div 
class=&quot;highlight&quot;&gt;&lt;pre 
class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span 
class=&quot;n&quot;&gt;graph&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;c_module&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;params&lt;/span&gt; &lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;relay&lt;/s [...]
+&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
+
+&lt;p&gt;with these lines:&lt;/p&gt;
+
+&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div 
class=&quot;highlight&quot;&gt;&lt;pre 
class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span 
class=&quot;k&quot;&gt;with&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;TARGET&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;autotvm&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;apply_history_best&lt;/span&gt;&lt;span 
class=&quot;p&quot;&g [...]
+  &lt;span class=&quot;n&quot;&gt;graph&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;c_module&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;params&lt;/span&gt; &lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;relay&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;build&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;(&lt;/span&gt;&l [...]
+&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
+
+&lt;p&gt;And our results now look like this:&lt;/p&gt;
+
+&lt;p style=&quot;text-align: center&quot;&gt;&lt;img 
src=&quot;/images/microtvm/post-2020-05-28/autotuned-cifar10-int-8-cnn.png&quot;
 
alt=&quot;/images/microtvm/post-2020-05-28/autotuned-cifar10-int-8-cnn.png&quot;
 width=&quot;60%&quot; /&gt;&lt;br /&gt;&lt;/p&gt;
+
+&lt;p&gt;We’ve improved our performance by ~2x, and we’re now much closer to 
CMSIS-NN. Although the MicroTVM CIFAR10 implementation is competitive in with a 
similar TFLite/CMSIS-NN model, this work has just begun to take advantage of 
TVM’s optimization features. There’s room to optimize further by accelerating 
other operators such as dense/fully-connected and taking advantage of TVM’s 
model-specific quantization and operator fusion capabilities. TVM with µTVM 
enables you to play with the [...]
+
+&lt;h1 id=&quot;design&quot;&gt;Design&lt;/h1&gt;
+
+&lt;p style=&quot;text-align: center&quot;&gt;&lt;img 
src=&quot;/images/microtvm/post-2020-05-28/memory-layout.png&quot; 
alt=&quot;/images/microtvm/post-2020-05-28/memory-layout.png&quot; 
width=&quot;20%&quot; /&gt;&lt;br /&gt;
+The µTVM Device Memory Layout in RAM&lt;/p&gt;
+
+&lt;p&gt;µTVM aims to support the lowest common denominator of devices by 
minimizing the set of requirements that must be satisfied.  In particular, 
users need only provide:&lt;/p&gt;
+
+&lt;ol&gt;
+  &lt;li&gt;a C cross-compiler toolchain for their device&lt;/li&gt;
+  &lt;li&gt;a method for reading/writing to device memory and executing code 
on the device&lt;/li&gt;
+  &lt;li&gt;a specification containing the device’s memory layout and general 
architectural characteristics&lt;/li&gt;
+  &lt;li&gt;a code snippet that prepares the device for function 
execution&lt;/li&gt;
+&lt;/ol&gt;
+
+&lt;p&gt;Most bare-metal devices have support for C and JTAG (a debugging 
protocol), so (1) and (2) usually come for free!  Furthermore, (3) and (4) are 
often very small asks.  Below are examples of (3) and (4) for STM32F746-series 
boards.&lt;/p&gt;
+
+&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div 
class=&quot;highlight&quot;&gt;&lt;pre 
class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span 
class=&quot;n&quot;&gt;device_config&lt;/span&gt; &lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span 
class=&quot;p&quot;&gt;{&lt;/span&gt;
+    &lt;span class=&quot;s&quot;&gt;'device_id'&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span 
class=&quot;s&quot;&gt;'arm.stm32f746xx'&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt;        &lt;span class=&quot;c1&quot;&gt;# 
unique identifier for the device
+&lt;/span&gt;    &lt;span 
class=&quot;s&quot;&gt;'toolchain_prefix'&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span 
class=&quot;s&quot;&gt;'arm-none-eabi-'&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;# 
prefix of each binary in the cross-compilation toolchain (e.g., 
arm-none-eabi-gcc)
+&lt;/span&gt;    &lt;span 
class=&quot;s&quot;&gt;'base_addr'&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span 
class=&quot;mh&quot;&gt;0x20000000&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt;               &lt;span 
class=&quot;c1&quot;&gt;# first address of RAM
+&lt;/span&gt;    &lt;span 
class=&quot;s&quot;&gt;'section_sizes'&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span 
class=&quot;p&quot;&gt;{&lt;/span&gt;                     &lt;span 
class=&quot;c1&quot;&gt;# dictionary of desired section sizes in bytes
+&lt;/span&gt;         &lt;span 
class=&quot;s&quot;&gt;'text'&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span 
class=&quot;mi&quot;&gt;18000&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt;
+         &lt;span class=&quot;s&quot;&gt;'rodata'&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span 
class=&quot;mi&quot;&gt;100&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt;
+         &lt;span class=&quot;s&quot;&gt;'data'&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span 
class=&quot;mi&quot;&gt;100&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt;
+         &lt;span class=&quot;o&quot;&gt;...&lt;/span&gt;
+    &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
+    &lt;span class=&quot;s&quot;&gt;'word_size'&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span 
class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt;                        &lt;span 
class=&quot;c1&quot;&gt;# device word size
+&lt;/span&gt;    &lt;span 
class=&quot;s&quot;&gt;'thumb_mode'&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span 
class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt;                    &lt;span 
class=&quot;c1&quot;&gt;# whether to use ARM's thumb ISA
+&lt;/span&gt;    &lt;span 
class=&quot;s&quot;&gt;'comms_method'&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span 
class=&quot;s&quot;&gt;'openocd'&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt;             &lt;span 
class=&quot;c1&quot;&gt;# method of communication with the device
+&lt;/span&gt;    &lt;span 
class=&quot;s&quot;&gt;'server_addr'&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span 
class=&quot;s&quot;&gt;'127.0.0.1'&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt;            &lt;span 
class=&quot;c1&quot;&gt;# OpenOCD server address (if 'comms_method' is 
'openocd')
+&lt;/span&gt;    &lt;span 
class=&quot;s&quot;&gt;'server_port'&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span 
class=&quot;mi&quot;&gt;6666&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt;                   &lt;span 
class=&quot;c1&quot;&gt;# OpenOCD server port (if 'comms_method' is 'openocd')
+&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
+&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
+
+&lt;div class=&quot;language-cpp highlighter-rouge&quot;&gt;&lt;div 
class=&quot;highlight&quot;&gt;&lt;pre 
class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span 
class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;syntax&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;unified&lt;/span&gt;
+&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;cpu&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;cortex&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;m7&lt;/span&gt;
+&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;fpu&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;softvfp&lt;/span&gt;
+&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;thumb&lt;/span&gt;
+
+&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;section&lt;/span&gt; &lt;span 
class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;text&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;UTVMInit&lt;/span&gt;
+&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;type&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;UTVMInit&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;o&quot;&gt;%&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;function&lt;/span&gt;
+&lt;span class=&quot;n&quot;&gt;UTVMInit&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;:&lt;/span&gt;
+  &lt;span class=&quot;cm&quot;&gt;/* enable fpu */&lt;/span&gt;
+  &lt;span class=&quot;n&quot;&gt;ldr&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;r0&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span 
class=&quot;mh&quot;&gt;0xE000ED88&lt;/span&gt;
+  &lt;span class=&quot;n&quot;&gt;ldr&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;r1&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;r0&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;]&lt;/span&gt;
+  &lt;span class=&quot;n&quot;&gt;ldr&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;r2&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span 
class=&quot;mh&quot;&gt;0xF00000&lt;/span&gt;
+  &lt;span class=&quot;n&quot;&gt;orr&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;r1&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;r2&lt;/span&gt;
+  &lt;span class=&quot;n&quot;&gt;str&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;r1&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;r0&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;]&lt;/span&gt;
+  &lt;span class=&quot;n&quot;&gt;dsb&lt;/span&gt;
+  &lt;span class=&quot;n&quot;&gt;isb&lt;/span&gt;
+  &lt;span class=&quot;cm&quot;&gt;/* set stack pointer */&lt;/span&gt;
+  &lt;span class=&quot;n&quot;&gt;ldr&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;sp&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;_utvm_stack_pointer_init&lt;/span&gt;
+  &lt;span class=&quot;n&quot;&gt;bl&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;UTVMMain&lt;/span&gt;
+&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;size&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;UTVMInit&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;UTVMInit&lt;/span&gt;
+&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
+
+&lt;p&gt;The µTVM infrastructure and device runtime have been built to only 
make use of these requirements, and we’re working to lessen these requirements 
by supporting common open source runtime platforms such as mBED OS to handle 
the compilation and linking processes.&lt;/p&gt;
+
+&lt;h2 id=&quot;device-sessions&quot;&gt;Device Sessions&lt;/h2&gt;
+
+&lt;p&gt;Given the networked nature of microcontroller interaction, we 
slightly deviate from standard TVM code by introducing the concept of &lt;code 
class=&quot;highlighter-rouge&quot;&gt;MicroSession&lt;/code&gt;.&lt;/p&gt;
+
+&lt;p&gt;Every piece of functionality in µTVM relies on having an open session 
with the target device.  If you’re familiar with TVM, you may have noticed a 
line of code that deviates from the norm in our first code snippet—-namely, 
this one:&lt;/p&gt;
+
+&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div 
class=&quot;highlight&quot;&gt;&lt;pre 
class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span 
class=&quot;o&quot;&gt;...&lt;/span&gt;
+&lt;span class=&quot;k&quot;&gt;with&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;micro&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;Session&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;device_config&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span 
class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;sess&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;:&lt;/span&gt;
+       &lt;span class=&quot;o&quot;&gt;...&lt;/span&gt;
+&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
+
+&lt;p&gt;Every line inside this &lt;code 
class=&quot;highlighter-rouge&quot;&gt;with&lt;/code&gt; block can call 
functions in µTVM, with the context being the device specified by &lt;code 
class=&quot;highlighter-rouge&quot;&gt;device_config&lt;/code&gt;.  This line 
is doing a number of things under the hood, so let’s unpack it.&lt;/p&gt;
+
+&lt;p&gt;First, it initializes a connection with your device, using whichever 
communication method you specified (usually OpenOCD).  The µTVM device runtime 
is then cross-compiled, using whichever cross-compiler you specified.  Finally, 
space for the compiled binary is allocated by the host, and the binary is 
loaded onto the device using the opened connection.&lt;/p&gt;
+
+&lt;p&gt;With the runtime now situated on the device, we’ll naturally want 
some functions to run through it.&lt;/p&gt;
+
+&lt;h2 id=&quot;module-loading&quot;&gt;Module Loading&lt;/h2&gt;
+
+&lt;p&gt;One of the core abstractions in TVM is that of a module.  A module 
stores a set of related functions for a particular device/runtime target.  
Given that microcontrollers don’t normally have operating systems, µTVM needs 
to do a lot of extra work to maintain this high-level abstraction.  To see 
what’s going on, we’ll trace through the process of creating and loading a 
µTVM-compatible module.&lt;/p&gt;
+
+&lt;p&gt;Suppose we have a &lt;code 
class=&quot;highlighter-rouge&quot;&gt;micro.Session&lt;/code&gt; open with our 
device and a TVM schedule that implements 2D convolution.  If we want to load 
it onto our microcontroller, we need it to emit C code.  To do so, we just need 
to set the &lt;code class=&quot;highlighter-rouge&quot;&gt;target&lt;/code&gt; 
in either &lt;code 
class=&quot;highlighter-rouge&quot;&gt;tvm.build&lt;/code&gt; or &lt;code 
class=&quot;highlighter-rouge&quot;&gt;relay.b [...]
+
+&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div 
class=&quot;highlight&quot;&gt;&lt;pre 
class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span 
class=&quot;n&quot;&gt;graph&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;c_module&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;params&lt;/span&gt; &lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;relay&lt;/s [...]
+&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
+
+&lt;p&gt;By setting the target like so, the build process runs through our C 
code generation backend.  However, the resulting C module still resides on the 
host machine.  In order to load it onto the device, we run it through one of 
the core functions in the µTVM infrastructure: &lt;code 
class=&quot;highlighter-rouge&quot;&gt;create_micro_mod&lt;/code&gt;.  
Example:&lt;/p&gt;
+
+&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div 
class=&quot;highlight&quot;&gt;&lt;pre 
class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span 
class=&quot;n&quot;&gt;micro_mod&lt;/span&gt; &lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;micro&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;create_micro_mod&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;c_ [...]
+&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
+
+&lt;p&gt;The line above cross-compiles the C source within the module, 
allocates room for the resulting binary (so it can coexist with the runtime in 
device memory), then sends each section of the binary to its allocated slot on 
the device.  Once the module binary is snug in device memory, function pointers 
within the binary are patched to give the module access to helper functions in 
the device runtime (e.g., for allocating scratchpads).&lt;/p&gt;
+
+&lt;p&gt;Now, with our kernel loaded on the device, we can grab a remote 
handle to the convolution function like so:&lt;/p&gt;
+
+&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div 
class=&quot;highlight&quot;&gt;&lt;pre 
class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span 
class=&quot;n&quot;&gt;micro_func&lt;/span&gt; &lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;micro_mod&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span 
class=&quot;s&quot;&gt;'conv2d'&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;]&lt;/span&gt;
+&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
+
+&lt;h2 id=&quot;tensor-loading&quot;&gt;Tensor Loading&lt;/h2&gt;
+
+&lt;p&gt;If we want to call an operator, we first need some tensors as 
arguments:&lt;/p&gt;
+
+&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div 
class=&quot;highlight&quot;&gt;&lt;pre 
class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span 
class=&quot;n&quot;&gt;data_np&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;kernel_np&lt;/span&gt; &lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;get_conv_inputs&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;()&lt;/span&gt;
+&lt;span class=&quot;n&quot;&gt;ctx&lt;/span&gt; &lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;micro_dev&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span 
class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;)&lt;/span&gt;
+&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt; &lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;nd&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;array&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;data_np&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span clas [...]
+&lt;span class=&quot;n&quot;&gt;kernel&lt;/span&gt; &lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;nd&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;array&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;kernel_np&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span  [...]
+&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
+
+&lt;p&gt;Based on its data type (e.g., &lt;code 
class=&quot;highlighter-rouge&quot;&gt;int8&lt;/code&gt;, &lt;code 
class=&quot;highlighter-rouge&quot;&gt;float32&lt;/code&gt;, etc.) and shape, 
each tensor’s size in bytes is calculated, and the host allocates a region of 
memory on the device’s heap.  The tensor’s data is then loaded into the 
allocated region.&lt;/p&gt;
+
+&lt;h2 id=&quot;function-calls&quot;&gt;Function Calls&lt;/h2&gt;
+
+&lt;p&gt;Operator execution is perhaps the trickiest part of this system.  To 
simplify its presentation, we’ll first cover strict execution (where operators 
are executed as soon as they’re called), then lazy execution (where operators 
are only executed once their results are needed)—-the latter is how the system 
actually works.&lt;/p&gt;
+
+&lt;h3 id=&quot;strict-execution&quot;&gt;Strict Execution&lt;/h3&gt;
+
+&lt;p&gt;When calling a function, both input and output tensors are passed as 
arguments, in what’s known as destination-passing style:&lt;/p&gt;
+
+&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div 
class=&quot;highlight&quot;&gt;&lt;pre 
class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span 
class=&quot;n&quot;&gt;conv2D&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;kernel&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;output&lt;/span& [...]
+&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
+
+&lt;p&gt;Given that these tensors are already allocated on the device, we only 
need to send &lt;em&gt;metadata&lt;/em&gt; to the device (device address, 
shape, and data type), so it knows which of its resident tensors to use.  The 
runtime representation of a function call includes this metadata, as well as 
the address of the function being called (shown below).  Before constructing 
this representation, the metadata needs to be serialized into the arguments 
section on the device that exis [...]
+
+&lt;div class=&quot;language-c highlighter-rouge&quot;&gt;&lt;div 
class=&quot;highlight&quot;&gt;&lt;pre 
class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;cm&quot;&gt;/*
+ * task struct for uTVM
+ */&lt;/span&gt;
+&lt;span class=&quot;k&quot;&gt;typedef&lt;/span&gt; &lt;span 
class=&quot;k&quot;&gt;struct&lt;/span&gt; &lt;span 
class=&quot;p&quot;&gt;{&lt;/span&gt;
+  &lt;span class=&quot;cm&quot;&gt;/* pointer to function to call for this 
task */&lt;/span&gt;
+  &lt;span class=&quot;kt&quot;&gt;int32_t&lt;/span&gt; &lt;span 
class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;func&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;)(&lt;/span&gt;&lt;span 
class=&quot;kt&quot;&gt;void&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;kt&quot;&gt;void&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span  [...]
+  &lt;span class=&quot;cm&quot;&gt;/* array of argument tensors */&lt;/span&gt;
+  &lt;span class=&quot;n&quot;&gt;TVMValue&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;arg_values&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;;&lt;/span&gt;
+  &lt;span class=&quot;cm&quot;&gt;/* array of datatype codes for each 
argument */&lt;/span&gt;
+  &lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;arg_type_codes&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;;&lt;/span&gt;
+  &lt;span class=&quot;cm&quot;&gt;/* number of arguments */&lt;/span&gt;
+  &lt;span class=&quot;kt&quot;&gt;int32_t&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;num_args&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;;&lt;/span&gt;
+&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;UTVMTask&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;;&lt;/span&gt;
+&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
+
+&lt;p&gt;In the strict setting, there is a single global &lt;code 
class=&quot;highlighter-rouge&quot;&gt;UTVMTask&lt;/code&gt; instance that we, 
from the host side, write into.  Once we have written to the task, the runtime 
has everything it needs to execute the function, and we can begin execution at 
the runtime’s entry point.  The runtime will perform some lightweight 
initialization, run our operator, then return control to the host.&lt;/p&gt;
+
+&lt;h3 id=&quot;lazy-execution&quot;&gt;Lazy Execution&lt;/h3&gt;
+
+&lt;p&gt;In practice, executing operators as soon as the user requests to 
becomes prohibitively expensive, as communication overhead begins to dominate.  
We can improve the throughput of our system by delaying evaluation until the 
user wants the results of the call.&lt;/p&gt;
+
+&lt;p&gt;From an implementation standpoint, instead of eagerly serializing 
argument metadata and &lt;code 
class=&quot;highlighter-rouge&quot;&gt;UTVMTask&lt;/code&gt; data, we now need 
to accumulate function call metadata on the host side, before flushing it to 
the device.  The device runtime also needs a few changes: (1) we must now have 
a global array of &lt;code 
class=&quot;highlighter-rouge&quot;&gt;UTVMTask&lt;/code&gt; and (2) we need to 
loop through and execute each task in order. [...]
+
+&lt;h2 id=&quot;autotvm-with-microtvm&quot;&gt;AutoTVM with MicroTVM&lt;/h2&gt;
+
+&lt;p&gt;So far, the runtime we’ve described doesn’t seem very useful for 
&lt;em&gt;model deployment&lt;/em&gt;, since it relies so heavily on a host 
machine.  This is intentional, and the runtime has in fact been designed for a 
different goal: &lt;strong&gt;AutoTVM support&lt;/strong&gt;.&lt;/p&gt;
+
+&lt;p&gt;In general, AutoTVM proposes candidate kernels, runs them on the 
target backend with random inputs, then uses the timing results to improve its 
search process.  Given that AutoTVM only cares about single operator 
executions, we have designed the runtime to be operator-oriented, as opposed to 
being model-oriented.  In the case of µTVM though, communication with the 
device will usually dominate the execution time.  Lazy execution allows us to 
run the same operator many times witho [...]
+
+&lt;p&gt;Because AutoTVM requires rapid iteration on large numbers of 
candidate kernels, µTVM infrastructure only makes use of RAM currently.  
However, for a self-hosted runtime, we will surely need to make use of both 
flash memory and RAM.&lt;/p&gt;
+
+&lt;h2 id=&quot;the-hosted-graph-runtime&quot;&gt;The Hosted Graph 
Runtime&lt;/h2&gt;
+
+&lt;p&gt;Although the hosted runtime was designed for AutoTVM, we can still 
run full models (as long as they don’t have any control flow).  This 
functionality comes for free just by using TVM’s graph runtime, but with a µTVM 
context.  In fact, the only reliance on the host with the graph runtime is for 
tensor allocation and operator scheduling (which is just a topological sort of 
the dependence graph).&lt;/p&gt;
+
+&lt;h1 id=&quot;evaluation&quot;&gt;Evaluation&lt;/h1&gt;
+
+&lt;p&gt;With this infrastructure in place, we sought to answer the following 
questions:&lt;/p&gt;
+
+&lt;ol&gt;
+  &lt;li&gt;Is µTVM truly device-agnostic?&lt;/li&gt;
+  &lt;li&gt;How much effort is required to experiment with optimizations using 
µTVM?&lt;/li&gt;
+&lt;/ol&gt;
+
+&lt;p&gt;To evaluate (1), we ran our experiments on two targets:&lt;/p&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;An &lt;a 
href=&quot;https://www.st.com/en/microcontrollers-microprocessors/stm32f746ng.html&quot;&gt;Arm
 STM32F746NG development board&lt;/a&gt;, featuring a Cortex-M7 
processor&lt;/li&gt;
+  &lt;li&gt;The µTVM host emulated device, which creates a memory arena on the 
host machine that is interfaced with as if it is a bare-metal device.&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;p&gt;To evaluate (2), we explore optimizations for the Arm board that give 
the biggest bang for your buck.&lt;/p&gt;
+
+&lt;p&gt;As a point of comparison, we pulled a quantized CIFAR-10 CNN from 
&lt;a 
href=&quot;https://developer.arm.com/solutions/machine-learning-on-arm/developer-material/how-to-guides/image-recognition-on-arm-cortex-m-with-cmsis-nn/single-page&quot;&gt;this
 tutorial by Arm&lt;/a&gt;.  In the tutorial, &lt;a 
href=&quot;https://arm-software.github.io/CMSIS_5/NN/html/index.html&quot;&gt;CMSIS-NN&lt;/a&gt;
 (a library of highly optimized kernels by Arm experts) is used as the operator 
librar [...]
+
+&lt;p style=&quot;text-align: center&quot;&gt;&lt;img 
src=&quot;/images/microtvm/post-2020-05-28/cifar10-graphical.png&quot; 
alt=&quot;/images/microtvm/post-2020-05-28/cifar10-graphical.png&quot; 
width=&quot;80%&quot; /&gt;&lt;br /&gt;
+Diagram of CIFAR-10 CNN&lt;/p&gt;
+
+&lt;h2 id=&quot;methodology&quot;&gt;Methodology&lt;/h2&gt;
+
+&lt;p&gt;In our experiments, we use TVM from HEAD (commit &lt;code 
class=&quot;highlighter-rouge&quot;&gt;9fa8341&lt;/code&gt;), version 5.7.0 of 
CMSIS-NN (commit &lt;code 
class=&quot;highlighter-rouge&quot;&gt;a65b7c9a&lt;/code&gt;), version 1.16.0 
of STM32CubeF7, and GCC from Arm’s GNU Tools for Arm Embedded Processors 
9-2019-q4-major 9.2.1 toolchain (revision 277599).  The host machine used in 
our experiments runs Ubuntu Linux 18.04.4 LTS and sports an AMD Ryzen 
Threadripper 2990WX 32 [...]
+
+&lt;h3 id=&quot;arm-specific-optimizations&quot;&gt;Arm-Specific 
Optimizations&lt;/h3&gt;
+
+&lt;p&gt;With CMSIS-NN, the first convolution maps to their &lt;a 
href=&quot;https://github.com/ARM-software/CMSIS_5/blob/develop/CMSIS/NN/Source/ConvolutionFunctions/arm_convolve_HWC_q7_RGB.c&quot;&gt;RGB
 convolution implementation&lt;/a&gt; (specifically for usage in input layers) 
and the latter two map to their &lt;a 
href=&quot;https://github.com/ARM-software/CMSIS_5/blob/develop/CMSIS/NN/Source/ConvolutionFunctions/arm_convolve_HWC_q7_fast.c&quot;&gt;“fast”
 convolution implementation [...]
+
+&lt;p style=&quot;text-align: center&quot;&gt;&lt;img 
src=&quot;/images/microtvm/post-2020-05-28/simd-diagram.png&quot; 
alt=&quot;/images/microtvm/post-2020-05-28/simd-diagram.png&quot; 
width=&quot;80%&quot; /&gt;&lt;br /&gt;
+Diagram from CMSIS-NN paper showing a 2x2 matrix multiplication 
microkernel&lt;/p&gt;
+
+&lt;p&gt;Tensorization works by defining a microkernel that can be inserted 
into the innermost loop of a TVM operator.  Using this mechanism, adding SIMD 
support for the Arm board was as simple as defining a microkernel in C (found 
&lt;a 
href=&quot;https://github.com/apache/incubator-tvm/blob/8d7249688771bb6806595931586d95648036f383/topi/python/topi/arm_cpu/cortex_m7/micro_kernel/gemm.py&quot;&gt;here&lt;/a&gt;)
 that mirrored the implementation in their paper.  We defined a schedule that 
[...]
+
+&lt;p&gt;While we were able to use the SIMD microkernel for direct 
convolution, CMSIS-NN uses what they call “partial im2col” as their 
implementation strategy, which offers a tradeoff between performance and memory 
usage.  Instead of manifesting the entire im2col matrix at once, partial im2col 
generates only a few columns at a time.  Then, with each batch, they can send 
the matrix to their SIMD matmul function.&lt;/p&gt;
+
+&lt;p&gt;Our hypothesis was that, among other optimizations, we could find the 
optimal batch size via autotuning.  In practice, we found partial im2col to be 
significantly slower than our direct convolution implementation, so we don’t 
include it in the rest of our results.&lt;/p&gt;
+
+&lt;p&gt;There are certainly other optimizations we could pull from CMSIS-NN 
to close the gap even further:&lt;/p&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;Batch expansion of &lt;code 
class=&quot;highlighter-rouge&quot;&gt;int8&lt;/code&gt; weights into &lt;code 
class=&quot;highlighter-rouge&quot;&gt;int16&lt;/code&gt;, to cut down on 
duplicate expansion for SIMD&lt;/li&gt;
+  &lt;li&gt;Splitting convolution into 3x3 tiles to reduce padding 
checks&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;p&gt;But our goal in this blog post is to show the broad strokes of what 
can be done with µTVM.  Even so, it’s not a competition, because CMSIS-NN (and 
any other hand-optimized library) can plug directly into TVM using the &lt;a 
href=&quot;https://tvm.apache.org/docs/dev/relay_bring_your_own_codegen.html&quot;&gt;Bring
 Your Own Codegen framework&lt;/a&gt;.&lt;/p&gt;
+
+&lt;h2 id=&quot;end-to-end&quot;&gt;End-To-End&lt;/h2&gt;
+
+&lt;h3 id=&quot;cifar-10&quot;&gt;CIFAR-10&lt;/h3&gt;
+
+&lt;p&gt;After exploring optimizations for convolution, we set out to measure 
their effects on end-to-end performance.  For the Arm board, we collected 
untuned results, results that were tuned &lt;strong&gt;without&lt;/strong&gt; 
any use of SIMD, results that were tuned &lt;strong&gt;with&lt;/strong&gt; 
SIMD, and results using CMSIS-NN.  For the emulated host device, we only 
collected untuned results and generic tuned results.&lt;/p&gt;
+
+&lt;p&gt;&lt;a 
href=&quot;https://github.com/areusch/microtvm-blogpost-eval&quot;&gt;https://github.com/areusch/microtvm-blogpost-eval&lt;/a&gt;&lt;/p&gt;
+
+&lt;p style=&quot;text-align: center&quot;&gt;&lt;img 
src=&quot;/images/microtvm/post-2020-05-28/autotuned-cifar10-int-8-cnn.png&quot;
 
alt=&quot;/images/microtvm/post-2020-05-28/autotuned-cifar10-int-8-cnn.png&quot;
 width=&quot;60%&quot; /&gt;&lt;br /&gt;
+&lt;code class=&quot;highlighter-rouge&quot;&gt;int8&lt;/code&gt;-quantized 
CIFAR-10 CNN comparison on an Arm STM32F746NG (re-posted from above)&lt;/p&gt;
+
+&lt;p style=&quot;text-align: center&quot;&gt;&lt;img 
src=&quot;/images/microtvm/post-2020-05-28/autotuned-cifar10-int-8-cnn-x86.png&quot;
 
alt=&quot;/images/microtvm/post-2020-05-28/autotuned-cifar10-int-8-cnn-x86.png&quot;
 width=&quot;60%&quot; /&gt;&lt;br /&gt;
+&lt;code class=&quot;highlighter-rouge&quot;&gt;int8&lt;/code&gt;-quantized 
CIFAR-10 CNN comparison on µTVM’s emulated host device&lt;/p&gt;
+
+&lt;p&gt;On the Arm STM32-series board, we were able to improve performance by 
~2x compared to the initial untuned operators, and we achieved results much 
closer to CMSIS-NN.  Additionally, we were able to significantly improve 
performance on the host emulated device.  Though the x86 
&lt;strong&gt;&lt;em&gt;numbers&lt;/em&gt;&lt;/strong&gt; don’t mean much, they 
show we can use the same infrastructure (µTVM) to optimize performance on 
vastly different architectures.&lt;/p&gt;
+
+&lt;p&gt;Stay tuned in the future for more end-to-end benchmarks as we scale 
this approach out more broadly.&lt;/p&gt;
+
+&lt;h1 id=&quot;self-hosted-runtime-the-final-frontier&quot;&gt;Self-Hosted 
Runtime: The Final Frontier&lt;/h1&gt;
+
+&lt;p style=&quot;text-align: center&quot;&gt;&lt;img 
src=&quot;/images/microtvm/self-hosted-runtime.png&quot; 
alt=&quot;/images/microtvm/self-hosted-runtime.png&quot; width=&quot;80%&quot; 
/&gt;&lt;br /&gt;&lt;/p&gt;
+
+&lt;p&gt;The envisioned µTVM optimization and deployment pipeline&lt;/p&gt;
+
+&lt;p&gt;While end-to-end benchmark results are already obtainable with the 
current runtime as we demonstrated above, deployment of these models in a 
standalone capacity is currently still on our roadmap. The gap being that the 
AutoTVM-oriented runtime currently relies on the host to allocate tensors and 
to schedule function execution. However, to be useful at the edge, we need a 
pipeline through µTVM that generates a &lt;strong&gt;single&lt;/strong&gt; 
binary to be run on a bare-metal d [...]
+
+&lt;h1 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h1&gt;
+
+&lt;p&gt;MicroTVM for single-kernel optimization is ready 
&lt;strong&gt;today&lt;/strong&gt; and is &lt;em&gt;the&lt;/em&gt; choice for 
that use case.  As we now build out self-hosted deployment support we hope 
you’re just as excited as we are to make µTVM &lt;em&gt;the&lt;/em&gt; choice 
for model deployment as well. However, this isn’t just a spectator sport - 
remember: this is all open source!  µTVM is still in its early days, so every 
individual can have a great deal of impact on its  [...]
+
+&lt;h2 id=&quot;acknowledgements&quot;&gt;Acknowledgements&lt;/h2&gt;
+
+&lt;p&gt;None of this work would have been possible, if not for the following 
people:&lt;/p&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;&lt;a href=&quot;https://tqchen.com/&quot;&gt;Tianqi 
Chen&lt;/a&gt;, for guiding the design and for being a fantastic 
mentor.&lt;/li&gt;
+  &lt;li&gt;&lt;a 
href=&quot;https://homes.cs.washington.edu/~patelp1/&quot;&gt;Pratyush 
Patel&lt;/a&gt;, for collaborating on early prototypes of MicroTVM.&lt;/li&gt;
+  &lt;li&gt;&lt;a href=&quot;https://octoml.ai/&quot;&gt;OctoML&lt;/a&gt;, for 
facilitating the internships where I have been able to go full steam on this 
project.&lt;/li&gt;
+  &lt;li&gt;&lt;a 
href=&quot;https://homes.cs.washington.edu/~moreau/&quot;&gt;Thierry 
Moreau&lt;/a&gt;, for mentoring me during my time at OctoML.&lt;/li&gt;
+  &lt;li&gt;&lt;a 
href=&quot;https://homes.cs.washington.edu/~vegaluis/&quot;&gt;Luis 
Vega&lt;/a&gt;, for teaching me the fundamentals of interacting with 
microcontrollers.&lt;/li&gt;
+  &lt;li&gt;&lt;a 
href=&quot;https://www.linkedin.com/in/themadrasi/?originalSubdomain=uk&quot;&gt;Ramana
 Radhakrishnan&lt;/a&gt;, for supplying the Arm hardware used in our 
experiments and for providing guidance on its usage.&lt;/li&gt;
+&lt;/ul&gt;
+</description>
+                
<link>https://tvm.apache.org/2020/06/04/tinyml-how-tvm-is-taming-tiny</link>
+                
<guid>https://tvm.apache.org/2020/06/04/tinyml-how-tvm-is-taming-tiny</guid>
+                <pubDate>Thu, 04 Jun 2020 00:00:00 -0700</pubDate>
+        </item>
+
+        <item>
                 <title>Bring Your Own Datatypes: Enabling Custom Datatype 
Exploration in TVM</title>
                 <description>&lt;p&gt;In this post, we describe the Bring Your 
Own Datatypes framework, which enables the use of custom datatypes within 
TVM.&lt;/p&gt;
 
diff --git a/sitemap.txt b/sitemap.txt
index 3bf5f42..7e3c273 100644
--- a/sitemap.txt
+++ b/sitemap.txt
@@ -12,6 +12,7 @@ https://tvm.apache.org/sitemap.txt
 https://tvm.apache.org/tags
 https://tvm.apache.org/vta
 
+https://tvm.apache.org/2020/06/04/tinyml-how-tvm-is-taming-tiny
 https://tvm.apache.org/2020/05/20/bring-your-own-datatypes
 
https://tvm.apache.org/2020/05/14/compiling-machine-learning-to-webassembly-and-webgpu
 https://tvm.apache.org/2019/05/30/pytorch-frontend

Reply via email to