This is an automated email from the ASF dual-hosted git repository. tqchen pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/incubator-tvm-site.git
The following commit(s) were added to refs/heads/asf-site by this push: new 2330d86 Build at Tue Nov 3 09:02:03 EST 2020 2330d86 is described below commit 2330d862e2d490be1c9e5633de8b550e14182c52 Author: tqchen <tianqi.tc...@gmail.com> AuthorDate: Tue Nov 3 09:02:03 2020 -0500 Build at Tue Nov 3 09:02:03 EST 2020 --- 2017/08/17/tvm-release-announcement.html | 2 +- ...s-with-TVM-A-Depthwise-Convolution-Example.html | 8 +-- 2017/10/06/nnvm-compiler-announcement.html | 2 +- ...s-to-TVM-Stack-and-NNVM-Compiler-with-ROCm.html | 4 +- 2017/11/08/android-rpc-introduction.html | 2 +- 2018/01/16/opt-mali-gpu.html | 2 +- 2018/03/12/webgl.html | 2 +- 2018/03/23/nmt-transformer-optimize.html | 2 +- 2018/07/12/vta-release-announcement.html | 12 ++-- 2018/08/10/DLPack-Bridge.html | 4 +- 2018/10/03/auto-opt-all.html | 2 +- 2018/10/09/ml-in-tees.html | 2 +- 2018/12/18/lowprecision-conv.html | 6 +- 2019/01/19/Golang.html | 8 +-- 2019/03/18/tvm-apache-announcement.html | 2 +- 2019/04/29/opt-cuda-quantized.html | 12 ++-- 2019/05/30/pytorch-frontend.html | 2 +- ...machine-learning-to-webassembly-and-webgpu.html | 2 +- 2020/06/04/tinyml-how-tvm-is-taming-tiny.html | 2 +- 2020/07/14/bert-pytorch-tvm.html | 2 +- .../15/how-to-bring-your-own-codegen-to-tvm.html | 2 +- 2020/09/26/bring-your-own-datatypes.html | 2 +- atom.xml | 76 ++++++++++----------- feed.xml | 40 +++++------ rss.xml | 78 +++++++++++----------- 25 files changed, 139 insertions(+), 139 deletions(-) diff --git a/2017/08/17/tvm-release-announcement.html b/2017/08/17/tvm-release-announcement.html index e5ee2d1..9b83eb3 100644 --- a/2017/08/17/tvm-release-announcement.html +++ b/2017/08/17/tvm-release-announcement.html @@ -140,7 +140,7 @@ <div class="span14 w-100"> <h1>TVM: An End to End IR Stack for Deploying Deep Learning Workloads on Hardware Platforms </h1> <p class="post-meta"> - <time datetime="2017-08-17T12:00:00-07:00" itemprop="datePublished"> + <time datetime="2017-08-17T15:00:00-04:00" itemprop="datePublished"> Aug 17, 2017 </time> diff --git a/2017/08/22/Optimize-Deep-Learning-GPU-Operators-with-TVM-A-Depthwise-Convolution-Example.html b/2017/08/22/Optimize-Deep-Learning-GPU-Operators-with-TVM-A-Depthwise-Convolution-Example.html index 5d0fa56..a03a6bf 100644 --- a/2017/08/22/Optimize-Deep-Learning-GPU-Operators-with-TVM-A-Depthwise-Convolution-Example.html +++ b/2017/08/22/Optimize-Deep-Learning-GPU-Operators-with-TVM-A-Depthwise-Convolution-Example.html @@ -140,7 +140,7 @@ <div class="span14 w-100"> <h1>Optimize Deep Learning GPU Operators with TVM: A Depthwise Convolution Example </h1> <p class="post-meta"> - <time datetime="2017-08-22T00:00:00-07:00" itemprop="datePublished"> + <time datetime="2017-08-22T00:00:00-04:00" itemprop="datePublished"> Aug 22, 2017 </time> @@ -705,9 +705,9 @@ Below is the result with Input = [1, 256, 96, 96], Filter = [256, 1, 3, 3], stri <h2 id="show-me-the-code">Show me the code</h2> <ul> - <li>Declare: <a href="https://github.com/dmlc/tvm/blob/master/topi/python/topi/nn/depthwise_conv2d.py">https://github.com/dmlc/tvm/blob/master/topi/python/topi/nn/depthwise_conv2d.py</a></li> - <li>Schedule: <a href="https://github.com/dmlc/tvm/blob/master/topi/python/topi/cuda/depthwise_conv2d.py">https://github.com/dmlc/tvm/blob/master/topi/python/topi/cuda/depthwise_conv2d.py</a></li> - <li>Test: <a href="https://github.com/dmlc/tvm/blob/master/topi/recipe/conv/depthwise_conv2d_test.py">https://github.com/dmlc/tvm/blob/master/topi/recipe/conv/depthwise_conv2d_test.py</a></li> + <li>Declare: <a href="https://github.com/apache/incubator-tvm/blob/main/topi/python/topi/nn/depthwise_conv2d.py">https://github.com/apache/incubator-tvm/blob/main/topi/python/topi/nn/depthwise_conv2d.py</a></li> + <li>Schedule: <a href="https://github.com/apache/incubator-tvm/blob/main/topi/python/topi/cuda/depthwise_conv2d.py">https://github.com/apache/incubator-tvm/blob/main/topi/python/topi/cuda/depthwise_conv2d.py</a></li> + <li>Test: <a href="https://github.com/apache/incubator-tvm/blob/main/topi/recipe/conv/depthwise_conv2d_test.py">https://github.com/apache/incubator-tvm/blob/main/topi/recipe/conv/depthwise_conv2d_test.py</a></li> </ul> <h2 id="acknowledgements">Acknowledgements</h2> diff --git a/2017/10/06/nnvm-compiler-announcement.html b/2017/10/06/nnvm-compiler-announcement.html index d7b9c05..d3eb49f 100644 --- a/2017/10/06/nnvm-compiler-announcement.html +++ b/2017/10/06/nnvm-compiler-announcement.html @@ -140,7 +140,7 @@ <div class="span14 w-100"> <h1>NNVM Compiler: Open Compiler for AI Frameworks </h1> <p class="post-meta"> - <time datetime="2017-10-06T08:30:00-07:00" itemprop="datePublished"> + <time datetime="2017-10-06T11:30:00-04:00" itemprop="datePublished"> Oct 6, 2017 </time> diff --git a/2017/10/30/Bringing-AMDGPUs-to-TVM-Stack-and-NNVM-Compiler-with-ROCm.html b/2017/10/30/Bringing-AMDGPUs-to-TVM-Stack-and-NNVM-Compiler-with-ROCm.html index eb4caed..1b48741 100644 --- a/2017/10/30/Bringing-AMDGPUs-to-TVM-Stack-and-NNVM-Compiler-with-ROCm.html +++ b/2017/10/30/Bringing-AMDGPUs-to-TVM-Stack-and-NNVM-Compiler-with-ROCm.html @@ -140,7 +140,7 @@ <div class="span14 w-100"> <h1>Bringing AMDGPUs to TVM Stack and NNVM Compiler with ROCm </h1> <p class="post-meta"> - <time datetime="2017-10-30T00:00:00-07:00" itemprop="datePublished"> + <time datetime="2017-10-30T00:00:00-04:00" itemprop="datePublished"> Oct 30, 2017 </time> @@ -204,7 +204,7 @@ TVM prediction top-1: 282 tiger cat</code></pre></figure> <h2 id="a-note-on-performance">A Note on performance</h2> -<p>The current support on ROCm focuses on the functionality coverage. We have already seen promising performance results by simply adopting existing TVM schedules for CUDA backend. For example, you can try running <a href="https://github.com/dmlc/tvm/blob/master/topi/recipe/gemm/cuda_gemm_square.py">the gemm test script</a> in the TVM repository and see the result. For two types of cards we tested, the current gemm recipe for square matrix multiplication (not yet specifically optimized f [...] +<p>The current support on ROCm focuses on the functionality coverage. We have already seen promising performance results by simply adopting existing TVM schedules for CUDA backend. For example, you can try running <a href="https://github.com/apache/incubator-tvm/blob/main/topi/recipe/gemm/cuda_gemm_square.py">the gemm test script</a> in the TVM repository and see the result. For two types of cards we tested, the current gemm recipe for square matrix multiplication (not yet specifically o [...] This is already a promising start, as it is very hard to optimize performance to get to peak and we did not yet apply AMD GPU specific optimizations. We are starting to look at performance optimization and we expect more improvement to come.</p> diff --git a/2017/11/08/android-rpc-introduction.html b/2017/11/08/android-rpc-introduction.html index d354c0a..104829e 100644 --- a/2017/11/08/android-rpc-introduction.html +++ b/2017/11/08/android-rpc-introduction.html @@ -140,7 +140,7 @@ <div class="span14 w-100"> <h1>Remote Profile and Test Deep Learning Cross Compilation on Mobile Phones with TVM RPC </h1> <p class="post-meta"> - <time datetime="2017-11-08T00:00:00-08:00" itemprop="datePublished"> + <time datetime="2017-11-08T00:00:00-05:00" itemprop="datePublished"> Nov 8, 2017 </time> diff --git a/2018/01/16/opt-mali-gpu.html b/2018/01/16/opt-mali-gpu.html index 814ea6e..71d3d86 100644 --- a/2018/01/16/opt-mali-gpu.html +++ b/2018/01/16/opt-mali-gpu.html @@ -140,7 +140,7 @@ <div class="span14 w-100"> <h1>Optimizing Mobile Deep Learning on ARM GPU with TVM </h1> <p class="post-meta"> - <time datetime="2018-01-16T00:00:00-08:00" itemprop="datePublished"> + <time datetime="2018-01-16T00:00:00-05:00" itemprop="datePublished"> Jan 16, 2018 </time> diff --git a/2018/03/12/webgl.html b/2018/03/12/webgl.html index 81e89ac..db05f52 100644 --- a/2018/03/12/webgl.html +++ b/2018/03/12/webgl.html @@ -140,7 +140,7 @@ <div class="span14 w-100"> <h1>Compiling Deep Learning Models to WebGL with TVM </h1> <p class="post-meta"> - <time datetime="2018-03-12T00:00:00-07:00" itemprop="datePublished"> + <time datetime="2018-03-12T00:00:00-04:00" itemprop="datePublished"> Mar 12, 2018 </time> diff --git a/2018/03/23/nmt-transformer-optimize.html b/2018/03/23/nmt-transformer-optimize.html index 7dd4172..9ec078f 100644 --- a/2018/03/23/nmt-transformer-optimize.html +++ b/2018/03/23/nmt-transformer-optimize.html @@ -140,7 +140,7 @@ <div class="span14 w-100"> <h1>Bringing TVM into TensorFlow for Optimizing Neural Machine Translation on GPU </h1> <p class="post-meta"> - <time datetime="2018-03-23T00:00:00-07:00" itemprop="datePublished"> + <time datetime="2018-03-23T00:00:00-04:00" itemprop="datePublished"> Mar 23, 2018 </time> diff --git a/2018/07/12/vta-release-announcement.html b/2018/07/12/vta-release-announcement.html index a4b1dd0..7155faa 100644 --- a/2018/07/12/vta-release-announcement.html +++ b/2018/07/12/vta-release-announcement.html @@ -140,7 +140,7 @@ <div class="span14 w-100"> <h1>VTA: An Open, Customizable Deep Learning Acceleration Stack </h1> <p class="post-meta"> - <time datetime="2018-07-12T00:00:00-07:00" itemprop="datePublished"> + <time datetime="2018-07-12T00:00:00-04:00" itemprop="datePublished"> Jul 12, 2018 </time> @@ -158,7 +158,7 @@ <p>VTA is more than a standalone accelerator design: it’s an end-to-end solution that includes drivers, a JIT runtime, and an optimizing compiler stack based on TVM. The current release includes a behavioral hardware simulator, as well as the infrastructure to deploy VTA on low-cost FPGA hardware for fast prototyping. By extending the TVM stack with a customizable, and open source deep learning hardware accelerator design, we are exposing a transparent end-to-end deep learning stack from [...] -<p style="text-align: center"><img src="https://raw.githubusercontent.com/uwsaml/web-data/master/vta/blogpost/vta_stack.png" alt="image" width="50%" /></p> +<p style="text-align: center"><img src="https://raw.githubusercontent.com/uwsampl/web-data/master/vta/blogpost/vta_stack.png" alt="image" width="50%" /></p> <p>The VTA and TVM stack together constitute a blueprint for end-to-end, accelerator-centric deep learning system that can:</p> @@ -213,7 +213,7 @@ The extendability of the compiler stack, combined with the ability to modify the <p>The Vanilla Tensor Accelerator (VTA) is a generic deep learning accelerator built around a GEMM core, which performs dense matrix multiplication at a high computational throughput. The design is inspired by mainstream deep learning accelerators, of the likes of Google’s TPU accelerator. The design adopts decoupled access-execute to hide memory access latency and maximize utilization of compute resources. To a broader extent, VTA can serve as a template deep learning accelerator design, exposing a clean tensor computation abstraction to the compiler stack.</p> -<p style="text-align: center"><img src="https://raw.githubusercontent.com/uwsaml/web-data/master/vta/blogpost/vta_overview.png" alt="image" width="60%" /></p> +<p style="text-align: center"><img src="https://raw.githubusercontent.com/uwsampl/web-data/master/vta/blogpost/vta_overview.png" alt="image" width="60%" /></p> <p>The figure above presents a high-level overview of the VTA hardware organization. VTA is composed of four modules that communicate between each other via FIFO queues and single-writer/single-reader SRAM memory blocks, to allow for task-level pipeline parallelism. The compute module performs both dense linear algebra computation with its GEMM core, and general computation with its tensor ALU. @@ -230,7 +230,7 @@ The first approach, which doesn’t require special hardware is to run deep lear This simulator back-end is readily available for developers to experiment with. The second approach relies on an off-the-shelf and low-cost FPGA development board – the <a href="http://www.pynq.io/">Pynq board</a>, which exposes a reconfigurable FPGA fabric and an ARM SoC.</p> -<p style="text-align: center"><img src="https://raw.githubusercontent.com/uwsaml/web-data/master/vta/blogpost/vta_system.png" alt="image" width="70%" /></p> +<p style="text-align: center"><img src="https://raw.githubusercontent.com/uwsampl/web-data/master/vta/blogpost/vta_system.png" alt="image" width="70%" /></p> <p>The VTA release offers a simple compilation and deployment flow of the VTA hardware design and TVM workloads on the Pynq platform, with the help of an RPC server interface. The RPC server handles FPGA reconfiguration tasks and TVM module invocation offloading onto the VTA runtime. @@ -253,7 +253,7 @@ While this platform is meant for prototyping (the 2012 FPGA cannot compete with <p>A popular method used to assess the efficient use of hardware are roofline diagrams: given a hardware design, how efficiently are different workloads utilizing the hardware compute and memory resources. The roofline plot below shows the throughput achieved on different convolution layers of the ResNet-18 inference benchmark. Each layer has a different arithmetic intensity, i.e. compute to data movement ratio. In the left half, convolution layers are bandwidth limited, whereas on the right half, they are compute limited.</p> -<p style="text-align: center"><img src="https://raw.githubusercontent.com/uwsaml/web-data/master/vta/blogpost/vta_roofline.png" alt="image" width="60%" /></p> +<p style="text-align: center"><img src="https://raw.githubusercontent.com/uwsampl/web-data/master/vta/blogpost/vta_roofline.png" alt="image" width="60%" /></p> <p>The goal behind designing a hardware architecture, and a compiler stack is to bring each workload as close as possible to the roofline of the target hardware. The roofline plot shows the effects of having the hardware and compiler work together to maximize utilization of the available hardware resources. @@ -262,7 +262,7 @@ The result is an overall higher utilization of the available compute and memory <h3 id="end-to-end-resnet-18-evaluation">End to end ResNet-18 evaluation</h3> -<p style="text-align: center"><img src="https://raw.githubusercontent.com/uwsaml/web-data/master/vta/blogpost/vta_e2e.png" alt="image" width="60%" /></p> +<p style="text-align: center"><img src="https://raw.githubusercontent.com/uwsampl/web-data/master/vta/blogpost/vta_e2e.png" alt="image" width="60%" /></p> <p>A benefit of having a complete compiler stack built for VTA is the ability to run end-to-end workloads. This is compelling in the context of hardware acceleration because we need to understand what performance bottlenecks, and Amdahl limitations stand in the way to obtaining faster performance. The bar plot above shows inference performance with and without offloading the ResNet convolutional layers to the FPGA-based VTA design, on the Pynq board’s ARM Cortex A9 SoC. diff --git a/2018/08/10/DLPack-Bridge.html b/2018/08/10/DLPack-Bridge.html index b64eead..0ec196d 100644 --- a/2018/08/10/DLPack-Bridge.html +++ b/2018/08/10/DLPack-Bridge.html @@ -140,7 +140,7 @@ <div class="span14 w-100"> <h1>Building a Cross-Framework Deep Learning Compiler via DLPack </h1> <p class="post-meta"> - <time datetime="2018-08-10T00:00:00-07:00" itemprop="datePublished"> + <time datetime="2018-08-10T00:00:00-04:00" itemprop="datePublished"> Aug 10, 2018 </time> @@ -262,7 +262,7 @@ found <a href="https://tvm.apache.org/docs//tutorials/optimize/opt_gemm.html">he </code></pre></div></div> <h2 id="under-the-hood-of-the-pytorch-example">Under the hood of the PyTorch Example</h2> -<p>As TVM provides <a href="https://github.com/dmlc/tvm/blob/master/include/tvm/runtime/c_runtime_api.h#L455">functions</a> to convert dlpack tensors to tvm <code class="language-plaintext highlighter-rouge">NDArray</code>s and +<p>As TVM provides <a href="https://github.com/apache/incubator-tvm/blob/main/include/tvm/runtime/c_runtime_api.h#L455">functions</a> to convert dlpack tensors to tvm <code class="language-plaintext highlighter-rouge">NDArray</code>s and vice-versa, so all that is needed is some syntactic sugar by wrapping functions. <code class="language-plaintext highlighter-rouge">convert_func</code> is a generic converter for frameworks using tensors with dlpack support, and can be used to implement convenient converters, such as diff --git a/2018/10/03/auto-opt-all.html b/2018/10/03/auto-opt-all.html index 005b8fc..f5f1482 100644 --- a/2018/10/03/auto-opt-all.html +++ b/2018/10/03/auto-opt-all.html @@ -140,7 +140,7 @@ <div class="span14 w-100"> <h1>Automatic Kernel Optimization for Deep Learning on All Hardware Platforms </h1> <p class="post-meta"> - <time datetime="2018-10-03T00:00:00-07:00" itemprop="datePublished"> + <time datetime="2018-10-03T00:00:00-04:00" itemprop="datePublished"> Oct 3, 2018 </time> diff --git a/2018/10/09/ml-in-tees.html b/2018/10/09/ml-in-tees.html index 85f637d..3838be6 100644 --- a/2018/10/09/ml-in-tees.html +++ b/2018/10/09/ml-in-tees.html @@ -140,7 +140,7 @@ <div class="span14 w-100"> <h1>Efficient Privacy-Preserving ML Using TVM </h1> <p class="post-meta"> - <time datetime="2018-10-09T00:00:00-07:00" itemprop="datePublished"> + <time datetime="2018-10-09T00:00:00-04:00" itemprop="datePublished"> Oct 9, 2018 </time> diff --git a/2018/12/18/lowprecision-conv.html b/2018/12/18/lowprecision-conv.html index e1fafc9..31738a3 100644 --- a/2018/12/18/lowprecision-conv.html +++ b/2018/12/18/lowprecision-conv.html @@ -140,7 +140,7 @@ <div class="span14 w-100"> <h1>Automating Generation of Low Precision Deep Learning Operators </h1> <p class="post-meta"> - <time datetime="2018-12-18T00:00:00-08:00" itemprop="datePublished"> + <time datetime="2018-12-18T00:00:00-05:00" itemprop="datePublished"> Dec 18, 2018 </time> @@ -292,8 +292,8 @@ Note: x86 doesn’t support a vectorized popcount for this microarchitecture, so <h2 id="show-me-the-code">Show me the code</h2> <ul> - <li><a href="https://github.com/dmlc/tvm/blob/master/topi/python/topi/nn/bitserial_conv2d.py">TOPI bitserial convolution</a></li> - <li><a href="https://github.com/dmlc/tvm/blob/master/topi/python/topi/arm_cpu/bitserial_conv2d.py">TOPI ARM cpu bitserial convolution</a></li> + <li><a href="https://github.com/apache/incubator-tvm/blob/main/topi/python/topi/nn/bitserial_conv2d.py">TOPI bitserial convolution</a></li> + <li><a href="https://github.com/apache/incubator-tvm/blob/main/topi/python/topi/arm_cpu/bitserial_conv2d.py">TOPI ARM cpu bitserial convolution</a></li> </ul> <h2 id="references">References</h2> diff --git a/2019/01/19/Golang.html b/2019/01/19/Golang.html index da4cdbd..87e1e11 100644 --- a/2019/01/19/Golang.html +++ b/2019/01/19/Golang.html @@ -140,7 +140,7 @@ <div class="span14 w-100"> <h1>TVM Golang Runtime for Deep Learning Deployment </h1> <p class="post-meta"> - <time datetime="2019-01-19T00:00:00-08:00" itemprop="datePublished"> + <time datetime="2019-01-19T00:00:00-05:00" itemprop="datePublished"> Jan 19, 2019 </time> @@ -293,14 +293,14 @@ For simplicity the error handling is ignored here, but is important in real appl </code></pre></div></div> <p><code class="language-plaintext highlighter-rouge">gotvm</code> extends the TVM packed function system to support golang function closures as packed functions. -<a href="https://github.com/dmlc/tvm/blob/master/golang/sample">Examples</a> available to register golang +<a href="https://github.com/apache/incubator-tvm/blob/main/golang/sample">Examples</a> available to register golang closure as TVM packed function and invoke the same across programming language barriers.</p> <h2 id="show-me-the-code">Show me the code</h2> <ul> - <li><a href="https://github.com/dmlc/tvm/blob/master/golang/src">Package Source</a></li> - <li><a href="https://github.com/dmlc/tvm/blob/master/golang/sample">Examples</a></li> + <li><a href="https://github.com/apache/incubator-tvm/blob/main/golang/src">Package Source</a></li> + <li><a href="https://github.com/apache/incubator-tvm/blob/main/golang/sample">Examples</a></li> </ul> <h2 id="references">References</h2> diff --git a/2019/03/18/tvm-apache-announcement.html b/2019/03/18/tvm-apache-announcement.html index 012b8e3..0e06763 100644 --- a/2019/03/18/tvm-apache-announcement.html +++ b/2019/03/18/tvm-apache-announcement.html @@ -140,7 +140,7 @@ <div class="span14 w-100"> <h1>TVM Deep Learning Compiler Joins Apache Software Foundation </h1> <p class="post-meta"> - <time datetime="2019-03-18T00:00:00-07:00" itemprop="datePublished"> + <time datetime="2019-03-18T00:00:00-04:00" itemprop="datePublished"> Mar 18, 2019 </time> diff --git a/2019/04/29/opt-cuda-quantized.html b/2019/04/29/opt-cuda-quantized.html index 8e24619..aebb4c7 100644 --- a/2019/04/29/opt-cuda-quantized.html +++ b/2019/04/29/opt-cuda-quantized.html @@ -140,7 +140,7 @@ <div class="span14 w-100"> <h1>Automating Optimization of Quantized Deep Learning Models on CUDA </h1> <p class="post-meta"> - <time datetime="2019-04-29T09:00:00-07:00" itemprop="datePublished"> + <time datetime="2019-04-29T12:00:00-04:00" itemprop="datePublished"> Apr 29, 2019 </time> @@ -219,7 +219,7 @@ Figure 2. 2D convolution with data layout in NCHW4c and weight layout in OIHW4o4 </div> <p></p> -<p>After we have specified the layout of convolution layers, other operators such as <code class="language-plaintext highlighter-rouge">add</code> and activations can automatically adapt to the chosen layout during the <a href="https://github.com/dmlc/tvm/blob/master/src/relay/pass/alter_op_layout.cc">AlterOpLayout</a> pass in Relay. +<p>After we have specified the layout of convolution layers, other operators such as <code class="language-plaintext highlighter-rouge">add</code> and activations can automatically adapt to the chosen layout during the <a href="https://github.com/apache/incubator-tvm/blob/main/src/relay/pass/alter_op_layout.cc">AlterOpLayout</a> pass in Relay. The layout transformation of the weight can be precomputed offline. Therefore, we can run the whole model in the same layout without extra overhead.</p> <h2 id="designing-search-space-for-automatic-optimization">Designing Search Space for Automatic Optimization</h2> @@ -280,10 +280,10 @@ We show that automatic optimization in TVM makes it easy and flexible to support <h1 id="show-me-the-code">Show Me the Code</h1> <ul> <li><a href="https://github.com/vinx13/tvm-cuda-int8-benchmark">Benchmark</a></li> - <li><a href="https://github.com/dmlc/tvm/blob/master/topi/python/topi/cuda/conv2d_int8.py">CUDA int8 conv2d</a></li> - <li><a href="https://github.com/dmlc/tvm/blob/master/topi/python/topi/cuda/group_conv2d_nchw.py">CUDA int8 group conv2d</a></li> - <li><a href="https://github.com/dmlc/tvm/blob/master/topi/python/topi/cuda/dense.py">CUDA int8 dense</a></li> - <li><a href="https://github.com/dmlc/tvm/blob/master/topi/python/topi/cuda/tensor_intrin.py">Tensor intrinsics declaration</a></li> + <li><a href="https://github.com/apache/incubator-tvm/blob/main/topi/python/topi/cuda/conv2d_int8.py">CUDA int8 conv2d</a></li> + <li><a href="https://github.com/apache/incubator-tvm/blob/main/topi/python/topi/cuda/group_conv2d_nchw.py">CUDA int8 group conv2d</a></li> + <li><a href="https://github.com/apache/incubator-tvm/blob/main/topi/python/topi/cuda/dense.py">CUDA int8 dense</a></li> + <li><a href="https://github.com/apache/incubator-tvm/blob/main/topi/python/topi/cuda/tensor_intrin.py">Tensor intrinsics declaration</a></li> </ul> <h1 id="bio--acknowledgement">Bio & Acknowledgement</h1> diff --git a/2019/05/30/pytorch-frontend.html b/2019/05/30/pytorch-frontend.html index fc95ffa..4f1ba30 100644 --- a/2019/05/30/pytorch-frontend.html +++ b/2019/05/30/pytorch-frontend.html @@ -140,7 +140,7 @@ <div class="span14 w-100"> <h1>Integrating TVM into PyTorch </h1> <p class="post-meta"> - <time datetime="2019-05-30T00:00:00-07:00" itemprop="datePublished"> + <time datetime="2019-05-30T00:00:00-04:00" itemprop="datePublished"> May 30, 2019 </time> diff --git a/2020/05/14/compiling-machine-learning-to-webassembly-and-webgpu.html b/2020/05/14/compiling-machine-learning-to-webassembly-and-webgpu.html index 0b08a29..d4e4ec8 100644 --- a/2020/05/14/compiling-machine-learning-to-webassembly-and-webgpu.html +++ b/2020/05/14/compiling-machine-learning-to-webassembly-and-webgpu.html @@ -140,7 +140,7 @@ <div class="span14 w-100"> <h1>Compiling Machine Learning to WASM and WebGPU with Apache TVM </h1> <p class="post-meta"> - <time datetime="2020-05-14T00:00:00-07:00" itemprop="datePublished"> + <time datetime="2020-05-14T00:00:00-04:00" itemprop="datePublished"> May 14, 2020 </time> diff --git a/2020/06/04/tinyml-how-tvm-is-taming-tiny.html b/2020/06/04/tinyml-how-tvm-is-taming-tiny.html index 586ccf4..08c5e67 100644 --- a/2020/06/04/tinyml-how-tvm-is-taming-tiny.html +++ b/2020/06/04/tinyml-how-tvm-is-taming-tiny.html @@ -140,7 +140,7 @@ <div class="span14 w-100"> <h1>TinyML - How TVM is Taming Tiny </h1> <p class="post-meta"> - <time datetime="2020-06-04T00:00:00-07:00" itemprop="datePublished"> + <time datetime="2020-06-04T00:00:00-04:00" itemprop="datePublished"> Jun 4, 2020 </time> diff --git a/2020/07/14/bert-pytorch-tvm.html b/2020/07/14/bert-pytorch-tvm.html index 2b63cf0..43cc791 100644 --- a/2020/07/14/bert-pytorch-tvm.html +++ b/2020/07/14/bert-pytorch-tvm.html @@ -140,7 +140,7 @@ <div class="span14 w-100"> <h1>Bridging PyTorch and TVM </h1> <p class="post-meta"> - <time datetime="2020-07-14T00:00:00-07:00" itemprop="datePublished"> + <time datetime="2020-07-14T00:00:00-04:00" itemprop="datePublished"> Jul 14, 2020 </time> diff --git a/2020/07/15/how-to-bring-your-own-codegen-to-tvm.html b/2020/07/15/how-to-bring-your-own-codegen-to-tvm.html index c92704c..155ea18 100644 --- a/2020/07/15/how-to-bring-your-own-codegen-to-tvm.html +++ b/2020/07/15/how-to-bring-your-own-codegen-to-tvm.html @@ -140,7 +140,7 @@ <div class="span14 w-100"> <h1>How to Bring Your Own Codegen to TVM </h1> <p class="post-meta"> - <time datetime="2020-07-15T00:00:00-07:00" itemprop="datePublished"> + <time datetime="2020-07-15T00:00:00-04:00" itemprop="datePublished"> Jul 15, 2020 </time> diff --git a/2020/09/26/bring-your-own-datatypes.html b/2020/09/26/bring-your-own-datatypes.html index 22bcf89..0486f82 100644 --- a/2020/09/26/bring-your-own-datatypes.html +++ b/2020/09/26/bring-your-own-datatypes.html @@ -140,7 +140,7 @@ <div class="span14 w-100"> <h1>Bring Your Own Datatypes: Enabling Custom Datatype Exploration in TVM </h1> <p class="post-meta"> - <time datetime="2020-09-26T00:00:00-07:00" itemprop="datePublished"> + <time datetime="2020-09-26T00:00:00-04:00" itemprop="datePublished"> Sep 26, 2020 </time> diff --git a/atom.xml b/atom.xml index a68c2a3..3b07147 100644 --- a/atom.xml +++ b/atom.xml @@ -4,7 +4,7 @@ <title>TVM</title> <link href="https://tvm.apache.org" rel="self"/> <link href="https://tvm.apache.org"/> - <updated>2020-11-02T16:31:02-08:00</updated> + <updated>2020-11-03T09:01:59-05:00</updated> <id>https://tvm.apache.org</id> <author> <name></name> @@ -15,7 +15,7 @@ <entry> <title>Bring Your Own Datatypes: Enabling Custom Datatype Exploration in TVM</title> <link href="https://tvm.apache.org/2020/09/26/bring-your-own-datatypes"/> - <updated>2020-09-26T00:00:00-07:00</updated> + <updated>2020-09-26T00:00:00-04:00</updated> <id>https://tvm.apache.org/2020/09/26/bring-your-own-datatypes</id> <content type="html"><p>In this post, we describe the Bring Your Own Datatypes framework, which enables the use of custom datatypes within TVM.</p> @@ -308,7 +308,7 @@ For more documentation about the Bring Your Own Datatypes framework <entry> <title>How to Bring Your Own Codegen to TVM</title> <link href="https://tvm.apache.org/2020/07/15/how-to-bring-your-own-codegen-to-tvm"/> - <updated>2020-07-15T00:00:00-07:00</updated> + <updated>2020-07-15T00:00:00-04:00</updated> <id>https://tvm.apache.org/2020/07/15/how-to-bring-your-own-codegen-to-tvm</id> <content type="html"><p>To free data scientists from worrying about the performance when developing a new model, hardware backend providers (e.g., Intel, NVIDIA, ARM, etc) either provide kernel libraries such as cuBLAS or cuDNN with many commonly used deep learning kernels, or provide frameworks such as DNNL or TensorRT with a graph engine to let users describe their models in a certain way to achieve high performance. In addition, emerging deep learning accelerators also have t [...] @@ -787,7 +787,7 @@ Figure 4: After Graph Partitioning. <entry> <title>Bridging PyTorch and TVM</title> <link href="https://tvm.apache.org/2020/07/14/bert-pytorch-tvm"/> - <updated>2020-07-14T00:00:00-07:00</updated> + <updated>2020-07-14T00:00:00-04:00</updated> <id>https://tvm.apache.org/2020/07/14/bert-pytorch-tvm</id> <content type="html"> <p>(A more code-heavy variant is crossposted on the more PyTorch affine <a href="https://lernapparat.de/transformers-pytorch-tvm/">Lernapparat</a>, @@ -1310,7 +1310,7 @@ He is a PyTorch core developer and co-authored <a href="https://www.mann <entry> <title>TinyML - How TVM is Taming Tiny</title> <link href="https://tvm.apache.org/2020/06/04/tinyml-how-tvm-is-taming-tiny"/> - <updated>2020-06-04T00:00:00-07:00</updated> + <updated>2020-06-04T00:00:00-04:00</updated> <id>https://tvm.apache.org/2020/06/04/tinyml-how-tvm-is-taming-tiny</id> <content type="html"> <p><img src="/images/microtvm/logo.png" alt="microTVM logo" width="30%" /><br /></p> @@ -1619,7 +1619,7 @@ Diagram from CMSIS-NN paper showing a 2x2 matrix multiplication microkernel</ <entry> <title>Compiling Machine Learning to WASM and WebGPU with Apache TVM</title> <link href="https://tvm.apache.org/2020/05/14/compiling-machine-learning-to-webassembly-and-webgpu"/> - <updated>2020-05-14T00:00:00-07:00</updated> + <updated>2020-05-14T00:00:00-04:00</updated> <id>https://tvm.apache.org/2020/05/14/compiling-machine-learning-to-webassembly-and-webgpu</id> <content type="html"><p><strong>TLDR</strong></p> @@ -1706,7 +1706,7 @@ Diagram from CMSIS-NN paper showing a 2x2 matrix multiplication microkernel</ <entry> <title>Integrating TVM into PyTorch</title> <link href="https://tvm.apache.org/2019/05/30/pytorch-frontend"/> - <updated>2019-05-30T00:00:00-07:00</updated> + <updated>2019-05-30T00:00:00-04:00</updated> <id>https://tvm.apache.org/2019/05/30/pytorch-frontend</id> <content type="html"><p>As TVM continuously demonstrates improvements to the efficiency of deep learning execution, it has become clear that PyTorch stands to benefit from directly leveraging the compiler stack. @@ -1808,7 +1808,7 @@ relay_graph = torch_tvm.to_relay(mul, inputs) <entry> <title>Automating Optimization of Quantized Deep Learning Models on CUDA</title> <link href="https://tvm.apache.org/2019/04/29/opt-cuda-quantized"/> - <updated>2019-04-29T09:00:00-07:00</updated> + <updated>2019-04-29T12:00:00-04:00</updated> <id>https://tvm.apache.org/2019/04/29/opt-cuda-quantized</id> <content type="html"><p>Deep learning has been successfully applied to a variety of tasks. On real-time scenarios such as inference on autonomous vehicles, the inference speed of the model is critical. @@ -1877,7 +1877,7 @@ Figure 2. 2D convolution with data layout in NCHW4c and weight layout in OIHW4o4 </div> <p></p> -<p>After we have specified the layout of convolution layers, other operators such as <code class="language-plaintext highlighter-rouge">add</code> and activations can automatically adapt to the chosen layout during the <a href="https://github.com/dmlc/tvm/blob/master/src/relay/pass/alter_op_layout.cc">AlterOpLayout</a> pass in Relay. +<p>After we have specified the layout of convolution layers, other operators such as <code class="language-plaintext highlighter-rouge">add</code> and activations can automatically adapt to the chosen layout during the <a href="https://github.com/apache/incubator-tvm/blob/main/src/relay/pass/alter_op_layout.cc">AlterOpLayout</a> pass in Relay. The layout transformation of the weight can be precomputed offline. Therefore, we can run the whole model in the same layout without extra overhead.</p> <h2 id="designing-search-space-for-automatic-optimization">Designing Search Space for Automatic Optimization</h2> @@ -1938,10 +1938,10 @@ We show that automatic optimization in TVM makes it easy and flexible to support <h1 id="show-me-the-code">Show Me the Code</h1> <ul> <li><a href="https://github.com/vinx13/tvm-cuda-int8-benchmark">Benchmark</a></li> - <li><a href="https://github.com/dmlc/tvm/blob/master/topi/python/topi/cuda/conv2d_int8.py">CUDA int8 conv2d</a></li> - <li><a href="https://github.com/dmlc/tvm/blob/master/topi/python/topi/cuda/group_conv2d_nchw.py">CUDA int8 group conv2d</a></li> - <li><a href="https://github.com/dmlc/tvm/blob/master/topi/python/topi/cuda/dense.py">CUDA int8 dense</a></li> - <li><a href="https://github.com/dmlc/tvm/blob/master/topi/python/topi/cuda/tensor_intrin.py">Tensor intrinsics declaration</a></li> + <li><a href="https://github.com/apache/incubator-tvm/blob/main/topi/python/topi/cuda/conv2d_int8.py">CUDA int8 conv2d</a></li> + <li><a href="https://github.com/apache/incubator-tvm/blob/main/topi/python/topi/cuda/group_conv2d_nchw.py">CUDA int8 group conv2d</a></li> + <li><a href="https://github.com/apache/incubator-tvm/blob/main/topi/python/topi/cuda/dense.py">CUDA int8 dense</a></li> + <li><a href="https://github.com/apache/incubator-tvm/blob/main/topi/python/topi/cuda/tensor_intrin.py">Tensor intrinsics declaration</a></li> </ul> <h1 id="bio--acknowledgement">Bio &amp; Acknowledgement</h1> @@ -1952,7 +1952,7 @@ We show that automatic optimization in TVM makes it easy and flexible to support <entry> <title>TVM Deep Learning Compiler Joins Apache Software Foundation</title> <link href="https://tvm.apache.org/2019/03/18/tvm-apache-announcement"/> - <updated>2019-03-18T00:00:00-07:00</updated> + <updated>2019-03-18T00:00:00-04:00</updated> <id>https://tvm.apache.org/2019/03/18/tvm-apache-announcement</id> <content type="html"><p>There is an increasing need to bring machine learning to a wide diversity of hardware devices. Current frameworks rely on vendor-specific operator libraries and optimize for a narrow range of server-class GPUs. Deploying workloads to new platforms – such as mobile phones, embedded devices, and accelerators (e.g., FPGAs, ASICs) – requires significant manual effort.</p> @@ -1975,7 +1975,7 @@ We show that automatic optimization in TVM makes it easy and flexible to support <entry> <title>TVM Golang Runtime for Deep Learning Deployment</title> <link href="https://tvm.apache.org/2019/01/19/Golang"/> - <updated>2019-01-19T00:00:00-08:00</updated> + <updated>2019-01-19T00:00:00-05:00</updated> <id>https://tvm.apache.org/2019/01/19/Golang</id> <content type="html"><h2 id="introduction">Introduction</h2> @@ -2118,14 +2118,14 @@ For simplicity the error handling is ignored here, but is important in real appl </code></pre></div></div> <p><code class="language-plaintext highlighter-rouge">gotvm</code> extends the TVM packed function system to support golang function closures as packed functions. -<a href="https://github.com/dmlc/tvm/blob/master/golang/sample">Examples</a> available to register golang +<a href="https://github.com/apache/incubator-tvm/blob/main/golang/sample">Examples</a> available to register golang closure as TVM packed function and invoke the same across programming language barriers.</p> <h2 id="show-me-the-code">Show me the code</h2> <ul> - <li><a href="https://github.com/dmlc/tvm/blob/master/golang/src">Package Source</a></li> - <li><a href="https://github.com/dmlc/tvm/blob/master/golang/sample">Examples</a></li> + <li><a href="https://github.com/apache/incubator-tvm/blob/main/golang/src">Package Source</a></li> + <li><a href="https://github.com/apache/incubator-tvm/blob/main/golang/sample">Examples</a></li> </ul> <h2 id="references">References</h2> @@ -2145,7 +2145,7 @@ closure as TVM packed function and invoke the same across programming language b <entry> <title>Automating Generation of Low Precision Deep Learning Operators</title> <link href="https://tvm.apache.org/2018/12/18/lowprecision-conv"/> - <updated>2018-12-18T00:00:00-08:00</updated> + <updated>2018-12-18T00:00:00-05:00</updated> <id>https://tvm.apache.org/2018/12/18/lowprecision-conv</id> <content type="html"><p>As deep learning models grow larger and more complex, deploying them on low powered phone and IoT devices becomes challenging because of their limited compute and energy budgets. A recent trend @@ -2287,8 +2287,8 @@ Note: x86 doesn’t support a vectorized popcount for this microarchitecture, so <h2 id="show-me-the-code">Show me the code</h2> <ul> - <li><a href="https://github.com/dmlc/tvm/blob/master/topi/python/topi/nn/bitserial_conv2d.py">TOPI bitserial convolution</a></li> - <li><a href="https://github.com/dmlc/tvm/blob/master/topi/python/topi/arm_cpu/bitserial_conv2d.py">TOPI ARM cpu bitserial convolution</a></li> + <li><a href="https://github.com/apache/incubator-tvm/blob/main/topi/python/topi/nn/bitserial_conv2d.py">TOPI bitserial convolution</a></li> + <li><a href="https://github.com/apache/incubator-tvm/blob/main/topi/python/topi/arm_cpu/bitserial_conv2d.py">TOPI ARM cpu bitserial convolution</a></li> </ul> <h2 id="references">References</h2> @@ -2306,7 +2306,7 @@ Note: x86 doesn’t support a vectorized popcount for this microarchitecture, so <entry> <title>Efficient Privacy-Preserving ML Using TVM</title> <link href="https://tvm.apache.org/2018/10/09/ml-in-tees"/> - <updated>2018-10-09T00:00:00-07:00</updated> + <updated>2018-10-09T00:00:00-04:00</updated> <id>https://tvm.apache.org/2018/10/09/ml-in-tees</id> <content type="html"><p>This post describes Myelin, a framework for privacy-preserving machine learning in trusted hardware enclaves, and how TVM makes Myelin fast. The key idea is that TVM, unlike other popular ML frameworks, compiles models into lightweight, optimized, and dependency-free libraries which can fit into resource constrained enclaves.</p> @@ -2422,7 +2422,7 @@ His research interest is in the general domain of ML on shared private data, but <entry> <title>Automatic Kernel Optimization for Deep Learning on All Hardware Platforms</title> <link href="https://tvm.apache.org/2018/10/03/auto-opt-all"/> - <updated>2018-10-03T00:00:00-07:00</updated> + <updated>2018-10-03T00:00:00-04:00</updated> <id>https://tvm.apache.org/2018/10/03/auto-opt-all</id> <content type="html"><p>Optimizing the performance of deep neural network on a diverse range of hardware platforms is still a hard problem for AI developers. In terms of system support, we are facing a many-to-many problem here: @@ -2816,7 +2816,7 @@ for inference deployment. TVM just provides such a solution.</p> <entry> <title>Building a Cross-Framework Deep Learning Compiler via DLPack</title> <link href="https://tvm.apache.org/2018/08/10/DLPack-Bridge"/> - <updated>2018-08-10T00:00:00-07:00</updated> + <updated>2018-08-10T00:00:00-04:00</updated> <id>https://tvm.apache.org/2018/08/10/DLPack-Bridge</id> <content type="html"><p>Deep learning frameworks such as Tensorflow, PyTorch, and ApacheMxNet provide a powerful toolbox for quickly prototyping and deploying deep learning models. @@ -2928,7 +2928,7 @@ found <a href="https://tvm.apache.org/docs//tutorials/optimize/opt_gemm. </code></pre></div></div> <h2 id="under-the-hood-of-the-pytorch-example">Under the hood of the PyTorch Example</h2> -<p>As TVM provides <a href="https://github.com/dmlc/tvm/blob/master/include/tvm/runtime/c_runtime_api.h#L455">functions</a> to convert dlpack tensors to tvm <code class="language-plaintext highlighter-rouge">NDArray</code>s and +<p>As TVM provides <a href="https://github.com/apache/incubator-tvm/blob/main/include/tvm/runtime/c_runtime_api.h#L455">functions</a> to convert dlpack tensors to tvm <code class="language-plaintext highlighter-rouge">NDArray</code>s and vice-versa, so all that is needed is some syntactic sugar by wrapping functions. <code class="language-plaintext highlighter-rouge">convert_func</code> is a generic converter for frameworks using tensors with dlpack support, and can be used to implement convenient converters, such as @@ -2955,7 +2955,7 @@ support, and can be used to implement convenient converters, such as <entry> <title>VTA: An Open, Customizable Deep Learning Acceleration Stack </title> <link href="https://tvm.apache.org/2018/07/12/vta-release-announcement"/> - <updated>2018-07-12T00:00:00-07:00</updated> + <updated>2018-07-12T00:00:00-04:00</updated> <id>https://tvm.apache.org/2018/07/12/vta-release-announcement</id> <content type="html"><p style="text-align: center">Thierry Moreau(VTA architect), Tianqi Chen(TVM stack), Ziheng Jiang†(graph compilation), Luis Vega(cloud deployment)</p> <p style="text-align: center">Advisors: Luis Ceze, Carlos Guestrin, Arvind Krishnamurthy</p> @@ -2967,7 +2967,7 @@ support, and can be used to implement convenient converters, such as <p>VTA is more than a standalone accelerator design: it’s an end-to-end solution that includes drivers, a JIT runtime, and an optimizing compiler stack based on TVM. The current release includes a behavioral hardware simulator, as well as the infrastructure to deploy VTA on low-cost FPGA hardware for fast prototyping. By extending the TVM stack with a customizable, and open source deep learning hardware accelerator design, we are exposing a transparent end-to-end deep learning stac [...] -<p style="text-align: center"><img src="https://raw.githubusercontent.com/uwsaml/web-data/master/vta/blogpost/vta_stack.png" alt="image" width="50%" /></p> +<p style="text-align: center"><img src="https://raw.githubusercontent.com/uwsampl/web-data/master/vta/blogpost/vta_stack.png" alt="image" width="50%" /></p> <p>The VTA and TVM stack together constitute a blueprint for end-to-end, accelerator-centric deep learning system that can:</p> @@ -3022,7 +3022,7 @@ The extendability of the compiler stack, combined with the ability to modify the <p>The Vanilla Tensor Accelerator (VTA) is a generic deep learning accelerator built around a GEMM core, which performs dense matrix multiplication at a high computational throughput. The design is inspired by mainstream deep learning accelerators, of the likes of Google’s TPU accelerator. The design adopts decoupled access-execute to hide memory access latency and maximize utilization of compute resources. To a broader extent, VTA can serve as a template deep learning accelerator design, exposing a clean tensor computation abstraction to the compiler stack.</p> -<p style="text-align: center"><img src="https://raw.githubusercontent.com/uwsaml/web-data/master/vta/blogpost/vta_overview.png" alt="image" width="60%" /></p> +<p style="text-align: center"><img src="https://raw.githubusercontent.com/uwsampl/web-data/master/vta/blogpost/vta_overview.png" alt="image" width="60%" /></p> <p>The figure above presents a high-level overview of the VTA hardware organization. VTA is composed of four modules that communicate between each other via FIFO queues and single-writer/single-reader SRAM memory blocks, to allow for task-level pipeline parallelism. The compute module performs both dense linear algebra computation with its GEMM core, and general computation with its tensor ALU. @@ -3039,7 +3039,7 @@ The first approach, which doesn’t require special hardware is to run deep lear This simulator back-end is readily available for developers to experiment with. The second approach relies on an off-the-shelf and low-cost FPGA development board – the <a href="http://www.pynq.io/">Pynq board</a>, which exposes a reconfigurable FPGA fabric and an ARM SoC.</p> -<p style="text-align: center"><img src="https://raw.githubusercontent.com/uwsaml/web-data/master/vta/blogpost/vta_system.png" alt="image" width="70%" /></p> +<p style="text-align: center"><img src="https://raw.githubusercontent.com/uwsampl/web-data/master/vta/blogpost/vta_system.png" alt="image" width="70%" /></p> <p>The VTA release offers a simple compilation and deployment flow of the VTA hardware design and TVM workloads on the Pynq platform, with the help of an RPC server interface. The RPC server handles FPGA reconfiguration tasks and TVM module invocation offloading onto the VTA runtime. @@ -3062,7 +3062,7 @@ While this platform is meant for prototyping (the 2012 FPGA cannot compete with <p>A popular method used to assess the efficient use of hardware are roofline diagrams: given a hardware design, how efficiently are different workloads utilizing the hardware compute and memory resources. The roofline plot below shows the throughput achieved on different convolution layers of the ResNet-18 inference benchmark. Each layer has a different arithmetic intensity, i.e. compute to data movement ratio. In the left half, convolution layers are bandwidth limited, whereas on the right half, they are compute limited.</p> -<p style="text-align: center"><img src="https://raw.githubusercontent.com/uwsaml/web-data/master/vta/blogpost/vta_roofline.png" alt="image" width="60%" /></p> +<p style="text-align: center"><img src="https://raw.githubusercontent.com/uwsampl/web-data/master/vta/blogpost/vta_roofline.png" alt="image" width="60%" /></p> <p>The goal behind designing a hardware architecture, and a compiler stack is to bring each workload as close as possible to the roofline of the target hardware. The roofline plot shows the effects of having the hardware and compiler work together to maximize utilization of the available hardware resources. @@ -3071,7 +3071,7 @@ The result is an overall higher utilization of the available compute and memory <h3 id="end-to-end-resnet-18-evaluation">End to end ResNet-18 evaluation</h3> -<p style="text-align: center"><img src="https://raw.githubusercontent.com/uwsaml/web-data/master/vta/blogpost/vta_e2e.png" alt="image" width="60%" /></p> +<p style="text-align: center"><img src="https://raw.githubusercontent.com/uwsampl/web-data/master/vta/blogpost/vta_e2e.png" alt="image" width="60%" /></p> <p>A benefit of having a complete compiler stack built for VTA is the ability to run end-to-end workloads. This is compelling in the context of hardware acceleration because we need to understand what performance bottlenecks, and Amdahl limitations stand in the way to obtaining faster performance. The bar plot above shows inference performance with and without offloading the ResNet convolutional layers to the FPGA-based VTA design, on the Pynq board’s ARM Cortex A9 SoC. @@ -3097,7 +3097,7 @@ This kind of high-level visibility is essential to system designers who want to <entry> <title>Bringing TVM into TensorFlow for Optimizing Neural Machine Translation on GPU</title> <link href="https://tvm.apache.org/2018/03/23/nmt-transformer-optimize"/> - <updated>2018-03-23T00:00:00-07:00</updated> + <updated>2018-03-23T00:00:00-04:00</updated> <id>https://tvm.apache.org/2018/03/23/nmt-transformer-optimize</id> <content type="html"><h2 id="author">Author</h2> @@ -3363,7 +3363,7 @@ C = tvm.compute( <entry> <title>Compiling Deep Learning Models to WebGL with TVM</title> <link href="https://tvm.apache.org/2018/03/12/webgl"/> - <updated>2018-03-12T00:00:00-07:00</updated> + <updated>2018-03-12T00:00:00-04:00</updated> <id>https://tvm.apache.org/2018/03/12/webgl</id> <content type="html"><p>Now TVM comes with a brand-new OpenGL/WebGL backend! This blog post explains what it is, and what you can achieve with it.</p> @@ -3479,7 +3479,7 @@ optimizations into the TVM stack.</p> <entry> <title>Optimizing Mobile Deep Learning on ARM GPU with TVM</title> <link href="https://tvm.apache.org/2018/01/16/opt-mali-gpu"/> - <updated>2018-01-16T00:00:00-08:00</updated> + <updated>2018-01-16T00:00:00-05:00</updated> <id>https://tvm.apache.org/2018/01/16/opt-mali-gpu</id> <content type="html"><p>With the great success of deep learning, the demand for deploying deep neural networks to mobile devices is growing rapidly. @@ -4053,7 +4053,7 @@ advice and <a href="https://github.com/yzhliu">Yizhi Liu</a&g <entry> <title>Remote Profile and Test Deep Learning Cross Compilation on Mobile Phones with TVM RPC</title> <link href="https://tvm.apache.org/2017/11/08/android-rpc-introduction"/> - <updated>2017-11-08T00:00:00-08:00</updated> + <updated>2017-11-08T00:00:00-05:00</updated> <id>https://tvm.apache.org/2017/11/08/android-rpc-introduction</id> <content type="html"><p>TVM stack is an end to end compilation stack to deploy deep learning workloads to all hardware backends. Thanks to the NNVM compiler support of TVM stack, we can now directly compile descriptions from deep learning frameworks and compile them to bare metal code. @@ -4281,7 +4281,7 @@ make jvminstall <entry> <title>Bringing AMDGPUs to TVM Stack and NNVM Compiler with ROCm</title> <link href="https://tvm.apache.org/2017/10/30/Bringing-AMDGPUs-to-TVM-Stack-and-NNVM-Compiler-with-ROCm"/> - <updated>2017-10-30T00:00:00-07:00</updated> + <updated>2017-10-30T00:00:00-04:00</updated> <id>https://tvm.apache.org/2017/10/30/Bringing-AMDGPUs-to-TVM-Stack-and-NNVM-Compiler-with-ROCm</id> <content type="html"><p style="text-align: center">Aditya Atluri, Advanced Micro Devices, Inc.</p> <p style="text-align: center">Masahiro Masuda, Ziosoft, Inc.</p> @@ -4339,7 +4339,7 @@ TVM prediction top-1: 282 tiger cat</code></pre></figure> <h2 id="a-note-on-performance">A Note on performance</h2> -<p>The current support on ROCm focuses on the functionality coverage. We have already seen promising performance results by simply adopting existing TVM schedules for CUDA backend. For example, you can try running <a href="https://github.com/dmlc/tvm/blob/master/topi/recipe/gemm/cuda_gemm_square.py">the gemm test script</a> in the TVM repository and see the result. For two types of cards we tested, the current gemm recipe for square matrix multiplication (not [...] +<p>The current support on ROCm focuses on the functionality coverage. We have already seen promising performance results by simply adopting existing TVM schedules for CUDA backend. For example, you can try running <a href="https://github.com/apache/incubator-tvm/blob/main/topi/recipe/gemm/cuda_gemm_square.py">the gemm test script</a> in the TVM repository and see the result. For two types of cards we tested, the current gemm recipe for square matrix multiplica [...] This is already a promising start, as it is very hard to optimize performance to get to peak and we did not yet apply AMD GPU specific optimizations. We are starting to look at performance optimization and we expect more improvement to come.</p> @@ -4507,7 +4507,7 @@ BB0_6: <entry> <title>NNVM Compiler: Open Compiler for AI Frameworks</title> <link href="https://tvm.apache.org/2017/10/06/nnvm-compiler-announcement"/> - <updated>2017-10-06T08:30:00-07:00</updated> + <updated>2017-10-06T11:30:00-04:00</updated> <id>https://tvm.apache.org/2017/10/06/nnvm-compiler-announcement</id> <content type="html"><p style="text-align: center">Paul G. Allen School of Computer Science &amp; Engineering, University of Washington</p> <p style="text-align: center">Amazon Web Service AI team</p> diff --git a/feed.xml b/feed.xml index 0f56d75..2406202 100644 --- a/feed.xml +++ b/feed.xml @@ -1,4 +1,4 @@ -<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.1.1">Jekyll</generator><link href="/feed.xml" rel="self" type="application/atom+xml" /><link href="/" rel="alternate" type="text/html" /><updated>2020-11-02T16:31:02-08:00</updated><id>/feed.xml</id><title type="html">TVM</title><author><name>{"name"=>nil}</name></author><entry><title type="html">Bring Your Own Datatypes: Enabling Custom Datatype [...] +<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.1.1">Jekyll</generator><link href="/feed.xml" rel="self" type="application/atom+xml" /><link href="/" rel="alternate" type="text/html" /><updated>2020-11-03T09:01:59-05:00</updated><id>/feed.xml</id><title type="html">TVM</title><author><name>{"name"=>nil}</name></author><entry><title type="html">Bring Your Own Datatypes: Enabling Custom Datatype [...] <h2 id="introduction">Introduction</h2> @@ -282,7 +282,7 @@ For more documentation about the Bring Your Own Datatypes framework <p><a href="https://posithub.org/docs/BeatingFloatingPoint.pdf" target="_blank">Beating Floating Point at its Own Game: Posit Arithmetic</a> <a href="#fnref:posit" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> </ol> -</div></content><author><name>Gus Smith, Andrew Liu</name></author><summary type="html">In this post, we describe the Bring Your Own Datatypes framework, which enables the use of custom datatypes within TVM.</summary></entry><entry><title type="html">How to Bring Your Own Codegen to TVM</title><link href="/2020/07/15/how-to-bring-your-own-codegen-to-tvm" rel="alternate" type="text/html" title="How to Bring Your Own Codegen to TVM" /><published>2020-07-15T00:00:00-07:00</published>< [...] +</div></content><author><name>Gus Smith, Andrew Liu</name></author><summary type="html">In this post, we describe the Bring Your Own Datatypes framework, which enables the use of custom datatypes within TVM.</summary></entry><entry><title type="html">How to Bring Your Own Codegen to TVM</title><link href="/2020/07/15/how-to-bring-your-own-codegen-to-tvm" rel="alternate" type="text/html" title="How to Bring Your Own Codegen to TVM" /><published>2020-07-15T00:00:00-04:00</published>< [...] <p>However, users have to learn a new programming interface when they attempt to work on a new kernel library or a device. As a result, the demand for a unified programming interface becomes more and more important to let all users and hardware backend providers stand on the same page.</p> @@ -751,7 +751,7 @@ Figure 4: After Graph Partitioning. <h2 id="acknowledgment">Acknowledgment</h2> -<p>We would like to thank our colleague Animesh Jain for valuable discussions in the framework design; Tianqi Chen and Jared Roesch from OctoML for system design discussions and prototyping; Masahiro Masuda from the TVM community to help code review and improve the DNNL integration. We would also like to thank Ramana Radhakrishnan, Matthew Barrett, Manupa Karunaratne, and Luke Hutton from ARM, U.K. for contributing several helpful ideas, related Relay passes, and the Arm Compute Li [...] +<p>We would like to thank our colleague Animesh Jain for valuable discussions in the framework design; Tianqi Chen and Jared Roesch from OctoML for system design discussions and prototyping; Masahiro Masuda from the TVM community to help code review and improve the DNNL integration. We would also like to thank Ramana Radhakrishnan, Matthew Barrett, Manupa Karunaratne, and Luke Hutton from ARM, U.K. for contributing several helpful ideas, related Relay passes, and the Arm Compute Li [...] the Jupyter Notebook to follow along is on <a href="https://github.com/t-vi/pytorch-tvmisc/tree/master/transformers-pytorch-tvm/">github</a>.)</p> <p>Some of the most intriguing applications of Artificial Intelligence have been in Natural Language Processing. @@ -1264,7 +1264,7 @@ one would want to re-do cheap computation, most prominently point-wise computati <h1 id="author">Author</h1> <p><a href="https://lernapparat.de/">Thomas Viehmann</a> is the founder of <a href="https://mathinf.eu/">MathInf GmbH</a>, Munich, Germany, a boutique training and consultancy firm focusing on Machine Learning and PyTorch. -He is a PyTorch core developer and co-authored <a href="https://www.manning.com/books/deep-learning-with-pytorch">Deep Learning with PyTorch</a>, which currently available as <a href="https://pytorch.org/deep-learning-with-pytorch">free download from the PyTorch website</a>.</p></content><author><name>Thomas Viehmann, MathInf GmbH</name></author><summary type="html"></summary></entry><entry><title type="html">TinyML - How TVM is Taming Ti [...] +He is a PyTorch core developer and co-authored <a href="https://www.manning.com/books/deep-learning-with-pytorch">Deep Learning with PyTorch</a>, which currently available as <a href="https://pytorch.org/deep-learning-with-pytorch">free download from the PyTorch website</a>.</p></content><author><name>Thomas Viehmann, MathInf GmbH</name></author><summary type="html"></summary></entry><entry><title type="html">TinyML - How TVM is Taming Ti [...] <p>The proliferation of low-cost, AI-powered consumer devices has led to widespread interest in “bare-metal” (low-power, often without an operating system) devices among ML researchers and practitioners. While it is already possible for experts to run <em>some</em> models on <em>some</em> bare-metal devices, optimizing models for diverse sets of devices is challenging, often requiring manually optimized device-specific libraries. And for those platforms wi [...] @@ -1563,7 +1563,7 @@ Diagram from CMSIS-NN paper showing a 2x2 matrix multiplication microkernel</ <li><a href="https://homes.cs.washington.edu/~moreau/">Thierry Moreau</a>, for mentoring me during my time at OctoML.</li> <li><a href="https://homes.cs.washington.edu/~vegaluis/">Luis Vega</a>, for teaching me the fundamentals of interacting with microcontrollers.</li> <li><a href="https://www.linkedin.com/in/themadrasi/?originalSubdomain=uk">Ramana Radhakrishnan</a>, for supplying the Arm hardware used in our experiments and for providing guidance on its usage.</li> -</ul></content><author><name>Logan Weber and Andrew Reusch, OctoML</name></author><summary type="html"></summary></entry><entry><title type="html">Compiling Machine Learning to WASM and WebGPU with Apache TVM</title><link href="/2020/05/14/compiling-machine-learning-to-webassembly-and-webgpu" rel="alternate" type="text/html" title="Compiling Machine Learning to WASM and WebGPU with Apache TVM" /><published>2020-05-14T00:00:00-07:00</published><updated>2020-05-14T00:00:00-07:00</upd [...] +</ul></content><author><name>Logan Weber and Andrew Reusch, OctoML</name></author><summary type="html"></summary></entry><entry><title type="html">Compiling Machine Learning to WASM and WebGPU with Apache TVM</title><link href="/2020/05/14/compiling-machine-learning-to-webassembly-and-webgpu" rel="alternate" type="text/html" title="Compiling Machine Learning to WASM and WebGPU with Apache TVM" /><published>2020-05-14T00:00:00-04:00</published><updated>2020-05-14T00:00:00-04:00</upd [...] <p>We introduced support for WASM and WebGPU to the Apache TVM deep learning compiler. Our experiments shows that TVM’s WebGPU backend can get <strong>close to native</strong> <strong>GPU performance</strong> when deploying models to the web.</p> @@ -1641,7 +1641,7 @@ Diagram from CMSIS-NN paper showing a 2x2 matrix multiplication microkernel</ <h2 id="acknowledgement">Acknowledgement</h2> -<p>We would like to thank the emscripten project for providing the WASM compilation infrastructures as well as the JS library support on the web. We would also like to thank the WebGPU community for various helpful discussions. Thanks to Fletcher Haynes for valuable feedbacks to the post.</p></content><author><name>Tianqi Chen and Jared Roesch, OctoML</name></author><summary type="html">TLDR</summary></entry><entry><title type="html">Integrating TVM into PyTorch</title><link [...] +<p>We would like to thank the emscripten project for providing the WASM compilation infrastructures as well as the JS library support on the web. We would also like to thank the WebGPU community for various helpful discussions. Thanks to Fletcher Haynes for valuable feedbacks to the post.</p></content><author><name>Tianqi Chen and Jared Roesch, OctoML</name></author><summary type="html">TLDR</summary></entry><entry><title type="html">Integrating TVM into PyTorch</title><link [...] it has become clear that PyTorch stands to benefit from directly leveraging the compiler stack. A major tenet of PyTorch is providing seamless and robust integrations that don’t get in the user’s way. To that end, PyTorch now has an official TVM-based backend, <a href="https://github.com/pytorch/tvm">torch_tvm</a>.</p> @@ -1733,7 +1733,7 @@ def mul(a, b, c): # via script relay_graph = torch_tvm.to_relay(mul, inputs) -</code></pre></div></div></content><author><name>Bram Wasti</name></author><summary type="html">As TVM continuously demonstrates improvements to the efficiency of deep learning execution, it has become clear that PyTorch stands to benefit from directly leveraging the compiler stack. A major tenet of PyTorch is providing seamless and robust integrations that don’t get in the user’s way. To that end, PyTorch now has an official TVM-based backend, torch_tvm.</summary [...] +</code></pre></div></div></content><author><name>Bram Wasti</name></author><summary type="html">As TVM continuously demonstrates improvements to the efficiency of deep learning execution, it has become clear that PyTorch stands to benefit from directly leveraging the compiler stack. A major tenet of PyTorch is providing seamless and robust integrations that don’t get in the user’s way. To that end, PyTorch now has an official TVM-based backend, torch_tvm.</summary [...] On real-time scenarios such as inference on autonomous vehicles, the inference speed of the model is critical. Network quantization is an effective approach to accelerating deep learning models. In quantized models, both data and model parameters are represented with low precision data types such as <code class="language-plaintext highlighter-rouge">int8</code> and <code class="language-plaintext highlighter-rouge">float16</code>. @@ -1800,7 +1800,7 @@ Figure 2. 2D convolution with data layout in NCHW4c and weight layout in OIHW4o4 </div> <p></p> -<p>After we have specified the layout of convolution layers, other operators such as <code class="language-plaintext highlighter-rouge">add</code> and activations can automatically adapt to the chosen layout during the <a href="https://github.com/dmlc/tvm/blob/master/src/relay/pass/alter_op_layout.cc">AlterOpLayout</a> pass in Relay. +<p>After we have specified the layout of convolution layers, other operators such as <code class="language-plaintext highlighter-rouge">add</code> and activations can automatically adapt to the chosen layout during the <a href="https://github.com/apache/incubator-tvm/blob/main/src/relay/pass/alter_op_layout.cc">AlterOpLayout</a> pass in Relay. The layout transformation of the weight can be precomputed offline. Therefore, we can run the whole model in the same layout without extra overhead.</p> <h2 id="designing-search-space-for-automatic-optimization">Designing Search Space for Automatic Optimization</h2> @@ -1861,14 +1861,14 @@ We show that automatic optimization in TVM makes it easy and flexible to support <h1 id="show-me-the-code">Show Me the Code</h1> <ul> <li><a href="https://github.com/vinx13/tvm-cuda-int8-benchmark">Benchmark</a></li> - <li><a href="https://github.com/dmlc/tvm/blob/master/topi/python/topi/cuda/conv2d_int8.py">CUDA int8 conv2d</a></li> - <li><a href="https://github.com/dmlc/tvm/blob/master/topi/python/topi/cuda/group_conv2d_nchw.py">CUDA int8 group conv2d</a></li> - <li><a href="https://github.com/dmlc/tvm/blob/master/topi/python/topi/cuda/dense.py">CUDA int8 dense</a></li> - <li><a href="https://github.com/dmlc/tvm/blob/master/topi/python/topi/cuda/tensor_intrin.py">Tensor intrinsics declaration</a></li> + <li><a href="https://github.com/apache/incubator-tvm/blob/main/topi/python/topi/cuda/conv2d_int8.py">CUDA int8 conv2d</a></li> + <li><a href="https://github.com/apache/incubator-tvm/blob/main/topi/python/topi/cuda/group_conv2d_nchw.py">CUDA int8 group conv2d</a></li> + <li><a href="https://github.com/apache/incubator-tvm/blob/main/topi/python/topi/cuda/dense.py">CUDA int8 dense</a></li> + <li><a href="https://github.com/apache/incubator-tvm/blob/main/topi/python/topi/cuda/tensor_intrin.py">Tensor intrinsics declaration</a></li> </ul> <h1 id="bio--acknowledgement">Bio &amp; Acknowledgement</h1> -<p><a href="https://wuwei.io/">Wuwei Lin</a> is an undergraduate student at SJTU. He is currently an intern at TuSimple. The author has many thanks to <a href="https://homes.cs.washington.edu/~tqchen/">Tianqi Chen</a> and <a href="https://homes.cs.washington.edu/~eqy/">Eddie Yan</a> for their reviews.</p></content><author><name>Wuwei Lin</name></author><summary type="html">Deep learning has been successfully ap [...] +<p><a href="https://wuwei.io/">Wuwei Lin</a> is an undergraduate student at SJTU. He is currently an intern at TuSimple. The author has many thanks to <a href="https://homes.cs.washington.edu/~tqchen/">Tianqi Chen</a> and <a href="https://homes.cs.washington.edu/~eqy/">Eddie Yan</a> for their reviews.</p></content><author><name>Wuwei Lin</name></author><summary type="html">Deep learning has been successfully ap [...] <p>TVM is an open source deep learning compiler stack that closes the gap between the productivity-focused deep learning frameworks, and the performance- or efficiency-oriented hardware backends. Today, we are glad to announce that the TVM community has decided to move on to Apache incubator, and becomes an Apache(incubating) project.</p> @@ -1882,7 +1882,7 @@ We show that automatic optimization in TVM makes it easy and flexible to support <p>We would like to take this chance to thank the Allen School for supporting the SAMPL team that gave birth to the TVM project. We would also like to thank the Halide project which provided the basis for TVM’s loop-level IR and initial code generation. We would like to thank our Apache incubator mentors for introducing the project to Apache and providing useful guidance. Finally, we would like to thank the TVM community and all of the organizations, as listed above, that supported [...] -<p>See also the <a href="https://news.cs.washington.edu/2019/03/18/allen-schools-tvm-deep-learning-compiler-framework-transitions-to-apache/">Allen School news about the transition here</a>, <a href="https://sampl.cs.washington.edu/tvmconf/#about-tvmconf">TVM conference program slides and recordings</a>, and <a href="https://tvm.apache.org/docs//contribute/community.html">our community guideline here</a>. Follow us o [...] +<p>See also the <a href="https://news.cs.washington.edu/2019/03/18/allen-schools-tvm-deep-learning-compiler-framework-transitions-to-apache/">Allen School news about the transition here</a>, <a href="https://sampl.cs.washington.edu/tvmconf/#about-tvmconf">TVM conference program slides and recordings</a>, and <a href="https://tvm.apache.org/docs//contribute/community.html">our community guideline here</a>. Follow us o [...] <p>TVM is an open deep learning compiler stack to compile various deep learning models from different frameworks to CPU, GPU or specialized accelerators. TVM supports model compilation from a wide range @@ -2023,14 +2023,14 @@ For simplicity the error handling is ignored here, but is important in real appl </code></pre></div></div> <p><code class="language-plaintext highlighter-rouge">gotvm</code> extends the TVM packed function system to support golang function closures as packed functions. -<a href="https://github.com/dmlc/tvm/blob/master/golang/sample">Examples</a> available to register golang +<a href="https://github.com/apache/incubator-tvm/blob/main/golang/sample">Examples</a> available to register golang closure as TVM packed function and invoke the same across programming language barriers.</p> <h2 id="show-me-the-code">Show me the code</h2> <ul> - <li><a href="https://github.com/dmlc/tvm/blob/master/golang/src">Package Source</a></li> - <li><a href="https://github.com/dmlc/tvm/blob/master/golang/sample">Examples</a></li> + <li><a href="https://github.com/apache/incubator-tvm/blob/main/golang/src">Package Source</a></li> + <li><a href="https://github.com/apache/incubator-tvm/blob/main/golang/sample">Examples</a></li> </ul> <h2 id="references">References</h2> @@ -2043,7 +2043,7 @@ closure as TVM packed function and invoke the same across programming language b <li>[5] <a href="https://blog.learngoprogramming.com/golang-variadic-funcs-how-to-patterns-369408f19085">Go Variadic Functions</a></li> <li>[6] <a href="https://github.com/jdeng/gomxnet">CFFI Ref</a></li> <li>[7] <a href="https://golang.org/pkg/runtime/#SetFinalizer">Go Finalizers</a></li> -</ul></content><author><name>Siva</name></author><summary type="html">Introduction</summary></entry><entry><title type="html">Automating Generation of Low Precision Deep Learning Operators</title><link href="/2018/12/18/lowprecision-conv" rel="alternate" type="text/html" title="Automating Generation of Low Precision Deep Learning Operators" /><published>2018-12-18T00:00:00-08:00</published><updated>2018-12-18T00:00:00-08:00</updated><id>/2018/12/18/lowprecision-conv</id><content ty [...] +</ul></content><author><name>Siva</name></author><summary type="html">Introduction</summary></entry><entry><title type="html">Automating Generation of Low Precision Deep Learning Operators</title><link href="/2018/12/18/lowprecision-conv" rel="alternate" type="text/html" title="Automating Generation of Low Precision Deep Learning Operators" /><published>2018-12-18T00:00:00-05:00</published><updated>2018-12-18T00:00:00-05:00</updated><id>/2018/12/18/lowprecision-conv</id><content ty [...] devices becomes challenging because of their limited compute and energy budgets. A recent trend in deep learning is the use of extremely quantized models that operate on inputs and weights of a few bits, with networks like XNOR-Net, DoReFa-Net, and HWGQ-Net making steady @@ -2183,8 +2183,8 @@ Note: x86 doesn’t support a vectorized popcount for this microarchitecture, so <h2 id="show-me-the-code">Show me the code</h2> <ul> - <li><a href="https://github.com/dmlc/tvm/blob/master/topi/python/topi/nn/bitserial_conv2d.py">TOPI bitserial convolution</a></li> - <li><a href="https://github.com/dmlc/tvm/blob/master/topi/python/topi/arm_cpu/bitserial_conv2d.py">TOPI ARM cpu bitserial convolution</a></li> + <li><a href="https://github.com/apache/incubator-tvm/blob/main/topi/python/topi/nn/bitserial_conv2d.py">TOPI bitserial convolution</a></li> + <li><a href="https://github.com/apache/incubator-tvm/blob/main/topi/python/topi/arm_cpu/bitserial_conv2d.py">TOPI ARM cpu bitserial convolution</a></li> </ul> <h2 id="references">References</h2> diff --git a/rss.xml b/rss.xml index f44a1bc..cc2324e 100644 --- a/rss.xml +++ b/rss.xml @@ -5,8 +5,8 @@ <description>TVM - </description> <link>https://tvm.apache.org</link> <atom:link href="https://tvm.apache.org" rel="self" type="application/rss+xml" /> - <lastBuildDate>Mon, 02 Nov 2020 16:31:02 -0800</lastBuildDate> - <pubDate>Mon, 02 Nov 2020 16:31:02 -0800</pubDate> + <lastBuildDate>Tue, 03 Nov 2020 09:01:59 -0500</lastBuildDate> + <pubDate>Tue, 03 Nov 2020 09:01:59 -0500</pubDate> <ttl>60</ttl> @@ -300,7 +300,7 @@ For more documentation about the Bring Your Own Datatypes framework </description> <link>https://tvm.apache.org/2020/09/26/bring-your-own-datatypes</link> <guid>https://tvm.apache.org/2020/09/26/bring-your-own-datatypes</guid> - <pubDate>Sat, 26 Sep 2020 00:00:00 -0700</pubDate> + <pubDate>Sat, 26 Sep 2020 00:00:00 -0400</pubDate> </item> <item> @@ -779,7 +779,7 @@ Figure 4: After Graph Partitioning. </description> <link>https://tvm.apache.org/2020/07/15/how-to-bring-your-own-codegen-to-tvm</link> <guid>https://tvm.apache.org/2020/07/15/how-to-bring-your-own-codegen-to-tvm</guid> - <pubDate>Wed, 15 Jul 2020 00:00:00 -0700</pubDate> + <pubDate>Wed, 15 Jul 2020 00:00:00 -0400</pubDate> </item> <item> @@ -1302,7 +1302,7 @@ He is a PyTorch core developer and co-authored <a href="https://www.mann </description> <link>https://tvm.apache.org/2020/07/14/bert-pytorch-tvm</link> <guid>https://tvm.apache.org/2020/07/14/bert-pytorch-tvm</guid> - <pubDate>Tue, 14 Jul 2020 00:00:00 -0700</pubDate> + <pubDate>Tue, 14 Jul 2020 00:00:00 -0400</pubDate> </item> <item> @@ -1611,7 +1611,7 @@ Diagram from CMSIS-NN paper showing a 2x2 matrix multiplication microkernel</ </description> <link>https://tvm.apache.org/2020/06/04/tinyml-how-tvm-is-taming-tiny</link> <guid>https://tvm.apache.org/2020/06/04/tinyml-how-tvm-is-taming-tiny</guid> - <pubDate>Thu, 04 Jun 2020 00:00:00 -0700</pubDate> + <pubDate>Thu, 04 Jun 2020 00:00:00 -0400</pubDate> </item> <item> @@ -1698,7 +1698,7 @@ Diagram from CMSIS-NN paper showing a 2x2 matrix multiplication microkernel</ </description> <link>https://tvm.apache.org/2020/05/14/compiling-machine-learning-to-webassembly-and-webgpu</link> <guid>https://tvm.apache.org/2020/05/14/compiling-machine-learning-to-webassembly-and-webgpu</guid> - <pubDate>Thu, 14 May 2020 00:00:00 -0700</pubDate> + <pubDate>Thu, 14 May 2020 00:00:00 -0400</pubDate> </item> <item> @@ -1800,7 +1800,7 @@ relay_graph = torch_tvm.to_relay(mul, inputs) </description> <link>https://tvm.apache.org/2019/05/30/pytorch-frontend</link> <guid>https://tvm.apache.org/2019/05/30/pytorch-frontend</guid> - <pubDate>Thu, 30 May 2019 00:00:00 -0700</pubDate> + <pubDate>Thu, 30 May 2019 00:00:00 -0400</pubDate> </item> <item> @@ -1872,7 +1872,7 @@ Figure 2. 2D convolution with data layout in NCHW4c and weight layout in OIHW4o4 </div> <p></p> -<p>After we have specified the layout of convolution layers, other operators such as <code class="language-plaintext highlighter-rouge">add</code> and activations can automatically adapt to the chosen layout during the <a href="https://github.com/dmlc/tvm/blob/master/src/relay/pass/alter_op_layout.cc">AlterOpLayout</a> pass in Relay. +<p>After we have specified the layout of convolution layers, other operators such as <code class="language-plaintext highlighter-rouge">add</code> and activations can automatically adapt to the chosen layout during the <a href="https://github.com/apache/incubator-tvm/blob/main/src/relay/pass/alter_op_layout.cc">AlterOpLayout</a> pass in Relay. The layout transformation of the weight can be precomputed offline. Therefore, we can run the whole model in the same layout without extra overhead.</p> <h2 id="designing-search-space-for-automatic-optimization">Designing Search Space for Automatic Optimization</h2> @@ -1933,10 +1933,10 @@ We show that automatic optimization in TVM makes it easy and flexible to support <h1 id="show-me-the-code">Show Me the Code</h1> <ul> <li><a href="https://github.com/vinx13/tvm-cuda-int8-benchmark">Benchmark</a></li> - <li><a href="https://github.com/dmlc/tvm/blob/master/topi/python/topi/cuda/conv2d_int8.py">CUDA int8 conv2d</a></li> - <li><a href="https://github.com/dmlc/tvm/blob/master/topi/python/topi/cuda/group_conv2d_nchw.py">CUDA int8 group conv2d</a></li> - <li><a href="https://github.com/dmlc/tvm/blob/master/topi/python/topi/cuda/dense.py">CUDA int8 dense</a></li> - <li><a href="https://github.com/dmlc/tvm/blob/master/topi/python/topi/cuda/tensor_intrin.py">Tensor intrinsics declaration</a></li> + <li><a href="https://github.com/apache/incubator-tvm/blob/main/topi/python/topi/cuda/conv2d_int8.py">CUDA int8 conv2d</a></li> + <li><a href="https://github.com/apache/incubator-tvm/blob/main/topi/python/topi/cuda/group_conv2d_nchw.py">CUDA int8 group conv2d</a></li> + <li><a href="https://github.com/apache/incubator-tvm/blob/main/topi/python/topi/cuda/dense.py">CUDA int8 dense</a></li> + <li><a href="https://github.com/apache/incubator-tvm/blob/main/topi/python/topi/cuda/tensor_intrin.py">Tensor intrinsics declaration</a></li> </ul> <h1 id="bio--acknowledgement">Bio &amp; Acknowledgement</h1> @@ -1944,7 +1944,7 @@ We show that automatic optimization in TVM makes it easy and flexible to support </description> <link>https://tvm.apache.org/2019/04/29/opt-cuda-quantized</link> <guid>https://tvm.apache.org/2019/04/29/opt-cuda-quantized</guid> - <pubDate>Mon, 29 Apr 2019 09:00:00 -0700</pubDate> + <pubDate>Mon, 29 Apr 2019 12:00:00 -0400</pubDate> </item> <item> @@ -1967,7 +1967,7 @@ We show that automatic optimization in TVM makes it easy and flexible to support </description> <link>https://tvm.apache.org/2019/03/18/tvm-apache-announcement</link> <guid>https://tvm.apache.org/2019/03/18/tvm-apache-announcement</guid> - <pubDate>Mon, 18 Mar 2019 00:00:00 -0700</pubDate> + <pubDate>Mon, 18 Mar 2019 00:00:00 -0400</pubDate> </item> <item> @@ -2113,14 +2113,14 @@ For simplicity the error handling is ignored here, but is important in real appl </code></pre></div></div> <p><code class="language-plaintext highlighter-rouge">gotvm</code> extends the TVM packed function system to support golang function closures as packed functions. -<a href="https://github.com/dmlc/tvm/blob/master/golang/sample">Examples</a> available to register golang +<a href="https://github.com/apache/incubator-tvm/blob/main/golang/sample">Examples</a> available to register golang closure as TVM packed function and invoke the same across programming language barriers.</p> <h2 id="show-me-the-code">Show me the code</h2> <ul> - <li><a href="https://github.com/dmlc/tvm/blob/master/golang/src">Package Source</a></li> - <li><a href="https://github.com/dmlc/tvm/blob/master/golang/sample">Examples</a></li> + <li><a href="https://github.com/apache/incubator-tvm/blob/main/golang/src">Package Source</a></li> + <li><a href="https://github.com/apache/incubator-tvm/blob/main/golang/sample">Examples</a></li> </ul> <h2 id="references">References</h2> @@ -2137,7 +2137,7 @@ closure as TVM packed function and invoke the same across programming language b </description> <link>https://tvm.apache.org/2019/01/19/Golang</link> <guid>https://tvm.apache.org/2019/01/19/Golang</guid> - <pubDate>Sat, 19 Jan 2019 00:00:00 -0800</pubDate> + <pubDate>Sat, 19 Jan 2019 00:00:00 -0500</pubDate> </item> <item> @@ -2282,8 +2282,8 @@ Note: x86 doesn’t support a vectorized popcount for this microarchitecture, so <h2 id="show-me-the-code">Show me the code</h2> <ul> - <li><a href="https://github.com/dmlc/tvm/blob/master/topi/python/topi/nn/bitserial_conv2d.py">TOPI bitserial convolution</a></li> - <li><a href="https://github.com/dmlc/tvm/blob/master/topi/python/topi/arm_cpu/bitserial_conv2d.py">TOPI ARM cpu bitserial convolution</a></li> + <li><a href="https://github.com/apache/incubator-tvm/blob/main/topi/python/topi/nn/bitserial_conv2d.py">TOPI bitserial convolution</a></li> + <li><a href="https://github.com/apache/incubator-tvm/blob/main/topi/python/topi/arm_cpu/bitserial_conv2d.py">TOPI ARM cpu bitserial convolution</a></li> </ul> <h2 id="references">References</h2> @@ -2298,7 +2298,7 @@ Note: x86 doesn’t support a vectorized popcount for this microarchitecture, so </description> <link>https://tvm.apache.org/2018/12/18/lowprecision-conv</link> <guid>https://tvm.apache.org/2018/12/18/lowprecision-conv</guid> - <pubDate>Tue, 18 Dec 2018 00:00:00 -0800</pubDate> + <pubDate>Tue, 18 Dec 2018 00:00:00 -0500</pubDate> </item> <item> @@ -2414,7 +2414,7 @@ His research interest is in the general domain of ML on shared private data, but </description> <link>https://tvm.apache.org/2018/10/09/ml-in-tees</link> <guid>https://tvm.apache.org/2018/10/09/ml-in-tees</guid> - <pubDate>Tue, 09 Oct 2018 00:00:00 -0700</pubDate> + <pubDate>Tue, 09 Oct 2018 00:00:00 -0400</pubDate> </item> <item> @@ -2808,7 +2808,7 @@ for inference deployment. TVM just provides such a solution.</p> </description> <link>https://tvm.apache.org/2018/10/03/auto-opt-all</link> <guid>https://tvm.apache.org/2018/10/03/auto-opt-all</guid> - <pubDate>Wed, 03 Oct 2018 00:00:00 -0700</pubDate> + <pubDate>Wed, 03 Oct 2018 00:00:00 -0400</pubDate> </item> <item> @@ -2923,7 +2923,7 @@ found <a href="https://tvm.apache.org/docs//tutorials/optimize/opt_gemm. </code></pre></div></div> <h2 id="under-the-hood-of-the-pytorch-example">Under the hood of the PyTorch Example</h2> -<p>As TVM provides <a href="https://github.com/dmlc/tvm/blob/master/include/tvm/runtime/c_runtime_api.h#L455">functions</a> to convert dlpack tensors to tvm <code class="language-plaintext highlighter-rouge">NDArray</code>s and +<p>As TVM provides <a href="https://github.com/apache/incubator-tvm/blob/main/include/tvm/runtime/c_runtime_api.h#L455">functions</a> to convert dlpack tensors to tvm <code class="language-plaintext highlighter-rouge">NDArray</code>s and vice-versa, so all that is needed is some syntactic sugar by wrapping functions. <code class="language-plaintext highlighter-rouge">convert_func</code> is a generic converter for frameworks using tensors with dlpack support, and can be used to implement convenient converters, such as @@ -2947,7 +2947,7 @@ support, and can be used to implement convenient converters, such as </description> <link>https://tvm.apache.org/2018/08/10/DLPack-Bridge</link> <guid>https://tvm.apache.org/2018/08/10/DLPack-Bridge</guid> - <pubDate>Fri, 10 Aug 2018 00:00:00 -0700</pubDate> + <pubDate>Fri, 10 Aug 2018 00:00:00 -0400</pubDate> </item> <item> @@ -2962,7 +2962,7 @@ support, and can be used to implement convenient converters, such as <p>VTA is more than a standalone accelerator design: it’s an end-to-end solution that includes drivers, a JIT runtime, and an optimizing compiler stack based on TVM. The current release includes a behavioral hardware simulator, as well as the infrastructure to deploy VTA on low-cost FPGA hardware for fast prototyping. By extending the TVM stack with a customizable, and open source deep learning hardware accelerator design, we are exposing a transparent end-to-end deep learning stac [...] -<p style="text-align: center"><img src="https://raw.githubusercontent.com/uwsaml/web-data/master/vta/blogpost/vta_stack.png" alt="image" width="50%" /></p> +<p style="text-align: center"><img src="https://raw.githubusercontent.com/uwsampl/web-data/master/vta/blogpost/vta_stack.png" alt="image" width="50%" /></p> <p>The VTA and TVM stack together constitute a blueprint for end-to-end, accelerator-centric deep learning system that can:</p> @@ -3017,7 +3017,7 @@ The extendability of the compiler stack, combined with the ability to modify the <p>The Vanilla Tensor Accelerator (VTA) is a generic deep learning accelerator built around a GEMM core, which performs dense matrix multiplication at a high computational throughput. The design is inspired by mainstream deep learning accelerators, of the likes of Google’s TPU accelerator. The design adopts decoupled access-execute to hide memory access latency and maximize utilization of compute resources. To a broader extent, VTA can serve as a template deep learning accelerator design, exposing a clean tensor computation abstraction to the compiler stack.</p> -<p style="text-align: center"><img src="https://raw.githubusercontent.com/uwsaml/web-data/master/vta/blogpost/vta_overview.png" alt="image" width="60%" /></p> +<p style="text-align: center"><img src="https://raw.githubusercontent.com/uwsampl/web-data/master/vta/blogpost/vta_overview.png" alt="image" width="60%" /></p> <p>The figure above presents a high-level overview of the VTA hardware organization. VTA is composed of four modules that communicate between each other via FIFO queues and single-writer/single-reader SRAM memory blocks, to allow for task-level pipeline parallelism. The compute module performs both dense linear algebra computation with its GEMM core, and general computation with its tensor ALU. @@ -3034,7 +3034,7 @@ The first approach, which doesn’t require special hardware is to run deep lear This simulator back-end is readily available for developers to experiment with. The second approach relies on an off-the-shelf and low-cost FPGA development board – the <a href="http://www.pynq.io/">Pynq board</a>, which exposes a reconfigurable FPGA fabric and an ARM SoC.</p> -<p style="text-align: center"><img src="https://raw.githubusercontent.com/uwsaml/web-data/master/vta/blogpost/vta_system.png" alt="image" width="70%" /></p> +<p style="text-align: center"><img src="https://raw.githubusercontent.com/uwsampl/web-data/master/vta/blogpost/vta_system.png" alt="image" width="70%" /></p> <p>The VTA release offers a simple compilation and deployment flow of the VTA hardware design and TVM workloads on the Pynq platform, with the help of an RPC server interface. The RPC server handles FPGA reconfiguration tasks and TVM module invocation offloading onto the VTA runtime. @@ -3057,7 +3057,7 @@ While this platform is meant for prototyping (the 2012 FPGA cannot compete with <p>A popular method used to assess the efficient use of hardware are roofline diagrams: given a hardware design, how efficiently are different workloads utilizing the hardware compute and memory resources. The roofline plot below shows the throughput achieved on different convolution layers of the ResNet-18 inference benchmark. Each layer has a different arithmetic intensity, i.e. compute to data movement ratio. In the left half, convolution layers are bandwidth limited, whereas on the right half, they are compute limited.</p> -<p style="text-align: center"><img src="https://raw.githubusercontent.com/uwsaml/web-data/master/vta/blogpost/vta_roofline.png" alt="image" width="60%" /></p> +<p style="text-align: center"><img src="https://raw.githubusercontent.com/uwsampl/web-data/master/vta/blogpost/vta_roofline.png" alt="image" width="60%" /></p> <p>The goal behind designing a hardware architecture, and a compiler stack is to bring each workload as close as possible to the roofline of the target hardware. The roofline plot shows the effects of having the hardware and compiler work together to maximize utilization of the available hardware resources. @@ -3066,7 +3066,7 @@ The result is an overall higher utilization of the available compute and memory <h3 id="end-to-end-resnet-18-evaluation">End to end ResNet-18 evaluation</h3> -<p style="text-align: center"><img src="https://raw.githubusercontent.com/uwsaml/web-data/master/vta/blogpost/vta_e2e.png" alt="image" width="60%" /></p> +<p style="text-align: center"><img src="https://raw.githubusercontent.com/uwsampl/web-data/master/vta/blogpost/vta_e2e.png" alt="image" width="60%" /></p> <p>A benefit of having a complete compiler stack built for VTA is the ability to run end-to-end workloads. This is compelling in the context of hardware acceleration because we need to understand what performance bottlenecks, and Amdahl limitations stand in the way to obtaining faster performance. The bar plot above shows inference performance with and without offloading the ResNet convolutional layers to the FPGA-based VTA design, on the Pynq board’s ARM Cortex A9 SoC. @@ -3089,7 +3089,7 @@ This kind of high-level visibility is essential to system designers who want to </description> <link>https://tvm.apache.org/2018/07/12/vta-release-announcement</link> <guid>https://tvm.apache.org/2018/07/12/vta-release-announcement</guid> - <pubDate>Thu, 12 Jul 2018 00:00:00 -0700</pubDate> + <pubDate>Thu, 12 Jul 2018 00:00:00 -0400</pubDate> </item> <item> @@ -3355,7 +3355,7 @@ C = tvm.compute( </description> <link>https://tvm.apache.org/2018/03/23/nmt-transformer-optimize</link> <guid>https://tvm.apache.org/2018/03/23/nmt-transformer-optimize</guid> - <pubDate>Fri, 23 Mar 2018 00:00:00 -0700</pubDate> + <pubDate>Fri, 23 Mar 2018 00:00:00 -0400</pubDate> </item> <item> @@ -3471,7 +3471,7 @@ optimizations into the TVM stack.</p> </description> <link>https://tvm.apache.org/2018/03/12/webgl</link> <guid>https://tvm.apache.org/2018/03/12/webgl</guid> - <pubDate>Mon, 12 Mar 2018 00:00:00 -0700</pubDate> + <pubDate>Mon, 12 Mar 2018 00:00:00 -0400</pubDate> </item> <item> @@ -4045,7 +4045,7 @@ advice and <a href="https://github.com/yzhliu">Yizhi Liu</a&g </description> <link>https://tvm.apache.org/2018/01/16/opt-mali-gpu</link> <guid>https://tvm.apache.org/2018/01/16/opt-mali-gpu</guid> - <pubDate>Tue, 16 Jan 2018 00:00:00 -0800</pubDate> + <pubDate>Tue, 16 Jan 2018 00:00:00 -0500</pubDate> </item> <item> @@ -4273,7 +4273,7 @@ make jvminstall </description> <link>https://tvm.apache.org/2017/11/08/android-rpc-introduction</link> <guid>https://tvm.apache.org/2017/11/08/android-rpc-introduction</guid> - <pubDate>Wed, 08 Nov 2017 00:00:00 -0800</pubDate> + <pubDate>Wed, 08 Nov 2017 00:00:00 -0500</pubDate> </item> <item> @@ -4334,7 +4334,7 @@ TVM prediction top-1: 282 tiger cat</code></pre></figure> <h2 id="a-note-on-performance">A Note on performance</h2> -<p>The current support on ROCm focuses on the functionality coverage. We have already seen promising performance results by simply adopting existing TVM schedules for CUDA backend. For example, you can try running <a href="https://github.com/dmlc/tvm/blob/master/topi/recipe/gemm/cuda_gemm_square.py">the gemm test script</a> in the TVM repository and see the result. For two types of cards we tested, the current gemm recipe for square matrix multiplication (not [...] +<p>The current support on ROCm focuses on the functionality coverage. We have already seen promising performance results by simply adopting existing TVM schedules for CUDA backend. For example, you can try running <a href="https://github.com/apache/incubator-tvm/blob/main/topi/recipe/gemm/cuda_gemm_square.py">the gemm test script</a> in the TVM repository and see the result. For two types of cards we tested, the current gemm recipe for square matrix multiplica [...] This is already a promising start, as it is very hard to optimize performance to get to peak and we did not yet apply AMD GPU specific optimizations. We are starting to look at performance optimization and we expect more improvement to come.</p> @@ -4499,7 +4499,7 @@ BB0_6: </description> <link>https://tvm.apache.org/2017/10/30/Bringing-AMDGPUs-to-TVM-Stack-and-NNVM-Compiler-with-ROCm</link> <guid>https://tvm.apache.org/2017/10/30/Bringing-AMDGPUs-to-TVM-Stack-and-NNVM-Compiler-with-ROCm</guid> - <pubDate>Mon, 30 Oct 2017 00:00:00 -0700</pubDate> + <pubDate>Mon, 30 Oct 2017 00:00:00 -0400</pubDate> </item> <item> @@ -4582,7 +4582,7 @@ We also learns from Halide when implementing the lowering pipeline in TVM.</l </description> <link>https://tvm.apache.org/2017/10/06/nnvm-compiler-announcement</link> <guid>https://tvm.apache.org/2017/10/06/nnvm-compiler-announcement</guid> - <pubDate>Fri, 06 Oct 2017 08:30:00 -0700</pubDate> + <pubDate>Fri, 06 Oct 2017 11:30:00 -0400</pubDate> </item>