tkonolige commented on a change in pull request #7612:
URL: https://github.com/apache/tvm/pull/7612#discussion_r591717419
##########
File path: tutorials/get_started/tensor_expr_get_started.py
##########
@@ -302,18 +371,437 @@
fadd_cl(a, b, c)
tvm.testing.assert_allclose(c.asnumpy(), a.asnumpy() + b.asnumpy())
-######################################################################
-# Summary
-# -------
-# This tutorial provides a walk through of TVM workflow using
-# a vector add example. The general workflow is
+################################################################################
+# .. note:: Code Specialization
+#
+# As you may have noticed, the declarations of A, B and C all take the same
+# shape argument, n. TVM will take advantage of this to pass only a single
+# shape argument to the kernel, as you will find in the printed device code.
+# This is one form of specialization.
+#
+# On the host side, TVM will automatically generate check code that checks
+# the constraints in the parameters. So if you pass arrays with different
+# shapes into fadd, an error will be raised.
+#
+# We can do more specializations. For example, we can write :code:`n =
+# tvm.runtime.convert(1024)` instead of :code:`n = te.var("n")`, in the
+# computation declaration. The generated function will only take vectors with
+# length 1024.
+
+################################################################################
+# .. note:: TE Scheduling Primitives
+#
+# TVM includes a number of different scheduling primitives:
+#
+# - split: splits a specified axis into two axises by the defined factor.
+# - tile: tiles will split a computation across two axes by the defined
factors.
+# - fuse: fuses two consecutive axises of one computation.
+# - reorder: can reorder the axises of a computation into a defined order.
+# - bind: can bind a computation to a specific thread, useful in GPU
programming.
+# - compute_at: by default, TVM will compute tensors at the root by default.
comput_at specifies
+# that one tensor should be computed at the first axis of computation for
another operator.
+# - compute_inline: when marked inline, a computation will be expanded then
inserted into the
+# address where the tensor is required.
+# - compute_root: moves a computation to the root stage.
+#
+# A complete description of these primitives can be found in the
+# [Schedule
Primitives](https://tvm.apache.org/docs/tutorials/language/schedule_primitives.html)
docs page.
+
+################################################################################
+# Example 2: Manually Optimizing Matrix Multiplication with TE
+# ------------------------------------------------------------
+#
+# Now we will consider a second, more advanced example, demonstrating how with
+# just 18 line of python code from TVM we can demonstrate up to 18x speedup on
+# a common matrix multiplication operation.
+#
+# **There are two important optimizations on intense computation applications
+# executed on CPU:**
+# 1. Increase the cache hit rate of memory access. Both complex numerical
+# computation and hot-spot memory access can be accelerated from high cache
hit
+# rate. This requires us to transform the origin memory access pattern to
the
+# pattern fits the cache policy.
+# 2. SIMD (Single instruction multi-data), also known as the vector processing
+# unit. Every time, a small batch of data, rather than a single grid, will
be
+# processed. This requires us to transform the data access pattern in the
loop
+# body in uniform pattern so that the LLVM backend can lower it to SIMD.
+#
+# The techniques used in this tutorial are a subset of tricks mentioned in this
+# `repository <https://github.com/flame/how-to-optimize-gemm>`_. Some of them
+# have been applied by TVM abstraction automatically, but some of them cannot
+# be simply applied due to TVM constraints.
+#
+# All the experiment results mentioned below, are executed on 2015 15" MacBook
+# equipped with Intel i7-4770HQ CPU. The cache line size should be 64 bytes
for
+# all the x86 CPUs.
+
+################################################################################
+# Preparation and Performance Baseline
+# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+#
+# We begin by collecting performance data on the `numpy` implementation of
+# matrix multiplication.
+
+import tvm
+import tvm.testing
+from tvm import te
+import numpy
+import timeit
+
+# The size of the matrix
+# (M, K) x (K, N)
+# You are free to try out different shapes, sometimes TVM optimization
outperforms numpy with MKL.
+M = 1024
+K = 1024
+N = 1024
+
+# The default tensor data type in tvm
+dtype = "float32"
+
+# using Intel AVX2 (Advanced Vector Extensions) ISA for SIMD
+# To get the best performance, please change the following line
+# to llvm -mcpu=core-avx2, or specific type of CPU you use
+target = "llvm"
+ctx = tvm.context(target, 0)
+
+# Random generated tensor for testing
+a = tvm.nd.array(numpy.random.rand(M, K).astype(dtype), ctx)
+b = tvm.nd.array(numpy.random.rand(K, N).astype(dtype), ctx)
+
+# Repeatedly perform a matrix multiplication to get a performance baseline
+# for the default numpy implementation
+np_repeat = 100
+np_runing_time = timeit.timeit(
+ setup="import numpy\n"
+ "M = " + str(M) + "\n"
+ "K = " + str(K) + "\n"
+ "N = " + str(N) + "\n"
+ 'dtype = "float32"\n'
+ "a = numpy.random.rand(M, K).astype(dtype)\n"
+ "b = numpy.random.rand(K, N).astype(dtype)\n",
+ stmt="answer = numpy.dot(a, b)",
+ number=np_repeat,
+)
+print("Numpy running time: %f" % (np_runing_time / np_repeat))
+
+answer = numpy.dot(a.asnumpy(), b.asnumpy())
+
+################################################################################
+# Now, write a basic matrix multiplication using TVM TE and verify that it
+# produces the same results as the numpy implementation. We also write a
+# function that will help us measure the performance of the schedule
+# optimizations.
+
+# TVM Matrix Multiplication using TE
+k = te.reduce_axis((0, K), "k")
+A = te.placeholder((M, K), name="A")
+B = te.placeholder((K, N), name="B")
+C = te.compute((M, N), lambda x, y: te.sum(A[x, k] * B[k, y], axis=k),
name="C")
+
+# Default schedule
+s = te.create_schedule(C.op)
+func = tvm.build(s, [A, B, C], target=target, name="mmult")
+assert func
Review comment:
yep
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]