hogepodge commented on a change in pull request #7642:
URL: https://github.com/apache/tvm/pull/7642#discussion_r600791467



##########
File path: tutorials/get_started/tensor_expr_get_started.py
##########
@@ -302,18 +385,452 @@
     fadd_cl(a, b, c)
     tvm.testing.assert_allclose(c.asnumpy(), a.asnumpy() + b.asnumpy())
 
-######################################################################
-# Summary
-# -------
-# This tutorial provides a walk through of TVM workflow using
-# a vector add example. The general workflow is
+################################################################################
+# .. note:: Code Specialization
+#
+#   As you may have noticed, the declarations of A, B and C all take the same
+#   shape argument, n. TVM will take advantage of this to pass only a single
+#   shape argument to the kernel, as you will find in the printed device code.
+#   This is one form of specialization.
+#
+#   On the host side, TVM will automatically generate check code that checks
+#   the constraints in the parameters. So if you pass arrays with different
+#   shapes into fadd, an error will be raised.
+#
+#   We can do more specializations. For example, we can write :code:`n =
+#   tvm.runtime.convert(1024)` instead of :code:`n = te.var("n")`, in the
+#   computation declaration. The generated function will only take vectors with
+#   length 1024.
+
+################################################################################
+# .. note:: TE Scheduling Primitives
+#
+#   TVM includes a number of different scheduling primitives:
+#
+#   - split: splits a specified axis into two axises by the defined factor.
+#   - tile: tiles will split a computation across two axes by the defined 
factors.
+#   - fuse: fuses two consecutive axises of one computation.
+#   - reorder: can reorder the axises of a computation into a defined order.
+#   - bind: can bind a computation to a specific thread, useful in GPU 
programming.
+#   - compute_at: by default, TVM will compute tensors at the outermost level
+#     of the function, or the root, by default. compute_at specifies that one
+#     tensor should be computed at the first axis of computation for another
+#     operator.
+#   - compute_inline: when marked inline, a computation will be expanded then
+#     inserted into the address where the tensor is required.
+#   - compute_root: moves a computation to the outermost layer, or root, of the
+#     function. This means that stage of the computation will be fully computed
+#     before it moves on to the next stage.
+#
+#   A complete description of these primitives can be found in the
+# [Schedule 
Primitives](https://tvm.apache.org/docs/tutorials/language/schedule_primitives.html)
 docs page.
+
+################################################################################
+# Example 2: Manually Optimizing Matrix Multiplication with TE
+# ------------------------------------------------------------
+#
+# Now we will consider a second, more advanced example, demonstrating how with
+# just 18 lines of python code TVM speeds up a common matrix multiplication 
operation by 18x.
+#
+# **Matrix multiplication is a compute intensive operation. There are two 
important optimizations for good CPU performance:**
+# 1. Increase the cache hit rate of memory access. Both complex numerical
+#    computation and hot-spot memory access can be accelerated by a high cache 
hit
+#    rate. This requires us to transform the origin memory access pattern to a 
pattern that fits the cache policy.
+# 2. SIMD (Single instruction multi-data), also known as the vector processing
+#    unit. On each cycle instead of processing a single value, SIMD can 
process a small batch of data.
+#    This requires us to transform the data access pattern in the loop
+#    body in uniform pattern so that the LLVM backend can lower it to SIMD.
+#
+# The techniques used in this tutorial are a subset of tricks mentioned in this
+# `repository <https://github.com/flame/how-to-optimize-gemm>`_. Some of them
+# have been applied by TVM abstraction automatically, but some of them cannot
+# be automatically applied due to TVM constraints.
+#
+# All the experiment results mentioned below are executed on 2015 15" MacBook
+# equipped with Intel i7-4770HQ CPU. The cache line size should be 64 bytes for
+# all the x86 CPUs.

Review comment:
       Since we will always be chasing the exact specs on this, I just decided 
to remove the line.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to