[GitHub] [tvm-rfcs] masahi commented on a change in pull request #18: [RFC] Adding initial SVE implementation

GitBox Fri, 27 Aug 2021 00:01:13 -0700


masahi commented on a change in pull request #18:
URL: https://github.com/apache/tvm-rfcs/pull/18#discussion_r697200267




##########
File path: rfcs/0018-initial-sve-addition.md
##########
@@ -0,0 +1,221 @@
+- Feature Name: Adding Initial SVE Support to TVM 
+- Start Date: 2021-07-30
+- RFC PR: https://github.com/apache/tvm-rfcs/pull/18
+
+Authors: Meera Nakrani, Sjoerd Meijer
+
+## Introduction
+
+In this RFC we would like to propose a TIR extension to support scalable
+vectorisation. Scalable vectorisation is extracting data parallelism from 
+code, but as opposed to a fixed width vectorisation, the vector length is 
+unknown at compile time. A scalable vector's total number of elements is a 
+constant multiple of a specified number of elements. The 
+[LLVM LangRef](https://llvm.org/docs/LangRef.html) refers to this constant 
+multiple as vscale. It is a positive integer that is unknown at compile time, 
+therefore the overall vector length (VL) is also unknown. The value of vscale, 
+and therefore VL, will depend on the architecture that is running the program. 
+More details and an overview of this is given in 
+[this 
tutorial](https://www.stonybrook.edu/commcms/ookami/support/_docs/ARM_SVE_tutorial.pdf),
 
+where an example of a daxpy kernel is given from slide 17 onwards. In this 
RFC, 
+we will show an example of lowering from TE for a (scalable) vector addition 
+kernel all the way down to LLVM IR, further illustrating the vscale concept. 
+We will also cover TIR support and how it affects the LLVM codegen. This is an 
+introductory RFC to see if the design of our prototype implementation, see 
+https://github.com/apache/tvm/pull/8655, is sound and we welcome any feedback 
+on this prosposal.
+
+Before we explain this in more detail, let's first briefly look at the current
+state and terminology with an example. Vectorisation along the x-axis of an
+addition of two one-dimensional tensors A and B of size 18, writing the result
+to C, will result in the following TIR:
+
+```
+C[ramp(0, 1, 17)] = A[ramp(0, 1, 17)] + B[ramp(0, 1, 17)]`
+```
+where the Ramp TIR node has the form 'Ramp(base, stride, lanes)' showing that
+these elements are processed in (vector) lanes.
+
+The size of 18 has been chosen to demonstrate the challenges of vectorising
+this example. Vector architecture extensions (e.g. X86 AVX512 or AArch Neon)
+typically allow to pack and operate on a power-of-2 number of elements, so 2,
+4, 8, 16, etc.  elements. If the elements are integers, and a vector register
+is 128-bits wide, we can pack 4 integer elements into one vector register (if
+an integer is 4 bytes). This is an example of fixed width vectorisation,
+because the vector registers have a fixed width of 128-bits. Since we have 18, 
the
+number of elements in the vectors A, B, and C, is not a multiple of 4, we need
+4 vector operations processing 4 * 4 = 16 elements, and 2 scalar operations are
+required for processing the 16th and 17th elements which we call the scalar
+epilogue.
+
+## Motivation
+
+However, most modern vector architectures (e.g. X86 AVX512 and the Arm
+Architecture's MVE and SVE extensions) support predicated vector instructions,
+removing the need for such a scalar epilogue and also allowing more code to be
+vectorised.  Lane predication allows the enabling/disabling of certain lanes in
+vector operations.  This allows us to have just 5 vector operations for our
+example, and importantly no scalar epilogue. But since we do not need to
+process 5 * 4 = 20 elements, the last vector operation only needs to write two
+elements, which can be achieved by predication as we can enable the first two
+lanes and disable the last 2 lanes.
+
+In addition to predication, and also related to it, some new vector 
+architectures also allow scalable vectorisation. As opposed to so called fixed
+width vectorisation (e.g. AArch Neon), the Arm architecture SVE vector
+extension allows implementations to choose a vector register length between 128
+and 2048 bits.  It supports a vector length agnostic programming model which
+allows code to run and scale automatically across all vector lengths without
+recompilation.
+
+## Problem Statement
+
+We would like to add support for Arm Architecture's Scalable Vector Extension 
(SVE) 
+in TVM by introducing features for Vector Length Agnostic (VLA) programs and
+predication, i.e. the 2 main new SVE features. Thus we would like to express
+scalable vectorisation in both TE and TIR. The question is how to achieve 
that? In
+Tensor Expression language, our example to add two tensors A and B would look
+like this:
+
+```
+n = 17
+A = te.placeholder((n,), name="A", dtype = "int8")
+B = te.placeholder((n,), name="B", dtype = "int8")
+C = te.compute(A.shape, lambda i: A[i] + B[i], name="C")
+s = te.create_schedule(C.op)
+x, = C.op.axis
+s[C].vectorize(x)
+```
+
+Vectorisation along the x-axis is requested with _vectorize(x)_, and will
+result in the TIR example shown in the Introduction. However, this requires
+knowing the vector length at compile time; it is an example of fixed width
+vectorisation. Instead, we would like for it to work with an unknown vector
+length at compile time.
+
+## Solution Approach
+
+In order to address the problem of expressing scalable vectorisation, we would
+like to propose the addition of a new _vectorize_scalable_ function to the 
Tensor
+Expression language, for example:
+``` 
+s[C].vectorize_scalable(x)
+```
+The TIR output of this would be:
+
+```
+primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
+  attr = {"global_symbol": "main", "tir.noalias": True}
+  buffers = {C: Buffer(C_2: Pointer(int8), int8, [17], []),
+             A: Buffer(A_2: Pointer(int8), int8, [17], []),
+             B: Buffer(B_2: Pointer(int8), int8, [17], [])}
+  buffer_map = {A_1: A, B_1: B, C_1: C} {
+  for (i: int32, 0, 17;i+=VL) {
+    C_2[ramp(i, 1, VL)] = ((int8xVL*)A_2[ramp(i, 1, VL)] + 
(int8xVL*)B_2[ramp(i, 1, VL)])
+  }
+}
+```
+
+In the above TIR, we can see the the for loop is looping with an agnostic
+stride _VL_, which stands for Vector Length. _VL_ is only showed for ease of
+representation and we don't store _VL_ anywhere inside the TIR data structures.
+
+We can also see the syntax of the Ramp nodes have now been modified to handle
+an unknown vector length, as seen by _ramp(i, 1, VL)_, instead of a fixed
+integer. The form is still _Ramp(base, stride, lanes)_ and the semantics of it
+are still the same, the only difference is that the number of lanes is unknown
+at compile time, and so we use VL as a way of representing that.
+
+## Implementation 
+
+An agnostic constructor has been added to the Ramp node, as well as to the
+Broadcast node, with an additional parameter. This parameter is a boolean named
+_is_scalable_, in order to enable both fixed and scalable vectorisation.
+
+This boolean has also been added in _data_type.h_ as the type of the Ramp node
+has changed, it is now scalable. The constructor is:
+
+```
+DataType(int code, int bits, int lanes, bool is_scalable = false)
+```
+
+Originally, for fixed vectorisation, _is_scalable_ will be false, but when
+scalable vectorisation is enabled we will set _is_scalable_ to true.
+
+In TIR we introduced a new ForKind called _kVectorizeScalable_ which marks a 
+loop as able to be vectorized but the value of VL will be unknown. This loop 
+is then legalised during a new pass called _VectorizeLoopScalable_ pass, which 
+is triggered by the vectorize_scalable function mentioned previously. This 
pass 
+transforms the loop so that it is able to handle the unknown constant VL. Our 
+prototype was implemented before the addition of the While node to TVM, and so 
+it currently transforms a For loop into a variable For loop. To do this, the 
For 
+node had to have extra parameters added to its implementation that would only 
be 
+used in this one specific case. One change we are planning to make is to make 
use 
+of this existing While node and transform a For Loop into a While loop during 
the 
+_VectorizeLoopScalable_ pass, since it is the more natural choice and it is 
what 

Review comment:
       :+1: This is a great use case of TIR `While` loop that I didn't imagine.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [tvm-rfcs] masahi commented on a change in pull request #18: [RFC] Adding initial SVE implementation

Reply via email to