[GitHub] [tvm-rfcs] tkonolige commented on a change in pull request #18: [RFC] Adding initial SVE implementation

GitBox Thu, 05 Aug 2021 04:11:36 -0700


tkonolige commented on a change in pull request #18:
URL: https://github.com/apache/tvm-rfcs/pull/18#discussion_r682981173




##########
File path: rfcs/initial_sve_addition.md
##########
@@ -0,0 +1,164 @@
+- Feature Name: Adding Initial SVE Support to TVM 
+- Start Date: 2021-07-30
+- RFC PR: TODO
+
+Authors: Meera Nakrani, Sjoerd Meijer
+
+## Introduction
+
+In this RFC we would like to propose a TIR extension to support scalable
+vectorisation. This is an introductory RFC to see if the design of our

Review comment:
       Can you define "scalable vectorization" in this paragraph.

##########
File path: rfcs/initial_sve_addition.md
##########
@@ -0,0 +1,164 @@
+- Feature Name: Adding Initial SVE Support to TVM 
+- Start Date: 2021-07-30
+- RFC PR: TODO
+
+Authors: Meera Nakrani, Sjoerd Meijer
+
+## Introduction
+
+In this RFC we would like to propose a TIR extension to support scalable
+vectorisation. This is an introductory RFC to see if the design of our
+prototype implementation, see TODO, is sound. We will cover TIR support and how
+it affects the LLVM codegen and welcome any feedback on this prosposal.
+
+Before we explain this in more detail, let's first briefly look at the current
+state and terminology with an example. Vectorisation along the x-axis of an
+addition of two one-dimensional tensors A and B of size 18, writing the result
+to C, will result in the following TIR:
+
+```
+C[ramp(0, 1, 17)] = A[ramp(0, 1, 17)] + B[ramp(0, 1, 17)]`
+```
+where the Ramp TIR node has the form 'Ramp(base, stride, lanes)' showing that
+these elements are processed in (vector) lanes.
+
+The size of 18 has been chosen to demonstrate the challenges of vectorising
+this example. Vector architecture extensions (e.g. X86 AVX or AArch Neon)
+typically allow to pack and operate on a power-of-2 number of elements, so 2,
+4, 8, 16, etc.  elements. If the elements are integers, and a vector register
+is 128-bits wide, we can pack 4 integer elements into one vector register (if
+an integer is 4 bytes). This is an example of fixed width vectorisation,
+because the vector registers have a fixed width of 128-bits. Since 18, the
+number of elements in the vectors A, B, and C, is not a multiple of 4, we need
+4 vector operations processing 4 * 4 = 16 elements, and 2 scalar operations are
+required for processing the 16th and 17th elements which we call the scalar
+epilogue.
+
+## Motivation
+
+However, most modern vector architectures (e.g. X86 AVX and the Arm
+Architecture's MVE and SVE extensions) support predicated vector instructions,
+removing the need for such a scalar epilogue and also allowing more code to be
+vectorised.  Lane predication allows the enabling/disabling of certain lanes in
+vector operations.  This allows us to have just 5 vector operations for our
+example, and importantly no scalar epilogue. But since we do not need to
+process 5 * 4 = 20 elements, the last vector operation only needs to write two
+elements, which can be achieved by predication as we can enable the first two
+lanes and disable the last 2 lanes.
+
+In addition to predication, and also related to it, some new vector 
+architectures also allow scalable vectorisation. As opposed to so called fixed
+width vectorisation (e.g. AArch Neon), the Arm architecture SVE vector
+extension allows implementations to choose a vector register length between 128
+and 2048 bits.  It supports a vector length agnostic programming model which
+allows code to run and scale automatically across all vector lengths without
+recompilation.
+
+## Problem Statement
+
+We would like to add support for Arm Architecture's Scalable Vector Extension 
(SVE) in TVM
+by introducing features for Vector Length Agnostic (VLA) programs and
+predication, i.e. the 2 main new SVE features. Thus we would like to express
+scalable vectorisation TE and TIR. The question is how to achieve that? In
+Tensor Expression language, our example to add two tensors A and B would look
+like this:
+
+```
+n = 17
+A = te.placeholder((n,), name="A", dtype = "int8")
+B = te.placeholder((n,), name="B", dtype = "int8")
+C = te.compute(A.shape, lambda i: A[i] + B[i], name="C")
+s = te.create_schedule(C.op)
+x, = C.op.axis
+s[C].vectorize(x)
+```
+
+Vectorisation along the x-axis is requested with _vectorize(x)_, and will
+result in the TIR example shown in the Introduction. However, this requires
+knowing the vector length at compile time; it is an example of fixed width
+vectorisation. Instead, we would like for it to work with an unknown vector
+length at compile time.
+
+## Solution Approach
+
+In order to address the problem of expressing scalable vectorisation, we would
+like to propose the addition of a new _vectorize_scalable_ function to the 
Tensor
+Expression language, for example:
+``` 
+s[C].vectorize_scalable(x)
+```
+The TIR output of this would be:
+
+```
+primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
+  attr = {"global_symbol": "main", "tir.noalias": True}
+  buffers = {C: Buffer(C_2: Pointer(int8), int8, [17], []),
+             A: Buffer(A_2: Pointer(int8), int8, [17], []),
+             B: Buffer(B_2: Pointer(int8), int8, [17], [])}
+  buffer_map = {A_1: A, B_1: B, C_1: C} {
+  for (i: int32, 0, 17;i+=VL) {
+    C_2[ramp(i, 1, VL)] = ((int8xVL*)A_2[ramp(i, 1, VL)] + 
(int8xVL*)B_2[ramp(i, 1, VL)])
+  }
+}
+```
+
+In the above TIR, we can see the the for loop is looping with an agnostic
+stride _VL_, which stands for Vector Length. _VL_ is only showed for ease of
+representation and we don't store _VL_ anywhere inside the TIR data structures.
+
+We can also see the syntax of the Ramp nodes have now been modified to handle
+an unknown vector length, as seen by _ramp(i, 1, VL)_, instead of a fixed
+integer. The form is still _Ramp(base, stride, lanes)_ and the semantics of it
+are still the same, the only difference is that the number of lanes is unknown
+at compile time, and so we use VL as a way of representing that.
+
+## Implementation 
+
+An agnostic constructor has been added to the Ramp node, as well as to the
+Broadcast node, with an additional parameter. This parameter is a boolean named
+_is_scalable_, in order to enable both fixed and scalable vectorisation.
+
+This boolean has also been added in _data_type.h_ as the type of the Ramp node
+has changed, it is now scalable. The constructor is:
+
+```
+DataType(int code, int bits, int lanes, bool is_scalable = false)

Review comment:
       Why not use a special value for lanes to indicate that it can be a 
variable number?

##########
File path: rfcs/initial_sve_addition.md
##########
@@ -0,0 +1,164 @@
+- Feature Name: Adding Initial SVE Support to TVM 
+- Start Date: 2021-07-30
+- RFC PR: TODO
+
+Authors: Meera Nakrani, Sjoerd Meijer
+
+## Introduction
+
+In this RFC we would like to propose a TIR extension to support scalable
+vectorisation. This is an introductory RFC to see if the design of our
+prototype implementation, see TODO, is sound. We will cover TIR support and how
+it affects the LLVM codegen and welcome any feedback on this prosposal.
+
+Before we explain this in more detail, let's first briefly look at the current
+state and terminology with an example. Vectorisation along the x-axis of an
+addition of two one-dimensional tensors A and B of size 18, writing the result
+to C, will result in the following TIR:
+
+```
+C[ramp(0, 1, 17)] = A[ramp(0, 1, 17)] + B[ramp(0, 1, 17)]`
+```
+where the Ramp TIR node has the form 'Ramp(base, stride, lanes)' showing that
+these elements are processed in (vector) lanes.
+
+The size of 18 has been chosen to demonstrate the challenges of vectorising
+this example. Vector architecture extensions (e.g. X86 AVX or AArch Neon)
+typically allow to pack and operate on a power-of-2 number of elements, so 2,
+4, 8, 16, etc.  elements. If the elements are integers, and a vector register
+is 128-bits wide, we can pack 4 integer elements into one vector register (if
+an integer is 4 bytes). This is an example of fixed width vectorisation,
+because the vector registers have a fixed width of 128-bits. Since 18, the
+number of elements in the vectors A, B, and C, is not a multiple of 4, we need
+4 vector operations processing 4 * 4 = 16 elements, and 2 scalar operations are
+required for processing the 16th and 17th elements which we call the scalar
+epilogue.
+
+## Motivation
+
+However, most modern vector architectures (e.g. X86 AVX and the Arm
+Architecture's MVE and SVE extensions) support predicated vector instructions,
+removing the need for such a scalar epilogue and also allowing more code to be
+vectorised.  Lane predication allows the enabling/disabling of certain lanes in
+vector operations.  This allows us to have just 5 vector operations for our
+example, and importantly no scalar epilogue. But since we do not need to
+process 5 * 4 = 20 elements, the last vector operation only needs to write two
+elements, which can be achieved by predication as we can enable the first two
+lanes and disable the last 2 lanes.
+
+In addition to predication, and also related to it, some new vector 
+architectures also allow scalable vectorisation. As opposed to so called fixed
+width vectorisation (e.g. AArch Neon), the Arm architecture SVE vector
+extension allows implementations to choose a vector register length between 128
+and 2048 bits.  It supports a vector length agnostic programming model which
+allows code to run and scale automatically across all vector lengths without
+recompilation.
+
+## Problem Statement
+
+We would like to add support for Arm Architecture's Scalable Vector Extension 
(SVE) in TVM
+by introducing features for Vector Length Agnostic (VLA) programs and
+predication, i.e. the 2 main new SVE features. Thus we would like to express
+scalable vectorisation TE and TIR. The question is how to achieve that? In
+Tensor Expression language, our example to add two tensors A and B would look
+like this:
+
+```
+n = 17
+A = te.placeholder((n,), name="A", dtype = "int8")
+B = te.placeholder((n,), name="B", dtype = "int8")
+C = te.compute(A.shape, lambda i: A[i] + B[i], name="C")
+s = te.create_schedule(C.op)
+x, = C.op.axis
+s[C].vectorize(x)
+```
+
+Vectorisation along the x-axis is requested with _vectorize(x)_, and will
+result in the TIR example shown in the Introduction. However, this requires
+knowing the vector length at compile time; it is an example of fixed width
+vectorisation. Instead, we would like for it to work with an unknown vector
+length at compile time.
+
+## Solution Approach
+
+In order to address the problem of expressing scalable vectorisation, we would
+like to propose the addition of a new _vectorize_scalable_ function to the 
Tensor
+Expression language, for example:
+``` 
+s[C].vectorize_scalable(x)
+```
+The TIR output of this would be:
+
+```
+primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
+  attr = {"global_symbol": "main", "tir.noalias": True}
+  buffers = {C: Buffer(C_2: Pointer(int8), int8, [17], []),
+             A: Buffer(A_2: Pointer(int8), int8, [17], []),
+             B: Buffer(B_2: Pointer(int8), int8, [17], [])}
+  buffer_map = {A_1: A, B_1: B, C_1: C} {
+  for (i: int32, 0, 17;i+=VL) {
+    C_2[ramp(i, 1, VL)] = ((int8xVL*)A_2[ramp(i, 1, VL)] + 
(int8xVL*)B_2[ramp(i, 1, VL)])
+  }
+}
+```
+
+In the above TIR, we can see the the for loop is looping with an agnostic
+stride _VL_, which stands for Vector Length. _VL_ is only showed for ease of
+representation and we don't store _VL_ anywhere inside the TIR data structures.
+
+We can also see the syntax of the Ramp nodes have now been modified to handle
+an unknown vector length, as seen by _ramp(i, 1, VL)_, instead of a fixed
+integer. The form is still _Ramp(base, stride, lanes)_ and the semantics of it
+are still the same, the only difference is that the number of lanes is unknown
+at compile time, and so we use VL as a way of representing that.
+
+## Implementation 
+
+An agnostic constructor has been added to the Ramp node, as well as to the
+Broadcast node, with an additional parameter. This parameter is a boolean named
+_is_scalable_, in order to enable both fixed and scalable vectorisation.
+
+This boolean has also been added in _data_type.h_ as the type of the Ramp node
+has changed, it is now scalable. The constructor is:
+
+```
+DataType(int code, int bits, int lanes, bool is_scalable = false)
+```
+
+Originally, for fixed vectorisation, _is_scalable_ will be false, but when
+scalable vectorisation is enabled we will set _is_scalable_ to true.
+
+The _vectorize_scalable_ function triggers the transformations on the loops to
+make them scalable, by triggering a new pass called VectorizeLoopScalable. This
+pass visits the TIR nodes and transforms them so that they account for the
+unknown vector length. Currently, our prototype was implemented before the
+addition of the While node to TVM, and so it generates a variable for loop. One
+change we are planning to make is transforming a for loop into a while loop
+with unknown vector length directly, as that is what would be generated in the
+assembly.
+
+This is all handled on the TE and TIR end, but for code generation we introduce
+a file called codegen_aarch64.cc which handles the creation of SVE intrinsics
+in LLVM by visiting the Load, Store and For nodes (to be replaced with a While
+node) and generating the relevant LLVM IR.
+

Review comment:
       Please include a drawbacks section.

##########
File path: rfcs/initial_sve_addition.md
##########
@@ -0,0 +1,164 @@
+- Feature Name: Adding Initial SVE Support to TVM 
+- Start Date: 2021-07-30
+- RFC PR: TODO
+
+Authors: Meera Nakrani, Sjoerd Meijer
+
+## Introduction
+
+In this RFC we would like to propose a TIR extension to support scalable
+vectorisation. This is an introductory RFC to see if the design of our
+prototype implementation, see TODO, is sound. We will cover TIR support and how
+it affects the LLVM codegen and welcome any feedback on this prosposal.
+
+Before we explain this in more detail, let's first briefly look at the current
+state and terminology with an example. Vectorisation along the x-axis of an
+addition of two one-dimensional tensors A and B of size 18, writing the result
+to C, will result in the following TIR:
+
+```
+C[ramp(0, 1, 17)] = A[ramp(0, 1, 17)] + B[ramp(0, 1, 17)]`
+```
+where the Ramp TIR node has the form 'Ramp(base, stride, lanes)' showing that
+these elements are processed in (vector) lanes.
+
+The size of 18 has been chosen to demonstrate the challenges of vectorising
+this example. Vector architecture extensions (e.g. X86 AVX or AArch Neon)
+typically allow to pack and operate on a power-of-2 number of elements, so 2,
+4, 8, 16, etc.  elements. If the elements are integers, and a vector register
+is 128-bits wide, we can pack 4 integer elements into one vector register (if
+an integer is 4 bytes). This is an example of fixed width vectorisation,
+because the vector registers have a fixed width of 128-bits. Since 18, the
+number of elements in the vectors A, B, and C, is not a multiple of 4, we need
+4 vector operations processing 4 * 4 = 16 elements, and 2 scalar operations are
+required for processing the 16th and 17th elements which we call the scalar
+epilogue.
+
+## Motivation
+
+However, most modern vector architectures (e.g. X86 AVX and the Arm
+Architecture's MVE and SVE extensions) support predicated vector instructions,
+removing the need for such a scalar epilogue and also allowing more code to be
+vectorised.  Lane predication allows the enabling/disabling of certain lanes in
+vector operations.  This allows us to have just 5 vector operations for our
+example, and importantly no scalar epilogue. But since we do not need to
+process 5 * 4 = 20 elements, the last vector operation only needs to write two
+elements, which can be achieved by predication as we can enable the first two
+lanes and disable the last 2 lanes.
+
+In addition to predication, and also related to it, some new vector 
+architectures also allow scalable vectorisation. As opposed to so called fixed
+width vectorisation (e.g. AArch Neon), the Arm architecture SVE vector
+extension allows implementations to choose a vector register length between 128
+and 2048 bits.  It supports a vector length agnostic programming model which
+allows code to run and scale automatically across all vector lengths without
+recompilation.
+
+## Problem Statement
+
+We would like to add support for Arm Architecture's Scalable Vector Extension 
(SVE) in TVM
+by introducing features for Vector Length Agnostic (VLA) programs and
+predication, i.e. the 2 main new SVE features. Thus we would like to express

Review comment:
       You mention predication here, yet you never talk about how it is being 
added to TIR.

##########
File path: rfcs/initial_sve_addition.md
##########
@@ -0,0 +1,164 @@
+- Feature Name: Adding Initial SVE Support to TVM 
+- Start Date: 2021-07-30
+- RFC PR: TODO
+
+Authors: Meera Nakrani, Sjoerd Meijer
+
+## Introduction
+
+In this RFC we would like to propose a TIR extension to support scalable
+vectorisation. This is an introductory RFC to see if the design of our
+prototype implementation, see TODO, is sound. We will cover TIR support and how
+it affects the LLVM codegen and welcome any feedback on this prosposal.
+
+Before we explain this in more detail, let's first briefly look at the current
+state and terminology with an example. Vectorisation along the x-axis of an
+addition of two one-dimensional tensors A and B of size 18, writing the result
+to C, will result in the following TIR:
+
+```
+C[ramp(0, 1, 17)] = A[ramp(0, 1, 17)] + B[ramp(0, 1, 17)]`
+```
+where the Ramp TIR node has the form 'Ramp(base, stride, lanes)' showing that
+these elements are processed in (vector) lanes.
+
+The size of 18 has been chosen to demonstrate the challenges of vectorising
+this example. Vector architecture extensions (e.g. X86 AVX or AArch Neon)
+typically allow to pack and operate on a power-of-2 number of elements, so 2,
+4, 8, 16, etc.  elements. If the elements are integers, and a vector register
+is 128-bits wide, we can pack 4 integer elements into one vector register (if
+an integer is 4 bytes). This is an example of fixed width vectorisation,
+because the vector registers have a fixed width of 128-bits. Since 18, the
+number of elements in the vectors A, B, and C, is not a multiple of 4, we need
+4 vector operations processing 4 * 4 = 16 elements, and 2 scalar operations are
+required for processing the 16th and 17th elements which we call the scalar
+epilogue.
+
+## Motivation
+
+However, most modern vector architectures (e.g. X86 AVX and the Arm
+Architecture's MVE and SVE extensions) support predicated vector instructions,
+removing the need for such a scalar epilogue and also allowing more code to be
+vectorised.  Lane predication allows the enabling/disabling of certain lanes in
+vector operations.  This allows us to have just 5 vector operations for our
+example, and importantly no scalar epilogue. But since we do not need to
+process 5 * 4 = 20 elements, the last vector operation only needs to write two
+elements, which can be achieved by predication as we can enable the first two
+lanes and disable the last 2 lanes.
+
+In addition to predication, and also related to it, some new vector 
+architectures also allow scalable vectorisation. As opposed to so called fixed
+width vectorisation (e.g. AArch Neon), the Arm architecture SVE vector
+extension allows implementations to choose a vector register length between 128
+and 2048 bits.  It supports a vector length agnostic programming model which
+allows code to run and scale automatically across all vector lengths without
+recompilation.
+
+## Problem Statement
+
+We would like to add support for Arm Architecture's Scalable Vector Extension 
(SVE) in TVM
+by introducing features for Vector Length Agnostic (VLA) programs and
+predication, i.e. the 2 main new SVE features. Thus we would like to express
+scalable vectorisation TE and TIR. The question is how to achieve that? In
+Tensor Expression language, our example to add two tensors A and B would look
+like this:
+
+```
+n = 17
+A = te.placeholder((n,), name="A", dtype = "int8")
+B = te.placeholder((n,), name="B", dtype = "int8")
+C = te.compute(A.shape, lambda i: A[i] + B[i], name="C")
+s = te.create_schedule(C.op)
+x, = C.op.axis
+s[C].vectorize(x)
+```
+
+Vectorisation along the x-axis is requested with _vectorize(x)_, and will
+result in the TIR example shown in the Introduction. However, this requires
+knowing the vector length at compile time; it is an example of fixed width
+vectorisation. Instead, we would like for it to work with an unknown vector
+length at compile time.
+
+## Solution Approach
+
+In order to address the problem of expressing scalable vectorisation, we would
+like to propose the addition of a new _vectorize_scalable_ function to the 
Tensor
+Expression language, for example:
+``` 
+s[C].vectorize_scalable(x)
+```

Review comment:
       What do you think about adding a new `ForKind` so that we can express 
scalable vectorization in TIR.

##########
File path: rfcs/initial_sve_addition.md
##########
@@ -0,0 +1,164 @@
+- Feature Name: Adding Initial SVE Support to TVM 
+- Start Date: 2021-07-30
+- RFC PR: TODO
+
+Authors: Meera Nakrani, Sjoerd Meijer
+
+## Introduction
+
+In this RFC we would like to propose a TIR extension to support scalable
+vectorisation. This is an introductory RFC to see if the design of our
+prototype implementation, see TODO, is sound. We will cover TIR support and how
+it affects the LLVM codegen and welcome any feedback on this prosposal.
+
+Before we explain this in more detail, let's first briefly look at the current
+state and terminology with an example. Vectorisation along the x-axis of an
+addition of two one-dimensional tensors A and B of size 18, writing the result
+to C, will result in the following TIR:
+
+```
+C[ramp(0, 1, 17)] = A[ramp(0, 1, 17)] + B[ramp(0, 1, 17)]`
+```
+where the Ramp TIR node has the form 'Ramp(base, stride, lanes)' showing that
+these elements are processed in (vector) lanes.
+
+The size of 18 has been chosen to demonstrate the challenges of vectorising
+this example. Vector architecture extensions (e.g. X86 AVX or AArch Neon)
+typically allow to pack and operate on a power-of-2 number of elements, so 2,
+4, 8, 16, etc.  elements. If the elements are integers, and a vector register
+is 128-bits wide, we can pack 4 integer elements into one vector register (if
+an integer is 4 bytes). This is an example of fixed width vectorisation,
+because the vector registers have a fixed width of 128-bits. Since 18, the
+number of elements in the vectors A, B, and C, is not a multiple of 4, we need
+4 vector operations processing 4 * 4 = 16 elements, and 2 scalar operations are
+required for processing the 16th and 17th elements which we call the scalar
+epilogue.
+
+## Motivation
+
+However, most modern vector architectures (e.g. X86 AVX and the Arm
+Architecture's MVE and SVE extensions) support predicated vector instructions,
+removing the need for such a scalar epilogue and also allowing more code to be
+vectorised.  Lane predication allows the enabling/disabling of certain lanes in
+vector operations.  This allows us to have just 5 vector operations for our
+example, and importantly no scalar epilogue. But since we do not need to
+process 5 * 4 = 20 elements, the last vector operation only needs to write two
+elements, which can be achieved by predication as we can enable the first two
+lanes and disable the last 2 lanes.
+
+In addition to predication, and also related to it, some new vector 
+architectures also allow scalable vectorisation. As opposed to so called fixed
+width vectorisation (e.g. AArch Neon), the Arm architecture SVE vector
+extension allows implementations to choose a vector register length between 128
+and 2048 bits.  It supports a vector length agnostic programming model which
+allows code to run and scale automatically across all vector lengths without
+recompilation.
+
+## Problem Statement
+
+We would like to add support for Arm Architecture's Scalable Vector Extension 
(SVE) in TVM
+by introducing features for Vector Length Agnostic (VLA) programs and
+predication, i.e. the 2 main new SVE features. Thus we would like to express
+scalable vectorisation TE and TIR. The question is how to achieve that? In
+Tensor Expression language, our example to add two tensors A and B would look
+like this:
+
+```
+n = 17
+A = te.placeholder((n,), name="A", dtype = "int8")
+B = te.placeholder((n,), name="B", dtype = "int8")
+C = te.compute(A.shape, lambda i: A[i] + B[i], name="C")
+s = te.create_schedule(C.op)
+x, = C.op.axis
+s[C].vectorize(x)
+```
+
+Vectorisation along the x-axis is requested with _vectorize(x)_, and will
+result in the TIR example shown in the Introduction. However, this requires
+knowing the vector length at compile time; it is an example of fixed width
+vectorisation. Instead, we would like for it to work with an unknown vector
+length at compile time.
+
+## Solution Approach
+
+In order to address the problem of expressing scalable vectorisation, we would
+like to propose the addition of a new _vectorize_scalable_ function to the 
Tensor
+Expression language, for example:
+``` 
+s[C].vectorize_scalable(x)
+```
+The TIR output of this would be:
+
+```
+primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
+  attr = {"global_symbol": "main", "tir.noalias": True}
+  buffers = {C: Buffer(C_2: Pointer(int8), int8, [17], []),
+             A: Buffer(A_2: Pointer(int8), int8, [17], []),
+             B: Buffer(B_2: Pointer(int8), int8, [17], [])}
+  buffer_map = {A_1: A, B_1: B, C_1: C} {
+  for (i: int32, 0, 17;i+=VL) {
+    C_2[ramp(i, 1, VL)] = ((int8xVL*)A_2[ramp(i, 1, VL)] + 
(int8xVL*)B_2[ramp(i, 1, VL)])
+  }
+}
+```
+
+In the above TIR, we can see the the for loop is looping with an agnostic
+stride _VL_, which stands for Vector Length. _VL_ is only showed for ease of
+representation and we don't store _VL_ anywhere inside the TIR data structures.
+
+We can also see the syntax of the Ramp nodes have now been modified to handle
+an unknown vector length, as seen by _ramp(i, 1, VL)_, instead of a fixed
+integer. The form is still _Ramp(base, stride, lanes)_ and the semantics of it

Review comment:
       We already have a concept of an unknown scalar value at compile time: 
`Any`. It seems like you could just have the lanes of Ramp be a PrimExpr 
instead of an int. Is there are reason not to take this approach?

##########
File path: rfcs/initial_sve_addition.md
##########
@@ -0,0 +1,164 @@
+- Feature Name: Adding Initial SVE Support to TVM 
+- Start Date: 2021-07-30
+- RFC PR: TODO
+
+Authors: Meera Nakrani, Sjoerd Meijer
+
+## Introduction
+
+In this RFC we would like to propose a TIR extension to support scalable
+vectorisation. This is an introductory RFC to see if the design of our
+prototype implementation, see TODO, is sound. We will cover TIR support and how
+it affects the LLVM codegen and welcome any feedback on this prosposal.
+
+Before we explain this in more detail, let's first briefly look at the current
+state and terminology with an example. Vectorisation along the x-axis of an
+addition of two one-dimensional tensors A and B of size 18, writing the result
+to C, will result in the following TIR:
+
+```
+C[ramp(0, 1, 17)] = A[ramp(0, 1, 17)] + B[ramp(0, 1, 17)]`
+```
+where the Ramp TIR node has the form 'Ramp(base, stride, lanes)' showing that
+these elements are processed in (vector) lanes.
+
+The size of 18 has been chosen to demonstrate the challenges of vectorising
+this example. Vector architecture extensions (e.g. X86 AVX or AArch Neon)
+typically allow to pack and operate on a power-of-2 number of elements, so 2,
+4, 8, 16, etc.  elements. If the elements are integers, and a vector register
+is 128-bits wide, we can pack 4 integer elements into one vector register (if
+an integer is 4 bytes). This is an example of fixed width vectorisation,
+because the vector registers have a fixed width of 128-bits. Since 18, the
+number of elements in the vectors A, B, and C, is not a multiple of 4, we need
+4 vector operations processing 4 * 4 = 16 elements, and 2 scalar operations are
+required for processing the 16th and 17th elements which we call the scalar
+epilogue.
+
+## Motivation
+
+However, most modern vector architectures (e.g. X86 AVX and the Arm
+Architecture's MVE and SVE extensions) support predicated vector instructions,
+removing the need for such a scalar epilogue and also allowing more code to be
+vectorised.  Lane predication allows the enabling/disabling of certain lanes in

Review comment:
       Is this PR talking about using predicated vector instructions or 
scalable vector instructions? If you aren't talking about predicated vector 
instructions, you could probably just remove this paragraph.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [tvm-rfcs] tkonolige commented on a change in pull request #18: [RFC] Adding initial SVE implementation

Reply via email to