[GitHub] [tvm-rfcs] tqchen commented on a diff in pull request #94: [RFC] CodeGenAArch64 backend with Scalable Vector Extension (SVE)

GitBox Wed, 05 Oct 2022 12:56:32 -0700


tqchen commented on code in PR #94:
URL: https://github.com/apache/tvm-rfcs/pull/94#discussion_r985240243



##########
rfcs/0094-aarch64-backend-with-sve.md:
##########
@@ -0,0 +1,140 @@
+- Feature Name: aarch64_backend
+- Start Date: 2022-09-26
+- RFC PR: [apache/tvm-rfcs#0000](https://github.com/apache/tvm-rfcs/pull/0000)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@manupak](https://github.com/manupak), 
[@u99127](https://github.com/u99127)
+
+# Summary
+
+This RFC is to introduce a new TIR backend for AArch64 codegen for supporting 
target specific features, specifically SVE. Currently AArch64 specific code is 
generated either through a generic LLVM backend or by tensorize implementation 
(e.g. the MMLA Arm(R) Neon(TM) instruction), but we could see a benefit from 
having a more fine grained control over LLVM that targets AArch64.
+
+# Motivation
+
+The main motivation behind this work is to introduce SVE instructions in 
codegen without changing IRs, scheduling primitives or TVM passes. AArch64 
backend would be a good place to work around the issues in LLVM SVE code 
generation that have surfaced while adding support for SVE in Halide. In 
addition, `CodegenAArch64` backend would not be limited to SVE codegen – it 
could be used to introduce AArch64 specific lowering where required, either for 
specialised use of AArch64 intrinsics or to work around limitations of LLVM.
+
+# Guide-level explanation
+
+In comparison to the Arm(R) Neon(TM) instruction set, which uses a fixed 
vector length, SVE allows the developer to write vectorized code where the 
exact length of a vector is unknown at a compile time. That code can then run 
on hardware implementations with different choices of vector lengths. For 
hardware implementations, the only constraint for the vector length is that it 
has to be minimum of 128 bits and it has to be a multiple of 128 bits. *Vscale* 
is the number of sets of 128 bits that fit into the SVE vector, e.g. vscale of 
4 results in a vector length of 512 bits.
+
+The initial SVE implementation in TVM would focus on two main capabilities of 
SVE:
+
+**1. Vector length agnostic loops**
+As an example, consider this vector length agnostic loop that adds two vectors 
with FP32 elements:
+
+```
+for (i = 0; i < n; i += 4 * vscale)
+    c[i : i + 4 * vscale] = a[i : i + 4 * vscale] + b[i : i + 4 * vscale]
+```
+
+Number 4 in the above example comes from the fact that we can fit four FP32 
elements into 128 bits. Here the number of times we have to run the loop will 
depend on vscale, which is a hardware implementation detail. If the vector 
length was, as an example, 256 bits, we could process 8 FP32 elements in one 
iteration, meaning we would have to do `n / 8` iterations. By increasing the 
vector length to 512 bits, we would need to do `n / 16` iterations.
+
+**2. Predication**
+SVE provides support for predication, enabling us to efficiently deal with 
loop tails, among other things. In the example above, `n` may or may not be a 
multiple of `4 * vscale`. Predication allows us to handle this loop without any 
special consideration for the remainder of the elements i.e. `c[n - n % (4 * 
vscale) : n]`. Essentially, every operation with SVE registers would take a 
predicate register as one of its arguments that would act as a bit mask 
indicating which elements are active. Similarly to the vector length, the 
length of a predicate depends on the hardware implementation.
+
+```
+whilelt p0.s, w17, w12
+ld1w    { z0.s }, p0/z, [x2, x17, lsl #2]
+ld1w    { z1.s }, p0/z, [x1, x17, lsl #2]
+fadd    z2.s , z0.s , z1.s
+st1w    { z2.s }, p0, [x0, x17, lsl #2]
+```
+
+In that example, `whilelt` constructs the predicate register `p0` based on the 
loop bound variable and the increment variable stored in `w` registers.
+
+## How to target AArch64 backend
+
+Similarly to how we target other LLVM codegen backends, we would invoke 
AArch64 backend through parsing the `-mtriple` in the target string:
+
+```
+target = "llvm -mtriple=aarch64-gnu-linux -mattr=+sve"
+```
+
+The node visitors in the AArch64 backend implementation would generate SVE 
code when `+sve` is part of the `-mattr`.
+
+# Reference-level explanation
+
+The main difference compared to CodegenLLVM would be how we generate llvm and 
assembly for `Ramp` and `Broadcast` nodes.
+
+Let's take a simple vectorized addition of two dimensional tensors as an 
example:
+
+```
+A = te.placeholder((200, 200), name="A")
+B = te.placeholder((200, 200), name="B")
+T = te.compute((200, 200), lambda i, j: A[i, j] + B[i, j])
+
+s = te.create_schedule(T.op)
+xo, yo, xi, yi = s[T].tile(T.op.axis[0], T.op.axis[1], x_factor=10, y_factor=5)
+                                                                    # ^^ this 
would be the vector length
+s[T].vectorize(yi)
+```
+
+Currently, loops that are annotated with vectorize will be represented as 
`Ramp` nodes in TIR:
+
+```
+@main = primfn(A_1: handle, B_1: handle, m: int32, n: int32) -> ()
+  attr = {"from_legacy_te_schedule": True, "global_symbol": "main", 
"tir.noalias": True}
+  buffers = {A: Buffer(A_2: Pointer(float32), float32, [200, 200], []),
+             B: Buffer(B_2: Pointer(float32), float32, [200, 200], [])}
+  buffer_map = {A_1: A, B_1: B} {
+  realize(compute: Buffer(compute_1: Pointer(float32), float32, [200, 200], 
[]), [0:200, 0:200], True {
+    for (i.outer: int32, 0, 20) {
+      for (j.outer: int32, 0, 40) {
+        for (i.inner: int32, 0, 10) "unroll" {
+          compute[(i.inner + (i.outer*10)), ramp((j.outer*5), 1, 5)] = 
(A[(i.inner + (i.outer*10)), ramp((j.outer*5), 1, 5)] + B[(i.inner + 
(i.outer*10)), ramp((j.outer*5), 1, 5)])
+        }
+      }
+    }
+  })
+}
+```
+
+The above TIR segment contains static numbers as the lane count (5) and the 
inferred bound (40) across the `j` axis. If SVE is used, the AArch64 backend 
would treat the lane count as `llvm.vscale() * 4` and the corresponding loop 
bound as `ceil( 40 / llvm.vscale() * 4 )`.
+
+With SVE enabled, this TIR would further be lowered to LLVM:
+
+```

Review Comment:
   Based on this description, seems the proposed approach is that:
   -  we pattern matching a fixed vectorization( lane=5)
   - raise it back to SVE pattern (with vscale and lane!=5)
   - codegen  
   
   One concern is that the code can be simplified by the assumption(lane=5) 
during lowering phase, but that simplification does not work for the general 
case.
   
   Edit: After thinking a bit more, i now think the above concern can be 
addressed by clarifying a strict set of raising rules. so feel free to ignore 
this



##########
rfcs/0094-aarch64-backend-with-sve.md:
##########
@@ -0,0 +1,140 @@
+- Feature Name: aarch64_backend
+- Start Date: 2022-09-26
+- RFC PR: [apache/tvm-rfcs#0000](https://github.com/apache/tvm-rfcs/pull/0000)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- Co-Authors: [@manupak](https://github.com/manupak), 
[@u99127](https://github.com/u99127)
+
+# Summary
+
+This RFC is to introduce a new TIR backend for AArch64 codegen for supporting 
target specific features, specifically SVE. Currently AArch64 specific code is 
generated either through a generic LLVM backend or by tensorize implementation 
(e.g. the MMLA Arm(R) Neon(TM) instruction), but we could see a benefit from 
having a more fine grained control over LLVM that targets AArch64.
+
+# Motivation
+
+The main motivation behind this work is to introduce SVE instructions in 
codegen without changing IRs, scheduling primitives or TVM passes. AArch64 
backend would be a good place to work around the issues in LLVM SVE code 
generation that have surfaced while adding support for SVE in Halide. In 
addition, `CodegenAArch64` backend would not be limited to SVE codegen – it 
could be used to introduce AArch64 specific lowering where required, either for 
specialised use of AArch64 intrinsics or to work around limitations of LLVM.
+
+# Guide-level explanation
+
+In comparison to the Arm(R) Neon(TM) instruction set, which uses a fixed 
vector length, SVE allows the developer to write vectorized code where the 
exact length of a vector is unknown at a compile time. That code can then run 
on hardware implementations with different choices of vector lengths. For 
hardware implementations, the only constraint for the vector length is that it 
has to be minimum of 128 bits and it has to be a multiple of 128 bits. *Vscale* 
is the number of sets of 128 bits that fit into the SVE vector, e.g. vscale of 
4 results in a vector length of 512 bits.
+
+The initial SVE implementation in TVM would focus on two main capabilities of 
SVE:
+
+**1. Vector length agnostic loops**
+As an example, consider this vector length agnostic loop that adds two vectors 
with FP32 elements:
+
+```
+for (i = 0; i < n; i += 4 * vscale)
+    c[i : i + 4 * vscale] = a[i : i + 4 * vscale] + b[i : i + 4 * vscale]
+```
+
+Number 4 in the above example comes from the fact that we can fit four FP32 
elements into 128 bits. Here the number of times we have to run the loop will 
depend on vscale, which is a hardware implementation detail. If the vector 
length was, as an example, 256 bits, we could process 8 FP32 elements in one 
iteration, meaning we would have to do `n / 8` iterations. By increasing the 
vector length to 512 bits, we would need to do `n / 16` iterations.
+
+**2. Predication**
+SVE provides support for predication, enabling us to efficiently deal with 
loop tails, among other things. In the example above, `n` may or may not be a 
multiple of `4 * vscale`. Predication allows us to handle this loop without any 
special consideration for the remainder of the elements i.e. `c[n - n % (4 * 
vscale) : n]`. Essentially, every operation with SVE registers would take a 
predicate register as one of its arguments that would act as a bit mask 
indicating which elements are active. Similarly to the vector length, the 
length of a predicate depends on the hardware implementation.
+
+```
+whilelt p0.s, w17, w12
+ld1w    { z0.s }, p0/z, [x2, x17, lsl #2]
+ld1w    { z1.s }, p0/z, [x1, x17, lsl #2]
+fadd    z2.s , z0.s , z1.s
+st1w    { z2.s }, p0, [x0, x17, lsl #2]
+```
+
+In that example, `whilelt` constructs the predicate register `p0` based on the 
loop bound variable and the increment variable stored in `w` registers.
+
+## How to target AArch64 backend
+
+Similarly to how we target other LLVM codegen backends, we would invoke 
AArch64 backend through parsing the `-mtriple` in the target string:
+
+```
+target = "llvm -mtriple=aarch64-gnu-linux -mattr=+sve"
+```
+
+The node visitors in the AArch64 backend implementation would generate SVE 
code when `+sve` is part of the `-mattr`.
+
+# Reference-level explanation
+
+The main difference compared to CodegenLLVM would be how we generate llvm and 
assembly for `Ramp` and `Broadcast` nodes.
+
+Let's take a simple vectorized addition of two dimensional tensors as an 
example:
+
+```
+A = te.placeholder((200, 200), name="A")
+B = te.placeholder((200, 200), name="B")
+T = te.compute((200, 200), lambda i, j: A[i, j] + B[i, j])
+
+s = te.create_schedule(T.op)
+xo, yo, xi, yi = s[T].tile(T.op.axis[0], T.op.axis[1], x_factor=10, y_factor=5)
+                                                                    # ^^ this 
would be the vector length
+s[T].vectorize(yi)
+```
+
+Currently, loops that are annotated with vectorize will be represented as 
`Ramp` nodes in TIR:
+
+```
+@main = primfn(A_1: handle, B_1: handle, m: int32, n: int32) -> ()
+  attr = {"from_legacy_te_schedule": True, "global_symbol": "main", 
"tir.noalias": True}
+  buffers = {A: Buffer(A_2: Pointer(float32), float32, [200, 200], []),
+             B: Buffer(B_2: Pointer(float32), float32, [200, 200], [])}
+  buffer_map = {A_1: A, B_1: B} {
+  realize(compute: Buffer(compute_1: Pointer(float32), float32, [200, 200], 
[]), [0:200, 0:200], True {
+    for (i.outer: int32, 0, 20) {
+      for (j.outer: int32, 0, 40) {
+        for (i.inner: int32, 0, 10) "unroll" {
+          compute[(i.inner + (i.outer*10)), ramp((j.outer*5), 1, 5)] = 
(A[(i.inner + (i.outer*10)), ramp((j.outer*5), 1, 5)] + B[(i.inner + 
(i.outer*10)), ramp((j.outer*5), 1, 5)])
+        }
+      }
+    }
+  })
+}
+```
+
+The above TIR segment contains static numbers as the lane count (5) and the 
inferred bound (40) across the `j` axis. If SVE is used, the AArch64 backend 
would treat the lane count as `llvm.vscale() * 4` and the corresponding loop 
bound as `ceil( 40 / llvm.vscale() * 4 )`.
+
+With SVE enabled, this TIR would further be lowered to LLVM:
+
+```

Review Comment:
   As an alternative, it be possible to directly generate from a non-vectorized 
spec? So the question is that if we already are in this loop with VLA 
annotation, presumably the cost of pattern matching is similar? 
   
   ```c++
     for (i: int32, 0, 17;i, annotation={"VLA"}) {
       C_2[i] = A_2[i] + B_2[i];
     }
   ```
   And we will be defering the vectorized instruction generation to the codegen 
phase, by specially handling the patterns in the for that is annotated with VLA 
loop. Of course we can only support a limited set of patterns(such as 
read/write to the same vector index or limited reduction support), that is why 
legalize is needed to make sure the body of VLA for loop satiesfies the pattern.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [tvm-rfcs] tqchen commented on a diff in pull request #94: [RFC] CodeGenAArch64 backend with Scalable Vector Extension (SVE)

Reply via email to