kparzysz-quic commented on PR #18:
URL: https://github.com/apache/tvm-rfcs/pull/18#issuecomment-1172632753
To reiterate---my original concern was that the first RFC was proposing
changes to target-independent part of TVM to add support for a very
target-specific feature. However, I do think that we can move this forward in
way that would be overall useful.
Here is the outline of my thoughts on this. Let me know what you think.
First, a couple of observations:
1. Architectures that support vectors can be assumed to also support vector
predication. I'm talking specifically about masked operations, and in
particular about predicated loads and stores.
2. For ARM/AArch64, it may be beneficial to distinguish vectorization via
fixed-length vectors from one via scalable vectors. If this choice is to be
made by auto-scheduling, it should be expressible in TIR.
What this RFC proposes is very close to allowing vectorization of countable
loops with variable iteration count, and I insist that we keep this in mind as
a goal.
The way that vectorization works right now is that a loop like
```
for (i : [0, 130)) {
C[i] = A[i] + B[i]
D[i] = A[i] * B[i]
}
```
will be replaced with statements
```
C[Ramp(0, 1, 130)] = A[Ramp(0, 1, 130)] + B[Ramp(0, 1, 130)]
D[Ramp(0, 1, 130)] = A[Ramp(0, 1, 130)] * B[Ramp(0, 1, 130)]
```
The expressions within these statement are all `PrimExpr`, whose type must
be expressible by `DataType`. All parameters in `DataType` are compile-time
integers, which means that a single statement can only represent vectors with a
known number of lanes. In other words, neither VIC nor VLA can be implemented
without some changes. These changes may be in how types are represented in
`DataType`, or in how vectorization is done (or a combination of these two).
We are already considering a special value for `DataType::lanes` that would
represent the yet-unknown vector length (VL). Following Halide's approach to
vectorization, I propose that we change vectorization to take an explicit
vector length as a parameter. As a special case for SVE, the scalable VL could
be represented by the same constant we chose for `DataType::lanes`. For
compatibility with existing code, `stage.vectorize()` would be equivalent to
`stage.vectorize(vector_length=iter_count)`, since currently only loops with
known iteration count can be vectorized. The argument value `vector_length=VL`
would indicate using SVE. With `vectorize(vector_length=32)`, the loop above
would be turned into
```
for (i = [0, (130+31)/32) {
// i-th vector is [32*i..32*(i+1))
C[Ramp(32*i, 1, 32), pred=(Ramp(32*i, 1, 32) < Broadcast(130, 32))] =
A[Ramp..., pred=...] + ...
...
}
```
If the loop iteration count changed from a known integer `130` to some
expression `N`, the generated code would remain mostly the same: the structure
does not depend on the fact that `130` is a compile-time constant. Similarly
the `32` indicating vector length could be replaced with the predefined value
for "scalable vector length", with the only issue potentially with calculating
the iteration count of the `for` loop above. If we were to allow an explicit
"stride" to `For`, the issue would go away (the RFC proposes something like
that).
To summarize:
1. Introduce `kScalableVectorLaneMark` (as suggested by @tqchen).
2. Make vector length a parameter to `stage.vectorize`.
3. Introduce "predicate" to `BufferLoad` and `BufferStore`.
4. Allow non-unit strides in `For` loops (as per the RFC).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]