kparzysz-quic commented on PR #18:
URL: https://github.com/apache/tvm-rfcs/pull/18#issuecomment-1172632753

   To reiterate---my original concern was that the first RFC was proposing 
changes to target-independent part of TVM to add support for a very 
target-specific feature.  However, I do think that we can move this forward in 
way that would be overall useful.
   
   Here is the outline of my thoughts on this.  Let me know what you think.
   
   First, a couple of observations:
   1. Architectures that support vectors can be assumed to also support vector 
predication.  I'm talking specifically about masked operations, and in 
particular about predicated loads and stores.
   2. For ARM/AArch64, it may be beneficial to distinguish vectorization via 
fixed-length vectors from one via scalable vectors.  If this choice is to be 
made by auto-scheduling, it should be expressible in TIR.
   
   What this RFC proposes is very close to allowing vectorization of countable 
loops with variable iteration count, and I insist that we keep this in mind as 
a goal.
   
   The way that vectorization works right now is that a loop like
   ```
   for (i : [0, 130)) {
     C[i] = A[i] + B[i]
     D[i] = A[i] * B[i]
   }
   ```
   will be replaced with statements
   ```
   C[Ramp(0, 1, 130)] = A[Ramp(0, 1, 130)] + B[Ramp(0, 1, 130)]
   D[Ramp(0, 1, 130)] = A[Ramp(0, 1, 130)] * B[Ramp(0, 1, 130)]
   ```
   The expressions within these statement are all `PrimExpr`, whose type must 
be expressible by `DataType`.  All parameters in `DataType` are compile-time 
integers, which means that a single statement can only represent vectors with a 
known number of lanes.  In other words, neither VIC nor VLA can be implemented 
without some changes.  These changes may be in how types are represented in 
`DataType`, or in how vectorization is done (or a combination of these two).
   
   We are already considering a special value for `DataType::lanes` that would 
represent the yet-unknown vector length (VL).  Following Halide's approach to 
vectorization, I propose that we change vectorization to take an explicit 
vector length as a parameter.  As a special case for SVE, the scalable VL could 
be represented by the same constant we chose for `DataType::lanes`.  For 
compatibility with existing code, `stage.vectorize()` would be equivalent to 
`stage.vectorize(vector_length=iter_count)`, since currently only loops with 
known iteration count can be vectorized.  The argument value `vector_length=VL` 
would indicate using SVE.  With `vectorize(vector_length=32)`, the loop above 
would be turned into
   ```
   for (i = [0, (130+31)/32) {
     // i-th vector is [32*i..32*(i+1))
     C[Ramp(32*i, 1, 32), pred=(Ramp(32*i, 1, 32) < Broadcast(130, 32))] = 
A[Ramp..., pred=...] + ...
     ...
   }
   ```
   If the loop iteration count changed from a known integer `130` to some 
expression `N`, the generated code would remain mostly the same: the structure 
does not depend on the fact that `130` is a compile-time constant.  Similarly 
the `32` indicating vector length could be replaced with the predefined value 
for "scalable vector length", with the only issue potentially with calculating 
the iteration count of the `for` loop above.  If we were to allow an explicit 
"stride" to `For`, the issue would go away (the RFC proposes something like 
that).
   
   To summarize:
   1. Introduce `kScalableVectorLaneMark` (as suggested by @tqchen).
   2. Make vector length a parameter to `stage.vectorize`.
   3. Introduce "predicate" to `BufferLoad` and `BufferStore`.
   4. Allow non-unit strides in `For` loops (as per the RFC).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to