I have a DSA-architecture NPU backend that supports vector addition
computations of arbitrary sizes.
My current implementation performs loop transformations during the pass phase,
converting the innermost loop into a fixed size of 32, then applying tensorize
to the innermost loop. During codegen, this is replaced with a kernel function
to perform vector addition with a vector length of 32.
However, the above approach has a problem: the innermost loop can only be
tensorized with a fixed size of 32. If the input data shape is, for example,
1x33, it will not work properly.
Are there alternative approaches to solve this problem?
```
MAX_SIZE=32
@T.prim_func
def add(
a:T.handle,b:T.handle,c:T.handle
) -> None:
A = T.match_buffer(a, (MAX_SIZE,), offset_factor=1)
B = T.match_buffer(b, (MAX_SIZE,), offset_factor=1)
C = T.match_buffer(c, (MAX_SIZE,), offset_factor=1)
with T.block():
T.reads(A[0:MAX_SIZE], B[0:MAX_SIZE])
T.writes(C[0:MAX_SIZE])
for i in T.serial(MAX_SIZE):
with T.block():
vi = T.axis.remap("S", [i])
C[vi]=A[vi]+B[vi]
@T.prim_func
def add_impl(
a:T.handle,b:T.handle,c:T.handle
) -> None:
A = T.match_buffer(a, (MAX_SIZE,), offset_factor=1)
B = T.match_buffer(b, (MAX_SIZE,), offset_factor=1)
C = T.match_buffer(c, (MAX_SIZE,), offset_factor=1)
with T.block():
T.reads(A[0:MAX_SIZE], B[0:MAX_SIZE])
T.writes(C[0:MAX_SIZE])
T.call_packed(
"add_mm",
C.data, C.elem_offset,
B.data, B.elem_offset,
A.data, A.elem_offset,
MAX_SIZE
)
ADD_MM="add_mm"
TensorIntrin.register(ADD_MM, add, add_impl)
```
---
[Visit
Topic](https://discuss.tvm.apache.org/t/does-tensorize-support-dynamic-shapes/18607/4)
to respond.
You are receiving this because you enabled mailing list mode.
To unsubscribe from these emails, [click
here](https://discuss.tvm.apache.org/email/unsubscribe/cc0a4f1aa1530f21a2d629d1600625c85bc197f865686510b7cc95931c5039c3).