LeiWang1999 opened a new pull request, #17133:
URL: https://github.com/apache/tvm/pull/17133
We currently only support lower cross thread with several constrains. For
example, the lower_cross_thread only apples when the thread binding reduced
axis is the innermost loop, and the block must have an init block. This can be
a limiting for some cases.
For example, when tensorizing the reduction block (e.g., dp4a or mma), it
becomes difficult to tensorize the init statement as well:
```python
with T.block("block"):
vi = T.axis.spatial(2, i_0 * 16 + i_1)
vk = T.axis.reduce(32, k_0 * 64 + k_1)
T.where(i_0 * 16 + i_1 < 2 and k_0 * 64 + k_1 < 32)
T.reads(A[vi, vk])
T.writes(B[vi])
with T.init():
B[vi] = T.float32(0)
B[vi] = B[vi] + A[vi, vk]
```
Moreover, certain cases, like small gemm, prefer block reduction in shared
memory to enhance parallelization to better utilize the hardware resources.
This pull request improves the `lower_cross_thread` pass, it can now handle
the thread block reduce lowering with separate init and reduce blocks, and
removes the constrain that the reduced axis is the innermost loop to support
TensorCore with block reduction.
relevant test cases can be found at
`tests/python/tir-transform/test_tir_transform_lower_cross_thread_reduction.py`.
Please CC @MasterJH5574 .
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]