masahi opened a new pull request, #13357: URL: https://github.com/apache/tvm/pull/13357
Currently, vectorization of shared to global store in tensor core auto tensorization is not done properly, since most blocks have the `T.where` predicate which disables vectorization. The predicate is introduced after `Split` in cooperative fetch: https://github.com/apache/tvm/blob/main/src/meta_schedule/postproc/rewrite_cooperative_fetch.cc#L159-L162 As the code says, this split is supposed to be applied to a fused loop. This is the case for cache read blocks, where `AddReadReuse` explicitly fuses loops around cache read blocks. But `AddWriteReuseTensorCore` doesn't fuse loops after cache write. So for cache rewrite blocks, we always try to split a single axis by large factors like [None, 4, 32, 2]. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
