masahi opened a new pull request, #13357:
URL: https://github.com/apache/tvm/pull/13357

   Currently, vectorization of shared to global store in tensor core auto 
tensorization is not done properly, since most blocks have the `T.where` 
predicate which disables vectorization. 
   
   The predicate is introduced after `Split` in cooperative fetch: 
https://github.com/apache/tvm/blob/main/src/meta_schedule/postproc/rewrite_cooperative_fetch.cc#L159-L162
   As the code says, this split is supposed to be applied to a fused loop. This 
is the case for cache read blocks, where `AddReadReuse` explicitly fuses loops 
around cache read blocks. But `AddWriteReuseTensorCore` doesn't fuse loops 
after cache write. So for cache rewrite blocks, we always try to split a single 
axis by large factors like [None, 4, 32, 2].


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to