huangxiao2008 commented on pull request #10580:
URL: https://github.com/apache/tvm/pull/10580#issuecomment-1066257810


   @masahi @Meteorix Take a concrete case for example , such as we compute a 
GEMM with M = 128, N = 128 and K = 16, and Batch = 1.  and Hardcode setting 
block_row_warps = 2, and block_col_warps = 2 in schedule template.
   
   In this case, a Warp will cover 16x16 tile size, and single wmma.sync 
instruction will do all the work. 
   
   ======================================Before  modify , Each Warp takes 4 
wmma.sync ======================================================
   
![image](https://user-images.githubusercontent.com/8383035/158091276-6e84f56e-b12f-4b96-a579-4d52cedf0555.png)
   
   =====================================After modify, the loop is binded to 
threadIdx.z threadIdx.y and each warp   has a single wmma.sync 
====================
   
![image](https://user-images.githubusercontent.com/8383035/158091488-4fc627dd-f487-4038-b3e8-7b21614f8c0d.png)
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to