[GitHub] [tvm] huangxiao2008 commented on pull request #10580: [Minor] fix redundant compute in cuda batch matmul tensorcore

GitBox Sun, 13 Mar 2022 18:54:17 -0700


huangxiao2008 commented on pull request #10580:
URL: https://github.com/apache/tvm/pull/10580#issuecomment-1066257810



   @masahi @Meteorix Take a concrete case for example , such as we compute a 
GEMM with M = 128, N = 128 and K = 16, and Batch = 1.  and Hardcode setting 
block_row_warps = 2, and block_col_warps = 2 in schedule template.
   
   In this case, a Warp will cover 16x16 tile size, and single wmma.sync 
instruction will do all the work. 
   
   ======================================Before  modify , Each Warp takes 4 
wmma.sync ======================================================
   
![image](https://user-images.githubusercontent.com/8383035/158091276-6e84f56e-b12f-4b96-a579-4d52cedf0555.png)
   
   =====================================After modify, the loop is binded to 
threadIdx.z threadIdx.y and each warp   has a single wmma.sync 
====================
   
![image](https://user-images.githubusercontent.com/8383035/158091488-4fc627dd-f487-4038-b3e8-7b21614f8c0d.png)
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [tvm] huangxiao2008 commented on pull request #10580: [Minor] fix redundant compute in cuda batch matmul tensorcore

Reply via email to