masahi commented on PR #80:
URL: https://github.com/apache/tvm-rfcs/pull/80#issuecomment-1159398339

   @Hzfengsy On NVIDIA Ampere, the only asynchronous operation is global to 
shared memory copy. So wmma, mma, or cuda core can all benefit from it.
   
   I have an MMA schedule with multi-stage pipeline, where global to shared 
memory is 4x multi buffered and asynchronous. The test case is here 
https://github.com/masahi/tvm/blob/2b325394e951b8b38c9a24d9a4b7a8c6f6d749e7/tests/python/unittest/test_tir_transform_inject_software_pipeline.py#L1436.
 It generates interesting code, but currently I'm not getting good performance: 
The baseline MMA schedule without any pipelining or async gets 40 TFLOPS on 
3070, but this fancy schedule gets only 33 TFLOPS. 
   
   My next step is to get Ampere async GEMM actually run on par or better with 
the baseline.    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to