masahi commented on PR #80: URL: https://github.com/apache/tvm-rfcs/pull/80#issuecomment-1159398339
@Hzfengsy On NVIDIA Ampere, the only asynchronous operation is global to shared memory copy. So wmma, mma, or cuda core can all benefit from it. I have an MMA schedule with multi-stage pipeline, where global to shared memory is 4x multi buffered and asynchronous. The test case is here https://github.com/masahi/tvm/blob/2b325394e951b8b38c9a24d9a4b7a8c6f6d749e7/tests/python/unittest/test_tir_transform_inject_software_pipeline.py#L1436. It generates interesting code, but currently I'm not getting good performance: The baseline MMA schedule without any pipelining or async gets 40 TFLOPS on 3070, but this fancy schedule gets only 33 TFLOPS. My next step is to get Ampere async GEMM actually run on par or better with the baseline. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
