masahi commented on pull request #10110:
URL: https://github.com/apache/tvm/pull/10110#issuecomment-1026837332


   HUGE UPDATE: Thanks to a tip from @manishucsd and @hwu36, it turns out 
upgrading the CUDA version from 11.3 to 11.6 alone gives 2x speedup on cutlass 
strided dgrad (unreal). Moreover, there was a critical bug in the parameter 
`beta` initialization, which was causing unnecessary memory traffic. That was 
hurting a lot on batch 256 case. The result was still correct because the `C` 
tensor, which points to the output pointer, was initialized with zeros.
   
   Here are the updated results after these two fixes:
   
   * [Batch size 
8](https://gist.github.com/masahi/90f68803ade90a5900029b128eb59dcf) 
   * [Batch size 
256](https://gist.github.com/masahi/a373b06e76e9d228d45ffdddc1cd9f76)
   
   Now, cutlass is winning in ALL but one case in batch size 256, which is 
still 0.96 vs 0.94 difference. Note that activation fusion is not enabled for 
dgrad yet. So I expect the cutlass perf to be much better in practice for DL 
training use cases.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to