masahi edited a comment on pull request #10185:
URL: https://github.com/apache/tvm/pull/10185#issuecomment-1033068680


   Actually, accuracy difference was there even before I added parallel split-k 
to wgrad. And that the result got closer to cuDNN after adding split-k. So I 
believe the issue is not in parallel reduction, there is something off 
elsewhere. I have seen some workload where cuDNN uses cutlass's wgrad and 
reduction kernel, even in that case there was difference. Probably I should 
look at how TVM is using wgrad first.
   
   I haven't applied fp32 wgrad on large inputs, for small ones we used in the 
unit test, the result looked good. We also have an option of comparing against 
TVM native results, which I only looked briefly.
   
   > CUTLASS profiler uses integer input to initialize tensors and matrices. 
This is to make the error checking easier. You can also use the CUTLASS 
profiler approach to make sure there are no functional error, i.e., try the 
operation on integer input.
   
   That's very interesting... I didn't know that. I can definitely try, thanks. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to