> * It shocks me that your solution is even faster than CUBLAS and CUDNN. I try 
> to reproduce the result but fails. Did you use BatchMatMul and BatchConv? And 
> which GPU did you test on? Could you show me the details about the 
> performance?
> 
Our fp16 TensorCore kernel are tuned on V100 with CUDA toolkit 9.0 with driver 
396.44. The int8 TensorCore kernels are tuned on T4 with CUDA toolkit 10.1 with 
driver 418.39. On different GPUs, the performance of tuned kernels can be 
different.



-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/dmlc/tvm/issues/4105#issuecomment-541276574

Reply via email to