masahi commented on pull request #10423:
URL: https://github.com/apache/tvm/pull/10423#issuecomment-1055852972


   > (N.B. I think I'm compiling the CUTLASS target wrong, because the results 
are suspiciously close to the regular CUDA target)
   
   So you are using cutlass on regular conv2d (not conv2d_transpose) with 
groups = 1? What is the input data type (fp16 or fp32)? On fp32 it is expected 
to be slow unless you are on A100. We always use tensorcore for cutlass, even 
if the input is fp32. We use a neat trick to do that, but it is known to be 
slow on Geforce cards because it's TF32 perf is no better than fp32. See 
https://github.com/NVIDIA/cutlass/discussions/390
   
   Also unless the input is already in NHWC layout, cutlass will ignore your 
op. So it is likely that you are not using cutlass at all. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to