masahi commented on pull request #10423: URL: https://github.com/apache/tvm/pull/10423#issuecomment-1055852972
> (N.B. I think I'm compiling the CUTLASS target wrong, because the results are suspiciously close to the regular CUDA target) So you are using cutlass on regular conv2d (not conv2d_transpose) with groups = 1? What is the input data type (fp16 or fp32)? On fp32 it is expected to be slow unless you are on A100. We always use tensorcore for cutlass, even if the input is fp32. We use a neat trick to do that, but it is known to be slow on Geforce cards because it's TF32 perf is no better than fp32. See https://github.com/NVIDIA/cutlass/discussions/390 Also unless the input is already in NHWC layout, cutlass will ignore your op. So it is likely that you are not using cutlass at all. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
