JCBrouwer commented on issue #10223: URL: https://github.com/apache/tvm/issues/10223#issuecomment-1052408332
> How TVM + cuDNN compares to PT? Since you are running on fp16, I'd hope that we can use tensorcore. But I've never seen grouped convolution running on tensorcore. Also cutlass is generally faster than cuDNN but it doesn't support grouped or depth wise afaik. I'm not quite sure what the best way to benchmark/profile things is. I've been trying to use `ncu`, but it's __very__ slow for the PyTorch models. At the moment the PyTorch models (vanilla, traced, optimize_for_inference) reach about 9-11 fps and the TVM + CUDNN is about 15-19 fps. I'm hoping to get into the 25-30 fps range. I'm trying to get more of the computation to be done in fp16, but I've ran into this issue #10397 . > You may try our auto-scheduler to see if it can beat cuDNN. So far the autotvm tuner hasn't been successful for me. It took a couple days to tune all of the ops with the default settings from the tutorial and it actually ended up slower than the untuned TVM + CUDNN version. I haven't looked too deep into the auto-scheduler yet because I couldn't find a good tutorial of applying it to a large model (I think the only tutorial is for single ops?). I figured it would also be less effective due to using CUDNN, which might reduce the flexibility of the scheduler, although I'm not sure if that's actually the case. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
