JCBrouwer edited a comment on pull request #10423:
URL: https://github.com/apache/tvm/pull/10423#issuecomment-1055433768


   On a related note, from more testing on my larger model (a StyleGAN 
generator), I've found that it's actually faster to split up grouped conv2d and 
conv2d_transpose into multiple non-grouped version (except for depthwise ones).
   
   Not sure if that'll be the case for all models (it might also be the 
downstream optimization passes that make this faster), but maybe it makes sense 
to add an option to allow splitting of grouped convs rather than transfering 
them to the possibly slower true group conv ops?
   
   ```
   name                                               time ms    fps
   G_tvm_target=cuda_split=none                       44886      0.0223    
   G_tvm_target=cuda_split=transpose                  17853      0.0560    
   G_tvm_target=cuda_split=both                       16114      0.6206    
   G_tvm_target=cuda,cutlass_split=none               44887      0.0223    
   G_tvm_target=cuda,cutlass_split=transpose          17852      0.0560    
   G_tvm_target=cuda,cutlass_split=both               16116      0.6205    
   G_tvm_target=cuda,cudnn_split=none                 59         16.8368   
   G_tvm_target=cuda,cudnn_split=transpose            61         16.3961   
   G_tvm_target=cuda,cudnn_split=both                 49         20.4347   
   ```
   
   (N.B. I think I'm compiling the CUTLASS target wrong, because the results 
are suspiciously close to the regular CUDA target)
   
   EDIT: Maybe this just makes more sense as a custom pass, although it might 
be nice to write it down somewhere...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to