t-vi commented on pull request #5791:
URL: https://github.com/apache/incubator-tvm/pull/5791#issuecomment-644258600


   @mbrookhart Yes, my use-case is transformers. The PyTorch frontend 
translates the matmul used in HuggingFace `transformer`'s BERT into 
`batch_matmul`. The speedup is 1.5x-2x-ish on ROCm (gfx906) and also some on a 
GTX1080Ti even though it currently hits a reshape right after `batch_matmul`. I 
don't quite reach the speed of ONNXRuntime yet.
   I'm currently preparing a detailed writeup (and that's the pattern of my 
recent PRs - tuneable BMM, this, support for integers and other non-float32 
PyTorch frontend).
   
   I imagine that it would be cool to move the pass to a pattern-matching. I 
would expect that it would replace the code shared by the combine passes of 
`batch_matmul` and `conv2d` (and to some extend the `dense` combiner rather 
than the part that's separate. I have been wondering about the efficiency of 
`dense` btw. - it mentions BERT as a use-case in the code comments but it is 
unclear to me whether the `dense` -> `batch_matmul` with "duplicated" (possibly 
stride 0) input is better than `dense` -> `dense` with non-contiguous results 
(though the columns would still be and only the rows would be interleaved 
between the ops). But then I haven't looked a lot at how TVM deals with strides 
(which is relatively significant because the self-attention typically has some 
reshapes that would be nice to fuse).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to