t-vi commented on pull request #5791: URL: https://github.com/apache/incubator-tvm/pull/5791#issuecomment-644258600
@mbrookhart Yes, my use-case is transformers. The PyTorch frontend translates the matmul used in HuggingFace `transformer`'s BERT into `batch_matmul`. The speedup is 1.5x-2x-ish on ROCm (gfx906) and also some on a GTX1080Ti even though it currently hits a reshape right after `batch_matmul`. I don't quite reach the speed of ONNXRuntime yet. I'm currently preparing a detailed writeup (and that's the pattern of my recent PRs - tuneable BMM, this, support for integers and other non-float32 PyTorch frontend). I imagine that it would be cool to move the pass to a pattern-matching. I would expect that it would replace the code shared by the combine passes of `batch_matmul` and `conv2d` (and to some extend the `dense` combiner rather than the part that's separate. I have been wondering about the efficiency of `dense` btw. - it mentions BERT as a use-case in the code comments but it is unclear to me whether the `dense` -> `batch_matmul` with "duplicated" (possibly stride 0) input is better than `dense` -> `dense` with non-contiguous results (though the columns would still be and only the rows would be interleaved between the ops). But then I haven't looked a lot at how TVM deals with strides (which is relatively significant because the self-attention typically has some reshapes that would be nice to fuse). ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
