masahi opened a new pull request #9439:
URL: https://github.com/apache/tvm/pull/9439


   Adds support for offloading `batch_matmul` via `GemmBatched` cutlass kernel. 
Also supports dynamic shape.
   I didn't add a profiler specifically for `GemmBatched` kernel - I piggy back 
on the `dense` profiler because cutlass uses the same kernel for batched gemm 
parallelized over the batch dimension (Grid Z dim). 
   
   This allows me to test cutlass byoc on Huggingface `BERT-large` end to end. 
I also have a number for TensorRT measured using their [BERT 
demo](https://github.com/NVIDIA/TensorRT/tree/master/demo/BERT), but TensorRT 
one is using google's own implementation of BERT.
   
   This is the current result comparing cutlass offload, autotvm native and 
TensorRT, all using tensor core. 
   
   Input size is `(8, 128)`.
   
   CUTLASS | AutoTVM | TensorRT (google's implementation)
    -- | -- | --
   23.0551 |  23.6729 | 14.0
   
   Here is the detailed nvprof output from cutlass and autotvm:
   * cutlass https://gist.github.com/masahi/01c064529f859afefd0cc34ef138aa08
   * autotvm https://gist.github.com/masahi/45ac7c45b637c2f3e4c35f8db11e9c88
   
   As you can see, for now activation fusion is enabled for GeLU. There are 
other activations or elemwise ops that can be fused in principle, such as
   * 
https://gist.github.com/masahi/45ac7c45b637c2f3e4c35f8db11e9c88#file-bert_large_gelu_fusion_autotvm-txt-L30-L32
   * 
https://gist.github.com/masahi/45ac7c45b637c2f3e4c35f8db11e9c88#file-bert_large_gelu_fusion_autotvm-txt-L26
   
   Together, they account for about 15% of total execution time, which is a 
shame. To fuse them, we need to overcome all of following blockers:
   * Support ND input dense to remove `reshape` between `dense` and activations 
(https://github.com/apache/tvm/issues/8412)
   * Support multiple "source" tensors in cutlass epilogue 
(https://github.com/NVIDIA/cutlass/discussions/347)
   * Enable fusion of `cast` op into activation 
(https://github.com/NVIDIA/cutlass/discussions/352)
   
   
   The nvprof output also shows that softmax is a huge bottleneck, accounting 
for 24% of e2e time. Apparently TVM's cuda softmax is 2x slower than cudnn, 
here is the result if we use cudnn's softmax:
   
   CUTLASS | AutoTVM | TensorRT (google's implementation)
    -- | -- | --
    20.2517 |  21.1869 | 14.0
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to