masahi commented on PR #15111:
URL: https://github.com/apache/tvm/pull/15111#issuecomment-1614534204
I have compared against the exllama int4 kernel for decoding. It seems the
FT GEMM kernel is overkill for mat x vec multiply - the custom mat x vec kernel
from exllama is much faster.
```
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max
(ns) StdDev (ns) Name
-------- --------------- --------- --------- --------- --------
-------- -----------
----------------------------------------------------------------------------------------------------
59.6 3,032,906,823 81,824 37,066.2 26,882.0 26,528
78,275 12,595.2 void cutlass::Kernel<cutlass::gemm::kernel::GemmFpAIntB
<cutlass::gemm::threadblock::DqMmaMultistage…
30.7 1,561,633,485 32,704 47,750.5 47,601.5 26,657
217,961 20,785.9 void cutlass::Kernel<cutlass::gemm::kernel::GemmFpAIntB
<cutlass::gemm::threadblock::DqMmaMultistage…
4.6 233,826,480 16,416 14,243.8 14,016.0 4,257
22,881 4,835.9 void attention_kernel_batched_impl<AttentionKernel<cutl
ass::half_t, cutlass::arch::Sm80, (bool)1, (…
1.2 59,620,306 513 116,218.9 116,196.0 114,597
120,773 660.2 void cutlass::Kernel<cutlass::gemm::kernel::GemmFpAIntB
<cutlass::gemm::threadblock::DqMmaMultistage…
```
```
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max
(ns) StdDev (ns) Name
-------- --------------- --------- --------- --------- --------
-------- -----------
----------------------------------------------------------------------------------------------------
77.0 1,563,943,842 57,344 27,273.0 18,111.0 16,351
42,174 11,590.3 void q4_matmul_kernel<(bool)1, (bool)1, (bool)0>(const
__half *, const unsigned int *, __half *, co…
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]