masahi commented on PR #15111:
URL: https://github.com/apache/tvm/pull/15111#issuecomment-1614534204

   I have compared against the exllama int4 kernel for decoding. It seems the 
FT GEMM kernel is overkill for mat x vec multiply - the custom mat x vec kernel 
from exllama is much faster.   
   
   ```
   Time (%)  Total Time (ns)  Instances  Avg (ns)   Med (ns)   Min (ns)  Max 
(ns)  StdDev (ns)                                                  Name   
                                                 
    --------  ---------------  ---------  ---------  ---------  --------  
--------  -----------  
----------------------------------------------------------------------------------------------------
 
        59.6    3,032,906,823     81,824   37,066.2   26,882.0    26,528    
78,275     12,595.2  void cutlass::Kernel<cutlass::gemm::kernel::GemmFpAIntB
   <cutlass::gemm::threadblock::DqMmaMultistage…       
        30.7    1,561,633,485     32,704   47,750.5   47,601.5    26,657   
217,961     20,785.9  void cutlass::Kernel<cutlass::gemm::kernel::GemmFpAIntB
   <cutlass::gemm::threadblock::DqMmaMultistage…                                
                                                                        
         4.6      233,826,480     16,416   14,243.8   14,016.0     4,257    
22,881      4,835.9  void attention_kernel_batched_impl<AttentionKernel<cutl
   ass::half_t, cutlass::arch::Sm80, (bool)1, (…                                
                                                                        
         1.2       59,620,306        513  116,218.9  116,196.0   114,597   
120,773        660.2  void cutlass::Kernel<cutlass::gemm::kernel::GemmFpAIntB
    <cutlass::gemm::threadblock::DqMmaMultistage…
   ```
   
   ```
   Time (%)  Total Time (ns)  Instances  Avg (ns)   Med (ns)   Min (ns)  Max 
(ns)  StdDev (ns)                                                  Name   
                                                 
    --------  ---------------  ---------  ---------  ---------  --------  
--------  -----------  
----------------------------------------------------------------------------------------------------
 
        77.0    1,563,943,842     57,344   27,273.0   18,111.0    16,351    
42,174     11,590.3  void q4_matmul_kernel<(bool)1, (bool)1, (bool)0>(const 
   __half *, const unsigned int *, __half *, co…                
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to