masahi commented on PR #15111:
URL: https://github.com/apache/tvm/pull/15111#issuecomment-1710616892

   @LeiWang1999 yes, we have done similar perf comparison against cublas fp16, 
and indeed we found that the FT int4 is faster only for small M. This is not 
surprising since a larger M amortizes the cost of loading a large weight over 
each row (which may correspond to "batch" or "token" in llm inference) and int4 
involves dequantize overhead. 
   
   The developer also talked about this point in their GTC23 presentation. Here 
is one of their slides.  
   
   ![スクリーンショット 2023-08-18 8 15 
03](https://github.com/apache/tvm/assets/1776403/1f35417f-a7c1-4b5c-9462-2c517aae466b)
   
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to