masahi commented on PR #15111: URL: https://github.com/apache/tvm/pull/15111#issuecomment-1710616892
@LeiWang1999 yes, we have done similar perf comparison against cublas fp16, and indeed we found that the FT int4 is faster only for small M. This is not surprising since a larger M amortizes the cost of loading a large weight over each row (which may correspond to "batch" or "token" in llm inference) and int4 involves dequantize overhead. The developer also talked about this point in their GTC23 presentation. Here is one of their slides.  -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
