MasterJH5574 opened a new pull request, #16731:
URL: https://github.com/apache/tvm/pull/16731

   Prior to this PR, there is one part missing in the shared memory estimation 
of the GeMV rule. The GeMV rule optimizes by using cross-thread reduction. When 
the target does not support warp reduction primitives, the cross-thread 
reduction will be further lowered to shared memory implementation, which 
consumes another part of shared memory.
   
   If we do not consider this part in the GeMV rule, it is possible for the 
total shared memory usage to exceed the target shared memory limit. For 
example, mlc-ai/mlc-llm#1841 reports an issue on the Vulkan shared memory limit 
exceed.
   
   This PR fixes the issue by introducing a flag `SUPPORT_WARP_SHUFFLE` to the 
GeMV rule. We only enable warp shuffle for CUDA and Metal backend, and turn it 
off for all other backends. This is basically aligned with the lowering rule of 
thread allreduce intrinsic.
   
   P.S.. ROCm also supports warp shuffle but has some limitation, where not 
every set of parameters in the GeMV rule can meet. Therefore, we regard ROCm as 
"not supported". This just mean we will be conservative in the shared memory 
usage for ROCm, and does not mean we do not use the warp shuffle when the 
workload is eligible when lowering.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to