ksgr5566 opened a new pull request, #18823:
URL: https://github.com/apache/tvm/pull/18823

   ## Summary
   This adds gating logic on top of #17699 to support optional subgroup shuffle 
   primitives based on a compile-time flag.
   
   ## Problem
   The PR #17699 always generates subgroup shuffle ops when targeting WebGPU. 
   However, not all WebGPU devices support subgroups. We need a way to:
   - Default to shared memory reductions (universally compatible)
   - Optionally enable subgroup shuffles for devices that support them
   
   ## Solution
   Implement gating via TVM target parameter:
   - Default `thread_warp_size=1` disables warp reductions (uses shared memory 
+ barriers)
   - Add target parser `UpdateWebGPUAttrs()` that sets `thread_warp_size=32` 
when `supports_subgroups=true`
   - Add `--enable-subgroups` CLI flag in mlc-llm to surface the option to users
   
   The gating happens at the reduction path selection level 
(`IsWarpReduction()` in 
   `lower_thread_allreduce.cc`), ensuring subgroup ops are never generated 
unless explicitly enabled.
   
   ## Changes
   - TVM: Target parser + default thread_warp_size=1
   - MLC-LLM: --enable-subgroups flag  
(https://github.com/mlc-ai/mlc-llm/pull/3431)
   - WebLLM: WGSL shader dumping for verification
   
   ## Testing
   
   Tested with Llama-3.2-1B-q4f16_1. Baseline (no flag) uses shared memory 
reductions; 
   with flag, generates subgroupShuffle* ops.
   Both the generated WGSLs here: 
https://gist.github.com/ksgr5566/301664a5dda3e46f44092be4d09b2d4f


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to