ksgr5566 opened a new pull request, #18823: URL: https://github.com/apache/tvm/pull/18823
## Summary This adds gating logic on top of #17699 to support optional subgroup shuffle primitives based on a compile-time flag. ## Problem The PR #17699 always generates subgroup shuffle ops when targeting WebGPU. However, not all WebGPU devices support subgroups. We need a way to: - Default to shared memory reductions (universally compatible) - Optionally enable subgroup shuffles for devices that support them ## Solution Implement gating via TVM target parameter: - Default `thread_warp_size=1` disables warp reductions (uses shared memory + barriers) - Add target parser `UpdateWebGPUAttrs()` that sets `thread_warp_size=32` when `supports_subgroups=true` - Add `--enable-subgroups` CLI flag in mlc-llm to surface the option to users The gating happens at the reduction path selection level (`IsWarpReduction()` in `lower_thread_allreduce.cc`), ensuring subgroup ops are never generated unless explicitly enabled. ## Changes - TVM: Target parser + default thread_warp_size=1 - MLC-LLM: --enable-subgroups flag (https://github.com/mlc-ai/mlc-llm/pull/3431) - WebLLM: WGSL shader dumping for verification ## Testing Tested with Llama-3.2-1B-q4f16_1. Baseline (no flag) uses shared memory reductions; with flag, generates subgroupShuffle* ops. Both the generated WGSLs here: https://gist.github.com/ksgr5566/301664a5dda3e46f44092be4d09b2d4f -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
