py312 [beam]

via GitHub Wed, 29 Apr 2026 08:44:38 -0700


dependabot[bot] opened a new pull request, #38322:
URL: https://github.com/apache/beam/pull/38322


   Bumps [vllm](https://github.com/vllm-project/vllm) from 0.10.1.1 to 0.20.0.
   <details>
   <summary>Release notes</summary>
   <p><em>Sourced from <a 
href="https://github.com/vllm-project/vllm/releases";>vllm's 
releases</a>.</em></p>
   <blockquote>
   <h2>v0.20.0</h2>
   <h1>vLLM v0.20.0</h1>
   <h2>Highlights</h2>
   <p>This release features 752 commits from 320 contributors (123 new)!</p>
   <ul>
   <li><strong>DeepSeek V4</strong>: Initial DeepSeek V4 support landed (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40860";>#40860</a>), 
with DSML token-leakage fix in DSV4/3.2 (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40806";>#40806</a>), 
DSA + MTP IMA fix (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40772";>#40772</a>), 
and a silu clamp limit on the shared expert (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40950";>#40950</a>).</li>
   <li><strong>CUDA 13.0 default</strong>: Default CUDA wheel on PyPI and 
<code>vllm/vllm-openai:v0.20.0</code> image switched to CUDA 13.0; architecture 
lists and build-args cleaned up (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/39878";>#39878</a>), 
and CUDA bumped to 13.0.2 to match PyTorch 2.11.0 (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40669";>#40669</a>). 
As a general rule of thumb, our CUDA version policy follows PyTorch's. We 
highly recommend to install vLLM with <code>uv</code> and use 
<code>--torch-backend=cu129</code> if you are on CUDA 12.9.</li>
   <li><strong>PyTorch 2.11 upgrade</strong> (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/34644";>#34644</a>): 
vLLM ships on torch 2.11 for CUDA, and XPU is now also on torch 2.11 (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/37947";>#37947</a>) — 
XPU is no longer pinned to 2.10. This is a breaking change for environment 
dependency.</li>
   <li><strong>Python 3.14</strong>: Added to the supported Python version list 
(<a 
href="https://redirect.github.com/vllm-project/vllm/issues/34770";>#34770</a>).</li>
   <li><strong>Transformers v5</strong>: vLLM now runs on HuggingFace 
<code>transformers&gt;=5</code> (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/30566";>#30566</a>), 
with vision-encoder torch.compile bypass (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/30518";>#30518</a>) 
and continued v4/v5 compat fixes including PaddleOCR-VL image processor 
<code>max_pixels</code> (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/38629";>#38629</a>), 
Mistral YaRN warning (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/37292";>#37292</a>), 
and Jina ColBERT rotary inv_freq recompute (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/39176";>#39176</a>).</li>
   <li><strong>New large models</strong>: Hunyuan v3 (Hy3) preview (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40681";>#40681</a>) 
with HYV3 reasoning parser (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40713";>#40713</a>); 
Granite 4.1 Vision as a built-in multimodal model (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40282";>#40282</a>).</li>
   <li><strong>FlashAttention 4 as default MLA prefill</strong>: FA4 re-enabled 
as the default MLA prefill backend (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/38819";>#38819</a>) 
with head-dim 512 and paged-KV support on SM90+ (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/38835";>#38835</a>), 
plus an upstream FA4 sync (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/38690";>#38690</a>).</li>
   <li><strong>TurboQuant 2-bit KV cache</strong>: New attention backend 
delivering 2-bit KV cache compression with 4× capacity (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/38479";>#38479</a>), 
now with FA3/FA4 prefill support (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40092";>#40092</a>).</li>
   <li><strong>Online quantization frontend</strong>: New end-to-end online 
quantization frontend (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/38138";>#38138</a>), 
with docs (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/39736";>#39736</a>); 
experts_int8 consolidated into the FP8 online path (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/38463";>#38463</a>); 
MXFP8 online quant moved to the new frontend (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40152";>#40152</a>).</li>
   <li><strong>vLLM IR</strong>: Initial IR skeleton with rms_norm op (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/33825";>#33825</a>), 
OOT-platform kernel imports (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/38807";>#38807</a>), 
gemma_rms_norm reworked on IR (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/39014";>#39014</a>), 
and IR op testing/benchmarking infra added (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40167";>#40167</a>) — 
foundation for future kernel work.</li>
   <li><strong>Model Runner V2 advances</strong>: Eagle prefill full-CUDA-graph 
(<a 
href="https://redirect.github.com/vllm-project/vllm/issues/37588";>#37588</a>), 
auto-resolve cudagraph mode/sizes from attention backend (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/32936";>#32936</a>), 
fused probabilistic rejection sample kernels (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/38496";>#38496</a>), 
config validation for unsupported features (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/38758";>#38758</a>), 
piecewise-fallback disabled for eagle draft decodes (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/39773";>#39773</a>), 
multiple prompt-logprobs support (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/39937";>#39937</a>), 
prefill warmup coverage (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40746";>#40746</a>), 
and a fix for accuracy regression caused by stale sampled/draft tokens (<a 
href="ht
 tps://redirect.github.com/vllm-project/vllm/issues/39833">#39833</a>).</li>
   <li><strong>MoE refactor series</strong>: Unquantized migrated to Full 
Oracle Flow (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/36286";>#36286</a>), 
CT W8A8 to Oracle (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/39187";>#39187</a>), 
SharedExperts class (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/35153";>#35153</a>), 
<code>SharedFusedMoE</code> removed (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/35782";>#35782</a>), 
DefaultMoERunner split (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/35326";>#35326</a>) 
and later combined back into <code>MoERunnerBase</code> (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40560";>#40560</a>), 
shared/fused expert output sum moved into <code>MoERunnerBase</code> (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/35949";>#35949</a>), 
ZeroExpertFusedMoE in new framework (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/35549";>#355
 49</a>), <code>compressed_tensors_moe.py</code> split (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/38960";>#38960</a>), 
<code>GPTQMarlinMoEMethod</code> reworked with MK (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/37990";>#37990</a>), 
XPU &amp; CUTLASS MoE relocated to <code>fused_moe/experts/</code> (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40568";>#40568</a>, 
<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40574";>#40574</a>), 
<code>make_expert_params_mapping</code> renamed (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40671";>#40671</a>), 
MoE LoRA refactor (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40338";>#40338</a>), 
and MoE DP chunking removed (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/39107";>#39107</a>).</li>
   <li><strong>Performance</strong>: Optimize batch invariant with fused rms 
norm — 2.1% E2E latency improvement (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40413";>#40413</a>); 
avoid <code>seq_lens_cpu</code> GPU→CPU sync (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40654";>#40654</a>); 
cache <code>InductorPass.hash_source</code> (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/39328";>#39328</a>); 
skip FX-graph deserialization on loading for faster warm compile (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40151";>#40151</a>); 
CUDAGraph memory profiling enabled by default for clearer startup memory 
accounting (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/38284";>#38284</a>).</li>
   </ul>
   <h3>Model Support</h3>
   <ul>
   <li>New architectures: DeepSeek V4 (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40860";>#40860</a>), 
Hunyuan v3 preview (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40681";>#40681</a>), 
Granite 4.1 Vision (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40282";>#40282</a>), 
EXAONE-4.5 (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/39388";>#39388</a>), 
BharatGen Param2MoE (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/38000";>#38000</a>), 
Phi-4-reasoning-vision-15B (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/38306";>#38306</a>), 
Cheers multimodal (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/38788";>#38788</a>), 
telechat3 (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/38510";>#38510</a>), 
FireRedLID (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/39290";>#39290</a>), 
jina-reranker-v3 (<a href="https://redirect.github.com/vllm-project/vllm/issue
 s/38800">#38800</a>), Jina Embeddings v5 (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/39575";>#39575</a>), 
Nemotron-v3 VL Nano/Super (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/39747";>#39747</a>).</li>
   <li>Gemma4 series: fast prefill (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/38879";>#38879</a>), 
quantized MoE (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/39045";>#39045</a>), 
Eagle3 (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/39450";>#39450</a>), 
block-local attention + YaRN for Gemma3 (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/39823";>#39823</a>), 
bidirectional vision attention for sliding layers (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40534";>#40534</a>), 
token-repetition fix via dynamic BOS (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/39842";>#39842</a>), 
multimodal embedder norm-order fix (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40411";>#40411</a>), 
plus a string of streaming/tool-call fixes (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/38844";>#38844</a>, 
<a href="https://redirect.github.com/vllm-project/vllm/issues/38909";>#3890
 9</a>, <a 
href="https://redirect.github.com/vllm-project/vllm/issues/38992";>#38992</a>, 
<a 
href="https://redirect.github.com/vllm-project/vllm/issues/39114";>#39114</a>, 
<a 
href="https://redirect.github.com/vllm-project/vllm/issues/39679";>#39679</a>, 
<a 
href="https://redirect.github.com/vllm-project/vllm/issues/39027";>#39027</a>).</li>
   <li>Quantization formats: GGUF support for MiniMax-M2.1 (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/36965";>#36965</a>), 
non-standard GGUF quant types with prefix such as UD-IQ1_S (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/39471";>#39471</a>).</li>
   <li>Speculative decoding: Eagle3 for MiniMax-M2 (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/37512";>#37512</a>), 
Eagle3 for Gemma4 (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/39450";>#39450</a>).</li>
   <li>LoRA: Qwen3ASRForConditionalGeneration (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/37247";>#37247</a>), 
Gemma4ForConditionalGeneration (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/39291";>#39291</a>, 
<a 
href="https://redirect.github.com/vllm-project/vllm/issues/38844";>#38844</a>), 
DeepSeek V3.2 (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/35077";>#35077</a>), 
Qwen3.5 / Step3.x expert base_layer extension (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/37114";>#37114</a>), 
MoE LoRA refactor (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40338";>#40338</a>), 
dual-CUDA-streams linear layer (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/35721";>#35721</a>).</li>
   <li>Multimodal MRoPE refresh: mm_features-based MRoPE for Ernie-4.5 VL (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/39753";>#39753</a>), 
Keye-VL / Keye-1.5-VL (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/39869";>#39869</a>), 
PaddleOCR-VL (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/39888";>#39888</a>).</li>
   <li>Other: Nano-Nemotron-VL static image inputs fix (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40724";>#40724</a>); 
Qwen3 MoE no longer calls gate twice (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40664";>#40664</a>); 
DeepSeek V2-Lite accuracy drop fix (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40673";>#40673</a>); 
Parakeet UX / perf enhancements (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/39423";>#39423</a>); 
ColModernVBERT updated for latest HF checkpoint (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/39307";>#39307</a>); 
NemotronH default <code>mamba_ssm_cache_dtype=float32</code> with 
NemotronHNanoVLV2 auto-hook (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/39032";>#39032</a>); 
new TP plan styles for the Transformers backend (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40467";>#40467</a>); 
GLM-5.1 fix on ROCm (<a href="https://redirect.github.com/vllm-proje
 ct/vllm/issues/40763">#40763</a>).</li>
   </ul>
   <h3>Engine Core</h3>
   <ul>
   <li><strong>Model Runner V2</strong>: Full CUDA graph for eagle prefill (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/37588";>#37588</a>), 
auto cudagraph mode/sizes based on attention backend (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/32936";>#32936</a>), 
fused probabilistic rejection-sample kernels (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/38496";>#38496</a>), 
config validation (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/38758";>#38758</a>), 
eagle-draft piecewise fallback disabled (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/39773";>#39773</a>), 
multiple prompt logprobs (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/39937";>#39937</a>), 
prefill warmup coverage (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40746";>#40746</a>), 
stale sampled/draft tokens accuracy fix (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/39833";>#39833</a>).</li>
   <li><strong>vLLM IR</strong>: IR skeleton + rms_norm (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/33825";>#33825</a>), 
OOT kernel import hooks (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/38807";>#38807</a>), 
gemma_rms_norm on IR (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/39014";>#39014</a>), 
IR op testing/benchmarking infra (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40167";>#40167</a>).</li>
   <li><strong>torch.compile</strong>: Opaque Objects on torch 2.11 (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/39286";>#39286</a>), 
AOT compile with batch-invariance mode (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/39201";>#39201</a>), 
Inductor cache nested under AOT dir (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/39718";>#39718</a>), 
split FX graph via codegen (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/38657";>#38657</a>), 
Inductor pre-grad passes re-enabled for torch≥2.12 (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/38944";>#38944</a>), 
strings in custom ops without compile regressions (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/38123";>#38123</a>), 
MLA + group FP8 fusion (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/38877";>#38877</a>), 
SiluMul activation+quant fusion refactor (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/39684";>#39684</a>
 ), <code>donate_graph_module=True</code> for <code>standalone_compile</code> 
(<a 
href="https://redirect.github.com/vllm-project/vllm/issues/39733";>#39733</a>), 
skip FX graph deserialization on loading (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40151";>#40151</a>), 
include Inductor &amp; functorch configs in compile-cache key (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40627";>#40627</a>), 
respect <code>TORCH_COMPILE_DISABLE</code> at vLLM config level (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40715";>#40715</a>), 
disable Sequence Parallelism for piecewise compilation (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/38373";>#38373</a>).</li>
   <li><strong>Attention</strong>: FA4 as default MLA prefill (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/38819";>#38819</a>), 
head-dim 512 + paged-KV on sm90+FA4 (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/38835";>#38835</a>), 
FA4 upstream sync (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/38690";>#38690</a>), 
full CUDA graph for FlexAttention (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/36298";>#36298</a>), 
FlexAttention non-causal support (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40394";>#40394</a>), 
unified 2D/3D triton_unified_attention (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40631";>#40631</a>), 
TRTLLM minimax_allreduce_rms ported (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/37045";>#37045</a>), 
<code>concat_mla_q</code> half-types only (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/37892";>#37892</a>), 
batch-invariance-aware backend aut
 o-selection (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40193";>#40193</a>), 
avoid <code>seq_lens_cpu</code> GPU→CPU sync (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40654";>#40654</a>).</li>
   <li><strong>Helion kernels</strong>: torch.compile support for Helion 
kernels (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/38592";>#38592</a>).</li>
   <li><strong>HMA / KV offload</strong>: GPU-side KV events for HMA (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/37688";>#37688</a>), 
group block hashes/IDs tracked (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/37109";>#37109</a>), 
unified memory layout for offloading workers (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/37206";>#37206</a>), 
<code>shutdown()</code> on OffloadingConnector (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/39182";>#39182</a>), 
request context passed through KV offload (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/39185";>#39185</a>), 
sliding-window lookup (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/36645";>#36645</a>), 
multi-group worker transfer (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/38453";>#38453</a>), 
multi-KV-group lookup/load/store (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/39401";>#39401</a>, 
<a href="https://r
 edirect.github.com/vllm-project/vllm/issues/39402">#39402</a>, <a 
href="https://redirect.github.com/vllm-project/vllm/issues/39403";>#39403</a>).</li>
   <li><strong>Features</strong>: NUMA binding for GPU workers (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/38635";>#38635</a>), 
opt-in <code>VLLM_MEDIA_CACHE</code> media URL caching (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/37123";>#37123</a>), 
safe request abort when FSM fails to advance (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/38663";>#38663</a>), 
KV connector prioritized over internal registry (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/38301";>#38301</a>), 
CUDAGraph memory profiling on by default (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/38284";>#38284</a>), 
shared-expert overlap restored (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/39222";>#39222</a>), 
<code>CONFIG_REGISTRY</code> config-class lookup fix when on-disk model_type 
differs (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/39554";>#39554</a>), 
workspace-resize GPU memory leak fix (<a href="ht
 tps://redirect.github.com/vllm-project/vllm/issues/39226">#39226</a>), 
SWA/chunked-local runtime admission capped to startup pool-sizing bound (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40946";>#40946</a>).</li>
   <li><strong>Pluggable layers</strong>: Applied to llm_head / vocab embedding 
(<a 
href="https://redirect.github.com/vllm-project/vllm/issues/33465";>#33465</a>) 
and MoE layers (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/33556";>#33556</a>).</li>
   <li><strong>Mamba</strong>: Stochastic rounding (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/35753";>#35753</a>), 
different Conv state layouts (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/37416";>#37416</a>), 
FlashInfer <code>selective_state_update</code> (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/36162";>#36162</a>).</li>
   <li><strong>Metrics &amp; scheduling</strong>: Labeled waiting-breakdown 
(capacity/deferred) metric (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/38435";>#38435</a>), 
API server handshake simplified (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/39364";>#39364</a>), 
mm-scheduler <code>get_num_embed</code> overhead reduced (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40143";>#40143</a>), 
<code>request_id</code> on <code>FinishedRequestStats</code> (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/39710";>#39710</a>).</li>
   <li><strong>Executor</strong>: RayExecutorV2 introduced (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/36836";>#36836</a>); 
unified engine process monitoring with Ray backend (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/35862";>#35862</a>).</li>
   </ul>
   <h3>Hardware &amp; Performance</h3>
   <ul>
   <li><strong>NVIDIA</strong>: swapAB support for SM120 CUTLASS blockwise FP8 
GEMM (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/38325";>#38325</a>), 
MXFP4 W4A4 CUTLASS MoE for SM100 (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/37463";>#37463</a>), 
TRTLLM GEN NVFP4 MoE with non-512-aligned hidden dims via weight padding (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/39510";>#39510</a>), 
TRTLLM FP8 MoE with shuffled weights + BlockMajorK layout (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/38993";>#38993</a>), 
fused qknorm+rope kernel on SM9.0 (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/37376";>#37376</a>), 
tuned fused_moe config for RTX PRO 6000 Blackwell (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/39183";>#39183</a>), 
ViT full CUDA graph for Qwen3-VL video (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/38061";>#38061</a>), 
<code>--enable-vit-cuda-graph</code> for VLM e
 xamples (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40580";>#40580</a>), 
default <code>max_frames_per_batch</code> auto-infer for ViT CG video (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40445";>#40445</a>), 
fused FP8 output quantization into <code>merge_attn_states</code> (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/36518";>#36518</a>), 
batched KV-cache swap via <code>cuMemcpyBatchAsync</code> (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/38460";>#38460</a>), 
sm_110 (Jetson Thor) added to CUDA 13.0 build targets (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/39233";>#39233</a>).</li>
   <li><strong>AMD ROCm</strong>: ZenCPU / AMD Zen CPU backend via zentorch (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/39967";>#39967</a>), 
RDNA 3.5/4 device IDs (gfx1150/1151/1201) (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/38455";>#38455</a>), 
gfx1102/gfx1103 added (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40037";>#40037</a>), 
MORI EP for unquantized MoE with AITER (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/37529";>#37529</a>), 
MoRI build with AMD AINIC stack (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/38371";>#38371</a>), 
MoRI-IO message format aligned with P2pNcclConnector and vllm-router (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/39565";>#39565</a>), 
MORI prefill/decode API correction (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/39835";>#39835</a>), 
AITER gemm w8a8 ptpc integration (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/33773";
 >#33773</a>), TritonW4A16LinearKernel (<a 
 >href="https://redirect.github.com/vllm-project/vllm/issues/37352";>#37352</a>),
 > asymmetric INT8 in <code>TritonInt8ScaledMMLinearKernel</code> (<a 
 >href="https://redirect.github.com/vllm-project/vllm/issues/38501";>#38501</a>),
 > <code>fused_silu_mul_block_quant</code> enabled (<a 
 >href="https://redirect.github.com/vllm-project/vllm/issues/38817";>#38817</a>),
 > KV-cache shuffle for <code>paged_attention_common</code> (<a 
 >href="https://redirect.github.com/vllm-project/vllm/issues/32914";>#32914</a>),
 > MLA decode output zero-fill removed in AITER (<a 
 >href="https://redirect.github.com/vllm-project/vllm/issues/37539";>#37539</a>),
 > MLA dual RMS norm fusion pass for DeepSeek/Kimi-K2 (<a 
 >href="https://redirect.github.com/vllm-project/vllm/issues/39242";>#39242</a>, 
 >with older-AITer guard <a 
 >href="https://redirect.github.com/vllm-project/vllm/issues/40386";>#40386</a>),
 > AITER MLA + Eagle3 spec decode (<a 
 >href="https://redirect.github.com/vllm-project/vllm/issues
 /39616">#39616</a>), DFlash on ROCm (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/39703";>#39703</a>), 
wvSplitK FP8 path for RDNA (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/37712";>#37712</a>), 
GPU↔NUMA-node detection (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40015";>#40015</a>), 
non-causal attention in <code>ROCM_ATTN</code> (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40176";>#40176</a>), 
engine-shutdown GPU memory leak fix (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/38503";>#38503</a>), 
score-correction-bias dtype cast for DeepSeek/Kimi-K2 (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/39999";>#39999</a>).</li>
   <li><strong>Intel XPU</strong>: torch 2.11 upgrade for XPU (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/37947";>#37947</a>) — 
no longer pinned to 2.10, initial GDN attention for Qwen3-Next / Qwen3.5 (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/33657";>#33657</a>), 
torch.compile for XPU GDN attention (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/39466";>#39466</a>), 
XPU MXFP8 quant op (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/38682";>#38682</a>), 
XPU MXFP4 quant op (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/39857";>#39857</a>), 
per-channel FP8 linear (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/38316";>#38316</a>), 
FP8 KV cache on XPU (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/37731";>#37731</a>), 
<code>round_int8</code> for Intel Triton (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/38825";>#38825</a>), 
MoE Triton in online FP8 quantization 
 fix (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40109";>#40109</a>), 
<code>current_platform.supports_fp8()</code> updated for TritonExperts (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40132";>#40132</a>), 
NIXL import on XPU fix (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40430";>#40430</a>), 
fusion-pattern support disabled on XPU (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/39789";>#39789</a>).</li>
   <li><strong>CPU</strong>: CPU draft-model speculative decoding (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/32662";>#32662</a>), 
CPU int8 compute mode in AWQ (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/35697";>#35697</a>), 
head_size 512 in <code>cpu_attn</code> (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/38676";>#38676</a>), 
gelu in <code>cpu_fused_moe</code> (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/38770";>#38770</a>), 
OMP replacement (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/36487";>#36487</a>), 
BF16 GELU LUT on ARM (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/37469";>#37469</a>), 
W4A16 Autoround on CPU (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/38192";>#38192</a>), 
CPU affinity/memory mgmt refactor (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/39781";>#39781</a>), 
IBM Z s390x torch 2.11 builds (<a href="https://redirect.github.com/vll
 m-project/vllm/issues/39910">#39910</a>), faster exp routine for 
lower-precision dtypes (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/38112";>#38112</a>), 
inter-node pipeline parallel fix (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40150";>#40150</a>), 
RISC-V multiple RVV VLEN targets (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/39478";>#39478</a>), 
RISC-V platform detection fix (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40427";>#40427</a>), 
exp() input clamp to prevent NaN on CPU/RISC-V (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40428";>#40428</a>).</li>
   <li><strong>TPU</strong>: tpu-inference upgraded to 0.18.0 (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40395";>#40395</a>).</li>
   <li><strong>DeepSeek / MLA / Indexer</strong>: Persistent TopK scheduler for 
DSV3.2 DSA decode (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/37421";>#37421</a>), 
DSV3.2 indexer fused weights projection (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/38684";>#38684</a>), 
Triton MLA perf fixes (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/33529";>#33529</a>), 
indexer WK upcast to BF16 for fusion (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/38928";>#38928</a>), 
MLA indexer uniform-decode optimization for MTP&gt;1 (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/39458";>#39458</a>), 
DSA + MTP IMA fix (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40772";>#40772</a>).</li>
   <li><strong>GDN / Mamba</strong>: Kernel fusion in GDN (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/37813";>#37813</a>), 
TMA aligned with upstream FLA (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/38981";>#38981</a>), 
GPU↔CPU syncs eliminated in prefill and spec-decode paths (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/38361";>#38361</a>, 
<a 
href="https://redirect.github.com/vllm-project/vllm/issues/38047";>#38047</a>).</li>
   </ul>
   <!-- raw HTML omitted -->
   </blockquote>
   <p>... (truncated)</p>
   </details>
   <details>
   <summary>Commits</summary>
   <ul>
   <li><a 
href="https://github.com/vllm-project/vllm/commit/88d34c6409e9fb3c7b8ca0c04756f061d2099eb1";><code>88d34c6</code></a>
 [Docker] Install numactl CLI in CUDA runtime image (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/41032";>#41032</a>)</li>
   <li><a 
href="https://github.com/vllm-project/vllm/commit/b8160878f07fe6aff02deb12bc842df3fa4a9237";><code>b816087</code></a>
 [DSV4] Add silu clamp limit to shared expert (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40950";>#40950</a>)</li>
   <li><a 
href="https://github.com/vllm-project/vllm/commit/84c276d7ea00c5bcb6af21b8035d57735479e0ab";><code>84c276d</code></a>
 [Bugfix] Cap SWA/chunked-local runtime admission to startup pool-sizing 
bound...</li>
   <li><a 
href="https://github.com/vllm-project/vllm/commit/5eb36575786d7034f36315f09cf8248fbfd4230b";><code>5eb3657</code></a>
 Revert &quot;[Frontend] Remove frontend pooling multi task support.  (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/37861";>#37861</a>)&quot;</li>
   <li><a 
href="https://github.com/vllm-project/vllm/commit/4d51588e2381018348f1022dfa3a7698899805b7";><code>4d51588</code></a>
 [Feat] DeepSeek V4 Rebased  (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40860";>#40860</a>)</li>
   <li><a 
href="https://github.com/vllm-project/vllm/commit/32e45636e3d7e02615facc8c63645ce4ac1d7e11";><code>32e4563</code></a>
 [torch.compile]: Disable Sequence Parallelism (SP) for piecewise compilation 
...</li>
   <li><a 
href="https://github.com/vllm-project/vllm/commit/b39c266dae8cd7aee31f667c973e9698ed0b2361";><code>b39c266</code></a>
 [KV Offload] Offload all KV blocks when doing prefill in P/D (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40346";>#40346</a>)</li>
   <li><a 
href="https://github.com/vllm-project/vllm/commit/9558f43903faa1b6db08ac98802bf88111196345";><code>9558f43</code></a>
 [Bugfix] Size FlashInfer NVLink MNNVL workspace to EP group (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40893";>#40893</a>)</li>
   <li><a 
href="https://github.com/vllm-project/vllm/commit/8cd174fa358326d5cc4195446be2ebcd65c481ce";><code>8cd174f</code></a>
 [LoRA] MoE LoRA Refactor (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40338";>#40338</a>)</li>
   <li><a 
href="https://github.com/vllm-project/vllm/commit/c798593f0d88cec583c599ea7ea40a2cc26c312b";><code>c798593</code></a>
 [Bugfix] Fix the DSML token leakage in DSV4/3.2 (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40806";>#40806</a>)</li>
   <li>Additional commits viewable in <a 
href="https://github.com/vllm-project/vllm/compare/v0.10.1.1...v0.20.0";>compare 
view</a></li>
   </ul>
   </details>
   <br />
   
   
   [![Dependabot compatibility 
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=vllm&package-manager=pip&previous-version=0.10.1.1&new-version=0.20.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
   
   Dependabot will resolve any conflicts with this PR as long as you don't 
alter it yourself. You can also trigger a rebase manually by commenting 
`@dependabot rebase`.
   
   [//]: # (dependabot-automerge-start)
   [//]: # (dependabot-automerge-end)
   
   ---
   
   <details>
   <summary>Dependabot commands and options</summary>
   <br />
   
   You can trigger Dependabot actions by commenting on this PR:
   - `@dependabot rebase` will rebase this PR
   - `@dependabot recreate` will recreate this PR, overwriting any edits that 
have been made to it
   - `@dependabot show <dependency name> ignore conditions` will show all of 
the ignore conditions of the specified dependency
   - `@dependabot ignore this major version` will close this PR and stop 
Dependabot creating any more for this major version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this minor version` will close this PR and stop 
Dependabot creating any more for this minor version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this dependency` will close this PR and stop 
Dependabot creating any more for this dependency (unless you reopen the PR or 
upgrade to it yourself)
   You can disable automated security fix PRs for this repo from the [Security 
Alerts page](https://github.com/apache/beam/network/alerts).
   
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] Bump vllm from 0.10.1.1 to 0.20.0 in /sdks/python/container/ml/py312 [beam]

Reply via email to