Re: RFR: 8372153: AArch64: Performance regression in long reduction microbenchmarks after JDK-8340093 [v7]

Fei Gao Thu, 11 Jun 2026 05:18:57 -0700

> [JDK-8340093](https://bugs.openjdk.org/browse/JDK-8340093) enabled 
> auto-vectorization for more reduction loop cases using `128-bit` vector 
> operations. As a result, the following microbenchmark is negatively affected:
> `VectorReduction2.longAddDotProduct`
> 
> This patch fixes these regressions.
> 
> **1. Improve code generation for MLA**
> 
> For 
> [longAddDotProduct](https://github.com/openjdk/jdk/blob/c5f288e2ae2ebe6ee4a0d39d91348f746bd0e353/test/micro/org/openjdk/bench/vm/compiler/VectorReduction2.java#L1096)[1],
>  the current implementation generates vectorized code similar to:
> 
> ldr     q17, [x12, #16]
> ldr     q18, [x11, #16]
> mla     z16.d, p7/m, z17.d, z18.d
> ldr     q17, [x11, #32]
> ldr     q18, [x12, #32]
> mla     z16.d, p7/m, z18.d, z17.d
> ...
> ldr     q17, [x11, #128]
> ldr     q18, [x12, #128]
> mla     z16.d, p7/m, z18.d, z17.d
> 
> `z16` is the third source and destination register. There are true 
> dependencies between consecutive `mla`[2] instructions. As a result, this 
> vectorized code performs significantly worse than the scalar version due to 
> limited instruction-level parallelism.
> 
> These `mla` instructions are produced by a backend match rule that fuses 
> `AddVL` and `MulVL` into a vector `MLA`[3]. In this situation, avoiding 
> instruction fusion and instead generating separate SVE `mul` and `add` 
> instructions can improve instruction-level parallelism and overall 
> performance.
> 
> To address this, this patch introduces
> `is_multiply_accumulate_candidate()` to determine whether a node is a 
> suitable vector `MLA` candidate. For node patterns that may increase 
> execution latency, instruction fusion into `MLA` is disabled.
> 
> After applying this patch, the generated assembly looks like:
> 
> ldr     q17, [x12, #16]
> ldr     q18, [x11, #16]
> ldr     q19, [x11, #32]
> mul     z17.d, p7/m, z17.d, z18.d
> ldr     q18, [x12, #32]
> ldr     q20, [x11, #48]
> mul     z18.d, p7/m, z18.d, z19.d
> ldr     q19, [x12, #48]
> add     v16.2d, v17.2d, v16.2d
> ldr     q17, [x11, #64]
> add     v16.2d, v18.2d, v16.2d
> ldr     q18, [x12, #64]
> mul     z19.d, p7/m, z19.d, z20.d
> ldr     q20, [x12, #80]
> add     v16.2d, v19.2d, v16.2d
> 
> This sequence exposes more independent operations and reduces dependency 
> chains, leading to improved performance.
> 
> Since SVE `mls` instructions may suffer from similar issues, the same logic 
> has been extended to cover `MLS` as well. Additional microbenchmarks have 
> been added accordingly.
> 
> 2. Results
> Performance measurements on 128-bit SVE machines show that these changes 
> improve overall performance fo...


Fei Gao has updated the pull request with a new target base due to a merge or a 
rebase. The incremental webrev excludes the unrelated changes brought in by the 
merge/rebase. The pull request contains 11 additional commits since the last 
revision:

 - Extend the fix to Vector API masked operations
 - Merge branch 'master' into fix-long-redu-regression
 - Add Vector API IR test case
 - Add a VectorAPI micro-benchmark case
 - Merge branch 'master' into fix-long-redu-regression
 - Refine the comments
 - Add an IR test case and one extra benchmark case
 - Merge branch 'master' into fix-long-redu-regression
 - Dropped unrelated changes and added the AvoidMLAChain option to enable this 
optimization selectively on Neoverse cores
 - Merge branch 'master' into fix-long-redu-regression
 - ... and 1 more: https://git.openjdk.org/jdk/compare/cdc556f5...9c38d647

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/30237/files
  - new: https://git.openjdk.org/jdk/pull/30237/files/0d5039bf..9c38d647

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=30237&range=06
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=30237&range=05-06

  Stats: 13142 lines in 179 files changed: 12271 ins; 336 del; 535 mod
  Patch: https://git.openjdk.org/jdk/pull/30237.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/30237/head:pull/30237

PR: https://git.openjdk.org/jdk/pull/30237

Re: RFR: 8372153: AArch64: Performance regression in long reduction microbenchmarks after JDK-8340093 [v7]

Reply via email to