Re: RFR: 8372153: AArch64: Performance regression in long reduction microbenchmarks after JDK-8340093 [v7]

Emanuel Peter Thu, 11 Jun 2026 07:36:58 -0700

On Thu, 11 Jun 2026 12:13:53 GMT, Fei Gao <[email protected]> wrote:

>> [JDK-8340093](https://bugs.openjdk.org/browse/JDK-8340093) enabled 
>> auto-vectorization for more reduction loop cases using `128-bit` vector 
>> operations. As a result, the following microbenchmark is negatively affected:
>> `VectorReduction2.longAddDotProduct`
>> 
>> This patch fixes these regressions.
>> 
>> **1. Improve code generation for MLA**
>> 
>> For 
>> [longAddDotProduct](https://github.com/openjdk/jdk/blob/c5f288e2ae2ebe6ee4a0d39d91348f746bd0e353/test/micro/org/openjdk/bench/vm/compiler/VectorReduction2.java#L1096)[1],
>>  the current implementation generates vectorized code similar to:
>> 
>> ldr     q17, [x12, #16]
>> ldr     q18, [x11, #16]
>> mla     z16.d, p7/m, z17.d, z18.d
>> ldr     q17, [x11, #32]
>> ldr     q18, [x12, #32]
>> mla     z16.d, p7/m, z18.d, z17.d
>> ...
>> ldr     q17, [x11, #128]
>> ldr     q18, [x12, #128]
>> mla     z16.d, p7/m, z18.d, z17.d
>> 
>> `z16` is the third source and destination register. There are true 
>> dependencies between consecutive `mla`[2] instructions. As a result, this 
>> vectorized code performs significantly worse than the scalar version due to 
>> limited instruction-level parallelism.
>> 
>> These `mla` instructions are produced by a backend match rule that fuses 
>> `AddVL` and `MulVL` into a vector `MLA`[3]. In this situation, avoiding 
>> instruction fusion and instead generating separate SVE `mul` and `add` 
>> instructions can improve instruction-level parallelism and overall 
>> performance.
>> 
>> To address this, this patch introduces
>> `is_multiply_accumulate_candidate()` to determine whether a node is a 
>> suitable vector `MLA` candidate. For node patterns that may increase 
>> execution latency, instruction fusion into `MLA` is disabled.
>> 
>> After applying this patch, the generated assembly looks like:
>> 
>> ldr     q17, [x12, #16]
>> ldr     q18, [x11, #16]
>> ldr     q19, [x11, #32]
>> mul     z17.d, p7/m, z17.d, z18.d
>> ldr     q18, [x12, #32]
>> ldr     q20, [x11, #48]
>> mul     z18.d, p7/m, z18.d, z19.d
>> ldr     q19, [x12, #48]
>> add     v16.2d, v17.2d, v16.2d
>> ldr     q17, [x11, #64]
>> add     v16.2d, v18.2d, v16.2d
>> ldr     q18, [x12, #64]
>> mul     z19.d, p7/m, z19.d, z20.d
>> ldr     q20, [x12, #80]
>> add     v16.2d, v19.2d, v16.2d
>> 
>> This sequence exposes more independent operations and reduces dependency 
>> chains, leading to improved performance.
>> 
>> Since SVE `mls` instructions may suffer from similar issues, the same logic 
>> has been extended to cover `MLS` as well. Additional microbenchmarks have 
>> been added accordingly.
>> 
>> 2. Results
>> P...
>
> Fei Gao has updated the pull request with a new target base due to a merge or 
> a rebase. The incremental webrev excludes the unrelated changes brought in by 
> the merge/rebase. The pull request contains 11 additional commits since the 
> last revision:
> 
>  - Extend the fix to Vector API masked operations
>  - Merge branch 'master' into fix-long-redu-regression
>  - Add Vector API IR test case
>  - Add a VectorAPI micro-benchmark case
>  - Merge branch 'master' into fix-long-redu-regression
>  - Refine the comments
>  - Add an IR test case and one extra benchmark case
>  - Merge branch 'master' into fix-long-redu-regression
>  - Dropped unrelated changes and added the AvoidMLAChain option to enable 
> this optimization selectively on Neoverse cores
>  - Merge branch 'master' into fix-long-redu-regression
>  - ... and 1 more: https://git.openjdk.org/jdk/compare/8474f2a1...9c38d647


I'll have to review more later. But I have this question for now ;)

src/hotspot/cpu/aarch64/aarch64_vector.ad line 417:

> 415:     if (n->Opcode() != Op_AddVL && n->Opcode() != Op_SubVL) {
> 416:       return true;
> 417:     }

Just a question: You converted the assert into a check. Why? Does it not always 
hold anymore? Example?

-------------

PR Review: https://git.openjdk.org/jdk/pull/30237#pullrequestreview-4476928235
PR Review Comment: https://git.openjdk.org/jdk/pull/30237#discussion_r3396052045

Re: RFR: 8372153: AArch64: Performance regression in long reduction microbenchmarks after JDK-8340093 [v7]

Reply via email to