On Mon, 8 Jun 2026 08:05:17 GMT, Fei Gao <[email protected]> wrote:
>> [JDK-8340093](https://bugs.openjdk.org/browse/JDK-8340093) enabled
>> auto-vectorization for more reduction loop cases using `128-bit` vector
>> operations. As a result, the following microbenchmark is negatively affected:
>> `VectorReduction2.longAddDotProduct`
>>
>> This patch fixes these regressions.
>>
>> **1. Improve code generation for MLA**
>>
>> For
>> [longAddDotProduct](https://github.com/openjdk/jdk/blob/c5f288e2ae2ebe6ee4a0d39d91348f746bd0e353/test/micro/org/openjdk/bench/vm/compiler/VectorReduction2.java#L1096)[1],
>> the current implementation generates vectorized code similar to:
>>
>> ldr q17, [x12, #16]
>> ldr q18, [x11, #16]
>> mla z16.d, p7/m, z17.d, z18.d
>> ldr q17, [x11, #32]
>> ldr q18, [x12, #32]
>> mla z16.d, p7/m, z18.d, z17.d
>> ...
>> ldr q17, [x11, #128]
>> ldr q18, [x12, #128]
>> mla z16.d, p7/m, z18.d, z17.d
>>
>> `z16` is the third source and destination register. There are true
>> dependencies between consecutive `mla`[2] instructions. As a result, this
>> vectorized code performs significantly worse than the scalar version due to
>> limited instruction-level parallelism.
>>
>> These `mla` instructions are produced by a backend match rule that fuses
>> `AddVL` and `MulVL` into a vector `MLA`[3]. In this situation, avoiding
>> instruction fusion and instead generating separate SVE `mul` and `add`
>> instructions can improve instruction-level parallelism and overall
>> performance.
>>
>> To address this, this patch introduces
>> `is_multiply_accumulate_candidate()` to determine whether a node is a
>> suitable vector `MLA` candidate. For node patterns that may increase
>> execution latency, instruction fusion into `MLA` is disabled.
>>
>> After applying this patch, the generated assembly looks like:
>>
>> ldr q17, [x12, #16]
>> ldr q18, [x11, #16]
>> ldr q19, [x11, #32]
>> mul z17.d, p7/m, z17.d, z18.d
>> ldr q18, [x12, #32]
>> ldr q20, [x11, #48]
>> mul z18.d, p7/m, z18.d, z19.d
>> ldr q19, [x12, #48]
>> add v16.2d, v17.2d, v16.2d
>> ldr q17, [x11, #64]
>> add v16.2d, v18.2d, v16.2d
>> ldr q18, [x12, #64]
>> mul z19.d, p7/m, z19.d, z20.d
>> ldr q20, [x12, #80]
>> add v16.2d, v19.2d, v16.2d
>>
>> This sequence exposes more independent operations and reduces dependency
>> chains, leading to improved performance.
>>
>> Since SVE `mls` instructions may suffer from similar issues, the same logic
>> has been extended to cover `MLS` as well. Additional microbenchmarks have
>> been added accordingly.
>>
>> 2. Results
>> P...
>
> Fei Gao has updated the pull request incrementally with one additional commit
> since the last revision:
>
> Add Vector API IR test case
src/hotspot/cpu/aarch64/aarch64_vector_ad.m4 line 1537:
> 1535:
> 1536: instruct vmla_masked(vReg dst_src1, vReg src2, vReg src3, pRegGov pg) %{
> 1537: predicate(UseSVE > 0);
Should we also check for `is_multiply_accumulate_candidate` here? Or is the
pattern different?
test/hotspot/jtreg/compiler/vectorization/TestVmlaAArch64.java line 67:
> 65: @IR(applyIfCPUFeature = {"sve", "true"},
> 66: applyIfAnd = {"MaxVectorSize", "<= 16", "AvoidMLAChain", "true"},
> 67: counts = {IRNode.VMLA, "=0"})
Should there be similar `VMLS` cases tested, with IR rules as well?
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/30237#discussion_r3379163596
PR Review Comment: https://git.openjdk.org/jdk/pull/30237#discussion_r3379903108