> [JDK-8340093](https://bugs.openjdk.org/browse/JDK-8340093) enabled > auto-vectorization for more reduction loop cases using `128-bit` vector > operations. As a result, the following microbenchmark is negatively affected: > `VectorReduction2.longAddDotProduct` > > This patch fixes these regressions. > > **1. Improve code generation for MLA** > > For > [longAddDotProduct](https://github.com/openjdk/jdk/blob/c5f288e2ae2ebe6ee4a0d39d91348f746bd0e353/test/micro/org/openjdk/bench/vm/compiler/VectorReduction2.java#L1096)[1], > the current implementation generates vectorized code similar to: > > ldr q17, [x12, #16] > ldr q18, [x11, #16] > mla z16.d, p7/m, z17.d, z18.d > ldr q17, [x11, #32] > ldr q18, [x12, #32] > mla z16.d, p7/m, z18.d, z17.d > ... > ldr q17, [x11, #128] > ldr q18, [x12, #128] > mla z16.d, p7/m, z18.d, z17.d > > `z16` is the third source and destination register. There are true > dependencies between consecutive `mla`[2] instructions. As a result, this > vectorized code performs significantly worse than the scalar version due to > limited instruction-level parallelism. > > These `mla` instructions are produced by a backend match rule that fuses > `AddVL` and `MulVL` into a vector `MLA`[3]. In this situation, avoiding > instruction fusion and instead generating separate SVE `mul` and `add` > instructions can improve instruction-level parallelism and overall > performance. > > To address this, this patch introduces > `is_multiply_accumulate_candidate()` to determine whether a node is a > suitable vector `MLA` candidate. For node patterns that may increase > execution latency, instruction fusion into `MLA` is disabled. > > After applying this patch, the generated assembly looks like: > > ldr q17, [x12, #16] > ldr q18, [x11, #16] > ldr q19, [x11, #32] > mul z17.d, p7/m, z17.d, z18.d > ldr q18, [x12, #32] > ldr q20, [x11, #48] > mul z18.d, p7/m, z18.d, z19.d > ldr q19, [x12, #48] > add v16.2d, v17.2d, v16.2d > ldr q17, [x11, #64] > add v16.2d, v18.2d, v16.2d > ldr q18, [x12, #64] > mul z19.d, p7/m, z19.d, z20.d > ldr q20, [x12, #80] > add v16.2d, v19.2d, v16.2d > > This sequence exposes more independent operations and reduces dependency > chains, leading to improved performance. > > Since SVE `mls` instructions may suffer from similar issues, the same logic > has been extended to cover `MLS` as well. Additional microbenchmarks have > been added accordingly. > > 2. Results > Performance measurements on 128-bit SVE machines show that these changes > improve overall performance fo...
Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 11 additional commits since the last revision: - Extend the fix to Vector API masked operations - Merge branch 'master' into fix-long-redu-regression - Add Vector API IR test case - Add a VectorAPI micro-benchmark case - Merge branch 'master' into fix-long-redu-regression - Refine the comments - Add an IR test case and one extra benchmark case - Merge branch 'master' into fix-long-redu-regression - Dropped unrelated changes and added the AvoidMLAChain option to enable this optimization selectively on Neoverse cores - Merge branch 'master' into fix-long-redu-regression - ... and 1 more: https://git.openjdk.org/jdk/compare/cdc556f5...9c38d647 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/30237/files - new: https://git.openjdk.org/jdk/pull/30237/files/0d5039bf..9c38d647 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=30237&range=06 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=30237&range=05-06 Stats: 13142 lines in 179 files changed: 12271 ins; 336 del; 535 mod Patch: https://git.openjdk.org/jdk/pull/30237.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/30237/head:pull/30237 PR: https://git.openjdk.org/jdk/pull/30237
