On Mon, 25 May 2026 02:35:28 GMT, Eric Fang <[email protected]> wrote:
> Vector API `lanewise BITWISE_BLEND` on AArch64 is currently lowered to a > generic vector sequence built from `(XorV(AndV(XorV)))` nodes. AArch64 > provides a more efficient mapping for this operation through the NEON `BSL` > and SVE `BSL` (bitwise select) instructions. > > This change teaches C2 to recognize the `BITWISE_BLEND` patterns and lower > them to the dedicated AArch64 instructions for better performance. > > The change includes the AArch64 match rules and assembler support, updates > the AArch64 asm tests, adds IR framework nodes for the new mach instructions, > introduces a new jtreg IR test and extends the MaskedLogicOpts JMH benchmark > for 128-bit long type. > > JMH results show **11% - 54%** performance improvements for the optimized > cases, and all jtreg tests (tier1, tier2 and tier3) passe on SVE2, SVE1, and > NEON configurations. > > On a Nvidia Grace (Neoverse-V2) machine with 128-bit SVE2: > > Benchmark Unit ARRAYLEN Before Error > After Error Uplift > bitwiseBlendOperationInt128 ops/s 256.00 3787.49 5.29 > 4277.64 8.89 1.13 > bitwiseBlendOperationInt128 ops/s 512.00 1888.24 11.02 > 2143.21 6.32 1.14 > bitwiseBlendOperationInt128 ops/s 1024.00 938.22 6.24 > 1053.45 14.68 1.12 > bitwiseBlendOperationLong128 ops/s 256.00 1895.45 13.68 2140.31 > 3.68 1.13 > bitwiseBlendOperationLong128 ops/s 512.00 938.71 5.32 1052.16 > 14.07 1.12 > bitwiseBlendOperationLong128 ops/s 1024.00 474.15 2.33 526.49 > 2.62 1.11 > > > On an AWS Graviton3 (Neoverse-V1) machine with 256-bit SVE1: > > Benchmark Unit ARRAYLEN Before Error > After Error Uplift > bitwiseBlendOperationInt128 ops/s 256.00 2051.52 13.85 > 2481.44 0.27 1.21 > bitwiseBlendOperationInt128 ops/s 512.00 995.47 20.77 > 1235.10 5.70 1.24 > bitwiseBlendOperationInt128 ops/s 1024.00 507.73 9.83 > 617.59 2.43 1.22 > bitwiseBlendOperationLong128 ops/s 256.00 1000.99 21.50 1235.39 > 5.48 1.23 > bitwiseBlendOperationLong128 ops/s 512.00 507.73 9.74 617.67 > 2.32 1.22 > bitwiseBlendOperationLong128 ops/s 1024.00 258.86 0.01 310.70 > 0.04 1.20 > > > On a Nvidia Grace (Neoverse-V2) machine with 128-bit NEON: > > Benchmark Unit ARRAYLEN Before Error > After Error Uplift > bitwiseBlendOperationInt128 ops/s 256.00 2336.17 13.18 > 3505.19 19.61 1.50 > bitwiseBlendOperationInt128 ops/s 512.00 1145.50 12.40 > 1735.24 10.43 1.51 > bitwiseBlendOperationInt128 ops/s 1... Looks a reasonable optimization and it looks good to me. src/hotspot/cpu/aarch64/aarch64_vector_ad.m4 line 4763: > 4761: // adlc only auto-swaps commutative ops when at least one operand is > a subtree, > 4762: // not when both sides are leaves, so both shapes need explicit match > rules. > 4763: match(Set dst_src1 (XorV (Binary src3 (AndV dst_src1 (XorV src3 > src2))) pg)); Not sure whether it's better to add a IR-level transformation by adding a new IR like `VectorBitwiseBlend`. Benefits: 1. Other platforms may be easier to share the same optimization if they need. 2. The rules for masked vector can be removed by implementing with a `VectorBlend` IR instead. 3. We do not need to add two rules manually to handle the commutative issues for `XorV` in all rules. ------------- PR Review: https://git.openjdk.org/jdk/pull/31269#pullrequestreview-4377473064 PR Review Comment: https://git.openjdk.org/jdk/pull/31269#discussion_r3314981781
