On Thu, 28 May 2026 08:58:31 GMT, Andrew Haley <[email protected]> wrote:
>> Vector API `lanewise BITWISE_BLEND` on AArch64 is currently lowered to a >> generic vector sequence built from `(XorV(AndV(XorV)))` nodes. AArch64 >> provides a more efficient mapping for this operation through the NEON `BSL` >> and SVE `BSL` (bitwise select) instructions. >> >> This change teaches C2 to recognize the `BITWISE_BLEND` patterns and lower >> them to the dedicated AArch64 instructions for better performance. >> >> The change includes the AArch64 match rules and assembler support, updates >> the AArch64 asm tests, adds IR framework nodes for the new mach >> instructions, introduces a new jtreg IR test and extends the MaskedLogicOpts >> JMH benchmark for 128-bit long type. >> >> JMH results show **11% - 54%** performance improvements for the optimized >> cases, and all jtreg tests (tier1, tier2 and tier3) passe on SVE2, SVE1, and >> NEON configurations. >> >> On a Nvidia Grace (Neoverse-V2) machine with 128-bit SVE2: >> >> Benchmark Unit ARRAYLEN Before Error >> After Error Uplift >> bitwiseBlendOperationInt128 ops/s 256.00 3787.49 5.29 >> 4277.64 8.89 1.13 >> bitwiseBlendOperationInt128 ops/s 512.00 1888.24 11.02 >> 2143.21 6.32 1.14 >> bitwiseBlendOperationInt128 ops/s 1024.00 938.22 6.24 >> 1053.45 14.68 1.12 >> bitwiseBlendOperationLong128 ops/s 256.00 1895.45 13.68 2140.31 >> 3.68 1.13 >> bitwiseBlendOperationLong128 ops/s 512.00 938.71 5.32 1052.16 >> 14.07 1.12 >> bitwiseBlendOperationLong128 ops/s 1024.00 474.15 2.33 526.49 >> 2.62 1.11 >> >> >> On an AWS Graviton3 (Neoverse-V1) machine with 256-bit SVE1: >> >> Benchmark Unit ARRAYLEN Before Error >> After Error Uplift >> bitwiseBlendOperationInt128 ops/s 256.00 2051.52 13.85 >> 2481.44 0.27 1.21 >> bitwiseBlendOperationInt128 ops/s 512.00 995.47 20.77 >> 1235.10 5.70 1.24 >> bitwiseBlendOperationInt128 ops/s 1024.00 507.73 9.83 >> 617.59 2.43 1.22 >> bitwiseBlendOperationLong128 ops/s 256.00 1000.99 21.50 1235.39 >> 5.48 1.23 >> bitwiseBlendOperationLong128 ops/s 512.00 507.73 9.74 617.67 >> 2.32 1.22 >> bitwiseBlendOperationLong128 ops/s 1024.00 258.86 0.01 310.70 >> 0.04 1.20 >> >> >> On a Nvidia Grace (Neoverse-V2) machine with 128-bit NEON: >> >> Benchmark Unit ARRAYLEN Before Error >> After Error Uplift >> bitwiseBlendOperationInt128 ops/s 256.00 2336.17 13.18 >> 3505.19 19.61 1.50 >> bitwiseBlendOperationInt128 ops/s 512.00 1145.50 ... > > src/hotspot/cpu/aarch64/aarch64_vector_ad.m4 line 4761: > >> 4759: match(Set dst_src1 (XorV (Binary src3 (AndV dst_src1 (XorV src2 >> src3))) pg)); >> 4760: // Second form: inner XorV may have operands (src3, src2) after >> Ideal/GVN. >> 4761: // adlc only auto-swaps commutative ops when at least one operand is >> a subtree, > > Is there a way to canonicalize this further, so that the two forms are not > both necessary? This is an ADLC limitation, see https://github.com/openjdk/jdk/blob/f2f8828188f45d16344c82adfbf951f7409b8825/src/hotspot/share/adlc/formssel.cpp#L3974 and https://github.com/openjdk/jdk/blob/f2f8828188f45d16344c82adfbf951f7409b8825/src/hotspot/share/adlc/formssel.cpp#L4032 This appears to be intentional rather than accidental, which restrict auto-swapping to operand pairs where at least one side is a subtree. I did try lifting that restriction and hit some real build failure: some "commutative" leaves are commutative as a Java/IR operation but order-sensitive in their expand lowering. For example, `AArch64's minI_reg_reg`: instruct minI_reg_reg(iRegINoSp dst, iRegIorL2I src1, iRegIorL2I src2) %{ match(Set dst (MinI src1 src2)); expand %{ rFlagsReg cr; compI_reg_reg(cr, src1, src2); // operand order matters here cmovI_reg_reg_lt(dst, src1, src2, cr); %} %} If adlc were to auto-swap the leaves of MinI, the `compI/cmov` expand would become incorrect. Several platforms have similar patterns, so a safe fix would have to audit them all. But I will try the implementation method suggested by @XiaohongGong first, which is to introduce a `BitwiseBlendNode` and then convert this pattern into a `BitwiseBlendNode` in IGVN. Perhaps this can avoid writing these match rules. Thanks~ ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/31269#discussion_r3321849567
