On Thu, 4 Jun 2026 07:46:10 GMT, Eric Fang <[email protected]> wrote:
>> Vector API `lanewise BITWISE_BLEND` on AArch64 is currently lowered to a
>> generic vector sequence built from `(XorV(AndV(XorV)))` nodes. AArch64
>> provides a more efficient mapping for this operation through the NEON `BSL`
>> and SVE `BSL` (bitwise select) instructions.
>>
>> This change teaches C2 to recognize the `BITWISE_BLEND` patterns and lower
>> them to the dedicated AArch64 instructions for better performance.
>>
>> The change includes the AArch64 match rules and assembler support, updates
>> the AArch64 asm tests, adds IR framework nodes for the new mach
>> instructions, introduces a new jtreg IR test and extends the MaskedLogicOpts
>> JMH benchmark for 128-bit long type.
>>
>> JMH results show **11% - 54%** performance improvements for the optimized
>> cases, and all jtreg tests (tier1, tier2 and tier3) passe on SVE2, SVE1, and
>> NEON configurations.
>>
>> On a Nvidia Grace (Neoverse-V2) machine with 128-bit SVE2:
>>
>> Benchmark Unit ARRAYLEN Before Error
>> After Error Uplift
>> bitwiseBlendOperationInt128 ops/s 256.00 3787.49 5.29
>> 4277.64 8.89 1.13
>> bitwiseBlendOperationInt128 ops/s 512.00 1888.24 11.02
>> 2143.21 6.32 1.14
>> bitwiseBlendOperationInt128 ops/s 1024.00 938.22 6.24
>> 1053.45 14.68 1.12
>> bitwiseBlendOperationLong128 ops/s 256.00 1895.45 13.68 2140.31
>> 3.68 1.13
>> bitwiseBlendOperationLong128 ops/s 512.00 938.71 5.32 1052.16
>> 14.07 1.12
>> bitwiseBlendOperationLong128 ops/s 1024.00 474.15 2.33 526.49
>> 2.62 1.11
>>
>>
>> On an AWS Graviton3 (Neoverse-V1) machine with 256-bit SVE1:
>>
>> Benchmark Unit ARRAYLEN Before Error
>> After Error Uplift
>> bitwiseBlendOperationInt128 ops/s 256.00 2051.52 13.85
>> 2481.44 0.27 1.21
>> bitwiseBlendOperationInt128 ops/s 512.00 995.47 20.77
>> 1235.10 5.70 1.24
>> bitwiseBlendOperationInt128 ops/s 1024.00 507.73 9.83
>> 617.59 2.43 1.22
>> bitwiseBlendOperationLong128 ops/s 256.00 1000.99 21.50 1235.39
>> 5.48 1.23
>> bitwiseBlendOperationLong128 ops/s 512.00 507.73 9.74 617.67
>> 2.32 1.22
>> bitwiseBlendOperationLong128 ops/s 1024.00 258.86 0.01 310.70
>> 0.04 1.20
>>
>>
>> On a Nvidia Grace (Neoverse-V2) machine with 128-bit NEON:
>>
>> Benchmark Unit ARRAYLEN Before Error
>> After Error Uplift
>> bitwiseBlendOperationInt128 ops/s 256.00 2336.17 13.18
>> 3505.19 19.61 1.50
>> bitwiseBlendOperationInt128 ops/s 512.00 1145.50 ...
>
> Eric Fang has updated the pull request with a new target base due to a merge
> or a rebase. The incremental webrev excludes the unrelated changes brought in
> by the merge/rebase. The pull request contains three additional commits since
> the last revision:
>
> - Implement bitwise_blend in IGVN
>
> The latest changes:
>
> 1. Defined a new IR `VectorBitwiseBlendNode`
> 2. Do the optimization in IGVN:
> // XorV(a, AndV(sel, XorV(a, b))) => VectorBitwiseBlend(a, b, sel)
> // XorV(a, AndV(sel, XorV(a, b)), mask) =>
> // VectorBlend(a, VectorBitwiseBlend(a, b, sel), mask)
>
> 3. Adjust the ad file match rules to match `VectorBitwiseBlendNode`.
> 4. Adjust the JTReg tests to check `VectorBitwiseBlendNode`.
> - Merge branch 'master' into JDK-8382052-bitwise-blend
> - 8382052: VectorAPI: AArch64: Optimize the lanewise BITWISE_BLEND operation
> with BSL
>
> Vector API `lanewise BITWISE_BLEND` on AArch64 is currently lowered to a
> generic vector sequence built from `(XorV(AndV(XorV)))` nodes. AArch64
> provides a more efficient mapping for this operation through the NEON
> `BSL` and SVE `BSL` (bitwise select) instructions.
>
> This change teaches C2 to recognize the `BITWISE_BLEND` patterns and
> lower them to the dedicated AArch64 instructions for better performance.
>
> The change includes the AArch64 match rules and assembler support,
> updates the AArch64 asm tests, adds IR framework nodes for the new mach
> instructions, introduces a new jtreg IR test and extends the
> MaskedLogicOpts JMH benchmark for 128-bit long type.
>
> JMH results show **11% - 54%** performance improvements for the
> optimized cases, and all jtreg tests (tier1, tier2 and tier3) passe on
> SVE2, SVE1, and NEON configurations.
>
> On a Nvidia Grace (Neoverse-V2) machine with 128-bit SVE2:
> ```
> Benchmark Unit ARRAYLEN Before Error After
> Error Uplift
> bitwiseBlendOperationInt128 ops/s 256.00 3787.49 5.29
> 4277.64 8.89 1.13
> bitwiseBlendOperationInt128 ops/s 512.00 1888.24 11.02
> 2143.21 6.32 1.14
> bitwiseBlendOperationInt128 ops/s 1024.00 938.22 6.24
> 1053.45 14.68 1.12
> bitwiseBlendOperationLong128 ops/s 256.00 1895.45 13.68
> 2140.31 3.68 1.13
> bitwiseBlendOperationLong128 ops/s 512.00 938.71 5.32
> 1052.16 14.07 1.12
> bitwiseBlendOperationLong128 ops/s 1024.00 474.15 2.33
> 526.49 2.62 1.11
> ```
>
> On an AWS Graviton3 (Neoverse-V1) machine with 256-bit SVE1:
> ```
> Benchmar...
src/hotspot/share/opto/vectornode.hpp line 1811:
> 1809: VectorBitwiseBlendNode(Node* vec_false, Node* vec_true, Node* sel,
> const TypeVect* vt)
> 1810: : VectorNode(vec_false, vec_true, sel, vt) {}
> 1811: virtual int Opcode() const;
Could we have some asserts about input nodes? Are they vectors, do they have
same length, etc?
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/31269#discussion_r3378616250