On Wed, 10 Jun 2026 04:37:32 GMT, Eric Fang <[email protected]> wrote:
>> Vector API `lanewise BITWISE_BLEND` on AArch64 is currently lowered to a
>> generic vector sequence built from `(XorV(AndV(XorV)))` nodes. AArch64
>> provides a more efficient mapping for this operation through the NEON `BSL`
>> and SVE `BSL` (bitwise select) instructions.
>>
>> This change teaches C2 to recognize the `BITWISE_BLEND` patterns and lower
>> them to the dedicated AArch64 instructions for better performance.
>>
>> The change includes the AArch64 match rules and assembler support, updates
>> the AArch64 asm tests, adds IR framework nodes for the new mach
>> instructions, introduces a new jtreg IR test and extends the MaskedLogicOpts
>> JMH benchmark for 128-bit long type.
>>
>> JMH results show **11% - 54%** performance improvements for the optimized
>> cases, and all jtreg tests (tier1, tier2 and tier3) passe on SVE2, SVE1, and
>> NEON configurations.
>>
>> On a Nvidia Grace (Neoverse-V2) machine with 128-bit SVE2:
>>
>> Benchmark Unit ARRAYLEN Before Error
>> After Error Uplift
>> bitwiseBlendOperationInt128 ops/s 256.00 3787.49 5.29
>> 4277.64 8.89 1.13
>> bitwiseBlendOperationInt128 ops/s 512.00 1888.24 11.02
>> 2143.21 6.32 1.14
>> bitwiseBlendOperationInt128 ops/s 1024.00 938.22 6.24
>> 1053.45 14.68 1.12
>> bitwiseBlendOperationLong128 ops/s 256.00 1895.45 13.68 2140.31
>> 3.68 1.13
>> bitwiseBlendOperationLong128 ops/s 512.00 938.71 5.32 1052.16
>> 14.07 1.12
>> bitwiseBlendOperationLong128 ops/s 1024.00 474.15 2.33 526.49
>> 2.62 1.11
>>
>>
>> On an AWS Graviton3 (Neoverse-V1) machine with 256-bit SVE1:
>>
>> Benchmark Unit ARRAYLEN Before Error
>> After Error Uplift
>> bitwiseBlendOperationInt128 ops/s 256.00 2051.52 13.85
>> 2481.44 0.27 1.21
>> bitwiseBlendOperationInt128 ops/s 512.00 995.47 20.77
>> 1235.10 5.70 1.24
>> bitwiseBlendOperationInt128 ops/s 1024.00 507.73 9.83
>> 617.59 2.43 1.22
>> bitwiseBlendOperationLong128 ops/s 256.00 1000.99 21.50 1235.39
>> 5.48 1.23
>> bitwiseBlendOperationLong128 ops/s 512.00 507.73 9.74 617.67
>> 2.32 1.22
>> bitwiseBlendOperationLong128 ops/s 1024.00 258.86 0.01 310.70
>> 0.04 1.20
>>
>>
>> On a Nvidia Grace (Neoverse-V2) machine with 128-bit NEON:
>>
>> Benchmark Unit ARRAYLEN Before Error
>> After Error Uplift
>> bitwiseBlendOperationInt128 ops/s 256.00 2336.17 13.18
>> 3505.19 19.61 1.50
>> bitwiseBlendOperationInt128 ops/s 512.00 1145.50 ...
>
> Eric Fang has updated the pull request incrementally with one additional
> commit since the last revision:
>
> Merge two conditional branches
Patch looks good, thanks @erifan for the updates!
Ok, I'm running some testing now :)
Below just a clarifying question.
src/hotspot/cpu/aarch64/aarch64_vector.ad line 325:
> 323: if (UseSVE < 2 && length_in_bytes > 16) {
> 324: return false;
> 325: }
NEON is also affected, right? Just asking because @XiaohongGong asked to change
the title to be SVE specific above.
-------------
PR Review: https://git.openjdk.org/jdk/pull/31269#pullrequestreview-4465028903
PR Review Comment: https://git.openjdk.org/jdk/pull/31269#discussion_r3385902123