Re: RFR: 8382052: VectorAPI: Optimize the lanewise BITWISE_BLEND for AArch64 [v5]

Xiaohong Gong Thu, 11 Jun 2026 18:19:15 -0700

On Thu, 11 Jun 2026 07:37:30 GMT, Eric Fang <[email protected]> wrote:


>> Vector API `lanewise BITWISE_BLEND` on AArch64 is currently lowered to a 
>> generic vector sequence built from `(XorV(AndV(XorV)))` nodes. AArch64 
>> provides a more efficient mapping for this operation through the NEON `BSL` 
>> and SVE `BSL` (bitwise select) instructions.
>> 
>> This change teaches C2 to recognize the `BITWISE_BLEND` patterns and lower 
>> them to the dedicated AArch64 instructions for better performance.
>> 
>> The change includes the AArch64 match rules and assembler support, updates 
>> the AArch64 asm tests, adds IR framework nodes for the new mach 
>> instructions, introduces a new jtreg IR test and extends the MaskedLogicOpts 
>> JMH benchmark for 128-bit long type.
>> 
>> JMH results show **11% - 54%** performance improvements for the optimized 
>> cases, and all jtreg tests (tier1, tier2 and tier3) passe on SVE2, SVE1, and 
>> NEON configurations.
>> 
>> On a Nvidia Grace (Neoverse-V2) machine with 128-bit SVE2:
>> 
>> Benchmark                        Unit        ARRAYLEN Before     Error    
>> After          Error       Uplift
>> bitwiseBlendOperationInt128      ops/s       256.00   3787.49    5.29     
>> 4277.64    8.89    1.13
>> bitwiseBlendOperationInt128      ops/s       512.00   1888.24    11.02    
>> 2143.21    6.32    1.14
>> bitwiseBlendOperationInt128      ops/s       1024.00  938.22     6.24     
>> 1053.45    14.68   1.12
>> bitwiseBlendOperationLong128 ops/s   256.00   1895.45    13.68    2140.31    
>> 3.68    1.13
>> bitwiseBlendOperationLong128 ops/s   512.00   938.71     5.32     1052.16    
>> 14.07   1.12
>> bitwiseBlendOperationLong128 ops/s   1024.00  474.15     2.33     526.49     
>>     2.62        1.11
>> 
>> 
>> On an AWS Graviton3 (Neoverse-V1) machine with 256-bit SVE1:
>> 
>> Benchmark                        Unit        ARRAYLEN Before     Error    
>> After          Error       Uplift
>> bitwiseBlendOperationInt128      ops/s       256.00   2051.52    13.85    
>> 2481.44    0.27    1.21
>> bitwiseBlendOperationInt128      ops/s       512.00   995.47     20.77    
>> 1235.10    5.70    1.24
>> bitwiseBlendOperationInt128      ops/s       1024.00  507.73     9.83     
>> 617.59         2.43        1.22
>> bitwiseBlendOperationLong128 ops/s   256.00   1000.99    21.50    1235.39    
>> 5.48    1.23
>> bitwiseBlendOperationLong128 ops/s   512.00   507.73     9.74     617.67     
>>     2.32        1.22
>> bitwiseBlendOperationLong128 ops/s   1024.00  258.86     0.01     310.70     
>>     0.04        1.20
>> 
>> 
>> On a Nvidia Grace (Neoverse-V2) machine with 128-bit NEON:
>> 
>> Benchmark                        Unit        ARRAYLEN Before     Error    
>> After          Error       Uplift
>> bitwiseBlendOperationInt128      ops/s       256.00   2336.17    13.18    
>> 3505.19    19.61   1.50
>> bitwiseBlendOperationInt128      ops/s       512.00   1145.50 ...
>
> Eric Fang has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Add a movprfx optimization entry point for the newly added sve_bsl 
> instruction

LGTM!

-------------

Marked as reviewed by xgong (Committer).

PR Review: https://git.openjdk.org/jdk/pull/31269#pullrequestreview-4481625841

Re: RFR: 8382052: VectorAPI: Optimize the lanewise BITWISE_BLEND for AArch64 [v5]

Reply via email to