On Thu, 28 May 2026 02:30:16 GMT, Xiaohong Gong <[email protected]> wrote:

>> Vector API `lanewise BITWISE_BLEND` on AArch64 is currently lowered to a 
>> generic vector sequence built from `(XorV(AndV(XorV)))` nodes. AArch64 
>> provides a more efficient mapping for this operation through the NEON `BSL` 
>> and SVE `BSL` (bitwise select) instructions.
>> 
>> This change teaches C2 to recognize the `BITWISE_BLEND` patterns and lower 
>> them to the dedicated AArch64 instructions for better performance.
>> 
>> The change includes the AArch64 match rules and assembler support, updates 
>> the AArch64 asm tests, adds IR framework nodes for the new mach 
>> instructions, introduces a new jtreg IR test and extends the MaskedLogicOpts 
>> JMH benchmark for 128-bit long type.
>> 
>> JMH results show **11% - 54%** performance improvements for the optimized 
>> cases, and all jtreg tests (tier1, tier2 and tier3) passe on SVE2, SVE1, and 
>> NEON configurations.
>> 
>> On a Nvidia Grace (Neoverse-V2) machine with 128-bit SVE2:
>> 
>> Benchmark                        Unit        ARRAYLEN Before     Error    
>> After          Error       Uplift
>> bitwiseBlendOperationInt128      ops/s       256.00   3787.49    5.29     
>> 4277.64    8.89    1.13
>> bitwiseBlendOperationInt128      ops/s       512.00   1888.24    11.02    
>> 2143.21    6.32    1.14
>> bitwiseBlendOperationInt128      ops/s       1024.00  938.22     6.24     
>> 1053.45    14.68   1.12
>> bitwiseBlendOperationLong128 ops/s   256.00   1895.45    13.68    2140.31    
>> 3.68    1.13
>> bitwiseBlendOperationLong128 ops/s   512.00   938.71     5.32     1052.16    
>> 14.07   1.12
>> bitwiseBlendOperationLong128 ops/s   1024.00  474.15     2.33     526.49     
>>     2.62        1.11
>> 
>> 
>> On an AWS Graviton3 (Neoverse-V1) machine with 256-bit SVE1:
>> 
>> Benchmark                        Unit        ARRAYLEN Before     Error    
>> After          Error       Uplift
>> bitwiseBlendOperationInt128      ops/s       256.00   2051.52    13.85    
>> 2481.44    0.27    1.21
>> bitwiseBlendOperationInt128      ops/s       512.00   995.47     20.77    
>> 1235.10    5.70    1.24
>> bitwiseBlendOperationInt128      ops/s       1024.00  507.73     9.83     
>> 617.59         2.43        1.22
>> bitwiseBlendOperationLong128 ops/s   256.00   1000.99    21.50    1235.39    
>> 5.48    1.23
>> bitwiseBlendOperationLong128 ops/s   512.00   507.73     9.74     617.67     
>>     2.32        1.22
>> bitwiseBlendOperationLong128 ops/s   1024.00  258.86     0.01     310.70     
>>     0.04        1.20
>> 
>> 
>> On a Nvidia Grace (Neoverse-V2) machine with 128-bit NEON:
>> 
>> Benchmark                        Unit        ARRAYLEN Before     Error    
>> After          Error       Uplift
>> bitwiseBlendOperationInt128      ops/s       256.00   2336.17    13.18    
>> 3505.19    19.61   1.50
>> bitwiseBlendOperationInt128      ops/s       512.00   1145.50 ...
>
> src/hotspot/cpu/aarch64/aarch64_vector_ad.m4 line 4763:
> 
>> 4761:   // adlc only auto-swaps commutative ops when at least one operand is 
>> a subtree,
>> 4762:   // not when both sides are leaves, so both shapes need explicit 
>> match rules.
>> 4763:   match(Set dst_src1 (XorV (Binary src3 (AndV dst_src1 (XorV src3 
>> src2))) pg));
> 
> Not sure whether it's better to add a IR-level transformation by adding a new 
> IR like `VectorBitwiseBlend`. Benefits:
> 1. Other platforms may be easier to share the same optimization if they need.
> 2. The rules for masked vector can be removed by implementing with a 
> `VectorBlend` IR instead.
> 3. We do not need to add two rules manually to handle the commutative issues 
> for `XorV` in all rules.

Thanks for the suggestion, I'll consider if this is a better method to do this.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/31269#discussion_r3321814935

Reply via email to