On Mon, 25 May 2026 02:35:28 GMT, Eric Fang <[email protected]> wrote:

> Vector API `lanewise BITWISE_BLEND` on AArch64 is currently lowered to a 
> generic vector sequence built from `(XorV(AndV(XorV)))` nodes. AArch64 
> provides a more efficient mapping for this operation through the NEON `BSL` 
> and SVE `BSL` (bitwise select) instructions.
> 
> This change teaches C2 to recognize the `BITWISE_BLEND` patterns and lower 
> them to the dedicated AArch64 instructions for better performance.
> 
> The change includes the AArch64 match rules and assembler support, updates 
> the AArch64 asm tests, adds IR framework nodes for the new mach instructions, 
> introduces a new jtreg IR test and extends the MaskedLogicOpts JMH benchmark 
> for 128-bit long type.
> 
> JMH results show **11% - 54%** performance improvements for the optimized 
> cases, and all jtreg tests (tier1, tier2 and tier3) passe on SVE2, SVE1, and 
> NEON configurations.
> 
> On a Nvidia Grace (Neoverse-V2) machine with 128-bit SVE2:
> 
> Benchmark                         Unit        ARRAYLEN Before     Error    
> After          Error       Uplift
> bitwiseBlendOperationInt128       ops/s       256.00   3787.49    5.29     
> 4277.64    8.89    1.13
> bitwiseBlendOperationInt128       ops/s       512.00   1888.24    11.02    
> 2143.21    6.32    1.14
> bitwiseBlendOperationInt128       ops/s       1024.00  938.22     6.24     
> 1053.45    14.68   1.12
> bitwiseBlendOperationLong128  ops/s   256.00   1895.45    13.68    2140.31    
> 3.68    1.13
> bitwiseBlendOperationLong128  ops/s   512.00   938.71     5.32     1052.16    
> 14.07   1.12
> bitwiseBlendOperationLong128  ops/s   1024.00  474.15     2.33     526.49     
>     2.62        1.11
> 
> 
> On an AWS Graviton3 (Neoverse-V1) machine with 256-bit SVE1:
> 
> Benchmark                         Unit        ARRAYLEN Before     Error    
> After          Error       Uplift
> bitwiseBlendOperationInt128       ops/s       256.00   2051.52    13.85    
> 2481.44    0.27    1.21
> bitwiseBlendOperationInt128       ops/s       512.00   995.47     20.77    
> 1235.10    5.70    1.24
> bitwiseBlendOperationInt128       ops/s       1024.00  507.73     9.83     
> 617.59         2.43        1.22
> bitwiseBlendOperationLong128  ops/s   256.00   1000.99    21.50    1235.39    
> 5.48    1.23
> bitwiseBlendOperationLong128  ops/s   512.00   507.73     9.74     617.67     
>     2.32        1.22
> bitwiseBlendOperationLong128  ops/s   1024.00  258.86     0.01     310.70     
>     0.04        1.20
> 
> 
> On a Nvidia Grace (Neoverse-V2) machine with 128-bit NEON:
> 
> Benchmark                         Unit        ARRAYLEN Before     Error    
> After          Error       Uplift
> bitwiseBlendOperationInt128       ops/s       256.00   2336.17    13.18    
> 3505.19    19.61   1.50
> bitwiseBlendOperationInt128       ops/s       512.00   1145.50    12.40    
> 1735.24    10.43   1.51
> bitwiseBlendOperationInt128       ops/s       1...

Looks a reasonable optimization and it looks good to me.

src/hotspot/cpu/aarch64/aarch64_vector_ad.m4 line 4763:

> 4761:   // adlc only auto-swaps commutative ops when at least one operand is 
> a subtree,
> 4762:   // not when both sides are leaves, so both shapes need explicit match 
> rules.
> 4763:   match(Set dst_src1 (XorV (Binary src3 (AndV dst_src1 (XorV src3 
> src2))) pg));

Not sure whether it's better to add a IR-level transformation by adding a new 
IR like `VectorBitwiseBlend`. Benefits:
1. Other platforms may be easier to share the same optimization if they need.
2. The rules for masked vector can be removed by implementing with a 
`VectorBlend` IR instead.
3. We do not need to add two rules manually to handle the commutative issues 
for `XorV` in all rules.

-------------

PR Review: https://git.openjdk.org/jdk/pull/31269#pullrequestreview-4377473064
PR Review Comment: https://git.openjdk.org/jdk/pull/31269#discussion_r3314981781

Reply via email to