> Vector API `lanewise BITWISE_BLEND` on AArch64 is currently lowered to a 
> generic vector sequence built from `(XorV(AndV(XorV)))` nodes. AArch64 
> provides a more efficient mapping for this operation through the NEON `BSL` 
> and SVE `BSL` (bitwise select) instructions.
> 
> This change teaches C2 to recognize the `BITWISE_BLEND` patterns and lower 
> them to the dedicated AArch64 instructions for better performance.
> 
> The change includes the AArch64 match rules and assembler support, updates 
> the AArch64 asm tests, adds IR framework nodes for the new mach instructions, 
> introduces a new jtreg IR test and extends the MaskedLogicOpts JMH benchmark 
> for 128-bit long type.
> 
> JMH results show **11% - 54%** performance improvements for the optimized 
> cases, and all jtreg tests (tier1, tier2 and tier3) passe on SVE2, SVE1, and 
> NEON configurations.
> 
> On a Nvidia Grace (Neoverse-V2) machine with 128-bit SVE2:
> 
> Benchmark                         Unit        ARRAYLEN Before     Error    
> After          Error       Uplift
> bitwiseBlendOperationInt128       ops/s       256.00   3787.49    5.29     
> 4277.64    8.89    1.13
> bitwiseBlendOperationInt128       ops/s       512.00   1888.24    11.02    
> 2143.21    6.32    1.14
> bitwiseBlendOperationInt128       ops/s       1024.00  938.22     6.24     
> 1053.45    14.68   1.12
> bitwiseBlendOperationLong128  ops/s   256.00   1895.45    13.68    2140.31    
> 3.68    1.13
> bitwiseBlendOperationLong128  ops/s   512.00   938.71     5.32     1052.16    
> 14.07   1.12
> bitwiseBlendOperationLong128  ops/s   1024.00  474.15     2.33     526.49     
>     2.62        1.11
> 
> 
> On an AWS Graviton3 (Neoverse-V1) machine with 256-bit SVE1:
> 
> Benchmark                         Unit        ARRAYLEN Before     Error    
> After          Error       Uplift
> bitwiseBlendOperationInt128       ops/s       256.00   2051.52    13.85    
> 2481.44    0.27    1.21
> bitwiseBlendOperationInt128       ops/s       512.00   995.47     20.77    
> 1235.10    5.70    1.24
> bitwiseBlendOperationInt128       ops/s       1024.00  507.73     9.83     
> 617.59         2.43        1.22
> bitwiseBlendOperationLong128  ops/s   256.00   1000.99    21.50    1235.39    
> 5.48    1.23
> bitwiseBlendOperationLong128  ops/s   512.00   507.73     9.74     617.67     
>     2.32        1.22
> bitwiseBlendOperationLong128  ops/s   1024.00  258.86     0.01     310.70     
>     0.04        1.20
> 
> 
> On a Nvidia Grace (Neoverse-V2) machine with 128-bit NEON:
> 
> Benchmark                         Unit        ARRAYLEN Before     Error    
> After          Error       Uplift
> bitwiseBlendOperationInt128       ops/s       256.00   2336.17    13.18    
> 3505.19    19.61   1.50
> bitwiseBlendOperationInt128       ops/s       512.00   1145.50    12.40    
> 1735.24    10.43   1.51
> bitwiseBlendOperationInt128       ops/s       1...

Eric Fang has updated the pull request incrementally with one additional commit 
since the last revision:

  Add a movprfx optimization entry point for the newly added sve_bsl instruction

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/31269/files
  - new: https://git.openjdk.org/jdk/pull/31269/files/6574f70c..f4723a0b

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=31269&range=04
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=31269&range=03-04

  Stats: 12 lines in 1 file changed: 6 ins; 0 del; 6 mod
  Patch: https://git.openjdk.org/jdk/pull/31269.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/31269/head:pull/31269

PR: https://git.openjdk.org/jdk/pull/31269

Reply via email to