Vector API `lanewise BITWISE_BLEND` on AArch64 is currently lowered to a 
generic vector sequence built from `(XorV(AndV(XorV)))` nodes. AArch64 provides 
a more efficient mapping for this operation through the NEON `BSL` and SVE 
`BSL` (bitwise select) instructions.

This change teaches C2 to recognize the `BITWISE_BLEND` patterns and lower them 
to the dedicated AArch64 instructions for better performance.

The change includes the AArch64 match rules and assembler support, updates the 
AArch64 asm tests, adds IR framework nodes for the new mach instructions, 
introduces a new jtreg IR test and extends the MaskedLogicOpts JMH benchmark 
for 128-bit long type.

JMH results show **11% - 54%** performance improvements for the optimized 
cases, and all jtreg tests (tier1, tier2 and tier3) passe on SVE2, SVE1, and 
NEON configurations.

On a Nvidia Grace (Neoverse-V2) machine with 128-bit SVE2:

Benchmark                           Unit        ARRAYLEN Before     Error    
After          Error       Uplift
bitwiseBlendOperationInt128         ops/s       256.00   3787.49    5.29     
4277.64    8.89    1.13
bitwiseBlendOperationInt128         ops/s       512.00   1888.24    11.02    
2143.21    6.32    1.14
bitwiseBlendOperationInt128         ops/s       1024.00  938.22     6.24     
1053.45    14.68   1.12
bitwiseBlendOperationLong128    ops/s   256.00   1895.45    13.68    2140.31    
3.68    1.13
bitwiseBlendOperationLong128    ops/s   512.00   938.71     5.32     1052.16    
14.07   1.12
bitwiseBlendOperationLong128    ops/s   1024.00  474.15     2.33     526.49     
    2.62        1.11


On an AWS Graviton3 (Neoverse-V1) machine with 256-bit SVE1:

Benchmark                           Unit        ARRAYLEN Before     Error    
After          Error       Uplift
bitwiseBlendOperationInt128         ops/s       256.00   2051.52    13.85    
2481.44    0.27    1.21
bitwiseBlendOperationInt128         ops/s       512.00   995.47     20.77    
1235.10    5.70    1.24
bitwiseBlendOperationInt128         ops/s       1024.00  507.73     9.83     
617.59         2.43        1.22
bitwiseBlendOperationLong128    ops/s   256.00   1000.99    21.50    1235.39    
5.48    1.23
bitwiseBlendOperationLong128    ops/s   512.00   507.73     9.74     617.67     
    2.32        1.22
bitwiseBlendOperationLong128    ops/s   1024.00  258.86     0.01     310.70     
    0.04        1.20


On a Nvidia Grace (Neoverse-V2) machine with 128-bit NEON:

Benchmark                           Unit        ARRAYLEN Before     Error    
After          Error       Uplift
bitwiseBlendOperationInt128         ops/s       256.00   2336.17    13.18    
3505.19    19.61   1.50
bitwiseBlendOperationInt128         ops/s       512.00   1145.50    12.40    
1735.24    10.43   1.51
bitwiseBlendOperationInt128         ops/s       1024.00  571.41     6.51     
866.01         3.34        1.52
bitwiseBlendOperationLong128    ops/s   256.00   1140.38    13.77    1740.28    
11.16   1.53
bitwiseBlendOperationLong128    ops/s   512.00   570.20     7.58     865.67     
    3.33        1.52
bitwiseBlendOperationLong128    ops/s   1024.00  280.94     2.58     432.78     
    0.19        1.54





---------
- [x] I confirm that I make this contribution in accordance with the [OpenJDK 
Interim AI Policy](https://openjdk.org/legal/ai).

-------------

Commit messages:
 - 8382052: VectorAPI: AArch64: Optimize the lanewise BITWISE_BLEND operation 
with BSL

Changes: https://git.openjdk.org/jdk/pull/31269/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=31269&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8382052
  Stats: 479 lines in 8 files changed: 411 ins; 2 del; 66 mod
  Patch: https://git.openjdk.org/jdk/pull/31269.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/31269/head:pull/31269

PR: https://git.openjdk.org/jdk/pull/31269

Reply via email to