> Vector API `lanewise BITWISE_BLEND` on AArch64 is currently lowered to a 
> generic vector sequence built from `(XorV(AndV(XorV)))` nodes. AArch64 
> provides a more efficient mapping for this operation through the NEON `BSL` 
> and SVE `BSL` (bitwise select) instructions.
> 
> This change teaches C2 to recognize the `BITWISE_BLEND` patterns and lower 
> them to the dedicated AArch64 instructions for better performance.
> 
> The change includes the AArch64 match rules and assembler support, updates 
> the AArch64 asm tests, adds IR framework nodes for the new mach instructions, 
> introduces a new jtreg IR test and extends the MaskedLogicOpts JMH benchmark 
> for 128-bit long type.
> 
> JMH results show **11% - 54%** performance improvements for the optimized 
> cases, and all jtreg tests (tier1, tier2 and tier3) passe on SVE2, SVE1, and 
> NEON configurations.
> 
> On a Nvidia Grace (Neoverse-V2) machine with 128-bit SVE2:
> 
> Benchmark                         Unit        ARRAYLEN Before     Error    
> After          Error       Uplift
> bitwiseBlendOperationInt128       ops/s       256.00   3787.49    5.29     
> 4277.64    8.89    1.13
> bitwiseBlendOperationInt128       ops/s       512.00   1888.24    11.02    
> 2143.21    6.32    1.14
> bitwiseBlendOperationInt128       ops/s       1024.00  938.22     6.24     
> 1053.45    14.68   1.12
> bitwiseBlendOperationLong128  ops/s   256.00   1895.45    13.68    2140.31    
> 3.68    1.13
> bitwiseBlendOperationLong128  ops/s   512.00   938.71     5.32     1052.16    
> 14.07   1.12
> bitwiseBlendOperationLong128  ops/s   1024.00  474.15     2.33     526.49     
>     2.62        1.11
> 
> 
> On an AWS Graviton3 (Neoverse-V1) machine with 256-bit SVE1:
> 
> Benchmark                         Unit        ARRAYLEN Before     Error    
> After          Error       Uplift
> bitwiseBlendOperationInt128       ops/s       256.00   2051.52    13.85    
> 2481.44    0.27    1.21
> bitwiseBlendOperationInt128       ops/s       512.00   995.47     20.77    
> 1235.10    5.70    1.24
> bitwiseBlendOperationInt128       ops/s       1024.00  507.73     9.83     
> 617.59         2.43        1.22
> bitwiseBlendOperationLong128  ops/s   256.00   1000.99    21.50    1235.39    
> 5.48    1.23
> bitwiseBlendOperationLong128  ops/s   512.00   507.73     9.74     617.67     
>     2.32        1.22
> bitwiseBlendOperationLong128  ops/s   1024.00  258.86     0.01     310.70     
>     0.04        1.20
> 
> 
> On a Nvidia Grace (Neoverse-V2) machine with 128-bit NEON:
> 
> Benchmark                         Unit        ARRAYLEN Before     Error    
> After          Error       Uplift
> bitwiseBlendOperationInt128       ops/s       256.00   2336.17    13.18    
> 3505.19    19.61   1.50
> bitwiseBlendOperationInt128       ops/s       512.00   1145.50    12.40    
> 1735.24    10.43   1.51
> bitwiseBlendOperationInt128       ops/s       1...

Eric Fang has updated the pull request with a new target base due to a merge or 
a rebase. The incremental webrev excludes the unrelated changes brought in by 
the merge/rebase. The pull request contains three additional commits since the 
last revision:

 - Implement bitwise_blend in IGVN
   
   The latest changes:
   
   1. Defined a new IR `VectorBitwiseBlendNode`
   2. Do the optimization in IGVN:
   // XorV(a, AndV(sel, XorV(a, b))) => VectorBitwiseBlend(a, b, sel)
   // XorV(a, AndV(sel, XorV(a, b)), mask) =>
   //   VectorBlend(a, VectorBitwiseBlend(a, b, sel), mask)
   
   3. Adjust the ad file match rules to match `VectorBitwiseBlendNode`.
   4. Adjust the JTReg tests to check `VectorBitwiseBlendNode`.
 - Merge branch 'master' into JDK-8382052-bitwise-blend
 - 8382052: VectorAPI: AArch64: Optimize the lanewise BITWISE_BLEND operation 
with BSL
   
   Vector API `lanewise BITWISE_BLEND` on AArch64 is currently lowered to a
   generic vector sequence built from `(XorV(AndV(XorV)))` nodes. AArch64
   provides a more efficient mapping for this operation through the NEON
   `BSL` and SVE `BSL` (bitwise select) instructions.
   
   This change teaches C2 to recognize the `BITWISE_BLEND` patterns and
   lower them to the dedicated AArch64 instructions for better performance.
   
   The change includes the AArch64 match rules and assembler support,
   updates the AArch64 asm tests, adds IR framework nodes for the new mach
   instructions, introduces a new jtreg IR test and extends the
   MaskedLogicOpts JMH benchmark for 128-bit long type.
   
   JMH results show **11% - 54%** performance improvements for the
   optimized cases, and all jtreg tests (tier1, tier2 and tier3) passe on
   SVE2, SVE1, and NEON configurations.
   
   On a Nvidia Grace (Neoverse-V2) machine with 128-bit SVE2:
   ```
   Benchmark                    Unit    ARRAYLEN Before     Error    After      
Error   Uplift
   bitwiseBlendOperationInt128  ops/s   256.00   3787.49    5.29     4277.64    
8.89    1.13
   bitwiseBlendOperationInt128  ops/s   512.00   1888.24    11.02    2143.21    
6.32    1.14
   bitwiseBlendOperationInt128  ops/s   1024.00  938.22     6.24     1053.45    
14.68   1.12
   bitwiseBlendOperationLong128 ops/s   256.00   1895.45    13.68    2140.31    
3.68    1.13
   bitwiseBlendOperationLong128 ops/s   512.00   938.71     5.32     1052.16    
14.07   1.12
   bitwiseBlendOperationLong128 ops/s   1024.00  474.15     2.33     526.49     
2.62    1.11
   ```
   
   On an AWS Graviton3 (Neoverse-V1) machine with 256-bit SVE1:
   ```
   Benchmark                    Unit    ARRAYLEN Before     Error    After      
Error   Uplift
   bitwiseBlendOperationInt128  ops/s   256.00   2051.52    13.85    2481.44    
0.27    1.21
   bitwiseBlendOperationInt128  ops/s   512.00   995.47     20.77    1235.10    
5.70    1.24
   bitwiseBlendOperationInt128  ops/s   1024.00  507.73     9.83     617.59     
2.43    1.22
   bitwiseBlendOperationLong128 ops/s   256.00   1000.99    21.50    1235.39    
5.48    1.23
   bitwiseBlendOperationLong128 ops/s   512.00   507.73     9.74     617.67     
2.32    1.22
   bitwiseBlendOperationLong128 ops/s   1024.00  258.86     0.01     310.70     
0.04    1.20
   ```
   
   On a Nvidia Grace (Neoverse-V2) machine with 128-bit NEON:
   ```
   Benchmark                    Unit    ARRAYLEN Before     Error    After      
Error   Uplift
   bitwiseBlendOperationInt128  ops/s   256.00   2336.17    13.18    3505.19    
19.61   1.50
   bitwiseBlendOperationInt128  ops/s   512.00   1145.50    12.40    1735.24    
10.43   1.51
   bitwiseBlendOperationInt128  ops/s   1024.00  571.41     6.51     866.01     
3.34    1.52
   bitwiseBlendOperationLong128 ops/s   256.00   1140.38    13.77    1740.28    
11.16   1.53
   bitwiseBlendOperationLong128 ops/s   512.00   570.20     7.58     865.67     
3.33    1.52
   bitwiseBlendOperationLong128 ops/s   1024.00  280.94     2.58     432.78     
0.19    1.54
   ```

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/31269/files
  - new: https://git.openjdk.org/jdk/pull/31269/files/55d4e819..b3fe4a97

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=31269&range=01
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=31269&range=00-01

  Stats: 134539 lines in 1702 files changed: 30993 ins; 96111 del; 7435 mod
  Patch: https://git.openjdk.org/jdk/pull/31269.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/31269/head:pull/31269

PR: https://git.openjdk.org/jdk/pull/31269

Reply via email to