Re: RFR: 8370863: VectorAPI: Optimize the VectorMaskCast chain in specific patterns [v2]

Eric Fang Wed, 19 Nov 2025 20:04:41 -0800

> `VectorMaskCastNode` is used to cast a vector mask from one type to another 
> type. The cast may be generated by calling the vector API `cast` or generated 
> by the compiler. For example, some vector mask operations like `trueCount` 
> require the input mask to be integer types, so for floating point type masks, 
> the compiler will cast the mask to the corresponding integer type mask 
> automatically before doing the mask operation. This kind of cast is very 
> common.
> 
> If the vector element size is not changed, the `VectorMaskCastNode` don't 
> generate code, otherwise code will be generated to extend or narrow the mask. 
> This IR node is not free no matter it generates code or not because it may 
> block some optimizations. For example:
> 1. `(VectorStoremask (VectorMaskCast (VectorLoadMask x)))` The middle 
> `VectorMaskCast` prevented the following optimization: `(VectorStoremask 
> (VectorLoadMask x)) => (x)`
> 2. `(VectorMaskToLong (VectorMaskCast (VectorLongToMask x)))`, which blocks 
> the optimization `(VectorMaskToLong (VectorLongToMask x)) => (x)`.
> 
> In these IR patterns, the value of the input `x` is not changed, so we can 
> safely do the optimization. But if the input value is changed, we can't 
> eliminate the cast.
> 
> The general idea of this PR is introducing an `uncast_mask` helper function, 
> which can be used to uncast a chain of `VectorMaskCastNode`, like the 
> existing `Node::uncast(bool)` function. The funtion returns the first non 
> `VectorMaskCastNode`.
> 
> The intended use case is when the IR pattern to be optimized may contain one 
> or more consecutive `VectorMaskCastNode` and this does not affect the 
> correctness of the optimization. Then this function can be called to 
> eliminate the `VectorMaskCastNode` chain.
> 
> Current optimizations related to `VectorMaskCastNode` include:
> 1. `(VectorMaskCast (VectorMaskCast x)) => (x)`, see JDK-8356760.
> 2. `(XorV (VectorMaskCast (VectorMaskCmp src1 src2 cond)) (Replicate -1)) => 
> (VectorMaskCast (VectorMaskCmp src1 src2 ncond))`, see JDK-8354242.
> 
> This PR does the following optimizations:
> 1. Extends the optimization pattern `(VectorMaskCast (VectorMaskCast x)) => 
> (x)` as `(VectorMaskCast (VectorMaskCast  ... (VectorMaskCast x))) => (x)`. 
> Because as long as types of the head and tail `VectorMaskCastNode` are 
> consistent, the optimization is correct.
> 2. Supports a new optimization pattern `(VectorStoreMask (VectorMaskCast ... 
> (VectorLoadMask x))) => (x)`. Since the value before and after the pattern is 
> a boolean vector, it remains unchanged as long as th...


Eric Fang has updated the pull request with a new target base due to a merge or 
a rebase. The incremental webrev excludes the unrelated changes brought in by 
the merge/rebase. The pull request contains three additional commits since the 
last revision:

 - Don't read and write the same memory in the JMH benchmarks
 - Merge branch 'master' into JDK-8370863-mask-cast-opt
 - 8370863: VectorAPI: Optimize the VectorMaskCast chain in specific patterns
   
   `VectorMaskCastNode` is used to cast a vector mask from one type to
   another type. The cast may be generated by calling the vector API `cast`
   or generated by the compiler. For example, some vector mask operations
   like `trueCount` require the input mask to be integer types, so for
   floating point type masks, the compiler will cast the mask to the
   corresponding integer type mask automatically before doing the mask
   operation. This kind of cast is very common.
   
   If the vector element size is not changed, the `VectorMaskCastNode`
   don't generate code, otherwise code will be generated to extend or narrow
   the mask. This IR node is not free no matter it generates code or not
   because it may block some optimizations. For example:
   1. `(VectorStoremask (VectorMaskCast (VectorLoadMask x)))`
   The middle `VectorMaskCast` prevented the following optimization:
   `(VectorStoremask (VectorLoadMask x)) => (x)`
   2. `(VectorMaskToLong (VectorMaskCast (VectorLongToMask x)))`, which
   blocks the optimization `(VectorMaskToLong (VectorLongToMask x)) => (x)`.
   
   In these IR patterns, the value of the input `x` is not changed, so we
   can safely do the optimization. But if the input value is changed, we
   can't eliminate the cast.
   
   The general idea of this PR is introducing an `uncast_mask` helper
   function, which can be used to uncast a chain of `VectorMaskCastNode`,
   like the existing `Node::uncast(bool)` function. The funtion returns
   the first non `VectorMaskCastNode`.
   
   The intended use case is when the IR pattern to be optimized may
   contain one or more consecutive `VectorMaskCastNode` and this does not
   affect the correctness of the optimization. Then this function can be
   called to eliminate the `VectorMaskCastNode` chain.
   
   Current optimizations related to `VectorMaskCastNode` include:
   1. `(VectorMaskCast (VectorMaskCast x)) => (x)`, see JDK-8356760.
   2. `(XorV (VectorMaskCast (VectorMaskCmp src1 src2 cond)) (Replicate -1))
       => (VectorMaskCast (VectorMaskCmp src1 src2 ncond))`, see JDK-8354242.
   
   This PR does the following optimizations:
   1. Extends the optimization pattern `(VectorMaskCast (VectorMaskCast x)) => 
(x)`
   as `(VectorMaskCast (VectorMaskCast  ... (VectorMaskCast x))) => (x)`.
   Because as long as types of the head and tail `VectorMaskCastNode` are
   consistent, the optimization is correct.
   2. Supports a new optimization pattern
   `(VectorStoreMask (VectorMaskCast ... (VectorLoadMask x))) => (x)`.
   Since the value before and after the pattern is a boolean vector, it
   remains unchanged as long as the vector length remains the same, and
   this is guranteed in the api level.
   
   I conducted some simple research on different mask generation methods
   and mask operations, and obtained the following table, which includes
   some potential optimization opportunities that may use this `uncast_mask`
   function.
   
   ```
   mask_gen\op    toLong   anyTrue allTrue trueCount firstTrue lastTrue
   compare        N/A      N/A     N/A     N/A       N/A       N/A
   maskAll        TBI      TBI     TBI     TBI       TBI       TBI
   fromLong       TBI      TBI     N/A     TBI       TBI       TBI
   
   mask_gen\op    and      or      xor     andNot    not       laneIsSet
   compare        N/A      N/A     N/A     N/A       TBI       N/A
   maskAll        TBI      TBI     TBI     TBI       TBI       TBI
   fromLong       N/A      N/A     N/A     N/A       TBI       TBI
   ```
   `TBI` indicated that there may be potential optimizations here that
   require further investigation.
   
   Benchmarks:
   
   On a Nvidia Grace machine with 128-bit SVE2:
   ```
   Benchmark                    Unit    Before  Error   After   Error   Uplift
   microMaskLoadCastStoreByte64 ops/us  59.23   0.21    148.12  0.07    2.50
   microMaskLoadCastStoreDouble128      ops/us  2.43    0.00    38.31   0.01    
15.73
   microMaskLoadCastStoreFloat128       ops/us  6.19    0.00    75.67   0.11    
12.22
   microMaskLoadCastStoreInt128 ops/us  6.19    0.00    75.67   0.03    12.22
   microMaskLoadCastStoreLong128        ops/us  2.43    0.00    38.32   0.01    
15.74
   microMaskLoadCastStoreShort64        ops/us  28.89   0.02    75.60   0.09    
2.62
   ```
   
   On a Nvidia Grace machine with 128-bit NEON:
   ```
   Benchmark                    Unit    Before  Error   After   Error   Uplift
   microMaskLoadCastStoreByte64 ops/us  75.75   0.19    149.74  0.08    1.98
   microMaskLoadCastStoreDouble128      ops/us  8.71    0.03    38.71   0.05    
4.44
   microMaskLoadCastStoreFloat128       ops/us  24.05   0.03    76.49   0.05    
3.18
   microMaskLoadCastStoreInt128 ops/us  24.06   0.02    76.51   0.05    3.18
   microMaskLoadCastStoreLong128        ops/us  8.72    0.01    38.71   0.02    
4.44
   microMaskLoadCastStoreShort64        ops/us  24.64   0.01    76.43   0.06    
3.10
   ```
   
   On an AMD EPYC 9124 16-Core Processor with AVX3:
   ```
   Benchmark                    Unit    Before  Error   After   Error   Uplift
   microMaskLoadCastStoreByte64 ops/us  82.13   0.31    115.14  0.08    1.40
   microMaskLoadCastStoreDouble128      ops/us  0.32    0.00    0.32    0.00    
1.01
   microMaskLoadCastStoreFloat128       ops/us  42.18   0.05    57.56   0.07    
1.36
   microMaskLoadCastStoreInt128 ops/us  42.19   0.01    57.53   0.08    1.36
   microMaskLoadCastStoreLong128        ops/us  0.30    0.01    0.32    0.00    
1.05
   microMaskLoadCastStoreShort64        ops/us  42.18   0.05    57.59   0.01    
1.37
   ```
   
   On an AMD EPYC 9124 16-Core Processor with AVX2:
   ```
   Benchmark                    Unit    Before  Error   After   Error   Uplift
   microMaskLoadCastStoreByte64 ops/us  73.53   0.20    114.98  0.03    1.56
   microMaskLoadCastStoreDouble128      ops/us  0.29    0.01    0.30    0.00    
1.00
   microMaskLoadCastStoreFloat128       ops/us  30.78   0.14    57.50   0.01    
1.87
   microMaskLoadCastStoreInt128 ops/us  30.65   0.26    57.50   0.01    1.88
   microMaskLoadCastStoreLong128        ops/us  0.30    0.00    0.30    0.00    
0.99
   microMaskLoadCastStoreShort64        ops/us  24.92   0.00    57.49   0.01    
2.31
   ```
   
   On an AMD EPYC 9124 16-Core Processor with AVX1:
   ```
   Benchmark                    Unit    Before  Error   After   Error   Uplift
   microMaskLoadCastStoreByte64 ops/us  79.68   0.01    248.49  0.91    3.12
   microMaskLoadCastStoreDouble128      ops/us  0.28    0.00    0.28    0.00    
1.00
   microMaskLoadCastStoreFloat128       ops/us  31.11   0.04    95.48   2.27    
3.07
   microMaskLoadCastStoreInt128 ops/us  31.10   0.03    99.94   1.87    3.21
   microMaskLoadCastStoreLong128        ops/us  0.28    0.00    0.28    0.00    
0.99
   microMaskLoadCastStoreShort64        ops/us  31.11   0.02    94.97   2.30    
3.05
   ```
   
   This PR was tested on 128-bit, 256-bit, and 512-bit (QEMU) aarch64
   environments, and two 512-bit x64 machines with various configurations,
   including sve2, sve1, neon, avx3, avx2, avx1, sse4 and sse3, all tests
   passed.

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/28313/files
  - new: https://git.openjdk.org/jdk/pull/28313/files/fca9b3e5..3b0ff7d6

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=28313&range=01
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=28313&range=00-01

  Stats: 28723 lines in 501 files changed: 18169 ins; 7171 del; 3383 mod
  Patch: https://git.openjdk.org/jdk/pull/28313.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/28313/head:pull/28313

PR: https://git.openjdk.org/jdk/pull/28313

Re: RFR: 8370863: VectorAPI: Optimize the VectorMaskCast chain in specific patterns [v2]

Reply via email to