On Fri, 14 Nov 2025 01:17:50 GMT, Eric Fang <[email protected]> wrote:

> `VectorMaskCastNode` is used to cast a vector mask from one type to another 
> type. The cast may be generated by calling the vector API `cast` or generated 
> by the compiler. For example, some vector mask operations like `trueCount` 
> require the input mask to be integer types, so for floating point type masks, 
> the compiler will cast the mask to the corresponding integer type mask 
> automatically before doing the mask operation. This kind of cast is very 
> common.
> 
> If the vector element size is not changed, the `VectorMaskCastNode` don't 
> generate code, otherwise code will be generated to extend or narrow the mask. 
> This IR node is not free no matter it generates code or not because it may 
> block some optimizations. For example:
> 1. `(VectorStoremask (VectorMaskCast (VectorLoadMask x)))` The middle 
> `VectorMaskCast` prevented the following optimization: `(VectorStoremask 
> (VectorLoadMask x)) => (x)`
> 2. `(VectorMaskToLong (VectorMaskCast (VectorLongToMask x)))`, which blocks 
> the optimization `(VectorMaskToLong (VectorLongToMask x)) => (x)`.
> 
> In these IR patterns, the value of the input `x` is not changed, so we can 
> safely do the optimization. But if the input value is changed, we can't 
> eliminate the cast.
> 
> The general idea of this PR is introducing an `uncast_mask` helper function, 
> which can be used to uncast a chain of `VectorMaskCastNode`, like the 
> existing `Node::uncast(bool)` function. The funtion returns the first non 
> `VectorMaskCastNode`.
> 
> The intended use case is when the IR pattern to be optimized may contain one 
> or more consecutive `VectorMaskCastNode` and this does not affect the 
> correctness of the optimization. Then this function can be called to 
> eliminate the `VectorMaskCastNode` chain.
> 
> Current optimizations related to `VectorMaskCastNode` include:
> 1. `(VectorMaskCast (VectorMaskCast x)) => (x)`, see JDK-8356760.
> 2. `(XorV (VectorMaskCast (VectorMaskCmp src1 src2 cond)) (Replicate -1)) => 
> (VectorMaskCast (VectorMaskCmp src1 src2 ncond))`, see JDK-8354242.
> 
> This PR does the following optimizations:
> 1. Extends the optimization pattern `(VectorMaskCast (VectorMaskCast x)) => 
> (x)` as `(VectorMaskCast (VectorMaskCast  ... (VectorMaskCast x))) => (x)`. 
> Because as long as types of the head and tail `VectorMaskCastNode` are 
> consistent, the optimization is correct.
> 2. Supports a new optimization pattern `(VectorStoreMask (VectorMaskCast ... 
> (VectorLoadMask x))) => (x)`. Since the value before and after the pattern is 
> a boolean vector, it remains unchanged as long as th...

Updated the JMH benchmarks and the new test results:

On a Nvidia Grace machine with 128-bit SVE2:

    Benchmark                       Unit    Before  Error   After   Error   
Uplift
    microMaskLoadCastStoreByte64    ops/us  64.29   0.02    146.67  0.09    2.28
    microMaskLoadCastStoreDouble128 ops/us  10.05   0.00    38.10   0.01    3.79
    microMaskLoadCastStoreFloat128  ops/us  19.94   0.00    75.05   0.07    3.76
    microMaskLoadCastStoreInt128    ops/us  19.94   0.00    75.13   0.01    3.77
    microMaskLoadCastStoreLong128   ops/us  10.04   0.00    38.09   0.01    3.79
    microMaskLoadCastStoreShort64   ops/us  31.52   0.02    75.12   0.02    2.38

On a Nvidia Grace machine with 128-bit NEON:

    Benchmark                       Unit    Before  Error   After   Error   
Uplift
    microMaskLoadCastStoreByte64    ops/us  73.33   0.01    147.01  0.06    2.00
    microMaskLoadCastStoreDouble128 ops/us  8.54    0.03    38.19   0.01    4.47
    microMaskLoadCastStoreFloat128  ops/us  23.75   0.01    75.27   0.10    3.17
    microMaskLoadCastStoreInt128    ops/us  23.73   0.01    75.25   0.07    3.17
    microMaskLoadCastStoreLong128   ops/us  8.56    0.03    38.19   0.01    4.46
    microMaskLoadCastStoreShort64   ops/us  24.32   0.00    75.35   0.07    3.10

On an AMD EPYC 9124 16-Core Processor with AVX3:

    Benchmark                       Unit    Before  Error   After   Error   
Uplift
    microMaskLoadCastStoreByte64    ops/us  82.39   0.11    115.15  0.03    1.40
    microMaskLoadCastStoreDouble128 ops/us  0.32    0.00    0.32    0.00    0.99
    microMaskLoadCastStoreFloat128  ops/us  42.10   0.10    57.58   0.02    1.37
    microMaskLoadCastStoreInt128    ops/us  42.10   0.08    57.57   0.02    1.37
    microMaskLoadCastStoreLong128   ops/us  0.32    0.00    0.32    0.00    0.99
    microMaskLoadCastStoreShort64   ops/us  42.16   0.05    57.54   0.04    1.36


On an AMD EPYC 9124 16-Core Processor with AVX2:

    Benchmark                       Unit    Before  Error   After   Error   
Uplift
    microMaskLoadCastStoreByte64    ops/us  73.59   0.27    115.14  0.04    1.56
    microMaskLoadCastStoreDouble128 ops/us  0.30    0.00    0.30    0.00    1.01
    microMaskLoadCastStoreFloat128  ops/us  30.68   0.09    57.57   0.02    1.88
    microMaskLoadCastStoreInt128    ops/us  30.75   0.09    57.58   0.01    1.87
    microMaskLoadCastStoreLong128   ops/us  0.30    0.00    0.30    0.00    1.00
    microMaskLoadCastStoreShort64   ops/us  24.95   0.01    57.59   0.01    2.31

On an AMD EPYC 9124 16-Core Processor with AVX1:

    Benchmark                       Unit    Before  Error   After   Error   
Uplift
    microMaskLoadCastStoreByte64    ops/us  73.68   0.02    115.17  0.03    1.56
    microMaskLoadCastStoreDouble128 ops/us  0.30    0.00    0.30    0.00    1.01
    microMaskLoadCastStoreFloat128  ops/us  30.80   0.12    57.59   0.01    1.87
    microMaskLoadCastStoreInt128    ops/us  30.70   0.11    57.58   0.01    1.88
    microMaskLoadCastStoreLong128   ops/us  0.30    0.00    0.30    0.00    0.99
    microMaskLoadCastStoreShort64   ops/us  24.95   0.01    57.56   0.02    2.31

-------------

PR Comment: https://git.openjdk.org/jdk/pull/28313#issuecomment-3555660413

Reply via email to