`VectorMaskCastNode` is used to cast a vector mask from one type to another 
type. The cast may be generated by calling the vector API `cast` or generated 
by the compiler. For example, some vector mask operations like `trueCount` 
require the input mask to be integer types, so for floating point type masks, 
the compiler will cast the mask to the corresponding integer type mask 
automatically before doing the mask operation. This kind of cast is very common.

If the vector element size is not changed, the `VectorMaskCastNode` don't 
generate code, otherwise code will be generated to extend or narrow the mask. 
This IR node is not free no matter it generates code or not because it may 
block some optimizations. For example:
1. `(VectorStoremask (VectorMaskCast (VectorLoadMask x)))` The middle 
`VectorMaskCast` prevented the following optimization: `(VectorStoremask 
(VectorLoadMask x)) => (x)`
2. `(VectorMaskToLong (VectorMaskCast (VectorLongToMask x)))`, which blocks the 
optimization `(VectorMaskToLong (VectorLongToMask x)) => (x)`.

In these IR patterns, the value of the input `x` is not changed, so we can 
safely do the optimization. But if the input value is changed, we can't 
eliminate the cast.

The general idea of this PR is introducing an `uncast_mask` helper function, 
which can be used to uncast a chain of `VectorMaskCastNode`, like the existing 
`Node::uncast(bool)` function. The funtion returns the first non 
`VectorMaskCastNode`.

The intended use case is when the IR pattern to be optimized may contain one or 
more consecutive `VectorMaskCastNode` and this does not affect the correctness 
of the optimization. Then this function can be called to eliminate the 
`VectorMaskCastNode` chain.

Current optimizations related to `VectorMaskCastNode` include:
1. `(VectorMaskCast (VectorMaskCast x)) => (x)`, see JDK-8356760.
2. `(XorV (VectorMaskCast (VectorMaskCmp src1 src2 cond)) (Replicate -1)) => 
(VectorMaskCast (VectorMaskCmp src1 src2 ncond))`, see JDK-8354242.

This PR does the following optimizations:
1. Extends the optimization pattern `(VectorMaskCast (VectorMaskCast x)) => 
(x)` as `(VectorMaskCast (VectorMaskCast  ... (VectorMaskCast x))) => (x)`. 
Because as long as types of the head and tail `VectorMaskCastNode` are 
consistent, the optimization is correct.
2. Supports a new optimization pattern `(VectorStoreMask (VectorMaskCast ... 
(VectorLoadMask x))) => (x)`. Since the value before and after the pattern is a 
boolean vector, it remains unchanged as long as the vector length remains the 
same, and this is guranteed in the api level.

I conducted some simple research on different mask generation methods and mask 
operations, and obtained the following table, which includes some potential 
optimization opportunities that may use this `uncast_mask` function.

<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns="http://www.w3.org/TR/REC-html40";>

<head>

<meta name=ProgId content=Excel.Sheet>
<meta name=Generator content="Microsoft Excel 15">
<link id=Main-File rel=Main-File
href="file:////Users/erfang/Library/Group%20Containers/UBF8T346G9.Office/TemporaryItems/msohtmlclip/clip.htm">
<link rel=File-List
href="file:////Users/erfang/Library/Group%20Containers/UBF8T346G9.Office/TemporaryItems/msohtmlclip/clip_filelist.xml">


</head>

<body link="#467886" vlink="#96607D">


mask_gen/op | toLong | anyTrue | allTrue | trueCount | firstTrue | lastTrue | 
and | or | xor | andNot | not | laneIsSet
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --
compare | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | TBI | N/A
maskAll | TBI | TBI | TBI | TBI | TBI | TBI | TBI | TBI | TBI | TBI | TBI | TBI
fromLong | TBI | TBI | N/A | TBI | TBI | TBI | N/A | N/A | N/A | N/A | TBI | TBI



</body>

</html>

`TBI` indicated that there may be potential optimizations here that require 
further investigation.

Benchmarks:

On a Nvidia Grace machine with 128-bit SVE2:

Benchmark                                   Unit        Before  Error   After   
Error   Uplift
microMaskLoadCastStoreByte64    ops/us  59.23   0.21    148.12  0.07    2.50
microMaskLoadCastStoreDouble128 ops/us  2.43    0.00    38.31   0.01    15.73
microMaskLoadCastStoreFloat128  ops/us  6.19    0.00    75.67   0.11    12.22
microMaskLoadCastStoreInt128    ops/us  6.19    0.00    75.67   0.03    12.22
microMaskLoadCastStoreLong128   ops/us  2.43    0.00    38.32   0.01    15.74
microMaskLoadCastStoreShort64   ops/us  28.89   0.02    75.60   0.09    2.62


On a Nvidia Grace machine with 128-bit NEON:

Benchmark                                   Unit        Before  Error   After   
Error   Uplift
microMaskLoadCastStoreByte64    ops/us  75.75   0.19    149.74  0.08    1.98
microMaskLoadCastStoreDouble128 ops/us  8.71    0.03    38.71   0.05    4.44
microMaskLoadCastStoreFloat128  ops/us  24.05   0.03    76.49   0.05    3.18
microMaskLoadCastStoreInt128    ops/us  24.06   0.02    76.51   0.05    3.18
microMaskLoadCastStoreLong128   ops/us  8.72    0.01    38.71   0.02    4.44
microMaskLoadCastStoreShort64   ops/us  24.64   0.01    76.43   0.06    3.10


On an AMD EPYC 9124 16-Core Processor with AVX3:

Benchmark                                   Unit        Before  Error   After   
Error   Uplift
microMaskLoadCastStoreByte64    ops/us  82.13   0.31    115.14  0.08    1.40
microMaskLoadCastStoreDouble128 ops/us  0.32    0.00    0.32    0.00    1.01
microMaskLoadCastStoreFloat128  ops/us  42.18   0.05    57.56   0.07    1.36
microMaskLoadCastStoreInt128    ops/us  42.19   0.01    57.53   0.08    1.36
microMaskLoadCastStoreLong128   ops/us  0.30    0.01    0.32    0.00    1.05
microMaskLoadCastStoreShort64   ops/us  42.18   0.05    57.59   0.01    1.37


On an AMD EPYC 9124 16-Core Processor with AVX2:

Benchmark                                   Unit        Before  Error   After   
Error   Uplift
microMaskLoadCastStoreByte64    ops/us  73.53   0.20    114.98  0.03    1.56
microMaskLoadCastStoreDouble128 ops/us  0.29    0.01    0.30    0.00    1.00
microMaskLoadCastStoreFloat128  ops/us  30.78   0.14    57.50   0.01    1.87
microMaskLoadCastStoreInt128    ops/us  30.65   0.26    57.50   0.01    1.88
microMaskLoadCastStoreLong128   ops/us  0.30    0.00    0.30    0.00    0.99
microMaskLoadCastStoreShort64   ops/us  24.92   0.00    57.49   0.01    2.31


On an AMD EPYC 9124 16-Core Processor with AVX1:

Benchmark                                   Unit        Before  Error   After   
Error   Uplift
microMaskLoadCastStoreByte64    ops/us  79.68   0.01    248.49  0.91    3.12
microMaskLoadCastStoreDouble128 ops/us  0.28    0.00    0.28    0.00    1.00
microMaskLoadCastStoreFloat128  ops/us  31.11   0.04    95.48   2.27    3.07
microMaskLoadCastStoreInt128    ops/us  31.10   0.03    99.94   1.87    3.21
microMaskLoadCastStoreLong128   ops/us  0.28    0.00    0.28    0.00    0.99
microMaskLoadCastStoreShort64   ops/us  31.11   0.02    94.97   2.30    3.05


This PR was tested on 128-bit, 256-bit, and 512-bit (QEMU) aarch64 
environments, and two 512-bit x64 machines. With various configurations, 
including sve2, sve1, neon, avx3, avx2, avx1, sse4 and sse3, all tests passed.

-------------

Commit messages:
 - 8370863: VectorAPI: Optimize the VectorMaskCast chain in specific patterns

Changes: https://git.openjdk.org/jdk/pull/28313/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=28313&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8370863
  Stats: 582 lines in 7 files changed: 498 ins; 0 del; 84 mod
  Patch: https://git.openjdk.org/jdk/pull/28313.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/28313/head:pull/28313

PR: https://git.openjdk.org/jdk/pull/28313

Reply via email to