> This patch adds intrinsic support for UMIN and UMAX reduction operations in
> the Vector API on AArch64, enabling direct hardware instruction mapping for
> better performance.
>
> Changes:
> --------
>
> 1. C2 mid-end:
> - Added UMinReductionVNode and UMaxReductionVNode
>
> 2. AArch64 Backend:
> - Added uminp/umaxp/sve_uminv/sve_umaxv instructions
> - Updated match rules for all vector sizes and element types
> - Both NEON and SVE implementation are supported
>
> 3. Test:
> - Added UMIN_REDUCTION_V and UMAX_REDUCTION_V to IRNode.java
> - Added assembly tests in aarch64-asmtest.py for new instructions
> - Added a JTReg test file VectorUMinMaxReductionTest.java
>
> Different configurations were tested on aarch64 and x86 machines, and all
> tests passed.
>
> Test results of JMH benchmarks from the panama-vector project:
> --------
>
> On a Nvidia Grace machine with 128-bit SVE:
>
> Benchmark Unit Before Error After Error
> Uplift
> Byte128Vector.UMAXLanes ops/ms 411.60 42.18 25226.51 33.92
> 61.29
> Byte128Vector.UMAXMaskedLanes ops/ms 558.56 85.12 25182.90 28.74
> 45.09
> Byte128Vector.UMINLanes ops/ms 645.58 780.76 28396.29
> 103.11 43.99
> Byte128Vector.UMINMaskedLanes ops/ms 621.09 718.27 26122.62 42.68
> 42.06
> Byte64Vector.UMAXLanes ops/ms 296.33 34.44 14357.74 15.95
> 48.45
> Byte64Vector.UMAXMaskedLanes ops/ms 376.54 44.01 14269.24 21.41
> 37.90
> Byte64Vector.UMINLanes ops/ms 373.45 426.51 15425.36 66.20
> 41.31
> Byte64Vector.UMINMaskedLanes ops/ms 353.32 346.87 14201.37 13.79
> 40.19
> Int128Vector.UMAXLanes ops/ms 174.79 192.51 9906.07
> 286.93 56.67
> Int128Vector.UMAXMaskedLanes ops/ms 157.23 206.68 10246.77 11.44
> 65.17
> Int64Vector.UMAXLanes ops/ms 95.30 126.49 4719.30 98.57
> 49.52
> Int64Vector.UMAXMaskedLanes ops/ms 88.19 87.44 4693.18 19.76
> 53.22
> Long128Vector.UMAXLanes ops/ms 80.62 97.82 5064.01 35.52
> 62.82
> Long128Vector.UMAXMaskedLanes ops/ms 78.15 102.91 5028.24 8.74
> 64.34
> Long64Vector.UMAXLanes ops/ms 47.56 62.01 46.76 52.28
> 0.98
> Long64Vector.UMAXMaskedLanes ops/ms 45.44 46.76 45.79 42.91
> 1.01
> Short128Vector.UMAXLanes ops/ms 316.65 410.30 14814.82 23.65
> 46.79
> Short128Vector.UMAXMaskedLanes ops/ms 308.90 351.78 15155.26 31.03
> 49.06
> Sh...
Eric Fang has updated the pull request with a new target base due to a merge or
a rebase. The pull request now contains four commits:
- Rebase commit 56d7b52
- Merge branch 'master' into JDK-8372980-umin-umax-intrinsic
- 8372980: [VectorAPI] AArch64: Add intrinsic support for unsigned min/max
reduction operations
This patch adds intrinsic support for UMIN and UMAX reduction operations
in the Vector API on AArch64, enabling direct hardware instruction mapping
for better performance.
Changes:
--------
1. C2 mid-end:
- Added UMinReductionVNode and UMaxReductionVNode
2. AArch64 Backend:
- Added uminp/umaxp/sve_uminv/sve_umaxv instructions
- Updated match rules for all vector sizes and element types
- Both NEON and SVE implementation are supported
3. Test:
- Added UMIN_REDUCTION_V and UMAX_REDUCTION_V to IRNode.java
- Added assembly tests in aarch64-asmtest.py for new instructions
- Added a JTReg test file VectorUMinMaxReductionTest.java
Different configurations were tested on aarch64 and x86 machines, and
all tests passed.
Test results of JMH benchmarks from the panama-vector project:
--------
On a Nvidia Grace machine with 128-bit SVE:
```
Benchmark Unit Before Error After Error
Uplift
Byte128Vector.UMAXLanes ops/ms 411.60 42.18 25226.51
33.92 61.29
Byte128Vector.UMAXMaskedLanes ops/ms 558.56 85.12 25182.90
28.74 45.09
Byte128Vector.UMINLanes ops/ms 645.58 780.76 28396.29
103.11 43.99
Byte128Vector.UMINMaskedLanes ops/ms 621.09 718.27 26122.62
42.68 42.06
Byte64Vector.UMAXLanes ops/ms 296.33 34.44 14357.74
15.95 48.45
Byte64Vector.UMAXMaskedLanes ops/ms 376.54 44.01 14269.24 21.41
37.90
Byte64Vector.UMINLanes ops/ms 373.45 426.51 15425.36
66.20 41.31
Byte64Vector.UMINMaskedLanes ops/ms 353.32 346.87 14201.37 13.79
40.19
Int128Vector.UMAXLanes ops/ms 174.79 192.51 9906.07
286.93 56.67
Int128Vector.UMAXMaskedLanes ops/ms 157.23 206.68 10246.77 11.44
65.17
Int64Vector.UMAXLanes ops/ms 95.30 126.49 4719.30
98.57 49.52
Int64Vector.UMAXMaskedLanes ops/ms 88.19 87.44 4693.18 19.76
53.22
Long128Vector.UMAXLanes ops/ms 80.62 97.82 5064.01
35.52 62.82
Long128Vector.UMAXMaskedLanes ops/ms 78.15 102.91 5028.24
8.74 64.34
Long64Vector.UMAXLanes ops/ms 47.56 62.01 46.76
52.28 0.98
Long64Vector.UMAXMaskedLanes ops/ms 45.44 46.76 45.79 42.91
1.01
Short128Vector.UMAXLanes ops/ms 316.65 410.30 14814.82 23.65
46.79
Short128Vector.UMAXMaskedLanes ops/ms 308.90 351.78 15155.26
31.03 49.06
Short64Vector.UMAXLanes ops/ms 190.38 245.09 8022.46
14.30 42.14
Short64Vector.UMAXMaskedLanes ops/ms 195.54 36.15 7930.28
11.88 40.56
```
On a Nvidia Grace machine with 128-bit NEON:
```
Benchmark Unit Before Error After Error
Uplift
Byte128Vector.UMAXLanes ops/ms 414.69 42.52 25257.61
25.91 60.91
Byte128Vector.UMAXMaskedLanes ops/ms 552.00 56.61 23063.14
304.45 41.78
Byte128Vector.UMINLanes ops/ms 634.98 849.04 28444.37
180.80 44.80
Byte128Vector.UMINMaskedLanes ops/ms 612.88 735.18 26127.07
27.99 42.63
Byte64Vector.UMAXLanes ops/ms 291.53 32.19 13893.62
28.09 47.66
Byte64Vector.UMAXMaskedLanes ops/ms 363.34 48.17 13290.59 12.53
36.58
Byte64Vector.UMINLanes ops/ms 368.70 433.60 15416.90
15.80 41.81
Byte64Vector.UMINMaskedLanes ops/ms 350.46 371.05 14524.29 121.63
41.44
Int128Vector.UMAXLanes ops/ms 177.67 201.38 10182.82
20.21 57.31
Int128Vector.UMAXMaskedLanes ops/ms 155.25 187.88 9194.13 393.35
59.22
Int64Vector.UMAXLanes ops/ms 93.93 115.02 5106.79
4.54 54.37
Int64Vector.UMAXMaskedLanes ops/ms 87.01 88.50 4405.87 8.06
50.63
Long128Vector.UMAXLanes ops/ms 80.32 98.50 3229.80
40.53 40.21
Long128Vector.UMAXMaskedLanes ops/ms 77.65 103.25 3161.50
4.45 40.72
Long64Vector.UMAXLanes ops/ms 47.72 65.38 46.41
50.38 0.97
Long64Vector.UMAXMaskedLanes ops/ms 45.26 47.46 45.13 47.23
1.00
Short128Vector.UMAXLanes ops/ms 316.09 429.34 14748.07 14.78
46.66
Short128Vector.UMAXMaskedLanes ops/ms 307.70 342.54 14359.11
44.99 46.67
Short64Vector.UMAXLanes ops/ms 187.67 253.01 8180.63
178.65 43.59
Short64Vector.UMAXMaskedLanes ops/ms 191.10 33.51 7949.19
108.65 41.60
```
- 8372978: [VectorAPI] Fix incorrect identity values in UMIN/UMAX reductions
The original implementation of UMIN/UMAX reductions in JDK-8346174
used incorrect identity values in the Java implementation and test code.
Problem:
--------
UMIN was using MAX_OR_INF (signed maximum value) as the identity:
- Byte.MAX_VALUE (127) instead of max unsigned byte (255)
- Short.MAX_VALUE (32767) instead of max unsigned short (65535)
- Integer.MAX_VALUE instead of max unsigned int (-1)
- Long.MAX_VALUE instead of max unsigned long (-1)
UMAX was using MIN_OR_INF (signed minimum value) as the identity:
- Byte.MIN_VALUE (-128) instead of 0
- Short.MIN_VALUE (-32768) instead of 0
- Integer.MIN_VALUE instead of 0
- Long.MIN_VALUE instead of 0
This caused incorrect result. For example:
UMAX([42,42,...,42]) returned 128 instead of 42
Solution:
---------
Use correct unsigned identity values:
- UMIN: ($type$)-1 (maximum unsigned value)
- UMAX: ($type$)0 (minimum unsigned value)
Changes:
--------
- X-Vector.java.template: Fixed identity values in reductionOperations
- gen-template.sh: Fixed identity values for test code generation
- templates/Unit-header.template: Updated copyright year to 2025
- Regenerated all Vector classes and test files
Testing:
--------
All types (byte/short/int/long) now return correct results in both
interpreter mode (-Xint) and compiled mode.
-------------
Changes: https://git.openjdk.org/jdk/pull/28693/files
Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=28693&range=01
Stats: 1101 lines in 12 files changed: 685 ins; 16 del; 400 mod
Patch: https://git.openjdk.org/jdk/pull/28693.diff
Fetch: git fetch https://git.openjdk.org/jdk.git pull/28693/head:pull/28693
PR: https://git.openjdk.org/jdk/pull/28693