> This patch adds intrinsic support for UMIN and UMAX reduction operations in 
> the Vector API on AArch64, enabling direct hardware instruction mapping for 
> better performance.
> 
> Changes:
> --------
> 
> 1. C2 mid-end:
>    - Added UMinReductionVNode and UMaxReductionVNode
> 
> 2. AArch64 Backend:
>    - Added uminp/umaxp/sve_uminv/sve_umaxv instructions
>    - Updated match rules for all vector sizes and element types
>    - Both NEON and SVE implementation are supported
> 
> 3. Test:
>    - Added UMIN_REDUCTION_V and UMAX_REDUCTION_V to IRNode.java
>    - Added assembly tests in aarch64-asmtest.py for new instructions
>    - Added a JTReg test file VectorUMinMaxReductionTest.java
> 
> Different configurations were tested on aarch64 and x86 machines, and all 
> tests passed.
> 
> Test results of JMH benchmarks from the panama-vector project:
> --------
> 
> On a Nvidia Grace machine with 128-bit SVE:
> 
> Benchmark                       Unit    Before  Error   After           Error 
>   Uplift
> Byte128Vector.UMAXLanes         ops/ms  411.60  42.18   25226.51        33.92 
>   61.29
> Byte128Vector.UMAXMaskedLanes   ops/ms  558.56  85.12   25182.90        28.74 
>   45.09
> Byte128Vector.UMINLanes         ops/ms  645.58  780.76  28396.29        
> 103.11  43.99
> Byte128Vector.UMINMaskedLanes   ops/ms  621.09  718.27  26122.62        42.68 
>   42.06
> Byte64Vector.UMAXLanes          ops/ms  296.33  34.44   14357.74        15.95 
>   48.45
> Byte64Vector.UMAXMaskedLanes    ops/ms  376.54  44.01   14269.24        21.41 
>   37.90
> Byte64Vector.UMINLanes          ops/ms  373.45  426.51  15425.36        66.20 
>   41.31
> Byte64Vector.UMINMaskedLanes    ops/ms  353.32  346.87  14201.37        13.79 
>   40.19
> Int128Vector.UMAXLanes          ops/ms  174.79  192.51  9906.07         
> 286.93  56.67
> Int128Vector.UMAXMaskedLanes    ops/ms  157.23  206.68  10246.77        11.44 
>   65.17
> Int64Vector.UMAXLanes           ops/ms  95.30   126.49  4719.30         98.57 
>   49.52
> Int64Vector.UMAXMaskedLanes     ops/ms  88.19   87.44   4693.18         19.76 
>   53.22
> Long128Vector.UMAXLanes         ops/ms  80.62   97.82   5064.01         35.52 
>   62.82
> Long128Vector.UMAXMaskedLanes   ops/ms  78.15   102.91  5028.24         8.74  
>   64.34
> Long64Vector.UMAXLanes          ops/ms  47.56   62.01   46.76           52.28 
>   0.98
> Long64Vector.UMAXMaskedLanes    ops/ms  45.44   46.76   45.79           42.91 
>   1.01
> Short128Vector.UMAXLanes        ops/ms  316.65  410.30  14814.82        23.65 
>   46.79
> Short128Vector.UMAXMaskedLanes  ops/ms  308.90  351.78  15155.26        31.03 
>   49.06
> Sh...

Eric Fang has updated the pull request with a new target base due to a merge or 
a rebase. The pull request now contains four commits:

 - Rebase commit 56d7b52
 - Merge branch 'master' into JDK-8372980-umin-umax-intrinsic
 - 8372980: [VectorAPI] AArch64: Add intrinsic support for unsigned min/max 
reduction operations
   
   This patch adds intrinsic support for UMIN and UMAX reduction operations
   in the Vector API on AArch64, enabling direct hardware instruction mapping
   for better performance.
   
   Changes:
   --------
   
   1. C2 mid-end:
      - Added UMinReductionVNode and UMaxReductionVNode
   
   2. AArch64 Backend:
      - Added uminp/umaxp/sve_uminv/sve_umaxv instructions
      - Updated match rules for all vector sizes and element types
      - Both NEON and SVE implementation are supported
   
   3. Test:
      - Added UMIN_REDUCTION_V and UMAX_REDUCTION_V to IRNode.java
      - Added assembly tests in aarch64-asmtest.py for new instructions
      - Added a JTReg test file VectorUMinMaxReductionTest.java
   
   Different configurations were tested on aarch64 and x86 machines, and
   all tests passed.
   
   Test results of JMH benchmarks from the panama-vector project:
   --------
   
   On a Nvidia Grace machine with 128-bit SVE:
   ```
   Benchmark                    Unit    Before  Error   After           Error   
Uplift
   Byte128Vector.UMAXLanes              ops/ms  411.60  42.18   25226.51        
33.92   61.29
   Byte128Vector.UMAXMaskedLanes        ops/ms  558.56  85.12   25182.90        
28.74   45.09
   Byte128Vector.UMINLanes              ops/ms  645.58  780.76  28396.29        
103.11  43.99
   Byte128Vector.UMINMaskedLanes        ops/ms  621.09  718.27  26122.62        
42.68   42.06
   Byte64Vector.UMAXLanes               ops/ms  296.33  34.44   14357.74        
15.95   48.45
   Byte64Vector.UMAXMaskedLanes ops/ms  376.54  44.01   14269.24        21.41   
37.90
   Byte64Vector.UMINLanes               ops/ms  373.45  426.51  15425.36        
66.20   41.31
   Byte64Vector.UMINMaskedLanes ops/ms  353.32  346.87  14201.37        13.79   
40.19
   Int128Vector.UMAXLanes               ops/ms  174.79  192.51  9906.07         
286.93  56.67
   Int128Vector.UMAXMaskedLanes ops/ms  157.23  206.68  10246.77        11.44   
65.17
   Int64Vector.UMAXLanes                ops/ms  95.30   126.49  4719.30         
98.57   49.52
   Int64Vector.UMAXMaskedLanes  ops/ms  88.19   87.44   4693.18         19.76   
53.22
   Long128Vector.UMAXLanes              ops/ms  80.62   97.82   5064.01         
35.52   62.82
   Long128Vector.UMAXMaskedLanes        ops/ms  78.15   102.91  5028.24         
8.74    64.34
   Long64Vector.UMAXLanes               ops/ms  47.56   62.01   46.76           
52.28   0.98
   Long64Vector.UMAXMaskedLanes ops/ms  45.44   46.76   45.79           42.91   
1.01
   Short128Vector.UMAXLanes     ops/ms  316.65  410.30  14814.82        23.65   
46.79
   Short128Vector.UMAXMaskedLanes       ops/ms  308.90  351.78  15155.26        
31.03   49.06
   Short64Vector.UMAXLanes              ops/ms  190.38  245.09  8022.46         
14.30   42.14
   Short64Vector.UMAXMaskedLanes        ops/ms  195.54  36.15   7930.28         
11.88   40.56
   ```
   
   On a Nvidia Grace machine with 128-bit NEON:
   ```
   Benchmark                    Unit    Before  Error   After           Error   
Uplift
   Byte128Vector.UMAXLanes              ops/ms  414.69  42.52   25257.61        
25.91   60.91
   Byte128Vector.UMAXMaskedLanes        ops/ms  552.00  56.61   23063.14        
304.45  41.78
   Byte128Vector.UMINLanes              ops/ms  634.98  849.04  28444.37        
180.80  44.80
   Byte128Vector.UMINMaskedLanes        ops/ms  612.88  735.18  26127.07        
27.99   42.63
   Byte64Vector.UMAXLanes               ops/ms  291.53  32.19   13893.62        
28.09   47.66
   Byte64Vector.UMAXMaskedLanes ops/ms  363.34  48.17   13290.59        12.53   
36.58
   Byte64Vector.UMINLanes               ops/ms  368.70  433.60  15416.90        
15.80   41.81
   Byte64Vector.UMINMaskedLanes ops/ms  350.46  371.05  14524.29        121.63  
41.44
   Int128Vector.UMAXLanes               ops/ms  177.67  201.38  10182.82        
20.21   57.31
   Int128Vector.UMAXMaskedLanes ops/ms  155.25  187.88  9194.13         393.35  
59.22
   Int64Vector.UMAXLanes                ops/ms  93.93   115.02  5106.79         
4.54    54.37
   Int64Vector.UMAXMaskedLanes  ops/ms  87.01   88.50   4405.87         8.06    
50.63
   Long128Vector.UMAXLanes              ops/ms  80.32   98.50   3229.80         
40.53   40.21
   Long128Vector.UMAXMaskedLanes        ops/ms  77.65   103.25  3161.50         
4.45    40.72
   Long64Vector.UMAXLanes               ops/ms  47.72   65.38   46.41           
50.38   0.97
   Long64Vector.UMAXMaskedLanes ops/ms  45.26   47.46   45.13           47.23   
1.00
   Short128Vector.UMAXLanes     ops/ms  316.09  429.34  14748.07        14.78   
46.66
   Short128Vector.UMAXMaskedLanes       ops/ms  307.70  342.54  14359.11        
44.99   46.67
   Short64Vector.UMAXLanes              ops/ms  187.67  253.01  8180.63         
178.65  43.59
   Short64Vector.UMAXMaskedLanes        ops/ms  191.10  33.51   7949.19         
108.65  41.60
   ```
 - 8372978: [VectorAPI] Fix incorrect identity values in UMIN/UMAX reductions
   
   The original implementation of UMIN/UMAX reductions in JDK-8346174
   used incorrect identity values in the Java implementation and test code.
   
   Problem:
   --------
   UMIN was using MAX_OR_INF (signed maximum value) as the identity:
     - Byte.MAX_VALUE (127) instead of max unsigned byte (255)
     - Short.MAX_VALUE (32767) instead of max unsigned short (65535)
     - Integer.MAX_VALUE instead of max unsigned int (-1)
     - Long.MAX_VALUE instead of max unsigned long (-1)
   
   UMAX was using MIN_OR_INF (signed minimum value) as the identity:
     - Byte.MIN_VALUE (-128) instead of 0
     - Short.MIN_VALUE (-32768) instead of 0
     - Integer.MIN_VALUE instead of 0
     - Long.MIN_VALUE instead of 0
   
   This caused incorrect result. For example:
     UMAX([42,42,...,42]) returned 128 instead of 42
   
   Solution:
   ---------
   Use correct unsigned identity values:
     - UMIN: ($type$)-1 (maximum unsigned value)
     - UMAX: ($type$)0 (minimum unsigned value)
   
   Changes:
   --------
   - X-Vector.java.template: Fixed identity values in reductionOperations
   - gen-template.sh: Fixed identity values for test code generation
   - templates/Unit-header.template: Updated copyright year to 2025
   - Regenerated all Vector classes and test files
   
   Testing:
   --------
   All types (byte/short/int/long) now return correct results in both
   interpreter mode (-Xint) and compiled mode.

-------------

Changes: https://git.openjdk.org/jdk/pull/28693/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=28693&range=01
  Stats: 1101 lines in 12 files changed: 685 ins; 16 del; 400 mod
  Patch: https://git.openjdk.org/jdk/pull/28693.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/28693/head:pull/28693

PR: https://git.openjdk.org/jdk/pull/28693

Reply via email to