On Mon, 26 Jan 2026 09:26:35 GMT, Eric Fang <[email protected]> wrote:

>> This patch adds intrinsic support for UMIN and UMAX reduction operations in 
>> the Vector API on AArch64, enabling direct hardware instruction mapping for 
>> better performance.
>> 
>> Changes:
>> --------
>> 
>> 1. C2 mid-end:
>>    - Added UMinReductionVNode and UMaxReductionVNode
>> 
>> 2. AArch64 Backend:
>>    - Added uminp/umaxp/sve_uminv/sve_umaxv instructions
>>    - Updated match rules for all vector sizes and element types
>>    - Both NEON and SVE implementation are supported
>> 
>> 3. Test:
>>    - Added UMIN_REDUCTION_V and UMAX_REDUCTION_V to IRNode.java
>>    - Added assembly tests in aarch64-asmtest.py for new instructions
>>    - Added a JTReg test file VectorUMinMaxReductionTest.java
>> 
>> Different configurations were tested on aarch64 and x86 machines, and all 
>> tests passed.
>> 
>> Test results of JMH benchmarks from the panama-vector project:
>> --------
>> 
>> On a Nvidia Grace machine with 128-bit SVE:
>> 
>> Benchmark                       Unit    Before  Error   After           
>> Error   Uplift
>> Byte128Vector.UMAXLanes         ops/ms  411.60  42.18   25226.51        
>> 33.92   61.29
>> Byte128Vector.UMAXMaskedLanes   ops/ms  558.56  85.12   25182.90        
>> 28.74   45.09
>> Byte128Vector.UMINLanes         ops/ms  645.58  780.76  28396.29        
>> 103.11  43.99
>> Byte128Vector.UMINMaskedLanes   ops/ms  621.09  718.27  26122.62        
>> 42.68   42.06
>> Byte64Vector.UMAXLanes          ops/ms  296.33  34.44   14357.74        
>> 15.95   48.45
>> Byte64Vector.UMAXMaskedLanes    ops/ms  376.54  44.01   14269.24        
>> 21.41   37.90
>> Byte64Vector.UMINLanes          ops/ms  373.45  426.51  15425.36        
>> 66.20   41.31
>> Byte64Vector.UMINMaskedLanes    ops/ms  353.32  346.87  14201.37        
>> 13.79   40.19
>> Int128Vector.UMAXLanes          ops/ms  174.79  192.51  9906.07         
>> 286.93  56.67
>> Int128Vector.UMAXMaskedLanes    ops/ms  157.23  206.68  10246.77        
>> 11.44   65.17
>> Int64Vector.UMAXLanes           ops/ms  95.30   126.49  4719.30         
>> 98.57   49.52
>> Int64Vector.UMAXMaskedLanes     ops/ms  88.19   87.44   4693.18         
>> 19.76   53.22
>> Long128Vector.UMAXLanes         ops/ms  80.62   97.82   5064.01         
>> 35.52   62.82
>> Long128Vector.UMAXMaskedLanes   ops/ms  78.15   102.91  5028.24         8.74 
>>    64.34
>> Long64Vector.UMAXLanes          ops/ms  47.56   62.01   46.76           
>> 52.28   0.98
>> Long64Vector.UMAXMaskedLanes    ops/ms  45.44   46.76   45.79           
>> 42.91   1.01
>> Short128Vector.UMAXLanes        ops/ms  316.65  410.30  14814.82        
>> 23.65   46.79
>> ...
>
> Eric Fang has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Move helper functions into c2_MacroAssembler_aarch64.hpp

The general way code flows right now, but not often, is from jdk/master to 
panama-vector/vectorIntrinsics, since most of the development work is in the 
mainline (exceptions to that are the float16 and Valhalla alignment work which 
are large efforts).

I am very reluctant to include all the auto-generated micro benchmarks in 
mainline. There is a huge number of them and i am not certain they provide as 
much value as they did now we have the IR test framework. In may cases, given 
the simplicity of what they measure, they were designed to ensure C2 generates 
the right instructions. The IR test framework is better at determining that by 
testing the right IR nodes are generated - and they get run as part of the 
existing HotSpot test suite.

The IR test framework is of course no substitute, in general, for performance 
tests. A better focus for Vector API performance tests is i think Emanuel's 
work [here](https://github.com/openjdk/jdk/pull/28639/) and 
use-cases/algorithms that can be implemented concisely.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/28693#issuecomment-3806851359

Reply via email to