On Wed, 21 Jan 2026 10:15:24 GMT, Eric Fang <[email protected]> wrote:

>> This patch adds intrinsic support for UMIN and UMAX reduction operations in 
>> the Vector API on AArch64, enabling direct hardware instruction mapping for 
>> better performance.
>> 
>> Changes:
>> --------
>> 
>> 1. C2 mid-end:
>>    - Added UMinReductionVNode and UMaxReductionVNode
>> 
>> 2. AArch64 Backend:
>>    - Added uminp/umaxp/sve_uminv/sve_umaxv instructions
>>    - Updated match rules for all vector sizes and element types
>>    - Both NEON and SVE implementation are supported
>> 
>> 3. Test:
>>    - Added UMIN_REDUCTION_V and UMAX_REDUCTION_V to IRNode.java
>>    - Added assembly tests in aarch64-asmtest.py for new instructions
>>    - Added a JTReg test file VectorUMinMaxReductionTest.java
>> 
>> Different configurations were tested on aarch64 and x86 machines, and all 
>> tests passed.
>> 
>> Test results of JMH benchmarks from the panama-vector project:
>> --------
>> 
>> On a Nvidia Grace machine with 128-bit SVE:
>> 
>> Benchmark                       Unit    Before  Error   After           
>> Error   Uplift
>> Byte128Vector.UMAXLanes         ops/ms  411.60  42.18   25226.51        
>> 33.92   61.29
>> Byte128Vector.UMAXMaskedLanes   ops/ms  558.56  85.12   25182.90        
>> 28.74   45.09
>> Byte128Vector.UMINLanes         ops/ms  645.58  780.76  28396.29        
>> 103.11  43.99
>> Byte128Vector.UMINMaskedLanes   ops/ms  621.09  718.27  26122.62        
>> 42.68   42.06
>> Byte64Vector.UMAXLanes          ops/ms  296.33  34.44   14357.74        
>> 15.95   48.45
>> Byte64Vector.UMAXMaskedLanes    ops/ms  376.54  44.01   14269.24        
>> 21.41   37.90
>> Byte64Vector.UMINLanes          ops/ms  373.45  426.51  15425.36        
>> 66.20   41.31
>> Byte64Vector.UMINMaskedLanes    ops/ms  353.32  346.87  14201.37        
>> 13.79   40.19
>> Int128Vector.UMAXLanes          ops/ms  174.79  192.51  9906.07         
>> 286.93  56.67
>> Int128Vector.UMAXMaskedLanes    ops/ms  157.23  206.68  10246.77        
>> 11.44   65.17
>> Int64Vector.UMAXLanes           ops/ms  95.30   126.49  4719.30         
>> 98.57   49.52
>> Int64Vector.UMAXMaskedLanes     ops/ms  88.19   87.44   4693.18         
>> 19.76   53.22
>> Long128Vector.UMAXLanes         ops/ms  80.62   97.82   5064.01         
>> 35.52   62.82
>> Long128Vector.UMAXMaskedLanes   ops/ms  78.15   102.91  5028.24         8.74 
>>    64.34
>> Long64Vector.UMAXLanes          ops/ms  47.56   62.01   46.76           
>> 52.28   0.98
>> Long64Vector.UMAXMaskedLanes    ops/ms  45.44   46.76   45.79           
>> 42.91   1.01
>> Short128Vector.UMAXLanes        ops/ms  316.65  410.30  14814.82        
>> 23.65   46.79
>> ...
>
> Eric Fang has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Extract some helper functions for better readability

src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.hpp line 50:

> 48:   void sve_maxv(bool is_unsigned, FloatRegister dst, SIMD_RegVariant size,
> 49:                 PRegister pg, FloatRegister src);
> 50: 

Using separate definitions here is adding unnecessary complexity.

I'd do something like this in the header file:


  // Typedefs used to disambiguate overloaded member functions.
  typedef void (Assembler::*neon_reduction2)
    (FloatRegister, Assembler::SIMD_Arrangement, FloatRegister);
  typedef void (Assembler::*sve_reduction3)
    (FloatRegister, Assembler::SIMD_RegVariant, PRegister, FloatRegister);

  // Helper functions for min/max reduction operations
  void neon_minp(bool is_unsigned, FloatRegister dst, SIMD_Arrangement size,
                 FloatRegister src1, FloatRegister src2) {
    auto m = is_unsigned ? &Assembler::uminp : &Assembler::sminp;
    (this->*m)(dst, size, src1, src2);
  }
  void neon_maxp(bool is_unsigned, FloatRegister dst, SIMD_Arrangement size,
                   FloatRegister src1, FloatRegister src2) {
    auto m = is_unsigned ? &Assembler::umaxp : &Assembler::smaxp;
    (this->*m)(dst, size, src1, src2);
  }
  void neon_minv(bool is_unsigned, FloatRegister dst, SIMD_Arrangement size,
                 FloatRegister src) {
    auto m = is_unsigned ? (neon_reduction2)&Assembler::uminv : 
&Assembler::sminv;
    (this->*m)(dst, size, src);
  }
  void neon_maxv(bool is_unsigned, FloatRegister dst, SIMD_Arrangement size,
                 FloatRegister src) {
    auto m = is_unsigned ? (neon_reduction2)&Assembler::umaxv : 
&Assembler::smaxv;
    (this->*m)(dst, size, src);
  }
  void sve_minv(bool is_unsigned, FloatRegister dst, SIMD_RegVariant size,
                PRegister pg, FloatRegister src) {
    auto m = is_unsigned ? (sve_reduction3)&Assembler::sve_uminv : 
&Assembler::sve_sminv;
    (this->*m)(dst, size, pg, src);
  }
  void sve_maxv(bool is_unsigned, FloatRegister dst, SIMD_RegVariant size,
                PRegister pg, FloatRegister src) {
    auto m = is_unsigned ? (sve_reduction3)&Assembler::sve_umaxv : 
&Assembler::sve_smaxv;
    (this->*m)(dst, size, pg, src);
  }



To some extent it's a matter of taste, but please try not to use much more 
repetitive and boilerplate code than you need to.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/28693#discussion_r2717902945

Reply via email to