Re: RFR: 8366444: Add support for add/mul reduction operations for Float16 [v2]

Bhavana Kilambi Thu, 18 Dec 2025 02:19:17 -0800

On Wed, 17 Dec 2025 12:40:38 GMT, Marc Chevalier <[email protected]> wrote:


>> Bhavana Kilambi has updated the pull request with a new target base due to a 
>> merge or a rebase. The incremental webrev excludes the unrelated changes 
>> brought in by the merge/rebase. The pull request contains three additional 
>> commits since the last revision:
>> 
>>  - Address review comments
>>  - Merge 'master'
>>  - 8366444: Add support for add/mul reduction operations for Float16
>>    
>>    This patch adds mid-end support for vectorized add/mul reduction
>>    operations for half floats. It also includes backend aarch64 support for
>>    these operations. Only vectorization support through autovectorization
>>    is added as VectorAPI currently does not support Float16 vector species.
>>    
>>    Both add and mul reduction vectorized through autovectorization mandate
>>    the implementation to be strictly ordered. The following is how each of
>>    these reductions is implemented for different aarch64 targets -
>>    
>>    For AddReduction :
>>    On Neon only targets (UseSVE = 0): Generates scalarized additions
>>    using the scalar "fadd" instruction for both 8B and 16B vector lengths.
>>    This is because Neon does not provide a direct instruction for computing
>>    strictly ordered floating point add reduction.
>>    
>>    On SVE targets (UseSVE > 0): Generates the "fadda" instruction which
>>    computes add reduction for floating point in strict order.
>>    
>>    For MulReduction :
>>    Both Neon and SVE do not provide a direct instruction for computing
>>    strictly ordered floating point multiply reduction. For vector lengths
>>    of 8B and 16B, a scalarized sequence of scalar "fmul" instructions is
>>    generated and multiply reduction for vector lengths > 16B is not
>>    supported.
>>    
>>    Below is the performance of the two newly added microbenchmarks in
>>    Float16OperationsBenchmark.java tested on three different aarch64
>>    machines and with varying MaxVectorSize -
>>    
>>    Note: On all machines, the score (ops/ms) is compared with the master
>>    branch without this patch which generates a sequence of loads ("ldrsh")
>>    to load the FP16 value into an FPR and a scalar "fadd/fmul" to
>>    add/multiply the loaded value to the running sum/product. The ratios
>>    given below are the ratios between the throughput with this patch and
>>    the throughput without this patch.
>>    Ratio > 1 indicates the performance with this patch is better than the
>>    master branch.
>>    
>>    N1 (UseSVE = 0, max vector length = 16B):
>>    Benchmark         vecto...
>
> src/hotspot/share/opto/vectornode.hpp line 328:
> 
>> 326:     ReductionNode(ctrl, in1, in2), 
>> _requires_strict_order(requires_strict_order) {}
>> 327: 
>> 328:   virtual int Opcode() const;
> 
> Build is failing on Mac because of `-Winconsistent-missing-override`: since 
> you specified `override` on `bottom_type` and `ideal_reg`, you need to put 
> `override` everywhere it applies. That means `Opcode` 
> `requires_strict_order`, `hash`, `cmp` and `size_of`. And same in 
> `MulReductionVHFNode`.

Done thanks. Could you please take another look?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/27526#discussion_r2630407402

Re: RFR: 8366444: Add support for add/mul reduction operations for Float16 [v2]

Reply via email to