Re: RFR: 8366444: Add support for add/mul reduction operations for Float16 [v7]

Bhavana Kilambi Thu, 12 Mar 2026 10:35:38 -0700

> This patch adds mid-end support for vectorized add/mul reduction operations 
> for half floats. It also includes backend aarch64 support for these 
> operations. Only vectorization support through autovectorization is added as 
> VectorAPI currently does not support Float16 vector species.
> 
> Both add and mul reduction vectorized through autovectorization mandate the 
> implementation to be strictly ordered. The following is how each of these 
> reductions is implemented for different aarch64 targets -
> 
> **For AddReduction :**
> On Neon only targets (UseSVE = 0): Generates scalarized additions using the 
> scalar `fadd` instruction for both 8B and 16B vector lengths. This is because 
> Neon does not provide a direct instruction for computing strictly ordered 
> floating point add reduction.
> 
> On SVE targets (UseSVE > 0): Generates the `fadda` instruction which computes 
> add reduction for floating point in strict order.
> 
> **For MulReduction :**
> Both Neon and SVE do not provide a direct instruction for computing strictly 
> ordered floating point multiply reduction. For vector lengths of 8B and 16B, 
> a scalarized sequence of scalar `fmul` instructions is generated and multiply 
> reduction for vector lengths > 16B is not supported.
> 
> Below is the performance of the two newly added microbenchmarks in 
> `Float16OperationsBenchmark.java` tested on three different aarch64 machines 
> and with varying `MaxVectorSize` -
> 
> Note: On all machines, the score (ops/ms) is compared with the master branch 
> without this patch which generates a sequence of loads (`ldrsh`) to load the 
> FP16 value into an FPR and a scalar `fadd/fmul` to add/multiply the loaded 
> value to the running sum/product. The ratios given below are the ratios 
> between the throughput with this patch and the throughput without this patch.
> Ratio > 1 indicates the performance with this patch is better than the master 
> branch.
> 
> **N1 (UseSVE = 0, max vector length = 16B):**
> 
> Benchmark         vectorDim  Mode   Cnt  8B     16B
> ReductionAddFP16  256        thrpt  9    1.41   1.40
> ReductionAddFP16  512        thrpt  9    1.41   1.41
> ReductionAddFP16  1024       thrpt  9    1.43   1.40
> ReductionAddFP16  2048       thrpt  9    1.43   1.40
> ReductionMulFP16  256        thrpt  9    1.22   1.22
> ReductionMulFP16  512        thrpt  9    1.21   1.23
> ReductionMulFP16  1024       thrpt  9    1.21   1.22
> ReductionMulFP16  2048       thrpt  9    1.20   1.22
> 
> 
> On N1, the scalarized sequence of `fadd/fmul` are generated for both 
> `MaxVectorSize` of 8B and 16B for add reduction ...


Bhavana Kilambi has updated the pull request with a new target base due to a 
merge or a rebase. The pull request now contains 12 commits:

 - Address review feedback
 - Merge from mainline
 - Address review feedback
 - merge from main
 - Merge commit '9f13ec1ccb684398e311b5f139773ca9f39561fe' into HEAD
 - Address review comments for the JTREG test and microbenchmark
 - Merge branch 'master'
 - Address review comments
 - Fix build failures on Mac
 - Address review comments
 - ... and 2 more: https://git.openjdk.org/jdk/compare/5aa115b5...b552ee31

-------------

Changes: https://git.openjdk.org/jdk/pull/27526/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=27526&range=06
  Stats: 831 lines in 14 files changed: 764 ins; 13 del; 54 mod
  Patch: https://git.openjdk.org/jdk/pull/27526.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/27526/head:pull/27526

PR: https://git.openjdk.org/jdk/pull/27526

Re: RFR: 8366444: Add support for add/mul reduction operations for Float16 [v7]

Reply via email to