On Fri, 26 Sep 2025 12:00:31 GMT, Bhavana Kilambi <[email protected]> wrote:
> This patch adds mid-end support for vectorized add/mul reduction operations > for half floats. It also includes backend aarch64 support for these > operations. Only vectorization support through autovectorization is added as > VectorAPI currently does not support Float16 vector species. > > Both add and mul reduction vectorized through autovectorization mandate the > implementation to be strictly ordered. The following is how each of these > reductions is implemented for different aarch64 targets - > > **For AddReduction :** > On Neon only targets (UseSVE = 0): Generates scalarized additions using the > scalar `fadd` instruction for both 8B and 16B vector lengths. This is because > Neon does not provide a direct instruction for computing strictly ordered > floating point add reduction. > > On SVE targets (UseSVE > 0): Generates the `fadda` instruction which computes > add reduction for floating point in strict order. > > **For MulReduction :** > Both Neon and SVE do not provide a direct instruction for computing strictly > ordered floating point multiply reduction. For vector lengths of 8B and 16B, > a scalarized sequence of scalar `fmul` instructions is generated and multiply > reduction for vector lengths > 16B is not supported. > > Below is the performance of the two newly added microbenchmarks in > `Float16OperationsBenchmark.java` tested on three different aarch64 machines > and with varying `MaxVectorSize` - > > Note: On all machines, the score (ops/ms) is compared with the master branch > without this patch which generates a sequence of loads (`ldrsh`) to load the > FP16 value into an FPR and a scalar `fadd/fmul` to add/multiply the loaded > value to the running sum/product. The ratios given below are the ratios > between the throughput with this patch and the throughput without this patch. > Ratio > 1 indicates the performance with this patch is better than the master > branch. > > **N1 (UseSVE = 0, max vector length = 16B):** > > Benchmark vectorDim Mode Cnt 8B 16B > ReductionAddFP16 256 thrpt 9 1.41 1.40 > ReductionAddFP16 512 thrpt 9 1.41 1.41 > ReductionAddFP16 1024 thrpt 9 1.43 1.40 > ReductionAddFP16 2048 thrpt 9 1.43 1.40 > ReductionMulFP16 256 thrpt 9 1.22 1.22 > ReductionMulFP16 512 thrpt 9 1.21 1.23 > ReductionMulFP16 1024 thrpt 9 1.21 1.22 > ReductionMulFP16 2048 thrpt 9 1.20 1.22 > > > On N1, the scalarized sequence of `fadd/fmul` are generated for both > `MaxVectorSize` of 8B and 16B for add reduction ... And it may be worth to fix this test before adding more changes to Float16, just to be sure we have a correct base to build on and are not making thins worse ;) ------------- PR Comment: https://git.openjdk.org/jdk/pull/27526#issuecomment-3641706953
