> This patch adds mid-end support for vectorized add/mul reduction operations > for half floats. It also includes backend aarch64 support for these > operations. Only vectorization support through autovectorization is added as > VectorAPI currently does not support Float16 vector species. > > Both add and mul reduction vectorized through autovectorization mandate the > implementation to be strictly ordered. The following is how each of these > reductions is implemented for different aarch64 targets - > > **For AddReduction :** > On Neon only targets (UseSVE = 0): Generates scalarized additions using the > scalar `fadd` instruction for both 8B and 16B vector lengths. This is because > Neon does not provide a direct instruction for computing strictly ordered > floating point add reduction. > > On SVE targets (UseSVE > 0): Generates the `fadda` instruction which computes > add reduction for floating point in strict order. > > **For MulReduction :** > Both Neon and SVE do not provide a direct instruction for computing strictly > ordered floating point multiply reduction. For vector lengths of 8B and 16B, > a scalarized sequence of scalar `fmul` instructions is generated and multiply > reduction for vector lengths > 16B is not supported. > > Below is the performance of the two newly added microbenchmarks in > `Float16OperationsBenchmark.java` tested on three different aarch64 machines > and with varying `MaxVectorSize` - > > Note: On all machines, the score (ops/ms) is compared with the master branch > without this patch which generates a sequence of loads (`ldrsh`) to load the > FP16 value into an FPR and a scalar `fadd/fmul` to add/multiply the loaded > value to the running sum/product. The ratios given below are the ratios > between the throughput with this patch and the throughput without this patch. > Ratio > 1 indicates the performance with this patch is better than the master > branch. > > **N1 (UseSVE = 0, max vector length = 16B):** > > Benchmark vectorDim Mode Cnt 8B 16B > ReductionAddFP16 256 thrpt 9 1.41 1.40 > ReductionAddFP16 512 thrpt 9 1.41 1.41 > ReductionAddFP16 1024 thrpt 9 1.43 1.40 > ReductionAddFP16 2048 thrpt 9 1.43 1.40 > ReductionMulFP16 256 thrpt 9 1.22 1.22 > ReductionMulFP16 512 thrpt 9 1.21 1.23 > ReductionMulFP16 1024 thrpt 9 1.21 1.22 > ReductionMulFP16 2048 thrpt 9 1.20 1.22 > > > On N1, the scalarized sequence of `fadd/fmul` are generated for both > `MaxVectorSize` of 8B and 16B for add reduction ...
Bhavana Kilambi has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 12 commits: - Address review feedback - Merge from mainline - Address review feedback - merge from main - Merge commit '9f13ec1ccb684398e311b5f139773ca9f39561fe' into HEAD - Address review comments for the JTREG test and microbenchmark - Merge branch 'master' - Address review comments - Fix build failures on Mac - Address review comments - ... and 2 more: https://git.openjdk.org/jdk/compare/5aa115b5...b552ee31 ------------- Changes: https://git.openjdk.org/jdk/pull/27526/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=27526&range=06 Stats: 831 lines in 14 files changed: 764 ins; 13 del; 54 mod Patch: https://git.openjdk.org/jdk/pull/27526.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27526/head:pull/27526 PR: https://git.openjdk.org/jdk/pull/27526
