> This patch adds mid-end support for vectorized add/mul reduction operations 
> for half floats. It also includes backend aarch64 support for these 
> operations. Only vectorization support through autovectorization is added as 
> VectorAPI currently does not support Float16 vector species.
> 
> Both add and mul reduction vectorized through autovectorization mandate the 
> implementation to be strictly ordered. The following is how each of these 
> reductions is implemented for different aarch64 targets -
> 
> **For AddReduction :**
> On Neon only targets (UseSVE = 0): Generates scalarized additions using the 
> scalar `fadd` instruction for both 8B and 16B vector lengths. This is because 
> Neon does not provide a direct instruction for computing strictly ordered 
> floating point add reduction.
> 
> On SVE targets (UseSVE > 0): Generates the `fadda` instruction which computes 
> add reduction for floating point in strict order.
> 
> **For MulReduction :**
> Both Neon and SVE do not provide a direct instruction for computing strictly 
> ordered floating point multiply reduction. For vector lengths of 8B and 16B, 
> a scalarized sequence of scalar `fmul` instructions is generated and multiply 
> reduction for vector lengths > 16B is not supported.
> 
> Below is the performance of the two newly added microbenchmarks in 
> `Float16OperationsBenchmark.java` tested on three different aarch64 machines 
> and with varying `MaxVectorSize` -
> 
> Note: On all machines, the score (ops/ms) is compared with the master branch 
> without this patch which generates a sequence of loads (`ldrsh`) to load the 
> FP16 value into an FPR and a scalar `fadd/fmul` to add/multiply the loaded 
> value to the running sum/product. The ratios given below are the ratios 
> between the throughput with this patch and the throughput without this patch.
> Ratio > 1 indicates the performance with this patch is better than the master 
> branch.
> 
> **N1 (UseSVE = 0, max vector length = 16B):**
> 
> Benchmark         vectorDim  Mode   Cnt  8B     16B
> ReductionAddFP16  256        thrpt  9    1.41   1.40
> ReductionAddFP16  512        thrpt  9    1.41   1.41
> ReductionAddFP16  1024       thrpt  9    1.43   1.40
> ReductionAddFP16  2048       thrpt  9    1.43   1.40
> ReductionMulFP16  256        thrpt  9    1.22   1.22
> ReductionMulFP16  512        thrpt  9    1.21   1.23
> ReductionMulFP16  1024       thrpt  9    1.21   1.22
> ReductionMulFP16  2048       thrpt  9    1.20   1.22
> 
> 
> On N1, the scalarized sequence of `fadd/fmul` are generated for both 
> `MaxVectorSize` of 8B and 16B for add reduction ...

Bhavana Kilambi has updated the pull request with a new target base due to a 
merge or a rebase. The incremental webrev excludes the unrelated changes 
brought in by the merge/rebase. The pull request contains three additional 
commits since the last revision:

 - Address review comments
 - Merge 'master'
 - 8366444: Add support for add/mul reduction operations for Float16
   
   This patch adds mid-end support for vectorized add/mul reduction
   operations for half floats. It also includes backend aarch64 support for
   these operations. Only vectorization support through autovectorization
   is added as VectorAPI currently does not support Float16 vector species.
   
   Both add and mul reduction vectorized through autovectorization mandate
   the implementation to be strictly ordered. The following is how each of
   these reductions is implemented for different aarch64 targets -
   
   For AddReduction :
   On Neon only targets (UseSVE = 0): Generates scalarized additions
   using the scalar "fadd" instruction for both 8B and 16B vector lengths.
   This is because Neon does not provide a direct instruction for computing
   strictly ordered floating point add reduction.
   
   On SVE targets (UseSVE > 0): Generates the "fadda" instruction which
   computes add reduction for floating point in strict order.
   
   For MulReduction :
   Both Neon and SVE do not provide a direct instruction for computing
   strictly ordered floating point multiply reduction. For vector lengths
   of 8B and 16B, a scalarized sequence of scalar "fmul" instructions is
   generated and multiply reduction for vector lengths > 16B is not
   supported.
   
   Below is the performance of the two newly added microbenchmarks in
   Float16OperationsBenchmark.java tested on three different aarch64
   machines and with varying MaxVectorSize -
   
   Note: On all machines, the score (ops/ms) is compared with the master
   branch without this patch which generates a sequence of loads ("ldrsh")
   to load the FP16 value into an FPR and a scalar "fadd/fmul" to
   add/multiply the loaded value to the running sum/product. The ratios
   given below are the ratios between the throughput with this patch and
   the throughput without this patch.
   Ratio > 1 indicates the performance with this patch is better than the
   master branch.
   
   N1 (UseSVE = 0, max vector length = 16B):
   Benchmark         vectorDim  Mode   Cnt  8B     16B
   ReductionAddFP16  256        thrpt  9    1.41   1.40
   ReductionAddFP16  512        thrpt  9    1.41   1.41
   ReductionAddFP16  1024       thrpt  9    1.43   1.40
   ReductionAddFP16  2048       thrpt  9    1.43   1.40
   ReductionMulFP16  256        thrpt  9    1.22   1.22
   ReductionMulFP16  512        thrpt  9    1.21   1.23
   ReductionMulFP16  1024       thrpt  9    1.21   1.22
   ReductionMulFP16  2048       thrpt  9    1.20   1.22
   
   On N1, the scalarized sequence of fadd/fmul are generated for both
   MaxVectorSize of 8B and 16B for add reduction and mul reduction
   respectively.
   
   V1 (UseSVE = 1, max vector length = 32B):
   Benchmark         vectorDim  Mode   Cnt  8B     16B     32B
   ReductionAddFP16  256        thrpt  9    1.11   1.75    2.02
   ReductionAddFP16  512        thrpt  9    1.02   1.64    1.93
   ReductionAddFP16  1024       thrpt  9    1.02   1.59    1.85
   ReductionAddFP16  2048       thrpt  9    1.02   1.56    1.80
   ReductionMulFP16  256        thrpt  9    1.12   0.99    1.09
   ReductionMulFP16  512        thrpt  9    1.04   1.01    1.04
   ReductionMulFP16  1024       thrpt  9    1.02   1.02    1.00
   ReductionMulFP16  2048       thrpt  9    1.01   1.01    1.00
   
   On V1, for MaxVectorSize = 8: scalarized fadd/fmul sequence will be
   generated for AddReductionVHF/MulReductionVHF as UseSVE defaults to 0
   [2].
   For MaxVectorSize = 16: scalarized "fmul" sequence is generated for
   MulReductionVHF and "fadda" is generated for AddReductionVHF which
   fetches signficant gains.
   For MaxVectorSize = 32: Autovectorization of MulReductionVHF is disabled
   for MaxVectorSize > 16B so the autovectorizer checks for maximal
   implemented size[1] which is 16B and generates scalarized "fmul"
   sequence for 16B in this case. For AddReductionVHF, it generates the
   "fadda" instruction.
   
   V2 (UseSVE = 2, max vector length = 16B)
   Benchmark         vectorDim  Mode   Cnt  8B     16B
   ReductionAddFP16  256        thrpt  9    1.16   1.70
   ReductionAddFP16  512        thrpt  9    1.02   1.61
   ReductionAddFP16  1024       thrpt  9    1.01   1.53
   ReductionAddFP16  2048       thrpt  9    1.00   1.49
   ReductionMulFP16  256        thrpt  9    1.18   0.99
   ReductionMulFP16  512        thrpt  9    1.04   1.01
   ReductionMulFP16  1024       thrpt  9    1.02   1.02
   ReductionMulFP16  2048       thrpt  9    1.01   1.01
   
   On V2, for MaxVectorSize = 8: scalarized fadd/fmul sequence will be
   generated as UseSVE defaults to 0 [2].
   For MaxVectorSize = 16: "fadda" instruction is generated for
   AddReductionVHF which results in significant gains in performance. For
   MulReductionVHF, the scalarized "fmul" sequence will be generated.
   
   Testing:
   hotspot_all, jdk(tiers1-3) and langtools(tier1) all pass on N1/V1/V2.
   
   [1] 
https://github.com/openjdk/jdk/blob/a272696813f2e5e896ac9de9985246aaeb9d476c/src/hotspot/share/opto/superword.cpp#L1677
   [2] 
https://github.com/openjdk/jdk/blob/a272696813f2e5e896ac9de9985246aaeb9d476c/src/hotspot/cpu/aarch64/vm_version_aarch64.cpp#L479

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/27526/files
  - new: https://git.openjdk.org/jdk/pull/27526/files/b8eb35ba..e8e3989d

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=27526&range=01
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=27526&range=00-01

  Stats: 432095 lines in 4952 files changed: 278406 ins; 97133 del; 56556 mod
  Patch: https://git.openjdk.org/jdk/pull/27526.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/27526/head:pull/27526

PR: https://git.openjdk.org/jdk/pull/27526

Reply via email to