On Mon, 29 Dec 2025 17:39:42 GMT, Bhavana Kilambi <[email protected]> wrote:

>> This patch adds mid-end support for vectorized add/mul reduction operations 
>> for half floats. It also includes backend aarch64 support for these 
>> operations. Only vectorization support through autovectorization is added as 
>> VectorAPI currently does not support Float16 vector species.
>> 
>> Both add and mul reduction vectorized through autovectorization mandate the 
>> implementation to be strictly ordered. The following is how each of these 
>> reductions is implemented for different aarch64 targets -
>> 
>> **For AddReduction :**
>> On Neon only targets (UseSVE = 0): Generates scalarized additions using the 
>> scalar `fadd` instruction for both 8B and 16B vector lengths. This is 
>> because Neon does not provide a direct instruction for computing strictly 
>> ordered floating point add reduction.
>> 
>> On SVE targets (UseSVE > 0): Generates the `fadda` instruction which 
>> computes add reduction for floating point in strict order.
>> 
>> **For MulReduction :**
>> Both Neon and SVE do not provide a direct instruction for computing strictly 
>> ordered floating point multiply reduction. For vector lengths of 8B and 16B, 
>> a scalarized sequence of scalar `fmul` instructions is generated and 
>> multiply reduction for vector lengths > 16B is not supported.
>> 
>> Below is the performance of the two newly added microbenchmarks in 
>> `Float16OperationsBenchmark.java` tested on three different aarch64 machines 
>> and with varying `MaxVectorSize` -
>> 
>> Note: On all machines, the score (ops/ms) is compared with the master branch 
>> without this patch which generates a sequence of loads (`ldrsh`) to load the 
>> FP16 value into an FPR and a scalar `fadd/fmul` to add/multiply the loaded 
>> value to the running sum/product. The ratios given below are the ratios 
>> between the throughput with this patch and the throughput without this patch.
>> Ratio > 1 indicates the performance with this patch is better than the 
>> master branch.
>> 
>> **N1 (UseSVE = 0, max vector length = 16B):**
>> 
>> Benchmark         vectorDim  Mode   Cnt  8B     16B
>> ReductionAddFP16  256        thrpt  9    1.41   1.40
>> ReductionAddFP16  512        thrpt  9    1.41   1.41
>> ReductionAddFP16  1024       thrpt  9    1.43   1.40
>> ReductionAddFP16  2048       thrpt  9    1.43   1.40
>> ReductionMulFP16  256        thrpt  9    1.22   1.22
>> ReductionMulFP16  512        thrpt  9    1.21   1.23
>> ReductionMulFP16  1024       thrpt  9    1.21   1.22
>> ReductionMulFP16  2048       thrpt  9    1.20   1.22
>> 
>> 
>> On N1, the scalarized sequence of `fadd/fmul` are gener...
>
> Bhavana Kilambi has updated the pull request with a new target base due to a 
> merge or a rebase. The pull request now contains seven commits:
> 
>  - Address review comments for the JTREG test and microbenchmark
>  - Merge branch 'master'
>  - Address review comments
>  - Fix build failures on Mac
>  - Address review comments
>  - Merge 'master'
>  - 8366444: Add support for add/mul reduction operations for Float16
>    
>    This patch adds mid-end support for vectorized add/mul reduction
>    operations for half floats. It also includes backend aarch64 support for
>    these operations. Only vectorization support through autovectorization
>    is added as VectorAPI currently does not support Float16 vector species.
>    
>    Both add and mul reduction vectorized through autovectorization mandate
>    the implementation to be strictly ordered. The following is how each of
>    these reductions is implemented for different aarch64 targets -
>    
>    For AddReduction :
>    On Neon only targets (UseSVE = 0): Generates scalarized additions
>    using the scalar "fadd" instruction for both 8B and 16B vector lengths.
>    This is because Neon does not provide a direct instruction for computing
>    strictly ordered floating point add reduction.
>    
>    On SVE targets (UseSVE > 0): Generates the "fadda" instruction which
>    computes add reduction for floating point in strict order.
>    
>    For MulReduction :
>    Both Neon and SVE do not provide a direct instruction for computing
>    strictly ordered floating point multiply reduction. For vector lengths
>    of 8B and 16B, a scalarized sequence of scalar "fmul" instructions is
>    generated and multiply reduction for vector lengths > 16B is not
>    supported.
>    
>    Below is the performance of the two newly added microbenchmarks in
>    Float16OperationsBenchmark.java tested on three different aarch64
>    machines and with varying MaxVectorSize -
>    
>    Note: On all machines, the score (ops/ms) is compared with the master
>    branch without this patch which generates a sequence of loads ("ldrsh")
>    to load the FP16 value into an FPR and a scalar "fadd/fmul" to
>    add/multiply the loaded value to the running sum/product. The ratios
>    given below are the ratios between the throughput with this patch and
>    the throughput without this patch.
>    Ratio > 1 indicates the performance with this patch is better than the
>    master branch.
>    
>    N1 (UseSVE = 0, max vector length = 16B):
>    Benchmark         vectorDim  Mode   Cnt  8B     16B
>    ReductionAddFP16  256        th...

Here are the new benchmark result:

Neoverse N1 (UseSVE = 0, max vector length = 16B):

Benchmark            vectorDim   Mode    Cnt     8B     16B
ReductionAddFP16        256     thrpt     9     1.17    1.21
ReductionAddFP16        512     thrpt     9     1.17    1.18
ReductionAddFP16       1024     thrpt     9     1.18    1.17
ReductionAddFP16       2048     thrpt     9     1.19    1.16
ReductionMulFP16        256     thrpt     9     1.03    1.04
ReductionMulFP16        512     thrpt     9     1.02    1.03
ReductionMulFP16       1024     thrpt     9     1.01    1.02
ReductionMulFP16       2048     thrpt     9     1.01    1.01


Neoverse V1 (UseSVE = 1, max vector length = 32B):

Benchmark            vectorDim   Mode   Cnt     8B     16B     32B
ReductionAddFP16        256     thrpt     9     1.12    1.75    1.95
ReductionAddFP16        512     thrpt     9     1.07    1.64    1.87
ReductionAddFP16       1024     thrpt     9     1.05    1.59    1.78
ReductionAddFP16       2048     thrpt     9     1.04    1.56    1.74
ReductionMulFP16        256     thrpt     9     1.12    1.12    1.11
ReductionMulFP16        512     thrpt     9     1.04    1.05    1.05
ReductionMulFP16       1024     thrpt     9     1.02    1.02    0.99
ReductionMulFP16       2048     thrpt     9     1.01    1.01    1.00


Neoverse V2 (UseSVE = 2, max vector length = 16B)

Benchmark            vectorDim   Mode    Cnt     8B     16B
ReductionAddFP16        256     thrpt     9     1.16    1.70
ReductionAddFP16        512     thrpt     9     1.07    1.61
ReductionAddFP16       1024     thrpt     9     1.03    1.53
ReductionAddFP16       2048     thrpt     9     1.02    1.50
ReductionMulFP16        256     thrpt     9     1.18    1.18
ReductionMulFP16        512     thrpt     9     1.08    1.07
ReductionMulFP16       1024     thrpt     9     1.04    1.04
ReductionMulFP16       2048     thrpt     9     1.02    1.01

-------------

PR Comment: https://git.openjdk.org/jdk/pull/27526#issuecomment-3861614647

Reply via email to