Re: RFR: 8366444: Add support for add/mul reduction operations for Float16

Bhavana Kilambi Thu, 11 Dec 2025 04:24:49 -0800

On Fri, 26 Sep 2025 12:00:31 GMT, Bhavana Kilambi <[email protected]> wrote:


> This patch adds mid-end support for vectorized add/mul reduction operations 
> for half floats. It also includes backend aarch64 support for these 
> operations. Only vectorization support through autovectorization is added as 
> VectorAPI currently does not support Float16 vector species.
> 
> Both add and mul reduction vectorized through autovectorization mandate the 
> implementation to be strictly ordered. The following is how each of these 
> reductions is implemented for different aarch64 targets -
> 
> **For AddReduction :**
> On Neon only targets (UseSVE = 0): Generates scalarized additions using the 
> scalar `fadd` instruction for both 8B and 16B vector lengths. This is because 
> Neon does not provide a direct instruction for computing strictly ordered 
> floating point add reduction.
> 
> On SVE targets (UseSVE > 0): Generates the `fadda` instruction which computes 
> add reduction for floating point in strict order.
> 
> **For MulReduction :**
> Both Neon and SVE do not provide a direct instruction for computing strictly 
> ordered floating point multiply reduction. For vector lengths of 8B and 16B, 
> a scalarized sequence of scalar `fmul` instructions is generated and multiply 
> reduction for vector lengths > 16B is not supported.
> 
> Below is the performance of the two newly added microbenchmarks in 
> `Float16OperationsBenchmark.java` tested on three different aarch64 machines 
> and with varying `MaxVectorSize` -
> 
> Note: On all machines, the score (ops/ms) is compared with the master branch 
> without this patch which generates a sequence of loads (`ldrsh`) to load the 
> FP16 value into an FPR and a scalar `fadd/fmul` to add/multiply the loaded 
> value to the running sum/product. The ratios given below are the ratios 
> between the throughput with this patch and the throughput without this patch.
> Ratio > 1 indicates the performance with this patch is better than the master 
> branch.
> 
> **N1 (UseSVE = 0, max vector length = 16B):**
> 
> Benchmark         vectorDim  Mode   Cnt  8B     16B
> ReductionAddFP16  256        thrpt  9    1.41   1.40
> ReductionAddFP16  512        thrpt  9    1.41   1.41
> ReductionAddFP16  1024       thrpt  9    1.43   1.40
> ReductionAddFP16  2048       thrpt  9    1.43   1.40
> ReductionMulFP16  256        thrpt  9    1.22   1.22
> ReductionMulFP16  512        thrpt  9    1.21   1.23
> ReductionMulFP16  1024       thrpt  9    1.21   1.22
> ReductionMulFP16  2048       thrpt  9    1.20   1.22
> 
> 
> On N1, the scalarized sequence of `fadd/fmul` are generated for both 
> `MaxVectorSize` of 8B and 16B for add reduction ...

Apologies, I missed to address the assertion failure you pointed out in my 
previous comment. It seems to exist because gdb showed that the combined stress 
flags somehow set the vector length to 4B which is not allowed. The assertion 
failure itself can be fixed by adding `length < 8` to this condition in 
aarch64_vector.ad file - 
`        if (length < 8 || length_in_bytes > 16 || !is_feat_fp16_supported()) {
          return false;
        }
`
which would avoid vectorization for 4B vector length.
But after this change, the IR rules for reduction fail because now the vector 
reduction nodes are not generated but the IR rule is expecting them. I'll  look 
into this but I actually noticed that this test fail even on master branch with 
the following IR failures - 


One or more @IR rules failed:

Failed IR Rules (11) of Methods (11)
------------------------------------
1) Method "public void 
compiler.vectorization.TestFloat16VectorOperations.vectorAddConstInputFloat16()"
 - [Failed IR rules: 1]:
   * @IR rule 2: "@compiler.lib.ir_framework.IR(phase={DEFAULT}, 
applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={"_#V#ADD_VHF#_", " >0 
"}, failOn={}, applyIfPlatform={}, applyIfPlatformOr={}, applyIfOr={}, 
applyIfCPUFeatureAnd={"fphp", "true", "asimdhp", "true"}, applyIf={}, 
applyIfCPUFeature={}, applyIfAnd={}, applyIfNot={})"
     > Phase "PrintIdeal":
       - counts: Graph contains wrong number of nodes:
         * Constraint 1: 
"(\d+(\s){2}(AddVHF.*)+(\s){2}===.*vector[A-Za-z]<S,8>)"
           - Failed comparison: [found] 0 > 0 [given]
           - No nodes matched!

2) Method "public void 
compiler.vectorization.TestFloat16VectorOperations.vectorAddFloat16()" - 
[Failed IR rules: 1]:
   * @IR rule 2: "@compiler.lib.ir_framework.IR(phase={DEFAULT}, 
applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={"_#V#ADD_VHF#_", " >0 
"}, failOn={}, applyIfPlatform={}, applyIfPlatformOr={}, applyIfOr={}, 
applyIfCPUFeatureAnd={"fphp", "true", "asimdhp", "true"}, applyIf={}, 
applyIfCPUFeature={}, applyIfAnd={}, applyIfNot={})"
     > Phase "PrintIdeal":
       - counts: Graph contains wrong number of nodes:
         * Constraint 1: 
"(\d+(\s){2}(AddVHF.*)+(\s){2}===.*vector[A-Za-z]<S,8>)"
           - Failed comparison: [found] 0 > 0 [given]
           - No nodes matched!

3) Method "public void 
compiler.vectorization.TestFloat16VectorOperations.vectorDivFloat16()" - 
[Failed IR rules: 1]:
   * @IR rule 2: "@compiler.lib.ir_framework.IR(phase={DEFAULT}, 
applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={"_#V#DIV_VHF#_", " >0 
"}, failOn={}, applyIfPlatform={}, applyIfPlatformOr={}, applyIfOr={}, 
applyIfCPUFeatureAnd={"fphp", "true", "asimdhp", "true"}, applyIf={}, 
applyIfCPUFeature={}, applyIfAnd={}, applyIfNot={})"
     > Phase "PrintIdeal":
       - counts: Graph contains wrong number of nodes:
         * Constraint 1: 
"(\d+(\s){2}(DivVHF.*)+(\s){2}===.*vector[A-Za-z]<S,8>)"
           - Failed comparison: [found] 0 > 0 [given]
           - No nodes matched!

4) Method "public void 
compiler.vectorization.TestFloat16VectorOperations.vectorFmaFloat16()" - 
[Failed IR rules: 1]:
   * @IR rule 2: "@compiler.lib.ir_framework.IR(phase={DEFAULT}, 
applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={"_#V#FMA_VHF#_", " >0 
"}, failOn={}, applyIfPlatform={}, applyIfPlatformOr={}, applyIfOr={}, 
applyIfCPUFeatureAnd={"fphp", "true", "asimdhp", "true"}, applyIf={}, 
applyIfCPUFeature={}, applyIfAnd={}, applyIfNot={})"
     > Phase "PrintIdeal":
       - counts: Graph contains wrong number of nodes:
         * Constraint 1: 
"(\d+(\s){2}(FmaVHF.*)+(\s){2}===.*vector[A-Za-z]<S,8>)"
           - Failed comparison: [found] 0 > 0 [given]
           - No nodes matched!

5) Method "public void 
compiler.vectorization.TestFloat16VectorOperations.vectorFmaFloat16MixedConstants()"
 - [Failed IR rules: 1]:
   * @IR rule 2: "@compiler.lib.ir_framework.IR(phase={DEFAULT}, 
applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={"_#V#FMA_VHF#_", " >0 
"}, failOn={}, applyIfPlatform={}, applyIfPlatformOr={}, applyIfOr={}, 
applyIfCPUFeatureAnd={"fphp", "true", "asimdhp", "true"}, applyIf={}, 
applyIfCPUFeature={}, applyIfAnd={}, applyIfNot={})"
     > Phase "PrintIdeal":
       - counts: Graph contains wrong number of nodes:
         * Constraint 1: 
"(\d+(\s){2}(FmaVHF.*)+(\s){2}===.*vector[A-Za-z]<S,8>)"
           - Failed comparison: [found] 0 > 0 [given]
           - No nodes matched!

6) Method "public void 
compiler.vectorization.TestFloat16VectorOperations.vectorFmaFloat16ScalarMixedConstants()"
 - [Failed IR rules: 1]:
   * @IR rule 2: "@compiler.lib.ir_framework.IR(phase={DEFAULT}, 
applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={"_#V#FMA_VHF#_", " >0 
"}, failOn={}, applyIfPlatform={}, applyIfPlatformOr={}, applyIfOr={}, 
applyIfCPUFeatureAnd={"fphp", "true", "asimdhp", "true"}, applyIf={}, 
applyIfCPUFeature={}, applyIfAnd={}, applyIfNot={})"
     > Phase "PrintIdeal":
       - counts: Graph contains wrong number of nodes:
         * Constraint 1: 
"(\d+(\s){2}(FmaVHF.*)+(\s){2}===.*vector[A-Za-z]<S,8>)"
           - Failed comparison: [found] 0 > 0 [given]
           - No nodes matched!

7) Method "public void 
compiler.vectorization.TestFloat16VectorOperations.vectorMaxFloat16()" - 
[Failed IR rules: 1]:
   * @IR rule 2: "@compiler.lib.ir_framework.IR(phase={DEFAULT}, 
applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={"_#V#MAX_VHF#_", " >0 
"}, failOn={}, applyIfPlatform={}, applyIfPlatformOr={}, applyIfOr={}, 
applyIfCPUFeatureAnd={"fphp", "true", "asimdhp", "true"}, applyIf={}, 
applyIfCPUFeature={}, applyIfAnd={}, applyIfNot={})"
     > Phase "PrintIdeal":
       - counts: Graph contains wrong number of nodes:
         * Constraint 1: 
"(\d+(\s){2}(MaxVHF.*)+(\s){2}===.*vector[A-Za-z]<S,8>)"
           - Failed comparison: [found] 0 > 0 [given]
           - No nodes matched!

8) Method "public void 
compiler.vectorization.TestFloat16VectorOperations.vectorMinFloat16()" - 
[Failed IR rules: 1]:
   * @IR rule 2: "@compiler.lib.ir_framework.IR(phase={DEFAULT}, 
applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={"_#V#MIN_VHF#_", " >0 
"}, failOn={}, applyIfPlatform={}, applyIfPlatformOr={}, applyIfOr={}, 
applyIfCPUFeatureAnd={"fphp", "true", "asimdhp", "true"}, applyIf={}, 
applyIfCPUFeature={}, applyIfAnd={}, applyIfNot={})"
     > Phase "PrintIdeal":
       - counts: Graph contains wrong number of nodes:
         * Constraint 1: 
"(\d+(\s){2}(MinVHF.*)+(\s){2}===.*vector[A-Za-z]<S,8>)"
           - Failed comparison: [found] 0 > 0 [given]
           - No nodes matched!

9) Method "public void 
compiler.vectorization.TestFloat16VectorOperations.vectorMulFloat16()" - 
[Failed IR rules: 1]:
   * @IR rule 2: "@compiler.lib.ir_framework.IR(phase={DEFAULT}, 
applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={"_#V#MUL_VHF#_", " >0 
"}, failOn={}, applyIfPlatform={}, applyIfPlatformOr={}, applyIfOr={}, 
applyIfCPUFeatureAnd={"fphp", "true", "asimdhp", "true"}, applyIf={}, 
applyIfCPUFeature={}, applyIfAnd={}, applyIfNot={})"
     > Phase "PrintIdeal":
       - counts: Graph contains wrong number of nodes:
         * Constraint 1: 
"(\d+(\s){2}(MulVHF.*)+(\s){2}===.*vector[A-Za-z]<S,8>)"
           - Failed comparison: [found] 0 > 0 [given]
           - No nodes matched!

10) Method "public void 
compiler.vectorization.TestFloat16VectorOperations.vectorSqrtFloat16()" - 
[Failed IR rules: 1]:
    * @IR rule 2: "@compiler.lib.ir_framework.IR(phase={DEFAULT}, 
applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={"_#V#SQRT_VHF#_", " >0 
"}, failOn={}, applyIfPlatform={}, applyIfPlatformOr={}, applyIfOr={}, 
applyIfCPUFeatureAnd={"fphp", "true", "asimdhp", "true"}, applyIf={}, 
applyIfCPUFeature={}, applyIfAnd={}, applyIfNot={})"
      > Phase "PrintIdeal":
        - counts: Graph contains wrong number of nodes:
          * Constraint 1: 
"(\d+(\s){2}(SqrtVHF.*)+(\s){2}===.*vector[A-Za-z]<S,8>)"
            - Failed comparison: [found] 0 > 0 [given]
            - No nodes matched!

11) Method "public void 
compiler.vectorization.TestFloat16VectorOperations.vectorSubFloat16()" - 
[Failed IR rules: 1]:
    * @IR rule 2: "@compiler.lib.ir_framework.IR(phase={DEFAULT}, 
applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={"_#V#SUB_VHF#_", " >0 
"}, failOn={}, applyIfPlatform={}, applyIfPlatformOr={}, applyIfOr={}, 
applyIfCPUFeatureAnd={"fphp", "true", "asimdhp", "true"}, applyIf={}, 
applyIfCPUFeature={}, applyIfAnd={}, applyIfNot={})"
      > Phase "PrintIdeal":
        - counts: Graph contains wrong number of nodes:
          * Constraint 1: 
"(\d+(\s){2}(SubVHF.*)+(\s){2}===.*vector[A-Za-z]<S,8>)"
            - Failed comparison: [found] 0 > 0 [given]
            - No nodes matched!



Mostly looks like the expected shape (the default is the `VECTOR_SIZE_MAX`) is 
not found in the IR graph (as the stress flags might have resulted in a change 
in vector length) and these failures seem to exist on both aarch64 and x86_64.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/27526#issuecomment-3641660105

Re: RFR: 8366444: Add support for add/mul reduction operations for Float16

Reply via email to