On Mon, 22 Dec 2025 16:28:45 GMT, Bhavana Kilambi <[email protected]> wrote:
>> This patch adds mid-end support for vectorized add/mul reduction operations
>> for half floats. It also includes backend aarch64 support for these
>> operations. Only vectorization support through autovectorization is added as
>> VectorAPI currently does not support Float16 vector species.
>>
>> Both add and mul reduction vectorized through autovectorization mandate the
>> implementation to be strictly ordered. The following is how each of these
>> reductions is implemented for different aarch64 targets -
>>
>> **For AddReduction :**
>> On Neon only targets (UseSVE = 0): Generates scalarized additions using the
>> scalar `fadd` instruction for both 8B and 16B vector lengths. This is
>> because Neon does not provide a direct instruction for computing strictly
>> ordered floating point add reduction.
>>
>> On SVE targets (UseSVE > 0): Generates the `fadda` instruction which
>> computes add reduction for floating point in strict order.
>>
>> **For MulReduction :**
>> Both Neon and SVE do not provide a direct instruction for computing strictly
>> ordered floating point multiply reduction. For vector lengths of 8B and 16B,
>> a scalarized sequence of scalar `fmul` instructions is generated and
>> multiply reduction for vector lengths > 16B is not supported.
>>
>> Below is the performance of the two newly added microbenchmarks in
>> `Float16OperationsBenchmark.java` tested on three different aarch64 machines
>> and with varying `MaxVectorSize` -
>>
>> Note: On all machines, the score (ops/ms) is compared with the master branch
>> without this patch which generates a sequence of loads (`ldrsh`) to load the
>> FP16 value into an FPR and a scalar `fadd/fmul` to add/multiply the loaded
>> value to the running sum/product. The ratios given below are the ratios
>> between the throughput with this patch and the throughput without this patch.
>> Ratio > 1 indicates the performance with this patch is better than the
>> master branch.
>>
>> **N1 (UseSVE = 0, max vector length = 16B):**
>>
>> Benchmark vectorDim Mode Cnt 8B 16B
>> ReductionAddFP16 256 thrpt 9 1.41 1.40
>> ReductionAddFP16 512 thrpt 9 1.41 1.41
>> ReductionAddFP16 1024 thrpt 9 1.43 1.40
>> ReductionAddFP16 2048 thrpt 9 1.43 1.40
>> ReductionMulFP16 256 thrpt 9 1.22 1.22
>> ReductionMulFP16 512 thrpt 9 1.21 1.23
>> ReductionMulFP16 1024 thrpt 9 1.21 1.22
>> ReductionMulFP16 2048 thrpt 9 1.20 1.22
>>
>>
>> On N1, the scalarized sequence of `fadd/fmul` are gener...
>
> Bhavana Kilambi has updated the pull request incrementally with one
> additional commit since the last revision:
>
> Address review comments
Common IR changes looks good to me, adding some minor comments.
test/hotspot/jtreg/compiler/vectorization/TestFloat16VectorOperations.java line
81:
> 79: if(!expected_fp16.equals(actual_fp16)) {
> 80: throw new AssertionError("Result Mismatch!, reduction type =
> " + reductionFunc + " actual = " + actual_fp16 + " expected = " +
> expected_fp16);
> 81: }
Please use Verify.checkEQ, Float16 support has been added to
[Verifier](https://github.com/openjdk/jdk/pull/28095) recently
test/micro/org/openjdk/bench/jdk/incubator/vector/Float16OperationsBenchmark.java
line 323:
> 321: short result = (short) 0;
> 322: for (int i = 0; i < vectorDim; i++) {
> 323: result =
> float16ToRawShortBits(Float16.add(Float16.shortBitsToFloat16(result),
> Float16.shortBitsToFloat16(vector1[i])));
Float16 class is statically imported, we don't need fully qualified names here.
Suggestion:
result =
float16ToRawShortBits(Float16.add(shortBitsToFloat16(result),
shortBitsToFloat16(vector1[i])));
test/micro/org/openjdk/bench/jdk/incubator/vector/Float16OperationsBenchmark.java
line 332:
> 330: short result = floatToFloat16(1.0f);
> 331: for (int i = 0; i < vectorDim; i++) {
> 332: result =
> float16ToRawShortBits(Float16.multiply(Float16.shortBitsToFloat16(result),
> Float16.shortBitsToFloat16(vector1[i])));
Suggestion:
result =
float16ToRawShortBits(Float16.multiply(shortBitsToFloat16(result),
shortBitsToFloat16(vector1[i])));
-------------
PR Review: https://git.openjdk.org/jdk/pull/27526#pullrequestreview-3610393442
PR Review Comment: https://git.openjdk.org/jdk/pull/27526#discussion_r2645201339
PR Review Comment: https://git.openjdk.org/jdk/pull/27526#discussion_r2645214816
PR Review Comment: https://git.openjdk.org/jdk/pull/27526#discussion_r2645216202