On Thu, 8 Jan 2026 15:27:01 GMT, Emanuel Peter <[email protected]> wrote:
>> Bhavana Kilambi has updated the pull request with a new target base due to a
>> merge or a rebase. The incremental webrev excludes the unrelated changes
>> brought in by the merge/rebase. The pull request contains seven additional
>> commits since the last revision:
>>
>> - Address review comments for the JTREG test and microbenchmark
>> - Merge branch 'master'
>> - Address review comments
>> - Fix build failures on Mac
>> - Address review comments
>> - Merge 'master'
>> - 8366444: Add support for add/mul reduction operations for Float16
>>
>> This patch adds mid-end support for vectorized add/mul reduction
>> operations for half floats. It also includes backend aarch64 support for
>> these operations. Only vectorization support through autovectorization
>> is added as VectorAPI currently does not support Float16 vector species.
>>
>> Both add and mul reduction vectorized through autovectorization mandate
>> the implementation to be strictly ordered. The following is how each of
>> these reductions is implemented for different aarch64 targets -
>>
>> For AddReduction :
>> On Neon only targets (UseSVE = 0): Generates scalarized additions
>> using the scalar "fadd" instruction for both 8B and 16B vector lengths.
>> This is because Neon does not provide a direct instruction for computing
>> strictly ordered floating point add reduction.
>>
>> On SVE targets (UseSVE > 0): Generates the "fadda" instruction which
>> computes add reduction for floating point in strict order.
>>
>> For MulReduction :
>> Both Neon and SVE do not provide a direct instruction for computing
>> strictly ordered floating point multiply reduction. For vector lengths
>> of 8B and 16B, a scalarized sequence of scalar "fmul" instructions is
>> generated and multiply reduction for vector lengths > 16B is not
>> supported.
>>
>> Below is the performance of the two newly added microbenchmarks in
>> Float16OperationsBenchmark.java tested on three different aarch64
>> machines and with varying MaxVectorSize -
>>
>> Note: On all machines, the score (ops/ms) is compared with the master
>> branch without this patch which generates a sequence of loads ("ldrsh")
>> to load the FP16 value into an FPR and a scalar "fadd/fmul" to
>> add/multiply the loaded value to the running sum/product. The ratios
>> given below are the ratios between the throughput with this patch and
>> the throughput without this patch.
>> Ratio > 1 indicate...
>
> test/hotspot/jtreg/compiler/vectorization/TestFloat16VectorOperations.java
> line 459:
>
>> 457: short result = (short) 0;
>> 458: for (int i = 0; i < LEN; i++) {
>> 459: result =
>> float16ToRawShortBits(add(shortBitsToFloat16(result),
>> shortBitsToFloat16(input1[i])));
>
> Why all the conversions from and to `short` / `Float16`?
> Is there any benefit to use `short` for the intermediate results? Why not
> make `result` a `Float16`?
If I remember correctly, I tried doing that initially but the loop did not get
vectorized. The Ideal graph showed there were a lot of nodes related to object
creation (probably for the intermediate `Float16` result) which bloated the
size of the loop resulting in the loop not getting unrolled (and eventually not
vectorized). I also tried a standalone loop where I do not return the
intermediate result hoping that escape analysis could help in avoiding the
object creation but did not help either.
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/27526#discussion_r2681225725