On Mon, 26 Jan 2026 16:51:56 GMT, Emanuel Peter <[email protected]> wrote:

>> src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 1940:
>> 
>>> 1938:       }
>>> 1939:   BLOCK_COMMENT("} neon_reduce_add_fp16");
>>> 1940: }
>> 
>> Given that the reduction order is sequential: why do you see any speedup in 
>> your benchmarks, comparing scalar to vector performance? How do you explain 
>> it? I'm just curious ;)
>
> Also: why not allow a vector with only 2 elements? Is there some restriction 
> here?

Hi @eme64 . That's probably not the only contributing factor but there's a 
significant difference in latency if we compare a sequence of scalar `addf` to 
the SVE F16 `fadda` instruction. According to [Neoverse 
V1](https://developer.arm.com/documentation/109897/latest/) and [Neoverse 
V2](https://developer.arm.com/documentation/109898/latest/) SWOGs, `fadda` has 
an execution latency of 19 and 10 cycles for 16 and 8 elements-long vector 
registers respectively. Scalar `fadd` has an execution latency of 2 cycles, 
which sums up to 32 and 16 cycles for 16 and 8 values respectively. I hope this 
explanation makes sense and helps.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/27526#discussion_r2769152062

Reply via email to