On Thu, 2 Oct 2025 10:21:06 GMT, Bhavana Kilambi <[email protected]> wrote:
>> src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 1900:
>>
>>> 1898: fmulh(dst, dst, vtmp);
>>> 1899: ins(vtmp, H, vsrc, 0, 7);
>>> 1900: fmulh(dst, dst, vtmp);
>>
>> Do you know why the performance is not improved significantly for multiply
>> reduction? Seems instructions between different `ins` instructions will have
>> a data-dependence, which is not expected? Could you please use other
>> instructions instead or clear the register `vtmp` before `ins` and check the
>> performance changes?
>>
>> Note that a clear of `mov` such as `MOVI Vd.2D, #0` has zero cost from V2's
>> guide.
>
> Are you referring to the N1 numbers? The add reduction operation has gains
> around ~40% while the mul reduction is around ~20% on N1. On V1 and V2 they
> look comparable (not considering the cases where we generate `fadda`
> instructions for add reduction).
>
>> Seems instructions between different ins instructions will have a
>> data-dependence, which is not expected
>
> Why do you think it's not expected? We have the exact same sequence for Neon
> add reduction as well. There's back to back dependency there as well and yet
> it shows better performance. The N1 optimization guide shows 2 cyc latency
> for `fadd` and 3 cyc latency for `fmul`. Could this be the reason? WDYT?
I mean we do not expect there is data-dependence between two `ins` operations,
but it has now. We do not recommend use the instructions that just write part
of a register. This might involve un-expected dependence between. I suggest to
use `ext` instead, and I can observe about 20% performance improvement compared
with current version on V2. I did not check the correctness, but it looks right
to me. Could you please help check on other machines? Thanks!
The change might look like:
Suggestion:
fmulh(dst, fsrc, vsrc);
ext(vtmp, T8B, vsrc, vsrc, 2);
fmulh(dst, dst, vtmp);
ext(vtmp, T8B, vsrc, vsrc, 4);
fmulh(dst, dst, vtmp);
ext(vtmp, T8B, vsrc, vsrc, 6);
fmulh(dst, dst, vtmp);
if (isQ) {
ext(vtmp, T16B, vsrc, vsrc, 8);
fmulh(dst, dst, vtmp);
ext(vtmp, T16B, vsrc, vsrc, 10);
fmulh(dst, dst, vtmp);
ext(vtmp, T16B, vsrc, vsrc, 12);
fmulh(dst, dst, vtmp);
ext(vtmp, T16B, vsrc, vsrc, 14);
fmulh(dst, dst, vtmp);
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/27526#discussion_r2409190652