On Thu, 2 Oct 2025 10:21:06 GMT, Bhavana Kilambi <[email protected]> wrote:

>> src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 1900:
>> 
>>> 1898:           fmulh(dst, dst, vtmp);
>>> 1899:           ins(vtmp, H, vsrc, 0, 7);
>>> 1900:           fmulh(dst, dst, vtmp);
>> 
>> Do you know why the performance is not improved significantly for multiply 
>> reduction? Seems instructions between different `ins` instructions will have 
>> a data-dependence, which is not expected? Could you please use other 
>> instructions instead or clear the register `vtmp` before `ins` and check the 
>> performance changes?
>> 
>> Note that a clear of `mov` such as `MOVI Vd.2D, #0` has zero cost from V2's 
>> guide.
>
> Are you referring to the N1 numbers? The add reduction operation has gains 
> around ~40% while the mul reduction is around ~20% on N1. On V1 and V2 they 
> look comparable (not considering the cases where we generate `fadda` 
> instructions for add reduction).
> 
>> Seems instructions between different ins instructions will have a 
>> data-dependence, which is not expected
> 
> Why do you think it's not expected? We have the exact same sequence for Neon 
> add reduction as well. There's back to back dependency there as well and yet 
> it shows better performance. The N1 optimization guide shows 2 cyc latency 
> for `fadd` and 3 cyc latency for `fmul`. Could this be the reason? WDYT?

I mean we do not expect there is data-dependence between two `ins` operations, 
but it has now. We do not recommend use the instructions that just write part 
of a register. This might involve un-expected dependence between. I suggest to 
use `ext` instead, and I can observe about 20% performance improvement compared 
with current version on V2. I did not check the correctness, but it looks right 
to me. Could you please help check on other machines? Thanks!

The change might look like:
Suggestion:

        fmulh(dst, fsrc, vsrc);
        ext(vtmp, T8B, vsrc, vsrc, 2);
        fmulh(dst, dst, vtmp);
        ext(vtmp, T8B, vsrc, vsrc, 4);
        fmulh(dst, dst, vtmp);
        ext(vtmp, T8B, vsrc, vsrc, 6);
        fmulh(dst, dst, vtmp);
        if (isQ) {
          ext(vtmp, T16B, vsrc, vsrc, 8);
          fmulh(dst, dst, vtmp);
          ext(vtmp, T16B, vsrc, vsrc, 10);
          fmulh(dst, dst, vtmp);
          ext(vtmp, T16B, vsrc, vsrc, 12);
          fmulh(dst, dst, vtmp);
          ext(vtmp, T16B, vsrc, vsrc, 14);
          fmulh(dst, dst, vtmp);

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/27526#discussion_r2409190652

Reply via email to