https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121290

--- Comment #29 from GCC Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Tamar Christina <[email protected]>:

https://gcc.gnu.org/g:2d1b0191b6bf71c12f876abea909d9a05200f406

commit r16-7240-g2d1b0191b6bf71c12f876abea909d9a05200f406
Author: Tamar Christina <[email protected]>
Date:   Mon Feb 2 10:59:11 2026 +0000

    AArch64: remove setting scalar dup analysis flag back to false [PR121290]

    The final loop in the PR s313 is fixed by removing he else branch that
shouldn't
    be there now that we look at a ratio of dups instead of a yes/no.

    By setting m_loop_fully_scalar_dup the final costing doesn't think there's
any
    dup at all in the inner loop costing.

    This returns the codegen to what it was for GCC 15.  However as the PR
mentions
    with the broken costing -O3 did improve.  The reason for this was that it
was
    preventing unrolling at -O3.

    The loop generates at -O3:

    .L5:
            add     w2, w2, w7
            ld1w    z26.s, p7/z, [x1]
            ld1w    z29.s, p7/z, [x0]
            ld1w    z2.s, p7/z, [x1, #1, mul vl]
            ld1w    z27.s, p7/z, [x0, #1, mul vl]
            ld1w    z1.s, p7/z, [x1, #2, mul vl]
            ld1w    z30.s, p7/z, [x0, #2, mul vl]
            ld1w    z0.s, p7/z, [x1, #3, mul vl]
            ld1w    z28.s, p7/z, [x0, #3, mul vl]
            fmul    z29.s, z26.s, z29.s
            fmul    z27.s, z2.s, z27.s
            fadda   s31, p7, s31, z29.s
            fmul    z30.s, z1.s, z30.s
            fadda   s31, p7, s31, z27.s
            fmul    z28.s, z0.s, z28.s
            fadda   s31, p7, s31, z30.s
            add     x1, x1, x3
            add     x0, x0, x3
            fadda   s31, p7, s31, z28.s
            cmp     w2, w6
            bls     .L5

    Which is silly due to the limited throughput of the instructions in this
loop.
    the left fold reduction is a single cycle reduction since the unrolling
needs
    the accumulation to happen in the same scalar.

    However aarch64_force_single_cycle doesn't detected this and so the costing
of
    the loop is wrong.  Even fixing that though the cost model does still think
it's
    beneficial because the throughput restrictions for the reductions aren't
modeled
    but to get it to reject the unrolling the cost have to be increased
    unrealistically.

    The better approach I think would be for us to model pipeline restrictions
more
    such that we have the ability to say that the above is a linear reduction
chain.

    However since that's not a regression and needs quite a bit of work, punted
to
    GCC 17.

    gcc/ChangeLog:

            PR target/121290
            * config/aarch64/aarch64.cc (aarch64_vector_costs::add_stmt_cost):
            Remove else.

    gcc/testsuite/ChangeLog:

            PR target/121290
            * gcc.target/aarch64/pr121290_3.c: New test.

Reply via email to