https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121290
--- Comment #29 from GCC Commits <cvs-commit at gcc dot gnu.org> --- The master branch has been updated by Tamar Christina <[email protected]>: https://gcc.gnu.org/g:2d1b0191b6bf71c12f876abea909d9a05200f406 commit r16-7240-g2d1b0191b6bf71c12f876abea909d9a05200f406 Author: Tamar Christina <[email protected]> Date: Mon Feb 2 10:59:11 2026 +0000 AArch64: remove setting scalar dup analysis flag back to false [PR121290] The final loop in the PR s313 is fixed by removing he else branch that shouldn't be there now that we look at a ratio of dups instead of a yes/no. By setting m_loop_fully_scalar_dup the final costing doesn't think there's any dup at all in the inner loop costing. This returns the codegen to what it was for GCC 15. However as the PR mentions with the broken costing -O3 did improve. The reason for this was that it was preventing unrolling at -O3. The loop generates at -O3: .L5: add w2, w2, w7 ld1w z26.s, p7/z, [x1] ld1w z29.s, p7/z, [x0] ld1w z2.s, p7/z, [x1, #1, mul vl] ld1w z27.s, p7/z, [x0, #1, mul vl] ld1w z1.s, p7/z, [x1, #2, mul vl] ld1w z30.s, p7/z, [x0, #2, mul vl] ld1w z0.s, p7/z, [x1, #3, mul vl] ld1w z28.s, p7/z, [x0, #3, mul vl] fmul z29.s, z26.s, z29.s fmul z27.s, z2.s, z27.s fadda s31, p7, s31, z29.s fmul z30.s, z1.s, z30.s fadda s31, p7, s31, z27.s fmul z28.s, z0.s, z28.s fadda s31, p7, s31, z30.s add x1, x1, x3 add x0, x0, x3 fadda s31, p7, s31, z28.s cmp w2, w6 bls .L5 Which is silly due to the limited throughput of the instructions in this loop. the left fold reduction is a single cycle reduction since the unrolling needs the accumulation to happen in the same scalar. However aarch64_force_single_cycle doesn't detected this and so the costing of the loop is wrong. Even fixing that though the cost model does still think it's beneficial because the throughput restrictions for the reductions aren't modeled but to get it to reject the unrolling the cost have to be increased unrealistically. The better approach I think would be for us to model pipeline restrictions more such that we have the ability to say that the above is a linear reduction chain. However since that's not a regression and needs quite a bit of work, punted to GCC 17. gcc/ChangeLog: PR target/121290 * config/aarch64/aarch64.cc (aarch64_vector_costs::add_stmt_cost): Remove else. gcc/testsuite/ChangeLog: PR target/121290 * gcc.target/aarch64/pr121290_3.c: New test.
