https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121290

--- Comment #11 from GCC Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Tamar Christina <tnfch...@gcc.gnu.org>:

https://gcc.gnu.org/g:28ab83367e8710a78fffa2513e6e008ebdfbee3e

commit r16-3394-g28ab83367e8710a78fffa2513e6e008ebdfbee3e
Author: Tamar Christina <tamar.christ...@arm.com>
Date:   Tue Aug 26 13:10:10 2025 +0100

    AArch64: extend cost model to cost outer loop vect where the inner loop is
invariant [PR121290]

    Consider the example:

    void
    f (int *restrict x, int *restrict y, int *restrict z, int n)
    {
      for (int i = 0; i < 4; ++i)
        {
          int res = 0;
          for (int j = 0; j < 100; ++j)
            res += y[j] * z[i];
          x[i] = res;
        }
    }

    we currently vectorize as

    f:
            movi    v30.4s, 0
            ldr     q31, [x2]
            add     x2, x1, 400
    .L2:
            ld1r    {v29.4s}, [x1], 4
            mla     v30.4s, v29.4s, v31.4s
            cmp     x2, x1
            bne     .L2
            str     q30, [x0]
            ret

    which is not useful because by doing outer-loop vectorization we're
performing
    less work per iteration than we would had we done inner-loop vectorization
and
    simply unrolled the inner loop.

    This patch teaches the cost model that if all your leafs are invariant,
then
    adjust the loop cost by * VF, since every vector iteration has at least one
lane
    really just doing 1 scalar.

    There are a couple of ways we could have solved this, one is to increase
the
    unroll factor to process more iterations of the inner loop.  This removes
the
    need for the broadcast, however we don't support unrolling the inner loop
within
    the outer loop.  We only support unrolling by increasing the VF, which
would
    affect the outer loop as well as the inner loop.

    We also don't directly support costing inner-loop vs outer-loop
vectorization,
    and as such we're left trying to predict/steer the cost model ahead of time
to
    what we think should be profitable.  This patch attempts to do so using a
    heuristic which penalizes the outer-loop vectorization.

    We now cost the loop as

    note:  Cost model analysis:
      Vector inside of loop cost: 2000
      Vector prologue cost: 4
      Vector epilogue cost: 0
      Scalar iteration cost: 300
      Scalar outside cost: 0
      Vector outside cost: 4
      prologue iterations: 0
      epilogue iterations: 0
    missed:  cost model: the vector iteration cost = 2000 divided by the scalar
iteration cost = 300 is greater or equal to the vectorization factor = 4.
    missed:  not vectorized: vectorization not profitable.
    missed:  not vectorized: vector version will never be profitable.
    missed:  Loop costings may not be worthwhile.

    And subsequently generate:

    .L5:
            add     w4, w4, w7
            ld1w    z24.s, p6/z, [x0, #1, mul vl]
            ld1w    z23.s, p6/z, [x0, #2, mul vl]
            ld1w    z22.s, p6/z, [x0, #3, mul vl]
            ld1w    z29.s, p6/z, [x0]
            mla     z26.s, p6/m, z24.s, z30.s
            add     x0, x0, x8
            mla     z27.s, p6/m, z23.s, z30.s
            mla     z28.s, p6/m, z22.s, z30.s
            mla     z25.s, p6/m, z29.s, z30.s
            cmp     w4, w6
            bls     .L5

    and avoids the load and replicate if it knows it has enough vector pipes to
do
    so.

    gcc/ChangeLog:

            PR target/121290
            * config/aarch64/aarch64.cc
            (class aarch64_vector_costs ): Add m_loop_fully_scalar_dup.
            (aarch64_vector_costs::add_stmt_cost): Detect invariant inner
loops.
            (adjust_body_cost): Adjust final costing if
m_loop_fully_scalar_dup.

    gcc/testsuite/ChangeLog:

            PR target/121290
            * gcc.target/aarch64/pr121290.c: New test.

Reply via email to