LIM and loop vector width (ZMM vs XMM)

raghesh.aloor at amd dot com via Gcc-bugs Tue, 26 May 2026 04:06:09 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125461


            Bug ID: 125461
           Summary: znver5: 14.2 vs trunk differ on SRA store shape/LIM
                    and loop vector width (ZMM vs XMM)
           Product: gcc
           Version: 17.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: raghesh.aloor at amd dot com
                CC: jamborm at gcc dot gnu.org, rguenth at gcc dot gnu.org,
                    venkataramanan.kumar at amd dot com
  Target Milestone: ---

Created attachment 64554
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=64554&action=edit
The preprocessed input file to be used for the command mentioned in the
description

GCC 14.2 vs trunk: SRA/LIM and loop-vectorization differences (znver5)
======================================================================

We are posting this report to get an initial opinion and advice on how to
proceed further. We can share the full reduced microbenchmark source,
tree dumps if required.

Context
-------
We compared GCC 14.2 and GCC trunk on a short version of a hot loop from an
application. The loop uses a user defined vectorized array of type
SimdBlock8<double>. We are experimenting on an AMD znver5 (AVX-512) machine.

We observe, trunk is slower than 14.2 on a build: the generated code
uses 128-bit XMM instructions for much of the loop, while 14.2 uses 512-bit
ZMM instructions.

Reduced Testcase
----------------
Function: simd_coeff_apply_kernel_impl

  inline void
  simd_coeff_apply_kernel_impl (
    const double * restrict coeffs,
    const SimdBlock8<double> *src,
    SimdBlock8<double> *dst)
  {
    using Vec = SimdBlock8<double>;
    constexpr int inner_n = 2;

    for (int block_i = 0; block_i < 6; ++block_i)
      {
        Vec v_p0 = src[0] + src[30];
        Vec v_m0 = src[0] - src[30];
        Vec v_p1 = src[6] + src[24];
        Vec v_m1 = src[6] - src[24];
        Vec v_p2 = src[12] + src[18];
        Vec v_m2 = src[12] - src[18];

        for (int k = 0; k < inner_n; ++k)
          {
            Vec acc0 = coeffs[k * 3] * v_p0;
            Vec acc1 = coeffs[(4 - k) * 3] * v_m0;
            acc0 += coeffs[k * 3 + 1] * v_p1;
            acc1 += coeffs[(4 - k) * 3 + 1] * v_m1;
            acc0 += coeffs[k * 3 + 2] * v_p2;
            acc1 += coeffs[(4 - k) * 3 + 2] * v_m2;
            dst[6 * k] = acc0 + acc1;
            dst[6 * (4 - k)] = acc0 - acc1;
          }

        Vec acc_tail = coeffs[6] * v_p0;
        acc_tail += coeffs[7] * v_p1;
        acc_tail += coeffs[8] * v_p2;
        dst[12] = acc_tail;

        src += 1;
        dst += 1;
      }
    src += 6 * 5;
    dst += 6 * 4;
  }

Compiler command used:

g++  -S -m64 -O3 -march=znver5 -std=c++17  -fdump-tree-sra -fdump-tree-lim2
-fdump-tree-vect simd_coeff_apply_microbench.i  -fopt-info-vec
-fopt-info-vec-missed -o simd_coeff_apply_microbench-gcc.s

We see two differences between GCC 14.2 and trunk on this test program.
They may be unrelated, but both matter for the final code. We describe them
as Issue A and Issue B below.

Issue A — SRA and LIM (before the loop vectorizer runs)
-------------------------------------------------------
This looks like a different problem from Issue B below, but both show up in
the same test program.

After pass_sra, the store dst[12] = acc_tail appears in the compiler IR as
follows on 14.2 vs trunk:

  GCC 14.2:
    MEM[(struct SimdBlock8 *)dst + 768B].data[k]

  GCC trunk (after PR118924):
    MEM<double> [(struct SimdBlock8 *)dst + (768 + 8*k)B]

That difference affects pass_lim (lim2). The loads from coeffs[6], coeffs[7],
and coeffs[8] (byte offsets +48, +56, +64 from coeffs) do not change inside
the outer loop, but:

  - On 14.2, lim2 says they do not alias the first dst store, moves them
    before the loop. We can see the following in the dumps

    Moving statement
    _47 = MEM[(const double *)coeffs_71(D) + 48B];
    (cost 20) out of loop 1.
    Moving statement
    _49 = MEM[(const double *)coeffs_71(D) + 56B];
    (cost 20) out of loop 1.

    Moving statement
    _51 = MEM[(const double *)coeffs_71(D) + 64B];
    (cost 20) out of loop 1.

  - On trunk, lim2 says they depend on the first flattened dst store at
    +768B and leaves them inside the loop.

Further analysis shows that trunk SRA sets grp_same_access_path = 0 on
acc_tail.data, while on 14.2 it stays 1.

Reverting these three commits on trunk brings back 14.2-like store shapes and
lim2 behaviour in our small tests:

  40445711b8a  sra: Clear grp_same_access_path ... (PR118924)
  07d24367002  sra: Avoid creating TBAA hazards (PR118924)
  0c286ea4006  sra: Dont use build_reconstructed_reference ... (PR122976)


Issue B — Loop vectorization (ZMM vs XMM)
-----------------------------------------
We are not sure whether Issue A and Issue B are related. What we do see is
that reverting only the three SRA commits above makes the IR before the loop
vectorizer match 14.2 again — but that did not bring back 14.2-style
vectorization (ZMM) on the inner k loop.

At -O3:

  - GCC 14.2: the loop vectorizer uses wide AVX-512 (ZMM). fopt-info mentions
    trying again with SLP turned off.

    simd_coeff_apply_microbench.i:63:24: optimized: basic block part vectorized
using 64 byte vectors
    simd_coeff_apply_microbench.i:63:24: optimized: basic block part vectorized
using 64 byte vectors
    simd_coeff_apply_microbench.i:69:15: optimized: basic block part vectorized
using 64 byte vectors
    simd_coeff_apply_microbench.i:64:30: optimized: basic block part vectorized
using 64 byte vectors
    simd_coeff_apply_microbench.i:64:30: optimized: basic block part vectorized
using 64 byte vectors

  - GCC trunk: the same loop is often vectorized with narrower XMM code and
    different SLP / loop-vectorizer choices.

    simd_coeff_apply_microbench.i:41:1: optimized: basic block part vectorized
using 16 byte vectors
    simd_coeff_apply_microbench.i:41:1: optimized: basic block part vectorized
using 16 byte vectors
    <...More dumps here...>

We found several 2025 vectorizer commits that might explain this, including
PR115895 (1b5d2ccd060) and commits that removed some non-SLP loop-vector
paths (cfeee375ecc, da012141c28, 1ae9e3c88ea). Reverting that group might
help, but we did not finish: the changes are large and conflicted in
tree-vect-loop.cc.

Both issues come from the same test program and the same application loop.
We think they may have different causes, but we are not sure. We are not
asking for a fix right now — we would like guidance on whether to focus on
SRA/TBAA, LIM, or the loop vectorizer.

[Bug tree-optimization/125461] New: znver5: 14.2 vs trunk differ on SRA store shape/LIM and loop vector width (ZMM vs XMM)

Reply via email to