https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125750

Tamar Christina <tnfchris at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|                            |2026-06-12
           Keywords|                            |missed-optimization
             Status|UNCONFIRMED                 |NEW
     Ever confirmed|0                           |1

--- Comment #2 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
There's quite a few things going on in those examples.. 

like

        movi    d27, #0
        mov     z26.d, z27.d
        mov     z25.d, z27.d
        mov     z24.d, z27.d
        mov     z23.d, z27.d
        mov     z22.d, z27.d
        mov     z21.d, z27.d
        mov     z20.d, z27.d
        mov     z19.d, z27.d
        mov     z18.d, z27.d
        mov     z17.d, z27.d
        mov     z16.d, z27.d

in the outerloop of compute_region_directions is blatantly dumb.. I have a
patch that fixes this.

I'll break these out in separate tickets next week, but a quick couple:

our cost model is reject BB vectrization of compute_region_means (works with
-mmax-vectorization and shows good codegen).

As for the main reported problem, the unrolling this is SLP build failing.

https://godbolt.org/z/873Ene4eW

focuses on this.

Note that LLVM vectorized this using Adv. SIMD.

In GCC multi-lane SLP build is failing:

missed:   SLP induction not supported for variable-length vectors.

and we fall back to single lane SLP.

For single lane SLP to succeed each stream becomes a LOAD_LANES. i.e. we load
and permute.
So we didn't unroll, we just vectorized every stream individually.  LLVM has
done the same thing, however they do so using smaller scalar loads and creating
vectors from them

        ldp     d30, d31, [x11]
        movprfx z28, z21
        add     z28.d, z28.d, #4
        ushll   v22.8h, v22.8b, #0
        ldr     d29, [x11, #880]
        ldr     d8, [x11, #888]
        add     x9, x9, #4
        cmgt    v25.2s, v3.2s, v25.2s
        zip1    v9.2s, v30.2s, v29.2s
        zip2    v29.2s, v30.2s, v29.2s

Using -mautovec-preference=asimd-only gives us much better code as well (though
still suboptimal).

We pick SVE because the cost model things that load with gaps using LD4 is
beneficial.  Part of it is the broken load lanes costing that I was arguing
with Richard about.

So for the loop above from a quick look

1. fix the costing, need to revive the patches
2. see if we can support SLP indunctions with VLA
3. See why we didn't optimize the permutes when using adv. simd

I'll try to break these down into smaller examples and subtasks next week.

Reply via email to