https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125201

Tamar Christina <tnfchris at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |tnfchris at gcc dot gnu.org

--- Comment #4 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #3)
> I'll note the change affects BB vectorization only.

Aye, the loop has a cross iteration dependency and so loop vect fails but SLP
succeeds.

BB vectorization can now vectorize it using

 vect__3.15_45 = VEC_PERM_EXPR <vect__3.13_4, vect__3.13_4, { 0, 0, 1, 2 }>;

which is fine, LIM then lifts it above both loops.

The problem is register allocation,

reload allocated the index array into v26, which is a caller-saved register.
The call to bl  dummy makes it now have to reload it.

It decides to just rematerialize it as a load.

So not a vectorizer issue in and on itself.

on SVE enabled cores we can replace the TBL here with an INSR to itself.
on non SVE cores we can use EXT + INS to avoid the constant, but this only
makes sense if we know the index will be spilled, otherwise it's slower than
the TBL.

The generated code may still not be better than scalar though due to the loop
carried value predcom inserts.

Indeed with -fdisable-tree-pcom generates much more reasonable code, only let
down that SLP build with V4SF fails
but it avoids are the constly and serial vector constructors inside the loop.

so probably we should increase the cost of

note: node (external) 0x23544438 (max_nunits=1, refcnt=1) vector(4) float
note:   { a_I_lsm0.8_24, _7, _11, _15 }

when we are doing BB SLP inside a loop nest, and account for the cost of
keeping the loop invariant value live.

The TBL Issue can be fixed with a simple pattern.. but INSR on older cores is
not a very fast instruction.

Reply via email to