[Bug tree-optimization/116463] [15 Regression] complex multiply vectorizer detection failures after r15-3087-gb07f8a301158e5

tnfchris at gcc dot gnu.org via Gcc-bugs Wed, 04 Dec 2024 01:56:51 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116463


--- Comment #32 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #31)
> (In reply to Tamar Christina from comment #29)
> > (In reply to Tamar Christina from comment #27)
> > > > > 
> > > > > We DO already impose any order on them, but the other operand is 
> > > > > oddodd, so
> > > > > the overall order ends up being oddodd because any known permute 
> > > > > overrides
> > > > > unknown ones.
> > > > 
> > > > So what's the desired outcome?  I guess PERM_UNKNOWN?  I guess it's
> > > > the "other operand" of an add?  What's the (bad) effect of classifying
> > > > it as ODDODD (optimistically)?
> > > > 
> > > > > So the question is, can we not follow externals in a constructor to 
> > > > > figure
> > > > > out if how they are used they all read from the same base and in 
> > > > > which order?
> > > > 
> > > > I don't see how it makes sense to do this.  For the above example, 
> > > > what's
> > > > the testcase exhibiting this (and on which arch)?
> > > 
> > > I've been working on a fix from a different angle for this, which also
> > > covers another GCC 14 regression that went unnoticed. I'll post after
> > > regressions finish.
> > 
> > So I've formalized the handling of TOP a bit better.  Which gets it to
> > recognize it again, however, it will be dropped as it's not profitable.
> > 
> > The reason it's not profitable is the canonicalization issue mentioned
> > above.  This has split the imaginary and real nodes into different
> > computations.
> >
> > So no matter what you do in the SLP tree, the attached digraph won't make
> > the loads of _5 linear.  Are you ok with me trying that Richi?
> 
> I can't make sense of that graph - the node feeding the store seems to have
> wrong scalar stmts?
> 
> What's the testcase for this (and on what arch?).
> 

void fms_elemconjsnd(_Complex TYPE a[restrict N], _Complex TYPE b,
                     _Complex TYPE c[restrict N]) {
  for (int i = 0; i < N; i++)
    c[i] -= a[i] * ~b;
}

compiled with -Ofast -march=armv8.3-a

#define TYPE double
#define I 1.0i
#define N 200
void fms180snd (_Complex TYPE a[restrict N], _Complex TYPE b[restrict N],
                _Complex TYPE c[restrict N])
{
  for (int i=0; i < N; i++)
    c[i] -= a[i] * (b[i] * I * I);
}

void fms180snd_1 (_Complex TYPE a[restrict N], _Complex TYPE b[restrict N],
                _Complex TYPE c[restrict N])
{
  _Complex TYPE t = I;
  for (int i=0; i < N; i++)
    c[i] -= a[i] * (b[i] * t * t);
}

is another one, where they are the same things, but 1st one is matched and
second one doesn't.

> But yes, the loads of *5 won't get linear here, but at least the
> permute node feeding the complex-add-rot270 can be elided, eventually
> even making the external _53, b$real_11 match the other with different
> order (though we don't model that, cost-wise).

But without the loads getting linearize the match will never work as multi-lane
SLP will be immediately cancelled because it assumed load-lanes is cheaper
(it's not, but load lanes doesn't get costed) and that's why there's a load
permute optimization step after complex pattern matching.

The point is however, that no permute is needed. *not even for the loads*.

GCC 13 generated:

fms_elemconjsnd:
        fneg    d1, d1
        mov     x2, 0
        dup     v4.2d, v0.d[0]
        dup     v3.2d, v1.d[0]
.L2:
        ldr     q1, [x0, x2]
        ldr     q0, [x1, x2]
        fmul    v2.2d, v3.2d, v1.2d
        fcadd   v0.2d, v0.2d, v2.2d, #270
        fmls    v0.2d, v4.2d, v1.2d
        str     q0, [x1, x2]
        add     x2, x2, 16
        cmp     x2, 3200
        bne     .L2
        ret

which was the optimal sequence.

[Bug tree-optimization/116463] [15 Regression] complex multiply vectorizer detection failures after r15-3087-gb07f8a301158e5

Reply via email to