https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116463
--- Comment #32 from Tamar Christina <tnfchris at gcc dot gnu.org> --- (In reply to Richard Biener from comment #31) > (In reply to Tamar Christina from comment #29) > > (In reply to Tamar Christina from comment #27) > > > > > > > > > > We DO already impose any order on them, but the other operand is > > > > > oddodd, so > > > > > the overall order ends up being oddodd because any known permute > > > > > overrides > > > > > unknown ones. > > > > > > > > So what's the desired outcome? I guess PERM_UNKNOWN? I guess it's > > > > the "other operand" of an add? What's the (bad) effect of classifying > > > > it as ODDODD (optimistically)? > > > > > > > > > So the question is, can we not follow externals in a constructor to > > > > > figure > > > > > out if how they are used they all read from the same base and in > > > > > which order? > > > > > > > > I don't see how it makes sense to do this. For the above example, > > > > what's > > > > the testcase exhibiting this (and on which arch)? > > > > > > I've been working on a fix from a different angle for this, which also > > > covers another GCC 14 regression that went unnoticed. I'll post after > > > regressions finish. > > > > So I've formalized the handling of TOP a bit better. Which gets it to > > recognize it again, however, it will be dropped as it's not profitable. > > > > The reason it's not profitable is the canonicalization issue mentioned > > above. This has split the imaginary and real nodes into different > > computations. > > > > So no matter what you do in the SLP tree, the attached digraph won't make > > the loads of _5 linear. Are you ok with me trying that Richi? > > I can't make sense of that graph - the node feeding the store seems to have > wrong scalar stmts? > > What's the testcase for this (and on what arch?). > void fms_elemconjsnd(_Complex TYPE a[restrict N], _Complex TYPE b, _Complex TYPE c[restrict N]) { for (int i = 0; i < N; i++) c[i] -= a[i] * ~b; } compiled with -Ofast -march=armv8.3-a #define TYPE double #define I 1.0i #define N 200 void fms180snd (_Complex TYPE a[restrict N], _Complex TYPE b[restrict N], _Complex TYPE c[restrict N]) { for (int i=0; i < N; i++) c[i] -= a[i] * (b[i] * I * I); } void fms180snd_1 (_Complex TYPE a[restrict N], _Complex TYPE b[restrict N], _Complex TYPE c[restrict N]) { _Complex TYPE t = I; for (int i=0; i < N; i++) c[i] -= a[i] * (b[i] * t * t); } is another one, where they are the same things, but 1st one is matched and second one doesn't. > But yes, the loads of *5 won't get linear here, but at least the > permute node feeding the complex-add-rot270 can be elided, eventually > even making the external _53, b$real_11 match the other with different > order (though we don't model that, cost-wise). But without the loads getting linearize the match will never work as multi-lane SLP will be immediately cancelled because it assumed load-lanes is cheaper (it's not, but load lanes doesn't get costed) and that's why there's a load permute optimization step after complex pattern matching. The point is however, that no permute is needed. *not even for the loads*. GCC 13 generated: fms_elemconjsnd: fneg d1, d1 mov x2, 0 dup v4.2d, v0.d[0] dup v3.2d, v1.d[0] .L2: ldr q1, [x0, x2] ldr q0, [x1, x2] fmul v2.2d, v3.2d, v1.2d fcadd v0.2d, v0.2d, v2.2d, #270 fmls v0.2d, v4.2d, v1.2d str q0, [x1, x2] add x2, x2, 16 cmp x2, 3200 bne .L2 ret which was the optimal sequence.