Re: Help with comparison-merging optimization pass

Thomas de Bock via Gcc Wed, 06 Aug 2025 05:29:43 -0700

> On Wed, Aug 6, 2025 at 9:40 AM Jakub Jelinek <ja...@redhat.com> wrote:
> > On Wed, Aug 06, 2025 at 08:48:55AM +0200, Richard Biener via Gcc wrote:
> > > For loops the canonical place to perform such optimization is the loop
> > > distribution pass which already recognizes
> > > memcpy but also strlen (strcmp is more like strlen).
> > >
> > > For straight-line code there's also a bugreport about supporting
> > > vectorization of such
> > > sequences from the basic-block vectorizer.  Passes with related 
> > > transforms are
> > > ifcombine (it now also does limited load merging), store-merging and 
> > > phiopt.
> >
> > I think store-merging doesn't have useful infrastructure for such kind of
> > optimization.  This optimization needs to analyze many short basic blocks.
> > Maybe also reassoc, though that one is for the case where
> > some conditions can use |/& and others ||/&& in separate basic blocks.
> > Or perhaps widening mul, though that one is too late for vectorization.
> >
> > The question is what the pass should be able to pattern recognize, whether
> > just non-inlined functions which perform the per-member comparison (and
> > which, whether just all equality or all non-equality comparisons (note, one
> > would need to handle both in any case and just verify they jump to the same
> > destination with the same return values), or also <=> cases (for unsigned
> > char members or little-endian larger unsigned vars)), or also when they are
> > inlined and say stored etc.
I've looked at the pattern LLVM recognizes and there is indeed a lot of 
different
ways we could recognize the chains and generalize the optimization.
The way I do the pattern recognition now is by changing the conditions to ==,
redirecting the edges if needed. Since memcmp will only give us whether all 
fields
were equal or not, a mix of these operators or their outgoing edges wouldn't 
work.
Then discovering chains where:
1. All blocks in the chain perform 2 loads, then a condition of their equality.
    The PHI argument value corresponding to each block in the chain should be 
the
    same, except for possibly a TRUE edge from the final block in the chain
    (possibly indirectly, by a path of empty blocks with fallthru edges)
2. If the condition returns false, an edge is followed into the PHI node 
(possibly
    indirectly). If the condition returns true, an edge is taken into the next
    block in the chain, which we follow to attempt to keep expanding the known
    chain this way.
3. The final comparison block in the chain does not need to lead to PHI in its
    TRUE case, I believe this is the case in LLVM, but I don't see how it's
    needed. Since either way we can store the edge, merge the comparisons,
    then reattach the edge to the new merged comparison block, since the
    condition for the edge to be taken will be that all compared fields
    were equal either way, so control flow is not affected.
    This helps with recognizing partial chains of comparisons that precede e.g.
    an array comparison that loops (which it does not recognize currently).
Since there is no clear start or end to any chain initially, I pick them in
the order they were listed in the PHI node, then discover the successors and
predecessors. This doesn't recognize the pattern generated by <=>, but
I tried to implement it to be as general as possible and recognize the chains
in any case where it does not affect the program's logic (in the case that
all sizes are known and sufficient memory is allocated).
I think it's best to keep the recognizing implementation general, before
adding clear restrictions around its use (like being applied to defaulted
equality operators for example, or at least distinguishing the pattern 
restrictions
from the general case). I think limiting its application by just
adding restricting logic to the pattern recognizer logic could prove 
unpredictable.
> >
> > Some food for thought, have a look at say
> > -O3 -fkeep-inline-functions -fdump-tree-all
> > #include <compare>
> >
> > struct S { unsigned char a, b, c, d; unsigned e; bool operator== (const S 
> > &) const = default; };
> > struct T { unsigned char a, b, c, d; unsigned e; auto operator<=> (const T 
> > &) const = default; };
> > void foo (const S *a, const S *b, int *c) { *c = *a == *b ? 42 : -13; }
> > void bar (const T *a, const T *b, std::strong_ordering *c) { *c = *a <=> 
> > *b; }
> > bool baz (const S *a, const S *b) {
> >   if (a->a != b->a) return false;
> >   if (a->b != b->b) return false;
> >   if (a->c != b->c) return false;
> >   if (a->d != b->d) return false;
> >   if (a->e != b->e) return false;
> >   return true;
> > }
> > std::strong_ordering qux (const T *a, const T *b) {
> >   if (auto c = a->a <=> b->a; c != 0) return c;
> >   if (auto c = a->b <=> b->b; c != 0) return c;
> >   if (auto c = a->c <=> b->c; c != 0) return c;
> >   if (auto c = a->d <=> b->d; c != 0) return c;
> >   return a->e <=> b->e;
> > }
> > void freddy ();
> > void garply (const S *a, const S *b) { if (*a == *b) freddy (); }
> > void corge (const T *a, const T *b) { if ((*a <=> *b) < 0) freddy (); }
> > In some cases it will be something like:
> >   <bb 2> :
> >   _1 = this_13(D)->a;
> >   _2 = _14(D)->a;
> >   if (_1 == _2)
> >     goto <bb 4>; [INV]
> >   else
> >     goto <bb 3>; [INV]
> >
> >   <bb 3> :
> >   // predicted unlikely by early return (on trees) predictor.
> >   goto <bb 12>; [INV]
> >
> >   <bb 4> :
> >   _3 = this_13(D)->b;
> >   _4 = _14(D)->b;
> >   if (_3 == _4)
> >     goto <bb 6>; [INV]
> >   else
> >     goto <bb 5>; [INV]
> >
> >   <bb 5> :
> >   // predicted unlikely by early return (on trees) predictor.
> >   goto <bb 12>; [INV]
> > ...
> >   <bb 12> :
> >   # _11 = PHI <0(5), 1(10), 0(3), 0(11), 0(9), 0(7)>
> >   return _11;
> > i.e. end up with a PHI and sometimes with the forwarder bbs, sometimes
> > without.  I see that fnsplit + some IPA optimizations for the
> > S::operator== case split the last part into a separate function and
> > something during IPA? questionably moved the loads earlier:
> > ...
> >   <bb 4> [local count: 268435456]:
> >   _5 = this_9(D)->c;
> >   _6 = _10(D)->c;
> >   if (_5 == _6)
> >     goto <bb 5>; [50.00%]
> >   else
> >     goto <bb 9>; [50.00%]
> >
> >   <bb 5> [local count: 134217728]:
> >   _12 = MEM[(unsigned char *)this_9(D) + 3B];
> >   _13 = MEM[(unsigned int *)this_9(D) + 4B];
> >   _14 = MEM[(unsigned char *)_10(D) + 3B];
> >   _15 = MEM[(unsigned int *)_10(D) + 4B];
> >   if (_12 == _14)
> >     goto <bb 6>; [50.00%]
> >   else
> >     goto <bb 8>; [50.00%]
> >
> >   <bb 6> [local count: 67108864]:
> >   if (_13 == _15)
> >     goto <bb 8>; [50.00%]
> >   else
> >     goto <bb 7>; [50.00%]
> >
> >   <bb 7> [local count: 33554432]:
> >
> >   <bb 8> [local count: 134217728]:
> >   # _20 = PHI <0(5), 1(6), 0(7)>
> >
> >   <bb 9> [local count: 1073741824]:
> >   # _7 = PHI <0(3), 0(4), 0(2), _20(8)>
> >   return _7;
> >
> > But in other cases it won't be about PHIs but just about where exactly it
> > jumps (in particular, have a look at garply and corge).
I initially put the optimization after this point so I'm familiar with this 
behaviour,
which is why I have it in the early optimizations now as more whacky stuff 
appeared.
Still not sure why it splits off the last few comparisons and handles them 
differently,
sometimes reading them both, comparing and storing the result, then AND'ing the 
results.
Tried to get it to show this behaviour in the output binary as well, but always 
just
gives it in the exact same read, cmp, brach structure when I tried, so not sure 
what
the point of it is.
> > The spaceship cases are even more difficult to pattern recognize and to
> > transform that to memcmp it needs to be TYPE_UNSIGNED char (or
> > unsigned short, unsigned int, unsigned long etc. for little endian)
> > comparison.
> > And as I've tried to mention in the PR, I'm actually not sure when
> > the vectorization of memcmp is a safe thing to do vs. pointers to
> > incompletely allocated structs.
> > Consider say
> > struct U { char a; char b[63]; }; where the b is actually poor man's
> > flexible array member (because e.g. standard C++ doesn't have that),
> > where allocations are done with offsetof (U, b) + n sizes or so.
> > So, the fact that say b[3] is mapped doesn't mean b[4] will be mapped.
> > Now, if you do some comparison of such structs (== or <=>), if say
> > b[3] is always different between them, the original source will never
> > read b[4], but perhaps vectorized memcmp might want to and crash.
> >
> > This isn't a problem when both arguments of the memcmp are pointers
> > to variables, just when they are pointers to something where we don't
> > know the size.  It can be vectorized if we make sure to always align
> > it, but that might result in too large code.


> If the accesses can possibly fault then it's like vectorization of
> early breaks - you'd either need alignment guarantees or something
> like first-fault-loads (aka loads that do not fault).

> Depending on how exactly pointers are formed you can't use
> the guarantee that a pointer has to point to a valid object to
> ignore ordering either, so for char *, p[0] || p[-1] cannot be
> re-ordered (you might argue p[1] || p[0] can because you have
> that pointer to p[0])

> Richard.
I see what you mean, I think there is probably no way to apply the
optimization to all functions then indeed, since there is no way to
know if the early break on inequality of a field was arbitrary
or because it indicates the following data will be out of range.
My main goal currently though is applying the optimization to
the default comparison operator, and is it not undefined
behaviour to call it on an incomplete struct?

For -O2:
bool __attribute__((noinline))
f(std::array<int32_t, 4>* x, std::array<int32_t, 4>* y) {
    return *x == *y;
}

int main() {

  std::array<int32_t, 4>* a;
  std::array<int32_t, 4>* b;
  a = (std::array<int32_t, 4>*)malloc(sizeof(std::array<int32_t, 3>));
  b = (std::array<int32_t, 4>*)malloc(sizeof(std::array<int32_t, 3>));
  return f(a, b);
}
Also generates a memcmp call which reads out of bounds of the allocated
memory, doesn't this show that for default equality we can assume
that all the structure's data was allocated?

https://www.open-std.org/JTC1/SC22/WG21/docs/papers/2014/n4114.htm
Does specify "a comparison of the subobjects that have the same position
in both objects against each other until one subobject is not equal to
the other."
But I don't see anything regarding when the memory should be loaded.

Either way, I think the effect this issue has on run-time does
demand a way to, at the very least, have a way of applying
this (possibly through different means) eventual vectorization of
comparisons. Even if it's through a flag or attribute etc, when
properly implemented this would speed up default comparisons
and decrease code size:
(https://docs.google.com/document/d/1CKp8cIfURXbPLSap0jFio7LW4suzR10u5gX4RBV0k7c/edit?tab=t.0#heading=h.mub2vspvhz42).
It would also help make code that currently uses memcmp to achieve
vectorization in struct comparisons much cleaner.

> >
> > > >   2.  When a struct A has a struct B field and we compare 2 A 
> > > > instances, when getting to this pass, A::operator== still contains a 
> > > > call to B::operator==, making it hard to merge the comparisons of B 
> > > > together with those of A. What is weird to me is that even after 
> > > > einline and IPA inline passes it seems it is still there as a call, 
> > > > instead of being inlined.
> > >
> > > I'd not do this as an early pass (and I'd not do it as a separate pass 
> > > anyway).
> > >
> > > > These are of course 2 cases that could be handled by implementing 
> > > > special cases (e.g. detecting the loop structure, recursing into the 
> > > > called function, then coming back and merging anyway), but I feel as 
> > > > though tackling the problems in this way will result in infinite 
> > > > complexity as I discover more cases. That's why I was hoping for some 
> > > > feedback on this, does the general approach I am taking seem logical 
> > > > for GCC, should I handle these as special cases, is there some other 
> > > > way of going about it?
> > > > Any tips very much appreciated, thank you!
> > >
> > > As for the case with the call I'd figure why we do not inline.
> >
> >         Jakub


________________________________
From: Richard Biener <richard.guent...@gmail.com>
Sent: 06 August 2025 09:12:36
To: Jakub Jelinek
Cc: Thomas de Bock; gcc@gcc.gnu.org
Subject: [ext] Re: Help with comparison-merging optimization pass

On Wed, Aug 6, 2025 at 9:40 AM Jakub Jelinek <ja...@redhat.com> wrote:
>
> On Wed, Aug 06, 2025 at 08:48:55AM +0200, Richard Biener via Gcc wrote:
> > For loops the canonical place to perform such optimization is the loop
> > distribution pass which already recognizes
> > memcpy but also strlen (strcmp is more like strlen).
> >
> > For straight-line code there's also a bugreport about supporting
> > vectorization of such
> > sequences from the basic-block vectorizer.  Passes with related transforms 
> > are
> > ifcombine (it now also does limited load merging), store-merging and phiopt.
>
> I think store-merging doesn't have useful infrastructure for such kind of
> optimization.  This optimization needs to analyze many short basic blocks.
> Maybe also reassoc, though that one is for the case where
> some conditions can use |/& and others ||/&& in separate basic blocks.
> Or perhaps widening mul, though that one is too late for vectorization.
>
> The question is what the pass should be able to pattern recognize, whether
> just non-inlined functions which perform the per-member comparison (and
> which, whether just all equality or all non-equality comparisons (note, one
> would need to handle both in any case and just verify they jump to the same
> destination with the same return values), or also <=> cases (for unsigned
> char members or little-endian larger unsigned vars)), or also when they are
> inlined and say stored etc.
>
> Some food for thought, have a look at say
> -O3 -fkeep-inline-functions -fdump-tree-all
> #include <compare>
>
> struct S { unsigned char a, b, c, d; unsigned e; bool operator== (const S &) 
> const = default; };
> struct T { unsigned char a, b, c, d; unsigned e; auto operator<=> (const T &) 
> const = default; };
> void foo (const S *a, const S *b, int *c) { *c = *a == *b ? 42 : -13; }
> void bar (const T *a, const T *b, std::strong_ordering *c) { *c = *a <=> *b; }
> bool baz (const S *a, const S *b) {
>   if (a->a != b->a) return false;
>   if (a->b != b->b) return false;
>   if (a->c != b->c) return false;
>   if (a->d != b->d) return false;
>   if (a->e != b->e) return false;
>   return true;
> }
> std::strong_ordering qux (const T *a, const T *b) {
>   if (auto c = a->a <=> b->a; c != 0) return c;
>   if (auto c = a->b <=> b->b; c != 0) return c;
>   if (auto c = a->c <=> b->c; c != 0) return c;
>   if (auto c = a->d <=> b->d; c != 0) return c;
>   return a->e <=> b->e;
> }
> void freddy ();
> void garply (const S *a, const S *b) { if (*a == *b) freddy (); }
> void corge (const T *a, const T *b) { if ((*a <=> *b) < 0) freddy (); }
> In some cases it will be something like:
>   <bb 2> :
>   _1 = this_13(D)->a;
>   _2 = _14(D)->a;
>   if (_1 == _2)
>     goto <bb 4>; [INV]
>   else
>     goto <bb 3>; [INV]
>
>   <bb 3> :
>   // predicted unlikely by early return (on trees) predictor.
>   goto <bb 12>; [INV]
>
>   <bb 4> :
>   _3 = this_13(D)->b;
>   _4 = _14(D)->b;
>   if (_3 == _4)
>     goto <bb 6>; [INV]
>   else
>     goto <bb 5>; [INV]
>
>   <bb 5> :
>   // predicted unlikely by early return (on trees) predictor.
>   goto <bb 12>; [INV]
> ...
>   <bb 12> :
>   # _11 = PHI <0(5), 1(10), 0(3), 0(11), 0(9), 0(7)>
>   return _11;
> i.e. end up with a PHI and sometimes with the forwarder bbs, sometimes
> without.  I see that fnsplit + some IPA optimizations for the
> S::operator== case split the last part into a separate function and
> something during IPA? questionably moved the loads earlier:
> ...
>   <bb 4> [local count: 268435456]:
>   _5 = this_9(D)->c;
>   _6 = _10(D)->c;
>   if (_5 == _6)
>     goto <bb 5>; [50.00%]
>   else
>     goto <bb 9>; [50.00%]
>
>   <bb 5> [local count: 134217728]:
>   _12 = MEM[(unsigned char *)this_9(D) + 3B];
>   _13 = MEM[(unsigned int *)this_9(D) + 4B];
>   _14 = MEM[(unsigned char *)_10(D) + 3B];
>   _15 = MEM[(unsigned int *)_10(D) + 4B];
>   if (_12 == _14)
>     goto <bb 6>; [50.00%]
>   else
>     goto <bb 8>; [50.00%]
>
>   <bb 6> [local count: 67108864]:
>   if (_13 == _15)
>     goto <bb 8>; [50.00%]
>   else
>     goto <bb 7>; [50.00%]
>
>   <bb 7> [local count: 33554432]:
>
>   <bb 8> [local count: 134217728]:
>   # _20 = PHI <0(5), 1(6), 0(7)>
>
>   <bb 9> [local count: 1073741824]:
>   # _7 = PHI <0(3), 0(4), 0(2), _20(8)>
>   return _7;
>
> But in other cases it won't be about PHIs but just about where exactly it
> jumps (in particular, have a look at garply and corge).
> The spaceship cases are even more difficult to pattern recognize and to
> transform that to memcmp it needs to be TYPE_UNSIGNED char (or
> unsigned short, unsigned int, unsigned long etc. for little endian)
> comparison.
>
> And as I've tried to mention in the PR, I'm actually not sure when
> the vectorization of memcmp is a safe thing to do vs. pointers to
> incompletely allocated structs.
> Consider say
> struct U { char a; char b[63]; }; where the b is actually poor man's
> flexible array member (because e.g. standard C++ doesn't have that),
> where allocations are done with offsetof (U, b) + n sizes or so.
> So, the fact that say b[3] is mapped doesn't mean b[4] will be mapped.
> Now, if you do some comparison of such structs (== or <=>), if say
> b[3] is always different between them, the original source will never
> read b[4], but perhaps vectorized memcmp might want to and crash.
>
> This isn't a problem when both arguments of the memcmp are pointers
> to variables, just when they are pointers to something where we don't
> know the size.  It can be vectorized if we make sure to always align
> it, but that might result in too large code.

If the accesses can possibly fault then it's like vectorization of
early breaks - you'd either need alignment guarantees or something
like first-fault-loads (aka loads that do not fault).

Depending on how exactly pointers are formed you can't use
the guarantee that a pointer has to point to a valid object to
ignore ordering either, so for char *, p[0] || p[-1] cannot be
re-ordered (you might argue p[1] || p[0] can because you have
that pointer to p[0])

Richard.

>
> > >   2.  When a struct A has a struct B field and we compare 2 A instances, 
> > > when getting to this pass, A::operator== still contains a call to 
> > > B::operator==, making it hard to merge the comparisons of B together with 
> > > those of A. What is weird to me is that even after einline and IPA inline 
> > > passes it seems it is still there as a call, instead of being inlined.
> >
> > I'd not do this as an early pass (and I'd not do it as a separate pass 
> > anyway).
> >
> > > These are of course 2 cases that could be handled by implementing special 
> > > cases (e.g. detecting the loop structure, recursing into the called 
> > > function, then coming back and merging anyway), but I feel as though 
> > > tackling the problems in this way will result in infinite complexity as I 
> > > discover more cases. That's why I was hoping for some feedback on this, 
> > > does the general approach I am taking seem logical for GCC, should I 
> > > handle these as special cases, is there some other way of going about it?
> > > Any tips very much appreciated, thank you!
> >
> > As for the case with the call I'd figure why we do not inline.
>
>         Jakub
>

This e-mail and any attachments may contain information that is confidential 
and proprietary and otherwise protected from disclosure. If you are not the 
intended recipient of this e-mail, do not read, duplicate or redistribute it by 
any means. Please immediately delete it and any attachments and notify the 
sender that you have received it by mistake. Unintended recipients are 
prohibited from taking action on the basis of information in this e-mail or any 
attachments. The DRW Companies make no representations that this e-mail or any 
attachments are free of computer viruses or other defects.

Re: Help with comparison-merging optimization pass

Reply via email to