> On Wed, Aug 6, 2025 at 9:40 AM Jakub Jelinek <ja...@redhat.com> wrote: > > On Wed, Aug 06, 2025 at 08:48:55AM +0200, Richard Biener via Gcc wrote: > > > For loops the canonical place to perform such optimization is the loop > > > distribution pass which already recognizes > > > memcpy but also strlen (strcmp is more like strlen). > > > > > > For straight-line code there's also a bugreport about supporting > > > vectorization of such > > > sequences from the basic-block vectorizer. Passes with related > > > transforms are > > > ifcombine (it now also does limited load merging), store-merging and > > > phiopt. > > > > I think store-merging doesn't have useful infrastructure for such kind of > > optimization. This optimization needs to analyze many short basic blocks. > > Maybe also reassoc, though that one is for the case where > > some conditions can use |/& and others ||/&& in separate basic blocks. > > Or perhaps widening mul, though that one is too late for vectorization. > > > > The question is what the pass should be able to pattern recognize, whether > > just non-inlined functions which perform the per-member comparison (and > > which, whether just all equality or all non-equality comparisons (note, one > > would need to handle both in any case and just verify they jump to the same > > destination with the same return values), or also <=> cases (for unsigned > > char members or little-endian larger unsigned vars)), or also when they are > > inlined and say stored etc. I've looked at the pattern LLVM recognizes and there is indeed a lot of different ways we could recognize the chains and generalize the optimization. The way I do the pattern recognition now is by changing the conditions to ==, redirecting the edges if needed. Since memcmp will only give us whether all fields were equal or not, a mix of these operators or their outgoing edges wouldn't work. Then discovering chains where: 1. All blocks in the chain perform 2 loads, then a condition of their equality. The PHI argument value corresponding to each block in the chain should be the same, except for possibly a TRUE edge from the final block in the chain (possibly indirectly, by a path of empty blocks with fallthru edges) 2. If the condition returns false, an edge is followed into the PHI node (possibly indirectly). If the condition returns true, an edge is taken into the next block in the chain, which we follow to attempt to keep expanding the known chain this way. 3. The final comparison block in the chain does not need to lead to PHI in its TRUE case, I believe this is the case in LLVM, but I don't see how it's needed. Since either way we can store the edge, merge the comparisons, then reattach the edge to the new merged comparison block, since the condition for the edge to be taken will be that all compared fields were equal either way, so control flow is not affected. This helps with recognizing partial chains of comparisons that precede e.g. an array comparison that loops (which it does not recognize currently). Since there is no clear start or end to any chain initially, I pick them in the order they were listed in the PHI node, then discover the successors and predecessors. This doesn't recognize the pattern generated by <=>, but I tried to implement it to be as general as possible and recognize the chains in any case where it does not affect the program's logic (in the case that all sizes are known and sufficient memory is allocated). I think it's best to keep the recognizing implementation general, before adding clear restrictions around its use (like being applied to defaulted equality operators for example, or at least distinguishing the pattern restrictions from the general case). I think limiting its application by just adding restricting logic to the pattern recognizer logic could prove unpredictable. > > > > Some food for thought, have a look at say > > -O3 -fkeep-inline-functions -fdump-tree-all > > #include <compare> > > > > struct S { unsigned char a, b, c, d; unsigned e; bool operator== (const S > > &) const = default; }; > > struct T { unsigned char a, b, c, d; unsigned e; auto operator<=> (const T > > &) const = default; }; > > void foo (const S *a, const S *b, int *c) { *c = *a == *b ? 42 : -13; } > > void bar (const T *a, const T *b, std::strong_ordering *c) { *c = *a <=> > > *b; } > > bool baz (const S *a, const S *b) { > > if (a->a != b->a) return false; > > if (a->b != b->b) return false; > > if (a->c != b->c) return false; > > if (a->d != b->d) return false; > > if (a->e != b->e) return false; > > return true; > > } > > std::strong_ordering qux (const T *a, const T *b) { > > if (auto c = a->a <=> b->a; c != 0) return c; > > if (auto c = a->b <=> b->b; c != 0) return c; > > if (auto c = a->c <=> b->c; c != 0) return c; > > if (auto c = a->d <=> b->d; c != 0) return c; > > return a->e <=> b->e; > > } > > void freddy (); > > void garply (const S *a, const S *b) { if (*a == *b) freddy (); } > > void corge (const T *a, const T *b) { if ((*a <=> *b) < 0) freddy (); } > > In some cases it will be something like: > > <bb 2> : > > _1 = this_13(D)->a; > > _2 = _14(D)->a; > > if (_1 == _2) > > goto <bb 4>; [INV] > > else > > goto <bb 3>; [INV] > > > > <bb 3> : > > // predicted unlikely by early return (on trees) predictor. > > goto <bb 12>; [INV] > > > > <bb 4> : > > _3 = this_13(D)->b; > > _4 = _14(D)->b; > > if (_3 == _4) > > goto <bb 6>; [INV] > > else > > goto <bb 5>; [INV] > > > > <bb 5> : > > // predicted unlikely by early return (on trees) predictor. > > goto <bb 12>; [INV] > > ... > > <bb 12> : > > # _11 = PHI <0(5), 1(10), 0(3), 0(11), 0(9), 0(7)> > > return _11; > > i.e. end up with a PHI and sometimes with the forwarder bbs, sometimes > > without. I see that fnsplit + some IPA optimizations for the > > S::operator== case split the last part into a separate function and > > something during IPA? questionably moved the loads earlier: > > ... > > <bb 4> [local count: 268435456]: > > _5 = this_9(D)->c; > > _6 = _10(D)->c; > > if (_5 == _6) > > goto <bb 5>; [50.00%] > > else > > goto <bb 9>; [50.00%] > > > > <bb 5> [local count: 134217728]: > > _12 = MEM[(unsigned char *)this_9(D) + 3B]; > > _13 = MEM[(unsigned int *)this_9(D) + 4B]; > > _14 = MEM[(unsigned char *)_10(D) + 3B]; > > _15 = MEM[(unsigned int *)_10(D) + 4B]; > > if (_12 == _14) > > goto <bb 6>; [50.00%] > > else > > goto <bb 8>; [50.00%] > > > > <bb 6> [local count: 67108864]: > > if (_13 == _15) > > goto <bb 8>; [50.00%] > > else > > goto <bb 7>; [50.00%] > > > > <bb 7> [local count: 33554432]: > > > > <bb 8> [local count: 134217728]: > > # _20 = PHI <0(5), 1(6), 0(7)> > > > > <bb 9> [local count: 1073741824]: > > # _7 = PHI <0(3), 0(4), 0(2), _20(8)> > > return _7; > > > > But in other cases it won't be about PHIs but just about where exactly it > > jumps (in particular, have a look at garply and corge). I initially put the optimization after this point so I'm familiar with this behaviour, which is why I have it in the early optimizations now as more whacky stuff appeared. Still not sure why it splits off the last few comparisons and handles them differently, sometimes reading them both, comparing and storing the result, then AND'ing the results. Tried to get it to show this behaviour in the output binary as well, but always just gives it in the exact same read, cmp, brach structure when I tried, so not sure what the point of it is. > > The spaceship cases are even more difficult to pattern recognize and to > > transform that to memcmp it needs to be TYPE_UNSIGNED char (or > > unsigned short, unsigned int, unsigned long etc. for little endian) > > comparison. > > And as I've tried to mention in the PR, I'm actually not sure when > > the vectorization of memcmp is a safe thing to do vs. pointers to > > incompletely allocated structs. > > Consider say > > struct U { char a; char b[63]; }; where the b is actually poor man's > > flexible array member (because e.g. standard C++ doesn't have that), > > where allocations are done with offsetof (U, b) + n sizes or so. > > So, the fact that say b[3] is mapped doesn't mean b[4] will be mapped. > > Now, if you do some comparison of such structs (== or <=>), if say > > b[3] is always different between them, the original source will never > > read b[4], but perhaps vectorized memcmp might want to and crash. > > > > This isn't a problem when both arguments of the memcmp are pointers > > to variables, just when they are pointers to something where we don't > > know the size. It can be vectorized if we make sure to always align > > it, but that might result in too large code.
> If the accesses can possibly fault then it's like vectorization of > early breaks - you'd either need alignment guarantees or something > like first-fault-loads (aka loads that do not fault). > Depending on how exactly pointers are formed you can't use > the guarantee that a pointer has to point to a valid object to > ignore ordering either, so for char *, p[0] || p[-1] cannot be > re-ordered (you might argue p[1] || p[0] can because you have > that pointer to p[0]) > Richard. I see what you mean, I think there is probably no way to apply the optimization to all functions then indeed, since there is no way to know if the early break on inequality of a field was arbitrary or because it indicates the following data will be out of range. My main goal currently though is applying the optimization to the default comparison operator, and is it not undefined behaviour to call it on an incomplete struct? For -O2: bool __attribute__((noinline)) f(std::array<int32_t, 4>* x, std::array<int32_t, 4>* y) { return *x == *y; } int main() { std::array<int32_t, 4>* a; std::array<int32_t, 4>* b; a = (std::array<int32_t, 4>*)malloc(sizeof(std::array<int32_t, 3>)); b = (std::array<int32_t, 4>*)malloc(sizeof(std::array<int32_t, 3>)); return f(a, b); } Also generates a memcmp call which reads out of bounds of the allocated memory, doesn't this show that for default equality we can assume that all the structure's data was allocated? https://www.open-std.org/JTC1/SC22/WG21/docs/papers/2014/n4114.htm Does specify "a comparison of the subobjects that have the same position in both objects against each other until one subobject is not equal to the other." But I don't see anything regarding when the memory should be loaded. Either way, I think the effect this issue has on run-time does demand a way to, at the very least, have a way of applying this (possibly through different means) eventual vectorization of comparisons. Even if it's through a flag or attribute etc, when properly implemented this would speed up default comparisons and decrease code size: (https://docs.google.com/document/d/1CKp8cIfURXbPLSap0jFio7LW4suzR10u5gX4RBV0k7c/edit?tab=t.0#heading=h.mub2vspvhz42). It would also help make code that currently uses memcmp to achieve vectorization in struct comparisons much cleaner. > > > > > > 2. When a struct A has a struct B field and we compare 2 A > > > > instances, when getting to this pass, A::operator== still contains a > > > > call to B::operator==, making it hard to merge the comparisons of B > > > > together with those of A. What is weird to me is that even after > > > > einline and IPA inline passes it seems it is still there as a call, > > > > instead of being inlined. > > > > > > I'd not do this as an early pass (and I'd not do it as a separate pass > > > anyway). > > > > > > > These are of course 2 cases that could be handled by implementing > > > > special cases (e.g. detecting the loop structure, recursing into the > > > > called function, then coming back and merging anyway), but I feel as > > > > though tackling the problems in this way will result in infinite > > > > complexity as I discover more cases. That's why I was hoping for some > > > > feedback on this, does the general approach I am taking seem logical > > > > for GCC, should I handle these as special cases, is there some other > > > > way of going about it? > > > > Any tips very much appreciated, thank you! > > > > > > As for the case with the call I'd figure why we do not inline. > > > > Jakub ________________________________ From: Richard Biener <richard.guent...@gmail.com> Sent: 06 August 2025 09:12:36 To: Jakub Jelinek Cc: Thomas de Bock; gcc@gcc.gnu.org Subject: [ext] Re: Help with comparison-merging optimization pass On Wed, Aug 6, 2025 at 9:40 AM Jakub Jelinek <ja...@redhat.com> wrote: > > On Wed, Aug 06, 2025 at 08:48:55AM +0200, Richard Biener via Gcc wrote: > > For loops the canonical place to perform such optimization is the loop > > distribution pass which already recognizes > > memcpy but also strlen (strcmp is more like strlen). > > > > For straight-line code there's also a bugreport about supporting > > vectorization of such > > sequences from the basic-block vectorizer. Passes with related transforms > > are > > ifcombine (it now also does limited load merging), store-merging and phiopt. > > I think store-merging doesn't have useful infrastructure for such kind of > optimization. This optimization needs to analyze many short basic blocks. > Maybe also reassoc, though that one is for the case where > some conditions can use |/& and others ||/&& in separate basic blocks. > Or perhaps widening mul, though that one is too late for vectorization. > > The question is what the pass should be able to pattern recognize, whether > just non-inlined functions which perform the per-member comparison (and > which, whether just all equality or all non-equality comparisons (note, one > would need to handle both in any case and just verify they jump to the same > destination with the same return values), or also <=> cases (for unsigned > char members or little-endian larger unsigned vars)), or also when they are > inlined and say stored etc. > > Some food for thought, have a look at say > -O3 -fkeep-inline-functions -fdump-tree-all > #include <compare> > > struct S { unsigned char a, b, c, d; unsigned e; bool operator== (const S &) > const = default; }; > struct T { unsigned char a, b, c, d; unsigned e; auto operator<=> (const T &) > const = default; }; > void foo (const S *a, const S *b, int *c) { *c = *a == *b ? 42 : -13; } > void bar (const T *a, const T *b, std::strong_ordering *c) { *c = *a <=> *b; } > bool baz (const S *a, const S *b) { > if (a->a != b->a) return false; > if (a->b != b->b) return false; > if (a->c != b->c) return false; > if (a->d != b->d) return false; > if (a->e != b->e) return false; > return true; > } > std::strong_ordering qux (const T *a, const T *b) { > if (auto c = a->a <=> b->a; c != 0) return c; > if (auto c = a->b <=> b->b; c != 0) return c; > if (auto c = a->c <=> b->c; c != 0) return c; > if (auto c = a->d <=> b->d; c != 0) return c; > return a->e <=> b->e; > } > void freddy (); > void garply (const S *a, const S *b) { if (*a == *b) freddy (); } > void corge (const T *a, const T *b) { if ((*a <=> *b) < 0) freddy (); } > In some cases it will be something like: > <bb 2> : > _1 = this_13(D)->a; > _2 = _14(D)->a; > if (_1 == _2) > goto <bb 4>; [INV] > else > goto <bb 3>; [INV] > > <bb 3> : > // predicted unlikely by early return (on trees) predictor. > goto <bb 12>; [INV] > > <bb 4> : > _3 = this_13(D)->b; > _4 = _14(D)->b; > if (_3 == _4) > goto <bb 6>; [INV] > else > goto <bb 5>; [INV] > > <bb 5> : > // predicted unlikely by early return (on trees) predictor. > goto <bb 12>; [INV] > ... > <bb 12> : > # _11 = PHI <0(5), 1(10), 0(3), 0(11), 0(9), 0(7)> > return _11; > i.e. end up with a PHI and sometimes with the forwarder bbs, sometimes > without. I see that fnsplit + some IPA optimizations for the > S::operator== case split the last part into a separate function and > something during IPA? questionably moved the loads earlier: > ... > <bb 4> [local count: 268435456]: > _5 = this_9(D)->c; > _6 = _10(D)->c; > if (_5 == _6) > goto <bb 5>; [50.00%] > else > goto <bb 9>; [50.00%] > > <bb 5> [local count: 134217728]: > _12 = MEM[(unsigned char *)this_9(D) + 3B]; > _13 = MEM[(unsigned int *)this_9(D) + 4B]; > _14 = MEM[(unsigned char *)_10(D) + 3B]; > _15 = MEM[(unsigned int *)_10(D) + 4B]; > if (_12 == _14) > goto <bb 6>; [50.00%] > else > goto <bb 8>; [50.00%] > > <bb 6> [local count: 67108864]: > if (_13 == _15) > goto <bb 8>; [50.00%] > else > goto <bb 7>; [50.00%] > > <bb 7> [local count: 33554432]: > > <bb 8> [local count: 134217728]: > # _20 = PHI <0(5), 1(6), 0(7)> > > <bb 9> [local count: 1073741824]: > # _7 = PHI <0(3), 0(4), 0(2), _20(8)> > return _7; > > But in other cases it won't be about PHIs but just about where exactly it > jumps (in particular, have a look at garply and corge). > The spaceship cases are even more difficult to pattern recognize and to > transform that to memcmp it needs to be TYPE_UNSIGNED char (or > unsigned short, unsigned int, unsigned long etc. for little endian) > comparison. > > And as I've tried to mention in the PR, I'm actually not sure when > the vectorization of memcmp is a safe thing to do vs. pointers to > incompletely allocated structs. > Consider say > struct U { char a; char b[63]; }; where the b is actually poor man's > flexible array member (because e.g. standard C++ doesn't have that), > where allocations are done with offsetof (U, b) + n sizes or so. > So, the fact that say b[3] is mapped doesn't mean b[4] will be mapped. > Now, if you do some comparison of such structs (== or <=>), if say > b[3] is always different between them, the original source will never > read b[4], but perhaps vectorized memcmp might want to and crash. > > This isn't a problem when both arguments of the memcmp are pointers > to variables, just when they are pointers to something where we don't > know the size. It can be vectorized if we make sure to always align > it, but that might result in too large code. If the accesses can possibly fault then it's like vectorization of early breaks - you'd either need alignment guarantees or something like first-fault-loads (aka loads that do not fault). Depending on how exactly pointers are formed you can't use the guarantee that a pointer has to point to a valid object to ignore ordering either, so for char *, p[0] || p[-1] cannot be re-ordered (you might argue p[1] || p[0] can because you have that pointer to p[0]) Richard. > > > > 2. When a struct A has a struct B field and we compare 2 A instances, > > > when getting to this pass, A::operator== still contains a call to > > > B::operator==, making it hard to merge the comparisons of B together with > > > those of A. What is weird to me is that even after einline and IPA inline > > > passes it seems it is still there as a call, instead of being inlined. > > > > I'd not do this as an early pass (and I'd not do it as a separate pass > > anyway). > > > > > These are of course 2 cases that could be handled by implementing special > > > cases (e.g. detecting the loop structure, recursing into the called > > > function, then coming back and merging anyway), but I feel as though > > > tackling the problems in this way will result in infinite complexity as I > > > discover more cases. That's why I was hoping for some feedback on this, > > > does the general approach I am taking seem logical for GCC, should I > > > handle these as special cases, is there some other way of going about it? > > > Any tips very much appreciated, thank you! > > > > As for the case with the call I'd figure why we do not inline. > > Jakub > This e-mail and any attachments may contain information that is confidential and proprietary and otherwise protected from disclosure. If you are not the intended recipient of this e-mail, do not read, duplicate or redistribute it by any means. Please immediately delete it and any attachments and notify the sender that you have received it by mistake. Unintended recipients are prohibited from taking action on the basis of information in this e-mail or any attachments. The DRW Companies make no representations that this e-mail or any attachments are free of computer viruses or other defects.