https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85406
Bug ID: 85406 Summary: Unnecessary blend when vectorizing short-cutted calculations Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: linux at carewolf dot com Target Milestone: --- If you have something like this: inline unsigned qPremultiply(unsigned x) { const unsigned a = x >> 24; if (a == 255) return x; unsigned t = (x & 0xff00ff) * a; t = (t + ((t >> 8) & 0xff00ff) + 0x800080) >> 8; t &= 0xff00ff; x = ((x >> 8) & 0xff) * a; x = (x + ((x >> 8) & 0xff) + 0x80); x &= 0xff00; return x | t | (a << 24); } Gcc will vectorize it so that the longer calculation is always performed and with an added blend in the end to merge the two different return values. This is however unnecessary as the calculation will give the same result, and thus the blend can be saved. Also in any case it is actually a bit unsafe to vectorize as the performance difference between the two branches is substantial, and it happens that in this case the short-cut is likely to be valid most of the time, so a nonvectorized loop might be faster than a vectorized one by doing a lot less. The latter can be fixed, if the short-cut was also vectorized, for instance making the test for 4 values at a time and skip the long route if none of them need it.