> Segher Boessenkool wrote: > On Thu, Sep 03, 2015 at 12:43:34PM +0100, Wilco Dijkstra wrote: > > > > Combine canonicalizes certain AND masks in a comparison with zero into > > > > extracts of the > > > widest > > > > register type. During matching these are expanded into a very > > > > inefficient sequence that > > > fails to > > > > match. For example (x & 2) == 0 is matched in combine like this: > > > > > > > > Failed to match this instruction: > > > > (set (reg:CC 66 cc) > > > > (compare:CC (zero_extract:DI (subreg:DI (reg/v:SI 76 [ xD.2641 ]) 0) > > > > (const_int 1 [0x1]) > > > > (const_int 1 [0x1])) > > > > (const_int 0 [0]))) > > > > > > Yes. Some processors even have specific instructions to do this. > > > > However there are 2 issues with this, one is the spurious subreg, > > Combine didn't make that up out of thin air; something already used > DImode here. It could simplify it to SImode in this case, that is > true, don't know why it doesn't; it isn't necessarily faster code to > do so, it can be slower, it might not match, etc.
The relevant RTL instructions on AArch64 are: (insn 8 3 25 2 (set (reg:SI 77 [ D.2705 ]) (and:SI (reg/v:SI 76 [ xD.2641 ]) (const_int 2 [0x2]))) tmp5.c:122 452 {andsi3} (nil)) (insn 26 25 27 2 (set (reg:CC 66 cc) (compare:CC (reg:SI 77 [ D.2705 ]) (const_int 0 [0]))) tmp5.c:122 377 {*cmpsi} (expr_list:REG_DEAD (reg:SI 77 [ D.2705 ]) (nil))) I don't see anything using DI... > > the other is > > that only a subset of legal zero_extracts are tried (only single bit and the > > degenerate case of zero_extract with shift of 0 - which I think should not > > be a > > zero_extract). All other AND immediate remain as AND. > > Yes. I'm happy to see this weird special case "optimisation", > "canocalisation" gone. > > > So to emit an AND on targets without such specific instructions, or where > > such > > instructions are more expensive than a simple AND (*), you need now at > > least 3 different > > backend patterns for any instruction that can emit AND immediate... > > It's only a problem for AND-and-compare, no? Yes, so it looks like some other backends match the odd pattern and then have another pattern change it back into the canonical AND/TST form during the split phase (maybe the subreg confuses register allocation or block other optimizations). This all seems a lot of unnecessary complexity for a few special immediates when there is a much simpler solution... > > (*) I think that is another issue in combine - when both alternatives match > > you > > want to select the lowest cost one, not the first one that matches. > > That's recog, not combine. And quite a few backends rely on "first match > wins", because it always has been that way. It also is very easy to write > such patterns accidentally (sometimes even the exact same one twice in the > same machine description, etc.) > > > > > Failed to match this instruction: > > > > (set (reg:CC 66 cc) > > > > (compare:CC (and:DI (lshiftrt:DI (subreg:DI (reg/v:SI 76 [ xD.2641 > > > > ]) 0) > > > > (const_int 1 [0x1])) > > > > (const_int 1 [0x1])) > > > > (const_int 0 [0]))) > > > > > > This is after r223067. Combine tests only one "final" instruction; that > > > revision rewrites zero_ext* if it doesn't match and tries again. This > > > helps for processors that can do more generic masks (like rs6000, and I > > > believe also aarch64?): without it, you need many more patterns to match > > > all the zero_ext{ract,end} cases. > > > > But there are more efficient ways to emit single bit and masks tests that > > apply > > to most CPUs rather than doing something specific that works for just one > > target > > only. For example single bit test is a simple shift into carry flag or into > > the > > sign bit, and for mask tests, you shift out all the non-mask bits. > > Most of those are quite target-specific. Some others are already done, > and/or done by other passes. But what combine does here is even more target-specific. Shifts and AND setting flags are universally supported on targets with condition code register. Bitfield test/extract instructions are more rare, and when they are supported, they may well be more expensive. > > So my question is, is it combine's job to try all possible permutations that > > constitute a bit or mask test? > > Combine converts the merged instructions to what it thinks is the > canonical or cheapest form, and uses that. It does not try multiple > options (the zero_ext* -> and+shift rewriting is not changing the > semantics of the pattern at all). But the change from AND to zero_extract is already changing semantics... > > Or would it be better to let each target decide > > on how to canonicalize bit tests and only try that alternative? > > The question is how to write the pattern to be most convenient for all > targets. The obvious choice is to try the 2 original instructions merged. > > > > Neither matches the AArch64 patterns for ANDS/TST (which is just > > > > compare and AND). If > the > > > immediate > > > > is not a power of 2 or a power of 2 -1 then it matches correctly as > > > > expected. > > > > > > > > I don't understand how ((x >> 1) & 1) != 0 could be a useful expansion > > > > > > It is zero_extract(x,1,1) really. This is convenient for (old and > > > embedded) > > > processors that have special bit-test instructions. If we now want > > > combine > > > to not do this, we'll have to update all backends that rely on it. > > > > Would any backend actually rely on this given it only does some specific > > masks, > > has a redundant shift with 0 for the mask case and the odd subreg as well? > > Such backends match the zero_extract patterns, of course. Random example: > the h8300 patterns for the "btst" instruction. > > > > They are common, and many processors had instructions for them, which is > > > (supposedly) why combine special-cased them. > > > > Yes, but that doesn't mean (x & C) != 0 shouldn't be tried as well... > > Combine does not try multiple options. I'm not following - combine tries zero_extract and shift+AND - that's 2 options. If that is feasible then adding a 3rd option should be possible. > > > > It's trivial to change change_zero_ext to expand extracts always into > > > > AND and remove the > > > redundant > > > > subreg. > > > > > > Not really trivial... Think about comparisons... > > > > > > int32_t x; > > > ((x >> 31) & 1) > 0; > > > // is not the same as > > > (x & 0x80000000) > 0; // signed comparison > > > > > > (You do not easily know what the COMPARE is used for). > > > > Indeed if you had a zero_extract that included the sign-bit then you may > > have to adjust > > the compare condition. However it looks like that case can't happen - (x & > > 0x80000000) > > comparisons with have the AND optimized away much earlier. > > A long time before combine, yes. But it is *possible* to give such insns > to combine, and it should not do the wrong thing with them. Yes. So that suggests we'd need to block the canonicalization rather than undo it. > > > > However wouldn't it make more sense to never special case certain AND > > > > immediate in the > first > > > > place? > > > > > > Yes, but we need to make sure no targets regress (i.e. by rewriting > > > patterns > > > for those that do). We need to know the impact of such a change, at the > > > least. > > > > The alternative would be let the target decide how to canonicalize bit > > tests. > > That seems like a better solution than trying multiple possibilities which > > can never > > match on most targets. > > You will end up with a *lot* of target hooks like this. It will also > make testing harder (less coverage). I am not sure that is a good idea. We certainly need a lot more target hooks in general so GCC can do the right thing (rather than using costs inconsistently all over the place). But that's a different discussion... Wilco