> Segher Boessenkool wrote:
> On Thu, Sep 03, 2015 at 12:43:34PM +0100, Wilco Dijkstra wrote:
> > > > Combine canonicalizes certain AND masks in a comparison with zero into 
> > > > extracts of the
> > > widest
> > > > register type. During matching these are expanded into a very 
> > > > inefficient sequence that
> > > fails to
> > > > match. For example (x & 2) == 0 is matched in combine like this:
> > > >
> > > > Failed to match this instruction:
> > > > (set (reg:CC 66 cc)
> > > >     (compare:CC (zero_extract:DI (subreg:DI (reg/v:SI 76 [ xD.2641 ]) 0)
> > > >             (const_int 1 [0x1])
> > > >             (const_int 1 [0x1]))
> > > >         (const_int 0 [0])))
> > >
> > > Yes.  Some processors even have specific instructions to do this.
> >
> > However there are 2 issues with this, one is the spurious subreg,
> 
> Combine didn't make that up out of thin air; something already used
> DImode here.  It could simplify it to SImode in this case, that is
> true, don't know why it doesn't; it isn't necessarily faster code to
> do so, it can be slower, it might not match, etc.

The relevant RTL instructions on AArch64 are:

(insn 8 3 25 2 (set (reg:SI 77 [ D.2705 ])
        (and:SI (reg/v:SI 76 [ xD.2641 ])
            (const_int 2 [0x2]))) tmp5.c:122 452 {andsi3}
     (nil))
 (insn 26 25 27 2 (set (reg:CC 66 cc)
        (compare:CC (reg:SI 77 [ D.2705 ])
            (const_int 0 [0]))) tmp5.c:122 377 {*cmpsi}
     (expr_list:REG_DEAD (reg:SI 77 [ D.2705 ])
        (nil)))

I don't see anything using DI...

> > the other is
> > that only a subset of legal zero_extracts are tried (only single bit and the
> > degenerate case of zero_extract with shift of 0 - which I think should not 
> > be a
> > zero_extract). All other AND immediate remain as AND.
> 
> Yes.  I'm happy to see this weird special case "optimisation",
> "canocalisation" gone.
> 
> > So to emit an AND on targets without such specific instructions, or where 
> > such
> > instructions are more expensive than a simple AND (*), you need now at 
> > least 3 different
> > backend patterns for any instruction that can emit AND immediate...
> 
> It's only a problem for AND-and-compare, no?

Yes, so it looks like some other backends match the odd pattern and then have 
another
pattern change it back into the canonical AND/TST form during the split phase 
(maybe
the subreg confuses register allocation or block other optimizations). This all 
seems
a lot of unnecessary complexity for a few special immediates when there is a 
much 
simpler solution...

> > (*) I think that is another issue in combine - when both alternatives match 
> > you
> > want to select the lowest cost one, not the first one that matches.
> 
> That's recog, not combine.  And quite a few backends rely on "first match
> wins", because it always has been that way.  It also is very easy to write
> such patterns accidentally (sometimes even the exact same one twice in the
> same machine description, etc.)
> 
> > > > Failed to match this instruction:
> > > > (set (reg:CC 66 cc)
> > > >     (compare:CC (and:DI (lshiftrt:DI (subreg:DI (reg/v:SI 76 [ xD.2641 
> > > > ]) 0)
> > > >                 (const_int 1 [0x1]))
> > > >             (const_int 1 [0x1]))
> > > >         (const_int 0 [0])))
> > >
> > > This is after r223067.  Combine tests only one "final" instruction; that
> > > revision rewrites zero_ext* if it doesn't match and tries again.  This
> > > helps for processors that can do more generic masks (like rs6000, and I
> > > believe also aarch64?): without it, you need many more patterns to match
> > > all the zero_ext{ract,end} cases.
> >
> > But there are more efficient ways to emit single bit and masks tests that 
> > apply
> > to most CPUs rather than doing something specific that works for just one 
> > target
> > only. For example single bit test is a simple shift into carry flag or into 
> > the
> > sign bit, and for mask tests, you shift out all the non-mask bits.
> 
> Most of those are quite target-specific.  Some others are already done,
> and/or done by other passes.

But what combine does here is even more target-specific. Shifts and AND setting 
flags
are universally supported on targets with condition code register.
Bitfield test/extract instructions are more rare, and when they are supported, 
they
may well be more expensive.

> > So my question is, is it combine's job to try all possible permutations that
> > constitute a bit or mask test?
> 
> Combine converts the merged instructions to what it thinks is the
> canonical or cheapest form, and uses that.  It does not try multiple
> options (the zero_ext* -> and+shift rewriting is not changing the
> semantics of the pattern at all).

But the change from AND to zero_extract is already changing semantics...

> > Or would it be better to let each target decide
> > on how to canonicalize bit tests and only try that alternative?
> 
> The question is how to write the pattern to be most convenient for all
> targets.

The obvious choice is to try the 2 original instructions merged.

> > > > Neither matches the AArch64 patterns for ANDS/TST (which is just 
> > > > compare and AND). If
> the
> > > immediate
> > > > is not a power of 2 or a power of 2 -1 then it matches correctly as 
> > > > expected.
> > > >
> > > > I don't understand how ((x >> 1) & 1) != 0 could be a useful expansion
> > >
> > > It is zero_extract(x,1,1) really.  This is convenient for (old and 
> > > embedded)
> > > processors that have special bit-test instructions.  If we now want 
> > > combine
> > > to not do this, we'll have to update all backends that rely on it.
> >
> > Would any backend actually rely on this given it only does some specific 
> > masks,
> > has a redundant shift with 0 for the mask case and the odd subreg as well?
> 
> Such backends match the zero_extract patterns, of course.  Random example:
> the h8300 patterns for the "btst" instruction.
> 
> > > They are common, and many processors had instructions for them, which is
> > > (supposedly) why combine special-cased them.
> >
> > Yes, but that doesn't mean (x & C) != 0 shouldn't be tried as well...
> 
> Combine does not try multiple options.

I'm not following - combine tries zero_extract and shift+AND - that's 2 options.
If that is feasible then adding a 3rd option should be possible.

> > > > It's trivial to change change_zero_ext to expand extracts always into 
> > > > AND and remove the
> > > redundant
> > > > subreg.
> > >
> > > Not really trivial...  Think about comparisons...
> > >
> > > int32_t x;
> > > ((x >> 31) & 1) > 0;
> > > // is not the same as
> > > (x & 0x80000000) > 0; // signed comparison
> > >
> > > (You do not easily know what the COMPARE is used for).
> >
> > Indeed if you had a zero_extract that included the sign-bit then you may 
> > have to adjust
> > the compare condition. However it looks like that case can't happen - (x & 
> > 0x80000000)
> > comparisons with have the AND optimized away much earlier.
> 
> A long time before combine, yes.  But it is *possible* to give such insns
> to combine, and it should not do the wrong thing with them.

Yes. So that suggests we'd need to block the canonicalization rather than undo 
it.

> > > > However wouldn't it make more sense to never special case certain AND 
> > > > immediate in the
> first
> > > > place?
> > >
> > > Yes, but we need to make sure no targets regress (i.e. by rewriting 
> > > patterns
> > > for those that do).  We need to know the impact of such a change, at the 
> > > least.
> >
> > The alternative would be let the target decide how to canonicalize bit 
> > tests.
> > That seems like a better solution than trying multiple possibilities which 
> > can never
> > match on most targets.
> 
> You will end up with a *lot* of target hooks like this.  It will also
> make testing harder (less coverage).  I am not sure that is a good idea.

We certainly need a lot more target hooks in general so GCC can do the right 
thing
(rather than using costs inconsistently all over the place). But that's a 
different
discussion...

Wilco


Reply via email to