Dhruv Chawla <dhr...@nvidia.com> writes:
> On 01/08/25 22:10, Richard Sandiford wrote:
>> External email: Use caution opening links or attachments
>> 
>> 
>> Dhruv Chawla <dhr...@nvidia.com> writes:
>>> On 24/07/25 11:21, Andrew Pinski wrote:
>>>> External email: Use caution opening links or attachments
>>>>
>>>>
>>>> On Wed, Jul 23, 2025 at 10:16 PM <dhr...@nvidia.com> wrote:
>>>>>
>>>>> From: Dhruv Chawla <dhr...@nvidia.com>
>>>>>
>>>>> This patch folds the following patterns:
>>>>> - max (a, add (a, b)) -> [sum, ovf] = adds (a, b); !ovf ? sum : a
>>>>> - min (a, add (a, b)) -> [sum, ovf] = adds (a, b); !ovf ? a : sum
>>>>> - max (a, sub (a, b)) -> [sum, ovf] = subs (a, b); !ovf ? a : sum
>>>>> - min (a, sub (a, b)) -> [sum, ovf] = subs (a, b); !ovf ? sum : a
>>>>>
>>>>> Where ovf is the overflow flag. adds and subs are generated by
>>>>> generating a parallel compare+plus/minus which maps to the pattern
>>>>> add<mode>3_compareC. sub<mode>3_compareC is also created to have an
>>>>> equivalent pattern for the subs instruction.
>>>>>
>>>>> This patch is a respin of the patch posted at
>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2025-May/685021.html as per
>>>>> the suggestion to turn it into a target-specific transform by Richard
>>>>> Biener.
>>>>>
>>>>> Bootstrapped and regtested on aarch64-unknown-linux-gnu.
>>>>>
>>>>> Signed-off-by: Dhruv Chawla <dhr...@nvidia.com>
>>>>>
>>>>>           PR middle-end/116815
>>>>>
>>>>> gcc/ChangeLog:
>>>>>
>>>>>           * config/aarch64/aarch64.md (sub<mode>3_compareC): New pattern.
>>>>>           (*aarch64_plus_within_<optab><mode>3_<ovf_commutate>): Likewise.
>>>>>           (*aarch64_minus_within_<optab><mode>3): Likewise.
>>>>>           * config/aarch64/iterators.md (ovf_add_cmp): New code attribute.
>>>>>           (ovf_sub_cmp): Likewise.
>>>>>           (ovf_commutate): New iterator.
>>>>>           (ovf_comm_opp): New int attribute.
>>>>>
>>>>> gcc/testsuite/ChangeLog:
>>>>>
>>>>>           * gcc.target/aarch64/pr116815-1.c: New test.
>>>>>           * gcc.target/aarch64/pr116815-2.c: Likewise.
>>>>>           * gcc.target/aarch64/pr116815-3.c: Likewise.
>>>>>           * gcc.target/aarch64/pr116815-4.c: Likewise.
>>>>>           * gcc.target/aarch64/pr116815-5.c: Likewise.
>>>>> ---
>>>>>    gcc/config/aarch64/aarch64.md                 |  73 +++++++++++
>>>>>    gcc/config/aarch64/iterators.md               |   9 ++
>>>>>    gcc/testsuite/gcc.target/aarch64/pr116815-1.c | 119 ++++++++++++++++++
>>>>>    gcc/testsuite/gcc.target/aarch64/pr116815-2.c |  94 ++++++++++++++
>>>>>    gcc/testsuite/gcc.target/aarch64/pr116815-3.c |  63 ++++++++++
>>>>>    gcc/testsuite/gcc.target/aarch64/pr116815-4.c |  49 ++++++++
>>>>>    gcc/testsuite/gcc.target/aarch64/pr116815-5.c |  45 +++++++
>>>>>    7 files changed, 452 insertions(+)
>>>>>    create mode 100644 gcc/testsuite/gcc.target/aarch64/pr116815-1.c
>>>>>    create mode 100644 gcc/testsuite/gcc.target/aarch64/pr116815-2.c
>>>>>    create mode 100644 gcc/testsuite/gcc.target/aarch64/pr116815-3.c
>>>>>    create mode 100644 gcc/testsuite/gcc.target/aarch64/pr116815-4.c
>>>>>    create mode 100644 gcc/testsuite/gcc.target/aarch64/pr116815-5.c
>>>>>
>>>>> diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
>>>>> index a4ae6859da0..c9f88c40473 100644
>>>>> --- a/gcc/config/aarch64/aarch64.md
>>>>> +++ b/gcc/config/aarch64/aarch64.md
>>>>> @@ -3741,6 +3741,20 @@
>>>>>      [(set_attr "type" "alus_sreg")]
>>>>>    )
>>>>>
>>>>> +;; An equivalent to add<mode>3_compareC
>>>>> +(define_insn "sub<mode>3_compareC"
>>>>> +  [(set (reg:CC_C CC_REGNUM)
>>>>> +       (compare:CC_C
>>>>> +         (minus:GPI
>>>>> +           (match_operand:GPI 1 "register_operand" "r")
>>>>> +           (match_operand:GPI 2 "register_operand" "r"))
>>>>> +         (match_dup 1)))
>>>>> +   (set (match_operand:GPI 0 "register_operand" "=r")
>>>>> +       (minus:GPI (match_dup 1) (match_dup 2)))]
>>>>> +  ""
>>>>> +  "subs\t%<w>0, %<w>1, %<w>2"
>>>>> +)
>>>>> +
>> 
>> I think I'm going to campaign for renaming CC_C to CC_C_INV.  I spent a
>> while trying to understand how this pattern could be correct, before
>> realising that CC_C uses the opposite mapping from CC.  The mapping is:
>> 
>> CC:    RTL GEU == AArch64 HS == AArch64 CS  (good)
>> CC_C:  RTL GEU == AArch64 LO == AArch64 CC
>> 
>> I think the later patterns get caught by this.
>
> Hi,
>
> Wow, this is really confusing. It looks like I missed/misunderstood the
> comment in aarch64-modes.def that explains this, my bad. Do you think
> exchanging the operand order (for 1 and 2) would be an okay fix for
> this (albeit not any less confusing)? Or should I switch this around to
> using CC directly?

I'm not sure if you're asking about the pattern above or the later patterns.
The pattern above is correct as-is because of the meaning of CC_C.
(Like I say, it just took me a long time to realise why it was correct,
since I was expecting CC_C to have the same mapping as CC for carry tests.)

If the pattern above used CC instead of CC_C then the carry flag would
have the opposite value from the expected one.

In the later patterns, it would work to test the condition in CC_C,
or (alternatively) to continue testing in CC and either flip the
comparison or swap the if_then_else operands.  I think using CC_Z
is more correct, since it reflects how SELECT_CC_MODE is normally used.

> I also don't really understand why CC_C is used instead of CC for patterns
> in general? Is it for more accurate modelling of the NZCV register for
> scheduling or regalloc purposes? Or (from what seems more likely from
> aarch64-modes.def), is it to map overflowing addition and subtraction to
> condition flags in the same way?

In general, the point of the different condition modes is to say which
subset of the NZCV flags are valid.  This isn't so much for scheduling
or regalloc, but instead to prevent invalid DCE/CSE.  CC guarantees that
all of N, Z, C, and V are set as for a comparison.  Thus if a bb contains:

   (parallel
     [(set (reg:CC_Z CC_REGNUM) (compare:CC_Z (reg:SI x) (reg:SI y)))
      ...])
   (set (reg:CC CC_REGNUM) (compare:CC (reg:SI x) (reg:SI y)))
   (set (if_then_else (gt (reg:CC CC_REGNUM (const_int 0))) ...)

then the middle instruction cannot be deleted, because the first
instruction only sets Z, whereas gt requires NZV.  If instead it was:

   (parallel
     [(set (reg:CC_Z CC_REGNUM) (compare:CC_Z (reg:SI x) (reg:SI y)))
      ...])
   (set (reg:CC CC_REGNUM) (compare:CC (reg:SI x) (reg:SI y)))
   (set (if_then_else (eq (reg:CC CC_REGNUM (const_int 0))) ...)

then the sequence can be optimised to:

   (parallel
     [(set (reg:CC_Z CC_REGNUM) (compare:CC_Z (reg:SI x) (reg:SI y)))
      ...])
   (set (if_then_else (eq (reg:CC_Z CC_REGNUM (const_int 0))) ...)

because eq only requires Z, and CC_Z guarantess that the Z flag
is valid.

>>>>>    (define_peephole2
>>>>>      [(set (match_operand:GPI 0 "aarch64_general_reg")
>>>>>           (minus:GPI (match_operand:GPI 1 "aarch64_reg_or_zero")
>>>>> @@ -4481,6 +4495,65 @@
>>>>>      [(set_attr "type" "<su>div")]
>>>>>    )
>>>>>
>>>>> +;; umax (a, add (a, b)) => [sum, ovf] = adds (a, b); !ovf ? sum : a
>>>>> +;; umin (a, add (a, b)) => [sum, ovf] = adds (a, b); !ovf ? a : sum
>>>>> +;; ... along with the commutative version of add (a, b) i.e. add (b, a)
>>>>> +(define_insn_and_split 
>>>>> "*aarch64_plus_within_<optab><mode>3_<ovf_commutate>"
>>>>> +  [(set (match_operand:GPI 0 "register_operand")
>>>>> +       (UMAXMIN:GPI
>>>>> +         (plus:GPI (match_operand:GPI 1 "register_operand")
>>>>> +                   (match_operand:GPI 2 "register_operand"))
>>>>> +         (match_dup <ovf_commutate>)))
>>>>> +   (clobber (match_scratch:GPI 3))]
>>>>> +  "!TARGET_CSSC"
>>>>> +  "#"
>>>>> +  "&& !reload_completed"
>>>>
>>>> Since this is a define_insn_and_split, I think there should be
>>>> constraints and not just predicates on it. I don't think you can
>>>> depend on it being split before RA (though I could be wrong) or
>>>> matching post RA.
>> 
>> Agreed.
>> 
>>>>> +  [(parallel
>>>>> +      [(set (reg:CC_C CC_REGNUM)
>>>>> +           (compare:CC_C (plus:GPI (match_dup ovf_commutate)
>>>>
>>>> I am thinking you are missing "<>" around ovf_commutate here and other
>>>> places below. Or did I misunderstand how this works?
>>>
>>> Hi,
>>>
>>> Yeah, this is a bit weird. It looks like the split part of a pattern does
>>> not require the <>. Infact, I was getting undefined iterator errors when
>>> I put them. Without them, the md gets correctly generated (I verified using
>>> "make mddump" in build/gcc/).
>> 
>> Yeah, since it's an iterator, using it without "<" and ">" is correct.
>> I think it's the (match_dup <ovf_commutate>) on the define_insn part
>> that's wrong -- it should be (match_dup ovf_commutate) too.
>> I'll look into why the parsers accept it.
>
> What is the difference between surrounding an iterator with <> (such as in the
> standard pattern name) versus its use in the pattern without <>?

What I meant was that, AFAIR, (match_dup <ovf_commutate>) was not
intentionally allowed.  I think it was an accidental side-effect of
allowing "...<ovf_commutate>..." to be used in C++ code (which in turn
was allowed to avoid having to define an int attribute that mapped
numbers to themselves).

(match_dup <...>) is for attributes rather than iterators.  The iterator
name can be used as a disambiguator in:

    (match_dup <iterator:attribute>)

but iterators weren't AFAIR supposed to be used standalone in <...>.

>>>>> +                                   (match_dup <ovf_comm_opp>))
>>>>> +                         (match_dup ovf_commutate)))
>>>>> +       (set (match_dup 3) (plus:GPI (match_dup ovf_commutate)
>>>>> +                                   (match_dup <ovf_comm_opp>)))])
>>>>> +   (set (match_dup 0)
>>>>> +       (if_then_else:GPI (<ovf_add_cmp> (reg:CC CC_REGNUM)
>>>>> +                                           (const_int 0))
>>>>> +                         (match_dup 3)
>>>>> +                         (match_dup ovf_commutate)))]
>> 
>> This sets the carry flag in CC_Cmode but tests it in CCmode, with:
>> 
>>    (define_code_attr ovf_add_cmp [(umax "geu") (umin "ltu")])
>> 
>> So UMAX uses GEU, which for CC maps to the natural HS/CS, i.e. to
>> ovf rather than !ovf.  So for UMAX, this chooses the sum on overflow
>> and the input otherwise.  Therefore:
>> 
>>> +/*
>>> +** umaxadd1:
>>> +**   adds    (w[0-9]+), w0, w1
>>> +**   csel    w0, \1, w0, cs
>>> +**   ret
>>> +*/
>>> +OPERATION (max, add, 1, a, a + b)
>> 
>> ...I don't think this is correct.  It chooses the addition result if
>> the ADD overflows and chooses a otherwise.  For example, after the patch:
>> 
>> [[gnu::noipa]] unsigned
>> foo (unsigned a, unsigned b)
>> {
>>    return a + b > b ? a + b : b;
>> }
>> 
>> int
>> main ()
>> {
>>    __builtin_printf ("%x\n", foo (0x90000000, 0xa0000000));
>> }
>> 
>> prints 30000000 when compiled at -O2 but a0000000 when compiled at -O0.
>
> Yeah, that makes sense. I thought that CC_C mapped directly to CC, so it
> makes sense that the pattern breaks. Thanks for catching this! I think I
> should add runtime tests as well to make sure that the behaviour is
> modeled correctly.

Thanks, sounds good.

Richard

Reply via email to