sub(a, b)) [PR116815]

Alex Coplan Tue, 25 Nov 2025 06:29:55 -0800

On 05/11/2025 10:39, Dhruv Chawla wrote:
> On 28/10/25 22:03, Alex Coplan wrote:
> > External email: Use caution opening links or attachments
> > 
> > 
> > Hi Dhruv,
> > 
> > Sorry for the long wait on this one.  Comments below ...
> 
> Hi Alex,
> 
> Thanks for the review. I have attached a new version of the patch
> to this email.


Thanks, and sorry to be slow again...

> 
> > 
> > On 18/08/2025 21:31, [email protected] wrote:
> > > From: Dhruv Chawla <[email protected]>
> > > 
> > > This patch folds the following patterns:
> > > - umax (a, add (a, b)) -> [sum,  ovf] = adds (a, b); !ovf ? sum : a
> > > - umin (a, add (a, b)) -> [sum,  ovf] = adds (a, b); !ovf ? a : sum
> > > - umax (a, sub (a, b)) -> [diff, ovf] = subs (a, b); ovf ? diff : a
> > > - umin (a, sub (a, b)) -> [diff, ovf] = subs (a, b); ovf ? a : diff
> > > 
> > > Where ovf is the overflow flag. adds and subs are generated by
> > > generating a parallel compare+plus/minus which maps to the pattern
> > > add<mode>3_compareC. sub<mode>3_compareC is also created to have an
> > > equivalent pattern for the subs instruction. In the case of subs, the
> > > overflow flag represents underflow.
> > > 
> > > This patch is a respin of the patch posted at
> > > https://gcc.gnu.org/pipermail/gcc-patches/2025-May/685021.html as per
> > > the suggestion to turn it into a target-specific transform by Richard
> > > Biener.
> > > 
> > > Bootstrapped and regtested on aarch64-unknown-linux-gnu.
> > > 
> > > Signed-off-by: Dhruv Chawla <[email protected]>
> > > 
> > >        PR middle-end/116815
> > > 
> > > gcc/ChangeLog:
> > > 
> > >        * config/aarch64/aarch64.md (sub<mode>3_compareC): New pattern.
> > >        (*aarch64_plus_within_<optab><mode>3_<ovf_commutate>): Likewise.
> > >        (*aarch64_minus_within_<optab><mode>3): Likewise.
> > >        * config/aarch64/iterators.md (ovf_add_cmp): New code attribute.
> > >        (ovf_sub_cmp): Likewise.
> > >        (ovf_commutate): New iterator.
> > >        (ovf_comm_opp): New int attribute.
> > > 
> > > gcc/testsuite/ChangeLog:
> > > 
> > >        * gcc.target/aarch64/pr116815-1.c: New test.
> > >        * gcc.target/aarch64/pr116815-2.c: Likewise.
> > >        * gcc.target/aarch64/pr116815-3.c: Likewise.
> > >        * gcc.target/aarch64/pr116815-4.c: Likewise.
> > >        * gcc.target/aarch64/pr116815-5.c: Likewise.
> > >        * gcc.target/aarch64/pr116815-6.c: Likewise.
> > > ---
> > >   gcc/config/aarch64/aarch64.md                 |  76 +++++++++++
> > >   gcc/config/aarch64/iterators.md               |   9 ++
> > >   gcc/testsuite/gcc.target/aarch64/pr116815-1.c | 120 ++++++++++++++++++
> > >   gcc/testsuite/gcc.target/aarch64/pr116815-2.c |  93 ++++++++++++++
> > >   gcc/testsuite/gcc.target/aarch64/pr116815-3.c |  62 +++++++++
> > >   gcc/testsuite/gcc.target/aarch64/pr116815-4.c |  48 +++++++
> > >   gcc/testsuite/gcc.target/aarch64/pr116815-5.c |  44 +++++++
> > >   gcc/testsuite/gcc.target/aarch64/pr116815-6.c |  60 +++++++++
> > >   8 files changed, 512 insertions(+)
> > >   create mode 100644 gcc/testsuite/gcc.target/aarch64/pr116815-1.c
> > >   create mode 100644 gcc/testsuite/gcc.target/aarch64/pr116815-2.c
> > >   create mode 100644 gcc/testsuite/gcc.target/aarch64/pr116815-3.c
> > >   create mode 100644 gcc/testsuite/gcc.target/aarch64/pr116815-4.c
> > >   create mode 100644 gcc/testsuite/gcc.target/aarch64/pr116815-5.c
> > >   create mode 100644 gcc/testsuite/gcc.target/aarch64/pr116815-6.c
> > > 
> > > diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
> > > index 8e431a10cb1..a55fe5c85d8 100644
> > > --- a/gcc/config/aarch64/aarch64.md
> > > +++ b/gcc/config/aarch64/aarch64.md
> > > @@ -3715,6 +3715,20 @@
> > >     [(set_attr "type" "alus_sreg")]
> > >   )
> > > 
> > > +;; An equivalent to add<mode>3_compareC
> > > +(define_insn "sub<mode>3_compareC"
> > > +  [(set (reg:CC_C CC_REGNUM)
> > > +     (compare:CC_C
> > > +       (minus:GPI
> > > +         (match_operand:GPI 1 "register_operand" "r")
> > > +         (match_operand:GPI 2 "register_operand" "r"))
> > > +       (match_dup 1)))
> > > +   (set (match_operand:GPI 0 "register_operand" "=r")
> > > +     (minus:GPI (match_dup 1) (match_dup 2)))]
> > > +  ""
> > > +  "subs\t%<w>0, %<w>1, %<w>2"
> > > +)
> > > +
> > 
> > I don't think we want/need this pattern.  subs(a,b) underflows precisely
> > when a < b, i.e. when a - b < 0 so you can just use a plain CCmode
> > comparison.  I think you can make use of the existing
> > sub<mode>3_compare1 pattern here.
> > 
> > This should simplify the define_insn_and_split for the subs case below.
> > 
> > >   (define_peephole2
> > >     [(set (match_operand:GPI 0 "aarch64_general_reg")
> > >        (minus:GPI (match_operand:GPI 1 "aarch64_reg_or_zero")
> > > @@ -4455,6 +4469,68 @@
> > >     [(set_attr "type" "<su>div")]
> > >   )
> > > 
> > > +;; umax (a, add (a, b)) => [sum, ovf] = adds (a, b); !ovf ? sum : a
> > > +;; umin (a, add (a, b)) => [sum, ovf] = adds (a, b); !ovf ? a : sum
> > > +;; ... along with the commutative version of add (a, b) i.e. add (b, a).
> > 
> > For the sake of saving one comment line, it might be better to actually
> > spell out the commutated variant, AIUI e.g. for umax that's:
> > umax (b, add (a, b)) => [sum, ovf] = adds (b, a); !ovf ? sum : b
> 
> Done. I also added this to the patch description.
> 
> > 
> > > +(define_insn_and_split 
> > > "*aarch64_plus_within_<optab><mode>3_<ovf_commutate>"
> > > +  [(set (match_operand:GPI 0 "register_operand" "=r")
> > > +     (UMAXMIN:GPI
> > > +       (plus:GPI (match_operand:GPI 1 "register_operand" "r")
> > > +                 (match_operand:GPI 2 "register_operand" "r"))
> > > +       (match_dup ovf_commutate)))
> > > +   (clobber (match_scratch:GPI 3))]
> > 
> > I think you might want an "=r" constraint on the match_scratch here.
> 
> I am a bit confused here - I expect this pattern to only execute pre-RA
> because it requires being able to call gen_reg_rtx (which only works
> pre-RA). Does RA populate scratch registers by itself? Otherwise, I
> don't really see any other way for this split to run post-RA.

So having done some digging on this, the point of this idiom:

> +    if (GET_CODE (operands[3]) == SCRATCH)
> +      operands[3] = gen_reg_rtx (<MODE>mode);

is to allow the splitter to work both before and after RA.  Indeed, the
RA will allocate hard registers in place of scratch registers, provided
there is a suitable constraint on the operand.

I think whatever we do the split condition should be "&& 1"; as Uros
pointed out, allowing the insn to match after RA but not allowing the
insn to be split is an ICE trap.

We then need to decide whether to restrict the define_insn_and_splits to
pre-RA only, or allow them both before and after RA.  I don't see a good
reason to forbid them from matching after RA, so I think the patch is OK
with the constraints on the scratch registers added, and the split
conditions changed to "&& 1".  I've added comments to this effect below.

Note that if we _were_ to restrict the pattern to pre-RA only, I agree
with all of Uros's comments, and we would have to define
aarch64_pre_reload_split and use that in the insn condition for the
define_insn_and_splits, but incrementally it seems easier to me to just
allow the pattern to work both before and after RA.

> 
> Uros: IIRC I ran into issues where unconditionally calling gen_reg_rtx
> still caused an ICE, maybe because the code ran during reload itself.
> I guess I could replace "!reload_completed" with "can_create_pseudo_p ()"?
> 
> > 
> > > +  "!TARGET_CSSC"
> > > +  "#"
> > > +  "&& !reload_completed"
> > > +  [(parallel
> > > +      [(set (reg:CC_C CC_REGNUM)
> > > +         (compare:CC_C (plus:GPI (match_dup ovf_commutate)
> > > +                                 (match_dup <ovf_comm_opp>))
> > > +                       (match_dup ovf_commutate)))
> > > +       (set (match_dup 3) (plus:GPI (match_dup ovf_commutate)
> > > +                                 (match_dup <ovf_comm_opp>)))])
> > > +   (set (match_dup 0)
> > > +     (if_then_else:GPI (<ovf_add_cmp> (reg:CC_C CC_REGNUM)
> > > +                                      (const_int 0))
> > > +                       (match_dup 3)
> > > +                       (match_dup ovf_commutate)))]
> > > +  {
> > > +    if (GET_CODE (operands[3]) == SCRATCH)
> > > +      operands[3] = gen_reg_rtx (<MODE>mode);
> > 
> > At first I was a bit unsure about this idiom, since md.texi:9358 has
> > this to say about define_split (and therefore about
> > define_insn_and_split):
> > 
> >      The @var{preparation-statements} are similar to those statements that
> >      are specified for @code{define_expand} (@pxref{Expander Definitions})
> >      and are executed before the new RTL is generated to prepare for the
> >      generated code or emit some insns whose pattern is not fixed.  Unlike
> >      those in @code{define_expand}, however, these statements must not
> >      generate any new pseudo-registers.  Once reload has completed, they 
> > also
> >      must not allocate any space in the stack frame.
> > 
> > but I see that we already use this idiom for a few patterns in 
> > aarch64-sve.md.
> > Would appreciate a second opinion on this.  Either the existing SVE 
> > patterns are
> > wrong or (perhaps more likely) the md.texi docs are out of date.
> > 
> > > +  }
> > > +)
> > > +
> > > +;; In this case the ovf represents an underflow.
> > > +;; umax (a, sub (a, b)) => [diff, ovf] = subs (a, b); ovf ? diff : a
> > > +;; umin (a, sub (a, b)) => [diff, ovf] = subs (a, b); ovf ? a : diff
> > > +(define_insn_and_split "*aarch64_minus_within_<optab><mode>3"
> > > +  [(set (match_operand:GPI 0 "register_operand" "=r")
> > > +     (UMAXMIN:GPI
> > > +       (minus:GPI (match_operand:GPI 1 "register_operand" "r")
> > > +                  (match_operand:GPI 2 "register_operand" "r"))
> > > +       (match_dup 1)))
> > > +   (clobber (match_scratch:GPI 3))]
> > > +  "!TARGET_CSSC"
> > > +  "#"
> > > +  "&& !reload_completed"
> > > +  [(parallel
> > > +      [(set (reg:CC_C CC_REGNUM)
> > > +         (compare:CC_C (minus:GPI (match_dup 1) (match_dup 2))
> > > +                       (match_dup 1)))
> > > +       (set (match_dup 3) (minus:GPI (match_dup 1) (match_dup 2)))])
> > > +   ;; There is an inverse mapping here from CC_C -> CC as this requires 
> > > the
> > > +   ;; inverse of the comparison from the above pattern.
> > 
> > As mentioned above you should be able to do the whole thing in CCmode, which
> > should simplify things here.
> > 
> > > +   (set (match_dup 0)
> > > +         (if_then_else:GPI (<ovf_sub_cmp> (reg:CC CC_REGNUM)
> > > +                                             (const_int 0))
> > > +                           (match_dup 3)
> > > +                           (match_dup 1)))]
> > > +  {
> > > +    if (GET_CODE (operands[3]) == SCRATCH)
> > > +      operands[3] = gen_reg_rtx (<MODE>mode);
> > > +  }
> > > +)
> > > +
> > >   ;; -------------------------------------------------------------------
> > >   ;; Comparison insns
> > >   ;; -------------------------------------------------------------------
> > > diff --git a/gcc/config/aarch64/iterators.md 
> > > b/gcc/config/aarch64/iterators.md
> > > index c3771d9402b..a29243247bd 100644
> > > --- a/gcc/config/aarch64/iterators.md
> > > +++ b/gcc/config/aarch64/iterators.md
> > > @@ -2804,6 +2804,8 @@
> > > 
> > >   (define_code_iterator FMAXMIN [smax smin])
> > > 
> > > +(define_code_iterator UMAXMIN [umax umin])
> > > +
> > >   ;; Signed and unsigned max operations.
> > >   (define_code_iterator USMAX [smax umax])
> > > 
> > > @@ -3092,6 +3094,9 @@
> > > 
> > >   (define_code_attr maxminand [(smax "bic") (smin "and")])
> > > 
> > > +(define_code_attr ovf_add_cmp [(umax "geu") (umin "ltu")])
> > > +(define_code_attr ovf_sub_cmp [(umax "ltu") (umin "geu")])
> > > +
> > >   ;; MLA/MLS attributes.
> > >   (define_code_attr as [(ss_plus "a") (ss_minus "s")])
> > > 
> > > @@ -5007,3 +5012,7 @@
> > >      (UNSPEC_F2CVT "f2cvt")
> > >      (UNSPEC_F1CVTLT "f1cvtlt")
> > >      (UNSPEC_F2CVTLT "f2cvtlt")])
> > > +
> > > +;; Operand numbers for commutative operations
> > > +(define_int_iterator ovf_commutate [1 2])
> > 
> > Note that this iterator already exists with the name BSL_DUP.  It might
> > be better to rename it to something more general and re-purpose it here,
> > rather than duplicating it.
> 
> Yeah, I saw that as well - using BSL_DUP and bsl_mov would achieve the same
> purpose, but I am not sure what commonality between my usage and BSL can be
> used for naming? ONE_TWO / two_one are the obvious names, but they don't seem
> very good.

Yeah, those are the names that came to mind for me, too, but it's
arguably less readable in context, so IMO it's OK to duplicate the
iterator for the sake of better readability (as the current patch does),
if you prefer.

> 
> > 
> > > +(define_int_attr ovf_comm_opp [(1 "2") (2 "1")])
> > > diff --git a/gcc/testsuite/gcc.target/aarch64/pr116815-1.c 
> > > b/gcc/testsuite/gcc.target/aarch64/pr116815-1.c
> > > new file mode 100644
> > > index 00000000000..f3bdb794318
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/aarch64/pr116815-1.c
> > > @@ -0,0 +1,120 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-options "-O2" } */
> > > +/* { dg-final { check-function-bodies "**" "" "" } } */
> > > +
> > > +/* PR middle-end/116815 */
> > > +
> > > +/* Single-use tests.  */
> > > +
> > > +static inline unsigned __attribute__ ((always_inline))
> > > +max (unsigned a, unsigned b)
> > > +{
> > > +  return a > b ? a : b;
> > > +}
> > > +
> > > +static inline unsigned __attribute__ ((always_inline))
> > > +min (unsigned a, unsigned b)
> > > +{
> > > +  return a < b ? a : b;
> > > +}
> > > +
> > > +#define OPERATION(op, type, N, exp1, exp2)                               
> > >       \
> > > +  unsigned u##op##type##N (unsigned a, unsigned b) { return op (exp1, 
> > > exp2); }
> > > +
> > > +/*
> > > +** umaxadd1:
> > > +**   adds    (w[0-9]+), w0, w1
> > > +**   csel    w0, \1, w0, cc
> > > +**   ret
> > > +*/
> > > +OPERATION (max, add, 1, a, a + b)
> > > +
> > > +/*
> > > +** umaxadd2:
> > > +**   adds    (w[0-9]+), w0, w1
> > > +**   csel    w0, \1, w0, cc
> > > +**   ret
> > > +*/
> > > +OPERATION (max, add, 2, a, b + a)
> > > +
> > > +/*
> > > +** umaxadd3:
> > > +**   adds    (w[0-9]+), w0, w1
> > > +**   csel    w0, \1, w0, cc
> > > +**   ret
> > > +*/
> > > +OPERATION (max, add, 3, a + b, a)
> > > +
> > > +/*
> > > +** umaxadd4:
> > > +**   adds    (w[0-9]+), w0, w1
> > > +**   csel    w0, \1, w0, cc
> > > +**   ret
> > > +*/
> > > +OPERATION (max, add, 4, b + a, a)
> > > +
> > > +/*
> > > +** uminadd1:
> > > +**   adds    (w[0-9]+), w0, w1
> > > +**   csel    w0, \1, w0, cs
> > > +**   ret
> > > +*/
> > > +OPERATION (min, add, 1, a, a + b)
> > > +
> > > +/*
> > > +** uminadd2:
> > > +**   adds    (w[0-9]+), w0, w1
> > > +**   csel    w0, \1, w0, cs
> > > +**   ret
> > > +*/
> > > +OPERATION (min, add, 2, a, b + a)
> > > +
> > > +/*
> > > +** uminadd3:
> > > +**   adds    (w[0-9]+), w0, w1
> > > +**   csel    w0, \1, w0, cs
> > > +**   ret
> > > +*/
> > > +OPERATION (min, add, 3, a + b, a)
> > > +
> > > +/*
> > > +** uminadd4:
> > > +**   adds    (w[0-9]+), w0, w1
> > > +**   csel    w0, \1, w0, cs
> > > +**   ret
> > > +*/
> > > +OPERATION (min, add, 4, b + a, a)
> > > +
> > > +/* sub requires the inverse of the comparison from add.  */
> > > +
> > > +/*
> > > +** umaxsub1:
> > > +**   subs    (w[0-9]+), w0, w1
> > > +**   csel    w0, \1, w0, cc
> > > +**   ret
> > > +*/
> > > +OPERATION (max, sub, 1, a, a - b)
> > > +
> > > +/*
> > > +** umaxsub2:
> > > +**   subs    (w[0-9]+), w0, w1
> > > +**   csel    w0, \1, w0, cc
> > > +**   ret
> > > +*/
> > > +OPERATION (max, sub, 2, a - b, a)
> > > +
> > > +/*
> > > +** uminsub1:
> > > +**   subs    (w[0-9]+), w0, w1
> > > +**   csel    w0, \1, w0, cs
> > > +**   ret
> > > +*/
> > > +OPERATION (min, sub, 1, a, a - b)
> > > +
> > > +/*
> > > +** uminsub2:
> > > +**   subs    (w[0-9]+), w0, w1
> > > +**   csel    w0, \1, w0, cs
> > > +**   ret
> > > +*/
> > > +OPERATION (min, sub, 2, a - b, a)
> > > diff --git a/gcc/testsuite/gcc.target/aarch64/pr116815-2.c 
> > > b/gcc/testsuite/gcc.target/aarch64/pr116815-2.c
> > > new file mode 100644
> > > index 00000000000..f2dd7b0bbb3
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/aarch64/pr116815-2.c
> > > @@ -0,0 +1,93 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-options "-O2" } */
> > > +
> > > +/* PR middle-end/116815 */
> > > +
> > > +/* Negative tests.  */
> > 
> > I don't see the benefit in including negative tests here, the tests will
> > get optimized out by the middle-end anyway due to C overflow rules.
> 
> Sure. I have removed this testcase in the new revision.
> 
> > 
> > > +
> > > +static inline int __attribute__ ((always_inline))
> > > +smax (int a, int b)
> > > +{
> > > +  return a > b ? a : b;
> > > +}
> > > +
> > > +static inline int __attribute__ ((always_inline))
> > > +smin (int a, int b)
> > > +{
> > > +  return a < b ? a : b;
> > > +}
> > > +
> > > +static inline unsigned __attribute__ ((always_inline))
> > > +umax (unsigned a, unsigned b)
> > > +{
> > > +  return a > b ? a : b;
> > > +}
> > > +
> > > +static inline unsigned __attribute__ ((always_inline))
> > > +umin (unsigned a, unsigned b)
> > > +{
> > > +  return a < b ? a : b;
> > > +}
> > > +
> > > +#define ASSUME(cond) if (!(cond)) __builtin_unreachable ();
> > > +
> > > +/* This transformation does not trigger on signed types.  */
> > > +
> > > +int
> > > +smax_add (int a, int b)
> > > +{
> > > +  ASSUME (b >= 0);
> > > +  return smax (a, a + b);
> > > +}
> > > +
> > > +int
> > > +smin_add (int a, int b)
> > > +{
> > > +  ASSUME (b >= 0);
> > > +  return smin (a, a + b);
> > > +}
> > > +
> > > +int
> > > +smax_sub (int a, int b)
> > > +{
> > > +  ASSUME (b >= 0);
> > > +  return smax (a, a - b);
> > > +}
> > > +
> > > +int
> > > +smin_sub (int a, int b)
> > > +{
> > > +  ASSUME (b >= 0);
> > > +  return smin (a, a - b);
> > > +}
> > > +
> > > +/* Invalid patterns.  */
> > > +
> > > +/* This can potentially be matched, but the RHS gets factored to
> > > +   (a + b) * b.  */
> > > +unsigned
> > > +umax_factored (unsigned a, unsigned b)
> > > +{
> > > +  return umax (a * b, a * b + b * b);
> > > +}
> > > +
> > > +unsigned
> > > +umin_mult (unsigned a, unsigned b)
> > > +{
> > > +  return umin (a, a * b);
> > > +}
> > > +
> > > +unsigned
> > > +umax_sub (unsigned a, unsigned b)
> > > +{
> > > +  return umax (a, b - a);
> > > +}
> > > +
> > > +unsigned
> > > +umin_sub (unsigned a, unsigned b)
> > > +{
> > > +  return umin (a, b - a);
> > > +}
> > > +
> > > +/* { dg-final { scan-assembler-not "adds\\t" } } */
> > > +/* { dg-final { scan-assembler-not "subs\\t" } } */
> > > diff --git a/gcc/testsuite/gcc.target/aarch64/pr116815-3.c 
> > > b/gcc/testsuite/gcc.target/aarch64/pr116815-3.c
> > > new file mode 100644
> > > index 00000000000..f8050377116
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/aarch64/pr116815-3.c
> > > @@ -0,0 +1,62 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-options "-O2" } */
> > > +/* { dg-final { check-function-bodies "**" "" "" } } */
> > > +
> > > +/* PR middle-end/116815 */
> > > +
> > > +/* Multi-use tests.  */
> > > +
> > > +static inline unsigned __attribute__ ((always_inline))
> > > +max (unsigned a, unsigned b)
> > > +{
> > > +  return a > b ? a : b;
> > > +}
> > > +
> > > +static inline unsigned __attribute__ ((always_inline))
> > > +min (unsigned a, unsigned b)
> > > +{
> > > +  return a < b ? a : b;
> > > +}
> > > +
> > > +/* FIXME: This should only generate one adds.  */
> > 
> > I appreciate you raising the shortcomings of the patch, but perhaps better 
> > just
> > to mention it in the cover letter rather than adding a test that explicitly
> > defends the suboptimal codegen.  AIUI, check-function-bodies tests are 
> > intended
> > for when we're confident there is one optimal sequence of codegen that we 
> > want
> > to generate.
> 
> I have added the FIXME as a comment in the commit message, does that seem 
> okay?

Yep, that's fine.

> This testcase is actually a leftover from when the patch was a GIMPLE 
> transform,
> because this used to work before and is a regression after switching to a
> target-specific transform. I have removed this testcase now.

I see, thanks.

<snip>

> From 588dca91b719168f8ff98b7e8ee4b395bccc7cda Mon Sep 17 00:00:00 2001
> From: Dhruv Chawla <[email protected]>
> Date: Wed, 23 Jul 2025 01:41:51 -0700
> Subject: [PATCH] [aarch64] Make better use of overflowing operations in
>  max/min(a, add/sub(a, b)) [PR116815]
> 
> This patch folds the following patterns:
> - For add:
>   - umax (a, add (a, b)) -> [sum,  ovf] = adds (a, b); !ovf ? sum : a
>   - umin (a, add (a, b)) -> [sum,  ovf] = adds (a, b); !ovf ? a : sum
>     ... along with the commutated versions:
>   - umax (a, add (b, a)) -> [sum,  ovf] = adds (b, a); !ovf ? sum : a
>   - umin (a, add (b, a)) -> [sum,  ovf] = adds (b, a); !ovf ? a : sum
> - For sub:
>   - umax (a, sub (a, b)) -> [diff, udf] = subs (a, b); udf ? diff : a
>   - umin (a, sub (a, b)) -> [diff, udf] = subs (a, b); udf ? a : diff
> 
> Where ovf is the overflow flag and udf is the underflow flag. adds and subs 
> are
> generated by generating parallel compare+plus/minus which map to
> add<mode>3_compareC and sub<mode>3_compare1.
> 
> This patch is a respin of the patch posted at
> https://gcc.gnu.org/pipermail/gcc-patches/2025-May/685021.html as per
> the suggestion to turn it into a target-specific transform by Richard
> Biener.
> 
> FIXME: This pattern cannot currently factor multiple occurences of the
>        add expression into a single adds, eg: max (a, a + b) + min (a + b, b)
>        ends up generating two adds instructions. This is something that
>        was lost when going from GIMPLE to target-specific transforms.
> 
> Bootstrapped and regtested on aarch64-unknown-linux-gnu.
> 
> Signed-off-by: Dhruv Chawla <[email protected]>
> 
>       PR middle-end/116815
> 
> gcc/ChangeLog:
> 
>       * config/aarch64/aarch64.md
>       (*aarch64_plus_within_<optab><mode>3_<ovf_commutate>): New pattern.
>       (*aarch64_minus_within_<optab><mode>3): Likewise.
>       * config/aarch64/iterators.md (ovf_add_cmp): New code attribute.
>       (udf_sub_cmp): Likewise.
>       (UMAXMIN): New code iterator.
>       (ovf_commutate): New iterator.
>       (ovf_comm_opp): New int attribute.
> 
> gcc/testsuite/ChangeLog:
> 
>       * gcc.target/aarch64/pr116815-1.c: New test.
>       * gcc.target/aarch64/pr116815-2.c: Likewise.
>       * gcc.target/aarch64/pr116815-3.c: Likewise.
>       * gcc.target/aarch64/pr116815-4.c: Likewise.
> ---
>  gcc/config/aarch64/aarch64.md                 |  60 +++++++++
>  gcc/config/aarch64/iterators.md               |   9 ++
>  gcc/testsuite/gcc.target/aarch64/pr116815-1.c | 120 ++++++++++++++++++
>  gcc/testsuite/gcc.target/aarch64/pr116815-2.c |  48 +++++++
>  gcc/testsuite/gcc.target/aarch64/pr116815-3.c |  44 +++++++
>  gcc/testsuite/gcc.target/aarch64/pr116815-4.c |  60 +++++++++
>  6 files changed, 341 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/pr116815-1.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/pr116815-2.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/pr116815-3.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/pr116815-4.c
> 
> diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
> index 98c65a74c8e..307df4e92ba 100644
> --- a/gcc/config/aarch64/aarch64.md
> +++ b/gcc/config/aarch64/aarch64.md
> @@ -4488,6 +4488,66 @@
>    [(set_attr "type" "<su>div")]
>  )
> +;; umax (a, add (a, b)) => [sum, ovf] = adds (a, b); !ovf ? sum : a
> +;; umin (a, add (a, b)) => [sum, ovf] = adds (a, b); !ovf ? a : sum
> +;; ... and the commutated versions:
> +;; umax (a, add (b, a)) => [sum, ovf] = adds (b, a); !ovf ? sum : a
> +;; umin (a, add (b, a)) => [sum, ovf] = adds (b, a); !ovf ? a : sum
> +(define_insn_and_split "*aarch64_plus_within_<optab><mode>3_<ovf_commutate>"
> +  [(set (match_operand:GPI 0 "register_operand" "=r")
> +     (UMAXMIN:GPI
> +       (plus:GPI (match_operand:GPI 1 "register_operand" "r")
> +                 (match_operand:GPI 2 "register_operand" "r"))
> +       (match_dup ovf_commutate)))
> +   (clobber (match_scratch:GPI 3))]

Please add an "=r" constraint here.

> +  "!TARGET_CSSC"
> +  "#"
> +  "&& !reload_completed"

This should be "&& 1".

> +  [(parallel
> +      [(set (reg:CC_C CC_REGNUM)
> +         (compare:CC_C (plus:GPI (match_dup ovf_commutate)
> +                                 (match_dup <ovf_comm_opp>))
> +                       (match_dup ovf_commutate)))
> +       (set (match_dup 3) (plus:GPI (match_dup ovf_commutate)
> +                                 (match_dup <ovf_comm_opp>)))])
> +   (set (match_dup 0)
> +     (if_then_else:GPI (<ovf_add_cmp> (reg:CC_C CC_REGNUM)
> +                                      (const_int 0))
> +                       (match_dup 3)
> +                       (match_dup ovf_commutate)))]
> +  {
> +    if (GET_CODE (operands[3]) == SCRATCH)
> +      operands[3] = gen_reg_rtx (<MODE>mode);
> +  }
> +)
> +
> +;; umax (a, sub (a, b)) => [diff, udf] = subs (a, b); udf ? diff : a
> +;; umin (a, sub (a, b)) => [diff, udf] = subs (a, b); udf ? a : diff
> +(define_insn_and_split "*aarch64_minus_within_<optab><mode>3"
> +  [(set (match_operand:GPI 0 "register_operand" "=r")
> +     (UMAXMIN:GPI
> +       (minus:GPI (match_operand:GPI 1 "register_operand" "r")
> +                  (match_operand:GPI 2 "register_operand" "r"))
> +       (match_dup 1)))
> +   (clobber (match_scratch:GPI 3))]

Please add an "=r" constraint here.

> +  "!TARGET_CSSC"
> +  "#"
> +  "&& !reload_completed"

This should be "&& 1".

> +  [(parallel
> +      [(set (reg:CC CC_REGNUM)
> +         (compare:CC (match_dup 1) (match_dup 2)))
> +       (set (match_dup 3) (minus:GPI (match_dup 1) (match_dup 2)))])
> +   (set (match_dup 0)
> +         (if_then_else:GPI (<udf_sub_cmp> (reg:CC CC_REGNUM)
> +                                             (const_int 0))
> +                           (match_dup 3)
> +                           (match_dup 1)))]
> +  {
> +    if (GET_CODE (operands[3]) == SCRATCH)
> +      operands[3] = gen_reg_rtx (<MODE>mode);
> +  }
> +)
> +
>  ;; -------------------------------------------------------------------
>  ;; Comparison insns
>  ;; -------------------------------------------------------------------
> diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
> index 517b2808b5f..74186cfae8c 100644
> --- a/gcc/config/aarch64/iterators.md
> +++ b/gcc/config/aarch64/iterators.md
> @@ -2827,6 +2827,8 @@
>  (define_code_iterator FMAXMIN [smax smin])
> +(define_code_iterator UMAXMIN [umax umin])
> +
>  ;; Signed and unsigned max operations.
>  (define_code_iterator USMAX [smax umax])
> @@ -3115,6 +3117,9 @@
>  (define_code_attr maxminand [(smax "bic") (smin "and")])
> +(define_code_attr ovf_add_cmp [(umax "geu") (umin "ltu")])
> +(define_code_attr udf_sub_cmp [(umax "ltu") (umin "geu")])
> +
>  ;; MLA/MLS attributes.
>  (define_code_attr as [(ss_plus "a") (ss_minus "s")])
> @@ -5126,3 +5131,7 @@
>     (UNSPEC_F2CVT "f2cvt")
>     (UNSPEC_F1CVTLT "f1cvtlt")
>     (UNSPEC_F2CVTLT "f2cvtlt")])
> +
> +;; Operand numbers for commutative operations
> +(define_int_iterator ovf_commutate [1 2])
> +(define_int_attr ovf_comm_opp [(1 "2") (2 "1")])
> diff --git a/gcc/testsuite/gcc.target/aarch64/pr116815-1.c 
> b/gcc/testsuite/gcc.target/aarch64/pr116815-1.c
> new file mode 100644
> index 00000000000..f3bdb794318
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/pr116815-1.c
> @@ -0,0 +1,120 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2" } */
> +/* { dg-final { check-function-bodies "**" "" "" } } */
> +
> +/* PR middle-end/116815 */
> +
> +/* Single-use tests.  */
> +
> +static inline unsigned __attribute__ ((always_inline))
> +max (unsigned a, unsigned b)
> +{
> +  return a > b ? a : b;
> +}
> +
> +static inline unsigned __attribute__ ((always_inline))
> +min (unsigned a, unsigned b)
> +{
> +  return a < b ? a : b;
> +}
> +
> +#define OPERATION(op, type, N, exp1, exp2)                                   
>   \
> +  unsigned u##op##type##N (unsigned a, unsigned b) { return op (exp1, exp2); 
> }
> +
> +/*
> +** umaxadd1:
> +**   adds    (w[0-9]+), w0, w1
> +**   csel    w0, \1, w0, cc
> +**   ret
> +*/
> +OPERATION (max, add, 1, a, a + b)
> +
> +/*
> +** umaxadd2:
> +**   adds    (w[0-9]+), w0, w1
> +**   csel    w0, \1, w0, cc
> +**   ret
> +*/
> +OPERATION (max, add, 2, a, b + a)
> +
> +/*
> +** umaxadd3:
> +**   adds    (w[0-9]+), w0, w1
> +**   csel    w0, \1, w0, cc
> +**   ret
> +*/
> +OPERATION (max, add, 3, a + b, a)
> +
> +/*
> +** umaxadd4:
> +**   adds    (w[0-9]+), w0, w1
> +**   csel    w0, \1, w0, cc
> +**   ret
> +*/
> +OPERATION (max, add, 4, b + a, a)
> +
> +/*
> +** uminadd1:
> +**   adds    (w[0-9]+), w0, w1
> +**   csel    w0, \1, w0, cs
> +**   ret
> +*/
> +OPERATION (min, add, 1, a, a + b)
> +
> +/*
> +** uminadd2:
> +**   adds    (w[0-9]+), w0, w1
> +**   csel    w0, \1, w0, cs
> +**   ret
> +*/
> +OPERATION (min, add, 2, a, b + a)
> +
> +/*
> +** uminadd3:
> +**   adds    (w[0-9]+), w0, w1
> +**   csel    w0, \1, w0, cs
> +**   ret
> +*/
> +OPERATION (min, add, 3, a + b, a)
> +
> +/*
> +** uminadd4:
> +**   adds    (w[0-9]+), w0, w1
> +**   csel    w0, \1, w0, cs
> +**   ret
> +*/
> +OPERATION (min, add, 4, b + a, a)
> +
> +/* sub requires the inverse of the comparison from add.  */
> +
> +/*
> +** umaxsub1:
> +**   subs    (w[0-9]+), w0, w1
> +**   csel    w0, \1, w0, cc
> +**   ret
> +*/
> +OPERATION (max, sub, 1, a, a - b)
> +
> +/*
> +** umaxsub2:
> +**   subs    (w[0-9]+), w0, w1
> +**   csel    w0, \1, w0, cc
> +**   ret
> +*/
> +OPERATION (max, sub, 2, a - b, a)
> +
> +/*
> +** uminsub1:
> +**   subs    (w[0-9]+), w0, w1
> +**   csel    w0, \1, w0, cs
> +**   ret
> +*/
> +OPERATION (min, sub, 1, a, a - b)
> +
> +/*
> +** uminsub2:
> +**   subs    (w[0-9]+), w0, w1
> +**   csel    w0, \1, w0, cs
> +**   ret
> +*/
> +OPERATION (min, sub, 2, a - b, a)
> diff --git a/gcc/testsuite/gcc.target/aarch64/pr116815-2.c 
> b/gcc/testsuite/gcc.target/aarch64/pr116815-2.c
> new file mode 100644
> index 00000000000..eb1fdb94797
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/pr116815-2.c
> @@ -0,0 +1,48 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2" } */
> +
> +/* PR middle-end/116815 */
> +
> +/* Multi-use testcase across basic blocks.  */
> +
> +static inline unsigned __attribute__ ((always_inline))
> +max (unsigned a, unsigned b)
> +{
> +  return a > b ? a : b;
> +}
> +
> +static inline unsigned __attribute__ ((always_inline))
> +min (unsigned a, unsigned b)
> +{
> +  return a < b ? a : b;
> +}
> +
> +#define OPERATION(op, type, N, exp1, exp2)                                   
>   \
> +  unsigned u##op##type##N (unsigned a, unsigned b, unsigned c, unsigned d,   
>   \
> +                        unsigned e)                                         \
> +  {                                                                          
>   \
> +    unsigned result = op (exp1, exp2);                                       
>   \
> +    if (result == c || result == c * 2)                                      
>   \
> +      return d;                                                              
>   \
> +    else                                                                     
>   \
> +      return e;                                                              
>   \
> +  }

What is the purpose/intent of this test?  IMO it seems superfluous.

The patch is OK without it and with the above changes to the patterns.

Thanks!
Alex

> +
> +OPERATION (max, add, 1, a, a + b)
> +OPERATION (max, add, 2, a, b + a)
> +OPERATION (max, add, 3, a + b, a)
> +OPERATION (max, add, 4, b + a, a)
> +
> +OPERATION (min, add, 1, a, a + b)
> +OPERATION (min, add, 2, a, b + a)
> +OPERATION (min, add, 3, a + b, a)
> +OPERATION (min, add, 4, b + a, a)
> +
> +OPERATION (max, sub, 1, a, a - b)
> +OPERATION (max, sub, 2, a - b, a)
> +
> +OPERATION (min, sub, 1, a, a - b)
> +OPERATION (min, sub, 2, a - b, a)
> +
> +/* { dg-final { scan-assembler-times "adds\\t" 8 } } */
> +/* { dg-final { scan-assembler-times "subs\\t" 4 } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/pr116815-3.c 
> b/gcc/testsuite/gcc.target/aarch64/pr116815-3.c
> new file mode 100644
> index 00000000000..015c868aec2
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/pr116815-3.c
> @@ -0,0 +1,44 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2" } */
> +
> +#pragma GCC target "+cssc"
> +
> +/* PR middle-end/116815 */
> +
> +/* Make sure that umax/umin instructions are generated with CSSC.  */
> +
> +static inline unsigned __attribute__ ((always_inline))
> +max (unsigned a, unsigned b)
> +{
> +  return a > b ? a : b;
> +}
> +
> +static inline unsigned __attribute__ ((always_inline))
> +min (unsigned a, unsigned b)
> +{
> +  return a < b ? a : b;
> +}
> +
> +#define OPERATION(op, type, N, exp1, exp2)                                   
>   \
> +  unsigned u##op##type##N (unsigned a, unsigned b) { return op (exp1, exp2); 
> }
> +
> +OPERATION (max, add, 1, a, a + b)
> +OPERATION (max, add, 2, a, b + a)
> +OPERATION (max, add, 3, a + b, a)
> +OPERATION (max, add, 4, b + a, a)
> +
> +OPERATION (min, add, 1, a, a + b)
> +OPERATION (min, add, 2, a, b + a)
> +OPERATION (min, add, 3, a + b, a)
> +OPERATION (min, add, 4, b + a, a)
> +
> +OPERATION (max, sub, 1, a, a - b)
> +OPERATION (max, sub, 2, a - b, a)
> +
> +OPERATION (min, sub, 1, a, a - b)
> +OPERATION (min, sub, 2, a - b, a)
> +
> +/* { dg-final { scan-assembler-times "umax\\t" 6 } } */
> +/* { dg-final { scan-assembler-times "umin\\t" 6 } } */
> +/* { dg-final { scan-assembler-not "adds\\t" } } */
> +/* { dg-final { scan-assembler-not "subs\\t" } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/pr116815-4.c 
> b/gcc/testsuite/gcc.target/aarch64/pr116815-4.c
> new file mode 100644
> index 00000000000..d262d2170f3
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/pr116815-4.c
> @@ -0,0 +1,60 @@
> +/* { dg-do run } */
> +/* { dg-options "-O2" } */
> +
> +/* PR middle-end/116815 */
> +
> +/* Verify that the transformation gives correct results */
> +
> +static inline unsigned __attribute__ ((always_inline))
> +min (unsigned a, unsigned b)
> +{
> +  return (a < b) ? a : b;
> +}
> +
> +static inline unsigned __attribute__ ((always_inline))
> +max (unsigned a, unsigned b)
> +{
> +  return (a > b) ? a : b;
> +}
> +
> +__attribute__ ((noipa)) unsigned
> +umaxadd (unsigned a, unsigned b)
> +{
> +  return max (a + b, a);
> +}
> +
> +__attribute__ ((noipa)) unsigned
> +umaxsub (unsigned a, unsigned b)
> +{
> +  return max (a - b, a);
> +}
> +
> +__attribute__ ((noipa)) unsigned
> +uminadd (unsigned a, unsigned b)
> +{
> +  return min (a + b, a);
> +}
> +
> +__attribute__ ((noipa)) unsigned
> +uminsub (unsigned a, unsigned b)
> +{
> +  return min (a - b, a);
> +}
> +
> +int
> +main ()
> +{
> +  /* Overflows to 0x30000000.  */
> +  if (umaxadd (0x90000000, 0xa0000000) != 0x90000000)
> +    __builtin_abort ();
> +
> +  if (uminadd (0x90000000, 0xa0000000) != 0x30000000)
> +    __builtin_abort ();
> +
> +  /* Underflows to 0x60000000.  */
> +  if (umaxsub (0x00000000, 0xa0000000) != 0x60000000)
> +    __builtin_abort ();
> +
> +  if (uminsub (0x00000000, 0xa0000000) != 0x00000000)
> +    __builtin_abort ();
> +}
> -- 
> 2.44.0

Re: [PATCH v5] [aarch64] Make better use of overflowing operations in max/min(a, add/sub(a, b)) [PR116815]

Reply via email to