Re: [PING][PATCH 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops

2023-12-12 Thread Richard Earnshaw




On 30/11/2023 12:55, Stamatis Markianos-Wright wrote:

Hi Andre,

Thanks for the comments, see latest revision attached.

On 27/11/2023 12:47, Andre Vieira (lists) wrote:

Hi Stam,

Just some comments.

+/* Recursively scan through the DF chain backwards within the basic 
block and
+   determine if any of the USEs of the original insn (or the USEs of 
the insns
s/Recursively scan/Scan/ as you no longer recurse, thanks for that by 
the way :) +   where thy were DEF-ed, etc., recursively) were affected 
by implicit VPT

remove recursively for the same reasons.

+  if (!CONST_INT_P (cond_counter_iv.step) || !CONST_INT_P 
(cond_temp_iv.step))

+    return NULL;
+  /* Look at the steps and swap around the rtx's if needed. Error 
out if

+ one of them cannot be identified as constant.  */
+  if (INTVAL (cond_counter_iv.step) != 0 && INTVAL 
(cond_temp_iv.step) != 0)

+    return NULL;

Move the comment above the if before, as the erroring out it talks 
about is there.

Done


+  emit_note_after ((enum insn_note)NOTE_KIND (insn), BB_END (body));
 space after 'insn_note)'

@@ -173,14 +176,14 @@ doloop_condition_get (rtx_insn *doloop_pat)
   if (! REG_P (reg))
 return 0;
 -  /* Check if something = (plus (reg) (const_int -1)).
+  /* Check if something = (plus (reg) (const_int -n)).
  On IA-64, this decrement is wrapped in an if_then_else.  */
   inc_src = SET_SRC (inc);
   if (GET_CODE (inc_src) == IF_THEN_ELSE)
 inc_src = XEXP (inc_src, 1);
   if (GET_CODE (inc_src) != PLUS
   || XEXP (inc_src, 0) != reg
-  || XEXP (inc_src, 1) != constm1_rtx)
+  || !CONST_INT_P (XEXP (inc_src, 1)))

Do we ever check that inc_src is negative? We used to check if it was 
-1, now we only check it's a constnat, but not a negative one, so I 
suspect this needs a:

|| INTVAL (XEXP (inc_src, 1)) >= 0

Good point. Done


@@ -492,7 +519,8 @@ doloop_modify (class loop *loop, class niter_desc 
*desc,

 case GE:
   /* Currently only GE tests against zero are supported.  */
   gcc_assert (XEXP (condition, 1) == const0_rtx);
-
+  /* FALLTHRU */
+    case GTU:
   noloop = constm1_rtx;

I spent a very long time staring at this trying to understand why 
noloop = constm1_rtx for GTU, where I thought it should've been (count 
& (n-1)). For the current use of doloop it doesn't matter because ARM 
is the only target using it and you set desc->noloop_assumptions to 
null_rtx in 'arm_attempt_dlstp_transform' so noloop is never used. 
However, if a different target accepts this GTU pattern then this 
target agnostic code will do the wrong thing.  I suggest we either:
 - set noloop to what we think might be the correct value, which if 
you ask me should be 'count & (XEXP (condition, 1))',
 - or add a gcc_assert (GET_CODE (condition) != GTU); under the if 
(desc->noloop_assumption); part and document why.  I have a slight 
preference for the assert given otherwise we are adding code that we 
can't test.


Yea, that's true tbh. I've done the latter, but also separated out the 
"case GTU:" and added a comment, so that it's more clear that the noloop 
things aren't used in the only implemented GTU case (Arm)


Thank you :)



LGTM otherwise (but I don't have the power to approve this ;)).

Kind regards,
Andre

From: Stamatis Markianos-Wright 
Sent: Thursday, November 16, 2023 11:36 AM
To: Stamatis Markianos-Wright via Gcc-patches; Richard Earnshaw; 
Richard Sandiford; Kyrylo Tkachov
Subject: [PING][PATCH 2/2] arm: Add support for MVE Tail-Predicated 
Low Overhead Loops


Pinging back to the top of reviewers' inboxes due to worry about Stage 1
End in a few days :)


See the last email for the latest version of the 2/2 patch. The 1/2
patch is A-Ok from Kyrill's earlier target-backend review.


On 10/11/2023 12:41, Stamatis Markianos-Wright wrote:


On 06/11/2023 17:29, Stamatis Markianos-Wright wrote:


On 06/11/2023 11:24, Richard Sandiford wrote:

Stamatis Markianos-Wright  writes:
One of the main reasons for reading the arm bits was to try to 
answer

the question: if we switch to a downcounting loop with a GE
condition,
how do we make sure that the start value is not a large unsigned
number that is interpreted as negative by GE?  E.g. if the loop
originally counted up in steps of N and used an LTU condition,
it could stop at a value in the range [INT_MAX + 1, UINT_MAX].
But the loop might never iterate if we start counting down from
most values in that range.

Does the patch handle that?

So AFAICT this is actually handled in the generic code in
`doloop_valid_p`:

This kind of loops fail because of they are "desc->infinite", then no
loop-doloop conversion is attempted at all (even for standard
dls/le loops)

Thanks to that check I haven't been able to trigger anything like the
behaviour you describe, do you think the doloop_valid_p checks are
robust enough?

The loops I was thinking of are provably no

Re: [PING][PATCH 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops

2023-12-09 Thread Richard Sandiford
Sorry for the slow review.

Stamatis Markianos-Wright  writes:
> [...]
> diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
> index 
> 44a04b86cb5806fcf50917826512fd203d42106c..c083f965fa9a40781bc86beb6e63654afd14eac4
>  100644
> --- a/gcc/config/arm/mve.md
> +++ b/gcc/config/arm/mve.md
> @@ -6922,23 +6922,24 @@
>  ;; Originally expanded by 'predicated_doloop_end'.
>  ;; In the rare situation where the branch is too far, we do also need to
>  ;; revert FPSCR.LTPSIZE back to 0x100 after the last iteration.
> -(define_insn "*predicated_doloop_end_internal"
> +(define_insn "predicated_doloop_end_internal"
>[(set (pc)
>   (if_then_else
> -(ge (plus:SI (reg:SI LR_REGNUM)
> - (match_operand:SI 0 "const_int_operand" ""))
> - (const_int 0))
> -  (label_ref (match_operand 1 "" ""))
> +(gtu (unspec:SI [(plus:SI (match_operand:SI 0 "s_register_operand" 
> "=r")
> +  (const_int ))]
> + LETP)
> + (const_int ))

Is there any need for the unspec?  I couldn't see why this wasn't simply:

  (gtu (match_operand:SI 0 "s_register_operand" "=r")
   (const_int ))

But I agree that using gtu rather than ge is nicer if it's what the
instruction does.

> diff --git a/gcc/df-core.cc b/gcc/df-core.cc
> index 
> d4812b04a7cb97ea1606082e26e910472da5bcc1..4fcc14bf790d43e792b3c926fe1f80073d908c17
>  100644
> --- a/gcc/df-core.cc
> +++ b/gcc/df-core.cc
> @@ -1964,6 +1964,21 @@ df_bb_regno_last_def_find (basic_block bb, unsigned 
> int regno)
>return NULL;
>  }
>  
> +/* Return the one and only def of REGNO within BB.  If there is no def or
> +   there are multiple defs, return NULL.  */
> +
> +df_ref
> +df_bb_regno_only_def_find (basic_block bb, unsigned int regno)
> +{
> +  df_ref temp = df_bb_regno_first_def_find (bb, regno);
> +  if (!temp)
> +return NULL;
> +  else if (temp == df_bb_regno_last_def_find (bb, regno))
> +return temp;
> +  else
> +return NULL;
> +}
> +
>  /* Finds the reference corresponding to the definition of REG in INSN.
> DF is the dataflow object.  */
>  
> diff --git a/gcc/df.h b/gcc/df.h
> index 
> 402657a7076f1bcad24e9c50682e033e57f432f9..98623637f9c839c799222e99df2a7173a770b2ac
>  100644
> --- a/gcc/df.h
> +++ b/gcc/df.h
> @@ -987,6 +987,7 @@ extern void df_check_cfg_clean (void);
>  #endif
>  extern df_ref df_bb_regno_first_def_find (basic_block, unsigned int);
>  extern df_ref df_bb_regno_last_def_find (basic_block, unsigned int);
> +extern df_ref df_bb_regno_only_def_find (basic_block, unsigned int);
>  extern df_ref df_find_def (rtx_insn *, rtx);
>  extern bool df_reg_defined (rtx_insn *, rtx);
>  extern df_ref df_find_use (rtx_insn *, rtx);
> diff --git a/gcc/loop-doloop.cc b/gcc/loop-doloop.cc
> index 
> 4feb0a25ab9331b7124df900f73c9fc6fb3eb10b..d919207505c472c8a54a2c9c982a09061584177b
>  100644
> --- a/gcc/loop-doloop.cc
> +++ b/gcc/loop-doloop.cc
> @@ -85,10 +85,10 @@ doloop_condition_get (rtx_insn *doloop_pat)
>   forms:
>  
>   1)  (parallel [(set (pc) (if_then_else (condition)
> - (label_ref (label))
> - (pc)))
> -  (set (reg) (plus (reg) (const_int -1)))
> -  (additional clobbers and uses)])
> + (label_ref (label))
> + (pc)))
> +  (set (reg) (plus (reg) (const_int -1)))
> +  (additional clobbers and uses)])
>  
>   The branch must be the first entry of the parallel (also required
>   by jump.cc), and the second entry of the parallel must be a set of
> @@ -96,19 +96,34 @@ doloop_condition_get (rtx_insn *doloop_pat)
>   the loop counter in an if_then_else too.
>  
>   2)  (set (reg) (plus (reg) (const_int -1))
> - (set (pc) (if_then_else (reg != 0)
> -  (label_ref (label))
> -  (pc))).  
> +  (set (pc) (if_then_else (reg != 0)
> +  (label_ref (label))
> +  (pc))).
>  
>   Some targets (ARM) do the comparison before the branch, as in the
>   following form:
>  
> - 3) (parallel [(set (cc) (compare ((plus (reg) (const_int -1), 0)))
> -   (set (reg) (plus (reg) (const_int -1)))])
> -(set (pc) (if_then_else (cc == NE)
> -(label_ref (label))
> -(pc))) */
> -
> + 3) (parallel [(set (cc) (compare (plus (reg) (const_int -1)) 0))
> +(set (reg) (plus (reg) (const_int -1)))])
> + (set (pc) (if_then_else (cc == NE)
> + (label_ref (label))
> + (pc)))
> +
> +  The ARM target also supports a special case of a counter that 
> decrements
> +  by `n` and terminating in a GTU condition.  In that case, the compare 
> and
> +  

Re: [PING][PATCH 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops

2023-12-07 Thread Andre Vieira (lists)
Thanks for addressing my comments. I have reviewed this and the other 
patch before and they LGTM. I however do not have approval rights so you 
will need the OK from a maintainer.


Thanks for doing this :)

Andre

On 30/11/2023 12:55, Stamatis Markianos-Wright wrote:

Hi Andre,

Thanks for the comments, see latest revision attached.

On 27/11/2023 12:47, Andre Vieira (lists) wrote:

Hi Stam,

Just some comments.

+/* Recursively scan through the DF chain backwards within the basic 
block and
+   determine if any of the USEs of the original insn (or the USEs of 
the insns
s/Recursively scan/Scan/ as you no longer recurse, thanks for that by 
the way :) +   where thy were DEF-ed, etc., recursively) were affected 
by implicit VPT

remove recursively for the same reasons.

+  if (!CONST_INT_P (cond_counter_iv.step) || !CONST_INT_P 
(cond_temp_iv.step))

+    return NULL;
+  /* Look at the steps and swap around the rtx's if needed. Error 
out if

+ one of them cannot be identified as constant.  */
+  if (INTVAL (cond_counter_iv.step) != 0 && INTVAL 
(cond_temp_iv.step) != 0)

+    return NULL;

Move the comment above the if before, as the erroring out it talks 
about is there.

Done


+  emit_note_after ((enum insn_note)NOTE_KIND (insn), BB_END (body));
 space after 'insn_note)'

@@ -173,14 +176,14 @@ doloop_condition_get (rtx_insn *doloop_pat)
   if (! REG_P (reg))
 return 0;
 -  /* Check if something = (plus (reg) (const_int -1)).
+  /* Check if something = (plus (reg) (const_int -n)).
  On IA-64, this decrement is wrapped in an if_then_else.  */
   inc_src = SET_SRC (inc);
   if (GET_CODE (inc_src) == IF_THEN_ELSE)
 inc_src = XEXP (inc_src, 1);
   if (GET_CODE (inc_src) != PLUS
   || XEXP (inc_src, 0) != reg
-  || XEXP (inc_src, 1) != constm1_rtx)
+  || !CONST_INT_P (XEXP (inc_src, 1)))

Do we ever check that inc_src is negative? We used to check if it was 
-1, now we only check it's a constnat, but not a negative one, so I 
suspect this needs a:

|| INTVAL (XEXP (inc_src, 1)) >= 0

Good point. Done


@@ -492,7 +519,8 @@ doloop_modify (class loop *loop, class niter_desc 
*desc,

 case GE:
   /* Currently only GE tests against zero are supported.  */
   gcc_assert (XEXP (condition, 1) == const0_rtx);
-
+  /* FALLTHRU */
+    case GTU:
   noloop = constm1_rtx;

I spent a very long time staring at this trying to understand why 
noloop = constm1_rtx for GTU, where I thought it should've been (count 
& (n-1)). For the current use of doloop it doesn't matter because ARM 
is the only target using it and you set desc->noloop_assumptions to 
null_rtx in 'arm_attempt_dlstp_transform' so noloop is never used. 
However, if a different target accepts this GTU pattern then this 
target agnostic code will do the wrong thing.  I suggest we either:
 - set noloop to what we think might be the correct value, which if 
you ask me should be 'count & (XEXP (condition, 1))',
 - or add a gcc_assert (GET_CODE (condition) != GTU); under the if 
(desc->noloop_assumption); part and document why.  I have a slight 
preference for the assert given otherwise we are adding code that we 
can't test.


Yea, that's true tbh. I've done the latter, but also separated out the 
"case GTU:" and added a comment, so that it's more clear that the noloop 
things aren't used in the only implemented GTU case (Arm)


Thank you :)



LGTM otherwise (but I don't have the power to approve this ;)).

Kind regards,
Andre

From: Stamatis Markianos-Wright 
Sent: Thursday, November 16, 2023 11:36 AM
To: Stamatis Markianos-Wright via Gcc-patches; Richard Earnshaw; 
Richard Sandiford; Kyrylo Tkachov
Subject: [PING][PATCH 2/2] arm: Add support for MVE Tail-Predicated 
Low Overhead Loops


Pinging back to the top of reviewers' inboxes due to worry about Stage 1
End in a few days :)


See the last email for the latest version of the 2/2 patch. The 1/2
patch is A-Ok from Kyrill's earlier target-backend review.


On 10/11/2023 12:41, Stamatis Markianos-Wright wrote:


On 06/11/2023 17:29, Stamatis Markianos-Wright wrote:


On 06/11/2023 11:24, Richard Sandiford wrote:

Stamatis Markianos-Wright  writes:
One of the main reasons for reading the arm bits was to try to 
answer

the question: if we switch to a downcounting loop with a GE
condition,
how do we make sure that the start value is not a large unsigned
number that is interpreted as negative by GE?  E.g. if the loop
originally counted up in steps of N and used an LTU condition,
it could stop at a value in the range [INT_MAX + 1, UINT_MAX].
But the loop might never iterate if we start counting down from
most values in that range.

Does the patch handle that?

So AFAICT this is actually handled in the generic code in
`doloop_valid_p`:

This kind of loops fail because of they are "desc->infinite", then no
loop-doloop conversion is attempted at all (even for standa

Re: [PING][PATCH 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops

2023-11-27 Thread Andre Vieira (lists)

Hi Stam,

Just some comments.

+/* Recursively scan through the DF chain backwards within the basic 
block and
+   determine if any of the USEs of the original insn (or the USEs of 
the insns
s/Recursively scan/Scan/ as you no longer recurse, thanks for that by 
the way :) +   where thy were DEF-ed, etc., recursively) were affected 
by implicit VPT

remove recursively for the same reasons.

+  if (!CONST_INT_P (cond_counter_iv.step) || !CONST_INT_P 
(cond_temp_iv.step))

+   return NULL;
+  /* Look at the steps and swap around the rtx's if needed.  Error 
out if

+one of them cannot be identified as constant.  */
+  if (INTVAL (cond_counter_iv.step) != 0 && INTVAL 
(cond_temp_iv.step) != 0)

+   return NULL;

Move the comment above the if before, as the erroring out it talks about 
is there.


+  emit_note_after ((enum insn_note)NOTE_KIND (insn), BB_END (body));
 space after 'insn_note)'

@@ -173,14 +176,14 @@ doloop_condition_get (rtx_insn *doloop_pat)
   if (! REG_P (reg))
 return 0;
 -  /* Check if something = (plus (reg) (const_int -1)).
+  /* Check if something = (plus (reg) (const_int -n)).
  On IA-64, this decrement is wrapped in an if_then_else.  */
   inc_src = SET_SRC (inc);
   if (GET_CODE (inc_src) == IF_THEN_ELSE)
 inc_src = XEXP (inc_src, 1);
   if (GET_CODE (inc_src) != PLUS
   || XEXP (inc_src, 0) != reg
-  || XEXP (inc_src, 1) != constm1_rtx)
+  || !CONST_INT_P (XEXP (inc_src, 1)))

Do we ever check that inc_src is negative? We used to check if it was 
-1, now we only check it's a constnat, but not a negative one, so I 
suspect this needs a:

|| INTVAL (XEXP (inc_src, 1)) >= 0

@@ -492,7 +519,8 @@ doloop_modify (class loop *loop, class niter_desc *desc,
 case GE:
   /* Currently only GE tests against zero are supported.  */
   gcc_assert (XEXP (condition, 1) == const0_rtx);
-
+  /* FALLTHRU */
+case GTU:
   noloop = constm1_rtx;

I spent a very long time staring at this trying to understand why noloop 
= constm1_rtx for GTU, where I thought it should've been (count & 
(n-1)). For the current use of doloop it doesn't matter because ARM is 
the only target using it and you set desc->noloop_assumptions to 
null_rtx in 'arm_attempt_dlstp_transform' so noloop is never used. 
However, if a different target accepts this GTU pattern then this target 
agnostic code will do the wrong thing.  I suggest we either:
 - set noloop to what we think might be the correct value, which if you 
ask me should be 'count & (XEXP (condition, 1))',
 - or add a gcc_assert (GET_CODE (condition) != GTU); under the if 
(desc->noloop_assumption); part and document why.  I have a slight 
preference for the assert given otherwise we are adding code that we 
can't test.


LGTM otherwise (but I don't have the power to approve this ;)).

Kind regards,
Andre

From: Stamatis Markianos-Wright 
Sent: Thursday, November 16, 2023 11:36 AM
To: Stamatis Markianos-Wright via Gcc-patches; Richard Earnshaw; Richard 
Sandiford; Kyrylo Tkachov
Subject: [PING][PATCH 2/2] arm: Add support for MVE Tail-Predicated Low 
Overhead Loops


Pinging back to the top of reviewers' inboxes due to worry about Stage 1
End in a few days :)


See the last email for the latest version of the 2/2 patch. The 1/2
patch is A-Ok from Kyrill's earlier target-backend review.


On 10/11/2023 12:41, Stamatis Markianos-Wright wrote:


On 06/11/2023 17:29, Stamatis Markianos-Wright wrote:


On 06/11/2023 11:24, Richard Sandiford wrote:

Stamatis Markianos-Wright  writes:

One of the main reasons for reading the arm bits was to try to answer
the question: if we switch to a downcounting loop with a GE
condition,
how do we make sure that the start value is not a large unsigned
number that is interpreted as negative by GE?  E.g. if the loop
originally counted up in steps of N and used an LTU condition,
it could stop at a value in the range [INT_MAX + 1, UINT_MAX].
But the loop might never iterate if we start counting down from
most values in that range.

Does the patch handle that?

So AFAICT this is actually handled in the generic code in
`doloop_valid_p`:

This kind of loops fail because of they are "desc->infinite", then no
loop-doloop conversion is attempted at all (even for standard
dls/le loops)

Thanks to that check I haven't been able to trigger anything like the
behaviour you describe, do you think the doloop_valid_p checks are
robust enough?

The loops I was thinking of are provably not infinite though. E.g.:

   for (unsigned int i = 0; i < UINT_MAX - 100; ++i)
 ...

is known to terminate.  And doloop conversion is safe with the normal
count-down-by-1 approach, so I don't think current code would need
to reject it.  I.e. a conversion to:

   unsigned int i = UINT_MAX - 101;
   do
 ...
   while (--i != ~0U);

would be safe, but a conversion to:

   int i = UINT_MAX - 101;
   do
 ...

[PING][PATCH 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops

2023-11-16 Thread Stamatis Markianos-Wright
Pinging back to the top of reviewers' inboxes due to worry about Stage 1 
End in a few days :)



See the last email for the latest version of the 2/2 patch. The 1/2 
patch is A-Ok from Kyrill's earlier target-backend review.



On 10/11/2023 12:41, Stamatis Markianos-Wright wrote:


On 06/11/2023 17:29, Stamatis Markianos-Wright wrote:


On 06/11/2023 11:24, Richard Sandiford wrote:

Stamatis Markianos-Wright  writes:

One of the main reasons for reading the arm bits was to try to answer
the question: if we switch to a downcounting loop with a GE 
condition,

how do we make sure that the start value is not a large unsigned
number that is interpreted as negative by GE?  E.g. if the loop
originally counted up in steps of N and used an LTU condition,
it could stop at a value in the range [INT_MAX + 1, UINT_MAX].
But the loop might never iterate if we start counting down from
most values in that range.

Does the patch handle that?
So AFAICT this is actually handled in the generic code in 
`doloop_valid_p`:


This kind of loops fail because of they are "desc->infinite", then no
loop-doloop conversion is attempted at all (even for standard 
dls/le loops)


Thanks to that check I haven't been able to trigger anything like the
behaviour you describe, do you think the doloop_valid_p checks are
robust enough?

The loops I was thinking of are provably not infinite though. E.g.:

   for (unsigned int i = 0; i < UINT_MAX - 100; ++i)
 ...

is known to terminate.  And doloop conversion is safe with the normal
count-down-by-1 approach, so I don't think current code would need
to reject it.  I.e. a conversion to:

   unsigned int i = UINT_MAX - 101;
   do
 ...
   while (--i != ~0U);

would be safe, but a conversion to:

   int i = UINT_MAX - 101;
   do
 ...
   while ((i -= step, i > 0));

wouldn't, because the loop body would only be executed once.

I'm only going off the name "infinite" though :)  It's possible that
it has more connotations than that.

Thanks,
Richard


Ack, yep, I see what you mean now, and yep, that kind of loop does 
indeed pass through doloop_valid_p


Interestingly , in the v8-M Arm ARM this is done with:

```

boolean IsLastLowOverheadLoop(INSTR_EXEC_STATE_Type state)
// This does not check whether a loop is currently active.
// If the PE were in a loop, would this be the last one?
return UInt(state.LoopCount) <= (1 << (4 - LTPSIZE));

```

So architecturally the asm we output would be ok (except maybe the 
"branch too far subs;bgt;lctp" fallback at 
`predicated_doloop_end_internal` (maybe that should be `bhi`))... But 
now GE: isn't looking like an accurate representation of this 
operation in the compiler.


I'm wondering if I should try to make 
`predicated_doloop_end_internal` contain a comparison along the lines 
of:

(gtu: (plus: (LR) (const_int -num_lanes)) (const_int num_lanes_minus_1))

I'll give that a try :)

The only reason I'd chosen to go with GE earlier, tbh, was because of 
the existing handling of GE in loop-doloop.cc


Let me know if any other ideas come to your mind!


Cheers,

Stam



It looks like I've had success with the below (diff to previous patch),
trimmed a bit to only the functionally interesting things::




diff --git a/gcc/config/arm/thumb2.md b/gcc/config/arm/thumb2.md
index 368d5138ca1..54dd4ee564b 100644
--- a/gcc/config/arm/thumb2.md
+++ b/gcc/config/arm/thumb2.md
@@ -1649,16 +1649,28 @@
   && (decrement_num = arm_attempt_dlstp_transform (operands[1]))
   && (INTVAL (decrement_num) != 1))
     {
-      insn = emit_insn
-          (gen_thumb2_addsi3_compare0
-              (s0, s0, GEN_INT ((-1) * (INTVAL (decrement_num);
-      cmp = XVECEXP (PATTERN (insn), 0, 0);
-      cc_reg = SET_DEST (cmp);
-      bcomp = gen_rtx_GE (VOIDmode, cc_reg, const0_rtx);
   loc_ref = gen_rtx_LABEL_REF (VOIDmode, operands[1]);
-      emit_jump_insn (gen_rtx_SET (pc_rtx,
-                   gen_rtx_IF_THEN_ELSE (VOIDmode, bcomp,
-                             loc_ref, pc_rtx)));
+      switch (INTVAL (decrement_num))
+        {
+          case 2:
+            insn = emit_jump_insn (gen_predicated_doloop_end_internal2
+                        (s0, loc_ref));
+            break;
+          case 4:
+            insn = emit_jump_insn (gen_predicated_doloop_end_internal4
+                        (s0, loc_ref));
+            break;
+          case 8:
+            insn = emit_jump_insn (gen_predicated_doloop_end_internal8
+                        (s0, loc_ref));
+            break;
+          case 16:
+            insn = emit_jump_insn (gen_predicated_doloop_end_internal16
+                        (s0, loc_ref));
+            break;
+          default:
+            gcc_unreachable ();
+        }
   DONE;
     }
 }

diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
index 93905583b18..c083f965fa9 100644
--- a/gcc/config/arm/mve.md
+++ b/gcc/config/arm/mve.md
@@ -6922,23 +6922,24 @@
 ;; 

Re: [PING][PATCH 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops

2023-11-06 Thread Stamatis Markianos-Wright



On 06/11/2023 11:24, Richard Sandiford wrote:

Stamatis Markianos-Wright  writes:

One of the main reasons for reading the arm bits was to try to answer
the question: if we switch to a downcounting loop with a GE condition,
how do we make sure that the start value is not a large unsigned
number that is interpreted as negative by GE?  E.g. if the loop
originally counted up in steps of N and used an LTU condition,
it could stop at a value in the range [INT_MAX + 1, UINT_MAX].
But the loop might never iterate if we start counting down from
most values in that range.

Does the patch handle that?

So AFAICT this is actually handled in the generic code in `doloop_valid_p`:

This kind of loops fail because of they are "desc->infinite", then no
loop-doloop conversion is attempted at all (even for standard dls/le loops)

Thanks to that check I haven't been able to trigger anything like the
behaviour you describe, do you think the doloop_valid_p checks are
robust enough?

The loops I was thinking of are provably not infinite though.  E.g.:

   for (unsigned int i = 0; i < UINT_MAX - 100; ++i)
 ...

is known to terminate.  And doloop conversion is safe with the normal
count-down-by-1 approach, so I don't think current code would need
to reject it.  I.e. a conversion to:

   unsigned int i = UINT_MAX - 101;
   do
 ...
   while (--i != ~0U);

would be safe, but a conversion to:

   int i = UINT_MAX - 101;
   do
 ...
   while ((i -= step, i > 0));

wouldn't, because the loop body would only be executed once.

I'm only going off the name "infinite" though :)  It's possible that
it has more connotations than that.

Thanks,
Richard


Ack, yep, I see what you mean now, and yep, that kind of loop does 
indeed pass through doloop_valid_p


Interestingly , in the v8-M Arm ARM this is done with:

```

boolean IsLastLowOverheadLoop(INSTR_EXEC_STATE_Type state)
// This does not check whether a loop is currently active.
// If the PE were in a loop, would this be the last one?
return UInt(state.LoopCount) <= (1 << (4 - LTPSIZE));

```

So architecturally the asm we output would be ok (except maybe the 
"branch too far subs;bgt;lctp" fallback at 
`predicated_doloop_end_internal` (maybe that should be `bhi`))... But 
now GE: isn't looking like an accurate representation of this operation 
in the compiler.


I'm wondering if I should try to make `predicated_doloop_end_internal` 
contain a comparison along the lines of:

(gtu: (plus: (LR) (const_int -num_lanes)) (const_int num_lanes_minus_1))

I'll give that a try :)

The only reason I'd chosen to go with GE earlier, tbh, was because of 
the existing handling of GE in loop-doloop.cc


Let me know if any other ideas come to your mind!


Cheers,

Stam




Re: [PING][PATCH 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops

2023-11-06 Thread Richard Sandiford
Stamatis Markianos-Wright  writes:
>> One of the main reasons for reading the arm bits was to try to answer
>> the question: if we switch to a downcounting loop with a GE condition,
>> how do we make sure that the start value is not a large unsigned
>> number that is interpreted as negative by GE?  E.g. if the loop
>> originally counted up in steps of N and used an LTU condition,
>> it could stop at a value in the range [INT_MAX + 1, UINT_MAX].
>> But the loop might never iterate if we start counting down from
>> most values in that range.
>>
>> Does the patch handle that?
>
> So AFAICT this is actually handled in the generic code in `doloop_valid_p`:
>
> This kind of loops fail because of they are "desc->infinite", then no 
> loop-doloop conversion is attempted at all (even for standard dls/le loops)
>
> Thanks to that check I haven't been able to trigger anything like the 
> behaviour you describe, do you think the doloop_valid_p checks are 
> robust enough?

The loops I was thinking of are provably not infinite though.  E.g.:

  for (unsigned int i = 0; i < UINT_MAX - 100; ++i)
...

is known to terminate.  And doloop conversion is safe with the normal
count-down-by-1 approach, so I don't think current code would need
to reject it.  I.e. a conversion to:

  unsigned int i = UINT_MAX - 101;
  do
...
  while (--i != ~0U);

would be safe, but a conversion to:

  int i = UINT_MAX - 101;
  do
...
  while ((i -= step, i > 0));

wouldn't, because the loop body would only be executed once.

I'm only going off the name "infinite" though :)  It's possible that
it has more connotations than that.

Thanks,
Richard


Re: [PING][PATCH 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops

2023-10-24 Thread Richard Sandiford
Sorry for the slow review.  I had a look at the arm bits too, to get
some context for the target-independent bits.

Stamatis Markianos-Wright via Gcc-patches  writes:
> [...]
> diff --git a/gcc/config/arm/arm-protos.h b/gcc/config/arm/arm-protos.h
> index 77e76336e94..74186930f0b 100644
> --- a/gcc/config/arm/arm-protos.h
> +++ b/gcc/config/arm/arm-protos.h
> @@ -65,8 +65,8 @@ extern void arm_emit_speculation_barrier_function (void);
>  extern void arm_decompose_di_binop (rtx, rtx, rtx *, rtx *, rtx *, rtx *);
>  extern bool arm_q_bit_access (void);
>  extern bool arm_ge_bits_access (void);
> -extern bool arm_target_insn_ok_for_lob (rtx);
> -
> +extern bool arm_target_bb_ok_for_lob (basic_block);
> +extern rtx arm_attempt_dlstp_transform (rtx);
>  #ifdef RTX_CODE
>  enum reg_class
>  arm_mode_base_reg_class (machine_mode);
> diff --git a/gcc/config/arm/arm.cc b/gcc/config/arm/arm.cc
> index 6e933c80183..39d97ba5e4d 100644
> --- a/gcc/config/arm/arm.cc
> +++ b/gcc/config/arm/arm.cc
> @@ -659,6 +659,12 @@ static const struct attribute_spec arm_attribute_table[]
> [...]
> +/* Wrapper function of arm_get_required_vpr_reg with TYPE == 1, so return
> +   something only if the VPR reg is an input operand to the insn.  */
> +
> +static rtx
> +ALWAYS_INLINE

Probably best to leave out the ALWAYS_INLINE.  That's generally only
appropriate for things that need to be inlined for correctness.

> +arm_get_required_vpr_reg_param (rtx_insn *insn)
> +{
> +  return arm_get_required_vpr_reg (insn, 1);
> +}
> [...]
> +/* Recursively scan through the DF chain backwards within the basic block and
> +   determine if any of the USEs of the original insn (or the USEs of the 
> insns
> +   where thy were DEF-ed, etc., recursively) were affected by implicit VPT
> +   predication of an MVE_VPT_UNPREDICATED_INSN_P in a dlstp/letp loop.
> +   This function returns true if the insn is affected implicit predication
> +   and false otherwise.
> +   Having such implicit predication on an unpredicated insn wouldn't in 
> itself
> +   block tail predication, because the output of that insn might then be used
> +   in a correctly predicated store insn, where the disabled lanes will be
> +   ignored.  To verify this we later call:
> +   `arm_mve_check_df_chain_fwd_for_implic_predic_impact`, which will check 
> the
> +   DF chains forward to see if any implicitly-predicated operand gets used in
> +   an improper way.  */
> +
> +static bool
> +arm_mve_check_df_chain_back_for_implic_predic
> +  (hash_map, bool>* safe_insn_map, rtx_insn *insn,
> +   rtx vctp_vpr_generated)
> +{
> +  bool* temp = NULL;
> +  if ((temp = safe_insn_map->get (INSN_UID (insn
> +return *temp;
> +
> +  basic_block body = BLOCK_FOR_INSN (insn);
> +  /* The circumstances under which an instruction is affected by "implicit
> + predication" are as follows:
> +  * It is an UNPREDICATED_INSN_P:
> + * That loads/stores from/to memory.
> + * Where any one of its operands is an MVE vector from outside the
> +   loop body bb.
> + Or:
> +  * Any of it's operands, recursively backwards, are affected.  */
> +  if (MVE_VPT_UNPREDICATED_INSN_P (insn)
> +  && (arm_is_mve_load_store_insn (insn)
> +   || (arm_is_mve_across_vector_insn (insn)
> +   && !arm_mve_is_allowed_unpredic_across_vector_insn (insn
> +{
> +  safe_insn_map->put (INSN_UID (insn), true);
> +  return true;
> +}
> +
> +  df_ref insn_uses = NULL;
> +  FOR_EACH_INSN_USE (insn_uses, insn)
> +  {
> +/* If the operand is in the input reg set to the the basic block,
> +   (i.e. it has come from outside the loop!), consider it unsafe if:
> +  * It's being used in an unpredicated insn.
> +  * It is a predicable MVE vector.  */
> +if (MVE_VPT_UNPREDICATED_INSN_P (insn)
> + && VALID_MVE_MODE (GET_MODE (DF_REF_REG (insn_uses)))
> + && REGNO_REG_SET_P (DF_LR_IN (body), DF_REF_REGNO (insn_uses)))
> +  {
> + safe_insn_map->put (INSN_UID (insn), true);
> + return true;
> +  }
> +/* Scan backwards from the current INSN through the instruction chain
> +   until the start of the basic block.  */
> +for (rtx_insn *prev_insn = PREV_INSN (insn);
> +  prev_insn && prev_insn != PREV_INSN (BB_HEAD (body));
> +  prev_insn = PREV_INSN (prev_insn))
> +  {
> + /* If a previous insn defines a register that INSN uses, then recurse
> +in order to check that insn's USEs.
> +If any of these insns return true as MVE_VPT_UNPREDICATED_INSN_Ps,
> +then the whole chain is affected by the change in behaviour from
> +being placed in dlstp/letp loop.  */
> + df_ref prev_insn_defs = NULL;
> + FOR_EACH_INSN_DEF (prev_insn_defs, prev_insn)
> + {
> +   if (DF_REF_REGNO (insn_uses) == DF_REF_REGNO (prev_insn_defs)
> +   && !arm_mve_vec_insn_is_predicated_with_this_predicate
> +(insn, vctp_vpr_generated)
> +   && 

Re: [PING][PATCH 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops

2023-10-23 Thread Andre Vieira (lists)
Ping for Jeff or another global maintainer to review the target agnostic 
bits of this, that's:

loop-doloop.cc
df-core.{c,h}

I do have a nitpick myself that I missed last time around:
  /* We expect the condition to be of the form (reg != 0)  */
  cond = XEXP (SET_SRC (cmp), 0);
- if (GET_CODE (cond) != NE || XEXP (cond, 1) != const0_rtx)
+ if ((GET_CODE (cond) != NE && GET_CODE (cond) != GE)
+ || XEXP (cond, 1) != const0_rtx)
return 0;
}
Could do with updating the comment to reflect allowing >= now. But happy 
for you to change this once approved by a maintainer.


Kind regards,
Andre

On 11/10/2023 12:34, Stamatis Markianos-Wright wrote:

Hi all,

On 28/09/2023 13:51, Andre Vieira (lists) wrote:

Hi,

On 14/09/2023 13:10, Kyrylo Tkachov via Gcc-patches wrote:

Hi Stam,





The arm parts look sensible but we'd need review for the df-core.h 
and df-core.cc changes.

Maybe Jeff can help or can recommend someone to take a look?


Just thought I'd do a follow-up "ping" on this :)



Thanks,
Kyrill



FWIW the changes LGTM, if we don't want these in df-core we can always 
implement the extra utility locally. It's really just a helper 
function to check if df_bb_regno_first_def_find and 
df_bb_regno_last_def_find yield the same result, meaning we only have 
a single definition.


Kind regards,
Andre


Thanks,

Stam



Re: [PING][PATCH 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops

2023-10-11 Thread Stamatis Markianos-Wright

Hi all,

On 28/09/2023 13:51, Andre Vieira (lists) wrote:

Hi,

On 14/09/2023 13:10, Kyrylo Tkachov via Gcc-patches wrote:

Hi Stam,





The arm parts look sensible but we'd need review for the df-core.h 
and df-core.cc changes.

Maybe Jeff can help or can recommend someone to take a look?


Just thought I'd do a follow-up "ping" on this :)



Thanks,
Kyrill



FWIW the changes LGTM, if we don't want these in df-core we can always 
implement the extra utility locally. It's really just a helper 
function to check if df_bb_regno_first_def_find and 
df_bb_regno_last_def_find yield the same result, meaning we only have 
a single definition.


Kind regards,
Andre


Thanks,

Stam



Re: [PING][PATCH 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops

2023-09-28 Thread Andre Vieira (lists)

Hi,

On 14/09/2023 13:10, Kyrylo Tkachov via Gcc-patches wrote:

Hi Stam,





The arm parts look sensible but we'd need review for the df-core.h and 
df-core.cc changes.
Maybe Jeff can help or can recommend someone to take a look?
Thanks,
Kyrill



FWIW the changes LGTM, if we don't want these in df-core we can always 
implement the extra utility locally. It's really just a helper function 
to check if df_bb_regno_first_def_find and df_bb_regno_last_def_find 
yield the same result, meaning we only have a single definition.


Kind regards,
Andre


RE: [PING][PATCH 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops

2023-09-14 Thread Kyrylo Tkachov via Gcc-patches
Hi Stam,

> -Original Message-
> From: Stam Markianos-Wright 
> Sent: Wednesday, September 6, 2023 6:19 PM
> To: gcc-patches@gcc.gnu.org
> Cc: Kyrylo Tkachov ; Richard Earnshaw
> 
> Subject: [PING][PATCH 2/2] arm: Add support for MVE Tail-Predicated Low
> Overhead Loops
> 
> Hi all,
> 
> This is the 2/2 patch that contains the functional changes needed
> for MVE Tail Predicated Low Overhead Loops.  See my previous email
> for a general introduction of MVE LOLs.
> 
> This support is added through the already existing loop-doloop
> mechanisms that are used for non-MVE dls/le looping.
> 
> Mid-end changes are:
> 
> 1) Relax the loop-doloop mechanism in the mid-end to allow for
>     decrement numbers other that -1 and for `count` to be an
>     rtx containing a simple REG (which in this case will contain
>     the number of elements to be processed), rather
>     than an expression for calculating the number of iterations.
> 2) Added a new df utility function: `df_bb_regno_only_def_find` that
>     will return the DEF of a REG if it is DEF-ed only once within the
>     basic block.
> 
> And many things in the backend to implement the above optimisation:
> 
> 3)  Implement the `arm_predict_doloop_p` target hook to instruct the
>      mid-end about Low Overhead Loops (MVE or not), as well as
>      `arm_loop_unroll_adjust` which will prevent unrolling of any loops
>      that are valid for becoming MVE Tail_Predicated Low Overhead Loops
>      (unrolling can transform a loop in ways that invalidate the dlstp/
>      letp tranformation logic and the benefit of the dlstp/letp loop
>      would be considerably higher than that of unrolling)
> 4)  Appropriate changes to the define_expand of doloop_end, new
>      patterns for dlstp and letp, new iterators,  unspecs, etc.
> 5) `arm_mve_loop_valid_for_dlstp` and a number of checking functions:
>     * `arm_mve_dlstp_check_dec_counter`
>     * `arm_mve_dlstp_check_inc_counter`
>     * `arm_mve_check_reg_origin_is_num_elems`
>     * `arm_mve_check_df_chain_back_for_implic_predic`
>     * `arm_mve_check_df_chain_fwd_for_implic_predic_impact`
>     This all, in smoe way or another, are running checks on the loop
>     structure in order to determine if the loop is valid for dlstp/letp
>     transformation.
> 6) `arm_attempt_dlstp_transform`: (called from the define_expand of
>      doloop_end) this function re-checks for the loop's suitability for
>      dlstp/letp transformation and then implements it, if possible.
> 7) Various utility functions:
>     *`arm_mve_get_vctp_lanes` to map
>     from vctp unspecs to number of lanes, and `arm_get_required_vpr_reg`
>     to check an insn to see if it requires the VPR or not.
>     * `arm_mve_get_loop_vctp`
>     * `arm_mve_get_vctp_lanes`
>     * `arm_emit_mve_unpredicated_insn_to_seq`
>     * `arm_get_required_vpr_reg`
>     * `arm_get_required_vpr_reg_param`
>     * `arm_get_required_vpr_reg_ret_val`
>     * `arm_mve_is_across_vector_insn`
>     * `arm_is_mve_load_store_insn`
>     * `arm_mve_vec_insn_is_predicated_with_this_predicate`
>     * `arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate`
> 
> No regressions on arm-none-eabi with various targets and on
> aarch64-none-elf. Thoughts on getting this into trunk?

The arm parts look sensible but we'd need review for the df-core.h and 
df-core.cc changes.
Maybe Jeff can help or can recommend someone to take a look?
Thanks,
Kyrill

> 
> Thank you,
> Stam Markianos-Wright
> 
> gcc/ChangeLog:
> 
>      * config/arm/arm-protos.h (arm_target_insn_ok_for_lob): Rename to...
>      (arm_target_bb_ok_for_lob): ...this
>      (arm_attempt_dlstp_transform): New.
>      * config/arm/arm.cc (TARGET_LOOP_UNROLL_ADJUST): New.
>      (TARGET_PREDICT_DOLOOP_P): New.
>      (arm_block_set_vect):
>      (arm_target_insn_ok_for_lob): Rename from arm_target_insn_ok_for_lob.
>      (arm_target_bb_ok_for_lob): New.
>      (arm_mve_get_vctp_lanes): New.
>      (arm_get_required_vpr_reg): New.
>      (arm_get_required_vpr_reg_param): New.
>      (arm_get_required_vpr_reg_ret_val): New.
>      (arm_mve_get_loop_vctp): New.
>      (arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate): New.
>      (arm_mve_vec_insn_is_predicated_with_this_predicate): New.
>      (arm_mve_check_df_chain_back_for_implic_predic): New.
>      (arm_mve_check_df_chain_fwd_for_implic_predic_impact): New.
>      (arm_mve_check_reg_origin_is_num_elems): New.
>      (arm_mve_dlstp_check_inc_counter): New.
>      (arm_mve_dlstp_check_dec_counter): New.
>      (arm_mve_loop_valid_for_dlstp): New.
>      (arm_mve_is_across_vector_insn): New.
>      (arm_is_mve_load_store_insn): New.
>      (arm_predict_doloop_p): Ne