[gcc r15-1547] xstormy16: Fix xs_hi_nonmemory_operand

2024-06-21 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:5320bcbd342a985a6e1db60bff2918f73dcad1a0

commit r15-1547-g5320bcbd342a985a6e1db60bff2918f73dcad1a0
Author: Richard Sandiford 
Date:   Fri Jun 21 15:40:11 2024 +0100

xstormy16: Fix xs_hi_nonmemory_operand

All uses of xs_hi_nonmemory_operand allow constraint "i",
which means that they allow consts, symbol_refs and label_refs.
The definition of xs_hi_nonmemory_operand accounted for consts,
but not for symbol_refs and label_refs.

gcc/
* config/stormy16/predicates.md (xs_hi_nonmemory_operand): Handle
symbol_ref and label_ref.

Diff:
---
 gcc/config/stormy16/predicates.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/config/stormy16/predicates.md 
b/gcc/config/stormy16/predicates.md
index 67c2ddc107c..085c9c5ed2d 100644
--- a/gcc/config/stormy16/predicates.md
+++ b/gcc/config/stormy16/predicates.md
@@ -152,7 +152,7 @@
 })
 
 (define_predicate "xs_hi_nonmemory_operand"
-  (match_code "const_int,reg,subreg,const")
+  (match_code "const_int,reg,subreg,const,symbol_ref,label_ref")
 {
   return nonmemory_operand (op, mode);
 })


[gcc r15-1546] iq2000: Fix test and branch instructions

2024-06-21 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:8f254cd4e40b692e5f01a3b40f2b5b60c8528a1e

commit r15-1546-g8f254cd4e40b692e5f01a3b40f2b5b60c8528a1e
Author: Richard Sandiford 
Date:   Fri Jun 21 15:40:10 2024 +0100

iq2000: Fix test and branch instructions

The iq2000 test and branch instructions had patterns like:

  [(set (pc)
(if_then_else
 (eq (and:SI (match_operand:SI 0 "register_operand" "r")
 (match_operand:SI 1 "power_of_2_operand" "I"))
  (const_int 0))
 (match_operand 2 "pc_or_label_operand" "")
 (match_operand 3 "pc_or_label_operand" "")))]

power_of_2_operand allows any 32-bit power of 2, whereas "I" only
accepts 16-bit signed constants.  This meant that any power of 2
greater than 32768 would cause an "insn does not satisfy its
constraints" ICE.

Also, the %p operand modifier barfed on 1<<31, which is sign-
rather than zero-extended to 64 bits.  The code is inherently
limited to 32-bit operands -- power_of_2_operand contains a test
involving "unsigned" -- so this patch just ands with 0x.

gcc/
* config/iq2000/iq2000.cc (iq2000_print_operand): Make %p handle 
1<<31.
* config/iq2000/iq2000.md: Remove "I" constraints on
power_of_2_operands.

Diff:
---
 gcc/config/iq2000/iq2000.cc | 2 +-
 gcc/config/iq2000/iq2000.md | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/gcc/config/iq2000/iq2000.cc b/gcc/config/iq2000/iq2000.cc
index f9f8c417841..136675d0fbb 100644
--- a/gcc/config/iq2000/iq2000.cc
+++ b/gcc/config/iq2000/iq2000.cc
@@ -3127,7 +3127,7 @@ iq2000_print_operand (FILE *file, rtx op, int letter)
 {
   int value;
   if (code != CONST_INT
- || (value = exact_log2 (INTVAL (op))) < 0)
+ || (value = exact_log2 (UINTVAL (op) & 0x)) < 0)
output_operand_lossage ("invalid %%p value");
   else
fprintf (file, "%d", value);
diff --git a/gcc/config/iq2000/iq2000.md b/gcc/config/iq2000/iq2000.md
index 8617efac3c6..e62c250ce8c 100644
--- a/gcc/config/iq2000/iq2000.md
+++ b/gcc/config/iq2000/iq2000.md
@@ -1175,7 +1175,7 @@
   [(set (pc)
(if_then_else
 (eq (and:SI (match_operand:SI 0 "register_operand" "r")
-(match_operand:SI 1 "power_of_2_operand" "I"))
+(match_operand:SI 1 "power_of_2_operand"))
  (const_int 0))
 (match_operand 2 "pc_or_label_operand" "")
 (match_operand 3 "pc_or_label_operand" "")))]
@@ -1189,7 +1189,7 @@
   [(set (pc)
(if_then_else
 (ne (and:SI (match_operand:SI 0 "register_operand" "r")
-(match_operand:SI 1 "power_of_2_operand" "I"))
+(match_operand:SI 1 "power_of_2_operand"))
 (const_int 0))
 (match_operand 2 "pc_or_label_operand" "")
 (match_operand 3 "pc_or_label_operand" "")))]


[gcc r15-1545] rtl-ssa: Don't cost no-op moves

2024-06-21 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:4a43a06c7b2bcc3402ac69d6e5ce7b8008acc69a

commit r15-1545-g4a43a06c7b2bcc3402ac69d6e5ce7b8008acc69a
Author: Richard Sandiford 
Date:   Fri Jun 21 15:40:10 2024 +0100

rtl-ssa: Don't cost no-op moves

No-op moves are given the code NOOP_MOVE_INSN_CODE if we plan
to delete them later.  Such insns shouldn't be costed, partly
because they're going to disappear, and partly because targets
won't recognise the insn code.

gcc/
* rtl-ssa/changes.cc (rtl_ssa::changes_are_worthwhile): Don't
cost no-op moves.
* rtl-ssa/insns.cc (insn_info::calculate_cost): Likewise.

Diff:
---
 gcc/rtl-ssa/changes.cc | 6 +-
 gcc/rtl-ssa/insns.cc   | 7 ++-
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/gcc/rtl-ssa/changes.cc b/gcc/rtl-ssa/changes.cc
index 11639e81bb7..3101f2dc4fc 100644
--- a/gcc/rtl-ssa/changes.cc
+++ b/gcc/rtl-ssa/changes.cc
@@ -177,13 +177,17 @@ rtl_ssa::changes_are_worthwhile (array_slice changes,
   auto entry_count = ENTRY_BLOCK_PTR_FOR_FN (cfun)->count;
   for (insn_change *change : changes)
 {
+  // Count zero for the old cost if the old instruction was a no-op
+  // move or had an unknown cost.  This should reduce the chances of
+  // making an unprofitable change.
   old_cost += change->old_cost ();
   basic_block cfg_bb = change->bb ()->cfg_bb ();
   bool for_speed = optimize_bb_for_speed_p (cfg_bb);
   if (for_speed)
weighted_old_cost += (cfg_bb->count.to_sreal_scale (entry_count)
  * change->old_cost ());
-  if (!change->is_deletion ())
+  if (!change->is_deletion ()
+ && INSN_CODE (change->rtl ()) != NOOP_MOVE_INSN_CODE)
{
  change->new_cost = insn_cost (change->rtl (), for_speed);
  new_cost += change->new_cost;
diff --git a/gcc/rtl-ssa/insns.cc b/gcc/rtl-ssa/insns.cc
index 0171d93c357..68365e323ec 100644
--- a/gcc/rtl-ssa/insns.cc
+++ b/gcc/rtl-ssa/insns.cc
@@ -48,7 +48,12 @@ insn_info::calculate_cost () const
 {
   basic_block cfg_bb = BLOCK_FOR_INSN (m_rtl);
   temporarily_undo_changes (0);
-  m_cost_or_uid = insn_cost (m_rtl, optimize_bb_for_speed_p (cfg_bb));
+  if (INSN_CODE (m_rtl) == NOOP_MOVE_INSN_CODE)
+// insn_cost also uses 0 to mean "don't know".  Callers that
+// want to distinguish the cases will need to check INSN_CODE.
+m_cost_or_uid = 0;
+  else
+m_cost_or_uid = insn_cost (m_rtl, optimize_bb_for_speed_p (cfg_bb));
   redo_changes (0);
 }


[gcc r15-1531] sh: Make *minus_plus_one work after RA

2024-06-21 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:f49267e1636872128249431e9e5d20c0908b7e8e

commit r15-1531-gf49267e1636872128249431e9e5d20c0908b7e8e
Author: Richard Sandiford 
Date:   Fri Jun 21 09:52:42 2024 +0100

sh: Make *minus_plus_one work after RA

*minus_plus_one had no constraints, which meant that it could be
matched after RA with operands 0, 1 and 2 all being different.
The associated split instead requires operand 0 to be tied to
operand 1.

gcc/
* config/sh/sh.md (*minus_plus_one): Add constraints.

Diff:
---
 gcc/config/sh/sh.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/gcc/config/sh/sh.md b/gcc/config/sh/sh.md
index 92a1efeb811..9491b49e55b 100644
--- a/gcc/config/sh/sh.md
+++ b/gcc/config/sh/sh.md
@@ -1642,9 +1642,9 @@
 ;; matched.  Split this up into a simple sub add sequence, as this will save
 ;; us one sett insn.
 (define_insn_and_split "*minus_plus_one"
-  [(set (match_operand:SI 0 "arith_reg_dest" "")
-   (plus:SI (minus:SI (match_operand:SI 1 "arith_reg_operand" "")
-  (match_operand:SI 2 "arith_reg_operand" ""))
+  [(set (match_operand:SI 0 "arith_reg_dest" "=r")
+   (plus:SI (minus:SI (match_operand:SI 1 "arith_reg_operand" "0")
+  (match_operand:SI 2 "arith_reg_operand" "r"))
 (const_int 1)))]
   "TARGET_SH1"
   "#"


[gcc r15-1400] Make more use of force_lowpart_subreg

2024-06-18 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:a573ed4367ee685fb1bc50b79239b8b4b69872ee

commit r15-1400-ga573ed4367ee685fb1bc50b79239b8b4b69872ee
Author: Richard Sandiford 
Date:   Tue Jun 18 12:22:32 2024 +0100

Make more use of force_lowpart_subreg

This patch makes target-independent code use force_lowpart_subreg
instead of simplify_gen_subreg and lowpart_subreg in some places.
The criteria were:

(1) The code is obviously specific to expand (where new pseudos
can be created), or at least would be invalid to call when
!can_create_pseudo_p () and temporaries are needed.

(2) The value is obviously an rvalue rather than an lvalue.

Doing this should reduce the likelihood of bugs like PR115464
occuring in other situations.

gcc/
* builtins.cc (expand_builtin_issignaling): Use force_lowpart_subreg
instead of simplify_gen_subreg and lowpart_subreg.
* expr.cc (convert_mode_scalar, expand_expr_real_2): Likewise.
* optabs.cc (expand_doubleword_mod): Likewise.

Diff:
---
 gcc/builtins.cc |  7 ++-
 gcc/expr.cc | 17 +
 gcc/optabs.cc   |  2 +-
 3 files changed, 12 insertions(+), 14 deletions(-)

diff --git a/gcc/builtins.cc b/gcc/builtins.cc
index 5b5307c67b8c..bde517b639e8 100644
--- a/gcc/builtins.cc
+++ b/gcc/builtins.cc
@@ -2940,8 +2940,7 @@ expand_builtin_issignaling (tree exp, rtx target)
  {
hi = simplify_gen_subreg (imode, temp, fmode,
  subreg_highpart_offset (imode, fmode));
-   lo = simplify_gen_subreg (imode, temp, fmode,
- subreg_lowpart_offset (imode, fmode));
+   lo = force_lowpart_subreg (imode, temp, fmode);
if (!hi || !lo)
  {
scalar_int_mode imode2;
@@ -2951,9 +2950,7 @@ expand_builtin_issignaling (tree exp, rtx target)
hi = simplify_gen_subreg (imode, temp2, imode2,
  subreg_highpart_offset (imode,
  imode2));
-   lo = simplify_gen_subreg (imode, temp2, imode2,
- subreg_lowpart_offset (imode,
-imode2));
+   lo = force_lowpart_subreg (imode, temp2, imode2);
  }
  }
if (!hi || !lo)
diff --git a/gcc/expr.cc b/gcc/expr.cc
index 31a7346e33f0..ffbac5136923 100644
--- a/gcc/expr.cc
+++ b/gcc/expr.cc
@@ -423,7 +423,8 @@ convert_mode_scalar (rtx to, rtx from, int unsignedp)
0).exists (_mode))
{
  start_sequence ();
- rtx fromi = lowpart_subreg (fromi_mode, from, from_mode);
+ rtx fromi = force_lowpart_subreg (fromi_mode, from,
+   from_mode);
  rtx tof = NULL_RTX;
  if (fromi)
{
@@ -443,7 +444,7 @@ convert_mode_scalar (rtx to, rtx from, int unsignedp)
  NULL_RTX, 1);
  if (toi)
{
- tof = lowpart_subreg (to_mode, toi, toi_mode);
+ tof = force_lowpart_subreg (to_mode, toi, toi_mode);
  if (tof)
emit_move_insn (to, tof);
}
@@ -475,7 +476,7 @@ convert_mode_scalar (rtx to, rtx from, int unsignedp)
0).exists (_mode))
{
  start_sequence ();
- rtx fromi = lowpart_subreg (fromi_mode, from, from_mode);
+ rtx fromi = force_lowpart_subreg (fromi_mode, from, from_mode);
  rtx tof = NULL_RTX;
  do
{
@@ -510,11 +511,11 @@ convert_mode_scalar (rtx to, rtx from, int unsignedp)
  temp4, shift, NULL_RTX, 1);
  if (!temp5)
break;
- rtx temp6 = lowpart_subreg (toi_mode, temp5, fromi_mode);
+ rtx temp6 = force_lowpart_subreg (toi_mode, temp5,
+   fromi_mode);
  if (!temp6)
break;
- tof = lowpart_subreg (to_mode, force_reg (toi_mode, temp6),
-   toi_mode);
+ tof = force_lowpart_subreg (to_mode, temp6, toi_mode);
  if (tof)
emit_move_insn (to, tof);
}
@@ -9784,9 +9785,9 @@ expand_expr_real_2 (const_sepops ops, rtx target, 
machine_mode tmode,
inner_mode = TYPE_MODE (inner_type);
 
  if (modifier == EXPAND_INITIALIZER)
-   op0 = lowpart_subreg (mode, op0, inner_mode);
+   op0 

[gcc r15-1402] aarch64: Add some uses of force_highpart_subreg

2024-06-18 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:c67a9a9c8e934234b640a613b0ae3c15e7fa9733

commit r15-1402-gc67a9a9c8e934234b640a613b0ae3c15e7fa9733
Author: Richard Sandiford 
Date:   Tue Jun 18 12:22:33 2024 +0100

aarch64: Add some uses of force_highpart_subreg

This patch adds uses of force_highpart_subreg to places that
already use force_lowpart_subreg.

gcc/
* config/aarch64/aarch64.cc (aarch64_addti_scratch_regs): Use
force_highpart_subreg instead of gen_highpart and 
simplify_gen_subreg.
(aarch64_subvti_scratch_regs): Likewise.

Diff:
---
 gcc/config/aarch64/aarch64.cc | 17 -
 1 file changed, 4 insertions(+), 13 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index c952a7cdefec..026f8627a893 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -26873,19 +26873,12 @@ aarch64_addti_scratch_regs (rtx op1, rtx op2, rtx 
*low_dest,
   *low_in1 = force_lowpart_subreg (DImode, op1, TImode);
   *low_in2 = force_lowpart_subreg (DImode, op2, TImode);
   *high_dest = gen_reg_rtx (DImode);
-  *high_in1 = gen_highpart (DImode, op1);
-  *high_in2 = simplify_gen_subreg (DImode, op2, TImode,
-  subreg_highpart_offset (DImode, TImode));
+  *high_in1 = force_highpart_subreg (DImode, op1, TImode);
+  *high_in2 = force_highpart_subreg (DImode, op2, TImode);
 }
 
 /* Generate DImode scratch registers for 128-bit (TImode) subtraction.
 
-   This function differs from 'arch64_addti_scratch_regs' in that
-   OP1 can be an immediate constant (zero). We must call
-   subreg_highpart_offset with DImode and TImode arguments, otherwise
-   VOIDmode will be used for the const_int which generates an internal
-   error from subreg_size_highpart_offset which does not expect a size of zero.
-
OP1 represents the TImode destination operand 1
OP2 represents the TImode destination operand 2
LOW_DEST represents the low half (DImode) of TImode operand 0
@@ -26907,10 +26900,8 @@ aarch64_subvti_scratch_regs (rtx op1, rtx op2, rtx 
*low_dest,
   *low_in2 = force_lowpart_subreg (DImode, op2, TImode);
   *high_dest = gen_reg_rtx (DImode);
 
-  *high_in1 = simplify_gen_subreg (DImode, op1, TImode,
-  subreg_highpart_offset (DImode, TImode));
-  *high_in2 = simplify_gen_subreg (DImode, op2, TImode,
-  subreg_highpart_offset (DImode, TImode));
+  *high_in1 = force_highpart_subreg (DImode, op1, TImode);
+  *high_in2 = force_highpart_subreg (DImode, op2, TImode);
 }
 
 /* Generate RTL for 128-bit (TImode) subtraction with overflow.


[gcc r15-1401] Add force_highpart_subreg

2024-06-18 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:e0700fbe35286d31fe64782b255c8d2caec673dc

commit r15-1401-ge0700fbe35286d31fe64782b255c8d2caec673dc
Author: Richard Sandiford 
Date:   Tue Jun 18 12:22:32 2024 +0100

Add force_highpart_subreg

This patch adds a force_highpart_subreg to go along with the
recently added force_lowpart_subreg.

gcc/
* explow.h (force_highpart_subreg): Declare.
* explow.cc (force_highpart_subreg): New function.
* builtins.cc (expand_builtin_issignaling): Use it.
* expmed.cc (emit_store_flag_1): Likewise.

Diff:
---
 gcc/builtins.cc | 15 ---
 gcc/explow.cc   | 14 ++
 gcc/explow.h|  1 +
 gcc/expmed.cc   |  4 +---
 4 files changed, 20 insertions(+), 14 deletions(-)

diff --git a/gcc/builtins.cc b/gcc/builtins.cc
index bde517b639e8..d467d1697b45 100644
--- a/gcc/builtins.cc
+++ b/gcc/builtins.cc
@@ -2835,9 +2835,7 @@ expand_builtin_issignaling (tree exp, rtx target)
 it is, working on the DImode high part is usually better.  */
  if (!MEM_P (temp))
{
- if (rtx t = simplify_gen_subreg (imode, temp, fmode,
-  subreg_highpart_offset (imode,
-  fmode)))
+ if (rtx t = force_highpart_subreg (imode, temp, fmode))
hi = t;
  else
{
@@ -2845,9 +2843,7 @@ expand_builtin_issignaling (tree exp, rtx target)
  if (int_mode_for_mode (fmode).exists ())
{
  rtx temp2 = gen_lowpart (imode2, temp);
- poly_uint64 off = subreg_highpart_offset (imode, imode2);
- if (rtx t = simplify_gen_subreg (imode, temp2,
-  imode2, off))
+ if (rtx t = force_highpart_subreg (imode, temp2, imode2))
hi = t;
}
}
@@ -2938,8 +2934,7 @@ expand_builtin_issignaling (tree exp, rtx target)
   it is, working on DImode parts is usually better.  */
if (!MEM_P (temp))
  {
-   hi = simplify_gen_subreg (imode, temp, fmode,
- subreg_highpart_offset (imode, fmode));
+   hi = force_highpart_subreg (imode, temp, fmode);
lo = force_lowpart_subreg (imode, temp, fmode);
if (!hi || !lo)
  {
@@ -2947,9 +2942,7 @@ expand_builtin_issignaling (tree exp, rtx target)
if (int_mode_for_mode (fmode).exists ())
  {
rtx temp2 = gen_lowpart (imode2, temp);
-   hi = simplify_gen_subreg (imode, temp2, imode2,
- subreg_highpart_offset (imode,
- imode2));
+   hi = force_highpart_subreg (imode, temp2, imode2);
lo = force_lowpart_subreg (imode, temp2, imode2);
  }
  }
diff --git a/gcc/explow.cc b/gcc/explow.cc
index 2a91cf76ea62..b4a0df89bc36 100644
--- a/gcc/explow.cc
+++ b/gcc/explow.cc
@@ -778,6 +778,20 @@ force_lowpart_subreg (machine_mode outermode, rtx op,
   return force_subreg (outermode, op, innermode, byte);
 }
 
+/* Try to return an rvalue expression for the OUTERMODE highpart of OP,
+   which has mode INNERMODE.  Allow OP to be forced into a new register
+   if necessary.
+
+   Return null on failure.  */
+
+rtx
+force_highpart_subreg (machine_mode outermode, rtx op,
+  machine_mode innermode)
+{
+  auto byte = subreg_highpart_offset (outermode, innermode);
+  return force_subreg (outermode, op, innermode, byte);
+}
+
 /* If X is a memory ref, copy its contents to a new temp reg and return
that reg.  Otherwise, return X.  */
 
diff --git a/gcc/explow.h b/gcc/explow.h
index dd654649b068..de89e9e2933e 100644
--- a/gcc/explow.h
+++ b/gcc/explow.h
@@ -44,6 +44,7 @@ extern rtx force_reg (machine_mode, rtx);
 
 extern rtx force_subreg (machine_mode, rtx, machine_mode, poly_uint64);
 extern rtx force_lowpart_subreg (machine_mode, rtx, machine_mode);
+extern rtx force_highpart_subreg (machine_mode, rtx, machine_mode);
 
 /* Return given rtx, copied into a new temp reg if it was in memory.  */
 extern rtx force_not_mem (rtx);
diff --git a/gcc/expmed.cc b/gcc/expmed.cc
index 1f68e7be721d..3b9475f5aa0b 100644
--- a/gcc/expmed.cc
+++ b/gcc/expmed.cc
@@ -5784,9 +5784,7 @@ emit_store_flag_1 (rtx target, enum rtx_code code, rtx 
op0, rtx op1,
  rtx op0h;
 
  /* If testing the sign bit, can just test on high word.  */
- op0h = simplify_gen_subreg (word_mode, op0, int_mode,
- subreg_highpart_offset (word_mode,
- int_mode));
+ op0h = force_highpart_subreg 

[gcc r15-1399] aarch64: Add some uses of force_lowpart_subreg

2024-06-18 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:6bd4fbae45d11795a9a6f54b866308d4d7134def

commit r15-1399-g6bd4fbae45d11795a9a6f54b866308d4d7134def
Author: Richard Sandiford 
Date:   Tue Jun 18 12:22:31 2024 +0100

aarch64: Add some uses of force_lowpart_subreg

This patch makes more use of force_lowpart_subreg, similarly
to the recent patch for force_subreg.  The criteria were:

(1) The code is obviously specific to expand (where new pseudos
can be created).

(2) The value is obviously an rvalue rather than an lvalue.

gcc/
PR target/115464
* config/aarch64/aarch64-builtins.cc (aarch64_expand_fcmla_builtin)
(aarch64_expand_rwsr_builtin): Use force_lowpart_subreg instead of
simplify_gen_subreg and lowpart_subreg.
* config/aarch64/aarch64-sve-builtins-base.cc
(svset_neonq_impl::expand): Likewise.
* config/aarch64/aarch64-sve-builtins-sme.cc
(add_load_store_slice_operand): Likewise.
* config/aarch64/aarch64.cc (aarch64_sve_reinterpret): Likewise.
(aarch64_addti_scratch_regs, aarch64_subvti_scratch_regs): Likewise.

gcc/testsuite/
PR target/115464
* gcc.target/aarch64/sve/acle/general/pr115464_2.c: New test.

Diff:
---
 gcc/config/aarch64/aarch64-builtins.cc | 11 +--
 gcc/config/aarch64/aarch64-sve-builtins-base.cc|  2 +-
 gcc/config/aarch64/aarch64-sve-builtins-sme.cc |  2 +-
 gcc/config/aarch64/aarch64.cc  | 14 +-
 .../gcc.target/aarch64/sve/acle/general/pr115464_2.c   | 11 +++
 5 files changed, 23 insertions(+), 17 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-builtins.cc 
b/gcc/config/aarch64/aarch64-builtins.cc
index 7d827cbc2ac0..30669f8aa182 100644
--- a/gcc/config/aarch64/aarch64-builtins.cc
+++ b/gcc/config/aarch64/aarch64-builtins.cc
@@ -2579,8 +2579,7 @@ aarch64_expand_fcmla_builtin (tree exp, rtx target, int 
fcode)
   int lane = INTVAL (lane_idx);
 
   if (lane < nunits / 4)
-op2 = simplify_gen_subreg (d->mode, op2, quadmode,
-  subreg_lowpart_offset (d->mode, quadmode));
+op2 = force_lowpart_subreg (d->mode, op2, quadmode);
   else
 {
   /* Select the upper 64 bits, either a V2SF or V4HF, this however
@@ -2590,8 +2589,7 @@ aarch64_expand_fcmla_builtin (tree exp, rtx target, int 
fcode)
 gen_highpart_mode generates code that isn't optimal.  */
   rtx temp1 = gen_reg_rtx (d->mode);
   rtx temp2 = gen_reg_rtx (DImode);
-  temp1 = simplify_gen_subreg (d->mode, op2, quadmode,
-  subreg_lowpart_offset (d->mode, quadmode));
+  temp1 = force_lowpart_subreg (d->mode, op2, quadmode);
   temp1 = force_subreg (V2DImode, temp1, d->mode, 0);
   if (BYTES_BIG_ENDIAN)
emit_insn (gen_aarch64_get_lanev2di (temp2, temp1, const0_rtx));
@@ -2836,7 +2834,7 @@ aarch64_expand_rwsr_builtin (tree exp, rtx target, int 
fcode)
case AARCH64_WSR64:
case AARCH64_WSRF64:
case AARCH64_WSR128:
- subreg = lowpart_subreg (sysreg_mode, input_val, mode);
+ subreg = force_lowpart_subreg (sysreg_mode, input_val, mode);
  break;
case AARCH64_WSRF:
  subreg = gen_lowpart_SUBREG (SImode, input_val);
@@ -2871,7 +2869,8 @@ aarch64_expand_rwsr_builtin (tree exp, rtx target, int 
fcode)
 case AARCH64_RSR64:
 case AARCH64_RSRF64:
 case AARCH64_RSR128:
-  return lowpart_subreg (TYPE_MODE (TREE_TYPE (exp)), target, sysreg_mode);
+  return force_lowpart_subreg (TYPE_MODE (TREE_TYPE (exp)),
+  target, sysreg_mode);
 case AARCH64_RSRF:
   subreg = gen_lowpart_SUBREG (SImode, target);
   return gen_lowpart_SUBREG (SFmode, subreg);
diff --git a/gcc/config/aarch64/aarch64-sve-builtins-base.cc 
b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
index 999320371247..aa26370d397f 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins-base.cc
+++ b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
@@ -1183,7 +1183,7 @@ public:
 if (BYTES_BIG_ENDIAN)
   return e.use_exact_insn (code_for_aarch64_sve_set_neonq (mode));
 insn_code icode = code_for_vcond_mask (mode, mode);
-e.args[1] = lowpart_subreg (mode, e.args[1], GET_MODE (e.args[1]));
+e.args[1] = force_lowpart_subreg (mode, e.args[1], GET_MODE (e.args[1]));
 e.add_output_operand (icode);
 e.add_input_operand (icode, e.args[1]);
 e.add_input_operand (icode, e.args[0]);
diff --git a/gcc/config/aarch64/aarch64-sve-builtins-sme.cc 
b/gcc/config/aarch64/aarch64-sve-builtins-sme.cc
index f4c91bcbb95d..b66b35ae60b7 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins-sme.cc
+++ b/gcc/config/aarch64/aarch64-sve-builtins-sme.cc
@@ -112,7 +112,7 @@ add_load_store_slice_operand (function_expander , 
insn_code icode,
   rtx base = e.args[argno];
   if 

[gcc r15-1398] Add force_lowpart_subreg

2024-06-18 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:5f40d1c0cc6ce91ef28d326b8707b3f05e6f239c

commit r15-1398-g5f40d1c0cc6ce91ef28d326b8707b3f05e6f239c
Author: Richard Sandiford 
Date:   Tue Jun 18 12:22:31 2024 +0100

Add force_lowpart_subreg

optabs had a local function called lowpart_subreg_maybe_copy
that is very similar to the lowpart version of force_subreg.
This patch adds a force_lowpart_subreg wrapper around
force_subreg and uses it in optabs.cc.

The only difference between the old and new functions is that
the old one asserted success while the new one doesn't.
It's common not to assert elsewhere when taking subregs;
normally a null result is enough.

Later patches will make more use of the new function.

gcc/
* explow.h (force_lowpart_subreg): Declare.
* explow.cc (force_lowpart_subreg): New function.
* optabs.cc (lowpart_subreg_maybe_copy): Delete.
(expand_absneg_bit): Use force_lowpart_subreg instead of
lowpart_subreg_maybe_copy.
(expand_copysign_bit): Likewise.

Diff:
---
 gcc/explow.cc | 14 ++
 gcc/explow.h  |  1 +
 gcc/optabs.cc | 24 ++--
 3 files changed, 17 insertions(+), 22 deletions(-)

diff --git a/gcc/explow.cc b/gcc/explow.cc
index bd93c8780649..2a91cf76ea62 100644
--- a/gcc/explow.cc
+++ b/gcc/explow.cc
@@ -764,6 +764,20 @@ force_subreg (machine_mode outermode, rtx op,
   return res;
 }
 
+/* Try to return an rvalue expression for the OUTERMODE lowpart of OP,
+   which has mode INNERMODE.  Allow OP to be forced into a new register
+   if necessary.
+
+   Return null on failure.  */
+
+rtx
+force_lowpart_subreg (machine_mode outermode, rtx op,
+ machine_mode innermode)
+{
+  auto byte = subreg_lowpart_offset (outermode, innermode);
+  return force_subreg (outermode, op, innermode, byte);
+}
+
 /* If X is a memory ref, copy its contents to a new temp reg and return
that reg.  Otherwise, return X.  */
 
diff --git a/gcc/explow.h b/gcc/explow.h
index cbd1fcb7eb34..dd654649b068 100644
--- a/gcc/explow.h
+++ b/gcc/explow.h
@@ -43,6 +43,7 @@ extern rtx copy_to_suggested_reg (rtx, rtx, machine_mode);
 extern rtx force_reg (machine_mode, rtx);
 
 extern rtx force_subreg (machine_mode, rtx, machine_mode, poly_uint64);
+extern rtx force_lowpart_subreg (machine_mode, rtx, machine_mode);
 
 /* Return given rtx, copied into a new temp reg if it was in memory.  */
 extern rtx force_not_mem (rtx);
diff --git a/gcc/optabs.cc b/gcc/optabs.cc
index c54d275b8b7a..d569742beea9 100644
--- a/gcc/optabs.cc
+++ b/gcc/optabs.cc
@@ -3096,26 +3096,6 @@ expand_ffs (scalar_int_mode mode, rtx op0, rtx target)
   return 0;
 }
 
-/* Extract the OMODE lowpart from VAL, which has IMODE.  Under certain
-   conditions, VAL may already be a SUBREG against which we cannot generate
-   a further SUBREG.  In this case, we expect forcing the value into a
-   register will work around the situation.  */
-
-static rtx
-lowpart_subreg_maybe_copy (machine_mode omode, rtx val,
-  machine_mode imode)
-{
-  rtx ret;
-  ret = lowpart_subreg (omode, val, imode);
-  if (ret == NULL)
-{
-  val = force_reg (imode, val);
-  ret = lowpart_subreg (omode, val, imode);
-  gcc_assert (ret != NULL);
-}
-  return ret;
-}
-
 /* Expand a floating point absolute value or negation operation via a
logical operation on the sign bit.  */
 
@@ -3204,7 +3184,7 @@ expand_absneg_bit (enum rtx_code code, scalar_float_mode 
mode,
   gen_lowpart (imode, op0),
   immed_wide_int_const (mask, imode),
   gen_lowpart (imode, target), 1, OPTAB_LIB_WIDEN);
-  target = lowpart_subreg_maybe_copy (mode, temp, imode);
+  target = force_lowpart_subreg (mode, temp, imode);
 
   set_dst_reg_note (get_last_insn (), REG_EQUAL,
gen_rtx_fmt_e (code, mode, copy_rtx (op0)),
@@ -4043,7 +4023,7 @@ expand_copysign_bit (scalar_float_mode mode, rtx op0, rtx 
op1, rtx target,
 
   temp = expand_binop (imode, ior_optab, op0, op1,
   gen_lowpart (imode, target), 1, OPTAB_LIB_WIDEN);
-  target = lowpart_subreg_maybe_copy (mode, temp, imode);
+  target = force_lowpart_subreg (mode, temp, imode);
 }
 
   return target;


[gcc r15-1397] Make more use of force_subreg

2024-06-18 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:d4047da6a070175aae7121c739d1cad6b08ff4b2

commit r15-1397-gd4047da6a070175aae7121c739d1cad6b08ff4b2
Author: Richard Sandiford 
Date:   Tue Jun 18 12:22:30 2024 +0100

Make more use of force_subreg

This patch makes target-independent code use force_subreg instead
of simplify_gen_subreg in some places.  The criteria were:

(1) The code is obviously specific to expand (where new pseudos
can be created), or at least would be invalid to call when
!can_create_pseudo_p () and temporaries are needed.

(2) The value is obviously an rvalue rather than an lvalue.

(3) The offset wasn't a simple lowpart or highpart calculation;
a later patch will deal with those.

Doing this should reduce the likelihood of bugs like PR115464
occuring in other situations.

gcc/
* expmed.cc (store_bit_field_using_insv): Use force_subreg
instead of simplify_gen_subreg.
(store_bit_field_1): Likewise.
(extract_bit_field_as_subreg): Likewise.
(extract_integral_bit_field): Likewise.
(emit_store_flag_1): Likewise.
* expr.cc (convert_move): Likewise.
(convert_modes): Likewise.
(emit_group_load_1): Likewise.
(emit_group_store): Likewise.
(expand_assignment): Likewise.

Diff:
---
 gcc/expmed.cc | 22 --
 gcc/expr.cc   | 27 ---
 2 files changed, 20 insertions(+), 29 deletions(-)

diff --git a/gcc/expmed.cc b/gcc/expmed.cc
index 9ba01695f538..1f68e7be721d 100644
--- a/gcc/expmed.cc
+++ b/gcc/expmed.cc
@@ -695,13 +695,7 @@ store_bit_field_using_insv (const extraction_insn *insv, 
rtx op0,
 if we must narrow it, be sure we do it correctly.  */
 
  if (GET_MODE_SIZE (value_mode) < GET_MODE_SIZE (op_mode))
-   {
- tmp = simplify_subreg (op_mode, value1, value_mode, 0);
- if (! tmp)
-   tmp = simplify_gen_subreg (op_mode,
-  force_reg (value_mode, value1),
-  value_mode, 0);
-   }
+   tmp = force_subreg (op_mode, value1, value_mode, 0);
  else
{
  if (targetm.mode_rep_extended (op_mode, value_mode) != UNKNOWN)
@@ -806,7 +800,7 @@ store_bit_field_1 (rtx str_rtx, poly_uint64 bitsize, 
poly_uint64 bitnum,
   if (known_eq (bitnum, 0U)
  && known_eq (bitsize, GET_MODE_BITSIZE (GET_MODE (op0
{
- sub = simplify_gen_subreg (GET_MODE (op0), value, fieldmode, 0);
+ sub = force_subreg (GET_MODE (op0), value, fieldmode, 0);
  if (sub)
{
  if (reverse)
@@ -1633,7 +1627,7 @@ extract_bit_field_as_subreg (machine_mode mode, rtx op0,
   && known_eq (bitsize, GET_MODE_BITSIZE (mode))
   && lowpart_bit_field_p (bitnum, bitsize, op0_mode)
   && TRULY_NOOP_TRUNCATION_MODES_P (mode, op0_mode))
-return simplify_gen_subreg (mode, op0, op0_mode, bytenum);
+return force_subreg (mode, op0, op0_mode, bytenum);
   return NULL_RTX;
 }
 
@@ -2000,11 +1994,11 @@ extract_integral_bit_field (rtx op0, 
opt_scalar_int_mode op0_mode,
  return convert_extracted_bit_field (target, mode, tmode, unsignedp);
}
   /* If OP0 is a hard register, copy it to a pseudo before calling
-simplify_gen_subreg.  */
+force_subreg.  */
   if (REG_P (op0) && HARD_REGISTER_P (op0))
op0 = copy_to_reg (op0);
-  op0 = simplify_gen_subreg (word_mode, op0, op0_mode.require (),
-bitnum / BITS_PER_WORD * UNITS_PER_WORD);
+  op0 = force_subreg (word_mode, op0, op0_mode.require (),
+ bitnum / BITS_PER_WORD * UNITS_PER_WORD);
   op0_mode = word_mode;
   bitnum %= BITS_PER_WORD;
 }
@@ -5774,8 +5768,8 @@ emit_store_flag_1 (rtx target, enum rtx_code code, rtx 
op0, rtx op1,
 
  /* Do a logical OR or AND of the two words and compare the
 result.  */
- op00 = simplify_gen_subreg (word_mode, op0, int_mode, 0);
- op01 = simplify_gen_subreg (word_mode, op0, int_mode, UNITS_PER_WORD);
+ op00 = force_subreg (word_mode, op0, int_mode, 0);
+ op01 = force_subreg (word_mode, op0, int_mode, UNITS_PER_WORD);
  tem = expand_binop (word_mode,
  op1 == const0_rtx ? ior_optab : and_optab,
  op00, op01, NULL_RTX, unsignedp,
diff --git a/gcc/expr.cc b/gcc/expr.cc
index 9cecc1758f5c..31a7346e33f0 100644
--- a/gcc/expr.cc
+++ b/gcc/expr.cc
@@ -301,7 +301,7 @@ convert_move (rtx to, rtx from, int unsignedp)
GET_MODE_BITSIZE (to_mode)));
 
   if (VECTOR_MODE_P (to_mode))
-   from = simplify_gen_subreg (to_mode, from, GET_MODE (from), 0);
+   from = force_subreg (to_mode, from, GET_MODE (from), 0);
   else
   

[gcc r15-1396] aarch64: Use force_subreg in more places

2024-06-18 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:1474a8eead4ab390e59ee014befa8c40346679f4

commit r15-1396-g1474a8eead4ab390e59ee014befa8c40346679f4
Author: Richard Sandiford 
Date:   Tue Jun 18 12:22:30 2024 +0100

aarch64: Use force_subreg in more places

This patch makes the aarch64 code use force_subreg instead of
simplify_gen_subreg in more places.  The criteria were:

(1) The code is obviously specific to expand (where new pseudos
can be created).

(2) The value is obviously an rvalue rather than an lvalue.

(3) The offset wasn't a simple lowpart or highpart calculation;
a later patch will deal with those.

gcc/
* config/aarch64/aarch64-builtins.cc (aarch64_expand_fcmla_builtin):
Use force_subreg instead of simplify_gen_subreg.
* config/aarch64/aarch64-simd.md (ctz2): Likewise.
* config/aarch64/aarch64-sve-builtins-base.cc
(svget_impl::expand): Likewise.
(svget_neonq_impl::expand): Likewise.
* config/aarch64/aarch64-sve-builtins-functions.h
(multireg_permute::expand): Likewise.

Diff:
---
 gcc/config/aarch64/aarch64-builtins.cc  | 4 ++--
 gcc/config/aarch64/aarch64-simd.md  | 4 ++--
 gcc/config/aarch64/aarch64-sve-builtins-base.cc | 8 +++-
 gcc/config/aarch64/aarch64-sve-builtins-functions.h | 6 +++---
 4 files changed, 10 insertions(+), 12 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-builtins.cc 
b/gcc/config/aarch64/aarch64-builtins.cc
index d589e59defc2..7d827cbc2ac0 100644
--- a/gcc/config/aarch64/aarch64-builtins.cc
+++ b/gcc/config/aarch64/aarch64-builtins.cc
@@ -2592,12 +2592,12 @@ aarch64_expand_fcmla_builtin (tree exp, rtx target, int 
fcode)
   rtx temp2 = gen_reg_rtx (DImode);
   temp1 = simplify_gen_subreg (d->mode, op2, quadmode,
   subreg_lowpart_offset (d->mode, quadmode));
-  temp1 = simplify_gen_subreg (V2DImode, temp1, d->mode, 0);
+  temp1 = force_subreg (V2DImode, temp1, d->mode, 0);
   if (BYTES_BIG_ENDIAN)
emit_insn (gen_aarch64_get_lanev2di (temp2, temp1, const0_rtx));
   else
emit_insn (gen_aarch64_get_lanev2di (temp2, temp1, const1_rtx));
-  op2 = simplify_gen_subreg (d->mode, temp2, GET_MODE (temp2), 0);
+  op2 = force_subreg (d->mode, temp2, GET_MODE (temp2), 0);
 
   /* And recalculate the index.  */
   lane -= nunits / 4;
diff --git a/gcc/config/aarch64/aarch64-simd.md 
b/gcc/config/aarch64/aarch64-simd.md
index 0bb39091a385..01b084d8ccb5 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -389,8 +389,8 @@
   "TARGET_SIMD"
   {
  emit_insn (gen_bswap2 (operands[0], operands[1]));
- rtx op0_castsi2qi = simplify_gen_subreg(mode, operands[0],
-mode, 0);
+ rtx op0_castsi2qi = force_subreg (mode, operands[0],
+  mode, 0);
  emit_insn (gen_aarch64_rbit (op0_castsi2qi, op0_castsi2qi));
  emit_insn (gen_clz2 (operands[0], operands[0]));
  DONE;
diff --git a/gcc/config/aarch64/aarch64-sve-builtins-base.cc 
b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
index 823d60040f9a..999320371247 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins-base.cc
+++ b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
@@ -1121,9 +1121,8 @@ public:
   expand (function_expander ) const override
   {
 /* Fold the access into a subreg rvalue.  */
-return simplify_gen_subreg (e.vector_mode (0), e.args[0],
-   GET_MODE (e.args[0]),
-   INTVAL (e.args[1]) * BYTES_PER_SVE_VECTOR);
+return force_subreg (e.vector_mode (0), e.args[0], GET_MODE (e.args[0]),
+INTVAL (e.args[1]) * BYTES_PER_SVE_VECTOR);
   }
 };
 
@@ -1157,8 +1156,7 @@ public:
e.add_fixed_operand (indices);
return e.generate_insn (icode);
   }
-return simplify_gen_subreg (e.result_mode (), e.args[0],
-   GET_MODE (e.args[0]), 0);
+return force_subreg (e.result_mode (), e.args[0], GET_MODE (e.args[0]), 0);
   }
 };
 
diff --git a/gcc/config/aarch64/aarch64-sve-builtins-functions.h 
b/gcc/config/aarch64/aarch64-sve-builtins-functions.h
index 3b8e575e98e7..7d06a57ff834 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins-functions.h
+++ b/gcc/config/aarch64/aarch64-sve-builtins-functions.h
@@ -639,9 +639,9 @@ public:
   {
machine_mode elt_mode = e.vector_mode (0);
rtx arg = e.args[0];
-   e.args[0] = simplify_gen_subreg (elt_mode, arg, GET_MODE (arg), 0);
-   e.args.safe_push (simplify_gen_subreg (elt_mode, arg, GET_MODE (arg),
-  GET_MODE_SIZE (elt_mode)));
+   e.args[0] = force_subreg (elt_mode, arg, GET_MODE (arg), 0);
+   e.args.safe_push (force_subreg (elt_mode, arg, GET_MODE (arg),
+  

[gcc r15-1395] Make force_subreg emit nothing on failure

2024-06-18 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:01044471ea39f9be4803c583ef2a946abc657f99

commit r15-1395-g01044471ea39f9be4803c583ef2a946abc657f99
Author: Richard Sandiford 
Date:   Tue Jun 18 12:22:30 2024 +0100

Make force_subreg emit nothing on failure

While adding more uses of force_subreg, I realised that it should
be more careful to emit no instructions on failure.  This kind of
failure should be very rare, so I don't think it's a case worth
optimising for.

gcc/
* explow.cc (force_subreg): Emit no instructions on failure.

Diff:
---
 gcc/explow.cc | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/gcc/explow.cc b/gcc/explow.cc
index f6843398c4b0..bd93c8780649 100644
--- a/gcc/explow.cc
+++ b/gcc/explow.cc
@@ -756,8 +756,12 @@ force_subreg (machine_mode outermode, rtx op,
   if (x)
 return x;
 
+  auto *start = get_last_insn ();
   op = copy_to_mode_reg (innermode, op);
-  return simplify_gen_subreg (outermode, op, innermode, byte);
+  rtx res = simplify_gen_subreg (outermode, op, innermode, byte);
+  if (!res)
+delete_insns_since (start);
+  return res;
 }
 
 /* If X is a memory ref, copy its contents to a new temp reg and return


[gcc r15-1244] aarch64: Fix invalid nested subregs [PR115464]

2024-06-13 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:0970ff46ba6330fc80e8736fc05b2eaeeae0b6a0

commit r15-1244-g0970ff46ba6330fc80e8736fc05b2eaeeae0b6a0
Author: Richard Sandiford 
Date:   Thu Jun 13 12:48:21 2024 +0100

aarch64: Fix invalid nested subregs [PR115464]

The testcase extracts one arm_neon.h vector from a pair (one subreg)
and then reinterprets the result as an SVE vector (another subreg).
Each subreg makes sense individually, but we can't fold them together
into a single subreg: it's 32 bytes -> 16 bytes -> 16*N bytes,
but the interpretation of 32 bytes -> 16*N bytes depends on
whether N==1 or N>1.

Since the second subreg makes sense individually, simplify_subreg
should bail out rather than ICE on it.  simplify_gen_subreg will
then do the same (because it already checks validate_subreg).
This leaves simplify_gen_subreg returning null, requiring the
caller to take appropriate action.

I think this is relatively likely to occur elsewhere, so the patch
adds a helper for forcing a subreg, allowing a temporary pseudo to
be created where necessary.

I'll follow up by using force_subreg in more places.  This patch
is intended to be a minimal backportable fix for the PR.

gcc/
PR target/115464
* simplify-rtx.cc (simplify_context::simplify_subreg): Don't try
to fold two subregs together if their relationship isn't known
at compile time.
* explow.h (force_subreg): Declare.
* explow.cc (force_subreg): New function.
* config/aarch64/aarch64-sve-builtins-base.cc
(svset_neonq_impl::expand): Use it instead of simplify_gen_subreg.

gcc/testsuite/
PR target/115464
* gcc.target/aarch64/sve/acle/general/pr115464.c: New test.

Diff:
---
 gcc/config/aarch64/aarch64-sve-builtins-base.cc   |  2 +-
 gcc/explow.cc | 15 +++
 gcc/explow.h  |  2 ++
 gcc/simplify-rtx.cc   |  5 +
 .../gcc.target/aarch64/sve/acle/general/pr115464.c| 13 +
 5 files changed, 36 insertions(+), 1 deletion(-)

diff --git a/gcc/config/aarch64/aarch64-sve-builtins-base.cc 
b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
index dea2f6e6bfc4..823d60040f9a 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins-base.cc
+++ b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
@@ -1174,7 +1174,7 @@ public:
Advanced SIMD argument as an SVE vector.  */
 if (!BYTES_BIG_ENDIAN
&& is_undef (CALL_EXPR_ARG (e.call_expr, 0)))
-  return simplify_gen_subreg (mode, e.args[1], GET_MODE (e.args[1]), 0);
+  return force_subreg (mode, e.args[1], GET_MODE (e.args[1]), 0);
 
 rtx_vector_builder builder (VNx16BImode, 16, 2);
 for (unsigned int i = 0; i < 16; i++)
diff --git a/gcc/explow.cc b/gcc/explow.cc
index 8e5f6b8e6804..f6843398c4b0 100644
--- a/gcc/explow.cc
+++ b/gcc/explow.cc
@@ -745,6 +745,21 @@ force_reg (machine_mode mode, rtx x)
   return temp;
 }
 
+/* Like simplify_gen_subreg, but force OP into a new register if the
+   subreg cannot be formed directly.  */
+
+rtx
+force_subreg (machine_mode outermode, rtx op,
+ machine_mode innermode, poly_uint64 byte)
+{
+  rtx x = simplify_gen_subreg (outermode, op, innermode, byte);
+  if (x)
+return x;
+
+  op = copy_to_mode_reg (innermode, op);
+  return simplify_gen_subreg (outermode, op, innermode, byte);
+}
+
 /* If X is a memory ref, copy its contents to a new temp reg and return
that reg.  Otherwise, return X.  */
 
diff --git a/gcc/explow.h b/gcc/explow.h
index 16aa02cfb689..cbd1fcb7eb34 100644
--- a/gcc/explow.h
+++ b/gcc/explow.h
@@ -42,6 +42,8 @@ extern rtx copy_to_suggested_reg (rtx, rtx, machine_mode);
Args are mode (in case value is a constant) and the value.  */
 extern rtx force_reg (machine_mode, rtx);
 
+extern rtx force_subreg (machine_mode, rtx, machine_mode, poly_uint64);
+
 /* Return given rtx, copied into a new temp reg if it was in memory.  */
 extern rtx force_not_mem (rtx);
 
diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc
index 3ee95f74d3db..35ba54c62921 100644
--- a/gcc/simplify-rtx.cc
+++ b/gcc/simplify-rtx.cc
@@ -7737,6 +7737,11 @@ simplify_context::simplify_subreg (machine_mode 
outermode, rtx op,
   poly_uint64 innermostsize = GET_MODE_SIZE (innermostmode);
   rtx newx;
 
+  /* Make sure that the relationship between the two subregs is
+known at compile time.  */
+  if (!ordered_p (outersize, innermostsize))
+   return NULL_RTX;
+
   if (outermode == innermostmode
  && known_eq (byte, 0U)
  && known_eq (SUBREG_BYTE (op), 0))
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/general/pr115464.c 
b/gcc/testsuite/gcc.target/aarch64/sve/acle/general/pr115464.c
new file mode 100644
index ..d728d1325edb
--- /dev/null

[gcc r14-10303] ira: Fix go_through_subreg offset calculation [PR115281]

2024-06-11 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:7d64bc0990381221c480ba15cb9cc950e51e2cef

commit r14-10303-g7d64bc0990381221c480ba15cb9cc950e51e2cef
Author: Richard Sandiford 
Date:   Tue Jun 11 09:58:48 2024 +0100

ira: Fix go_through_subreg offset calculation [PR115281]

go_through_subreg used:

  else if (!can_div_trunc_p (SUBREG_BYTE (x),
 REGMODE_NATURAL_SIZE (GET_MODE (x)), offset))

to calculate the register offset for a pseudo subreg x.  In the blessed
days before poly-int, this was:

*offset = (SUBREG_BYTE (x) / REGMODE_NATURAL_SIZE (GET_MODE (x)));

But I think this is testing the wrong natural size.  If we exclude
paradoxical subregs (which will get an offset of zero regardless),
it's the inner register that is being split, so it should be the
inner register's natural size that we use.

This matters in the testcase because we have an SFmode lowpart
subreg into the last of three variable-sized vectors.  The
SUBREG_BYTE is therefore equal to the size of two variable-sized
vectors.  Dividing by the vector size gives a register offset of 2,
as expected, but dividing by the size of a scalar FPR would give
a variable offset.

I think something similar could happen for fixed-size targets if
REGMODE_NATURAL_SIZE is different for vectors and integers (say),
although that case would trade an ICE for an incorrect offset.

gcc/
PR rtl-optimization/115281
* ira-conflicts.cc (go_through_subreg): Use the natural size of
the inner mode rather than the outer mode.

gcc/testsuite/
PR rtl-optimization/115281
* gfortran.dg/pr115281.f90: New test.

(cherry picked from commit 46d931b3dd31cbba7c3355ada63f155aa24a4e2b)

Diff:
---
 gcc/ira-conflicts.cc   |  3 ++-
 gcc/testsuite/gfortran.dg/pr115281.f90 | 39 ++
 2 files changed, 41 insertions(+), 1 deletion(-)

diff --git a/gcc/ira-conflicts.cc b/gcc/ira-conflicts.cc
index 83274c53330..15ac42d8848 100644
--- a/gcc/ira-conflicts.cc
+++ b/gcc/ira-conflicts.cc
@@ -227,8 +227,9 @@ go_through_subreg (rtx x, int *offset)
   if (REGNO (reg) < FIRST_PSEUDO_REGISTER)
 *offset = subreg_regno_offset (REGNO (reg), GET_MODE (reg),
   SUBREG_BYTE (x), GET_MODE (x));
+  /* The offset is always 0 for paradoxical subregs.  */
   else if (!can_div_trunc_p (SUBREG_BYTE (x),
-REGMODE_NATURAL_SIZE (GET_MODE (x)), offset))
+REGMODE_NATURAL_SIZE (GET_MODE (reg)), offset))
 /* Checked by validate_subreg.  We must know at compile time which
inner hard registers are being accessed.  */
 gcc_unreachable ();
diff --git a/gcc/testsuite/gfortran.dg/pr115281.f90 
b/gcc/testsuite/gfortran.dg/pr115281.f90
new file mode 100644
index 000..80aa822e745
--- /dev/null
+++ b/gcc/testsuite/gfortran.dg/pr115281.f90
@@ -0,0 +1,39 @@
+! { dg-options "-O3" }
+! { dg-additional-options "-mcpu=neoverse-v1" { target aarch64*-*-* } }
+
+SUBROUTINE fn0(ma, mb, nt)
+  CHARACTER ca
+  REAL r0(ma)
+  INTEGER i0(mb)
+  REAL r1(3,mb)
+  REAL r2(3,mb)
+  REAL r3(3,3)
+  zero=0.0
+  do na = 1, nt
+ nt = i0(na)
+ do l = 1, 3
+r1 (l, na) =   r0 (nt)
+r2(l, na) = zero
+ enddo
+  enddo
+  if (ca  .ne.'z') then
+ do j = 1, 3
+do i = 1, 3
+   r4  = zero
+enddo
+ enddo
+ do na = 1, nt
+do k =  1, 3
+   do l = 1, 3
+  do m = 1, 3
+ r3 = r4 * v
+  enddo
+   enddo
+enddo
+ do i = 1, 3
+   do k = 1, ifn (r3)
+   enddo
+enddo
+ enddo
+ endif
+END


[gcc r11-11468] rtl-ssa: Fix -fcompare-debug failure [PR100303]

2024-06-04 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:a1fb76e041740e7dd8cdf71dff3ae7aa31b3ea9b

commit r11-11468-ga1fb76e041740e7dd8cdf71dff3ae7aa31b3ea9b
Author: Richard Sandiford 
Date:   Tue Jun 4 13:47:36 2024 +0100

rtl-ssa: Fix -fcompare-debug failure [PR100303]

This patch fixes an oversight in the handling of debug instructions
in rtl-ssa.  At the moment (and whether this is a good idea or not
remains to be seen), we maintain a linear RPO sequence of definitions
and non-debug uses.  If a register is defined more than once, we use
a degenerate phi to reestablish a previous definition where necessary.

However, debug instructions shouldn't of course affect codegen,
so we can't create a new definition just for them.  In those situations
we instead hang the debug use off the real definition (meaning that
debug uses do not follow a linear order wrt definitions).  Again,
it remains to be seen whether that's a good idea.

The problem in the PR was that we weren't taking this into account
when increasing (or potentially increasing) the live range of an
existing definition.  We'd create the phi even if it would only
be used by debug instructions.

The patch goes for the simple but inelegant approach of passing
a bool to say whether the use is a debug use or not.  I imagine
this area will need some tweaking based on experience in future.

gcc/
PR rtl-optimization/100303
* rtl-ssa/accesses.cc (function_info::make_use_available): Take a
boolean that indicates whether the use will only be used in
debug instructions.  Treat it in the same way that existing
cross-EBB debug references would be handled if so.
(function_info::make_uses_available): Likewise.
* rtl-ssa/functions.h (function_info::make_uses_available): Update
prototype accordingly.
(function_info::make_uses_available): Likewise.
* fwprop.c (try_fwprop_subst): Update call accordingly.

(cherry picked from commit c97351c0cf4872cc0e99e73ed17fb16659fd38b3)

Diff:
---
 gcc/fwprop.c|   3 +-
 gcc/rtl-ssa/accesses.cc |  15 +++--
 gcc/rtl-ssa/functions.h |   7 +-
 gcc/testsuite/g++.dg/torture/pr100303.C | 112 
 4 files changed, 129 insertions(+), 8 deletions(-)

diff --git a/gcc/fwprop.c b/gcc/fwprop.c
index d7203672886..73284a7ae3e 100644
--- a/gcc/fwprop.c
+++ b/gcc/fwprop.c
@@ -606,7 +606,8 @@ try_fwprop_subst (use_info *use, set_info *def,
   if (def_insn->bb () != use_insn->bb ())
 {
   src_uses = crtl->ssa->make_uses_available (attempt, src_uses,
-use_insn->bb ());
+use_insn->bb (),
+use_insn->is_debug_insn ());
   if (!src_uses.is_valid ())
return false;
 }
diff --git a/gcc/rtl-ssa/accesses.cc b/gcc/rtl-ssa/accesses.cc
index af7b568fa98..0621ea22880 100644
--- a/gcc/rtl-ssa/accesses.cc
+++ b/gcc/rtl-ssa/accesses.cc
@@ -1290,7 +1290,10 @@ function_info::insert_temp_clobber (obstack_watermark 
,
 }
 
 // A subroutine of make_uses_available.  Try to make USE's definition
-// available at the head of BB.  On success:
+// available at the head of BB.  WILL_BE_DEBUG_USE is true if the
+// definition will be used only in debug instructions.
+//
+// On success:
 //
 // - If the use would have the same def () as USE, return USE.
 //
@@ -1302,7 +1305,8 @@ function_info::insert_temp_clobber (obstack_watermark 
,
 //
 // Return null on failure.
 use_info *
-function_info::make_use_available (use_info *use, bb_info *bb)
+function_info::make_use_available (use_info *use, bb_info *bb,
+  bool will_be_debug_use)
 {
   set_info *def = use->def ();
   if (!def)
@@ -1318,7 +1322,7 @@ function_info::make_use_available (use_info *use, bb_info 
*bb)
   && single_pred (cfg_bb) == use_bb->cfg_bb ()
   && remains_available_on_exit (def, use_bb))
 {
-  if (def->ebb () == bb->ebb ())
+  if (def->ebb () == bb->ebb () || will_be_debug_use)
return use;
 
   resource_info resource = use->resource ();
@@ -1362,7 +1366,8 @@ function_info::make_use_available (use_info *use, bb_info 
*bb)
 // See the comment above the declaration.
 use_array
 function_info::make_uses_available (obstack_watermark ,
-   use_array uses, bb_info *bb)
+   use_array uses, bb_info *bb,
+   bool will_be_debug_uses)
 {
   unsigned int num_uses = uses.size ();
   if (num_uses == 0)
@@ -1371,7 +1376,7 @@ function_info::make_uses_available (obstack_watermark 
,
   auto **new_uses = XOBNEWVEC (watermark, access_info *, num_uses);
   for (unsigned int i = 0; i < num_uses; ++i)
 {
-  use_info *use = 

[gcc r11-11467] rtl-ssa: Extend m_num_defs to a full unsigned int [PR108086]

2024-06-04 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:66d01cc3f4a248ccc471a978f0bfe3615c3f3a30

commit r11-11467-g66d01cc3f4a248ccc471a978f0bfe3615c3f3a30
Author: Richard Sandiford 
Date:   Tue Jun 4 13:47:35 2024 +0100

rtl-ssa: Extend m_num_defs to a full unsigned int [PR108086]

insn_info tried to save space by storing the number of
definitions in a 16-bit bitfield.  The justification was:

  // ...  FIRST_PSEUDO_REGISTER + 1
  // is the maximum number of accesses to hard registers and memory, and
  // MAX_RECOG_OPERANDS is the maximum number of pseudos that can be
  // defined by an instruction, so the number of definitions should fit
  // easily in 16 bits.

But while that reasoning holds (I think) for real instructions,
it doesn't hold for artificial instructions.  I don't think there's
any sensible higher limit we can use, so this patch goes for a full
unsigned int.

gcc/
PR rtl-optimization/108086
* rtl-ssa/insns.h (insn_info): Make m_num_defs a full unsigned int.
Adjust size-related commentary accordingly.

(cherry picked from commit cd41085a37b8288dbdfe0f81027ce04b978578f1)

Diff:
---
 gcc/rtl-ssa/insns.h | 14 +-
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/gcc/rtl-ssa/insns.h b/gcc/rtl-ssa/insns.h
index e4aa6d1d5ce..ab715adc151 100644
--- a/gcc/rtl-ssa/insns.h
+++ b/gcc/rtl-ssa/insns.h
@@ -141,7 +141,7 @@ using insn_call_clobbers_tree = 
default_splay_tree;
 // of "notes", a bit like REG_NOTES for the underlying RTL insns.
 class insn_info
 {
-  // Size: 8 LP64 words.
+  // Size: 9 LP64 words.
   friend class ebb_info;
   friend class function_info;
 
@@ -401,10 +401,11 @@ private:
   // The number of definitions and the number uses.  FIRST_PSEUDO_REGISTER + 1
   // is the maximum number of accesses to hard registers and memory, and
   // MAX_RECOG_OPERANDS is the maximum number of pseudos that can be
-  // defined by an instruction, so the number of definitions should fit
-  // easily in 16 bits.
+  // defined by an instruction, so the number of definitions in a real
+  // instruction should fit easily in 16 bits.  However, there are no
+  // limits on the number of definitions in artifical instructions.
   unsigned int m_num_uses;
-  unsigned int m_num_defs : 16;
+  unsigned int m_num_defs;
 
   // Flags returned by the accessors above.
   unsigned int m_is_debug_insn : 1;
@@ -414,7 +415,7 @@ private:
   unsigned int m_has_volatile_refs : 1;
 
   // For future expansion.
-  unsigned int m_spare : 11;
+  unsigned int m_spare : 27;
 
   // The program point at which the instruction occurs.
   //
@@ -431,6 +432,9 @@ private:
   // instruction.
   mutable int m_cost_or_uid;
 
+  // On LP64 systems, there's a gap here that could be used for future
+  // expansion.
+
   // The list of notes that have been attached to the instruction.
   insn_note *m_first_note;
 };


[gcc r11-11466] vect: Tighten vect_determine_precisions_from_range [PR113281]

2024-06-04 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:95e4252f53bc0e5b66a200c611fd2c9f6f7f2a62

commit r11-11466-g95e4252f53bc0e5b66a200c611fd2c9f6f7f2a62
Author: Richard Sandiford 
Date:   Tue Jun 4 13:47:35 2024 +0100

vect: Tighten vect_determine_precisions_from_range [PR113281]

This was another PR caused by the way that
vect_determine_precisions_from_range handles shifts.  We tried to
narrow 32768 >> x to a 16-bit shift based on range information for
the inputs and outputs, with vect_recog_over_widening_pattern
(after PR110828) adjusting the shift amount.  But this doesn't
work for the case where x is in [16, 31], since then 32-bit
32768 >> x is a well-defined zero, whereas no well-defined
16-bit 32768 >> y will produce 0.

We could perhaps generate x < 16 ? 32768 >> x : 0 instead,
but since vect_determine_precisions_from_range was never really
supposed to rely on fix-ups, it seems better to fix that instead.

The patch also makes the code more selective about which codes
can be narrowed based on input and output ranges.  This showed
that vect_truncatable_operation_p was missing cases for
BIT_NOT_EXPR (equivalent to BIT_XOR_EXPR of -1) and NEGATE_EXPR
(equivalent to BIT_NOT_EXPR followed by a PLUS_EXPR of 1).

pr113281-1.c is the original testcase.  pr113281-[23].c failed
before the patch due to overly optimistic narrowing.  pr113281-[45].c
previously passed and are meant to protect against accidental
optimisation regressions.

gcc/
PR target/113281
* tree-vect-patterns.c (vect_recog_over_widening_pattern): Remove
workaround for right shifts.
(vect_truncatable_operation_p): Handle NEGATE_EXPR and BIT_NOT_EXPR.
(vect_determine_precisions_from_range): Be more selective about
which codes can be narrowed based on their input and output ranges.
For shifts, require at least one more bit of precision than the
maximum shift amount.

gcc/testsuite/
PR target/113281
* gcc.dg/vect/pr113281-1.c: New test.
* gcc.dg/vect/pr113281-2.c: Likewise.
* gcc.dg/vect/pr113281-3.c: Likewise.
* gcc.dg/vect/pr113281-4.c: Likewise.
* gcc.dg/vect/pr113281-5.c: Likewise.

(cherry picked from commit 1a8261e047f7a2c2b0afb95716f7615cba718cd1)

Diff:
---
 gcc/testsuite/gcc.dg/vect/pr113281-1.c |  17 ++
 gcc/testsuite/gcc.dg/vect/pr113281-2.c |  50 +++
 gcc/testsuite/gcc.dg/vect/pr113281-3.c |  39 
 gcc/testsuite/gcc.dg/vect/pr113281-4.c |  55 +
 gcc/testsuite/gcc.dg/vect/pr113281-5.c |  66 
 gcc/tree-vect-patterns.c   | 107 -
 6 files changed, 305 insertions(+), 29 deletions(-)

diff --git a/gcc/testsuite/gcc.dg/vect/pr113281-1.c 
b/gcc/testsuite/gcc.dg/vect/pr113281-1.c
new file mode 100644
index 000..6df4231cb5f
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr113281-1.c
@@ -0,0 +1,17 @@
+#include "tree-vect.h"
+
+unsigned char a;
+
+int main() {
+  check_vect ();
+
+  short b = a = 0;
+  for (; a != 19; a++)
+if (a)
+  b = 32872 >> a;
+
+  if (b == 0)
+return 0;
+  else
+return 1;
+}
diff --git a/gcc/testsuite/gcc.dg/vect/pr113281-2.c 
b/gcc/testsuite/gcc.dg/vect/pr113281-2.c
new file mode 100644
index 000..3a1170c28b6
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr113281-2.c
@@ -0,0 +1,50 @@
+/* { dg-do compile } */
+
+#define N 128
+
+short x[N];
+short y[N];
+
+void
+f1 (void)
+{
+  for (int i = 0; i < N; ++i)
+x[i] >>= y[i];
+}
+
+void
+f2 (void)
+{
+  for (int i = 0; i < N; ++i)
+x[i] >>= (y[i] < 32 ? y[i] : 32);
+}
+
+void
+f3 (void)
+{
+  for (int i = 0; i < N; ++i)
+x[i] >>= (y[i] < 31 ? y[i] : 31);
+}
+
+void
+f4 (void)
+{
+  for (int i = 0; i < N; ++i)
+x[i] >>= (y[i] & 31);
+}
+
+void
+f5 (void)
+{
+  for (int i = 0; i < N; ++i)
+x[i] >>= 0x8000 >> y[i];
+}
+
+void
+f6 (void)
+{
+  for (int i = 0; i < N; ++i)
+x[i] >>= 0x8000 >> (y[i] & 31);
+}
+
+/* { dg-final { scan-tree-dump-not {can narrow[^\n]+>>} "vect" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/pr113281-3.c 
b/gcc/testsuite/gcc.dg/vect/pr113281-3.c
new file mode 100644
index 000..5982dd2d16f
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr113281-3.c
@@ -0,0 +1,39 @@
+/* { dg-do compile } */
+
+#define N 128
+
+short x[N];
+short y[N];
+
+void
+f1 (void)
+{
+  for (int i = 0; i < N; ++i)
+x[i] >>= (y[i] < 30 ? y[i] : 30);
+}
+
+void
+f2 (void)
+{
+  for (int i = 0; i < N; ++i)
+x[i] >>= ((y[i] & 15) + 2);
+}
+
+void
+f3 (void)
+{
+  for (int i = 0; i < N; ++i)
+x[i] >>= (y[i] < 16 ? y[i] : 16);
+}
+
+void
+f4 (void)
+{
+  for (int i = 0; i < N; ++i)
+x[i] = 32768 >> ((y[i] & 15) + 3);
+}
+
+/* { dg-final { scan-tree-dump {can narrow to signed:31 without loss [^\n]+>>} 
"vect" } } */
+/* { dg-final { scan-tree-dump {can 

[gcc r11-11465] vect: Fix access size alignment assumption [PR115192]

2024-06-04 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:741ea10418987ac02eb8e680f2946a6e5928eb23

commit r11-11465-g741ea10418987ac02eb8e680f2946a6e5928eb23
Author: Richard Sandiford 
Date:   Tue Jun 4 13:47:34 2024 +0100

vect: Fix access size alignment assumption [PR115192]

create_intersect_range_checks checks whether two access ranges
a and b are alias-free using something equivalent to:

  end_a <= start_b || end_b <= start_a

It has two ways of doing this: a "vanilla" way that calculates
the exact exclusive end pointers, and another way that uses the
last inclusive aligned pointers (and changes the comparisons
accordingly).  The comment for the latter is:

  /* Calculate the minimum alignment shared by all four pointers,
 then arrange for this alignment to be subtracted from the
 exclusive maximum values to get inclusive maximum values.
 This "- min_align" is cumulative with a "+ access_size"
 in the calculation of the maximum values.  In the best
 (and common) case, the two cancel each other out, leaving
 us with an inclusive bound based only on seg_len.  In the
 worst case we're simply adding a smaller number than before.

The problem is that the associated code implicitly assumed that the
access size was a multiple of the pointer alignment, and so the
alignment could be carried over to the exclusive end pointer.

The testcase started failing after g:9fa5b473b5b8e289b6542
because that commit improved the alignment information for
the accesses.

gcc/
PR tree-optimization/115192
* tree-data-ref.c (create_intersect_range_checks): Take the
alignment of the access sizes into account.

gcc/testsuite/
PR tree-optimization/115192
* gcc.dg/vect/pr115192.c: New test.

(cherry picked from commit a0fe4fb1c8d7804515845dd5d2a814b3c7a1ccba)

Diff:
---
 gcc/testsuite/gcc.dg/vect/pr115192.c | 28 
 gcc/tree-data-ref.c  |  5 -
 2 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/gcc/testsuite/gcc.dg/vect/pr115192.c 
b/gcc/testsuite/gcc.dg/vect/pr115192.c
new file mode 100644
index 000..923d377c1bb
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr115192.c
@@ -0,0 +1,28 @@
+#include "tree-vect.h"
+
+int data[4 * 16 * 16] __attribute__((aligned(16)));
+
+__attribute__((noipa)) void
+foo (__SIZE_TYPE__ n)
+{
+  for (__SIZE_TYPE__ i = 1; i < n; ++i)
+{
+  data[i * n * 4] = data[(i - 1) * n * 4] + 1;
+  data[i * n * 4 + 1] = data[(i - 1) * n * 4 + 1] + 2;
+}
+}
+
+int
+main ()
+{
+  check_vect ();
+
+  data[0] = 10;
+  data[1] = 20;
+
+  foo (3);
+
+  if (data[24] != 12 || data[25] != 24)
+__builtin_abort ();
+  return 0;
+}
diff --git a/gcc/tree-data-ref.c b/gcc/tree-data-ref.c
index b3dd2f0ca41..d127aba8792 100644
--- a/gcc/tree-data-ref.c
+++ b/gcc/tree-data-ref.c
@@ -73,6 +73,7 @@ along with GCC; see the file COPYING3.  If not see
 
 */
 
+#define INCLUDE_ALGORITHM
 #include "config.h"
 #include "system.h"
 #include "coretypes.h"
@@ -2629,7 +2630,9 @@ create_intersect_range_checks (class loop *loop, tree 
*cond_expr,
 Because the maximum values are inclusive, there is an alias
 if the maximum value of one segment is equal to the minimum
 value of the other.  */
-  min_align = MIN (dr_a.align, dr_b.align);
+  min_align = std::min (dr_a.align, dr_b.align);
+  min_align = std::min (min_align, known_alignment (dr_a.access_size));
+  min_align = std::min (min_align, known_alignment (dr_b.access_size));
   cmp_code = LT_EXPR;
 }


[gcc r12-10489] vect: Tighten vect_determine_precisions_from_range [PR113281]

2024-06-04 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:dfaa13455d67646805bc611aa4373728a460a37d

commit r12-10489-gdfaa13455d67646805bc611aa4373728a460a37d
Author: Richard Sandiford 
Date:   Tue Jun 4 08:47:48 2024 +0100

vect: Tighten vect_determine_precisions_from_range [PR113281]

This was another PR caused by the way that
vect_determine_precisions_from_range handles shifts.  We tried to
narrow 32768 >> x to a 16-bit shift based on range information for
the inputs and outputs, with vect_recog_over_widening_pattern
(after PR110828) adjusting the shift amount.  But this doesn't
work for the case where x is in [16, 31], since then 32-bit
32768 >> x is a well-defined zero, whereas no well-defined
16-bit 32768 >> y will produce 0.

We could perhaps generate x < 16 ? 32768 >> x : 0 instead,
but since vect_determine_precisions_from_range was never really
supposed to rely on fix-ups, it seems better to fix that instead.

The patch also makes the code more selective about which codes
can be narrowed based on input and output ranges.  This showed
that vect_truncatable_operation_p was missing cases for
BIT_NOT_EXPR (equivalent to BIT_XOR_EXPR of -1) and NEGATE_EXPR
(equivalent to BIT_NOT_EXPR followed by a PLUS_EXPR of 1).

pr113281-1.c is the original testcase.  pr113281-[23].c failed
before the patch due to overly optimistic narrowing.  pr113281-[45].c
previously passed and are meant to protect against accidental
optimisation regressions.

gcc/
PR target/113281
* tree-vect-patterns.cc (vect_recog_over_widening_pattern): Remove
workaround for right shifts.
(vect_truncatable_operation_p): Handle NEGATE_EXPR and BIT_NOT_EXPR.
(vect_determine_precisions_from_range): Be more selective about
which codes can be narrowed based on their input and output ranges.
For shifts, require at least one more bit of precision than the
maximum shift amount.

gcc/testsuite/
PR target/113281
* gcc.dg/vect/pr113281-1.c: New test.
* gcc.dg/vect/pr113281-2.c: Likewise.
* gcc.dg/vect/pr113281-3.c: Likewise.
* gcc.dg/vect/pr113281-4.c: Likewise.
* gcc.dg/vect/pr113281-5.c: Likewise.

(cherry picked from commit 1a8261e047f7a2c2b0afb95716f7615cba718cd1)

Diff:
---
 gcc/testsuite/gcc.dg/vect/pr113281-1.c |  17 ++
 gcc/testsuite/gcc.dg/vect/pr113281-2.c |  50 +++
 gcc/testsuite/gcc.dg/vect/pr113281-3.c |  39 
 gcc/testsuite/gcc.dg/vect/pr113281-4.c |  55 +
 gcc/testsuite/gcc.dg/vect/pr113281-5.c |  66 
 gcc/tree-vect-patterns.cc  | 107 -
 6 files changed, 305 insertions(+), 29 deletions(-)

diff --git a/gcc/testsuite/gcc.dg/vect/pr113281-1.c 
b/gcc/testsuite/gcc.dg/vect/pr113281-1.c
new file mode 100644
index 000..6df4231cb5f
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr113281-1.c
@@ -0,0 +1,17 @@
+#include "tree-vect.h"
+
+unsigned char a;
+
+int main() {
+  check_vect ();
+
+  short b = a = 0;
+  for (; a != 19; a++)
+if (a)
+  b = 32872 >> a;
+
+  if (b == 0)
+return 0;
+  else
+return 1;
+}
diff --git a/gcc/testsuite/gcc.dg/vect/pr113281-2.c 
b/gcc/testsuite/gcc.dg/vect/pr113281-2.c
new file mode 100644
index 000..3a1170c28b6
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr113281-2.c
@@ -0,0 +1,50 @@
+/* { dg-do compile } */
+
+#define N 128
+
+short x[N];
+short y[N];
+
+void
+f1 (void)
+{
+  for (int i = 0; i < N; ++i)
+x[i] >>= y[i];
+}
+
+void
+f2 (void)
+{
+  for (int i = 0; i < N; ++i)
+x[i] >>= (y[i] < 32 ? y[i] : 32);
+}
+
+void
+f3 (void)
+{
+  for (int i = 0; i < N; ++i)
+x[i] >>= (y[i] < 31 ? y[i] : 31);
+}
+
+void
+f4 (void)
+{
+  for (int i = 0; i < N; ++i)
+x[i] >>= (y[i] & 31);
+}
+
+void
+f5 (void)
+{
+  for (int i = 0; i < N; ++i)
+x[i] >>= 0x8000 >> y[i];
+}
+
+void
+f6 (void)
+{
+  for (int i = 0; i < N; ++i)
+x[i] >>= 0x8000 >> (y[i] & 31);
+}
+
+/* { dg-final { scan-tree-dump-not {can narrow[^\n]+>>} "vect" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/pr113281-3.c 
b/gcc/testsuite/gcc.dg/vect/pr113281-3.c
new file mode 100644
index 000..5982dd2d16f
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr113281-3.c
@@ -0,0 +1,39 @@
+/* { dg-do compile } */
+
+#define N 128
+
+short x[N];
+short y[N];
+
+void
+f1 (void)
+{
+  for (int i = 0; i < N; ++i)
+x[i] >>= (y[i] < 30 ? y[i] : 30);
+}
+
+void
+f2 (void)
+{
+  for (int i = 0; i < N; ++i)
+x[i] >>= ((y[i] & 15) + 2);
+}
+
+void
+f3 (void)
+{
+  for (int i = 0; i < N; ++i)
+x[i] >>= (y[i] < 16 ? y[i] : 16);
+}
+
+void
+f4 (void)
+{
+  for (int i = 0; i < N; ++i)
+x[i] = 32768 >> ((y[i] & 15) + 3);
+}
+
+/* { dg-final { scan-tree-dump {can narrow to signed:31 without loss [^\n]+>>} 
"vect" } } */
+/* { dg-final { scan-tree-dump {can 

[gcc r12-10488] vect: Fix access size alignment assumption [PR115192]

2024-06-04 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:f510e59db482456160b8a63dc083c78b0c1f6c09

commit r12-10488-gf510e59db482456160b8a63dc083c78b0c1f6c09
Author: Richard Sandiford 
Date:   Tue Jun 4 08:47:47 2024 +0100

vect: Fix access size alignment assumption [PR115192]

create_intersect_range_checks checks whether two access ranges
a and b are alias-free using something equivalent to:

  end_a <= start_b || end_b <= start_a

It has two ways of doing this: a "vanilla" way that calculates
the exact exclusive end pointers, and another way that uses the
last inclusive aligned pointers (and changes the comparisons
accordingly).  The comment for the latter is:

  /* Calculate the minimum alignment shared by all four pointers,
 then arrange for this alignment to be subtracted from the
 exclusive maximum values to get inclusive maximum values.
 This "- min_align" is cumulative with a "+ access_size"
 in the calculation of the maximum values.  In the best
 (and common) case, the two cancel each other out, leaving
 us with an inclusive bound based only on seg_len.  In the
 worst case we're simply adding a smaller number than before.

The problem is that the associated code implicitly assumed that the
access size was a multiple of the pointer alignment, and so the
alignment could be carried over to the exclusive end pointer.

The testcase started failing after g:9fa5b473b5b8e289b6542
because that commit improved the alignment information for
the accesses.

gcc/
PR tree-optimization/115192
* tree-data-ref.cc (create_intersect_range_checks): Take the
alignment of the access sizes into account.

gcc/testsuite/
PR tree-optimization/115192
* gcc.dg/vect/pr115192.c: New test.

(cherry picked from commit a0fe4fb1c8d7804515845dd5d2a814b3c7a1ccba)

Diff:
---
 gcc/testsuite/gcc.dg/vect/pr115192.c | 28 
 gcc/tree-data-ref.cc |  5 -
 2 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/gcc/testsuite/gcc.dg/vect/pr115192.c 
b/gcc/testsuite/gcc.dg/vect/pr115192.c
new file mode 100644
index 000..923d377c1bb
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr115192.c
@@ -0,0 +1,28 @@
+#include "tree-vect.h"
+
+int data[4 * 16 * 16] __attribute__((aligned(16)));
+
+__attribute__((noipa)) void
+foo (__SIZE_TYPE__ n)
+{
+  for (__SIZE_TYPE__ i = 1; i < n; ++i)
+{
+  data[i * n * 4] = data[(i - 1) * n * 4] + 1;
+  data[i * n * 4 + 1] = data[(i - 1) * n * 4 + 1] + 2;
+}
+}
+
+int
+main ()
+{
+  check_vect ();
+
+  data[0] = 10;
+  data[1] = 20;
+
+  foo (3);
+
+  if (data[24] != 12 || data[25] != 24)
+__builtin_abort ();
+  return 0;
+}
diff --git a/gcc/tree-data-ref.cc b/gcc/tree-data-ref.cc
index 0df4a3525f4..706a49f226e 100644
--- a/gcc/tree-data-ref.cc
+++ b/gcc/tree-data-ref.cc
@@ -73,6 +73,7 @@ along with GCC; see the file COPYING3.  If not see
 
 */
 
+#define INCLUDE_ALGORITHM
 #include "config.h"
 #include "system.h"
 #include "coretypes.h"
@@ -2627,7 +2628,9 @@ create_intersect_range_checks (class loop *loop, tree 
*cond_expr,
 Because the maximum values are inclusive, there is an alias
 if the maximum value of one segment is equal to the minimum
 value of the other.  */
-  min_align = MIN (dr_a.align, dr_b.align);
+  min_align = std::min (dr_a.align, dr_b.align);
+  min_align = std::min (min_align, known_alignment (dr_a.access_size));
+  min_align = std::min (min_align, known_alignment (dr_b.access_size));
   cmp_code = LT_EXPR;
 }


[gcc r13-8813] vect: Tighten vect_determine_precisions_from_range [PR113281]

2024-05-31 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:2602b71103d5ef2ef86000cac832b31dad3dfe2b

commit r13-8813-g2602b71103d5ef2ef86000cac832b31dad3dfe2b
Author: Richard Sandiford 
Date:   Fri May 31 15:56:05 2024 +0100

vect: Tighten vect_determine_precisions_from_range [PR113281]

This was another PR caused by the way that
vect_determine_precisions_from_range handles shifts.  We tried to
narrow 32768 >> x to a 16-bit shift based on range information for
the inputs and outputs, with vect_recog_over_widening_pattern
(after PR110828) adjusting the shift amount.  But this doesn't
work for the case where x is in [16, 31], since then 32-bit
32768 >> x is a well-defined zero, whereas no well-defined
16-bit 32768 >> y will produce 0.

We could perhaps generate x < 16 ? 32768 >> x : 0 instead,
but since vect_determine_precisions_from_range was never really
supposed to rely on fix-ups, it seems better to fix that instead.

The patch also makes the code more selective about which codes
can be narrowed based on input and output ranges.  This showed
that vect_truncatable_operation_p was missing cases for
BIT_NOT_EXPR (equivalent to BIT_XOR_EXPR of -1) and NEGATE_EXPR
(equivalent to BIT_NOT_EXPR followed by a PLUS_EXPR of 1).

pr113281-1.c is the original testcase.  pr113281-[23].c failed
before the patch due to overly optimistic narrowing.  pr113281-[45].c
previously passed and are meant to protect against accidental
optimisation regressions.

gcc/
PR target/113281
* tree-vect-patterns.cc (vect_recog_over_widening_pattern): Remove
workaround for right shifts.
(vect_truncatable_operation_p): Handle NEGATE_EXPR and BIT_NOT_EXPR.
(vect_determine_precisions_from_range): Be more selective about
which codes can be narrowed based on their input and output ranges.
For shifts, require at least one more bit of precision than the
maximum shift amount.

gcc/testsuite/
PR target/113281
* gcc.dg/vect/pr113281-1.c: New test.
* gcc.dg/vect/pr113281-2.c: Likewise.
* gcc.dg/vect/pr113281-3.c: Likewise.
* gcc.dg/vect/pr113281-4.c: Likewise.
* gcc.dg/vect/pr113281-5.c: Likewise.

(cherry picked from commit 1a8261e047f7a2c2b0afb95716f7615cba718cd1)

Diff:
---
 gcc/testsuite/gcc.dg/vect/pr113281-1.c |  17 ++
 gcc/testsuite/gcc.dg/vect/pr113281-2.c |  50 +++
 gcc/testsuite/gcc.dg/vect/pr113281-3.c |  39 
 gcc/testsuite/gcc.dg/vect/pr113281-4.c |  55 +
 gcc/testsuite/gcc.dg/vect/pr113281-5.c |  66 
 gcc/tree-vect-patterns.cc  | 107 -
 6 files changed, 305 insertions(+), 29 deletions(-)

diff --git a/gcc/testsuite/gcc.dg/vect/pr113281-1.c 
b/gcc/testsuite/gcc.dg/vect/pr113281-1.c
new file mode 100644
index 000..6df4231cb5f
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr113281-1.c
@@ -0,0 +1,17 @@
+#include "tree-vect.h"
+
+unsigned char a;
+
+int main() {
+  check_vect ();
+
+  short b = a = 0;
+  for (; a != 19; a++)
+if (a)
+  b = 32872 >> a;
+
+  if (b == 0)
+return 0;
+  else
+return 1;
+}
diff --git a/gcc/testsuite/gcc.dg/vect/pr113281-2.c 
b/gcc/testsuite/gcc.dg/vect/pr113281-2.c
new file mode 100644
index 000..3a1170c28b6
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr113281-2.c
@@ -0,0 +1,50 @@
+/* { dg-do compile } */
+
+#define N 128
+
+short x[N];
+short y[N];
+
+void
+f1 (void)
+{
+  for (int i = 0; i < N; ++i)
+x[i] >>= y[i];
+}
+
+void
+f2 (void)
+{
+  for (int i = 0; i < N; ++i)
+x[i] >>= (y[i] < 32 ? y[i] : 32);
+}
+
+void
+f3 (void)
+{
+  for (int i = 0; i < N; ++i)
+x[i] >>= (y[i] < 31 ? y[i] : 31);
+}
+
+void
+f4 (void)
+{
+  for (int i = 0; i < N; ++i)
+x[i] >>= (y[i] & 31);
+}
+
+void
+f5 (void)
+{
+  for (int i = 0; i < N; ++i)
+x[i] >>= 0x8000 >> y[i];
+}
+
+void
+f6 (void)
+{
+  for (int i = 0; i < N; ++i)
+x[i] >>= 0x8000 >> (y[i] & 31);
+}
+
+/* { dg-final { scan-tree-dump-not {can narrow[^\n]+>>} "vect" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/pr113281-3.c 
b/gcc/testsuite/gcc.dg/vect/pr113281-3.c
new file mode 100644
index 000..5982dd2d16f
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr113281-3.c
@@ -0,0 +1,39 @@
+/* { dg-do compile } */
+
+#define N 128
+
+short x[N];
+short y[N];
+
+void
+f1 (void)
+{
+  for (int i = 0; i < N; ++i)
+x[i] >>= (y[i] < 30 ? y[i] : 30);
+}
+
+void
+f2 (void)
+{
+  for (int i = 0; i < N; ++i)
+x[i] >>= ((y[i] & 15) + 2);
+}
+
+void
+f3 (void)
+{
+  for (int i = 0; i < N; ++i)
+x[i] >>= (y[i] < 16 ? y[i] : 16);
+}
+
+void
+f4 (void)
+{
+  for (int i = 0; i < N; ++i)
+x[i] = 32768 >> ((y[i] & 15) + 3);
+}
+
+/* { dg-final { scan-tree-dump {can narrow to signed:31 without loss [^\n]+>>} 
"vect" } } */
+/* { dg-final { scan-tree-dump {can 

[gcc r13-8812] vect: Fix access size alignment assumption [PR115192]

2024-05-31 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:0836216693749f3b0b383d015bd36c004754f1da

commit r13-8812-g0836216693749f3b0b383d015bd36c004754f1da
Author: Richard Sandiford 
Date:   Fri May 31 15:56:04 2024 +0100

vect: Fix access size alignment assumption [PR115192]

create_intersect_range_checks checks whether two access ranges
a and b are alias-free using something equivalent to:

  end_a <= start_b || end_b <= start_a

It has two ways of doing this: a "vanilla" way that calculates
the exact exclusive end pointers, and another way that uses the
last inclusive aligned pointers (and changes the comparisons
accordingly).  The comment for the latter is:

  /* Calculate the minimum alignment shared by all four pointers,
 then arrange for this alignment to be subtracted from the
 exclusive maximum values to get inclusive maximum values.
 This "- min_align" is cumulative with a "+ access_size"
 in the calculation of the maximum values.  In the best
 (and common) case, the two cancel each other out, leaving
 us with an inclusive bound based only on seg_len.  In the
 worst case we're simply adding a smaller number than before.

The problem is that the associated code implicitly assumed that the
access size was a multiple of the pointer alignment, and so the
alignment could be carried over to the exclusive end pointer.

The testcase started failing after g:9fa5b473b5b8e289b6542
because that commit improved the alignment information for
the accesses.

gcc/
PR tree-optimization/115192
* tree-data-ref.cc (create_intersect_range_checks): Take the
alignment of the access sizes into account.

gcc/testsuite/
PR tree-optimization/115192
* gcc.dg/vect/pr115192.c: New test.

(cherry picked from commit a0fe4fb1c8d7804515845dd5d2a814b3c7a1ccba)

Diff:
---
 gcc/testsuite/gcc.dg/vect/pr115192.c | 28 
 gcc/tree-data-ref.cc |  5 -
 2 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/gcc/testsuite/gcc.dg/vect/pr115192.c 
b/gcc/testsuite/gcc.dg/vect/pr115192.c
new file mode 100644
index 000..923d377c1bb
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr115192.c
@@ -0,0 +1,28 @@
+#include "tree-vect.h"
+
+int data[4 * 16 * 16] __attribute__((aligned(16)));
+
+__attribute__((noipa)) void
+foo (__SIZE_TYPE__ n)
+{
+  for (__SIZE_TYPE__ i = 1; i < n; ++i)
+{
+  data[i * n * 4] = data[(i - 1) * n * 4] + 1;
+  data[i * n * 4 + 1] = data[(i - 1) * n * 4 + 1] + 2;
+}
+}
+
+int
+main ()
+{
+  check_vect ();
+
+  data[0] = 10;
+  data[1] = 20;
+
+  foo (3);
+
+  if (data[24] != 12 || data[25] != 24)
+__builtin_abort ();
+  return 0;
+}
diff --git a/gcc/tree-data-ref.cc b/gcc/tree-data-ref.cc
index 6cd5f7aa3cf..96934addff1 100644
--- a/gcc/tree-data-ref.cc
+++ b/gcc/tree-data-ref.cc
@@ -73,6 +73,7 @@ along with GCC; see the file COPYING3.  If not see
 
 */
 
+#define INCLUDE_ALGORITHM
 #include "config.h"
 #include "system.h"
 #include "coretypes.h"
@@ -2629,7 +2630,9 @@ create_intersect_range_checks (class loop *loop, tree 
*cond_expr,
 Because the maximum values are inclusive, there is an alias
 if the maximum value of one segment is equal to the minimum
 value of the other.  */
-  min_align = MIN (dr_a.align, dr_b.align);
+  min_align = std::min (dr_a.align, dr_b.align);
+  min_align = std::min (min_align, known_alignment (dr_a.access_size));
+  min_align = std::min (min_align, known_alignment (dr_b.access_size));
   cmp_code = LT_EXPR;
 }


[gcc r14-10263] vect: Fix access size alignment assumption [PR115192]

2024-05-31 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:36575f5fe491d86b6851ff3f47cbfb7dad0fc8ae

commit r14-10263-g36575f5fe491d86b6851ff3f47cbfb7dad0fc8ae
Author: Richard Sandiford 
Date:   Fri May 31 08:22:55 2024 +0100

vect: Fix access size alignment assumption [PR115192]

create_intersect_range_checks checks whether two access ranges
a and b are alias-free using something equivalent to:

  end_a <= start_b || end_b <= start_a

It has two ways of doing this: a "vanilla" way that calculates
the exact exclusive end pointers, and another way that uses the
last inclusive aligned pointers (and changes the comparisons
accordingly).  The comment for the latter is:

  /* Calculate the minimum alignment shared by all four pointers,
 then arrange for this alignment to be subtracted from the
 exclusive maximum values to get inclusive maximum values.
 This "- min_align" is cumulative with a "+ access_size"
 in the calculation of the maximum values.  In the best
 (and common) case, the two cancel each other out, leaving
 us with an inclusive bound based only on seg_len.  In the
 worst case we're simply adding a smaller number than before.

The problem is that the associated code implicitly assumed that the
access size was a multiple of the pointer alignment, and so the
alignment could be carried over to the exclusive end pointer.

The testcase started failing after g:9fa5b473b5b8e289b6542
because that commit improved the alignment information for
the accesses.

gcc/
PR tree-optimization/115192
* tree-data-ref.cc (create_intersect_range_checks): Take the
alignment of the access sizes into account.

gcc/testsuite/
PR tree-optimization/115192
* gcc.dg/vect/pr115192.c: New test.

(cherry picked from commit a0fe4fb1c8d7804515845dd5d2a814b3c7a1ccba)

Diff:
---
 gcc/testsuite/gcc.dg/vect/pr115192.c | 28 
 gcc/tree-data-ref.cc |  5 -
 2 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/gcc/testsuite/gcc.dg/vect/pr115192.c 
b/gcc/testsuite/gcc.dg/vect/pr115192.c
new file mode 100644
index 000..923d377c1bb
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr115192.c
@@ -0,0 +1,28 @@
+#include "tree-vect.h"
+
+int data[4 * 16 * 16] __attribute__((aligned(16)));
+
+__attribute__((noipa)) void
+foo (__SIZE_TYPE__ n)
+{
+  for (__SIZE_TYPE__ i = 1; i < n; ++i)
+{
+  data[i * n * 4] = data[(i - 1) * n * 4] + 1;
+  data[i * n * 4 + 1] = data[(i - 1) * n * 4 + 1] + 2;
+}
+}
+
+int
+main ()
+{
+  check_vect ();
+
+  data[0] = 10;
+  data[1] = 20;
+
+  foo (3);
+
+  if (data[24] != 12 || data[25] != 24)
+__builtin_abort ();
+  return 0;
+}
diff --git a/gcc/tree-data-ref.cc b/gcc/tree-data-ref.cc
index f37734b5340..654a8220214 100644
--- a/gcc/tree-data-ref.cc
+++ b/gcc/tree-data-ref.cc
@@ -73,6 +73,7 @@ along with GCC; see the file COPYING3.  If not see
 
 */
 
+#define INCLUDE_ALGORITHM
 #include "config.h"
 #include "system.h"
 #include "coretypes.h"
@@ -2640,7 +2641,9 @@ create_intersect_range_checks (class loop *loop, tree 
*cond_expr,
 Because the maximum values are inclusive, there is an alias
 if the maximum value of one segment is equal to the minimum
 value of the other.  */
-  min_align = MIN (dr_a.align, dr_b.align);
+  min_align = std::min (dr_a.align, dr_b.align);
+  min_align = std::min (min_align, known_alignment (dr_a.access_size));
+  min_align = std::min (min_align, known_alignment (dr_b.access_size));
   cmp_code = LT_EXPR;
 }


[gcc r15-929] ira: Fix go_through_subreg offset calculation [PR115281]

2024-05-30 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:46d931b3dd31cbba7c3355ada63f155aa24a4e2b

commit r15-929-g46d931b3dd31cbba7c3355ada63f155aa24a4e2b
Author: Richard Sandiford 
Date:   Thu May 30 16:17:58 2024 +0100

ira: Fix go_through_subreg offset calculation [PR115281]

go_through_subreg used:

  else if (!can_div_trunc_p (SUBREG_BYTE (x),
 REGMODE_NATURAL_SIZE (GET_MODE (x)), offset))

to calculate the register offset for a pseudo subreg x.  In the blessed
days before poly-int, this was:

*offset = (SUBREG_BYTE (x) / REGMODE_NATURAL_SIZE (GET_MODE (x)));

But I think this is testing the wrong natural size.  If we exclude
paradoxical subregs (which will get an offset of zero regardless),
it's the inner register that is being split, so it should be the
inner register's natural size that we use.

This matters in the testcase because we have an SFmode lowpart
subreg into the last of three variable-sized vectors.  The
SUBREG_BYTE is therefore equal to the size of two variable-sized
vectors.  Dividing by the vector size gives a register offset of 2,
as expected, but dividing by the size of a scalar FPR would give
a variable offset.

I think something similar could happen for fixed-size targets if
REGMODE_NATURAL_SIZE is different for vectors and integers (say),
although that case would trade an ICE for an incorrect offset.

gcc/
PR rtl-optimization/115281
* ira-conflicts.cc (go_through_subreg): Use the natural size of
the inner mode rather than the outer mode.

gcc/testsuite/
PR rtl-optimization/115281
* gfortran.dg/pr115281.f90: New test.

Diff:
---
 gcc/ira-conflicts.cc   |  3 ++-
 gcc/testsuite/gfortran.dg/pr115281.f90 | 39 ++
 2 files changed, 41 insertions(+), 1 deletion(-)

diff --git a/gcc/ira-conflicts.cc b/gcc/ira-conflicts.cc
index 83274c53330..15ac42d8848 100644
--- a/gcc/ira-conflicts.cc
+++ b/gcc/ira-conflicts.cc
@@ -227,8 +227,9 @@ go_through_subreg (rtx x, int *offset)
   if (REGNO (reg) < FIRST_PSEUDO_REGISTER)
 *offset = subreg_regno_offset (REGNO (reg), GET_MODE (reg),
   SUBREG_BYTE (x), GET_MODE (x));
+  /* The offset is always 0 for paradoxical subregs.  */
   else if (!can_div_trunc_p (SUBREG_BYTE (x),
-REGMODE_NATURAL_SIZE (GET_MODE (x)), offset))
+REGMODE_NATURAL_SIZE (GET_MODE (reg)), offset))
 /* Checked by validate_subreg.  We must know at compile time which
inner hard registers are being accessed.  */
 gcc_unreachable ();
diff --git a/gcc/testsuite/gfortran.dg/pr115281.f90 
b/gcc/testsuite/gfortran.dg/pr115281.f90
new file mode 100644
index 000..80aa822e745
--- /dev/null
+++ b/gcc/testsuite/gfortran.dg/pr115281.f90
@@ -0,0 +1,39 @@
+! { dg-options "-O3" }
+! { dg-additional-options "-mcpu=neoverse-v1" { target aarch64*-*-* } }
+
+SUBROUTINE fn0(ma, mb, nt)
+  CHARACTER ca
+  REAL r0(ma)
+  INTEGER i0(mb)
+  REAL r1(3,mb)
+  REAL r2(3,mb)
+  REAL r3(3,3)
+  zero=0.0
+  do na = 1, nt
+ nt = i0(na)
+ do l = 1, 3
+r1 (l, na) =   r0 (nt)
+r2(l, na) = zero
+ enddo
+  enddo
+  if (ca  .ne.'z') then
+ do j = 1, 3
+do i = 1, 3
+   r4  = zero
+enddo
+ enddo
+ do na = 1, nt
+do k =  1, 3
+   do l = 1, 3
+  do m = 1, 3
+ r3 = r4 * v
+  enddo
+   enddo
+enddo
+ do i = 1, 3
+   do k = 1, ifn (r3)
+   enddo
+enddo
+ enddo
+ endif
+END


[gcc r15-906] aarch64: Split aarch64_combinev16qi before RA [PR115258]

2024-05-29 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:39263ed2d39ac1cebde59bc5e72ddcad5dc7a1ec

commit r15-906-g39263ed2d39ac1cebde59bc5e72ddcad5dc7a1ec
Author: Richard Sandiford 
Date:   Wed May 29 16:43:33 2024 +0100

aarch64: Split aarch64_combinev16qi before RA [PR115258]

Two-vector TBL instructions are fed by an aarch64_combinev16qi, whose
purpose is to put the two input data vectors into consecutive registers.
This aarch64_combinev16qi was then split after reload into individual
moves (from the first input to the first half of the output, and from
the second input to the second half of the output).

In the worst case, the RA might allocate things so that the destination
of the aarch64_combinev16qi is the second input followed by the first
input.  In that case, the split form of aarch64_combinev16qi uses three
eors to swap the registers around.

This PR is about a test where this worst case occurred.  And given the
insn description, that allocation doesn't semm unreasonable.

early-ra should (hopefully) mean that we're now better at allocating
subregs of vector registers.  The upcoming RA subreg patches should
improve things further.  The best fix for the PR therefore seems
to be to split the combination before RA, so that the RA can see
the underlying moves.

Perhaps it even makes sense to do this at expand time, avoiding the need
for aarch64_combinev16qi entirely.  That deserves more experimentation
though.

gcc/
PR target/115258
* config/aarch64/aarch64-simd.md (aarch64_combinev16qi): Allow
the split before reload.
* config/aarch64/aarch64.cc (aarch64_split_combinev16qi): Generalize
into a form that handles pseudo registers.

gcc/testsuite/
PR target/115258
* gcc.target/aarch64/pr115258.c: New test.

Diff:
---
 gcc/config/aarch64/aarch64-simd.md  |  2 +-
 gcc/config/aarch64/aarch64.cc   | 29 ++---
 gcc/testsuite/gcc.target/aarch64/pr115258.c | 19 +++
 3 files changed, 34 insertions(+), 16 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-simd.md 
b/gcc/config/aarch64/aarch64-simd.md
index c311888e4bd..868f4486218 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -8474,7 +8474,7 @@
UNSPEC_CONCAT))]
   "TARGET_SIMD"
   "#"
-  "&& reload_completed"
+  "&& 1"
   [(const_int 0)]
 {
   aarch64_split_combinev16qi (operands);
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index ee12d8897a8..13191ec8e34 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -25333,27 +25333,26 @@ aarch64_output_sve_ptrues (rtx const_unspec)
 void
 aarch64_split_combinev16qi (rtx operands[3])
 {
-  unsigned int dest = REGNO (operands[0]);
-  unsigned int src1 = REGNO (operands[1]);
-  unsigned int src2 = REGNO (operands[2]);
   machine_mode halfmode = GET_MODE (operands[1]);
-  unsigned int halfregs = REG_NREGS (operands[1]);
-  rtx destlo, desthi;
 
   gcc_assert (halfmode == V16QImode);
 
-  if (src1 == dest && src2 == dest + halfregs)
+  rtx destlo = simplify_gen_subreg (halfmode, operands[0],
+   GET_MODE (operands[0]), 0);
+  rtx desthi = simplify_gen_subreg (halfmode, operands[0],
+   GET_MODE (operands[0]),
+   GET_MODE_SIZE (halfmode));
+
+  bool skiplo = rtx_equal_p (destlo, operands[1]);
+  bool skiphi = rtx_equal_p (desthi, operands[2]);
+
+  if (skiplo && skiphi)
 {
   /* No-op move.  Can't split to nothing; emit something.  */
   emit_note (NOTE_INSN_DELETED);
   return;
 }
 
-  /* Preserve register attributes for variable tracking.  */
-  destlo = gen_rtx_REG_offset (operands[0], halfmode, dest, 0);
-  desthi = gen_rtx_REG_offset (operands[0], halfmode, dest + halfregs,
-  GET_MODE_SIZE (halfmode));
-
   /* Special case of reversed high/low parts.  */
   if (reg_overlap_mentioned_p (operands[2], destlo)
   && reg_overlap_mentioned_p (operands[1], desthi))
@@ -25366,16 +25365,16 @@ aarch64_split_combinev16qi (rtx operands[3])
 {
   /* Try to avoid unnecessary moves if part of the result
 is in the right place already.  */
-  if (src1 != dest)
+  if (!skiplo)
emit_move_insn (destlo, operands[1]);
-  if (src2 != dest + halfregs)
+  if (!skiphi)
emit_move_insn (desthi, operands[2]);
 }
   else
 {
-  if (src2 != dest + halfregs)
+  if (!skiphi)
emit_move_insn (desthi, operands[2]);
-  if (src1 != dest)
+  if (!skiplo)
emit_move_insn (destlo, operands[1]);
 }
 }
diff --git a/gcc/testsuite/gcc.target/aarch64/pr115258.c 
b/gcc/testsuite/gcc.target/aarch64/pr115258.c
new file mode 100644
index 000..9a489d4604c
--- /dev/null
+++ 

[gcc r15-820] vect: Fix access size alignment assumption [PR115192]

2024-05-24 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:a0fe4fb1c8d7804515845dd5d2a814b3c7a1ccba

commit r15-820-ga0fe4fb1c8d7804515845dd5d2a814b3c7a1ccba
Author: Richard Sandiford 
Date:   Fri May 24 13:47:21 2024 +0100

vect: Fix access size alignment assumption [PR115192]

create_intersect_range_checks checks whether two access ranges
a and b are alias-free using something equivalent to:

  end_a <= start_b || end_b <= start_a

It has two ways of doing this: a "vanilla" way that calculates
the exact exclusive end pointers, and another way that uses the
last inclusive aligned pointers (and changes the comparisons
accordingly).  The comment for the latter is:

  /* Calculate the minimum alignment shared by all four pointers,
 then arrange for this alignment to be subtracted from the
 exclusive maximum values to get inclusive maximum values.
 This "- min_align" is cumulative with a "+ access_size"
 in the calculation of the maximum values.  In the best
 (and common) case, the two cancel each other out, leaving
 us with an inclusive bound based only on seg_len.  In the
 worst case we're simply adding a smaller number than before.

The problem is that the associated code implicitly assumed that the
access size was a multiple of the pointer alignment, and so the
alignment could be carried over to the exclusive end pointer.

The testcase started failing after g:9fa5b473b5b8e289b6542
because that commit improved the alignment information for
the accesses.

gcc/
PR tree-optimization/115192
* tree-data-ref.cc (create_intersect_range_checks): Take the
alignment of the access sizes into account.

gcc/testsuite/
PR tree-optimization/115192
* gcc.dg/vect/pr115192.c: New test.

Diff:
---
 gcc/testsuite/gcc.dg/vect/pr115192.c | 28 
 gcc/tree-data-ref.cc |  5 -
 2 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/gcc/testsuite/gcc.dg/vect/pr115192.c 
b/gcc/testsuite/gcc.dg/vect/pr115192.c
new file mode 100644
index 000..923d377c1bb
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr115192.c
@@ -0,0 +1,28 @@
+#include "tree-vect.h"
+
+int data[4 * 16 * 16] __attribute__((aligned(16)));
+
+__attribute__((noipa)) void
+foo (__SIZE_TYPE__ n)
+{
+  for (__SIZE_TYPE__ i = 1; i < n; ++i)
+{
+  data[i * n * 4] = data[(i - 1) * n * 4] + 1;
+  data[i * n * 4 + 1] = data[(i - 1) * n * 4 + 1] + 2;
+}
+}
+
+int
+main ()
+{
+  check_vect ();
+
+  data[0] = 10;
+  data[1] = 20;
+
+  foo (3);
+
+  if (data[24] != 12 || data[25] != 24)
+__builtin_abort ();
+  return 0;
+}
diff --git a/gcc/tree-data-ref.cc b/gcc/tree-data-ref.cc
index db15ddb43de..7c4049faf34 100644
--- a/gcc/tree-data-ref.cc
+++ b/gcc/tree-data-ref.cc
@@ -73,6 +73,7 @@ along with GCC; see the file COPYING3.  If not see
 
 */
 
+#define INCLUDE_ALGORITHM
 #include "config.h"
 #include "system.h"
 #include "coretypes.h"
@@ -2640,7 +2641,9 @@ create_intersect_range_checks (class loop *loop, tree 
*cond_expr,
 Because the maximum values are inclusive, there is an alias
 if the maximum value of one segment is equal to the minimum
 value of the other.  */
-  min_align = MIN (dr_a.align, dr_b.align);
+  min_align = std::min (dr_a.align, dr_b.align);
+  min_align = std::min (min_align, known_alignment (dr_a.access_size));
+  min_align = std::min (min_align, known_alignment (dr_b.access_size));
   cmp_code = LT_EXPR;
 }


[gcc r15-752] Cache the set of EH_RETURN_DATA_REGNOs

2024-05-21 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:7f35863ebbf7ba63e2f075edfbec105de272578a

commit r15-752-g7f35863ebbf7ba63e2f075edfbec105de272578a
Author: Richard Sandiford 
Date:   Tue May 21 10:21:16 2024 +0100

Cache the set of EH_RETURN_DATA_REGNOs

While reviewing Andrew's fix for PR114843, it seemed like it would
be convenient to have a HARD_REG_SET of EH_RETURN_DATA_REGNOs.
This patch adds one and uses it to simplify a couple of use sites.

gcc/
* hard-reg-set.h (target_hard_regs::x_eh_return_data_regs): New 
field.
(eh_return_data_regs): New macro.
* reginfo.cc (init_reg_sets_1): Initialize x_eh_return_data_regs.
* df-scan.cc (df_get_exit_block_use_set): Use it.
* ira-lives.cc (process_out_of_region_eh_regs): Likewise.

Diff:
---
 gcc/df-scan.cc |  8 +---
 gcc/hard-reg-set.h |  5 +
 gcc/ira-lives.cc   | 10 ++
 gcc/reginfo.cc | 10 ++
 4 files changed, 18 insertions(+), 15 deletions(-)

diff --git a/gcc/df-scan.cc b/gcc/df-scan.cc
index 1bade2cd71e..c8ab3c09cee 100644
--- a/gcc/df-scan.cc
+++ b/gcc/df-scan.cc
@@ -3702,13 +3702,7 @@ df_get_exit_block_use_set (bitmap exit_block_uses)
 
   /* Mark the registers that will contain data for the handler.  */
   if (reload_completed && crtl->calls_eh_return)
-for (i = 0; ; ++i)
-  {
-   unsigned regno = EH_RETURN_DATA_REGNO (i);
-   if (regno == INVALID_REGNUM)
- break;
-   bitmap_set_bit (exit_block_uses, regno);
-  }
+IOR_REG_SET_HRS (exit_block_uses, eh_return_data_regs);
 
 #ifdef EH_RETURN_STACKADJ_RTX
   if ((!targetm.have_epilogue () || ! epilogue_completed)
diff --git a/gcc/hard-reg-set.h b/gcc/hard-reg-set.h
index 8c1d1512ca2..340eb425c10 100644
--- a/gcc/hard-reg-set.h
+++ b/gcc/hard-reg-set.h
@@ -421,6 +421,9 @@ struct target_hard_regs {
  with the local stack frame are safe, but scant others.  */
   HARD_REG_SET x_regs_invalidated_by_call;
 
+  /* The set of registers that are used by EH_RETURN_DATA_REGNO.  */
+  HARD_REG_SET x_eh_return_data_regs;
+
   /* Table of register numbers in the order in which to try to use them.  */
   int x_reg_alloc_order[FIRST_PSEUDO_REGISTER];
 
@@ -485,6 +488,8 @@ extern struct target_hard_regs *this_target_hard_regs;
 #define call_used_or_fixed_regs \
   (regs_invalidated_by_call | fixed_reg_set)
 #endif
+#define eh_return_data_regs \
+  (this_target_hard_regs->x_eh_return_data_regs)
 #define reg_alloc_order \
   (this_target_hard_regs->x_reg_alloc_order)
 #define inv_reg_alloc_order \
diff --git a/gcc/ira-lives.cc b/gcc/ira-lives.cc
index e07d3dc3e89..958eabb9708 100644
--- a/gcc/ira-lives.cc
+++ b/gcc/ira-lives.cc
@@ -1260,14 +1260,8 @@ process_out_of_region_eh_regs (basic_block bb)
   for (int n = ALLOCNO_NUM_OBJECTS (a) - 1; n >= 0; n--)
{
  ira_object_t obj = ALLOCNO_OBJECT (a, n);
- for (int k = 0; ; k++)
-   {
- unsigned int regno = EH_RETURN_DATA_REGNO (k);
- if (regno == INVALID_REGNUM)
-   break;
- SET_HARD_REG_BIT (OBJECT_CONFLICT_HARD_REGS (obj), regno);
- SET_HARD_REG_BIT (OBJECT_TOTAL_CONFLICT_HARD_REGS (obj), regno);
-   }
+ OBJECT_CONFLICT_HARD_REGS (obj) |= eh_return_data_regs;
+ OBJECT_TOTAL_CONFLICT_HARD_REGS (obj) |= eh_return_data_regs;
}
 }
 }
diff --git a/gcc/reginfo.cc b/gcc/reginfo.cc
index a0baeb90e12..73121365c47 100644
--- a/gcc/reginfo.cc
+++ b/gcc/reginfo.cc
@@ -420,6 +420,16 @@ init_reg_sets_1 (void)
}
 }
 
+  /* Recalculate eh_return_data_regs.  */
+  CLEAR_HARD_REG_SET (eh_return_data_regs);
+  for (i = 0; ; ++i)
+{
+  unsigned int regno = EH_RETURN_DATA_REGNO (i);
+  if (regno == INVALID_REGNUM)
+   break;
+  SET_HARD_REG_BIT (eh_return_data_regs, regno);
+}
+
   memset (have_regs_of_mode, 0, sizeof (have_regs_of_mode));
   memset (contains_reg_of_mode, 0, sizeof (contains_reg_of_mode));
   for (m = 0; m < (unsigned int) MAX_MACHINE_MODE; m++)


Re: [RFC] Merge strathegy for all-SLP vectorizer

2024-05-17 Thread Richard Sandiford via Gcc
Richard Biener via Gcc  writes:
> Hi,
>
> I'd like to discuss how to go forward with getting the vectorizer to
> all-SLP for this stage1.  While there is a personal branch with my
> ongoing work (users/rguenth/vect-force-slp) branches haven't proved
> themselves working well for collaboration.

Speaking for myself, the problem hasn't been so much the branch as
lack of time.  I've been pretty swamped the last eight months of so
(except for the time that I took off, which admittedly was quite a
bit!), and so I never even got around to properly reading and replying
to your message after the Cauldron.  It's been on the "this is important,
I should make time to read and understand it properly" list all this time.
Sorry about that. :(

I'm hoping to have time to work/help out on SLP stuff soon.

> The branch isn't ready to be merged in full but I have been picking
> improvements to trunk last stage1 and some remaining bits in the past
> weeks.  I have refrained from merging code paths that cannot be
> exercised on trunk.
>
> There are two important set of changes on the branch, both critical
> to get more testing on non-x86 targets.
>
>  1. enable single-lane SLP discovery
>  2. avoid splitting store groups (9315bfc661432c3 and 4336060fe2db8ec
> if you fetch the branch)
>
> The first point is also most annoying on the testsuite since doing
> SLP instead of interleaving changes what we dump and thus tests
> start to fail in random ways when you switch between both modes.
> On the branch single-lane SLP discovery is gated with
> --param vect-single-lane-slp.
>
> The branch has numerous changes to enable single-lane SLP for some
> code paths that have SLP not implemented and where I did not bother
> to try supporting multi-lane SLP at this point.  It also adds more
> SLP discovery entry points.
>
> I'm not sure how to try merging these pieces to allow others to
> more easily help out.  One possibility is to merge
> --param vect-single-lane-slp defaulted off and pick dependent
> changes even when they cause testsuite regressions with
> vect-single-lane-slp=1.  Alternatively adjust the testsuite by
> adding --param vect-single-lane-slp=0 and default to 1
> (or keep the default).

FWIW, this one sounds good to me (the default to 1 version).
I.e. mechanically add --param vect-single-lane-slp=0 to any tests
that fail with the new default.  That means that the test that need
fixing are easily greppable for anyone who wants to help.  Sometimes
it'll just be a test update.  Sometimes it will be new vectoriser code.

Thanks,
Richard

> Or require a clean testsuite with
> --param vect-single-lane-slp defaulted to 1 but keep the --param
> for debugging (and allow FAILs with 0).
>
> For fun I merged just single-lane discovery of non-grouped stores
> and have that enabled by default.  On x86_64 this results in the
> set of FAILs below.
>
> Any suggestions?
>
> Thanks,
> Richard.
>
> FAIL: gcc.dg/vect/O3-pr39675-2.c scan-tree-dump-times vect "vectorizing 
> stmts using SLP" 1
> XPASS: gcc.dg/vect/no-scevccp-outer-12.c scan-tree-dump-times vect "OUTER 
> LOOP VECTORIZED." 1
> FAIL: gcc.dg/vect/no-section-anchors-vect-31.c scan-tree-dump-times vect 
> "Alignment of access forced using peeling" 2
> FAIL: gcc.dg/vect/no-section-anchors-vect-31.c scan-tree-dump-times vect 
> "Vectorizing an unaligned access" 0
> FAIL: gcc.dg/vect/no-section-anchors-vect-64.c scan-tree-dump-times vect 
> "Alignment of access forced using peeling" 2
> FAIL: gcc.dg/vect/no-section-anchors-vect-64.c scan-tree-dump-times vect 
> "Vectorizing an unaligned access" 0
> FAIL: gcc.dg/vect/no-section-anchors-vect-66.c scan-tree-dump-times vect 
> "Alignment of access forced using peeling" 1
> FAIL: gcc.dg/vect/no-section-anchors-vect-66.c scan-tree-dump-times vect 
> "Vectorizing an unaligned access" 0
> FAIL: gcc.dg/vect/no-section-anchors-vect-68.c scan-tree-dump-times vect 
> "Alignment of access forced using peeling" 2
> FAIL: gcc.dg/vect/no-section-anchors-vect-68.c scan-tree-dump-times vect 
> "Vectorizing an unaligned access" 0
> FAIL: gcc.dg/vect/slp-12a.c -flto -ffat-lto-objects  scan-tree-dump-times 
> vect "vectorizing stmts using SLP" 1
> FAIL: gcc.dg/vect/slp-12a.c scan-tree-dump-times vect "vectorizing stmts 
> using SLP" 1
> FAIL: gcc.dg/vect/slp-19a.c -flto -ffat-lto-objects  scan-tree-dump-times 
> vect "vectorizing stmts using SLP" 1
> FAIL: gcc.dg/vect/slp-19a.c scan-tree-dump-times vect "vectorizing stmts 
> using SLP" 1
> FAIL: gcc.dg/vect/slp-19b.c -flto -ffat-lto-objects  scan-tree-dump-times 
> vect "vectorizing stmts using SLP" 1
> FAIL: gcc.dg/vect/slp-19b.c scan-tree-dump-times vect "vectorizing stmts 
> using SLP" 1
> FAIL: gcc.dg/vect/slp-19c.c -flto -ffat-lto-objects  scan-tree-dump-times 
> vect "vectorized 1 loops" 1
> FAIL: gcc.dg/vect/slp-19c.c -flto -ffat-lto-objects  scan-tree-dump-times 
> vect "vectorizing stmts using SLP" 1
> FAIL: gcc.dg/vect/slp-19c.c scan-tree-dump-times vect "vectorized 1 loops" 
> 1
> FAIL: 

[gcc r14-9925] aarch64: Fix _BitInt testcases

2024-04-11 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:b87ba79200f2a727aa5c523abcc5c03fa11fc007

commit r14-9925-gb87ba79200f2a727aa5c523abcc5c03fa11fc007
Author: Andre Vieira (lists) 
Date:   Thu Apr 11 17:54:37 2024 +0100

aarch64: Fix _BitInt testcases

This patch fixes some testisms introduced by:

commit 5aa3fec38cc6f52285168b161bab1a869d864b44
Author: Andre Vieira 
Date:   Wed Apr 10 16:29:46 2024 +0100

 aarch64: Add support for _BitInt

The testcases were relying on an unnecessary sign-extend that is no longer
generated.

The tested version was just slightly behind top of trunk when the patch
was committed, and the codegen had changed, for the better, by then.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/bitfield-bitint-abi-align16.c (g1, g8, g16, 
g1p, g8p,
g16p): Remove unnecessary sbfx.
* gcc.target/aarch64/bitfield-bitint-abi-align8.c (g1, g8, g16, 
g1p, g8p,
g16p): Likewise.

Diff:
---
 .../aarch64/bitfield-bitint-abi-align16.c  | 30 +-
 .../aarch64/bitfield-bitint-abi-align8.c   | 30 +-
 2 files changed, 24 insertions(+), 36 deletions(-)

diff --git a/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align16.c 
b/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align16.c
index 3f292a45f95..4a228b0a1ce 100644
--- a/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align16.c
+++ b/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align16.c
@@ -55,9 +55,8 @@
 ** g1:
 ** mov (x[0-9]+), x0
 ** mov w0, w1
-** sbfx(x[0-9]+), \1, 0, 63
-** and x4, \2, 9223372036854775807
-** and x2, \2, 1
+** and x4, \1, 9223372036854775807
+** and x2, \1, 1
 ** mov x3, 0
 ** b   f1
 */
@@ -66,9 +65,8 @@
 ** g8:
 ** mov (x[0-9]+), x0
 ** mov w0, w1
-** sbfx(x[0-9]+), \1, 0, 63
-** and x4, \2, 9223372036854775807
-** and x2, \2, 1
+** and x4, \1, 9223372036854775807
+** and x2, \1, 1
 ** mov x3, 0
 ** b   f8
 */
@@ -76,9 +74,8 @@
 ** g16:
 ** mov (x[0-9]+), x0
 ** mov w0, w1
-** sbfx(x[0-9]+), \1, 0, 63
-** and x4, \2, 9223372036854775807
-** and x2, \2, 1
+** and x4, \1, 9223372036854775807
+** and x2, \1, 1
 ** mov x3, 0
 ** b   f16
 */
@@ -107,9 +104,8 @@
 /*
 ** g1p:
 ** mov (w[0-9]+), w1
-** sbfx(x[0-9]+), x0, 0, 63
-** and x3, \2, 9223372036854775807
-** and x1, \2, 1
+** and x3, x0, 9223372036854775807
+** and x1, x0, 1
 ** mov x2, 0
 ** mov w0, \1
 ** b   f1p
@@ -117,9 +113,8 @@
 /*
 ** g8p:
 ** mov (w[0-9]+), w1
-** sbfx(x[0-9]+), x0, 0, 63
-** and x3, \2, 9223372036854775807
-** and x1, \2, 1
+** and x3, x0, 9223372036854775807
+** and x1, x0, 1
 ** mov x2, 0
 ** mov w0, \1
 ** b   f8p
@@ -128,9 +123,8 @@
 ** g16p:
 ** mov (x[0-9]+), x0
 ** mov w0, w1
-** sbfx(x[0-9]+), \1, 0, 63
-** and x4, \2, 9223372036854775807
-** and x2, \2, 1
+** and x4, \1, 9223372036854775807
+** and x2, \1, 1
 ** mov x3, 0
 ** b   f16p
 */
diff --git a/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align8.c 
b/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align8.c
index da3c23550ba..e7f773640f0 100644
--- a/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align8.c
+++ b/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align8.c
@@ -54,9 +54,8 @@
 /*
 ** g1:
 ** mov (w[0-9]+), w1
-** sbfx(x[0-9]+), x0, 0, 63
-** and x3, \2, 9223372036854775807
-** and x1, \2, 1
+** and x3, x0, 9223372036854775807
+** and x1, x0, 1
 ** mov x2, 0
 ** mov w0, \1
 ** b   f1
@@ -65,9 +64,8 @@
 /*
 ** g8:
 ** mov (w[0-9]+), w1
-** sbfx(x[0-9]+), x0, 0, 63
-** and x3, \2, 9223372036854775807
-** and x1, \2, 1
+** and x3, x0, 9223372036854775807
+** and x1, x0, 1
 ** mov x2, 0
 ** mov w0, \1
 ** b   f8
@@ -76,9 +74,8 @@
 ** g16:
 ** mov (x[0-9]+), x0
 ** mov w0, w1
-** sbfx(x[0-9]+), \1, 0, 63
-** and x4, \2, 9223372036854775807
-** and x2, \2, 1
+** and x4, \1, 9223372036854775807
+** and x2, \1, 1
 ** mov x3, 0
 ** b   f16
 */
@@ -107,9 +104,8 @@
 /*
 ** g1p:
 ** mov (w[0-9]+), w1
-** sbfx(x[0-9]+), x0, 0, 63
-** and x3, \2, 9223372036854775807
-** and x1, \2, 1
+** and x3, x0, 9223372036854775807
+** and x1, x0, 1
 ** mov x2, 0
 ** mov w0, \1
 ** b   f1p
@@ -117,9 +113,8 @@
 /*
 ** g8p:
 ** mov (w[0-9]+), w1
-** sbfx(x[0-9]+), x0, 0, 63
-** and 

[gcc r14-9836] aarch64: Fix expansion of svsudot [PR114607]

2024-04-08 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:2c1c2485a4b1aca746ac693041e51ea6da5c64ca

commit r14-9836-g2c1c2485a4b1aca746ac693041e51ea6da5c64ca
Author: Richard Sandiford 
Date:   Mon Apr 8 16:53:32 2024 +0100

aarch64: Fix expansion of svsudot [PR114607]

Not sure how this happend, but: svsudot is supposed to be expanded
as USDOT with the operands swapped.  However, a thinko in the
expansion of svsudot meant that the arguments weren't in fact
swapped; the attempted swap was just a no-op.  And the testcases
blithely accepted that.

gcc/
PR target/114607
* config/aarch64/aarch64-sve-builtins-base.cc
(svusdot_impl::expand): Fix botched attempt to swap the operands
for svsudot.

gcc/testsuite/
PR target/114607
* gcc.target/aarch64/sve/acle/asm/sudot_s32.c: New test.

Diff:
---
 gcc/config/aarch64/aarch64-sve-builtins-base.cc   | 2 +-
 gcc/testsuite/gcc.target/aarch64/sve/acle/asm/sudot_s32.c | 8 
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-sve-builtins-base.cc 
b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
index 5be2315a3c6..0d2edf3f19e 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins-base.cc
+++ b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
@@ -2809,7 +2809,7 @@ public:
version) is through the USDOT instruction but with the second and third
inputs swapped.  */
 if (m_su)
-  e.rotate_inputs_left (1, 2);
+  e.rotate_inputs_left (1, 3);
 /* The ACLE function has the same order requirements as for svdot.
While there's no requirement for the RTL pattern to have the same sort
of order as that for dot_prod, it's easier to read.
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/sudot_s32.c 
b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/sudot_s32.c
index 4b452619eee..e06b69affab 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/sudot_s32.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/sudot_s32.c
@@ -6,7 +6,7 @@
 
 /*
 ** sudot_s32_tied1:
-** usdot   z0\.s, z2\.b, z4\.b
+** usdot   z0\.s, z4\.b, z2\.b
 ** ret
 */
 TEST_TRIPLE_Z (sudot_s32_tied1, svint32_t, svint8_t, svuint8_t,
@@ -17,7 +17,7 @@ TEST_TRIPLE_Z (sudot_s32_tied1, svint32_t, svint8_t, 
svuint8_t,
 ** sudot_s32_tied2:
 ** mov (z[0-9]+)\.d, z0\.d
 ** movprfx z0, z4
-** usdot   z0\.s, z2\.b, \1\.b
+** usdot   z0\.s, \1\.b, z2\.b
 ** ret
 */
 TEST_TRIPLE_Z_REV (sudot_s32_tied2, svint32_t, svint8_t, svuint8_t,
@@ -27,7 +27,7 @@ TEST_TRIPLE_Z_REV (sudot_s32_tied2, svint32_t, svint8_t, 
svuint8_t,
 /*
 ** sudot_w0_s32_tied:
 ** mov (z[0-9]+\.b), w0
-** usdot   z0\.s, z2\.b, \1
+** usdot   z0\.s, \1, z2\.b
 ** ret
 */
 TEST_TRIPLE_ZX (sudot_w0_s32_tied, svint32_t, svint8_t, uint8_t,
@@ -37,7 +37,7 @@ TEST_TRIPLE_ZX (sudot_w0_s32_tied, svint32_t, svint8_t, 
uint8_t,
 /*
 ** sudot_9_s32_tied:
 ** mov (z[0-9]+\.b), #9
-** usdot   z0\.s, z2\.b, \1
+** usdot   z0\.s, \1, z2\.b
 ** ret
 */
 TEST_TRIPLE_Z (sudot_9_s32_tied, svint32_t, svint8_t, uint8_t,


[gcc r14-9833] aarch64: Fix vld1/st1_x4 intrinsic test

2024-04-08 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:278cad85077509b73b1faf32d36f3889c2a5524b

commit r14-9833-g278cad85077509b73b1faf32d36f3889c2a5524b
Author: Swinney, Jonathan 
Date:   Mon Apr 8 14:02:33 2024 +0100

aarch64: Fix vld1/st1_x4 intrinsic test

The test for this intrinsic was failing silently and so it failed to
report the bug reported in 114521. This patch modifes the test to
report the result.

Bug report: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114521

Signed-off-by: Jonathan Swinney 

gcc/testsuite/
* gcc.target/aarch64/advsimd-intrinsics/vld1x4.c: Exit with a 
nonzero
code if the test fails.

Diff:
---
 gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vld1x4.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vld1x4.c 
b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vld1x4.c
index 89b289bb21d..17db262a31a 100644
--- a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vld1x4.c
+++ b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vld1x4.c
@@ -3,6 +3,7 @@
 /* { dg-skip-if "unimplemented" { arm*-*-* } } */
 /* { dg-options "-O3" } */
 
+#include 
 #include 
 #include "arm-neon-ref.h"
 
@@ -71,13 +72,16 @@ VARIANT (float64, 2, q_f64)
 VARIANTS (TESTMETH)
 
 #define CHECKS(BASE, ELTS, SUFFIX) \
-  if (test_vld1##SUFFIX##_x4 () != 0)  \
-fprintf (stderr, "test_vld1##SUFFIX##_x4");
+  if (test_vld1##SUFFIX##_x4 () != 0) {\
+fprintf (stderr, "test_vld1" #SUFFIX "_x4 failed\n"); \
+failed = true; \
+  }
 
 int
 main (int argc, char **argv)
 {
+  bool failed = false;
   VARIANTS (CHECKS)
 
-  return 0;
+  return (failed) ? 1 : 0;
 }


[gcc r14-9811] aarch64: Fix bogus cnot optimisation [PR114603]

2024-04-05 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:67cbb1c638d6ab3a9cb77e674541e2b291fb67df

commit r14-9811-g67cbb1c638d6ab3a9cb77e674541e2b291fb67df
Author: Richard Sandiford 
Date:   Fri Apr 5 14:47:15 2024 +0100

aarch64: Fix bogus cnot optimisation [PR114603]

aarch64-sve.md had a pattern that combined:

cmpeq   pb.T, pa/z, zc.T, #0
mov zd.T, pb/z, #1

into:

cnotzd.T, pa/m, zc.T

But this is only valid if pa.T is a ptrue.  In other cases, the
original would set inactive elements of zd.T to 0, whereas the
combined form would copy elements from zc.T.

gcc/
PR target/114603
* config/aarch64/aarch64-sve.md (@aarch64_pred_cnot): Replace
with...
(@aarch64_ptrue_cnot): ...this, requiring operand 1 to be
a ptrue.
(*cnot): Require operand 1 to be a ptrue.
* config/aarch64/aarch64-sve-builtins-base.cc (svcnot_impl::expand):
Use aarch64_ptrue_cnot for _x operations that are predicated
with a ptrue.  Represent other _x operations as fully-defined _m
operations.

gcc/testsuite/
PR target/114603
* gcc.target/aarch64/sve/acle/general/cnot_1.c: New test.

Diff:
---
 gcc/config/aarch64/aarch64-sve-builtins-base.cc| 25 ++
 gcc/config/aarch64/aarch64-sve.md  | 22 +--
 .../gcc.target/aarch64/sve/acle/general/cnot_1.c   | 23 
 3 files changed, 50 insertions(+), 20 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-sve-builtins-base.cc 
b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
index 257ca5bf6ad..5be2315a3c6 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins-base.cc
+++ b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
@@ -517,15 +517,22 @@ public:
   expand (function_expander ) const override
   {
 machine_mode mode = e.vector_mode (0);
-if (e.pred == PRED_x)
-  {
-   /* The pattern for CNOT includes an UNSPEC_PRED_Z, so needs
-  a ptrue hint.  */
-   e.add_ptrue_hint (0, e.gp_mode (0));
-   return e.use_pred_x_insn (code_for_aarch64_pred_cnot (mode));
-  }
-
-return e.use_cond_insn (code_for_cond_cnot (mode), 0);
+machine_mode pred_mode = e.gp_mode (0);
+/* The underlying _x pattern is effectively:
+
+dst = src == 0 ? 1 : 0
+
+   rather than an UNSPEC_PRED_X.  Using this form allows autovec
+   constructs to be matched by combine, but it means that the
+   predicate on the src == 0 comparison must be all-true.
+
+   For simplicity, represent other _x operations as fully-defined _m
+   operations rather than using a separate bespoke pattern.  */
+if (e.pred == PRED_x
+   && gen_lowpart (pred_mode, e.args[0]) == CONSTM1_RTX (pred_mode))
+  return e.use_pred_x_insn (code_for_aarch64_ptrue_cnot (mode));
+return e.use_cond_insn (code_for_cond_cnot (mode),
+   e.pred == PRED_x ? 1 : 0);
   }
 };
 
diff --git a/gcc/config/aarch64/aarch64-sve.md 
b/gcc/config/aarch64/aarch64-sve.md
index eca8623e587..0434358122d 100644
--- a/gcc/config/aarch64/aarch64-sve.md
+++ b/gcc/config/aarch64/aarch64-sve.md
@@ -3363,24 +3363,24 @@
 ;; - CNOT
 ;; -
 
-;; Predicated logical inverse.
-(define_expand "@aarch64_pred_cnot"
+;; Logical inverse, predicated with a ptrue.
+(define_expand "@aarch64_ptrue_cnot"
   [(set (match_operand:SVE_FULL_I 0 "register_operand")
(unspec:SVE_FULL_I
  [(unspec:
 [(match_operand: 1 "register_operand")
- (match_operand:SI 2 "aarch64_sve_ptrue_flag")
+ (const_int SVE_KNOWN_PTRUE)
  (eq:
-   (match_operand:SVE_FULL_I 3 "register_operand")
-   (match_dup 4))]
+   (match_operand:SVE_FULL_I 2 "register_operand")
+   (match_dup 3))]
 UNSPEC_PRED_Z)
-  (match_dup 5)
-  (match_dup 4)]
+  (match_dup 4)
+  (match_dup 3)]
  UNSPEC_SEL))]
   "TARGET_SVE"
   {
-operands[4] = CONST0_RTX (mode);
-operands[5] = CONST1_RTX (mode);
+operands[3] = CONST0_RTX (mode);
+operands[4] = CONST1_RTX (mode);
   }
 )
 
@@ -3389,7 +3389,7 @@
(unspec:SVE_I
  [(unspec:
 [(match_operand: 1 "register_operand")
- (match_operand:SI 5 "aarch64_sve_ptrue_flag")
+ (const_int SVE_KNOWN_PTRUE)
  (eq:
(match_operand:SVE_I 2 "register_operand")
(match_operand:SVE_I 3 "aarch64_simd_imm_zero"))]
@@ -11001,4 +11001,4 @@
   GET_MODE (operands[2]));
 return "sel\t%0., %3, %2., %1.";
   }
-)
\ No newline at end of file
+)
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/general/cnot_1.c 
b/gcc/testsuite/gcc.target/aarch64/sve/acle/general/cnot_1.c
new file mode 

[gcc r14-9787] aarch64: Recognise svundef idiom [PR114577]

2024-04-04 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:86dce005a1d440154dbf585dde5a2dd4cfac7a05

commit r14-9787-g86dce005a1d440154dbf585dde5a2dd4cfac7a05
Author: Richard Sandiford 
Date:   Thu Apr 4 14:15:49 2024 +0100

aarch64: Recognise svundef idiom [PR114577]

GCC 14 adds the header file arm_neon_sve_bridge.h to help interface
SVE and Advanced SIMD code.  One of the defined idioms is:

  svset_neonq (svundef_TYPE (), advsimd_vector)

which simply reinterprets advsimd_vector as an SVE vector without
regard for what's in the upper bits.

GCC was failing to recognise this idiom, which was likely to
significantly hamper adoption.

There is (AFAIK) no good way of representing an extension with
undefined bits in gimple.  We could add an internal-only builtin
to represent it, but the current framework makes that somewhat
awkward.  It also doesn't seem very forward-looking.

This patch instead goes for the simpler approach of recognising
undefined arguments at expansion time.

gcc/
PR target/114577
* config/aarch64/aarch64-sve-builtins.h 
(aarch64_sve::lookup_fndecl):
Declare.
* config/aarch64/aarch64-sve-builtins.cc 
(aarch64_sve::lookup_fndecl):
New function.
* config/aarch64/aarch64-sve-builtins-base.cc (is_undef): Likewise.
(svset_neonq_impl::expand): Optimise expansions whose first argument
is undefined.

gcc/testsuite/
PR target/114577
* gcc.target/aarch64/sve/acle/general/pr114577_1.c: New test.
* gcc.target/aarch64/sve/acle/general/pr114577_2.c: Likewise.

Diff:
---
 gcc/config/aarch64/aarch64-sve-builtins-base.cc| 27 +++
 gcc/config/aarch64/aarch64-sve-builtins.cc | 16 
 gcc/config/aarch64/aarch64-sve-builtins.h  |  1 +
 .../aarch64/sve/acle/general/pr114577_1.c  | 94 ++
 .../aarch64/sve/acle/general/pr114577_2.c  | 46 +++
 5 files changed, 184 insertions(+)

diff --git a/gcc/config/aarch64/aarch64-sve-builtins-base.cc 
b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
index a8c3f84a70b..257ca5bf6ad 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins-base.cc
+++ b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
@@ -47,11 +47,31 @@
 #include "aarch64-builtins.h"
 #include "ssa.h"
 #include "gimple-fold.h"
+#include "tree-ssa.h"
 
 using namespace aarch64_sve;
 
 namespace {
 
+/* Return true if VAL is an undefined value.  */
+static bool
+is_undef (tree val)
+{
+  if (TREE_CODE (val) == SSA_NAME)
+{
+  if (ssa_undefined_value_p (val, false))
+   return true;
+
+  gimple *def = SSA_NAME_DEF_STMT (val);
+  if (gcall *call = dyn_cast (def))
+   if (tree fndecl = gimple_call_fndecl (call))
+ if (const function_instance *instance = lookup_fndecl (fndecl))
+   if (instance->base == functions::svundef)
+ return true;
+}
+  return false;
+}
+
 /* Return the UNSPEC_CMLA* unspec for rotation amount ROT.  */
 static int
 unspec_cmla (int rot)
@@ -1142,6 +1162,13 @@ public:
   expand (function_expander ) const override
   {
 machine_mode mode = e.vector_mode (0);
+
+/* If the SVE argument is undefined, we just need to reinterpret the
+   Advanced SIMD argument as an SVE vector.  */
+if (!BYTES_BIG_ENDIAN
+   && is_undef (CALL_EXPR_ARG (e.call_expr, 0)))
+  return simplify_gen_subreg (mode, e.args[1], GET_MODE (e.args[1]), 0);
+
 rtx_vector_builder builder (VNx16BImode, 16, 2);
 for (unsigned int i = 0; i < 16; i++)
   builder.quick_push (CONST1_RTX (BImode));
diff --git a/gcc/config/aarch64/aarch64-sve-builtins.cc 
b/gcc/config/aarch64/aarch64-sve-builtins.cc
index 11f5c5c500c..e124d1f90a5 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins.cc
+++ b/gcc/config/aarch64/aarch64-sve-builtins.cc
@@ -1055,6 +1055,22 @@ get_vector_type (sve_type type)
   return acle_vector_types[type.num_vectors - 1][vector_type];
 }
 
+/* If FNDECL is an SVE builtin, return its function instance, otherwise
+   return null.  */
+const function_instance *
+lookup_fndecl (tree fndecl)
+{
+  if (!fndecl_built_in_p (fndecl, BUILT_IN_MD))
+return nullptr;
+
+  unsigned int code = DECL_MD_FUNCTION_CODE (fndecl);
+  if ((code & AARCH64_BUILTIN_CLASS) != AARCH64_BUILTIN_SVE)
+return nullptr;
+
+  unsigned int subcode = code >> AARCH64_BUILTIN_SHIFT;
+  return &(*registered_functions)[subcode]->instance;
+}
+
 /* Report an error against LOCATION that the user has tried to use
function FNDECL when extension EXTENSION is disabled.  */
 static void
diff --git a/gcc/config/aarch64/aarch64-sve-builtins.h 
b/gcc/config/aarch64/aarch64-sve-builtins.h
index e66729ed635..053006776a9 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins.h
+++ b/gcc/config/aarch64/aarch64-sve-builtins.h
@@ -810,6 +810,7 @@ extern tree acle_svprfop;
 
 bool vector_cst_all_same (tree, unsigned int);
 bool 

[gcc r11-11296] asan: Handle poly-int sizes in ASAN_MARK [PR97696]

2024-03-27 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:d98467091bfc23522fefd32f1253e1c9e80331d3

commit r11-11296-gd98467091bfc23522fefd32f1253e1c9e80331d3
Author: Richard Sandiford 
Date:   Wed Mar 27 19:26:57 2024 +

asan: Handle poly-int sizes in ASAN_MARK [PR97696]

This patch makes the expansion of IFN_ASAN_MARK let through
poly-int-sized objects.  The expansion itself was already generic
enough, but the tests for the fast path were too strict.

gcc/
PR sanitizer/97696
* asan.c (asan_expand_mark_ifn): Allow the length to be a poly_int.

gcc/testsuite/
PR sanitizer/97696
* gcc.target/aarch64/sve/pr97696.c: New test.

(cherry picked from commit fca6f6fddb22b8665e840f455a7d0318d4575227)

Diff:
---
 gcc/asan.c |  9 
 gcc/testsuite/gcc.target/aarch64/sve/pr97696.c | 29 ++
 2 files changed, 33 insertions(+), 5 deletions(-)

diff --git a/gcc/asan.c b/gcc/asan.c
index ca3020f463c..2aa2be13bf6 100644
--- a/gcc/asan.c
+++ b/gcc/asan.c
@@ -3723,9 +3723,7 @@ asan_expand_mark_ifn (gimple_stmt_iterator *iter)
 }
   tree len = gimple_call_arg (g, 2);
 
-  gcc_assert (tree_fits_shwi_p (len));
-  unsigned HOST_WIDE_INT size_in_bytes = tree_to_shwi (len);
-  gcc_assert (size_in_bytes);
+  gcc_assert (poly_int_tree_p (len));
 
   g = gimple_build_assign (make_ssa_name (pointer_sized_int_node),
   NOP_EXPR, base);
@@ -3734,9 +3732,10 @@ asan_expand_mark_ifn (gimple_stmt_iterator *iter)
   tree base_addr = gimple_assign_lhs (g);
 
   /* Generate direct emission if size_in_bytes is small.  */
-  if (size_in_bytes
-  <= (unsigned)param_use_after_scope_direct_emission_threshold)
+  unsigned threshold = param_use_after_scope_direct_emission_threshold;
+  if (tree_fits_uhwi_p (len) && tree_to_uhwi (len) <= threshold)
 {
+  unsigned HOST_WIDE_INT size_in_bytes = tree_to_uhwi (len);
   const unsigned HOST_WIDE_INT shadow_size
= shadow_mem_size (size_in_bytes);
   const unsigned int shadow_align
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pr97696.c 
b/gcc/testsuite/gcc.target/aarch64/sve/pr97696.c
new file mode 100644
index 000..8b7de18a07d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/pr97696.c
@@ -0,0 +1,29 @@
+/* { dg-skip-if "" { no_fsanitize_address } } */
+/* { dg-options "-fsanitize=address -fsanitize-address-use-after-scope" } */
+
+#include 
+
+__attribute__((noinline, noclone)) int
+foo (char *a)
+{
+  int i, j = 0;
+  asm volatile ("" : "+r" (a) : : "memory");
+  for (i = 0; i < 12; i++)
+j += a[i];
+  return j;
+}
+
+int
+main ()
+{
+  int i, j = 0;
+  for (i = 0; i < 4; i++)
+{
+  char a[12];
+  __SVInt8_t freq;
+  __builtin_bcmp (, a, 10);
+  __builtin_memset (a, 0, sizeof (a));
+  j += foo (a);
+}
+  return j;
+}


[gcc r11-11295] aarch64: Fix vld1/st1_x4 intrinsic definitions

2024-03-27 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:daee0409d195d346562e423da783d5d1cf8ea175

commit r11-11295-gdaee0409d195d346562e423da783d5d1cf8ea175
Author: Richard Sandiford 
Date:   Wed Mar 27 19:26:56 2024 +

aarch64: Fix vld1/st1_x4 intrinsic definitions

The vld1_x4 and vst1_x4 patterns use XI registers for both 64-bit and
128-bit vectors.  This has the nice property that each individual
vector is within a separate 16-byte subreg of the XI, which should
reduce the number of memory spills needed.  However, it means that the
64-bit vector forms must convert between the native 4x64-bit structure
layout and the padded 4x128-bit XI layout.

The vld4 and vst4 functions did this correctly.  But the vld1x4 and
vst1x4 functions used a union between the native and padded layouts,
even though the layouts are different sizes.

This patch makes vld1x4 and vst1x4 use the same approach as vld4
and vst4.  It also fixes some uses of variables in the user namespace.

gcc/
* config/aarch64/arm_neon.h (vld1_s8_x4, vld1_s16_x4, vld1_s32_x4):
(vld1_u8_x4, vld1_u16_x4, vld1_u32_x4, vld1_f16_x4, vld1_f32_x4):
(vld1_p8_x4, vld1_p16_x4, vld1_s64_x4, vld1_u64_x4, vld1_p64_x4):
(vld1_f64_x4): Avoid using a union of a 256-bit structure and 
512-bit
XImode integer.  Instead use the same approach as the vld4 
intrinsics.
(vst1_s8_x4, vst1_s16_x4, vst1_s32_x4, vst1_u8_x4, vst1_u16_x4):
(vst1_u32_x4, vst1_f16_x4, vst1_f32_x4, vst1_p8_x4, vst1_p16_x4):
(vst1_s64_x4, vst1_u64_x4, vst1_p64_x4, vst1_f64_x4, vld1_bf16_x4):
(vst1_bf16_x4): Likewise for stores.
(vst1q_s8_x4, vst1q_s16_x4, vst1q_s32_x4, vst1q_u8_x4, 
vst1q_u16_x4):
(vst1q_u32_x4, vst1q_f16_x4, vst1q_f32_x4, vst1q_p8_x4, 
vst1q_p16_x4):
(vst1q_s64_x4, vst1q_u64_x4, vst1q_p64_x4, vst1q_f64_x4)
(vst1q_bf16_x4): Rename val parameter to __val.

Diff:
---
 gcc/config/aarch64/arm_neon.h | 469 ++
 1 file changed, 334 insertions(+), 135 deletions(-)

diff --git a/gcc/config/aarch64/arm_neon.h b/gcc/config/aarch64/arm_neon.h
index baa30bd5a9d..8f53f4e1559 100644
--- a/gcc/config/aarch64/arm_neon.h
+++ b/gcc/config/aarch64/arm_neon.h
@@ -16498,10 +16498,14 @@ __extension__ extern __inline int8x8x4_t
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 vld1_s8_x4 (const int8_t *__a)
 {
-  union { int8x8x4_t __i; __builtin_aarch64_simd_xi __o; } __au;
-  __au.__o
-= __builtin_aarch64_ld1x4v8qi ((const __builtin_aarch64_simd_qi *) __a);
-  return __au.__i;
+  int8x8x4_t ret;
+  __builtin_aarch64_simd_xi __o;
+  __o = __builtin_aarch64_ld1x4v8qi ((const __builtin_aarch64_simd_qi *) __a);
+  ret.val[0] = (int8x8_t) __builtin_aarch64_get_dregxiv8qi (__o, 0);
+  ret.val[1] = (int8x8_t) __builtin_aarch64_get_dregxiv8qi (__o, 1);
+  ret.val[2] = (int8x8_t) __builtin_aarch64_get_dregxiv8qi (__o, 2);
+  ret.val[3] = (int8x8_t) __builtin_aarch64_get_dregxiv8qi (__o, 3);
+  return ret;
 }
 
 __extension__ extern __inline int8x16x4_t
@@ -16518,10 +16522,14 @@ __extension__ extern __inline int16x4x4_t
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 vld1_s16_x4 (const int16_t *__a)
 {
-  union { int16x4x4_t __i; __builtin_aarch64_simd_xi __o; } __au;
-  __au.__o
-= __builtin_aarch64_ld1x4v4hi ((const __builtin_aarch64_simd_hi *) __a);
-  return __au.__i;
+  int16x4x4_t ret;
+  __builtin_aarch64_simd_xi __o;
+  __o = __builtin_aarch64_ld1x4v4hi ((const __builtin_aarch64_simd_hi *) __a);
+  ret.val[0] = (int16x4_t) __builtin_aarch64_get_dregxiv4hi (__o, 0);
+  ret.val[1] = (int16x4_t) __builtin_aarch64_get_dregxiv4hi (__o, 1);
+  ret.val[2] = (int16x4_t) __builtin_aarch64_get_dregxiv4hi (__o, 2);
+  ret.val[3] = (int16x4_t) __builtin_aarch64_get_dregxiv4hi (__o, 3);
+  return ret;
 }
 
 __extension__ extern __inline int16x8x4_t
@@ -16538,10 +16546,14 @@ __extension__ extern __inline int32x2x4_t
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 vld1_s32_x4 (const int32_t *__a)
 {
-  union { int32x2x4_t __i; __builtin_aarch64_simd_xi __o; } __au;
-  __au.__o
-  = __builtin_aarch64_ld1x4v2si ((const __builtin_aarch64_simd_si *) __a);
-  return __au.__i;
+  int32x2x4_t ret;
+  __builtin_aarch64_simd_xi __o;
+  __o = __builtin_aarch64_ld1x4v2si ((const __builtin_aarch64_simd_si *) __a);
+  ret.val[0] = (int32x2_t) __builtin_aarch64_get_dregxiv2si (__o, 0);
+  ret.val[1] = (int32x2_t) __builtin_aarch64_get_dregxiv2si (__o, 1);
+  ret.val[2] = (int32x2_t) __builtin_aarch64_get_dregxiv2si (__o, 2);
+  ret.val[3] = (int32x2_t) __builtin_aarch64_get_dregxiv2si (__o, 3);
+  return ret;
 }
 
 __extension__ extern __inline int32x4x4_t
@@ -16558,10 +16570,14 @@ __extension__ extern __inline uint8x8x4_t
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 vld1_u8_x4 (const uint8_t *__a)
 {
-  

[gcc r12-10296] asan: Handle poly-int sizes in ASAN_MARK [PR97696]

2024-03-27 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:51e1629bc11f0ae4b8050712b26521036ed360aa

commit r12-10296-g51e1629bc11f0ae4b8050712b26521036ed360aa
Author: Richard Sandiford 
Date:   Wed Mar 27 17:38:09 2024 +

asan: Handle poly-int sizes in ASAN_MARK [PR97696]

This patch makes the expansion of IFN_ASAN_MARK let through
poly-int-sized objects.  The expansion itself was already generic
enough, but the tests for the fast path were too strict.

gcc/
PR sanitizer/97696
* asan.cc (asan_expand_mark_ifn): Allow the length to be a poly_int.

gcc/testsuite/
PR sanitizer/97696
* gcc.target/aarch64/sve/pr97696.c: New test.

(cherry picked from commit fca6f6fddb22b8665e840f455a7d0318d4575227)

Diff:
---
 gcc/asan.cc|  9 
 gcc/testsuite/gcc.target/aarch64/sve/pr97696.c | 29 ++
 2 files changed, 33 insertions(+), 5 deletions(-)

diff --git a/gcc/asan.cc b/gcc/asan.cc
index 20e5ef9d378..72d1ef28be8 100644
--- a/gcc/asan.cc
+++ b/gcc/asan.cc
@@ -3746,9 +3746,7 @@ asan_expand_mark_ifn (gimple_stmt_iterator *iter)
 }
   tree len = gimple_call_arg (g, 2);
 
-  gcc_assert (tree_fits_shwi_p (len));
-  unsigned HOST_WIDE_INT size_in_bytes = tree_to_shwi (len);
-  gcc_assert (size_in_bytes);
+  gcc_assert (poly_int_tree_p (len));
 
   g = gimple_build_assign (make_ssa_name (pointer_sized_int_node),
   NOP_EXPR, base);
@@ -3757,9 +3755,10 @@ asan_expand_mark_ifn (gimple_stmt_iterator *iter)
   tree base_addr = gimple_assign_lhs (g);
 
   /* Generate direct emission if size_in_bytes is small.  */
-  if (size_in_bytes
-  <= (unsigned)param_use_after_scope_direct_emission_threshold)
+  unsigned threshold = param_use_after_scope_direct_emission_threshold;
+  if (tree_fits_uhwi_p (len) && tree_to_uhwi (len) <= threshold)
 {
+  unsigned HOST_WIDE_INT size_in_bytes = tree_to_uhwi (len);
   const unsigned HOST_WIDE_INT shadow_size
= shadow_mem_size (size_in_bytes);
   const unsigned int shadow_align
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pr97696.c 
b/gcc/testsuite/gcc.target/aarch64/sve/pr97696.c
new file mode 100644
index 000..8b7de18a07d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/pr97696.c
@@ -0,0 +1,29 @@
+/* { dg-skip-if "" { no_fsanitize_address } } */
+/* { dg-options "-fsanitize=address -fsanitize-address-use-after-scope" } */
+
+#include 
+
+__attribute__((noinline, noclone)) int
+foo (char *a)
+{
+  int i, j = 0;
+  asm volatile ("" : "+r" (a) : : "memory");
+  for (i = 0; i < 12; i++)
+j += a[i];
+  return j;
+}
+
+int
+main ()
+{
+  int i, j = 0;
+  for (i = 0; i < 4; i++)
+{
+  char a[12];
+  __SVInt8_t freq;
+  __builtin_bcmp (, a, 10);
+  __builtin_memset (a, 0, sizeof (a));
+  j += foo (a);
+}
+  return j;
+}


[gcc r13-8501] asan: Handle poly-int sizes in ASAN_MARK [PR97696]

2024-03-27 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:86b80b049167d28a9ef43aebdfbb80ae5deb0888

commit r13-8501-g86b80b049167d28a9ef43aebdfbb80ae5deb0888
Author: Richard Sandiford 
Date:   Wed Mar 27 15:30:19 2024 +

asan: Handle poly-int sizes in ASAN_MARK [PR97696]

This patch makes the expansion of IFN_ASAN_MARK let through
poly-int-sized objects.  The expansion itself was already generic
enough, but the tests for the fast path were too strict.

gcc/
PR sanitizer/97696
* asan.cc (asan_expand_mark_ifn): Allow the length to be a poly_int.

gcc/testsuite/
PR sanitizer/97696
* gcc.target/aarch64/sve/pr97696.c: New test.

(cherry picked from commit fca6f6fddb22b8665e840f455a7d0318d4575227)

Diff:
---
 gcc/asan.cc|  9 
 gcc/testsuite/gcc.target/aarch64/sve/pr97696.c | 29 ++
 2 files changed, 33 insertions(+), 5 deletions(-)

diff --git a/gcc/asan.cc b/gcc/asan.cc
index df732c02150..1a443afedc0 100644
--- a/gcc/asan.cc
+++ b/gcc/asan.cc
@@ -3801,9 +3801,7 @@ asan_expand_mark_ifn (gimple_stmt_iterator *iter)
 }
   tree len = gimple_call_arg (g, 2);
 
-  gcc_assert (tree_fits_shwi_p (len));
-  unsigned HOST_WIDE_INT size_in_bytes = tree_to_shwi (len);
-  gcc_assert (size_in_bytes);
+  gcc_assert (poly_int_tree_p (len));
 
   g = gimple_build_assign (make_ssa_name (pointer_sized_int_node),
   NOP_EXPR, base);
@@ -3812,9 +3810,10 @@ asan_expand_mark_ifn (gimple_stmt_iterator *iter)
   tree base_addr = gimple_assign_lhs (g);
 
   /* Generate direct emission if size_in_bytes is small.  */
-  if (size_in_bytes
-  <= (unsigned)param_use_after_scope_direct_emission_threshold)
+  unsigned threshold = param_use_after_scope_direct_emission_threshold;
+  if (tree_fits_uhwi_p (len) && tree_to_uhwi (len) <= threshold)
 {
+  unsigned HOST_WIDE_INT size_in_bytes = tree_to_uhwi (len);
   const unsigned HOST_WIDE_INT shadow_size
= shadow_mem_size (size_in_bytes);
   const unsigned int shadow_align
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pr97696.c 
b/gcc/testsuite/gcc.target/aarch64/sve/pr97696.c
new file mode 100644
index 000..8b7de18a07d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/pr97696.c
@@ -0,0 +1,29 @@
+/* { dg-skip-if "" { no_fsanitize_address } } */
+/* { dg-options "-fsanitize=address -fsanitize-address-use-after-scope" } */
+
+#include 
+
+__attribute__((noinline, noclone)) int
+foo (char *a)
+{
+  int i, j = 0;
+  asm volatile ("" : "+r" (a) : : "memory");
+  for (i = 0; i < 12; i++)
+j += a[i];
+  return j;
+}
+
+int
+main ()
+{
+  int i, j = 0;
+  for (i = 0; i < 4; i++)
+{
+  char a[12];
+  __SVInt8_t freq;
+  __builtin_bcmp (, a, 10);
+  __builtin_memset (a, 0, sizeof (a));
+  j += foo (a);
+}
+  return j;
+}


[gcc r14-9678] aarch64: Use constexpr for out-of-line statics

2024-03-26 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:5be2313bceea7b482c17ee730efe604b910800bd

commit r14-9678-g5be2313bceea7b482c17ee730efe604b910800bd
Author: Richard Sandiford 
Date:   Tue Mar 26 17:27:56 2024 +

aarch64: Use constexpr for out-of-line statics

GCC 4.8 complained about the use of const rather than constexpr
for out-of-line static constexprs.

gcc/
* config/aarch64/aarch64-feature-deps.h: Use constexpr for
out-of-line statics.

Diff:
---
 gcc/config/aarch64/aarch64-feature-deps.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-feature-deps.h 
b/gcc/config/aarch64/aarch64-feature-deps.h
index 3641badb82f..79126db8825 100644
--- a/gcc/config/aarch64/aarch64-feature-deps.h
+++ b/gcc/config/aarch64/aarch64-feature-deps.h
@@ -71,9 +71,9 @@ template struct info;
 static constexpr auto enable = flag | get_enable REQUIRES; \
 static constexpr auto explicit_on = enable | get_enable EXPLICIT_ON; \
   };   \
-  const aarch64_feature_flags info::flag;  \
-  const aarch64_feature_flags info::enable;\
-  const aarch64_feature_flags info::explicit_on; \
+  constexpr aarch64_feature_flags info::flag;  \
+  constexpr aarch64_feature_flags info::enable;
\
+  constexpr aarch64_feature_flags info::explicit_on; \
   constexpr info IDENT ()  \
   {\
 return info ();\


[gcc r14-9333] aarch64: Define out-of-class static constants

2024-03-06 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:c7a9883663a888617b6e3584233aa756b30519f8

commit r14-9333-gc7a9883663a888617b6e3584233aa756b30519f8
Author: Richard Sandiford 
Date:   Wed Mar 6 10:04:56 2024 +

aarch64: Define out-of-class static constants

While reworking the aarch64 feature descriptions, I forgot
to add out-of-class definitions of some static constants.
This could lead to a build failure with some compilers.

This was seen with some WIP to increase the number of extensions
beyond 64.  It's latent on trunk though, and a regression from
before the rework.

gcc/
* config/aarch64/aarch64-feature-deps.h (feature_deps::info): Add
out-of-class definitions of static constants.

Diff:
---
 gcc/config/aarch64/aarch64-feature-deps.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/gcc/config/aarch64/aarch64-feature-deps.h 
b/gcc/config/aarch64/aarch64-feature-deps.h
index a1b81f9070b..3641badb82f 100644
--- a/gcc/config/aarch64/aarch64-feature-deps.h
+++ b/gcc/config/aarch64/aarch64-feature-deps.h
@@ -71,6 +71,9 @@ template struct info;
 static constexpr auto enable = flag | get_enable REQUIRES; \
 static constexpr auto explicit_on = enable | get_enable EXPLICIT_ON; \
   };   \
+  const aarch64_feature_flags info::flag;  \
+  const aarch64_feature_flags info::enable;\
+  const aarch64_feature_flags info::explicit_on; \
   constexpr info IDENT ()  \
   {\
 return info ();\


Re: Discussion about arm/aarch64 testcase failures seen with patch for PR111673

2023-11-28 Thread Richard Sandiford via Gcc
Richard Earnshaw  writes:
> On 28/11/2023 12:52, Surya Kumari Jangala wrote:
>> Hi Richard,
>> Thanks a lot for your response!
>> 
>> Another failure reported by the Linaro CI is as follows :
>> (Note: I am planning to send a separate mail for each failure, as this will 
>> make
>> the discussion easy to track)
>> 
>> FAIL: gcc.target/aarch64/sve/acle/general/cpy_1.c -march=armv8.2-a+sve 
>> -moverride=tune=none  check-function-bodies dup_x0_m
>> 
>> Expected code:
>> 
>>...
>>add (x[0-9]+), x0, #?1
>>mov (p[0-7])\.b, p15\.b
>>mov z0\.d, \2/m, \1
>>...
>>ret
>> 
>> 
>> Code obtained w/o patch:
>>  addvl   sp, sp, #-1
>>  str p15, [sp]
>>  add x0, x0, 1
>>  mov p3.b, p15.b
>>  mov z0.d, p3/m, x0
>>  ldr p15, [sp]
>>  addvl   sp, sp, #1
>>  ret
>> 
>> Code obtained w/ patch:
>>  addvl   sp, sp, #-1
>>  str p15, [sp]
>>  mov p3.b, p15.b
>>  add x0, x0, 1
>>  mov z0.d, p3/m, x0
>>  ldr p15, [sp]
>>  addvl   sp, sp, #1
>>  ret
>> 
>> As we can see, with the patch, the following two instructions are 
>> interchanged:
>>  add x0, x0, 1
>>  mov p3.b, p15.b
>
> Indeed, both look acceptable results to me, especially given that we 
> don't schedule results at -O1.
>
> There's two ways of fixing this:
> 1) Simply swap the order to what the compiler currently generates (which 
> is a little fragile, since it might flip back someday).
> 2) Write the test as
>
>
> ** (
> **   add (x[0-9]+), x0, #?1
> **   mov (p[0-7])\.b, p15\.b
> **   mov z0\.d, \2/m, \1
> ** |
> **   mov (p[0-7])\.b, p15\.b
> **   add (x[0-9]+), x0, #?1
> **   mov z0\.d, \1/m, \2
> ** )
>
> Note, we need to swap the match names in the third insn to account for 
> the different order of the earlier instructions.
>
> Neither is ideal, but the second is perhaps a little more bomb proof.
>
> I don't really have a strong feeling either way, but perhaps the second 
> is slightly preferable.
>
> Richard S: thoughts?

Yeah, I agree the second is probably better.  The | doesn't reset the
capture numbers, so I think the final instruction needs to be:

**   mov z0\.d, \3/m, \4

Thanks,
Richard

>
> R.
>
>> I believe that this is fine and the test can be modified to allow it to pass 
>> on
>> aarch64. Please let me know what you think.
>> 
>> Regards,
>> Surya
>> 
>> 
>> On 24/11/23 4:18 pm, Richard Earnshaw wrote:
>>>
>>>
>>> On 24/11/2023 08:09, Surya Kumari Jangala via Gcc wrote:
 Hi Richard,
 Ping. Please let me know if the test failure that I mentioned in the mail 
 below can be handled by changing the expected generated code. I am not 
 conversant with arm, and hence would appreciate your help.

 Regards,
 Surya

 On 03/11/23 4:58 pm, Surya Kumari Jangala wrote:
> Hi Richard,
> I had submitted a patch for review 
> (https://gcc.gnu.org/pipermail/gcc-patches/2023-October/631849.html)
> regarding scaling save/restore costs of callee save registers with block
> frequency in the IRA pass (PR111673).
>
> This patch has been approved by VMakarov
> (https://gcc.gnu.org/pipermail/gcc-patches/2023-October/632089.html).
>
> With this patch, we are seeing performance improvements with spec on x86
> (exchange: 5%, xalancbmk: 2.5%) and on Power (perlbench: 5.57%).
>
> I received a mail from Linaro about some failures seen in the CI pipeline 
> with
> this patch. I have analyzed the failures and I wish to discuss the 
> analysis with you.
>
> One failure reported by the Linaro CI is:
>
> FAIL: gcc.target/arm/pr111235.c scan-assembler-times ldrexd\tr[0-9]+, 
> r[0-9]+, \\[r[0-9]+\\] 2
>
> The diff in the assembly between trunk and patch is:
>
> 93c93
> <   push    {r4, r5}
> ---
>>     push    {fp}
> 95c95
> <   ldrexd  r4, r5, [r0]
> ---
>>     ldrexd  fp, ip, [r0]
> 99c99
> <   pop {r4, r5}
> ---
>>     ldr fp, [sp], #4
>
>
> The test fails with patch because the ldrexd insn uses fp & ip registers 
> instead
> of r[0-9]+
>
> But the code produced by patch is better because it is pushing and 
> restoring only
> one register (fp) instead of two registers (r4, r5). Hence, this test can 
> be
> modified to allow it to pass on arm. Please let me know what you think.
>
> If you need more information, please let me know. I will be sending 
> separate mails
> for the other test failures.
>
>>>
>>> Thanks for looking at this.
>>>
>>>
>>> The key part of this test is that the compiler generates LDREXD.  The 
>>> registers used for that are pretty much irrelevant as we don't match them 
>>> to any other 

Re: Arm assembler crc issue

2023-10-19 Thread Richard Sandiford via Gcc
Iain Sandoe  writes:
> Hi Richard,
>
>
> I am being bitten by a problem that falls out from the code that emits
>
>   .arch Armv8.n-a+crc
>
> when the arch is less than Armv8-r.
> The code that does this,  in gcc/common/config/aarch64 is quite recent 
> (2022-09).

Heh.  A workaround for one assembler bug triggers another assembler bug.

The special treatment of CRC is much older than 2022-09 though.  I think
it dates back to 04a99ebecee885e42e56b6e0c832570e2a91c196 (2016-04),
with 4ca82fc9f86fc1187ee112e3a637cb3ca5d2ef2a providing the more
complete explanation.

>
> --
>
> (I admit the permutations are complex and I might have miss-analyzed) - but 
> it appears that llvm assembler (for mach-o, at least) sees an explict mention 
> of an attribute for a feature which is mandatory at a specified arch level as 
> demoting that arch to the minimum that made the explicit feature mandatory.  
> Of course, it could just be a bug in the handling of transitive feature 
> enables...
>
> the problem is that, for example:
>
>   .arch Armv8.4-a+crc
>
> no longer recognises fp16 insns. (and appending +fp16 does not fix this).
>
> 
>
> Even if upstream LLVM is deemed to be buggy (it does not do what I would 
> expect, at least), and fixed - I will still have a bunch of assembler 
> versions that are broken (before the fix percolates through to downstream 
> xcode) - and the LLVM assembler is the only current option for Darwin.
>
> So, it seems that this ought to be a reasonable configure test:
>
>   .arch armv8.2-a
>   .text
> m:
>   crc32b w0, w1, w2 
>
> and then emit HAS_GAS_AARCH64_CRC_BUG (for example) if that fails to assemble 
> which can be used to make the +crc emit conditional on a broken assembler.

AIUI the problem was in the CPU descriptions, so I don't think this
would test for the old gas bug that is being worked around.

Perhaps instead we could have a configure test for the bug that you've
found, and disable the crc workaround if so?

Thanks,
Richard

>
> - I am asking here before constructing the patch, in case there’s some reason 
> that doing this at configure time is not acceptable.
>
> thanks
> Iain


Re: ipa-inline & what TARGET_CAN_INLINE_P can assume

2023-09-25 Thread Richard Sandiford via Gcc
Andrew Pinski  writes:
> On Mon, Sep 25, 2023 at 10:16 AM Richard Sandiford via Gcc
>  wrote:
>>
>> Hi,
>>
>> I have a couple of questions about what TARGET_CAN_INLINE_P is
>> alllowed to assume when called from ipa-inline.  (Callers from the
>> front-end don't matter for the moment.)
>>
>> I'm working on an extension where a function F1 without attribute A
>> can't be inlined into a function F2 with attribute A.  That part is
>> easy and standard.
>>
>> But it's expected that many functions won't have attribute A,
>> even if they could.  So we'd like to detect automatically whether
>> F1's implementation is compatible with attribute A.  This is something
>> we can do by scanning the gimple code.
>>
>> However, even if we detect that F1's code is compatible with attribute A,
>> we don't want to add attribute A to F1 itself because (a) it would change
>> F1's ABI and (b) it would restrict the optimisation of any non-inlined
>> copy of F1.  So this is a test for inlining only.
>>
>> TARGET_CAN_INLINE_P (F2, F1) can check whether F1's current code
>> is compatible with attribute A.  But:
>>
>> (a) Is it safe to assume (going forward) that F1 won't change before
>> it is inlined into F2?  Specifically, is it safe to assume that
>> nothing will be inlined into F1 between the call to TARGET_CAN_INLINE_P
>> and the inlining of F1 into F2?
>>
>> (b) For compile-time reasons, I'd like to cache the result in
>> machine_function.  The cache would be a three-state:
>>
>> - not tested
>> - compatible with A
>> - incompatible with A
>>
>> The cache would be reset to "not tested" whenever TARGET_CAN_INLINE_P
>> is called with F1 as the *caller* rather than the callee.  The idea
>> is to handle cases where something is inlined into F1 after F1 has
>> been inlined into F2.  (This would include calls from the main
>> inlining pass, after the early pass has finished.)
>>
>> Is resetting the cache in this way sufficient?  Or should we have a
>> new interface for this?
>>
>> Sorry for the long question :)  I have something that seems to work,
>> but I'm not sure whether it's misusing the interface.
>
>
> The rs6000 backend has a similar issue and defined the following
> target hooks which seems exactly what you need in this case
> TARGET_NEED_IPA_FN_TARGET_INFO
> TARGET_UPDATE_IPA_FN_TARGET_INFO
>
> And then use that information in can_inline_p target hook to mask off
> the ISA bits:
>   unsigned int info = ipa_fn_summaries->get (callee_node)->target_info;
>   if ((info & RS6000_FN_TARGET_INFO_HTM) == 0)
> {
>   callee_isa &= ~OPTION_MASK_HTM;
>   explicit_isa &= ~OPTION_MASK_HTM;
> }

Thanks!  Like you say, it looks like a perfect fit.

The optimisation of having TARGET_UPDATE_IPA_FN_TARGET_INFO return false
to stop further analysis probably won't trigger for this use case.
I need to track two conditions and the second one is very rare.
But that's still going to be much better than potentially scanning
the same (inlined) stmts multiple times.

Richard


ipa-inline & what TARGET_CAN_INLINE_P can assume

2023-09-25 Thread Richard Sandiford via Gcc
Hi,

I have a couple of questions about what TARGET_CAN_INLINE_P is
alllowed to assume when called from ipa-inline.  (Callers from the
front-end don't matter for the moment.)

I'm working on an extension where a function F1 without attribute A
can't be inlined into a function F2 with attribute A.  That part is
easy and standard.

But it's expected that many functions won't have attribute A,
even if they could.  So we'd like to detect automatically whether
F1's implementation is compatible with attribute A.  This is something
we can do by scanning the gimple code.

However, even if we detect that F1's code is compatible with attribute A,
we don't want to add attribute A to F1 itself because (a) it would change
F1's ABI and (b) it would restrict the optimisation of any non-inlined
copy of F1.  So this is a test for inlining only.

TARGET_CAN_INLINE_P (F2, F1) can check whether F1's current code
is compatible with attribute A.  But:

(a) Is it safe to assume (going forward) that F1 won't change before
it is inlined into F2?  Specifically, is it safe to assume that
nothing will be inlined into F1 between the call to TARGET_CAN_INLINE_P
and the inlining of F1 into F2?

(b) For compile-time reasons, I'd like to cache the result in
machine_function.  The cache would be a three-state:

- not tested
- compatible with A
- incompatible with A

The cache would be reset to "not tested" whenever TARGET_CAN_INLINE_P
is called with F1 as the *caller* rather than the callee.  The idea
is to handle cases where something is inlined into F1 after F1 has
been inlined into F2.  (This would include calls from the main
inlining pass, after the early pass has finished.)

Is resetting the cache in this way sufficient?  Or should we have a
new interface for this?

Sorry for the long question :)  I have something that seems to work,
but I'm not sure whether it's misusing the interface.

Thanks,
Richard


Re: [PATCH/RFC 08/10] aarch64: Don't use CEIL for vector_store in aarch64_stp_sequence_cost

2023-09-18 Thread Richard Sandiford via Gcc-patches
Kewen Lin  writes:
> This costing adjustment patch series exposes one issue in
> aarch64 specific costing adjustment for STP sequence.  It
> causes the below test cases to fail:
>
>   - gcc/testsuite/gcc.target/aarch64/ldp_stp_15.c
>   - gcc/testsuite/gcc.target/aarch64/ldp_stp_16.c
>   - gcc/testsuite/gcc.target/aarch64/ldp_stp_17.c
>   - gcc/testsuite/gcc.target/aarch64/ldp_stp_18.c
>
> Take the below function extracted from ldp_stp_15.c as
> example:
>
> void
> dup_8_int32_t (int32_t *x, int32_t val)
> {
> for (int i = 0; i < 8; ++i)
>   x[i] = val;
> }
>
> Without my patch series, during slp1 it gets:
>
>   val_8(D) 2 times unaligned_store (misalign -1) costs 2 in body
>   node 0x10008c85e38 1 times scalar_to_vec costs 1 in prologue
>
> then the final vector cost is 3.
>
> With my patch series, during slp1 it gets:
>
>   val_8(D) 1 times unaligned_store (misalign -1) costs 1 in body
>   val_8(D) 1 times unaligned_store (misalign -1) costs 1 in body
>   node 0x10004cc5d88 1 times scalar_to_vec costs 1 in prologue
>
> but the final vector cost is 17.  The unaligned_store count is
> actually unchanged, but the final vector costs become different,
> it's because the below aarch64 special handling makes the
> different costs:
>
>   /* Apply the heuristic described above m_stp_sequence_cost.  */
>   if (m_stp_sequence_cost != ~0U)
> {
>   uint64_t cost = aarch64_stp_sequence_cost (count, kind,
>stmt_info, vectype);
>   m_stp_sequence_cost = MIN (m._stp_sequence_cost + cost, ~0U);
> }
>
> For the former, since the count is 2, function
> aarch64_stp_sequence_cost returns 2 as "CEIL (count, 2) * 2".
> While for the latter, it's separated into twice calls with
> count 1, aarch64_stp_sequence_cost returns 2 for each time,
> so it returns 4 in total.
>
> For this case, the stmt with scalar_to_vec also contributes
> 4 to m_stp_sequence_cost, then the final m_stp_sequence_cost
> are 6 (2+4) vs. 8 (4+4).
>
> Considering scalar_costs->m_stp_sequence_cost is 8 and below
> checking and re-assigning:
>
>   else if (m_stp_sequence_cost >= scalar_costs->m_stp_sequence_cost)
> m_costs[vect_body] = 2 * scalar_costs->total_cost ();
>
> For the former, the body cost of vector isn't changed; but
> for the latter, the body cost of vector is double of scalar
> cost which is 8 for this case, then it becomes 16 which is
> bigger than what we expect.
>
> I'm not sure why it adopts CEIL for the return value for
> case unaligned_store in function aarch64_stp_sequence_cost,
> but I tried to modify it with "return count;" (as it can
> get back to previous cost), there is no failures exposed
> in regression testing.  I expected that if the previous
> unaligned_store count is even, this adjustment doesn't
> change anything, if it's odd, the adjustment may reduce
> it by one, but I'd guess it would be few.  Besides, as
> the comments for m_stp_sequence_cost, the current
> handlings seems temporary, maybe a tweak like this can be
> accepted, so I posted this RFC/PATCH to request comments.
> this one line change is considered.

It's unfortunate that doing this didn't show up a regression.
I guess it's not a change we explicitly added tests to guard against.

But the point of the condition is to estimate how many single stores
(STRs) and how many paired stores (STPs) would be generated.  As far
as this heuristic goes, STP (storing two values) is as cheap as STR
(storing only one value).  So the point of the CEIL is to count 1 store
as having equal cost to 2, 3 as having equal cost to 4, etc.

For a heuristic like that, costing a vector stmt once with count 2
is different from costing 2 vector stmts with count 1.  The former
makes it obvious that the 2 vector stmts are associated with the
same scalar stmt, and are highly likely to be consecutive.  The latter
(costing 2 stmts with count 1) could also happen for unrelated stmts.

ISTM that costing once with count N provides strictly more information
to targets than costing N time with count 1.  Is there no way we can
keep the current behaviour?  E.g. rather than costing a stmt immediately
within a loop, could we just increment a counter and cost once at the end?

Thanks,
Richard

> gcc/ChangeLog:
>
>   * config/aarch64/aarch64.cc (aarch64_stp_sequence_cost): Return
>   count directly instead of the adjusted value computed with CEIL.
> ---
>  gcc/config/aarch64/aarch64.cc | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 37d414021ca..9fb4fbd883d 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -17051,7 +17051,7 @@ aarch64_stp_sequence_cost (unsigned int count, 
> vect_cost_for_stmt kind,
> if (!aarch64_aligned_constant_offset_p (stmt_info, size))
>   return count * 2;
>   }
> -  return CEIL (count, 2) * 2;
> +  return count;
>  
>  case 

Re: [PATCH V2] internal-fn: Support undefined rtx for uninitialized SSA_NAME

2023-09-17 Thread Richard Sandiford via Gcc-patches
Juzhe-Zhong  writes:
> According to PR: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110751
>
> As Richard and Richi suggested, we recognize uninitialized SSA_NAME and 
> convert it
> into SCRATCH rtx if the target predicate allows SCRATCH.
>
> It can help to reduce redundant data move instructions of targets like RISC-V.
>
> gcc/ChangeLog:
>
>   * internal-fn.cc (expand_fn_using_insn): Support undefined rtx.
>   * optabs.cc (maybe_legitimize_operand): Ditto.
>   (can_reuse_operands_p): Ditto.
>   * optabs.h (enum expand_operand_type): Ditto.
>   (create_undefined_input_operand): Ditto.
>
> ---
>  gcc/internal-fn.cc |  4 
>  gcc/optabs.cc  | 16 
>  gcc/optabs.h   | 14 +-
>  3 files changed, 33 insertions(+), 1 deletion(-)
>
> diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
> index 0fd34359247..61d5a9e4772 100644
> --- a/gcc/internal-fn.cc
> +++ b/gcc/internal-fn.cc
> @@ -247,6 +247,10 @@ expand_fn_using_insn (gcall *stmt, insn_code icode, 
> unsigned int noutputs,
>   create_convert_operand_from ([opno], rhs_rtx,
>TYPE_MODE (rhs_type),
>TYPE_UNSIGNED (rhs_type));
> +  else if (TREE_CODE (rhs) == SSA_NAME
> +&& SSA_NAME_IS_DEFAULT_DEF (rhs)
> +&& VAR_P (SSA_NAME_VAR (rhs)))
> + create_undefined_input_operand ([opno], TYPE_MODE (rhs_type));
>else
>   create_input_operand ([opno], rhs_rtx, TYPE_MODE (rhs_type));
>opno += 1;
> diff --git a/gcc/optabs.cc b/gcc/optabs.cc
> index 32ff379ffc3..d8c771547a3 100644
> --- a/gcc/optabs.cc
> +++ b/gcc/optabs.cc
> @@ -8102,6 +8102,21 @@ maybe_legitimize_operand (enum insn_code icode, 
> unsigned int opno,
> goto input;
>   }
>break;
> +
> +case EXPAND_UNDEFINED:
> +  {
> + mode = insn_data[(int) icode].operand[opno].mode;
> + rtx scratch = gen_rtx_SCRATCH (mode);

A scratch of the right mode should already be available in op->value,
since it was created by create_undefined_input_operand.

If that doesn't work for some reason, then it would be better for
create_undefined_input_operand to pass NULL_RTX as the "value"
argument to create_expand_operand.

> + /* For SCRATCH rtx which is converted from uninitialized
> +SSA, we convert it as fresh pseudo when target doesn't
> +allow scratch rtx in predicate. Otherwise, return true.  */
> + if (!insn_operand_matches (icode, opno, scratch))
> +   {
> + op->value = gen_reg_rtx (mode);

The mode should come from op->mode.

> + goto input;
> +   }
> + return true;
> +  }
>  }
>return insn_operand_matches (icode, opno, op->value);
>  }
> @@ -8147,6 +8162,7 @@ can_reuse_operands_p (enum insn_code icode,
>  case EXPAND_INPUT:
>  case EXPAND_ADDRESS:
>  case EXPAND_INTEGER:
> +case EXPAND_UNDEFINED:
>return true;

I think this should be in the "return false" block instead.

>  
>  case EXPAND_CONVERT_TO:
> diff --git a/gcc/optabs.h b/gcc/optabs.h
> index c80b7f4dc1b..4eb1f9ee09a 100644
> --- a/gcc/optabs.h
> +++ b/gcc/optabs.h
> @@ -37,7 +37,8 @@ enum expand_operand_type {
>EXPAND_CONVERT_TO,
>EXPAND_CONVERT_FROM,
>EXPAND_ADDRESS,
> -  EXPAND_INTEGER
> +  EXPAND_INTEGER,
> +  EXPAND_UNDEFINED

Sorry, this was my bad suggestion.  I should have suggested
EXPAND_UNDEFINED_INPUT, to match the name of the function.

Thanks,
Richard

>  };
>  
>  /* Information about an operand for instruction expansion.  */
> @@ -117,6 +118,17 @@ create_input_operand (class expand_operand *op, rtx 
> value,
>create_expand_operand (op, EXPAND_INPUT, value, mode, false);
>  }
>  
> +/* Make OP describe an undefined input operand for uninitialized
> +   SSA.  It's the scratch operand with mode MODE; MODE cannot be
> +   VOIDmode.  */
> +
> +inline void
> +create_undefined_input_operand (class expand_operand *op, machine_mode mode)
> +{
> +  create_expand_operand (op, EXPAND_UNDEFINED, gen_rtx_SCRATCH (mode), mode,
> +  false);
> +}
> +
>  /* Like create_input_operand, except that VALUE must first be converted
> to mode MODE.  UNSIGNED_P says whether VALUE is unsigned.  */


Re: [AArch64][testsuite] Adjust vect_copy_lane_1.c for new code-gen

2023-09-17 Thread Richard Sandiford via Gcc-patches
Prathamesh Kulkarni  writes:
> Hi,
> After 27de9aa152141e7f3ee66372647d0f2cd94c4b90, there's a following 
> regression:
> FAIL: gcc.target/aarch64/vect_copy_lane_1.c scan-assembler-times
> ins\\tv0.s\\[1\\], v1.s\\[0\\] 3
>
> This happens because for the following function from vect_copy_lane_1.c:
> float32x2_t
> __attribute__((noinline, noclone)) test_copy_lane_f32 (float32x2_t a,
> float32x2_t b)
> {
>   return vcopy_lane_f32 (a, 1, b, 0);
> }
>
> Before 27de9aa152141e7f3ee66372647d0f2cd94c4b90,
> it got lowered to following sequence in .optimized dump:
>[local count: 1073741824]:
>   _4 = BIT_FIELD_REF ;
>   __a_5 = BIT_INSERT_EXPR ;
>   return __a_5;
>
> The above commit simplifies BIT_FIELD_REF + BIT_INSERT_EXPR
> to vector permutation and now thus gets lowered to:
>
>[local count: 1073741824]:
>   __a_4 = VEC_PERM_EXPR ;
>   return __a_4;
>
> Since we give higher priority to aarch64_evpc_zip over aarch64_evpc_ins
> in aarch64_expand_vec_perm_const_1, it now generates:
>
> test_copy_lane_f32:
> zip1v0.2s, v0.2s, v1.2s
> ret
>
> Similarly for test_copy_lane_[us]32.

Yeah, I suppose this choice is at least as good as INS.  It has the advantage
that the source and destination don't need to be tied.  For example:

int32x2_t f(int32x2_t a, int32x2_t b, int32x2_t c) {
return vcopy_lane_s32 (b, 1, c, 0);
}

used to be:

f:
mov v0.8b, v1.8b
ins v0.s[1], v2.s[0]
ret

but is now:

f:
zip1v0.2s, v1.2s, v2.2s
ret

> The attached patch adjusts the tests to reflect the change in code-gen
> and the tests pass.
> OK to commit ?
>
> Thanks,
> Prathamesh
>
> diff --git a/gcc/testsuite/gcc.target/aarch64/vect_copy_lane_1.c 
> b/gcc/testsuite/gcc.target/aarch64/vect_copy_lane_1.c
> index 2848be564d5..811dc678b92 100644
> --- a/gcc/testsuite/gcc.target/aarch64/vect_copy_lane_1.c
> +++ b/gcc/testsuite/gcc.target/aarch64/vect_copy_lane_1.c
> @@ -22,7 +22,7 @@ BUILD_TEST (uint16x4_t, uint16x4_t, , , u16, 3, 2)
>  BUILD_TEST (float32x2_t, float32x2_t, , , f32, 1, 0)
>  BUILD_TEST (int32x2_t,   int32x2_t,   , , s32, 1, 0)
>  BUILD_TEST (uint32x2_t,  uint32x2_t,  , , u32, 1, 0)
> -/* { dg-final { scan-assembler-times "ins\\tv0.s\\\[1\\\], v1.s\\\[0\\\]" 3 
> } } */
> +/* { dg-final { scan-assembler-times "zip1\\tv0.2s, v0.2s, v1.2s" 3 } } */
>  BUILD_TEST (int64x1_t,   int64x1_t,   , , s64, 0, 0)
>  BUILD_TEST (uint64x1_t,  uint64x1_t,  , , u64, 0, 0)
>  BUILD_TEST (float64x1_t, float64x1_t, , , f64, 0, 0)

OK, thanks.

Richard


Re: [PATCH] AArch64: Improve immediate expansion [PR105928]

2023-09-17 Thread Richard Sandiford via Gcc-patches
Wilco Dijkstra  writes:
> Support immediate expansion of immediates which can be created from 2 MOVKs
> and a shifted ORR or BIC instruction.  Change aarch64_split_dimode_const_store
> to apply if we save one instruction.
>
> This reduces the number of 4-instruction immediates in SPECINT/FP by 5%.
>
> Passes regress, OK for commit?
>
> gcc/ChangeLog:
> PR target/105928
> * config/aarch64/aarch64.cc (aarch64_internal_mov_immediate)
> Add support for immediates using shifted ORR/BIC.
> (aarch64_split_dimode_const_store): Apply if we save one instruction.
> * config/aarch64/aarch64.md (_3):
> Make pattern global.
>
> gcc/testsuite:
> PR target/105928
> * gcc.target/aarch64/pr105928.c: Add new test.
> * gcc.target/aarch64/vect-cse-codegen.c: Fix test.

Looks good apart from a comment below about the test.

I was worried that reusing "dest" for intermediate results would
prevent CSE for cases like:

void g (long long, long long);
void
f (long long *ptr)
{
  g (0xee11ee22ee11ee22LL, 0xdc23dc44ee11ee22LL);
}

where the same 32-bit lowpart pattern is used for two immediates.
In principle, that could be avoided using:

if (generate)
  {
rtx tmp = aarch64_target_reg (dest, DImode);
emit_insn (gen_rtx_SET (tmp, GEN_INT (val2 & 0x)));
emit_insn (gen_insv_immdi (tmp, GEN_INT (16),
   GEN_INT (val2 >> 16)));
set_unique_reg_note (get_last_insn (), REG_EQUAL,
 GEN_INT (val2));
emit_insn (gen_ior_ashldi3 (dest, tmp, GEN_INT (i), tmp));
  }
return 3;

But it doesn't work, since we only expose the individual immediates
during split1, and nothing between split1 and ira is able to remove
redundancies.  There's no point complicating the code for a theoretical
future optimisation.

> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 
> c44c0b979d0cc3755c61dcf566cfddedccebf1ea..832f8197ac8d1a04986791e6f3e51861e41944b2
>  100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -5639,7 +5639,7 @@ aarch64_internal_mov_immediate (rtx dest, rtx imm, bool 
> generate,
> machine_mode mode)
>  {
>int i;
> -  unsigned HOST_WIDE_INT val, val2, mask;
> +  unsigned HOST_WIDE_INT val, val2, val3, mask;
>int one_match, zero_match;
>int num_insns;
>
> @@ -5721,6 +5721,35 @@ aarch64_internal_mov_immediate (rtx dest, rtx imm, 
> bool generate,
> }
>   return 3;
> }
> +
> +  /* Try shifting and inserting the bottom 32-bits into the top bits.  */
> +  val2 = val & 0x;
> +  val3 = 0x;
> +  val3 = val2 | (val3 << 32);
> +  for (i = 17; i < 48; i++)
> +   if ((val2 | (val2 << i)) == val)
> + {
> +   if (generate)
> + {
> +   emit_insn (gen_rtx_SET (dest, GEN_INT (val2 & 0x)));
> +   emit_insn (gen_insv_immdi (dest, GEN_INT (16),
> +  GEN_INT (val2 >> 16)));
> +   emit_insn (gen_ior_ashldi3 (dest, dest, GEN_INT (i), dest));
> + }
> +   return 3;
> + }
> +   else if ((val3 & ~(val3 << i)) == val)
> + {
> +   if (generate)
> + {
> +   emit_insn (gen_rtx_SET (dest, GEN_INT (val3 | 0x)));
> +   emit_insn (gen_insv_immdi (dest, GEN_INT (16),
> +  GEN_INT (val2 >> 16)));
> +   emit_insn (gen_and_one_cmpl_ashldi3 (dest, dest, GEN_INT (i),
> + dest));
> + }
> +   return 3;
> + }
>  }
>
>/* Generate 2-4 instructions, skipping 16 bits of all zeroes or ones which
> @@ -25506,8 +25535,6 @@ aarch64_split_dimode_const_store (rtx dst, rtx src)
>rtx lo = gen_lowpart (SImode, src);
>rtx hi = gen_highpart_mode (SImode, DImode, src);
>
> -  bool size_p = optimize_function_for_size_p (cfun);
> -
>if (!rtx_equal_p (lo, hi))
>  return false;
>
> @@ -25526,14 +25553,8 @@ aarch64_split_dimode_const_store (rtx dst, rtx src)
>   MOV   w1, 49370
>   MOVK  w1, 0x140, lsl 16
>   STP   w1, w1, [x0]
> -   So we want to perform this only when we save two instructions
> -   or more.  When optimizing for size, however, accept any code size
> -   savings we can.  */
> -  if (size_p && orig_cost <= lo_cost)
> -return false;
> -
> -  if (!size_p
> -  && (orig_cost <= lo_cost + 1))
> +   So we want to perform this when we save at least one instruction.  */
> +  if (orig_cost <= lo_cost)
>  return false;
>
>rtx mem_lo = adjust_address (dst, SImode, 0);
> diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
> index 
> 

Re: [PATCH] internal-fn: Convert uninitialized SSA_NAME into SCRATCH rtx[PR110751]

2023-09-17 Thread Richard Sandiford via Gcc-patches
Juzhe-Zhong  writes:
> According to PR: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110751
>
> As Richard and Richi suggested, we recognize uninitialized SSA_NAME and 
> convert it
> into SCRATCH rtx if the target predicate allows SCRATCH.
>
> It can help to reduce redundant data move instructions of targets like RISC-V.
>
> Here we add the condition "insn_operand_matches (icode, opno, scratch)"
> Then, we will only create scratch rtx that target allow scratch rtx in 
> predicate.
> When the target doesn't allow scratch rtx in predicate, the later "else" 
> condtion
> will create fresh pseudo for uninitialized SSA.
>
> I have verify it in RISC-V port and it works well.
>
> Bootstrap and Regression on X86 passed.
>
> Ok for trunk ?
>  
> gcc/ChangeLog:
>
>   * internal-fn.cc (expand_fn_using_insn): Convert uninitialized SSA into 
> scratch.
>
> ---
>  gcc/internal-fn.cc | 6 ++
>  1 file changed, 6 insertions(+)
>
> diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
> index 0fd34359247..fe4d86b3dbd 100644
> --- a/gcc/internal-fn.cc
> +++ b/gcc/internal-fn.cc
> @@ -243,10 +243,16 @@ expand_fn_using_insn (gcall *stmt, insn_code icode, 
> unsigned int noutputs,
>tree rhs = gimple_call_arg (stmt, i);
>tree rhs_type = TREE_TYPE (rhs);
>rtx rhs_rtx = expand_normal (rhs);
> +  rtx scratch = gen_rtx_SCRATCH (TYPE_MODE (rhs_type));
>if (INTEGRAL_TYPE_P (rhs_type))
>   create_convert_operand_from ([opno], rhs_rtx,
>TYPE_MODE (rhs_type),
>TYPE_UNSIGNED (rhs_type));
> +  else if (TREE_CODE (rhs) == SSA_NAME
> +&& SSA_NAME_IS_DEFAULT_DEF (rhs)
> +&& VAR_P (SSA_NAME_VAR (rhs))
> +&& insn_operand_matches (icode, opno, scratch))

Rather than check insn_operand_matches here, I think we should create
the scratch operand regardless and leave optabs.cc to deal with it.
(This will need changes to optabs.cc.)

How about adding:

  create_undefined_input_operand (expand_operand *op, machine_mode mode)

that maps to a new EXPAND_UNDEFINED, then handle EXPAND_UNDEFINED in the
two case statements in optabs.cc.

Thanks,
Richard

> + create_input_operand ([opno], scratch, TYPE_MODE (rhs_type));
>else
>   create_input_operand ([opno], rhs_rtx, TYPE_MODE (rhs_type));
>opno += 1;


[PATCH] aarch64: Fix loose ldpstp check [PR111411]

2023-09-15 Thread Richard Sandiford via Gcc-patches
aarch64_operands_ok_for_ldpstp contained the code:

  /* One of the memory accesses must be a mempair operand.
 If it is not the first one, they need to be swapped by the
 peephole.  */
  if (!aarch64_mem_pair_operand (mem_1, GET_MODE (mem_1))
   && !aarch64_mem_pair_operand (mem_2, GET_MODE (mem_2)))
return false;

But the requirement isn't just that one of the accesses must be a
valid mempair operand.  It's that the lower access must be, since
that's the access that will be used for the instruction operand.

Tested on aarch64-linux-gnu & pushed.  The patch applies cleanly
to GCC 12 and 13, so I'll backport there next week.  GCC 11 will
need a bespoke fix if the problem shows up there, but I doubt it will.

Richard


gcc/
PR target/111411
* config/aarch64/aarch64.cc (aarch64_operands_ok_for_ldpstp): Require
the lower memory access to a mem-pair operand.

gcc/testsuite/
PR target/111411
* gcc.dg/rtl/aarch64/pr111411.c: New test.
---
 gcc/config/aarch64/aarch64.cc   |  8 ++-
 gcc/testsuite/gcc.dg/rtl/aarch64/pr111411.c | 57 +
 2 files changed, 60 insertions(+), 5 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/rtl/aarch64/pr111411.c

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 0962fc4f56e..7bb1161f943 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -26503,11 +26503,9 @@ aarch64_operands_ok_for_ldpstp (rtx *operands, bool 
load,
   gcc_assert (known_eq (GET_MODE_SIZE (GET_MODE (mem_1)),
GET_MODE_SIZE (GET_MODE (mem_2;
 
-  /* One of the memory accesses must be a mempair operand.
- If it is not the first one, they need to be swapped by the
- peephole.  */
-  if (!aarch64_mem_pair_operand (mem_1, GET_MODE (mem_1))
-   && !aarch64_mem_pair_operand (mem_2, GET_MODE (mem_2)))
+  /* The lower memory access must be a mem-pair operand.  */
+  rtx lower_mem = reversed ? mem_2 : mem_1;
+  if (!aarch64_mem_pair_operand (lower_mem, GET_MODE (lower_mem)))
 return false;
 
   if (REG_P (reg_1) && FP_REGNUM_P (REGNO (reg_1)))
diff --git a/gcc/testsuite/gcc.dg/rtl/aarch64/pr111411.c 
b/gcc/testsuite/gcc.dg/rtl/aarch64/pr111411.c
new file mode 100644
index 000..ad07e9c6c89
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/rtl/aarch64/pr111411.c
@@ -0,0 +1,57 @@
+/* { dg-do compile { target aarch64*-*-* } } */
+/* { dg-require-effective-target lp64 } */
+/* { dg-options "-O -fdisable-rtl-postreload -fpeephole2 -fno-schedule-fusion" 
} */
+
+extern int data[];
+
+void __RTL (startwith ("ira")) foo (void *ptr)
+{
+  (function "foo"
+(param "ptr"
+  (DECL_RTL (reg/v:DI <0> [ ptr ]))
+  (DECL_RTL_INCOMING (reg/v:DI x0 [ ptr ]))
+) ;; param "ptr"
+(insn-chain
+  (block 2
+   (edge-from entry (flags "FALLTHRU"))
+   (cnote 3 [bb 2] NOTE_INSN_BASIC_BLOCK)
+   (insn 4 (set (reg:DI <0>) (reg:DI x0)))
+   (insn 5 (set (reg:DI <1>)
+(plus:DI (reg:DI <0>) (const_int 768
+   (insn 6 (set (mem:SI (plus:DI (reg:DI <0>)
+ (const_int 508)) [1 +508 S4 A4])
+(const_int 0)))
+   (insn 7 (set (mem:SI (plus:DI (reg:DI <1>)
+ (const_int -256)) [1 +512 S4 A4])
+(const_int 0)))
+   (edge-to exit (flags "FALLTHRU"))
+  ) ;; block 2
+) ;; insn-chain
+  ) ;; function
+}
+
+void __RTL (startwith ("ira")) bar (void *ptr)
+{
+  (function "bar"
+(param "ptr"
+  (DECL_RTL (reg/v:DI <0> [ ptr ]))
+  (DECL_RTL_INCOMING (reg/v:DI x0 [ ptr ]))
+) ;; param "ptr"
+(insn-chain
+  (block 2
+   (edge-from entry (flags "FALLTHRU"))
+   (cnote 3 [bb 2] NOTE_INSN_BASIC_BLOCK)
+   (insn 4 (set (reg:DI <0>) (reg:DI x0)))
+   (insn 5 (set (reg:DI <1>)
+(plus:DI (reg:DI <0>) (const_int 768
+   (insn 6 (set (mem:SI (plus:DI (reg:DI <1>)
+ (const_int -256)) [1 +512 S4 A4])
+(const_int 0)))
+   (insn 7 (set (mem:SI (plus:DI (reg:DI <0>)
+ (const_int 508)) [1 +508 S4 A4])
+(const_int 0)))
+   (edge-to exit (flags "FALLTHRU"))
+  ) ;; block 2
+) ;; insn-chain
+  ) ;; function
+}
-- 
2.25.1



[PATCH] aarch64: Restore SVE WHILE costing

2023-09-14 Thread Richard Sandiford via Gcc-patches
AArch64 previously costed WHILELO instructions on the first call
to add_stmt_cost.  This was because, at the time, only add_stmt_cost
had access to the loop_vec_info.

However, after the AVX512 changes, we only calculate the masks later.
This patch moves the WHILELO costing to finish_cost, which is in any
case a more logical place for it to be.  It also means that we can
check the final decision about whether to use predicated loops.

Tested on aarch64-linux-gnu & applied.

Richard


gcc/
* config/aarch64/aarch64.cc (aarch64_vector_costs::analyze_loop_info):
Move WHILELO handling to...
(aarch64_vector_costs::finish_cost): ...here.  Check whether the
vectorizer has decided to use a predicated loop.

gcc/testsuite/
* gcc.target/aarch64/sve/cost_model_15.c: New test.
---
 gcc/config/aarch64/aarch64.cc | 36 ++-
 .../gcc.target/aarch64/sve/cost_model_15.c| 13 +++
 2 files changed, 32 insertions(+), 17 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/cost_model_15.c

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 3739a44bfd9..0962fc4f56e 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -16310,22 +16310,6 @@ aarch64_vector_costs::analyze_loop_vinfo 
(loop_vec_info loop_vinfo)
   /* Detect whether we're vectorizing for SVE and should apply the unrolling
  heuristic described above m_unrolled_advsimd_niters.  */
   record_potential_advsimd_unrolling (loop_vinfo);
-
-  /* Record the issue information for any SVE WHILE instructions that the
- loop needs.  */
-  if (!m_ops.is_empty () && !LOOP_VINFO_MASKS (loop_vinfo).is_empty ())
-{
-  unsigned int num_masks = 0;
-  rgroup_controls *rgm;
-  unsigned int num_vectors_m1;
-  FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo).rgc_vec,
-   num_vectors_m1, rgm)
-   if (rgm->type)
- num_masks += num_vectors_m1 + 1;
-  for (auto  : m_ops)
-   if (auto *issue = ops.sve_issue_info ())
- ops.pred_ops += num_masks * issue->while_pred_ops;
-}
 }
 
 /* Implement targetm.vectorize.builtin_vectorization_cost.  */
@@ -17507,9 +17491,27 @@ adjust_body_cost (loop_vec_info loop_vinfo,
 void
 aarch64_vector_costs::finish_cost (const vector_costs *uncast_scalar_costs)
 {
+  /* Record the issue information for any SVE WHILE instructions that the
+ loop needs.  */
+  loop_vec_info loop_vinfo = dyn_cast (m_vinfo);
+  if (!m_ops.is_empty ()
+  && loop_vinfo
+  && LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+{
+  unsigned int num_masks = 0;
+  rgroup_controls *rgm;
+  unsigned int num_vectors_m1;
+  FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo).rgc_vec,
+   num_vectors_m1, rgm)
+   if (rgm->type)
+ num_masks += num_vectors_m1 + 1;
+  for (auto  : m_ops)
+   if (auto *issue = ops.sve_issue_info ())
+ ops.pred_ops += num_masks * issue->while_pred_ops;
+}
+
   auto *scalar_costs
 = static_cast (uncast_scalar_costs);
-  loop_vec_info loop_vinfo = dyn_cast (m_vinfo);
   if (loop_vinfo
   && m_vec_flags
   && aarch64_use_new_vector_costs_p ())
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cost_model_15.c 
b/gcc/testsuite/gcc.target/aarch64/sve/cost_model_15.c
new file mode 100644
index 000..b9e6306bb59
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/cost_model_15.c
@@ -0,0 +1,13 @@
+/* { dg-options "-Ofast -mtune=neoverse-v1" } */
+
+double f(double *restrict x, double *restrict y, int *restrict z)
+{
+  double res = 0.0;
+  for (int i = 0; i < 100; ++i)
+res += x[i] * y[z[i]];
+  return res;
+}
+
+/* { dg-final { scan-assembler-times {\tld1sw\tz[0-9]+\.d,} 1 } } */
+/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d,} 2 } } */
+/* { dg-final { scan-assembler-times {\tfmla\tz[0-9]+\.d,} 1 } } */
-- 
2.25.1



[PATCH] aarch64: Coerce addresses to be suitable for LD1RQ

2023-09-14 Thread Richard Sandiford via Gcc-patches
In the following test:

  svuint8_t ld(uint8_t *ptr) { return svld1rq(svptrue_b8(), ptr + 2); }

ptr + 2 is a valid address for an Advanced SIMD load, but not for
an SVE load.  We therefore ended up generating:

ldr q0, [x0, 2]
dup z0.q, z0.q[0]

This patch makes us generate LD1RQ for that case too.  It takes the
slightly old-school approach of making the predicate broader than
the constraint.  That is: any valid memory address is accepted as
an operand before RA.  If the instruction remains during RA, LRA will
coerce the address to match the constraint.  If the instruction gets
split before RA, the splitter will load invalid addresses into a
scratch register.

Tested on aarch64-linux-gnu & pushed.

Richard

gcc/
* config/aarch64/aarch64-sve.md (@aarch64_vec_duplicate_vq_le):
Accept all nonimmediate_operands, but keep the existing constraints.
If the instruction is split before RA, load invalid addresses into
a temporary register.
* config/aarch64/predicates.md (aarch64_sve_dup_ld1rq_operand): Delete.

gcc/testsuite/
* gcc.target/aarch64/sve/acle/general/ld1rq_1.c: New test.
---
 gcc/config/aarch64/aarch64-sve.md | 15 -
 gcc/config/aarch64/predicates.md  |  4 ---
 .../aarch64/sve/acle/general/ld1rq_1.c| 33 +++
 3 files changed, 47 insertions(+), 5 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/acle/general/ld1rq_1.c

diff --git a/gcc/config/aarch64/aarch64-sve.md 
b/gcc/config/aarch64/aarch64-sve.md
index da5534c3e32..b223e7d3c9d 100644
--- a/gcc/config/aarch64/aarch64-sve.md
+++ b/gcc/config/aarch64/aarch64-sve.md
@@ -2611,11 +2611,18 @@ (define_insn_and_split "*vec_duplicate_reg"
 )
 
 ;; Duplicate an Advanced SIMD vector to fill an SVE vector (LE version).
+;;
+;; The addressing mode range of LD1RQ does not match the addressing mode
+;; range of LDR Qn.  If the predicate enforced the LD1RQ range, we would
+;; not be able to combine LDR Qns outside that range.  The predicate
+;; therefore accepts all memory operands, with only the constraints
+;; enforcing the actual restrictions.  If the instruction is split
+;; before RA, we need to load invalid addresses into a temporary.
 
 (define_insn_and_split "@aarch64_vec_duplicate_vq_le"
   [(set (match_operand:SVE_FULL 0 "register_operand" "=w, w")
(vec_duplicate:SVE_FULL
- (match_operand: 1 "aarch64_sve_dup_ld1rq_operand" "w, UtQ")))
+ (match_operand: 1 "nonimmediate_operand" "w, UtQ")))
(clobber (match_scratch:VNx16BI 2 "=X, Upl"))]
   "TARGET_SVE && !BYTES_BIG_ENDIAN"
   {
@@ -2633,6 +2640,12 @@ (define_insn_and_split 
"@aarch64_vec_duplicate_vq_le"
   "&& MEM_P (operands[1])"
   [(const_int 0)]
   {
+if (can_create_pseudo_p ()
+&& !aarch64_sve_ld1rq_operand (operands[1], mode))
+  {
+   rtx addr = force_reg (Pmode, XEXP (operands[1], 0));
+   operands[1] = replace_equiv_address (operands[1], addr);
+  }
 if (GET_CODE (operands[2]) == SCRATCH)
   operands[2] = gen_reg_rtx (VNx16BImode);
 emit_move_insn (operands[2], CONSTM1_RTX (VNx16BImode));
diff --git a/gcc/config/aarch64/predicates.md b/gcc/config/aarch64/predicates.md
index 2d8d1fe25c1..01de4743974 100644
--- a/gcc/config/aarch64/predicates.md
+++ b/gcc/config/aarch64/predicates.md
@@ -732,10 +732,6 @@ (define_predicate "aarch64_sve_dup_operand"
   (ior (match_operand 0 "register_operand")
(match_operand 0 "aarch64_sve_ld1r_operand")))
 
-(define_predicate "aarch64_sve_dup_ld1rq_operand"
-  (ior (match_operand 0 "register_operand")
-   (match_operand 0 "aarch64_sve_ld1rq_operand")))
-
 (define_predicate "aarch64_sve_ptrue_svpattern_immediate"
   (and (match_code "const")
(match_test "aarch64_sve_ptrue_svpattern_p (op, NULL)")))
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/general/ld1rq_1.c 
b/gcc/testsuite/gcc.target/aarch64/sve/acle/general/ld1rq_1.c
new file mode 100644
index 000..9242c639731
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/general/ld1rq_1.c
@@ -0,0 +1,33 @@
+/* { dg-options "-O2" } */
+
+#include 
+
+#define TEST_OFFSET(TYPE, SUFFIX, OFFSET) \
+  sv##TYPE##_t \
+  test_##TYPE##_##SUFFIX (TYPE##_t *ptr) \
+  { \
+return svld1rq(svptrue_b8(), ptr + OFFSET); \
+  }
+
+#define TEST(TYPE) \
+  TEST_OFFSET (TYPE, 0, 0) \
+  TEST_OFFSET (TYPE, 1, 1) \
+  TEST_OFFSET (TYPE, 2, 2) \
+  TEST_OFFSET (TYPE, 16, 16) \
+  TEST_OFFSET (TYPE, 0x1, 0x1) \
+  TEST_OFFSET (TYPE, 0x10001, 0x10001) \
+  TEST_OFFSET (TYPE, m1, -1) \
+  TEST_OFFSET (TYPE, m2, -2) \
+  TEST_OFFSET (TYPE, m16, -16) \
+  TEST_OFFSET (TYPE, m0x1, -0x1) \
+  TEST_OFFSET (TYPE, m0x10001, -0x10001)
+
+TEST (int8)
+TEST (int16)
+TEST (uint32)
+TEST (uint64)
+
+/* { dg-final { scan-assembler-times {\tld1rqb\t} 11 { target 
aarch64_little_endian } } } */
+/* { dg-final { scan-assembler-times {\tld1rqh\t} 11 { target 
aarch64_little_endian } } } 

Re: [PATCH] AArch64: List official cores before codenames

2023-09-13 Thread Richard Sandiford via Gcc-patches
Wilco Dijkstra  writes:
> List official cores first so that -cpu=native does not show a codename with -v
> or in errors/warnings.

Nice spot.

> Passes regress, OK for commit?
>
> gcc/ChangeLog:
> * config/aarch64/aarch64-cores.def (neoverse-n1): Place before ares.
> (neoverse-v1): Place before zeus.
> (neoverse-v2): Place before demeter.
> * config/aarch64/aarch64-tune.md: Regenerate.

OK, thanks.  OK for backports too from my POV.

Richard

> ---
>
> diff --git a/gcc/config/aarch64/aarch64-cores.def 
> b/gcc/config/aarch64/aarch64-cores.def
> index 
> dbac497ef3aab410eb81db185b2e9532186888bb..3894f2afc27e71523e5a413fa45c144222082934
>  100644
> --- a/gcc/config/aarch64/aarch64-cores.def
> +++ b/gcc/config/aarch64/aarch64-cores.def
> @@ -115,8 +115,8 @@ AARCH64_CORE("cortex-a65",  cortexa65, cortexa53, V8_2A,  
> (F16, RCPC, DOTPROD, S
>  AARCH64_CORE("cortex-a65ae",  cortexa65ae, cortexa53, V8_2A,  (F16, RCPC, 
> DOTPROD, SSBS), cortexa73, 0x41, 0xd43, -1)
>  AARCH64_CORE("cortex-x1",  cortexx1, cortexa57, V8_2A,  (F16, RCPC, DOTPROD, 
> SSBS, PROFILE), neoversen1, 0x41, 0xd44, -1)
>  AARCH64_CORE("cortex-x1c",  cortexx1c, cortexa57, V8_2A,  (F16, RCPC, 
> DOTPROD, SSBS, PROFILE, PAUTH), neoversen1, 0x41, 0xd4c, -1)
> -AARCH64_CORE("ares",  ares, cortexa57, V8_2A,  (F16, RCPC, DOTPROD, 
> PROFILE), neoversen1, 0x41, 0xd0c, -1)
>  AARCH64_CORE("neoverse-n1",  neoversen1, cortexa57, V8_2A,  (F16, RCPC, 
> DOTPROD, PROFILE), neoversen1, 0x41, 0xd0c, -1)
> +AARCH64_CORE("ares",  ares, cortexa57, V8_2A,  (F16, RCPC, DOTPROD, 
> PROFILE), neoversen1, 0x41, 0xd0c, -1)
>  AARCH64_CORE("neoverse-e1",  neoversee1, cortexa53, V8_2A,  (F16, RCPC, 
> DOTPROD, SSBS), cortexa73, 0x41, 0xd4a, -1)
>
>  /* Cavium ('C') cores. */
> @@ -143,8 +143,8 @@ AARCH64_CORE("thunderx3t110",  thunderx3t110,  
> thunderx3t110, V8_3A,  (CRYPTO, S
>  /* ARMv8.4-A Architecture Processors.  */
>
>  /* Arm ('A') cores.  */
> -AARCH64_CORE("zeus", zeus, cortexa57, V8_4A,  (SVE, I8MM, BF16, PROFILE, 
> SSBS, RNG), neoversev1, 0x41, 0xd40, -1)
>  AARCH64_CORE("neoverse-v1", neoversev1, cortexa57, V8_4A,  (SVE, I8MM, BF16, 
> PROFILE, SSBS, RNG), neoversev1, 0x41, 0xd40, -1)
> +AARCH64_CORE("zeus", zeus, cortexa57, V8_4A,  (SVE, I8MM, BF16, PROFILE, 
> SSBS, RNG), neoversev1, 0x41, 0xd40, -1)
>  AARCH64_CORE("neoverse-512tvb", neoverse512tvb, cortexa57, V8_4A,  (SVE, 
> I8MM, BF16, PROFILE, SSBS, RNG), neoverse512tvb, INVALID_IMP, INVALID_CORE, 
> -1)
>
>  /* Qualcomm ('Q') cores. */
> @@ -182,7 +182,7 @@ AARCH64_CORE("cortex-x3",  cortexx3, cortexa57, V9A,  
> (SVE2_BITPERM, MEMTAG, I8M
>
>  AARCH64_CORE("neoverse-n2", neoversen2, cortexa57, V9A, (I8MM, BF16, 
> SVE2_BITPERM, RNG, MEMTAG, PROFILE), neoversen2, 0x41, 0xd49, -1)
>
> -AARCH64_CORE("demeter", demeter, cortexa57, V9A, (I8MM, BF16, SVE2_BITPERM, 
> RNG, MEMTAG, PROFILE), neoversev2, 0x41, 0xd4f, -1)
>  AARCH64_CORE("neoverse-v2", neoversev2, cortexa57, V9A, (I8MM, BF16, 
> SVE2_BITPERM, RNG, MEMTAG, PROFILE), neoversev2, 0x41, 0xd4f, -1)
> +AARCH64_CORE("demeter", demeter, cortexa57, V9A, (I8MM, BF16, SVE2_BITPERM, 
> RNG, MEMTAG, PROFILE), neoversev2, 0x41, 0xd4f, -1)
>
>  #undef AARCH64_CORE
> diff --git a/gcc/config/aarch64/aarch64-tune.md 
> b/gcc/config/aarch64/aarch64-tune.md
> index 
> 2170980dddb0d5d410a49631ad26ff2e346b39dd..69e5357fa814e4733b05f7164bfa11e4aa04
>  100644
> --- a/gcc/config/aarch64/aarch64-tune.md
> +++ b/gcc/config/aarch64/aarch64-tune.md
> @@ -1,5 +1,5 @@
>  ;; -*- buffer-read-only: t -*-
>  ;; Generated automatically by gentune.sh from aarch64-cores.def
>  (define_attr "tune"
> -   
> "cortexa34,cortexa35,cortexa53,cortexa57,cortexa72,cortexa73,thunderx,thunderxt88p1,thunderxt88,octeontx,octeontxt81,octeontxt83,thunderxt81,thunderxt83,ampere1,ampere1a,emag,xgene1,falkor,qdf24xx,exynosm1,phecda,thunderx2t99p1,vulcan,thunderx2t99,cortexa55,cortexa75,cortexa76,cortexa76ae,cortexa77,cortexa78,cortexa78ae,cortexa78c,cortexa65,cortexa65ae,cortexx1,cortexx1c,ares,neoversen1,neoversee1,octeontx2,octeontx2t98,octeontx2t96,octeontx2t93,octeontx2f95,octeontx2f95n,octeontx2f95mm,a64fx,tsv110,thunderx3t110,zeus,neoversev1,neoverse512tvb,saphira,cortexa57cortexa53,cortexa72cortexa53,cortexa73cortexa35,cortexa73cortexa53,cortexa75cortexa55,cortexa76cortexa55,cortexr82,cortexa510,cortexa520,cortexa710,cortexa715,cortexx2,cortexx3,neoversen2,demeter,neoversev2"
> +   
> 

[PATCH 17/19] aarch64: Explicitly record probe registers in frame info

2023-09-12 Thread Richard Sandiford via Gcc-patches
The stack frame is currently divided into three areas:

A: the area above the hard frame pointer
B: the SVE saves below the hard frame pointer
C: the outgoing arguments

If the stack frame is allocated in one chunk, the allocation needs a
probe if the frame size is >= guard_size - 1KiB.  In addition, if the
function is not a leaf function, it must probe an address no more than
1KiB above the outgoing SP.  We ensured the second condition by

(1) using single-chunk allocations for non-leaf functions only if
the link register save slot is within 512 bytes of the bottom
of the frame; and

(2) using the link register save as a probe (meaning, for instance,
that it can't be individually shrink wrapped)

If instead the stack is allocated in multiple chunks, then:

* an allocation involving only the outgoing arguments (C above) requires
  a probe if the allocation size is > 1KiB

* any other allocation requires a probe if the allocation size
  is >= guard_size - 1KiB

* second and subsequent allocations require the previous allocation
  to probe at the bottom of the allocated area, regardless of the size
  of that previous allocation

The final point means that, unlike for single allocations,
it can be necessary to have both a non-SVE register probe and
an SVE register probe.  For example:

* allocate A, probe using a non-SVE register save
* allocate B, probe using an SVE register save
* allocate C

The non-SVE register used in this case was again the link register.
It was previously used even if the link register save slot was some
bytes above the bottom of the non-SVE register saves, but an earlier
patch avoided that by putting the link register save slot first.

As a belt-and-braces fix, this patch explicitly records which
probe registers we're using and allows the non-SVE probe to be
whichever register comes first (as for SVE).

The patch also avoids unnecessary probes in sve/pcs/stack_clash_3.c.

gcc/
* config/aarch64/aarch64.h (aarch64_frame::sve_save_and_probe)
(aarch64_frame::hard_fp_save_and_probe): New fields.
* config/aarch64/aarch64.cc (aarch64_layout_frame): Initialize them.
Rather than asserting that a leaf function saves LR, instead assert
that a leaf function saves something.
(aarch64_get_separate_components): Prevent the chosen probe
registers from being individually shrink-wrapped.
(aarch64_allocate_and_probe_stack_space): Remove workaround for
probe registers that aren't at the bottom of the previous allocation.

gcc/testsuite/
* gcc.target/aarch64/sve/pcs/stack_clash_3.c: Avoid redundant probes.
---
 gcc/config/aarch64/aarch64.cc | 68 +++
 gcc/config/aarch64/aarch64.h  |  8 +++
 .../aarch64/sve/pcs/stack_clash_3.c   |  6 +-
 3 files changed, 64 insertions(+), 18 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index bcb879ba94b..3c7c476c4c6 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -8510,15 +8510,11 @@ aarch64_layout_frame (void)
&& !crtl->abi->clobbers_full_reg_p (regno))
   frame.reg_offset[regno] = SLOT_REQUIRED;
 
-  /* With stack-clash, LR must be saved in non-leaf functions.  The saving of
- LR counts as an implicit probe which allows us to maintain the invariant
- described in the comment at expand_prologue.  */
-  gcc_assert (crtl->is_leaf
- || maybe_ne (frame.reg_offset[R30_REGNUM], SLOT_NOT_REQUIRED));
 
   poly_int64 offset = crtl->outgoing_args_size;
   gcc_assert (multiple_p (offset, STACK_BOUNDARY / BITS_PER_UNIT));
   frame.bytes_below_saved_regs = offset;
+  frame.sve_save_and_probe = INVALID_REGNUM;
 
   /* Now assign stack slots for the registers.  Start with the predicate
  registers, since predicate LDR and STR have a relatively small
@@ -8526,6 +8522,8 @@ aarch64_layout_frame (void)
   for (regno = P0_REGNUM; regno <= P15_REGNUM; regno++)
 if (known_eq (frame.reg_offset[regno], SLOT_REQUIRED))
   {
+   if (frame.sve_save_and_probe == INVALID_REGNUM)
+ frame.sve_save_and_probe = regno;
frame.reg_offset[regno] = offset;
offset += BYTES_PER_SVE_PRED;
   }
@@ -8563,6 +8561,8 @@ aarch64_layout_frame (void)
 for (regno = V0_REGNUM; regno <= V31_REGNUM; regno++)
   if (known_eq (frame.reg_offset[regno], SLOT_REQUIRED))
{
+ if (frame.sve_save_and_probe == INVALID_REGNUM)
+   frame.sve_save_and_probe = regno;
  frame.reg_offset[regno] = offset;
  offset += vector_save_size;
}
@@ -8572,10 +8572,18 @@ aarch64_layout_frame (void)
   frame.below_hard_fp_saved_regs_size = offset - frame.bytes_below_saved_regs;
   bool saves_below_hard_fp_p
 = maybe_ne (frame.below_hard_fp_saved_regs_size, 0);
+  gcc_assert (!saves_below_hard_fp_p
+ || (frame.sve_save_and_probe != INVALID_REGNUM
+ && known_eq 

[PATCH 19/19] aarch64: Make stack smash canary protect saved registers

2023-09-12 Thread Richard Sandiford via Gcc-patches
AArch64 normally puts the saved registers near the bottom of the frame,
immediately above any dynamic allocations.  But this means that a
stack-smash attack on those dynamic allocations could overwrite the
saved registers without needing to reach as far as the stack smash
canary.

The same thing could also happen for variable-sized arguments that are
passed by value, since those are allocated before a call and popped on
return.

This patch avoids that by putting the locals (and thus the canary) below
the saved registers when stack smash protection is active.

The patch fixes CVE-2023-4039.

gcc/
* config/aarch64/aarch64.cc (aarch64_save_regs_above_locals_p):
New function.
(aarch64_layout_frame): Use it to decide whether locals should
go above or below the saved registers.
(aarch64_expand_prologue): Update stack layout comment.
Emit a stack tie after the final adjustment.

gcc/testsuite/
* gcc.target/aarch64/stack-protector-8.c: New test.
* gcc.target/aarch64/stack-protector-9.c: Likewise.
---
 gcc/config/aarch64/aarch64.cc | 46 +++--
 .../gcc.target/aarch64/stack-protector-8.c| 95 +++
 .../gcc.target/aarch64/stack-protector-9.c| 33 +++
 3 files changed, 168 insertions(+), 6 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/stack-protector-8.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/stack-protector-9.c

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 51e57370807..3739a44bfd9 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -8433,6 +8433,20 @@ aarch64_needs_frame_chain (void)
   return aarch64_use_frame_pointer;
 }
 
+/* Return true if the current function should save registers above
+   the locals area, rather than below it.  */
+
+static bool
+aarch64_save_regs_above_locals_p ()
+{
+  /* When using stack smash protection, make sure that the canary slot
+ comes between the locals and the saved registers.  Otherwise,
+ it would be possible for a carefully sized smash attack to change
+ the saved registers (particularly LR and FP) without reaching the
+ canary.  */
+  return crtl->stack_protect_guard;
+}
+
 /* Mark the registers that need to be saved by the callee and calculate
the size of the callee-saved registers area and frame record (both FP
and LR may be omitted).  */
@@ -8444,6 +8458,7 @@ aarch64_layout_frame (void)
   poly_int64 vector_save_size = GET_MODE_SIZE (vector_save_mode);
   bool frame_related_fp_reg_p = false;
   aarch64_frame  = cfun->machine->frame;
+  poly_int64 top_of_locals = -1;
 
   frame.emit_frame_chain = aarch64_needs_frame_chain ();
 
@@ -8510,9 +8525,16 @@ aarch64_layout_frame (void)
&& !crtl->abi->clobbers_full_reg_p (regno))
   frame.reg_offset[regno] = SLOT_REQUIRED;
 
+  bool regs_at_top_p = aarch64_save_regs_above_locals_p ();
 
   poly_int64 offset = crtl->outgoing_args_size;
   gcc_assert (multiple_p (offset, STACK_BOUNDARY / BITS_PER_UNIT));
+  if (regs_at_top_p)
+{
+  offset += get_frame_size ();
+  offset = aligned_upper_bound (offset, STACK_BOUNDARY / BITS_PER_UNIT);
+  top_of_locals = offset;
+}
   frame.bytes_below_saved_regs = offset;
   frame.sve_save_and_probe = INVALID_REGNUM;
 
@@ -8652,15 +8674,18 @@ aarch64_layout_frame (void)
  at expand_prologue.  */
   gcc_assert (crtl->is_leaf || maybe_ne (saved_regs_size, 0));
 
-  offset += get_frame_size ();
-  offset = aligned_upper_bound (offset, STACK_BOUNDARY / BITS_PER_UNIT);
-  auto top_of_locals = offset;
-
+  if (!regs_at_top_p)
+{
+  offset += get_frame_size ();
+  offset = aligned_upper_bound (offset, STACK_BOUNDARY / BITS_PER_UNIT);
+  top_of_locals = offset;
+}
   offset += frame.saved_varargs_size;
   gcc_assert (multiple_p (offset, STACK_BOUNDARY / BITS_PER_UNIT));
   frame.frame_size = offset;
 
   frame.bytes_above_hard_fp = frame.frame_size - frame.bytes_below_hard_fp;
+  gcc_assert (known_ge (top_of_locals, 0));
   frame.bytes_above_locals = frame.frame_size - top_of_locals;
 
   frame.initial_adjust = 0;
@@ -9979,10 +10004,10 @@ aarch64_epilogue_uses (int regno)
|  for register varargs |
|   |
+---+
-   |  local variables  | <-- frame_pointer_rtx
+   |  local variables (1)  | <-- frame_pointer_rtx
|   |
+---+
-   |  padding  |
+   |  padding (1)  |
+---+
|  callee-saved registers   |
+---+
@@ -9994,6 +10019,10 @@ aarch64_epilogue_uses (int regno)
+---+
|  SVE predicate registers  |
+---+
+   |  local variables (2)  

[PATCH 16/19] aarch64: Simplify probe of final frame allocation

2023-09-12 Thread Richard Sandiford via Gcc-patches
Previous patches ensured that the final frame allocation only needs
a probe when the size is strictly greater than 1KiB.  It's therefore
safe to use the normal 1024 probe offset in all cases.

The main motivation for doing this is to simplify the code and
remove the number of special cases.

gcc/
* config/aarch64/aarch64.cc (aarch64_allocate_and_probe_stack_space):
Always probe the residual allocation at offset 1024, asserting
that that is in range.

gcc/testsuite/
* gcc.target/aarch64/stack-check-prologue-17.c: Expect the probe
to be at offset 1024 rather than offset 0.
* gcc.target/aarch64/stack-check-prologue-18.c: Likewise.
* gcc.target/aarch64/stack-check-prologue-19.c: Likewise.
---
 gcc/config/aarch64/aarch64.cc| 12 
 .../gcc.target/aarch64/stack-check-prologue-17.c |  2 +-
 .../gcc.target/aarch64/stack-check-prologue-18.c |  4 ++--
 .../gcc.target/aarch64/stack-check-prologue-19.c |  4 ++--
 4 files changed, 9 insertions(+), 13 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 383b32f2078..bcb879ba94b 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -9887,16 +9887,12 @@ aarch64_allocate_and_probe_stack_space (rtx temp1, rtx 
temp2,
  are still safe.  */
   if (residual)
 {
-  HOST_WIDE_INT residual_probe_offset = guard_used_by_caller;
+  gcc_assert (guard_used_by_caller + byte_sp_alignment <= size);
+
   /* If we're doing final adjustments, and we've done any full page
 allocations then any residual needs to be probed.  */
   if (final_adjustment_p && rounded_size != 0)
min_probe_threshold = 0;
-  /* If doing a small final adjustment, we always probe at offset 0.
-This is done to avoid issues when the final adjustment is smaller
-than the probing offset.  */
-  else if (final_adjustment_p && rounded_size == 0)
-   residual_probe_offset = 0;
 
   aarch64_sub_sp (temp1, temp2, residual, frame_related_p);
   if (residual >= min_probe_threshold)
@@ -9907,8 +9903,8 @@ aarch64_allocate_and_probe_stack_space (rtx temp1, rtx 
temp2,
 HOST_WIDE_INT_PRINT_DEC " bytes, probing will be required."
 "\n", residual);
 
-   emit_stack_probe (plus_constant (Pmode, stack_pointer_rtx,
-residual_probe_offset));
+ emit_stack_probe (plus_constant (Pmode, stack_pointer_rtx,
+  guard_used_by_caller));
  emit_insn (gen_blockage ());
}
 }
diff --git a/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-17.c 
b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-17.c
index 0d8a25d73a2..f0ec1389771 100644
--- a/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-17.c
+++ b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-17.c
@@ -33,7 +33,7 @@ int test1(int z) {
 ** ...
 ** str x30, \[sp\]
 ** sub sp, sp, #1040
-** str xzr, \[sp\]
+** str xzr, \[sp, #?1024\]
 ** cbnzw0, .*
 ** bl  g
 ** ...
diff --git a/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-18.c 
b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-18.c
index 82447d20fff..6383bec5ebc 100644
--- a/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-18.c
+++ b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-18.c
@@ -9,7 +9,7 @@ void g();
 ** ...
 ** str x30, \[sp\]
 ** sub sp, sp, #4064
-** str xzr, \[sp\]
+** str xzr, \[sp, #?1024\]
 ** cbnzw0, .*
 ** bl  g
 ** ...
@@ -50,7 +50,7 @@ int test1(int z) {
 ** ...
 ** str x30, \[sp\]
 ** sub sp, sp, #1040
-** str xzr, \[sp\]
+** str xzr, \[sp, #?1024\]
 ** cbnzw0, .*
 ** bl  g
 ** ...
diff --git a/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-19.c 
b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-19.c
index 73ac3e4e4eb..562039b5e9b 100644
--- a/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-19.c
+++ b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-19.c
@@ -9,7 +9,7 @@ void g();
 ** ...
 ** str x30, \[sp\]
 ** sub sp, sp, #4064
-** str xzr, \[sp\]
+** str xzr, \[sp, #?1024\]
 ** cbnzw0, .*
 ** bl  g
 ** ...
@@ -50,7 +50,7 @@ int test1(int z) {
 ** ...
 ** str x30, \[sp\]
 ** sub sp, sp, #1040
-** str xzr, \[sp\]
+** str xzr, \[sp, #?1024\]
 ** cbnzw0, .*
 ** bl  g
 ** ...
-- 
2.25.1



[PATCH 08/19] aarch64: Rename locals_offset to bytes_above_locals

2023-09-12 Thread Richard Sandiford via Gcc-patches
locals_offset was described as:

  /* Offset from the base of the frame (incomming SP) to the
 top of the locals area.  This value is always a multiple of
 STACK_BOUNDARY.  */

This is implicitly an “upside down” view of the frame: the incoming
SP is at offset 0, and anything N bytes below the incoming SP is at
offset N (rather than -N).

However, reg_offset instead uses a “right way up” view; that is,
it views offsets in address terms.  Something above X is at a
positive offset from X and something below X is at a negative
offset from X.

Also, even on FRAME_GROWS_DOWNWARD targets like AArch64,
target-independent code views offsets in address terms too:
locals are allocated at negative offsets to virtual_stack_vars.

It seems confusing to have *_offset fields of the same structure
using different polarities like this.  This patch tries to avoid
that by renaming locals_offset to bytes_above_locals.

gcc/
* config/aarch64/aarch64.h (aarch64_frame::locals_offset): Rename to...
(aarch64_frame::bytes_above_locals): ...this.
* config/aarch64/aarch64.cc (aarch64_layout_frame)
(aarch64_initial_elimination_offset): Update accordingly.
---
 gcc/config/aarch64/aarch64.cc | 6 +++---
 gcc/config/aarch64/aarch64.h  | 6 +++---
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 25b5fb243a6..bcd1dec6f51 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -8637,7 +8637,7 @@ aarch64_layout_frame (void)
  STACK_BOUNDARY / BITS_PER_UNIT));
   frame.frame_size = saved_regs_and_above + frame.bytes_below_saved_regs;
 
-  frame.locals_offset = frame.saved_varargs_size;
+  frame.bytes_above_locals = frame.saved_varargs_size;
 
   frame.initial_adjust = 0;
   frame.final_adjust = 0;
@@ -12854,13 +12854,13 @@ aarch64_initial_elimination_offset (unsigned from, 
unsigned to)
return frame.hard_fp_offset;
 
   if (from == FRAME_POINTER_REGNUM)
-   return frame.hard_fp_offset - frame.locals_offset;
+   return frame.hard_fp_offset - frame.bytes_above_locals;
 }
 
   if (to == STACK_POINTER_REGNUM)
 {
   if (from == FRAME_POINTER_REGNUM)
-   return frame.frame_size - frame.locals_offset;
+   return frame.frame_size - frame.bytes_above_locals;
 }
 
   return frame.frame_size;
diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h
index 46dd981b85c..3382f819e72 100644
--- a/gcc/config/aarch64/aarch64.h
+++ b/gcc/config/aarch64/aarch64.h
@@ -790,10 +790,10 @@ struct GTY (()) aarch64_frame
  always a multiple of STACK_BOUNDARY.  */
   poly_int64 bytes_below_hard_fp;
 
-  /* Offset from the base of the frame (incomming SP) to the
- top of the locals area.  This value is always a multiple of
+  /* The number of bytes between the top of the locals area and the top
+ of the frame (the incomming SP).  This value is always a multiple of
  STACK_BOUNDARY.  */
-  poly_int64 locals_offset;
+  poly_int64 bytes_above_locals;
 
   /* Offset from the base of the frame (incomming SP) to the
  hard_frame_pointer.  This value is always a multiple of
-- 
2.25.1



[PATCH 18/19] aarch64: Remove below_hard_fp_saved_regs_size

2023-09-12 Thread Richard Sandiford via Gcc-patches
After previous patches, it's no longer necessary to store
saved_regs_size and below_hard_fp_saved_regs_size in the frame info.
All measurements instead use the top or bottom of the frame as
reference points.

gcc/
* config/aarch64/aarch64.h (aarch64_frame::saved_regs_size)
(aarch64_frame::below_hard_fp_saved_regs_size): Delete.
* config/aarch64/aarch64.cc (aarch64_layout_frame): Update accordingly.
---
 gcc/config/aarch64/aarch64.cc | 45 ---
 gcc/config/aarch64/aarch64.h  |  7 --
 2 files changed, 21 insertions(+), 31 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 3c7c476c4c6..51e57370807 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -8569,9 +8569,8 @@ aarch64_layout_frame (void)
 
   /* OFFSET is now the offset of the hard frame pointer from the bottom
  of the callee save area.  */
-  frame.below_hard_fp_saved_regs_size = offset - frame.bytes_below_saved_regs;
-  bool saves_below_hard_fp_p
-= maybe_ne (frame.below_hard_fp_saved_regs_size, 0);
+  auto below_hard_fp_saved_regs_size = offset - frame.bytes_below_saved_regs;
+  bool saves_below_hard_fp_p = maybe_ne (below_hard_fp_saved_regs_size, 0);
   gcc_assert (!saves_below_hard_fp_p
  || (frame.sve_save_and_probe != INVALID_REGNUM
  && known_eq (frame.reg_offset[frame.sve_save_and_probe],
@@ -8641,9 +8640,8 @@ aarch64_layout_frame (void)
 
   offset = aligned_upper_bound (offset, STACK_BOUNDARY / BITS_PER_UNIT);
 
-  frame.saved_regs_size = offset - frame.bytes_below_saved_regs;
-  gcc_assert (known_eq (frame.saved_regs_size,
-   frame.below_hard_fp_saved_regs_size)
+  auto saved_regs_size = offset - frame.bytes_below_saved_regs;
+  gcc_assert (known_eq (saved_regs_size, below_hard_fp_saved_regs_size)
  || (frame.hard_fp_save_and_probe != INVALID_REGNUM
  && known_eq (frame.reg_offset[frame.hard_fp_save_and_probe],
   frame.bytes_below_hard_fp)));
@@ -8652,7 +8650,7 @@ aarch64_layout_frame (void)
  The saving of the bottommost register counts as an implicit probe,
  which allows us to maintain the invariant described in the comment
  at expand_prologue.  */
-  gcc_assert (crtl->is_leaf || maybe_ne (frame.saved_regs_size, 0));
+  gcc_assert (crtl->is_leaf || maybe_ne (saved_regs_size, 0));
 
   offset += get_frame_size ();
   offset = aligned_upper_bound (offset, STACK_BOUNDARY / BITS_PER_UNIT);
@@ -8709,7 +8707,7 @@ aarch64_layout_frame (void)
 
   HOST_WIDE_INT const_size, const_below_saved_regs, const_above_fp;
   HOST_WIDE_INT const_saved_regs_size;
-  if (known_eq (frame.saved_regs_size, 0))
+  if (known_eq (saved_regs_size, 0))
 frame.initial_adjust = frame.frame_size;
   else if (frame.frame_size.is_constant (_size)
   && const_size < max_push_offset
@@ -8722,7 +8720,7 @@ aarch64_layout_frame (void)
   frame.callee_adjust = const_size;
 }
   else if (frame.bytes_below_saved_regs.is_constant (_below_saved_regs)
-  && frame.saved_regs_size.is_constant (_saved_regs_size)
+  && saved_regs_size.is_constant (_saved_regs_size)
   && const_below_saved_regs + const_saved_regs_size < 512
   /* We could handle this case even with data below the saved
  registers, provided that that data left us with valid offsets
@@ -8741,8 +8739,7 @@ aarch64_layout_frame (void)
   frame.initial_adjust = frame.frame_size;
 }
   else if (saves_below_hard_fp_p
-  && known_eq (frame.saved_regs_size,
-   frame.below_hard_fp_saved_regs_size))
+  && known_eq (saved_regs_size, below_hard_fp_saved_regs_size))
 {
   /* Frame in which all saves are SVE saves:
 
@@ -8764,7 +8761,7 @@ aarch64_layout_frame (void)
 [save SVE registers relative to SP]
 sub sp, sp, bytes_below_saved_regs  */
   frame.callee_adjust = const_above_fp;
-  frame.sve_callee_adjust = frame.below_hard_fp_saved_regs_size;
+  frame.sve_callee_adjust = below_hard_fp_saved_regs_size;
   frame.final_adjust = frame.bytes_below_saved_regs;
 }
   else
@@ -8779,7 +8776,7 @@ aarch64_layout_frame (void)
 [save SVE registers relative to SP]
 sub sp, sp, bytes_below_saved_regs  */
   frame.initial_adjust = frame.bytes_above_hard_fp;
-  frame.sve_callee_adjust = frame.below_hard_fp_saved_regs_size;
+  frame.sve_callee_adjust = below_hard_fp_saved_regs_size;
   frame.final_adjust = frame.bytes_below_saved_regs;
 }
 
@@ -9985,17 +9982,17 @@ aarch64_epilogue_uses (int regno)
|  local variables  | <-- frame_pointer_rtx
|   |
+---+
-   |  padding  | \
-   +---+  |
-   |  callee-saved registers   |  | frame.saved_regs_size
-   

[PATCH 14/19] aarch64: Tweak stack clash boundary condition

2023-09-12 Thread Richard Sandiford via Gcc-patches
The AArch64 ABI says that, when stack clash protection is used,
there can be a maximum of 1KiB of unprobed space at sp on entry
to a function.  Therefore, we need to probe when allocating
>= guard_size - 1KiB of data (>= rather than >).  This is what
GCC does.

If an allocation is exactly guard_size bytes, it is enough to allocate
those bytes and probe once at offset 1024.  It isn't possible to use a
single probe at any other offset: higher would conmplicate later code,
by leaving more unprobed space than usual, while lower would risk
leaving an entire page unprobed.  For simplicity, the code probes all
allocations at offset 1024.

Some register saves also act as probes.  If we need to allocate
more space below the last such register save probe, we need to
probe the allocation if it is > 1KiB.  Again, this allocation is
then sometimes (but not always) probed at offset 1024.  This sort of
allocation is currently only used for outgoing arguments, which are
rarely this big.

However, the code also probed if this final outgoing-arguments
allocation was == 1KiB, rather than just > 1KiB.  This isn't
necessary, since the register save then probes at offset 1024
as required.  Continuing to probe allocations of exactly 1KiB
would complicate later patches.

gcc/
* config/aarch64/aarch64.cc (aarch64_allocate_and_probe_stack_space):
Don't probe final allocations that are exactly 1KiB in size (after
unprobed space above the final allocation has been deducted).

gcc/testsuite/
* gcc.target/aarch64/stack-check-prologue-17.c: New test.
---
 gcc/config/aarch64/aarch64.cc |  4 +-
 .../aarch64/stack-check-prologue-17.c | 55 +++
 2 files changed, 58 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/stack-check-prologue-17.c

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index e40ccc7d1cf..b942bf3de4a 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -9697,9 +9697,11 @@ aarch64_allocate_and_probe_stack_space (rtx temp1, rtx 
temp2,
   HOST_WIDE_INT guard_size
 = 1 << param_stack_clash_protection_guard_size;
   HOST_WIDE_INT guard_used_by_caller = STACK_CLASH_CALLER_GUARD;
+  HOST_WIDE_INT byte_sp_alignment = STACK_BOUNDARY / BITS_PER_UNIT;
+  gcc_assert (multiple_p (poly_size, byte_sp_alignment));
   HOST_WIDE_INT min_probe_threshold
 = (final_adjustment_p
-   ? guard_used_by_caller
+   ? guard_used_by_caller + byte_sp_alignment
: guard_size - guard_used_by_caller);
   /* When doing the final adjustment for the outgoing arguments, take into
  account any unprobed space there is above the current SP.  There are
diff --git a/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-17.c 
b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-17.c
new file mode 100644
index 000..0d8a25d73a2
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-17.c
@@ -0,0 +1,55 @@
+/* { dg-options "-O2 -fstack-clash-protection -fomit-frame-pointer --param 
stack-clash-protection-guard-size=12" } */
+/* { dg-final { check-function-bodies "**" "" } } */
+
+void f(int, ...);
+void g();
+
+/*
+** test1:
+** ...
+** str x30, \[sp\]
+** sub sp, sp, #1024
+** cbnzw0, .*
+** bl  g
+** ...
+*/
+int test1(int z) {
+  __uint128_t x = 0;
+  int y[0x400];
+  if (z)
+{
+  f(0, 0, 0, 0, 0, 0, 0, ,
+   x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x,
+   x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x,
+   x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x,
+   x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x);
+}
+  g();
+  return 1;
+}
+
+/*
+** test2:
+** ...
+** str x30, \[sp\]
+** sub sp, sp, #1040
+** str xzr, \[sp\]
+** cbnzw0, .*
+** bl  g
+** ...
+*/
+int test2(int z) {
+  __uint128_t x = 0;
+  int y[0x400];
+  if (z)
+{
+  f(0, 0, 0, 0, 0, 0, 0, ,
+   x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x,
+   x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x,
+   x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x,
+   x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x,
+   x);
+}
+  g();
+  return 1;
+}
-- 
2.25.1



[PATCH 04/19] aarch64: Add bytes_below_saved_regs to frame info

2023-09-12 Thread Richard Sandiford via Gcc-patches
The frame layout code currently hard-codes the assumption that
the number of bytes below the saved registers is equal to the
size of the outgoing arguments.  This patch abstracts that
value into a new field of aarch64_frame.

gcc/
* config/aarch64/aarch64.h (aarch64_frame::bytes_below_saved_regs): New
field.
* config/aarch64/aarch64.cc (aarch64_layout_frame): Initialize it,
and use it instead of crtl->outgoing_args_size.
(aarch64_get_separate_components): Use bytes_below_saved_regs instead
of outgoing_args_size.
(aarch64_process_components): Likewise.
---
 gcc/config/aarch64/aarch64.cc | 71 ++-
 gcc/config/aarch64/aarch64.h  |  5 +++
 2 files changed, 41 insertions(+), 35 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 34d0ccc9a67..49c2fbedd14 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -8517,6 +8517,8 @@ aarch64_layout_frame (void)
   gcc_assert (crtl->is_leaf
  || maybe_ne (frame.reg_offset[R30_REGNUM], SLOT_NOT_REQUIRED));
 
+  frame.bytes_below_saved_regs = crtl->outgoing_args_size;
+
   /* Now assign stack slots for the registers.  Start with the predicate
  registers, since predicate LDR and STR have a relatively small
  offset range.  These saves happen below the hard frame pointer.  */
@@ -8621,18 +8623,18 @@ aarch64_layout_frame (void)
 
   poly_int64 varargs_and_saved_regs_size = offset + frame.saved_varargs_size;
 
-  poly_int64 above_outgoing_args
+  poly_int64 saved_regs_and_above
 = aligned_upper_bound (varargs_and_saved_regs_size
   + get_frame_size (),
   STACK_BOUNDARY / BITS_PER_UNIT);
 
   frame.hard_fp_offset
-= above_outgoing_args - frame.below_hard_fp_saved_regs_size;
+= saved_regs_and_above - frame.below_hard_fp_saved_regs_size;
 
   /* Both these values are already aligned.  */
-  gcc_assert (multiple_p (crtl->outgoing_args_size,
+  gcc_assert (multiple_p (frame.bytes_below_saved_regs,
  STACK_BOUNDARY / BITS_PER_UNIT));
-  frame.frame_size = above_outgoing_args + crtl->outgoing_args_size;
+  frame.frame_size = saved_regs_and_above + frame.bytes_below_saved_regs;
 
   frame.locals_offset = frame.saved_varargs_size;
 
@@ -8676,7 +8678,7 @@ aarch64_layout_frame (void)
   else if (frame.wb_pop_candidate1 != INVALID_REGNUM)
 max_push_offset = 256;
 
-  HOST_WIDE_INT const_size, const_outgoing_args_size, const_fp_offset;
+  HOST_WIDE_INT const_size, const_below_saved_regs, const_fp_offset;
   HOST_WIDE_INT const_saved_regs_size;
   if (known_eq (frame.saved_regs_size, 0))
 frame.initial_adjust = frame.frame_size;
@@ -8684,31 +8686,31 @@ aarch64_layout_frame (void)
   && const_size < max_push_offset
   && known_eq (frame.hard_fp_offset, const_size))
 {
-  /* Simple, small frame with no outgoing arguments:
+  /* Simple, small frame with no data below the saved registers.
 
 stp reg1, reg2, [sp, -frame_size]!
 stp reg3, reg4, [sp, 16]  */
   frame.callee_adjust = const_size;
 }
-  else if (crtl->outgoing_args_size.is_constant (_outgoing_args_size)
+  else if (frame.bytes_below_saved_regs.is_constant (_below_saved_regs)
   && frame.saved_regs_size.is_constant (_saved_regs_size)
-  && const_outgoing_args_size + const_saved_regs_size < 512
-  /* We could handle this case even with outgoing args, provided
- that the number of args left us with valid offsets for all
- predicate and vector save slots.  It's such a rare case that
- it hardly seems worth the effort though.  */
-  && (!saves_below_hard_fp_p || const_outgoing_args_size == 0)
+  && const_below_saved_regs + const_saved_regs_size < 512
+  /* We could handle this case even with data below the saved
+ registers, provided that that data left us with valid offsets
+ for all predicate and vector save slots.  It's such a rare
+ case that it hardly seems worth the effort though.  */
+  && (!saves_below_hard_fp_p || const_below_saved_regs == 0)
   && !(cfun->calls_alloca
&& frame.hard_fp_offset.is_constant (_fp_offset)
&& const_fp_offset < max_push_offset))
 {
-  /* Frame with small outgoing arguments:
+  /* Frame with small area below the saved registers:
 
 sub sp, sp, frame_size
-stp reg1, reg2, [sp, outgoing_args_size]
-stp reg3, reg4, [sp, outgoing_args_size + 16]  */
+stp reg1, reg2, [sp, bytes_below_saved_regs]
+stp reg3, reg4, [sp, bytes_below_saved_regs + 16]  */
   frame.initial_adjust = frame.frame_size;
-  frame.callee_offset = const_outgoing_args_size;
+  frame.callee_offset = const_below_saved_regs;
 }
   else if (saves_below_hard_fp_p
   && known_eq 

[PATCH 15/19] aarch64: Put LR save probe in first 16 bytes

2023-09-12 Thread Richard Sandiford via Gcc-patches
-fstack-clash-protection uses the save of LR as a probe for the next
allocation.  The next allocation could be:

* another part of the static frame, e.g. when allocating SVE save slots
  or outgoing arguments

* an alloca in the same function

* an allocation made by a callee function

However, when -fomit-frame-pointer is used, the LR save slot is placed
above the other GPR save slots.  It could therefore be up to 80 bytes
above the base of the GPR save area (which is also the hard fp address).

aarch64_allocate_and_probe_stack_space took this into account when
deciding how much subsequent space could be allocated without needing
a probe.  However, it interacted badly with:

  /* If doing a small final adjustment, we always probe at offset 0.
 This is done to avoid issues when LR is not at position 0 or when
 the final adjustment is smaller than the probing offset.  */
  else if (final_adjustment_p && rounded_size == 0)
residual_probe_offset = 0;

which forces any allocation that is smaller than the guard page size
to be probed at offset 0 rather than the usual offset 1024.  It was
therefore possible to construct cases in which we had:

* a probe using LR at SP + 80 bytes (or some other value >= 16)
* an allocation of the guard page size - 16 bytes
* a probe at SP + 0

which allocates guard page size + 64 consecutive unprobed bytes.

This patch requires the LR probe to be in the first 16 bytes of the
save area when stack clash protection is active.  Doing it
unconditionally would cause code-quality regressions.

Putting LR before other registers prevents push/pop allocation
when shadow call stacks are enabled, since LR is restored
separately from the other callee-saved registers.

The new comment doesn't say that the probe register is required
to be LR, since a later patch removes that restriction.

gcc/
* config/aarch64/aarch64.cc (aarch64_layout_frame): Ensure that
the LR save slot is in the first 16 bytes of the register save area.
Only form STP/LDP push/pop candidates if both registers are valid.
(aarch64_allocate_and_probe_stack_space): Remove workaround for
when LR was not in the first 16 bytes.

gcc/testsuite/
* gcc.target/aarch64/stack-check-prologue-18.c: New test.
* gcc.target/aarch64/stack-check-prologue-19.c: Likewise.
* gcc.target/aarch64/stack-check-prologue-20.c: Likewise.
---
 gcc/config/aarch64/aarch64.cc |  72 ++---
 .../aarch64/stack-check-prologue-18.c | 100 ++
 .../aarch64/stack-check-prologue-19.c | 100 ++
 .../aarch64/stack-check-prologue-20.c |   3 +
 4 files changed, 233 insertions(+), 42 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/stack-check-prologue-18.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/stack-check-prologue-19.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/stack-check-prologue-20.c

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index b942bf3de4a..383b32f2078 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -8573,26 +8573,34 @@ aarch64_layout_frame (void)
   bool saves_below_hard_fp_p
 = maybe_ne (frame.below_hard_fp_saved_regs_size, 0);
   frame.bytes_below_hard_fp = offset;
+
+  auto allocate_gpr_slot = [&](unsigned int regno)
+{
+  frame.reg_offset[regno] = offset;
+  if (frame.wb_push_candidate1 == INVALID_REGNUM)
+   frame.wb_push_candidate1 = regno;
+  else if (frame.wb_push_candidate2 == INVALID_REGNUM)
+   frame.wb_push_candidate2 = regno;
+  offset += UNITS_PER_WORD;
+};
+
   if (frame.emit_frame_chain)
 {
   /* FP and LR are placed in the linkage record.  */
-  frame.reg_offset[R29_REGNUM] = offset;
-  frame.wb_push_candidate1 = R29_REGNUM;
-  frame.reg_offset[R30_REGNUM] = offset + UNITS_PER_WORD;
-  frame.wb_push_candidate2 = R30_REGNUM;
-  offset += 2 * UNITS_PER_WORD;
+  allocate_gpr_slot (R29_REGNUM);
+  allocate_gpr_slot (R30_REGNUM);
 }
+  else if (flag_stack_clash_protection
+  && known_eq (frame.reg_offset[R30_REGNUM], SLOT_REQUIRED))
+/* Put the LR save slot first, since it makes a good choice of probe
+   for stack clash purposes.  The idea is that the link register usually
+   has to be saved before a call anyway, and so we lose little by
+   stopping it from being individually shrink-wrapped.  */
+allocate_gpr_slot (R30_REGNUM);
 
   for (regno = R0_REGNUM; regno <= R30_REGNUM; regno++)
 if (known_eq (frame.reg_offset[regno], SLOT_REQUIRED))
-  {
-   frame.reg_offset[regno] = offset;
-   if (frame.wb_push_candidate1 == INVALID_REGNUM)
- frame.wb_push_candidate1 = regno;
-   else if (frame.wb_push_candidate2 == INVALID_REGNUM)
- frame.wb_push_candidate2 = regno;
-   offset += UNITS_PER_WORD;
-  }
+  allocate_gpr_slot 

[PATCH 13/19] aarch64: Minor initial adjustment tweak

2023-09-12 Thread Richard Sandiford via Gcc-patches
This patch just changes a calculation of initial_adjust
to one that makes it slightly more obvious that the total
adjustment is frame.frame_size.

gcc/
* config/aarch64/aarch64.cc (aarch64_layout_frame): Tweak
calculation of initial_adjust for frames in which all saves
are SVE saves.
---
 gcc/config/aarch64/aarch64.cc | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 9578592d256..e40ccc7d1cf 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -8714,11 +8714,10 @@ aarch64_layout_frame (void)
 {
   /* Frame in which all saves are SVE saves:
 
-sub sp, sp, hard_fp_offset + below_hard_fp_saved_regs_size
+sub sp, sp, frame_size - bytes_below_saved_regs
 save SVE registers relative to SP
 sub sp, sp, bytes_below_saved_regs  */
-  frame.initial_adjust = (frame.bytes_above_hard_fp
- + frame.below_hard_fp_saved_regs_size);
+  frame.initial_adjust = frame.frame_size - frame.bytes_below_saved_regs;
   frame.final_adjust = frame.bytes_below_saved_regs;
 }
   else if (frame.bytes_above_hard_fp.is_constant (_above_fp)
-- 
2.25.1



[PATCH 10/19] aarch64: Tweak frame_size comment

2023-09-12 Thread Richard Sandiford via Gcc-patches
This patch fixes another case in which a value was described with
an “upside-down” view.

gcc/
* config/aarch64/aarch64.h (aarch64_frame::frame_size): Tweak comment.
---
 gcc/config/aarch64/aarch64.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h
index 4a4de9c044e..92965eced0a 100644
--- a/gcc/config/aarch64/aarch64.h
+++ b/gcc/config/aarch64/aarch64.h
@@ -800,8 +800,8 @@ struct GTY (()) aarch64_frame
  STACK_BOUNDARY.  */
   poly_int64 bytes_above_hard_fp;
 
-  /* The size of the frame.  This value is the offset from base of the
- frame (incomming SP) to the stack_pointer.  This value is always
+  /* The size of the frame, i.e. the number of bytes between the bottom
+ of the outgoing arguments and the incoming SP.  This value is always
  a multiple of STACK_BOUNDARY.  */
   poly_int64 frame_size;
 
-- 
2.25.1



[PATCH 03/19] aarch64: Explicitly handle frames with no saved registers

2023-09-12 Thread Richard Sandiford via Gcc-patches
If a frame has no saved registers, it can be allocated in one go.
There is no need to treat the areas below and above the saved
registers as separate.

And if we allocate the frame in one go, it should be allocated
as the initial_adjust rather than the final_adjust.  This allows the
frame size to grow to guard_size - guard_used_by_caller before a stack
probe is needed.  (A frame with no register saves is necessarily a
leaf frame.)

This is a no-op as thing stand, since a leaf function will have
no outgoing arguments, and so all the frame will be above where
the saved registers normally go.

gcc/
* config/aarch64/aarch64.cc (aarch64_layout_frame): Explicitly
allocate the frame in one go if there are no saved registers.
---
 gcc/config/aarch64/aarch64.cc | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 9fb94623693..34d0ccc9a67 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -8678,9 +8678,11 @@ aarch64_layout_frame (void)
 
   HOST_WIDE_INT const_size, const_outgoing_args_size, const_fp_offset;
   HOST_WIDE_INT const_saved_regs_size;
-  if (frame.frame_size.is_constant (_size)
-  && const_size < max_push_offset
-  && known_eq (frame.hard_fp_offset, const_size))
+  if (known_eq (frame.saved_regs_size, 0))
+frame.initial_adjust = frame.frame_size;
+  else if (frame.frame_size.is_constant (_size)
+  && const_size < max_push_offset
+  && known_eq (frame.hard_fp_offset, const_size))
 {
   /* Simple, small frame with no outgoing arguments:
 
-- 
2.25.1



[PATCH 11/19] aarch64: Measure reg_offset from the bottom of the frame

2023-09-12 Thread Richard Sandiford via Gcc-patches
reg_offset was measured from the bottom of the saved register area.
This made perfect sense with the original layout, since the bottom
of the saved register area was also the hard frame pointer address.
It became slightly less obvious with SVE, since we save SVE
registers below the hard frame pointer, but it still made sense.

However, if we want to allow different frame layouts, it's more
convenient and obvious to measure reg_offset from the bottom of
the frame.  After previous patches, it's also a slight simplification
in its own right.

gcc/
* config/aarch64/aarch64.h (aarch64_frame): Add comment above
reg_offset.
* config/aarch64/aarch64.cc (aarch64_layout_frame): Walk offsets
from the bottom of the frame, rather than the bottom of the saved
register area.  Measure reg_offset from the bottom of the frame
rather than the bottom of the saved register area.
(aarch64_save_callee_saves): Update accordingly.
(aarch64_restore_callee_saves): Likewise.
(aarch64_get_separate_components): Likewise.
(aarch64_process_components): Likewise.
---
 gcc/config/aarch64/aarch64.cc | 53 ---
 gcc/config/aarch64/aarch64.h  |  3 ++
 2 files changed, 27 insertions(+), 29 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 7d642d06871..ca2e6af5d12 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -8439,7 +8439,6 @@ aarch64_needs_frame_chain (void)
 static void
 aarch64_layout_frame (void)
 {
-  poly_int64 offset = 0;
   int regno, last_fp_reg = INVALID_REGNUM;
   machine_mode vector_save_mode = aarch64_reg_save_mode (V8_REGNUM);
   poly_int64 vector_save_size = GET_MODE_SIZE (vector_save_mode);
@@ -8517,7 +8516,9 @@ aarch64_layout_frame (void)
   gcc_assert (crtl->is_leaf
  || maybe_ne (frame.reg_offset[R30_REGNUM], SLOT_NOT_REQUIRED));
 
-  frame.bytes_below_saved_regs = crtl->outgoing_args_size;
+  poly_int64 offset = crtl->outgoing_args_size;
+  gcc_assert (multiple_p (offset, STACK_BOUNDARY / BITS_PER_UNIT));
+  frame.bytes_below_saved_regs = offset;
 
   /* Now assign stack slots for the registers.  Start with the predicate
  registers, since predicate LDR and STR have a relatively small
@@ -8529,7 +8530,8 @@ aarch64_layout_frame (void)
offset += BYTES_PER_SVE_PRED;
   }
 
-  if (maybe_ne (offset, 0))
+  poly_int64 saved_prs_size = offset - frame.bytes_below_saved_regs;
+  if (maybe_ne (saved_prs_size, 0))
 {
   /* If we have any vector registers to save above the predicate registers,
 the offset of the vector register save slots need to be a multiple
@@ -8547,10 +8549,10 @@ aarch64_layout_frame (void)
offset = aligned_upper_bound (offset, STACK_BOUNDARY / BITS_PER_UNIT);
   else
{
- if (known_le (offset, vector_save_size))
-   offset = vector_save_size;
- else if (known_le (offset, vector_save_size * 2))
-   offset = vector_save_size * 2;
+ if (known_le (saved_prs_size, vector_save_size))
+   offset = frame.bytes_below_saved_regs + vector_save_size;
+ else if (known_le (saved_prs_size, vector_save_size * 2))
+   offset = frame.bytes_below_saved_regs + vector_save_size * 2;
  else
gcc_unreachable ();
}
@@ -8567,9 +8569,10 @@ aarch64_layout_frame (void)
 
   /* OFFSET is now the offset of the hard frame pointer from the bottom
  of the callee save area.  */
-  bool saves_below_hard_fp_p = maybe_ne (offset, 0);
-  frame.below_hard_fp_saved_regs_size = offset;
-  frame.bytes_below_hard_fp = offset + frame.bytes_below_saved_regs;
+  frame.below_hard_fp_saved_regs_size = offset - frame.bytes_below_saved_regs;
+  bool saves_below_hard_fp_p
+= maybe_ne (frame.below_hard_fp_saved_regs_size, 0);
+  frame.bytes_below_hard_fp = offset;
   if (frame.emit_frame_chain)
 {
   /* FP and LR are placed in the linkage record.  */
@@ -8620,9 +8623,10 @@ aarch64_layout_frame (void)
 
   offset = aligned_upper_bound (offset, STACK_BOUNDARY / BITS_PER_UNIT);
 
-  frame.saved_regs_size = offset;
+  frame.saved_regs_size = offset - frame.bytes_below_saved_regs;
 
-  poly_int64 varargs_and_saved_regs_size = offset + frame.saved_varargs_size;
+  poly_int64 varargs_and_saved_regs_size
+= frame.saved_regs_size + frame.saved_varargs_size;
 
   poly_int64 saved_regs_and_above
 = aligned_upper_bound (varargs_and_saved_regs_size
@@ -9144,9 +9148,7 @@ aarch64_save_callee_saves (poly_int64 bytes_below_sp,
 
   machine_mode mode = aarch64_reg_save_mode (regno);
   reg = gen_rtx_REG (mode, regno);
-  offset = (frame.reg_offset[regno]
-   + frame.bytes_below_saved_regs
-   - bytes_below_sp);
+  offset = frame.reg_offset[regno] - bytes_below_sp;
   rtx base_rtx = stack_pointer_rtx;
   poly_int64 sp_offset = offset;
 
@@ -9253,9 +9255,7 @@ 

[PATCH 09/19] aarch64: Rename hard_fp_offset to bytes_above_hard_fp

2023-09-12 Thread Richard Sandiford via Gcc-patches
Similarly to the previous locals_offset patch, hard_fp_offset
was described as:

  /* Offset from the base of the frame (incomming SP) to the
 hard_frame_pointer.  This value is always a multiple of
 STACK_BOUNDARY.  */
  poly_int64 hard_fp_offset;

which again took an “upside-down” view: higher offsets meant lower
addresses.  This patch renames the field to bytes_above_hard_fp instead.

gcc/
* config/aarch64/aarch64.h (aarch64_frame::hard_fp_offset): Rename
to...
(aarch64_frame::bytes_above_hard_fp): ...this.
* config/aarch64/aarch64.cc (aarch64_layout_frame)
(aarch64_expand_prologue): Update accordingly.
(aarch64_initial_elimination_offset): Likewise.
---
 gcc/config/aarch64/aarch64.cc | 26 +-
 gcc/config/aarch64/aarch64.h  |  6 +++---
 2 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index bcd1dec6f51..7d642d06871 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -8629,7 +8629,7 @@ aarch64_layout_frame (void)
   + get_frame_size (),
   STACK_BOUNDARY / BITS_PER_UNIT);
 
-  frame.hard_fp_offset
+  frame.bytes_above_hard_fp
 = saved_regs_and_above - frame.below_hard_fp_saved_regs_size;
 
   /* Both these values are already aligned.  */
@@ -8678,13 +8678,13 @@ aarch64_layout_frame (void)
   else if (frame.wb_pop_candidate1 != INVALID_REGNUM)
 max_push_offset = 256;
 
-  HOST_WIDE_INT const_size, const_below_saved_regs, const_fp_offset;
+  HOST_WIDE_INT const_size, const_below_saved_regs, const_above_fp;
   HOST_WIDE_INT const_saved_regs_size;
   if (known_eq (frame.saved_regs_size, 0))
 frame.initial_adjust = frame.frame_size;
   else if (frame.frame_size.is_constant (_size)
   && const_size < max_push_offset
-  && known_eq (frame.hard_fp_offset, const_size))
+  && known_eq (frame.bytes_above_hard_fp, const_size))
 {
   /* Simple, small frame with no data below the saved registers.
 
@@ -8701,8 +8701,8 @@ aarch64_layout_frame (void)
  case that it hardly seems worth the effort though.  */
   && (!saves_below_hard_fp_p || const_below_saved_regs == 0)
   && !(cfun->calls_alloca
-   && frame.hard_fp_offset.is_constant (_fp_offset)
-   && const_fp_offset < max_push_offset))
+   && frame.bytes_above_hard_fp.is_constant (_above_fp)
+   && const_above_fp < max_push_offset))
 {
   /* Frame with small area below the saved registers:
 
@@ -8720,12 +8720,12 @@ aarch64_layout_frame (void)
 sub sp, sp, hard_fp_offset + below_hard_fp_saved_regs_size
 save SVE registers relative to SP
 sub sp, sp, bytes_below_saved_regs  */
-  frame.initial_adjust = (frame.hard_fp_offset
+  frame.initial_adjust = (frame.bytes_above_hard_fp
  + frame.below_hard_fp_saved_regs_size);
   frame.final_adjust = frame.bytes_below_saved_regs;
 }
-  else if (frame.hard_fp_offset.is_constant (_fp_offset)
-  && const_fp_offset < max_push_offset)
+  else if (frame.bytes_above_hard_fp.is_constant (_above_fp)
+  && const_above_fp < max_push_offset)
 {
   /* Frame with large area below the saved registers, or with SVE saves,
 but with a small area above:
@@ -8735,7 +8735,7 @@ aarch64_layout_frame (void)
 [sub sp, sp, below_hard_fp_saved_regs_size]
 [save SVE registers relative to SP]
 sub sp, sp, bytes_below_saved_regs  */
-  frame.callee_adjust = const_fp_offset;
+  frame.callee_adjust = const_above_fp;
   frame.sve_callee_adjust = frame.below_hard_fp_saved_regs_size;
   frame.final_adjust = frame.bytes_below_saved_regs;
 }
@@ -8750,7 +8750,7 @@ aarch64_layout_frame (void)
 [sub sp, sp, below_hard_fp_saved_regs_size]
 [save SVE registers relative to SP]
 sub sp, sp, bytes_below_saved_regs  */
-  frame.initial_adjust = frame.hard_fp_offset;
+  frame.initial_adjust = frame.bytes_above_hard_fp;
   frame.sve_callee_adjust = frame.below_hard_fp_saved_regs_size;
   frame.final_adjust = frame.bytes_below_saved_regs;
 }
@@ -10118,7 +10118,7 @@ aarch64_expand_prologue (void)
 {
   /* The offset of the frame chain record (if any) from the current SP.  */
   poly_int64 chain_offset = (initial_adjust + callee_adjust
-- frame.hard_fp_offset);
+- frame.bytes_above_hard_fp);
   gcc_assert (known_ge (chain_offset, 0));
 
   if (callee_adjust == 0)
@@ -12851,10 +12851,10 @@ aarch64_initial_elimination_offset (unsigned from, 
unsigned to)
   if (to == HARD_FRAME_POINTER_REGNUM)
 {
   if (from == ARG_POINTER_REGNUM)
-   return frame.hard_fp_offset;
+   return frame.bytes_above_hard_fp;
 
   if (from == FRAME_POINTER_REGNUM)

[PATCH 06/19] aarch64: Tweak aarch64_save/restore_callee_saves

2023-09-12 Thread Richard Sandiford via Gcc-patches
aarch64_save_callee_saves and aarch64_restore_callee_saves took
a parameter called start_offset that gives the offset of the
bottom of the saved register area from the current stack pointer.
However, it's more convenient for later patches if we use the
bottom of the entire frame as the reference point, rather than
the bottom of the saved registers.

Doing that removes the need for the callee_offset field.
Other than that, this is not a win on its own.  It only really
makes sense in combination with the follow-on patches.

gcc/
* config/aarch64/aarch64.h (aarch64_frame::callee_offset): Delete.
* config/aarch64/aarch64.cc (aarch64_layout_frame): Remove
callee_offset handling.
(aarch64_save_callee_saves): Replace the start_offset parameter
with a bytes_below_sp parameter.
(aarch64_restore_callee_saves): Likewise.
(aarch64_expand_prologue): Update accordingly.
(aarch64_expand_epilogue): Likewise.
---
 gcc/config/aarch64/aarch64.cc | 56 +--
 gcc/config/aarch64/aarch64.h  |  4 ---
 2 files changed, 28 insertions(+), 32 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 58dd8946232..2c218c90906 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -8643,7 +8643,6 @@ aarch64_layout_frame (void)
   frame.final_adjust = 0;
   frame.callee_adjust = 0;
   frame.sve_callee_adjust = 0;
-  frame.callee_offset = 0;
 
   frame.wb_pop_candidate1 = frame.wb_push_candidate1;
   frame.wb_pop_candidate2 = frame.wb_push_candidate2;
@@ -8711,7 +8710,6 @@ aarch64_layout_frame (void)
 stp reg1, reg2, [sp, bytes_below_saved_regs]
 stp reg3, reg4, [sp, bytes_below_saved_regs + 16]  */
   frame.initial_adjust = frame.frame_size;
-  frame.callee_offset = const_below_saved_regs;
 }
   else if (saves_below_hard_fp_p
   && known_eq (frame.saved_regs_size,
@@ -9112,12 +9110,13 @@ aarch64_add_cfa_expression (rtx_insn *insn, rtx reg,
 }
 
 /* Emit code to save the callee-saved registers from register number START
-   to LIMIT to the stack at the location starting at offset START_OFFSET,
-   skipping any write-back candidates if SKIP_WB is true.  HARD_FP_VALID_P
-   is true if the hard frame pointer has been set up.  */
+   to LIMIT to the stack.  The stack pointer is currently BYTES_BELOW_SP
+   bytes above the bottom of the static frame.  Skip any write-back
+   candidates if SKIP_WB is true.  HARD_FP_VALID_P is true if the hard
+   frame pointer has been set up.  */
 
 static void
-aarch64_save_callee_saves (poly_int64 start_offset,
+aarch64_save_callee_saves (poly_int64 bytes_below_sp,
   unsigned start, unsigned limit, bool skip_wb,
   bool hard_fp_valid_p)
 {
@@ -9145,7 +9144,9 @@ aarch64_save_callee_saves (poly_int64 start_offset,
 
   machine_mode mode = aarch64_reg_save_mode (regno);
   reg = gen_rtx_REG (mode, regno);
-  offset = start_offset + frame.reg_offset[regno];
+  offset = (frame.reg_offset[regno]
+   + frame.bytes_below_saved_regs
+   - bytes_below_sp);
   rtx base_rtx = stack_pointer_rtx;
   poly_int64 sp_offset = offset;
 
@@ -9156,9 +9157,7 @@ aarch64_save_callee_saves (poly_int64 start_offset,
   else if (GP_REGNUM_P (regno)
   && (!offset.is_constant (_offset) || const_offset >= 512))
{
- gcc_assert (known_eq (start_offset, 0));
- poly_int64 fp_offset
-   = frame.below_hard_fp_saved_regs_size;
+ poly_int64 fp_offset = frame.bytes_below_hard_fp - bytes_below_sp;
  if (hard_fp_valid_p)
base_rtx = hard_frame_pointer_rtx;
  else
@@ -9222,12 +9221,13 @@ aarch64_save_callee_saves (poly_int64 start_offset,
 }
 
 /* Emit code to restore the callee registers from register number START
-   up to and including LIMIT.  Restore from the stack offset START_OFFSET,
-   skipping any write-back candidates if SKIP_WB is true.  Write the
-   appropriate REG_CFA_RESTORE notes into CFI_OPS.  */
+   up to and including LIMIT.  The stack pointer is currently BYTES_BELOW_SP
+   bytes above the bottom of the static frame.  Skip any write-back
+   candidates if SKIP_WB is true.  Write the appropriate REG_CFA_RESTORE
+   notes into CFI_OPS.  */
 
 static void
-aarch64_restore_callee_saves (poly_int64 start_offset, unsigned start,
+aarch64_restore_callee_saves (poly_int64 bytes_below_sp, unsigned start,
  unsigned limit, bool skip_wb, rtx *cfi_ops)
 {
   aarch64_frame  = cfun->machine->frame;
@@ -9253,7 +9253,9 @@ aarch64_restore_callee_saves (poly_int64 start_offset, 
unsigned start,
 
   machine_mode mode = aarch64_reg_save_mode (regno);
   reg = gen_rtx_REG (mode, regno);
-  offset = start_offset + frame.reg_offset[regno];
+  offset = (frame.reg_offset[regno]
+   + frame.bytes_below_saved_regs
+   - 

[PATCH 02/19] aarch64: Avoid a use of callee_offset

2023-09-12 Thread Richard Sandiford via Gcc-patches
When we emit the frame chain, i.e. when we reach Here in this statement
of aarch64_expand_prologue:

  if (emit_frame_chain)
{
  // Here
  ...
}

the stack is in one of two states:

- We've allocated up to the frame chain, but no more.

- We've allocated the whole frame, and the frame chain is within easy
  reach of the new SP.

The offset of the frame chain from the current SP is available
in aarch64_frame as callee_offset.  It is also available as the
chain_offset local variable, where the latter is calculated from other
data.  (However, chain_offset is not always equal to callee_offset when
!emit_frame_chain, so chain_offset isn't redundant.)

In c600df9a4060da3c6121ff4d0b93f179eafd69d1 I switched to using
chain_offset for the initialisation of the hard frame pointer:

   aarch64_add_offset (Pmode, hard_frame_pointer_rtx,
- stack_pointer_rtx, callee_offset,
+ stack_pointer_rtx, chain_offset,
  tmp1_rtx, tmp0_rtx, frame_pointer_needed);

But the later REG_CFA_ADJUST_CFA handling still used callee_offset.

I think the difference is harmless, but it's more logical for the
CFA note to be in sync, and it's more convenient for later patches
if it uses chain_offset.

gcc/
* config/aarch64/aarch64.cc (aarch64_expand_prologue): Use
chain_offset rather than callee_offset.
---
 gcc/config/aarch64/aarch64.cc | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index b91f77d7b1f..9fb94623693 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -10034,7 +10034,6 @@ aarch64_expand_prologue (void)
   poly_int64 initial_adjust = frame.initial_adjust;
   HOST_WIDE_INT callee_adjust = frame.callee_adjust;
   poly_int64 final_adjust = frame.final_adjust;
-  poly_int64 callee_offset = frame.callee_offset;
   poly_int64 sve_callee_adjust = frame.sve_callee_adjust;
   poly_int64 below_hard_fp_saved_regs_size
 = frame.below_hard_fp_saved_regs_size;
@@ -10147,8 +10146,7 @@ aarch64_expand_prologue (void)
 implicit.  */
  if (!find_reg_note (insn, REG_CFA_ADJUST_CFA, NULL_RTX))
{
- rtx src = plus_constant (Pmode, stack_pointer_rtx,
-  callee_offset);
+ rtx src = plus_constant (Pmode, stack_pointer_rtx, chain_offset);
  add_reg_note (insn, REG_CFA_ADJUST_CFA,
gen_rtx_SET (hard_frame_pointer_rtx, src));
}
-- 
2.25.1



[PATCH 12/19] aarch64: Simplify top of frame allocation

2023-09-12 Thread Richard Sandiford via Gcc-patches
After previous patches, it no longer really makes sense to allocate
the top of the frame in terms of varargs_and_saved_regs_size and
saved_regs_and_above.

gcc/
* config/aarch64/aarch64.cc (aarch64_layout_frame): Simplify
the allocation of the top of the frame.
---
 gcc/config/aarch64/aarch64.cc | 23 ---
 1 file changed, 8 insertions(+), 15 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index ca2e6af5d12..9578592d256 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -8625,23 +8625,16 @@ aarch64_layout_frame (void)
 
   frame.saved_regs_size = offset - frame.bytes_below_saved_regs;
 
-  poly_int64 varargs_and_saved_regs_size
-= frame.saved_regs_size + frame.saved_varargs_size;
-
-  poly_int64 saved_regs_and_above
-= aligned_upper_bound (varargs_and_saved_regs_size
-  + get_frame_size (),
-  STACK_BOUNDARY / BITS_PER_UNIT);
-
-  frame.bytes_above_hard_fp
-= saved_regs_and_above - frame.below_hard_fp_saved_regs_size;
+  offset += get_frame_size ();
+  offset = aligned_upper_bound (offset, STACK_BOUNDARY / BITS_PER_UNIT);
+  auto top_of_locals = offset;
 
-  /* Both these values are already aligned.  */
-  gcc_assert (multiple_p (frame.bytes_below_saved_regs,
- STACK_BOUNDARY / BITS_PER_UNIT));
-  frame.frame_size = saved_regs_and_above + frame.bytes_below_saved_regs;
+  offset += frame.saved_varargs_size;
+  gcc_assert (multiple_p (offset, STACK_BOUNDARY / BITS_PER_UNIT));
+  frame.frame_size = offset;
 
-  frame.bytes_above_locals = frame.saved_varargs_size;
+  frame.bytes_above_hard_fp = frame.frame_size - frame.bytes_below_hard_fp;
+  frame.bytes_above_locals = frame.frame_size - top_of_locals;
 
   frame.initial_adjust = 0;
   frame.final_adjust = 0;
-- 
2.25.1



[PATCH 07/19] aarch64: Only calculate chain_offset if there is a chain

2023-09-12 Thread Richard Sandiford via Gcc-patches
After previous patches, it is no longer necessary to calculate
a chain_offset in cases where there is no chain record.

gcc/
* config/aarch64/aarch64.cc (aarch64_expand_prologue): Move the
calculation of chain_offset into the emit_frame_chain block.
---
 gcc/config/aarch64/aarch64.cc | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 2c218c90906..25b5fb243a6 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -10111,16 +10111,16 @@ aarch64_expand_prologue (void)
   if (callee_adjust != 0)
 aarch64_push_regs (reg1, reg2, callee_adjust);
 
-  /* The offset of the frame chain record (if any) from the current SP.  */
-  poly_int64 chain_offset = (initial_adjust + callee_adjust
-- frame.hard_fp_offset);
-  gcc_assert (known_ge (chain_offset, 0));
-
   /* The offset of the current SP from the bottom of the static frame.  */
   poly_int64 bytes_below_sp = frame_size - initial_adjust - callee_adjust;
 
   if (emit_frame_chain)
 {
+  /* The offset of the frame chain record (if any) from the current SP.  */
+  poly_int64 chain_offset = (initial_adjust + callee_adjust
+- frame.hard_fp_offset);
+  gcc_assert (known_ge (chain_offset, 0));
+
   if (callee_adjust == 0)
{
  reg1 = R29_REGNUM;
-- 
2.25.1



[PATCH 05/19] aarch64: Add bytes_below_hard_fp to frame info

2023-09-12 Thread Richard Sandiford via Gcc-patches
Following on from the previous bytes_below_saved_regs patch, this one
records the number of bytes that are below the hard frame pointer.
This eventually replaces below_hard_fp_saved_regs_size.

If a frame pointer is not needed, the epilogue adds final_adjust
to the stack pointer before restoring registers:

 aarch64_add_sp (tmp1_rtx, tmp0_rtx, final_adjust, true);

Therefore, if the epilogue needs to restore the stack pointer from
the hard frame pointer, the directly corresponding offset is:

 -bytes_below_hard_fp + final_adjust

i.e. go from the hard frame pointer to the bottom of the frame,
then add the same amount as if we were using the stack pointer
from the outset.

gcc/
* config/aarch64/aarch64.h (aarch64_frame::bytes_below_hard_fp): New
field.
* config/aarch64/aarch64.cc (aarch64_layout_frame): Initialize it.
(aarch64_expand_epilogue): Use it instead of
below_hard_fp_saved_regs_size.
---
 gcc/config/aarch64/aarch64.cc | 6 +++---
 gcc/config/aarch64/aarch64.h  | 5 +
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 49c2fbedd14..58dd8946232 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -8569,6 +8569,7 @@ aarch64_layout_frame (void)
  of the callee save area.  */
   bool saves_below_hard_fp_p = maybe_ne (offset, 0);
   frame.below_hard_fp_saved_regs_size = offset;
+  frame.bytes_below_hard_fp = offset + frame.bytes_below_saved_regs;
   if (frame.emit_frame_chain)
 {
   /* FP and LR are placed in the linkage record.  */
@@ -10220,8 +10221,7 @@ aarch64_expand_epilogue (bool for_sibcall)
   poly_int64 final_adjust = frame.final_adjust;
   poly_int64 callee_offset = frame.callee_offset;
   poly_int64 sve_callee_adjust = frame.sve_callee_adjust;
-  poly_int64 below_hard_fp_saved_regs_size
-= frame.below_hard_fp_saved_regs_size;
+  poly_int64 bytes_below_hard_fp = frame.bytes_below_hard_fp;
   unsigned reg1 = frame.wb_pop_candidate1;
   unsigned reg2 = frame.wb_pop_candidate2;
   unsigned int last_gpr = (frame.is_scs_enabled
@@ -10279,7 +10279,7 @@ aarch64_expand_epilogue (bool for_sibcall)
is restored on the instruction doing the writeback.  */
 aarch64_add_offset (Pmode, stack_pointer_rtx,
hard_frame_pointer_rtx,
-   -callee_offset - below_hard_fp_saved_regs_size,
+   -bytes_below_hard_fp + final_adjust,
tmp1_rtx, tmp0_rtx, callee_adjust == 0);
   else
  /* The case where we need to re-use the register here is very rare, so
diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h
index 83939991eb1..75fd3b59b0d 100644
--- a/gcc/config/aarch64/aarch64.h
+++ b/gcc/config/aarch64/aarch64.h
@@ -785,6 +785,11 @@ struct GTY (()) aarch64_frame
  are saved below the hard frame pointer.  */
   poly_int64 below_hard_fp_saved_regs_size;
 
+  /* The number of bytes between the bottom of the static frame (the bottom
+ of the outgoing arguments) and the hard frame pointer.  This value is
+ always a multiple of STACK_BOUNDARY.  */
+  poly_int64 bytes_below_hard_fp;
+
   /* Offset from the base of the frame (incomming SP) to the
  top of the locals area.  This value is always a multiple of
  STACK_BOUNDARY.  */
-- 
2.25.1



[PATCH 01/19] aarch64: Use local frame vars in shrink-wrapping code

2023-09-12 Thread Richard Sandiford via Gcc-patches
aarch64_layout_frame uses a shorthand for referring to
cfun->machine->frame:

  aarch64_frame  = cfun->machine->frame;

This patch does the same for some other heavy users of the structure.
No functional change intended.

gcc/
* config/aarch64/aarch64.cc (aarch64_save_callee_saves): Use
a local shorthand for cfun->machine->frame.
(aarch64_restore_callee_saves, aarch64_get_separate_components):
(aarch64_process_components): Likewise.
(aarch64_allocate_and_probe_stack_space): Likewise.
(aarch64_expand_prologue, aarch64_expand_epilogue): Likewise.
(aarch64_layout_frame): Use existing shorthand for one more case.
---
 gcc/config/aarch64/aarch64.cc | 123 ++
 1 file changed, 64 insertions(+), 59 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 37d414021ca..b91f77d7b1f 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -8651,7 +8651,7 @@ aarch64_layout_frame (void)
   frame.is_scs_enabled
 = (!crtl->calls_eh_return
&& sanitize_flags_p (SANITIZE_SHADOW_CALL_STACK)
-   && known_ge (cfun->machine->frame.reg_offset[LR_REGNUM], 0));
+   && known_ge (frame.reg_offset[LR_REGNUM], 0));
 
   /* When shadow call stack is enabled, the scs_pop in the epilogue will
  restore x30, and we don't need to pop x30 again in the traditional
@@ -9117,6 +9117,7 @@ aarch64_save_callee_saves (poly_int64 start_offset,
   unsigned start, unsigned limit, bool skip_wb,
   bool hard_fp_valid_p)
 {
+  aarch64_frame  = cfun->machine->frame;
   rtx_insn *insn;
   unsigned regno;
   unsigned regno2;
@@ -9131,8 +9132,8 @@ aarch64_save_callee_saves (poly_int64 start_offset,
   bool frame_related_p = aarch64_emit_cfi_for_reg_p (regno);
 
   if (skip_wb
- && (regno == cfun->machine->frame.wb_push_candidate1
- || regno == cfun->machine->frame.wb_push_candidate2))
+ && (regno == frame.wb_push_candidate1
+ || regno == frame.wb_push_candidate2))
continue;
 
   if (cfun->machine->reg_is_wrapped_separately[regno])
@@ -9140,7 +9141,7 @@ aarch64_save_callee_saves (poly_int64 start_offset,
 
   machine_mode mode = aarch64_reg_save_mode (regno);
   reg = gen_rtx_REG (mode, regno);
-  offset = start_offset + cfun->machine->frame.reg_offset[regno];
+  offset = start_offset + frame.reg_offset[regno];
   rtx base_rtx = stack_pointer_rtx;
   poly_int64 sp_offset = offset;
 
@@ -9153,7 +9154,7 @@ aarch64_save_callee_saves (poly_int64 start_offset,
{
  gcc_assert (known_eq (start_offset, 0));
  poly_int64 fp_offset
-   = cfun->machine->frame.below_hard_fp_saved_regs_size;
+   = frame.below_hard_fp_saved_regs_size;
  if (hard_fp_valid_p)
base_rtx = hard_frame_pointer_rtx;
  else
@@ -9175,8 +9176,7 @@ aarch64_save_callee_saves (poly_int64 start_offset,
  && (regno2 = aarch64_next_callee_save (regno + 1, limit)) <= limit
  && !cfun->machine->reg_is_wrapped_separately[regno2]
  && known_eq (GET_MODE_SIZE (mode),
-  cfun->machine->frame.reg_offset[regno2]
-  - cfun->machine->frame.reg_offset[regno]))
+  frame.reg_offset[regno2] - frame.reg_offset[regno]))
{
  rtx reg2 = gen_rtx_REG (mode, regno2);
  rtx mem2;
@@ -9226,6 +9226,7 @@ static void
 aarch64_restore_callee_saves (poly_int64 start_offset, unsigned start,
  unsigned limit, bool skip_wb, rtx *cfi_ops)
 {
+  aarch64_frame  = cfun->machine->frame;
   unsigned regno;
   unsigned regno2;
   poly_int64 offset;
@@ -9242,13 +9243,13 @@ aarch64_restore_callee_saves (poly_int64 start_offset, 
unsigned start,
   rtx reg, mem;
 
   if (skip_wb
- && (regno == cfun->machine->frame.wb_pop_candidate1
- || regno == cfun->machine->frame.wb_pop_candidate2))
+ && (regno == frame.wb_pop_candidate1
+ || regno == frame.wb_pop_candidate2))
continue;
 
   machine_mode mode = aarch64_reg_save_mode (regno);
   reg = gen_rtx_REG (mode, regno);
-  offset = start_offset + cfun->machine->frame.reg_offset[regno];
+  offset = start_offset + frame.reg_offset[regno];
   rtx base_rtx = stack_pointer_rtx;
   if (mode == VNx2DImode && BYTES_BIG_ENDIAN)
aarch64_adjust_sve_callee_save_base (mode, base_rtx, anchor_reg,
@@ -9259,8 +9260,7 @@ aarch64_restore_callee_saves (poly_int64 start_offset, 
unsigned start,
  && (regno2 = aarch64_next_callee_save (regno + 1, limit)) <= limit
  && !cfun->machine->reg_is_wrapped_separately[regno2]
  && known_eq (GET_MODE_SIZE (mode),
-  cfun->machine->frame.reg_offset[regno2]
-  - cfun->machine->frame.reg_offset[regno]))
+  

[PATCH 00/19] aarch64: Fix -fstack-protector issue

2023-09-12 Thread Richard Sandiford via Gcc-patches
This series of patches fixes deficiencies in GCC's -fstack-protector
implementation for AArch64 when using dynamically allocated stack space.
This is CVE-2023-4039.  See:

https://developer.arm.com/Arm%20Security%20Center/GCC%20Stack%20Protector%20Vulnerability%20AArch64
https://github.com/metaredteam/external-disclosures/security/advisories/GHSA-x7ch-h5rf-w2mf

for more details.

The fix is to put the saved registers above the locals area when
-fstack-protector is used.

The series also fixes a stack-clash problem that I found while working
on the CVE.  In unpatched sources, the stack-clash problem would only
trigger for unrealistic numbers of arguments (8K 64-bit arguments, or an
equivalent).  But it would be a more significant issue with the new
-fstack-protector frame layout.  It's therefore important that both
problems are fixed together.

Some reorganisation of the code seemed necessary to fix the problems in a
cleanish way.  The series is therefore quite long, but only a handful of
patches should have any effect on code generation.

See the individual patches for a detailed description.

Tested on aarch64-linux-gnu. Pushed to trunk and to all active branches.
I've also pushed backports to GCC 7+ to vendors/ARM/heads/CVE-2023-4039.

Richard Sandiford (19):
  aarch64: Use local frame vars in shrink-wrapping code
  aarch64: Avoid a use of callee_offset
  aarch64: Explicitly handle frames with no saved registers
  aarch64: Add bytes_below_saved_regs to frame info
  aarch64: Add bytes_below_hard_fp to frame info
  aarch64: Tweak aarch64_save/restore_callee_saves
  aarch64: Only calculate chain_offset if there is a chain
  aarch64: Rename locals_offset to bytes_above_locals
  aarch64: Rename hard_fp_offset to bytes_above_hard_fp
  aarch64: Tweak frame_size comment
  aarch64: Measure reg_offset from the bottom of the frame
  aarch64: Simplify top of frame allocation
  aarch64: Minor initial adjustment tweak
  aarch64: Tweak stack clash boundary condition
  aarch64: Put LR save probe in first 16 bytes
  aarch64: Simplify probe of final frame allocation
  aarch64: Explicitly record probe registers in frame info
  aarch64: Remove below_hard_fp_saved_regs_size
  aarch64: Make stack smash canary protect saved registers

 gcc/config/aarch64/aarch64.cc | 518 ++
 gcc/config/aarch64/aarch64.h  |  44 +-
 .../aarch64/stack-check-prologue-17.c |  55 ++
 .../aarch64/stack-check-prologue-18.c | 100 
 .../aarch64/stack-check-prologue-19.c | 100 
 .../aarch64/stack-check-prologue-20.c |   3 +
 .../gcc.target/aarch64/stack-protector-8.c|  95 
 .../gcc.target/aarch64/stack-protector-9.c|  33 ++
 .../aarch64/sve/pcs/stack_clash_3.c   |   6 +-
 9 files changed, 699 insertions(+), 255 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/stack-check-prologue-17.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/stack-check-prologue-18.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/stack-check-prologue-19.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/stack-check-prologue-20.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/stack-protector-8.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/stack-protector-9.c

-- 
2.25.1



Re: [PATCH] pretty-print: Fix up pp_wide_int [PR111329]

2023-09-11 Thread Richard Sandiford via Gcc-patches
Jakub Jelinek  writes:
> Hi!
>
> The recent pp_wide_int changes for _BitInt support (because not all
> wide_ints fit into the small fixed size digit_buffer anymore) apparently
> broke
> +FAIL: gcc.dg/analyzer/out-of-bounds-diagram-1-debug.c (test for excess 
> errors)
> +FAIL: gcc.dg/analyzer/out-of-bounds-diagram-1-debug.c 2 blank line(s) in 
> output
> +FAIL: gcc.dg/analyzer/out-of-bounds-diagram-1-debug.c expected multiline 
> pattern lines 17-39
> (and I couldn't reproduce that in bisect seed (which is -O0 compiled) and
> thought it would be some analyzer diagnostic bug).
>
> The problem is that analyzer uses pp_wide_int with a function call in the
> second argument.  Previously, when pp_wide_int macro just did
>   print_dec (W, pp_buffer (PP)->digit_buffer, SGN);
>   pp_string (PP, pp_buffer (PP)->digit_buffer);
> it worked, because the const wide_int_ref & first argument to print_dec
> bound to a temporary, which was only destructed at the end of the full
> statement after print_dec was called.
> But with my changes where I need to first compare the precision of the
> const wide_int_ref & to decide whether to use digit_buffer or XALLOCAVEC
> something larger, this means that pp_wide_int_ref binds to a temporary
> which is destroyed at the end of full statement which is the
>   const wide_int_ref _wide_int_ref = (W);
> declaration, so then invokes UB accessing a destructed temporary.
>
> The following patch fixes it by rewriting pp_wide_int into an inline
> function, so that the end of the full statement is the end of the inline
> function call.  As functions using alloca aren't normally inlined, I've
> also split that part into a separate out of line function.  Putting that
> into pretty-print.cc didn't work, e.g. the gm2 binary doesn't link,
> because pretty-print.o is in libcommon.a, but wide-print-print.o which
> defines print_dec is not.  So I've put that out of line function into
> wide-int-print.cc instead.
>
> Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?
>
> 2023-09-08  Jakub Jelinek  
>
>   PR middle-end/111329
>   * pretty-print.h (pp_wide_int): Rewrite from macro into inline
>   function.  For printing values which don't fit into digit_buffer
>   use out-of-line function.
>   * wide-int-print.h (pp_wide_int_large): Declare.
>   * wide-int-print.cc: Include pretty-print.h.
>   (pp_wide_int_large): Define.

OK, thanks.

Richard

> --- gcc/pretty-print.h.jj 2023-09-06 14:36:53.485246347 +0200
> +++ gcc/pretty-print.h2023-09-08 11:11:21.173649942 +0200
> @@ -333,28 +333,6 @@ pp_get_prefix (const pretty_printer *pp)
>  #define pp_decimal_int(PP, I)  pp_scalar (PP, "%d", I)
>  #define pp_unsigned_wide_integer(PP, I) \
> pp_scalar (PP, HOST_WIDE_INT_PRINT_UNSIGNED, (unsigned HOST_WIDE_INT) I)
> -#define pp_wide_int(PP, W, SGN)  \
> -  do \
> -{\
> -  const wide_int_ref _wide_int_ref = (W); \
> -  unsigned int pp_wide_int_prec  \
> - = pp_wide_int_ref.get_precision (); \
> -  if ((pp_wide_int_prec + 3) / 4 \
> -   > sizeof (pp_buffer (PP)->digit_buffer) - 3)  \
> - {   \
> -   char *pp_wide_int_buf \
> - = XALLOCAVEC (char, (pp_wide_int_prec + 3) / 4 + 3);\
> -   print_dec (pp_wide_int_ref, pp_wide_int_buf, SGN);\
> -   pp_string (PP, pp_wide_int_buf);  \
> - }   \
> -  else   \
> - {   \
> -   print_dec (pp_wide_int_ref,   \
> -  pp_buffer (PP)->digit_buffer, SGN);\
> -   pp_string (PP, pp_buffer (PP)->digit_buffer); \
> - }   \
> -}\
> -  while (0)
>  #define pp_vrange(PP, R) \
>do \
>  {\
> @@ -453,6 +431,19 @@ pp_wide_integer (pretty_printer *pp, HOS
>pp_scalar (pp, HOST_WIDE_INT_PRINT_DEC, i);
>  }
>  
> +inline void
> +pp_wide_int (pretty_printer *pp, const wide_int_ref , signop sgn)
> +{
> +  unsigned int prec = w.get_precision ();
> +  if (UNLIKELY ((prec + 3) / 4 > sizeof (pp_buffer (pp)->digit_buffer) - 3))
> +pp_wide_int_large (pp, w, sgn);
> +  else
> +{
> +  print_dec (w, pp_buffer (pp)->digit_buffer, sgn);
> +  pp_string (pp, pp_buffer (pp)->digit_buffer);
> +}
> +}
> +
>  template
>  void 

[PATCH] Allow target attributes in non-gnu namespaces

2023-09-08 Thread Richard Sandiford via Gcc-patches
Currently there are four static sources of attributes:

- LANG_HOOKS_ATTRIBUTE_TABLE
- LANG_HOOKS_COMMON_ATTRIBUTE_TABLE
- LANG_HOOKS_FORMAT_ATTRIBUTE_TABLE
- TARGET_ATTRIBUTE_TABLE

All of the attributes in these tables go in the "gnu" namespace.
This means that they can use the traditional GNU __attribute__((...))
syntax and the standard [[gnu::...]] syntax.

Standard attributes are registered dynamically with a null namespace.
There are no supported attributes in other namespaces (clang, vendor
namespaces, etc.).

This patch tries to generalise things by making the namespace
part of the attribute specification.

It's usual for multiple attributes to be defined in the same namespace,
so rather than adding the namespace to each individual definition,
it seemed better to group attributes in the same namespace together.
This would also allow us to reuse the same table for clang attributes
that are written with the GNU syntax, or other similar situations
where the attribute can be accessed via multiple "spellings".

The patch therefore adds a scoped_attribute_specs that contains
a namespace and a list of attributes in that namespace.

It's still possible to have multiple scoped_attribute_specs
for the same namespace.  E.g. it makes sense to keep the
C++-specific, C/C++-common, and format-related attributes in
separate tables, even though they're all GNU attributes.

Current lists of attributes are terminated by a null name.
Rather than keep that for the new structure, it seemed neater
to use an array_slice.  This also makes the tables slighly more
compact.

In general, a target might want to support attributes in multiple
namespaces.  Rather than have a separate hook for each possibility
(like the three langhooks above), it seemed better to make
TARGET_ATTRIBUTE_TABLE a table of tables.  Specifically, it's
an array_slice of scoped_attribute_specs.

We can do the same thing for langhooks, which allows the three hooks
above to be merged into a single LANG_HOOKS_ATTRIBUTE_TABLE.
It also allows the standard attributes to be registered statically
and checked by the usual attribs.cc checks.

The patch adds a TARGET_GNU_ATTRIBUTES helper for the common case
in which a target wants a single table of gnu attributes.  It can
only be used if the table is free of preprocessor directives.

There are probably other things we need to do to make vendor namespaces
work smoothly.  E.g. in principle it would be good to make exclusion
sets namespace-aware.  But to some extent we have that with standard
vs. gnu attributes too.  This patch is just supposed to be a first step.

Bootstrapped & regtested on aarch64-linux-gnu and x86_64-linux-gnu.
Also tested on the full target list in config-list.mk.  OK to install?

Richard


gcc/
* attribs.h (scoped_attribute_specs): New structure.
(register_scoped_attributes): Take a reference to a
scoped_attribute_specs instead of separate namespace and array
parameters.
* plugin.h (register_scoped_attributes): Likewise.
* attribs.cc (register_scoped_attributes): Likewise.
(attribute_tables): Change into an array of scoped_attribute_specs
pointers.  Reduce to 1 element for frontends and 1 element for targets.
(empty_attribute_table): Delete.
(check_attribute_tables): Update for changes to attribute_tables.
Use a hash_set to identify duplicates.
(handle_ignored_attributes_option): Update for above changes.
(init_attributes): Likewise.
(excl_pair): Delete.
(test_attribute_exclusions): Update for above changes.  Don't
enforce symmetry for standard attributes in the top-level namespace.
* langhooks-def.h (LANG_HOOKS_COMMON_ATTRIBUTE_TABLE): Delete.
(LANG_HOOKS_FORMAT_ATTRIBUTE_TABLE): Likewise.
(LANG_HOOKS_INITIALIZER): Update accordingly.
(LANG_HOOKS_ATTRIBUTE_TABLE): Define to an empty constructor.
* langhooks.h (lang_hooks::common_attribute_table): Delete.
(lang_hooks::format_attribute_table): Likewise.
(lang_hooks::attribute_table): Redefine to an array of
scoped_attribute_specs pointers.
* target-def.h (TARGET_GNU_ATTRIBUTES): New macro.
* target.def (attribute_spec): Redefine to return an array of
scoped_attribute_specs pointers.
* tree-inline.cc (function_attribute_inlinable_p): Update accordingly.
* doc/tm.texi: Regenerate.
* config/aarch64/aarch64.cc (aarch64_attribute_table): Define using
TARGET_GNU_ATTRIBUTES.
* config/alpha/alpha.cc (vms_attribute_table): Likewise.
* config/avr/avr.cc (avr_attribute_table): Likewise.
* config/bfin/bfin.cc (bfin_attribute_table): Likewise.
* config/bpf/bpf.cc (bpf_attribute_table): Likewise.
* config/csky/csky.cc (csky_attribute_table): Likewise.
* config/epiphany/epiphany.cc (epiphany_attribute_table): Likewise.
* config/gcn/gcn.cc (gcn_attribute_table): 

Re: [PATCH V3] Support folding min(poly,poly) to const

2023-09-08 Thread Richard Sandiford via Gcc-patches
Lehua Ding  writes:
> V3 change: Address Richard's comments.
>
> Hi,
>
> This patch adds support that tries to fold `MIN (poly, poly)` to
> a constant. Consider the following C Code:
>
> ```
> void foo2 (int* restrict a, int* restrict b, int n)
> {
> for (int i = 0; i < 3; i += 1)
>   a[i] += b[i];
> }
> ```
>
> Before this patch:
>
> ```
> void foo2 (int * restrict a, int * restrict b, int n)
> {
>   vector([4,4]) int vect__7.27;
>   vector([4,4]) int vect__6.26;
>   vector([4,4]) int vect__4.23;
>   unsigned long _32;
>
>[local count: 268435456]:
>   _32 = MIN_EXPR <3, POLY_INT_CST [4, 4]>;
>   vect__4.23_20 = .MASK_LEN_LOAD (a_11(D), 32B, { -1, ... }, _32, 0);
>   vect__6.26_15 = .MASK_LEN_LOAD (b_12(D), 32B, { -1, ... }, _32, 0);
>   vect__7.27_9 = vect__6.26_15 + vect__4.23_20;
>   .MASK_LEN_STORE (a_11(D), 32B, { -1, ... }, _32, 0, vect__7.27_9); [tail 
> call]
>   return;
>
> }
> ```
>
> After this patch:
>
> ```
> void foo2 (int * restrict a, int * restrict b, int n)
> {
>   vector([4,4]) int vect__7.27;
>   vector([4,4]) int vect__6.26;
>   vector([4,4]) int vect__4.23;
>
>[local count: 268435456]:
>   vect__4.23_20 = .MASK_LEN_LOAD (a_11(D), 32B, { -1, ... }, 3, 0);
>   vect__6.26_15 = .MASK_LEN_LOAD (b_12(D), 32B, { -1, ... }, 3, 0);
>   vect__7.27_9 = vect__6.26_15 + vect__4.23_20;
>   .MASK_LEN_STORE (a_11(D), 32B, { -1, ... }, 3, 0, vect__7.27_9); [tail call]
>   return;
>
> }
> ```
>
> For RISC-V RVV, csrr and branch instructions can be reduced:
>
> Before this patch:
>
> ```
> foo2:
> csrra4,vlenb
> srlia4,a4,2
> li  a5,3
> bleua5,a4,.L5
> mv  a5,a4
> .L5:
> vsetvli zero,a5,e32,m1,ta,ma
> ...
> ```
>
> After this patch.
>
> ```
> foo2:
>   vsetivlizero,3,e32,m1,ta,ma
> ...
> ```
>
> Best,
> Lehua
>
> gcc/ChangeLog:
>
>   * fold-const.cc (can_min_p): New function.
>   (poly_int_binop): Try fold MIN_EXPR.

OK, thanks.

Richard

> gcc/testsuite/ChangeLog:
>
>   * gcc.target/riscv/rvv/autovec/vls/div-1.c: Adjust.
>   * gcc.target/riscv/rvv/autovec/vls/shift-3.c: Adjust.
>   * gcc.target/riscv/rvv/autovec/fold-min-poly.c: New test.
>
> ---
>  gcc/fold-const.cc | 24 +++
>  .../riscv/rvv/autovec/fold-min-poly.c | 24 +++
>  .../gcc.target/riscv/rvv/autovec/vls/div-1.c  |  2 +-
>  .../riscv/rvv/autovec/vls/shift-3.c   |  2 +-
>  4 files changed, 50 insertions(+), 2 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/fold-min-poly.c
>
> diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
> index 1da498a3152..d19b4666c65 100644
> --- a/gcc/fold-const.cc
> +++ b/gcc/fold-const.cc
> @@ -1213,6 +1213,25 @@ wide_int_binop (wide_int ,
>return true;
>  }
>  
> +/* Returns true if we know who is smaller or equal, ARG1 or ARG2, and set the
> +   min value to RES.  */
> +bool
> +can_min_p (const_tree arg1, const_tree arg2, poly_wide_int )
> +{
> +  if (known_le (wi::to_poly_widest (arg1), wi::to_poly_widest (arg2)))
> +{
> +  res = wi::to_poly_wide (arg1);
> +  return true;
> +}
> +  else if (known_le (wi::to_poly_widest (arg2), wi::to_poly_widest (arg1)))
> +{
> +  res = wi::to_poly_wide (arg2);
> +  return true;
> +}
> +
> +  return false;
> +}
> +
>  /* Combine two poly int's ARG1 and ARG2 under operation CODE to
> produce a new constant in RES.  Return FALSE if we don't know how
> to evaluate CODE at compile-time.  */
> @@ -1261,6 +1280,11 @@ poly_int_binop (poly_wide_int , enum tree_code 
> code,
>   return false;
>break;
>  
> +case MIN_EXPR:
> +  if (!can_min_p (arg1, arg2, res))
> + return false;
> +  break;
> +
>  default:
>return false;
>  }
> diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/fold-min-poly.c 
> b/gcc/testsuite/gcc.target/riscv/rvv/autovec/fold-min-poly.c
> new file mode 100644
> index 000..de4c472c76e
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/fold-min-poly.c
> @@ -0,0 +1,24 @@
> +/* { dg-do compile } */
> +/* { dg-options " -march=rv64gcv_zvl128b -mabi=lp64d -O3 --param 
> riscv-autovec-preference=scalable --param riscv-autovec-lmul=m1 
> -fno-vect-cost-model" } */
> +
> +void foo1 (int* restrict a, int* restrict b, int n)
> +{
> +for (int i = 0; i < 4; i += 1)
> +  a[i] += b[i];
> +}
> +
> +void foo2 (int* restrict a, int* restrict b, int n)
> +{
> +for (int i = 0; i < 3; i += 1)
> +  a[i] += b[i];
> +}
> +
> +void foo3 (int* restrict a, int* restrict b, int n)
> +{
> +for (int i = 0; i < 5; i += 1)
> +  a[i] += b[i];
> +}
> +
> +/* { dg-final { scan-assembler-not {\tcsrr\t} } } */
> +/* { dg-final { scan-assembler {\tvsetivli\tzero,4,e32,m1,t[au],m[au]} } } */
> +/* { dg-final { scan-assembler {\tvsetivli\tzero,3,e32,m1,t[au],m[au]} } } */
> diff --git 

Re: [PATCH V2] Support folding min(poly,poly) to const

2023-09-08 Thread Richard Sandiford via Gcc-patches
Richard Sandiford  writes:
> Lehua Ding  writes:
>> Hi,
>>
>> This patch adds support that tries to fold `MIN (poly, poly)` to
>> a constant. Consider the following C Code:
>>
>> ```
>> void foo2 (int* restrict a, int* restrict b, int n)
>> {
>> for (int i = 0; i < 3; i += 1)
>>   a[i] += b[i];
>> }
>> ```
>>
>> Before this patch:
>>
>> ```
>> void foo2 (int * restrict a, int * restrict b, int n)
>> {
>>   vector([4,4]) int vect__7.27;
>>   vector([4,4]) int vect__6.26;
>>   vector([4,4]) int vect__4.23;
>>   unsigned long _32;
>>
>>[local count: 268435456]:
>>   _32 = MIN_EXPR <3, POLY_INT_CST [4, 4]>;
>>   vect__4.23_20 = .MASK_LEN_LOAD (a_11(D), 32B, { -1, ... }, _32, 0);
>>   vect__6.26_15 = .MASK_LEN_LOAD (b_12(D), 32B, { -1, ... }, _32, 0);
>>   vect__7.27_9 = vect__6.26_15 + vect__4.23_20;
>>   .MASK_LEN_STORE (a_11(D), 32B, { -1, ... }, _32, 0, vect__7.27_9); [tail 
>> call]
>>   return;
>>
>> }
>> ```
>>
>> After this patch:
>>
>> ```
>> void foo2 (int * restrict a, int * restrict b, int n)
>> {
>>   vector([4,4]) int vect__7.27;
>>   vector([4,4]) int vect__6.26;
>>   vector([4,4]) int vect__4.23;
>>
>>[local count: 268435456]:
>>   vect__4.23_20 = .MASK_LEN_LOAD (a_11(D), 32B, { -1, ... }, 3, 0);
>>   vect__6.26_15 = .MASK_LEN_LOAD (b_12(D), 32B, { -1, ... }, 3, 0);
>>   vect__7.27_9 = vect__6.26_15 + vect__4.23_20;
>>   .MASK_LEN_STORE (a_11(D), 32B, { -1, ... }, 3, 0, vect__7.27_9); [tail 
>> call]
>>   return;
>>
>> }
>> ```
>>
>> For RISC-V RVV, csrr and branch instructions can be reduced:
>>
>> Before this patch:
>>
>> ```
>> foo2:
>> csrra4,vlenb
>> srlia4,a4,2
>> li  a5,3
>> bleua5,a4,.L5
>> mv  a5,a4
>> .L5:
>> vsetvli zero,a5,e32,m1,ta,ma
>> ...
>> ```
>>
>> After this patch.
>>
>> ```
>> foo2:
>>  vsetivlizero,3,e32,m1,ta,ma
>> ...
>> ```
>>
>> Best,
>> Lehua
>>
>> gcc/ChangeLog:
>>
>>  * fold-const.cc (can_min_p): New function.
>>  (poly_int_binop): Try fold MIN_EXPR.
>
> OK, thanks.

Sorry, just realised that the poly_int_tree_p tests are redundant.
The caller has already checked that.

Richard

> Richard
>
>> gcc/testsuite/ChangeLog:
>>
>>  * gcc.target/riscv/rvv/autovec/vls/div-1.c: Adjust.
>>  * gcc.target/riscv/rvv/autovec/vls/shift-3.c: Adjust.
>>  * gcc.target/riscv/rvv/autovec/fold-min-poly.c: New test.
>>
>> ---
>>  gcc/fold-const.cc | 27 +++
>>  .../riscv/rvv/autovec/fold-min-poly.c | 24 +
>>  .../gcc.target/riscv/rvv/autovec/vls/div-1.c  |  2 +-
>>  .../riscv/rvv/autovec/vls/shift-3.c   |  2 +-
>>  4 files changed, 53 insertions(+), 2 deletions(-)
>>  create mode 100644 
>> gcc/testsuite/gcc.target/riscv/rvv/autovec/fold-min-poly.c
>>
>> diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
>> index 1da498a3152..ba4b6f3f3a3 100644
>> --- a/gcc/fold-const.cc
>> +++ b/gcc/fold-const.cc
>> @@ -1213,6 +1213,28 @@ wide_int_binop (wide_int ,
>>return true;
>>  }
>>  
>> +/* Returns true if we know who is smaller or equal, ARG1 or ARG2, and set 
>> the
>> +   min value to RES.  */
>> +bool
>> +can_min_p (const_tree arg1, const_tree arg2, poly_wide_int )
>> +{
>> +  if (!poly_int_tree_p (arg1) || !poly_int_tree_p (arg2))
>> +return false;
>> +
>> +  if (known_le (wi::to_poly_widest (arg1), wi::to_poly_widest (arg2)))
>> +{
>> +  res = wi::to_poly_wide (arg1);
>> +  return true;
>> +}
>> +  else if (known_le (wi::to_poly_widest (arg2), wi::to_poly_widest (arg1)))
>> +{
>> +  res = wi::to_poly_wide (arg2);
>> +  return true;
>> +}
>> +
>> +  return false;
>> +}
>> +
>>  /* Combine two poly int's ARG1 and ARG2 under operation CODE to
>> produce a new constant in RES.  Return FALSE if we don't know how
>> to evaluate CODE at compile-time.  */
>> @@ -1261,6 +1283,11 @@ poly_int_binop (poly_wide_int , enum tree_code 
>> code,
>>  return false;
>>break;
>>  
>> +case MIN_EXPR:
>> +  if (!can_min_p (arg1, arg2, res))
>> +return false;
>> +  break;
>> +
>>  default:
>>return false;
>>  }
>> diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/fold-min-poly.c 
>> b/gcc/testsuite/gcc.target/riscv/rvv/autovec/fold-min-poly.c
>> new file mode 100644
>> index 000..de4c472c76e
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/fold-min-poly.c
>> @@ -0,0 +1,24 @@
>> +/* { dg-do compile } */
>> +/* { dg-options " -march=rv64gcv_zvl128b -mabi=lp64d -O3 --param 
>> riscv-autovec-preference=scalable --param riscv-autovec-lmul=m1 
>> -fno-vect-cost-model" } */
>> +
>> +void foo1 (int* restrict a, int* restrict b, int n)
>> +{
>> +for (int i = 0; i < 4; i += 1)
>> +  a[i] += b[i];
>> +}
>> +
>> +void foo2 (int* restrict a, int* restrict b, int n)
>> +{
>> +for (int i = 0; i < 3; i += 1)
>> +  a[i] += b[i];
>> +}
>> +
>> +void foo3 (int* restrict a, int* restrict b, 

Re: [PATCH V2] Support folding min(poly,poly) to const

2023-09-08 Thread Richard Sandiford via Gcc-patches
Lehua Ding  writes:
> Hi,
>
> This patch adds support that tries to fold `MIN (poly, poly)` to
> a constant. Consider the following C Code:
>
> ```
> void foo2 (int* restrict a, int* restrict b, int n)
> {
> for (int i = 0; i < 3; i += 1)
>   a[i] += b[i];
> }
> ```
>
> Before this patch:
>
> ```
> void foo2 (int * restrict a, int * restrict b, int n)
> {
>   vector([4,4]) int vect__7.27;
>   vector([4,4]) int vect__6.26;
>   vector([4,4]) int vect__4.23;
>   unsigned long _32;
>
>[local count: 268435456]:
>   _32 = MIN_EXPR <3, POLY_INT_CST [4, 4]>;
>   vect__4.23_20 = .MASK_LEN_LOAD (a_11(D), 32B, { -1, ... }, _32, 0);
>   vect__6.26_15 = .MASK_LEN_LOAD (b_12(D), 32B, { -1, ... }, _32, 0);
>   vect__7.27_9 = vect__6.26_15 + vect__4.23_20;
>   .MASK_LEN_STORE (a_11(D), 32B, { -1, ... }, _32, 0, vect__7.27_9); [tail 
> call]
>   return;
>
> }
> ```
>
> After this patch:
>
> ```
> void foo2 (int * restrict a, int * restrict b, int n)
> {
>   vector([4,4]) int vect__7.27;
>   vector([4,4]) int vect__6.26;
>   vector([4,4]) int vect__4.23;
>
>[local count: 268435456]:
>   vect__4.23_20 = .MASK_LEN_LOAD (a_11(D), 32B, { -1, ... }, 3, 0);
>   vect__6.26_15 = .MASK_LEN_LOAD (b_12(D), 32B, { -1, ... }, 3, 0);
>   vect__7.27_9 = vect__6.26_15 + vect__4.23_20;
>   .MASK_LEN_STORE (a_11(D), 32B, { -1, ... }, 3, 0, vect__7.27_9); [tail call]
>   return;
>
> }
> ```
>
> For RISC-V RVV, csrr and branch instructions can be reduced:
>
> Before this patch:
>
> ```
> foo2:
> csrra4,vlenb
> srlia4,a4,2
> li  a5,3
> bleua5,a4,.L5
> mv  a5,a4
> .L5:
> vsetvli zero,a5,e32,m1,ta,ma
> ...
> ```
>
> After this patch.
>
> ```
> foo2:
>   vsetivlizero,3,e32,m1,ta,ma
> ...
> ```
>
> Best,
> Lehua
>
> gcc/ChangeLog:
>
>   * fold-const.cc (can_min_p): New function.
>   (poly_int_binop): Try fold MIN_EXPR.

OK, thanks.

Richard

> gcc/testsuite/ChangeLog:
>
>   * gcc.target/riscv/rvv/autovec/vls/div-1.c: Adjust.
>   * gcc.target/riscv/rvv/autovec/vls/shift-3.c: Adjust.
>   * gcc.target/riscv/rvv/autovec/fold-min-poly.c: New test.
>
> ---
>  gcc/fold-const.cc | 27 +++
>  .../riscv/rvv/autovec/fold-min-poly.c | 24 +
>  .../gcc.target/riscv/rvv/autovec/vls/div-1.c  |  2 +-
>  .../riscv/rvv/autovec/vls/shift-3.c   |  2 +-
>  4 files changed, 53 insertions(+), 2 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/fold-min-poly.c
>
> diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
> index 1da498a3152..ba4b6f3f3a3 100644
> --- a/gcc/fold-const.cc
> +++ b/gcc/fold-const.cc
> @@ -1213,6 +1213,28 @@ wide_int_binop (wide_int ,
>return true;
>  }
>  
> +/* Returns true if we know who is smaller or equal, ARG1 or ARG2, and set the
> +   min value to RES.  */
> +bool
> +can_min_p (const_tree arg1, const_tree arg2, poly_wide_int )
> +{
> +  if (!poly_int_tree_p (arg1) || !poly_int_tree_p (arg2))
> +return false;
> +
> +  if (known_le (wi::to_poly_widest (arg1), wi::to_poly_widest (arg2)))
> +{
> +  res = wi::to_poly_wide (arg1);
> +  return true;
> +}
> +  else if (known_le (wi::to_poly_widest (arg2), wi::to_poly_widest (arg1)))
> +{
> +  res = wi::to_poly_wide (arg2);
> +  return true;
> +}
> +
> +  return false;
> +}
> +
>  /* Combine two poly int's ARG1 and ARG2 under operation CODE to
> produce a new constant in RES.  Return FALSE if we don't know how
> to evaluate CODE at compile-time.  */
> @@ -1261,6 +1283,11 @@ poly_int_binop (poly_wide_int , enum tree_code 
> code,
>   return false;
>break;
>  
> +case MIN_EXPR:
> +  if (!can_min_p (arg1, arg2, res))
> + return false;
> +  break;
> +
>  default:
>return false;
>  }
> diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/fold-min-poly.c 
> b/gcc/testsuite/gcc.target/riscv/rvv/autovec/fold-min-poly.c
> new file mode 100644
> index 000..de4c472c76e
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/fold-min-poly.c
> @@ -0,0 +1,24 @@
> +/* { dg-do compile } */
> +/* { dg-options " -march=rv64gcv_zvl128b -mabi=lp64d -O3 --param 
> riscv-autovec-preference=scalable --param riscv-autovec-lmul=m1 
> -fno-vect-cost-model" } */
> +
> +void foo1 (int* restrict a, int* restrict b, int n)
> +{
> +for (int i = 0; i < 4; i += 1)
> +  a[i] += b[i];
> +}
> +
> +void foo2 (int* restrict a, int* restrict b, int n)
> +{
> +for (int i = 0; i < 3; i += 1)
> +  a[i] += b[i];
> +}
> +
> +void foo3 (int* restrict a, int* restrict b, int n)
> +{
> +for (int i = 0; i < 5; i += 1)
> +  a[i] += b[i];
> +}
> +
> +/* { dg-final { scan-assembler-not {\tcsrr\t} } } */
> +/* { dg-final { scan-assembler {\tvsetivli\tzero,4,e32,m1,t[au],m[au]} } } */
> +/* { dg-final { scan-assembler {\tvsetivli\tzero,3,e32,m1,t[au],m[au]} } } */
> diff --git 

Re: [PATCH] Support folding min(poly,poly) to const

2023-09-08 Thread Richard Sandiford via Gcc-patches
Lehua Ding  writes:
> Hi,
>
> This patch adds support that tries to fold `MIN (poly, poly)` to
> a constant. Consider the following C Code:
>
> ```
> void foo2 (int* restrict a, int* restrict b, int n)
> {
> for (int i = 0; i < 3; i += 1)
>   a[i] += b[i];
> }
> ```
>
> Before this patch:
>
> ```
> void foo2 (int * restrict a, int * restrict b, int n)
> {
>   vector([4,4]) int vect__7.27;
>   vector([4,4]) int vect__6.26;
>   vector([4,4]) int vect__4.23;
>   unsigned long _32;
>
>[local count: 268435456]:
>   _32 = MIN_EXPR <3, POLY_INT_CST [4, 4]>;
>   vect__4.23_20 = .MASK_LEN_LOAD (a_11(D), 32B, { -1, ... }, _32, 0);
>   vect__6.26_15 = .MASK_LEN_LOAD (b_12(D), 32B, { -1, ... }, _32, 0);
>   vect__7.27_9 = vect__6.26_15 + vect__4.23_20;
>   .MASK_LEN_STORE (a_11(D), 32B, { -1, ... }, _32, 0, vect__7.27_9); [tail 
> call]
>   return;
>
> }
> ```
>
> After this patch:
>
> ```
> void foo2 (int * restrict a, int * restrict b, int n)
> {
>   vector([4,4]) int vect__7.27;
>   vector([4,4]) int vect__6.26;
>   vector([4,4]) int vect__4.23;
>
>[local count: 268435456]:
>   vect__4.23_20 = .MASK_LEN_LOAD (a_11(D), 32B, { -1, ... }, 3, 0);
>   vect__6.26_15 = .MASK_LEN_LOAD (b_12(D), 32B, { -1, ... }, 3, 0);
>   vect__7.27_9 = vect__6.26_15 + vect__4.23_20;
>   .MASK_LEN_STORE (a_11(D), 32B, { -1, ... }, 3, 0, vect__7.27_9); [tail call]
>   return;
>
> }
> ```
>
> For RISC-V RVV, one branch instruction can be reduced:
>
> Before this patch:
>
> ```
> foo2:
> csrra4,vlenb
> srlia4,a4,2
> li  a5,3
> bleua5,a4,.L5
> mv  a5,a4
> .L5:
> vsetvli zero,a5,e32,m1,ta,ma
> ...
> ```
>
> After this patch.
>
> ```
> foo2:
>   vsetivlizero,3,e32,m1,ta,ma
> ...
> ```
>
> Best,
> Lehua
>
> gcc/ChangeLog:
>
>   * fold-const.cc (can_min_p): New function.
>   (poly_int_binop): Try fold MIN_EXPR.
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/riscv/rvv/autovec/vls/div-1.c: Adjust.
>   * gcc.target/riscv/rvv/autovec/vls/shift-3.c: Adjust.
>   * gcc.target/riscv/rvv/autovec/fold-min-poly.c: New test.
>
> ---
>  gcc/fold-const.cc | 33 +++
>  .../riscv/rvv/autovec/fold-min-poly.c | 24 ++
>  .../gcc.target/riscv/rvv/autovec/vls/div-1.c  |  2 +-
>  .../riscv/rvv/autovec/vls/shift-3.c   |  2 +-
>  4 files changed, 59 insertions(+), 2 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/fold-min-poly.c
>
> diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
> index 1da498a3152..f7f793cc326 100644
> --- a/gcc/fold-const.cc
> +++ b/gcc/fold-const.cc
> @@ -1213,6 +1213,34 @@ wide_int_binop (wide_int ,
>return true;
>  }
>  
> +/* Returns true if we know who is smaller or equal, ARG1 or ARG2., and set 
> the
> +   min value to RES.  */
> +bool
> +can_min_p (const_tree arg1, const_tree arg2, poly_wide_int )
> +{
> +  if (tree_fits_poly_int64_p (arg1) && tree_fits_poly_int64_p (arg2))
> +{
> +  if (known_le (tree_to_poly_int64 (arg1), tree_to_poly_int64 (arg2)))
> + res = wi::to_poly_wide (arg1);
> +  else if (known_le (tree_to_poly_int64 (arg2), tree_to_poly_int64 
> (arg1)))
> + res = wi::to_poly_wide (arg2);
> +  else
> + return false;
> +}
> +  else if (tree_fits_poly_uint64_p (arg1) && tree_fits_poly_uint64_p (arg2))
> +{
> +  if (known_le (tree_to_poly_uint64 (arg1), tree_to_poly_uint64 (arg2)))
> + res = wi::to_poly_wide (arg1);
> +  else if (known_le (tree_to_poly_int64 (arg2), tree_to_poly_int64 
> (arg1)))
> + res = wi::to_poly_wide (arg2);
> +  else
> + return false;
> +}
> +  else
> +return false;
> +  return true;
> +}

I think this should instead use poly_int_tree_p and wi::to_poly_widest.
There's no need to handle int64 and uint64 separately.  (And there's
no need to handle just 64-bit types.)

Thanks,
Richard

> +
>  /* Combine two poly int's ARG1 and ARG2 under operation CODE to
> produce a new constant in RES.  Return FALSE if we don't know how
> to evaluate CODE at compile-time.  */
> @@ -1261,6 +1289,11 @@ poly_int_binop (poly_wide_int , enum tree_code 
> code,
>   return false;
>break;
>  
> +case MIN_EXPR:
> +  if (!can_min_p (arg1, arg2, res))
> + return false;
> +  break;
> +
>  default:
>return false;
>  }
> diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/fold-min-poly.c 
> b/gcc/testsuite/gcc.target/riscv/rvv/autovec/fold-min-poly.c
> new file mode 100644
> index 000..de4c472c76e
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/fold-min-poly.c
> @@ -0,0 +1,24 @@
> +/* { dg-do compile } */
> +/* { dg-options " -march=rv64gcv_zvl128b -mabi=lp64d -O3 --param 
> riscv-autovec-preference=scalable --param riscv-autovec-lmul=m1 
> -fno-vect-cost-model" } */
> +
> +void foo1 (int* restrict a, int* restrict b, int n)
> +{
> +for (int i = 0; i < 

Re: [PATCH] fwprop: Allow UNARY_P and check register pressure.

2023-09-07 Thread Richard Sandiford via Gcc-patches
Robin Dapp  writes:
> Hi Richard,
>
> I did some testing with the attached v2 that does not restrict to UNARY
> anymore.  As feared ;) there is some more fallout that I'm detailing below.
>
> On Power there is one guality fail (pr43051-1.c) that I would take
> the liberty of ignoring for now.
>
> On x86 there are four fails:
>
>  - cond_op_addsubmuldiv__Float16-2.c: assembler error
>unsupported masking for `vmovsh'.  I guess that's a latent backend
>problem.
>
>  - ifcvt-3.c, pr49095.c: Here we propagate into a compare.  Before, we had
>(cmp (reg/CC) 0) and now we have (cmp (plus (reg1 reg2) 0).
>That looks like a costing problem and can hopefully solveable by making
>the second compare more expensive, preventing the propagation.
>i386 costing (or every costing?) is brittle so that could well break other
>things. 
>
>  - pr88873.c: This is interesting because even before this patch we
>propagated with different register classes (V2DF vs DI).  With the patch
>we check the register pressure, find the class NO_REGS for V2DF and
>abort (because the patch assumes NO_REGS = high pressure).  I'm thinking
>of keeping the old behavior for reg-reg propagations and only checking
>the pressure for more complex operations.
>
> aarch64 has the most fails:
>
>  - One guality fail (same as Power).
>  - shrn-combine-[123].c as before.
>
>  - A class of (hopefully, I only checked some) similar cases where we
>propagate an unspec_whilelo into an unspec_ptest.  Before we would only
>set a REG_EQUALS note.
>Before we managed to create a while_ultsivnx16bi_cc whereas now we have
>while_ultsivnx16bi and while_ultsivnx16bi_ptest that won't be combined.
>We create redundant whilelos and I'm not sure how to improve that. I
>guess a peephole is out of the question :)

Yeah, I think this is potentially a blocker for propagating A into B
when A is used elsewhere.  Combine is able to combine A and B while
keeping A in parallel with the result.  I think either fwprop would
need to try that too, or it would need to be restricted to cases where A
is only used in B.

I don't think that's really a UNARY_P-or-not thing.  The same problem
would in principle apply to plain unary operations.

>  - pred-combine-and.c: Here the new propagation appears useful at first.

I'm not sure about that, because...

>We propagate a "vector mask and" into a while_ultsivnx4bi_ptest and the
>individual and registers remain live up to the propagation site (while
>being dead before the patch).
>With the registers dead, combine could create a single fcmgt before.
>Now it only manages a 2->2 combination because we still need the registers
>and end up with two fcmgts.
>The code is worse but this seems more bad luck than anything.

...the fwprop1 change is:

(insn 26 25 27 4 (set (reg:VNx4BI 102 [ vec_mask_and_64 ])
(and:VNx4BI (reg:VNx4BI 116 [ mask__30.13 ])
(reg:VNx4BI 98 [ loop_mask_58 ]))) 8420 {andvnx4bi3}
 (nil))
...
(insn 31 30 32 4 (set (reg:VNx4BI 106 [ mask__24.18 ])
(and:VNx4BI (reg:VNx4BI 118 [ mask__25.17 ])
(reg:VNx4BI 102 [ vec_mask_and_64 ]))) 8420 {andvnx4bi3}
 (nil))

to:

(insn 26 25 27 4 (set (reg:VNx4BI 102 [ vec_mask_and_64 ])
(and:VNx4BI (reg:VNx4BI 116 [ mask__30.13 ])
(reg:VNx4BI 98 [ loop_mask_58 ]))) 8420 {andvnx4bi3}
 (expr_list:REG_DEAD (reg:VNx4BI 116 [ mask__30.13 ])
(expr_list:REG_DEAD (reg:VNx4BI 98 [ loop_mask_58 ])
(nil
...
(insn 31 30 32 4 (set (reg:VNx4BI 106 [ mask__24.18 ])
(and:VNx4BI (and:VNx4BI (reg:VNx4BI 116 [ mask__30.13 ])
(reg:VNx4BI 98 [ loop_mask_58 ]))
(reg:VNx4BI 118 [ mask__25.17 ]))) 8428 {aarch64_pred_andvnx4bi_z}
 (nil))

On its own this isn't worse.  But it's also not a win.  The before and
after sequences have equal cost, but the after sequence is more complex.

That would probably be OK for something that runs near the end of
the pre-RA pipeline, since it could in principle increase parallelism.
But it's probably a bad idea when we know that the main instruction
combination pass is still to run.  By making insn 31 more complex,
we're making it a less likely combination candidate.

So this isn't necesarily the wrong thing to do.  But I think it is
the wrong time to do it.

>  - Addressing fails from before:  I looked into these and suspect all of
>them are a similar.
>What happens is that we have a poly_int offset that we shift, negate
>and then add to x0.  The result is used as load address.
>Before, we would pull (combine) the (plus x0 reg) into the load keeping
>the neg and shift.
>Now we propagate everything into a single (set (minus x0 offset)).
>The propagation itself seems worthwhile because we save one insn.
>However as we got rid of the base/offset split by lumping everything
>together, combine cannot pull the (plus) into the 

[PATCH] Tweak language choice in config-list.mk

2023-09-07 Thread Richard Sandiford via Gcc-patches
When I tried to use config-list.mk, the build for every triple except
the build machine's failed for m2.  This is because, unlike other
languages, m2 builds target objects during all-gcc.  The build will
therefore fail unless you have access to an appropriate binutils
(or an equivalent).  That's quite a big ask for over 100 targets. :)

This patch therefore makes m2 an optional inclusion.

Doing that wasn't entirely straightforward though.  The current
configure line includes "--enable-languages=all,...", which means
that the "..." can only force languages to be added that otherwise
wouldn't have been.  (I.e. the only effect of the "..." is to
override configure autodetection.)

The choice of all,ada and:

  # Make sure you have a recent enough gcc (with ada support) in your path so
  # that --enable-werror-always will work.

make it clear that lack of GNAT should be a build failure rather than
silently ignored.  This predates the D frontend, which requires GDC
in the same way that Ada requires GNAT.  I don't know of a reason
why D should be treated differently.

The patch therefore expands the "all" into a specific list of
languages.

That in turn meant that Fortran had to be handled specially,
since bpf and mmix don't support Fortran.

Perhaps there's an argument that m2 shouldn't build target objects
during all-gcc, but (a) it works for practical usage and (b) the
patch is an easy workaround.  I'd be happy for the patch to be
reverted if the build system changes.

OK to install?

Richard


gcc/
* contrib/config-list.mk (OPT_IN_LANGUAGES): New variable.
($(LIST)): Replace --enable-languages=all with a specifc list.
Disable fortran on bpf and mmix.  Enable the languages in
OPT_IN_LANGUAGES.
---
 contrib/config-list.mk | 17 ++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/contrib/config-list.mk b/contrib/config-list.mk
index e570b13c71b..50ecb014bc0 100644
--- a/contrib/config-list.mk
+++ b/contrib/config-list.mk
@@ -12,6 +12,11 @@ TEST=all-gcc
 # supply an absolute path.
 GCC_SRC_DIR=../../gcc
 
+# Define this to ,m2 if you want to build Modula-2.  Modula-2 builds target
+# objects during all-gcc, so it can only be included if you've installed
+# binutils (or an equivalent) for each target.
+OPT_IN_LANGUAGES=
+
 # Use -j / -l make arguments and nice to assure a smooth resource-efficient
 # load on the build machine, e.g. for 24 cores:
 # svn co svn://gcc.gnu.org/svn/gcc/branches/foo-branch gcc
@@ -126,17 +131,23 @@ $(LIST): make-log-dir
TGT=`echo $@ | awk 'BEGIN { FS = "OPT" }; { print $$1 }'` &&
\
TGT=`$(GCC_SRC_DIR)/config.sub $$TGT` &&
\
case $$TGT in   
\
-   *-*-darwin* | *-*-cygwin* | *-*-mingw* | *-*-aix* | 
bpf-*-*)\
+   bpf-*-*)
\
ADDITIONAL_LANGUAGES="";
\
;;  
\
-   *)  
\
+   *-*-darwin* | *-*-cygwin* | *-*-mingw* | *-*-aix* | 
bpf-*-*)\
+   ADDITIONAL_LANGUAGES=",fortran";
\
+   ;;  
\
+   mmix-*-*)   
\
ADDITIONAL_LANGUAGES=",go"; 
\
;;  
\
+   *)  
\
+   ADDITIONAL_LANGUAGES=",fortran,go"; 
\
+   ;;  
\
esac && 
\
$(GCC_SRC_DIR)/configure
\
--target=$(subst SCRIPTS,`pwd`/../scripts/,$(subst 
OPT,$(empty) -,$@))  \
--enable-werror-always ${host_options}  
\
-   --enable-languages=all,ada$$ADDITIONAL_LANGUAGES;   
\
+   
--enable-languages=c,ada,c++,d,lto,objc,obj-c++,rust$$ADDITIONAL_LANGUAGES$(OPT_IN_LANGUAGES);
 \
) > log/$@-config.out 2>&1
 
 $(LOGFILES) : log/%-make.out : %
-- 
2.25.1



Re: [PATCH] fwprop: Allow UNARY_P and check register pressure.

2023-09-06 Thread Richard Sandiford via Gcc-patches
Robin Dapp  writes:
> Hi Richard,
>
> I did some testing with the attached v2 that does not restrict to UNARY
> anymore.  As feared ;) there is some more fallout that I'm detailing below.
>
> On Power there is one guality fail (pr43051-1.c) that I would take
> the liberty of ignoring for now.
>
> On x86 there are four fails:
>
>  - cond_op_addsubmuldiv__Float16-2.c: assembler error
>unsupported masking for `vmovsh'.  I guess that's a latent backend
>problem.
>
>  - ifcvt-3.c, pr49095.c: Here we propagate into a compare.  Before, we had
>(cmp (reg/CC) 0) and now we have (cmp (plus (reg1 reg2) 0).
>That looks like a costing problem and can hopefully solveable by making
>the second compare more expensive, preventing the propagation.
>i386 costing (or every costing?) is brittle so that could well break other
>things. 
>
>  - pr88873.c: This is interesting because even before this patch we
>propagated with different register classes (V2DF vs DI).  With the patch
>we check the register pressure, find the class NO_REGS for V2DF and
>abort (because the patch assumes NO_REGS = high pressure).  I'm thinking
>of keeping the old behavior for reg-reg propagations and only checking
>the pressure for more complex operations.
>
> aarch64 has the most fails:
>
>  - One guality fail (same as Power).
>  - shrn-combine-[123].c as before.
>
>  - A class of (hopefully, I only checked some) similar cases where we
>propagate an unspec_whilelo into an unspec_ptest.  Before we would only
>set a REG_EQUALS note.
>Before we managed to create a while_ultsivnx16bi_cc whereas now we have
>while_ultsivnx16bi and while_ultsivnx16bi_ptest that won't be combined.
>We create redundant whilelos and I'm not sure how to improve that. I
>guess a peephole is out of the question :)
>
>  - pred-combine-and.c: Here the new propagation appears useful at first.
>We propagate a "vector mask and" into a while_ultsivnx4bi_ptest and the
>individual and registers remain live up to the propagation site (while
>being dead before the patch).
>With the registers dead, combine could create a single fcmgt before.
>Now it only manages a 2->2 combination because we still need the registers
>and end up with two fcmgts.
>The code is worse but this seems more bad luck than anything.
>
>  - Addressing fails from before:  I looked into these and suspect all of
>them are a similar.
>What happens is that we have a poly_int offset that we shift, negate
>and then add to x0.  The result is used as load address.
>Before, we would pull (combine) the (plus x0 reg) into the load keeping
>the neg and shift.
>Now we propagate everything into a single (set (minus x0 offset)).
>The propagation itself seems worthwhile because we save one insn.
>However as we got rid of the base/offset split by lumping everything
>together, combine cannot pull the (plus) into the address load and
>we require an aarch64_split_add_offset.  This will emit the longer
>sequence of ashiftl and subtract.  The "base" address is x0 here so
>we cannot convert (minus x0 ...)) into neg.
>I didn't go through all of aarch64_split_add_offset.  I suppose we
>could re-add the separation of base/offset there but that might be
>a loss when the result is not used as an address. 
>
> Again, all in all no fatal problems but pretty annoying :)  It's not much
> but just gradually worse than with just UNARY.  Any idea on how/whether to
> continue?

Thanks for giving it a go.  Can you post the latest version of the
regpressure patch too?  The previous on-list version I could find
seems to be too old.

Thanks,
Richard

> Regards
>  Robin
>
> gcc/ChangeLog:
>
>   * fwprop.cc (fwprop_propagation::profitable_p): Add unary
>   handling.
>   (fwprop_propagation::update_register_pressure): New function.
>   (fwprop_propagation::register_pressure_high_p): New function
>   (reg_single_def_for_src_p): Look through unary expressions.
>   (try_fwprop_subst_pattern): Check register pressure.
>   (forward_propagate_into): Call new function.
>   (fwprop_init): Init register pressure.
>   (fwprop_done): Clean up register pressure.
>   (fwprop_insn): Add comment.
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/riscv/rvv/autovec/binop/vadd-vx-fwprop.c: New test.
> ---
>  gcc/fwprop.cc | 359 +-
>  .../riscv/rvv/autovec/binop/vadd-vx-fwprop.c  |  64 
>  2 files changed, 419 insertions(+), 4 deletions(-)
>  create mode 100644 
> gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/vadd-vx-fwprop.c
>
> diff --git a/gcc/fwprop.cc b/gcc/fwprop.cc
> index 0707a234726..ce6f5a74b00 100644
> --- a/gcc/fwprop.cc
> +++ b/gcc/fwprop.cc
> @@ -36,6 +36,10 @@ along with GCC; see the file COPYING3.  If not see
>  #include "tree-pass.h"
>  #include "rtl-iter.h"
>  #include "target.h"
> +#include "dominance.h"

Re: Question on aarch64 prologue code.

2023-09-06 Thread Richard Sandiford via Gcc
Iain Sandoe  writes:
> Hi Folks,
>
> On the Darwin aarch64 port, we have a number of cleanup test fails (pretty 
> much corresponding to the [still open] 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=39244).  However, let’s assume 
> that bug could be a red herring..
>
> the underlying reason is missing CFI for the set of the FP which [with 
> Darwin’s LLVM libunwind impl.] breaks the unwind through the function that 
> triggers a signal.

Just curious, do you have more details about why that is?  If the unwinder
is sophisticated enough to process CFI, it seems odd that it requires the
CFA to be defined in terms of the frame pointer.
>
> ———
>
> taking one of the functions in cleanup-8.C (say fn1) which contains calls.
>
> what I am seeing is something like:
>
> __ZL3fn1v:
> LFB28:
> ; BLOCK 2, count:1073741824 (estimated locally) seq:0
> ; PRED: ENTRY [always]  count:1073741824 (estimated locally, freq 1.) 
> (FALLTHRU)
>   stp x29, x30, [sp, -32]!
> // LCFI; or .cfi_xxx is present
>   mov x29, sp
> // *** NO  LCFI (or .cfi_cfa_ when that is enabled)
>   str x19, [sp, 16]
> // LCFI / .cfi_ is present.
>   adrpx19, __ZL3fn4i@PAGE
>   add x19, x19, __ZL3fn4i@PAGEOFF;momd
>   mov x1, x19
>   mov w0, 11
>   bl  _signal
> 
>
> ———
>
> The reason seems to be that, in expand_prolog, emit_frame_chain is true (as 
> we would expect, given that this function makes calls).  However 
> ‘frame_pointer_needed' is false, so that the call to aarch64_add_offset() 
> [line aarch64.cc:10405] does not add CFA adjustments to the load of x29.

Right.

> ———
>
> I have currently worked around this by defining a 
> TARGET_FRAME_POINTER_REQUIRED which returns true unless the function is a 
> leaf (if that’s the correct solution, then all is fine).

I suppose it depends on why the frame-pointer-based CFA is important
for Darwin.  If it's due to a more general requirement for a frame
pointer to be used, then yeah, that's probably the right fix.  If it's
more a quirk of the unwinder. then we could probably expose whatever
that quirk is as a new status bit.  Target-independent code in
dwarf2cfi.cc would then need to be aware as well.

> ———
>
> However, it does seem odd that the existing code sets up the FP, but never 
> produces any CFA for it.
>
> So is this a possible bug, or just that I misunderstand the relevant set of 
> circumstances?

emit_frame_chain fulfills an ABI requirement that every non-leaf function
set up a frame-chain record.  When emit_frame_chain && !frame_pointer_needed,
we set up the FP for ABI purposes only.  GCC can still access everything
relative to the stack pointer, and it can still describe the CFI based
purely on the stack pointer.

glibc-based systems only need the CFA to be based on the frame pointer
if the stack pointer moves during the body of the function (usually due
to alloca or VLAs).

Thanks,
Richard


Re: [PATCH] LoongArch: Fix unintentional bash-ism in r14-3665.

2023-09-06 Thread Richard Sandiford via Gcc-patches
Yang Yujie  writes:
> gcc/ChangeLog:
>
>   * config.gcc: remove non-POSIX syntax "<<<".

OK.  Thanks for the quick fix.

Richard.

> ---
>  gcc/config.gcc | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/gcc/config.gcc b/gcc/config.gcc
> index b2fe7c7ceef..6d4c8becd28 100644
> --- a/gcc/config.gcc
> +++ b/gcc/config.gcc
> @@ -5189,7 +5189,7 @@ case "${target}" in
>   if test x${parse_state} = x"abi-base"; then
>   # Base ABI type
>   case ${component} in
> - lp64d | lp64f | lp64s) 
> elem_tmp="ABI_BASE_$(tr a-z A-Z <<< ${component}),";;
> + lp64d | lp64f | lp64s) 
> elem_tmp="ABI_BASE_$(echo ${component} | tr a-z A-Z),";;
>   *)
>   echo "Unknown base ABI 
> \"${component}\" in --with-multilib-list." 1>&2
>   exit 1


Re: [PATCH v1 2/6] LoongArch: improved target configuration interface

2023-09-06 Thread Richard Sandiford via Gcc-patches
Yang Yujie  writes:
> @@ -5171,25 +5213,21 @@ case "${target}" in
>   # ${with_multilib_list} should not contain whitespaces,
>   # consecutive commas or slashes.
>   if echo "${with_multilib_list}" \
> - | grep -E -e "[[:space:]]" -e '[,/][,/]' -e '[,/]$' -e '^[,/]' 
> > /dev/null; then
> + | grep -E -e "[[:space:]]" -e '[,/][,/]' -e '[,/]$' -e '^[,/]' 
> > /dev/null 2>&1; then
>   echo "Invalid argument to --with-multilib-list." 1>&2
>   exit 1
>   fi
>  
> - unset component idx elem_abi_base elem_abi_ext elem_tmp
> + unset component elem_abi_base elem_abi_ext elem_tmp parse_state 
> all_abis
>   for elem in $(echo "${with_multilib_list}" | tr ',' ' '); do
> - idx=0
> - while true; do
> - idx=$((idx + 1))
> - component=$(echo "${elem}" | awk -F'/' '{print 
> $'"${idx}"'}')
> -
> - case ${idx} in
> - 1)
> - # Component 1: Base ABI type
> + unset elem_abi_base elem_abi_ext
> + parse_state="abi-base"
> +
> + for component in $(echo "${elem}" | tr '/' ' '); do
> + if test x${parse_state} = x"abi-base"; then
> + # Base ABI type
>   case ${component} in
> - lp64d) elem_tmp="ABI_BASE_LP64D,";;
> - lp64f) elem_tmp="ABI_BASE_LP64F,";;
> - lp64s) elem_tmp="ABI_BASE_LP64S,";;
> + lp64d | lp64f | lp64s) 
> elem_tmp="ABI_BASE_$(tr a-z A-Z <<< ${component}),";;

"<<<" isn't portable shell.  Could you try with:

  echo ${component} | tr ...

instead?

As it stands, this causes a bootstrap failure with non-bash shells
such as dash, even on non-Loongson targets.

(Part of me wishes that we'd just standardise on bash.  But since that
isn't the policy, I sometimes use dash to pick up my own lapses.)

Thanks,
Richard


Re: [PATCH 10/11] aarch64: Fix branch-protection error message tests

2023-09-05 Thread Richard Sandiford via Gcc-patches
Szabolcs Nagy  writes:
> Update tests for the new branch-protection parser errors.
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/branch-protection-attr.c: Update.
>   * gcc.target/aarch64/branch-protection-option.c: Update.

OK, thanks.  (And I agree these are better messages. :))

I think that's the last of the AArch64-specific ones.  The others
will need to be reviewed by Kyrill or Richard.

Richard

> ---
>  gcc/testsuite/gcc.target/aarch64/branch-protection-attr.c   | 6 +++---
>  gcc/testsuite/gcc.target/aarch64/branch-protection-option.c | 2 +-
>  2 files changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/gcc/testsuite/gcc.target/aarch64/branch-protection-attr.c 
> b/gcc/testsuite/gcc.target/aarch64/branch-protection-attr.c
> index 272000c2747..dae2a758a56 100644
> --- a/gcc/testsuite/gcc.target/aarch64/branch-protection-attr.c
> +++ b/gcc/testsuite/gcc.target/aarch64/branch-protection-attr.c
> @@ -4,19 +4,19 @@ void __attribute__ ((target("branch-protection=leaf")))
>  foo1 ()
>  {
>  }
> -/* { dg-error {invalid protection type 'leaf' in 
> 'target\("branch-protection="\)' pragma or attribute} "" { target *-*-* } 5 } 
> */
> +/* { dg-error {invalid argument 'leaf' for 'target\("branch-protection="\)'} 
> "" { target *-*-* } 5 } */
>  /* { dg-error {pragma or attribute 'target\("branch-protection=leaf"\)' is 
> not valid} "" { target *-*-* } 5 } */
>  
>  void __attribute__ ((target("branch-protection=none+pac-ret")))
>  foo2 ()
>  {
>  }
> -/* { dg-error "unexpected 'pac-ret' after 'none'" "" { target *-*-* } 12 } */
> +/* { dg-error {argument 'none' can only appear alone in 
> 'target\("branch-protection="\)'} "" { target *-*-* } 12 } */
>  /* { dg-error {pragma or attribute 
> 'target\("branch-protection=none\+pac-ret"\)' is not valid} "" { target *-*-* 
> } 12 } */
>  
>  void __attribute__ ((target("branch-protection=")))
>  foo3 ()
>  {
>  }
> -/* { dg-error {missing argument to 'target\("branch-protection="\)' pragma 
> or attribute} "" { target *-*-* } 19 } */
> +/* { dg-error {invalid argument '' for 'target\("branch-protection="\)'} "" 
> { target *-*-* } 19 } */
>  /* { dg-error {pragma or attribute 'target\("branch-protection="\)' is not 
> valid} "" { target *-*-* } 19 } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/branch-protection-option.c 
> b/gcc/testsuite/gcc.target/aarch64/branch-protection-option.c
> index 1b3bf4ee2b8..e2f847a31c4 100644
> --- a/gcc/testsuite/gcc.target/aarch64/branch-protection-option.c
> +++ b/gcc/testsuite/gcc.target/aarch64/branch-protection-option.c
> @@ -1,4 +1,4 @@
>  /* { dg-do "compile" } */
>  /* { dg-options "-mbranch-protection=leaf -mbranch-protection=none+pac-ret" 
> } */
>  
> -/* { dg-error "unexpected 'pac-ret' after 'none'"  "" { target *-*-* } 0 } */
> +/* { dg-error "argument 'none' can only appear alone in 
> '-mbranch-protection='" "" { target *-*-* } 0 } */


Re: [PATCH 07/11] aarch64: Disable branch-protection for pcs tests

2023-09-05 Thread Richard Sandiford via Gcc-patches
Szabolcs Nagy  writes:
> The tests manipulate the return address in abitest-2.h and thus not
> compatible with -mbranch-protection=pac-ret+leaf or
> -mbranch-protection=gcs.
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/aapcs64/func-ret-1.c: Disable branch-protection.
>   * gcc.target/aarch64/aapcs64/func-ret-2.c: Likewise.
>   * gcc.target/aarch64/aapcs64/func-ret-3.c: Likewise.
>   * gcc.target/aarch64/aapcs64/func-ret-4.c: Likewise.
>   * gcc.target/aarch64/aapcs64/func-ret-64x1_1.c: Likewise.

OK, thanks.

Richard

> ---
>  gcc/testsuite/gcc.target/aarch64/aapcs64/func-ret-1.c  | 1 +
>  gcc/testsuite/gcc.target/aarch64/aapcs64/func-ret-2.c  | 1 +
>  gcc/testsuite/gcc.target/aarch64/aapcs64/func-ret-3.c  | 1 +
>  gcc/testsuite/gcc.target/aarch64/aapcs64/func-ret-4.c  | 1 +
>  gcc/testsuite/gcc.target/aarch64/aapcs64/func-ret-64x1_1.c | 1 +
>  5 files changed, 5 insertions(+)
>
> diff --git a/gcc/testsuite/gcc.target/aarch64/aapcs64/func-ret-1.c 
> b/gcc/testsuite/gcc.target/aarch64/aapcs64/func-ret-1.c
> index 5405e1e4920..7bd7757efe6 100644
> --- a/gcc/testsuite/gcc.target/aarch64/aapcs64/func-ret-1.c
> +++ b/gcc/testsuite/gcc.target/aarch64/aapcs64/func-ret-1.c
> @@ -4,6 +4,7 @@
> AAPCS64 \S 4.1.  */
>  
>  /* { dg-do run { target aarch64*-*-* } } */
> +/* { dg-additional-options "-mbranch-protection=none" } */
>  /* { dg-additional-sources "abitest.S" } */
>  
>  #ifndef IN_FRAMEWORK
> diff --git a/gcc/testsuite/gcc.target/aarch64/aapcs64/func-ret-2.c 
> b/gcc/testsuite/gcc.target/aarch64/aapcs64/func-ret-2.c
> index 6b171c46fbb..85a822ace4a 100644
> --- a/gcc/testsuite/gcc.target/aarch64/aapcs64/func-ret-2.c
> +++ b/gcc/testsuite/gcc.target/aarch64/aapcs64/func-ret-2.c
> @@ -4,6 +4,7 @@
> Homogeneous floating-point aggregate types are covered in func-ret-3.c.  
> */
>  
>  /* { dg-do run { target aarch64*-*-* } } */
> +/* { dg-additional-options "-mbranch-protection=none" } */
>  /* { dg-additional-sources "abitest.S" } */
>  
>  #ifndef IN_FRAMEWORK
> diff --git a/gcc/testsuite/gcc.target/aarch64/aapcs64/func-ret-3.c 
> b/gcc/testsuite/gcc.target/aarch64/aapcs64/func-ret-3.c
> index ad312b675b9..1d35ebf14b4 100644
> --- a/gcc/testsuite/gcc.target/aarch64/aapcs64/func-ret-3.c
> +++ b/gcc/testsuite/gcc.target/aarch64/aapcs64/func-ret-3.c
> @@ -4,6 +4,7 @@
> in AAPCS64 \S 4.3.5.  */
>  
>  /* { dg-do run { target aarch64-*-* } } */
> +/* { dg-additional-options "-mbranch-protection=none" } */
>  /* { dg-additional-sources "abitest.S" } */
>  /* { dg-require-effective-target aarch64_big_endian } */
>  
> diff --git a/gcc/testsuite/gcc.target/aarch64/aapcs64/func-ret-4.c 
> b/gcc/testsuite/gcc.target/aarch64/aapcs64/func-ret-4.c
> index af05fbe9fdf..15e1408c62d 100644
> --- a/gcc/testsuite/gcc.target/aarch64/aapcs64/func-ret-4.c
> +++ b/gcc/testsuite/gcc.target/aarch64/aapcs64/func-ret-4.c
> @@ -5,6 +5,7 @@
> are treated as general composite types.  */
>  
>  /* { dg-do run { target aarch64*-*-* } } */
> +/* { dg-additional-options "-mbranch-protection=none" } */
>  /* { dg-additional-sources "abitest.S" } */
>  /* { dg-require-effective-target aarch64_big_endian } */
>  
> diff --git a/gcc/testsuite/gcc.target/aarch64/aapcs64/func-ret-64x1_1.c 
> b/gcc/testsuite/gcc.target/aarch64/aapcs64/func-ret-64x1_1.c
> index 05957e2dcae..fe7bbb6a835 100644
> --- a/gcc/testsuite/gcc.target/aarch64/aapcs64/func-ret-64x1_1.c
> +++ b/gcc/testsuite/gcc.target/aarch64/aapcs64/func-ret-64x1_1.c
> @@ -3,6 +3,7 @@
>Test 64-bit singleton vector types which should be in FP/SIMD registers.  
> */
>  
>  /* { dg-do run { target aarch64*-*-* } } */
> +/* { dg-additional-options "-mbranch-protection=none" } */
>  /* { dg-additional-sources "abitest.S" } */
>  
>  #ifndef IN_FRAMEWORK


Re: [PATCH 06/11] aarch64: Fix pac-ret eh_return tests

2023-09-05 Thread Richard Sandiford via Gcc-patches
Szabolcs Nagy  writes:
> This is needed since eh_return no longer prevents pac-ret in the
> normal return path.
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/return_address_sign_1.c: Move func4 to ...
>   * gcc.target/aarch64/return_address_sign_2.c: ... here and fix the
>   scan asm check.
>   * gcc.target/aarch64/return_address_sign_b_1.c: Move func4 to ...
>   * gcc.target/aarch64/return_address_sign_b_2.c: ... here and fix the
>   scan asm check.
> ---
>  .../gcc.target/aarch64/return_address_sign_1.c  | 13 +
>  .../gcc.target/aarch64/return_address_sign_2.c  | 17 +++--
>  .../aarch64/return_address_sign_b_1.c   | 11 ---
>  .../aarch64/return_address_sign_b_2.c   | 17 +++--
>  4 files changed, 31 insertions(+), 27 deletions(-)
>
> diff --git a/gcc/testsuite/gcc.target/aarch64/return_address_sign_1.c 
> b/gcc/testsuite/gcc.target/aarch64/return_address_sign_1.c
> index 232ba67ade0..114a9dacb3f 100644
> --- a/gcc/testsuite/gcc.target/aarch64/return_address_sign_1.c
> +++ b/gcc/testsuite/gcc.target/aarch64/return_address_sign_1.c
> @@ -37,16 +37,5 @@ func3 (int a, int b, int c)
>/* autiasp */
>  }
>  
> -/* eh_return.  */
> -void __attribute__ ((target ("arch=armv8.3-a")))
> -func4 (long offset, void *handler, int *ptr, int imm1, int imm2)
> -{
> -  /* no paciasp */
> -  *ptr = imm1 + foo (imm1) + imm2;
> -  __builtin_eh_return (offset, handler);
> -  /* no autiasp */
> -  return;
> -}
> -
> -/* { dg-final { scan-assembler-times "autiasp" 3 } } */
>  /* { dg-final { scan-assembler-times "paciasp" 3 } } */
> +/* { dg-final { scan-assembler-times "autiasp" 3 } } */

I suppose there is no normal return path here.  I don't know how quickly
we'd realise that though, in the sense that the flag register becomes known-1.
But a quick-and-dirty check would be whether the exit block has a single
predecessor, which in a function that calls eh_return should mean
that the eh_return is unconditional.

But that might not be worth worrying about, given the builtin's limited
use case.  And even if it is worth worrying about, keeping the test in
this file would mix correctness with optimisation, which isn't a good
thing for scan-assembler-times.

So yeah, I agree this is OK.  It should probably be part of 03 though,
so that no individual commit causes a regression.

Thanks,
Richard

> diff --git a/gcc/testsuite/gcc.target/aarch64/return_address_sign_2.c 
> b/gcc/testsuite/gcc.target/aarch64/return_address_sign_2.c
> index a4bc5b45333..d93492c3c43 100644
> --- a/gcc/testsuite/gcc.target/aarch64/return_address_sign_2.c
> +++ b/gcc/testsuite/gcc.target/aarch64/return_address_sign_2.c
> @@ -14,5 +14,18 @@ func1 (int a, int b, int c)
>/* retaa */
>  }
>  
> -/* { dg-final { scan-assembler-times "paciasp" 1 } } */
> -/* { dg-final { scan-assembler-times "retaa" 1 } } */
> +/* eh_return.  */
> +void __attribute__ ((target ("arch=armv8.3-a")))
> +func4 (long offset, void *handler, int *ptr, int imm1, int imm2)
> +{
> +  /* paciasp */
> +  *ptr = imm1 + foo (imm1) + imm2;
> +  if (handler)
> +/* br */
> +__builtin_eh_return (offset, handler);
> +  /* retaa */
> +  return;
> +}
> +
> +/* { dg-final { scan-assembler-times "paciasp" 2 } } */
> +/* { dg-final { scan-assembler-times "retaa" 2 } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/return_address_sign_b_1.c 
> b/gcc/testsuite/gcc.target/aarch64/return_address_sign_b_1.c
> index 43e32ab6cb7..697fa30dc5a 100644
> --- a/gcc/testsuite/gcc.target/aarch64/return_address_sign_b_1.c
> +++ b/gcc/testsuite/gcc.target/aarch64/return_address_sign_b_1.c
> @@ -37,16 +37,5 @@ func3 (int a, int b, int c)
>/* autibsp */
>  }
>  
> -/* eh_return.  */
> -void __attribute__ ((target ("arch=armv8.3-a")))
> -func4 (long offset, void *handler, int *ptr, int imm1, int imm2)
> -{
> -  /* no pacibsp */
> -  *ptr = imm1 + foo (imm1) + imm2;
> -  __builtin_eh_return (offset, handler);
> -  /* no autibsp */
> -  return;
> -}
> -
>  /* { dg-final { scan-assembler-times "pacibsp" 3 } } */
>  /* { dg-final { scan-assembler-times "autibsp" 3 } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/return_address_sign_b_2.c 
> b/gcc/testsuite/gcc.target/aarch64/return_address_sign_b_2.c
> index 9ed64ce0591..748924c72f3 100644
> --- a/gcc/testsuite/gcc.target/aarch64/return_address_sign_b_2.c
> +++ b/gcc/testsuite/gcc.target/aarch64/return_address_sign_b_2.c
> @@ -14,5 +14,18 @@ func1 (int a, int b, int c)
>/* retab */
>  }
>  
> -/* { dg-final { scan-assembler-times "pacibsp" 1 } } */
> -/* { dg-final { scan-assembler-times "retab" 1 } } */
> +/* eh_return.  */
> +void __attribute__ ((target ("arch=armv8.3-a")))
> +func4 (long offset, void *handler, int *ptr, int imm1, int imm2)
> +{
> +  /* paciasp */
> +  *ptr = imm1 + foo (imm1) + imm2;
> +  if (handler)
> +/* br */
> +__builtin_eh_return (offset, handler);
> +  /* retab */
> +  return;
> +}
> +
> +/* { dg-final { 

Re: [PATCH 05/11] aarch64: Add eh_return compile tests

2023-09-05 Thread Richard Sandiford via Gcc-patches
Szabolcs Nagy  writes:
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/eh_return-2.c: New test.
>   * gcc.target/aarch64/eh_return-3.c: New test.

OK.

I wonder if it's worth using check-function-bodies for -3.c though.
It would then be easy to verify that the autiasp only occurs on the
normal return path.

Just a suggestion -- the current test is fine too.

Thanks,
Richard

> ---
>  gcc/testsuite/gcc.target/aarch64/eh_return-2.c |  9 +
>  gcc/testsuite/gcc.target/aarch64/eh_return-3.c | 14 ++
>  2 files changed, 23 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/eh_return-2.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/eh_return-3.c
>
> diff --git a/gcc/testsuite/gcc.target/aarch64/eh_return-2.c 
> b/gcc/testsuite/gcc.target/aarch64/eh_return-2.c
> new file mode 100644
> index 000..4a9d124e891
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/eh_return-2.c
> @@ -0,0 +1,9 @@
> +/* { dg-do compile } */
> +/* { dg-final { scan-assembler "add\tsp, sp, x5" } } */
> +/* { dg-final { scan-assembler "br\tx6" } } */
> +
> +void
> +foo (unsigned long off, void *handler)
> +{
> +  __builtin_eh_return (off, handler);
> +}
> diff --git a/gcc/testsuite/gcc.target/aarch64/eh_return-3.c 
> b/gcc/testsuite/gcc.target/aarch64/eh_return-3.c
> new file mode 100644
> index 000..35989eee806
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/eh_return-3.c
> @@ -0,0 +1,14 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mbranch-protection=pac-ret+leaf" } */
> +/* { dg-final { scan-assembler "add\tsp, sp, x5" } } */
> +/* { dg-final { scan-assembler "br\tx6" } } */
> +/* { dg-final { scan-assembler "hint\t25 // paciasp" } } */
> +/* { dg-final { scan-assembler "hint\t29 // autiasp" } } */
> +
> +void
> +foo (unsigned long off, void *handler, int c)
> +{
> +  if (c)
> +return;
> +  __builtin_eh_return (off, handler);
> +}


Re: [PATCH 04/11] aarch64: Do not force a stack frame for EH returns

2023-09-05 Thread Richard Sandiford via Gcc-patches
Szabolcs Nagy  writes:
> EH returns no longer rely on clobbering the return address on the stack
> so forcing a stack frame is not necessary.
>
> This does not actually change the code gen for the unwinder since there
> are calls before the EH return.
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64.cc (aarch64_needs_frame_chain): Do not
>   force frame chain for eh_return.

OK once we've agreed on something for 03/11.

Thanks,
Richard

> ---
>  gcc/config/aarch64/aarch64.cc | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 36cd172d182..afdbf4213c1 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -8417,8 +8417,7 @@ aarch64_output_probe_sve_stack_clash (rtx base, rtx 
> adjustment,
>  static bool
>  aarch64_needs_frame_chain (void)
>  {
> -  /* Force a frame chain for EH returns so the return address is at FP+8.  */
> -  if (frame_pointer_needed || crtl->calls_eh_return)
> +  if (frame_pointer_needed)
>  return true;
>  
>/* A leaf function cannot have calls or write LR.  */


Re: [PATCH 01/11] aarch64: AARCH64_ISA_RCPC was defined twice

2023-09-05 Thread Richard Sandiford via Gcc-patches
Szabolcs Nagy  writes:
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64.h (AARCH64_ISA_RCPC): Remove dup.

OK, thanks.

Richard

> ---
>  gcc/config/aarch64/aarch64.h | 1 -
>  1 file changed, 1 deletion(-)
>
> diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h
> index 2b0fc97bb71..c783cb96c48 100644
> --- a/gcc/config/aarch64/aarch64.h
> +++ b/gcc/config/aarch64/aarch64.h
> @@ -222,7 +222,6 @@ enum class aarch64_feature : unsigned char {
>  #define AARCH64_ISA_MOPS(aarch64_isa_flags & AARCH64_FL_MOPS)
>  #define AARCH64_ISA_LS64(aarch64_isa_flags & AARCH64_FL_LS64)
>  #define AARCH64_ISA_CSSC(aarch64_isa_flags & AARCH64_FL_CSSC)
> -#define AARCH64_ISA_RCPC   (aarch64_isa_flags & AARCH64_FL_RCPC)
>  
>  /* Crypto is an optional extension to AdvSIMD.  */
>  #define TARGET_CRYPTO (AARCH64_ISA_CRYPTO)


Re: testsuite: Port 'check-function-bodies' to nvptx

2023-09-05 Thread Richard Sandiford via Gcc-patches
Thomas Schwinge  writes:
> Hi!
>
> On 2023-09-04T23:05:05+0200, I wrote:
>> On 2019-07-16T15:04:49+0100, Richard Sandiford  
>> wrote:
>>> This patch therefore adds a new check-function-bodies dg-final test
>
>>> The regexps in parse_function_bodies are fairly general, but might
>>> still need to be extended in future for targets like Darwin or AIX.
>>
>> ..., or nvptx.  [...]
>
>> number of TODO items.
>>
>> In particular how to parameterize regular expressions for the different
>> syntax used by nvptx: for example, parameterize via global variables,
>> initialized accordingly (where?)?  Thinking about it, maybe simply
>> conditionalizing the current local initializations by
>> 'if { [istarget nvptx-*-*] } { [...] } else { [...] }' will do, simple
>> enough!
>
> Indeed that works fine.
>
>> Regarding whitespace prefixed, I think I'll go with the current
>> 'append function_regexp "\t" $line "\n"', that is, prefix expected output
>> lines with '\t' (as done in 'gcc.target/nvptx/abort.c'), and also for
>> nvptx handle labels as "fluff" (until we solve that issue generally).
>
> I changed my mind about that: instead of '\t', use '\t*' for nvptx, which
> means that both instructions emitted with additional whitespace prefixed
> and labels in column zero work nicely.
>
>> --- a/gcc/testsuite/lib/scanasm.exp
>> +++ b/gcc/testsuite/lib/scanasm.exp
>
>> @@ -907,7 +911,8 @@ proc check-function-bodies { args } {
>>
>>  set count 0
>>  set function_regexp ""
>> -set label {^(\S+):$}
>> +#TODO
>> +set label {^// BEGIN GLOBAL FUNCTION DEF: ([a-zA-Z_]\S+)$}
>
> There's actually no reason that the expected output syntax (this one) has
> to match the assembly -- so I restored that, to use the same syntax for
> nvptx here, too.
>
> Any comments before I push the attached
> "testsuite: Port 'check-function-bodies' to nvptx"?
>
>
> Grüße
>  Thomas
>
>
> -
> Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 
> München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas 
> Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht 
> München, HRB 106955
>
> From bdaf7572d9d4c1988274405840de4071ded3733f Mon Sep 17 00:00:00 2001
> From: Thomas Schwinge 
> Date: Mon, 4 Sep 2023 22:28:12 +0200
> Subject: [PATCH] testsuite: Port 'check-function-bodies' to nvptx
>
> This extends commit 4d706ff86ea86868615558e92407674a4f4b4af9
> "Add dg test for matching function bodies" for nvptx.
>
>   gcc/testsuite/
>   * lib/scanasm.exp (configure_check-function-bodies): New proc.
>   (parse_function_bodies, check-function-bodies): Use it.
>   * gcc.target/nvptx/abort.c: Use 'check-function-bodies'.
>   gcc/
>   * doc/sourcebuild.texi (check-function-bodies): Update.

LGTM.  Just a minor comment:

> ---
>  gcc/doc/sourcebuild.texi   |  9 ++-
>  gcc/testsuite/gcc.target/nvptx/abort.c | 19 ++-
>  gcc/testsuite/lib/scanasm.exp  | 76 --
>  3 files changed, 83 insertions(+), 21 deletions(-)
>
> diff --git a/gcc/doc/sourcebuild.texi b/gcc/doc/sourcebuild.texi
> index 1a78b3c1abb..8aec6b6592c 100644
> --- a/gcc/doc/sourcebuild.texi
> +++ b/gcc/doc/sourcebuild.texi
> @@ -3327,9 +3327,12 @@ The first line of the expected output for a function 
> @var{fn} has the form:
>  Subsequent lines of the expected output also start with @var{prefix}.
>  In both cases, whitespace after @var{prefix} is not significant.
>  
> -The test discards assembly directives such as @code{.cfi_startproc}
> -and local label definitions such as @code{.LFB0} from the compiler's
> -assembly output.  It then matches the result against the expected
> +Depending on the configuration (see
> +@code{gcc/testsuite/lib/scanasm.exp:configure_check-function-bodies}),

I can imagine such a long string wouldn't format well in the output.
How about: @code{configure_check-function-bodies} in
@filename{gcc/testsuite/lib/scanasm.exp}?

OK from my POV with that change.

Thanks,
Richard

> +the test may discard from the compiler's assembly output
> +directives such as @code{.cfi_startproc},
> +local label definitions such as @code{.LFB0}, and more.
> +It then matches the result against the expected
>  output for a function as a single regular expression.  This means that
>  later lines can use backslashes to refer back to @samp{(@dots{})}
>  captures on earlier lines.  For example:
> diff --git a/gcc/testsuite/gcc.target/nvptx/abort.c 
> b/gcc/testsuite/gcc.target/nvptx/abort.c
> index d3220687400..ae9dbf45a9b 100644
> --- a/gcc/testsuite/gcc.target/nvptx/abort.c
> +++ b/gcc/testsuite/gcc.target/nvptx/abort.c
> @@ -1,4 +1,6 @@
>  /* { dg-do compile} */
> +/* { dg-final { check-function-bodies {**} {} } } */
> +
>  /* Annotate no return functions with a trailing 'trap'.  */
>  
>  extern void abort ();
> @@ -9,5 +11,18 @@ int main (int argc, char **argv)
>  abort ();
>return 0;
>  }
> -
> -/* { dg-final { scan-assembler "call 

Re: [PATCH] fwprop: Allow UNARY_P and check register pressure.

2023-09-05 Thread Richard Sandiford via Gcc-patches
Robin Dapp  writes:
>> So I don't think I have a good feel for the advantages and disadvantages
>> of doing this.  Robin's analysis of the aarch64 changes was nice and
>> detailed though.  I think the one that worries me most is the addressing
>> mode one.  fwprop is probably the first chance we get to propagate adds
>> into addresses, and virtual register elimination means that some of
>> those opportunities won't show up in gimple.
>> 
>> There again, virtual register elimination wouldn't be the reason for
>> the ld4_s8.c failure.  Perhaps there's something missing in expand.
>> 
>> Other than that, I think my main question is: why just unary operations?
>> Is the underlying assumption that we only want to propagate a maximum of
>> one register?  If so, then I think we should check for that directly, by
>> iterating over subrtxes.
>
> The main reason for stopping at unary operations was to limit the scope
> and change as little as possible (not restricting the change to one
> register).  I'm currently testing a v2 that iterates over subrtxs.

Thanks.  Definitely no problem with doing things in small steps, but IMO
it's better if each choice of step can still be justified in its own terms.

>> Perhaps we should allow the optimisation without register-pressure
>> information if (a) the source register and destination register are
>> in the same pressure class and (b) all uses of the destination are
>> being replaced.  (FWIW, rtl-ssa should make it easier to try to
>> replace all definitions at once, with an all-or-nothing choice,
>> if we ever wanted to do that.)
>
> I presume you're referring to replacing one register (dest) in all using
> insns?  Source and destination are somewhat overloaded in fwprop context
> because I'm thinking of the "to be replaced" register as dest when it's
> actually the replacement register.

Yeah.

> AFAICT fwprop currently iterates over insns, going through all their uses
> and trying if an individual use can be substituted.  Do you suggest to
> change this general iteration order to iterate over the defs of an insn
> and then try to replace all the uses at once (e.g. using ssa->change_insns)?

No, I was just noting in passing that we could try do that if we wanted to.
The current code is a fairly mechanical conversion of the original DF-based
code, but there's no reason why it has to continue to work the way it
does now.

> When keeping the current order, wouldn't we need to store all potential
> changes instead of committing them and later apply them in bulk, e.g.
> grouped by use?  This order would also help to pick the propagation
> with the most number of uses (i.e. propagation potential) but maybe
> I'm misunderstanding?

I imagine doing it in reverse postorder would still make sense.

But my point was that, for the current fwprop limitation of substituting
into exactly one use of a register, we can check whether that use is
the *only* use of register.

I.e. if we substitute:

  A: (set (reg R1) (foo (reg R2)))

into:

  B: (set ... (reg R1) ...)

if R1 and R2 are likely to be in the same register class, and if B
is the only user of R2, then we don't need to calculate register
pressure.  The change is either neutral (if R2 died in A) or an
improvement (if R2 doesn't die in A, and so R1 and R2 were previously
live at the same time).

Thanks,
Richard


Re: RFC: Introduce -fhardened to enable security-related flags

2023-09-04 Thread Richard Sandiford via Gcc-patches
Qing Zhao via Gcc-patches  writes:
>> On Aug 29, 2023, at 3:42 PM, Marek Polacek via Gcc-patches 
>>  wrote:
>> 
>> Improving the security of software has been a major trend in the recent
>> years.  Fortunately, GCC offers a wide variety of flags that enable extra
>> hardening.  These flags aren't enabled by default, though.  And since
>> there are a lot of hardening flags, with more to come, it's been difficult
>> to keep on top of them; more so for the users of GCC who ought not to be
>> expected to keep track of all the new options.
>> 
>> To alleviate some of the problems I mentioned, we thought it would
>> be useful to provide a new umbrella option that enables a reasonable set
>> of hardening flags.  What's "reasonable" in this context is not easy to
>> pin down.  Surely, there must be no ABI impact, the option cannot cause
>> severe performance issues, and, I suspect, it should not cause build
>> errors by enabling stricter compile-time errors (such as, -Wimplicit-int,
>> -Wint-conversion).  Including a controversial option in -fhardened
>> would likely cause that users would not use -fhardened at all.  It's
>> roughly akin to -Wall or -O2 -- those also enable a reasonable set of
>> options, and evolve over time, and are not kept in sync with other
>> compilers.
>> 
>> Currently, -fhardened enables:
>> 
>>  -D_FORTIFY_SOURCE=3 (or =2 for older glibcs)
>>  -D_GLIBCXX_ASSERTIONS
>>  -ftrivial-auto-var-init=zero
>>  -fPIE  -pie  -Wl,-z,relro,-z,now
>>  -fstack-protector-strong
>>  -fstack-clash-protection
>>  -fcf-protection=full (x86 GNU/Linux only)
>> 
>> -fsanitize=undefined is specifically not enabled.  -fstrict-flex-arrays is
>> also liable to break a lot of code so I didn't include it.
>> 
>> Appended is a proof-of-concept patch.  It doesn't implement --help=hardened
>> yet.  A fairly crucial point is that -fhardened will not override options
>> that were specified on the command line (before or after -fhardened).  For
>> example,
>> 
>> -D_FORTIFY_SOURCE=1 -fhardened
>> 
>> means that _FORTIFY_SOURCE=1 will be used.  Similarly,
>> 
>>  -fhardened -fstack-protector
>> 
>> will not enable -fstack-protector-strong.
>> 
>> Thoughts?
>
> In general, I think that it is a very good idea to provide umbrella options
>  for software security purpose.  Thanks a lot for this work!
>
> 1. I do agree with Martin, multiple-level control for this purpose might be 
> needed,
> similar as multiple levels for warnings, and multiple levels for 
> optimizations.
>
> Similar as optimization options, can we organize all the security options 
> together 
> In our manual, then the user will have a good central place to get more and 
> complete
> Information of the security features our compiler provides? 
>
> 2. What’s the major criteria to decide which security feature should go into 
> this list?
> Later, when we have new security features, how to decide whether to add them 
> to
> This list or not?
> I am wondering why -fzero-call-used-regs is not included in the list and also

FWIW, I wondered the same thing.  Not a strong conviction that it should
be included -- maybe the code bloat is too much on some targets.  But it
might be acceptable for the -fhardened equivalent of -O3, at least if
restricted to GPRs.
 
> Why chose -ftrivial-auto-var-init=zero instead of 
> -ftrivial-auto-var-init=pattern? 

Yeah, IIRC -ftrivial-auto-var-init=zero was controversial with some
Clang maintainers because it effectively creates a language dialect.
-ftrivial-auto-var-init=pattern wasn't controversial in the same way.

Thanks,
Richard


Re: [PATCH] Bug 111071: fix the subr with -1 to not due to the simplify.

2023-09-04 Thread Richard Sandiford via Gcc-patches
Richard Sandiford  writes:
> "yanzhang.wang--- via Gcc-patches"  writes:
>> From: Yanzhang Wang 
>>
>> gcc/testsuite/ChangeLog:
>>
>>  * gcc.target/aarch64/sve/acle/asm/subr_s8.c: Modify subr with -1
>> to not.
>>
>> Signed-off-by: Yanzhang Wang 
>> ---
>>
>> Tested on my local arm environment and passed. Thanks Andrew Pinski's comment
>> the code is the same with that.
>>
>>  gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_s8.c | 3 +--
>>  1 file changed, 1 insertion(+), 2 deletions(-)
>>
>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_s8.c 
>> b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_s8.c
>> index b9615de6655..1cf6916a5e0 100644
>> --- a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_s8.c
>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_s8.c
>> @@ -76,8 +76,7 @@ TEST_UNIFORM_Z (subr_1_s8_m_untied, svint8_t,
>>  
>>  /*
>>  ** subr_m1_s8_m:
>> -**  mov (z[0-9]+\.b), #-1
>> -**  subrz0\.b, p0/m, z0\.b, \1
>> +**  not z0.b, p0/m, z0.b
>>  **  ret
>>  */
>>  TEST_UNIFORM_Z (subr_m1_s8_m, svint8_t,
>
> I think we need this for subr_u8.c too.  OK with that change,
> and thanks for the fix!

Actually, never mind.  I just saw a patch from Thiago Jung Bauerman
for the same issue, which is now in trunk.  Sorry for the confusion,
and thanks again for posting the fix.

Richard


Re: [PATCH] testsuite: aarch64: Adjust SVE ACLE tests to new generated code

2023-09-04 Thread Richard Sandiford via Gcc-patches
Thiago Jung Bauermann via Gcc-patches  writes:
> Since commit e7a36e4715c7 "[PATCH] RISC-V: Support simplify (-1-x) for
> vector." these tests fail on aarch64-linux:
>
>   === g++ tests ===
>
> Running g++:g++.target/aarch64/sve/acle/aarch64-sve-acle-asm.exp ...
> FAIL: gcc.target/aarch64/sve/acle/asm/subr_s8.c -std=gnu++98 -O2 
> -fno-schedule-insns -DCHECK_ASM --save-temps -DTEST_FULL  
> check-function-bodies subr_m1_s8_m
> FAIL: gcc.target/aarch64/sve/acle/asm/subr_s8.c -std=gnu++98 -O2 
> -fno-schedule-insns -DCHECK_ASM --save-temps -DTEST_OVERLOADS  
> check-function-bodies subr_m1_s8_m
> FAIL: gcc.target/aarch64/sve/acle/asm/subr_u8.c -std=gnu++98 -O2 
> -fno-schedule-insns -DCHECK_ASM --save-temps -DTEST_FULL  
> check-function-bodies subr_m1_u8_m
> FAIL: gcc.target/aarch64/sve/acle/asm/subr_u8.c -std=gnu++98 -O2 
> -fno-schedule-insns -DCHECK_ASM --save-temps -DTEST_OVERLOADS  
> check-function-bodies subr_m1_u8_m
>
>   === gcc tests ===
>
> Running gcc:gcc.target/aarch64/sve/acle/aarch64-sve-acle-asm.exp ...
> FAIL: gcc.target/aarch64/sve/acle/asm/subr_s8.c -std=gnu90 -O2 
> -fno-schedule-insns -DCHECK_ASM --save-temps -DTEST_FULL  
> check-function-bodies subr_m1_s8_m
> FAIL: gcc.target/aarch64/sve/acle/asm/subr_s8.c -std=gnu90 -O2 
> -fno-schedule-insns -DCHECK_ASM --save-temps -DTEST_OVERLOADS  
> check-function-bodies subr_m1_s8_m
> FAIL: gcc.target/aarch64/sve/acle/asm/subr_u8.c -std=gnu90 -O2 
> -fno-schedule-insns -DCHECK_ASM --save-temps -DTEST_FULL  
> check-function-bodies subr_m1_u8_m
> FAIL: gcc.target/aarch64/sve/acle/asm/subr_u8.c -std=gnu90 -O2 
> -fno-schedule-insns -DCHECK_ASM --save-temps -DTEST_OVERLOADS  
> check-function-bodies subr_m1_u8_m
>
> Andrew Pinski's analysis in PR testsuite/111071 is that the new code is
> better and the testcase should be updated. I also asked Prathamesh Kulkarni
> in private and he agreed.
>
> Here is the update. With this change, all tests in
> gcc.target/aarch64/sve/acle/aarch64-sve-acle-asm.exp pass on aarch64-linux.
>
> gcc/testsuite/
>   PR testsuite/111071
>   * gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_s8.c: Adjust to 
> new code.
>   * gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_u8.c: Likewise.

Thanks, pushed to trunk.  And sorry for the delay.  I somehow
missed this earlier. :(

Richard

> Suggested-by: Andrew Pinski 
> ---
>  gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_s8.c | 3 +--
>  gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_u8.c | 3 +--
>  2 files changed, 2 insertions(+), 4 deletions(-)
>
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_s8.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_s8.c
> index b9615de6655f..3e521bc9ae32 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_s8.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_s8.c
> @@ -76,8 +76,7 @@ TEST_UNIFORM_Z (subr_1_s8_m_untied, svint8_t,
>  
>  /*
>  ** subr_m1_s8_m:
> -**   mov (z[0-9]+\.b), #-1
> -**   subrz0\.b, p0/m, z0\.b, \1
> +**   not z0\.b, p0/m, z0\.b
>  **   ret
>  */
>  TEST_UNIFORM_Z (subr_m1_s8_m, svint8_t,
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_u8.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_u8.c
> index 65606b6dda03..4922bdbacc47 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_u8.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_u8.c
> @@ -76,8 +76,7 @@ TEST_UNIFORM_Z (subr_1_u8_m_untied, svuint8_t,
>  
>  /*
>  ** subr_m1_u8_m:
> -**   mov (z[0-9]+\.b), #-1
> -**   subrz0\.b, p0/m, z0\.b, \1
> +**   not z0\.b, p0/m, z0\.b
>  **   ret
>  */
>  TEST_UNIFORM_Z (subr_m1_u8_m, svuint8_t,


Re: [PATCH v3] mklog: handle Signed-off-by, minor cleanup

2023-09-04 Thread Richard Sandiford via Gcc-patches
Marc Poulhiès via Gcc-patches  writes:
> Richard Sandiford via Gcc-patches  writes:
>>> +# this regex matches the first line of the "end" in the initial commit 
>>> message
>>> +FIRST_LINE_OF_END_RE = re.compile('(?i)^(signed-off-by|co-authored-by|#): 
>>> ')
>>
>> The current code only requires an initial "#", rather than an initial "#: ".
>> Is that a deliberate change?
>>
>> The patch LGTM apart from that.
>
> Hello Richard,
>
> Thanks for the review and sorry for the delayed answer as I was away the
> past weeks. This issue was catched early this month
> (https://github.com/Rust-GCC/gccrs/pull/2504), but I didn't want to send
> something here before leaving. Here's a fixed patched.
>
> Ok for master?
>
> Thanks,
> Marc
>
> ---
>  contrib/mklog.py   | 34 +-
>  contrib/prepare-commit-msg | 20 ++--
>  2 files changed, 39 insertions(+), 15 deletions(-)
>
> diff --git a/contrib/mklog.py b/contrib/mklog.py
> index 26230b9b4f2..496780883fb 100755
> --- a/contrib/mklog.py
> +++ b/contrib/mklog.py
> @@ -41,7 +41,34 @@ from unidiff import PatchSet
>  
>  LINE_LIMIT = 100
>  TAB_WIDTH = 8
> -CO_AUTHORED_BY_PREFIX = 'co-authored-by: '
> +
> +# Initial commit:
> +#   +--+
> +#   | gccrs: Some title|
> +#   |  | This is the "start"
> +#   | This is some text explaining the commit. |
> +#   | There can be several lines.  |
> +#   |  |<--->
> +#   | Signed-off-by: My Name  | This is the "end"
> +#   +--+
> +#
> +# Results in:
> +#   +--+
> +#   | gccrs: Some title|
> +#   |  |
> +#   | This is some text explaining the commit. | This is the "start"
> +#   | There can be several lines.  |
> +#   |  |<--->
> +#   | gcc/rust/ChangeLog:  |
> +#   |  | This is the 
> generated
> +#   | * some_file (bla):   | ChangeLog part
> +#   | (foo):   |
> +#   |  |<--->
> +#   | Signed-off-by: My Name  | This is the "end"
> +#   +--+
> +
> +# this regex matches the first line of the "end" in the initial commit 
> message
> +FIRST_LINE_OF_END_RE = re.compile('(?i)^(signed-off-by:|co-authored-by:|#) ')

Personally I think it would be safer to drop the final space in the regexp.

OK with that change if you agree.

Thanks,
Richard

>  
>  pr_regex = re.compile(r'(\/(\/|\*)|[Cc*!])\s+(?PPR [a-z+-]+\/[0-9]+)')
>  prnum_regex = re.compile(r'PR (?P[a-z+-]+)/(?P[0-9]+)')
> @@ -330,10 +357,7 @@ def update_copyright(data):
>  
>  
>  def skip_line_in_changelog(line):
> -if line.lower().startswith(CO_AUTHORED_BY_PREFIX) or 
> line.startswith('#'):
> -return False
> -return True
> -
> +return FIRST_LINE_OF_END_RE.match(line) == None
>  
>  if __name__ == '__main__':
>  extra_args = os.getenv('GCC_MKLOG_ARGS')
> diff --git a/contrib/prepare-commit-msg b/contrib/prepare-commit-msg
> index 48c9dad3c6f..1e94706ba40 100755
> --- a/contrib/prepare-commit-msg
> +++ b/contrib/prepare-commit-msg
> @@ -32,11 +32,11 @@ if ! [ -f "$COMMIT_MSG_FILE" ]; then exit 0; fi
>  # Don't do anything unless requested to.
>  if [ -z "$GCC_FORCE_MKLOG" ]; then exit 0; fi
>  
> -if [ -z "$COMMIT_SOURCE" ] || [ $COMMIT_SOURCE = template ]; then
> +if [ -z "$COMMIT_SOURCE" ] || [ "$COMMIT_SOURCE" = template ]; then
>  # No source or "template" means new commit.
>  cmd="diff --cached"
>  
> -elif [ $COMMIT_SOURCE = message ]; then
> +elif [ "$COMMIT_SOURCE" = message ]; then
>  # "message" means -m; assume a new commit if there are any changes 
> staged.
>  if ! git diff --cached --quiet; then
>   cmd="diff --cached"
> @@ -44,23 +44,23 @@ elif [ $COMMIT_SOURCE = message ]; then
>   cmd="diff --cached HEAD^"
>  fi
>  
> -elif [ $COMMIT_SOURCE = commit ]; t

Re: [PATCH] testsuite: Remove unwanted 'dg-do run' from gcc.dg/vect tests

2023-09-04 Thread Richard Sandiford via Gcc-patches
Christophe Lyon via Gcc-patches  writes:
> Tests under gcc.dg/vect use check_vect_support_and_set_flags to set
> compilation flags as appropriate for the target, but they also set
> dg-do-what-default to 'run' or 'compile', depending on the actual
> target hardware (or simulator) capabilities.
>
> For instance on arm, we use options to enable Neon, but set
> dg-do-what-default to 'run' only if we cam actually execute Neon
> instructions.
>
> Therefore, we would always try to link and execute tests containing
> 'dg-do run', although dg-do-what-default says otherwise, leading to
> uninteresting failures.
>
> Therefore, this patch removes all such unconditionnal 'dg-do run',
> thus avoid link errors for instance if GCC has been configured with
> multilibs disabled and some --with-{float|cpu|hard} option
> incompatible with what check_vect_support_and_set_flags selects.
>
> For exmaple, GCC configured with:
> --disable-multilib --with-mode=thumb --with-cpu=cortex-m7 --with-float=hard
> and check_vect_support_and_set_flags uses
> -mfpu=neon -mfloat-abi=softfp -march=armv7-a
> (thus incompatible float-abi options)
>
> Tested on native aarch64-linux-gnu (no change) and several arm-eabi
> cases where the FAIL/UNRESOLVED disappear (and we keep only the
> 'compilation' tests).
>
> 2023-09-04  Christophe Lyon  
>
>   gcc/testsuite/
>   * gcc.dg/vect/bb-slp-44.c: Remove 'dg-do run'.
>   * gcc.dg/vect/bb-slp-71.c: Likewise.
>   * gcc.dg/vect/bb-slp-72.c: Likewise.
>   * gcc.dg/vect/bb-slp-73.c: Likewise.
>   * gcc.dg/vect/bb-slp-74.c: Likewise.
>   * gcc.dg/vect/bb-slp-pr101207.c: Likewise.
>   * gcc.dg/vect/bb-slp-pr101615-1.c: Likewise.
>   * gcc.dg/vect/bb-slp-pr101615-2.c: Likewise.
>   * gcc.dg/vect/bb-slp-pr101668.c: Likewise.
>   * gcc.dg/vect/bb-slp-pr54400.c: Likewise.
>   * gcc.dg/vect/bb-slp-pr98516-1.c: Likewise.
>   * gcc.dg/vect/bb-slp-pr98516-2.c: Likewise.
>   * gcc.dg/vect/bb-slp-pr98544.c: Likewise.
>   * gcc.dg/vect/pr101445.c: Likewise.
>   * gcc.dg/vect/pr105219.c: Likewise.
>   * gcc.dg/vect/pr107160.c: Likewise.
>   * gcc.dg/vect/pr107212-1.c: Likewise.
>   * gcc.dg/vect/pr107212-2.c: Likewise.
>   * gcc.dg/vect/pr109502.c: Likewise.
>   * gcc.dg/vect/pr110381.c: Likewise.
>   * gcc.dg/vect/pr110838.c: Likewise.
>   * gcc.dg/vect/pr88497-1.c: Likewise.
>   * gcc.dg/vect/pr88497-7.c: Likewise.
>   * gcc.dg/vect/pr96783-1.c: Likewise.
>   * gcc.dg/vect/pr96783-2.c: Likewise.
>   * gcc.dg/vect/pr97558-2.c: Likewise.
>   * gcc.dg/vect/pr99253.c: Likewise.
>   * gcc.dg/vect/slp-mask-store-1.c: Likewise.
>   * gcc.dg/vect/vect-bic-bitmask-10.c: Likewise.
>   * gcc.dg/vect/vect-bic-bitmask-11.c: Likewise.
>   * gcc.dg/vect/vect-bic-bitmask-2.c: Likewise.
>   * gcc.dg/vect/vect-bic-bitmask-3.c: Likewise.
>   * gcc.dg/vect/vect-bic-bitmask-4.c: Likewise.
>   * gcc.dg/vect/vect-bic-bitmask-5.c: Likewise.
>   * gcc.dg/vect/vect-bic-bitmask-6.c: Likewise.
>   * gcc.dg/vect/vect-bic-bitmask-8.c: Likewise.
>   * gcc.dg/vect/vect-bic-bitmask-9.c: Likewise.
>   * gcc.dg/vect/vect-cond-13.c: Likewise.
>   * gcc.dg/vect/vect-recurr-1.c: Likewise.
>   * gcc.dg/vect/vect-recurr-2.c: Likewise.
>   * gcc.dg/vect/vect-recurr-3.c: Likewise.
>   * gcc.dg/vect/vect-recurr-4.c: Likewise.
>   * gcc.dg/vect/vect-recurr-5.c: Likewise.
>   * gcc.dg/vect/vect-recurr-6.c: Likewise.

OK, thanks.

Richard

> ---
>  gcc/testsuite/gcc.dg/vect/bb-slp-44.c   | 2 --
>  gcc/testsuite/gcc.dg/vect/bb-slp-71.c   | 2 --
>  gcc/testsuite/gcc.dg/vect/bb-slp-72.c   | 2 --
>  gcc/testsuite/gcc.dg/vect/bb-slp-73.c   | 2 --
>  gcc/testsuite/gcc.dg/vect/bb-slp-74.c   | 1 -
>  gcc/testsuite/gcc.dg/vect/bb-slp-pr101207.c | 1 -
>  gcc/testsuite/gcc.dg/vect/bb-slp-pr101615-1.c   | 1 -
>  gcc/testsuite/gcc.dg/vect/bb-slp-pr101615-2.c   | 1 -
>  gcc/testsuite/gcc.dg/vect/bb-slp-pr101668.c | 1 -
>  gcc/testsuite/gcc.dg/vect/bb-slp-pr54400.c  | 1 -
>  gcc/testsuite/gcc.dg/vect/bb-slp-pr98516-1.c| 2 --
>  gcc/testsuite/gcc.dg/vect/bb-slp-pr98516-2.c| 2 --
>  gcc/testsuite/gcc.dg/vect/bb-slp-pr98544.c  | 2 --
>  gcc/testsuite/gcc.dg/vect/pr101445.c| 2 --
>  gcc/testsuite/gcc.dg/vect/pr105219.c| 1 -
>  gcc/testsuite/gcc.dg/vect/pr107160.c| 2 --
>  gcc/testsuite/gcc.dg/vect/pr107212-1.c  | 2 --
>  gcc/testsuite/gcc.dg/vect/pr107212-2.c  | 2 --
>  gcc/testsuite/gcc.dg/vect/pr109502.c| 1 -
>  gcc/testsuite/gcc.dg/vect/pr110381.c| 1 -
>  gcc/testsuite/gcc.dg/vect/pr110838.c| 2 --
>  gcc/testsuite/gcc.dg/vect/pr88497-1.c   | 1 -
>  gcc/testsuite/gcc.dg/vect/pr88497-7.c   | 1 -
>  gcc/testsuite/gcc.dg/vect/pr96783-1.c   | 2 --
>  gcc/testsuite/gcc.dg/vect/pr96783-2.c   | 2 --
>  

Re: [PATCH] Bug 111071: fix the subr with -1 to not due to the simplify.

2023-09-04 Thread Richard Sandiford via Gcc-patches
"yanzhang.wang--- via Gcc-patches"  writes:
> From: Yanzhang Wang 
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/sve/acle/asm/subr_s8.c: Modify subr with -1
> to not.
>
> Signed-off-by: Yanzhang Wang 
> ---
>
> Tested on my local arm environment and passed. Thanks Andrew Pinski's comment
> the code is the same with that.
>
>  gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_s8.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_s8.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_s8.c
> index b9615de6655..1cf6916a5e0 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_s8.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_s8.c
> @@ -76,8 +76,7 @@ TEST_UNIFORM_Z (subr_1_s8_m_untied, svint8_t,
>  
>  /*
>  ** subr_m1_s8_m:
> -**   mov (z[0-9]+\.b), #-1
> -**   subrz0\.b, p0/m, z0\.b, \1
> +**   not z0.b, p0/m, z0.b
>  **   ret
>  */
>  TEST_UNIFORM_Z (subr_m1_s8_m, svint8_t,

I think we need this for subr_u8.c too.  OK with that change,
and thanks for the fix!

Richard


Re: [PATCH]AArch64 xorsign: Fix scalar xorsign lowering

2023-09-01 Thread Richard Sandiford via Gcc-patches
Tamar Christina  writes:
>> -Original Message-
>> From: Richard Sandiford 
>> Sent: Friday, September 1, 2023 2:36 PM
>> To: Tamar Christina 
>> Cc: gcc-patches@gcc.gnu.org; nd ; Richard Earnshaw
>> ; Marcus Shawcroft
>> ; Kyrylo Tkachov 
>> Subject: Re: [PATCH]AArch64 xorsign: Fix scalar xorsign lowering
>> 
>> Tamar Christina  writes:
>> > Hi All,
>> >
>> > In GCC-9 our scalar xorsign pattern broke and we didn't notice it
>> > because the testcase was not strong enough.  With this commit
>> >
>> > 8d2d39587d941a40f25ea0144cceb677df115040 is the first bad commit
>> > commit 8d2d39587d941a40f25ea0144cceb677df115040
>> > Author: Segher Boessenkool 
>> > Date:   Mon Oct 22 22:23:39 2018 +0200
>> >
>> > combine: Do not combine moves from hard registers
>> >
>> > combine started introducing useless moves on hard registers,  when one
>> > of the arguments to our scalar xorsign is a hardreg we get an additional 
>> > move
>> inserted.
>> >
>> > This leads to combine forming an AND with the immediate inside and
>> > using the superflous move to do the r->w move, instead of what we
>> > wanted before which was for the `and` to be a vector and and have reload
>> pick the right alternative.
>> 
>> IMO, the xorsign optab ought to go away.  IIRC it was just a stop-gap measure
>> that (like most stop-gap measures) never got cleaned up later.
>> 
>> But that's not important now. :)
>> 
>> > To fix this the patch just forces the use of the vector version
>> > directly and so combine has no chance to mess it up.
>> >
>> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>> >
>> > Ok for master?
>> >
>> > Thanks,
>> > Tamar
>> >
>> > gcc/ChangeLog:
>> >
>> >* config/aarch64/aarch64-simd.md (xorsign3): Renamed to..
>> >(@xorsign3): ...This.
>> >* config/aarch64/aarch64.md (xorsign3): Renamed to...
>> >(@xorsign3): ..This and emit vectors directly
>> >* config/aarch64/iterators.md (VCONQ): Add SF and DF.
>> >
>> > gcc/testsuite/ChangeLog:
>> >
>> >* gcc.target/aarch64/xorsign.c:
>> >
>> > --- inline copy of patch --
>> > diff --git a/gcc/config/aarch64/aarch64-simd.md
>> > b/gcc/config/aarch64/aarch64-simd.md
>> > index
>> >
>> f67eb70577d0c2d9911d8c867d38a4d0b390337c..e955691f1be8830efacc2
>> 3746511
>> > 9764ce2a4942 100644
>> > --- a/gcc/config/aarch64/aarch64-simd.md
>> > +++ b/gcc/config/aarch64/aarch64-simd.md
>> > @@ -500,7 +500,7 @@ (define_expand "ctz2"
>> >}
>> >  )
>> >
>> > -(define_expand "xorsign3"
>> > +(define_expand "@xorsign3"
>> >[(match_operand:VHSDF 0 "register_operand")
>> > (match_operand:VHSDF 1 "register_operand")
>> > (match_operand:VHSDF 2 "register_operand")] diff --git
>> > a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md index
>> >
>> 01cf989641fce8e6c3828f6cfef62e101c4142df..9db82347bf891f9bc40aede
>> cdc84
>> > 62c94bf1a769 100644
>> > --- a/gcc/config/aarch64/aarch64.md
>> > +++ b/gcc/config/aarch64/aarch64.md
>> > @@ -6953,31 +6953,20 @@ (define_insn "copysign3_insn"
>> >  ;; EOR   v0.8B, v0.8B, v3.8B
>> >  ;;
>> >
>> > -(define_expand "xorsign3"
>> > +(define_expand "@xorsign3"
>> >[(match_operand:GPF 0 "register_operand")
>> > (match_operand:GPF 1 "register_operand")
>> > (match_operand:GPF 2 "register_operand")]
>> >"TARGET_SIMD"
>> >  {
>> > -
>> > -  machine_mode imode = mode;
>> > -  rtx mask = gen_reg_rtx (imode);
>> > -  rtx op1x = gen_reg_rtx (imode);
>> > -  rtx op2x = gen_reg_rtx (imode);
>> > -
>> > -  int bits = GET_MODE_BITSIZE (mode) - 1;
>> > -  emit_move_insn (mask, GEN_INT (trunc_int_for_mode
>> (HOST_WIDE_INT_M1U << bits,
>> > -   imode)));
>> > -
>> > -  emit_insn (gen_and3 (op2x, mask,
>> > -  lowpart_subreg (imode, operands[2],
>> > -  mode)));
>> > -  emit_insn (gen_xor3 (op1x,
>> > -  lowpart_subreg (imode, operands[1],
>> > -  mode),
>> > -  op2x));
>> > +  rtx tmp = gen_reg_rtx (mode);  rtx op1 = gen_reg_rtx
>> > + (mode);  rtx op2 = gen_reg_rtx (mode);
>> emit_move_insn
>> > + (op1, lowpart_subreg (mode, operands[1], mode));
>> > + emit_move_insn (op2, lowpart_subreg (mode, operands[2],
>> > + mode));  emit_insn (gen_xorsign3(mode, tmp, op1,
>> op2));
>> 
>> Do we need the extra moves into op1 and op2?  I would have expected the
>> subregs to be acceptable as direct operands of the xorsign3.  Making them
>> direct operands should be better, since there's then less risk of having the
>> same value live in different registers at the same time.
>> 
>
> That was the first thing I tried but it doesn't work because validate_subreg 
> seems
> to have the invariant that you can either change mode between the same size
> or make it paradoxical but not both at the same time.
>
> i.e. it rejects subreg:V2DI (subreg:DI (reg:DF))), and lowpart_subreg folds 
> it to
> 

Re: [PATCH]AArch64 xorsign: Fix scalar xorsign lowering

2023-09-01 Thread Richard Sandiford via Gcc-patches
Tamar Christina  writes:
> Hi All,
>
> In GCC-9 our scalar xorsign pattern broke and we didn't notice it because the
> testcase was not strong enough.  With this commit
>
> 8d2d39587d941a40f25ea0144cceb677df115040 is the first bad commit
> commit 8d2d39587d941a40f25ea0144cceb677df115040
> Author: Segher Boessenkool 
> Date:   Mon Oct 22 22:23:39 2018 +0200
>
> combine: Do not combine moves from hard registers
>
> combine started introducing useless moves on hard registers,  when one of the
> arguments to our scalar xorsign is a hardreg we get an additional move 
> inserted.
>
> This leads to combine forming an AND with the immediate inside and using the
> superflous move to do the r->w move, instead of what we wanted before which 
> was
> for the `and` to be a vector and and have reload pick the right alternative.

IMO, the xorsign optab ought to go away.  IIRC it was just a stop-gap
measure that (like most stop-gap measures) never got cleaned up later.

But that's not important now. :)

> To fix this the patch just forces the use of the vector version directly and
> so combine has no chance to mess it up.
>
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>
> Ok for master?
>
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64-simd.md (xorsign3): Renamed to..
>   (@xorsign3): ...This.
>   * config/aarch64/aarch64.md (xorsign3): Renamed to...
>   (@xorsign3): ..This and emit vectors directly
>   * config/aarch64/iterators.md (VCONQ): Add SF and DF.
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/xorsign.c:
>
> --- inline copy of patch -- 
> diff --git a/gcc/config/aarch64/aarch64-simd.md 
> b/gcc/config/aarch64/aarch64-simd.md
> index 
> f67eb70577d0c2d9911d8c867d38a4d0b390337c..e955691f1be8830efacc237465119764ce2a4942
>  100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -500,7 +500,7 @@ (define_expand "ctz2"
>}
>  )
>  
> -(define_expand "xorsign3"
> +(define_expand "@xorsign3"
>[(match_operand:VHSDF 0 "register_operand")
> (match_operand:VHSDF 1 "register_operand")
> (match_operand:VHSDF 2 "register_operand")]
> diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
> index 
> 01cf989641fce8e6c3828f6cfef62e101c4142df..9db82347bf891f9bc40aedecdc8462c94bf1a769
>  100644
> --- a/gcc/config/aarch64/aarch64.md
> +++ b/gcc/config/aarch64/aarch64.md
> @@ -6953,31 +6953,20 @@ (define_insn "copysign3_insn"
>  ;; EOR   v0.8B, v0.8B, v3.8B
>  ;;
>  
> -(define_expand "xorsign3"
> +(define_expand "@xorsign3"
>[(match_operand:GPF 0 "register_operand")
> (match_operand:GPF 1 "register_operand")
> (match_operand:GPF 2 "register_operand")]
>"TARGET_SIMD"
>  {
> -
> -  machine_mode imode = mode;
> -  rtx mask = gen_reg_rtx (imode);
> -  rtx op1x = gen_reg_rtx (imode);
> -  rtx op2x = gen_reg_rtx (imode);
> -
> -  int bits = GET_MODE_BITSIZE (mode) - 1;
> -  emit_move_insn (mask, GEN_INT (trunc_int_for_mode (HOST_WIDE_INT_M1U << 
> bits,
> -  imode)));
> -
> -  emit_insn (gen_and3 (op2x, mask,
> - lowpart_subreg (imode, operands[2],
> - mode)));
> -  emit_insn (gen_xor3 (op1x,
> - lowpart_subreg (imode, operands[1],
> - mode),
> - op2x));
> +  rtx tmp = gen_reg_rtx (mode);
> +  rtx op1 = gen_reg_rtx (mode);
> +  rtx op2 = gen_reg_rtx (mode);
> +  emit_move_insn (op1, lowpart_subreg (mode, operands[1], 
> mode));
> +  emit_move_insn (op2, lowpart_subreg (mode, operands[2], 
> mode));
> +  emit_insn (gen_xorsign3(mode, tmp, op1, op2));

Do we need the extra moves into op1 and op2?  I would have expected the
subregs to be acceptable as direct operands of the xorsign3.  Making
them direct operands should be better, since there's then less risk of
having the same value live in different registers at the same time.

OK with that change if it works.

Also, nit: missing space before "(".

Thanks,
Richard

>emit_move_insn (operands[0],
> -   lowpart_subreg (mode, op1x, imode));
> +   lowpart_subreg (mode, tmp, mode));
>DONE;
>  }
>  )
> diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
> index 
> 9398d713044433cd89b2a83db5ae7969feb1dcf7..2451d8c2cd8e2da6ac8339eed9bc975cf203fa4c
>  100644
> --- a/gcc/config/aarch64/iterators.md
> +++ b/gcc/config/aarch64/iterators.md
> @@ -1428,7 +1428,8 @@ (define_mode_attr VCONQ [(V8QI "V16QI") (V16QI "V16QI")
>(V4HF "V8HF") (V8HF "V8HF")
>(V2SF "V4SF") (V4SF "V4SF")
>(V2DF "V2DF") (SI   "V4SI")
> -  (HI   "V8HI") (QI   "V16QI")])
> +  (HI   "V8HI") (QI   "V16QI")
> +  (SF   "V4SF") (DF   "V2DF")])
>  
>  

  1   2   3   4   5   6   7   8   9   10   >