Re: [PATCH/RFC 08/10] aarch64: Don't use CEIL for vector_store in aarch64_stp_sequence_cost

2023-09-18 Thread Richard Sandiford via Gcc-patches
Kewen Lin  writes:
> This costing adjustment patch series exposes one issue in
> aarch64 specific costing adjustment for STP sequence.  It
> causes the below test cases to fail:
>
>   - gcc/testsuite/gcc.target/aarch64/ldp_stp_15.c
>   - gcc/testsuite/gcc.target/aarch64/ldp_stp_16.c
>   - gcc/testsuite/gcc.target/aarch64/ldp_stp_17.c
>   - gcc/testsuite/gcc.target/aarch64/ldp_stp_18.c
>
> Take the below function extracted from ldp_stp_15.c as
> example:
>
> void
> dup_8_int32_t (int32_t *x, int32_t val)
> {
> for (int i = 0; i < 8; ++i)
>   x[i] = val;
> }
>
> Without my patch series, during slp1 it gets:
>
>   val_8(D) 2 times unaligned_store (misalign -1) costs 2 in body
>   node 0x10008c85e38 1 times scalar_to_vec costs 1 in prologue
>
> then the final vector cost is 3.
>
> With my patch series, during slp1 it gets:
>
>   val_8(D) 1 times unaligned_store (misalign -1) costs 1 in body
>   val_8(D) 1 times unaligned_store (misalign -1) costs 1 in body
>   node 0x10004cc5d88 1 times scalar_to_vec costs 1 in prologue
>
> but the final vector cost is 17.  The unaligned_store count is
> actually unchanged, but the final vector costs become different,
> it's because the below aarch64 special handling makes the
> different costs:
>
>   /* Apply the heuristic described above m_stp_sequence_cost.  */
>   if (m_stp_sequence_cost != ~0U)
> {
>   uint64_t cost = aarch64_stp_sequence_cost (count, kind,
>stmt_info, vectype);
>   m_stp_sequence_cost = MIN (m._stp_sequence_cost + cost, ~0U);
> }
>
> For the former, since the count is 2, function
> aarch64_stp_sequence_cost returns 2 as "CEIL (count, 2) * 2".
> While for the latter, it's separated into twice calls with
> count 1, aarch64_stp_sequence_cost returns 2 for each time,
> so it returns 4 in total.
>
> For this case, the stmt with scalar_to_vec also contributes
> 4 to m_stp_sequence_cost, then the final m_stp_sequence_cost
> are 6 (2+4) vs. 8 (4+4).
>
> Considering scalar_costs->m_stp_sequence_cost is 8 and below
> checking and re-assigning:
>
>   else if (m_stp_sequence_cost >= scalar_costs->m_stp_sequence_cost)
> m_costs[vect_body] = 2 * scalar_costs->total_cost ();
>
> For the former, the body cost of vector isn't changed; but
> for the latter, the body cost of vector is double of scalar
> cost which is 8 for this case, then it becomes 16 which is
> bigger than what we expect.
>
> I'm not sure why it adopts CEIL for the return value for
> case unaligned_store in function aarch64_stp_sequence_cost,
> but I tried to modify it with "return count;" (as it can
> get back to previous cost), there is no failures exposed
> in regression testing.  I expected that if the previous
> unaligned_store count is even, this adjustment doesn't
> change anything, if it's odd, the adjustment may reduce
> it by one, but I'd guess it would be few.  Besides, as
> the comments for m_stp_sequence_cost, the current
> handlings seems temporary, maybe a tweak like this can be
> accepted, so I posted this RFC/PATCH to request comments.
> this one line change is considered.

It's unfortunate that doing this didn't show up a regression.
I guess it's not a change we explicitly added tests to guard against.

But the point of the condition is to estimate how many single stores
(STRs) and how many paired stores (STPs) would be generated.  As far
as this heuristic goes, STP (storing two values) is as cheap as STR
(storing only one value).  So the point of the CEIL is to count 1 store
as having equal cost to 2, 3 as having equal cost to 4, etc.

For a heuristic like that, costing a vector stmt once with count 2
is different from costing 2 vector stmts with count 1.  The former
makes it obvious that the 2 vector stmts are associated with the
same scalar stmt, and are highly likely to be consecutive.  The latter
(costing 2 stmts with count 1) could also happen for unrelated stmts.

ISTM that costing once with count N provides strictly more information
to targets than costing N time with count 1.  Is there no way we can
keep the current behaviour?  E.g. rather than costing a stmt immediately
within a loop, could we just increment a counter and cost once at the end?

Thanks,
Richard

> gcc/ChangeLog:
>
>   * config/aarch64/aarch64.cc (aarch64_stp_sequence_cost): Return
>   count directly instead of the adjusted value computed with CEIL.
> ---
>  gcc/config/aarch64/aarch64.cc | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 37d414021ca..9fb4fbd883d 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -17051,7 +17051,7 @@ aarch64_stp_sequence_cost (unsigned int count, 
> vect_cost_for_stmt kind,
> if (!aarch64_aligned_constant_offset_p (stmt_info, size))
>   return count * 2;
>   }
> -  return CEIL (count, 2) * 2;
> +  return count;
>  
>  case 

Re: [PATCH V2] internal-fn: Support undefined rtx for uninitialized SSA_NAME

2023-09-17 Thread Richard Sandiford via Gcc-patches
Juzhe-Zhong  writes:
> According to PR: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110751
>
> As Richard and Richi suggested, we recognize uninitialized SSA_NAME and 
> convert it
> into SCRATCH rtx if the target predicate allows SCRATCH.
>
> It can help to reduce redundant data move instructions of targets like RISC-V.
>
> gcc/ChangeLog:
>
>   * internal-fn.cc (expand_fn_using_insn): Support undefined rtx.
>   * optabs.cc (maybe_legitimize_operand): Ditto.
>   (can_reuse_operands_p): Ditto.
>   * optabs.h (enum expand_operand_type): Ditto.
>   (create_undefined_input_operand): Ditto.
>
> ---
>  gcc/internal-fn.cc |  4 
>  gcc/optabs.cc  | 16 
>  gcc/optabs.h   | 14 +-
>  3 files changed, 33 insertions(+), 1 deletion(-)
>
> diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
> index 0fd34359247..61d5a9e4772 100644
> --- a/gcc/internal-fn.cc
> +++ b/gcc/internal-fn.cc
> @@ -247,6 +247,10 @@ expand_fn_using_insn (gcall *stmt, insn_code icode, 
> unsigned int noutputs,
>   create_convert_operand_from ([opno], rhs_rtx,
>TYPE_MODE (rhs_type),
>TYPE_UNSIGNED (rhs_type));
> +  else if (TREE_CODE (rhs) == SSA_NAME
> +&& SSA_NAME_IS_DEFAULT_DEF (rhs)
> +&& VAR_P (SSA_NAME_VAR (rhs)))
> + create_undefined_input_operand ([opno], TYPE_MODE (rhs_type));
>else
>   create_input_operand ([opno], rhs_rtx, TYPE_MODE (rhs_type));
>opno += 1;
> diff --git a/gcc/optabs.cc b/gcc/optabs.cc
> index 32ff379ffc3..d8c771547a3 100644
> --- a/gcc/optabs.cc
> +++ b/gcc/optabs.cc
> @@ -8102,6 +8102,21 @@ maybe_legitimize_operand (enum insn_code icode, 
> unsigned int opno,
> goto input;
>   }
>break;
> +
> +case EXPAND_UNDEFINED:
> +  {
> + mode = insn_data[(int) icode].operand[opno].mode;
> + rtx scratch = gen_rtx_SCRATCH (mode);

A scratch of the right mode should already be available in op->value,
since it was created by create_undefined_input_operand.

If that doesn't work for some reason, then it would be better for
create_undefined_input_operand to pass NULL_RTX as the "value"
argument to create_expand_operand.

> + /* For SCRATCH rtx which is converted from uninitialized
> +SSA, we convert it as fresh pseudo when target doesn't
> +allow scratch rtx in predicate. Otherwise, return true.  */
> + if (!insn_operand_matches (icode, opno, scratch))
> +   {
> + op->value = gen_reg_rtx (mode);

The mode should come from op->mode.

> + goto input;
> +   }
> + return true;
> +  }
>  }
>return insn_operand_matches (icode, opno, op->value);
>  }
> @@ -8147,6 +8162,7 @@ can_reuse_operands_p (enum insn_code icode,
>  case EXPAND_INPUT:
>  case EXPAND_ADDRESS:
>  case EXPAND_INTEGER:
> +case EXPAND_UNDEFINED:
>return true;

I think this should be in the "return false" block instead.

>  
>  case EXPAND_CONVERT_TO:
> diff --git a/gcc/optabs.h b/gcc/optabs.h
> index c80b7f4dc1b..4eb1f9ee09a 100644
> --- a/gcc/optabs.h
> +++ b/gcc/optabs.h
> @@ -37,7 +37,8 @@ enum expand_operand_type {
>EXPAND_CONVERT_TO,
>EXPAND_CONVERT_FROM,
>EXPAND_ADDRESS,
> -  EXPAND_INTEGER
> +  EXPAND_INTEGER,
> +  EXPAND_UNDEFINED

Sorry, this was my bad suggestion.  I should have suggested
EXPAND_UNDEFINED_INPUT, to match the name of the function.

Thanks,
Richard

>  };
>  
>  /* Information about an operand for instruction expansion.  */
> @@ -117,6 +118,17 @@ create_input_operand (class expand_operand *op, rtx 
> value,
>create_expand_operand (op, EXPAND_INPUT, value, mode, false);
>  }
>  
> +/* Make OP describe an undefined input operand for uninitialized
> +   SSA.  It's the scratch operand with mode MODE; MODE cannot be
> +   VOIDmode.  */
> +
> +inline void
> +create_undefined_input_operand (class expand_operand *op, machine_mode mode)
> +{
> +  create_expand_operand (op, EXPAND_UNDEFINED, gen_rtx_SCRATCH (mode), mode,
> +  false);
> +}
> +
>  /* Like create_input_operand, except that VALUE must first be converted
> to mode MODE.  UNSIGNED_P says whether VALUE is unsigned.  */


Re: [AArch64][testsuite] Adjust vect_copy_lane_1.c for new code-gen

2023-09-17 Thread Richard Sandiford via Gcc-patches
Prathamesh Kulkarni  writes:
> Hi,
> After 27de9aa152141e7f3ee66372647d0f2cd94c4b90, there's a following 
> regression:
> FAIL: gcc.target/aarch64/vect_copy_lane_1.c scan-assembler-times
> ins\\tv0.s\\[1\\], v1.s\\[0\\] 3
>
> This happens because for the following function from vect_copy_lane_1.c:
> float32x2_t
> __attribute__((noinline, noclone)) test_copy_lane_f32 (float32x2_t a,
> float32x2_t b)
> {
>   return vcopy_lane_f32 (a, 1, b, 0);
> }
>
> Before 27de9aa152141e7f3ee66372647d0f2cd94c4b90,
> it got lowered to following sequence in .optimized dump:
>[local count: 1073741824]:
>   _4 = BIT_FIELD_REF ;
>   __a_5 = BIT_INSERT_EXPR ;
>   return __a_5;
>
> The above commit simplifies BIT_FIELD_REF + BIT_INSERT_EXPR
> to vector permutation and now thus gets lowered to:
>
>[local count: 1073741824]:
>   __a_4 = VEC_PERM_EXPR ;
>   return __a_4;
>
> Since we give higher priority to aarch64_evpc_zip over aarch64_evpc_ins
> in aarch64_expand_vec_perm_const_1, it now generates:
>
> test_copy_lane_f32:
> zip1v0.2s, v0.2s, v1.2s
> ret
>
> Similarly for test_copy_lane_[us]32.

Yeah, I suppose this choice is at least as good as INS.  It has the advantage
that the source and destination don't need to be tied.  For example:

int32x2_t f(int32x2_t a, int32x2_t b, int32x2_t c) {
return vcopy_lane_s32 (b, 1, c, 0);
}

used to be:

f:
mov v0.8b, v1.8b
ins v0.s[1], v2.s[0]
ret

but is now:

f:
zip1v0.2s, v1.2s, v2.2s
ret

> The attached patch adjusts the tests to reflect the change in code-gen
> and the tests pass.
> OK to commit ?
>
> Thanks,
> Prathamesh
>
> diff --git a/gcc/testsuite/gcc.target/aarch64/vect_copy_lane_1.c 
> b/gcc/testsuite/gcc.target/aarch64/vect_copy_lane_1.c
> index 2848be564d5..811dc678b92 100644
> --- a/gcc/testsuite/gcc.target/aarch64/vect_copy_lane_1.c
> +++ b/gcc/testsuite/gcc.target/aarch64/vect_copy_lane_1.c
> @@ -22,7 +22,7 @@ BUILD_TEST (uint16x4_t, uint16x4_t, , , u16, 3, 2)
>  BUILD_TEST (float32x2_t, float32x2_t, , , f32, 1, 0)
>  BUILD_TEST (int32x2_t,   int32x2_t,   , , s32, 1, 0)
>  BUILD_TEST (uint32x2_t,  uint32x2_t,  , , u32, 1, 0)
> -/* { dg-final { scan-assembler-times "ins\\tv0.s\\\[1\\\], v1.s\\\[0\\\]" 3 
> } } */
> +/* { dg-final { scan-assembler-times "zip1\\tv0.2s, v0.2s, v1.2s" 3 } } */
>  BUILD_TEST (int64x1_t,   int64x1_t,   , , s64, 0, 0)
>  BUILD_TEST (uint64x1_t,  uint64x1_t,  , , u64, 0, 0)
>  BUILD_TEST (float64x1_t, float64x1_t, , , f64, 0, 0)

OK, thanks.

Richard


Re: [PATCH] AArch64: Improve immediate expansion [PR105928]

2023-09-17 Thread Richard Sandiford via Gcc-patches
Wilco Dijkstra  writes:
> Support immediate expansion of immediates which can be created from 2 MOVKs
> and a shifted ORR or BIC instruction.  Change aarch64_split_dimode_const_store
> to apply if we save one instruction.
>
> This reduces the number of 4-instruction immediates in SPECINT/FP by 5%.
>
> Passes regress, OK for commit?
>
> gcc/ChangeLog:
> PR target/105928
> * config/aarch64/aarch64.cc (aarch64_internal_mov_immediate)
> Add support for immediates using shifted ORR/BIC.
> (aarch64_split_dimode_const_store): Apply if we save one instruction.
> * config/aarch64/aarch64.md (_3):
> Make pattern global.
>
> gcc/testsuite:
> PR target/105928
> * gcc.target/aarch64/pr105928.c: Add new test.
> * gcc.target/aarch64/vect-cse-codegen.c: Fix test.

Looks good apart from a comment below about the test.

I was worried that reusing "dest" for intermediate results would
prevent CSE for cases like:

void g (long long, long long);
void
f (long long *ptr)
{
  g (0xee11ee22ee11ee22LL, 0xdc23dc44ee11ee22LL);
}

where the same 32-bit lowpart pattern is used for two immediates.
In principle, that could be avoided using:

if (generate)
  {
rtx tmp = aarch64_target_reg (dest, DImode);
emit_insn (gen_rtx_SET (tmp, GEN_INT (val2 & 0x)));
emit_insn (gen_insv_immdi (tmp, GEN_INT (16),
   GEN_INT (val2 >> 16)));
set_unique_reg_note (get_last_insn (), REG_EQUAL,
 GEN_INT (val2));
emit_insn (gen_ior_ashldi3 (dest, tmp, GEN_INT (i), tmp));
  }
return 3;

But it doesn't work, since we only expose the individual immediates
during split1, and nothing between split1 and ira is able to remove
redundancies.  There's no point complicating the code for a theoretical
future optimisation.

> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 
> c44c0b979d0cc3755c61dcf566cfddedccebf1ea..832f8197ac8d1a04986791e6f3e51861e41944b2
>  100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -5639,7 +5639,7 @@ aarch64_internal_mov_immediate (rtx dest, rtx imm, bool 
> generate,
> machine_mode mode)
>  {
>int i;
> -  unsigned HOST_WIDE_INT val, val2, mask;
> +  unsigned HOST_WIDE_INT val, val2, val3, mask;
>int one_match, zero_match;
>int num_insns;
>
> @@ -5721,6 +5721,35 @@ aarch64_internal_mov_immediate (rtx dest, rtx imm, 
> bool generate,
> }
>   return 3;
> }
> +
> +  /* Try shifting and inserting the bottom 32-bits into the top bits.  */
> +  val2 = val & 0x;
> +  val3 = 0x;
> +  val3 = val2 | (val3 << 32);
> +  for (i = 17; i < 48; i++)
> +   if ((val2 | (val2 << i)) == val)
> + {
> +   if (generate)
> + {
> +   emit_insn (gen_rtx_SET (dest, GEN_INT (val2 & 0x)));
> +   emit_insn (gen_insv_immdi (dest, GEN_INT (16),
> +  GEN_INT (val2 >> 16)));
> +   emit_insn (gen_ior_ashldi3 (dest, dest, GEN_INT (i), dest));
> + }
> +   return 3;
> + }
> +   else if ((val3 & ~(val3 << i)) == val)
> + {
> +   if (generate)
> + {
> +   emit_insn (gen_rtx_SET (dest, GEN_INT (val3 | 0x)));
> +   emit_insn (gen_insv_immdi (dest, GEN_INT (16),
> +  GEN_INT (val2 >> 16)));
> +   emit_insn (gen_and_one_cmpl_ashldi3 (dest, dest, GEN_INT (i),
> + dest));
> + }
> +   return 3;
> + }
>  }
>
>/* Generate 2-4 instructions, skipping 16 bits of all zeroes or ones which
> @@ -25506,8 +25535,6 @@ aarch64_split_dimode_const_store (rtx dst, rtx src)
>rtx lo = gen_lowpart (SImode, src);
>rtx hi = gen_highpart_mode (SImode, DImode, src);
>
> -  bool size_p = optimize_function_for_size_p (cfun);
> -
>if (!rtx_equal_p (lo, hi))
>  return false;
>
> @@ -25526,14 +25553,8 @@ aarch64_split_dimode_const_store (rtx dst, rtx src)
>   MOV   w1, 49370
>   MOVK  w1, 0x140, lsl 16
>   STP   w1, w1, [x0]
> -   So we want to perform this only when we save two instructions
> -   or more.  When optimizing for size, however, accept any code size
> -   savings we can.  */
> -  if (size_p && orig_cost <= lo_cost)
> -return false;
> -
> -  if (!size_p
> -  && (orig_cost <= lo_cost + 1))
> +   So we want to perform this when we save at least one instruction.  */
> +  if (orig_cost <= lo_cost)
>  return false;
>
>rtx mem_lo = adjust_address (dst, SImode, 0);
> diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
> index 
> 

Re: [PATCH] internal-fn: Convert uninitialized SSA_NAME into SCRATCH rtx[PR110751]

2023-09-17 Thread Richard Sandiford via Gcc-patches
Juzhe-Zhong  writes:
> According to PR: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110751
>
> As Richard and Richi suggested, we recognize uninitialized SSA_NAME and 
> convert it
> into SCRATCH rtx if the target predicate allows SCRATCH.
>
> It can help to reduce redundant data move instructions of targets like RISC-V.
>
> Here we add the condition "insn_operand_matches (icode, opno, scratch)"
> Then, we will only create scratch rtx that target allow scratch rtx in 
> predicate.
> When the target doesn't allow scratch rtx in predicate, the later "else" 
> condtion
> will create fresh pseudo for uninitialized SSA.
>
> I have verify it in RISC-V port and it works well.
>
> Bootstrap and Regression on X86 passed.
>
> Ok for trunk ?
>  
> gcc/ChangeLog:
>
>   * internal-fn.cc (expand_fn_using_insn): Convert uninitialized SSA into 
> scratch.
>
> ---
>  gcc/internal-fn.cc | 6 ++
>  1 file changed, 6 insertions(+)
>
> diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
> index 0fd34359247..fe4d86b3dbd 100644
> --- a/gcc/internal-fn.cc
> +++ b/gcc/internal-fn.cc
> @@ -243,10 +243,16 @@ expand_fn_using_insn (gcall *stmt, insn_code icode, 
> unsigned int noutputs,
>tree rhs = gimple_call_arg (stmt, i);
>tree rhs_type = TREE_TYPE (rhs);
>rtx rhs_rtx = expand_normal (rhs);
> +  rtx scratch = gen_rtx_SCRATCH (TYPE_MODE (rhs_type));
>if (INTEGRAL_TYPE_P (rhs_type))
>   create_convert_operand_from ([opno], rhs_rtx,
>TYPE_MODE (rhs_type),
>TYPE_UNSIGNED (rhs_type));
> +  else if (TREE_CODE (rhs) == SSA_NAME
> +&& SSA_NAME_IS_DEFAULT_DEF (rhs)
> +&& VAR_P (SSA_NAME_VAR (rhs))
> +&& insn_operand_matches (icode, opno, scratch))

Rather than check insn_operand_matches here, I think we should create
the scratch operand regardless and leave optabs.cc to deal with it.
(This will need changes to optabs.cc.)

How about adding:

  create_undefined_input_operand (expand_operand *op, machine_mode mode)

that maps to a new EXPAND_UNDEFINED, then handle EXPAND_UNDEFINED in the
two case statements in optabs.cc.

Thanks,
Richard

> + create_input_operand ([opno], scratch, TYPE_MODE (rhs_type));
>else
>   create_input_operand ([opno], rhs_rtx, TYPE_MODE (rhs_type));
>opno += 1;


[PATCH] aarch64: Fix loose ldpstp check [PR111411]

2023-09-15 Thread Richard Sandiford via Gcc-patches
aarch64_operands_ok_for_ldpstp contained the code:

  /* One of the memory accesses must be a mempair operand.
 If it is not the first one, they need to be swapped by the
 peephole.  */
  if (!aarch64_mem_pair_operand (mem_1, GET_MODE (mem_1))
   && !aarch64_mem_pair_operand (mem_2, GET_MODE (mem_2)))
return false;

But the requirement isn't just that one of the accesses must be a
valid mempair operand.  It's that the lower access must be, since
that's the access that will be used for the instruction operand.

Tested on aarch64-linux-gnu & pushed.  The patch applies cleanly
to GCC 12 and 13, so I'll backport there next week.  GCC 11 will
need a bespoke fix if the problem shows up there, but I doubt it will.

Richard


gcc/
PR target/111411
* config/aarch64/aarch64.cc (aarch64_operands_ok_for_ldpstp): Require
the lower memory access to a mem-pair operand.

gcc/testsuite/
PR target/111411
* gcc.dg/rtl/aarch64/pr111411.c: New test.
---
 gcc/config/aarch64/aarch64.cc   |  8 ++-
 gcc/testsuite/gcc.dg/rtl/aarch64/pr111411.c | 57 +
 2 files changed, 60 insertions(+), 5 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/rtl/aarch64/pr111411.c

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 0962fc4f56e..7bb1161f943 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -26503,11 +26503,9 @@ aarch64_operands_ok_for_ldpstp (rtx *operands, bool 
load,
   gcc_assert (known_eq (GET_MODE_SIZE (GET_MODE (mem_1)),
GET_MODE_SIZE (GET_MODE (mem_2;
 
-  /* One of the memory accesses must be a mempair operand.
- If it is not the first one, they need to be swapped by the
- peephole.  */
-  if (!aarch64_mem_pair_operand (mem_1, GET_MODE (mem_1))
-   && !aarch64_mem_pair_operand (mem_2, GET_MODE (mem_2)))
+  /* The lower memory access must be a mem-pair operand.  */
+  rtx lower_mem = reversed ? mem_2 : mem_1;
+  if (!aarch64_mem_pair_operand (lower_mem, GET_MODE (lower_mem)))
 return false;
 
   if (REG_P (reg_1) && FP_REGNUM_P (REGNO (reg_1)))
diff --git a/gcc/testsuite/gcc.dg/rtl/aarch64/pr111411.c 
b/gcc/testsuite/gcc.dg/rtl/aarch64/pr111411.c
new file mode 100644
index 000..ad07e9c6c89
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/rtl/aarch64/pr111411.c
@@ -0,0 +1,57 @@
+/* { dg-do compile { target aarch64*-*-* } } */
+/* { dg-require-effective-target lp64 } */
+/* { dg-options "-O -fdisable-rtl-postreload -fpeephole2 -fno-schedule-fusion" 
} */
+
+extern int data[];
+
+void __RTL (startwith ("ira")) foo (void *ptr)
+{
+  (function "foo"
+(param "ptr"
+  (DECL_RTL (reg/v:DI <0> [ ptr ]))
+  (DECL_RTL_INCOMING (reg/v:DI x0 [ ptr ]))
+) ;; param "ptr"
+(insn-chain
+  (block 2
+   (edge-from entry (flags "FALLTHRU"))
+   (cnote 3 [bb 2] NOTE_INSN_BASIC_BLOCK)
+   (insn 4 (set (reg:DI <0>) (reg:DI x0)))
+   (insn 5 (set (reg:DI <1>)
+(plus:DI (reg:DI <0>) (const_int 768
+   (insn 6 (set (mem:SI (plus:DI (reg:DI <0>)
+ (const_int 508)) [1 +508 S4 A4])
+(const_int 0)))
+   (insn 7 (set (mem:SI (plus:DI (reg:DI <1>)
+ (const_int -256)) [1 +512 S4 A4])
+(const_int 0)))
+   (edge-to exit (flags "FALLTHRU"))
+  ) ;; block 2
+) ;; insn-chain
+  ) ;; function
+}
+
+void __RTL (startwith ("ira")) bar (void *ptr)
+{
+  (function "bar"
+(param "ptr"
+  (DECL_RTL (reg/v:DI <0> [ ptr ]))
+  (DECL_RTL_INCOMING (reg/v:DI x0 [ ptr ]))
+) ;; param "ptr"
+(insn-chain
+  (block 2
+   (edge-from entry (flags "FALLTHRU"))
+   (cnote 3 [bb 2] NOTE_INSN_BASIC_BLOCK)
+   (insn 4 (set (reg:DI <0>) (reg:DI x0)))
+   (insn 5 (set (reg:DI <1>)
+(plus:DI (reg:DI <0>) (const_int 768
+   (insn 6 (set (mem:SI (plus:DI (reg:DI <1>)
+ (const_int -256)) [1 +512 S4 A4])
+(const_int 0)))
+   (insn 7 (set (mem:SI (plus:DI (reg:DI <0>)
+ (const_int 508)) [1 +508 S4 A4])
+(const_int 0)))
+   (edge-to exit (flags "FALLTHRU"))
+  ) ;; block 2
+) ;; insn-chain
+  ) ;; function
+}
-- 
2.25.1



[PATCH] aarch64: Restore SVE WHILE costing

2023-09-14 Thread Richard Sandiford via Gcc-patches
AArch64 previously costed WHILELO instructions on the first call
to add_stmt_cost.  This was because, at the time, only add_stmt_cost
had access to the loop_vec_info.

However, after the AVX512 changes, we only calculate the masks later.
This patch moves the WHILELO costing to finish_cost, which is in any
case a more logical place for it to be.  It also means that we can
check the final decision about whether to use predicated loops.

Tested on aarch64-linux-gnu & applied.

Richard


gcc/
* config/aarch64/aarch64.cc (aarch64_vector_costs::analyze_loop_info):
Move WHILELO handling to...
(aarch64_vector_costs::finish_cost): ...here.  Check whether the
vectorizer has decided to use a predicated loop.

gcc/testsuite/
* gcc.target/aarch64/sve/cost_model_15.c: New test.
---
 gcc/config/aarch64/aarch64.cc | 36 ++-
 .../gcc.target/aarch64/sve/cost_model_15.c| 13 +++
 2 files changed, 32 insertions(+), 17 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/cost_model_15.c

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 3739a44bfd9..0962fc4f56e 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -16310,22 +16310,6 @@ aarch64_vector_costs::analyze_loop_vinfo 
(loop_vec_info loop_vinfo)
   /* Detect whether we're vectorizing for SVE and should apply the unrolling
  heuristic described above m_unrolled_advsimd_niters.  */
   record_potential_advsimd_unrolling (loop_vinfo);
-
-  /* Record the issue information for any SVE WHILE instructions that the
- loop needs.  */
-  if (!m_ops.is_empty () && !LOOP_VINFO_MASKS (loop_vinfo).is_empty ())
-{
-  unsigned int num_masks = 0;
-  rgroup_controls *rgm;
-  unsigned int num_vectors_m1;
-  FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo).rgc_vec,
-   num_vectors_m1, rgm)
-   if (rgm->type)
- num_masks += num_vectors_m1 + 1;
-  for (auto  : m_ops)
-   if (auto *issue = ops.sve_issue_info ())
- ops.pred_ops += num_masks * issue->while_pred_ops;
-}
 }
 
 /* Implement targetm.vectorize.builtin_vectorization_cost.  */
@@ -17507,9 +17491,27 @@ adjust_body_cost (loop_vec_info loop_vinfo,
 void
 aarch64_vector_costs::finish_cost (const vector_costs *uncast_scalar_costs)
 {
+  /* Record the issue information for any SVE WHILE instructions that the
+ loop needs.  */
+  loop_vec_info loop_vinfo = dyn_cast (m_vinfo);
+  if (!m_ops.is_empty ()
+  && loop_vinfo
+  && LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+{
+  unsigned int num_masks = 0;
+  rgroup_controls *rgm;
+  unsigned int num_vectors_m1;
+  FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo).rgc_vec,
+   num_vectors_m1, rgm)
+   if (rgm->type)
+ num_masks += num_vectors_m1 + 1;
+  for (auto  : m_ops)
+   if (auto *issue = ops.sve_issue_info ())
+ ops.pred_ops += num_masks * issue->while_pred_ops;
+}
+
   auto *scalar_costs
 = static_cast (uncast_scalar_costs);
-  loop_vec_info loop_vinfo = dyn_cast (m_vinfo);
   if (loop_vinfo
   && m_vec_flags
   && aarch64_use_new_vector_costs_p ())
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cost_model_15.c 
b/gcc/testsuite/gcc.target/aarch64/sve/cost_model_15.c
new file mode 100644
index 000..b9e6306bb59
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/cost_model_15.c
@@ -0,0 +1,13 @@
+/* { dg-options "-Ofast -mtune=neoverse-v1" } */
+
+double f(double *restrict x, double *restrict y, int *restrict z)
+{
+  double res = 0.0;
+  for (int i = 0; i < 100; ++i)
+res += x[i] * y[z[i]];
+  return res;
+}
+
+/* { dg-final { scan-assembler-times {\tld1sw\tz[0-9]+\.d,} 1 } } */
+/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d,} 2 } } */
+/* { dg-final { scan-assembler-times {\tfmla\tz[0-9]+\.d,} 1 } } */
-- 
2.25.1



[PATCH] aarch64: Coerce addresses to be suitable for LD1RQ

2023-09-14 Thread Richard Sandiford via Gcc-patches
In the following test:

  svuint8_t ld(uint8_t *ptr) { return svld1rq(svptrue_b8(), ptr + 2); }

ptr + 2 is a valid address for an Advanced SIMD load, but not for
an SVE load.  We therefore ended up generating:

ldr q0, [x0, 2]
dup z0.q, z0.q[0]

This patch makes us generate LD1RQ for that case too.  It takes the
slightly old-school approach of making the predicate broader than
the constraint.  That is: any valid memory address is accepted as
an operand before RA.  If the instruction remains during RA, LRA will
coerce the address to match the constraint.  If the instruction gets
split before RA, the splitter will load invalid addresses into a
scratch register.

Tested on aarch64-linux-gnu & pushed.

Richard

gcc/
* config/aarch64/aarch64-sve.md (@aarch64_vec_duplicate_vq_le):
Accept all nonimmediate_operands, but keep the existing constraints.
If the instruction is split before RA, load invalid addresses into
a temporary register.
* config/aarch64/predicates.md (aarch64_sve_dup_ld1rq_operand): Delete.

gcc/testsuite/
* gcc.target/aarch64/sve/acle/general/ld1rq_1.c: New test.
---
 gcc/config/aarch64/aarch64-sve.md | 15 -
 gcc/config/aarch64/predicates.md  |  4 ---
 .../aarch64/sve/acle/general/ld1rq_1.c| 33 +++
 3 files changed, 47 insertions(+), 5 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/acle/general/ld1rq_1.c

diff --git a/gcc/config/aarch64/aarch64-sve.md 
b/gcc/config/aarch64/aarch64-sve.md
index da5534c3e32..b223e7d3c9d 100644
--- a/gcc/config/aarch64/aarch64-sve.md
+++ b/gcc/config/aarch64/aarch64-sve.md
@@ -2611,11 +2611,18 @@ (define_insn_and_split "*vec_duplicate_reg"
 )
 
 ;; Duplicate an Advanced SIMD vector to fill an SVE vector (LE version).
+;;
+;; The addressing mode range of LD1RQ does not match the addressing mode
+;; range of LDR Qn.  If the predicate enforced the LD1RQ range, we would
+;; not be able to combine LDR Qns outside that range.  The predicate
+;; therefore accepts all memory operands, with only the constraints
+;; enforcing the actual restrictions.  If the instruction is split
+;; before RA, we need to load invalid addresses into a temporary.
 
 (define_insn_and_split "@aarch64_vec_duplicate_vq_le"
   [(set (match_operand:SVE_FULL 0 "register_operand" "=w, w")
(vec_duplicate:SVE_FULL
- (match_operand: 1 "aarch64_sve_dup_ld1rq_operand" "w, UtQ")))
+ (match_operand: 1 "nonimmediate_operand" "w, UtQ")))
(clobber (match_scratch:VNx16BI 2 "=X, Upl"))]
   "TARGET_SVE && !BYTES_BIG_ENDIAN"
   {
@@ -2633,6 +2640,12 @@ (define_insn_and_split 
"@aarch64_vec_duplicate_vq_le"
   "&& MEM_P (operands[1])"
   [(const_int 0)]
   {
+if (can_create_pseudo_p ()
+&& !aarch64_sve_ld1rq_operand (operands[1], mode))
+  {
+   rtx addr = force_reg (Pmode, XEXP (operands[1], 0));
+   operands[1] = replace_equiv_address (operands[1], addr);
+  }
 if (GET_CODE (operands[2]) == SCRATCH)
   operands[2] = gen_reg_rtx (VNx16BImode);
 emit_move_insn (operands[2], CONSTM1_RTX (VNx16BImode));
diff --git a/gcc/config/aarch64/predicates.md b/gcc/config/aarch64/predicates.md
index 2d8d1fe25c1..01de4743974 100644
--- a/gcc/config/aarch64/predicates.md
+++ b/gcc/config/aarch64/predicates.md
@@ -732,10 +732,6 @@ (define_predicate "aarch64_sve_dup_operand"
   (ior (match_operand 0 "register_operand")
(match_operand 0 "aarch64_sve_ld1r_operand")))
 
-(define_predicate "aarch64_sve_dup_ld1rq_operand"
-  (ior (match_operand 0 "register_operand")
-   (match_operand 0 "aarch64_sve_ld1rq_operand")))
-
 (define_predicate "aarch64_sve_ptrue_svpattern_immediate"
   (and (match_code "const")
(match_test "aarch64_sve_ptrue_svpattern_p (op, NULL)")))
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/general/ld1rq_1.c 
b/gcc/testsuite/gcc.target/aarch64/sve/acle/general/ld1rq_1.c
new file mode 100644
index 000..9242c639731
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/general/ld1rq_1.c
@@ -0,0 +1,33 @@
+/* { dg-options "-O2" } */
+
+#include 
+
+#define TEST_OFFSET(TYPE, SUFFIX, OFFSET) \
+  sv##TYPE##_t \
+  test_##TYPE##_##SUFFIX (TYPE##_t *ptr) \
+  { \
+return svld1rq(svptrue_b8(), ptr + OFFSET); \
+  }
+
+#define TEST(TYPE) \
+  TEST_OFFSET (TYPE, 0, 0) \
+  TEST_OFFSET (TYPE, 1, 1) \
+  TEST_OFFSET (TYPE, 2, 2) \
+  TEST_OFFSET (TYPE, 16, 16) \
+  TEST_OFFSET (TYPE, 0x1, 0x1) \
+  TEST_OFFSET (TYPE, 0x10001, 0x10001) \
+  TEST_OFFSET (TYPE, m1, -1) \
+  TEST_OFFSET (TYPE, m2, -2) \
+  TEST_OFFSET (TYPE, m16, -16) \
+  TEST_OFFSET (TYPE, m0x1, -0x1) \
+  TEST_OFFSET (TYPE, m0x10001, -0x10001)
+
+TEST (int8)
+TEST (int16)
+TEST (uint32)
+TEST (uint64)
+
+/* { dg-final { scan-assembler-times {\tld1rqb\t} 11 { target 
aarch64_little_endian } } } */
+/* { dg-final { scan-assembler-times {\tld1rqh\t} 11 { target 
aarch64_little_endian } } } 

Re: [PATCH] AArch64: List official cores before codenames

2023-09-13 Thread Richard Sandiford via Gcc-patches
Wilco Dijkstra  writes:
> List official cores first so that -cpu=native does not show a codename with -v
> or in errors/warnings.

Nice spot.

> Passes regress, OK for commit?
>
> gcc/ChangeLog:
> * config/aarch64/aarch64-cores.def (neoverse-n1): Place before ares.
> (neoverse-v1): Place before zeus.
> (neoverse-v2): Place before demeter.
> * config/aarch64/aarch64-tune.md: Regenerate.

OK, thanks.  OK for backports too from my POV.

Richard

> ---
>
> diff --git a/gcc/config/aarch64/aarch64-cores.def 
> b/gcc/config/aarch64/aarch64-cores.def
> index 
> dbac497ef3aab410eb81db185b2e9532186888bb..3894f2afc27e71523e5a413fa45c144222082934
>  100644
> --- a/gcc/config/aarch64/aarch64-cores.def
> +++ b/gcc/config/aarch64/aarch64-cores.def
> @@ -115,8 +115,8 @@ AARCH64_CORE("cortex-a65",  cortexa65, cortexa53, V8_2A,  
> (F16, RCPC, DOTPROD, S
>  AARCH64_CORE("cortex-a65ae",  cortexa65ae, cortexa53, V8_2A,  (F16, RCPC, 
> DOTPROD, SSBS), cortexa73, 0x41, 0xd43, -1)
>  AARCH64_CORE("cortex-x1",  cortexx1, cortexa57, V8_2A,  (F16, RCPC, DOTPROD, 
> SSBS, PROFILE), neoversen1, 0x41, 0xd44, -1)
>  AARCH64_CORE("cortex-x1c",  cortexx1c, cortexa57, V8_2A,  (F16, RCPC, 
> DOTPROD, SSBS, PROFILE, PAUTH), neoversen1, 0x41, 0xd4c, -1)
> -AARCH64_CORE("ares",  ares, cortexa57, V8_2A,  (F16, RCPC, DOTPROD, 
> PROFILE), neoversen1, 0x41, 0xd0c, -1)
>  AARCH64_CORE("neoverse-n1",  neoversen1, cortexa57, V8_2A,  (F16, RCPC, 
> DOTPROD, PROFILE), neoversen1, 0x41, 0xd0c, -1)
> +AARCH64_CORE("ares",  ares, cortexa57, V8_2A,  (F16, RCPC, DOTPROD, 
> PROFILE), neoversen1, 0x41, 0xd0c, -1)
>  AARCH64_CORE("neoverse-e1",  neoversee1, cortexa53, V8_2A,  (F16, RCPC, 
> DOTPROD, SSBS), cortexa73, 0x41, 0xd4a, -1)
>
>  /* Cavium ('C') cores. */
> @@ -143,8 +143,8 @@ AARCH64_CORE("thunderx3t110",  thunderx3t110,  
> thunderx3t110, V8_3A,  (CRYPTO, S
>  /* ARMv8.4-A Architecture Processors.  */
>
>  /* Arm ('A') cores.  */
> -AARCH64_CORE("zeus", zeus, cortexa57, V8_4A,  (SVE, I8MM, BF16, PROFILE, 
> SSBS, RNG), neoversev1, 0x41, 0xd40, -1)
>  AARCH64_CORE("neoverse-v1", neoversev1, cortexa57, V8_4A,  (SVE, I8MM, BF16, 
> PROFILE, SSBS, RNG), neoversev1, 0x41, 0xd40, -1)
> +AARCH64_CORE("zeus", zeus, cortexa57, V8_4A,  (SVE, I8MM, BF16, PROFILE, 
> SSBS, RNG), neoversev1, 0x41, 0xd40, -1)
>  AARCH64_CORE("neoverse-512tvb", neoverse512tvb, cortexa57, V8_4A,  (SVE, 
> I8MM, BF16, PROFILE, SSBS, RNG), neoverse512tvb, INVALID_IMP, INVALID_CORE, 
> -1)
>
>  /* Qualcomm ('Q') cores. */
> @@ -182,7 +182,7 @@ AARCH64_CORE("cortex-x3",  cortexx3, cortexa57, V9A,  
> (SVE2_BITPERM, MEMTAG, I8M
>
>  AARCH64_CORE("neoverse-n2", neoversen2, cortexa57, V9A, (I8MM, BF16, 
> SVE2_BITPERM, RNG, MEMTAG, PROFILE), neoversen2, 0x41, 0xd49, -1)
>
> -AARCH64_CORE("demeter", demeter, cortexa57, V9A, (I8MM, BF16, SVE2_BITPERM, 
> RNG, MEMTAG, PROFILE), neoversev2, 0x41, 0xd4f, -1)
>  AARCH64_CORE("neoverse-v2", neoversev2, cortexa57, V9A, (I8MM, BF16, 
> SVE2_BITPERM, RNG, MEMTAG, PROFILE), neoversev2, 0x41, 0xd4f, -1)
> +AARCH64_CORE("demeter", demeter, cortexa57, V9A, (I8MM, BF16, SVE2_BITPERM, 
> RNG, MEMTAG, PROFILE), neoversev2, 0x41, 0xd4f, -1)
>
>  #undef AARCH64_CORE
> diff --git a/gcc/config/aarch64/aarch64-tune.md 
> b/gcc/config/aarch64/aarch64-tune.md
> index 
> 2170980dddb0d5d410a49631ad26ff2e346b39dd..69e5357fa814e4733b05f7164bfa11e4aa04
>  100644
> --- a/gcc/config/aarch64/aarch64-tune.md
> +++ b/gcc/config/aarch64/aarch64-tune.md
> @@ -1,5 +1,5 @@
>  ;; -*- buffer-read-only: t -*-
>  ;; Generated automatically by gentune.sh from aarch64-cores.def
>  (define_attr "tune"
> -   
> "cortexa34,cortexa35,cortexa53,cortexa57,cortexa72,cortexa73,thunderx,thunderxt88p1,thunderxt88,octeontx,octeontxt81,octeontxt83,thunderxt81,thunderxt83,ampere1,ampere1a,emag,xgene1,falkor,qdf24xx,exynosm1,phecda,thunderx2t99p1,vulcan,thunderx2t99,cortexa55,cortexa75,cortexa76,cortexa76ae,cortexa77,cortexa78,cortexa78ae,cortexa78c,cortexa65,cortexa65ae,cortexx1,cortexx1c,ares,neoversen1,neoversee1,octeontx2,octeontx2t98,octeontx2t96,octeontx2t93,octeontx2f95,octeontx2f95n,octeontx2f95mm,a64fx,tsv110,thunderx3t110,zeus,neoversev1,neoverse512tvb,saphira,cortexa57cortexa53,cortexa72cortexa53,cortexa73cortexa35,cortexa73cortexa53,cortexa75cortexa55,cortexa76cortexa55,cortexr82,cortexa510,cortexa520,cortexa710,cortexa715,cortexx2,cortexx3,neoversen2,demeter,neoversev2"
> +   
> 

[PATCH 17/19] aarch64: Explicitly record probe registers in frame info

2023-09-12 Thread Richard Sandiford via Gcc-patches
The stack frame is currently divided into three areas:

A: the area above the hard frame pointer
B: the SVE saves below the hard frame pointer
C: the outgoing arguments

If the stack frame is allocated in one chunk, the allocation needs a
probe if the frame size is >= guard_size - 1KiB.  In addition, if the
function is not a leaf function, it must probe an address no more than
1KiB above the outgoing SP.  We ensured the second condition by

(1) using single-chunk allocations for non-leaf functions only if
the link register save slot is within 512 bytes of the bottom
of the frame; and

(2) using the link register save as a probe (meaning, for instance,
that it can't be individually shrink wrapped)

If instead the stack is allocated in multiple chunks, then:

* an allocation involving only the outgoing arguments (C above) requires
  a probe if the allocation size is > 1KiB

* any other allocation requires a probe if the allocation size
  is >= guard_size - 1KiB

* second and subsequent allocations require the previous allocation
  to probe at the bottom of the allocated area, regardless of the size
  of that previous allocation

The final point means that, unlike for single allocations,
it can be necessary to have both a non-SVE register probe and
an SVE register probe.  For example:

* allocate A, probe using a non-SVE register save
* allocate B, probe using an SVE register save
* allocate C

The non-SVE register used in this case was again the link register.
It was previously used even if the link register save slot was some
bytes above the bottom of the non-SVE register saves, but an earlier
patch avoided that by putting the link register save slot first.

As a belt-and-braces fix, this patch explicitly records which
probe registers we're using and allows the non-SVE probe to be
whichever register comes first (as for SVE).

The patch also avoids unnecessary probes in sve/pcs/stack_clash_3.c.

gcc/
* config/aarch64/aarch64.h (aarch64_frame::sve_save_and_probe)
(aarch64_frame::hard_fp_save_and_probe): New fields.
* config/aarch64/aarch64.cc (aarch64_layout_frame): Initialize them.
Rather than asserting that a leaf function saves LR, instead assert
that a leaf function saves something.
(aarch64_get_separate_components): Prevent the chosen probe
registers from being individually shrink-wrapped.
(aarch64_allocate_and_probe_stack_space): Remove workaround for
probe registers that aren't at the bottom of the previous allocation.

gcc/testsuite/
* gcc.target/aarch64/sve/pcs/stack_clash_3.c: Avoid redundant probes.
---
 gcc/config/aarch64/aarch64.cc | 68 +++
 gcc/config/aarch64/aarch64.h  |  8 +++
 .../aarch64/sve/pcs/stack_clash_3.c   |  6 +-
 3 files changed, 64 insertions(+), 18 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index bcb879ba94b..3c7c476c4c6 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -8510,15 +8510,11 @@ aarch64_layout_frame (void)
&& !crtl->abi->clobbers_full_reg_p (regno))
   frame.reg_offset[regno] = SLOT_REQUIRED;
 
-  /* With stack-clash, LR must be saved in non-leaf functions.  The saving of
- LR counts as an implicit probe which allows us to maintain the invariant
- described in the comment at expand_prologue.  */
-  gcc_assert (crtl->is_leaf
- || maybe_ne (frame.reg_offset[R30_REGNUM], SLOT_NOT_REQUIRED));
 
   poly_int64 offset = crtl->outgoing_args_size;
   gcc_assert (multiple_p (offset, STACK_BOUNDARY / BITS_PER_UNIT));
   frame.bytes_below_saved_regs = offset;
+  frame.sve_save_and_probe = INVALID_REGNUM;
 
   /* Now assign stack slots for the registers.  Start with the predicate
  registers, since predicate LDR and STR have a relatively small
@@ -8526,6 +8522,8 @@ aarch64_layout_frame (void)
   for (regno = P0_REGNUM; regno <= P15_REGNUM; regno++)
 if (known_eq (frame.reg_offset[regno], SLOT_REQUIRED))
   {
+   if (frame.sve_save_and_probe == INVALID_REGNUM)
+ frame.sve_save_and_probe = regno;
frame.reg_offset[regno] = offset;
offset += BYTES_PER_SVE_PRED;
   }
@@ -8563,6 +8561,8 @@ aarch64_layout_frame (void)
 for (regno = V0_REGNUM; regno <= V31_REGNUM; regno++)
   if (known_eq (frame.reg_offset[regno], SLOT_REQUIRED))
{
+ if (frame.sve_save_and_probe == INVALID_REGNUM)
+   frame.sve_save_and_probe = regno;
  frame.reg_offset[regno] = offset;
  offset += vector_save_size;
}
@@ -8572,10 +8572,18 @@ aarch64_layout_frame (void)
   frame.below_hard_fp_saved_regs_size = offset - frame.bytes_below_saved_regs;
   bool saves_below_hard_fp_p
 = maybe_ne (frame.below_hard_fp_saved_regs_size, 0);
+  gcc_assert (!saves_below_hard_fp_p
+ || (frame.sve_save_and_probe != INVALID_REGNUM
+ && known_eq 

[PATCH 19/19] aarch64: Make stack smash canary protect saved registers

2023-09-12 Thread Richard Sandiford via Gcc-patches
AArch64 normally puts the saved registers near the bottom of the frame,
immediately above any dynamic allocations.  But this means that a
stack-smash attack on those dynamic allocations could overwrite the
saved registers without needing to reach as far as the stack smash
canary.

The same thing could also happen for variable-sized arguments that are
passed by value, since those are allocated before a call and popped on
return.

This patch avoids that by putting the locals (and thus the canary) below
the saved registers when stack smash protection is active.

The patch fixes CVE-2023-4039.

gcc/
* config/aarch64/aarch64.cc (aarch64_save_regs_above_locals_p):
New function.
(aarch64_layout_frame): Use it to decide whether locals should
go above or below the saved registers.
(aarch64_expand_prologue): Update stack layout comment.
Emit a stack tie after the final adjustment.

gcc/testsuite/
* gcc.target/aarch64/stack-protector-8.c: New test.
* gcc.target/aarch64/stack-protector-9.c: Likewise.
---
 gcc/config/aarch64/aarch64.cc | 46 +++--
 .../gcc.target/aarch64/stack-protector-8.c| 95 +++
 .../gcc.target/aarch64/stack-protector-9.c| 33 +++
 3 files changed, 168 insertions(+), 6 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/stack-protector-8.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/stack-protector-9.c

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 51e57370807..3739a44bfd9 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -8433,6 +8433,20 @@ aarch64_needs_frame_chain (void)
   return aarch64_use_frame_pointer;
 }
 
+/* Return true if the current function should save registers above
+   the locals area, rather than below it.  */
+
+static bool
+aarch64_save_regs_above_locals_p ()
+{
+  /* When using stack smash protection, make sure that the canary slot
+ comes between the locals and the saved registers.  Otherwise,
+ it would be possible for a carefully sized smash attack to change
+ the saved registers (particularly LR and FP) without reaching the
+ canary.  */
+  return crtl->stack_protect_guard;
+}
+
 /* Mark the registers that need to be saved by the callee and calculate
the size of the callee-saved registers area and frame record (both FP
and LR may be omitted).  */
@@ -8444,6 +8458,7 @@ aarch64_layout_frame (void)
   poly_int64 vector_save_size = GET_MODE_SIZE (vector_save_mode);
   bool frame_related_fp_reg_p = false;
   aarch64_frame  = cfun->machine->frame;
+  poly_int64 top_of_locals = -1;
 
   frame.emit_frame_chain = aarch64_needs_frame_chain ();
 
@@ -8510,9 +8525,16 @@ aarch64_layout_frame (void)
&& !crtl->abi->clobbers_full_reg_p (regno))
   frame.reg_offset[regno] = SLOT_REQUIRED;
 
+  bool regs_at_top_p = aarch64_save_regs_above_locals_p ();
 
   poly_int64 offset = crtl->outgoing_args_size;
   gcc_assert (multiple_p (offset, STACK_BOUNDARY / BITS_PER_UNIT));
+  if (regs_at_top_p)
+{
+  offset += get_frame_size ();
+  offset = aligned_upper_bound (offset, STACK_BOUNDARY / BITS_PER_UNIT);
+  top_of_locals = offset;
+}
   frame.bytes_below_saved_regs = offset;
   frame.sve_save_and_probe = INVALID_REGNUM;
 
@@ -8652,15 +8674,18 @@ aarch64_layout_frame (void)
  at expand_prologue.  */
   gcc_assert (crtl->is_leaf || maybe_ne (saved_regs_size, 0));
 
-  offset += get_frame_size ();
-  offset = aligned_upper_bound (offset, STACK_BOUNDARY / BITS_PER_UNIT);
-  auto top_of_locals = offset;
-
+  if (!regs_at_top_p)
+{
+  offset += get_frame_size ();
+  offset = aligned_upper_bound (offset, STACK_BOUNDARY / BITS_PER_UNIT);
+  top_of_locals = offset;
+}
   offset += frame.saved_varargs_size;
   gcc_assert (multiple_p (offset, STACK_BOUNDARY / BITS_PER_UNIT));
   frame.frame_size = offset;
 
   frame.bytes_above_hard_fp = frame.frame_size - frame.bytes_below_hard_fp;
+  gcc_assert (known_ge (top_of_locals, 0));
   frame.bytes_above_locals = frame.frame_size - top_of_locals;
 
   frame.initial_adjust = 0;
@@ -9979,10 +10004,10 @@ aarch64_epilogue_uses (int regno)
|  for register varargs |
|   |
+---+
-   |  local variables  | <-- frame_pointer_rtx
+   |  local variables (1)  | <-- frame_pointer_rtx
|   |
+---+
-   |  padding  |
+   |  padding (1)  |
+---+
|  callee-saved registers   |
+---+
@@ -9994,6 +10019,10 @@ aarch64_epilogue_uses (int regno)
+---+
|  SVE predicate registers  |
+---+
+   |  local variables (2)  

[PATCH 16/19] aarch64: Simplify probe of final frame allocation

2023-09-12 Thread Richard Sandiford via Gcc-patches
Previous patches ensured that the final frame allocation only needs
a probe when the size is strictly greater than 1KiB.  It's therefore
safe to use the normal 1024 probe offset in all cases.

The main motivation for doing this is to simplify the code and
remove the number of special cases.

gcc/
* config/aarch64/aarch64.cc (aarch64_allocate_and_probe_stack_space):
Always probe the residual allocation at offset 1024, asserting
that that is in range.

gcc/testsuite/
* gcc.target/aarch64/stack-check-prologue-17.c: Expect the probe
to be at offset 1024 rather than offset 0.
* gcc.target/aarch64/stack-check-prologue-18.c: Likewise.
* gcc.target/aarch64/stack-check-prologue-19.c: Likewise.
---
 gcc/config/aarch64/aarch64.cc| 12 
 .../gcc.target/aarch64/stack-check-prologue-17.c |  2 +-
 .../gcc.target/aarch64/stack-check-prologue-18.c |  4 ++--
 .../gcc.target/aarch64/stack-check-prologue-19.c |  4 ++--
 4 files changed, 9 insertions(+), 13 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 383b32f2078..bcb879ba94b 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -9887,16 +9887,12 @@ aarch64_allocate_and_probe_stack_space (rtx temp1, rtx 
temp2,
  are still safe.  */
   if (residual)
 {
-  HOST_WIDE_INT residual_probe_offset = guard_used_by_caller;
+  gcc_assert (guard_used_by_caller + byte_sp_alignment <= size);
+
   /* If we're doing final adjustments, and we've done any full page
 allocations then any residual needs to be probed.  */
   if (final_adjustment_p && rounded_size != 0)
min_probe_threshold = 0;
-  /* If doing a small final adjustment, we always probe at offset 0.
-This is done to avoid issues when the final adjustment is smaller
-than the probing offset.  */
-  else if (final_adjustment_p && rounded_size == 0)
-   residual_probe_offset = 0;
 
   aarch64_sub_sp (temp1, temp2, residual, frame_related_p);
   if (residual >= min_probe_threshold)
@@ -9907,8 +9903,8 @@ aarch64_allocate_and_probe_stack_space (rtx temp1, rtx 
temp2,
 HOST_WIDE_INT_PRINT_DEC " bytes, probing will be required."
 "\n", residual);
 
-   emit_stack_probe (plus_constant (Pmode, stack_pointer_rtx,
-residual_probe_offset));
+ emit_stack_probe (plus_constant (Pmode, stack_pointer_rtx,
+  guard_used_by_caller));
  emit_insn (gen_blockage ());
}
 }
diff --git a/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-17.c 
b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-17.c
index 0d8a25d73a2..f0ec1389771 100644
--- a/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-17.c
+++ b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-17.c
@@ -33,7 +33,7 @@ int test1(int z) {
 ** ...
 ** str x30, \[sp\]
 ** sub sp, sp, #1040
-** str xzr, \[sp\]
+** str xzr, \[sp, #?1024\]
 ** cbnzw0, .*
 ** bl  g
 ** ...
diff --git a/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-18.c 
b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-18.c
index 82447d20fff..6383bec5ebc 100644
--- a/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-18.c
+++ b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-18.c
@@ -9,7 +9,7 @@ void g();
 ** ...
 ** str x30, \[sp\]
 ** sub sp, sp, #4064
-** str xzr, \[sp\]
+** str xzr, \[sp, #?1024\]
 ** cbnzw0, .*
 ** bl  g
 ** ...
@@ -50,7 +50,7 @@ int test1(int z) {
 ** ...
 ** str x30, \[sp\]
 ** sub sp, sp, #1040
-** str xzr, \[sp\]
+** str xzr, \[sp, #?1024\]
 ** cbnzw0, .*
 ** bl  g
 ** ...
diff --git a/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-19.c 
b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-19.c
index 73ac3e4e4eb..562039b5e9b 100644
--- a/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-19.c
+++ b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-19.c
@@ -9,7 +9,7 @@ void g();
 ** ...
 ** str x30, \[sp\]
 ** sub sp, sp, #4064
-** str xzr, \[sp\]
+** str xzr, \[sp, #?1024\]
 ** cbnzw0, .*
 ** bl  g
 ** ...
@@ -50,7 +50,7 @@ int test1(int z) {
 ** ...
 ** str x30, \[sp\]
 ** sub sp, sp, #1040
-** str xzr, \[sp\]
+** str xzr, \[sp, #?1024\]
 ** cbnzw0, .*
 ** bl  g
 ** ...
-- 
2.25.1



[PATCH 08/19] aarch64: Rename locals_offset to bytes_above_locals

2023-09-12 Thread Richard Sandiford via Gcc-patches
locals_offset was described as:

  /* Offset from the base of the frame (incomming SP) to the
 top of the locals area.  This value is always a multiple of
 STACK_BOUNDARY.  */

This is implicitly an “upside down” view of the frame: the incoming
SP is at offset 0, and anything N bytes below the incoming SP is at
offset N (rather than -N).

However, reg_offset instead uses a “right way up” view; that is,
it views offsets in address terms.  Something above X is at a
positive offset from X and something below X is at a negative
offset from X.

Also, even on FRAME_GROWS_DOWNWARD targets like AArch64,
target-independent code views offsets in address terms too:
locals are allocated at negative offsets to virtual_stack_vars.

It seems confusing to have *_offset fields of the same structure
using different polarities like this.  This patch tries to avoid
that by renaming locals_offset to bytes_above_locals.

gcc/
* config/aarch64/aarch64.h (aarch64_frame::locals_offset): Rename to...
(aarch64_frame::bytes_above_locals): ...this.
* config/aarch64/aarch64.cc (aarch64_layout_frame)
(aarch64_initial_elimination_offset): Update accordingly.
---
 gcc/config/aarch64/aarch64.cc | 6 +++---
 gcc/config/aarch64/aarch64.h  | 6 +++---
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 25b5fb243a6..bcd1dec6f51 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -8637,7 +8637,7 @@ aarch64_layout_frame (void)
  STACK_BOUNDARY / BITS_PER_UNIT));
   frame.frame_size = saved_regs_and_above + frame.bytes_below_saved_regs;
 
-  frame.locals_offset = frame.saved_varargs_size;
+  frame.bytes_above_locals = frame.saved_varargs_size;
 
   frame.initial_adjust = 0;
   frame.final_adjust = 0;
@@ -12854,13 +12854,13 @@ aarch64_initial_elimination_offset (unsigned from, 
unsigned to)
return frame.hard_fp_offset;
 
   if (from == FRAME_POINTER_REGNUM)
-   return frame.hard_fp_offset - frame.locals_offset;
+   return frame.hard_fp_offset - frame.bytes_above_locals;
 }
 
   if (to == STACK_POINTER_REGNUM)
 {
   if (from == FRAME_POINTER_REGNUM)
-   return frame.frame_size - frame.locals_offset;
+   return frame.frame_size - frame.bytes_above_locals;
 }
 
   return frame.frame_size;
diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h
index 46dd981b85c..3382f819e72 100644
--- a/gcc/config/aarch64/aarch64.h
+++ b/gcc/config/aarch64/aarch64.h
@@ -790,10 +790,10 @@ struct GTY (()) aarch64_frame
  always a multiple of STACK_BOUNDARY.  */
   poly_int64 bytes_below_hard_fp;
 
-  /* Offset from the base of the frame (incomming SP) to the
- top of the locals area.  This value is always a multiple of
+  /* The number of bytes between the top of the locals area and the top
+ of the frame (the incomming SP).  This value is always a multiple of
  STACK_BOUNDARY.  */
-  poly_int64 locals_offset;
+  poly_int64 bytes_above_locals;
 
   /* Offset from the base of the frame (incomming SP) to the
  hard_frame_pointer.  This value is always a multiple of
-- 
2.25.1



[PATCH 18/19] aarch64: Remove below_hard_fp_saved_regs_size

2023-09-12 Thread Richard Sandiford via Gcc-patches
After previous patches, it's no longer necessary to store
saved_regs_size and below_hard_fp_saved_regs_size in the frame info.
All measurements instead use the top or bottom of the frame as
reference points.

gcc/
* config/aarch64/aarch64.h (aarch64_frame::saved_regs_size)
(aarch64_frame::below_hard_fp_saved_regs_size): Delete.
* config/aarch64/aarch64.cc (aarch64_layout_frame): Update accordingly.
---
 gcc/config/aarch64/aarch64.cc | 45 ---
 gcc/config/aarch64/aarch64.h  |  7 --
 2 files changed, 21 insertions(+), 31 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 3c7c476c4c6..51e57370807 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -8569,9 +8569,8 @@ aarch64_layout_frame (void)
 
   /* OFFSET is now the offset of the hard frame pointer from the bottom
  of the callee save area.  */
-  frame.below_hard_fp_saved_regs_size = offset - frame.bytes_below_saved_regs;
-  bool saves_below_hard_fp_p
-= maybe_ne (frame.below_hard_fp_saved_regs_size, 0);
+  auto below_hard_fp_saved_regs_size = offset - frame.bytes_below_saved_regs;
+  bool saves_below_hard_fp_p = maybe_ne (below_hard_fp_saved_regs_size, 0);
   gcc_assert (!saves_below_hard_fp_p
  || (frame.sve_save_and_probe != INVALID_REGNUM
  && known_eq (frame.reg_offset[frame.sve_save_and_probe],
@@ -8641,9 +8640,8 @@ aarch64_layout_frame (void)
 
   offset = aligned_upper_bound (offset, STACK_BOUNDARY / BITS_PER_UNIT);
 
-  frame.saved_regs_size = offset - frame.bytes_below_saved_regs;
-  gcc_assert (known_eq (frame.saved_regs_size,
-   frame.below_hard_fp_saved_regs_size)
+  auto saved_regs_size = offset - frame.bytes_below_saved_regs;
+  gcc_assert (known_eq (saved_regs_size, below_hard_fp_saved_regs_size)
  || (frame.hard_fp_save_and_probe != INVALID_REGNUM
  && known_eq (frame.reg_offset[frame.hard_fp_save_and_probe],
   frame.bytes_below_hard_fp)));
@@ -8652,7 +8650,7 @@ aarch64_layout_frame (void)
  The saving of the bottommost register counts as an implicit probe,
  which allows us to maintain the invariant described in the comment
  at expand_prologue.  */
-  gcc_assert (crtl->is_leaf || maybe_ne (frame.saved_regs_size, 0));
+  gcc_assert (crtl->is_leaf || maybe_ne (saved_regs_size, 0));
 
   offset += get_frame_size ();
   offset = aligned_upper_bound (offset, STACK_BOUNDARY / BITS_PER_UNIT);
@@ -8709,7 +8707,7 @@ aarch64_layout_frame (void)
 
   HOST_WIDE_INT const_size, const_below_saved_regs, const_above_fp;
   HOST_WIDE_INT const_saved_regs_size;
-  if (known_eq (frame.saved_regs_size, 0))
+  if (known_eq (saved_regs_size, 0))
 frame.initial_adjust = frame.frame_size;
   else if (frame.frame_size.is_constant (_size)
   && const_size < max_push_offset
@@ -8722,7 +8720,7 @@ aarch64_layout_frame (void)
   frame.callee_adjust = const_size;
 }
   else if (frame.bytes_below_saved_regs.is_constant (_below_saved_regs)
-  && frame.saved_regs_size.is_constant (_saved_regs_size)
+  && saved_regs_size.is_constant (_saved_regs_size)
   && const_below_saved_regs + const_saved_regs_size < 512
   /* We could handle this case even with data below the saved
  registers, provided that that data left us with valid offsets
@@ -8741,8 +8739,7 @@ aarch64_layout_frame (void)
   frame.initial_adjust = frame.frame_size;
 }
   else if (saves_below_hard_fp_p
-  && known_eq (frame.saved_regs_size,
-   frame.below_hard_fp_saved_regs_size))
+  && known_eq (saved_regs_size, below_hard_fp_saved_regs_size))
 {
   /* Frame in which all saves are SVE saves:
 
@@ -8764,7 +8761,7 @@ aarch64_layout_frame (void)
 [save SVE registers relative to SP]
 sub sp, sp, bytes_below_saved_regs  */
   frame.callee_adjust = const_above_fp;
-  frame.sve_callee_adjust = frame.below_hard_fp_saved_regs_size;
+  frame.sve_callee_adjust = below_hard_fp_saved_regs_size;
   frame.final_adjust = frame.bytes_below_saved_regs;
 }
   else
@@ -8779,7 +8776,7 @@ aarch64_layout_frame (void)
 [save SVE registers relative to SP]
 sub sp, sp, bytes_below_saved_regs  */
   frame.initial_adjust = frame.bytes_above_hard_fp;
-  frame.sve_callee_adjust = frame.below_hard_fp_saved_regs_size;
+  frame.sve_callee_adjust = below_hard_fp_saved_regs_size;
   frame.final_adjust = frame.bytes_below_saved_regs;
 }
 
@@ -9985,17 +9982,17 @@ aarch64_epilogue_uses (int regno)
|  local variables  | <-- frame_pointer_rtx
|   |
+---+
-   |  padding  | \
-   +---+  |
-   |  callee-saved registers   |  | frame.saved_regs_size
-   

[PATCH 14/19] aarch64: Tweak stack clash boundary condition

2023-09-12 Thread Richard Sandiford via Gcc-patches
The AArch64 ABI says that, when stack clash protection is used,
there can be a maximum of 1KiB of unprobed space at sp on entry
to a function.  Therefore, we need to probe when allocating
>= guard_size - 1KiB of data (>= rather than >).  This is what
GCC does.

If an allocation is exactly guard_size bytes, it is enough to allocate
those bytes and probe once at offset 1024.  It isn't possible to use a
single probe at any other offset: higher would conmplicate later code,
by leaving more unprobed space than usual, while lower would risk
leaving an entire page unprobed.  For simplicity, the code probes all
allocations at offset 1024.

Some register saves also act as probes.  If we need to allocate
more space below the last such register save probe, we need to
probe the allocation if it is > 1KiB.  Again, this allocation is
then sometimes (but not always) probed at offset 1024.  This sort of
allocation is currently only used for outgoing arguments, which are
rarely this big.

However, the code also probed if this final outgoing-arguments
allocation was == 1KiB, rather than just > 1KiB.  This isn't
necessary, since the register save then probes at offset 1024
as required.  Continuing to probe allocations of exactly 1KiB
would complicate later patches.

gcc/
* config/aarch64/aarch64.cc (aarch64_allocate_and_probe_stack_space):
Don't probe final allocations that are exactly 1KiB in size (after
unprobed space above the final allocation has been deducted).

gcc/testsuite/
* gcc.target/aarch64/stack-check-prologue-17.c: New test.
---
 gcc/config/aarch64/aarch64.cc |  4 +-
 .../aarch64/stack-check-prologue-17.c | 55 +++
 2 files changed, 58 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/stack-check-prologue-17.c

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index e40ccc7d1cf..b942bf3de4a 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -9697,9 +9697,11 @@ aarch64_allocate_and_probe_stack_space (rtx temp1, rtx 
temp2,
   HOST_WIDE_INT guard_size
 = 1 << param_stack_clash_protection_guard_size;
   HOST_WIDE_INT guard_used_by_caller = STACK_CLASH_CALLER_GUARD;
+  HOST_WIDE_INT byte_sp_alignment = STACK_BOUNDARY / BITS_PER_UNIT;
+  gcc_assert (multiple_p (poly_size, byte_sp_alignment));
   HOST_WIDE_INT min_probe_threshold
 = (final_adjustment_p
-   ? guard_used_by_caller
+   ? guard_used_by_caller + byte_sp_alignment
: guard_size - guard_used_by_caller);
   /* When doing the final adjustment for the outgoing arguments, take into
  account any unprobed space there is above the current SP.  There are
diff --git a/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-17.c 
b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-17.c
new file mode 100644
index 000..0d8a25d73a2
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-17.c
@@ -0,0 +1,55 @@
+/* { dg-options "-O2 -fstack-clash-protection -fomit-frame-pointer --param 
stack-clash-protection-guard-size=12" } */
+/* { dg-final { check-function-bodies "**" "" } } */
+
+void f(int, ...);
+void g();
+
+/*
+** test1:
+** ...
+** str x30, \[sp\]
+** sub sp, sp, #1024
+** cbnzw0, .*
+** bl  g
+** ...
+*/
+int test1(int z) {
+  __uint128_t x = 0;
+  int y[0x400];
+  if (z)
+{
+  f(0, 0, 0, 0, 0, 0, 0, ,
+   x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x,
+   x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x,
+   x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x,
+   x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x);
+}
+  g();
+  return 1;
+}
+
+/*
+** test2:
+** ...
+** str x30, \[sp\]
+** sub sp, sp, #1040
+** str xzr, \[sp\]
+** cbnzw0, .*
+** bl  g
+** ...
+*/
+int test2(int z) {
+  __uint128_t x = 0;
+  int y[0x400];
+  if (z)
+{
+  f(0, 0, 0, 0, 0, 0, 0, ,
+   x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x,
+   x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x,
+   x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x,
+   x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x,
+   x);
+}
+  g();
+  return 1;
+}
-- 
2.25.1



[PATCH 04/19] aarch64: Add bytes_below_saved_regs to frame info

2023-09-12 Thread Richard Sandiford via Gcc-patches
The frame layout code currently hard-codes the assumption that
the number of bytes below the saved registers is equal to the
size of the outgoing arguments.  This patch abstracts that
value into a new field of aarch64_frame.

gcc/
* config/aarch64/aarch64.h (aarch64_frame::bytes_below_saved_regs): New
field.
* config/aarch64/aarch64.cc (aarch64_layout_frame): Initialize it,
and use it instead of crtl->outgoing_args_size.
(aarch64_get_separate_components): Use bytes_below_saved_regs instead
of outgoing_args_size.
(aarch64_process_components): Likewise.
---
 gcc/config/aarch64/aarch64.cc | 71 ++-
 gcc/config/aarch64/aarch64.h  |  5 +++
 2 files changed, 41 insertions(+), 35 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 34d0ccc9a67..49c2fbedd14 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -8517,6 +8517,8 @@ aarch64_layout_frame (void)
   gcc_assert (crtl->is_leaf
  || maybe_ne (frame.reg_offset[R30_REGNUM], SLOT_NOT_REQUIRED));
 
+  frame.bytes_below_saved_regs = crtl->outgoing_args_size;
+
   /* Now assign stack slots for the registers.  Start with the predicate
  registers, since predicate LDR and STR have a relatively small
  offset range.  These saves happen below the hard frame pointer.  */
@@ -8621,18 +8623,18 @@ aarch64_layout_frame (void)
 
   poly_int64 varargs_and_saved_regs_size = offset + frame.saved_varargs_size;
 
-  poly_int64 above_outgoing_args
+  poly_int64 saved_regs_and_above
 = aligned_upper_bound (varargs_and_saved_regs_size
   + get_frame_size (),
   STACK_BOUNDARY / BITS_PER_UNIT);
 
   frame.hard_fp_offset
-= above_outgoing_args - frame.below_hard_fp_saved_regs_size;
+= saved_regs_and_above - frame.below_hard_fp_saved_regs_size;
 
   /* Both these values are already aligned.  */
-  gcc_assert (multiple_p (crtl->outgoing_args_size,
+  gcc_assert (multiple_p (frame.bytes_below_saved_regs,
  STACK_BOUNDARY / BITS_PER_UNIT));
-  frame.frame_size = above_outgoing_args + crtl->outgoing_args_size;
+  frame.frame_size = saved_regs_and_above + frame.bytes_below_saved_regs;
 
   frame.locals_offset = frame.saved_varargs_size;
 
@@ -8676,7 +8678,7 @@ aarch64_layout_frame (void)
   else if (frame.wb_pop_candidate1 != INVALID_REGNUM)
 max_push_offset = 256;
 
-  HOST_WIDE_INT const_size, const_outgoing_args_size, const_fp_offset;
+  HOST_WIDE_INT const_size, const_below_saved_regs, const_fp_offset;
   HOST_WIDE_INT const_saved_regs_size;
   if (known_eq (frame.saved_regs_size, 0))
 frame.initial_adjust = frame.frame_size;
@@ -8684,31 +8686,31 @@ aarch64_layout_frame (void)
   && const_size < max_push_offset
   && known_eq (frame.hard_fp_offset, const_size))
 {
-  /* Simple, small frame with no outgoing arguments:
+  /* Simple, small frame with no data below the saved registers.
 
 stp reg1, reg2, [sp, -frame_size]!
 stp reg3, reg4, [sp, 16]  */
   frame.callee_adjust = const_size;
 }
-  else if (crtl->outgoing_args_size.is_constant (_outgoing_args_size)
+  else if (frame.bytes_below_saved_regs.is_constant (_below_saved_regs)
   && frame.saved_regs_size.is_constant (_saved_regs_size)
-  && const_outgoing_args_size + const_saved_regs_size < 512
-  /* We could handle this case even with outgoing args, provided
- that the number of args left us with valid offsets for all
- predicate and vector save slots.  It's such a rare case that
- it hardly seems worth the effort though.  */
-  && (!saves_below_hard_fp_p || const_outgoing_args_size == 0)
+  && const_below_saved_regs + const_saved_regs_size < 512
+  /* We could handle this case even with data below the saved
+ registers, provided that that data left us with valid offsets
+ for all predicate and vector save slots.  It's such a rare
+ case that it hardly seems worth the effort though.  */
+  && (!saves_below_hard_fp_p || const_below_saved_regs == 0)
   && !(cfun->calls_alloca
&& frame.hard_fp_offset.is_constant (_fp_offset)
&& const_fp_offset < max_push_offset))
 {
-  /* Frame with small outgoing arguments:
+  /* Frame with small area below the saved registers:
 
 sub sp, sp, frame_size
-stp reg1, reg2, [sp, outgoing_args_size]
-stp reg3, reg4, [sp, outgoing_args_size + 16]  */
+stp reg1, reg2, [sp, bytes_below_saved_regs]
+stp reg3, reg4, [sp, bytes_below_saved_regs + 16]  */
   frame.initial_adjust = frame.frame_size;
-  frame.callee_offset = const_outgoing_args_size;
+  frame.callee_offset = const_below_saved_regs;
 }
   else if (saves_below_hard_fp_p
   && known_eq 

[PATCH 15/19] aarch64: Put LR save probe in first 16 bytes

2023-09-12 Thread Richard Sandiford via Gcc-patches
-fstack-clash-protection uses the save of LR as a probe for the next
allocation.  The next allocation could be:

* another part of the static frame, e.g. when allocating SVE save slots
  or outgoing arguments

* an alloca in the same function

* an allocation made by a callee function

However, when -fomit-frame-pointer is used, the LR save slot is placed
above the other GPR save slots.  It could therefore be up to 80 bytes
above the base of the GPR save area (which is also the hard fp address).

aarch64_allocate_and_probe_stack_space took this into account when
deciding how much subsequent space could be allocated without needing
a probe.  However, it interacted badly with:

  /* If doing a small final adjustment, we always probe at offset 0.
 This is done to avoid issues when LR is not at position 0 or when
 the final adjustment is smaller than the probing offset.  */
  else if (final_adjustment_p && rounded_size == 0)
residual_probe_offset = 0;

which forces any allocation that is smaller than the guard page size
to be probed at offset 0 rather than the usual offset 1024.  It was
therefore possible to construct cases in which we had:

* a probe using LR at SP + 80 bytes (or some other value >= 16)
* an allocation of the guard page size - 16 bytes
* a probe at SP + 0

which allocates guard page size + 64 consecutive unprobed bytes.

This patch requires the LR probe to be in the first 16 bytes of the
save area when stack clash protection is active.  Doing it
unconditionally would cause code-quality regressions.

Putting LR before other registers prevents push/pop allocation
when shadow call stacks are enabled, since LR is restored
separately from the other callee-saved registers.

The new comment doesn't say that the probe register is required
to be LR, since a later patch removes that restriction.

gcc/
* config/aarch64/aarch64.cc (aarch64_layout_frame): Ensure that
the LR save slot is in the first 16 bytes of the register save area.
Only form STP/LDP push/pop candidates if both registers are valid.
(aarch64_allocate_and_probe_stack_space): Remove workaround for
when LR was not in the first 16 bytes.

gcc/testsuite/
* gcc.target/aarch64/stack-check-prologue-18.c: New test.
* gcc.target/aarch64/stack-check-prologue-19.c: Likewise.
* gcc.target/aarch64/stack-check-prologue-20.c: Likewise.
---
 gcc/config/aarch64/aarch64.cc |  72 ++---
 .../aarch64/stack-check-prologue-18.c | 100 ++
 .../aarch64/stack-check-prologue-19.c | 100 ++
 .../aarch64/stack-check-prologue-20.c |   3 +
 4 files changed, 233 insertions(+), 42 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/stack-check-prologue-18.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/stack-check-prologue-19.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/stack-check-prologue-20.c

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index b942bf3de4a..383b32f2078 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -8573,26 +8573,34 @@ aarch64_layout_frame (void)
   bool saves_below_hard_fp_p
 = maybe_ne (frame.below_hard_fp_saved_regs_size, 0);
   frame.bytes_below_hard_fp = offset;
+
+  auto allocate_gpr_slot = [&](unsigned int regno)
+{
+  frame.reg_offset[regno] = offset;
+  if (frame.wb_push_candidate1 == INVALID_REGNUM)
+   frame.wb_push_candidate1 = regno;
+  else if (frame.wb_push_candidate2 == INVALID_REGNUM)
+   frame.wb_push_candidate2 = regno;
+  offset += UNITS_PER_WORD;
+};
+
   if (frame.emit_frame_chain)
 {
   /* FP and LR are placed in the linkage record.  */
-  frame.reg_offset[R29_REGNUM] = offset;
-  frame.wb_push_candidate1 = R29_REGNUM;
-  frame.reg_offset[R30_REGNUM] = offset + UNITS_PER_WORD;
-  frame.wb_push_candidate2 = R30_REGNUM;
-  offset += 2 * UNITS_PER_WORD;
+  allocate_gpr_slot (R29_REGNUM);
+  allocate_gpr_slot (R30_REGNUM);
 }
+  else if (flag_stack_clash_protection
+  && known_eq (frame.reg_offset[R30_REGNUM], SLOT_REQUIRED))
+/* Put the LR save slot first, since it makes a good choice of probe
+   for stack clash purposes.  The idea is that the link register usually
+   has to be saved before a call anyway, and so we lose little by
+   stopping it from being individually shrink-wrapped.  */
+allocate_gpr_slot (R30_REGNUM);
 
   for (regno = R0_REGNUM; regno <= R30_REGNUM; regno++)
 if (known_eq (frame.reg_offset[regno], SLOT_REQUIRED))
-  {
-   frame.reg_offset[regno] = offset;
-   if (frame.wb_push_candidate1 == INVALID_REGNUM)
- frame.wb_push_candidate1 = regno;
-   else if (frame.wb_push_candidate2 == INVALID_REGNUM)
- frame.wb_push_candidate2 = regno;
-   offset += UNITS_PER_WORD;
-  }
+  allocate_gpr_slot 

[PATCH 13/19] aarch64: Minor initial adjustment tweak

2023-09-12 Thread Richard Sandiford via Gcc-patches
This patch just changes a calculation of initial_adjust
to one that makes it slightly more obvious that the total
adjustment is frame.frame_size.

gcc/
* config/aarch64/aarch64.cc (aarch64_layout_frame): Tweak
calculation of initial_adjust for frames in which all saves
are SVE saves.
---
 gcc/config/aarch64/aarch64.cc | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 9578592d256..e40ccc7d1cf 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -8714,11 +8714,10 @@ aarch64_layout_frame (void)
 {
   /* Frame in which all saves are SVE saves:
 
-sub sp, sp, hard_fp_offset + below_hard_fp_saved_regs_size
+sub sp, sp, frame_size - bytes_below_saved_regs
 save SVE registers relative to SP
 sub sp, sp, bytes_below_saved_regs  */
-  frame.initial_adjust = (frame.bytes_above_hard_fp
- + frame.below_hard_fp_saved_regs_size);
+  frame.initial_adjust = frame.frame_size - frame.bytes_below_saved_regs;
   frame.final_adjust = frame.bytes_below_saved_regs;
 }
   else if (frame.bytes_above_hard_fp.is_constant (_above_fp)
-- 
2.25.1



[PATCH 10/19] aarch64: Tweak frame_size comment

2023-09-12 Thread Richard Sandiford via Gcc-patches
This patch fixes another case in which a value was described with
an “upside-down” view.

gcc/
* config/aarch64/aarch64.h (aarch64_frame::frame_size): Tweak comment.
---
 gcc/config/aarch64/aarch64.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h
index 4a4de9c044e..92965eced0a 100644
--- a/gcc/config/aarch64/aarch64.h
+++ b/gcc/config/aarch64/aarch64.h
@@ -800,8 +800,8 @@ struct GTY (()) aarch64_frame
  STACK_BOUNDARY.  */
   poly_int64 bytes_above_hard_fp;
 
-  /* The size of the frame.  This value is the offset from base of the
- frame (incomming SP) to the stack_pointer.  This value is always
+  /* The size of the frame, i.e. the number of bytes between the bottom
+ of the outgoing arguments and the incoming SP.  This value is always
  a multiple of STACK_BOUNDARY.  */
   poly_int64 frame_size;
 
-- 
2.25.1



[PATCH 03/19] aarch64: Explicitly handle frames with no saved registers

2023-09-12 Thread Richard Sandiford via Gcc-patches
If a frame has no saved registers, it can be allocated in one go.
There is no need to treat the areas below and above the saved
registers as separate.

And if we allocate the frame in one go, it should be allocated
as the initial_adjust rather than the final_adjust.  This allows the
frame size to grow to guard_size - guard_used_by_caller before a stack
probe is needed.  (A frame with no register saves is necessarily a
leaf frame.)

This is a no-op as thing stand, since a leaf function will have
no outgoing arguments, and so all the frame will be above where
the saved registers normally go.

gcc/
* config/aarch64/aarch64.cc (aarch64_layout_frame): Explicitly
allocate the frame in one go if there are no saved registers.
---
 gcc/config/aarch64/aarch64.cc | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 9fb94623693..34d0ccc9a67 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -8678,9 +8678,11 @@ aarch64_layout_frame (void)
 
   HOST_WIDE_INT const_size, const_outgoing_args_size, const_fp_offset;
   HOST_WIDE_INT const_saved_regs_size;
-  if (frame.frame_size.is_constant (_size)
-  && const_size < max_push_offset
-  && known_eq (frame.hard_fp_offset, const_size))
+  if (known_eq (frame.saved_regs_size, 0))
+frame.initial_adjust = frame.frame_size;
+  else if (frame.frame_size.is_constant (_size)
+  && const_size < max_push_offset
+  && known_eq (frame.hard_fp_offset, const_size))
 {
   /* Simple, small frame with no outgoing arguments:
 
-- 
2.25.1



[PATCH 11/19] aarch64: Measure reg_offset from the bottom of the frame

2023-09-12 Thread Richard Sandiford via Gcc-patches
reg_offset was measured from the bottom of the saved register area.
This made perfect sense with the original layout, since the bottom
of the saved register area was also the hard frame pointer address.
It became slightly less obvious with SVE, since we save SVE
registers below the hard frame pointer, but it still made sense.

However, if we want to allow different frame layouts, it's more
convenient and obvious to measure reg_offset from the bottom of
the frame.  After previous patches, it's also a slight simplification
in its own right.

gcc/
* config/aarch64/aarch64.h (aarch64_frame): Add comment above
reg_offset.
* config/aarch64/aarch64.cc (aarch64_layout_frame): Walk offsets
from the bottom of the frame, rather than the bottom of the saved
register area.  Measure reg_offset from the bottom of the frame
rather than the bottom of the saved register area.
(aarch64_save_callee_saves): Update accordingly.
(aarch64_restore_callee_saves): Likewise.
(aarch64_get_separate_components): Likewise.
(aarch64_process_components): Likewise.
---
 gcc/config/aarch64/aarch64.cc | 53 ---
 gcc/config/aarch64/aarch64.h  |  3 ++
 2 files changed, 27 insertions(+), 29 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 7d642d06871..ca2e6af5d12 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -8439,7 +8439,6 @@ aarch64_needs_frame_chain (void)
 static void
 aarch64_layout_frame (void)
 {
-  poly_int64 offset = 0;
   int regno, last_fp_reg = INVALID_REGNUM;
   machine_mode vector_save_mode = aarch64_reg_save_mode (V8_REGNUM);
   poly_int64 vector_save_size = GET_MODE_SIZE (vector_save_mode);
@@ -8517,7 +8516,9 @@ aarch64_layout_frame (void)
   gcc_assert (crtl->is_leaf
  || maybe_ne (frame.reg_offset[R30_REGNUM], SLOT_NOT_REQUIRED));
 
-  frame.bytes_below_saved_regs = crtl->outgoing_args_size;
+  poly_int64 offset = crtl->outgoing_args_size;
+  gcc_assert (multiple_p (offset, STACK_BOUNDARY / BITS_PER_UNIT));
+  frame.bytes_below_saved_regs = offset;
 
   /* Now assign stack slots for the registers.  Start with the predicate
  registers, since predicate LDR and STR have a relatively small
@@ -8529,7 +8530,8 @@ aarch64_layout_frame (void)
offset += BYTES_PER_SVE_PRED;
   }
 
-  if (maybe_ne (offset, 0))
+  poly_int64 saved_prs_size = offset - frame.bytes_below_saved_regs;
+  if (maybe_ne (saved_prs_size, 0))
 {
   /* If we have any vector registers to save above the predicate registers,
 the offset of the vector register save slots need to be a multiple
@@ -8547,10 +8549,10 @@ aarch64_layout_frame (void)
offset = aligned_upper_bound (offset, STACK_BOUNDARY / BITS_PER_UNIT);
   else
{
- if (known_le (offset, vector_save_size))
-   offset = vector_save_size;
- else if (known_le (offset, vector_save_size * 2))
-   offset = vector_save_size * 2;
+ if (known_le (saved_prs_size, vector_save_size))
+   offset = frame.bytes_below_saved_regs + vector_save_size;
+ else if (known_le (saved_prs_size, vector_save_size * 2))
+   offset = frame.bytes_below_saved_regs + vector_save_size * 2;
  else
gcc_unreachable ();
}
@@ -8567,9 +8569,10 @@ aarch64_layout_frame (void)
 
   /* OFFSET is now the offset of the hard frame pointer from the bottom
  of the callee save area.  */
-  bool saves_below_hard_fp_p = maybe_ne (offset, 0);
-  frame.below_hard_fp_saved_regs_size = offset;
-  frame.bytes_below_hard_fp = offset + frame.bytes_below_saved_regs;
+  frame.below_hard_fp_saved_regs_size = offset - frame.bytes_below_saved_regs;
+  bool saves_below_hard_fp_p
+= maybe_ne (frame.below_hard_fp_saved_regs_size, 0);
+  frame.bytes_below_hard_fp = offset;
   if (frame.emit_frame_chain)
 {
   /* FP and LR are placed in the linkage record.  */
@@ -8620,9 +8623,10 @@ aarch64_layout_frame (void)
 
   offset = aligned_upper_bound (offset, STACK_BOUNDARY / BITS_PER_UNIT);
 
-  frame.saved_regs_size = offset;
+  frame.saved_regs_size = offset - frame.bytes_below_saved_regs;
 
-  poly_int64 varargs_and_saved_regs_size = offset + frame.saved_varargs_size;
+  poly_int64 varargs_and_saved_regs_size
+= frame.saved_regs_size + frame.saved_varargs_size;
 
   poly_int64 saved_regs_and_above
 = aligned_upper_bound (varargs_and_saved_regs_size
@@ -9144,9 +9148,7 @@ aarch64_save_callee_saves (poly_int64 bytes_below_sp,
 
   machine_mode mode = aarch64_reg_save_mode (regno);
   reg = gen_rtx_REG (mode, regno);
-  offset = (frame.reg_offset[regno]
-   + frame.bytes_below_saved_regs
-   - bytes_below_sp);
+  offset = frame.reg_offset[regno] - bytes_below_sp;
   rtx base_rtx = stack_pointer_rtx;
   poly_int64 sp_offset = offset;
 
@@ -9253,9 +9255,7 @@ 

[PATCH 09/19] aarch64: Rename hard_fp_offset to bytes_above_hard_fp

2023-09-12 Thread Richard Sandiford via Gcc-patches
Similarly to the previous locals_offset patch, hard_fp_offset
was described as:

  /* Offset from the base of the frame (incomming SP) to the
 hard_frame_pointer.  This value is always a multiple of
 STACK_BOUNDARY.  */
  poly_int64 hard_fp_offset;

which again took an “upside-down” view: higher offsets meant lower
addresses.  This patch renames the field to bytes_above_hard_fp instead.

gcc/
* config/aarch64/aarch64.h (aarch64_frame::hard_fp_offset): Rename
to...
(aarch64_frame::bytes_above_hard_fp): ...this.
* config/aarch64/aarch64.cc (aarch64_layout_frame)
(aarch64_expand_prologue): Update accordingly.
(aarch64_initial_elimination_offset): Likewise.
---
 gcc/config/aarch64/aarch64.cc | 26 +-
 gcc/config/aarch64/aarch64.h  |  6 +++---
 2 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index bcd1dec6f51..7d642d06871 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -8629,7 +8629,7 @@ aarch64_layout_frame (void)
   + get_frame_size (),
   STACK_BOUNDARY / BITS_PER_UNIT);
 
-  frame.hard_fp_offset
+  frame.bytes_above_hard_fp
 = saved_regs_and_above - frame.below_hard_fp_saved_regs_size;
 
   /* Both these values are already aligned.  */
@@ -8678,13 +8678,13 @@ aarch64_layout_frame (void)
   else if (frame.wb_pop_candidate1 != INVALID_REGNUM)
 max_push_offset = 256;
 
-  HOST_WIDE_INT const_size, const_below_saved_regs, const_fp_offset;
+  HOST_WIDE_INT const_size, const_below_saved_regs, const_above_fp;
   HOST_WIDE_INT const_saved_regs_size;
   if (known_eq (frame.saved_regs_size, 0))
 frame.initial_adjust = frame.frame_size;
   else if (frame.frame_size.is_constant (_size)
   && const_size < max_push_offset
-  && known_eq (frame.hard_fp_offset, const_size))
+  && known_eq (frame.bytes_above_hard_fp, const_size))
 {
   /* Simple, small frame with no data below the saved registers.
 
@@ -8701,8 +8701,8 @@ aarch64_layout_frame (void)
  case that it hardly seems worth the effort though.  */
   && (!saves_below_hard_fp_p || const_below_saved_regs == 0)
   && !(cfun->calls_alloca
-   && frame.hard_fp_offset.is_constant (_fp_offset)
-   && const_fp_offset < max_push_offset))
+   && frame.bytes_above_hard_fp.is_constant (_above_fp)
+   && const_above_fp < max_push_offset))
 {
   /* Frame with small area below the saved registers:
 
@@ -8720,12 +8720,12 @@ aarch64_layout_frame (void)
 sub sp, sp, hard_fp_offset + below_hard_fp_saved_regs_size
 save SVE registers relative to SP
 sub sp, sp, bytes_below_saved_regs  */
-  frame.initial_adjust = (frame.hard_fp_offset
+  frame.initial_adjust = (frame.bytes_above_hard_fp
  + frame.below_hard_fp_saved_regs_size);
   frame.final_adjust = frame.bytes_below_saved_regs;
 }
-  else if (frame.hard_fp_offset.is_constant (_fp_offset)
-  && const_fp_offset < max_push_offset)
+  else if (frame.bytes_above_hard_fp.is_constant (_above_fp)
+  && const_above_fp < max_push_offset)
 {
   /* Frame with large area below the saved registers, or with SVE saves,
 but with a small area above:
@@ -8735,7 +8735,7 @@ aarch64_layout_frame (void)
 [sub sp, sp, below_hard_fp_saved_regs_size]
 [save SVE registers relative to SP]
 sub sp, sp, bytes_below_saved_regs  */
-  frame.callee_adjust = const_fp_offset;
+  frame.callee_adjust = const_above_fp;
   frame.sve_callee_adjust = frame.below_hard_fp_saved_regs_size;
   frame.final_adjust = frame.bytes_below_saved_regs;
 }
@@ -8750,7 +8750,7 @@ aarch64_layout_frame (void)
 [sub sp, sp, below_hard_fp_saved_regs_size]
 [save SVE registers relative to SP]
 sub sp, sp, bytes_below_saved_regs  */
-  frame.initial_adjust = frame.hard_fp_offset;
+  frame.initial_adjust = frame.bytes_above_hard_fp;
   frame.sve_callee_adjust = frame.below_hard_fp_saved_regs_size;
   frame.final_adjust = frame.bytes_below_saved_regs;
 }
@@ -10118,7 +10118,7 @@ aarch64_expand_prologue (void)
 {
   /* The offset of the frame chain record (if any) from the current SP.  */
   poly_int64 chain_offset = (initial_adjust + callee_adjust
-- frame.hard_fp_offset);
+- frame.bytes_above_hard_fp);
   gcc_assert (known_ge (chain_offset, 0));
 
   if (callee_adjust == 0)
@@ -12851,10 +12851,10 @@ aarch64_initial_elimination_offset (unsigned from, 
unsigned to)
   if (to == HARD_FRAME_POINTER_REGNUM)
 {
   if (from == ARG_POINTER_REGNUM)
-   return frame.hard_fp_offset;
+   return frame.bytes_above_hard_fp;
 
   if (from == FRAME_POINTER_REGNUM)

[PATCH 06/19] aarch64: Tweak aarch64_save/restore_callee_saves

2023-09-12 Thread Richard Sandiford via Gcc-patches
aarch64_save_callee_saves and aarch64_restore_callee_saves took
a parameter called start_offset that gives the offset of the
bottom of the saved register area from the current stack pointer.
However, it's more convenient for later patches if we use the
bottom of the entire frame as the reference point, rather than
the bottom of the saved registers.

Doing that removes the need for the callee_offset field.
Other than that, this is not a win on its own.  It only really
makes sense in combination with the follow-on patches.

gcc/
* config/aarch64/aarch64.h (aarch64_frame::callee_offset): Delete.
* config/aarch64/aarch64.cc (aarch64_layout_frame): Remove
callee_offset handling.
(aarch64_save_callee_saves): Replace the start_offset parameter
with a bytes_below_sp parameter.
(aarch64_restore_callee_saves): Likewise.
(aarch64_expand_prologue): Update accordingly.
(aarch64_expand_epilogue): Likewise.
---
 gcc/config/aarch64/aarch64.cc | 56 +--
 gcc/config/aarch64/aarch64.h  |  4 ---
 2 files changed, 28 insertions(+), 32 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 58dd8946232..2c218c90906 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -8643,7 +8643,6 @@ aarch64_layout_frame (void)
   frame.final_adjust = 0;
   frame.callee_adjust = 0;
   frame.sve_callee_adjust = 0;
-  frame.callee_offset = 0;
 
   frame.wb_pop_candidate1 = frame.wb_push_candidate1;
   frame.wb_pop_candidate2 = frame.wb_push_candidate2;
@@ -8711,7 +8710,6 @@ aarch64_layout_frame (void)
 stp reg1, reg2, [sp, bytes_below_saved_regs]
 stp reg3, reg4, [sp, bytes_below_saved_regs + 16]  */
   frame.initial_adjust = frame.frame_size;
-  frame.callee_offset = const_below_saved_regs;
 }
   else if (saves_below_hard_fp_p
   && known_eq (frame.saved_regs_size,
@@ -9112,12 +9110,13 @@ aarch64_add_cfa_expression (rtx_insn *insn, rtx reg,
 }
 
 /* Emit code to save the callee-saved registers from register number START
-   to LIMIT to the stack at the location starting at offset START_OFFSET,
-   skipping any write-back candidates if SKIP_WB is true.  HARD_FP_VALID_P
-   is true if the hard frame pointer has been set up.  */
+   to LIMIT to the stack.  The stack pointer is currently BYTES_BELOW_SP
+   bytes above the bottom of the static frame.  Skip any write-back
+   candidates if SKIP_WB is true.  HARD_FP_VALID_P is true if the hard
+   frame pointer has been set up.  */
 
 static void
-aarch64_save_callee_saves (poly_int64 start_offset,
+aarch64_save_callee_saves (poly_int64 bytes_below_sp,
   unsigned start, unsigned limit, bool skip_wb,
   bool hard_fp_valid_p)
 {
@@ -9145,7 +9144,9 @@ aarch64_save_callee_saves (poly_int64 start_offset,
 
   machine_mode mode = aarch64_reg_save_mode (regno);
   reg = gen_rtx_REG (mode, regno);
-  offset = start_offset + frame.reg_offset[regno];
+  offset = (frame.reg_offset[regno]
+   + frame.bytes_below_saved_regs
+   - bytes_below_sp);
   rtx base_rtx = stack_pointer_rtx;
   poly_int64 sp_offset = offset;
 
@@ -9156,9 +9157,7 @@ aarch64_save_callee_saves (poly_int64 start_offset,
   else if (GP_REGNUM_P (regno)
   && (!offset.is_constant (_offset) || const_offset >= 512))
{
- gcc_assert (known_eq (start_offset, 0));
- poly_int64 fp_offset
-   = frame.below_hard_fp_saved_regs_size;
+ poly_int64 fp_offset = frame.bytes_below_hard_fp - bytes_below_sp;
  if (hard_fp_valid_p)
base_rtx = hard_frame_pointer_rtx;
  else
@@ -9222,12 +9221,13 @@ aarch64_save_callee_saves (poly_int64 start_offset,
 }
 
 /* Emit code to restore the callee registers from register number START
-   up to and including LIMIT.  Restore from the stack offset START_OFFSET,
-   skipping any write-back candidates if SKIP_WB is true.  Write the
-   appropriate REG_CFA_RESTORE notes into CFI_OPS.  */
+   up to and including LIMIT.  The stack pointer is currently BYTES_BELOW_SP
+   bytes above the bottom of the static frame.  Skip any write-back
+   candidates if SKIP_WB is true.  Write the appropriate REG_CFA_RESTORE
+   notes into CFI_OPS.  */
 
 static void
-aarch64_restore_callee_saves (poly_int64 start_offset, unsigned start,
+aarch64_restore_callee_saves (poly_int64 bytes_below_sp, unsigned start,
  unsigned limit, bool skip_wb, rtx *cfi_ops)
 {
   aarch64_frame  = cfun->machine->frame;
@@ -9253,7 +9253,9 @@ aarch64_restore_callee_saves (poly_int64 start_offset, 
unsigned start,
 
   machine_mode mode = aarch64_reg_save_mode (regno);
   reg = gen_rtx_REG (mode, regno);
-  offset = start_offset + frame.reg_offset[regno];
+  offset = (frame.reg_offset[regno]
+   + frame.bytes_below_saved_regs
+   - 

[PATCH 02/19] aarch64: Avoid a use of callee_offset

2023-09-12 Thread Richard Sandiford via Gcc-patches
When we emit the frame chain, i.e. when we reach Here in this statement
of aarch64_expand_prologue:

  if (emit_frame_chain)
{
  // Here
  ...
}

the stack is in one of two states:

- We've allocated up to the frame chain, but no more.

- We've allocated the whole frame, and the frame chain is within easy
  reach of the new SP.

The offset of the frame chain from the current SP is available
in aarch64_frame as callee_offset.  It is also available as the
chain_offset local variable, where the latter is calculated from other
data.  (However, chain_offset is not always equal to callee_offset when
!emit_frame_chain, so chain_offset isn't redundant.)

In c600df9a4060da3c6121ff4d0b93f179eafd69d1 I switched to using
chain_offset for the initialisation of the hard frame pointer:

   aarch64_add_offset (Pmode, hard_frame_pointer_rtx,
- stack_pointer_rtx, callee_offset,
+ stack_pointer_rtx, chain_offset,
  tmp1_rtx, tmp0_rtx, frame_pointer_needed);

But the later REG_CFA_ADJUST_CFA handling still used callee_offset.

I think the difference is harmless, but it's more logical for the
CFA note to be in sync, and it's more convenient for later patches
if it uses chain_offset.

gcc/
* config/aarch64/aarch64.cc (aarch64_expand_prologue): Use
chain_offset rather than callee_offset.
---
 gcc/config/aarch64/aarch64.cc | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index b91f77d7b1f..9fb94623693 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -10034,7 +10034,6 @@ aarch64_expand_prologue (void)
   poly_int64 initial_adjust = frame.initial_adjust;
   HOST_WIDE_INT callee_adjust = frame.callee_adjust;
   poly_int64 final_adjust = frame.final_adjust;
-  poly_int64 callee_offset = frame.callee_offset;
   poly_int64 sve_callee_adjust = frame.sve_callee_adjust;
   poly_int64 below_hard_fp_saved_regs_size
 = frame.below_hard_fp_saved_regs_size;
@@ -10147,8 +10146,7 @@ aarch64_expand_prologue (void)
 implicit.  */
  if (!find_reg_note (insn, REG_CFA_ADJUST_CFA, NULL_RTX))
{
- rtx src = plus_constant (Pmode, stack_pointer_rtx,
-  callee_offset);
+ rtx src = plus_constant (Pmode, stack_pointer_rtx, chain_offset);
  add_reg_note (insn, REG_CFA_ADJUST_CFA,
gen_rtx_SET (hard_frame_pointer_rtx, src));
}
-- 
2.25.1



[PATCH 12/19] aarch64: Simplify top of frame allocation

2023-09-12 Thread Richard Sandiford via Gcc-patches
After previous patches, it no longer really makes sense to allocate
the top of the frame in terms of varargs_and_saved_regs_size and
saved_regs_and_above.

gcc/
* config/aarch64/aarch64.cc (aarch64_layout_frame): Simplify
the allocation of the top of the frame.
---
 gcc/config/aarch64/aarch64.cc | 23 ---
 1 file changed, 8 insertions(+), 15 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index ca2e6af5d12..9578592d256 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -8625,23 +8625,16 @@ aarch64_layout_frame (void)
 
   frame.saved_regs_size = offset - frame.bytes_below_saved_regs;
 
-  poly_int64 varargs_and_saved_regs_size
-= frame.saved_regs_size + frame.saved_varargs_size;
-
-  poly_int64 saved_regs_and_above
-= aligned_upper_bound (varargs_and_saved_regs_size
-  + get_frame_size (),
-  STACK_BOUNDARY / BITS_PER_UNIT);
-
-  frame.bytes_above_hard_fp
-= saved_regs_and_above - frame.below_hard_fp_saved_regs_size;
+  offset += get_frame_size ();
+  offset = aligned_upper_bound (offset, STACK_BOUNDARY / BITS_PER_UNIT);
+  auto top_of_locals = offset;
 
-  /* Both these values are already aligned.  */
-  gcc_assert (multiple_p (frame.bytes_below_saved_regs,
- STACK_BOUNDARY / BITS_PER_UNIT));
-  frame.frame_size = saved_regs_and_above + frame.bytes_below_saved_regs;
+  offset += frame.saved_varargs_size;
+  gcc_assert (multiple_p (offset, STACK_BOUNDARY / BITS_PER_UNIT));
+  frame.frame_size = offset;
 
-  frame.bytes_above_locals = frame.saved_varargs_size;
+  frame.bytes_above_hard_fp = frame.frame_size - frame.bytes_below_hard_fp;
+  frame.bytes_above_locals = frame.frame_size - top_of_locals;
 
   frame.initial_adjust = 0;
   frame.final_adjust = 0;
-- 
2.25.1



[PATCH 07/19] aarch64: Only calculate chain_offset if there is a chain

2023-09-12 Thread Richard Sandiford via Gcc-patches
After previous patches, it is no longer necessary to calculate
a chain_offset in cases where there is no chain record.

gcc/
* config/aarch64/aarch64.cc (aarch64_expand_prologue): Move the
calculation of chain_offset into the emit_frame_chain block.
---
 gcc/config/aarch64/aarch64.cc | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 2c218c90906..25b5fb243a6 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -10111,16 +10111,16 @@ aarch64_expand_prologue (void)
   if (callee_adjust != 0)
 aarch64_push_regs (reg1, reg2, callee_adjust);
 
-  /* The offset of the frame chain record (if any) from the current SP.  */
-  poly_int64 chain_offset = (initial_adjust + callee_adjust
-- frame.hard_fp_offset);
-  gcc_assert (known_ge (chain_offset, 0));
-
   /* The offset of the current SP from the bottom of the static frame.  */
   poly_int64 bytes_below_sp = frame_size - initial_adjust - callee_adjust;
 
   if (emit_frame_chain)
 {
+  /* The offset of the frame chain record (if any) from the current SP.  */
+  poly_int64 chain_offset = (initial_adjust + callee_adjust
+- frame.hard_fp_offset);
+  gcc_assert (known_ge (chain_offset, 0));
+
   if (callee_adjust == 0)
{
  reg1 = R29_REGNUM;
-- 
2.25.1



[PATCH 05/19] aarch64: Add bytes_below_hard_fp to frame info

2023-09-12 Thread Richard Sandiford via Gcc-patches
Following on from the previous bytes_below_saved_regs patch, this one
records the number of bytes that are below the hard frame pointer.
This eventually replaces below_hard_fp_saved_regs_size.

If a frame pointer is not needed, the epilogue adds final_adjust
to the stack pointer before restoring registers:

 aarch64_add_sp (tmp1_rtx, tmp0_rtx, final_adjust, true);

Therefore, if the epilogue needs to restore the stack pointer from
the hard frame pointer, the directly corresponding offset is:

 -bytes_below_hard_fp + final_adjust

i.e. go from the hard frame pointer to the bottom of the frame,
then add the same amount as if we were using the stack pointer
from the outset.

gcc/
* config/aarch64/aarch64.h (aarch64_frame::bytes_below_hard_fp): New
field.
* config/aarch64/aarch64.cc (aarch64_layout_frame): Initialize it.
(aarch64_expand_epilogue): Use it instead of
below_hard_fp_saved_regs_size.
---
 gcc/config/aarch64/aarch64.cc | 6 +++---
 gcc/config/aarch64/aarch64.h  | 5 +
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 49c2fbedd14..58dd8946232 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -8569,6 +8569,7 @@ aarch64_layout_frame (void)
  of the callee save area.  */
   bool saves_below_hard_fp_p = maybe_ne (offset, 0);
   frame.below_hard_fp_saved_regs_size = offset;
+  frame.bytes_below_hard_fp = offset + frame.bytes_below_saved_regs;
   if (frame.emit_frame_chain)
 {
   /* FP and LR are placed in the linkage record.  */
@@ -10220,8 +10221,7 @@ aarch64_expand_epilogue (bool for_sibcall)
   poly_int64 final_adjust = frame.final_adjust;
   poly_int64 callee_offset = frame.callee_offset;
   poly_int64 sve_callee_adjust = frame.sve_callee_adjust;
-  poly_int64 below_hard_fp_saved_regs_size
-= frame.below_hard_fp_saved_regs_size;
+  poly_int64 bytes_below_hard_fp = frame.bytes_below_hard_fp;
   unsigned reg1 = frame.wb_pop_candidate1;
   unsigned reg2 = frame.wb_pop_candidate2;
   unsigned int last_gpr = (frame.is_scs_enabled
@@ -10279,7 +10279,7 @@ aarch64_expand_epilogue (bool for_sibcall)
is restored on the instruction doing the writeback.  */
 aarch64_add_offset (Pmode, stack_pointer_rtx,
hard_frame_pointer_rtx,
-   -callee_offset - below_hard_fp_saved_regs_size,
+   -bytes_below_hard_fp + final_adjust,
tmp1_rtx, tmp0_rtx, callee_adjust == 0);
   else
  /* The case where we need to re-use the register here is very rare, so
diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h
index 83939991eb1..75fd3b59b0d 100644
--- a/gcc/config/aarch64/aarch64.h
+++ b/gcc/config/aarch64/aarch64.h
@@ -785,6 +785,11 @@ struct GTY (()) aarch64_frame
  are saved below the hard frame pointer.  */
   poly_int64 below_hard_fp_saved_regs_size;
 
+  /* The number of bytes between the bottom of the static frame (the bottom
+ of the outgoing arguments) and the hard frame pointer.  This value is
+ always a multiple of STACK_BOUNDARY.  */
+  poly_int64 bytes_below_hard_fp;
+
   /* Offset from the base of the frame (incomming SP) to the
  top of the locals area.  This value is always a multiple of
  STACK_BOUNDARY.  */
-- 
2.25.1



[PATCH 01/19] aarch64: Use local frame vars in shrink-wrapping code

2023-09-12 Thread Richard Sandiford via Gcc-patches
aarch64_layout_frame uses a shorthand for referring to
cfun->machine->frame:

  aarch64_frame  = cfun->machine->frame;

This patch does the same for some other heavy users of the structure.
No functional change intended.

gcc/
* config/aarch64/aarch64.cc (aarch64_save_callee_saves): Use
a local shorthand for cfun->machine->frame.
(aarch64_restore_callee_saves, aarch64_get_separate_components):
(aarch64_process_components): Likewise.
(aarch64_allocate_and_probe_stack_space): Likewise.
(aarch64_expand_prologue, aarch64_expand_epilogue): Likewise.
(aarch64_layout_frame): Use existing shorthand for one more case.
---
 gcc/config/aarch64/aarch64.cc | 123 ++
 1 file changed, 64 insertions(+), 59 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 37d414021ca..b91f77d7b1f 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -8651,7 +8651,7 @@ aarch64_layout_frame (void)
   frame.is_scs_enabled
 = (!crtl->calls_eh_return
&& sanitize_flags_p (SANITIZE_SHADOW_CALL_STACK)
-   && known_ge (cfun->machine->frame.reg_offset[LR_REGNUM], 0));
+   && known_ge (frame.reg_offset[LR_REGNUM], 0));
 
   /* When shadow call stack is enabled, the scs_pop in the epilogue will
  restore x30, and we don't need to pop x30 again in the traditional
@@ -9117,6 +9117,7 @@ aarch64_save_callee_saves (poly_int64 start_offset,
   unsigned start, unsigned limit, bool skip_wb,
   bool hard_fp_valid_p)
 {
+  aarch64_frame  = cfun->machine->frame;
   rtx_insn *insn;
   unsigned regno;
   unsigned regno2;
@@ -9131,8 +9132,8 @@ aarch64_save_callee_saves (poly_int64 start_offset,
   bool frame_related_p = aarch64_emit_cfi_for_reg_p (regno);
 
   if (skip_wb
- && (regno == cfun->machine->frame.wb_push_candidate1
- || regno == cfun->machine->frame.wb_push_candidate2))
+ && (regno == frame.wb_push_candidate1
+ || regno == frame.wb_push_candidate2))
continue;
 
   if (cfun->machine->reg_is_wrapped_separately[regno])
@@ -9140,7 +9141,7 @@ aarch64_save_callee_saves (poly_int64 start_offset,
 
   machine_mode mode = aarch64_reg_save_mode (regno);
   reg = gen_rtx_REG (mode, regno);
-  offset = start_offset + cfun->machine->frame.reg_offset[regno];
+  offset = start_offset + frame.reg_offset[regno];
   rtx base_rtx = stack_pointer_rtx;
   poly_int64 sp_offset = offset;
 
@@ -9153,7 +9154,7 @@ aarch64_save_callee_saves (poly_int64 start_offset,
{
  gcc_assert (known_eq (start_offset, 0));
  poly_int64 fp_offset
-   = cfun->machine->frame.below_hard_fp_saved_regs_size;
+   = frame.below_hard_fp_saved_regs_size;
  if (hard_fp_valid_p)
base_rtx = hard_frame_pointer_rtx;
  else
@@ -9175,8 +9176,7 @@ aarch64_save_callee_saves (poly_int64 start_offset,
  && (regno2 = aarch64_next_callee_save (regno + 1, limit)) <= limit
  && !cfun->machine->reg_is_wrapped_separately[regno2]
  && known_eq (GET_MODE_SIZE (mode),
-  cfun->machine->frame.reg_offset[regno2]
-  - cfun->machine->frame.reg_offset[regno]))
+  frame.reg_offset[regno2] - frame.reg_offset[regno]))
{
  rtx reg2 = gen_rtx_REG (mode, regno2);
  rtx mem2;
@@ -9226,6 +9226,7 @@ static void
 aarch64_restore_callee_saves (poly_int64 start_offset, unsigned start,
  unsigned limit, bool skip_wb, rtx *cfi_ops)
 {
+  aarch64_frame  = cfun->machine->frame;
   unsigned regno;
   unsigned regno2;
   poly_int64 offset;
@@ -9242,13 +9243,13 @@ aarch64_restore_callee_saves (poly_int64 start_offset, 
unsigned start,
   rtx reg, mem;
 
   if (skip_wb
- && (regno == cfun->machine->frame.wb_pop_candidate1
- || regno == cfun->machine->frame.wb_pop_candidate2))
+ && (regno == frame.wb_pop_candidate1
+ || regno == frame.wb_pop_candidate2))
continue;
 
   machine_mode mode = aarch64_reg_save_mode (regno);
   reg = gen_rtx_REG (mode, regno);
-  offset = start_offset + cfun->machine->frame.reg_offset[regno];
+  offset = start_offset + frame.reg_offset[regno];
   rtx base_rtx = stack_pointer_rtx;
   if (mode == VNx2DImode && BYTES_BIG_ENDIAN)
aarch64_adjust_sve_callee_save_base (mode, base_rtx, anchor_reg,
@@ -9259,8 +9260,7 @@ aarch64_restore_callee_saves (poly_int64 start_offset, 
unsigned start,
  && (regno2 = aarch64_next_callee_save (regno + 1, limit)) <= limit
  && !cfun->machine->reg_is_wrapped_separately[regno2]
  && known_eq (GET_MODE_SIZE (mode),
-  cfun->machine->frame.reg_offset[regno2]
-  - cfun->machine->frame.reg_offset[regno]))
+  

[PATCH 00/19] aarch64: Fix -fstack-protector issue

2023-09-12 Thread Richard Sandiford via Gcc-patches
This series of patches fixes deficiencies in GCC's -fstack-protector
implementation for AArch64 when using dynamically allocated stack space.
This is CVE-2023-4039.  See:

https://developer.arm.com/Arm%20Security%20Center/GCC%20Stack%20Protector%20Vulnerability%20AArch64
https://github.com/metaredteam/external-disclosures/security/advisories/GHSA-x7ch-h5rf-w2mf

for more details.

The fix is to put the saved registers above the locals area when
-fstack-protector is used.

The series also fixes a stack-clash problem that I found while working
on the CVE.  In unpatched sources, the stack-clash problem would only
trigger for unrealistic numbers of arguments (8K 64-bit arguments, or an
equivalent).  But it would be a more significant issue with the new
-fstack-protector frame layout.  It's therefore important that both
problems are fixed together.

Some reorganisation of the code seemed necessary to fix the problems in a
cleanish way.  The series is therefore quite long, but only a handful of
patches should have any effect on code generation.

See the individual patches for a detailed description.

Tested on aarch64-linux-gnu. Pushed to trunk and to all active branches.
I've also pushed backports to GCC 7+ to vendors/ARM/heads/CVE-2023-4039.

Richard Sandiford (19):
  aarch64: Use local frame vars in shrink-wrapping code
  aarch64: Avoid a use of callee_offset
  aarch64: Explicitly handle frames with no saved registers
  aarch64: Add bytes_below_saved_regs to frame info
  aarch64: Add bytes_below_hard_fp to frame info
  aarch64: Tweak aarch64_save/restore_callee_saves
  aarch64: Only calculate chain_offset if there is a chain
  aarch64: Rename locals_offset to bytes_above_locals
  aarch64: Rename hard_fp_offset to bytes_above_hard_fp
  aarch64: Tweak frame_size comment
  aarch64: Measure reg_offset from the bottom of the frame
  aarch64: Simplify top of frame allocation
  aarch64: Minor initial adjustment tweak
  aarch64: Tweak stack clash boundary condition
  aarch64: Put LR save probe in first 16 bytes
  aarch64: Simplify probe of final frame allocation
  aarch64: Explicitly record probe registers in frame info
  aarch64: Remove below_hard_fp_saved_regs_size
  aarch64: Make stack smash canary protect saved registers

 gcc/config/aarch64/aarch64.cc | 518 ++
 gcc/config/aarch64/aarch64.h  |  44 +-
 .../aarch64/stack-check-prologue-17.c |  55 ++
 .../aarch64/stack-check-prologue-18.c | 100 
 .../aarch64/stack-check-prologue-19.c | 100 
 .../aarch64/stack-check-prologue-20.c |   3 +
 .../gcc.target/aarch64/stack-protector-8.c|  95 
 .../gcc.target/aarch64/stack-protector-9.c|  33 ++
 .../aarch64/sve/pcs/stack_clash_3.c   |   6 +-
 9 files changed, 699 insertions(+), 255 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/stack-check-prologue-17.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/stack-check-prologue-18.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/stack-check-prologue-19.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/stack-check-prologue-20.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/stack-protector-8.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/stack-protector-9.c

-- 
2.25.1



Re: [PATCH] pretty-print: Fix up pp_wide_int [PR111329]

2023-09-11 Thread Richard Sandiford via Gcc-patches
Jakub Jelinek  writes:
> Hi!
>
> The recent pp_wide_int changes for _BitInt support (because not all
> wide_ints fit into the small fixed size digit_buffer anymore) apparently
> broke
> +FAIL: gcc.dg/analyzer/out-of-bounds-diagram-1-debug.c (test for excess 
> errors)
> +FAIL: gcc.dg/analyzer/out-of-bounds-diagram-1-debug.c 2 blank line(s) in 
> output
> +FAIL: gcc.dg/analyzer/out-of-bounds-diagram-1-debug.c expected multiline 
> pattern lines 17-39
> (and I couldn't reproduce that in bisect seed (which is -O0 compiled) and
> thought it would be some analyzer diagnostic bug).
>
> The problem is that analyzer uses pp_wide_int with a function call in the
> second argument.  Previously, when pp_wide_int macro just did
>   print_dec (W, pp_buffer (PP)->digit_buffer, SGN);
>   pp_string (PP, pp_buffer (PP)->digit_buffer);
> it worked, because the const wide_int_ref & first argument to print_dec
> bound to a temporary, which was only destructed at the end of the full
> statement after print_dec was called.
> But with my changes where I need to first compare the precision of the
> const wide_int_ref & to decide whether to use digit_buffer or XALLOCAVEC
> something larger, this means that pp_wide_int_ref binds to a temporary
> which is destroyed at the end of full statement which is the
>   const wide_int_ref _wide_int_ref = (W);
> declaration, so then invokes UB accessing a destructed temporary.
>
> The following patch fixes it by rewriting pp_wide_int into an inline
> function, so that the end of the full statement is the end of the inline
> function call.  As functions using alloca aren't normally inlined, I've
> also split that part into a separate out of line function.  Putting that
> into pretty-print.cc didn't work, e.g. the gm2 binary doesn't link,
> because pretty-print.o is in libcommon.a, but wide-print-print.o which
> defines print_dec is not.  So I've put that out of line function into
> wide-int-print.cc instead.
>
> Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?
>
> 2023-09-08  Jakub Jelinek  
>
>   PR middle-end/111329
>   * pretty-print.h (pp_wide_int): Rewrite from macro into inline
>   function.  For printing values which don't fit into digit_buffer
>   use out-of-line function.
>   * wide-int-print.h (pp_wide_int_large): Declare.
>   * wide-int-print.cc: Include pretty-print.h.
>   (pp_wide_int_large): Define.

OK, thanks.

Richard

> --- gcc/pretty-print.h.jj 2023-09-06 14:36:53.485246347 +0200
> +++ gcc/pretty-print.h2023-09-08 11:11:21.173649942 +0200
> @@ -333,28 +333,6 @@ pp_get_prefix (const pretty_printer *pp)
>  #define pp_decimal_int(PP, I)  pp_scalar (PP, "%d", I)
>  #define pp_unsigned_wide_integer(PP, I) \
> pp_scalar (PP, HOST_WIDE_INT_PRINT_UNSIGNED, (unsigned HOST_WIDE_INT) I)
> -#define pp_wide_int(PP, W, SGN)  \
> -  do \
> -{\
> -  const wide_int_ref _wide_int_ref = (W); \
> -  unsigned int pp_wide_int_prec  \
> - = pp_wide_int_ref.get_precision (); \
> -  if ((pp_wide_int_prec + 3) / 4 \
> -   > sizeof (pp_buffer (PP)->digit_buffer) - 3)  \
> - {   \
> -   char *pp_wide_int_buf \
> - = XALLOCAVEC (char, (pp_wide_int_prec + 3) / 4 + 3);\
> -   print_dec (pp_wide_int_ref, pp_wide_int_buf, SGN);\
> -   pp_string (PP, pp_wide_int_buf);  \
> - }   \
> -  else   \
> - {   \
> -   print_dec (pp_wide_int_ref,   \
> -  pp_buffer (PP)->digit_buffer, SGN);\
> -   pp_string (PP, pp_buffer (PP)->digit_buffer); \
> - }   \
> -}\
> -  while (0)
>  #define pp_vrange(PP, R) \
>do \
>  {\
> @@ -453,6 +431,19 @@ pp_wide_integer (pretty_printer *pp, HOS
>pp_scalar (pp, HOST_WIDE_INT_PRINT_DEC, i);
>  }
>  
> +inline void
> +pp_wide_int (pretty_printer *pp, const wide_int_ref , signop sgn)
> +{
> +  unsigned int prec = w.get_precision ();
> +  if (UNLIKELY ((prec + 3) / 4 > sizeof (pp_buffer (pp)->digit_buffer) - 3))
> +pp_wide_int_large (pp, w, sgn);
> +  else
> +{
> +  print_dec (w, pp_buffer (pp)->digit_buffer, sgn);
> +  pp_string (pp, pp_buffer (pp)->digit_buffer);
> +}
> +}
> +
>  template
>  void 

[PATCH] Allow target attributes in non-gnu namespaces

2023-09-08 Thread Richard Sandiford via Gcc-patches
Currently there are four static sources of attributes:

- LANG_HOOKS_ATTRIBUTE_TABLE
- LANG_HOOKS_COMMON_ATTRIBUTE_TABLE
- LANG_HOOKS_FORMAT_ATTRIBUTE_TABLE
- TARGET_ATTRIBUTE_TABLE

All of the attributes in these tables go in the "gnu" namespace.
This means that they can use the traditional GNU __attribute__((...))
syntax and the standard [[gnu::...]] syntax.

Standard attributes are registered dynamically with a null namespace.
There are no supported attributes in other namespaces (clang, vendor
namespaces, etc.).

This patch tries to generalise things by making the namespace
part of the attribute specification.

It's usual for multiple attributes to be defined in the same namespace,
so rather than adding the namespace to each individual definition,
it seemed better to group attributes in the same namespace together.
This would also allow us to reuse the same table for clang attributes
that are written with the GNU syntax, or other similar situations
where the attribute can be accessed via multiple "spellings".

The patch therefore adds a scoped_attribute_specs that contains
a namespace and a list of attributes in that namespace.

It's still possible to have multiple scoped_attribute_specs
for the same namespace.  E.g. it makes sense to keep the
C++-specific, C/C++-common, and format-related attributes in
separate tables, even though they're all GNU attributes.

Current lists of attributes are terminated by a null name.
Rather than keep that for the new structure, it seemed neater
to use an array_slice.  This also makes the tables slighly more
compact.

In general, a target might want to support attributes in multiple
namespaces.  Rather than have a separate hook for each possibility
(like the three langhooks above), it seemed better to make
TARGET_ATTRIBUTE_TABLE a table of tables.  Specifically, it's
an array_slice of scoped_attribute_specs.

We can do the same thing for langhooks, which allows the three hooks
above to be merged into a single LANG_HOOKS_ATTRIBUTE_TABLE.
It also allows the standard attributes to be registered statically
and checked by the usual attribs.cc checks.

The patch adds a TARGET_GNU_ATTRIBUTES helper for the common case
in which a target wants a single table of gnu attributes.  It can
only be used if the table is free of preprocessor directives.

There are probably other things we need to do to make vendor namespaces
work smoothly.  E.g. in principle it would be good to make exclusion
sets namespace-aware.  But to some extent we have that with standard
vs. gnu attributes too.  This patch is just supposed to be a first step.

Bootstrapped & regtested on aarch64-linux-gnu and x86_64-linux-gnu.
Also tested on the full target list in config-list.mk.  OK to install?

Richard


gcc/
* attribs.h (scoped_attribute_specs): New structure.
(register_scoped_attributes): Take a reference to a
scoped_attribute_specs instead of separate namespace and array
parameters.
* plugin.h (register_scoped_attributes): Likewise.
* attribs.cc (register_scoped_attributes): Likewise.
(attribute_tables): Change into an array of scoped_attribute_specs
pointers.  Reduce to 1 element for frontends and 1 element for targets.
(empty_attribute_table): Delete.
(check_attribute_tables): Update for changes to attribute_tables.
Use a hash_set to identify duplicates.
(handle_ignored_attributes_option): Update for above changes.
(init_attributes): Likewise.
(excl_pair): Delete.
(test_attribute_exclusions): Update for above changes.  Don't
enforce symmetry for standard attributes in the top-level namespace.
* langhooks-def.h (LANG_HOOKS_COMMON_ATTRIBUTE_TABLE): Delete.
(LANG_HOOKS_FORMAT_ATTRIBUTE_TABLE): Likewise.
(LANG_HOOKS_INITIALIZER): Update accordingly.
(LANG_HOOKS_ATTRIBUTE_TABLE): Define to an empty constructor.
* langhooks.h (lang_hooks::common_attribute_table): Delete.
(lang_hooks::format_attribute_table): Likewise.
(lang_hooks::attribute_table): Redefine to an array of
scoped_attribute_specs pointers.
* target-def.h (TARGET_GNU_ATTRIBUTES): New macro.
* target.def (attribute_spec): Redefine to return an array of
scoped_attribute_specs pointers.
* tree-inline.cc (function_attribute_inlinable_p): Update accordingly.
* doc/tm.texi: Regenerate.
* config/aarch64/aarch64.cc (aarch64_attribute_table): Define using
TARGET_GNU_ATTRIBUTES.
* config/alpha/alpha.cc (vms_attribute_table): Likewise.
* config/avr/avr.cc (avr_attribute_table): Likewise.
* config/bfin/bfin.cc (bfin_attribute_table): Likewise.
* config/bpf/bpf.cc (bpf_attribute_table): Likewise.
* config/csky/csky.cc (csky_attribute_table): Likewise.
* config/epiphany/epiphany.cc (epiphany_attribute_table): Likewise.
* config/gcn/gcn.cc (gcn_attribute_table): 

Re: [PATCH V3] Support folding min(poly,poly) to const

2023-09-08 Thread Richard Sandiford via Gcc-patches
Lehua Ding  writes:
> V3 change: Address Richard's comments.
>
> Hi,
>
> This patch adds support that tries to fold `MIN (poly, poly)` to
> a constant. Consider the following C Code:
>
> ```
> void foo2 (int* restrict a, int* restrict b, int n)
> {
> for (int i = 0; i < 3; i += 1)
>   a[i] += b[i];
> }
> ```
>
> Before this patch:
>
> ```
> void foo2 (int * restrict a, int * restrict b, int n)
> {
>   vector([4,4]) int vect__7.27;
>   vector([4,4]) int vect__6.26;
>   vector([4,4]) int vect__4.23;
>   unsigned long _32;
>
>[local count: 268435456]:
>   _32 = MIN_EXPR <3, POLY_INT_CST [4, 4]>;
>   vect__4.23_20 = .MASK_LEN_LOAD (a_11(D), 32B, { -1, ... }, _32, 0);
>   vect__6.26_15 = .MASK_LEN_LOAD (b_12(D), 32B, { -1, ... }, _32, 0);
>   vect__7.27_9 = vect__6.26_15 + vect__4.23_20;
>   .MASK_LEN_STORE (a_11(D), 32B, { -1, ... }, _32, 0, vect__7.27_9); [tail 
> call]
>   return;
>
> }
> ```
>
> After this patch:
>
> ```
> void foo2 (int * restrict a, int * restrict b, int n)
> {
>   vector([4,4]) int vect__7.27;
>   vector([4,4]) int vect__6.26;
>   vector([4,4]) int vect__4.23;
>
>[local count: 268435456]:
>   vect__4.23_20 = .MASK_LEN_LOAD (a_11(D), 32B, { -1, ... }, 3, 0);
>   vect__6.26_15 = .MASK_LEN_LOAD (b_12(D), 32B, { -1, ... }, 3, 0);
>   vect__7.27_9 = vect__6.26_15 + vect__4.23_20;
>   .MASK_LEN_STORE (a_11(D), 32B, { -1, ... }, 3, 0, vect__7.27_9); [tail call]
>   return;
>
> }
> ```
>
> For RISC-V RVV, csrr and branch instructions can be reduced:
>
> Before this patch:
>
> ```
> foo2:
> csrra4,vlenb
> srlia4,a4,2
> li  a5,3
> bleua5,a4,.L5
> mv  a5,a4
> .L5:
> vsetvli zero,a5,e32,m1,ta,ma
> ...
> ```
>
> After this patch.
>
> ```
> foo2:
>   vsetivlizero,3,e32,m1,ta,ma
> ...
> ```
>
> Best,
> Lehua
>
> gcc/ChangeLog:
>
>   * fold-const.cc (can_min_p): New function.
>   (poly_int_binop): Try fold MIN_EXPR.

OK, thanks.

Richard

> gcc/testsuite/ChangeLog:
>
>   * gcc.target/riscv/rvv/autovec/vls/div-1.c: Adjust.
>   * gcc.target/riscv/rvv/autovec/vls/shift-3.c: Adjust.
>   * gcc.target/riscv/rvv/autovec/fold-min-poly.c: New test.
>
> ---
>  gcc/fold-const.cc | 24 +++
>  .../riscv/rvv/autovec/fold-min-poly.c | 24 +++
>  .../gcc.target/riscv/rvv/autovec/vls/div-1.c  |  2 +-
>  .../riscv/rvv/autovec/vls/shift-3.c   |  2 +-
>  4 files changed, 50 insertions(+), 2 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/fold-min-poly.c
>
> diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
> index 1da498a3152..d19b4666c65 100644
> --- a/gcc/fold-const.cc
> +++ b/gcc/fold-const.cc
> @@ -1213,6 +1213,25 @@ wide_int_binop (wide_int ,
>return true;
>  }
>  
> +/* Returns true if we know who is smaller or equal, ARG1 or ARG2, and set the
> +   min value to RES.  */
> +bool
> +can_min_p (const_tree arg1, const_tree arg2, poly_wide_int )
> +{
> +  if (known_le (wi::to_poly_widest (arg1), wi::to_poly_widest (arg2)))
> +{
> +  res = wi::to_poly_wide (arg1);
> +  return true;
> +}
> +  else if (known_le (wi::to_poly_widest (arg2), wi::to_poly_widest (arg1)))
> +{
> +  res = wi::to_poly_wide (arg2);
> +  return true;
> +}
> +
> +  return false;
> +}
> +
>  /* Combine two poly int's ARG1 and ARG2 under operation CODE to
> produce a new constant in RES.  Return FALSE if we don't know how
> to evaluate CODE at compile-time.  */
> @@ -1261,6 +1280,11 @@ poly_int_binop (poly_wide_int , enum tree_code 
> code,
>   return false;
>break;
>  
> +case MIN_EXPR:
> +  if (!can_min_p (arg1, arg2, res))
> + return false;
> +  break;
> +
>  default:
>return false;
>  }
> diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/fold-min-poly.c 
> b/gcc/testsuite/gcc.target/riscv/rvv/autovec/fold-min-poly.c
> new file mode 100644
> index 000..de4c472c76e
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/fold-min-poly.c
> @@ -0,0 +1,24 @@
> +/* { dg-do compile } */
> +/* { dg-options " -march=rv64gcv_zvl128b -mabi=lp64d -O3 --param 
> riscv-autovec-preference=scalable --param riscv-autovec-lmul=m1 
> -fno-vect-cost-model" } */
> +
> +void foo1 (int* restrict a, int* restrict b, int n)
> +{
> +for (int i = 0; i < 4; i += 1)
> +  a[i] += b[i];
> +}
> +
> +void foo2 (int* restrict a, int* restrict b, int n)
> +{
> +for (int i = 0; i < 3; i += 1)
> +  a[i] += b[i];
> +}
> +
> +void foo3 (int* restrict a, int* restrict b, int n)
> +{
> +for (int i = 0; i < 5; i += 1)
> +  a[i] += b[i];
> +}
> +
> +/* { dg-final { scan-assembler-not {\tcsrr\t} } } */
> +/* { dg-final { scan-assembler {\tvsetivli\tzero,4,e32,m1,t[au],m[au]} } } */
> +/* { dg-final { scan-assembler {\tvsetivli\tzero,3,e32,m1,t[au],m[au]} } } */
> diff --git 

Re: [PATCH V2] Support folding min(poly,poly) to const

2023-09-08 Thread Richard Sandiford via Gcc-patches
Richard Sandiford  writes:
> Lehua Ding  writes:
>> Hi,
>>
>> This patch adds support that tries to fold `MIN (poly, poly)` to
>> a constant. Consider the following C Code:
>>
>> ```
>> void foo2 (int* restrict a, int* restrict b, int n)
>> {
>> for (int i = 0; i < 3; i += 1)
>>   a[i] += b[i];
>> }
>> ```
>>
>> Before this patch:
>>
>> ```
>> void foo2 (int * restrict a, int * restrict b, int n)
>> {
>>   vector([4,4]) int vect__7.27;
>>   vector([4,4]) int vect__6.26;
>>   vector([4,4]) int vect__4.23;
>>   unsigned long _32;
>>
>>[local count: 268435456]:
>>   _32 = MIN_EXPR <3, POLY_INT_CST [4, 4]>;
>>   vect__4.23_20 = .MASK_LEN_LOAD (a_11(D), 32B, { -1, ... }, _32, 0);
>>   vect__6.26_15 = .MASK_LEN_LOAD (b_12(D), 32B, { -1, ... }, _32, 0);
>>   vect__7.27_9 = vect__6.26_15 + vect__4.23_20;
>>   .MASK_LEN_STORE (a_11(D), 32B, { -1, ... }, _32, 0, vect__7.27_9); [tail 
>> call]
>>   return;
>>
>> }
>> ```
>>
>> After this patch:
>>
>> ```
>> void foo2 (int * restrict a, int * restrict b, int n)
>> {
>>   vector([4,4]) int vect__7.27;
>>   vector([4,4]) int vect__6.26;
>>   vector([4,4]) int vect__4.23;
>>
>>[local count: 268435456]:
>>   vect__4.23_20 = .MASK_LEN_LOAD (a_11(D), 32B, { -1, ... }, 3, 0);
>>   vect__6.26_15 = .MASK_LEN_LOAD (b_12(D), 32B, { -1, ... }, 3, 0);
>>   vect__7.27_9 = vect__6.26_15 + vect__4.23_20;
>>   .MASK_LEN_STORE (a_11(D), 32B, { -1, ... }, 3, 0, vect__7.27_9); [tail 
>> call]
>>   return;
>>
>> }
>> ```
>>
>> For RISC-V RVV, csrr and branch instructions can be reduced:
>>
>> Before this patch:
>>
>> ```
>> foo2:
>> csrra4,vlenb
>> srlia4,a4,2
>> li  a5,3
>> bleua5,a4,.L5
>> mv  a5,a4
>> .L5:
>> vsetvli zero,a5,e32,m1,ta,ma
>> ...
>> ```
>>
>> After this patch.
>>
>> ```
>> foo2:
>>  vsetivlizero,3,e32,m1,ta,ma
>> ...
>> ```
>>
>> Best,
>> Lehua
>>
>> gcc/ChangeLog:
>>
>>  * fold-const.cc (can_min_p): New function.
>>  (poly_int_binop): Try fold MIN_EXPR.
>
> OK, thanks.

Sorry, just realised that the poly_int_tree_p tests are redundant.
The caller has already checked that.

Richard

> Richard
>
>> gcc/testsuite/ChangeLog:
>>
>>  * gcc.target/riscv/rvv/autovec/vls/div-1.c: Adjust.
>>  * gcc.target/riscv/rvv/autovec/vls/shift-3.c: Adjust.
>>  * gcc.target/riscv/rvv/autovec/fold-min-poly.c: New test.
>>
>> ---
>>  gcc/fold-const.cc | 27 +++
>>  .../riscv/rvv/autovec/fold-min-poly.c | 24 +
>>  .../gcc.target/riscv/rvv/autovec/vls/div-1.c  |  2 +-
>>  .../riscv/rvv/autovec/vls/shift-3.c   |  2 +-
>>  4 files changed, 53 insertions(+), 2 deletions(-)
>>  create mode 100644 
>> gcc/testsuite/gcc.target/riscv/rvv/autovec/fold-min-poly.c
>>
>> diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
>> index 1da498a3152..ba4b6f3f3a3 100644
>> --- a/gcc/fold-const.cc
>> +++ b/gcc/fold-const.cc
>> @@ -1213,6 +1213,28 @@ wide_int_binop (wide_int ,
>>return true;
>>  }
>>  
>> +/* Returns true if we know who is smaller or equal, ARG1 or ARG2, and set 
>> the
>> +   min value to RES.  */
>> +bool
>> +can_min_p (const_tree arg1, const_tree arg2, poly_wide_int )
>> +{
>> +  if (!poly_int_tree_p (arg1) || !poly_int_tree_p (arg2))
>> +return false;
>> +
>> +  if (known_le (wi::to_poly_widest (arg1), wi::to_poly_widest (arg2)))
>> +{
>> +  res = wi::to_poly_wide (arg1);
>> +  return true;
>> +}
>> +  else if (known_le (wi::to_poly_widest (arg2), wi::to_poly_widest (arg1)))
>> +{
>> +  res = wi::to_poly_wide (arg2);
>> +  return true;
>> +}
>> +
>> +  return false;
>> +}
>> +
>>  /* Combine two poly int's ARG1 and ARG2 under operation CODE to
>> produce a new constant in RES.  Return FALSE if we don't know how
>> to evaluate CODE at compile-time.  */
>> @@ -1261,6 +1283,11 @@ poly_int_binop (poly_wide_int , enum tree_code 
>> code,
>>  return false;
>>break;
>>  
>> +case MIN_EXPR:
>> +  if (!can_min_p (arg1, arg2, res))
>> +return false;
>> +  break;
>> +
>>  default:
>>return false;
>>  }
>> diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/fold-min-poly.c 
>> b/gcc/testsuite/gcc.target/riscv/rvv/autovec/fold-min-poly.c
>> new file mode 100644
>> index 000..de4c472c76e
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/fold-min-poly.c
>> @@ -0,0 +1,24 @@
>> +/* { dg-do compile } */
>> +/* { dg-options " -march=rv64gcv_zvl128b -mabi=lp64d -O3 --param 
>> riscv-autovec-preference=scalable --param riscv-autovec-lmul=m1 
>> -fno-vect-cost-model" } */
>> +
>> +void foo1 (int* restrict a, int* restrict b, int n)
>> +{
>> +for (int i = 0; i < 4; i += 1)
>> +  a[i] += b[i];
>> +}
>> +
>> +void foo2 (int* restrict a, int* restrict b, int n)
>> +{
>> +for (int i = 0; i < 3; i += 1)
>> +  a[i] += b[i];
>> +}
>> +
>> +void foo3 (int* restrict a, int* restrict b, 

Re: [PATCH V2] Support folding min(poly,poly) to const

2023-09-08 Thread Richard Sandiford via Gcc-patches
Lehua Ding  writes:
> Hi,
>
> This patch adds support that tries to fold `MIN (poly, poly)` to
> a constant. Consider the following C Code:
>
> ```
> void foo2 (int* restrict a, int* restrict b, int n)
> {
> for (int i = 0; i < 3; i += 1)
>   a[i] += b[i];
> }
> ```
>
> Before this patch:
>
> ```
> void foo2 (int * restrict a, int * restrict b, int n)
> {
>   vector([4,4]) int vect__7.27;
>   vector([4,4]) int vect__6.26;
>   vector([4,4]) int vect__4.23;
>   unsigned long _32;
>
>[local count: 268435456]:
>   _32 = MIN_EXPR <3, POLY_INT_CST [4, 4]>;
>   vect__4.23_20 = .MASK_LEN_LOAD (a_11(D), 32B, { -1, ... }, _32, 0);
>   vect__6.26_15 = .MASK_LEN_LOAD (b_12(D), 32B, { -1, ... }, _32, 0);
>   vect__7.27_9 = vect__6.26_15 + vect__4.23_20;
>   .MASK_LEN_STORE (a_11(D), 32B, { -1, ... }, _32, 0, vect__7.27_9); [tail 
> call]
>   return;
>
> }
> ```
>
> After this patch:
>
> ```
> void foo2 (int * restrict a, int * restrict b, int n)
> {
>   vector([4,4]) int vect__7.27;
>   vector([4,4]) int vect__6.26;
>   vector([4,4]) int vect__4.23;
>
>[local count: 268435456]:
>   vect__4.23_20 = .MASK_LEN_LOAD (a_11(D), 32B, { -1, ... }, 3, 0);
>   vect__6.26_15 = .MASK_LEN_LOAD (b_12(D), 32B, { -1, ... }, 3, 0);
>   vect__7.27_9 = vect__6.26_15 + vect__4.23_20;
>   .MASK_LEN_STORE (a_11(D), 32B, { -1, ... }, 3, 0, vect__7.27_9); [tail call]
>   return;
>
> }
> ```
>
> For RISC-V RVV, csrr and branch instructions can be reduced:
>
> Before this patch:
>
> ```
> foo2:
> csrra4,vlenb
> srlia4,a4,2
> li  a5,3
> bleua5,a4,.L5
> mv  a5,a4
> .L5:
> vsetvli zero,a5,e32,m1,ta,ma
> ...
> ```
>
> After this patch.
>
> ```
> foo2:
>   vsetivlizero,3,e32,m1,ta,ma
> ...
> ```
>
> Best,
> Lehua
>
> gcc/ChangeLog:
>
>   * fold-const.cc (can_min_p): New function.
>   (poly_int_binop): Try fold MIN_EXPR.

OK, thanks.

Richard

> gcc/testsuite/ChangeLog:
>
>   * gcc.target/riscv/rvv/autovec/vls/div-1.c: Adjust.
>   * gcc.target/riscv/rvv/autovec/vls/shift-3.c: Adjust.
>   * gcc.target/riscv/rvv/autovec/fold-min-poly.c: New test.
>
> ---
>  gcc/fold-const.cc | 27 +++
>  .../riscv/rvv/autovec/fold-min-poly.c | 24 +
>  .../gcc.target/riscv/rvv/autovec/vls/div-1.c  |  2 +-
>  .../riscv/rvv/autovec/vls/shift-3.c   |  2 +-
>  4 files changed, 53 insertions(+), 2 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/fold-min-poly.c
>
> diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
> index 1da498a3152..ba4b6f3f3a3 100644
> --- a/gcc/fold-const.cc
> +++ b/gcc/fold-const.cc
> @@ -1213,6 +1213,28 @@ wide_int_binop (wide_int ,
>return true;
>  }
>  
> +/* Returns true if we know who is smaller or equal, ARG1 or ARG2, and set the
> +   min value to RES.  */
> +bool
> +can_min_p (const_tree arg1, const_tree arg2, poly_wide_int )
> +{
> +  if (!poly_int_tree_p (arg1) || !poly_int_tree_p (arg2))
> +return false;
> +
> +  if (known_le (wi::to_poly_widest (arg1), wi::to_poly_widest (arg2)))
> +{
> +  res = wi::to_poly_wide (arg1);
> +  return true;
> +}
> +  else if (known_le (wi::to_poly_widest (arg2), wi::to_poly_widest (arg1)))
> +{
> +  res = wi::to_poly_wide (arg2);
> +  return true;
> +}
> +
> +  return false;
> +}
> +
>  /* Combine two poly int's ARG1 and ARG2 under operation CODE to
> produce a new constant in RES.  Return FALSE if we don't know how
> to evaluate CODE at compile-time.  */
> @@ -1261,6 +1283,11 @@ poly_int_binop (poly_wide_int , enum tree_code 
> code,
>   return false;
>break;
>  
> +case MIN_EXPR:
> +  if (!can_min_p (arg1, arg2, res))
> + return false;
> +  break;
> +
>  default:
>return false;
>  }
> diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/fold-min-poly.c 
> b/gcc/testsuite/gcc.target/riscv/rvv/autovec/fold-min-poly.c
> new file mode 100644
> index 000..de4c472c76e
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/fold-min-poly.c
> @@ -0,0 +1,24 @@
> +/* { dg-do compile } */
> +/* { dg-options " -march=rv64gcv_zvl128b -mabi=lp64d -O3 --param 
> riscv-autovec-preference=scalable --param riscv-autovec-lmul=m1 
> -fno-vect-cost-model" } */
> +
> +void foo1 (int* restrict a, int* restrict b, int n)
> +{
> +for (int i = 0; i < 4; i += 1)
> +  a[i] += b[i];
> +}
> +
> +void foo2 (int* restrict a, int* restrict b, int n)
> +{
> +for (int i = 0; i < 3; i += 1)
> +  a[i] += b[i];
> +}
> +
> +void foo3 (int* restrict a, int* restrict b, int n)
> +{
> +for (int i = 0; i < 5; i += 1)
> +  a[i] += b[i];
> +}
> +
> +/* { dg-final { scan-assembler-not {\tcsrr\t} } } */
> +/* { dg-final { scan-assembler {\tvsetivli\tzero,4,e32,m1,t[au],m[au]} } } */
> +/* { dg-final { scan-assembler {\tvsetivli\tzero,3,e32,m1,t[au],m[au]} } } */
> diff --git 

Re: [PATCH] Support folding min(poly,poly) to const

2023-09-08 Thread Richard Sandiford via Gcc-patches
Lehua Ding  writes:
> Hi,
>
> This patch adds support that tries to fold `MIN (poly, poly)` to
> a constant. Consider the following C Code:
>
> ```
> void foo2 (int* restrict a, int* restrict b, int n)
> {
> for (int i = 0; i < 3; i += 1)
>   a[i] += b[i];
> }
> ```
>
> Before this patch:
>
> ```
> void foo2 (int * restrict a, int * restrict b, int n)
> {
>   vector([4,4]) int vect__7.27;
>   vector([4,4]) int vect__6.26;
>   vector([4,4]) int vect__4.23;
>   unsigned long _32;
>
>[local count: 268435456]:
>   _32 = MIN_EXPR <3, POLY_INT_CST [4, 4]>;
>   vect__4.23_20 = .MASK_LEN_LOAD (a_11(D), 32B, { -1, ... }, _32, 0);
>   vect__6.26_15 = .MASK_LEN_LOAD (b_12(D), 32B, { -1, ... }, _32, 0);
>   vect__7.27_9 = vect__6.26_15 + vect__4.23_20;
>   .MASK_LEN_STORE (a_11(D), 32B, { -1, ... }, _32, 0, vect__7.27_9); [tail 
> call]
>   return;
>
> }
> ```
>
> After this patch:
>
> ```
> void foo2 (int * restrict a, int * restrict b, int n)
> {
>   vector([4,4]) int vect__7.27;
>   vector([4,4]) int vect__6.26;
>   vector([4,4]) int vect__4.23;
>
>[local count: 268435456]:
>   vect__4.23_20 = .MASK_LEN_LOAD (a_11(D), 32B, { -1, ... }, 3, 0);
>   vect__6.26_15 = .MASK_LEN_LOAD (b_12(D), 32B, { -1, ... }, 3, 0);
>   vect__7.27_9 = vect__6.26_15 + vect__4.23_20;
>   .MASK_LEN_STORE (a_11(D), 32B, { -1, ... }, 3, 0, vect__7.27_9); [tail call]
>   return;
>
> }
> ```
>
> For RISC-V RVV, one branch instruction can be reduced:
>
> Before this patch:
>
> ```
> foo2:
> csrra4,vlenb
> srlia4,a4,2
> li  a5,3
> bleua5,a4,.L5
> mv  a5,a4
> .L5:
> vsetvli zero,a5,e32,m1,ta,ma
> ...
> ```
>
> After this patch.
>
> ```
> foo2:
>   vsetivlizero,3,e32,m1,ta,ma
> ...
> ```
>
> Best,
> Lehua
>
> gcc/ChangeLog:
>
>   * fold-const.cc (can_min_p): New function.
>   (poly_int_binop): Try fold MIN_EXPR.
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/riscv/rvv/autovec/vls/div-1.c: Adjust.
>   * gcc.target/riscv/rvv/autovec/vls/shift-3.c: Adjust.
>   * gcc.target/riscv/rvv/autovec/fold-min-poly.c: New test.
>
> ---
>  gcc/fold-const.cc | 33 +++
>  .../riscv/rvv/autovec/fold-min-poly.c | 24 ++
>  .../gcc.target/riscv/rvv/autovec/vls/div-1.c  |  2 +-
>  .../riscv/rvv/autovec/vls/shift-3.c   |  2 +-
>  4 files changed, 59 insertions(+), 2 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/fold-min-poly.c
>
> diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
> index 1da498a3152..f7f793cc326 100644
> --- a/gcc/fold-const.cc
> +++ b/gcc/fold-const.cc
> @@ -1213,6 +1213,34 @@ wide_int_binop (wide_int ,
>return true;
>  }
>  
> +/* Returns true if we know who is smaller or equal, ARG1 or ARG2., and set 
> the
> +   min value to RES.  */
> +bool
> +can_min_p (const_tree arg1, const_tree arg2, poly_wide_int )
> +{
> +  if (tree_fits_poly_int64_p (arg1) && tree_fits_poly_int64_p (arg2))
> +{
> +  if (known_le (tree_to_poly_int64 (arg1), tree_to_poly_int64 (arg2)))
> + res = wi::to_poly_wide (arg1);
> +  else if (known_le (tree_to_poly_int64 (arg2), tree_to_poly_int64 
> (arg1)))
> + res = wi::to_poly_wide (arg2);
> +  else
> + return false;
> +}
> +  else if (tree_fits_poly_uint64_p (arg1) && tree_fits_poly_uint64_p (arg2))
> +{
> +  if (known_le (tree_to_poly_uint64 (arg1), tree_to_poly_uint64 (arg2)))
> + res = wi::to_poly_wide (arg1);
> +  else if (known_le (tree_to_poly_int64 (arg2), tree_to_poly_int64 
> (arg1)))
> + res = wi::to_poly_wide (arg2);
> +  else
> + return false;
> +}
> +  else
> +return false;
> +  return true;
> +}

I think this should instead use poly_int_tree_p and wi::to_poly_widest.
There's no need to handle int64 and uint64 separately.  (And there's
no need to handle just 64-bit types.)

Thanks,
Richard

> +
>  /* Combine two poly int's ARG1 and ARG2 under operation CODE to
> produce a new constant in RES.  Return FALSE if we don't know how
> to evaluate CODE at compile-time.  */
> @@ -1261,6 +1289,11 @@ poly_int_binop (poly_wide_int , enum tree_code 
> code,
>   return false;
>break;
>  
> +case MIN_EXPR:
> +  if (!can_min_p (arg1, arg2, res))
> + return false;
> +  break;
> +
>  default:
>return false;
>  }
> diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/fold-min-poly.c 
> b/gcc/testsuite/gcc.target/riscv/rvv/autovec/fold-min-poly.c
> new file mode 100644
> index 000..de4c472c76e
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/fold-min-poly.c
> @@ -0,0 +1,24 @@
> +/* { dg-do compile } */
> +/* { dg-options " -march=rv64gcv_zvl128b -mabi=lp64d -O3 --param 
> riscv-autovec-preference=scalable --param riscv-autovec-lmul=m1 
> -fno-vect-cost-model" } */
> +
> +void foo1 (int* restrict a, int* restrict b, int n)
> +{
> +for (int i = 0; i < 

Re: [PATCH] fwprop: Allow UNARY_P and check register pressure.

2023-09-07 Thread Richard Sandiford via Gcc-patches
Robin Dapp  writes:
> Hi Richard,
>
> I did some testing with the attached v2 that does not restrict to UNARY
> anymore.  As feared ;) there is some more fallout that I'm detailing below.
>
> On Power there is one guality fail (pr43051-1.c) that I would take
> the liberty of ignoring for now.
>
> On x86 there are four fails:
>
>  - cond_op_addsubmuldiv__Float16-2.c: assembler error
>unsupported masking for `vmovsh'.  I guess that's a latent backend
>problem.
>
>  - ifcvt-3.c, pr49095.c: Here we propagate into a compare.  Before, we had
>(cmp (reg/CC) 0) and now we have (cmp (plus (reg1 reg2) 0).
>That looks like a costing problem and can hopefully solveable by making
>the second compare more expensive, preventing the propagation.
>i386 costing (or every costing?) is brittle so that could well break other
>things. 
>
>  - pr88873.c: This is interesting because even before this patch we
>propagated with different register classes (V2DF vs DI).  With the patch
>we check the register pressure, find the class NO_REGS for V2DF and
>abort (because the patch assumes NO_REGS = high pressure).  I'm thinking
>of keeping the old behavior for reg-reg propagations and only checking
>the pressure for more complex operations.
>
> aarch64 has the most fails:
>
>  - One guality fail (same as Power).
>  - shrn-combine-[123].c as before.
>
>  - A class of (hopefully, I only checked some) similar cases where we
>propagate an unspec_whilelo into an unspec_ptest.  Before we would only
>set a REG_EQUALS note.
>Before we managed to create a while_ultsivnx16bi_cc whereas now we have
>while_ultsivnx16bi and while_ultsivnx16bi_ptest that won't be combined.
>We create redundant whilelos and I'm not sure how to improve that. I
>guess a peephole is out of the question :)

Yeah, I think this is potentially a blocker for propagating A into B
when A is used elsewhere.  Combine is able to combine A and B while
keeping A in parallel with the result.  I think either fwprop would
need to try that too, or it would need to be restricted to cases where A
is only used in B.

I don't think that's really a UNARY_P-or-not thing.  The same problem
would in principle apply to plain unary operations.

>  - pred-combine-and.c: Here the new propagation appears useful at first.

I'm not sure about that, because...

>We propagate a "vector mask and" into a while_ultsivnx4bi_ptest and the
>individual and registers remain live up to the propagation site (while
>being dead before the patch).
>With the registers dead, combine could create a single fcmgt before.
>Now it only manages a 2->2 combination because we still need the registers
>and end up with two fcmgts.
>The code is worse but this seems more bad luck than anything.

...the fwprop1 change is:

(insn 26 25 27 4 (set (reg:VNx4BI 102 [ vec_mask_and_64 ])
(and:VNx4BI (reg:VNx4BI 116 [ mask__30.13 ])
(reg:VNx4BI 98 [ loop_mask_58 ]))) 8420 {andvnx4bi3}
 (nil))
...
(insn 31 30 32 4 (set (reg:VNx4BI 106 [ mask__24.18 ])
(and:VNx4BI (reg:VNx4BI 118 [ mask__25.17 ])
(reg:VNx4BI 102 [ vec_mask_and_64 ]))) 8420 {andvnx4bi3}
 (nil))

to:

(insn 26 25 27 4 (set (reg:VNx4BI 102 [ vec_mask_and_64 ])
(and:VNx4BI (reg:VNx4BI 116 [ mask__30.13 ])
(reg:VNx4BI 98 [ loop_mask_58 ]))) 8420 {andvnx4bi3}
 (expr_list:REG_DEAD (reg:VNx4BI 116 [ mask__30.13 ])
(expr_list:REG_DEAD (reg:VNx4BI 98 [ loop_mask_58 ])
(nil
...
(insn 31 30 32 4 (set (reg:VNx4BI 106 [ mask__24.18 ])
(and:VNx4BI (and:VNx4BI (reg:VNx4BI 116 [ mask__30.13 ])
(reg:VNx4BI 98 [ loop_mask_58 ]))
(reg:VNx4BI 118 [ mask__25.17 ]))) 8428 {aarch64_pred_andvnx4bi_z}
 (nil))

On its own this isn't worse.  But it's also not a win.  The before and
after sequences have equal cost, but the after sequence is more complex.

That would probably be OK for something that runs near the end of
the pre-RA pipeline, since it could in principle increase parallelism.
But it's probably a bad idea when we know that the main instruction
combination pass is still to run.  By making insn 31 more complex,
we're making it a less likely combination candidate.

So this isn't necesarily the wrong thing to do.  But I think it is
the wrong time to do it.

>  - Addressing fails from before:  I looked into these and suspect all of
>them are a similar.
>What happens is that we have a poly_int offset that we shift, negate
>and then add to x0.  The result is used as load address.
>Before, we would pull (combine) the (plus x0 reg) into the load keeping
>the neg and shift.
>Now we propagate everything into a single (set (minus x0 offset)).
>The propagation itself seems worthwhile because we save one insn.
>However as we got rid of the base/offset split by lumping everything
>together, combine cannot pull the (plus) into the 

[PATCH] Tweak language choice in config-list.mk

2023-09-07 Thread Richard Sandiford via Gcc-patches
When I tried to use config-list.mk, the build for every triple except
the build machine's failed for m2.  This is because, unlike other
languages, m2 builds target objects during all-gcc.  The build will
therefore fail unless you have access to an appropriate binutils
(or an equivalent).  That's quite a big ask for over 100 targets. :)

This patch therefore makes m2 an optional inclusion.

Doing that wasn't entirely straightforward though.  The current
configure line includes "--enable-languages=all,...", which means
that the "..." can only force languages to be added that otherwise
wouldn't have been.  (I.e. the only effect of the "..." is to
override configure autodetection.)

The choice of all,ada and:

  # Make sure you have a recent enough gcc (with ada support) in your path so
  # that --enable-werror-always will work.

make it clear that lack of GNAT should be a build failure rather than
silently ignored.  This predates the D frontend, which requires GDC
in the same way that Ada requires GNAT.  I don't know of a reason
why D should be treated differently.

The patch therefore expands the "all" into a specific list of
languages.

That in turn meant that Fortran had to be handled specially,
since bpf and mmix don't support Fortran.

Perhaps there's an argument that m2 shouldn't build target objects
during all-gcc, but (a) it works for practical usage and (b) the
patch is an easy workaround.  I'd be happy for the patch to be
reverted if the build system changes.

OK to install?

Richard


gcc/
* contrib/config-list.mk (OPT_IN_LANGUAGES): New variable.
($(LIST)): Replace --enable-languages=all with a specifc list.
Disable fortran on bpf and mmix.  Enable the languages in
OPT_IN_LANGUAGES.
---
 contrib/config-list.mk | 17 ++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/contrib/config-list.mk b/contrib/config-list.mk
index e570b13c71b..50ecb014bc0 100644
--- a/contrib/config-list.mk
+++ b/contrib/config-list.mk
@@ -12,6 +12,11 @@ TEST=all-gcc
 # supply an absolute path.
 GCC_SRC_DIR=../../gcc
 
+# Define this to ,m2 if you want to build Modula-2.  Modula-2 builds target
+# objects during all-gcc, so it can only be included if you've installed
+# binutils (or an equivalent) for each target.
+OPT_IN_LANGUAGES=
+
 # Use -j / -l make arguments and nice to assure a smooth resource-efficient
 # load on the build machine, e.g. for 24 cores:
 # svn co svn://gcc.gnu.org/svn/gcc/branches/foo-branch gcc
@@ -126,17 +131,23 @@ $(LIST): make-log-dir
TGT=`echo $@ | awk 'BEGIN { FS = "OPT" }; { print $$1 }'` &&
\
TGT=`$(GCC_SRC_DIR)/config.sub $$TGT` &&
\
case $$TGT in   
\
-   *-*-darwin* | *-*-cygwin* | *-*-mingw* | *-*-aix* | 
bpf-*-*)\
+   bpf-*-*)
\
ADDITIONAL_LANGUAGES="";
\
;;  
\
-   *)  
\
+   *-*-darwin* | *-*-cygwin* | *-*-mingw* | *-*-aix* | 
bpf-*-*)\
+   ADDITIONAL_LANGUAGES=",fortran";
\
+   ;;  
\
+   mmix-*-*)   
\
ADDITIONAL_LANGUAGES=",go"; 
\
;;  
\
+   *)  
\
+   ADDITIONAL_LANGUAGES=",fortran,go"; 
\
+   ;;  
\
esac && 
\
$(GCC_SRC_DIR)/configure
\
--target=$(subst SCRIPTS,`pwd`/../scripts/,$(subst 
OPT,$(empty) -,$@))  \
--enable-werror-always ${host_options}  
\
-   --enable-languages=all,ada$$ADDITIONAL_LANGUAGES;   
\
+   
--enable-languages=c,ada,c++,d,lto,objc,obj-c++,rust$$ADDITIONAL_LANGUAGES$(OPT_IN_LANGUAGES);
 \
) > log/$@-config.out 2>&1
 
 $(LOGFILES) : log/%-make.out : %
-- 
2.25.1



Re: [PATCH] fwprop: Allow UNARY_P and check register pressure.

2023-09-06 Thread Richard Sandiford via Gcc-patches
Robin Dapp  writes:
> Hi Richard,
>
> I did some testing with the attached v2 that does not restrict to UNARY
> anymore.  As feared ;) there is some more fallout that I'm detailing below.
>
> On Power there is one guality fail (pr43051-1.c) that I would take
> the liberty of ignoring for now.
>
> On x86 there are four fails:
>
>  - cond_op_addsubmuldiv__Float16-2.c: assembler error
>unsupported masking for `vmovsh'.  I guess that's a latent backend
>problem.
>
>  - ifcvt-3.c, pr49095.c: Here we propagate into a compare.  Before, we had
>(cmp (reg/CC) 0) and now we have (cmp (plus (reg1 reg2) 0).
>That looks like a costing problem and can hopefully solveable by making
>the second compare more expensive, preventing the propagation.
>i386 costing (or every costing?) is brittle so that could well break other
>things. 
>
>  - pr88873.c: This is interesting because even before this patch we
>propagated with different register classes (V2DF vs DI).  With the patch
>we check the register pressure, find the class NO_REGS for V2DF and
>abort (because the patch assumes NO_REGS = high pressure).  I'm thinking
>of keeping the old behavior for reg-reg propagations and only checking
>the pressure for more complex operations.
>
> aarch64 has the most fails:
>
>  - One guality fail (same as Power).
>  - shrn-combine-[123].c as before.
>
>  - A class of (hopefully, I only checked some) similar cases where we
>propagate an unspec_whilelo into an unspec_ptest.  Before we would only
>set a REG_EQUALS note.
>Before we managed to create a while_ultsivnx16bi_cc whereas now we have
>while_ultsivnx16bi and while_ultsivnx16bi_ptest that won't be combined.
>We create redundant whilelos and I'm not sure how to improve that. I
>guess a peephole is out of the question :)
>
>  - pred-combine-and.c: Here the new propagation appears useful at first.
>We propagate a "vector mask and" into a while_ultsivnx4bi_ptest and the
>individual and registers remain live up to the propagation site (while
>being dead before the patch).
>With the registers dead, combine could create a single fcmgt before.
>Now it only manages a 2->2 combination because we still need the registers
>and end up with two fcmgts.
>The code is worse but this seems more bad luck than anything.
>
>  - Addressing fails from before:  I looked into these and suspect all of
>them are a similar.
>What happens is that we have a poly_int offset that we shift, negate
>and then add to x0.  The result is used as load address.
>Before, we would pull (combine) the (plus x0 reg) into the load keeping
>the neg and shift.
>Now we propagate everything into a single (set (minus x0 offset)).
>The propagation itself seems worthwhile because we save one insn.
>However as we got rid of the base/offset split by lumping everything
>together, combine cannot pull the (plus) into the address load and
>we require an aarch64_split_add_offset.  This will emit the longer
>sequence of ashiftl and subtract.  The "base" address is x0 here so
>we cannot convert (minus x0 ...)) into neg.
>I didn't go through all of aarch64_split_add_offset.  I suppose we
>could re-add the separation of base/offset there but that might be
>a loss when the result is not used as an address. 
>
> Again, all in all no fatal problems but pretty annoying :)  It's not much
> but just gradually worse than with just UNARY.  Any idea on how/whether to
> continue?

Thanks for giving it a go.  Can you post the latest version of the
regpressure patch too?  The previous on-list version I could find
seems to be too old.

Thanks,
Richard

> Regards
>  Robin
>
> gcc/ChangeLog:
>
>   * fwprop.cc (fwprop_propagation::profitable_p): Add unary
>   handling.
>   (fwprop_propagation::update_register_pressure): New function.
>   (fwprop_propagation::register_pressure_high_p): New function
>   (reg_single_def_for_src_p): Look through unary expressions.
>   (try_fwprop_subst_pattern): Check register pressure.
>   (forward_propagate_into): Call new function.
>   (fwprop_init): Init register pressure.
>   (fwprop_done): Clean up register pressure.
>   (fwprop_insn): Add comment.
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/riscv/rvv/autovec/binop/vadd-vx-fwprop.c: New test.
> ---
>  gcc/fwprop.cc | 359 +-
>  .../riscv/rvv/autovec/binop/vadd-vx-fwprop.c  |  64 
>  2 files changed, 419 insertions(+), 4 deletions(-)
>  create mode 100644 
> gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/vadd-vx-fwprop.c
>
> diff --git a/gcc/fwprop.cc b/gcc/fwprop.cc
> index 0707a234726..ce6f5a74b00 100644
> --- a/gcc/fwprop.cc
> +++ b/gcc/fwprop.cc
> @@ -36,6 +36,10 @@ along with GCC; see the file COPYING3.  If not see
>  #include "tree-pass.h"
>  #include "rtl-iter.h"
>  #include "target.h"
> +#include "dominance.h"

Re: [PATCH] LoongArch: Fix unintentional bash-ism in r14-3665.

2023-09-06 Thread Richard Sandiford via Gcc-patches
Yang Yujie  writes:
> gcc/ChangeLog:
>
>   * config.gcc: remove non-POSIX syntax "<<<".

OK.  Thanks for the quick fix.

Richard.

> ---
>  gcc/config.gcc | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/gcc/config.gcc b/gcc/config.gcc
> index b2fe7c7ceef..6d4c8becd28 100644
> --- a/gcc/config.gcc
> +++ b/gcc/config.gcc
> @@ -5189,7 +5189,7 @@ case "${target}" in
>   if test x${parse_state} = x"abi-base"; then
>   # Base ABI type
>   case ${component} in
> - lp64d | lp64f | lp64s) 
> elem_tmp="ABI_BASE_$(tr a-z A-Z <<< ${component}),";;
> + lp64d | lp64f | lp64s) 
> elem_tmp="ABI_BASE_$(echo ${component} | tr a-z A-Z),";;
>   *)
>   echo "Unknown base ABI 
> \"${component}\" in --with-multilib-list." 1>&2
>   exit 1


Re: [PATCH v1 2/6] LoongArch: improved target configuration interface

2023-09-06 Thread Richard Sandiford via Gcc-patches
Yang Yujie  writes:
> @@ -5171,25 +5213,21 @@ case "${target}" in
>   # ${with_multilib_list} should not contain whitespaces,
>   # consecutive commas or slashes.
>   if echo "${with_multilib_list}" \
> - | grep -E -e "[[:space:]]" -e '[,/][,/]' -e '[,/]$' -e '^[,/]' 
> > /dev/null; then
> + | grep -E -e "[[:space:]]" -e '[,/][,/]' -e '[,/]$' -e '^[,/]' 
> > /dev/null 2>&1; then
>   echo "Invalid argument to --with-multilib-list." 1>&2
>   exit 1
>   fi
>  
> - unset component idx elem_abi_base elem_abi_ext elem_tmp
> + unset component elem_abi_base elem_abi_ext elem_tmp parse_state 
> all_abis
>   for elem in $(echo "${with_multilib_list}" | tr ',' ' '); do
> - idx=0
> - while true; do
> - idx=$((idx + 1))
> - component=$(echo "${elem}" | awk -F'/' '{print 
> $'"${idx}"'}')
> -
> - case ${idx} in
> - 1)
> - # Component 1: Base ABI type
> + unset elem_abi_base elem_abi_ext
> + parse_state="abi-base"
> +
> + for component in $(echo "${elem}" | tr '/' ' '); do
> + if test x${parse_state} = x"abi-base"; then
> + # Base ABI type
>   case ${component} in
> - lp64d) elem_tmp="ABI_BASE_LP64D,";;
> - lp64f) elem_tmp="ABI_BASE_LP64F,";;
> - lp64s) elem_tmp="ABI_BASE_LP64S,";;
> + lp64d | lp64f | lp64s) 
> elem_tmp="ABI_BASE_$(tr a-z A-Z <<< ${component}),";;

"<<<" isn't portable shell.  Could you try with:

  echo ${component} | tr ...

instead?

As it stands, this causes a bootstrap failure with non-bash shells
such as dash, even on non-Loongson targets.

(Part of me wishes that we'd just standardise on bash.  But since that
isn't the policy, I sometimes use dash to pick up my own lapses.)

Thanks,
Richard


Re: [PATCH 10/11] aarch64: Fix branch-protection error message tests

2023-09-05 Thread Richard Sandiford via Gcc-patches
Szabolcs Nagy  writes:
> Update tests for the new branch-protection parser errors.
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/branch-protection-attr.c: Update.
>   * gcc.target/aarch64/branch-protection-option.c: Update.

OK, thanks.  (And I agree these are better messages. :))

I think that's the last of the AArch64-specific ones.  The others
will need to be reviewed by Kyrill or Richard.

Richard

> ---
>  gcc/testsuite/gcc.target/aarch64/branch-protection-attr.c   | 6 +++---
>  gcc/testsuite/gcc.target/aarch64/branch-protection-option.c | 2 +-
>  2 files changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/gcc/testsuite/gcc.target/aarch64/branch-protection-attr.c 
> b/gcc/testsuite/gcc.target/aarch64/branch-protection-attr.c
> index 272000c2747..dae2a758a56 100644
> --- a/gcc/testsuite/gcc.target/aarch64/branch-protection-attr.c
> +++ b/gcc/testsuite/gcc.target/aarch64/branch-protection-attr.c
> @@ -4,19 +4,19 @@ void __attribute__ ((target("branch-protection=leaf")))
>  foo1 ()
>  {
>  }
> -/* { dg-error {invalid protection type 'leaf' in 
> 'target\("branch-protection="\)' pragma or attribute} "" { target *-*-* } 5 } 
> */
> +/* { dg-error {invalid argument 'leaf' for 'target\("branch-protection="\)'} 
> "" { target *-*-* } 5 } */
>  /* { dg-error {pragma or attribute 'target\("branch-protection=leaf"\)' is 
> not valid} "" { target *-*-* } 5 } */
>  
>  void __attribute__ ((target("branch-protection=none+pac-ret")))
>  foo2 ()
>  {
>  }
> -/* { dg-error "unexpected 'pac-ret' after 'none'" "" { target *-*-* } 12 } */
> +/* { dg-error {argument 'none' can only appear alone in 
> 'target\("branch-protection="\)'} "" { target *-*-* } 12 } */
>  /* { dg-error {pragma or attribute 
> 'target\("branch-protection=none\+pac-ret"\)' is not valid} "" { target *-*-* 
> } 12 } */
>  
>  void __attribute__ ((target("branch-protection=")))
>  foo3 ()
>  {
>  }
> -/* { dg-error {missing argument to 'target\("branch-protection="\)' pragma 
> or attribute} "" { target *-*-* } 19 } */
> +/* { dg-error {invalid argument '' for 'target\("branch-protection="\)'} "" 
> { target *-*-* } 19 } */
>  /* { dg-error {pragma or attribute 'target\("branch-protection="\)' is not 
> valid} "" { target *-*-* } 19 } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/branch-protection-option.c 
> b/gcc/testsuite/gcc.target/aarch64/branch-protection-option.c
> index 1b3bf4ee2b8..e2f847a31c4 100644
> --- a/gcc/testsuite/gcc.target/aarch64/branch-protection-option.c
> +++ b/gcc/testsuite/gcc.target/aarch64/branch-protection-option.c
> @@ -1,4 +1,4 @@
>  /* { dg-do "compile" } */
>  /* { dg-options "-mbranch-protection=leaf -mbranch-protection=none+pac-ret" 
> } */
>  
> -/* { dg-error "unexpected 'pac-ret' after 'none'"  "" { target *-*-* } 0 } */
> +/* { dg-error "argument 'none' can only appear alone in 
> '-mbranch-protection='" "" { target *-*-* } 0 } */


Re: [PATCH 07/11] aarch64: Disable branch-protection for pcs tests

2023-09-05 Thread Richard Sandiford via Gcc-patches
Szabolcs Nagy  writes:
> The tests manipulate the return address in abitest-2.h and thus not
> compatible with -mbranch-protection=pac-ret+leaf or
> -mbranch-protection=gcs.
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/aapcs64/func-ret-1.c: Disable branch-protection.
>   * gcc.target/aarch64/aapcs64/func-ret-2.c: Likewise.
>   * gcc.target/aarch64/aapcs64/func-ret-3.c: Likewise.
>   * gcc.target/aarch64/aapcs64/func-ret-4.c: Likewise.
>   * gcc.target/aarch64/aapcs64/func-ret-64x1_1.c: Likewise.

OK, thanks.

Richard

> ---
>  gcc/testsuite/gcc.target/aarch64/aapcs64/func-ret-1.c  | 1 +
>  gcc/testsuite/gcc.target/aarch64/aapcs64/func-ret-2.c  | 1 +
>  gcc/testsuite/gcc.target/aarch64/aapcs64/func-ret-3.c  | 1 +
>  gcc/testsuite/gcc.target/aarch64/aapcs64/func-ret-4.c  | 1 +
>  gcc/testsuite/gcc.target/aarch64/aapcs64/func-ret-64x1_1.c | 1 +
>  5 files changed, 5 insertions(+)
>
> diff --git a/gcc/testsuite/gcc.target/aarch64/aapcs64/func-ret-1.c 
> b/gcc/testsuite/gcc.target/aarch64/aapcs64/func-ret-1.c
> index 5405e1e4920..7bd7757efe6 100644
> --- a/gcc/testsuite/gcc.target/aarch64/aapcs64/func-ret-1.c
> +++ b/gcc/testsuite/gcc.target/aarch64/aapcs64/func-ret-1.c
> @@ -4,6 +4,7 @@
> AAPCS64 \S 4.1.  */
>  
>  /* { dg-do run { target aarch64*-*-* } } */
> +/* { dg-additional-options "-mbranch-protection=none" } */
>  /* { dg-additional-sources "abitest.S" } */
>  
>  #ifndef IN_FRAMEWORK
> diff --git a/gcc/testsuite/gcc.target/aarch64/aapcs64/func-ret-2.c 
> b/gcc/testsuite/gcc.target/aarch64/aapcs64/func-ret-2.c
> index 6b171c46fbb..85a822ace4a 100644
> --- a/gcc/testsuite/gcc.target/aarch64/aapcs64/func-ret-2.c
> +++ b/gcc/testsuite/gcc.target/aarch64/aapcs64/func-ret-2.c
> @@ -4,6 +4,7 @@
> Homogeneous floating-point aggregate types are covered in func-ret-3.c.  
> */
>  
>  /* { dg-do run { target aarch64*-*-* } } */
> +/* { dg-additional-options "-mbranch-protection=none" } */
>  /* { dg-additional-sources "abitest.S" } */
>  
>  #ifndef IN_FRAMEWORK
> diff --git a/gcc/testsuite/gcc.target/aarch64/aapcs64/func-ret-3.c 
> b/gcc/testsuite/gcc.target/aarch64/aapcs64/func-ret-3.c
> index ad312b675b9..1d35ebf14b4 100644
> --- a/gcc/testsuite/gcc.target/aarch64/aapcs64/func-ret-3.c
> +++ b/gcc/testsuite/gcc.target/aarch64/aapcs64/func-ret-3.c
> @@ -4,6 +4,7 @@
> in AAPCS64 \S 4.3.5.  */
>  
>  /* { dg-do run { target aarch64-*-* } } */
> +/* { dg-additional-options "-mbranch-protection=none" } */
>  /* { dg-additional-sources "abitest.S" } */
>  /* { dg-require-effective-target aarch64_big_endian } */
>  
> diff --git a/gcc/testsuite/gcc.target/aarch64/aapcs64/func-ret-4.c 
> b/gcc/testsuite/gcc.target/aarch64/aapcs64/func-ret-4.c
> index af05fbe9fdf..15e1408c62d 100644
> --- a/gcc/testsuite/gcc.target/aarch64/aapcs64/func-ret-4.c
> +++ b/gcc/testsuite/gcc.target/aarch64/aapcs64/func-ret-4.c
> @@ -5,6 +5,7 @@
> are treated as general composite types.  */
>  
>  /* { dg-do run { target aarch64*-*-* } } */
> +/* { dg-additional-options "-mbranch-protection=none" } */
>  /* { dg-additional-sources "abitest.S" } */
>  /* { dg-require-effective-target aarch64_big_endian } */
>  
> diff --git a/gcc/testsuite/gcc.target/aarch64/aapcs64/func-ret-64x1_1.c 
> b/gcc/testsuite/gcc.target/aarch64/aapcs64/func-ret-64x1_1.c
> index 05957e2dcae..fe7bbb6a835 100644
> --- a/gcc/testsuite/gcc.target/aarch64/aapcs64/func-ret-64x1_1.c
> +++ b/gcc/testsuite/gcc.target/aarch64/aapcs64/func-ret-64x1_1.c
> @@ -3,6 +3,7 @@
>Test 64-bit singleton vector types which should be in FP/SIMD registers.  
> */
>  
>  /* { dg-do run { target aarch64*-*-* } } */
> +/* { dg-additional-options "-mbranch-protection=none" } */
>  /* { dg-additional-sources "abitest.S" } */
>  
>  #ifndef IN_FRAMEWORK


Re: [PATCH 06/11] aarch64: Fix pac-ret eh_return tests

2023-09-05 Thread Richard Sandiford via Gcc-patches
Szabolcs Nagy  writes:
> This is needed since eh_return no longer prevents pac-ret in the
> normal return path.
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/return_address_sign_1.c: Move func4 to ...
>   * gcc.target/aarch64/return_address_sign_2.c: ... here and fix the
>   scan asm check.
>   * gcc.target/aarch64/return_address_sign_b_1.c: Move func4 to ...
>   * gcc.target/aarch64/return_address_sign_b_2.c: ... here and fix the
>   scan asm check.
> ---
>  .../gcc.target/aarch64/return_address_sign_1.c  | 13 +
>  .../gcc.target/aarch64/return_address_sign_2.c  | 17 +++--
>  .../aarch64/return_address_sign_b_1.c   | 11 ---
>  .../aarch64/return_address_sign_b_2.c   | 17 +++--
>  4 files changed, 31 insertions(+), 27 deletions(-)
>
> diff --git a/gcc/testsuite/gcc.target/aarch64/return_address_sign_1.c 
> b/gcc/testsuite/gcc.target/aarch64/return_address_sign_1.c
> index 232ba67ade0..114a9dacb3f 100644
> --- a/gcc/testsuite/gcc.target/aarch64/return_address_sign_1.c
> +++ b/gcc/testsuite/gcc.target/aarch64/return_address_sign_1.c
> @@ -37,16 +37,5 @@ func3 (int a, int b, int c)
>/* autiasp */
>  }
>  
> -/* eh_return.  */
> -void __attribute__ ((target ("arch=armv8.3-a")))
> -func4 (long offset, void *handler, int *ptr, int imm1, int imm2)
> -{
> -  /* no paciasp */
> -  *ptr = imm1 + foo (imm1) + imm2;
> -  __builtin_eh_return (offset, handler);
> -  /* no autiasp */
> -  return;
> -}
> -
> -/* { dg-final { scan-assembler-times "autiasp" 3 } } */
>  /* { dg-final { scan-assembler-times "paciasp" 3 } } */
> +/* { dg-final { scan-assembler-times "autiasp" 3 } } */

I suppose there is no normal return path here.  I don't know how quickly
we'd realise that though, in the sense that the flag register becomes known-1.
But a quick-and-dirty check would be whether the exit block has a single
predecessor, which in a function that calls eh_return should mean
that the eh_return is unconditional.

But that might not be worth worrying about, given the builtin's limited
use case.  And even if it is worth worrying about, keeping the test in
this file would mix correctness with optimisation, which isn't a good
thing for scan-assembler-times.

So yeah, I agree this is OK.  It should probably be part of 03 though,
so that no individual commit causes a regression.

Thanks,
Richard

> diff --git a/gcc/testsuite/gcc.target/aarch64/return_address_sign_2.c 
> b/gcc/testsuite/gcc.target/aarch64/return_address_sign_2.c
> index a4bc5b45333..d93492c3c43 100644
> --- a/gcc/testsuite/gcc.target/aarch64/return_address_sign_2.c
> +++ b/gcc/testsuite/gcc.target/aarch64/return_address_sign_2.c
> @@ -14,5 +14,18 @@ func1 (int a, int b, int c)
>/* retaa */
>  }
>  
> -/* { dg-final { scan-assembler-times "paciasp" 1 } } */
> -/* { dg-final { scan-assembler-times "retaa" 1 } } */
> +/* eh_return.  */
> +void __attribute__ ((target ("arch=armv8.3-a")))
> +func4 (long offset, void *handler, int *ptr, int imm1, int imm2)
> +{
> +  /* paciasp */
> +  *ptr = imm1 + foo (imm1) + imm2;
> +  if (handler)
> +/* br */
> +__builtin_eh_return (offset, handler);
> +  /* retaa */
> +  return;
> +}
> +
> +/* { dg-final { scan-assembler-times "paciasp" 2 } } */
> +/* { dg-final { scan-assembler-times "retaa" 2 } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/return_address_sign_b_1.c 
> b/gcc/testsuite/gcc.target/aarch64/return_address_sign_b_1.c
> index 43e32ab6cb7..697fa30dc5a 100644
> --- a/gcc/testsuite/gcc.target/aarch64/return_address_sign_b_1.c
> +++ b/gcc/testsuite/gcc.target/aarch64/return_address_sign_b_1.c
> @@ -37,16 +37,5 @@ func3 (int a, int b, int c)
>/* autibsp */
>  }
>  
> -/* eh_return.  */
> -void __attribute__ ((target ("arch=armv8.3-a")))
> -func4 (long offset, void *handler, int *ptr, int imm1, int imm2)
> -{
> -  /* no pacibsp */
> -  *ptr = imm1 + foo (imm1) + imm2;
> -  __builtin_eh_return (offset, handler);
> -  /* no autibsp */
> -  return;
> -}
> -
>  /* { dg-final { scan-assembler-times "pacibsp" 3 } } */
>  /* { dg-final { scan-assembler-times "autibsp" 3 } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/return_address_sign_b_2.c 
> b/gcc/testsuite/gcc.target/aarch64/return_address_sign_b_2.c
> index 9ed64ce0591..748924c72f3 100644
> --- a/gcc/testsuite/gcc.target/aarch64/return_address_sign_b_2.c
> +++ b/gcc/testsuite/gcc.target/aarch64/return_address_sign_b_2.c
> @@ -14,5 +14,18 @@ func1 (int a, int b, int c)
>/* retab */
>  }
>  
> -/* { dg-final { scan-assembler-times "pacibsp" 1 } } */
> -/* { dg-final { scan-assembler-times "retab" 1 } } */
> +/* eh_return.  */
> +void __attribute__ ((target ("arch=armv8.3-a")))
> +func4 (long offset, void *handler, int *ptr, int imm1, int imm2)
> +{
> +  /* paciasp */
> +  *ptr = imm1 + foo (imm1) + imm2;
> +  if (handler)
> +/* br */
> +__builtin_eh_return (offset, handler);
> +  /* retab */
> +  return;
> +}
> +
> +/* { dg-final { 

Re: [PATCH 05/11] aarch64: Add eh_return compile tests

2023-09-05 Thread Richard Sandiford via Gcc-patches
Szabolcs Nagy  writes:
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/eh_return-2.c: New test.
>   * gcc.target/aarch64/eh_return-3.c: New test.

OK.

I wonder if it's worth using check-function-bodies for -3.c though.
It would then be easy to verify that the autiasp only occurs on the
normal return path.

Just a suggestion -- the current test is fine too.

Thanks,
Richard

> ---
>  gcc/testsuite/gcc.target/aarch64/eh_return-2.c |  9 +
>  gcc/testsuite/gcc.target/aarch64/eh_return-3.c | 14 ++
>  2 files changed, 23 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/eh_return-2.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/eh_return-3.c
>
> diff --git a/gcc/testsuite/gcc.target/aarch64/eh_return-2.c 
> b/gcc/testsuite/gcc.target/aarch64/eh_return-2.c
> new file mode 100644
> index 000..4a9d124e891
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/eh_return-2.c
> @@ -0,0 +1,9 @@
> +/* { dg-do compile } */
> +/* { dg-final { scan-assembler "add\tsp, sp, x5" } } */
> +/* { dg-final { scan-assembler "br\tx6" } } */
> +
> +void
> +foo (unsigned long off, void *handler)
> +{
> +  __builtin_eh_return (off, handler);
> +}
> diff --git a/gcc/testsuite/gcc.target/aarch64/eh_return-3.c 
> b/gcc/testsuite/gcc.target/aarch64/eh_return-3.c
> new file mode 100644
> index 000..35989eee806
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/eh_return-3.c
> @@ -0,0 +1,14 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mbranch-protection=pac-ret+leaf" } */
> +/* { dg-final { scan-assembler "add\tsp, sp, x5" } } */
> +/* { dg-final { scan-assembler "br\tx6" } } */
> +/* { dg-final { scan-assembler "hint\t25 // paciasp" } } */
> +/* { dg-final { scan-assembler "hint\t29 // autiasp" } } */
> +
> +void
> +foo (unsigned long off, void *handler, int c)
> +{
> +  if (c)
> +return;
> +  __builtin_eh_return (off, handler);
> +}


Re: [PATCH 04/11] aarch64: Do not force a stack frame for EH returns

2023-09-05 Thread Richard Sandiford via Gcc-patches
Szabolcs Nagy  writes:
> EH returns no longer rely on clobbering the return address on the stack
> so forcing a stack frame is not necessary.
>
> This does not actually change the code gen for the unwinder since there
> are calls before the EH return.
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64.cc (aarch64_needs_frame_chain): Do not
>   force frame chain for eh_return.

OK once we've agreed on something for 03/11.

Thanks,
Richard

> ---
>  gcc/config/aarch64/aarch64.cc | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 36cd172d182..afdbf4213c1 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -8417,8 +8417,7 @@ aarch64_output_probe_sve_stack_clash (rtx base, rtx 
> adjustment,
>  static bool
>  aarch64_needs_frame_chain (void)
>  {
> -  /* Force a frame chain for EH returns so the return address is at FP+8.  */
> -  if (frame_pointer_needed || crtl->calls_eh_return)
> +  if (frame_pointer_needed)
>  return true;
>  
>/* A leaf function cannot have calls or write LR.  */


Re: [PATCH 01/11] aarch64: AARCH64_ISA_RCPC was defined twice

2023-09-05 Thread Richard Sandiford via Gcc-patches
Szabolcs Nagy  writes:
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64.h (AARCH64_ISA_RCPC): Remove dup.

OK, thanks.

Richard

> ---
>  gcc/config/aarch64/aarch64.h | 1 -
>  1 file changed, 1 deletion(-)
>
> diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h
> index 2b0fc97bb71..c783cb96c48 100644
> --- a/gcc/config/aarch64/aarch64.h
> +++ b/gcc/config/aarch64/aarch64.h
> @@ -222,7 +222,6 @@ enum class aarch64_feature : unsigned char {
>  #define AARCH64_ISA_MOPS(aarch64_isa_flags & AARCH64_FL_MOPS)
>  #define AARCH64_ISA_LS64(aarch64_isa_flags & AARCH64_FL_LS64)
>  #define AARCH64_ISA_CSSC(aarch64_isa_flags & AARCH64_FL_CSSC)
> -#define AARCH64_ISA_RCPC   (aarch64_isa_flags & AARCH64_FL_RCPC)
>  
>  /* Crypto is an optional extension to AdvSIMD.  */
>  #define TARGET_CRYPTO (AARCH64_ISA_CRYPTO)


Re: testsuite: Port 'check-function-bodies' to nvptx

2023-09-05 Thread Richard Sandiford via Gcc-patches
Thomas Schwinge  writes:
> Hi!
>
> On 2023-09-04T23:05:05+0200, I wrote:
>> On 2019-07-16T15:04:49+0100, Richard Sandiford  
>> wrote:
>>> This patch therefore adds a new check-function-bodies dg-final test
>
>>> The regexps in parse_function_bodies are fairly general, but might
>>> still need to be extended in future for targets like Darwin or AIX.
>>
>> ..., or nvptx.  [...]
>
>> number of TODO items.
>>
>> In particular how to parameterize regular expressions for the different
>> syntax used by nvptx: for example, parameterize via global variables,
>> initialized accordingly (where?)?  Thinking about it, maybe simply
>> conditionalizing the current local initializations by
>> 'if { [istarget nvptx-*-*] } { [...] } else { [...] }' will do, simple
>> enough!
>
> Indeed that works fine.
>
>> Regarding whitespace prefixed, I think I'll go with the current
>> 'append function_regexp "\t" $line "\n"', that is, prefix expected output
>> lines with '\t' (as done in 'gcc.target/nvptx/abort.c'), and also for
>> nvptx handle labels as "fluff" (until we solve that issue generally).
>
> I changed my mind about that: instead of '\t', use '\t*' for nvptx, which
> means that both instructions emitted with additional whitespace prefixed
> and labels in column zero work nicely.
>
>> --- a/gcc/testsuite/lib/scanasm.exp
>> +++ b/gcc/testsuite/lib/scanasm.exp
>
>> @@ -907,7 +911,8 @@ proc check-function-bodies { args } {
>>
>>  set count 0
>>  set function_regexp ""
>> -set label {^(\S+):$}
>> +#TODO
>> +set label {^// BEGIN GLOBAL FUNCTION DEF: ([a-zA-Z_]\S+)$}
>
> There's actually no reason that the expected output syntax (this one) has
> to match the assembly -- so I restored that, to use the same syntax for
> nvptx here, too.
>
> Any comments before I push the attached
> "testsuite: Port 'check-function-bodies' to nvptx"?
>
>
> Grüße
>  Thomas
>
>
> -
> Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 
> München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas 
> Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht 
> München, HRB 106955
>
> From bdaf7572d9d4c1988274405840de4071ded3733f Mon Sep 17 00:00:00 2001
> From: Thomas Schwinge 
> Date: Mon, 4 Sep 2023 22:28:12 +0200
> Subject: [PATCH] testsuite: Port 'check-function-bodies' to nvptx
>
> This extends commit 4d706ff86ea86868615558e92407674a4f4b4af9
> "Add dg test for matching function bodies" for nvptx.
>
>   gcc/testsuite/
>   * lib/scanasm.exp (configure_check-function-bodies): New proc.
>   (parse_function_bodies, check-function-bodies): Use it.
>   * gcc.target/nvptx/abort.c: Use 'check-function-bodies'.
>   gcc/
>   * doc/sourcebuild.texi (check-function-bodies): Update.

LGTM.  Just a minor comment:

> ---
>  gcc/doc/sourcebuild.texi   |  9 ++-
>  gcc/testsuite/gcc.target/nvptx/abort.c | 19 ++-
>  gcc/testsuite/lib/scanasm.exp  | 76 --
>  3 files changed, 83 insertions(+), 21 deletions(-)
>
> diff --git a/gcc/doc/sourcebuild.texi b/gcc/doc/sourcebuild.texi
> index 1a78b3c1abb..8aec6b6592c 100644
> --- a/gcc/doc/sourcebuild.texi
> +++ b/gcc/doc/sourcebuild.texi
> @@ -3327,9 +3327,12 @@ The first line of the expected output for a function 
> @var{fn} has the form:
>  Subsequent lines of the expected output also start with @var{prefix}.
>  In both cases, whitespace after @var{prefix} is not significant.
>  
> -The test discards assembly directives such as @code{.cfi_startproc}
> -and local label definitions such as @code{.LFB0} from the compiler's
> -assembly output.  It then matches the result against the expected
> +Depending on the configuration (see
> +@code{gcc/testsuite/lib/scanasm.exp:configure_check-function-bodies}),

I can imagine such a long string wouldn't format well in the output.
How about: @code{configure_check-function-bodies} in
@filename{gcc/testsuite/lib/scanasm.exp}?

OK from my POV with that change.

Thanks,
Richard

> +the test may discard from the compiler's assembly output
> +directives such as @code{.cfi_startproc},
> +local label definitions such as @code{.LFB0}, and more.
> +It then matches the result against the expected
>  output for a function as a single regular expression.  This means that
>  later lines can use backslashes to refer back to @samp{(@dots{})}
>  captures on earlier lines.  For example:
> diff --git a/gcc/testsuite/gcc.target/nvptx/abort.c 
> b/gcc/testsuite/gcc.target/nvptx/abort.c
> index d3220687400..ae9dbf45a9b 100644
> --- a/gcc/testsuite/gcc.target/nvptx/abort.c
> +++ b/gcc/testsuite/gcc.target/nvptx/abort.c
> @@ -1,4 +1,6 @@
>  /* { dg-do compile} */
> +/* { dg-final { check-function-bodies {**} {} } } */
> +
>  /* Annotate no return functions with a trailing 'trap'.  */
>  
>  extern void abort ();
> @@ -9,5 +11,18 @@ int main (int argc, char **argv)
>  abort ();
>return 0;
>  }
> -
> -/* { dg-final { scan-assembler "call 

Re: [PATCH] fwprop: Allow UNARY_P and check register pressure.

2023-09-05 Thread Richard Sandiford via Gcc-patches
Robin Dapp  writes:
>> So I don't think I have a good feel for the advantages and disadvantages
>> of doing this.  Robin's analysis of the aarch64 changes was nice and
>> detailed though.  I think the one that worries me most is the addressing
>> mode one.  fwprop is probably the first chance we get to propagate adds
>> into addresses, and virtual register elimination means that some of
>> those opportunities won't show up in gimple.
>> 
>> There again, virtual register elimination wouldn't be the reason for
>> the ld4_s8.c failure.  Perhaps there's something missing in expand.
>> 
>> Other than that, I think my main question is: why just unary operations?
>> Is the underlying assumption that we only want to propagate a maximum of
>> one register?  If so, then I think we should check for that directly, by
>> iterating over subrtxes.
>
> The main reason for stopping at unary operations was to limit the scope
> and change as little as possible (not restricting the change to one
> register).  I'm currently testing a v2 that iterates over subrtxs.

Thanks.  Definitely no problem with doing things in small steps, but IMO
it's better if each choice of step can still be justified in its own terms.

>> Perhaps we should allow the optimisation without register-pressure
>> information if (a) the source register and destination register are
>> in the same pressure class and (b) all uses of the destination are
>> being replaced.  (FWIW, rtl-ssa should make it easier to try to
>> replace all definitions at once, with an all-or-nothing choice,
>> if we ever wanted to do that.)
>
> I presume you're referring to replacing one register (dest) in all using
> insns?  Source and destination are somewhat overloaded in fwprop context
> because I'm thinking of the "to be replaced" register as dest when it's
> actually the replacement register.

Yeah.

> AFAICT fwprop currently iterates over insns, going through all their uses
> and trying if an individual use can be substituted.  Do you suggest to
> change this general iteration order to iterate over the defs of an insn
> and then try to replace all the uses at once (e.g. using ssa->change_insns)?

No, I was just noting in passing that we could try do that if we wanted to.
The current code is a fairly mechanical conversion of the original DF-based
code, but there's no reason why it has to continue to work the way it
does now.

> When keeping the current order, wouldn't we need to store all potential
> changes instead of committing them and later apply them in bulk, e.g.
> grouped by use?  This order would also help to pick the propagation
> with the most number of uses (i.e. propagation potential) but maybe
> I'm misunderstanding?

I imagine doing it in reverse postorder would still make sense.

But my point was that, for the current fwprop limitation of substituting
into exactly one use of a register, we can check whether that use is
the *only* use of register.

I.e. if we substitute:

  A: (set (reg R1) (foo (reg R2)))

into:

  B: (set ... (reg R1) ...)

if R1 and R2 are likely to be in the same register class, and if B
is the only user of R2, then we don't need to calculate register
pressure.  The change is either neutral (if R2 died in A) or an
improvement (if R2 doesn't die in A, and so R1 and R2 were previously
live at the same time).

Thanks,
Richard


Re: RFC: Introduce -fhardened to enable security-related flags

2023-09-04 Thread Richard Sandiford via Gcc-patches
Qing Zhao via Gcc-patches  writes:
>> On Aug 29, 2023, at 3:42 PM, Marek Polacek via Gcc-patches 
>>  wrote:
>> 
>> Improving the security of software has been a major trend in the recent
>> years.  Fortunately, GCC offers a wide variety of flags that enable extra
>> hardening.  These flags aren't enabled by default, though.  And since
>> there are a lot of hardening flags, with more to come, it's been difficult
>> to keep on top of them; more so for the users of GCC who ought not to be
>> expected to keep track of all the new options.
>> 
>> To alleviate some of the problems I mentioned, we thought it would
>> be useful to provide a new umbrella option that enables a reasonable set
>> of hardening flags.  What's "reasonable" in this context is not easy to
>> pin down.  Surely, there must be no ABI impact, the option cannot cause
>> severe performance issues, and, I suspect, it should not cause build
>> errors by enabling stricter compile-time errors (such as, -Wimplicit-int,
>> -Wint-conversion).  Including a controversial option in -fhardened
>> would likely cause that users would not use -fhardened at all.  It's
>> roughly akin to -Wall or -O2 -- those also enable a reasonable set of
>> options, and evolve over time, and are not kept in sync with other
>> compilers.
>> 
>> Currently, -fhardened enables:
>> 
>>  -D_FORTIFY_SOURCE=3 (or =2 for older glibcs)
>>  -D_GLIBCXX_ASSERTIONS
>>  -ftrivial-auto-var-init=zero
>>  -fPIE  -pie  -Wl,-z,relro,-z,now
>>  -fstack-protector-strong
>>  -fstack-clash-protection
>>  -fcf-protection=full (x86 GNU/Linux only)
>> 
>> -fsanitize=undefined is specifically not enabled.  -fstrict-flex-arrays is
>> also liable to break a lot of code so I didn't include it.
>> 
>> Appended is a proof-of-concept patch.  It doesn't implement --help=hardened
>> yet.  A fairly crucial point is that -fhardened will not override options
>> that were specified on the command line (before or after -fhardened).  For
>> example,
>> 
>> -D_FORTIFY_SOURCE=1 -fhardened
>> 
>> means that _FORTIFY_SOURCE=1 will be used.  Similarly,
>> 
>>  -fhardened -fstack-protector
>> 
>> will not enable -fstack-protector-strong.
>> 
>> Thoughts?
>
> In general, I think that it is a very good idea to provide umbrella options
>  for software security purpose.  Thanks a lot for this work!
>
> 1. I do agree with Martin, multiple-level control for this purpose might be 
> needed,
> similar as multiple levels for warnings, and multiple levels for 
> optimizations.
>
> Similar as optimization options, can we organize all the security options 
> together 
> In our manual, then the user will have a good central place to get more and 
> complete
> Information of the security features our compiler provides? 
>
> 2. What’s the major criteria to decide which security feature should go into 
> this list?
> Later, when we have new security features, how to decide whether to add them 
> to
> This list or not?
> I am wondering why -fzero-call-used-regs is not included in the list and also

FWIW, I wondered the same thing.  Not a strong conviction that it should
be included -- maybe the code bloat is too much on some targets.  But it
might be acceptable for the -fhardened equivalent of -O3, at least if
restricted to GPRs.
 
> Why chose -ftrivial-auto-var-init=zero instead of 
> -ftrivial-auto-var-init=pattern? 

Yeah, IIRC -ftrivial-auto-var-init=zero was controversial with some
Clang maintainers because it effectively creates a language dialect.
-ftrivial-auto-var-init=pattern wasn't controversial in the same way.

Thanks,
Richard


Re: [PATCH] Bug 111071: fix the subr with -1 to not due to the simplify.

2023-09-04 Thread Richard Sandiford via Gcc-patches
Richard Sandiford  writes:
> "yanzhang.wang--- via Gcc-patches"  writes:
>> From: Yanzhang Wang 
>>
>> gcc/testsuite/ChangeLog:
>>
>>  * gcc.target/aarch64/sve/acle/asm/subr_s8.c: Modify subr with -1
>> to not.
>>
>> Signed-off-by: Yanzhang Wang 
>> ---
>>
>> Tested on my local arm environment and passed. Thanks Andrew Pinski's comment
>> the code is the same with that.
>>
>>  gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_s8.c | 3 +--
>>  1 file changed, 1 insertion(+), 2 deletions(-)
>>
>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_s8.c 
>> b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_s8.c
>> index b9615de6655..1cf6916a5e0 100644
>> --- a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_s8.c
>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_s8.c
>> @@ -76,8 +76,7 @@ TEST_UNIFORM_Z (subr_1_s8_m_untied, svint8_t,
>>  
>>  /*
>>  ** subr_m1_s8_m:
>> -**  mov (z[0-9]+\.b), #-1
>> -**  subrz0\.b, p0/m, z0\.b, \1
>> +**  not z0.b, p0/m, z0.b
>>  **  ret
>>  */
>>  TEST_UNIFORM_Z (subr_m1_s8_m, svint8_t,
>
> I think we need this for subr_u8.c too.  OK with that change,
> and thanks for the fix!

Actually, never mind.  I just saw a patch from Thiago Jung Bauerman
for the same issue, which is now in trunk.  Sorry for the confusion,
and thanks again for posting the fix.

Richard


Re: [PATCH] testsuite: aarch64: Adjust SVE ACLE tests to new generated code

2023-09-04 Thread Richard Sandiford via Gcc-patches
Thiago Jung Bauermann via Gcc-patches  writes:
> Since commit e7a36e4715c7 "[PATCH] RISC-V: Support simplify (-1-x) for
> vector." these tests fail on aarch64-linux:
>
>   === g++ tests ===
>
> Running g++:g++.target/aarch64/sve/acle/aarch64-sve-acle-asm.exp ...
> FAIL: gcc.target/aarch64/sve/acle/asm/subr_s8.c -std=gnu++98 -O2 
> -fno-schedule-insns -DCHECK_ASM --save-temps -DTEST_FULL  
> check-function-bodies subr_m1_s8_m
> FAIL: gcc.target/aarch64/sve/acle/asm/subr_s8.c -std=gnu++98 -O2 
> -fno-schedule-insns -DCHECK_ASM --save-temps -DTEST_OVERLOADS  
> check-function-bodies subr_m1_s8_m
> FAIL: gcc.target/aarch64/sve/acle/asm/subr_u8.c -std=gnu++98 -O2 
> -fno-schedule-insns -DCHECK_ASM --save-temps -DTEST_FULL  
> check-function-bodies subr_m1_u8_m
> FAIL: gcc.target/aarch64/sve/acle/asm/subr_u8.c -std=gnu++98 -O2 
> -fno-schedule-insns -DCHECK_ASM --save-temps -DTEST_OVERLOADS  
> check-function-bodies subr_m1_u8_m
>
>   === gcc tests ===
>
> Running gcc:gcc.target/aarch64/sve/acle/aarch64-sve-acle-asm.exp ...
> FAIL: gcc.target/aarch64/sve/acle/asm/subr_s8.c -std=gnu90 -O2 
> -fno-schedule-insns -DCHECK_ASM --save-temps -DTEST_FULL  
> check-function-bodies subr_m1_s8_m
> FAIL: gcc.target/aarch64/sve/acle/asm/subr_s8.c -std=gnu90 -O2 
> -fno-schedule-insns -DCHECK_ASM --save-temps -DTEST_OVERLOADS  
> check-function-bodies subr_m1_s8_m
> FAIL: gcc.target/aarch64/sve/acle/asm/subr_u8.c -std=gnu90 -O2 
> -fno-schedule-insns -DCHECK_ASM --save-temps -DTEST_FULL  
> check-function-bodies subr_m1_u8_m
> FAIL: gcc.target/aarch64/sve/acle/asm/subr_u8.c -std=gnu90 -O2 
> -fno-schedule-insns -DCHECK_ASM --save-temps -DTEST_OVERLOADS  
> check-function-bodies subr_m1_u8_m
>
> Andrew Pinski's analysis in PR testsuite/111071 is that the new code is
> better and the testcase should be updated. I also asked Prathamesh Kulkarni
> in private and he agreed.
>
> Here is the update. With this change, all tests in
> gcc.target/aarch64/sve/acle/aarch64-sve-acle-asm.exp pass on aarch64-linux.
>
> gcc/testsuite/
>   PR testsuite/111071
>   * gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_s8.c: Adjust to 
> new code.
>   * gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_u8.c: Likewise.

Thanks, pushed to trunk.  And sorry for the delay.  I somehow
missed this earlier. :(

Richard

> Suggested-by: Andrew Pinski 
> ---
>  gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_s8.c | 3 +--
>  gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_u8.c | 3 +--
>  2 files changed, 2 insertions(+), 4 deletions(-)
>
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_s8.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_s8.c
> index b9615de6655f..3e521bc9ae32 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_s8.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_s8.c
> @@ -76,8 +76,7 @@ TEST_UNIFORM_Z (subr_1_s8_m_untied, svint8_t,
>  
>  /*
>  ** subr_m1_s8_m:
> -**   mov (z[0-9]+\.b), #-1
> -**   subrz0\.b, p0/m, z0\.b, \1
> +**   not z0\.b, p0/m, z0\.b
>  **   ret
>  */
>  TEST_UNIFORM_Z (subr_m1_s8_m, svint8_t,
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_u8.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_u8.c
> index 65606b6dda03..4922bdbacc47 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_u8.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_u8.c
> @@ -76,8 +76,7 @@ TEST_UNIFORM_Z (subr_1_u8_m_untied, svuint8_t,
>  
>  /*
>  ** subr_m1_u8_m:
> -**   mov (z[0-9]+\.b), #-1
> -**   subrz0\.b, p0/m, z0\.b, \1
> +**   not z0\.b, p0/m, z0\.b
>  **   ret
>  */
>  TEST_UNIFORM_Z (subr_m1_u8_m, svuint8_t,


Re: [PATCH v3] mklog: handle Signed-off-by, minor cleanup

2023-09-04 Thread Richard Sandiford via Gcc-patches
Marc Poulhiès via Gcc-patches  writes:
> Richard Sandiford via Gcc-patches  writes:
>>> +# this regex matches the first line of the "end" in the initial commit 
>>> message
>>> +FIRST_LINE_OF_END_RE = re.compile('(?i)^(signed-off-by|co-authored-by|#): 
>>> ')
>>
>> The current code only requires an initial "#", rather than an initial "#: ".
>> Is that a deliberate change?
>>
>> The patch LGTM apart from that.
>
> Hello Richard,
>
> Thanks for the review and sorry for the delayed answer as I was away the
> past weeks. This issue was catched early this month
> (https://github.com/Rust-GCC/gccrs/pull/2504), but I didn't want to send
> something here before leaving. Here's a fixed patched.
>
> Ok for master?
>
> Thanks,
> Marc
>
> ---
>  contrib/mklog.py   | 34 +-
>  contrib/prepare-commit-msg | 20 ++--
>  2 files changed, 39 insertions(+), 15 deletions(-)
>
> diff --git a/contrib/mklog.py b/contrib/mklog.py
> index 26230b9b4f2..496780883fb 100755
> --- a/contrib/mklog.py
> +++ b/contrib/mklog.py
> @@ -41,7 +41,34 @@ from unidiff import PatchSet
>  
>  LINE_LIMIT = 100
>  TAB_WIDTH = 8
> -CO_AUTHORED_BY_PREFIX = 'co-authored-by: '
> +
> +# Initial commit:
> +#   +--+
> +#   | gccrs: Some title|
> +#   |  | This is the "start"
> +#   | This is some text explaining the commit. |
> +#   | There can be several lines.  |
> +#   |  |<--->
> +#   | Signed-off-by: My Name  | This is the "end"
> +#   +--+
> +#
> +# Results in:
> +#   +--+
> +#   | gccrs: Some title|
> +#   |  |
> +#   | This is some text explaining the commit. | This is the "start"
> +#   | There can be several lines.  |
> +#   |  |<--->
> +#   | gcc/rust/ChangeLog:  |
> +#   |  | This is the 
> generated
> +#   | * some_file (bla):   | ChangeLog part
> +#   | (foo):   |
> +#   |  |<--->
> +#   | Signed-off-by: My Name  | This is the "end"
> +#   +--+
> +
> +# this regex matches the first line of the "end" in the initial commit 
> message
> +FIRST_LINE_OF_END_RE = re.compile('(?i)^(signed-off-by:|co-authored-by:|#) ')

Personally I think it would be safer to drop the final space in the regexp.

OK with that change if you agree.

Thanks,
Richard

>  
>  pr_regex = re.compile(r'(\/(\/|\*)|[Cc*!])\s+(?PPR [a-z+-]+\/[0-9]+)')
>  prnum_regex = re.compile(r'PR (?P[a-z+-]+)/(?P[0-9]+)')
> @@ -330,10 +357,7 @@ def update_copyright(data):
>  
>  
>  def skip_line_in_changelog(line):
> -if line.lower().startswith(CO_AUTHORED_BY_PREFIX) or 
> line.startswith('#'):
> -return False
> -return True
> -
> +return FIRST_LINE_OF_END_RE.match(line) == None
>  
>  if __name__ == '__main__':
>  extra_args = os.getenv('GCC_MKLOG_ARGS')
> diff --git a/contrib/prepare-commit-msg b/contrib/prepare-commit-msg
> index 48c9dad3c6f..1e94706ba40 100755
> --- a/contrib/prepare-commit-msg
> +++ b/contrib/prepare-commit-msg
> @@ -32,11 +32,11 @@ if ! [ -f "$COMMIT_MSG_FILE" ]; then exit 0; fi
>  # Don't do anything unless requested to.
>  if [ -z "$GCC_FORCE_MKLOG" ]; then exit 0; fi
>  
> -if [ -z "$COMMIT_SOURCE" ] || [ $COMMIT_SOURCE = template ]; then
> +if [ -z "$COMMIT_SOURCE" ] || [ "$COMMIT_SOURCE" = template ]; then
>  # No source or "template" means new commit.
>  cmd="diff --cached"
>  
> -elif [ $COMMIT_SOURCE = message ]; then
> +elif [ "$COMMIT_SOURCE" = message ]; then
>  # "message" means -m; assume a new commit if there are any changes 
> staged.
>  if ! git diff --cached --quiet; then
>   cmd="diff --cached"
> @@ -44,23 +44,23 @@ elif [ $COMMIT_SOURCE = message ]; then
>   cmd="diff --cached HEAD^"
>  fi
>  
> -elif [ $COMMIT_SOURCE = commit ]; t

Re: [PATCH] testsuite: Remove unwanted 'dg-do run' from gcc.dg/vect tests

2023-09-04 Thread Richard Sandiford via Gcc-patches
Christophe Lyon via Gcc-patches  writes:
> Tests under gcc.dg/vect use check_vect_support_and_set_flags to set
> compilation flags as appropriate for the target, but they also set
> dg-do-what-default to 'run' or 'compile', depending on the actual
> target hardware (or simulator) capabilities.
>
> For instance on arm, we use options to enable Neon, but set
> dg-do-what-default to 'run' only if we cam actually execute Neon
> instructions.
>
> Therefore, we would always try to link and execute tests containing
> 'dg-do run', although dg-do-what-default says otherwise, leading to
> uninteresting failures.
>
> Therefore, this patch removes all such unconditionnal 'dg-do run',
> thus avoid link errors for instance if GCC has been configured with
> multilibs disabled and some --with-{float|cpu|hard} option
> incompatible with what check_vect_support_and_set_flags selects.
>
> For exmaple, GCC configured with:
> --disable-multilib --with-mode=thumb --with-cpu=cortex-m7 --with-float=hard
> and check_vect_support_and_set_flags uses
> -mfpu=neon -mfloat-abi=softfp -march=armv7-a
> (thus incompatible float-abi options)
>
> Tested on native aarch64-linux-gnu (no change) and several arm-eabi
> cases where the FAIL/UNRESOLVED disappear (and we keep only the
> 'compilation' tests).
>
> 2023-09-04  Christophe Lyon  
>
>   gcc/testsuite/
>   * gcc.dg/vect/bb-slp-44.c: Remove 'dg-do run'.
>   * gcc.dg/vect/bb-slp-71.c: Likewise.
>   * gcc.dg/vect/bb-slp-72.c: Likewise.
>   * gcc.dg/vect/bb-slp-73.c: Likewise.
>   * gcc.dg/vect/bb-slp-74.c: Likewise.
>   * gcc.dg/vect/bb-slp-pr101207.c: Likewise.
>   * gcc.dg/vect/bb-slp-pr101615-1.c: Likewise.
>   * gcc.dg/vect/bb-slp-pr101615-2.c: Likewise.
>   * gcc.dg/vect/bb-slp-pr101668.c: Likewise.
>   * gcc.dg/vect/bb-slp-pr54400.c: Likewise.
>   * gcc.dg/vect/bb-slp-pr98516-1.c: Likewise.
>   * gcc.dg/vect/bb-slp-pr98516-2.c: Likewise.
>   * gcc.dg/vect/bb-slp-pr98544.c: Likewise.
>   * gcc.dg/vect/pr101445.c: Likewise.
>   * gcc.dg/vect/pr105219.c: Likewise.
>   * gcc.dg/vect/pr107160.c: Likewise.
>   * gcc.dg/vect/pr107212-1.c: Likewise.
>   * gcc.dg/vect/pr107212-2.c: Likewise.
>   * gcc.dg/vect/pr109502.c: Likewise.
>   * gcc.dg/vect/pr110381.c: Likewise.
>   * gcc.dg/vect/pr110838.c: Likewise.
>   * gcc.dg/vect/pr88497-1.c: Likewise.
>   * gcc.dg/vect/pr88497-7.c: Likewise.
>   * gcc.dg/vect/pr96783-1.c: Likewise.
>   * gcc.dg/vect/pr96783-2.c: Likewise.
>   * gcc.dg/vect/pr97558-2.c: Likewise.
>   * gcc.dg/vect/pr99253.c: Likewise.
>   * gcc.dg/vect/slp-mask-store-1.c: Likewise.
>   * gcc.dg/vect/vect-bic-bitmask-10.c: Likewise.
>   * gcc.dg/vect/vect-bic-bitmask-11.c: Likewise.
>   * gcc.dg/vect/vect-bic-bitmask-2.c: Likewise.
>   * gcc.dg/vect/vect-bic-bitmask-3.c: Likewise.
>   * gcc.dg/vect/vect-bic-bitmask-4.c: Likewise.
>   * gcc.dg/vect/vect-bic-bitmask-5.c: Likewise.
>   * gcc.dg/vect/vect-bic-bitmask-6.c: Likewise.
>   * gcc.dg/vect/vect-bic-bitmask-8.c: Likewise.
>   * gcc.dg/vect/vect-bic-bitmask-9.c: Likewise.
>   * gcc.dg/vect/vect-cond-13.c: Likewise.
>   * gcc.dg/vect/vect-recurr-1.c: Likewise.
>   * gcc.dg/vect/vect-recurr-2.c: Likewise.
>   * gcc.dg/vect/vect-recurr-3.c: Likewise.
>   * gcc.dg/vect/vect-recurr-4.c: Likewise.
>   * gcc.dg/vect/vect-recurr-5.c: Likewise.
>   * gcc.dg/vect/vect-recurr-6.c: Likewise.

OK, thanks.

Richard

> ---
>  gcc/testsuite/gcc.dg/vect/bb-slp-44.c   | 2 --
>  gcc/testsuite/gcc.dg/vect/bb-slp-71.c   | 2 --
>  gcc/testsuite/gcc.dg/vect/bb-slp-72.c   | 2 --
>  gcc/testsuite/gcc.dg/vect/bb-slp-73.c   | 2 --
>  gcc/testsuite/gcc.dg/vect/bb-slp-74.c   | 1 -
>  gcc/testsuite/gcc.dg/vect/bb-slp-pr101207.c | 1 -
>  gcc/testsuite/gcc.dg/vect/bb-slp-pr101615-1.c   | 1 -
>  gcc/testsuite/gcc.dg/vect/bb-slp-pr101615-2.c   | 1 -
>  gcc/testsuite/gcc.dg/vect/bb-slp-pr101668.c | 1 -
>  gcc/testsuite/gcc.dg/vect/bb-slp-pr54400.c  | 1 -
>  gcc/testsuite/gcc.dg/vect/bb-slp-pr98516-1.c| 2 --
>  gcc/testsuite/gcc.dg/vect/bb-slp-pr98516-2.c| 2 --
>  gcc/testsuite/gcc.dg/vect/bb-slp-pr98544.c  | 2 --
>  gcc/testsuite/gcc.dg/vect/pr101445.c| 2 --
>  gcc/testsuite/gcc.dg/vect/pr105219.c| 1 -
>  gcc/testsuite/gcc.dg/vect/pr107160.c| 2 --
>  gcc/testsuite/gcc.dg/vect/pr107212-1.c  | 2 --
>  gcc/testsuite/gcc.dg/vect/pr107212-2.c  | 2 --
>  gcc/testsuite/gcc.dg/vect/pr109502.c| 1 -
>  gcc/testsuite/gcc.dg/vect/pr110381.c| 1 -
>  gcc/testsuite/gcc.dg/vect/pr110838.c| 2 --
>  gcc/testsuite/gcc.dg/vect/pr88497-1.c   | 1 -
>  gcc/testsuite/gcc.dg/vect/pr88497-7.c   | 1 -
>  gcc/testsuite/gcc.dg/vect/pr96783-1.c   | 2 --
>  gcc/testsuite/gcc.dg/vect/pr96783-2.c   | 2 --
>  

Re: [PATCH] Bug 111071: fix the subr with -1 to not due to the simplify.

2023-09-04 Thread Richard Sandiford via Gcc-patches
"yanzhang.wang--- via Gcc-patches"  writes:
> From: Yanzhang Wang 
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/sve/acle/asm/subr_s8.c: Modify subr with -1
> to not.
>
> Signed-off-by: Yanzhang Wang 
> ---
>
> Tested on my local arm environment and passed. Thanks Andrew Pinski's comment
> the code is the same with that.
>
>  gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_s8.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_s8.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_s8.c
> index b9615de6655..1cf6916a5e0 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_s8.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_s8.c
> @@ -76,8 +76,7 @@ TEST_UNIFORM_Z (subr_1_s8_m_untied, svint8_t,
>  
>  /*
>  ** subr_m1_s8_m:
> -**   mov (z[0-9]+\.b), #-1
> -**   subrz0\.b, p0/m, z0\.b, \1
> +**   not z0.b, p0/m, z0.b
>  **   ret
>  */
>  TEST_UNIFORM_Z (subr_m1_s8_m, svint8_t,

I think we need this for subr_u8.c too.  OK with that change,
and thanks for the fix!

Richard


Re: [PATCH]AArch64 xorsign: Fix scalar xorsign lowering

2023-09-01 Thread Richard Sandiford via Gcc-patches
Tamar Christina  writes:
>> -Original Message-
>> From: Richard Sandiford 
>> Sent: Friday, September 1, 2023 2:36 PM
>> To: Tamar Christina 
>> Cc: gcc-patches@gcc.gnu.org; nd ; Richard Earnshaw
>> ; Marcus Shawcroft
>> ; Kyrylo Tkachov 
>> Subject: Re: [PATCH]AArch64 xorsign: Fix scalar xorsign lowering
>> 
>> Tamar Christina  writes:
>> > Hi All,
>> >
>> > In GCC-9 our scalar xorsign pattern broke and we didn't notice it
>> > because the testcase was not strong enough.  With this commit
>> >
>> > 8d2d39587d941a40f25ea0144cceb677df115040 is the first bad commit
>> > commit 8d2d39587d941a40f25ea0144cceb677df115040
>> > Author: Segher Boessenkool 
>> > Date:   Mon Oct 22 22:23:39 2018 +0200
>> >
>> > combine: Do not combine moves from hard registers
>> >
>> > combine started introducing useless moves on hard registers,  when one
>> > of the arguments to our scalar xorsign is a hardreg we get an additional 
>> > move
>> inserted.
>> >
>> > This leads to combine forming an AND with the immediate inside and
>> > using the superflous move to do the r->w move, instead of what we
>> > wanted before which was for the `and` to be a vector and and have reload
>> pick the right alternative.
>> 
>> IMO, the xorsign optab ought to go away.  IIRC it was just a stop-gap measure
>> that (like most stop-gap measures) never got cleaned up later.
>> 
>> But that's not important now. :)
>> 
>> > To fix this the patch just forces the use of the vector version
>> > directly and so combine has no chance to mess it up.
>> >
>> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>> >
>> > Ok for master?
>> >
>> > Thanks,
>> > Tamar
>> >
>> > gcc/ChangeLog:
>> >
>> >* config/aarch64/aarch64-simd.md (xorsign3): Renamed to..
>> >(@xorsign3): ...This.
>> >* config/aarch64/aarch64.md (xorsign3): Renamed to...
>> >(@xorsign3): ..This and emit vectors directly
>> >* config/aarch64/iterators.md (VCONQ): Add SF and DF.
>> >
>> > gcc/testsuite/ChangeLog:
>> >
>> >* gcc.target/aarch64/xorsign.c:
>> >
>> > --- inline copy of patch --
>> > diff --git a/gcc/config/aarch64/aarch64-simd.md
>> > b/gcc/config/aarch64/aarch64-simd.md
>> > index
>> >
>> f67eb70577d0c2d9911d8c867d38a4d0b390337c..e955691f1be8830efacc2
>> 3746511
>> > 9764ce2a4942 100644
>> > --- a/gcc/config/aarch64/aarch64-simd.md
>> > +++ b/gcc/config/aarch64/aarch64-simd.md
>> > @@ -500,7 +500,7 @@ (define_expand "ctz2"
>> >}
>> >  )
>> >
>> > -(define_expand "xorsign3"
>> > +(define_expand "@xorsign3"
>> >[(match_operand:VHSDF 0 "register_operand")
>> > (match_operand:VHSDF 1 "register_operand")
>> > (match_operand:VHSDF 2 "register_operand")] diff --git
>> > a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md index
>> >
>> 01cf989641fce8e6c3828f6cfef62e101c4142df..9db82347bf891f9bc40aede
>> cdc84
>> > 62c94bf1a769 100644
>> > --- a/gcc/config/aarch64/aarch64.md
>> > +++ b/gcc/config/aarch64/aarch64.md
>> > @@ -6953,31 +6953,20 @@ (define_insn "copysign3_insn"
>> >  ;; EOR   v0.8B, v0.8B, v3.8B
>> >  ;;
>> >
>> > -(define_expand "xorsign3"
>> > +(define_expand "@xorsign3"
>> >[(match_operand:GPF 0 "register_operand")
>> > (match_operand:GPF 1 "register_operand")
>> > (match_operand:GPF 2 "register_operand")]
>> >"TARGET_SIMD"
>> >  {
>> > -
>> > -  machine_mode imode = mode;
>> > -  rtx mask = gen_reg_rtx (imode);
>> > -  rtx op1x = gen_reg_rtx (imode);
>> > -  rtx op2x = gen_reg_rtx (imode);
>> > -
>> > -  int bits = GET_MODE_BITSIZE (mode) - 1;
>> > -  emit_move_insn (mask, GEN_INT (trunc_int_for_mode
>> (HOST_WIDE_INT_M1U << bits,
>> > -   imode)));
>> > -
>> > -  emit_insn (gen_and3 (op2x, mask,
>> > -  lowpart_subreg (imode, operands[2],
>> > -  mode)));
>> > -  emit_insn (gen_xor3 (op1x,
>> > -  lowpart_subreg (imode, operands[1],
>> > -  mode),
>> > -  op2x));
>> > +  rtx tmp = gen_reg_rtx (mode);  rtx op1 = gen_reg_rtx
>> > + (mode);  rtx op2 = gen_reg_rtx (mode);
>> emit_move_insn
>> > + (op1, lowpart_subreg (mode, operands[1], mode));
>> > + emit_move_insn (op2, lowpart_subreg (mode, operands[2],
>> > + mode));  emit_insn (gen_xorsign3(mode, tmp, op1,
>> op2));
>> 
>> Do we need the extra moves into op1 and op2?  I would have expected the
>> subregs to be acceptable as direct operands of the xorsign3.  Making them
>> direct operands should be better, since there's then less risk of having the
>> same value live in different registers at the same time.
>> 
>
> That was the first thing I tried but it doesn't work because validate_subreg 
> seems
> to have the invariant that you can either change mode between the same size
> or make it paradoxical but not both at the same time.
>
> i.e. it rejects subreg:V2DI (subreg:DI (reg:DF))), and lowpart_subreg folds 
> it to
> 

Re: [PATCH]AArch64 xorsign: Fix scalar xorsign lowering

2023-09-01 Thread Richard Sandiford via Gcc-patches
Tamar Christina  writes:
> Hi All,
>
> In GCC-9 our scalar xorsign pattern broke and we didn't notice it because the
> testcase was not strong enough.  With this commit
>
> 8d2d39587d941a40f25ea0144cceb677df115040 is the first bad commit
> commit 8d2d39587d941a40f25ea0144cceb677df115040
> Author: Segher Boessenkool 
> Date:   Mon Oct 22 22:23:39 2018 +0200
>
> combine: Do not combine moves from hard registers
>
> combine started introducing useless moves on hard registers,  when one of the
> arguments to our scalar xorsign is a hardreg we get an additional move 
> inserted.
>
> This leads to combine forming an AND with the immediate inside and using the
> superflous move to do the r->w move, instead of what we wanted before which 
> was
> for the `and` to be a vector and and have reload pick the right alternative.

IMO, the xorsign optab ought to go away.  IIRC it was just a stop-gap
measure that (like most stop-gap measures) never got cleaned up later.

But that's not important now. :)

> To fix this the patch just forces the use of the vector version directly and
> so combine has no chance to mess it up.
>
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>
> Ok for master?
>
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64-simd.md (xorsign3): Renamed to..
>   (@xorsign3): ...This.
>   * config/aarch64/aarch64.md (xorsign3): Renamed to...
>   (@xorsign3): ..This and emit vectors directly
>   * config/aarch64/iterators.md (VCONQ): Add SF and DF.
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/xorsign.c:
>
> --- inline copy of patch -- 
> diff --git a/gcc/config/aarch64/aarch64-simd.md 
> b/gcc/config/aarch64/aarch64-simd.md
> index 
> f67eb70577d0c2d9911d8c867d38a4d0b390337c..e955691f1be8830efacc237465119764ce2a4942
>  100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -500,7 +500,7 @@ (define_expand "ctz2"
>}
>  )
>  
> -(define_expand "xorsign3"
> +(define_expand "@xorsign3"
>[(match_operand:VHSDF 0 "register_operand")
> (match_operand:VHSDF 1 "register_operand")
> (match_operand:VHSDF 2 "register_operand")]
> diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
> index 
> 01cf989641fce8e6c3828f6cfef62e101c4142df..9db82347bf891f9bc40aedecdc8462c94bf1a769
>  100644
> --- a/gcc/config/aarch64/aarch64.md
> +++ b/gcc/config/aarch64/aarch64.md
> @@ -6953,31 +6953,20 @@ (define_insn "copysign3_insn"
>  ;; EOR   v0.8B, v0.8B, v3.8B
>  ;;
>  
> -(define_expand "xorsign3"
> +(define_expand "@xorsign3"
>[(match_operand:GPF 0 "register_operand")
> (match_operand:GPF 1 "register_operand")
> (match_operand:GPF 2 "register_operand")]
>"TARGET_SIMD"
>  {
> -
> -  machine_mode imode = mode;
> -  rtx mask = gen_reg_rtx (imode);
> -  rtx op1x = gen_reg_rtx (imode);
> -  rtx op2x = gen_reg_rtx (imode);
> -
> -  int bits = GET_MODE_BITSIZE (mode) - 1;
> -  emit_move_insn (mask, GEN_INT (trunc_int_for_mode (HOST_WIDE_INT_M1U << 
> bits,
> -  imode)));
> -
> -  emit_insn (gen_and3 (op2x, mask,
> - lowpart_subreg (imode, operands[2],
> - mode)));
> -  emit_insn (gen_xor3 (op1x,
> - lowpart_subreg (imode, operands[1],
> - mode),
> - op2x));
> +  rtx tmp = gen_reg_rtx (mode);
> +  rtx op1 = gen_reg_rtx (mode);
> +  rtx op2 = gen_reg_rtx (mode);
> +  emit_move_insn (op1, lowpart_subreg (mode, operands[1], 
> mode));
> +  emit_move_insn (op2, lowpart_subreg (mode, operands[2], 
> mode));
> +  emit_insn (gen_xorsign3(mode, tmp, op1, op2));

Do we need the extra moves into op1 and op2?  I would have expected the
subregs to be acceptable as direct operands of the xorsign3.  Making
them direct operands should be better, since there's then less risk of
having the same value live in different registers at the same time.

OK with that change if it works.

Also, nit: missing space before "(".

Thanks,
Richard

>emit_move_insn (operands[0],
> -   lowpart_subreg (mode, op1x, imode));
> +   lowpart_subreg (mode, tmp, mode));
>DONE;
>  }
>  )
> diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
> index 
> 9398d713044433cd89b2a83db5ae7969feb1dcf7..2451d8c2cd8e2da6ac8339eed9bc975cf203fa4c
>  100644
> --- a/gcc/config/aarch64/iterators.md
> +++ b/gcc/config/aarch64/iterators.md
> @@ -1428,7 +1428,8 @@ (define_mode_attr VCONQ [(V8QI "V16QI") (V16QI "V16QI")
>(V4HF "V8HF") (V8HF "V8HF")
>(V2SF "V4SF") (V4SF "V4SF")
>(V2DF "V2DF") (SI   "V4SI")
> -  (HI   "V8HI") (QI   "V16QI")])
> +  (HI   "V8HI") (QI   "V16QI")
> +  (SF   "V4SF") (DF   "V2DF")])
>  
>  

Re: [PATCH 06/13] [APX EGPR] Map reg/mem constraints in inline asm to non-EGPR constraint.

2023-09-01 Thread Richard Sandiford via Gcc-patches
Uros Bizjak via Gcc-patches  writes:
> On Thu, Aug 31, 2023 at 11:18 AM Jakub Jelinek via Gcc-patches
>  wrote:
>>
>> On Thu, Aug 31, 2023 at 04:20:17PM +0800, Hongyu Wang via Gcc-patches wrote:
>> > From: Kong Lingling 
>> >
>> > In inline asm, we do not know if the insn can use EGPR, so disable EGPR
>> > usage by default from mapping the common reg/mem constraint to non-EGPR
>> > constraints. Use a flag mapx-inline-asm-use-gpr32 to enable EGPR usage
>> > for inline asm.
>> >
>> > gcc/ChangeLog:
>> >
>> >   * config/i386/i386.cc (INCLUDE_STRING): Add include for
>> >   ix86_md_asm_adjust.
>> >   (ix86_md_asm_adjust): When APX EGPR enabled without specifying the
>> >   target option, map reg/mem constraints to non-EGPR constraints.
>> >   * config/i386/i386.opt: Add option mapx-inline-asm-use-gpr32.
>> >
>> > gcc/testsuite/ChangeLog:
>> >
>> >   * gcc.target/i386/apx-inline-gpr-norex2.c: New test.
>> > ---
>> >  gcc/config/i386/i386.cc   |  44 +++
>> >  gcc/config/i386/i386.opt  |   5 +
>> >  .../gcc.target/i386/apx-inline-gpr-norex2.c   | 107 ++
>> >  3 files changed, 156 insertions(+)
>> >  create mode 100644 gcc/testsuite/gcc.target/i386/apx-inline-gpr-norex2.c
>> >
>> > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
>> > index d26d9ab0d9d..9460ebbfda4 100644
>> > --- a/gcc/config/i386/i386.cc
>> > +++ b/gcc/config/i386/i386.cc
>> > @@ -17,6 +17,7 @@ You should have received a copy of the GNU General 
>> > Public License
>> >  along with GCC; see the file COPYING3.  If not see
>> >  .  */
>> >
>> > +#define INCLUDE_STRING
>> >  #define IN_TARGET_CODE 1
>> >
>> >  #include "config.h"
>> > @@ -23077,6 +23078,49 @@ ix86_md_asm_adjust (vec , vec & 
>> > /*inputs*/,
>> >bool saw_asm_flag = false;
>> >
>> >start_sequence ();
>> > +  /* TODO: Here we just mapped the general r/m constraints to non-EGPR
>> > +   constraints, will eventually map all the usable constraints in the 
>> > future. */
>>
>> I think there should be some constraint which explicitly has all the 32
>> GPRs, like there is one for just all 16 GPRs (h), so that regardless of
>> -mapx-inline-asm-use-gpr32 one can be explicit what the inline asm wants.
>>
>> Also, what about the "g" constraint?  Shouldn't there be another for "g"
>> without r16..r31?  What about the various other memory
>> constraints ("<", "o", ...)?
>
> I think we should leave all existing constraints as they are, so "r"
> covers only GPR16, "m" and "o" to only use GPR16. We can then
> introduce "h" to instructions that have the ability to handle EGPR.

Yeah.  I'm jumping in without having read the full thread, sorry,
but the current mechanism for handling this is TARGET_MEM_CONSTRAINT
(added for s390).  That is, TARGET_MEM_CONSTRAINT can be defined to some
new constraint that is more general than the traditional "m" constraint.
This constraint is then the one that is associated with memory_operand
etc.  "m" can then be defined explicitly to the old definition,
so that existing asms continue to work.

So if the port wants generic internal memory addresses to use the
EGPR set (sounds reasonable), then TARGET_MEM_CONSTRAINT would be
a new constraint that maps to those addresses.

Thanks,
Richard


Re: [PATCH] expmed: Allow extract_bit_field via mem for low-precision modes.

2023-09-01 Thread Richard Sandiford via Gcc-patches
Robin Dapp via Gcc-patches  writes:
>> It's not just a question of which byte though.  It's also a question
>> of which bit.
>> 
>> One option would be to code-generate for even X and for odd X, and select
>> between them at runtime.  But that doesn't scale well to 2+2X and 1+1X.
>> 
>> Otherwise I think we need to treat the bit position as a variable,
>> with bitpos % 8 and bitpos / 8 being calculated at runtime.
>
> Thanks.  I worked around it with a backend vec_extractQI expander
> so we don't run into that situation directly anymore.  The problem is of
> course still latent and I'm going to look at it again after some other things
> on my plate.

Yeah, sounds like a good workaround.  If the target has an efficient way
of coping with the VLAness then the optab will probably be better than
whatever the generic code ends up being.

Thanks,
Richard


[PATCH] lra: Avoid unfolded plus-0

2023-08-31 Thread Richard Sandiford via Gcc-patches
While backporting another patch to an earlier release, I hit a
situation in which lra_eliminate_regs_1 would eliminate an address to:

(plus (reg:P R) (const_int 0))

This address compared not-equal to plain:

(reg:P R)

which caused an ICE in a later peephole2.  (The ICE showed up in
gfortran.fortran-torture/compile/pr80464.f90 on the branch but seems
to be latent on trunk.)

These unfolded PLUSes shouldn't occur in the insn stream, and later code
in the same function tried to avoid them.

Tested on aarch64-linux-gnu so far, but I'll test on x86_64-linux-gnu too.
Does this look OK?

There are probably other instances of the same thing elsewhere,
but it seemed safer to stick to the one that caused the issue.

Thanks,
Richard


gcc/
* lra-eliminations.cc (lra_eliminate_regs_1): Use simplify_gen_binary
rather than gen_rtx_PLUS.
---
 gcc/lra-eliminations.cc | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/lra-eliminations.cc b/gcc/lra-eliminations.cc
index df613cdda76..4daaff1a124 100644
--- a/gcc/lra-eliminations.cc
+++ b/gcc/lra-eliminations.cc
@@ -406,7 +406,7 @@ lra_eliminate_regs_1 (rtx_insn *insn, rtx x, machine_mode 
mem_mode,
elimination_fp2sp_occured_p = true;
 
  if (! update_p && ! full_p)
-   return gen_rtx_PLUS (Pmode, to, XEXP (x, 1));
+   return simplify_gen_binary (PLUS, Pmode, to, XEXP (x, 1));
 
  if (maybe_ne (update_sp_offset, 0))
offset = ep->to_rtx == stack_pointer_rtx ? update_sp_offset : 0;
-- 
2.25.1



[PATCH] aarch64: Fix return register handling in untyped_call

2023-08-31 Thread Richard Sandiford via Gcc-patches
While working on another patch, I hit a problem with the aarch64
expansion of untyped_call.  The expander emits the usual:

  (set (mem ...) (reg resN))

instructions to store the result registers to memory, but it didn't
say in RTL where those resN results came from.  This eventually led
to a failure of gcc.dg/torture/stackalign/builtin-return-2.c,
via regrename.

This patch turns the untyped call from a plain call to a call_value,
to represent that the call returns (or might return) a useful value.
The patch also uses a PARALLEL return rtx to represent all the possible
return registers.

Tested on aarch64-linux-gnu & pushed.

Richard


gcc/
* config/aarch64/aarch64.md (untyped_call): Emit a call_value
rather than a call.  List each possible destination register
in the call pattern.
---
 gcc/config/aarch64/aarch64.md | 20 +++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 01cf989641f..6f7827bd8c9 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -1170,9 +1170,27 @@ (define_expand "untyped_call"
 {
   int i;
 
+  /* Generate a PARALLEL that contains all of the register results.
+ The offsets are somewhat arbitrary, since we don't know the
+ actual return type.  The main thing we need to avoid is having
+ overlapping byte ranges, since those might give the impression
+ that two registers are known to have data in common.  */
+  rtvec rets = rtvec_alloc (XVECLEN (operands[2], 0));
+  poly_int64 offset = 0;
+  for (i = 0; i < XVECLEN (operands[2], 0); i++)
+{
+  rtx reg = SET_SRC (XVECEXP (operands[2], 0, i));
+  gcc_assert (REG_P (reg));
+  rtx offset_rtx = gen_int_mode (offset, Pmode);
+  rtx piece = gen_rtx_EXPR_LIST (VOIDmode, reg, offset_rtx);
+  RTVEC_ELT (rets, i) = piece;
+  offset += GET_MODE_SIZE (GET_MODE (reg));
+}
+  rtx ret = gen_rtx_PARALLEL (VOIDmode, rets);
+
   /* Untyped calls always use the default ABI.  It's only possible to use
  ABI variants if we know the type of the target function.  */
-  emit_call_insn (gen_call (operands[0], const0_rtx, const0_rtx));
+  emit_call_insn (gen_call_value (ret, operands[0], const0_rtx, const0_rtx));
 
   for (i = 0; i < XVECLEN (operands[2], 0); i++)
 {
-- 
2.25.1



Re: [PATCH] expmed: Allow extract_bit_field via mem for low-precision modes.

2023-08-30 Thread Richard Sandiford via Gcc-patches
Robin Dapp  writes:
>> But in the VLA case, doesn't it instead have precision 4+4X?
>> The problem then is that we can't tell at compile time which
>> byte that corresponds to.  So...
>
> Yes 4 + 4x.  I keep getting confused with poly modes :)
> In this case we want to extract the bitnum [3 4] = 3 + 4x which
> would be in byte 0 for x = 0 or x = 1 and in byte 1 for x = 2, 3 and
> so on.
>
> Can't we still make that work somehow?  As far as I can tell we're looking
> for the byte range to be accessed.  It's not like we have a precision or
> bitnum of e.g. [3 17] where the access could be anywhere but still a pow2
> fraction of BITS_PER_UNIT.
>
> I'm just having trouble writing that down.
>
> What about something like
>
> int factor = BITS_PER_UINT / prec.coeffs[0];
> bytenum = force_align_down_and_div (bitnum, prec.coeffs[0]);
> bytenum *= factor;
>
> (or a similar thing done manually without helpers) guarded by the
> proper condition?
> Or do we need something more generic for the factor (i.e. prec.coeffs[0])
> is not enough when we have a precision like [8 16]? Does that even exist?.

It's not just a question of which byte though.  It's also a question
of which bit.

One option would be to code-generate for even X and for odd X, and select
between them at runtime.  But that doesn't scale well to 2+2X and 1+1X.

Otherwise I think we need to treat the bit position as a variable,
with bitpos % 8 and bitpos / 8 being calculated at runtime.

Thanks,
Richard




RE: [PATCH] expmed: Allow extract_bit_field via mem for low-precision modes.

2023-08-30 Thread Richard Sandiford via Gcc-patches
[Sorry for any weird MUA issues, don't have access to my usual set-up.]

> when looking at a riscv ICE in vect-live-6.c I noticed that we
> assume that the variable part (coeffs[1] * x1) of the to-be-extracted
> bit number in extract_bit_field_1 is a multiple of BITS_PER_UNIT.
>
> This means that bits_to_bytes_round_down and num_trailing_bits
> cannot handle e.g. extracting from a "VNx4BI"-mode vector which has
> 4-bit precision on riscv.

But in the VLA case, doesn't it instead have precision 4+4X?
The problem then is that we can't tell at compile time which
byte that corresponds to.  So...

> This patch adds a special case for that situation and sets bytenum to
> zero as well as bitnum to its proper value.  It works for the riscv
> case because in all other situations we can align to a byte boundary.
> If x1 were 3 for some reason, however, the above assertion would still
> fail.  I don't think this can happen for riscv as we only ever double
> the number of chunks for larger vector sizes but not sure about the
> general case.
>
> If there's another, correct way to work around feel free to suggest.
>
> Bootstrap/testsuite on aarch64 and x86 is running but I would be
> surprised if there were any changes as riscv is the only target that
> uses modes with precision < 8.
>
> Regards
>  Robin
>
> gcc/ChangeLog:
>
>   * expmed.cc (extract_bit_field_1): Handle bitnum with variable
>   part less than BITS_PER_UNIT.
> ---
>  gcc/expmed.cc | 18 --
>  1 file changed, 16 insertions(+), 2 deletions(-)
>
> diff --git a/gcc/expmed.cc b/gcc/expmed.cc
> index e22e43c8505..1b0119f9cfc 100644
> --- a/gcc/expmed.cc
> +++ b/gcc/expmed.cc
> @@ -1858,8 +1858,22 @@ extract_bit_field_1 (rtx str_rtx, poly_uint64 bitsize, 
> poly_uint64 bitnum,
>   but is useful for things like vector booleans.  */
>if (MEM_P (op0) && !bitnum.is_constant ())
>  {
> -  bytenum = bits_to_bytes_round_down (bitnum);
> -  bitnum = num_trailing_bits (bitnum);
> +  /* bits_to_bytes_round_down tries to align to a byte (BITS_PER_UNIT)
> +  boundary and asserts that bitnum.coeffs[1] % BITS_PER_UNIT == 0.
> +  For modes with precision < BITS_PER_UNIT this fails but we can
> +  still extract from the first byte.  */
> +  poly_uint16 prec = GET_MODE_PRECISION (outermode);
> +  if (prec.coeffs[1] < BITS_PER_UNIT && bitnum.coeffs[1] < BITS_PER_UNIT)
> + {
> +   bytenum = 0;
> +   bitnum = bitnum.coeffs[0] & (BITS_PER_UNIT - 1);

...this doesn't look right.  We can't drop bitnum.coeffs[1] when it's
nonzero, because it says that for some runtime vector sizes, the bit
position might be higher than bitnum.coeffs[0].

Also, it's not possible to access coeffs[1] unconditionally in
target-independent code.

Thanks,
Richard

> + }
> +  else
> + {
> +   bytenum = bits_to_bytes_round_down (bitnum);
> +   bitnum = num_trailing_bits (bitnum);
> + }
> +
>poly_uint64 bytesize = bits_to_bytes_round_up (bitnum + bitsize);
>op0 = adjust_bitfield_address_size (op0, BLKmode, bytenum, bytesize);
>op0_mode = opt_scalar_int_mode ();



[PATCH] attribs: Use existing traits for excl_hash_traits

2023-08-29 Thread Richard Sandiford via Gcc-patches
excl_hash_traits can be defined more simply by reusing existing traits.

Tested on aarch64-linux-gnu.  OK to install?

Richard


gcc/
* attribs.cc (excl_hash_traits): Delete.
(test_attribute_exclusions): Use pair_hash and nofree_string_hash
instead.
---
 gcc/attribs.cc | 45 +++--
 1 file changed, 3 insertions(+), 42 deletions(-)

diff --git a/gcc/attribs.cc b/gcc/attribs.cc
index b8cb55b97df..0d4ab23aeb6 100644
--- a/gcc/attribs.cc
+++ b/gcc/attribs.cc
@@ -2640,47 +2640,6 @@ namespace selftest
 
 typedef std::pair excl_pair;
 
-struct excl_hash_traits: typed_noop_remove
-{
-  typedef excl_pair  value_type;
-  typedef value_type compare_type;
-
-  static hashval_t hash (const value_type )
-  {
-hashval_t h1 = htab_hash_string (x.first);
-hashval_t h2 = htab_hash_string (x.second);
-return h1 ^ h2;
-  }
-
-  static bool equal (const value_type , const value_type )
-  {
-return !strcmp (x.first, y.first) && !strcmp (x.second, y.second);
-  }
-
-  static void mark_deleted (value_type )
-  {
-x = value_type (NULL, NULL);
-  }
-
-  static const bool empty_zero_p = false;
-
-  static void mark_empty (value_type )
-  {
-x = value_type ("", "");
-  }
-
-  static bool is_deleted (const value_type )
-  {
-return !x.first && !x.second;
-  }
-
-  static bool is_empty (const value_type )
-  {
-return !*x.first && !*x.second;
-  }
-};
-
-
 /* Self-test to verify that each attribute exclusion is symmetric,
meaning that if attribute A is encoded as incompatible with
attribute B then the opposite relationship is also encoded.
@@ -2690,13 +2649,15 @@ struct excl_hash_traits: typed_noop_remove
 static void
 test_attribute_exclusions ()
 {
+  using excl_hash_traits = pair_hash;
+
   /* Iterate over the array of attribute tables first (with TI0 as
  the index) and over the array of attribute_spec in each table
  (with SI0 as the index).  */
   const size_t ntables = ARRAY_SIZE (attribute_tables);
 
   /* Set of pairs of mutually exclusive attributes.  */
-  typedef hash_set exclusion_set;
+  typedef hash_set exclusion_set;
   exclusion_set excl_set;
 
   for (size_t ti0 = 0; ti0 != ntables; ++ti0)
-- 
2.25.1



Re: [PATCH] fwprop: Allow UNARY_P and check register pressure.

2023-08-29 Thread Richard Sandiford via Gcc-patches
Jeff Law  writes:
> On 8/24/23 08:06, Robin Dapp via Gcc-patches wrote:
>> Ping.  I refined the code and some comments a bit and added a test
>> case.
>> 
>> My question in general would still be:  Is this something we want
>> given that we potentially move some of combine's work a bit towards
>> the front of the RTL pipeline?
>> 
>> Regards
>>   Robin
>> 
>> Subject: [PATCH] fwprop: Allow UNARY_P and check register pressure.
>> 
>> This patch enables the forwarding of UNARY_P sources.  As this
>> involves potentially replacing a vector register with a scalar register
>> the ira_hoist_pressure machinery is used to calculate the change in
>> register pressure.  If the propagation would increase the pressure
>> beyond the number of hard regs, we don't perform it.
>> 
>> gcc/ChangeLog:
>> 
>>  * fwprop.cc (fwprop_propagation::profitable_p): Add unary
>>  handling.
>>  (fwprop_propagation::update_register_pressure): New function.
>>  (fwprop_propagation::register_pressure_high_p): New function
>>  (reg_single_def_for_src_p): Look through unary expressions.
>>  (try_fwprop_subst_pattern): Check register pressure.
>>  (forward_propagate_into): Call new function.
>>  (fwprop_init): Init register pressure.
>>  (fwprop_done): Clean up register pressure.
>>  (fwprop_insn): Add comment.
>> 
>> gcc/testsuite/ChangeLog:
>> 
>>  * gcc.target/riscv/rvv/autovec/binop/vadd-vx-fwprop.c: New test.
> So I was hoping that Richard S. would chime in here as he knows this 
> code better than anyone.

Heh, I'm not sure about that.  I rewrote the code to use rtl-ssa,
so in that sense I'm OK with the framework side.  But I tried to
preserve the decisions that the old pass made as closely as possible.
I don't know why most of those decisions were made (which is why I just
kept them).

So I don't think I have a good feel for the advantages and disadvantages
of doing this.  Robin's analysis of the aarch64 changes was nice and
detailed though.  I think the one that worries me most is the addressing
mode one.  fwprop is probably the first chance we get to propagate adds
into addresses, and virtual register elimination means that some of
those opportunities won't show up in gimple.

There again, virtual register elimination wouldn't be the reason for
the ld4_s8.c failure.  Perhaps there's something missing in expand.

Other than that, I think my main question is: why just unary operations?
Is the underlying assumption that we only want to propagate a maximum of
one register?  If so, then I think we should check for that directly, by
iterating over subrtxes.

That way we can handle things like binary operations involving a
register and a constant, and unspecs with a single non-constant operand.

I imagine the check would be something like:

  unsigned int nregs = 0;
  for (each subrtx x)
{
  if (MEM_P (x))
return false;
  if (SUBREG_P (x) && .../*current conditions */...)
return false;
  if (REG_P (x))
{
  nregs += 1;
  if (nregs > 1)
return false;
}
}
  return true;

Perhaps we should allow the optimisation without register-pressure
information if (a) the source register and destination register are
in the same pressure class and (b) all uses of the destination are
being replaced.  (FWIW, rtl-ssa should make it easier to try to
replace all definitions at once, with an all-or-nothing choice,
if we ever wanted to do that.)

Thanks,
Richard

>
> This looks like a much better implementation of something I've done 
> before :-)  Basically imagine a target where a sign/zero extension can 
> be folded into arithmetic for free.  We put in various hacks to this 
> code to encourage more propagations of extensions.
>
> I still think this is valuable.  As we lower from gimple->RTL we're 
> going to still have artifacts in the RTL that we're going to want to 
> optimize away.  fwprop has certain advantages over combine, including 
> the fact that it runs earlier, pre-loop.
>
>
> It looks generally sensible to me.  But give Richard S. another week to 
> chime in.  He seems to be around, but may be slammed with stuff right now.
>
> jeff


Re: [RFC] > WIDE_INT_MAX_PREC support in wide-int

2023-08-29 Thread Richard Sandiford via Gcc-patches
Just some off-the-cuff thoughts.  Might think differently when
I've had more time...

Richard Biener  writes:
> On Mon, 28 Aug 2023, Jakub Jelinek wrote:
>
>> Hi!
>> 
>> While the _BitInt series isn't committed yet, I had a quick look at
>> lifting the current lowest limitation on maximum _BitInt precision,
>> that wide_int can only support wide_int until WIDE_INT_MAX_PRECISION - 1.
>> 
>> Note, other limits if that is lifted are INTEGER_CST currently using 3
>> unsigned char members and so being able to only hold up to 255 * 64 = 16320
>> bit numbers and then TYPE_PRECISION being 16-bit, so limiting us to 65535
>> bits.  The INTEGER_CST limit could be dealt with by dropping the
>> int_length.offset "cache" and making int_length.extended and
>> int_length.unextended members unsinged short rather than unsigned char.
>> 
>> The following so far just compile tested patch changes wide_int_storage
>> to be a union, for precisions up to WIDE_INT_MAX_PRECISION inclusive it
>> will work as before (just being no longer trivially copyable type and
>> having an inline destructor), while larger precision instead use a pointer
>> to heap allocated array.
>> For wide_int this is fairly easy (of course, I'd need to see what the
>> patch does to gcc code size and compile time performance, some
>> growth/slowdown is certain), but I'd like to brainstorm on
>> widest_int/widest2_int.
>> 
>> Currently it is a constant precision storage with WIDE_INT_MAX_PRECISION
>> precision (widest2_int twice that), so memory layout-wide on at least 64-bit
>> hosts identical to wide_int, just it doesn't have precision member and so
>> 32 bits smaller on 32-bit hosts.  It is used in lots of places.
>> 
>> I think the most common is what is done e.g. in tree_int_cst* comparisons
>> and similarly, using wi::to_widest () to just compare INTEGER_CSTs.
>> That case actually doesn't even use wide_int but widest_extended_tree
>> as storage, unless stored into widest_int in between (that happens in
>> various spots as well).  For comparisons, it would be fine if
>> widest_int_storage/widest_extended_tree storages had a dynamic precision,
>> WIDE_INT_MAX_PRECISION for most of the cases (if only
>> precision < WIDE_INT_MAX_PRECISION is involved), otherwise the needed
>> precision (e.g. for binary ops) which would be what we say have in
>> INTEGER_CST or some type, rounded up to whole multiples of HOST_WIDE_INTs
>> and if unsigned with multiple of HOST_WIDE_INT precision, have another
>> HWI to make it always sign-extended.
>> 
>> Another common case is how e.g. tree-ssa-ccp.cc uses them, that is mostly
>> for bitwise ops and so I think the above would be just fine for that case.
>> 
>> Another case is how tree-ssa-loop-niter.cc uses it, I think for such a usage
>> it really wants something widest, perhaps we could just try to punt for
>> _BitInt(N) for N >= WIDE_INT_MAX_PRECISION in there, so that we never care
>> about bits beyond that limit?
>
> I'll note tree-ssa-loop-niter.cc also uses GMP in some cases, widest_int
> is really trying to be poor-mans GMP by limiting the maximum precision.

I'd characterise widest_int as "a wide_int that is big enough to hold
all supported integer types, without losing sign information".  It's
not big enough to do arbitrary arithmetic without losing precision
(in the way that GMP is).

If the new limit on integer sizes is 65535 bits for all targets,
then I think that means that widest_int needs to become a 65536-bit type.
(But not with all bits represented all the time, of course.)

[ And at that point I think widest_int should ideally become a GMP wrapper.
  The wide_int stuff isn't optimised for such large sizes, even accepting
  that large sizes will be a worst case.  That might not be easy to do with
  the current infrastructure though.  Especially not if widest_ints are
  stored in GC-ed structures. ]

That seems like it would stand the biggest chance of preserving
existing semantics.  But we might want to define new typedefs for
narrower limits.  E.g. the current widest_int limit probably still
makes sense for operations on scalar_int_modes.  (But then most
RTL arithmetic should use wide_int rather than widest_int.)

Perhaps some widest_int uses are really restricted to address-like
things and could instead use offset_int.  Until now there hasn't been
much incentive to make the distinction.

And perhaps we could identify other similar cases where the limit is
known (statically) to be the current limit, rather than 65536.

I think one of the worst things we could do is push the requirement
up to users of the API to have one path for _BitInts and one for "normal"
integers.  That's bound to lead to a whack-a-mole effect.

Thanks,
Richard


Re: [PATCH V2] gimple_fold: Support COND_LEN_FNMA/COND_LEN_FMS/COND_LEN_FNMS gimple fold

2023-08-24 Thread Richard Sandiford via Gcc-patches
Juzhe-Zhong  writes:
> Hi, Richard and Richi.
>
> Currently, GCC support COND_LEN_FMA for floating-point **NO** -ffast-math.
> It's supported in tree-ssa-math-opts.cc. However, GCC failed to support 
> COND_LEN_FNMA/COND_LEN_FMS/COND_LEN_FNMS.
>
> Consider this following case:
> #define TEST_TYPE(TYPE)   
>  \
>   __attribute__ ((noipa)) void ternop_##TYPE (TYPE *__restrict dst,   
>  \
> TYPE *__restrict a,  \
> TYPE *__restrict b, int n)   \
>   {   
>  \
> for (int i = 0; i < n; i++)   
>  \
>   dst[i] -= a[i] * b[i];   \
>   }
>
> #define TEST_ALL()
>  \
>   TEST_TYPE (float)   
>  \
>
> TEST_ALL ()
>
> Gimple IR for RVV:
>
> ...
> _39 = -vect__8.14_26;
> vect__10.16_21 = .COND_LEN_FMA ({ -1, ... }, vect__6.11_30, _39, 
> vect__4.8_34, vect__4.8_34, _46, 0);
> ...
>
> This is because this following piece of codes in tree-ssa-math-opts.cc:
>
>   if (len)
>   fma_stmt
> = gimple_build_call_internal (IFN_COND_LEN_FMA, 7, cond, mulop1, op2,
>   addop, else_value, len, bias);
>   else if (cond)
>   fma_stmt = gimple_build_call_internal (IFN_COND_FMA, 5, cond, mulop1,
>  op2, addop, else_value);
>   else
>   fma_stmt = gimple_build_call_internal (IFN_FMA, 3, mulop1, op2, addop);
>   gimple_set_lhs (fma_stmt, gimple_get_lhs (use_stmt));
>   gimple_call_set_nothrow (fma_stmt, !stmt_can_throw_internal (cfun,
>  use_stmt));
>   gsi_replace (, fma_stmt, true);
>   /* Follow all SSA edges so that we generate FMS, FNMA and FNMS
>regardless of where the negation occurs.  */
>   gimple *orig_stmt = gsi_stmt (gsi);
>   if (fold_stmt (, follow_all_ssa_edges))
>   {
> if (maybe_clean_or_replace_eh_stmt (orig_stmt, gsi_stmt (gsi)))
>   gcc_unreachable ();
> update_stmt (gsi_stmt (gsi));
>   }
>
> 'fold_stmt' failed to fold NEGATE_EXPR + COND_LEN_FMA > COND_LEN_FNMA.
>
> This patch support STMT fold into:
>
> vect__10.16_21 = .COND_LEN_FNMA ({ -1, ... }, vect__8.14_26, vect__6.11_30, 
> vect__4.8_34, { 0.0, ... }, _46, 0);
>
> Note that COND_LEN_FNMA has 7 arguments and COND_LEN_ADD has 6 arguments.
>
> Extend maximum num ops:
> -  static const unsigned int MAX_NUM_OPS = 5;
> +  static const unsigned int MAX_NUM_OPS = 7;
>
> Bootstrap and Regtest on X86 passed.
> Tested on aarch64 Qemu.
>
> Fully tested COND_LEN_FNMA/COND_LEN_FMS/COND_LEN_FNMS on RISC-V backend.
>
>
> gcc/ChangeLog:
>
> * genmatch.cc (decision_tree::gen): Support 
> COND_LEN_FNMA/COND_LEN_FMS/COND_LEN_FNMS gimple fold.
> * gimple-match-exports.cc (gimple_simplify): Ditto.
> (gimple_resimplify6): New function.
> (gimple_resimplify7): New function.
> (gimple_match_op::resimplify): Support 
> COND_LEN_FNMA/COND_LEN_FMS/COND_LEN_FNMS gimple fold.
> (convert_conditional_op): Ditto.
> (build_call_internal): Ditto.
> (try_conditional_simplification): Ditto.
> (gimple_extract): Ditto.
> * gimple-match.h (gimple_match_cond::gimple_match_cond): Ditto.
> * internal-fn.cc (CASE): Ditto.

OK, thanks.

Richard

>
> ---
>  gcc/genmatch.cc |   2 +-
>  gcc/gimple-match-exports.cc | 123 ++--
>  gcc/gimple-match.h  |  16 -
>  gcc/internal-fn.cc  |   7 +-
>  4 files changed, 138 insertions(+), 10 deletions(-)
>
> diff --git a/gcc/genmatch.cc b/gcc/genmatch.cc
> index f46d2e1520d..a1925a747a7 100644
> --- a/gcc/genmatch.cc
> +++ b/gcc/genmatch.cc
> @@ -4052,7 +4052,7 @@ decision_tree::gen (vec  , bool gimple)
>  }
>fprintf (stderr, "removed %u duplicate tails\n", rcnt);
>  
> -  for (unsigned n = 1; n <= 5; ++n)
> +  for (unsigned n = 1; n <= 7; ++n)
>  {
>bool has_kids_p = false;
>  
> diff --git a/gcc/gimple-match-exports.cc b/gcc/gimple-match-exports.cc
> index 7aeb4ddb152..b36027b0bad 100644
> --- a/gcc/gimple-match-exports.cc
> +++ b/gcc/gimple-match-exports.cc
> @@ -60,6 +60,12 @@ extern bool gimple_simplify (gimple_match_op *, gimple_seq 
> *, tree (*)(tree),
>code_helper, tree, tree, tree, tree, tree);
>  extern bool gimple_simplify (gimple_match_op *, gimple_seq *, tree (*)(tree),
>code_helper, tree, tree, tree, tree, tree, tree);
> +extern bool gimple_simplify (gimple_match_op *, gimple_seq *, tree (*)(tree),
> +  code_helper, tree, tree, 

Re: [PATCH] RISC-V: Add conditional unary neg/abs/not autovec patterns

2023-08-24 Thread Richard Sandiford via Gcc-patches
Jeff Law  writes:
> On 8/22/23 02:08, juzhe.zh...@rivai.ai wrote:
>> Yes, I agree long-term we want every-thing be optimized as early as 
>> possible.
>> 
>> However, IMHO, it's impossible we can support every conditional patterns 
>> in the middle-end (match.pd).
>> It's a really big number.
>> 
>> For example, for sign_extend conversion, we have vsext.vf2 (vector SI -> 
>> vector DI),... vsext.vf4 (vector HI -> vector DI), vsext.vf8 (vector QI 
>> -> vector DI)..
>> Not only the conversion, every auto-vectorization patterns can have 
>> conditional format.
>> For example, abs,..rotate, sqrt, floor, ceil,etc.
>> I bet it could be over 100+ conditional optabs/internal FNs. It's huge 
>> number.
>> I don't see necessity that we should support them in middle-end 
>> (match.pd) since we known RTL back-end combine PASS can do the good job 
>> here.
>> 
>> Besides, LLVM doesn't such many conditional pattern. LLVM just has "add" 
>> and "select" separate IR then do the combine in the back-end:
>> https://godbolt.org/z/rYcMMG1eT 
>> 
>> You can see LLVM didn't do the op + select optimization in generic IR, 
>> they do the optimization in combine PASS.
>> 
>> So I prefer this patch solution and apply such solution for the future 
>> more support : sign extend, zero extend, float extend, abs, sqrt, ceil, 
>> floor, etc.
> It's certainly got the potential to get out of hand.  And it's not just 
> the vectorizer operations.  I know of an architecture that can execute 
> most of its ALU and loads/stores conditionally (not predication, but 
> actual conditional ops) like target  = (x COND Y) ? a << b ; a)
>
> I'd tend to lean towards synthesizing these conditional ops around a 
> conditional move/select primitive in gimple through the RTL expanders. 
> That would in turn set things up so that if the target had various 
> conditional operations like conditional shift it could be trivially 
> discovered by the combiner.

FWIW, one of the original motivations behind the COND_* internal
functions was to represent the fact that the operation is suppressed
(rather than being performed and discarded) when the predicate is false.
This allows if-conversion for FP operations even in strict FP modes,
since inactive lanes are guaranteed not to generate an exception.

I think it makes sense to add COND_* functions for anything that can
reasonably be done on FP types, and that could generate an FP exception.
E.g. sqrt was one of the examples mentioned, and I think COND_SQRT is
something that we should have.

I agree it's less clear-cut for purely integer stuff, or for FP operations
like neg and abs that are pure bit manipulation.  But perhaps there's a
question of how many operations are only defined for integers, and
whether the number is high enough for them to be treated differently.

I wouldn't have expected an explosion of operations to be a significant
issue, since (a) the underlying infrastructure is pretty mechanical and
(b) any operation that a target supports is going to need an .md pattern
whatever happens.

Thanks,
Richard


Re: [PATCH 03/11] aarch64: Use br instead of ret for eh_return

2023-08-24 Thread Richard Sandiford via Gcc-patches
Richard Sandiford  writes:
> Rather than hiding this in target code, perhaps we should add a
> target-independent concept of an "eh_return taken" flag, say
> EH_RETURN_TAKEN_RTX.
>
> We could define it so that, on targets that define EH_RETURN_TAKEN_RTX,
> a register EH_RETURN_STACKADJ_RTX and a register EH_RETURN_HANDLER_RTX
> are only meaningful when the flag is true.  E.g. we could have:
>
> #ifdef EH_RETURN_HANDLER_RTX

Gah, I meant #ifdef EH_RETURN_TAKEN_RTX here

>   for (rtx tmp : { EH_RETURN_STACKADJ_RTX, EH_RETURN_HANDLER_RTX })
> if (tmp && REG_P (tmp))
>   emit_clobber (tmp);
> #endif
>
> in the "normal return" part of expand_eh_return.  (If some other target
> wants a flag with different semantics, it'd be up to them to add it.)
>
> That should avoid most of the bad code-quality effects, since the
> specialness of x4-x6 will be confined to the code immediately before
> the pre-epilogue exit edges.
>
> Thanks,
> Richard


Re: [PATCH] tree-optimization/111115 - SLP of masked stores

2023-08-24 Thread Richard Sandiford via Gcc-patches
Richard Biener  writes:
> The following adds the capability to do SLP on .MASK_STORE, I do not
> plan to add interleaving support.
>
> Bootstrapped and tested on x86_64-unknown-linux-gnu, OK?

LGTM, thanks.

Richard

> Thanks,
> Richard.
>
>   PR tree-optimization/15
> gcc/
>   * tree-vectorizer.h (vect_slp_child_index_for_operand): New.
>   * tree-vect-data-refs.cc (can_group_stmts_p): Also group
>   .MASK_STORE.
>   * tree-vect-slp.cc (arg3_arg2_map): New.
>   (vect_get_operand_map): Handle IFN_MASK_STORE.
>   (vect_slp_child_index_for_operand): New function.
>   (vect_build_slp_tree_1): Handle statements with no LHS,
>   masked store ifns.
>   (vect_remove_slp_scalar_calls): Likewise.
>   * tree-vect-stmts.c (vect_check_store_rhs): Lookup the
>   SLP child corresponding to the ifn value index.
>   (vectorizable_store): Likewise for the mask index.  Support
>   masked stores.
>   (vectorizable_load): Lookup the SLP child corresponding to the
>   ifn mask index.
>
> gcc/testsuite/
>   * lib/target-supports.exp (check_effective_target_vect_masked_store):
>   Supported with check_avx_available.
>   * gcc.dg/vect/slp-mask-store-1.c: New testcase.
> ---
>  gcc/testsuite/gcc.dg/vect/slp-mask-store-1.c | 39 +
>  gcc/testsuite/lib/target-supports.exp|  3 +-
>  gcc/tree-vect-data-refs.cc   |  3 +-
>  gcc/tree-vect-slp.cc | 46 +---
>  gcc/tree-vect-stmts.cc   | 23 +-
>  gcc/tree-vectorizer.h|  1 +
>  6 files changed, 94 insertions(+), 21 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/vect/slp-mask-store-1.c
>
> diff --git a/gcc/testsuite/gcc.dg/vect/slp-mask-store-1.c 
> b/gcc/testsuite/gcc.dg/vect/slp-mask-store-1.c
> new file mode 100644
> index 000..50b7066778e
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/slp-mask-store-1.c
> @@ -0,0 +1,39 @@
> +/* { dg-do run } */
> +/* { dg-additional-options "-mavx2" { target avx2 } } */
> +
> +#include "tree-vect.h"
> +
> +void __attribute__((noipa))
> +foo (unsigned * __restrict x, int * __restrict flag)
> +{
> +  for (int i = 0; i < 32; ++i)
> +{
> +  if (flag[2*i+0])
> +x[2*i+0] = x[2*i+0] + 3;
> +  if (flag[2*i+1])
> +x[2*i+1] = x[2*i+1] + 177;
> +}
> +}
> +
> +unsigned x[16];
> +int flag[32] = { 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0,
> + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 };
> +unsigned res[16] = { 3, 177, 0, 0, 0, 177, 3, 0, 3, 177, 0, 0, 0, 177, 3, 0 
> };
> +
> +int
> +main ()
> +{
> +  check_vect ();
> +
> +  foo (x, flag);
> +
> +  if (__builtin_memcmp (x, res, sizeof (x)) != 0)
> +abort ();
> +  for (int i = 0; i < 32; ++i)
> +if (flag[i] != 0 && flag[i] != 1)
> +  abort ();
> +
> +  return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump-times "LOOP VECTORIZED" 1 "vect" { target { 
> vect_masked_store && vect_masked_load } } } } */
> diff --git a/gcc/testsuite/lib/target-supports.exp 
> b/gcc/testsuite/lib/target-supports.exp
> index d4623ee6b45..d353cc0aaf0 100644
> --- a/gcc/testsuite/lib/target-supports.exp
> +++ b/gcc/testsuite/lib/target-supports.exp
> @@ -8400,7 +8400,8 @@ proc check_effective_target_vect_masked_load { } {
>  # Return 1 if the target supports vector masked stores.
>  
>  proc check_effective_target_vect_masked_store { } {
> -return [expr { [check_effective_target_aarch64_sve]
> +return [expr { [check_avx_available]
> +|| [check_effective_target_aarch64_sve]
>  || [istarget amdgcn*-*-*] }]
>  }
>  
> diff --git a/gcc/tree-vect-data-refs.cc b/gcc/tree-vect-data-refs.cc
> index 3e9a284666c..a2caf6cb1c7 100644
> --- a/gcc/tree-vect-data-refs.cc
> +++ b/gcc/tree-vect-data-refs.cc
> @@ -3048,8 +3048,7 @@ can_group_stmts_p (stmt_vec_info stmt1_info, 
> stmt_vec_info stmt2_info,
>like those created by build_mask_conversion.  */
>tree mask1 = gimple_call_arg (call1, 2);
>tree mask2 = gimple_call_arg (call2, 2);
> -  if (!operand_equal_p (mask1, mask2, 0)
> -  && (ifn == IFN_MASK_STORE || !allow_slp_p))
> +  if (!operand_equal_p (mask1, mask2, 0) && !allow_slp_p)
>   {
> mask1 = strip_conversion (mask1);
> if (!mask1)
> diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
> index b5f9333fc22..cc799b6ebcd 100644
> --- a/gcc/tree-vect-slp.cc
> +++ b/gcc/tree-vect-slp.cc
> @@ -503,6 +503,7 @@ static const int cond_expr_maps[3][5] = {
>  static const int arg1_map[] = { 1, 1 };
>  static const int arg2_map[] = { 1, 2 };
>  static const int arg1_arg4_map[] = { 2, 1, 4 };
> +static const int arg3_arg2_map[] = { 2, 3, 2 };
>  static const int op1_op0_map[] = { 2, 1, 0 };
>  
>  /* For most SLP statements, there is a one-to-one mapping between
> @@ -543,6 +544,9 @@ vect_get_operand_map (const gimple *stmt, unsigned char 
> swap = 0)
> case 

[PATCH] aarch64: Account for different Advanced SIMD fusing options

2023-08-24 Thread Richard Sandiford via Gcc-patches
The scalar FNMADD/FNMSUB and SVE FNMLA/FNMLS instructions mean
that either side of a subtraction can start an accumulator chain.
However, Advanced SIMD doesn't have an equivalent instruction.
This means that, for Advanced SIMD, a subtraction can only be
fused if the second operand is a multiplication.

Also, if both sides of a subtraction are multiplications,
and if the second operand is used multiple times, such as:

 c * d - a * b
 e * f - a * b

then the first rather than second multiplication operand will tend
to be fused.  On Advanced SIMD, this leads to:

 tmp1 = a * b
 tmp2 = -tmp1
  ... = tmp2 + c * d   // FMLA
  ... = tmp2 + e * f   // FMLA

where one of the FMLAs also requires a MOV.

This patch tries to account for this in the vector cost model.
It improves roms performance by 2-3% on Neoverse V1.  It's also
needed to avoid a regression in fotonik for Neoverse N2 and
Neoverse V2 with the patch for PR110625.

Tested on aarch64-linux-gnu & pushed.

Richard


gcc/
* config/aarch64/aarch64.cc: Include ssa.h.
(aarch64_multiply_add_p): Require the second operand of an
Advanced SIMD subtraction to be a multiplication.  Assume that
such an operation won't be fused if the second operand is used
multiple times and if the first operand is also a multiplication.

gcc/testsuite/
* gcc.target/aarch64/neoverse_v1_2.c: New test.
* gcc.target/aarch64/neoverse_v1_3.c: Likewise.
---
 gcc/config/aarch64/aarch64.cc | 24 ++-
 .../gcc.target/aarch64/neoverse_v1_2.c| 15 
 .../gcc.target/aarch64/neoverse_v1_3.c| 14 +++
 3 files changed, 47 insertions(+), 6 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/neoverse_v1_2.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/neoverse_v1_3.c

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 034628148ef..37d414021ca 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -84,6 +84,7 @@
 #include "aarch64-feature-deps.h"
 #include "config/arm/aarch-common.h"
 #include "config/arm/aarch-common-protos.h"
+#include "ssa.h"
 
 /* This file should be included last.  */
 #include "target-def.h"
@@ -16411,20 +16412,20 @@ aarch64_multiply_add_p (vec_info *vinfo, 
stmt_vec_info stmt_info,
   if (code != PLUS_EXPR && code != MINUS_EXPR)
 return false;
 
-  for (int i = 1; i < 3; ++i)
+  auto is_mul_result = [&](int i)
 {
   tree rhs = gimple_op (assign, i);
   /* ??? Should we try to check for a single use as well?  */
   if (TREE_CODE (rhs) != SSA_NAME)
-   continue;
+   return false;
 
   stmt_vec_info def_stmt_info = vinfo->lookup_def (rhs);
   if (!def_stmt_info
  || STMT_VINFO_DEF_TYPE (def_stmt_info) != vect_internal_def)
-   continue;
+   return false;
   gassign *rhs_assign = dyn_cast (def_stmt_info->stmt);
   if (!rhs_assign || gimple_assign_rhs_code (rhs_assign) != MULT_EXPR)
-   continue;
+   return false;
 
   if (vec_flags & VEC_ADVSIMD)
{
@@ -16444,8 +16445,19 @@ aarch64_multiply_add_p (vec_info *vinfo, stmt_vec_info 
stmt_info,
}
 
   return true;
-}
-  return false;
+};
+
+  if (code == MINUS_EXPR && (vec_flags & VEC_ADVSIMD))
+/* Advanced SIMD doesn't have FNMADD/FNMSUB/FNMLA/FNMLS, so the
+   multiplication must be on the second operand (to form an FMLS).
+   But if both operands are multiplications and the second operand
+   is used more than once, we'll instead negate the second operand
+   and use it as an accumulator for the first operand.  */
+return (is_mul_result (2)
+   && (has_single_use (gimple_assign_rhs2 (assign))
+   || !is_mul_result (1)));
+
+  return is_mul_result (1) || is_mul_result (2);
 }
 
 /* Return true if STMT_INFO is the second part of a two-statement boolean AND
diff --git a/gcc/testsuite/gcc.target/aarch64/neoverse_v1_2.c 
b/gcc/testsuite/gcc.target/aarch64/neoverse_v1_2.c
new file mode 100644
index 000..45d7e81c78e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/neoverse_v1_2.c
@@ -0,0 +1,15 @@
+/* { dg-options "-O2 -mcpu=neoverse-v1 --param aarch64-autovec-preference=1 
-fdump-tree-vect-details" } */
+
+void
+f (float x[restrict][100], float y[restrict][100])
+{
+  for (int i = 0; i < 100; ++i)
+{
+  x[0][i] = y[0][i] * y[1][i] - y[3][i] * y[4][i];
+  x[1][i] = y[1][i] * y[2][i] - y[3][i] * y[4][i];
+}
+}
+
+/* { dg-final { scan-tree-dump {_[0-9]+ - _[0-9]+ 1 times vector_stmt costs 2 
} "vect" } } */
+/* { dg-final { scan-tree-dump-not {vector_stmt costs 0 } "vect" } } */
+/* { dg-final { scan-tree-dump {_[0-9]+ - _[0-9]+ 1 times scalar_stmt costs 0 
} "vect" } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/neoverse_v1_3.c 
b/gcc/testsuite/gcc.target/aarch64/neoverse_v1_3.c
new file mode 100644
index 000..de31fc13b28
--- /dev/null
+++ 

Re: [PATCH] AArch64: Fix MOPS memmove operand corruption [PR111121]

2023-08-23 Thread Richard Sandiford via Gcc-patches
Wilco Dijkstra  writes:
> Hi Richard,
>
> (that's quick!)
>
>> +  if (size > max_copy_size || size > max_mops_size)
>> +return aarch64_expand_cpymem_mops (operands, is_memmove);
>>
>> Could you explain this a bit more?  If I've followed the logic correctly,
>> max_copy_size will always be 0 for movmem, so this "if" condition will
>> always be true for movmem (given that the caller can be relied on to
>> optimise away zero-length copies).  So doesn't this function reduce to:
>
> In this patch it is zero yes, but there is no real reason for that. The goal 
> is to
> share as much code as possible. I have a patch that inlines memmove like
> memcpy.

But I think this part of the patch belongs in that future series.
The current patch should just concentrate on fixing the bug.

It's difficult to evaluate the change at the moment, without the follow-on
change that it's preparing for.  I don't think it stands as an indepedent
improvement in its own right.

>> when is_memmove is true?  If so, I think it would be clearer to do that
>> directly, rather than go through aarch64_expand_cpymem.  max_copy_size
>> is really an optimisation threshold, whereas the above seems to be
>> leaning on it for correctness.
>
> In principle we could for the time being add a assert (!is_memmove) if that
> makes it clearer memmove isn't yet handled.

I think for this patch movmemdi should just call aarch64_expand_cpymem_mops
directly.  Let's leave the aarch64_expand_cpymem changes to other patches.

>> ...I think we might as well keep this pattern conditional on TARGET_MOPS.
>
> But then we have inconsistencies in the conditions of the expanders, which
> is what led to all these bugs in the first place (I lost count, there are 4 
> or 5
> different bugs I fixed). Ensuring everything is 100% identical between
> memcpy and memmove makes the code much easier to follow.

I think that too should be part of your follow-on changes to do inline
movmem expansions without TARGET_MOPS.  While all supported movmemdis
require TARGET_MOPS, I think the expander should too.

>> I think we can then also split:
>>
>>   /* All three registers are changed by the instruction, so each one
>>  must be a fresh pseudo.  */
>>   rtx dst_addr = copy_to_mode_reg (Pmode, XEXP (operands[0], 0));
>>   rtx src_addr = copy_to_mode_reg (Pmode, XEXP (operands[1], 0));
>>   rtx dst_mem = replace_equiv_address (operands[0], dst_addr);
>>   rtx src_mem = replace_equiv_address (operands[1], src_addr);
>>   rtx sz_reg = copy_to_mode_reg (DImode, operands[2]);
>>
>> out of aarch64_expand_cpymem_mops into a new function (say
>> aarch64_prepare_mops_operands) and call it from the movmemdi
>> expander.  There should then be no need for the extra staging
>> expander (aarch64_movmemdi).
>
> So you're saying we could remove aarch64_cpymemdi/movmemdi if
> aarch64_expand_cpymem_mops did massage the operands in the
> right way so that we can immediately match the underlying instruction?

Yeah.  But I'd forgotten about the pesky fourth (alignment) operand
to movmemdi and cpymemdi, which we don't need for the mops patterns.
So I take that part back.  I agree it's clearer to have a separate
aarch64_movmemdi expander.

> Hmm, does that actually work, as in we don't lose the extra alias info that
> gets lost in the current memmove expander? (another bug/inconsistency)
>
> And the MOPS code would be separated from aarch64_expand_cpymem
> so we'd do all the MOPS size tests inside aarch64_expand_cpymem_mops
> and the expander tries using MOPS first and if it fails try inline expansion?
>
> So something like:
>
> (define_expand "movmemdi"
> 
>   if (aarch64_try_mops_expansion (operands, is_memmove))
> DONE;
>   if (aarch64_try_inline_copy_expansion (operands, is_memmove))
> DONE;
>   FAIL;
> )
>
>> IMO the STRICT_ALIGNMENT stuff should be a separate patch,
>> with its own testcases.
>
> We will need backports to fix all these bugs, so the question is whether it
> is worth doing a lot of cleanups now?

But I think what I'm asking for is significantly simpler than the
original patch.  That should make it more backportable rather than less.

Thanks,
Richard


Re: [PATCH] rtl: Forward declare rtx_code

2023-08-23 Thread Richard Sandiford via Gcc-patches
"Richard Earnshaw (lists)"  writes:
> On 23/08/2023 16:49, Richard Sandiford via Gcc-patches wrote:
>> Richard Earnshaw via Gcc-patches  writes:
>>> Now that we require C++ 11, we can safely forward declare rtx_code
>>> so that we can use it in target hooks.
>>>
>>> gcc/ChangeLog
>>> * coretypes.h (rtx_code): Add forward declaration.
>>> * rtl.h (rtx_code): Make compatible with forward declaration.
>>> ---
>>>  gcc/coretypes.h | 4 
>>>  gcc/rtl.h   | 2 +-
>>>  2 files changed, 5 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/gcc/coretypes.h b/gcc/coretypes.h
>>> index ca8837cef67..51e9ce0 100644
>>> --- a/gcc/coretypes.h
>>> +++ b/gcc/coretypes.h
>>> @@ -100,6 +100,10 @@ struct gimple;
>>>  typedef gimple *gimple_seq;
>>>  struct gimple_stmt_iterator;
>>>  
>>> +/* Forward declare rtx_code, so that we can use it in target hooks without
>>> +   needing to pull in rtl.h.  */
>>> +enum rtx_code : unsigned;
>>> +
>>>  /* Forward decls for leaf gimple subclasses (for individual gimple codes).
>>> Keep this in the same order as the corresponding codes in gimple.def.  
>>> */
>>>  
>>> diff --git a/gcc/rtl.h b/gcc/rtl.h
>>> index e1c51156f90..0e9491b89b4 100644
>>> --- a/gcc/rtl.h
>>> +++ b/gcc/rtl.h
>>> @@ -45,7 +45,7 @@ class predefined_function_abi;
>>>  /* Register Transfer Language EXPRESSIONS CODES */
>>>  
>>>  #define RTX_CODE   enum rtx_code
>>> -enum rtx_code  {
>>> +enum rtx_code : unsigned {
>>>  
>>>  #define DEF_RTL_EXPR(ENUM, NAME, FORMAT, CLASS)   ENUM ,
>>>  #include "rtl.def" /* rtl expressions are documented here */
>> 
>> Given:
>> 
>>   #define RTX_CODE_BITSIZE 8
>> 
>> there might be some value in making it uint8_t rather than unsigned.
>> Preapproved if you agree.
>> 
>> But the patch as posted is a strict improvement over the status quo,
>> so it's also OK as-is.
>> 
>> Thanks,
>> Richard
>
> I did think about that, but there were two reasons for not doing so:
> - it presumes we would never want more than 8 bits for rtx_code (well, not 
> quite, 
> but it would make it more work to change this).

The rtx_def structure itself provides a significant barrier to that though.

If we ever think that we need to represent more than 256 separate
operations, I think the natural way would be to treat the less well-used
ones in a similar way to unspecs.

> - it would probably lead to more zero-extension operations happening in the 
> compiler

Yeah, that's true.  The upside though is that we could then declare
arrays of codes directly, without having to resort to "unsigned char"
tricks.  That's unlikely to help codes much, but the same principle
would apply to modes, which are more frequently put into arrays.

E.g. one of the issues with bumping the machine_mode bitfield from 8 to
16 bits was finding all the places where "unsigned char" was used to
hold modes, and changing them to "unsigned short".  If machine_mode was
instead the "right" size, we could just call a spade a spade.

But like I say, that's mostly reasoning by analogy rather than because
the size of rtx_code itself is important.

Richard


Re: [PATCH] rtl: use rtx_code for gen_ccmp_first and gen_ccmp_next

2023-08-23 Thread Richard Sandiford via Gcc-patches
Richard Earnshaw via Gcc-patches  writes:
> Note, this patch is dependent on the patch I posted yesterday to
> forward declare rtx_code in coretypes.h.
>
> --
> Now that we have a forward declaration of rtx_code in coretypes.h, we
> can adjust these hooks to take rtx_code arguments rather than an int.
>
> gcc/ChangeLog:
>
>   * target.def (gen_ccmp_first, gen_ccmp_next): Use rtx_code for
>   CODE, CMP_CODE and BIT_CODE arguments.
>   * config/aarch64/aarch64.cc (aarch64_gen_ccmp_first): Likewise.
>   (aarch64_gen_ccmp_next): Likewise.
>   * doc/tm.texi: Regenerated.

OK, thanks.

Richard

> ---
>  gcc/config/aarch64/aarch64.cc | 5 +++--
>  gcc/doc/tm.texi   | 4 ++--
>  gcc/target.def| 4 ++--
>  3 files changed, 7 insertions(+), 6 deletions(-)
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 560e5431636..bc09185b8ec 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -25585,7 +25585,7 @@ aarch64_asan_shadow_offset (void)
>  
>  static rtx
>  aarch64_gen_ccmp_first (rtx_insn **prep_seq, rtx_insn **gen_seq,
> - int code, tree treeop0, tree treeop1)
> + rtx_code code, tree treeop0, tree treeop1)
>  {
>machine_mode op_mode, cmp_mode, cc_mode = CCmode;
>rtx op0, op1;
> @@ -25659,7 +25659,8 @@ aarch64_gen_ccmp_first (rtx_insn **prep_seq, rtx_insn 
> **gen_seq,
>  
>  static rtx
>  aarch64_gen_ccmp_next (rtx_insn **prep_seq, rtx_insn **gen_seq, rtx prev,
> -int cmp_code, tree treeop0, tree treeop1, int bit_code)
> +rtx_code cmp_code, tree treeop0, tree treeop1,
> +rtx_code bit_code)
>  {
>rtx op0, op1, target;
>machine_mode op_mode, cmp_mode, cc_mode = CCmode;
> diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
> index 95ba56e05ae..75cb8e3417c 100644
> --- a/gcc/doc/tm.texi
> +++ b/gcc/doc/tm.texi
> @@ -12005,7 +12005,7 @@ This target hook is required only when the target has 
> several different
>  modes and they have different conditional execution capability, such as ARM.
>  @end deftypefn
>  
> -@deftypefn {Target Hook} rtx TARGET_GEN_CCMP_FIRST (rtx_insn 
> **@var{prep_seq}, rtx_insn **@var{gen_seq}, int @var{code}, tree @var{op0}, 
> tree @var{op1})
> +@deftypefn {Target Hook} rtx TARGET_GEN_CCMP_FIRST (rtx_insn 
> **@var{prep_seq}, rtx_insn **@var{gen_seq}, rtx_code @var{code}, tree 
> @var{op0}, tree @var{op1})
>  This function prepares to emit a comparison insn for the first compare in a
>   sequence of conditional comparisions.  It returns an appropriate comparison
>   with @code{CC} for passing to @code{gen_ccmp_next} or @code{cbranch_optab}.
> @@ -12015,7 +12015,7 @@ This function prepares to emit a comparison insn for 
> the first compare in a
>   @var{code} is the @code{rtx_code} of the compare for @var{op0} and 
> @var{op1}.
>  @end deftypefn
>  
> -@deftypefn {Target Hook} rtx TARGET_GEN_CCMP_NEXT (rtx_insn 
> **@var{prep_seq}, rtx_insn **@var{gen_seq}, rtx @var{prev}, int 
> @var{cmp_code}, tree @var{op0}, tree @var{op1}, int @var{bit_code})
> +@deftypefn {Target Hook} rtx TARGET_GEN_CCMP_NEXT (rtx_insn 
> **@var{prep_seq}, rtx_insn **@var{gen_seq}, rtx @var{prev}, rtx_code 
> @var{cmp_code}, tree @var{op0}, tree @var{op1}, rtx_code @var{bit_code})
>  This function prepares to emit a conditional comparison within a sequence
>   of conditional comparisons.  It returns an appropriate comparison with
>   @code{CC} for passing to @code{gen_ccmp_next} or @code{cbranch_optab}.
> diff --git a/gcc/target.def b/gcc/target.def
> index 7d684296c17..3ad0bde3ece 100644
> --- a/gcc/target.def
> +++ b/gcc/target.def
> @@ -2735,7 +2735,7 @@ DEFHOOK
>   insns are saved in @var{gen_seq}.  They will be emitted when all the\n\
>   compares in the conditional comparision are generated without error.\n\
>   @var{code} is the @code{rtx_code} of the compare for @var{op0} and 
> @var{op1}.",
> - rtx, (rtx_insn **prep_seq, rtx_insn **gen_seq, int code, tree op0, tree 
> op1),
> + rtx, (rtx_insn **prep_seq, rtx_insn **gen_seq, rtx_code code, tree op0, 
> tree op1),
>   NULL)
>  
>  DEFHOOK
> @@ -2752,7 +2752,7 @@ DEFHOOK
>   be appropriate for passing to @code{gen_ccmp_next} or 
> @code{cbranch_optab}.\n\
>   @var{code} is the @code{rtx_code} of the compare for @var{op0} and 
> @var{op1}.\n\
>   @var{bit_code} is @code{AND} or @code{IOR}, which is the op on the 
> compares.",
> - rtx, (rtx_insn **prep_seq, rtx_insn **gen_seq, rtx prev, int cmp_code, tree 
> op0, tree op1, int bit_code),
> + rtx, (rtx_insn **prep_seq, rtx_insn **gen_seq, rtx prev, rtx_code cmp_code, 
> tree op0, tree op1, rtx_code bit_code),
>   NULL)
>  
>  /* Return a new value for loop unroll size.  */


Re: [PATCH] rtl: Forward declare rtx_code

2023-08-23 Thread Richard Sandiford via Gcc-patches
Richard Earnshaw via Gcc-patches  writes:
> Now that we require C++ 11, we can safely forward declare rtx_code
> so that we can use it in target hooks.
>
> gcc/ChangeLog
>   * coretypes.h (rtx_code): Add forward declaration.
>   * rtl.h (rtx_code): Make compatible with forward declaration.
> ---
>  gcc/coretypes.h | 4 
>  gcc/rtl.h   | 2 +-
>  2 files changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/gcc/coretypes.h b/gcc/coretypes.h
> index ca8837cef67..51e9ce0 100644
> --- a/gcc/coretypes.h
> +++ b/gcc/coretypes.h
> @@ -100,6 +100,10 @@ struct gimple;
>  typedef gimple *gimple_seq;
>  struct gimple_stmt_iterator;
>  
> +/* Forward declare rtx_code, so that we can use it in target hooks without
> +   needing to pull in rtl.h.  */
> +enum rtx_code : unsigned;
> +
>  /* Forward decls for leaf gimple subclasses (for individual gimple codes).
> Keep this in the same order as the corresponding codes in gimple.def.  */
>  
> diff --git a/gcc/rtl.h b/gcc/rtl.h
> index e1c51156f90..0e9491b89b4 100644
> --- a/gcc/rtl.h
> +++ b/gcc/rtl.h
> @@ -45,7 +45,7 @@ class predefined_function_abi;
>  /* Register Transfer Language EXPRESSIONS CODES */
>  
>  #define RTX_CODE enum rtx_code
> -enum rtx_code  {
> +enum rtx_code : unsigned {
>  
>  #define DEF_RTL_EXPR(ENUM, NAME, FORMAT, CLASS)   ENUM ,
>  #include "rtl.def"   /* rtl expressions are documented here */

Given:

  #define RTX_CODE_BITSIZE 8

there might be some value in making it uint8_t rather than unsigned.
Preapproved if you agree.

But the patch as posted is a strict improvement over the status quo,
so it's also OK as-is.

Thanks,
Richard


Re: [PATCH] AArch64: Fix MOPS memmove operand corruption [PR111121]

2023-08-23 Thread Richard Sandiford via Gcc-patches
Wilco Dijkstra  writes:
> A MOPS memmove may corrupt registers since there is no copy of the input 
> operands to temporary
> registers.  Fix this by calling aarch64_expand_cpymem which does this.  Also 
> fix an issue with
> STRICT_ALIGNMENT being ignored if TARGET_MOPS is true, and avoid crashing or 
> generating a huge
> expansion if aarch64_mops_memcpy_size_threshold is large.
>
> Passes regress/bootstrap, OK for commit?
>
> gcc/ChangeLog/
> PR target/21
> * config/aarch64/aarch64.md (cpymemdi): Remove STRICT_ALIGNMENT, add 
> param for memmove.
> (aarch64_movmemdi): Add new expander similar to aarch64_cpymemdi.
> (movmemdi): Like cpymemdi call aarch64_expand_cpymem for correct 
> expansion.
> * config/aarch64/aarch64.cc (aarch64_expand_cpymem_mops): Add support 
> for memmove.
> (aarch64_expand_cpymem): Add support for memmove. Handle 
> STRICT_ALIGNMENT correctly.
> Handle TARGET_MOPS size selection correctly.
> * config/aarch64/aarch64-protos.h (aarch64_expand_cpymem): Update 
> prototype.
>
> gcc/testsuite/ChangeLog/
> PR target/21
> * gcc.target/aarch64/mops_4.c: Add memmove testcases.
>
> ---
> diff --git a/gcc/config/aarch64/aarch64-protos.h 
> b/gcc/config/aarch64/aarch64-protos.h
> index 
> 70303d6fd953e0c397b9138ede8858c2db2e53db..97375e81cbda078847af83bf5dd4e0d7673d6af4
>  100644
> --- a/gcc/config/aarch64/aarch64-protos.h
> +++ b/gcc/config/aarch64/aarch64-protos.h
> @@ -765,7 +765,7 @@ bool aarch64_emit_approx_div (rtx, rtx, rtx);
>  bool aarch64_emit_approx_sqrt (rtx, rtx, bool);
>  tree aarch64_vector_load_decl (tree);
>  void aarch64_expand_call (rtx, rtx, rtx, bool);
> -bool aarch64_expand_cpymem (rtx *);
> +bool aarch64_expand_cpymem (rtx *, bool);
>  bool aarch64_expand_setmem (rtx *);
>  bool aarch64_float_const_zero_rtx_p (rtx);
>  bool aarch64_float_const_rtx_p (rtx);
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 
> eba5d4a7e04b7af82437453a691d5607d98133c9..5e8d0a0c91bc7719de2a8c5627b354cf905a4db0
>  100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -25135,10 +25135,11 @@ aarch64_copy_one_block_and_progress_pointers (rtx 
> *src, rtx *dst,
>*dst = aarch64_progress_pointer (*dst);
>  }
>
> -/* Expand a cpymem using the MOPS extension.  OPERANDS are taken
> -   from the cpymem pattern.  Return true iff we succeeded.  */
> +/* Expand a cpymem/movmem using the MOPS extension.  OPERANDS are taken
> +   from the cpymem/movmem pattern.  IS_MEMMOVE is true if this is a memmove
> +   rather than memcpy.  Return true iff we succeeded.  */
>  static bool
> -aarch64_expand_cpymem_mops (rtx *operands)
> +aarch64_expand_cpymem_mops (rtx *operands, bool is_memmove)
>  {
>if (!TARGET_MOPS)
>  return false;
> @@ -25150,17 +25151,19 @@ aarch64_expand_cpymem_mops (rtx *operands)
>rtx dst_mem = replace_equiv_address (operands[0], dst_addr);
>rtx src_mem = replace_equiv_address (operands[1], src_addr);
>rtx sz_reg = copy_to_mode_reg (DImode, operands[2]);
> -  emit_insn (gen_aarch64_cpymemdi (dst_mem, src_mem, sz_reg));
> -
> +  if (is_memmove)
> +emit_insn (gen_aarch64_movmemdi (dst_mem, src_mem, sz_reg));
> +  else
> +emit_insn (gen_aarch64_cpymemdi (dst_mem, src_mem, sz_reg));
>return true;
>  }
>
> -/* Expand cpymem, as if from a __builtin_memcpy.  Return true if
> -   we succeed, otherwise return false, indicating that a libcall to
> -   memcpy should be emitted.  */
> -
> +/* Expand cpymem/movmem, as if from a __builtin_memcpy/memmove.
> +   OPERANDS are taken from the cpymem/movmem pattern.  IS_MEMMOVE is true
> +   if this is a memmove rather than memcpy.  Return true if we succeed,
> +   otherwise return false, indicating that a libcall should be emitted.  */
>  bool
> -aarch64_expand_cpymem (rtx *operands)
> +aarch64_expand_cpymem (rtx *operands, bool is_memmove)
>  {
>int mode_bits;
>rtx dst = operands[0];
> @@ -25168,25 +25171,23 @@ aarch64_expand_cpymem (rtx *operands)
>rtx base;
>machine_mode cur_mode = BLKmode;
>
> -  /* Variable-sized memcpy can go through the MOPS expansion if available.  
> */
> -  if (!CONST_INT_P (operands[2]))
> -return aarch64_expand_cpymem_mops (operands);
> +  /* Variable-sized or strict align copies may use the MOPS expansion.  */
> +  if (!CONST_INT_P (operands[2]) || STRICT_ALIGNMENT)
> +return aarch64_expand_cpymem_mops (operands, is_memmove);
>
>unsigned HOST_WIDE_INT size = INTVAL (operands[2]);
>
> -  /* Try to inline up to 256 bytes or use the MOPS threshold if available.  
> */
> -  unsigned HOST_WIDE_INT max_copy_size
> -= TARGET_MOPS ? aarch64_mops_memcpy_size_threshold : 256;
> +  /* Set inline limits for memmove/memcpy.  MOPS has a separate threshold.  
> */
> +  unsigned HOST_WIDE_INT max_copy_size = is_memmove ? 0 : 256;
> +  unsigned HOST_WIDE_INT max_mops_size = max_copy_size;
>
> -  bool size_p = 

Re: [PATCH 03/11] aarch64: Use br instead of ret for eh_return

2023-08-23 Thread Richard Sandiford via Gcc-patches
Szabolcs Nagy  writes:
> The expected way to handle eh_return is to pass the stack adjustment
> offset and landing pad address via
>
>   EH_RETURN_STACKADJ_RTX
>   EH_RETURN_HANDLER_RTX
>
> to the epilogue that is shared between normal return paths and the
> eh_return paths.  EH_RETURN_HANDLER_RTX is the stack slot of the
> return address that is overwritten with the landing pad in the
> eh_return case and EH_RETURN_STACKADJ_RTX is a register added to sp
> right before return and it is set to 0 in the normal return case.
>
> The issue with this design is that eh_return and normal return may
> require different return sequence but there is no way to distinguish
> the two cases in the epilogue (the stack adjustment may be 0 in the
> eh_return case too).
>
> The reason eh_return and normal return requires different return
> sequence is that control flow integrity hardening may need to treat
> eh_return as a forward-edge transfer (it is not returning to the
> previous stack frame) and normal return as a backward-edge one.
> In case of AArch64 forward-edge is protected by BTI and requires br
> instruction and backward-edge is protected by PAUTH or GCS and
> requires ret (or authenticated ret) instruction.
>
> This patch resolves the issue by using the EH_RETURN_STACKADJ_RTX
> register only as a flag that is set to 1 in the eh_return paths
> (it is 0 in normal return paths) and introduces
>
>   AARCH64_EH_RETURN_STACKADJ_RTX
>   AARCH64_EH_RETURN_HANDLER_RTX
>
> to pass the actual stack adjustment and landing pad address to the
> epilogue in the eh_return case. Then the epilogue can use the right
> return sequence based on the EH_RETURN_STACKADJ_RTX flag.
>
> The handler could be passed the old way via clobbering the return
> address, but since now the eh_return case can be distinguished, the
> handler can be in a different register than x30 and no stack frame
> is needed for eh_return.

I don't think there's any specific target-independent requirement for
EH_RETURN_HANDLER_RTX to be a stack slot.  df-scan.cc has code to handle
registers.

So couldn't we just use EH_RETURN_HANDLER_RTX for this, rather than
making it AARCH64_EH_RETURN_HANDLER_RTX?

> The new code generation for functions with eh_return is not amazing,
> since x5 and x6 is assumed to be used by the epilogue even in the
> normal return path, not just for eh_return.  But only the unwinder
> is expected to use eh_return so this is fine.

I guess the problem here is that x5 and x6 are upwards-exposed on
the non-eh_return paths, and so are treated as live for most of the
function.  Is that right?

The patch seems to be using the existing interfaces to implement
a slightly different model.  E.g. if feels like a hack (but a neat hack)
that EH_RETURN_STACKADJ_RTX is now a flag rather than an adjustment,
with AARCH64_EH_RETURN_STACKADJ_RTX then being the "real" stack
adjustment.  And the reason for the upwards exposure of the new
registers on normal return paths is that the existing model has
no hook into the normal return path.

Rather than hiding this in target code, perhaps we should add a
target-independent concept of an "eh_return taken" flag, say
EH_RETURN_TAKEN_RTX.

We could define it so that, on targets that define EH_RETURN_TAKEN_RTX,
a register EH_RETURN_STACKADJ_RTX and a register EH_RETURN_HANDLER_RTX
are only meaningful when the flag is true.  E.g. we could have:

#ifdef EH_RETURN_HANDLER_RTX
  for (rtx tmp : { EH_RETURN_STACKADJ_RTX, EH_RETURN_HANDLER_RTX })
if (tmp && REG_P (tmp))
  emit_clobber (tmp);
#endif

in the "normal return" part of expand_eh_return.  (If some other target
wants a flag with different semantics, it'd be up to them to add it.)

That should avoid most of the bad code-quality effects, since the
specialness of x4-x6 will be confined to the code immediately before
the pre-epilogue exit edges.

Thanks,
Richard

> This patch fixes a return to anywhere gadget in the unwinder with
> existing standard branch protection as well as makes EH return
> compatible with the Guarded Control Stack (GCS) extension.
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64-protos.h (aarch64_eh_return_handler_rtx):
>   Remove.
>   (aarch64_eh_return): New.
>   * config/aarch64/aarch64.cc (aarch64_return_address_signing_enabled):
>   Sign return address even in functions with eh_return.
>   (aarch64_epilogue_uses): Mark two registers as used.
>   (aarch64_expand_epilogue): Conditionally return with br or ret.
>   (aarch64_eh_return_handler_rtx): Remove.
>   (aarch64_eh_return): New.
>   * config/aarch64/aarch64.h (EH_RETURN_HANDLER_RTX): Remove.
>   (AARCH64_EH_RETURN_STACKADJ_REGNUM): Define.
>   (AARCH64_EH_RETURN_STACKADJ_RTX): Define.
>   (AARCH64_EH_RETURN_HANDLER_REGNUM): Define.
>   (AARCH64_EH_RETURN_HANDLER_RTX): Define.
>   * config/aarch64/aarch64.md (eh_return): New.
> ---
>  gcc/config/aarch64/aarch64-protos.h |   2 +-
>  gcc/config/aarch64/aarch64.cc   | 106 

Re: [PATCH v2] mklog: handle Signed-Off-By, minor cleanup

2023-08-22 Thread Richard Sandiford via Gcc-patches
Marc Poulhiès via Gcc-patches  writes:
> Consider Signed-Off-By lines as part of the ending of the initial
> commit to avoid having these in the middle of the log when the
> changelog part is injected after.
>
> This is particularly usefull with:
>
>  $ git gcc-commit-mklog --amend -s
>
> that can be used to create the changelog and add the Signed-Off-By line.
>
> Also applies most of the shellcheck suggestions on the
> prepare-commit-msg hook.
>
> contrib/ChangeLog:
>
>   * mklog.py: Leave SOB lines after changelog.
>   * prepare-commit-msg: Apply most shellcheck suggestions.
>
> Signed-off-by: Marc Poulhiès 
> ---
> Previous version was missing the ChangeLog.
>
> This command is used in particular during the dev of the frontend
> for the Rust language (see r13-7099-g4b25fc15b925f8 as an example
> of a SoB ending in the middle of the commit message).
>
> Ok for master?
>
>  contrib/mklog.py   | 34 +-
>  contrib/prepare-commit-msg | 20 ++--
>  2 files changed, 39 insertions(+), 15 deletions(-)
>
> diff --git a/contrib/mklog.py b/contrib/mklog.py
> index 777212c98d7..e5cc69e0d0a 100755
> --- a/contrib/mklog.py
> +++ b/contrib/mklog.py
> @@ -41,7 +41,34 @@ from unidiff import PatchSet
>  
>  LINE_LIMIT = 100
>  TAB_WIDTH = 8
> -CO_AUTHORED_BY_PREFIX = 'co-authored-by: '
> +
> +# Initial commit:
> +#   +--+
> +#   | gccrs: Some title|
> +#   |  | This is the "start"
> +#   | This is some text explaining the commit. |
> +#   | There can be several lines.  |
> +#   |  |<--->
> +#   | Signed-off-by: My Name  | This is the "end"
> +#   +--+
> +#
> +# Results in:
> +#   +--+
> +#   | gccrs: Some title|
> +#   |  |
> +#   | This is some text explaining the commit. | This is the "start"
> +#   | There can be several lines.  |
> +#   |  |<--->
> +#   | gcc/rust/ChangeLog:  |
> +#   |  | This is the 
> generated
> +#   | * some_file (bla):   | ChangeLog part
> +#   | (foo):   |
> +#   |  |<--->
> +#   | Signed-off-by: My Name  | This is the "end"
> +#   +--+
> +
> +# this regex matches the first line of the "end" in the initial commit 
> message
> +FIRST_LINE_OF_END_RE = re.compile('(?i)^(signed-off-by|co-authored-by|#): ')

The current code only requires an initial "#", rather than an initial "#: ".
Is that a deliberate change?

The patch LGTM apart from that.

Thanks,
Richard

>  pr_regex = re.compile(r'(\/(\/|\*)|[Cc*!])\s+(?PPR [a-z+-]+\/[0-9]+)')
>  prnum_regex = re.compile(r'PR (?P[a-z+-]+)/(?P[0-9]+)')
> @@ -330,10 +357,7 @@ def update_copyright(data):
>  
>  
>  def skip_line_in_changelog(line):
> -if line.lower().startswith(CO_AUTHORED_BY_PREFIX) or 
> line.startswith('#'):
> -return False
> -return True
> -
> +return FIRST_LINE_OF_END_RE.match(line) == None
>  
>  if __name__ == '__main__':
>  extra_args = os.getenv('GCC_MKLOG_ARGS')
> diff --git a/contrib/prepare-commit-msg b/contrib/prepare-commit-msg
> index 48c9dad3c6f..1e94706ba40 100755
> --- a/contrib/prepare-commit-msg
> +++ b/contrib/prepare-commit-msg
> @@ -32,11 +32,11 @@ if ! [ -f "$COMMIT_MSG_FILE" ]; then exit 0; fi
>  # Don't do anything unless requested to.
>  if [ -z "$GCC_FORCE_MKLOG" ]; then exit 0; fi
>  
> -if [ -z "$COMMIT_SOURCE" ] || [ $COMMIT_SOURCE = template ]; then
> +if [ -z "$COMMIT_SOURCE" ] || [ "$COMMIT_SOURCE" = template ]; then
>  # No source or "template" means new commit.
>  cmd="diff --cached"
>  
> -elif [ $COMMIT_SOURCE = message ]; then
> +elif [ "$COMMIT_SOURCE" = message ]; then
>  # "message" means -m; assume a new commit if there are any changes 
> staged.
>  if ! git diff --cached --quiet; then
>   cmd="diff --cached"
> @@ -44,23 +44,23 @@ elif [ $COMMIT_SOURCE = message ]; then
>   cmd="diff --cached HEAD^"
>  fi
>  
> -elif [ $COMMIT_SOURCE = commit ]; then
> +elif [ "$COMMIT_SOURCE" = commit ]; then
>  # The message of an existing commit.  If it's HEAD, assume --amend;
>  # otherwise, assume a new commit with -C.
> -if [ $SHA1 = HEAD ]; then
> +if [ "$SHA1" = HEAD ]; then
>   cmd="diff --cached HEAD^"
>   if [ "$(git config gcc-config.mklog-hook-type)" = "smart-amend" ]; then
>   # Check if the existing message 

Re: [PATCH] Remove XFAIL from gcc/testsuite/gcc.dg/unroll-7.c

2023-08-21 Thread Richard Sandiford via Gcc-patches
Thiago Jung Bauermann via Gcc-patches  writes:
> This test passes since commit e41103081bfa "Fix undefined behaviour in
> profile_count::differs_from_p", so remove the xfail annotation.
>
> Tested on aarch64-linux-gnu, armv8l-linux-gnueabihf and x86_64-linux-gnu.
>
> gcc/testsuite/ChangeLog:
>   * gcc.dg/unroll-7.c: Remove xfail.

Thanks, pushed to trunk.  Sorry for the slow response.

Richard

> ---
>  gcc/testsuite/gcc.dg/unroll-7.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/gcc/testsuite/gcc.dg/unroll-7.c b/gcc/testsuite/gcc.dg/unroll-7.c
> index 650448df5db1..17c5e533c2cb 100644
> --- a/gcc/testsuite/gcc.dg/unroll-7.c
> +++ b/gcc/testsuite/gcc.dg/unroll-7.c
> @@ -15,4 +15,4 @@ int t(void)
>  /* { dg-final { scan-rtl-dump "upper bound: 99" "loop2_unroll" } } */
>  /* { dg-final { scan-rtl-dump "realistic bound: 99" "loop2_unroll" } } */
>  /* { dg-final { scan-rtl-dump "considering unrolling loop with constant 
> number of iterations" "loop2_unroll" } } */
> -/* { dg-final { scan-rtl-dump-not "Invalid sum" "loop2_unroll" {xfail *-*-* 
> } } } */
> +/* { dg-final { scan-rtl-dump-not "Invalid sum" "loop2_unroll" } } */
>
> base-commit: 5da4c0b85a97727e6802eaf3a0d47bcdb8da5f51


Re: [PATCH] gimple_fold: Support COND_LEN_FNMA/COND_LEN_FMS/COND_LEN_FNMS gimple fold

2023-08-21 Thread Richard Sandiford via Gcc-patches
Richard Biener  writes:
> On Wed, 16 Aug 2023, Juzhe-Zhong wrote:
>
>> Hi, Richard and Richi.
>> 
>> Currently, GCC support COND_LEN_FMA for floating-point **NO** -ffast-math.
>> It's supported in tree-ssa-math-opts.cc. However, GCC failed to support 
>> COND_LEN_FNMA/COND_LEN_FMS/COND_LEN_FNMS.
>> 
>> Consider this following case:
>> #define TEST_TYPE(TYPE)  
>>   \
>>   __attribute__ ((noipa)) void ternop_##TYPE (TYPE *__restrict dst,  
>>   \
>>TYPE *__restrict a,  \
>>TYPE *__restrict b, int n)   \
>>   {  
>>   \
>> for (int i = 0; i < n; i++)  
>>   \
>>   dst[i] -= a[i] * b[i];   \
>>   }
>> 
>> #define TEST_ALL()   
>>   \
>>   TEST_TYPE (float)  
>>   \
>> 
>> TEST_ALL ()
>> 
>> Gimple IR for RVV:
>> 
>> ...
>> _39 = -vect__8.14_26;
>> vect__10.16_21 = .COND_LEN_FMA ({ -1, ... }, vect__6.11_30, _39, 
>> vect__4.8_34, vect__4.8_34, _46, 0);
>> ...
>> 
>> This is because this following piece of codes in tree-ssa-math-opts.cc:
>> 
>>   if (len)
>>  fma_stmt
>>= gimple_build_call_internal (IFN_COND_LEN_FMA, 7, cond, mulop1, op2,
>>  addop, else_value, len, bias);
>>   else if (cond)
>>  fma_stmt = gimple_build_call_internal (IFN_COND_FMA, 5, cond, mulop1,
>> op2, addop, else_value);
>>   else
>>  fma_stmt = gimple_build_call_internal (IFN_FMA, 3, mulop1, op2, addop);
>>   gimple_set_lhs (fma_stmt, gimple_get_lhs (use_stmt));
>>   gimple_call_set_nothrow (fma_stmt, !stmt_can_throw_internal (cfun,
>> use_stmt));
>>   gsi_replace (, fma_stmt, true);
>>   /* Follow all SSA edges so that we generate FMS, FNMA and FNMS
>>   regardless of where the negation occurs.  */
>>   gimple *orig_stmt = gsi_stmt (gsi);
>>   if (fold_stmt (, follow_all_ssa_edges))
>>  {
>>if (maybe_clean_or_replace_eh_stmt (orig_stmt, gsi_stmt (gsi)))
>>  gcc_unreachable ();
>>update_stmt (gsi_stmt (gsi));
>>  }
>> 
>> 'fold_stmt' failed to fold NEGATE_EXPR + COND_LEN_FMA > COND_LEN_FNMA.
>> 
>> This patch support STMT fold into:
>> 
>> vect__10.16_21 = .COND_LEN_FNMA ({ -1, ... }, vect__8.14_26, vect__6.11_30, 
>> vect__4.8_34, { 0.0, ... }, _46, 0);
>> 
>> Note that COND_LEN_FNMA has 7 arguments and COND_LEN_ADD has 6 arguments.
>> 
>> Extend maximum num ops:
>> -  static const unsigned int MAX_NUM_OPS = 5;
>> +  static const unsigned int MAX_NUM_OPS = 7;
>> 
>> Bootstrap and Regtest on X86 passed.
>> 
>> Fully tested COND_LEN_FNMA/COND_LEN_FMS/COND_LEN_FNMS on RISC-V backend.
>> 
>> Testing on aarch64 is on progress.
>> 
>> gcc/ChangeLog:
>> 
>> * genmatch.cc (decision_tree::gen): Support 
>> COND_LEN_FNMA/COND_LEN_FMS/COND_LEN_FNMS gimple fold.
>> * gimple-match-exports.cc (gimple_simplify): Ditto.
>> (gimple_resimplify6): New function.
>> (gimple_resimplify7): New function.
>> (gimple_match_op::resimplify): Support 
>> COND_LEN_FNMA/COND_LEN_FMS/COND_LEN_FNMS gimple fold.
>> (convert_conditional_op): Ditto.
>> (build_call_internal): Ditto.
>> (try_conditional_simplification): Ditto.
>> (gimple_extract): Ditto.
>> * gimple-match.h (gimple_match_cond::gimple_match_cond): Ditto.
>> * internal-fn.cc (CASE): Ditto.
>> 
>> ---
>>  gcc/genmatch.cc |   2 +-
>>  gcc/gimple-match-exports.cc | 124 ++--
>>  gcc/gimple-match.h  |  19 +-
>>  gcc/internal-fn.cc  |  11 ++--
>>  4 files changed, 144 insertions(+), 12 deletions(-)
>> 
>> diff --git a/gcc/genmatch.cc b/gcc/genmatch.cc
>> index f46d2e1520d..a1925a747a7 100644
>> --- a/gcc/genmatch.cc
>> +++ b/gcc/genmatch.cc
>> @@ -4052,7 +4052,7 @@ decision_tree::gen (vec  , bool gimple)
>>  }
>>fprintf (stderr, "removed %u duplicate tails\n", rcnt);
>>  
>> -  for (unsigned n = 1; n <= 5; ++n)
>> +  for (unsigned n = 1; n <= 7; ++n)
>>  {
>>bool has_kids_p = false;
>>  
>> diff --git a/gcc/gimple-match-exports.cc b/gcc/gimple-match-exports.cc
>> index 7aeb4ddb152..895950309b7 100644
>> --- a/gcc/gimple-match-exports.cc
>> +++ b/gcc/gimple-match-exports.cc
>> @@ -60,6 +60,12 @@ extern bool gimple_simplify (gimple_match_op *, 
>> gimple_seq *, tree (*)(tree),
>>   code_helper, tree, tree, tree, tree, tree);
>>  extern bool gimple_simplify (gimple_match_op *, gimple_seq *, tree 
>> (*)(tree),
>>   code_helper, tree, 

Re: [PATCH] gimple_fold: Support COND_LEN_FNMA/COND_LEN_FMS/COND_LEN_FNMS gimple fold

2023-08-21 Thread Richard Sandiford via Gcc-patches
Juzhe-Zhong  writes:
> Hi, Richard and Richi.
>
> Currently, GCC support COND_LEN_FMA for floating-point **NO** -ffast-math.
> It's supported in tree-ssa-math-opts.cc. However, GCC failed to support 
> COND_LEN_FNMA/COND_LEN_FMS/COND_LEN_FNMS.
>
> Consider this following case:
> #define TEST_TYPE(TYPE)   
>  \
>   __attribute__ ((noipa)) void ternop_##TYPE (TYPE *__restrict dst,   
>  \
> TYPE *__restrict a,  \
> TYPE *__restrict b, int n)   \
>   {   
>  \
> for (int i = 0; i < n; i++)   
>  \
>   dst[i] -= a[i] * b[i];   \
>   }
>
> #define TEST_ALL()
>  \
>   TEST_TYPE (float)   
>  \
>
> TEST_ALL ()
>
> Gimple IR for RVV:
>
> ...
> _39 = -vect__8.14_26;
> vect__10.16_21 = .COND_LEN_FMA ({ -1, ... }, vect__6.11_30, _39, 
> vect__4.8_34, vect__4.8_34, _46, 0);
> ...
>
> This is because this following piece of codes in tree-ssa-math-opts.cc:
>
>   if (len)
>   fma_stmt
> = gimple_build_call_internal (IFN_COND_LEN_FMA, 7, cond, mulop1, op2,
>   addop, else_value, len, bias);
>   else if (cond)
>   fma_stmt = gimple_build_call_internal (IFN_COND_FMA, 5, cond, mulop1,
>  op2, addop, else_value);
>   else
>   fma_stmt = gimple_build_call_internal (IFN_FMA, 3, mulop1, op2, addop);
>   gimple_set_lhs (fma_stmt, gimple_get_lhs (use_stmt));
>   gimple_call_set_nothrow (fma_stmt, !stmt_can_throw_internal (cfun,
>  use_stmt));
>   gsi_replace (, fma_stmt, true);
>   /* Follow all SSA edges so that we generate FMS, FNMA and FNMS
>regardless of where the negation occurs.  */
>   gimple *orig_stmt = gsi_stmt (gsi);
>   if (fold_stmt (, follow_all_ssa_edges))
>   {
> if (maybe_clean_or_replace_eh_stmt (orig_stmt, gsi_stmt (gsi)))
>   gcc_unreachable ();
> update_stmt (gsi_stmt (gsi));
>   }
>
> 'fold_stmt' failed to fold NEGATE_EXPR + COND_LEN_FMA > COND_LEN_FNMA.
>
> This patch support STMT fold into:
>
> vect__10.16_21 = .COND_LEN_FNMA ({ -1, ... }, vect__8.14_26, vect__6.11_30, 
> vect__4.8_34, { 0.0, ... }, _46, 0);
>
> Note that COND_LEN_FNMA has 7 arguments and COND_LEN_ADD has 6 arguments.
>
> Extend maximum num ops:
> -  static const unsigned int MAX_NUM_OPS = 5;
> +  static const unsigned int MAX_NUM_OPS = 7;
>
> Bootstrap and Regtest on X86 passed.
>
> Fully tested COND_LEN_FNMA/COND_LEN_FMS/COND_LEN_FNMS on RISC-V backend.
>
> Testing on aarch64 is on progress.
>
> gcc/ChangeLog:
>
> * genmatch.cc (decision_tree::gen): Support 
> COND_LEN_FNMA/COND_LEN_FMS/COND_LEN_FNMS gimple fold.
> * gimple-match-exports.cc (gimple_simplify): Ditto.
> (gimple_resimplify6): New function.
> (gimple_resimplify7): New function.
> (gimple_match_op::resimplify): Support 
> COND_LEN_FNMA/COND_LEN_FMS/COND_LEN_FNMS gimple fold.
> (convert_conditional_op): Ditto.
> (build_call_internal): Ditto.
> (try_conditional_simplification): Ditto.
> (gimple_extract): Ditto.
> * gimple-match.h (gimple_match_cond::gimple_match_cond): Ditto.
> * internal-fn.cc (CASE): Ditto.
>
> ---
>  gcc/genmatch.cc |   2 +-
>  gcc/gimple-match-exports.cc | 124 ++--
>  gcc/gimple-match.h  |  19 +-
>  gcc/internal-fn.cc  |  11 ++--
>  4 files changed, 144 insertions(+), 12 deletions(-)
>
> diff --git a/gcc/genmatch.cc b/gcc/genmatch.cc
> index f46d2e1520d..a1925a747a7 100644
> --- a/gcc/genmatch.cc
> +++ b/gcc/genmatch.cc
> @@ -4052,7 +4052,7 @@ decision_tree::gen (vec  , bool gimple)
>  }
>fprintf (stderr, "removed %u duplicate tails\n", rcnt);
>  
> -  for (unsigned n = 1; n <= 5; ++n)
> +  for (unsigned n = 1; n <= 7; ++n)
>  {
>bool has_kids_p = false;
>  
> diff --git a/gcc/gimple-match-exports.cc b/gcc/gimple-match-exports.cc
> index 7aeb4ddb152..895950309b7 100644
> --- a/gcc/gimple-match-exports.cc
> +++ b/gcc/gimple-match-exports.cc
> @@ -60,6 +60,12 @@ extern bool gimple_simplify (gimple_match_op *, gimple_seq 
> *, tree (*)(tree),
>code_helper, tree, tree, tree, tree, tree);
>  extern bool gimple_simplify (gimple_match_op *, gimple_seq *, tree (*)(tree),
>code_helper, tree, tree, tree, tree, tree, tree);
> +extern bool gimple_simplify (gimple_match_op *, gimple_seq *, tree (*)(tree),
> +  code_helper, tree, tree, tree, tree, 

Re: [PATCH] tree-optimization/111048 - avoid flawed logic in fold_vec_perm

2023-08-21 Thread Richard Sandiford via Gcc-patches
Prathamesh Kulkarni  writes:
> On Mon, 21 Aug 2023 at 12:26, Richard Biener  wrote:
>>
>> On Sat, 19 Aug 2023, Prathamesh Kulkarni wrote:
>>
>> > On Fri, 18 Aug 2023 at 14:52, Richard Biener  wrote:
>> > >
>> > > On Fri, 18 Aug 2023, Richard Sandiford wrote:
>> > >
>> > > > Richard Biener  writes:
>> > > > > The following avoids running into somehow flawed logic in 
>> > > > > fold_vec_perm
>> > > > > for non-VLA vectors.
>> > > > >
>> > > > > Bootstrap & regtest running on x86_64-unknown-linux-gnu.
>> > > > >
>> > > > > Richard.
>> > > > >
>> > > > > PR tree-optimization/111048
>> > > > > * fold-const.cc (fold_vec_perm_cst): Check for non-VLA
>> > > > > vectors first.
>> > > > >
>> > > > > * gcc.dg/torture/pr111048.c: New testcase.
>> > > >
>> > > > Please don't do this as a permanent thing.  It was a deliberate choice
>> > > > to have the is_constant be the fallback, so that the "generic" 
>> > > > (VLA+VLS)
>> > > > logic gets more coverage.  Like you say, if something is wrong for VLS
>> > > > then the chances are that it's also wrong for VLA.
>> > >
>> > > Sure, feel free to undo this change together with the fix for the
>> > > VLA case.
>> > Hi,
>> > The attached patch reverts the workaround, and fixes the issue.
>> > Bootstrapped+tested on aarch64-linux-gnu with and without SVE, and
>> > x64_64-linux-gnu.
>> > OK to commit ?
>>
>> OK.
> Thanks, committed to trunk in 649388462e9a3c2de0b90ce525de8044704cc521

Thanks for the patch.  Please remember to close the PR too.

Richard


Re: [PATCH] c: Add support for [[__extension__ ...]]

2023-08-18 Thread Richard Sandiford via Gcc-patches
Richard Sandiford  writes:
> Joseph Myers  writes:
>> On Wed, 16 Aug 2023, Richard Sandiford via Gcc-patches wrote:
>>
>>> Would it be OK to add support for:
>>> 
>>>   [[__extension__ ...]]
>>> 
>>> to suppress the pedwarn about using [[]] prior to C2X?  Then we can
>>
>> That seems like a plausible feature to add.
>
> Thanks.  Of course, once I actually tried it, I hit a snag:
> :: isn't a single lexing token prior to C2X, and so something like:
>
>   [[__extension__ arm::streaming]]
>
> would not be interpreted as a scoped attribute in C11.  The patch
> gets around that by allowing two colons in place of :: when
> __extension__ is used.  I realise that's pushing the bounds of
> acceptability though...
>
> I wondered about trying to require the two colons to be immediately
> adjacent.  But:
>
> (a) There didn't appear to be an existing API to check that, which seemed
> like a red flag.  The closest I could find was get_source_text_between.
>
> Similarly to that, it would in principle be possible to compare
> two expanded locations.  But...
>
> (b) I had a vague impression that locations were allowed to drop column
> information for very large inputs (maybe I'm wrong).
>
> (c) It wouldn't cope with token pasting.
>
> So in the end I just used a simple two-token test, like for [[ and ]].
>
> Bootstrapped & regression-tested on aarch64-linux-gnu.

Gah, as mentioned yesterday, the patch was peeking the wrong token.
I've fixed that, and added corresponding tests.  Sorry for missing
it first time.

Richard

-

[[]] attributes are a recent addition to C, but as a GNU extension,
GCC allows them to be used in C11 and earlier.  Normally this use
would trigger a pedwarn (for -pedantic, -Wc11-c2x-compat, etc.).

This patch allows the pedwarn to be suppressed by starting the
attribute-list with __extension__.

Also, :: is not a single lexing token prior to C2X, so it wasn't
possible to use scoped attributes in C11, even as a GNU extension.
The patch allows two colons to be used in place of :: when
__extension__ is used.  No attempt is made to check whether the
two colons are immediately adjacent.

gcc/
* doc/extend.texi: Document the C [[__extension__ ...]] construct.

gcc/c/
* c-parser.cc (c_parser_std_attribute): Conditionally allow
two colons to be used in place of ::.
(c_parser_std_attribute_list): New function, split out from...
(c_parser_std_attribute_specifier): ...here.  Allow the attribute-list
to start with __extension__.  When it does, also allow two colons
to be used in place of ::.

gcc/testsuite/
* gcc.dg/c2x-attr-syntax-6.c: New test.
* gcc.dg/c2x-attr-syntax-7.c: Likewise.
---
 gcc/c/c-parser.cc| 64 ++--
 gcc/doc/extend.texi  | 27 --
 gcc/testsuite/gcc.dg/c2x-attr-syntax-6.c | 62 +++
 gcc/testsuite/gcc.dg/c2x-attr-syntax-7.c | 60 ++
 4 files changed, 193 insertions(+), 20 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/c2x-attr-syntax-6.c
 create mode 100644 gcc/testsuite/gcc.dg/c2x-attr-syntax-7.c

diff --git a/gcc/c/c-parser.cc b/gcc/c/c-parser.cc
index 33fe7b115ff..ca60c51ddb2 100644
--- a/gcc/c/c-parser.cc
+++ b/gcc/c/c-parser.cc
@@ -5390,10 +5390,18 @@ c_parser_balanced_token_sequence (c_parser *parser)
  ( balanced-token-sequence[opt] )
 
Keywords are accepted as identifiers for this purpose.
-*/
+
+   As an extension, we permit an attribute-specifier to be:
+
+ [ [ __extension__ attribute-list ] ]
+
+   Two colons are then accepted as a synonym for ::.  No attempt is made
+   to check whether the colons are immediately adjacent.  LOOSE_SCOPE_P
+   indicates whether this relaxation is in effect.  */
 
 static tree
-c_parser_std_attribute (c_parser *parser, bool for_tm)
+c_parser_std_attribute (c_parser *parser, bool for_tm,
+   bool loose_scope_p = false)
 {
   c_token *token = c_parser_peek_token (parser);
   tree ns, name, attribute;
@@ -5406,9 +5414,14 @@ c_parser_std_attribute (c_parser *parser, bool for_tm)
 }
   name = canonicalize_attr_name (token->value);
   c_parser_consume_token (parser);
-  if (c_parser_next_token_is (parser, CPP_SCOPE))
+  if (c_parser_next_token_is (parser, CPP_SCOPE)
+  || (loose_scope_p
+ && c_parser_next_token_is (parser, CPP_COLON)
+ && c_parser_peek_2nd_token (parser)->type == CPP_COLON))
 {
   ns = name;
+  if (c_parser_next_token_is (parser, CPP_COLON))
+   c_parser_consume_token (parser);
   c_parser_consume_token (parser);
   token = c_parser_peek_token (parser);
   if (token->type != CPP_NAME && token->type != CPP_KEYWORD)
@@ -5481,19 +5494,9 @@ c_

Re: [PATCH] tree-optimization/111048 - avoid flawed logic in fold_vec_perm

2023-08-18 Thread Richard Sandiford via Gcc-patches
Richard Biener  writes:
> The following avoids running into somehow flawed logic in fold_vec_perm
> for non-VLA vectors.
>
> Bootstrap & regtest running on x86_64-unknown-linux-gnu.
>
> Richard.
>
>   PR tree-optimization/111048
>   * fold-const.cc (fold_vec_perm_cst): Check for non-VLA
>   vectors first.
>
>   * gcc.dg/torture/pr111048.c: New testcase.

Please don't do this as a permanent thing.  It was a deliberate choice
to have the is_constant be the fallback, so that the "generic" (VLA+VLS)
logic gets more coverage.  Like you say, if something is wrong for VLS
then the chances are that it's also wrong for VLA.

Thanks,
Richard


> ---
>  gcc/fold-const.cc   | 12 ++--
>  gcc/testsuite/gcc.dg/torture/pr111048.c | 24 
>  2 files changed, 30 insertions(+), 6 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/torture/pr111048.c
>
> diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
> index 5c51c9d91be..144fd7481b3 100644
> --- a/gcc/fold-const.cc
> +++ b/gcc/fold-const.cc
> @@ -10625,6 +10625,11 @@ fold_vec_perm_cst (tree type, tree arg0, tree arg1, 
> const vec_perm_indices ,
>unsigned res_npatterns, res_nelts_per_pattern;
>unsigned HOST_WIDE_INT res_nelts;
>  
> +  if (TYPE_VECTOR_SUBPARTS (type).is_constant (_nelts))
> +{
> +  res_npatterns = res_nelts;
> +  res_nelts_per_pattern = 1;
> +}
>/* (1) If SEL is a suitable mask as determined by
>   valid_mask_for_fold_vec_perm_cst_p, then:
>   res_npatterns = max of npatterns between ARG0, ARG1, and SEL
> @@ -10634,7 +10639,7 @@ fold_vec_perm_cst (tree type, tree arg0, tree arg1, 
> const vec_perm_indices ,
>   res_npatterns = nelts in result vector.
>   res_nelts_per_pattern = 1.
>   This exception is made so that VLS ARG0, ARG1 and SEL work as before.  
> */
> -  if (valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, reason))
> +  else if (valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, reason))
>  {
>res_npatterns
>   = std::max (VECTOR_CST_NPATTERNS (arg0),
> @@ -10648,11 +10653,6 @@ fold_vec_perm_cst (tree type, tree arg0, tree arg1, 
> const vec_perm_indices ,
>  
>res_nelts = res_npatterns * res_nelts_per_pattern;
>  }
> -  else if (TYPE_VECTOR_SUBPARTS (type).is_constant (_nelts))
> -{
> -  res_npatterns = res_nelts;
> -  res_nelts_per_pattern = 1;
> -}
>else
>  return NULL_TREE;
>  
> diff --git a/gcc/testsuite/gcc.dg/torture/pr111048.c 
> b/gcc/testsuite/gcc.dg/torture/pr111048.c
> new file mode 100644
> index 000..475978aae2b
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/torture/pr111048.c
> @@ -0,0 +1,24 @@
> +/* { dg-do run } */
> +/* { dg-additional-options "-mavx2" { target avx2_runtime } } */
> +
> +typedef unsigned char u8;
> +
> +__attribute__((noipa))
> +static void check(const u8 * v) {
> +if (*v != 15) __builtin_trap();
> +}
> +
> +__attribute__((noipa))
> +static void bug(void) {
> +u8 in_lanes[32];
> +for (unsigned i = 0; i < 32; i += 2) {
> +  in_lanes[i + 0] = 0;
> +  in_lanes[i + 1] = ((u8)0xff) >> (i & 7);
> +}
> +
> +check(_lanes[13]);
> +  }
> +
> +int main() {
> +bug();
> +}


Re: [PATCH] c: Add support for [[__extension__ ...]]

2023-08-17 Thread Richard Sandiford via Gcc-patches
Richard Biener  writes:
>> Am 17.08.2023 um 13:25 schrieb Richard Sandiford via Gcc-patches 
>> :
>> 
>> Joseph Myers  writes:
>>>> On Wed, 16 Aug 2023, Richard Sandiford via Gcc-patches wrote:
>>>> 
>>>> Would it be OK to add support for:
>>>> 
>>>>  [[__extension__ ...]]
>>>> 
>>>> to suppress the pedwarn about using [[]] prior to C2X?  Then we can
>>> 
>>> That seems like a plausible feature to add.
>> 
>> Thanks.  Of course, once I actually tried it, I hit a snag:
>> :: isn't a single lexing token prior to C2X, and so something like:
>> 
>>  [[__extension__ arm::streaming]]
>> 
>> would not be interpreted as a scoped attribute in C11.  The patch
>> gets around that by allowing two colons in place of :: when
>> __extension__ is used.  I realise that's pushing the bounds of
>> acceptability though...
>> 
>> I wondered about trying to require the two colons to be immediately
>> adjacent.  But:
>> 
>> (a) There didn't appear to be an existing API to check that, which seemed
>>like a red flag.  The closest I could find was get_source_text_between.
>
> IStR a cop Toben has ->prev_white or so

Ah, thanks.

  if (c_parser_next_token_is (parser, CPP_SCOPE)
  || (loose_scope_p
  && c_parser_next_token_is (parser, CPP_COLON)
  && c_parser_peek_2nd_token (parser)->type == CPP_COLON
  && !(c_parser_peek_2nd_token (parser)->flags & PREV_WHITE)))

seems to work for (i.e. reject):

typedef int [[__extension__ gnu : : vector_size (4)]] g3;
typedef int [[__extension__ gnu :/**/: vector_size (4)]] g13;

but not:

#define BAR :
typedef int [[__extension__ gnu BAR BAR vector_size (4)]] g5;

#define JOIN(A, B) A/**/B
typedef int [[__extension__ gnu JOIN(:,:) vector_size (4)]] g14;

I now realise the patch was peeking at the wrong token.  Will fix,
and add more tests.

Richard


[PATCH] c: Add support for [[__extension__ ...]]

2023-08-17 Thread Richard Sandiford via Gcc-patches
Joseph Myers  writes:
> On Wed, 16 Aug 2023, Richard Sandiford via Gcc-patches wrote:
>
>> Would it be OK to add support for:
>> 
>>   [[__extension__ ...]]
>> 
>> to suppress the pedwarn about using [[]] prior to C2X?  Then we can
>
> That seems like a plausible feature to add.

Thanks.  Of course, once I actually tried it, I hit a snag:
:: isn't a single lexing token prior to C2X, and so something like:

  [[__extension__ arm::streaming]]

would not be interpreted as a scoped attribute in C11.  The patch
gets around that by allowing two colons in place of :: when
__extension__ is used.  I realise that's pushing the bounds of
acceptability though...

I wondered about trying to require the two colons to be immediately
adjacent.  But:

(a) There didn't appear to be an existing API to check that, which seemed
like a red flag.  The closest I could find was get_source_text_between.

Similarly to that, it would in principle be possible to compare
two expanded locations.  But...

(b) I had a vague impression that locations were allowed to drop column
information for very large inputs (maybe I'm wrong).

(c) It wouldn't cope with token pasting.

So in the end I just used a simple two-token test, like for [[ and ]].

Bootstrapped & regression-tested on aarch64-linux-gnu.

Richard



[[]] attributes are a recent addition to C, but as a GNU extension,
GCC allows them to be used in C11 and earlier.  Normally this use
would trigger a pedwarn (for -pedantic, -Wc11-c2x-compat, etc.).

This patch allows the pedwarn to be suppressed by starting the
attribute-list with __extension__.

Also, :: is not a single lexing token prior to C2X, so it wasn't
possible to use scoped attributes in C11, even as a GNU extension.
The patch allows two colons to be used in place of :: when
__extension__ is used.  No attempt is made to check whether the
two colons are immediately adjacent.

gcc/
* doc/extend.texi: Document the C [[__extension__ ...]] construct.

gcc/c/
* c-parser.cc (c_parser_std_attribute): Conditionally allow
two colons to be used in place of ::.
(c_parser_std_attribute_list): New function, split out from...
(c_parser_std_attribute_specifier): ...here.  Allow the attribute-list
to start with __extension__.  When it does, also allow two colons
to be used in place of ::.

gcc/testsuite/
* gcc.dg/c2x-attr-syntax-6.c: New test.
* gcc.dg/c2x-attr-syntax-7.c: Likewise.
---
 gcc/c/c-parser.cc| 68 ++--
 gcc/doc/extend.texi  | 27 --
 gcc/testsuite/gcc.dg/c2x-attr-syntax-6.c | 50 +
 gcc/testsuite/gcc.dg/c2x-attr-syntax-7.c | 48 +
 4 files changed, 173 insertions(+), 20 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/c2x-attr-syntax-6.c
 create mode 100644 gcc/testsuite/gcc.dg/c2x-attr-syntax-7.c

diff --git a/gcc/c/c-parser.cc b/gcc/c/c-parser.cc
index 33fe7b115ff..82e56b28446 100644
--- a/gcc/c/c-parser.cc
+++ b/gcc/c/c-parser.cc
@@ -5390,10 +5390,18 @@ c_parser_balanced_token_sequence (c_parser *parser)
  ( balanced-token-sequence[opt] )
 
Keywords are accepted as identifiers for this purpose.
-*/
+
+   As an extension, we permit an attribute-specifier to be:
+
+ [ [ __extension__ attribute-list ] ]
+
+   Two colons are then accepted as a synonym for ::.  No attempt is made
+   to check whether the colons are immediately adjacent.  LOOSE_SCOPE_P
+   indicates whether this relaxation is in effect.  */
 
 static tree
-c_parser_std_attribute (c_parser *parser, bool for_tm)
+c_parser_std_attribute (c_parser *parser, bool for_tm,
+   bool loose_scope_p = false)
 {
   c_token *token = c_parser_peek_token (parser);
   tree ns, name, attribute;
@@ -5406,9 +5414,18 @@ c_parser_std_attribute (c_parser *parser, bool for_tm)
 }
   name = canonicalize_attr_name (token->value);
   c_parser_consume_token (parser);
-  if (c_parser_next_token_is (parser, CPP_SCOPE))
+  if (c_parser_next_token_is (parser, CPP_SCOPE)
+  || (loose_scope_p
+ && c_parser_next_token_is (parser, CPP_COLON)
+ && c_parser_peek_token (parser)->type == CPP_COLON))
 {
   ns = name;
+  if (c_parser_next_token_is (parser, CPP_COLON))
+   {
+ c_parser_consume_token (parser);
+ if (!c_parser_next_token_is (parser, CPP_COLON))
+   gcc_unreachable ();
+   }
   c_parser_consume_token (parser);
   token = c_parser_peek_token (parser);
   if (token->type != CPP_NAME && token->type != CPP_KEYWORD)
@@ -5481,19 +5498,9 @@ c_parser_std_attribute (c_parser *parser, bool for_tm)
 }
 
 static tree
-c_parser_std_attribute_specifier (c_parser *parser, bool for_tm)
+c_parser_std_attribute_list (c_parser *parser, bool for_tm,
+bool loose_scope_p = false)
 {
-  locatio

Re: [PATCH] doc: Fixes to RTL-SSA sample code

2023-08-17 Thread Richard Sandiford via Gcc-patches
Alex Coplan  writes:
> Hi,
>
> This patch fixes up the code examples in the RTL-SSA documentation (the
> sections on making insn changes) to reflect the current API.
>
> The main issues are as follows:
>  - rtl_ssa::recog takes an obstack_watermark & as the first parameter.
>Presumably this is intended to be the change attempt, so I've updated
>the examples to pass this through.
>  - The variants of recog and restrict_movement that take an ignore
>predicate have been renamed with an _ignoring suffix, so I've
>updated callers to use those names.
>  - A couple of minor "obvious" fixes to add a missing address-of
>operator and correct a variable name.
>
> OK for trunk?

OK.  Thanks for doing this.  I'm pretty sure the examples did
compile with one version of the API, but like you say, I forgot
to update it later. :(

Richard

> Thanks,
> Alex
>
> gcc/ChangeLog:
>
>   * doc/rtl.texi: Fix up sample code for RTL-SSA insn changes.
>
> diff --git a/gcc/doc/rtl.texi b/gcc/doc/rtl.texi
> index 76aeafb8f15..0ed88f58821 100644
> --- a/gcc/doc/rtl.texi
> +++ b/gcc/doc/rtl.texi
> @@ -4964,7 +4964,7 @@ the pass should check whether the new pattern matches a 
> target
>  instruction or satisfies the requirements of an inline asm:
>  
>  @smallexample
> -if (!rtl_ssa::recog (change))
> +if (!rtl_ssa::recog (attempt, change))
>return false;
>  @end smallexample
>  
> @@ -5015,7 +5015,7 @@ if (!rtl_ssa::restrict_movement (change))
>  insn_change_watermark watermark;
>  // Use validate_change etc. to change INSN's pattern.
>  @dots{}
> -if (!rtl_ssa::recog (change)
> +if (!rtl_ssa::recog (attempt, change)
>  || !rtl_ssa::change_is_worthwhile (change))
>return false;
>  
> @@ -5048,7 +5048,7 @@ For example, if a pass is changing exactly two 
> instructions,
>  it might do:
>  
>  @smallexample
> -rtl_ssa::insn_change *changes[] = @{ , change2 @};
> +rtl_ssa::insn_change *changes[] = @{ ,  @};
>  @end smallexample
>  
>  where @code{change1}'s instruction must come before @code{change2}'s.
> @@ -5066,7 +5066,7 @@ in the correct order with respect to each other.
>  The way to do this is:
>  
>  @smallexample
> -if (!rtl_ssa::restrict_movement (change, insn_is_changing (changes)))
> +if (!rtl_ssa::restrict_movement_ignoring (change, insn_is_changing 
> (changes)))
>return false;
>  @end smallexample
>  
> @@ -5078,7 +5078,7 @@ changing instructions (which might, for example, no 
> longer need
>  to clobber the flags register).  The way to do this is:
>  
>  @smallexample
> -if (!rtl_ssa::recog (change, insn_is_changing (changes)))
> +if (!rtl_ssa::recog_ignoring (attempt, change, insn_is_changing (changes)))
>return false;
>  @end smallexample
>  
> @@ -5118,28 +5118,28 @@ Putting all this together, the process for a 
> two-instruction change is:
>  @smallexample
>  auto attempt = crtl->ssa->new_change_attempt ();
>  
> -rtl_ssa::insn_change change (insn1);
> +rtl_ssa::insn_change change1 (insn1);
>  change1.new_defs = @dots{};
>  change1.new_uses = @dots{};
>  change1.move_range = @dots{};
>  
> -rtl_ssa::insn_change change (insn2);
> +rtl_ssa::insn_change change2 (insn2);
>  change2.new_defs = @dots{};
>  change2.new_uses = @dots{};
>  change2.move_range = @dots{};
>  
> -rtl_ssa::insn_change *changes[] = @{ , change2 @};
> +rtl_ssa::insn_change *changes[] = @{ ,  @};
>  
>  auto is_changing = insn_is_changing (changes);
> -if (!rtl_ssa::restrict_movement (change1, is_changing)
> -|| !rtl_ssa::restrict_movement (change2, is_changing))
> +if (!rtl_ssa::restrict_movement_ignoring (change1, is_changing)
> +|| !rtl_ssa::restrict_movement_ignoring (change2, is_changing))
>return false;
>  
>  insn_change_watermark watermark;
>  // Use validate_change etc. to change INSN1's and INSN2's patterns.
>  @dots{}
> -if (!rtl_ssa::recog (change1, is_changing)
> -|| !rtl_ssa::recog (change2, is_changing)
> +if (!rtl_ssa::recog_ignoring (attempt, change1, is_changing)
> +|| !rtl_ssa::recog_ignoring (attempt, change2, is_changing)
>  || !rtl_ssa::changes_are_worthwhile (changes)
>  || !crtl->ssa->verify_insn_changes (changes))
>return false;


Re: [WIP RFC] Add support for keyword-based attributes

2023-08-16 Thread Richard Sandiford via Gcc-patches
Joseph Myers  writes:
> On Mon, 17 Jul 2023, Michael Matz via Gcc-patches wrote:
>
>> So, essentially you want unignorable attributes, right?  Then implement 
>> exactly that: add one new keyword "__known_attribute__" (invent a better 
>> name, maybe :) ), semantics exactly as with __attribute__ (including using 
>> the same underlying lists in our data structures), with only one single 
>> deviation: instead of the warning you give an error for unhandled 
>> attributes.  Done.
>
> Assuming you also want the better-defined standard rules about how [[]] 
> attributes appertain to particular entities, rather than the different 
> __attribute__ rules, that would suggest something like [[!some::attr]] for 
> the case of attributes that can't be ignored but otherwise are handled 
> like standard [[]] attributes.

Yeah, that would work.  But I'd rather not gate the SME work on getting
an extension like that into C and C++.

As it stands, some clang maintainers pushed back against the use of
attributes for important semantics, and preferred keywords instead.
It's clear from this threads that the GCC maintainers prefer attributes
to keywords.  (And it turns out that some other clang maintainers do too,
though not as strongly.)

So I think the easiest way of keeping both constituencies happy(-ish)
is to provide both standard attributes and "keywords", but allow
the "keywords" to be macros that expand to standard attributes.

Would it be OK to add support for:

  [[__extension__ ...]]

to suppress the pedwarn about using [[]] prior to C2X?  Then we can
predefine __arm_streaming to [[__extension__ arm::streaming]], etc.

Thanks,
Richard



Re: [PATCH v2][GCC] aarch64: Add support for Cortex-A720 CPU

2023-08-16 Thread Richard Sandiford via Gcc-patches
Richard Ball  writes:
> v2: Add missing PROFILE feature flag.
>
> This patch adds support for the Cortex-A720 CPU to GCC.
>
> No regressions on aarch64-none-elf.
>
> Ok for master?
>
> gcc/ChangeLog:
>
>  * config/aarch64/aarch64-cores.def (AARCH64_CORE): Add Cortex-
>   A720 CPU.
>  * config/aarch64/aarch64-tune.md: Regenerate.
>  * doc/invoke.texi: Document Cortex-A720 CPU.

OK, thanks.

Richard

>
> diff --git a/gcc/config/aarch64/aarch64-cores.def 
> b/gcc/config/aarch64/aarch64-cores.def
> index 
> dbac497ef3aab410eb81db185b2e9532186888bb..73976e9a4c5e4f0b5c04bc7974e2006ddfd02fff
>  100644
> --- a/gcc/config/aarch64/aarch64-cores.def
> +++ b/gcc/config/aarch64/aarch64-cores.def
> @@ -176,6 +176,8 @@ AARCH64_CORE("cortex-a710",  cortexa710, cortexa57, V9A,  
> (SVE2_BITPERM, MEMTAG,
>  
>  AARCH64_CORE("cortex-a715",  cortexa715, cortexa57, V9A,  (SVE2_BITPERM, 
> MEMTAG, I8MM, BF16), neoversen2, 0x41, 0xd4d, -1)
>  
> +AARCH64_CORE("cortex-a720",  cortexa720, cortexa57, V9_2A,  (SVE2_BITPERM, 
> MEMTAG, PROFILE), neoversen2, 0x41, 0xd81, -1)
> +
>  AARCH64_CORE("cortex-x2",  cortexx2, cortexa57, V9A,  (SVE2_BITPERM, MEMTAG, 
> I8MM, BF16), neoversen2, 0x41, 0xd48, -1)
>  
>  AARCH64_CORE("cortex-x3",  cortexx3, cortexa57, V9A,  (SVE2_BITPERM, MEMTAG, 
> I8MM, BF16), neoversen2, 0x41, 0xd4e, -1)
> diff --git a/gcc/config/aarch64/aarch64-tune.md 
> b/gcc/config/aarch64/aarch64-tune.md
> index 
> 2170980dddb0d5d410a49631ad26ff2e346b39dd..12d610f0f6580096eed9cf3de8ad3239efde5e4b
>  100644
> --- a/gcc/config/aarch64/aarch64-tune.md
> +++ b/gcc/config/aarch64/aarch64-tune.md
> @@ -1,5 +1,5 @@
>  ;; -*- buffer-read-only: t -*-
>  ;; Generated automatically by gentune.sh from aarch64-cores.def
>  (define_attr "tune"
> - 
> "cortexa34,cortexa35,cortexa53,cortexa57,cortexa72,cortexa73,thunderx,thunderxt88p1,thunderxt88,octeontx,octeontxt81,octeontxt83,thunderxt81,thunderxt83,ampere1,ampere1a,emag,xgene1,falkor,qdf24xx,exynosm1,phecda,thunderx2t99p1,vulcan,thunderx2t99,cortexa55,cortexa75,cortexa76,cortexa76ae,cortexa77,cortexa78,cortexa78ae,cortexa78c,cortexa65,cortexa65ae,cortexx1,cortexx1c,ares,neoversen1,neoversee1,octeontx2,octeontx2t98,octeontx2t96,octeontx2t93,octeontx2f95,octeontx2f95n,octeontx2f95mm,a64fx,tsv110,thunderx3t110,zeus,neoversev1,neoverse512tvb,saphira,cortexa57cortexa53,cortexa72cortexa53,cortexa73cortexa35,cortexa73cortexa53,cortexa75cortexa55,cortexa76cortexa55,cortexr82,cortexa510,cortexa520,cortexa710,cortexa715,cortexx2,cortexx3,neoversen2,demeter,neoversev2"
> + 
> "cortexa34,cortexa35,cortexa53,cortexa57,cortexa72,cortexa73,thunderx,thunderxt88p1,thunderxt88,octeontx,octeontxt81,octeontxt83,thunderxt81,thunderxt83,ampere1,ampere1a,emag,xgene1,falkor,qdf24xx,exynosm1,phecda,thunderx2t99p1,vulcan,thunderx2t99,cortexa55,cortexa75,cortexa76,cortexa76ae,cortexa77,cortexa78,cortexa78ae,cortexa78c,cortexa65,cortexa65ae,cortexx1,cortexx1c,ares,neoversen1,neoversee1,octeontx2,octeontx2t98,octeontx2t96,octeontx2t93,octeontx2f95,octeontx2f95n,octeontx2f95mm,a64fx,tsv110,thunderx3t110,zeus,neoversev1,neoverse512tvb,saphira,cortexa57cortexa53,cortexa72cortexa53,cortexa73cortexa35,cortexa73cortexa53,cortexa75cortexa55,cortexa76cortexa55,cortexr82,cortexa510,cortexa520,cortexa710,cortexa715,cortexa720,cortexx2,cortexx3,neoversen2,demeter,neoversev2"
>   (const (symbol_ref "((enum attr_tune) aarch64_tune)")))
> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> index 
> 2c870d3c34b587ffc721b1f18f99ecd66d4217be..62537d9d09e25f864c27534b7ac2ec467ea24789
>  100644
> --- a/gcc/doc/invoke.texi
> +++ b/gcc/doc/invoke.texi
> @@ -20517,7 +20517,8 @@ performance of the code.  Permissible values for this 
> option are:
>  @samp{cortex-a75.cortex-a55}, @samp{cortex-a76.cortex-a55},
>  @samp{cortex-r82}, @samp{cortex-x1}, @samp{cortex-x1c}, @samp{cortex-x2},
>  @samp{cortex-x3}, @samp{cortex-a510}, @samp{cortex-a520}, @samp{cortex-a710},
> -@samp{cortex-a715}, @samp{ampere1}, @samp{ampere1a}, and @samp{native}.
> +@samp{cortex-a715}, @samp{cortex-a720}, @samp{ampere1}, @samp{ampere1a},
> +and @samp{native}.
>  
>  The values @samp{cortex-a57.cortex-a53}, @samp{cortex-a72.cortex-a53},
>  @samp{cortex-a73.cortex-a35}, @samp{cortex-a73.cortex-a53},


Re: [PATCH] IFN: Fix vector extraction into promoted subreg.

2023-08-16 Thread Richard Sandiford via Gcc-patches
Robin Dapp  writes:
>> However:
>> 
>> | #define vec_extract_direct { 3, 3, false }
>> 
>> This looks wrong.  The numbers are argument numbers (or -1 for a return
>> value).  vec_extract only takes 2 arguments, so 3 looks to be out-of-range.
>> 
>> | #define direct_vec_extract_optab_supported_p direct_optab_supported_p
>> 
>> I would expect this to be convert_optab_supported_p.
>> 
>> On the promoted subreg thing, I think expand_vec_extract_optab_fn
>> should use expand_fn_using_insn.
>
> Thanks, really easier that way.  Attached a new version that's currently
> bootstrapping.  Does that look better?

LGTM, thanks.  OK if testing passes.

Richard

> Regards
>  Robin
>
> Subject: [PATCH v2] internal-fn: Fix vector extraction into promoted subreg.
>
> This patch fixes the case where vec_extract gets passed a promoted
> subreg (e.g. from a return value).  This is achieved by using
> expand_convert_optab_fn instead of a separate expander function.
>
> gcc/ChangeLog:
>
>   * internal-fn.cc (vec_extract_direct): Change type argument
>   numbers.
>   (expand_vec_extract_optab_fn): Call convert_optab_fn.
>   (direct_vec_extract_optab_supported_p): Use
>   convert_optab_supported_p.
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/riscv/rvv/autovec/vls-vlmax/vec_extract-1u.c: New test.
>   * gcc.target/riscv/rvv/autovec/vls-vlmax/vec_extract-2u.c: New test.
>   * gcc.target/riscv/rvv/autovec/vls-vlmax/vec_extract-3u.c: New test.
>   * gcc.target/riscv/rvv/autovec/vls-vlmax/vec_extract-4u.c: New test.
>   * gcc.target/riscv/rvv/autovec/vls-vlmax/vec_extract-runu.c: New test.
> ---
>  gcc/internal-fn.cc|  44 +-
>  .../rvv/autovec/vls-vlmax/vec_extract-1u.c|  63 
>  .../rvv/autovec/vls-vlmax/vec_extract-2u.c|  69 +
>  .../rvv/autovec/vls-vlmax/vec_extract-3u.c|  69 +
>  .../rvv/autovec/vls-vlmax/vec_extract-4u.c|  70 +
>  .../rvv/autovec/vls-vlmax/vec_extract-runu.c  | 137 ++
>  6 files changed, 413 insertions(+), 39 deletions(-)
>  create mode 100644 
> gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/vec_extract-1u.c
>  create mode 100644 
> gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/vec_extract-2u.c
>  create mode 100644 
> gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/vec_extract-3u.c
>  create mode 100644 
> gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/vec_extract-4u.c
>  create mode 100644 
> gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/vec_extract-runu.c
>
> diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
> index 4f2b20a79e5..5cce36a789b 100644
> --- a/gcc/internal-fn.cc
> +++ b/gcc/internal-fn.cc
> @@ -175,7 +175,7 @@ init_internal_fns ()
>  #define len_store_direct { 3, 3, false }
>  #define mask_len_store_direct { 4, 5, false }
>  #define vec_set_direct { 3, 3, false }
> -#define vec_extract_direct { 3, 3, false }
> +#define vec_extract_direct { 0, -1, false }
>  #define unary_direct { 0, 0, true }
>  #define unary_convert_direct { -1, 0, true }
>  #define binary_direct { 0, 0, true }
> @@ -3127,43 +3127,6 @@ expand_vec_set_optab_fn (internal_fn, gcall *stmt, 
> convert_optab optab)
>gcc_unreachable ();
>  }
>  
> -/* Expand VEC_EXTRACT optab internal function.  */
> -
> -static void
> -expand_vec_extract_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
> -{
> -  tree lhs = gimple_call_lhs (stmt);
> -  tree op0 = gimple_call_arg (stmt, 0);
> -  tree op1 = gimple_call_arg (stmt, 1);
> -
> -  rtx target = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
> -
> -  machine_mode outermode = TYPE_MODE (TREE_TYPE (op0));
> -  machine_mode extract_mode = TYPE_MODE (TREE_TYPE (lhs));
> -
> -  rtx src = expand_normal (op0);
> -  rtx pos = expand_normal (op1);
> -
> -  class expand_operand ops[3];
> -  enum insn_code icode = convert_optab_handler (optab, outermode,
> - extract_mode);
> -
> -  if (icode != CODE_FOR_nothing)
> -{
> -  create_output_operand ([0], target, extract_mode);
> -  create_input_operand ([1], src, outermode);
> -  create_convert_operand_from ([2], pos,
> -TYPE_MODE (TREE_TYPE (op1)), true);
> -  if (maybe_expand_insn (icode, 3, ops))
> - {
> -   if (!rtx_equal_p (target, ops[0].value))
> - emit_move_insn (target, ops[0].value);
> -   return;
> - }
> -}
> -  gcc_unreachable ();
> -}
> -
>  static void
>  expand_ABNORMAL_DISPATCHER (internal_fn, gcall *)
>  {
> @@ -3917,6 +3880,9 @@ expand_convert_optab_fn (internal_fn fn, gcall *stmt, 
> convert_optab optab,
>  #define expand_unary_convert_optab_fn(FN, STMT, OPTAB) \
>expand_convert_optab_fn (FN, STMT, OPTAB, 1)
>  
> +#define expand_vec_extract_optab_fn(FN, STMT, OPTAB) \
> +  expand_convert_optab_fn (FN, STMT, OPTAB, 2)
> +
>  /* RETURN_TYPE and ARGS are a return type and argument list that are
> in principle compatible with FN 

Re: [RFC] [v2] Extend fold_vec_perm to handle VLA vectors

2023-08-16 Thread Richard Sandiford via Gcc-patches
Prathamesh Kulkarni  writes:
>> Unfortunately, the patch regressed following tests on ppc64le and
>> armhf respectively:
>> gcc.target/powerpc/vec-perm-ctor.c scan-tree-dump-not optimized
>> "VIEW_CONVERT_EXPR"
>> gcc.dg/tree-ssa/forwprop-20.c scan-tree-dump-not forwprop1 "VEC_PERM_EXPR"
>>
>> This happens because of the change to vect_cst_ctor_array which
>> removes handling of VECTOR_CST,
>> and thus we return NULL_TREE for cases where VEC_PERM_EXPR has
>> vector_cst, ctor input operands.
>>
>> For eg we fail to fold VEC_PERM_EXPR for the following test taken from
>> forwprop-20.c:
>> void f (double d, vecf* r)
>> {
>>   vecf x = { -d, 5 };
>>   vecf y = {  1, 4 };
>>   veci m = {  2, 0 };
>>   *r = __builtin_shuffle (x, y, m); // { 1, -d }
>> }
>> because vect_cst_ctor_to_array will now return NULL_TREE for vector_cst {1, 
>> 4}.
>>
>> The attached patch thus reverts the changes to vect_cst_ctor_to_array,
>> which makes the tests pass again.
>> I have put the patch for another round of bootstrap+test on the above
>> targets (aarch64, aarch64-sve, x86_64, armhf, ppc64le).
>> OK to commit if it passes ?
> The patch now passes bootstrap+test on all these targets.

OK, thanks.

Richard


Re: [PATCH] IFN: Fix vector extraction into promoted subreg.

2023-08-16 Thread Richard Sandiford via Gcc-patches
"juzhe.zh...@rivai.ai"  writes:
> Hi, Robin, Richard and Richi.
>
> I am wondering whether we can just simply replace the VEC_EXTRACT expander 
> with binary?
>
> Like this :?
>
> DEF_INTERNAL_OPTAB_FN (VEC_EXTRACT, ECF_CONST | ECF_NOTHROW,
> -  vec_extract, vec_extract)
> +  vec_extract, binary)
>
> to fix the sign extend issue.
>
> And remove the vec_extract explicit expander in internal-fn.cc ?

I'm not sure how that would work.  The vec_extract optab takes two
modes whereas binary optabs take one mode.

However:

| #define vec_extract_direct { 3, 3, false }

This looks wrong.  The numbers are argument numbers (or -1 for a return
value).  vec_extract only takes 2 arguments, so 3 looks to be out-of-range.

| #define direct_vec_extract_optab_supported_p direct_optab_supported_p

I would expect this to be convert_optab_supported_p.

On the promoted subreg thing, I think expand_vec_extract_optab_fn
should use expand_fn_using_insn.

Thanks,
Richard


Re: [PATCH] Handle TYPE_OVERFLOW_UNDEFINED vectorized BB reductions

2023-08-15 Thread Richard Sandiford via Gcc-patches
Richard Biener  writes:
> The following changes the gate to perform vectorization of BB reductions
> to use needs_fold_left_reduction_p which in turn requires handling
> TYPE_OVERFLOW_UNDEFINED types in the epilogue code generation by
> promoting any operations generated there to use unsigned arithmetic.
>
> The following does this, there's currently only v16qi where x86
> supports a .REDUC_PLUS reduction for integral modes so I had to
> add a x86 specific testcase using GIMPLE IL.
>
> Bootstrap and regtest ongoing on x86_64-unknown-linux-gnu.

LGTM FWIW.

> The next plan is to remove the restriction to .REDUC_PLUS, factoring
> out some of the general non-ifn way of doing a reduction epilog
> from loop reduction handling.  I had a stab at doing in-order
> reductions already but then those are really too similar to
> having general SLP discovery from N scalar defs (and then replacing
> those with extracts), at least since there's no
> fold_left_plus that doesn't add to an existing scalar I can't
> seem to easily just handle that case, possibly discovering
> { x_0, x_1, ..., x_n-1 }, extracting x_0, shifting the vector
> to { x_1, ..., x_n-1,  } and using mask_fold_left_plus
> with accumulating to x_0 and the  element masked would do.
> But I'm not sure that's worth the trouble?

Yeah, I doubt it.  I don't think SVE's FADDA is expected to be an
optimisation in its own right.  It's more of an enabler.

Another reason to use it in loops is that it's VLA-friendly.
But that wouldn't be an issue here.

Thanks,
Richard

> In principle with generic N scalar defs we could do a forward
> discovery from grouped loads and see where that goes (and of
> course handle in-order reductions that way).
>
>   * tree-vect-slp.cc (vect_slp_check_for_roots): Use
>   !needs_fold_left_reduction_p to decide whether we can
>   handle the reduction with association.
>   (vectorize_slp_instance_root_stmt): For TYPE_OVERFLOW_UNDEFINED
>   reductions perform all arithmetic in an unsigned type.
>
>   * gcc.target/i386/vect-reduc-2.c: New testcase.
> ---
>  gcc/testsuite/gcc.target/i386/vect-reduc-2.c | 77 
>  gcc/tree-vect-slp.cc | 44 +++
>  2 files changed, 107 insertions(+), 14 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/vect-reduc-2.c
>
> diff --git a/gcc/testsuite/gcc.target/i386/vect-reduc-2.c 
> b/gcc/testsuite/gcc.target/i386/vect-reduc-2.c
> new file mode 100644
> index 000..62559ef8e7b
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/vect-reduc-2.c
> @@ -0,0 +1,77 @@
> +/* { dg-do compile } */
> +/* { dg-options "-fgimple -O2 -msse2 -fdump-tree-slp2-optimized" } */
> +
> +signed char x[16];
> +
> +signed char __GIMPLE (ssa,guessed_local(1073741824))
> +foo ()
> +{
> +  signed char _1;
> +  signed char _3;
> +  signed char _5;
> +  signed char _6;
> +  signed char _8;
> +  signed char _9;
> +  signed char _11;
> +  signed char _12;
> +  signed char _14;
> +  signed char _15;
> +  signed char _17;
> +  signed char _18;
> +  signed char _20;
> +  signed char _21;
> +  signed char _23;
> +  signed char _24;
> +  signed char _26;
> +  signed char _27;
> +  signed char _29;
> +  signed char _30;
> +  signed char _32;
> +  signed char _33;
> +  signed char _35;
> +  signed char _36;
> +  signed char _38;
> +  signed char _39;
> +  signed char _41;
> +  signed char _42;
> +  signed char _44;
> +  signed char _45;
> +  signed char _47;
> +
> +  __BB(2,guessed_local(1073741824)):
> +  _1 = x[15];
> +  _3 = x[1];
> +  _5 = _1 + _3;
> +  _6 = x[2];
> +  _8 = _5 + _6;
> +  _9 = x[3];
> +  _11 = _8 + _9;
> +  _12 = x[4];
> +  _14 = _11 + _12;
> +  _15 = x[5];
> +  _17 = _14 + _15;
> +  _18 = x[6];
> +  _20 = _17 + _18;
> +  _21 = x[7];
> +  _23 = _20 + _21;
> +  _24 = x[8];
> +  _26 = _23 + _24;
> +  _27 = x[9];
> +  _29 = _26 + _27;
> +  _30 = x[10];
> +  _32 = _29 + _30;
> +  _33 = x[11];
> +  _35 = _32 + _33;
> +  _36 = x[12];
> +  _38 = _35 + _36;
> +  _39 = x[13];
> +  _41 = _38 + _39;
> +  _42 = x[14];
> +  _44 = _41 + _42;
> +  _45 = x[0];
> +  _47 = _44 + _45;
> +  return _47;
> +
> +}
> +
> +/* { dg-final { scan-tree-dump "optimized: basic block part vectorized" 
> "slp2" } } */
> diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
> index 7020bd9fa0e..07d68f2052b 100644
> --- a/gcc/tree-vect-slp.cc
> +++ b/gcc/tree-vect-slp.cc
> @@ -7217,13 +7217,10 @@ vect_slp_check_for_roots (bb_vec_info bb_vinfo)
>   }
>else if (!VECTOR_TYPE_P (TREE_TYPE (rhs))
>  && (associative_tree_code (code) || code == MINUS_EXPR)
> -/* ???  The flag_associative_math and TYPE_OVERFLOW_WRAPS
> -   checks pessimize a two-element reduction.  PR54400.
> +/* ???  This pessimizes a two-element reduction.  PR54400.
> ???  In-order reduction could be handled if we only
> traverse one operand chain in vect_slp_linearize_chain.  */
> -&& ((FLOAT_TYPE_P 

Re: [PATCH] vect: Move VMAT_GATHER_SCATTER handlings from final loop nest

2023-08-15 Thread Richard Sandiford via Gcc-patches
Richard Biener  writes:
>> OK, fair enough.  So the idea is: see where we end up and then try to
>> improve/factor the APIs in a less peephole way?
>
> Yeah, I think that's the only good way forward.

OK, no objection from me.  Sorry for holding the patch up.

Richard


Re: [PATCH V4] VECT: Support loop len control on EXTRACT_LAST vectorization

2023-08-15 Thread Richard Sandiford via Gcc-patches
Richard Biener  writes:
> On Tue, 15 Aug 2023, Richard Sandiford wrote:
>
>> Richard Biener  writes:
>> > On Tue, 15 Aug 2023, Kewen.Lin wrote:
>> >
>> >> Hi Stefan,
>> >> 
>> >> on 2023/8/15 02:51, Stefan Schulze Frielinghaus wrote:
>> >> > Hi everyone,
>> >> > 
>> >> > I have bootstrapped and regtested the patch below on s390.  For the
>> >> > 64-bit target I do not see any changes regarding the testsuite.  For the
>> >> > 31-bit target I see the following failures:
>> >> > 
>> >> > FAIL: gcc.dg/vect/no-scevccp-outer-14.c (internal compiler error: in 
>> >> > require, at machmode.h:313)
>> >> > FAIL: gcc.dg/vect/no-scevccp-outer-14.c (test for excess errors)
>> >> > FAIL: gcc.dg/vect/pr50451.c (internal compiler error: in require, at 
>> >> > machmode.h:313)
>> >> > FAIL: gcc.dg/vect/pr50451.c (test for excess errors)
>> >> > FAIL: gcc.dg/vect/pr50451.c -flto -ffat-lto-objects (internal compiler 
>> >> > error: in require, at machmode.h:313)
>> >> > FAIL: gcc.dg/vect/pr50451.c -flto -ffat-lto-objects (test for excess 
>> >> > errors)
>> >> > FAIL: gcc.dg/vect/pr53773.c (internal compiler error: in require, at 
>> >> > machmode.h:313)
>> >> > FAIL: gcc.dg/vect/pr53773.c (test for excess errors)
>> >> > FAIL: gcc.dg/vect/pr53773.c -flto -ffat-lto-objects (internal compiler 
>> >> > error: in require, at machmode.h:313)
>> >> > FAIL: gcc.dg/vect/pr53773.c -flto -ffat-lto-objects (test for excess 
>> >> > errors)
>> >> > FAIL: gcc.dg/vect/pr71407.c (internal compiler error: in require, at 
>> >> > machmode.h:313)
>> >> > FAIL: gcc.dg/vect/pr71407.c (test for excess errors)
>> >> > FAIL: gcc.dg/vect/pr71407.c -flto -ffat-lto-objects (internal compiler 
>> >> > error: in require, at machmode.h:313)
>> >> > FAIL: gcc.dg/vect/pr71407.c -flto -ffat-lto-objects (test for excess 
>> >> > errors)
>> >> > FAIL: gcc.dg/vect/pr71416-1.c (internal compiler error: in require, at 
>> >> > machmode.h:313)
>> >> > FAIL: gcc.dg/vect/pr71416-1.c (test for excess errors)
>> >> > FAIL: gcc.dg/vect/pr71416-1.c -flto -ffat-lto-objects (internal 
>> >> > compiler error: in require, at machmode.h:313)
>> >> > FAIL: gcc.dg/vect/pr71416-1.c -flto -ffat-lto-objects (test for excess 
>> >> > errors)
>> >> > FAIL: gcc.dg/vect/pr94443.c (internal compiler error: in require, at 
>> >> > machmode.h:313)
>> >> > FAIL: gcc.dg/vect/pr94443.c (test for excess errors)
>> >> > FAIL: gcc.dg/vect/pr94443.c -flto -ffat-lto-objects (internal compiler 
>> >> > error: in require, at machmode.h:313)
>> >> > FAIL: gcc.dg/vect/pr94443.c -flto -ffat-lto-objects (test for excess 
>> >> > errors)
>> >> > FAIL: gcc.dg/vect/pr97558.c (internal compiler error: in require, at 
>> >> > machmode.h:313)
>> >> > FAIL: gcc.dg/vect/pr97558.c (test for excess errors)
>> >> > FAIL: gcc.dg/vect/pr97558.c -flto -ffat-lto-objects (internal compiler 
>> >> > error: in require, at machmode.h:313)
>> >> > FAIL: gcc.dg/vect/pr97558.c -flto -ffat-lto-objects (test for excess 
>> >> > errors)
>> >> > FAIL: gcc.dg/vect/vect-reduc-pattern-3.c -flto -ffat-lto-objects 
>> >> > (internal compiler error: in require, at machmode.h:313)
>> >> > FAIL: gcc.dg/vect/vect-reduc-pattern-3.c -flto -ffat-lto-objects (test 
>> >> > for excess errors)
>> >> > UNRESOLVED: gcc.dg/vect/no-scevccp-outer-14.c compilation failed to 
>> >> > produce executable
>> >> > UNRESOLVED: gcc.dg/vect/pr53773.c -flto -ffat-lto-objects  
>> >> > scan-tree-dump-times optimized "\\* 10" 2
>> >> > UNRESOLVED: gcc.dg/vect/pr53773.c scan-tree-dump-times optimized "\\* 
>> >> > 10" 2
>> >> > UNRESOLVED: gcc.dg/vect/pr71416-1.c -flto -ffat-lto-objects compilation 
>> >> > failed to produce executable
>> >> > UNRESOLVED: gcc.dg/vect/pr71416-1.c compilation failed to produce 
>> >> > executable
>> >> > UNRESOLVED: gcc.dg/vect/vect-reduc-pattern-3.c -flto -ffat-lto-objects 
>> >> > compilation failed to produce executable
>> >> > 
>> >> > I've randomely picked pr50451.c and ran gcc against it which results in:
>> >> > 
>> >> > during GIMPLE pass: vect
>> >> > dump file: pr50451.c.174t.vect
>> >> > /gcc-verify-workdir/patched/src/gcc/testsuite/gcc.dg/vect/pr50451.c: In 
>> >> > function ?foo?:
>> >> > /gcc-verify-workdir/patched/src/gcc/testsuite/gcc.dg/vect/pr50451.c:5:1:
>> >> >  internal compiler error: in require, at machmode.h:313
>> >> > 0x1265d21 opt_mode::require() const
>> >> > /gcc-verify-workdir/patched/src/gcc/machmode.h:313
>> >> > 0x1d7e4e9 opt_mode::require() const
>> >> > /gcc-verify-workdir/patched/src/gcc/vec.h:955
>> >> > 0x1d7e4e9 vect_verify_loop_lens
>> >> > /gcc-verify-workdir/patched/src/gcc/tree-vect-loop.cc:1471
>> >> > 0x1da29ab vect_analyze_loop_2
>> >> > /gcc-verify-workdir/patched/src/gcc/tree-vect-loop.cc:2929
>> >> > 0x1da40c7 vect_analyze_loop_1
>> >> > /gcc-verify-workdir/patched/src/gcc/tree-vect-loop.cc:3330
>> >> > 0x1da499d vect_analyze_loop(loop*, vec_info_shared*)
>> >> > /gcc-verify-workdir/patched/src/gcc/tree-vect-loop.cc:3484
>> >> 

Re: [PATCH][RFC] tree-optimization/92335 - Improve sinking heuristics for vectorization

2023-08-15 Thread Richard Sandiford via Gcc-patches
Richard Biener  writes:
> On Mon, 14 Aug 2023, Prathamesh Kulkarni wrote:
>> On Mon, 7 Aug 2023 at 13:19, Richard Biener  
>> wrote:
>> > It doesn't seem to make a difference for x86.  That said, the "fix" is
>> > probably sticking the correct target on the dump-check, it seems
>> > that vect_fold_extract_last is no longer correct here.
>> Um sorry, I did go thru various checks in target-supports.exp, but not
>> sure which one will be appropriate for this case,
>> and am stuck here :/ Could you please suggest how to proceed ?
>
> Maybe Richard S. knows the magic thing to test, he originally
> implemented the direct conversion support.  I suggest to implement
> such dg-checks if they are not present (I can't find them),
> possibly quite specific to the modes involved (like we have
> other checks with _qi_to_hi suffixes, for float modes maybe
> just _float).

Yeah, can't remember specific selectors for that feature.  TBH I think
most (all?) of the tests were AArch64-specific.

Thanks,
Richard


Re: [PATCH V4] VECT: Support loop len control on EXTRACT_LAST vectorization

2023-08-15 Thread Richard Sandiford via Gcc-patches
Richard Biener  writes:
> On Tue, 15 Aug 2023, Kewen.Lin wrote:
>
>> Hi Stefan,
>> 
>> on 2023/8/15 02:51, Stefan Schulze Frielinghaus wrote:
>> > Hi everyone,
>> > 
>> > I have bootstrapped and regtested the patch below on s390.  For the
>> > 64-bit target I do not see any changes regarding the testsuite.  For the
>> > 31-bit target I see the following failures:
>> > 
>> > FAIL: gcc.dg/vect/no-scevccp-outer-14.c (internal compiler error: in 
>> > require, at machmode.h:313)
>> > FAIL: gcc.dg/vect/no-scevccp-outer-14.c (test for excess errors)
>> > FAIL: gcc.dg/vect/pr50451.c (internal compiler error: in require, at 
>> > machmode.h:313)
>> > FAIL: gcc.dg/vect/pr50451.c (test for excess errors)
>> > FAIL: gcc.dg/vect/pr50451.c -flto -ffat-lto-objects (internal compiler 
>> > error: in require, at machmode.h:313)
>> > FAIL: gcc.dg/vect/pr50451.c -flto -ffat-lto-objects (test for excess 
>> > errors)
>> > FAIL: gcc.dg/vect/pr53773.c (internal compiler error: in require, at 
>> > machmode.h:313)
>> > FAIL: gcc.dg/vect/pr53773.c (test for excess errors)
>> > FAIL: gcc.dg/vect/pr53773.c -flto -ffat-lto-objects (internal compiler 
>> > error: in require, at machmode.h:313)
>> > FAIL: gcc.dg/vect/pr53773.c -flto -ffat-lto-objects (test for excess 
>> > errors)
>> > FAIL: gcc.dg/vect/pr71407.c (internal compiler error: in require, at 
>> > machmode.h:313)
>> > FAIL: gcc.dg/vect/pr71407.c (test for excess errors)
>> > FAIL: gcc.dg/vect/pr71407.c -flto -ffat-lto-objects (internal compiler 
>> > error: in require, at machmode.h:313)
>> > FAIL: gcc.dg/vect/pr71407.c -flto -ffat-lto-objects (test for excess 
>> > errors)
>> > FAIL: gcc.dg/vect/pr71416-1.c (internal compiler error: in require, at 
>> > machmode.h:313)
>> > FAIL: gcc.dg/vect/pr71416-1.c (test for excess errors)
>> > FAIL: gcc.dg/vect/pr71416-1.c -flto -ffat-lto-objects (internal compiler 
>> > error: in require, at machmode.h:313)
>> > FAIL: gcc.dg/vect/pr71416-1.c -flto -ffat-lto-objects (test for excess 
>> > errors)
>> > FAIL: gcc.dg/vect/pr94443.c (internal compiler error: in require, at 
>> > machmode.h:313)
>> > FAIL: gcc.dg/vect/pr94443.c (test for excess errors)
>> > FAIL: gcc.dg/vect/pr94443.c -flto -ffat-lto-objects (internal compiler 
>> > error: in require, at machmode.h:313)
>> > FAIL: gcc.dg/vect/pr94443.c -flto -ffat-lto-objects (test for excess 
>> > errors)
>> > FAIL: gcc.dg/vect/pr97558.c (internal compiler error: in require, at 
>> > machmode.h:313)
>> > FAIL: gcc.dg/vect/pr97558.c (test for excess errors)
>> > FAIL: gcc.dg/vect/pr97558.c -flto -ffat-lto-objects (internal compiler 
>> > error: in require, at machmode.h:313)
>> > FAIL: gcc.dg/vect/pr97558.c -flto -ffat-lto-objects (test for excess 
>> > errors)
>> > FAIL: gcc.dg/vect/vect-reduc-pattern-3.c -flto -ffat-lto-objects (internal 
>> > compiler error: in require, at machmode.h:313)
>> > FAIL: gcc.dg/vect/vect-reduc-pattern-3.c -flto -ffat-lto-objects (test for 
>> > excess errors)
>> > UNRESOLVED: gcc.dg/vect/no-scevccp-outer-14.c compilation failed to 
>> > produce executable
>> > UNRESOLVED: gcc.dg/vect/pr53773.c -flto -ffat-lto-objects  
>> > scan-tree-dump-times optimized "\\* 10" 2
>> > UNRESOLVED: gcc.dg/vect/pr53773.c scan-tree-dump-times optimized "\\* 10" 2
>> > UNRESOLVED: gcc.dg/vect/pr71416-1.c -flto -ffat-lto-objects compilation 
>> > failed to produce executable
>> > UNRESOLVED: gcc.dg/vect/pr71416-1.c compilation failed to produce 
>> > executable
>> > UNRESOLVED: gcc.dg/vect/vect-reduc-pattern-3.c -flto -ffat-lto-objects 
>> > compilation failed to produce executable
>> > 
>> > I've randomely picked pr50451.c and ran gcc against it which results in:
>> > 
>> > during GIMPLE pass: vect
>> > dump file: pr50451.c.174t.vect
>> > /gcc-verify-workdir/patched/src/gcc/testsuite/gcc.dg/vect/pr50451.c: In 
>> > function ?foo?:
>> > /gcc-verify-workdir/patched/src/gcc/testsuite/gcc.dg/vect/pr50451.c:5:1: 
>> > internal compiler error: in require, at machmode.h:313
>> > 0x1265d21 opt_mode::require() const
>> > /gcc-verify-workdir/patched/src/gcc/machmode.h:313
>> > 0x1d7e4e9 opt_mode::require() const
>> > /gcc-verify-workdir/patched/src/gcc/vec.h:955
>> > 0x1d7e4e9 vect_verify_loop_lens
>> > /gcc-verify-workdir/patched/src/gcc/tree-vect-loop.cc:1471
>> > 0x1da29ab vect_analyze_loop_2
>> > /gcc-verify-workdir/patched/src/gcc/tree-vect-loop.cc:2929
>> > 0x1da40c7 vect_analyze_loop_1
>> > /gcc-verify-workdir/patched/src/gcc/tree-vect-loop.cc:3330
>> > 0x1da499d vect_analyze_loop(loop*, vec_info_shared*)
>> > /gcc-verify-workdir/patched/src/gcc/tree-vect-loop.cc:3484
>> > 0x1deed27 try_vectorize_loop_1
>> > /gcc-verify-workdir/patched/src/gcc/tree-vectorizer.cc:1064
>> > 0x1deed27 try_vectorize_loop
>> > /gcc-verify-workdir/patched/src/gcc/tree-vectorizer.cc:1180
>> > 0x1def5c1 execute
>> > /gcc-verify-workdir/patched/src/gcc/tree-vectorizer.cc:1296
>> > Please submit a full bug report, with preprocessed 

Re: [PATCH] vect: Move VMAT_GATHER_SCATTER handlings from final loop nest

2023-08-15 Thread Richard Sandiford via Gcc-patches
Richard Biener  writes:
> On Tue, Aug 15, 2023 at 4:44 AM Kewen.Lin  wrote:
>>
>> on 2023/8/14 22:16, Richard Sandiford wrote:
>> > No, it was more that 219-142=77, so it seems like a lot of lines
>> > are being duplicated rather than simply being moved.  (Unlike for
>> > VMAT_LOAD_STORE_LANES, which was even a slight LOC saving, and so
>> > was a clear improvement.)
>> >
>> > So I was just wondering if there was any obvious factoring-out that
>> > could be done to reduce the duplication.
>>
>> ah, thanks for the clarification!
>>
>> I think the main duplication are on the loop body beginning and end,
>> let's take a look at them in details:
>>
>> +  if (memory_access_type == VMAT_GATHER_SCATTER)
>> +{
>> +  gcc_assert (alignment_support_scheme == dr_aligned
>> + || alignment_support_scheme == dr_unaligned_supported);
>> +  gcc_assert (!grouped_load && !slp_perm);
>> +
>> +  unsigned int inside_cost = 0, prologue_cost = 0;
>>
>> // These above are newly added.
>>
>> +  for (j = 0; j < ncopies; j++)
>> +   {
>> + /* 1. Create the vector or array pointer update chain.  */
>> + if (j == 0 && !costing_p)
>> +   {
>> + if (STMT_VINFO_GATHER_SCATTER_P (stmt_info))
>> +   vect_get_gather_scatter_ops (loop_vinfo, loop, stmt_info,
>> +slp_node, _info, 
>> _ptr,
>> +_offsets);
>> + else
>> +   dataref_ptr
>> + = vect_create_data_ref_ptr (vinfo, first_stmt_info, 
>> aggr_type,
>> + at_loop, offset, , gsi,
>> + _incr, false, bump);
>> +   }
>> + else if (!costing_p)
>> +   {
>> + gcc_assert (!LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo));
>> + if (!STMT_VINFO_GATHER_SCATTER_P (stmt_info))
>> +   dataref_ptr = bump_vector_ptr (vinfo, dataref_ptr, ptr_incr,
>> +  gsi, stmt_info, bump);
>> +   }
>>
>> // These are for dataref_ptr, in the final looop nest we deal with more cases
>> on simd_lane_access_p and diff_first_stmt_info, but don't handle
>> STMT_VINFO_GATHER_SCATTER_P any more, very few (one case) can be shared 
>> between,
>> IMHO factoring out it seems like a overkill.
>>
>> +
>> + if (mask && !costing_p)
>> +   vec_mask = vec_masks[j];
>>
>> // It's merged out from j == 0 and j != 0
>>
>> +
>> + gimple *new_stmt = NULL;
>> + for (i = 0; i < vec_num; i++)
>> +   {
>> + tree final_mask = NULL_TREE;
>> + tree final_len = NULL_TREE;
>> + tree bias = NULL_TREE;
>> + if (!costing_p)
>> +   {
>> + if (loop_masks)
>> +   final_mask
>> + = vect_get_loop_mask (loop_vinfo, gsi, loop_masks,
>> +   vec_num * ncopies, vectype,
>> +   vec_num * j + i);
>> + if (vec_mask)
>> +   final_mask = prepare_vec_mask (loop_vinfo, mask_vectype,
>> +  final_mask, vec_mask, 
>> gsi);
>> +
>> + if (i > 0 && !STMT_VINFO_GATHER_SCATTER_P (stmt_info))
>> +   dataref_ptr = bump_vector_ptr (vinfo, dataref_ptr, 
>> ptr_incr,
>> +  gsi, stmt_info, bump);
>> +   }
>>
>> // This part is directly copied from the original, the original gets updated 
>> by
>> removing && !STMT_VINFO_GATHER_SCATTER_P.  Due to its size, I didn't consider
>> this before, do you prefer me to factor this part out?
>>
>> + if (gs_info.ifn != IFN_LAST)
>> +   {
>> ...
>> +   }
>> + else
>> +   {
>> + /* Emulated gather-scatter.  */
>> ...
>>
>> // This part is just moved from the original.
>>
>> + vec_dest = vect_create_destination_var (scalar_dest, vectype);
>> + /* DATA_REF is null if we've already built the statement.  */
>> + if (data_ref)
>> +   {
>> + vect_copy_ref_info (data_ref, DR_REF (first_dr_info->dr));
>> + new_stmt = gimple_build_assign (vec_dest, data_ref);
>> +   }
>> + new_temp = make_ssa_name (vec_dest, new_stmt);
>> + gimple_set_lhs (new_stmt, new_temp);
>> + vect_finish_stmt_generation (vinfo, stmt_info, new_stmt, gsi);
>> +
>> + /* Store vector loads in the corresponding SLP_NODE.  */
>> + if (slp)
>> +   slp_node->push_vec_def (new_stmt);
>> +
>> + if (!slp && !costing_p)
>> +   STMT_VINFO_VEC_STMTS (stmt_info).safe_push (new_stmt);
>> +   }
>> +
>> +  if (!slp && !costing_p)
>> +   

Re: [PATCH] Add support for vector conitional not

2023-08-14 Thread Richard Sandiford via Gcc-patches
Andrew Pinski via Gcc-patches  writes:
> Like the support conditional neg (r12-4470-g20dcda98ed376cb61c74b2c71),
> this just adds conditional not too.
> Also we should be able to turn `(a ? -1 : 0) ^ b` into a conditional
> not.
>
> OK? Bootstrapped and tested on x86_64-linux-gnu and aarch64-linux-gnu.
>
> gcc/ChangeLog:
>
>   * internal-fn.def (COND_NOT): New internal function.
>   * match.pd (UNCOND_UNARY, COND_UNARY): Add bit_not/not
>   to the lists.
>   (`vec (a ? -1 : 0) ^ b`): New pattern to convert
>   into conditional not.
>   * optabs.def (cond_one_cmpl): New optab.
>   (cond_len_one_cmpl): Likewise.
>
> gcc/testsuite/ChangeLog:
>
>   PR target/110986
>   * gcc.target/aarch64/sve/cond_unary_9.c: New test.
> ---
>  gcc/internal-fn.def   |  2 ++
>  gcc/match.pd  | 15 --
>  gcc/optabs.def|  2 ++
>  .../gcc.target/aarch64/sve/cond_unary_9.c | 20 +++
>  4 files changed, 37 insertions(+), 2 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/cond_unary_9.c
>
> diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
> index b3c410f4b6a..3e8693dfddb 100644
> --- a/gcc/internal-fn.def
> +++ b/gcc/internal-fn.def
> @@ -69,6 +69,7 @@ along with GCC; see the file COPYING3.  If not see
>   lround2.
>  
> - cond_binary: a conditional binary optab, such as cond_add
> +   - cond_unary: a conditional unary optab, such as cond_neg
> - cond_ternary: a conditional ternary optab, such as cond_fma_rev
>  
> - fold_left: for scalar = FN (scalar, vector), keyed off the vector mode
> @@ -276,6 +277,7 @@ DEF_INTERNAL_COND_FN (FNMA, ECF_CONST, fnma, ternary)
>  DEF_INTERNAL_COND_FN (FNMS, ECF_CONST, fnms, ternary)
>  
>  DEF_INTERNAL_COND_FN (NEG, ECF_CONST, neg, unary)
> +DEF_INTERNAL_COND_FN (NOT, ECF_CONST, one_cmpl, unary)
>  
>  DEF_INTERNAL_OPTAB_FN (RSQRT, ECF_CONST, rsqrt, unary)
>  
> diff --git a/gcc/match.pd b/gcc/match.pd
> index 6791060891d..2ee6d24ccee 100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -84,9 +84,9 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>  
>  /* Unary operations and their associated IFN_COND_* function.  */
>  (define_operator_list UNCOND_UNARY
> -  negate)
> +  negate bit_not)
>  (define_operator_list COND_UNARY
> -  IFN_COND_NEG)
> +  IFN_COND_NEG IFN_COND_NOT)
>  
>  /* Binary operations and their associated IFN_COND_* function.  */
>  (define_operator_list UNCOND_BINARY
> @@ -8482,6 +8482,17 @@ and,
>  && is_truth_type_for (op_type, TREE_TYPE (@0)))
>   (cond_op (bit_not @0) @2 @1)
>  
> +/* `(a ? -1 : 0) ^ b` can be converted into a conditional not.  */
> +(simplify
> + (bit_xor:c (vec_cond @0 uniform_integer_cst_p@1 uniform_integer_cst_p@2) @3)
> + (if (canonicalize_math_after_vectorization_p ()
> +  && vectorized_internal_fn_supported_p (IFN_COND_NOT, type)
> +  && is_truth_type_for (type, TREE_TYPE (@0)))
> + (if (integer_all_onesp (@1) && integer_zerop (@2))
> +  (IFN_COND_NOT @0 @3 @3))
> +  (if (integer_all_onesp (@2) && integer_zerop (@1))
> +   (vec_cond (bit_not @0) @3 @3

Looks like this should be IFN_COND_NOT rather than vec_cond.

LGTM otherwise, but please give Richi 24hrs to comment.

Thanks,
Richard

> +
>  /* Simplify:
>  
>   a = a1 op a2
> diff --git a/gcc/optabs.def b/gcc/optabs.def
> index 1ea1947b3b5..a58819bc665 100644
> --- a/gcc/optabs.def
> +++ b/gcc/optabs.def
> @@ -254,6 +254,7 @@ OPTAB_D (cond_fms_optab, "cond_fms$a")
>  OPTAB_D (cond_fnma_optab, "cond_fnma$a")
>  OPTAB_D (cond_fnms_optab, "cond_fnms$a")
>  OPTAB_D (cond_neg_optab, "cond_neg$a")
> +OPTAB_D (cond_one_cmpl_optab, "cond_one_cmpl$a")
>  OPTAB_D (cond_len_add_optab, "cond_len_add$a")
>  OPTAB_D (cond_len_sub_optab, "cond_len_sub$a")
>  OPTAB_D (cond_len_smul_optab, "cond_len_mul$a")
> @@ -278,6 +279,7 @@ OPTAB_D (cond_len_fms_optab, "cond_len_fms$a")
>  OPTAB_D (cond_len_fnma_optab, "cond_len_fnma$a")
>  OPTAB_D (cond_len_fnms_optab, "cond_len_fnms$a")
>  OPTAB_D (cond_len_neg_optab, "cond_len_neg$a")
> +OPTAB_D (cond_len_one_cmpl_optab, "cond_len_one_cmpl$a")
>  OPTAB_D (cmov_optab, "cmov$a6")
>  OPTAB_D (cstore_optab, "cstore$a4")
>  OPTAB_D (ctrap_optab, "ctrap$a4")
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_unary_9.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/cond_unary_9.c
> new file mode 100644
> index 000..d6bc0409630
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_unary_9.c
> @@ -0,0 +1,20 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -ftree-vectorize -moverride=sve_width=256 
> -fdump-tree-optimized" } */
> +
> +/* This is a reduced version of cond_unary_5.c */
> +
> +void __attribute__ ((noipa))
> +f (short *__restrict r,
> +   short *__restrict a,
> +   short *__restrict pred)
> +{
> +  for (int i = 0; i < 1024; ++i)
> +r[i] = pred[i] != 0 ? ~(a[i]) : a[i];
> +}
> +
> +/* { dg-final { scan-assembler-times 

Re: [RFC] GCC Security policy

2023-08-14 Thread Richard Sandiford via Gcc-patches
I think it would help to clarify what the aim of the security policy is.
Specifically:

(1) What service do we want to provide to users by classifying one thing
as a security bug and another thing as not a security bug?

(2) What service do we want to provide to the GNU community by the same
classification?

I think it will be easier to agree on the classification if we first
agree on that.

Siddhesh Poyarekar  writes:
> Hi,
>
> Here's the updated draft of the top part of the security policy with all 
> of the recommendations incorporated.
>
> Thanks,
> Sid
>
>
> What is a GCC security bug?
> ===
>
>  A security bug is one that threatens the security of a system or
>  network, or might compromise the security of data stored on it.
>  In the context of GCC there are multiple ways in which this might
>  happen and they're detailed below.
>
> Compiler drivers, programs, libgccjit and support libraries
> ---
>
>  The compiler driver processes source code, invokes other programs
>  such as the assembler and linker and generates the output result,
>  which may be assembly code or machine code.  It is necessary that
>  all source code inputs to the compiler are trusted, since it is
>  impossible for the driver to validate input source code beyond
>  conformance to a programming language standard.
>
>  The GCC JIT implementation, libgccjit, is intended to be plugged
>  into applications to translate input source code in the application
>  context.  Limitations that apply to the compiler
>  driver, apply here too in terms of sanitizing inputs, so it is
>  recommended that inputs are either sanitized by an external program
>  to allow only trusted, safe execution in the context of the
>  application or the JIT execution context is appropriately sandboxed
>  to contain the effects of any bugs in the JIT or its generated code
>  to the sandboxed environment.
>
>  Support libraries such as libiberty, libcc1 libvtv and libcpp have
>  been developed separately to share code with other tools such as
>  binutils and gdb.  These libraries again have similar challenges to
>  compiler drivers.  While they are expected to be robust against
>  arbitrary input, they should only be used with trusted inputs.
>
>  Libraries such as zlib that bundled into GCC to build it will be
>  treated the same as the compiler drivers and programs as far as
>  security coverage is concerned.  However if you find an issue in
>  these libraries independent of their use in GCC, you should reach
>  out to their upstream projects to report them.
>
>  As a result, the only case for a potential security issue in all
>  these cases is when it ends up generating vulnerable output for
>  valid input source code.
>
>  As a result, the only case for a potential security issue in the
>  compiler is when it generates vulnerable application code for
>  trusted input source code that is conforming to the relevant
>  programming standard or extensions documented as supported by GCC
>  and the algorithm expressed in the source code does not have the
>  vulnerability.  The output application code could be considered
>  vulnerable if it produces an actual vulnerability in the target
>  application, specifically in the following cases:
>
>  - The application dereferences an invalid memory location despite
>the application sources being valid.
>  - The application reads from or writes to a valid but incorrect
>memory location, resulting in an information integrity issue or an
>information leak.
>  - The application ends up running in an infinite loop or with
>severe degradation in performance despite the input sources having
>no such issue, resulting in a Denial of Service.  Note that
>correct but non-performant code is not a security issue candidate,
>this only applies to incorrect code that may result in performance
>degradation severe enough to amount to a denial of service.
>  - The application crashes due to the generated incorrect code,
>resulting in a Denial of Service.

One difficulty is that wrong-code bugs are rarely confined to
a particular source code structure.  Something that causes a
miscompilation of a bounds check could later be discovered to cause a
miscompilation of something that is less obviously security-sensitive.
Or the same thing could happen in reverse.  And it's common for the
same bug to be reported multiple times, against different testcases.

The proposal says that certain kinds of wrong code could be a security
bug.  But what will be the criteria for deciding whether a wrong code
bug that *could* be classified as a security bug is in fact a security
bug?  Does someone have to show that at least one 

Re: [PATCH] vect: Move VMAT_GATHER_SCATTER handlings from final loop nest

2023-08-14 Thread Richard Sandiford via Gcc-patches
"Kewen.Lin"  writes:
> Hi Richard,
>
> on 2023/8/14 20:20, Richard Sandiford wrote:
>> Thanks for the clean-ups.  But...
>> 
>> "Kewen.Lin"  writes:
>>> Hi,
>>>
>>> Following Richi's suggestion [1], this patch is to move the
>>> handlings on VMAT_GATHER_SCATTER in the final loop nest
>>> of function vectorizable_load to its own loop.  Basically
>>> it duplicates the final loop nest, clean up some useless
>>> set up code for the case of VMAT_GATHER_SCATTER, remove some
>>> unreachable code.  Also remove the corresponding handlings
>>> in the final loop nest.
>>>
>>> Bootstrapped and regtested on x86_64-redhat-linux,
>>> aarch64-linux-gnu and powerpc64{,le}-linux-gnu.
>>>
>>> [1] https://gcc.gnu.org/pipermail/gcc-patches/2023-June/623329.html
>>>
>>> Is it ok for trunk?
>>>
>>> BR,
>>> Kewen
>>> -
>>>
>>> gcc/ChangeLog:
>>>
>>> * tree-vect-stmts.cc (vectorizable_load): Move the handlings on
>>> VMAT_GATHER_SCATTER in the final loop nest to its own loop,
>>> and update the final nest accordingly.
>>> ---
>>>  gcc/tree-vect-stmts.cc | 361 +
>>>  1 file changed, 219 insertions(+), 142 deletions(-)
>> 
>> ...that seems like quite a lot of +s.  Is there nothing we can do to
>> avoid the cut-&-paste?
>
> Thanks for the comments!  I'm not sure if I get your question, if we
> want to move out the handlings of VMAT_GATHER_SCATTER, the new +s seem
> inevitable?  Your concern is mainly about git blame history?

No, it was more that 219-142=77, so it seems like a lot of lines
are being duplicated rather than simply being moved.  (Unlike for
VMAT_LOAD_STORE_LANES, which was even a slight LOC saving, and so
was a clear improvement.)

So I was just wondering if there was any obvious factoring-out that
could be done to reduce the duplication.

Thanks,
Richard




  1   2   3   4   5   6   7   8   9   10   >