Re: [Patch, rs6000, aarch64, middle-end] Add implementation for different targets for pair mem fusion

2024-05-31 Thread Richard Sandiford
Reviewing my review :)

Richard Sandiford  writes:
>> +
>> +  for (auto def : info->defs ())
>> +{
>> +  auto set = dyn_cast (def);
>> +  if (set && set->has_any_uses ())
>> +{
>> +  for (auto use : set->all_uses())
>
> Nit: has_any_uses isn't necessary: the inner loop will simply do nothing
> in that case.  Also, we can/should restrict the scan to non-debug uses.
>
> This can then be:
>
>   for (auto def : info->defs ())
> if (auto set = dyn_cast (def))
>   for (auto use : set->nondebug_insn_uses())

I forgot the space before "()" in the line above.

>
>> +{
>> +  if (use->insn ()->is_artificial ())
>> +return false;
>> +
>> +   insn_info *info = use->insn ();
>> +
>> +   if (info
>> +   && info->rtl ()
>
> This test shouldn't be necessary.
>
>> +   && info->is_real ())
>> +  {
>> +rtx_insn *rtl_insn = info->rtl ();
>> +rtx set = single_set (rtl_insn);
>> +
>> +if (set == NULL_RTX)
>> +  return false;
>> +
>> +rtx op0 = SET_SRC (set);
>> +if (GET_CODE (op0) != UNSPEC)
>> +  return false;
> [...]
> Also, using single_set means that the function still lets through
> parallels of two sets in which the sources are unspecs.  Is that
> intentional?

I got this wrong, sorry.  You return false for non-single_set,
so that particular problem doesn't arise.  But why do we want to
reject uses of registers that are set by parallel sets?

Thanks,
Richard


Re: [Patch, rs6000, aarch64, middle-end] Add implementation for different targets for pair mem fusion

2024-05-31 Thread Richard Sandiford
Ajit Agarwal  writes:
> Hello All:
>
> Common infrastructure using generic code for pair mem fusion of different
> targets.
>
> rs6000 target specific specific code implements virtual functions defined
> by generic code.
>
> Code is implemented with pure virtual functions to interface with target
> code.
>
> Target specific code are added in rs6000-mem-fusion.cc and additional virtual
> function implementation required for rs6000 are added in 
> aarch64-ldp-fusion.cc.
>
> Bootstrapped and regtested for aarch64-linux-gnu and powerpc64-linux-gnu.
>
> Thanks & Regards
> Ajit
>
>
> aarch64, rs6000, middle-end: Add implementation for different targets for 
> pair mem fusion
>
> Common infrastructure using generic code for pair mem fusion of different
> targets.
>
> rs6000 target specific specific code implements virtual functions defined
> by generic code.
>
> Code is implemented with pure virtual functions to interface with target
> code.
>
> Target specific code are added in rs6000-mem-fusion.cc and additional virtual
> function implementation required for rs6000 are added in 
> aarch64-ldp-fusion.cc.
>
> 2024-05-31  Ajit Kumar Agarwal  
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64-ldp-fusion.cc: Add target specific
>   implementation of additional virtual functions added in pair_fusion
>   struct.
>   * config/rs6000/rs6000-passes.def: New mem fusion pass
>   before pass_early_remat.
>   * config/rs6000/rs6000-mem-fusion.cc: Add new pass.
>   Add target specific implementation using pure virtual
>   functions.
>   * config.gcc: Add new object file.
>   * config/rs6000/rs6000-protos.h: Add new prototype for mem
>   fusion pass.
>   * config/rs6000/t-rs6000: Add new rule.
>   * rtl-ssa/accesses.h: Moved set_is_live_out_use as public
>   from private.
>
> gcc/testsuite/ChangeLog:
>
>   * g++.target/powerpc/me-fusion.C: New test.
>   * g++.target/powerpc/mem-fusion-1.C: New test.
>   * gcc.target/powerpc/mma-builtin-1.c: Modify test.
> ---

This isn't a complete review, just some initial questions & comments
about selected parts.

> [...]
> +/* Check whether load can be fusable or not.
> +   Return true if dependent use is UNSPEC otherwise false.  */
> +bool
> +rs6000_pair_fusion::fuseable_load_p (insn_info *info)
> +{
> +  rtx_insn *insn = info->rtl ();
> +
> +  for (rtx note = REG_NOTES (insn); note; note = XEXP (note, 1))
> +if (REG_NOTE_KIND (note) == REG_EQUAL
> + || REG_NOTE_KIND (note) == REG_EQUIV)
> +  return false;

It's unusual to punt on an optimisation because of a REG_EQUAL/EQUIV
note.  What's the reason for doing this?  Are you trying to avoid
fusing pairs before reload that are equivalent to a MEM (i.e. have
a natural spill slot)?  I think Alex hit a similar situation.

> +
> +  for (auto def : info->defs ())
> +{
> +  auto set = dyn_cast (def);
> +  if (set && set->has_any_uses ())
> + {
> +   for (auto use : set->all_uses())

Nit: has_any_uses isn't necessary: the inner loop will simply do nothing
in that case.  Also, we can/should restrict the scan to non-debug uses.

This can then be:

  for (auto def : info->defs ())
if (auto set = dyn_cast (def))
  for (auto use : set->nondebug_insn_uses())

> + {
> +   if (use->insn ()->is_artificial ())
> + return false;
> +
> +insn_info *info = use->insn ();
> +
> +if (info
> +&& info->rtl ()

This test shouldn't be necessary.

> +&& info->is_real ())
> +   {
> + rtx_insn *rtl_insn = info->rtl ();
> + rtx set = single_set (rtl_insn);
> +
> + if (set == NULL_RTX)
> +   return false;
> +
> + rtx op0 = SET_SRC (set);
> + if (GET_CODE (op0) != UNSPEC)
> +   return false;

What's the motivation for rejecting unspecs?  It's unsual to treat
all unspecs as a distinct group.

Also, using single_set means that the function still lets through
parallels of two sets in which the sources are unspecs.  Is that
intentional?

The reasons behind things like the REG_EQUAL/EQUIV and UNSPEC decisions
need to be described in comments, so that other people coming to this
code later can understand the motivation.  The same thing applies to
other decisions in the patch.

> +   }
> +   }
> +   }
> +}
> +  return true;
> +}
> [...]
> diff --git a/gcc/pair-fusion.cc b/gcc/pair-fusion.cc
> index 9f897ac04e2..2dbe9f854ef 100644
> --- a/gcc/pair-fusion.cc
> +++ b/gcc/pair-fusion.cc
> @@ -312,7 +312,7 @@ static int
>  encode_lfs (lfs_fields fields)
>  {
>int size_log2 = exact_log2 (fields.size);
> -  gcc_checking_assert (size_log2 >= 2 && size_log2 <= 4);
> +  gcc_checking_assert (size_log2 >= 2 && size_log2 <= 6);
>return ((int)fields.load_p << 3)
>  | ((int)fields.fpsimd_p << 2)
>  | (size_log2 - 2);

The point of the assert 

Re: [PATCH 2/4] resource.cc: Replace calls to find_basic_block with cfgrtl BLOCK_FOR_INSN

2024-05-31 Thread Richard Sandiford
Hans-Peter Nilsson  writes:
> [...]
> (Not-so-)fun fact: add_insn_after takes a bb parameter which
> reorg.cc always passes as NULL.  But - the argument is
> *always ignored* and the bb in the "after" insn is used.
> I traced that ignored parameter as far as
> r0-81421-g6fb5fa3cbc0d78 "Merge dataflow branch into
> mainline" when is was added.  I *guess* it's an artifact
> left over from some idea explored on that branch.  Ripe for
> obvious cleanup by removal everywhere.

Heh.  I wondered whether there'd be some direct callers of
add_insn_after_nobb that relied on the block *not* being updated
for some reason, but thankfully not.  The only two callers seem
to be add_insn_after and emit_note_after.  But then emit_note_after
handles notes slightly differently from add_insn_after, even though
logically, emitting an existing note should work in the same way
as emitting a new note.

So yeah, like you say, ripe for cleanup :)

Richard


Re: [PATCH 01/11] OpenMP/PolyInt: Pass poly-int structures by address to OMP libs.

2024-05-31 Thread Richard Sandiford
Jakub Jelinek  writes:
> On Fri, May 31, 2024 at 08:45:54AM +0100, Richard Sandiford wrote:
>> > When you say same way, do you mean the way SVE ABI defines the rules for 
>> > SVE types?
>> 
>> No, sorry, I meant that if the choice isn't purely local to a source
>> code function, the condition should be something like sizeless_type_p
>> (suitably abstracted) rather than POLY_INT_CST_P.  That way, the "ABI"
>> stays the same regardless of -msve-vector-bits.
>
> There is no ABI, it is how the caller and indirect callee communicate,
> but both parts are compiled with the same compiler, so it can choose
> differently based on different compiler version etc.
> It is effectively simplified:
> struct whatever { ... };
> void callee (void *x) { struct whatever *w = *x; use *w; }
> void caller (void) { struct whatever w; fill in w; ABI_call (callee, ); }
> (plus in some cases the callee can also update values and propagate that
> back to caller).
> In any case, it is a similar "ABI" to e.g. tree-nested.cc communication
> between caller and nested callee, how exactly are the variables laid out
> in a struct depends on compiler version and whatever it decides, same
> compiler then emits both sides.

Ah, ok, thanks.  In that case I guess POLY_INT_CST_P should be
safe/correct after all.

Richard


Re: [Patch, rs6000, aarch64, middle-end] Add implementation for different targets for pair mem fusion

2024-05-31 Thread Richard Sandiford
Segher Boessenkool  writes:
> Hi!
>
> On Fri, May 31, 2024 at 01:21:44AM +0530, Ajit Agarwal wrote:
>> Code is implemented with pure virtual functions to interface with target
>> code.
>
> It's not a pure function.  A pure function -- by definition -- has no
> side effects.  These things have side effects.
>
> What you mean is this is *an implementation* for C++ functions without
> a generic implementation.  An obfuscation some people (like me) would
> say.  But please call things what they are!  So not "pure function".
> That has a meaning, and this isn't it.

"pure virtual function" is an established term.  The "pure" modifies
"virtual", not "function".

The description is correct because the patch adds pure virtual functions
to the base class and expects the derived class to override and implement
them.

>>  * config/aarch64/aarch64-ldp-fusion.cc: Add target specific
>>  implementation of additional virtual functions added in pair_fusion
>>  struct.
>
> This does not belong in this patch.  Do not send "rs6000" patches that
> touch anything outside of config/rs6000/ and similar, certainly not in
> config/something-else/!
>
> This would be WAY easier to review (read: AT ALL POSSIBLE) if you
> included some detailed rationale and design document.

Please don't shout.

I don't think this kind of aggressive review is helpful to the project.

Richard


Re: [PATCH 01/11] OpenMP/PolyInt: Pass poly-int structures by address to OMP libs.

2024-05-31 Thread Richard Sandiford
Tejas Belagod  writes:
> On 5/30/24 6:28 PM, Richard Sandiford wrote:
>> Tejas Belagod  writes:
>>> Currently poly-int type structures are passed by value to OpenMP runtime
>>> functions for shared clauses etc.  This patch improves on this by passing
>>> around poly-int structures by address to avoid copy-overhead.
>>>
>>> gcc/ChangeLog
>>> * omp-low.c (use_pointer_for_field): Use pointer if the OMP data
>>> structure's field type is a poly-int.
>>> ---
>>>   gcc/omp-low.cc | 3 ++-
>>>   1 file changed, 2 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/gcc/omp-low.cc b/gcc/omp-low.cc
>>> index 1a65229cc37..b15607f4ef5 100644
>>> --- a/gcc/omp-low.cc
>>> +++ b/gcc/omp-low.cc
>>> @@ -466,7 +466,8 @@ static bool
>>>   use_pointer_for_field (tree decl, omp_context *shared_ctx)
>>>   {
>>> if (AGGREGATE_TYPE_P (TREE_TYPE (decl))
>>> -  || TYPE_ATOMIC (TREE_TYPE (decl)))
>>> +  || TYPE_ATOMIC (TREE_TYPE (decl))
>>> +  || POLY_INT_CST_P (DECL_SIZE (decl)))
>>>   return true;
>>>   
>>> /* We can only use copy-in/copy-out semantics for shared variables
>> 
>
> Thanks for the reviews.
>
>> Realise this is also true of my original patch, but:
>> 
>> I suppose a question here is whether this function is only ever used for
>> local interfaces between code generated by the same source code function,
>> or whether it's ABI in a more general sense.  
>
> I'm not a 100% sure, but AFAICS, 'use_pointer_for_field' seems to be 
> used only for local interface between source and generated functions. I 
> don't see any backend hooks into this or backend hooking into this 
> function for general ABI. Ofcourse, I'm not the expert on OMP lowering, 
> so it would be great to get an expert opinion on this.
>
>> If the latter, I suppose
>> we should make sure to handle ACLE types the same way regardless of
>> whether the SVE vector size is known.
>> 
>
> When you say same way, do you mean the way SVE ABI defines the rules for 
> SVE types?

No, sorry, I meant that if the choice isn't purely local to a source
code function, the condition should be something like sizeless_type_p
(suitably abstracted) rather than POLY_INT_CST_P.  That way, the "ABI"
stays the same regardless of -msve-vector-bits.

Thanks,
Richard


[gcc r14-10263] vect: Fix access size alignment assumption [PR115192]

2024-05-31 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:36575f5fe491d86b6851ff3f47cbfb7dad0fc8ae

commit r14-10263-g36575f5fe491d86b6851ff3f47cbfb7dad0fc8ae
Author: Richard Sandiford 
Date:   Fri May 31 08:22:55 2024 +0100

vect: Fix access size alignment assumption [PR115192]

create_intersect_range_checks checks whether two access ranges
a and b are alias-free using something equivalent to:

  end_a <= start_b || end_b <= start_a

It has two ways of doing this: a "vanilla" way that calculates
the exact exclusive end pointers, and another way that uses the
last inclusive aligned pointers (and changes the comparisons
accordingly).  The comment for the latter is:

  /* Calculate the minimum alignment shared by all four pointers,
 then arrange for this alignment to be subtracted from the
 exclusive maximum values to get inclusive maximum values.
 This "- min_align" is cumulative with a "+ access_size"
 in the calculation of the maximum values.  In the best
 (and common) case, the two cancel each other out, leaving
 us with an inclusive bound based only on seg_len.  In the
 worst case we're simply adding a smaller number than before.

The problem is that the associated code implicitly assumed that the
access size was a multiple of the pointer alignment, and so the
alignment could be carried over to the exclusive end pointer.

The testcase started failing after g:9fa5b473b5b8e289b6542
because that commit improved the alignment information for
the accesses.

gcc/
PR tree-optimization/115192
* tree-data-ref.cc (create_intersect_range_checks): Take the
alignment of the access sizes into account.

gcc/testsuite/
PR tree-optimization/115192
* gcc.dg/vect/pr115192.c: New test.

(cherry picked from commit a0fe4fb1c8d7804515845dd5d2a814b3c7a1ccba)

Diff:
---
 gcc/testsuite/gcc.dg/vect/pr115192.c | 28 
 gcc/tree-data-ref.cc |  5 -
 2 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/gcc/testsuite/gcc.dg/vect/pr115192.c 
b/gcc/testsuite/gcc.dg/vect/pr115192.c
new file mode 100644
index 000..923d377c1bb
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr115192.c
@@ -0,0 +1,28 @@
+#include "tree-vect.h"
+
+int data[4 * 16 * 16] __attribute__((aligned(16)));
+
+__attribute__((noipa)) void
+foo (__SIZE_TYPE__ n)
+{
+  for (__SIZE_TYPE__ i = 1; i < n; ++i)
+{
+  data[i * n * 4] = data[(i - 1) * n * 4] + 1;
+  data[i * n * 4 + 1] = data[(i - 1) * n * 4 + 1] + 2;
+}
+}
+
+int
+main ()
+{
+  check_vect ();
+
+  data[0] = 10;
+  data[1] = 20;
+
+  foo (3);
+
+  if (data[24] != 12 || data[25] != 24)
+__builtin_abort ();
+  return 0;
+}
diff --git a/gcc/tree-data-ref.cc b/gcc/tree-data-ref.cc
index f37734b5340..654a8220214 100644
--- a/gcc/tree-data-ref.cc
+++ b/gcc/tree-data-ref.cc
@@ -73,6 +73,7 @@ along with GCC; see the file COPYING3.  If not see
 
 */
 
+#define INCLUDE_ALGORITHM
 #include "config.h"
 #include "system.h"
 #include "coretypes.h"
@@ -2640,7 +2641,9 @@ create_intersect_range_checks (class loop *loop, tree 
*cond_expr,
 Because the maximum values are inclusive, there is an alias
 if the maximum value of one segment is equal to the minimum
 value of the other.  */
-  min_align = MIN (dr_a.align, dr_b.align);
+  min_align = std::min (dr_a.align, dr_b.align);
+  min_align = std::min (min_align, known_alignment (dr_a.access_size));
+  min_align = std::min (min_align, known_alignment (dr_b.access_size));
   cmp_code = LT_EXPR;
 }


Re: [PATCH 4/4]AArch64: enable new predicate tuning for Neoverse cores.

2024-05-30 Thread Richard Sandiford
Tamar Christina  writes:
> Hi All,
>
> This enables the new tuning flag for Neoverse V1, Neoverse V2 and Neoverse N2.
> It is kept off for generic codegen.
>
> Note the reason for the +sve even though they are in aarch64-sve.exp is if the
> testsuite is ran with a forced SVE off option, e.g. -march=armv8-a+nosve then
> the intrinsics end up being disabled because the -march is preferred over the
> -mcpu even though the -mcpu comes later.
>
> This prevents the tests from failing in such runs.

IMO we should just skip aarch64-sve.exp if the options explicitly disable
SVE.  But that's separate work.  I'll try it once this patch is in.

> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>
> Ok for master?
>
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
>   * config/aarch64/tuning_models/neoversen2.h (neoversen2_tunings): Add
>   AARCH64_EXTRA_TUNE_AVOID_PRED_RMW.
>   * config/aarch64/tuning_models/neoversev1.h (neoversev1_tunings): Add
>   AARCH64_EXTRA_TUNE_AVOID_PRED_RMW.
>   * config/aarch64/tuning_models/neoversev2.h (neoversev2_tunings): Add
>   AARCH64_EXTRA_TUNE_AVOID_PRED_RMW.
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/sve/pred_clobber_1.c: New test.
>   * gcc.target/aarch64/sve/pred_clobber_2.c: New test.
>   * gcc.target/aarch64/sve/pred_clobber_3.c: New test.
>   * gcc.target/aarch64/sve/pred_clobber_4.c: New test.
>
> ---
> diff --git a/gcc/config/aarch64/tuning_models/neoversen2.h 
> b/gcc/config/aarch64/tuning_models/neoversen2.h
> index 
> 7e799bbe762fe862e31befed50e54040a7fd1f2f..be9a48ac3adc097f967c217fe09dcac194d7d14f
>  100644
> --- a/gcc/config/aarch64/tuning_models/neoversen2.h
> +++ b/gcc/config/aarch64/tuning_models/neoversen2.h
> @@ -236,7 +236,8 @@ static const struct tune_params neoversen2_tunings =
>(AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> -   | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),  /* tune_flags.  */
> +   | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
> +   | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags.  */
>_prefetch_tune,
>AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
>AARCH64_LDP_STP_POLICY_ALWAYS /* stp_policy_model.  */
> diff --git a/gcc/config/aarch64/tuning_models/neoversev1.h 
> b/gcc/config/aarch64/tuning_models/neoversev1.h
> index 
> 9363f2ad98a5279cc99f2f9b1509ba921d582e84..0fc41ce6a41b3135fa06d2bda1f517fdf4f8dbcf
>  100644
> --- a/gcc/config/aarch64/tuning_models/neoversev1.h
> +++ b/gcc/config/aarch64/tuning_models/neoversev1.h
> @@ -227,7 +227,8 @@ static const struct tune_params neoversev1_tunings =
>(AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
> -   | AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND), /* tune_flags.  */
> +   | AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
> +   | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags.  */
>_prefetch_tune,
>AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
>AARCH64_LDP_STP_POLICY_ALWAYS/* stp_policy_model.  */
> diff --git a/gcc/config/aarch64/tuning_models/neoversev2.h 
> b/gcc/config/aarch64/tuning_models/neoversev2.h
> index 
> bc01ed767c9b690504eb98456402df5d9d64eee3..f76e4ef358f7dfb9c7d7b470ea7240eaa2120f8e
>  100644
> --- a/gcc/config/aarch64/tuning_models/neoversev2.h
> +++ b/gcc/config/aarch64/tuning_models/neoversev2.h
> @@ -236,7 +236,8 @@ static const struct tune_params neoversev2_tunings =
>(AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> -   | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),  /* tune_flags.  */
> +   | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
> +   | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags.  */
>_prefetch_tune,
>AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
>AARCH64_LDP_STP_POLICY_ALWAYS /* stp_policy_model.  */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pred_clobber_1.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/pred_clobber_1.c
> new file mode 100644
> index 
> ..934a00a38531c5fd4139d99ff33414904b2c104f
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/pred_clobber_1.c
> @@ -0,0 +1,22 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mcpu=neoverse-n2" } */
> +/* { dg-final { check-function-bodies "**" "" } } */
> +
> +#pragma GCC target "+sve"
> +
> +#include 
> +
> +extern void use(svbool_t);
> +
> +/*
> +** foo:
> +**   ...
> +**   ptrue   p([1-9][0-9]?).b, all

Might be better to make this p([1-3]), so that we disallow any registers
that would cause a spill.

OK with that change, thanks.

Richard

> +**   cmplo   p0.h, p\1/z, z0.h, z[0-9]+.h
> +**   ...
> +*/
> +void foo (svuint16_t a, uint16_t b)
> +{
> +svbool_t p0 = svcmplt_n_u16 (svptrue_b16 (), a, b);
> +

Re: [PATCH 3/4]AArch64: add new alternative with early clobber to patterns

2024-05-30 Thread Richard Sandiford
Tamar Christina  writes:
> [...]
> @@ -6651,8 +6661,10 @@ (define_insn "and3"
>   (and:PRED_ALL (match_operand:PRED_ALL 1 "register_operand")
> (match_operand:PRED_ALL 2 "register_operand")))]
>"TARGET_SVE"
> -  {@ [ cons: =0, 1  , 2   ]
> - [ Upa , Upa, Upa ] and\t%0.b, %1/z, %2.b, %2.b
> +  {@ [ cons: =0, 1  , 2  ; attrs: pred_clobber ]
> + [ , Upa, Upa; yes ] and\t%0.b, %1/z, %2.b, %2.b
> + [ ?Upa, 0  , Upa; yes ] ^
> + [ Upa , Upa, Upa; no  ] ^

I think this ought to be:

> +  {@ [ cons: =0, 1  ,  2   ; attrs: pred_clobber ]
> + [ , Upa,  Upa ; yes ] and\t%0.b, %1/z, %2.b, 
> %2.b
> + [ ?Upa, 0Upa, 0Upa; yes ] ^
> + [ Upa , Upa,  Upa ; no  ] ^

so that operand 2 can be tied to operand 0 in the worst case.  Similarly:

>}
>  )
>  
> @@ -6679,8 +6691,10 @@ (define_insn "@aarch64_pred__z"
>   (match_operand:PRED_ALL 3 "register_operand"))
> (match_operand:PRED_ALL 1 "register_operand")))]
>"TARGET_SVE"
> -  {@ [ cons: =0, 1  , 2  , 3   ]
> - [ Upa , Upa, Upa, Upa ] \t%0.b, %1/z, %2.b, %3.b
> +  {@ [ cons: =0, 1  , 2  , 3  ; attrs: pred_clobber ]
> + [ , Upa, Upa, Upa; yes ] \t%0.b, %1/z, 
> %2.b, %3.b
> + [ ?Upa, 0  , Upa, Upa; yes ] ^
> + [ Upa , Upa, Upa, Upa; no  ] ^
>}
>  )

this would be:

  {@ [ cons: =0, 1   , 2   , 3   ; attrs: pred_clobber ]
 [ , Upa , Upa , Upa ; yes ] \t%0.b, %1/z, 
%2.b, %3.b
 [ ?Upa, 0Upa, 0Upa, 0Upa; yes ] ^
 [ Upa , Upa , Upa,  Upa ; no  ] ^
  }

Same idea for the rest.

I tried this on:

--
#include 

void use (svbool_t, svbool_t, svbool_t);

void
f1 (svbool_t p0, svbool_t p1, svbool_t p2, int n, svbool_t *ptr)
{
  while (n--)
p2 = svand_z (p0, p1, p2);
  *ptr = p2;
}

void
f2 (svbool_t p0, svbool_t p1, svbool_t p2, svbool_t *ptr)
{
  *ptr = svand_z (p0, p1, p2);
}

void
f3 (svbool_t p0, svbool_t p1, svbool_t p2, svbool_t *ptr)
{
  use (svand_z (p0, p1, p2), p1, p2);
}

void
f4 (svbool_t p0, svbool_t p1, svbool_t p2, svbool_t *ptr)
{
  use (p0, svand_z (p0, p1, p2), p2);
}

void
f5 (svbool_t p0, svbool_t p1, svbool_t p2, svbool_t *ptr)
{
  use (p0, p1, svand_z (p0, p1, p2));
}
--

and it seemed to produce the right output:

--
f1:
cbz w0, .L2
sub w0, w0, #1
.p2align 5,,15
.L3:
and p2.b, p0/z, p1.b, p2.b
sub w0, w0, #1
cmn w0, #1
bne .L3
.L2:
str p2, [x1]
ret

f2:
and p3.b, p0/z, p1.b, p2.b
str p3, [x0]
ret

f3:
and p0.b, p0/z, p1.b, p2.b
b   use

f4:
and p1.b, p0/z, p1.b, p2.b
b   use

f5:
and p2.b, p0/z, p1.b, p2.b
b   use
--

(with that coming directly from RA, rather than being cleaned
up later)

> [...]
> @@ -10046,8 +10104,10 @@ (define_insn_and_rewrite "*aarch64_brkn_cc"
>  (match_dup 3)]
> UNSPEC_BRKN))]
>"TARGET_SVE"
> -  {@ [ cons: =0, 1  , 2  , 3 ]
> - [ Upa , Upa, Upa, 0 ] brkns\t%0.b, %1/z, %2.b, %0.b
> +  {@ [ cons: =0, 1  , 2  , 3; attrs: pred_clobber ]
> + [ , Upa, Upa, 0; yes ] brkns\t%0.b, %1/z, %2.b, 
> %0.b
> + [ ?Upa, 0  , Upa, 0; yes ] ^
> + [ Upa , Upa, Upa, 0; no  ] ^
>}
>"&& (operands[4] != CONST0_RTX (VNx16BImode)
> || operands[5] != CONST0_RTX (VNx16BImode))"

Probably best to leave this out.  All alternatives require operand 3
to match operand 0.  So operands 1 and 2 will only match operand 0
if they're the same as operand 3.  In that case it'd be better to
allow the sharing rather than force the same value to be stored
in two registers.

That is, if op1 != op3 && op2 != op3 then we get what we want
naturally, regardless of tuning.

The same thing would apply to the BRKN instances of :

> @@ -10020,8 +10076,10 @@ (define_insn "@aarch64_brk"
>  (match_operand:VNx16BI 3 "register_operand")]
> SVE_BRK_BINARY))]
>"TARGET_SVE"
> -  {@ [ cons: =0, 1  , 2  , 3 ]
> - [ Upa , Upa, Upa,  ] brk\t%0.b, %1/z, %2.b, 
> %.b
> +  {@ [ cons: =0,  1 , 2  , 3; attrs: pred_clobber ]
> + [ , Upa, Upa, ; yes ] 
> brk\t%0.b, %1/z, %2.b, %.b
> + [ ?Upa, 0  , Upa, ; yes ] ^
> + [ Upa , Upa, Upa, ; no  ] ^
>}
>  )

but I think we should keep this factoring/abstraction and just add
the extra alternatives 

[gcc r15-929] ira: Fix go_through_subreg offset calculation [PR115281]

2024-05-30 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:46d931b3dd31cbba7c3355ada63f155aa24a4e2b

commit r15-929-g46d931b3dd31cbba7c3355ada63f155aa24a4e2b
Author: Richard Sandiford 
Date:   Thu May 30 16:17:58 2024 +0100

ira: Fix go_through_subreg offset calculation [PR115281]

go_through_subreg used:

  else if (!can_div_trunc_p (SUBREG_BYTE (x),
 REGMODE_NATURAL_SIZE (GET_MODE (x)), offset))

to calculate the register offset for a pseudo subreg x.  In the blessed
days before poly-int, this was:

*offset = (SUBREG_BYTE (x) / REGMODE_NATURAL_SIZE (GET_MODE (x)));

But I think this is testing the wrong natural size.  If we exclude
paradoxical subregs (which will get an offset of zero regardless),
it's the inner register that is being split, so it should be the
inner register's natural size that we use.

This matters in the testcase because we have an SFmode lowpart
subreg into the last of three variable-sized vectors.  The
SUBREG_BYTE is therefore equal to the size of two variable-sized
vectors.  Dividing by the vector size gives a register offset of 2,
as expected, but dividing by the size of a scalar FPR would give
a variable offset.

I think something similar could happen for fixed-size targets if
REGMODE_NATURAL_SIZE is different for vectors and integers (say),
although that case would trade an ICE for an incorrect offset.

gcc/
PR rtl-optimization/115281
* ira-conflicts.cc (go_through_subreg): Use the natural size of
the inner mode rather than the outer mode.

gcc/testsuite/
PR rtl-optimization/115281
* gfortran.dg/pr115281.f90: New test.

Diff:
---
 gcc/ira-conflicts.cc   |  3 ++-
 gcc/testsuite/gfortran.dg/pr115281.f90 | 39 ++
 2 files changed, 41 insertions(+), 1 deletion(-)

diff --git a/gcc/ira-conflicts.cc b/gcc/ira-conflicts.cc
index 83274c53330..15ac42d8848 100644
--- a/gcc/ira-conflicts.cc
+++ b/gcc/ira-conflicts.cc
@@ -227,8 +227,9 @@ go_through_subreg (rtx x, int *offset)
   if (REGNO (reg) < FIRST_PSEUDO_REGISTER)
 *offset = subreg_regno_offset (REGNO (reg), GET_MODE (reg),
   SUBREG_BYTE (x), GET_MODE (x));
+  /* The offset is always 0 for paradoxical subregs.  */
   else if (!can_div_trunc_p (SUBREG_BYTE (x),
-REGMODE_NATURAL_SIZE (GET_MODE (x)), offset))
+REGMODE_NATURAL_SIZE (GET_MODE (reg)), offset))
 /* Checked by validate_subreg.  We must know at compile time which
inner hard registers are being accessed.  */
 gcc_unreachable ();
diff --git a/gcc/testsuite/gfortran.dg/pr115281.f90 
b/gcc/testsuite/gfortran.dg/pr115281.f90
new file mode 100644
index 000..80aa822e745
--- /dev/null
+++ b/gcc/testsuite/gfortran.dg/pr115281.f90
@@ -0,0 +1,39 @@
+! { dg-options "-O3" }
+! { dg-additional-options "-mcpu=neoverse-v1" { target aarch64*-*-* } }
+
+SUBROUTINE fn0(ma, mb, nt)
+  CHARACTER ca
+  REAL r0(ma)
+  INTEGER i0(mb)
+  REAL r1(3,mb)
+  REAL r2(3,mb)
+  REAL r3(3,3)
+  zero=0.0
+  do na = 1, nt
+ nt = i0(na)
+ do l = 1, 3
+r1 (l, na) =   r0 (nt)
+r2(l, na) = zero
+ enddo
+  enddo
+  if (ca  .ne.'z') then
+ do j = 1, 3
+do i = 1, 3
+   r4  = zero
+enddo
+ enddo
+ do na = 1, nt
+do k =  1, 3
+   do l = 1, 3
+  do m = 1, 3
+ r3 = r4 * v
+  enddo
+   enddo
+enddo
+ do i = 1, 3
+   do k = 1, ifn (r3)
+   enddo
+enddo
+ enddo
+ endif
+END


Re: [PATCH 2/4]AArch64: add new tuning param and attribute for enabling conditional early clobber

2024-05-30 Thread Richard Sandiford
Tamar Christina  writes:
>> -Original Message-
>> From: Tamar Christina 
>> Sent: Wednesday, May 22, 2024 10:29 AM
>> To: Richard Sandiford 
>> Cc: gcc-patches@gcc.gnu.org; nd ; Richard Earnshaw
>> ; Marcus Shawcroft
>> ; ktkac...@gcc.gnu.org
>> Subject: RE: [PATCH 2/4]AArch64: add new tuning param and attribute for
>> enabling conditional early clobber
>>
>> >
>> > Sorry for the bike-shedding, but how about something like "avoid_pred_rmw"?
>> > (I'm open to other suggestions.)  Just looking for something that describes
>> > either the architecture or the end result that we want to achieve.
>> > And preferable something fairly short :)
>> >
>> > avoid_* would be consistent with the existing "avoid_cross_loop_fma".
>> >
>> > > +
>> > >  #undef AARCH64_EXTRA_TUNING_OPTION
>> > > diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h
>> > > index
>> >
>> bbf11faaf4b4340956094a983f8b0dc2649b2d27..76a18dd511f40ebb58ed12d5
>> > 6b46c74084ba7c3c 100644
>> > > --- a/gcc/config/aarch64/aarch64.h
>> > > +++ b/gcc/config/aarch64/aarch64.h
>> > > @@ -495,6 +495,11 @@ constexpr auto AARCH64_FL_DEFAULT_ISA_MODE =
>> > AARCH64_FL_SM_OFF;
>> > >  enabled through +gcs.  */
>> > >  #define TARGET_GCS (AARCH64_ISA_GCS)
>> > >
>> > > +/*  Prefer different predicate registers for the output of a predicated 
>> > > operation
>> > over
>> > > +re-using an existing input predicate.  */
>> > > +#define TARGET_SVE_PRED_CLOBBER (TARGET_SVE \
>> > > +  && (aarch64_tune_params.extra_tuning_flags \
>> > > +  &
>> > AARCH64_EXTRA_TUNE_EARLY_CLOBBER_SVE_PRED_DEST))
>> > >
>> > >  /* Standard register usage.  */
>> > >
>> > > diff --git a/gcc/config/aarch64/aarch64.md 
>> > > b/gcc/config/aarch64/aarch64.md
>> > > index
>> >
>> dbde066f7478bec51a8703b017ea553aa98be309..1ecd1a2812969504bd5114a
>> > 53473b478c5ddba82 100644
>> > > --- a/gcc/config/aarch64/aarch64.md
>> > > +++ b/gcc/config/aarch64/aarch64.md
>> > > @@ -445,6 +445,10 @@ (define_enum_attr "arch" "arches" (const_string
>> > "any"))
>> > >  ;; target-independent code.
>> > >  (define_attr "is_call" "no,yes" (const_string "no"))
>> > >
>> > > +;; Indicates whether we want to enable the pattern with an optional 
>> > > early
>> > > +;; clobber for SVE predicates.
>> > > +(define_attr "pred_clobber" "no,yes" (const_string "no"))
>> > > +
>> > >  ;; [For compatibility with Arm in pipeline models]
>> > >  ;; Attribute that specifies whether or not the instruction touches fp
>> > >  ;; registers.
>> > > @@ -461,7 +465,8 @@ (define_attr "fp" "no,yes"
>> > >  (define_attr "arch_enabled" "no,yes"
>> > >(if_then_else
>> > >  (ior
>> > > - (eq_attr "arch" "any")
>> > > + (and (eq_attr "arch" "any")
>> > > +  (eq_attr "pred_clobber" "no"))
>> > >
>> > >   (and (eq_attr "arch" "rcpc8_4")
>> > >(match_test "AARCH64_ISA_RCPC8_4"))
>> > > @@ -488,7 +493,10 @@ (define_attr "arch_enabled" "no,yes"
>> > >(match_test "TARGET_SVE"))
>> > >
>> > >   (and (eq_attr "arch" "sme")
>> > > -  (match_test "TARGET_SME")))
>> > > +  (match_test "TARGET_SME"))
>> > > +
>> > > + (and (eq_attr "pred_clobber" "yes")
>> > > +  (match_test "TARGET_SVE_PRED_CLOBBER")))
>> >
>> > IMO it'd be bettero handle pred_clobber separately from arch, as a new
>> > top-level AND:
>> >
>> >   (and
>> > (ior
>> >   (eq_attr "pred_clobber" "no")
>> >   (match_test "!TARGET_..."))
>> > (ior
>> >   ...existing arch tests...))
>> >
>>
>
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>
> Ok for master?
>

Re: [PATCH 00/11] AArch64/OpenMP: Test SVE ACLE types with various OpenMP constructs.

2024-05-30 Thread Richard Sandiford
Tejas Belagod  writes:
> Note: This patch series is based on Richard's initial patch
>   https://gcc.gnu.org/pipermail/gcc-patches/2022-November/606741.html
> and Jakub's suggestion
>   https://gcc.gnu.org/pipermail/gcc-patches/2023-February/611892.html
>
> The following patch series handles various scenarios with OpenMP and SVE 
> types.
> The starting point for the series follows a suggestion from Jakub to cover 
> all 
> the possible scenarios that could arise when OMP constructs/clauses etc are 
> used with SVE ACLE types. Here are a few instances that this patch series 
> tests
> and in some cases fixes the expected output.  This patch series does not 
> follow
> a formal definition or a spec of how OMP interacts with SVE ACLE types, so 
> it's 
> more of a proposed behaviour.  Comments and discussion welcome.

Thanks for doing this.  I've left some comments on individual patches,
but generally the series looks good from my limited abilit to evaluate it.
Hopefully Jakub can say whether this catches all the cases that matter.

Richard

> This list is not exhaustive, but covers most scenarios of how SVE ACLE types
> ought to interact with OMP constructs/clauses.
>
> 1. Poly-int structures that represent variable-sized objects and OMP runtime.
>
> Currently poly-int type structures are passed by value to OpenMP runtime
> functions for shared clauses etc.  This patch improves on this by passing
> around poly-int structures by address to avoid copy-overhead.
>
> 2. SVE ACLE types in OMP Shared clauses.
>
> We test the behaviour where SVE ACLE type objects are shared in the following
> methods into an OMP region:
>   a. Explicit Shared clause on SVE ACLE type objects.
>   b. Implicit shared clause.
>   c. Implicit shared with default clause.
>   d. SVE ALCE types in the presence of predetermined (static) shared objects.
>
> The associated tests ensure that all such shared objects are passed by address
> into the OMP runtime.  There are runtime tests to verify the functional
> correctness of the change.
>
> 3. Offloading and SVE ACLE types.
>
> The target clause in OpenMP is used to offload loop kernels to accelerator
> peripeherals.  target's 'map' clause is used to move data from and to the 
> accelarator.  When the data is SVE type, it may not be suitable because of
> various reasons i.e. the two SVE targets may not agree on vector size or
> some targets don't support variable vector size.  This makes SVE unsuitable
> for use in OMP's 'map' clause.  We diagnose all such cases and issue errors
> where appropriate.  The cases we cover in this patch are:
>
>   a. Implicitly-mapped SVE ACLE types in OMP target regions are diagnosed.
>   b. Explicitly-mapped SVE ACLE types in OMP target regions using map clause
>  are diagnosed.
>   c. Explicilty-mapped SVLE ACLE types of various directions - to, from, 
> tofrom
>  in the map clause are diagnosed.
>   d. target enter and exit data clauses with map on SVE ACLE types are 
>  diagnosed.
>   e. target data map with alloc on SVE ACLE types are diagnosed.
>   f. target update from clause on SVE ACLE types are diagnosed.
>   g. target private firstprivate with SVE ACLE types are diagnosed.
>   h. All combinations of target with work-sharing constructs like parallel,
>  loop, simd, teams, distribute etc are also diagnosed when SVE ACLE types
>  are involved.
>
> 3. Lastprivate and SVE ACLE types.
>
> Various OpenMP lastprivate clause scenarios with SVE object types are 
> diagnosed.  Worksharing constructs like sections, for, distribute bind to an
> implicit outer parallel region in whose scope SVE ACLE types are declared and 
> are therefore default private.  The lastprivate clause list with SVE ACLE type
> object items are diagnosed in this scenario.
>
> 4. Threadprivate on SVE ACLE type objects.
>
> We ensure threadprivate SVE ACLE type objects are supported. We also ensure
> copyin clause is also supported.
>
> 5. User-Defined Reductions on SVE ACLE types.
>
> We define a reduction using OMP declare reduction using SVE ACLE intrinsics 
> and
> ensure its functional correctness with various work-sharing constructs like
> for, simd, parallel, task, taskloop.
>
> 6. Uniform and Aligned Clause with SVE ACLE
>
> We ensure the uniform clause's functional correctness with simd construct and
> associated SVE ACLE intrinsics in the simd region.  There is no direct
> interaction between uniform and SVE ACLE type objects, but we ensure the 
> uniform
> clause applies correctly to a region where SVE ACLE intrinsics are present.
> Similarly for the aligned clause.
>
> 7. Linear clause and SVE ACLE type.
>
> We diagnose if a linear clause list item has SVE ACLE type objects present.
> Its doesn't mean much if the linear clause is applied to SVE ACLE types.
>
> 8. Depend clause and SVE ACLE objects.
>
> We test for functional correctness many combinations of dependency of shared
> SVE ACLE type objects in parallel regions.  We test if in, out dependencies 
> 

Re: [PATCH 01/11] OpenMP/PolyInt: Pass poly-int structures by address to OMP libs.

2024-05-30 Thread Richard Sandiford
Tejas Belagod  writes:
> Currently poly-int type structures are passed by value to OpenMP runtime
> functions for shared clauses etc.  This patch improves on this by passing
> around poly-int structures by address to avoid copy-overhead.
>
> gcc/ChangeLog
>   * omp-low.c (use_pointer_for_field): Use pointer if the OMP data
>   structure's field type is a poly-int.
> ---
>  gcc/omp-low.cc | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/gcc/omp-low.cc b/gcc/omp-low.cc
> index 1a65229cc37..b15607f4ef5 100644
> --- a/gcc/omp-low.cc
> +++ b/gcc/omp-low.cc
> @@ -466,7 +466,8 @@ static bool
>  use_pointer_for_field (tree decl, omp_context *shared_ctx)
>  {
>if (AGGREGATE_TYPE_P (TREE_TYPE (decl))
> -  || TYPE_ATOMIC (TREE_TYPE (decl)))
> +  || TYPE_ATOMIC (TREE_TYPE (decl))
> +  || POLY_INT_CST_P (DECL_SIZE (decl)))
>  return true;
>  
>/* We can only use copy-in/copy-out semantics for shared variables

Realise this is also true of my original patch, but:

I suppose a question here is whether this function is only ever used for
local interfaces between code generated by the same source code function,
or whether it's ABI in a more general sense.  If the latter, I suppose
we should make sure to handle ACLE types the same way regardless of
whether the SVE vector size is known.

(At the moment, the vector size is fixed for a TU, not just a function,
but we should probably plan for relaxing that in future.)

Thanks,
Richard


Re: [PATCH 03/11] AArch64: Diagnose OpenMP offloading when SVE types involved.

2024-05-30 Thread Richard Sandiford
Tejas Belagod  writes:
> The target clause in OpenMP is used to offload loop kernels to accelarator
> peripeherals.  target's 'map' clause is used to move data from and to the
> accelarator.  When the data is SVE type, it may not be suitable because of
> various reasons i.e. the two SVE targets may not agree on vector size or
> some targets don't support variable vector size.  This makes SVE unsuitable
> for use in OMP's 'map' clause.  This patch diagnoses all such cases and issues
> an error where SVE types are not suitable.
>
> Co-authored-by: Andrea Corallo 
>
> gcc/ChangeLog:
>
>   * target.h (type_context_kind): Add new context kinds for target 
> clauses.
>   * config/aarch64/aarch64-sve-builtins.cc (verify_type_context): Diagnose
>   SVE types for a given OpenMP context.
>   * gimplify.cc (omp_notice_variable):  Diagnose implicitly-mapped SVE
>   objects in OpenMP regions.
>   (gimplify_scan_omp_clauses): Diagnose SVE types for various target
>   clauses.
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/sve/omp/offload-1.c: New test.
>   * gcc.target/aarch64/sve/omp/offload-2.c: Likewise.
>   * gcc.target/aarch64/sve/omp/offload-parallel-loop.c: Likewise.
>   * gcc.target/aarch64/sve/omp/offload-parallel.c: Likewise.
>   * gcc.target/aarch64/sve/omp/offload-simd.c: Likewise.
>   * gcc.target/aarch64/sve/omp/offload-teams-distribute-simd.c: Likewise.
>   * gcc.target/aarch64/sve/omp/offload-teams-distribute.c: Likewise.
>   * gcc.target/aarch64/sve/omp/offload-teams-loop.c: Likewise.
>   * gcc.target/aarch64/sve/omp/offload-teams.c: Likewise.
>   * gcc.target/aarch64/sve/omp/target-device.c: Likewise.
>   * gcc.target/aarch64/sve/omp/target-link.c: Likewise.
> ---
>  gcc/config/aarch64/aarch64-sve-builtins.cc|  31 +++
>  gcc/gimplify.cc   |  34 ++-
>  gcc/target.h  |  19 +-
>  .../gcc.target/aarch64/sve/omp/offload-1.c| 237 ++
>  .../gcc.target/aarch64/sve/omp/offload-2.c| 198 +++
>  .../aarch64/sve/omp/offload-parallel-loop.c   | 236 +
>  .../aarch64/sve/omp/offload-parallel.c| 195 ++
>  .../gcc.target/aarch64/sve/omp/offload-simd.c | 236 +
>  .../sve/omp/offload-teams-distribute-simd.c   | 237 ++
>  .../sve/omp/offload-teams-distribute.c| 236 +
>  .../aarch64/sve/omp/offload-teams-loop.c  | 237 ++
>  .../aarch64/sve/omp/offload-teams.c   | 195 ++
>  .../aarch64/sve/omp/target-device.c   |  97 +++
>  .../gcc.target/aarch64/sve/omp/target-link.c  |  48 
>  14 files changed, 2234 insertions(+), 2 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/omp/offload-1.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/omp/offload-2.c
>  create mode 100644 
> gcc/testsuite/gcc.target/aarch64/sve/omp/offload-parallel-loop.c
>  create mode 100644 
> gcc/testsuite/gcc.target/aarch64/sve/omp/offload-parallel.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/omp/offload-simd.c
>  create mode 100644 
> gcc/testsuite/gcc.target/aarch64/sve/omp/offload-teams-distribute-simd.c
>  create mode 100644 
> gcc/testsuite/gcc.target/aarch64/sve/omp/offload-teams-distribute.c
>  create mode 100644 
> gcc/testsuite/gcc.target/aarch64/sve/omp/offload-teams-loop.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/omp/offload-teams.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/omp/target-device.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/omp/target-link.c
>
> diff --git a/gcc/config/aarch64/aarch64-sve-builtins.cc 
> b/gcc/config/aarch64/aarch64-sve-builtins.cc
> index f3983a123e3..ee1064c3bb7 100644
> --- a/gcc/config/aarch64/aarch64-sve-builtins.cc
> +++ b/gcc/config/aarch64/aarch64-sve-builtins.cc
> @@ -5000,6 +5000,29 @@ bool
>  verify_type_context (location_t loc, type_context_kind context,
>const_tree type, bool silent_p)
>  {
> +  if (aarch64_sve::builtin_type_p (type)
> +  || (POINTER_TYPE_P (type)
> +   && aarch64_sve::builtin_type_p (TREE_TYPE (type

Could you say in more detail why we check for zero or one levels
of pointer indirection but not for more?

Also, was there a reason for checking builtin_type_p rather than
sizeless_type_p?  Things like svbool_t remain sizeless even for
-msve-vector-bits=128 etc., so sizeless_type_p would still cover
that case.  But arm_sve_vector_bits makes it possible to define
fixed-length vector types that are treated for ABI & ACLE purposes
like SVE types.  I don't think those should be treated differently
from normal vectors by omp, since the size is fixed by the attribute
(and types with different attributes are distinct).

Thanks,
Richard

> +switch (context)
> +{
> +  case TCTX_OMP_MAP:
> + error_at (loc, "SVE type %qT not allowed in 

Re: [PATCH 02/11] AArch64: Add test cases for SVE types in OpenMP shared clause.

2024-05-30 Thread Richard Sandiford
Tejas Belagod  writes:
> This patch tests various shared clauses with SVE types.  It also adds a test
> scaffold to run OpenMP tests in under the gcc.target testsuite.
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/sve/omp/aarch64-sve-omp.exp: New scaffold.

Hopefully Jakub can comment on whether we should test this in the
GCC testsuite or libgomp testsuite.

On the test:

> [...]
> +int
> +main ()
> +{
> +  svint32_t x = svindex_s32 (0 ,1);
> +  svint32_t y = svindex_s32 (8, 1);
> +  svint32_t a, b;
> +  svbool_t p;
> +
> +  /* Implicit shared.  */
> +  a = foo (x, y, p);
> +  b = implicit_shared_default (x, y, p);

It looks like p is used uninitialised here.  Can you check locally
that using svptrue_b8 () (or whatever) as an initialiser allows the
test to pass while svpfalse_b () causes it to fail?

Thanks,
Richard

> +  compare_vec (a, b);
> +
> +  /* Explicit shared.  */
> +  a = foo (x ,y, p);
> +  b = explicit_shared (x, y, p);
> +  compare_vec (a, b);
> +
> +  /* Implicit shared with no default clause.  */
> +  a = foo (x ,y, p);
> +  b = implicit_shared_no_default (x, y, p);
> +  compare_vec (a, b);
> +
> +  /* Mix shared.  */
> +  a = foo (x ,y, p);
> +  b = mix_shared (y, p);
> +  compare_vec (a, b);
> +
> +  /* Predetermined shared.  */
> +  predetermined_shared_static (true);
> +  predetermined_shared_static (false);
> +
> +  return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump-times "value-expr: \*.omp_data_i->a" 10 
> "ompexp" } } */


Re: [Patch, aarch64, middle-end\ v4: Move pair_fusion pass from aarch64 to middle-end

2024-05-30 Thread Richard Sandiford
Thanks for the update.  Some comments below, but looks very close
to ready.

Ajit Agarwal  writes:
> diff --git a/gcc/pair-fusion.cc b/gcc/pair-fusion.cc
> new file mode 100644
> index 000..060fd95
> --- /dev/null
> +++ b/gcc/pair-fusion.cc
> @@ -0,0 +1,3012 @@
> +// Pass to fuse adjacent loads/stores into paired memory accesses.
> +// Copyright (C) 2024 Free Software Foundation, Inc.

This should probably be 2023-2024, since it's based on code
contributed in 2023.

> +//
> +// This file is part of GCC.
> +//
> +// GCC is free software; you can redistribute it and/or modify it
> +// under the terms of the GNU General Public License as published by
> +// the Free Software Foundation; either version 3, or (at your option)
> +// any later version.
> +//
> +// GCC is distributed in the hope that it will be useful, but
> +// WITHOUT ANY WARRANTY; without even the implied warranty of
> +// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +// General Public License for more details.
> +//
> +// You should have received a copy of the GNU General Public License
> +// along with GCC; see the file COPYING3.  If not see
> +// .
> +
> +#define INCLUDE_ALGORITHM
> +#define INCLUDE_FUNCTIONAL
> +#define INCLUDE_LIST
> +#define INCLUDE_TYPE_TRAITS
> +#include "config.h"
> +#include "system.h"
> +#include "coretypes.h"
> +#include "backend.h"
> +#include "rtl.h"
> +#include "df.h"
> +#include "rtl-iter.h"
> +#include "rtl-ssa.h"
> +#include "cfgcleanup.h"
> +#include "tree-pass.h"
> +#include "ordered-hash-map.h"
> +#include "tree-dfa.h"
> +#include "fold-const.h"
> +#include "tree-hash-traits.h"
> +#include "print-tree.h"
> +#include "pair-fusion.h"
> +
> +using namespace rtl_ssa;
> +
> +// We pack these fields (load_p, fpsimd_p, and size) into an integer
> +// (LFS) which we use as part of the key into the main hash tables.
> +//
> +// The idea is that we group candidates together only if they agree on
> +// the fields below.  Candidates that disagree on any of these
> +// properties shouldn't be merged together.
> +struct lfs_fields
> +{
> +  bool load_p;
> +  bool fpsimd_p;
> +  unsigned size;
> +};
> +
> +using insn_list_t = std::list;
> +
> +// Information about the accesses at a given offset from a particular
> +// base.  Stored in an access_group, see below.
> +struct access_record
> +{
> +  poly_int64 offset;
> +  std::list cand_insns;
> +  std::list::iterator place;
> +
> +  access_record (poly_int64 off) : offset (off) {}
> +};
> +
> +// A group of accesses where adjacent accesses could be ldp/stp
> +// candidates.  The splay tree supports efficient insertion,
> +// while the list supports efficient iteration.
> +struct access_group
> +{
> +  splay_tree tree;
> +  std::list list;
> +
> +  template
> +  inline void track (Alloc node_alloc, poly_int64 offset, insn_info *insn);
> +};
> +
> +// Test if this base candidate is viable according to HAZARDS.
> +bool base_cand::viable () const

Formating nit, should be:

bool
base_cand::viable () const

> +{
> +  return !hazards[0] || !hazards[1] || (*hazards[0] > *hazards[1]);
> +}
> [...]
> +void
> +pair_fusion_bb_info::transform ()
> +{
> +  traverse_base_map (expr_map);
> +  traverse_base_map (def_map);
> +}
> +
> +// the base register which we can fold in to make this pair use
> +// a writeback addressing mode.

The first line of this comment is missing.  It should be:

// Given an existing pair insn INSN, look for a trailing update of

> [...]
> diff --git a/gcc/pair-fusion.h b/gcc/pair-fusion.h
> new file mode 100644
> index 000..f295fdbdb8f
> --- /dev/null
> +++ b/gcc/pair-fusion.h
> @@ -0,0 +1,195 @@
> +// Pass to fuse adjacent loads/stores into paired memory accesses.
> +//
> +// This file contains the definition of the virtual base class which is
> +// overriden by targets that make use of the pass.
> +//
> +// Copyright (C) 2024 Free Software Foundation, Inc.

2023-2024 here too

> +//
> +// This file is part of GCC.
> +//
> +// GCC is free software; you can redistribute it and/or modify it
> +// under the terms of the GNU General Public License as published by
> +// the Free Software Foundation; either version 3, or (at your option)
> +// any later version.
> +//
> +// GCC is distributed in the hope that it will be useful, but
> +// WITHOUT ANY WARRANTY; without even the implied warranty of
> +// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +// General Public License for more details.
> +//
> +// You should have received a copy of the GNU General Public License
> +// along with GCC; see the file COPYING3.  If not see
> +// .
> +
> +namespace rtl_ssa {
> +  class def_info;
> +  class insn_info;
> +  class insn_range_info;
> +  class bb_info;
> +}
> +
> +// Information about a potential base candidate, used in try_fuse_pair.
> +// There may be zero, one, or two viable RTL bases for a given pair.
> +struct base_cand
> +{
> +  // DEF is the 

Re: [PATCH] aarch64: testsuite: Explicitly add -mlittle-endian to vget_low_2.c

2024-05-30 Thread Richard Sandiford
Pengxuan Zheng  writes:
> vget_low_2.c is a test case for little-endian, but we missed the 
> -mlittle-endian
> flag in r15-697-ga2e4fe5a53cf75.
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/vget_low_2.c: Add -mlittle-endian.

Ok, thanks.

If you'd like write access, please follow the instructions on
https://gcc.gnu.org/gitwrite.html (I'll sponsor).

Richard

> Signed-off-by: Pengxuan Zheng 
> ---
>  gcc/testsuite/gcc.target/aarch64/vget_low_2.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/gcc/testsuite/gcc.target/aarch64/vget_low_2.c 
> b/gcc/testsuite/gcc.target/aarch64/vget_low_2.c
> index 44414e1c043..93e9e664ee9 100644
> --- a/gcc/testsuite/gcc.target/aarch64/vget_low_2.c
> +++ b/gcc/testsuite/gcc.target/aarch64/vget_low_2.c
> @@ -1,5 +1,5 @@
>  /* { dg-do compile } */
> -/* { dg-options "-O3 -fdump-tree-optimized" } */
> +/* { dg-options "-O3 -fdump-tree-optimized -mlittle-endian" } */
>  
>  #include 


Re: [PATCH] aarch64: Add vector floating point extend patterns [PR113880, PR113869]

2024-05-30 Thread Richard Sandiford
Pengxuan Zheng  writes:
> This patch improves vectorization of certain floating point widening 
> operations
> for the aarch64 target by adding vector floating point extend patterns for
> V2SF->V2DF and V4HF->V4SF conversions.
>
>   PR target/113880
>   PR target/113869
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64-simd.md (extend2): New expand.
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/extend-vec.c: New test.
>
> Signed-off-by: Pengxuan Zheng 

Thanks for doing this.  Could we instead rename
aarch64_float_extend_lo_ to extend2 and
use something similar to:

---
/* The builtins below should be expanded through the standard optabs
   CODE_FOR_[u]avg3_[floor,ceil].  However the mapping scheme in
   aarch64-simd-builtins.def does not easily allow us to have a pre-mode
   ("uavg") and post-mode string ("_ceil") in the CODE_FOR_* construction.
   So the builtins use a name that is natural for AArch64 instructions
   e.g. "aarch64_srhadd" and we re-map these to the optab-related
   CODE_FOR_ here.  */
#undef VAR1
#define VAR1(F,T1,T2,I,M) \
constexpr insn_code CODE_FOR_aarch64_##F##M = CODE_FOR_##T1##M##3##T2;

BUILTIN_VDQ_BHSI (srhadd, avg, _ceil, 0)
BUILTIN_VDQ_BHSI (urhadd, uavg, _ceil, 0)
BUILTIN_VDQ_BHSI (shadd, avg, _floor, 0)
BUILTIN_VDQ_BHSI (uhadd, uavg, _floor, 0)

#undef VAR1
---

(from aarch64-builtins.cc) to handle the intrinsics?  The idea is
to try to avoid adding new patterns just to satisfy the internal
naming convention.

Richard

> ---
>  gcc/config/aarch64/aarch64-simd.md|  7 +++
>  gcc/testsuite/gcc.target/aarch64/extend-vec.c | 21 +++
>  2 files changed, 28 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/extend-vec.c
>
> diff --git a/gcc/config/aarch64/aarch64-simd.md 
> b/gcc/config/aarch64/aarch64-simd.md
> index 868f4486218..8febb411d06 100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -3141,6 +3141,13 @@ (define_insn "aarch64_float_extend_lo_"
>[(set_attr "type" "neon_fp_cvt_widen_s")]
>  )
>  
> +(define_expand "extend2"
> +  [(set (match_operand: 0 "register_operand" "=w")
> +(float_extend:
> +  (match_operand:VDF 1 "register_operand" "w")))]
> +  "TARGET_SIMD"
> +)
> +
>  ;; Float narrowing operations.
>  
>  (define_insn "aarch64_float_trunc_rodd_df"
> diff --git a/gcc/testsuite/gcc.target/aarch64/extend-vec.c 
> b/gcc/testsuite/gcc.target/aarch64/extend-vec.c
> new file mode 100644
> index 000..f6241d5
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/extend-vec.c
> @@ -0,0 +1,21 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2" } */
> +
> +/* { dg-final { scan-assembler-times {fcvtl\tv[0-9]+.2d, v[0-9]+.2s} 1 } } */
> +void
> +f (float *__restrict a, double *__restrict b)
> +{
> +  b[0] = a[0];
> +  b[1] = a[1];
> +}
> +
> +/* { dg-final { scan-assembler-times {fcvtl\tv[0-9]+.4s, v[0-9]+.4h} 1 } } */
> +void
> +f1 (_Float16 *__restrict a, float *__restrict b)
> +{
> +
> +  b[0] = a[0];
> +  b[1] = a[1];
> +  b[2] = a[2];
> +  b[3] = a[3];
> +}


[PATCH] ira: Fix go_through_subreg offset calculation [PR115281]

2024-05-30 Thread Richard Sandiford
go_through_subreg used:

  else if (!can_div_trunc_p (SUBREG_BYTE (x),
 REGMODE_NATURAL_SIZE (GET_MODE (x)), offset))

to calculate the register offset for a pseudo subreg x.  In the blessed
days before poly-int, this was:

*offset = (SUBREG_BYTE (x) / REGMODE_NATURAL_SIZE (GET_MODE (x)));

But I think this is testing the wrong natural size.  If we exclude
paradoxical subregs (which will get an offset of zero regardless),
it's the inner register that is being split, so it should be the
inner register's natural size that we use.

This matters in the testcase because we have an SFmode lowpart
subreg into the last of three variable-sized vectors.  The
SUBREG_BYTE is therefore equal to the size of two variable-sized
vectors.  Dividing by the vector size gives a register offset of 2,
as expected, but dividing by the size of a scalar FPR would give
a variable offset.

I think something similar could happen for fixed-size targets if
REGMODE_NATURAL_SIZE is different for vectors and integers (say).

Tested on aarch64-linux-gnu & x86_64-linux-gnu.  OK to install?

Richard


gcc/
PR rtl-optimization/115281
* ira-conflicts.cc (go_through_subreg): Use the natural size of
the inner mode rather than the outer mode.

gcc/testsuite/
PR rtl-optimization/115281
* gfortran.dg/pr115281.f90: New test.
---
 gcc/ira-conflicts.cc   |  3 +-
 gcc/testsuite/gfortran.dg/pr115281.f90 | 39 ++
 2 files changed, 41 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gfortran.dg/pr115281.f90

diff --git a/gcc/ira-conflicts.cc b/gcc/ira-conflicts.cc
index 83274c53330..15ac42d8848 100644
--- a/gcc/ira-conflicts.cc
+++ b/gcc/ira-conflicts.cc
@@ -227,8 +227,9 @@ go_through_subreg (rtx x, int *offset)
   if (REGNO (reg) < FIRST_PSEUDO_REGISTER)
 *offset = subreg_regno_offset (REGNO (reg), GET_MODE (reg),
   SUBREG_BYTE (x), GET_MODE (x));
+  /* The offset is always 0 for paradoxical subregs.  */
   else if (!can_div_trunc_p (SUBREG_BYTE (x),
-REGMODE_NATURAL_SIZE (GET_MODE (x)), offset))
+REGMODE_NATURAL_SIZE (GET_MODE (reg)), offset))
 /* Checked by validate_subreg.  We must know at compile time which
inner hard registers are being accessed.  */
 gcc_unreachable ();
diff --git a/gcc/testsuite/gfortran.dg/pr115281.f90 
b/gcc/testsuite/gfortran.dg/pr115281.f90
new file mode 100644
index 000..80aa822e745
--- /dev/null
+++ b/gcc/testsuite/gfortran.dg/pr115281.f90
@@ -0,0 +1,39 @@
+! { dg-options "-O3" }
+! { dg-additional-options "-mcpu=neoverse-v1" { target aarch64*-*-* } }
+
+SUBROUTINE fn0(ma, mb, nt)
+  CHARACTER ca
+  REAL r0(ma)
+  INTEGER i0(mb)
+  REAL r1(3,mb)
+  REAL r2(3,mb)
+  REAL r3(3,3)
+  zero=0.0
+  do na = 1, nt
+ nt = i0(na)
+ do l = 1, 3
+r1 (l, na) =   r0 (nt)
+r2(l, na) = zero
+ enddo
+  enddo
+  if (ca  .ne.'z') then
+ do j = 1, 3
+do i = 1, 3
+   r4  = zero
+enddo
+ enddo
+ do na = 1, nt
+do k =  1, 3
+   do l = 1, 3
+  do m = 1, 3
+ r3 = r4 * v
+  enddo
+   enddo
+enddo
+ do i = 1, 3
+   do k = 1, ifn (r3)
+   enddo
+enddo
+ enddo
+ endif
+END
-- 
2.25.1



[PATCH] aarch64: Split aarch64_combinev16qi before RA [PR115258]

2024-05-29 Thread Richard Sandiford
Two-vector TBL instructions are fed by an aarch64_combinev16qi, whose
purpose is to put the two input data vectors into consecutive registers.
This aarch64_combinev16qi was then split after reload into individual
moves (from the first input to the first half of the output, and from
the second input to the second half of the output).

In the worst case, the RA might allocate things so that the destination
of the aarch64_combinev16qi is the second input followed by the first
input.  In that case, the split form of aarch64_combinev16qi uses three
eors to swap the registers around.

This PR is about a test where this worst case occurred.  And given the
insn description, that allocation doesn't semm unreasonable.

early-ra should (hopefully) mean that we're now better at allocating
subregs of vector registers.  The upcoming RA subreg patches should
improve things further.  The best fix for the PR therefore seems
to be to split the combination before RA, so that the RA can see
the underlying moves.

Perhaps it even makes sense to do this at expand time, avoiding the need
for aarch64_combinev16qi entirely.  That deserves more experimentation
though.

Tested on aarch64-linux-gnu & pushed.

Richard


gcc/
PR target/115258
* config/aarch64/aarch64-simd.md (aarch64_combinev16qi): Allow
the split before reload.
* config/aarch64/aarch64.cc (aarch64_split_combinev16qi): Generalize
into a form that handles pseudo registers.

gcc/testsuite/
PR target/115258
* gcc.target/aarch64/pr115258.c: New test.
---
 gcc/config/aarch64/aarch64-simd.md  |  2 +-
 gcc/config/aarch64/aarch64.cc   | 29 ++---
 gcc/testsuite/gcc.target/aarch64/pr115258.c | 19 ++
 3 files changed, 34 insertions(+), 16 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/pr115258.c

diff --git a/gcc/config/aarch64/aarch64-simd.md 
b/gcc/config/aarch64/aarch64-simd.md
index c311888e4bd..868f4486218 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -8474,7 +8474,7 @@ (define_insn_and_split "aarch64_combinev16qi"
UNSPEC_CONCAT))]
   "TARGET_SIMD"
   "#"
-  "&& reload_completed"
+  "&& 1"
   [(const_int 0)]
 {
   aarch64_split_combinev16qi (operands);
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index ee12d8897a8..13191ec8e34 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -25333,27 +25333,26 @@ aarch64_output_sve_ptrues (rtx const_unspec)
 void
 aarch64_split_combinev16qi (rtx operands[3])
 {
-  unsigned int dest = REGNO (operands[0]);
-  unsigned int src1 = REGNO (operands[1]);
-  unsigned int src2 = REGNO (operands[2]);
   machine_mode halfmode = GET_MODE (operands[1]);
-  unsigned int halfregs = REG_NREGS (operands[1]);
-  rtx destlo, desthi;
 
   gcc_assert (halfmode == V16QImode);
 
-  if (src1 == dest && src2 == dest + halfregs)
+  rtx destlo = simplify_gen_subreg (halfmode, operands[0],
+   GET_MODE (operands[0]), 0);
+  rtx desthi = simplify_gen_subreg (halfmode, operands[0],
+   GET_MODE (operands[0]),
+   GET_MODE_SIZE (halfmode));
+
+  bool skiplo = rtx_equal_p (destlo, operands[1]);
+  bool skiphi = rtx_equal_p (desthi, operands[2]);
+
+  if (skiplo && skiphi)
 {
   /* No-op move.  Can't split to nothing; emit something.  */
   emit_note (NOTE_INSN_DELETED);
   return;
 }
 
-  /* Preserve register attributes for variable tracking.  */
-  destlo = gen_rtx_REG_offset (operands[0], halfmode, dest, 0);
-  desthi = gen_rtx_REG_offset (operands[0], halfmode, dest + halfregs,
-  GET_MODE_SIZE (halfmode));
-
   /* Special case of reversed high/low parts.  */
   if (reg_overlap_mentioned_p (operands[2], destlo)
   && reg_overlap_mentioned_p (operands[1], desthi))
@@ -25366,16 +25365,16 @@ aarch64_split_combinev16qi (rtx operands[3])
 {
   /* Try to avoid unnecessary moves if part of the result
 is in the right place already.  */
-  if (src1 != dest)
+  if (!skiplo)
emit_move_insn (destlo, operands[1]);
-  if (src2 != dest + halfregs)
+  if (!skiphi)
emit_move_insn (desthi, operands[2]);
 }
   else
 {
-  if (src2 != dest + halfregs)
+  if (!skiphi)
emit_move_insn (desthi, operands[2]);
-  if (src1 != dest)
+  if (!skiplo)
emit_move_insn (destlo, operands[1]);
 }
 }
diff --git a/gcc/testsuite/gcc.target/aarch64/pr115258.c 
b/gcc/testsuite/gcc.target/aarch64/pr115258.c
new file mode 100644
index 000..9a489d4604c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr115258.c
@@ -0,0 +1,19 @@
+/* { dg-options "-O2" } */
+/* { dg-final { check-function-bodies "**" "" "" } } */
+
+/*
+** fun:
+** (ldr|adrp)  [^\n]+
+** (ldr|adrp)  [^\n]+
+** (ldr|adrp)  

[gcc r15-906] aarch64: Split aarch64_combinev16qi before RA [PR115258]

2024-05-29 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:39263ed2d39ac1cebde59bc5e72ddcad5dc7a1ec

commit r15-906-g39263ed2d39ac1cebde59bc5e72ddcad5dc7a1ec
Author: Richard Sandiford 
Date:   Wed May 29 16:43:33 2024 +0100

aarch64: Split aarch64_combinev16qi before RA [PR115258]

Two-vector TBL instructions are fed by an aarch64_combinev16qi, whose
purpose is to put the two input data vectors into consecutive registers.
This aarch64_combinev16qi was then split after reload into individual
moves (from the first input to the first half of the output, and from
the second input to the second half of the output).

In the worst case, the RA might allocate things so that the destination
of the aarch64_combinev16qi is the second input followed by the first
input.  In that case, the split form of aarch64_combinev16qi uses three
eors to swap the registers around.

This PR is about a test where this worst case occurred.  And given the
insn description, that allocation doesn't semm unreasonable.

early-ra should (hopefully) mean that we're now better at allocating
subregs of vector registers.  The upcoming RA subreg patches should
improve things further.  The best fix for the PR therefore seems
to be to split the combination before RA, so that the RA can see
the underlying moves.

Perhaps it even makes sense to do this at expand time, avoiding the need
for aarch64_combinev16qi entirely.  That deserves more experimentation
though.

gcc/
PR target/115258
* config/aarch64/aarch64-simd.md (aarch64_combinev16qi): Allow
the split before reload.
* config/aarch64/aarch64.cc (aarch64_split_combinev16qi): Generalize
into a form that handles pseudo registers.

gcc/testsuite/
PR target/115258
* gcc.target/aarch64/pr115258.c: New test.

Diff:
---
 gcc/config/aarch64/aarch64-simd.md  |  2 +-
 gcc/config/aarch64/aarch64.cc   | 29 ++---
 gcc/testsuite/gcc.target/aarch64/pr115258.c | 19 +++
 3 files changed, 34 insertions(+), 16 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-simd.md 
b/gcc/config/aarch64/aarch64-simd.md
index c311888e4bd..868f4486218 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -8474,7 +8474,7 @@
UNSPEC_CONCAT))]
   "TARGET_SIMD"
   "#"
-  "&& reload_completed"
+  "&& 1"
   [(const_int 0)]
 {
   aarch64_split_combinev16qi (operands);
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index ee12d8897a8..13191ec8e34 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -25333,27 +25333,26 @@ aarch64_output_sve_ptrues (rtx const_unspec)
 void
 aarch64_split_combinev16qi (rtx operands[3])
 {
-  unsigned int dest = REGNO (operands[0]);
-  unsigned int src1 = REGNO (operands[1]);
-  unsigned int src2 = REGNO (operands[2]);
   machine_mode halfmode = GET_MODE (operands[1]);
-  unsigned int halfregs = REG_NREGS (operands[1]);
-  rtx destlo, desthi;
 
   gcc_assert (halfmode == V16QImode);
 
-  if (src1 == dest && src2 == dest + halfregs)
+  rtx destlo = simplify_gen_subreg (halfmode, operands[0],
+   GET_MODE (operands[0]), 0);
+  rtx desthi = simplify_gen_subreg (halfmode, operands[0],
+   GET_MODE (operands[0]),
+   GET_MODE_SIZE (halfmode));
+
+  bool skiplo = rtx_equal_p (destlo, operands[1]);
+  bool skiphi = rtx_equal_p (desthi, operands[2]);
+
+  if (skiplo && skiphi)
 {
   /* No-op move.  Can't split to nothing; emit something.  */
   emit_note (NOTE_INSN_DELETED);
   return;
 }
 
-  /* Preserve register attributes for variable tracking.  */
-  destlo = gen_rtx_REG_offset (operands[0], halfmode, dest, 0);
-  desthi = gen_rtx_REG_offset (operands[0], halfmode, dest + halfregs,
-  GET_MODE_SIZE (halfmode));
-
   /* Special case of reversed high/low parts.  */
   if (reg_overlap_mentioned_p (operands[2], destlo)
   && reg_overlap_mentioned_p (operands[1], desthi))
@@ -25366,16 +25365,16 @@ aarch64_split_combinev16qi (rtx operands[3])
 {
   /* Try to avoid unnecessary moves if part of the result
 is in the right place already.  */
-  if (src1 != dest)
+  if (!skiplo)
emit_move_insn (destlo, operands[1]);
-  if (src2 != dest + halfregs)
+  if (!skiphi)
emit_move_insn (desthi, operands[2]);
 }
   else
 {
-  if (src2 != dest + halfregs)
+  if (!skiphi)
emit_move_insn (desthi, operands[2]);
-  if (src1 != dest)
+  if (!skiplo)
emit_move_insn (destlo, operands[1]);
 }
 }
diff --git a/gcc/testsuite/gcc.target/aarch64/pr115258.c 
b/gcc/testsuite/gcc.target/aarc

Re: [PATCH] tree-optimization/115252 - enhance peeling for gaps avoidance

2024-05-29 Thread Richard Sandiford
Richard Biener  writes:
> Code generation for contiguous load vectorization can already deal
> with generalized avoidance of loading from a gap.  The following
> extends detection of peeling for gaps requirement with that,
> gets rid of the old special casing of a half load and makes sure
> when we do access the gap we have peeling for gaps enabled.
>
> Bootstrapped and tested on x86_64-unknown-linux-gnu.
>
> This is the first patch in a series to improve peeling for gaps,
> it turned out into an improvement for code rather than just doing
> the (delayed from stage3) removal of the "old" half-vector codepath.
>
> I'll wait for the pre-CI testing for pushing so you also have time
> for some comments.

LGTM FWIW (some trivia below).

Out of interest, how far are we off being able to load:

a[i*8+0]
a[i*8+1]
a[i*8+3]
a[i*8+4]

as two half vectors?  It doesn't look like we're quite there yet,
but I might have misread.

It would be nice if we could eventually integrate the overrun_p checks
with the vectorizable_load code that the code is trying to predict.
E.g. we could run through the vectorizable_load code during the
analysis phase and record overruns, similarly to Kewen's costing
patches.  As it stands, it seems difficult to make sure that the two
checks are exactly in sync, especially when the structure is so
different.

> Richard.
>
>   PR tree-optimization/115252
>   * tree-vect-stmts.cc (get_group_load_store_type): Enhance
>   detecting the number of cases where we can avoid accessing a gap
>   during code generation.
>   (vectorizable_load): Remove old half-vector peeling for gap
>   avoidance which is now redundant.  Add gap-aligned case where
>   it's OK to access the gap.  Add assert that we have peeling for
>   gaps enabled when we access a gap.
>
>   * gcc.dg/vect/slp-gap-1.c: New testcase.
> ---
>  gcc/testsuite/gcc.dg/vect/slp-gap-1.c | 18 +
>  gcc/tree-vect-stmts.cc| 58 +--
>  2 files changed, 46 insertions(+), 30 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/vect/slp-gap-1.c
>
> diff --git a/gcc/testsuite/gcc.dg/vect/slp-gap-1.c 
> b/gcc/testsuite/gcc.dg/vect/slp-gap-1.c
> new file mode 100644
> index 000..36463ca22c5
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/slp-gap-1.c
> @@ -0,0 +1,18 @@
> +/* { dg-do compile } */
> +/* { dg-additional-options "-O3" } */
> +
> +typedef unsigned char uint8_t;
> +typedef short int16_t;
> +void pixel_sub_wxh(int16_t * __restrict diff, uint8_t *pix1, uint8_t *pix2) {
> +  for (int y = 0; y < 4; y++) {
> +for (int x = 0; x < 4; x++)
> +  diff[x + y * 4] = pix1[x] - pix2[x];
> +pix1 += 16;
> +pix2 += 32;
> +  }
> +}
> +
> +/* We can vectorize this without peeling for gaps and thus without epilogue,
> +   but the only thing we can reliably scan is the zero-padding trick for the
> +   partial loads.  */
> +/* { dg-final { scan-tree-dump-times "\{_\[0-9\]\+, 0" 6 "vect" { target 
> vect64 } } } */
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index a01099d3456..b26cc74f417 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -2072,16 +2072,22 @@ get_group_load_store_type (vec_info *vinfo, 
> stmt_vec_info stmt_info,
> dr_alignment_support alss;
> int misalign = dr_misalignment (first_dr_info, vectype);
> tree half_vtype;
> +   poly_uint64 remain;
> +   unsigned HOST_WIDE_INT tem, num;
> if (overrun_p
> && !masked_p
> && (((alss = vect_supportable_dr_alignment (vinfo, first_dr_info,
> vectype, misalign)))
>  == dr_aligned
> || alss == dr_unaligned_supported)
> -   && known_eq (nunits, (group_size - gap) * 2)
> -   && known_eq (nunits, group_size)
> -   && (vector_vector_composition_type (vectype, 2, _vtype)
> -   != NULL_TREE))
> +   && can_div_trunc_p (group_size
> +   * LOOP_VINFO_VECT_FACTOR (loop_vinfo) - gap,
> +   nunits, , )
> +   && (known_eq (remain, 0u)
> +   || (constant_multiple_p (nunits, remain, )
> +   && (vector_vector_composition_type (vectype, num,
> +   _vtype)
> +   != NULL_TREE
>   overrun_p = false;

Might be worth renaming half_vtype now that it isn't necessarily
a strict half.

>  
> if (overrun_p && !can_overrun_p)
> @@ -11533,33 +11539,14 @@ vectorizable_load (vec_info *vinfo,
>   unsigned HOST_WIDE_INT gap = DR_GROUP_GAP (first_stmt_info);
>   unsigned int vect_align
> = vect_known_alignment_in_bytes (first_dr_info, vectype);
> - unsigned int scalar_dr_size
> -   = vect_get_scalar_dr_size (first_dr_info);
> 

Re: [PATCH 1/5] Do single-lane SLP discovery for reductions

2024-05-29 Thread Richard Sandiford
Richard Biener  writes:
> On Fri, 24 May 2024, Richard Biener wrote:
>
>> This is the second merge proposed from the SLP vectorizer branch.
>> I have again managed without adding and using --param vect-single-lane-slp
>> but instead this provides always enabled functionality.
>> 
>> This makes us use SLP reductions (a group of reductions) for the
>> case where the group size is one.  This basically means we try
>> to use SLP for all reductions.
>> 
>> I've kept the series close to changes how they are on the branch
>> but in the end I'll squash it, having separate commits for review
>> eventually helps identifying common issues we will run into.  In
>> particular we lack full SLP support for several reduction kinds
>> and the branch has more enabling patches than in this series.
>> For example 4/5 makes sure we use shifts and direct opcode
>> reductions in the reduction epilog for SLP reductions but doesn't
>> bother to try covering the general case but enables it only
>> for the single-element group case to avoid regressions
>> in gcc.dg/vect/reduc-{mul,or}_[12].c testcases.
>> 
>> Bootstrapped and tested on x86_64-unknown-linux-gnu, I've also
>> successfully built SPEC CPU 2017.  This posting should trigger
>> arm & riscv pre-checkin CI.
>> 
>> There's one ICE in gcc.target/i386/pr51235.c I discovered late
>> that I will investigate and address after the weekend.
>
> I've fixed this now.
>
> On aarch64 and arm there's
>
> FAIL: gcc.dg/vect/slp-reduc-3.c scan-tree-dump-times vect "VEC_PERM_EXPR" 
> 0
>
> which is a testism, I _think_ due to a bogus vect_load_lanes check
> in that line.  The code is as expected not using a SLP reduction of
> two lanes due to the widen-sum pattern used.  It might be that we
> somehow fail to use load-lanes when vectorizing the load with SLP
> which means that for SLP reductions we fail to consider
> load-lanes as override.  I think we should leave this FAIL, we need to
> work to get load-lanes vectorization from SLP anyway.  To fix this
> the load-permutation followup I have in the works will be necessary.

Sounds good to me FWIW.

> I also see
>
> FAIL: gcc.target/aarch64/sve/dot_1.c scan-assembler-times \\twhilelo\\t 8
> FAIL: gcc.target/aarch64/sve/reduc_4.c scan-assembler-not \\tfadd\\t
> FAIL: gcc.target/aarch64/sve/sad_1.c scan-assembler-times 
> \\tudot\\tz[0-9]+\\.s, z[0-9]+\\.b, z[0-9]+\\.b\\n 2
>
> but scan-assemblers are not my favorite.  For example dot_1.c has
> twice as many whilelo, but I'm not sure what goes wrong.
>
> There are quite some regressions reported for RISC-V, I looked at the
> ICEs and fixed them but I did not investigate any of the assembly
> scanning FAILs.
>
> I'll re-spin the series with the fixes tomorrow.
> If anybody wants to point out something I should investigate please
> speak up.

Thanks for checking the aarch64 results.  I'll look at the three SVE
failures once the patch is in.  Many of the tests are supposed to ensure
that we generate correct code for a given set of choices.  Sometimes
it's necessary to update the flags to retain the same of choices,
e.g. due to costing changes or general vectoriser improvements.

That is, the point of these tests isn't necessarily to make sure that we
get the "best" SVE code for the source -- especially since there isn't
really an abstract, objective "best" that applies to all targets.
The tests are instead reognising that we have mulitple techniques for
doing some things, and are trying to make sure that each of those
techniques works individually.

Realise that kind of test isn't popular with everyone.  The quid
pro quo is that we (AArch64 folks) get to look at the tests when
failures show up :)

Richard

>
> Thanks,
> Richard.
>
>> This change should be more straight-forward than the previous one,
>> still comments are of course welcome.  After pushed I will followup
>> with changes to enable single-lane SLP reductions for various
>> COND_EXPR reductions as well as double-reduction support and
>> in-order reduction support (also all restricted to single-lane
>> for the moment).
>>
>> Thanks,
>> Richard.
>> 
>> --
>> 
>> The following performs single-lane SLP discovery for reductions.
>> This exposes a latent issue with reduction SLP in outer loop
>> vectorization and makes gcc.dg/vect/vect-outer-4[fgkl].c FAIL
>> execution.
>> 
>>  * tree-vect-slp.cc (vect_build_slp_tree_2): Only multi-lane
>>  discoveries are reduction chains and need special backedge
>>  treatment.
>>  (vect_analyze_slp): Fall back to single-lane SLP discovery
>>  for reductions.  Make sure to try single-lane SLP reduction
>>  for all reductions as fallback.
>> ---
>>  gcc/tree-vect-slp.cc | 71 +---
>>  1 file changed, 54 insertions(+), 17 deletions(-)
>> 
>> diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
>> index c7ed520b629..73cc69d85ce 100644
>> --- a/gcc/tree-vect-slp.cc
>> +++ b/gcc/tree-vect-slp.cc
>> @@ -1907,7 +1907,8 @@ vect_build_slp_tree_2 

Re: [PATCHv3] Optab: add isfinite_optab for __builtin_isfinite

2024-05-28 Thread Richard Sandiford
HAO CHEN GUI  writes:
> Hi,
>   This patch adds an optab for __builtin_isfinite. The finite check can be
> implemented on rs6000 by a single instruction. It needs an optab to be
> expanded to the certain sequence of instructions.
>
>   The subsequent patches will implement the expand on rs6000.
>
>   Compared to previous version, the main change is to specify acceptable
> modes for the optab.
> https://gcc.gnu.org/pipermail/gcc-patches/2024-May/652170.html
>
>   Bootstrapped and tested on x86 and powerpc64-linux BE and LE with no
> regressions. Is this OK for trunk?
>
> Thanks
> Gui Haochen
>
> ChangeLog
> optab: Add isfinite_optab for isfinite builtin
>
> gcc/
>   * builtins.cc (interclass_mathfn_icode): Set optab to isfinite_optab
>   for isfinite builtin.
>   * optabs.def (isfinite_optab): New.
>   * doc/md.texi (isfinite): Document.
>
>
> patch.diff
> diff --git a/gcc/builtins.cc b/gcc/builtins.cc
> index f8d94c4b435..b8432f84020 100644
> --- a/gcc/builtins.cc
> +++ b/gcc/builtins.cc
> @@ -2459,8 +2459,9 @@ interclass_mathfn_icode (tree arg, tree fndecl)
>errno_set = true; builtin_optab = ilogb_optab; break;
>  CASE_FLT_FN (BUILT_IN_ISINF):
>builtin_optab = isinf_optab; break;
> -case BUILT_IN_ISNORMAL:
>  case BUILT_IN_ISFINITE:
> +  builtin_optab = isfinite_optab; break;
> +case BUILT_IN_ISNORMAL:
>  CASE_FLT_FN (BUILT_IN_FINITE):
>  case BUILT_IN_FINITED32:
>  case BUILT_IN_FINITED64:
> diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> index 5730bda80dc..67407fad37d 100644
> --- a/gcc/doc/md.texi
> +++ b/gcc/doc/md.texi
> @@ -8557,6 +8557,15 @@ operand 2, greater than operand 2 or is unordered with 
> operand 2.
>
>  This pattern is not allowed to @code{FAIL}.
>
> +@cindex @code{isfinite@var{m}2} instruction pattern
> +@item @samp{isfinite@var{m}2}
> +Set operand 0 to nonzero if operand 1 is a finite @code{SFmode},
> +@code{DFmode}, or @code{TFmode} floating point number and to 0
> +otherwise.

This has probably already been discussed, sorry, but how about defining
the optab to return a strict 0/1 result, rather than just zero/nonzero?
I realise that's stricter than the underlying math.h routines, but it
would in principle avoid the need to expand extra instructions in
a setcc-like operation.

Richard

> +
> +If this pattern @code{FAIL}, a call to the library function
> +@code{isfinite} is used.
> +
>  @end table
>
>  @end ifset
> diff --git a/gcc/optabs.def b/gcc/optabs.def
> index ad14f9328b9..dcd77315c2a 100644
> --- a/gcc/optabs.def
> +++ b/gcc/optabs.def
> @@ -352,6 +352,7 @@ OPTAB_D (fmod_optab, "fmod$a3")
>  OPTAB_D (hypot_optab, "hypot$a3")
>  OPTAB_D (ilogb_optab, "ilogb$a2")
>  OPTAB_D (isinf_optab, "isinf$a2")
> +OPTAB_D (isfinite_optab, "isfinite$a2")
>  OPTAB_D (issignaling_optab, "issignaling$a2")
>  OPTAB_D (ldexp_optab, "ldexp$a3")
>  OPTAB_D (log10_optab, "log10$a2")


Re: [PATCH v3] tree-ssa-pre.c/115214(ICE in find_or_generate_expression, at tree-ssa-pre.c:2780): Return NULL_TREE when deal special cases.

2024-05-28 Thread Richard Sandiford
Richard Biener  writes:
> On Mon, May 27, 2024 at 9:48 AM Jiawei  wrote:
>>
>> Return NULL_TREE when genop3 equal EXACT_DIV_EXPR.
>> https://gcc.gnu.org/pipermail/gcc-patches/2024-May/652641.html
>>
>> version log v3: remove additional POLY_INT_CST check.
>> https://gcc.gnu.org/pipermail/gcc-patches/2024-May/652795.html
>
> OK.
>
> Thanks,
> Richard.
>
>> gcc/ChangeLog:
>>
>> * tree-ssa-pre.cc (create_component_ref_by_pieces_1): New conditions.
>>
>> gcc/testsuite/ChangeLog:
>>
>> * gcc.target/riscv/rvv/vsetvl/pr115214.c: New test.
>>
>> ---
>>  .../gcc.target/riscv/rvv/vsetvl/pr115214.c| 52 +++
>>  gcc/tree-ssa-pre.cc   | 10 ++--
>>  2 files changed, 59 insertions(+), 3 deletions(-)
>>  create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/vsetvl/pr115214.c
>>
>> diff --git a/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/pr115214.c 
>> b/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/pr115214.c
>> new file mode 100644
>> index 000..fce2e9da766
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/pr115214.c
>> @@ -0,0 +1,52 @@
>> +/* { dg-do compile } */
>> +/* { dg-options "-mrvv-vector-bits=scalable -march=rv64gcv -mabi=lp64d -O3 
>> -w" } */
>> +/* { dg-skip-if "" { *-*-* } { "-flto" } } */
>> +
>> +#include 
>> +
>> +static inline __attribute__(()) int vaddq_f32();
>> +static inline __attribute__(()) int vload_tillz_f32(int nlane) {
>> +  vint32m1_t __trans_tmp_9;
>> +  {
>> +int __trans_tmp_0 = nlane;
>> +{
>> +  vint64m1_t __trans_tmp_1;
>> +  vint64m1_t __trans_tmp_2;
>> +  vint64m1_t __trans_tmp_3;
>> +  vint64m1_t __trans_tmp_4;
>> +  if (__trans_tmp_0 == 1) {
>> +{
>> +  __trans_tmp_3 =
>> +  __riscv_vslideup_vx_i64m1(__trans_tmp_1, __trans_tmp_2, 1, 2);
>> +}
>> +__trans_tmp_4 = __trans_tmp_2;
>> +  }
>> +  __trans_tmp_4 = __trans_tmp_3;
>> +  __trans_tmp_9 = __riscv_vreinterpret_v_i64m1_i32m1(__trans_tmp_3);
>> +}
>> +  }
>> +  return vaddq_f32(__trans_tmp_9); /* { dg-error {RVV type 'vint32m1_t' 
>> cannot be passed to an unprototyped function} } */
>> +}
>> +
>> +char CFLOAT_add_args[3];
>> +const int *CFLOAT_add_steps;
>> +const int CFLOAT_steps;
>> +
>> +__attribute__(()) void CFLOAT_add() {
>> +  char *b_src0 = _add_args[0], *b_src1 = _add_args[1],
>> +   *b_dst = _add_args[2];
>> +  const float *src1 = (float *)b_src1;
>> +  float *dst = (float *)b_dst;
>> +  const int ssrc1 = CFLOAT_add_steps[1] / sizeof(float);
>> +  const int sdst = CFLOAT_add_steps[2] / sizeof(float);
>> +  const int hstep = 4 / 2;
>> +  vfloat32m1x2_t a;
>> +  int len = 255;
>> +  for (; len > 0; len -= hstep, src1 += 4, dst += 4) {
>> +int b = vload_tillz_f32(len);
>> +int r = vaddq_f32(a.__val[0], b); /* { dg-error {RVV type 
>> '__rvv_float32m1_t' cannot be passed to an unprototyped function} } */
>> +  }
>> +  for (; len > 0; --len, b_src0 += CFLOAT_steps,
>> +  b_src1 += CFLOAT_add_steps[1], b_dst += 
>> CFLOAT_add_steps[2])
>> +;
>> +}
>> diff --git a/gcc/tree-ssa-pre.cc b/gcc/tree-ssa-pre.cc
>> index 75217f5cde1..5cf1968bc26 100644
>> --- a/gcc/tree-ssa-pre.cc
>> +++ b/gcc/tree-ssa-pre.cc
>> @@ -2685,11 +2685,15 @@ create_component_ref_by_pieces_1 (basic_block block, 
>> vn_reference_t ref,
>>here as the element alignment may be not visible.  See
>>PR43783.  Simply drop the element size for constant
>>sizes.  */
>> -   if (TREE_CODE (genop3) == INTEGER_CST
>> +   if ((TREE_CODE (genop3) == INTEGER_CST
>> && TREE_CODE (TYPE_SIZE_UNIT (elmt_type)) == INTEGER_CST
>> && wi::eq_p (wi::to_offset (TYPE_SIZE_UNIT (elmt_type)),
>> -(wi::to_offset (genop3)
>> - * vn_ref_op_align_unit (currop
>> +(wi::to_offset (genop3) * vn_ref_op_align_unit 
>> (currop

Sorry for the nits, but the original formatting was correct here.
The new one instead goes over 80 columns.

>> + || (TREE_CODE (genop3) == EXACT_DIV_EXPR
>> +   && TREE_CODE (TREE_OPERAND (genop3, 1)) == INTEGER_CST
>> +   && operand_equal_p (TREE_OPERAND (genop3, 0), TYPE_SIZE_UNIT 
>> (elmt_type))

Similarly this line is too long.

Thanks for fixing this.

Richard

>> +   && wi::eq_p (wi::to_offset (TREE_OPERAND (genop3, 1)),
>> +vn_ref_op_align_unit (currop
>>   genop3 = NULL_TREE;
>> else
>>   {
>> --
>> 2.25.1
>>


Re: [PATCH v2 1/2] driver: Use -as/ld/objcopy as final fallback instead of native ones for cross

2024-05-28 Thread Richard Sandiford
YunQiang Su  writes:
> If `find_a_program` cannot find `as/ld/objcopy` and we are a cross toolchain,
> the final fallback is `as/ld` of system.  In fact, we can have a try with
> -as/ld/objcopy before fallback to native as/ld/objcopy.
>
> This patch is derivatived from Debian's patch:
>   gcc-search-prefixed-as-ld.diff

I'm probably making you repeat a previous discussion, sorry, but could
you describe the use case in more detail?  The current approach to
handling cross toolchains has been used for many years.  Presumably
this patch is supporting a different way of organising things,
but I wasn't sure from the description what it was.

AIUI, we currently assume that cross as, ld and objcopy will be
installed under those names in $prefix/$target_alias/bin (aka $tooldir/bin).
E.g.:

   bin/aarch64-elf-as = aarch64-elf/bin/as

GCC should then find as in aarch64-elf/bin.

Is that not true in your case?

To be clear, I'm not saying the patch is wrong.  I'm just trying to
understand why the patch is needed.

Thanks,
Richard

>
> gcc
>   * gcc.cc(execute): Looks for -as/ld/objcopy before fallback
>   to native as/ld/objcopy.
> ---
>  gcc/gcc.cc | 20 
>  1 file changed, 20 insertions(+)
>
> diff --git a/gcc/gcc.cc b/gcc/gcc.cc
> index 830a4700a87..3dc6348d761 100644
> --- a/gcc/gcc.cc
> +++ b/gcc/gcc.cc
> @@ -3293,6 +3293,26 @@ execute (void)
>string = find_a_program(commands[0].prog);
>if (string)
>   commands[0].argv[0] = string;
> +  else if (*cross_compile != '0'
> + && !strcmp (commands[0].argv[0], commands[0].prog)
> + && (!strcmp (commands[0].prog, "as")
> + || !strcmp (commands[0].prog, "ld")
> + || !strcmp (commands[0].prog, "objcopy")))
> + {
> +   string = concat (DEFAULT_REAL_TARGET_MACHINE, "-",
> + commands[0].prog, NULL);
> +   const char *string_args[] = {string, "--version", NULL};
> +   int exit_status = 0;
> +   int err = 0;
> +   const char *errmsg = pex_one (PEX_SEARCH, string,
> +   CONST_CAST (char **, string_args), string,
> +   NULL, NULL, _status, );
> +   if (errmsg == NULL && exit_status == 0 && err == 0)
> + {
> +   commands[0].argv[0] = string;
> +   commands[0].prog = string;
> + }
> + }
>  }
>  
>for (n_commands = 1, i = 0; argbuf.iterate (i, ); i++)


Re: [PATCH] attribs: Fix and refactor diag_attr_exclusions

2024-05-28 Thread Richard Sandiford
Andrew Carlotti  writes:
> The existing implementation of this function was convoluted, and had
> multiple control flow errors that became apparent to me while reading
> the code:
>
> 1. The initial early return only checked the properties of the first
> exclusion in the list, when these properties could be different for
> subsequent exclusions.
>
> 2. excl was not reset within the outer loop, so the inner loop body
> would only execute during the first iteration of the outer loop.  This
> effectively meant that the value of attrs[1] was ignored.
>
> 3. The function called itself recursively twice, with both last_decl and
> TREE_TYPE (last_decl) as parameters. The second recursive call should
> have been redundant, since attrs[1] = TREE_TYPE (last_decl) during the
> first recursive call.

Thanks for doing this.  Agree with the above.

> This patch eliminated the early return, and combines the checks with
> those present within the inner loop.  It also fixes the inner loop
> initialisation, and modifies the outer loop to iterate over nodes
> instead of their attributes. This latter change allows the recursion to
> be eliminated, by extending the new nodes array to include last_decl
> (and its type) as well.
>
> This patch provides an alternative fix for PR114634, although I wasn't
> aware of that issue until rebasing on top of Jakub's fix.
>
> I am not aware of any other compiler bugs resulting from these issues.
> However, if the exclusions for target_clones were listed in the opposite
> order, then it would have broken detection of the always_inline
> exclusion on aarch64 (where TARGET_HAS_FMV_TARGET_ATTRIBUTE is false).
>
> Is this ok for master?
>
> gcc/ChangeLog:
>
>   * attribs.cc (diag_attr_exclusions): Fix and refactor.
>
>
> diff --git a/gcc/attribs.cc b/gcc/attribs.cc
> index 
> 3ab0b0fd87a4404a593b2de365ea5226e31fe24a..431dd4255e68e92dd8d10bbb21ea079e50811faa
>  100644
> --- a/gcc/attribs.cc
> +++ b/gcc/attribs.cc
> @@ -433,84 +433,69 @@ get_attribute_namespace (const_tree attr)
> or a TYPE.  */
>  
>  static bool
> -diag_attr_exclusions (tree last_decl, tree node, tree attrname,
> +diag_attr_exclusions (tree last_decl, tree base_node, tree attrname,
> const attribute_spec *spec)
>  {
> -  const attribute_spec::exclusions *excl = spec->exclude;
>  
> -  tree_code code = TREE_CODE (node);
> +  /* BASE_NODE is either the current decl to which the attribute is being
> + applied, or its type.  For the former, consider the attributes on both 
> the
> + decl and its type.  Check both LAST_DECL and its type as well.  */
>  
> -  if ((code == FUNCTION_DECL && !excl->function
> -   && (!excl->type || !spec->affects_type_identity))
> -  || (code == VAR_DECL && !excl->variable
> -   && (!excl->type || !spec->affects_type_identity))
> -  || (((code == TYPE_DECL || RECORD_OR_UNION_TYPE_P (node)) && 
> !excl->type)))
> -return false;
> +  tree nodes[4] = { NULL_TREE, NULL_TREE, NULL_TREE, NULL_TREE };
>  
> -  /* True if an attribute that's mutually exclusive with ATTRNAME
> - has been found.  */
> -  bool found = false;
> +  nodes[0] = base_node;
> +  if (DECL_P (base_node))
> +  nodes[1] = (TREE_TYPE (base_node));

Nit: too much indentation.

> -  if (last_decl && last_decl != node && TREE_TYPE (last_decl) != node)
> +  if (last_decl)
>  {
> -  /* Check both the last DECL and its type for conflicts with
> -  the attribute being added to the current decl or type.  */
> -  found |= diag_attr_exclusions (last_decl, last_decl, attrname, spec);
> -  tree decl_type = TREE_TYPE (last_decl);
> -  found |= diag_attr_exclusions (last_decl, decl_type, attrname, spec);
> +  nodes[2] = last_decl;
> +  if (DECL_P (last_decl))
> +   nodes[3] = TREE_TYPE (last_decl);
>  }
>  
> -  /* NODE is either the current DECL to which the attribute is being
> - applied or its TYPE.  For the former, consider the attributes on
> - both the DECL and its type.  */
> -  tree attrs[2];
> -
> -  if (DECL_P (node))
> -{
> -  attrs[0] = DECL_ATTRIBUTES (node);
> -  if (TREE_TYPE (node))
> - attrs[1] = TYPE_ATTRIBUTES (TREE_TYPE (node));
> -  else
> - /* TREE_TYPE can be NULL e.g. while processing attributes on
> -enumerators.  */
> - attrs[1] = NULL_TREE;
> -}
> -  else
> -{
> -  attrs[0] = TYPE_ATTRIBUTES (node);
> -  attrs[1] = NULL_TREE;
> -}
> +  /* True if an attribute that's mutually exclusive with ATTRNAME
> + has been found.  */
> +  bool found = false;
>  
>/* Iterate over the mutually exclusive attribute names and verify
>   that the symbol doesn't contain it.  */
> -  for (unsigned i = 0; i != ARRAY_SIZE (attrs); ++i)
> +  for (unsigned i = 0; i != ARRAY_SIZE (nodes); ++i)
>  {
> -  if (!attrs[i])
> +  tree node = nodes[i];
> +
> +  if (!node)
>   continue;
>  
> -  for ( ; excl->name; ++excl)
> +  tree attr;
> +  if DECL_P 

[gcc r15-820] vect: Fix access size alignment assumption [PR115192]

2024-05-24 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:a0fe4fb1c8d7804515845dd5d2a814b3c7a1ccba

commit r15-820-ga0fe4fb1c8d7804515845dd5d2a814b3c7a1ccba
Author: Richard Sandiford 
Date:   Fri May 24 13:47:21 2024 +0100

vect: Fix access size alignment assumption [PR115192]

create_intersect_range_checks checks whether two access ranges
a and b are alias-free using something equivalent to:

  end_a <= start_b || end_b <= start_a

It has two ways of doing this: a "vanilla" way that calculates
the exact exclusive end pointers, and another way that uses the
last inclusive aligned pointers (and changes the comparisons
accordingly).  The comment for the latter is:

  /* Calculate the minimum alignment shared by all four pointers,
 then arrange for this alignment to be subtracted from the
 exclusive maximum values to get inclusive maximum values.
 This "- min_align" is cumulative with a "+ access_size"
 in the calculation of the maximum values.  In the best
 (and common) case, the two cancel each other out, leaving
 us with an inclusive bound based only on seg_len.  In the
 worst case we're simply adding a smaller number than before.

The problem is that the associated code implicitly assumed that the
access size was a multiple of the pointer alignment, and so the
alignment could be carried over to the exclusive end pointer.

The testcase started failing after g:9fa5b473b5b8e289b6542
because that commit improved the alignment information for
the accesses.

gcc/
PR tree-optimization/115192
* tree-data-ref.cc (create_intersect_range_checks): Take the
alignment of the access sizes into account.

gcc/testsuite/
PR tree-optimization/115192
* gcc.dg/vect/pr115192.c: New test.

Diff:
---
 gcc/testsuite/gcc.dg/vect/pr115192.c | 28 
 gcc/tree-data-ref.cc |  5 -
 2 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/gcc/testsuite/gcc.dg/vect/pr115192.c 
b/gcc/testsuite/gcc.dg/vect/pr115192.c
new file mode 100644
index 000..923d377c1bb
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr115192.c
@@ -0,0 +1,28 @@
+#include "tree-vect.h"
+
+int data[4 * 16 * 16] __attribute__((aligned(16)));
+
+__attribute__((noipa)) void
+foo (__SIZE_TYPE__ n)
+{
+  for (__SIZE_TYPE__ i = 1; i < n; ++i)
+{
+  data[i * n * 4] = data[(i - 1) * n * 4] + 1;
+  data[i * n * 4 + 1] = data[(i - 1) * n * 4 + 1] + 2;
+}
+}
+
+int
+main ()
+{
+  check_vect ();
+
+  data[0] = 10;
+  data[1] = 20;
+
+  foo (3);
+
+  if (data[24] != 12 || data[25] != 24)
+__builtin_abort ();
+  return 0;
+}
diff --git a/gcc/tree-data-ref.cc b/gcc/tree-data-ref.cc
index db15ddb43de..7c4049faf34 100644
--- a/gcc/tree-data-ref.cc
+++ b/gcc/tree-data-ref.cc
@@ -73,6 +73,7 @@ along with GCC; see the file COPYING3.  If not see
 
 */
 
+#define INCLUDE_ALGORITHM
 #include "config.h"
 #include "system.h"
 #include "coretypes.h"
@@ -2640,7 +2641,9 @@ create_intersect_range_checks (class loop *loop, tree 
*cond_expr,
 Because the maximum values are inclusive, there is an alias
 if the maximum value of one segment is equal to the minimum
 value of the other.  */
-  min_align = MIN (dr_a.align, dr_b.align);
+  min_align = std::min (dr_a.align, dr_b.align);
+  min_align = std::min (min_align, known_alignment (dr_a.access_size));
+  min_align = std::min (min_align, known_alignment (dr_b.access_size));
   cmp_code = LT_EXPR;
 }


[PATCH] vect: Fix access size alignment assumption [PR115192]

2024-05-24 Thread Richard Sandiford
create_intersect_range_checks checks whether two access ranges
a and b are alias-free using something equivalent to:

  end_a <= start_b || end_b <= start_a

It has two ways of doing this: a "vanilla" way that calculates
the exact exclusive end pointers, and another way that uses the
last inclusive aligned pointers (and changes the comparisons
accordingly).  The comment for the latter is:

  /* Calculate the minimum alignment shared by all four pointers,
 then arrange for this alignment to be subtracted from the
 exclusive maximum values to get inclusive maximum values.
 This "- min_align" is cumulative with a "+ access_size"
 in the calculation of the maximum values.  In the best
 (and common) case, the two cancel each other out, leaving
 us with an inclusive bound based only on seg_len.  In the
 worst case we're simply adding a smaller number than before.

The problem is that the associated code implicitly assumed that the
access size was a multiple of the pointer alignment, and so the
alignment could be carried over to the exclusive end pointer.

The testcase started failing after g:9fa5b473b5b8e289b6542
because that commit improved the alignment information for
the accesses.

Tested on aarch64-linux-gnu & x86_64-linux-gnu.  OK for trunk
and backports?

Richard


gcc/
PR tree-optimization/115192
* tree-data-ref.cc (create_intersect_range_checks): Take the
alignment of the access sizes into account.

gcc/testsuite/
PR tree-optimization/115192
* gcc.dg/vect/pr115192.c: New test.
---
 gcc/testsuite/gcc.dg/vect/pr115192.c | 28 
 gcc/tree-data-ref.cc |  5 -
 2 files changed, 32 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/pr115192.c

diff --git a/gcc/testsuite/gcc.dg/vect/pr115192.c 
b/gcc/testsuite/gcc.dg/vect/pr115192.c
new file mode 100644
index 000..923d377c1bb
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr115192.c
@@ -0,0 +1,28 @@
+#include "tree-vect.h"
+
+int data[4 * 16 * 16] __attribute__((aligned(16)));
+
+__attribute__((noipa)) void
+foo (__SIZE_TYPE__ n)
+{
+  for (__SIZE_TYPE__ i = 1; i < n; ++i)
+{
+  data[i * n * 4] = data[(i - 1) * n * 4] + 1;
+  data[i * n * 4 + 1] = data[(i - 1) * n * 4 + 1] + 2;
+}
+}
+
+int
+main ()
+{
+  check_vect ();
+
+  data[0] = 10;
+  data[1] = 20;
+
+  foo (3);
+
+  if (data[24] != 12 || data[25] != 24)
+__builtin_abort ();
+  return 0;
+}
diff --git a/gcc/tree-data-ref.cc b/gcc/tree-data-ref.cc
index db15ddb43de..7c4049faf34 100644
--- a/gcc/tree-data-ref.cc
+++ b/gcc/tree-data-ref.cc
@@ -73,6 +73,7 @@ along with GCC; see the file COPYING3.  If not see
 
 */
 
+#define INCLUDE_ALGORITHM
 #include "config.h"
 #include "system.h"
 #include "coretypes.h"
@@ -2640,7 +2641,9 @@ create_intersect_range_checks (class loop *loop, tree 
*cond_expr,
 Because the maximum values are inclusive, there is an alias
 if the maximum value of one segment is equal to the minimum
 value of the other.  */
-  min_align = MIN (dr_a.align, dr_b.align);
+  min_align = std::min (dr_a.align, dr_b.align);
+  min_align = std::min (min_align, known_alignment (dr_a.access_size));
+  min_align = std::min (min_align, known_alignment (dr_b.access_size));
   cmp_code = LT_EXPR;
 }
 
-- 
2.25.1



Re: [PATCH] aarch64: Fold vget_high_* intrinsics to BIT_FIELD_REF [PR102171]

2024-05-22 Thread Richard Sandiford
Pengxuan Zheng  writes:
> This patch is a follow-up of r15-697-ga2e4fe5a53cf75 to also fold vget_high_*
> intrinsics to BIT_FILED_REF and remove the vget_high_* definitions from
> arm_neon.h to use the new intrinsics framework.
>
>   PR target/102171
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64-builtins.cc (AARCH64_SIMD_VGET_HIGH_BUILTINS):
>   New macro to create definitions for all vget_high intrinsics.
>   (VGET_HIGH_BUILTIN): Likewise.
>   (enum aarch64_builtins): Add vget_high function codes.
>   (AARCH64_SIMD_VGET_LOW_BUILTINS): Delete duplicate macro.
>   (aarch64_general_fold_builtin): Fold vget_high calls.
>   * config/aarch64/aarch64-simd-builtins.def: Delete vget_high builtins.
>   * config/aarch64/aarch64-simd.md (aarch64_get_high): Delete.
>   (aarch64_vget_hi_halfv8bf): Likewise.
>   * config/aarch64/arm_neon.h (__attribute__): Delete.
>   (vget_high_f16): Likewise.
>   (vget_high_f32): Likewise.
>   (vget_high_f64): Likewise.
>   (vget_high_p8): Likewise.
>   (vget_high_p16): Likewise.
>   (vget_high_p64): Likewise.
>   (vget_high_s8): Likewise.
>   (vget_high_s16): Likewise.
>   (vget_high_s32): Likewise.
>   (vget_high_s64): Likewise.
>   (vget_high_u8): Likewise.
>   (vget_high_u16): Likewise.
>   (vget_high_u32): Likewise.
>   (vget_high_u64): Likewise.
>   (vget_high_bf16): Likewise.
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/vget_high_2.c: New test.
>   * gcc.target/aarch64/vget_high_2_be.c: New test.

OK, thanks.

Richard

> Signed-off-by: Pengxuan Zheng 
> ---
>  gcc/config/aarch64/aarch64-builtins.cc|  59 +++---
>  gcc/config/aarch64/aarch64-simd-builtins.def  |   6 -
>  gcc/config/aarch64/aarch64-simd.md|  22 
>  gcc/config/aarch64/arm_neon.h | 105 --
>  .../gcc.target/aarch64/vget_high_2.c  |  30 +
>  .../gcc.target/aarch64/vget_high_2_be.c   |  31 ++
>  6 files changed, 104 insertions(+), 149 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/vget_high_2.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/vget_high_2_be.c
>
> diff --git a/gcc/config/aarch64/aarch64-builtins.cc 
> b/gcc/config/aarch64/aarch64-builtins.cc
> index 11b888016ed..f8eeccb554d 100644
> --- a/gcc/config/aarch64/aarch64-builtins.cc
> +++ b/gcc/config/aarch64/aarch64-builtins.cc
> @@ -675,6 +675,23 @@ static aarch64_simd_builtin_datum 
> aarch64_simd_builtin_data[] = {
>VGET_LOW_BUILTIN(u64) \
>VGET_LOW_BUILTIN(bf16)
>  
> +#define AARCH64_SIMD_VGET_HIGH_BUILTINS \
> +  VGET_HIGH_BUILTIN(f16) \
> +  VGET_HIGH_BUILTIN(f32) \
> +  VGET_HIGH_BUILTIN(f64) \
> +  VGET_HIGH_BUILTIN(p8) \
> +  VGET_HIGH_BUILTIN(p16) \
> +  VGET_HIGH_BUILTIN(p64) \
> +  VGET_HIGH_BUILTIN(s8) \
> +  VGET_HIGH_BUILTIN(s16) \
> +  VGET_HIGH_BUILTIN(s32) \
> +  VGET_HIGH_BUILTIN(s64) \
> +  VGET_HIGH_BUILTIN(u8) \
> +  VGET_HIGH_BUILTIN(u16) \
> +  VGET_HIGH_BUILTIN(u32) \
> +  VGET_HIGH_BUILTIN(u64) \
> +  VGET_HIGH_BUILTIN(bf16)
> +
>  typedef struct
>  {
>const char *name;
> @@ -717,6 +734,9 @@ typedef struct
>  #define VGET_LOW_BUILTIN(A) \
>AARCH64_SIMD_BUILTIN_VGET_LOW_##A,
>  
> +#define VGET_HIGH_BUILTIN(A) \
> +  AARCH64_SIMD_BUILTIN_VGET_HIGH_##A,
> +
>  #undef VAR1
>  #define VAR1(T, N, MAP, FLAG, A) \
>AARCH64_SIMD_BUILTIN_##T##_##N##A,
> @@ -753,6 +773,7 @@ enum aarch64_builtins
>/* SIMD intrinsic builtins.  */
>AARCH64_SIMD_VREINTERPRET_BUILTINS
>AARCH64_SIMD_VGET_LOW_BUILTINS
> +  AARCH64_SIMD_VGET_HIGH_BUILTINS
>/* ARMv8.3-A Pointer Authentication Builtins.  */
>AARCH64_PAUTH_BUILTIN_AUTIA1716,
>AARCH64_PAUTH_BUILTIN_PACIA1716,
> @@ -855,26 +876,21 @@ static aarch64_fcmla_laneq_builtin_datum 
> aarch64_fcmla_lane_builtin_data[] = {
> false \
>},
>  
> -#define AARCH64_SIMD_VGET_LOW_BUILTINS \
> -  VGET_LOW_BUILTIN(f16) \
> -  VGET_LOW_BUILTIN(f32) \
> -  VGET_LOW_BUILTIN(f64) \
> -  VGET_LOW_BUILTIN(p8) \
> -  VGET_LOW_BUILTIN(p16) \
> -  VGET_LOW_BUILTIN(p64) \
> -  VGET_LOW_BUILTIN(s8) \
> -  VGET_LOW_BUILTIN(s16) \
> -  VGET_LOW_BUILTIN(s32) \
> -  VGET_LOW_BUILTIN(s64) \
> -  VGET_LOW_BUILTIN(u8) \
> -  VGET_LOW_BUILTIN(u16) \
> -  VGET_LOW_BUILTIN(u32) \
> -  VGET_LOW_BUILTIN(u64) \
> -  VGET_LOW_BUILTIN(bf16)
> +#undef VGET_HIGH_BUILTIN
> +#define VGET_HIGH_BUILTIN(A) \
> +  {"vget_high_" #A, \
> +   AARCH64_SIMD_BUILTIN_VGET_HIGH_##A, \
> +   2, \
> +   { SIMD_INTR_MODE(A, d), SIMD_INTR_MODE(A, q) }, \
> +   { SIMD_INTR_QUAL(A), SIMD_INTR_QUAL(A) }, \
> +   FLAG_AUTO_FP, \
> +   false \
> +  },
>  
>  static const aarch64_simd_intrinsic_datum aarch64_simd_intrinsic_data[] = {
>AARCH64_SIMD_VREINTERPRET_BUILTINS
>AARCH64_SIMD_VGET_LOW_BUILTINS
> +  AARCH64_SIMD_VGET_HIGH_BUILTINS
>  };
>  
>  
> @@ -3270,6 +3286,10 @@ aarch64_fold_builtin_lane_check (tree arg0, tree arg1, 
> tree arg2)
>  #define 

Re: [PATCH v1 5/6] Adjust DLL import/export implementation for AArch64

2024-05-22 Thread Richard Sandiford
Evgeny Karpov  writes:
> The DLL import/export mingw implementation, originally from ix86, requires
> minor adjustments to be compatible with AArch64.
>
> gcc/ChangeLog:
>
>   * config/mingw/mingw32.h (defined): Use the correct DllMainCRTStartup
>   entry function.
>   * config/mingw/winnt-dll.cc (defined): Exclude ix86-related code.
> ---
>  gcc/config/mingw/mingw32.h| 2 +-
>  gcc/config/mingw/winnt-dll.cc | 4 
>  2 files changed, 5 insertions(+), 1 deletion(-)

Could we provide some abstractions here, rather than testing
CPU-specific macros directly?  E.g.:

>
> diff --git a/gcc/config/mingw/mingw32.h b/gcc/config/mingw/mingw32.h
> index 08f1b5f0696..efe777051b4 100644
> --- a/gcc/config/mingw/mingw32.h
> +++ b/gcc/config/mingw/mingw32.h
> @@ -79,7 +79,7 @@ along with GCC; see the file COPYING3.  If not see
>  #endif
>  
>  #undef SUB_LINK_ENTRY
> -#if TARGET_64BIT_DEFAULT
> +#if TARGET_64BIT_DEFAULT || defined (TARGET_AARCH64_MS_ABI)

it looks like this is equivalent to something like "HAVE_64BIT_POINTERS"
or something, which aarch64 could define to 1 and x86 could define
to TARGET_64BIT_DEFAULT.

The name is just a suggestion, based on not really knowing what the
macro selects.  Please use whatever makes most sense :)

>  #define SUB_LINK_ENTRY SUB_LINK_ENTRY64
>  #else
>  #define SUB_LINK_ENTRY SUB_LINK_ENTRY32
> diff --git a/gcc/config/mingw/winnt-dll.cc b/gcc/config/mingw/winnt-dll.cc
> index 349ade6f5c0..294361fab4c 100644
> --- a/gcc/config/mingw/winnt-dll.cc
> +++ b/gcc/config/mingw/winnt-dll.cc
> @@ -206,9 +206,13 @@ legitimize_pe_coff_symbol (rtx addr, bool inreg)
>   }
>  }
>  
> +#if !defined (TARGET_AARCH64_MS_ABI)
> +
>if (ix86_cmodel != CM_LARGE_PIC && ix86_cmodel != CM_MEDIUM_PIC)
>  return NULL_RTX;
>  
> +#endif
> +

Similarly here, it feels like there is a concept underlying this check.
Could we just use:

  if (!NEW_MACRO)
return NULL_RTX;

with NEW_MACRO describing the underlying property that is common to
medium x86 PIC, large x86 PIC, and aarch64.

Thanks,
Richard

>if (GET_CODE (addr) == SYMBOL_REF
>&& !is_imported_p (addr)
>&& SYMBOL_REF_EXTERNAL_P (addr)


Re: [PATCH v1 4/6] aarch64: Add selectany attribute handling

2024-05-22 Thread Richard Sandiford
Evgeny Karpov  writes:
> This patch extends the aarch64 attributes list with the selectany
> attribute for the aarch64-w64-mingw32 target and reuses the mingw
> implementation to handle it.
>
>   * config/aarch64/aarch64.cc:
>   Extend the aarch64 attributes list.
>   * config/aarch64/cygming.h (SUBTARGET_ATTRIBUTE_TABLE):
>   Define the selectany attribute.

Now that TARGET_ATTRIBUTE_TABLE is an array, it should in principle
be possible to define the attribute in winnt.cc and so avoid
duplicating the table entry.  That'd be a separate clean-up though.
I agree that for this series we should stick with the current approach.

So the patch is ok, thanks.

Richard

> ---
>  gcc/config/aarch64/aarch64.cc | 5 -
>  gcc/config/aarch64/cygming.h  | 3 +++
>  2 files changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index c763a8a6298..19205927430 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -855,7 +855,10 @@ static const attribute_spec aarch64_gnu_attributes[] =
> NULL },
>{ "Advanced SIMD type", 1, 1, false, true,  false, true,  NULL, NULL },
>{ "SVE type",3, 3, false, true,  false, true,  NULL, NULL 
> },
> -  { "SVE sizeless type",  0, 0, false, true,  false, true,  NULL, NULL }
> +  { "SVE sizeless type",  0, 0, false, true,  false, true,  NULL, NULL },
> +#ifdef SUBTARGET_ATTRIBUTE_TABLE
> +  SUBTARGET_ATTRIBUTE_TABLE
> +#endif
>  };
>  
>  static const scoped_attribute_specs aarch64_gnu_attribute_table =
> diff --git a/gcc/config/aarch64/cygming.h b/gcc/config/aarch64/cygming.h
> index 0d048879311..76623153080 100644
> --- a/gcc/config/aarch64/cygming.h
> +++ b/gcc/config/aarch64/cygming.h
> @@ -154,6 +154,9 @@ still needed for compilation.  */
>  flag_stack_check = STATIC_BUILTIN_STACK_CHECK;   \
>} while (0)
>  
> +#define SUBTARGET_ATTRIBUTE_TABLE \
> +  { "selectany", 0, 0, true, false, false, false, \
> +mingw_handle_selectany_attribute, NULL }
>  
>  #define SUPPORTS_ONE_ONLY 1


Re: [PATCH v1 3/6] Rename functions for reuse in AArch64

2024-05-22 Thread Richard Sandiford
Evgeny Karpov  writes:
> This patch renames functions related to dllimport/dllexport
> and selectany functionality. These functions will be reused
> in the aarch64-w64-mingw32 target.
>
> gcc/ChangeLog:
>
>   * config/i386/cygming.h (mingw_pe_record_stub):
>   Rename functions in mingw folder which will be reused for
>   aarch64.
>   (TARGET_ASM_FILE_END): Update to new target-independent name.
>   (SUBTARGET_ATTRIBUTE_TABLE): Likewise.
>   (TARGET_VALID_DLLIMPORT_ATTRIBUTE_P): Likewise.
>   (SUB_TARGET_RECORD_STUB): Likewise.
>   * config/i386/i386-protos.h (ix86_handle_selectany_attribute): Likewise.
>   (mingw_handle_selectany_attribute): Likewise.
>   (i386_pe_valid_dllimport_attribute_p): Likewise.
>   (mingw_pe_valid_dllimport_attribute_p): Likewise.
>   (i386_pe_file_end): Likewise.
>   (mingw_pe_file_end): Likewise.
>   (i386_pe_record_stub): Likewise.
>   (mingw_pe_record_stub): Likewise.
>   * config/mingw/winnt.cc (ix86_handle_selectany_attribute): Likewise.
>   (mingw_handle_selectany_attribute): Likewise.
>   (i386_pe_valid_dllimport_attribute_p): Likewise.
>   (mingw_pe_valid_dllimport_attribute_p): Likewise.
>   (i386_pe_record_stub): Likewise.
>   (mingw_pe_record_stub): Likewise.
>   (i386_pe_file_end): Likewise.
>   (mingw_pe_file_end): Likewise.
>   * config/mingw/winnt.h (mingw_handle_selectany_attribute):
>   Declate functionality that will be reused by multiple targets.
>   (mingw_pe_file_end): Likewise.
>   (mingw_pe_record_stub): Likewise.
>   (mingw_pe_valid_dllimport_attribute_p): Likewise.

Ok, but...

> [...]
> diff --git a/gcc/config/mingw/winnt.cc b/gcc/config/mingw/winnt.cc
> index 9901576ade0..a0b5950be2e 100644
> --- a/gcc/config/mingw/winnt.cc
> +++ b/gcc/config/mingw/winnt.cc
> @@ -71,7 +71,7 @@ ix86_handle_shared_attribute (tree *node, tree name, tree, 
> int,
>  /* Handle a "selectany" attribute;
> arguments as in struct attribute_spec.handler.  */
>  tree
> -ix86_handle_selectany_attribute (tree *node, tree name, tree, int,
> +mingw_handle_selectany_attribute (tree *node, tree name, tree, int,
>bool *no_add_attrs)

please reindent the parameters for the new name length.

Thanks,
Richard


Re: [PATCH 4/4] Testsuite updates

2024-05-22 Thread Richard Sandiford
Richard Biener  writes:
> On Tue, 21 May 2024, Richard Biener wrote:
>
>> The gcc.dg/vect/slp-12a.c case is interesting as we currently split
>> the 8 store group into lanes 0-5 which we SLP with an unroll factor
>> of two (on x86-64 with SSE) and the remaining two lanes are using
>> interleaving vectorization with a final unroll factor of four.  Thus
>> we're using hybrid SLP within a single store group.  After the change
>> we discover the same 0-5 lane SLP part as well as two single-lane
>> parts feeding the full store group.  But that results in a load
>> permutation that isn't supported (I have WIP patchs to rectify that).
>> So we end up cancelling SLP and vectorizing the whole loop with
>> interleaving which is IMO good and results in better code.
>> 
>> This is similar for gcc.target/i386/pr52252-atom.c where interleaving
>> generates much better code than hybrid SLP.  I'm unsure how to update
>> the testcase though.
>> 
>> gcc.dg/vect/slp-21.c runs into similar situations.  Note that when
>> when analyzing SLP operations we discard an instance we currently
>> force the full loop to have no SLP because hybrid detection is
>> broken.  It's probably not worth fixing this at this moment.
>> 
>> For gcc.dg/vect/pr97428.c we are not splitting the 16 store group
>> into two but merge the two 8 lane loads into one before doing the
>> store and thus have only a single SLP instance.  A similar situation
>> happens in gcc.dg/vect/slp-11c.c but the branches feeding the
>> single SLP store only have a single lane.  Likewise for
>> gcc.dg/vect/vect-complex-5.c and gcc.dg/vect/vect-gather-2.c.
>> 
>> gcc.dg/vect/slp-cond-1.c has an additional SLP vectorization
>> with a SLP store group of size two but two single-lane branches.
>> 
>> gcc.target/i386/pr98928.c ICEs in SLP permute optimization
>> because we don't expect a constant and internal branch to be
>> merged with a permute node in
>> vect_optimize_slp_pass::change_vec_perm_layout:4859 (the only
>> permutes merging two SLP nodes are two-operator nodes right now).
>> This still requires fixing.
>> 
>> The whole series has been bootstrapped and tested on 
>> x86_64-unknown-linux-gnu with the gcc.target/i386/pr98928.c FAIL
>> unfixed.
>> 
>> Comments welcome (and hello ARM CI), RISC-V and other arch
>> testing appreciated.  Unless there are comments to the contrary
>> I plan to push patch 1 and 2 tomorrow.
>
> RISC-V CI didn't trigger (not sure what magic is required).  Both
> ARM and AARCH64 show that the "Vectorizing stmts using SLP" are a bit
> fragile because we sometimes cancel SLP becuase we want to use
> load/store-lanes.
>
> I have locally scrapped the SLP scanning for gcc.dg/vect/slp-21.c where
> it doesn't really matter (and if we are finished with all-SLP it will
> matter nowhere).  I've conditionalized the outcome based on
> vect_load_lanes for gcc.dg/vect/slp-11c.c and
> gcc.dg/vect/slp-cond-1.c
>
> On AARCH64 additionally gcc.target/aarch64/sve/mask_struct_store_4.c
> ICEs, I have a fix for that.
>
> gcc.target/aarch64/pr99873_2.c FAILs because with a single
> SLP store group merged from two two-lane load groups we cancel
> the SLP and want to use load/store-lanes.  I'll leave this
> FAILing or shall I XFAIL it?

Yeah, agree it's probably worth leaving it FAILing for now, since it
is something we should try to fix for GCC 15.

Thanks,
Richard

>
> Thanks,
> Richard.
>
>> Thanks,
>> Richard.
>> 
>>  * gcc.dg/vect/pr97428.c: Expect a single store SLP group.
>>  * gcc.dg/vect/slp-11c.c: Likewise.
>>  * gcc.dg/vect/vect-complex-5.c: Likewise.
>>  * gcc.dg/vect/slp-12a.c: Do not expect SLP.
>>  * gcc.dg/vect/slp-21.c: Likewise.
>>  * gcc.dg/vect/slp-cond-1.c: Expect one more SLP.
>>  * gcc.dg/vect/vect-gather-2.c: Expect SLP to be used.
>>  * gcc.target/i386/pr52252-atom.c: XFAIL test for palignr.
>> ---
>>  gcc/testsuite/gcc.dg/vect/pr97428.c  |  2 +-
>>  gcc/testsuite/gcc.dg/vect/slp-11c.c  |  5 +++--
>>  gcc/testsuite/gcc.dg/vect/slp-12a.c  |  6 +-
>>  gcc/testsuite/gcc.dg/vect/slp-21.c   | 19 +--
>>  gcc/testsuite/gcc.dg/vect/slp-cond-1.c   |  2 +-
>>  gcc/testsuite/gcc.dg/vect/vect-complex-5.c   |  2 +-
>>  gcc/testsuite/gcc.dg/vect/vect-gather-2.c|  1 -
>>  gcc/testsuite/gcc.target/i386/pr52252-atom.c |  3 ++-
>>  8 files changed, 18 insertions(+), 22 deletions(-)
>> 
>> diff --git a/gcc/testsuite/gcc.dg/vect/pr97428.c 
>> b/gcc/testsuite/gcc.dg/vect/pr97428.c
>> index 60dd984cfd3..3cc9976c00c 100644
>> --- a/gcc/testsuite/gcc.dg/vect/pr97428.c
>> +++ b/gcc/testsuite/gcc.dg/vect/pr97428.c
>> @@ -44,5 +44,5 @@ void foo_i2(dcmlx4_t dst[], const dcmlx_t src[], int n)
>>  /* { dg-final { scan-tree-dump "Detected interleaving store of size 16" 
>> "vect" } } */
>>  /* We're not able to peel & apply re-aligning to make accesses well-aligned 
>> for !vect_hw_misalign,
>> but we could by peeling the stores for alignment and applying 
>> re-aligning loads.  

Re: [PATCH 3/4]AArch64: add new alternative with early clobber to patterns

2024-05-22 Thread Richard Sandiford
Tamar Christina  writes:
>> -Original Message-
>> From: Richard Sandiford 
>> Sent: Wednesday, May 22, 2024 10:48 AM
>> To: Tamar Christina 
>> Cc: gcc-patches@gcc.gnu.org; nd ; Richard Earnshaw
>> ; Marcus Shawcroft
>> ; ktkac...@gcc.gnu.org
>> Subject: Re: [PATCH 3/4]AArch64: add new alternative with early clobber to
>> patterns
>> 
>> Tamar Christina  writes:
>> > Hi All,
>> >
>> > This patch adds new alternatives to the patterns which are affected.  The 
>> > new
>> > alternatives with the conditional early clobbers are added before the 
>> > normal
>> > ones in order for LRA to prefer them in the event that we have enough free
>> > registers to accommodate them.
>> >
>> > In case register pressure is too high the normal alternatives will be 
>> > preferred
>> > before a reload is considered as we rather have the tie than a spill.
>> >
>> > Tests are in the next patch.
>> >
>> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>> >
>> > Ok for master?
>> >
>> > Thanks,
>> > Tamar
>> >
>> > gcc/ChangeLog:
>> >
>> >* config/aarch64/aarch64-sve.md (and3,
>> >@aarch64_pred__z, *3_cc,
>> >*3_ptest, aarch64_pred__z,
>> >*3_cc, *3_ptest,
>> >aarch64_pred__z, *3_cc,
>> >*3_ptest, @aarch64_pred_cmp,
>> >*cmp_cc, *cmp_ptest,
>> >@aarch64_pred_cmp_wide,
>> >*aarch64_pred_cmp_wide_cc,
>> >*aarch64_pred_cmp_wide_ptest,
>> @aarch64_brk,
>> >*aarch64_brk_cc, *aarch64_brk_ptest,
>> >@aarch64_brk, *aarch64_brkn_cc, *aarch64_brkn_ptest,
>> >*aarch64_brk_cc, *aarch64_brk_ptest,
>> >aarch64_rdffr_z, *aarch64_rdffr_z_ptest, *aarch64_rdffr_ptest,
>> >*aarch64_rdffr_z_cc, *aarch64_rdffr_cc): Add new early clobber
>> >alternative.
>> >* config/aarch64/aarch64-sve2.md
>> >(@aarch64_pred_): Likewise.
>> >
>> > ---
>> > diff --git a/gcc/config/aarch64/aarch64-sve.md 
>> > b/gcc/config/aarch64/aarch64-
>> sve.md
>> > index
>> e3085c0c636f1317409bbf3b5fbaf5342a2df1f6..8fdc1bc3cd43acfcd675a18350c
>> 297428c85fe46 100644
>> > --- a/gcc/config/aarch64/aarch64-sve.md
>> > +++ b/gcc/config/aarch64/aarch64-sve.md
>> > @@ -1161,8 +1161,10 @@ (define_insn "aarch64_rdffr_z"
>> >  (reg:VNx16BI FFRT_REGNUM)
>> >  (match_operand:VNx16BI 1 "register_operand")))]
>> >"TARGET_SVE && TARGET_NON_STREAMING"
>> > -  {@ [ cons: =0, 1   ]
>> > - [ Upa , Upa ] rdffr\t%0.b, %1/z
>> > +  {@ [ cons: =0, 1  ; attrs: pred_clobber ]
>> > + [ , Upa; yes ] rdffr\t%0.b, %1/z
>> > + [ ?Upa, Upa; yes ] ^
>> > + [ Upa , Upa; *   ] ^
>> >}
>> >  )
>> 
>> Sorry for not explaining it very well, but in the previous review I 
>> suggested:
>> 
>> > The gather-like approach would be something like:
>> >
>> >  [  , Upl , w , ; yes ]
>> cmp\t%0., %1/z, %3., #%4
>> >  [ ?Upl , 0   , w , ; yes ] ^
>> >  [ Upa  , Upl , w , ; no  ] ^
>> >  [  , Upl , w , w; yes ] 
>> > cmp\t%0., %1/z,
>> %3., %4.
>> >  [ ?Upl , 0   , w , w; yes ] ^
>> >  [ Upa  , Upl , w , w; no  ] ^
>> >
>> > with:
>> >
>> >   (define_attr "pred_clobber" "any,no,yes" (const_string "any"))
>> 
>> (with emphasis on the last line).  What I didn't say explicitly is
>> that "no" should require !TARGET_SVE_PRED_CLOBBER.
>> 
>> The premise of that review was that we shouldn't enable things like:
>> 
>>  [ Upa  , Upl , w , w; no  ] ^
>> 
>> for TARGET_SVE_PRED_CLOBBER since it contradicts the earlyclobber
>> alternative.  So we should enable either the pred_clobber=yes
>> alternatives or the pred_clobber=no alternatives, but not both.
>> 
>> The default "any" is then for other non-predicate instructions that
>> don't care about TARGET_SVE_PRED_CLOBBER either way.
>> 
>> In contrast, this patch makes pred_clobber=yes enable the alternatives
>

Re: [PATCH v1 2/6] Extract ix86 dllimport implementation to mingw

2024-05-22 Thread Richard Sandiford
Evgeny Karpov  writes:
> This patch extracts the ix86 implementation for expanding a SYMBOL
> into its corresponding dllimport, far-address, or refptr symbol.
> It will be reused in the aarch64-w64-mingw32 target.
> The implementation is copied as is from i386/i386.cc with
> minor changes to follow to the code style.
>
> Also this patch replaces the original DLL import/export
> implementation in ix86 with mingw.
>
> gcc/ChangeLog:
>
>   * config.gcc: Add winnt-dll.o, which contains the DLL
>   import/export implementation.
>   * config/i386/cygming.h (SUB_TARGET_RECORD_STUB): Remove the
>   old implementation. Rename the required function to MinGW.
>   Rename it to a conditional function that will reuse the
>   MinGW implementation for COFF and nothing otherwise.
>   * config/i386/i386-expand.cc (ix86_expand_move): Likewise.
>   * config/i386/i386-expand.h (is_imported_p): Likewise.
>   (mingw_GOT_alias_set): Likewise.
>   (ix86_legitimize_pe_coff_symbol): Likewise.
>   * config/i386/i386-protos.h: Likewise.
>   * config/i386/i386.cc (is_imported_p): Likewise.
>   (ix86_legitimize_pe_coff_symbol): Likewise.
>   (ix86_GOT_alias_set): Likewise.
>   (legitimize_pic_address): Likewise.
>   (struct dllimport_hasher):
>   (GTY): Likewise.
>   (get_dllimport_decl): Likewise.
>   (legitimize_pe_coff_extern_decl): Likewise.
>   (legitimize_dllimport_symbol): Likewise.
>   (legitimize_pe_coff_symbol): Likewise.
>   (ix86_legitimize_address): Likewise.
>   * config/mingw/winnt.h (mingw_pe_record_stub): Likewise.
>   * config/mingw/winnt.cc (i386_pe_record_stub): Likewise.
>   (mingw_pe_record_stub): Likewise.
>   * config/mingw/t-cygming: Add the winnt-dll.o compilation.
>   * config/mingw/winnt-dll.cc: New file.

This looks good to me apart from a couple of very minor comments below,
but please get approval from the x86 maintainers as well.  In particular,
they might prefer to handle ix86_legitimize_pe_coff_symbol in some other way.

> [...]
> diff --git a/gcc/config/mingw/winnt-dll.cc b/gcc/config/mingw/winnt-dll.cc
> new file mode 100644
> index 000..349ade6f5c0
> --- /dev/null
> +++ b/gcc/config/mingw/winnt-dll.cc
> @@ -0,0 +1,229 @@
> +/* Expand a SYMBOL into its corresponding dllimport, far-address,
> +or refptr symbol.
> +Copyright (C) 2024 Free Software Foundation, Inc.

I suppose this should retain the range from the i386 file that the
code is moving from:

   Copyright (C) 1988-2024 Free Software Foundation, Inc.

> [...]
> diff --git a/gcc/config/mingw/winnt-dll.h b/gcc/config/mingw/winnt-dll.h
> new file mode 100644
> index 000..19c16e747a2
> --- /dev/null
> +++ b/gcc/config/mingw/winnt-dll.h
> @@ -0,0 +1,26 @@
> +/* Expand a SYMBOL into its corresponding dllimport, far-address,
> +or refptr symbol.
> +Copyright (C) 2024 Free Software Foundation, Inc.
> +
> +GCC is free software; you can redistribute it and/or modify it under
> +the terms of the GNU General Public License as published by the Free
> +Software Foundation; either version 3, or (at your option) any later
> +version.
> +
> +GCC is distributed in the hope that it will be useful, but WITHOUT ANY
> +WARRANTY; without even the implied warranty of MERCHANTABILITY or
> +FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
> +for more details.
> +
> +You should have received a copy of the GNU General Public License
> +along with GCC; see the file COPYING3.  If not see
> +http://www.gnu.org/licenses/.  */
> +
> +#ifndef GCC_MINGW_WINNT_DLL_H
> +#define GCC_MINGW_WINNT_DLL_H
> +
> +extern bool is_imported_p (rtx x);
> +extern alias_set_type mingw_GOT_alias_set (void);
> +extern rtx legitimize_pe_coff_symbol (rtx addr, bool inreg);
> +
> +#endif
> \ No newline at end of file

Would be good to add the newlihe.

Thanks,
Richard


Re: [PATCH 3/4]AArch64: add new alternative with early clobber to patterns

2024-05-22 Thread Richard Sandiford
Tamar Christina  writes:
> Hi All,
>
> This patch adds new alternatives to the patterns which are affected.  The new
> alternatives with the conditional early clobbers are added before the normal
> ones in order for LRA to prefer them in the event that we have enough free
> registers to accommodate them.
>
> In case register pressure is too high the normal alternatives will be 
> preferred
> before a reload is considered as we rather have the tie than a spill.
>
> Tests are in the next patch.
>
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>
> Ok for master?
>
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64-sve.md (and3,
>   @aarch64_pred__z, *3_cc,
>   *3_ptest, aarch64_pred__z,
>   *3_cc, *3_ptest,
>   aarch64_pred__z, *3_cc,
>   *3_ptest, @aarch64_pred_cmp,
>   *cmp_cc, *cmp_ptest,
>   @aarch64_pred_cmp_wide,
>   *aarch64_pred_cmp_wide_cc,
>   *aarch64_pred_cmp_wide_ptest, @aarch64_brk,
>   *aarch64_brk_cc, *aarch64_brk_ptest,
>   @aarch64_brk, *aarch64_brkn_cc, *aarch64_brkn_ptest,
>   *aarch64_brk_cc, *aarch64_brk_ptest,
>   aarch64_rdffr_z, *aarch64_rdffr_z_ptest, *aarch64_rdffr_ptest,
>   *aarch64_rdffr_z_cc, *aarch64_rdffr_cc): Add new early clobber
>   alternative.
>   * config/aarch64/aarch64-sve2.md
>   (@aarch64_pred_): Likewise.
>
> ---
> diff --git a/gcc/config/aarch64/aarch64-sve.md 
> b/gcc/config/aarch64/aarch64-sve.md
> index 
> e3085c0c636f1317409bbf3b5fbaf5342a2df1f6..8fdc1bc3cd43acfcd675a18350c297428c85fe46
>  100644
> --- a/gcc/config/aarch64/aarch64-sve.md
> +++ b/gcc/config/aarch64/aarch64-sve.md
> @@ -1161,8 +1161,10 @@ (define_insn "aarch64_rdffr_z"
> (reg:VNx16BI FFRT_REGNUM)
> (match_operand:VNx16BI 1 "register_operand")))]
>"TARGET_SVE && TARGET_NON_STREAMING"
> -  {@ [ cons: =0, 1   ]
> - [ Upa , Upa ] rdffr\t%0.b, %1/z
> +  {@ [ cons: =0, 1  ; attrs: pred_clobber ]
> + [ , Upa; yes ] rdffr\t%0.b, %1/z
> + [ ?Upa, Upa; yes ] ^
> + [ Upa , Upa; *   ] ^
>}
>  )

Sorry for not explaining it very well, but in the previous review I suggested:

> The gather-like approach would be something like:
>
>  [  , Upl , w , ; yes ] 
> cmp\t%0., %1/z, %3., #%4
>  [ ?Upl , 0   , w , ; yes ] ^
>  [ Upa  , Upl , w , ; no  ] ^
>  [  , Upl , w , w; yes ] 
> cmp\t%0., %1/z, %3., %4.
>  [ ?Upl , 0   , w , w; yes ] ^
>  [ Upa  , Upl , w , w; no  ] ^
>
> with:
>
>   (define_attr "pred_clobber" "any,no,yes" (const_string "any"))

(with emphasis on the last line).  What I didn't say explicitly is
that "no" should require !TARGET_SVE_PRED_CLOBBER.

The premise of that review was that we shouldn't enable things like:

 [ Upa  , Upl , w , w; no  ] ^

for TARGET_SVE_PRED_CLOBBER since it contradicts the earlyclobber
alternative.  So we should enable either the pred_clobber=yes
alternatives or the pred_clobber=no alternatives, but not both.

The default "any" is then for other non-predicate instructions that
don't care about TARGET_SVE_PRED_CLOBBER either way.

In contrast, this patch makes pred_clobber=yes enable the alternatives
that correctly describe the restriction (good!) but then also enables
the normal alternatives too, which IMO makes the semantics unclear.

Thanks,
Richard

>  
> @@ -1179,8 +1181,10 @@ (define_insn "*aarch64_rdffr_z_ptest"
> UNSPEC_PTEST))
> (clobber (match_scratch:VNx16BI 0))]
>"TARGET_SVE && TARGET_NON_STREAMING"
> -  {@ [ cons: =0, 1   ]
> - [ Upa , Upa ] rdffrs\t%0.b, %1/z
> +  {@ [ cons: =0, 1  ; attrs: pred_clobber ]
> + [ , Upa; yes ] rdffrs\t%0.b, %1/z
> + [ ?Upa, Upa; yes ] ^
> + [ Upa , Upa; *   ] ^
>}
>  )
>  
> @@ -1195,8 +1199,10 @@ (define_insn "*aarch64_rdffr_ptest"
> UNSPEC_PTEST))
> (clobber (match_scratch:VNx16BI 0))]
>"TARGET_SVE && TARGET_NON_STREAMING"
> -  {@ [ cons: =0, 1   ]
> - [ Upa , Upa ] rdffrs\t%0.b, %1/z
> +  {@ [ cons: =0, 1  ; attrs: pred_clobber ]
> + [ , Upa; yes ] rdffrs\t%0.b, %1/z
> + [ ?Upa, Upa; yes ] ^
> + [ Upa , Upa; *   ] ^
>}
>  )
>  
> @@ -1216,8 +1222,10 @@ (define_insn "*aarch64_rdffr_z_cc"
> (reg:VNx16BI FFRT_REGNUM)
> (match_dup 1)))]
>"TARGET_SVE && TARGET_NON_STREAMING"
> -  {@ [ cons: =0, 1   ]
> - [ Upa , Upa ] rdffrs\t%0.b, %1/z
> +  {@ [ cons: =0, 1  ; attrs: pred_clobber ]
> + [ , Upa; yes ] rdffrs\t%0.b, %1/z
> + [ ?Upa, Upa; yes ] ^
> + [ Upa , Upa; *   ] ^
>}
>  )
>  
> @@ -1233,8 +1241,10 @@ 

Re: [PATCH] Fix mixed input kind permute optimization

2024-05-22 Thread Richard Sandiford
Richard Sandiford  writes:
> Richard Biener  writes:
>> When change_vec_perm_layout runs into a permute combining two
>> nodes where one is invariant and one internal the partition of
>> one input can be -1 but the other might not be.  The following
>> supports this case by simply ignoring inputs with input partiton -1.
>>
>> I'm not sure this is correct but it avoids ICEing when accessing
>> that partitions layout for gcc.target/i386/pr98928.c with the
>> change to avoid splitting store dataref groups during SLP discovery.
>>
>> Bootstrap and regtest running on x86_64-unknown-linux-gnu (ontop of
>> the SLP series).  The change can't break anything that's already
>> broken but I'm not sure this does the right thing - the testcase
>> has an uniform constant.  I'll try to come up with a better runtime
>> testcase tomorrow.  Hints as to where to correctly fix such case
>> appreciated.
>
> Famous last words, but yeah, it looks correct to me.  I think the
> routine in principle should have a free choice of which layout to
> choose for invariants (as long as it's consistent for all queries
> about the same node).  So it should just be a question of whether
> keeping the original layout is more likely to give a valid
> permutation, or whether going with out_layout_i would be better.
> I don't have a strong intuition either way.

BTW, I should have said that using a different layout from 0
would require compensating code in the materialize function.
So this is definitely the simplest and most direct fix.

Thanks,
Richard


Re: [PATCH 3/4] Avoid splitting store dataref groups during SLP discovery

2024-05-21 Thread Richard Sandiford
Richard Biener  writes:
> The following avoids splitting store dataref groups during SLP
> discovery but instead forces (eventually single-lane) consecutive
> lane SLP discovery for all lanes of the group, creating VEC_PERM
> SLP nodes merging them so the store will always cover the whole group.
>
> With this for example
>
> int x[1024], y[1024], z[1024], w[1024];
> void foo (void)
> {
>   for (int i = 0; i < 256; i++)
> {
>   x[4*i+0] = y[2*i+0];
>   x[4*i+1] = y[2*i+1];
>   x[4*i+2] = z[i];
>   x[4*i+3] = w[i];
> }
> }
>
> which was previously using hybrid SLP can now be fully SLPed and

Nice!

> SSE code generated looks better (but of course you never know,
> I didn't actually benchmark).  We of course need a VF of four here.
>
> .L2:
> movdqa  z(%rax), %xmm0
> movdqa  w(%rax), %xmm4
> movdqa  y(%rax,%rax), %xmm2
> movdqa  y+16(%rax,%rax), %xmm1
> movdqa  %xmm0, %xmm3
> punpckhdq   %xmm4, %xmm0
> punpckldq   %xmm4, %xmm3
> movdqa  %xmm2, %xmm4
> shufps  $238, %xmm3, %xmm2
> movaps  %xmm2, x+16(,%rax,4)
> movdqa  %xmm1, %xmm2
> shufps  $68, %xmm3, %xmm4
> shufps  $68, %xmm0, %xmm2
> movaps  %xmm4, x(,%rax,4)
> shufps  $238, %xmm0, %xmm1
> movaps  %xmm2, x+32(,%rax,4)
> movaps  %xmm1, x+48(,%rax,4)
> addq$16, %rax
> cmpq$1024, %rax
> jne .L2
>
> The extra permute nodes merging distinct branches of the SLP
> tree might be unexpected for some code, esp. since
> SLP_TREE_REPRESENTATIVE cannot be meaningfully set and we
> cannot populate SLP_TREE_SCALAR_STMTS or SLP_TREE_SCALAR_OPS
> consistently as we can have a mix of both.
>
> The patch keeps the sub-trees form consecutive lanes but that's
> in principle not necessary if we for example have an even/odd
> split which now would result in N single-lane sub-trees.  That's
> left for future improvements.
>
> The interesting part is how VLA vector ISAs handle merging of
> two vectors that's not trivial even/odd merging.  The strathegy
> of how to build the permute tree might need adjustments for that
> (in the end splitting each branch to single lanes and then doing
> even/odd merging would be the brute-force fallback).  Not sure
> how much we can or should rely on the SLP optimize pass to handle
> this.

Yeah, I think we'll have to play it by ear.  It might involve tweaking
the order in which we "reduce" the VEC_PERM_EXPRs.  E.g. in the above
example, my guess is that it would be better to reduce the z/w part
first and then permute that with y, whereas it looks like the patch
always goes left-to-right.

The patch LGTM FWIW.

I suppose this does further hard-code the assumption that the vector
type is uniquely determined by the element type (and so we can safely
assume that everything has the same vector type as the first split node).
But that's pretty much pervasive, and not easy to solve until we're
serious about putting some infrastructre in place for it.  It just
caught me out when reading vector code for the first time in a while :)

(E.g. in the above example, the y vector could eventually be double the
z & w vectors.)

Thanks,
Richard

>   * tree-vect-slp.cc (vect_build_slp_instance): Do not split
>   store dataref groups on loop SLP discovery failure but create
>   a single SLP instance for the stores but branch to SLP sub-trees
>   and merge with a series of VEC_PERM nodes.
> ---
>  gcc/tree-vect-slp.cc | 240 ++-
>  1 file changed, 214 insertions(+), 26 deletions(-)
>
> diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
> index 43f2c153bf0..873748b0a72 100644
> --- a/gcc/tree-vect-slp.cc
> +++ b/gcc/tree-vect-slp.cc
> @@ -3468,12 +3468,7 @@ vect_build_slp_instance (vec_info *vinfo,
> return true;
>   }
>  }
> -  else
> -{
> -  /* Failed to SLP.  */
> -  /* Free the allocated memory.  */
> -  scalar_stmts.release ();
> -}
> +  /* Failed to SLP.  */
>  
>stmt_vec_info stmt_info = stmt_info_;
>/* Try to break the group up into pieces.  */
> @@ -3491,6 +3486,9 @@ vect_build_slp_instance (vec_info *vinfo,
>if (is_a  (vinfo)
> && (i > 1 && i < group_size))
>   {
> +   /* Free the allocated memory.  */
> +   scalar_stmts.release ();
> +
> tree scalar_type
>   = TREE_TYPE (DR_REF (STMT_VINFO_DATA_REF (stmt_info)));
> tree vectype = get_vectype_for_scalar_type (vinfo, scalar_type,
> @@ -3535,38 +3533,228 @@ vect_build_slp_instance (vec_info *vinfo,
>   }
>   }
>  
> -  /* For loop vectorization split into arbitrary pieces of size > 1.  */
> -  if (is_a  (vinfo)
> -   && (i > 1 && i < group_size)
> -   && !vect_slp_prefer_store_lanes_p (vinfo, stmt_info, group_size, i))
> +  /* For loop vectorization split the RHS into arbitrary pieces of
> +  size >= 1.  */
> +  else 

Re: [PATCH] Fix mixed input kind permute optimization

2024-05-21 Thread Richard Sandiford
Richard Biener  writes:
> When change_vec_perm_layout runs into a permute combining two
> nodes where one is invariant and one internal the partition of
> one input can be -1 but the other might not be.  The following
> supports this case by simply ignoring inputs with input partiton -1.
>
> I'm not sure this is correct but it avoids ICEing when accessing
> that partitions layout for gcc.target/i386/pr98928.c with the
> change to avoid splitting store dataref groups during SLP discovery.
>
> Bootstrap and regtest running on x86_64-unknown-linux-gnu (ontop of
> the SLP series).  The change can't break anything that's already
> broken but I'm not sure this does the right thing - the testcase
> has an uniform constant.  I'll try to come up with a better runtime
> testcase tomorrow.  Hints as to where to correctly fix such case
> appreciated.

Famous last words, but yeah, it looks correct to me.  I think the
routine in principle should have a free choice of which layout to
choose for invariants (as long as it's consistent for all queries
about the same node).  So it should just be a question of whether
keeping the original layout is more likely to give a valid
permutation, or whether going with out_layout_i would be better.
I don't have a strong intuition either way.

Thanks,
Richard

>
>   * tree-vect-slp.cc (change_vec_perm_layout): Ignore an
>   input partition of -1.
> ---
>  gcc/tree-vect-slp.cc | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
> index 873748b0a72..f6ec1a81c96 100644
> --- a/gcc/tree-vect-slp.cc
> +++ b/gcc/tree-vect-slp.cc
> @@ -4828,6 +4828,8 @@ change_vec_perm_layout (slp_tree node, 
> lane_permutation_t ,
>   {
> slp_tree in_node = SLP_TREE_CHILDREN (node)[entry.first];
> unsigned int in_partition_i = m_vertices[in_node->vertex].partition;
> +   if (in_partition_i == -1u)
> + continue;
> this_in_layout_i = m_partitions[in_partition_i].layout;
>   }
>if (this_in_layout_i > 0)


Re: [PATCH v3] aarch64: Fix normal returns inside functions which use eh_returns [PR114843]

2024-05-21 Thread Richard Sandiford
Wilco Dijkstra  writes:
> Hi Andrew,
>
> A few comments on the implementation, I think it can be simplified a lot:

FWIW, I agree with Wilco's comments, except:

>> +++ b/gcc/config/aarch64/aarch64.h
>> @@ -700,8 +700,9 @@ constexpr auto AARCH64_FL_DEFAULT_ISA_MODE = 
>> AARCH64_FL_SM_OFF;
>> #define DWARF2_UNWIND_INFO 1
>>  
>>  /* Use R0 through R3 to pass exception handling information.  */
>> +#define EH_RETURN_DATA_REGISTERS_N 4
>>  #define EH_RETURN_DATA_REGNO(N) \
>> -  ((N) < 4 ? ((unsigned int) R0_REGNUM + (N)) : INVALID_REGNUM)
>> +  ((N) < EH_RETURN_DATA_REGISTERS_N ? ((unsigned int) R0_REGNUM + (N)) : 
>> INVALID_REGNUM)
>  
> It would be useful to add a macro IS_EH_RETURN_REGNUM(regnum) that just checks
> the range R0_REGNUM to R0_REGNUM + EH_RETURN_DATA_REGISTERS_N.

I've just pushed a patch that adds a global eh_return_data_regs set,
so I think we can test that instead.

>> @@ -929,6 +928,7 @@ struct GTY (()) aarch64_frame
>>  outgoing arguments) of each register save slot, or -2 if no save is
>>  needed.  */
>>   poly_int64 reg_offset[LAST_SAVED_REGNUM + 1];
>> +  bool eh_return_allocated[EH_RETURN_DATA_REGISTERS_N];
>
> This doesn't make much sense - besides X0-X3, we also need X5 and X6 for 
> eh_return.
> If these or any of the other temporaries used by epilog are callee-saved 
> somehow,
> things are going horribly wrong already... So what do we gain by doing this?
>
>
>> +++ b/gcc/config/aarch64/aarch64.cc
>> @@ -7792,6 +7792,7 @@ aarch64_layout_frame (void)
>> 
>>  #define SLOT_NOT_REQUIRED (-2)
>>  #define SLOT_REQUIRED (-1)
>> +#define SLOT_EH_RETURN_REQUIRED (-3)
>  
> I don't see a need for this.
>
>
>> @@ -7949,6 +7950,18 @@ aarch64_layout_frame (void)
>> stopping it from being individually shrink-wrapped.  */
>>  allocate_gpr_slot (R30_REGNUM);
>>  
>> +  /* Allocate the eh_return first. */
>> +  if (crtl->calls_eh_return)
>> +for (regno = 0; EH_RETURN_DATA_REGNO (regno) != INVALID_REGNUM; regno++)
>> +  {
>> +int realregno = EH_RETURN_DATA_REGNO (regno);
>> +if (known_eq (frame.reg_offset[realregno], SLOT_EH_RETURN_REQUIRED))
>> +  {
>> +frame.eh_return_allocated[regno] = true;
>> +allocate_gpr_slot (realregno);
>> +  }
>> +  }
>
> This change is unnecessary if we just mark the slots with SLOT_REQUIRED.

Also, is it necessary to allocate EH data registers first?

>> @@ -8035,6 +8048,23 @@ aarch64_layout_frame (void)
>>   frame.wb_pop_candidate1 = frame.wb_push_candidate1;
>>   frame.wb_pop_candidate2 = frame.wb_push_candidate2;
>>  
>> +  /* EH data registers are not pop canidates. */
>> +  if (crtl->calls_eh_return)
>> +for (regno = 0; EH_RETURN_DATA_REGNO (regno) != INVALID_REGNUM; 
>> regno++)> 
>> +  {
>> +if (frame.eh_return_allocated[regno]
>> +&& frame.wb_pop_candidate1 == EH_RETURN_DATA_REGNO (regno))
>> +{
>> +  frame.wb_pop_candidate1 = frame.wb_pop_candidate2;
>> +  frame.wb_pop_candidate2 = INVALID_REGNUM;
>> +}
>> +if (frame.eh_return_allocated[regno]
>> +&& frame.wb_pop_candidate2 == EH_RETURN_DATA_REGNO (regno))
>> +{
>> +  frame.wb_pop_candidate2 = INVALID_REGNUM;
>> +}
>> +  }
>
> This is unnecessary since we can just avoid making them push candidates
> if there is no frame chain, eg:
>
> if ((!crtl->calls_eh_return || frame.emit_frame_chain) && !push_regs.empty ()
>   && known_eq (frame.reg_offset[push_regs[0]], frame.bytes_below_hard_fp))

I agree we should do the check here (and similarly for the second register),
rather than fixing it up later.  But IMO we should test the register directly:

  if (!push_regs.empty ()
  && known_eq (frame.reg_offset[push_regs[0]], frame.bytes_below_hard_fp)
  && (!crtl->calls_eh_return
  || !TEST_HARD_REG_BIT (eh_return_data_regs, push_regs[0])))

In some ways it seems unfortunate that we're generating two different
copies of the epilogue in order to skip two LDPs that (with a bit of
work) could easily be done before entering a combined epilogue.
But we already have a branch on EH_RETURN_TAKEN_RTX as well,
so maybe this is the tipping point at which duplication is worthwhile.

Thanks,
Richard

> @@ -8681,6 +8712,20 @@ aarch64_restore_callee_saves (poly_int64 
> bytes_below_sp,
>if (frame.is_scs_enabled && regno == LR_REGNUM)
>   return true;
>  
> +  /* Skip the eh return data registers if we are
> +  returning normally rather than via eh_return. */
> +  if (!was_eh_return && crtl->calls_eh_return)
> + {
> +   for (unsigned ehregno = 0;
> +EH_RETURN_DATA_REGNO (ehregno) != INVALID_REGNUM;
> +ehregno++)
> + {
> +   if (EH_RETURN_DATA_REGNO (ehregno) == regno
> +   && frame.eh_return_allocated[ehregno])
> + return true;
> + }
> + }
> +
>
> So this could be something like:
>
>   if (!was_eh_return && crtl->calls_eh_return && IS_EH_RETURN_REGNUM 
> (regno))
>   

[gcc r15-752] Cache the set of EH_RETURN_DATA_REGNOs

2024-05-21 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:7f35863ebbf7ba63e2f075edfbec105de272578a

commit r15-752-g7f35863ebbf7ba63e2f075edfbec105de272578a
Author: Richard Sandiford 
Date:   Tue May 21 10:21:16 2024 +0100

Cache the set of EH_RETURN_DATA_REGNOs

While reviewing Andrew's fix for PR114843, it seemed like it would
be convenient to have a HARD_REG_SET of EH_RETURN_DATA_REGNOs.
This patch adds one and uses it to simplify a couple of use sites.

gcc/
* hard-reg-set.h (target_hard_regs::x_eh_return_data_regs): New 
field.
(eh_return_data_regs): New macro.
* reginfo.cc (init_reg_sets_1): Initialize x_eh_return_data_regs.
* df-scan.cc (df_get_exit_block_use_set): Use it.
* ira-lives.cc (process_out_of_region_eh_regs): Likewise.

Diff:
---
 gcc/df-scan.cc |  8 +---
 gcc/hard-reg-set.h |  5 +
 gcc/ira-lives.cc   | 10 ++
 gcc/reginfo.cc | 10 ++
 4 files changed, 18 insertions(+), 15 deletions(-)

diff --git a/gcc/df-scan.cc b/gcc/df-scan.cc
index 1bade2cd71e..c8ab3c09cee 100644
--- a/gcc/df-scan.cc
+++ b/gcc/df-scan.cc
@@ -3702,13 +3702,7 @@ df_get_exit_block_use_set (bitmap exit_block_uses)
 
   /* Mark the registers that will contain data for the handler.  */
   if (reload_completed && crtl->calls_eh_return)
-for (i = 0; ; ++i)
-  {
-   unsigned regno = EH_RETURN_DATA_REGNO (i);
-   if (regno == INVALID_REGNUM)
- break;
-   bitmap_set_bit (exit_block_uses, regno);
-  }
+IOR_REG_SET_HRS (exit_block_uses, eh_return_data_regs);
 
 #ifdef EH_RETURN_STACKADJ_RTX
   if ((!targetm.have_epilogue () || ! epilogue_completed)
diff --git a/gcc/hard-reg-set.h b/gcc/hard-reg-set.h
index 8c1d1512ca2..340eb425c10 100644
--- a/gcc/hard-reg-set.h
+++ b/gcc/hard-reg-set.h
@@ -421,6 +421,9 @@ struct target_hard_regs {
  with the local stack frame are safe, but scant others.  */
   HARD_REG_SET x_regs_invalidated_by_call;
 
+  /* The set of registers that are used by EH_RETURN_DATA_REGNO.  */
+  HARD_REG_SET x_eh_return_data_regs;
+
   /* Table of register numbers in the order in which to try to use them.  */
   int x_reg_alloc_order[FIRST_PSEUDO_REGISTER];
 
@@ -485,6 +488,8 @@ extern struct target_hard_regs *this_target_hard_regs;
 #define call_used_or_fixed_regs \
   (regs_invalidated_by_call | fixed_reg_set)
 #endif
+#define eh_return_data_regs \
+  (this_target_hard_regs->x_eh_return_data_regs)
 #define reg_alloc_order \
   (this_target_hard_regs->x_reg_alloc_order)
 #define inv_reg_alloc_order \
diff --git a/gcc/ira-lives.cc b/gcc/ira-lives.cc
index e07d3dc3e89..958eabb9708 100644
--- a/gcc/ira-lives.cc
+++ b/gcc/ira-lives.cc
@@ -1260,14 +1260,8 @@ process_out_of_region_eh_regs (basic_block bb)
   for (int n = ALLOCNO_NUM_OBJECTS (a) - 1; n >= 0; n--)
{
  ira_object_t obj = ALLOCNO_OBJECT (a, n);
- for (int k = 0; ; k++)
-   {
- unsigned int regno = EH_RETURN_DATA_REGNO (k);
- if (regno == INVALID_REGNUM)
-   break;
- SET_HARD_REG_BIT (OBJECT_CONFLICT_HARD_REGS (obj), regno);
- SET_HARD_REG_BIT (OBJECT_TOTAL_CONFLICT_HARD_REGS (obj), regno);
-   }
+ OBJECT_CONFLICT_HARD_REGS (obj) |= eh_return_data_regs;
+ OBJECT_TOTAL_CONFLICT_HARD_REGS (obj) |= eh_return_data_regs;
}
 }
 }
diff --git a/gcc/reginfo.cc b/gcc/reginfo.cc
index a0baeb90e12..73121365c47 100644
--- a/gcc/reginfo.cc
+++ b/gcc/reginfo.cc
@@ -420,6 +420,16 @@ init_reg_sets_1 (void)
}
 }
 
+  /* Recalculate eh_return_data_regs.  */
+  CLEAR_HARD_REG_SET (eh_return_data_regs);
+  for (i = 0; ; ++i)
+{
+  unsigned int regno = EH_RETURN_DATA_REGNO (i);
+  if (regno == INVALID_REGNUM)
+   break;
+  SET_HARD_REG_BIT (eh_return_data_regs, regno);
+}
+
   memset (have_regs_of_mode, 0, sizeof (have_regs_of_mode));
   memset (contains_reg_of_mode, 0, sizeof (contains_reg_of_mode));
   for (m = 0; m < (unsigned int) MAX_MACHINE_MODE; m++)


[PATCH] Cache the set of EH_RETURN_DATA_REGNOs

2024-05-21 Thread Richard Sandiford
While reviewing Andrew's fix for PR114843, it seemed like it would
be convenient to have a HARD_REG_SET of EH_RETURN_DATA_REGNOs.
This patch adds one and uses it to simplify a couple of use sites.

Tested on aarch64-linux-gnu & x86_64-linux-gnu.  OK to install?

Richard


gcc/
* hard-reg-set.h (target_hard_regs::x_eh_return_data_regs): New field.
(eh_return_data_regs): New macro.
* reginfo.cc (init_reg_sets_1): Initialize x_eh_return_data_regs.
* df-scan.cc (df_get_exit_block_use_set): Use it.
* ira-lives.cc (process_out_of_region_eh_regs): Likewise.
---
 gcc/df-scan.cc |  8 +---
 gcc/hard-reg-set.h |  5 +
 gcc/ira-lives.cc   | 10 ++
 gcc/reginfo.cc | 10 ++
 4 files changed, 18 insertions(+), 15 deletions(-)

diff --git a/gcc/df-scan.cc b/gcc/df-scan.cc
index 1bade2cd71e..c8ab3c09cee 100644
--- a/gcc/df-scan.cc
+++ b/gcc/df-scan.cc
@@ -3702,13 +3702,7 @@ df_get_exit_block_use_set (bitmap exit_block_uses)
 
   /* Mark the registers that will contain data for the handler.  */
   if (reload_completed && crtl->calls_eh_return)
-for (i = 0; ; ++i)
-  {
-   unsigned regno = EH_RETURN_DATA_REGNO (i);
-   if (regno == INVALID_REGNUM)
- break;
-   bitmap_set_bit (exit_block_uses, regno);
-  }
+IOR_REG_SET_HRS (exit_block_uses, eh_return_data_regs);
 
 #ifdef EH_RETURN_STACKADJ_RTX
   if ((!targetm.have_epilogue () || ! epilogue_completed)
diff --git a/gcc/hard-reg-set.h b/gcc/hard-reg-set.h
index 8c1d1512ca2..340eb425c10 100644
--- a/gcc/hard-reg-set.h
+++ b/gcc/hard-reg-set.h
@@ -421,6 +421,9 @@ struct target_hard_regs {
  with the local stack frame are safe, but scant others.  */
   HARD_REG_SET x_regs_invalidated_by_call;
 
+  /* The set of registers that are used by EH_RETURN_DATA_REGNO.  */
+  HARD_REG_SET x_eh_return_data_regs;
+
   /* Table of register numbers in the order in which to try to use them.  */
   int x_reg_alloc_order[FIRST_PSEUDO_REGISTER];
 
@@ -485,6 +488,8 @@ extern struct target_hard_regs *this_target_hard_regs;
 #define call_used_or_fixed_regs \
   (regs_invalidated_by_call | fixed_reg_set)
 #endif
+#define eh_return_data_regs \
+  (this_target_hard_regs->x_eh_return_data_regs)
 #define reg_alloc_order \
   (this_target_hard_regs->x_reg_alloc_order)
 #define inv_reg_alloc_order \
diff --git a/gcc/ira-lives.cc b/gcc/ira-lives.cc
index e07d3dc3e89..958eabb9708 100644
--- a/gcc/ira-lives.cc
+++ b/gcc/ira-lives.cc
@@ -1260,14 +1260,8 @@ process_out_of_region_eh_regs (basic_block bb)
   for (int n = ALLOCNO_NUM_OBJECTS (a) - 1; n >= 0; n--)
{
  ira_object_t obj = ALLOCNO_OBJECT (a, n);
- for (int k = 0; ; k++)
-   {
- unsigned int regno = EH_RETURN_DATA_REGNO (k);
- if (regno == INVALID_REGNUM)
-   break;
- SET_HARD_REG_BIT (OBJECT_CONFLICT_HARD_REGS (obj), regno);
- SET_HARD_REG_BIT (OBJECT_TOTAL_CONFLICT_HARD_REGS (obj), regno);
-   }
+ OBJECT_CONFLICT_HARD_REGS (obj) |= eh_return_data_regs;
+ OBJECT_TOTAL_CONFLICT_HARD_REGS (obj) |= eh_return_data_regs;
}
 }
 }
diff --git a/gcc/reginfo.cc b/gcc/reginfo.cc
index a0baeb90e12..73121365c47 100644
--- a/gcc/reginfo.cc
+++ b/gcc/reginfo.cc
@@ -420,6 +420,16 @@ init_reg_sets_1 (void)
}
 }
 
+  /* Recalculate eh_return_data_regs.  */
+  CLEAR_HARD_REG_SET (eh_return_data_regs);
+  for (i = 0; ; ++i)
+{
+  unsigned int regno = EH_RETURN_DATA_REGNO (i);
+  if (regno == INVALID_REGNUM)
+   break;
+  SET_HARD_REG_BIT (eh_return_data_regs, regno);
+}
+
   memset (have_regs_of_mode, 0, sizeof (have_regs_of_mode));
   memset (contains_reg_of_mode, 0, sizeof (contains_reg_of_mode));
   for (m = 0; m < (unsigned int) MAX_MACHINE_MODE; m++)
-- 
2.25.1



Re: [PATCH] middle-end: Expand {u|s}dot product support in autovectorizer

2024-05-20 Thread Richard Sandiford
Richard Biener  writes:
> On Fri, May 17, 2024 at 11:56 AM Tamar Christina
>  wrote:
>>
>> > -Original Message-
>> > From: Richard Biener 
>> > Sent: Friday, May 17, 2024 10:46 AM
>> > To: Tamar Christina 
>> > Cc: Victor Do Nascimento ; gcc-
>> > patc...@gcc.gnu.org; Richard Sandiford ; Richard
>> > Earnshaw ; Victor Do Nascimento
>> > 
>> > Subject: Re: [PATCH] middle-end: Expand {u|s}dot product support in
>> > autovectorizer
>> >
>> > On Fri, May 17, 2024 at 11:05 AM Tamar Christina
>> >  wrote:
>> > >
>> > > > -Original Message-
>> > > > From: Richard Biener 
>> > > > Sent: Friday, May 17, 2024 6:51 AM
>> > > > To: Victor Do Nascimento 
>> > > > Cc: gcc-patches@gcc.gnu.org; Richard Sandiford
>> > ;
>> > > > Richard Earnshaw ; Victor Do Nascimento
>> > > > 
>> > > > Subject: Re: [PATCH] middle-end: Expand {u|s}dot product support in
>> > > > autovectorizer
>> > > >
>> > > > On Thu, May 16, 2024 at 4:40 PM Victor Do Nascimento
>> > > >  wrote:
>> > > > >
>> > > > > From: Victor Do Nascimento 
>> > > > >
>> > > > > At present, the compiler offers the `{u|s|us}dot_prod_optab' direct
>> > > > > optabs for dealing with vectorizable dot product code sequences.  The
>> > > > > consequence of using a direct optab for this is that backend-pattern
>> > > > > selection is only ever able to match against one datatype - Either
>> > > > > that of the operands or of the accumulated value, never both.
>> > > > >
>> > > > > With the introduction of the 2-way (un)signed dot-product insn [1][2]
>> > > > > in AArch64 SVE2, the existing direct opcode approach is no longer
>> > > > > sufficient for full specification of all the possible dot product
>> > > > > machine instructions to be matched to the code sequence; a dot 
>> > > > > product
>> > > > > resulting in VNx4SI may result from either dot products on VNx16QI or
>> > > > > VNx8HI values for the 4- and 2-way dot product operations, 
>> > > > > respectively.
>> > > > >
>> > > > > This means that the following example fails autovectorization:
>> > > > >
>> > > > > uint32_t foo(int n, uint16_t* data) {
>> > > > >   uint32_t sum = 0;
>> > > > >   for (int i=0; i> > > > > sum += data[i] * data[i];
>> > > > >   }
>> > > > >   return sum;
>> > > > > }
>> > > > >
>> > > > > To remedy the issue a new optab is added, tentatively named
>> > > > > `udot_prod_twoway_optab', whose selection is dependent upon checking
>> > > > > of both input and output types involved in the operation.
>> > > >
>> > > > I don't like this too much.  I'll note we document dot_prod as
>> > > >
>> > > > @cindex @code{sdot_prod@var{m}} instruction pattern
>> > > > @item @samp{sdot_prod@var{m}}
>> > > >
>> > > > Compute the sum of the products of two signed elements.
>> > > > Operand 1 and operand 2 are of the same mode. Their
>> > > > product, which is of a wider mode, is computed and added to operand 3.
>> > > > Operand 3 is of a mode equal or wider than the mode of the product. The
>> > > > result is placed in operand 0, which is of the same mode as operand 3.
>> > > > @var{m} is the mode of operand 1 and operand 2.
>> > > >
>> > > > with no restriction on the wider mode but we don't specify it which is
>> > > > bad design.  This should have been a convert optab with two modes
>> > > > from the start - adding a _twoway variant is just a hack.
>> > >
>> > > We did discuss this at the time we started implementing it.  There was 
>> > > two
>> > > options, one was indeed to change it to a convert dot_prod optab, but 
>> > > doing
>> > > this means we have to update every target that uses it.
>> > >
>> > > Now that means 3 ISAs for AArch64, Arm, Arc, c6x, 2 for x86, loongson and
>> > altivec.
>> > >
>> > > Which sure could be possible, but there's also every use in the backends 
>> > > that
>> > need
>> > > to be updated, and tested, which for some targets we don't even know how 
>> > > to
>> > begin.
>> > >
>> > > So it seems very hard to correct dotprod to a convert optab now.
>> >
>> > It's still the correct way to go.  At _least_ your new pattern should
>> > have been this,
>> > otherwise what do you do when you have two-way, four-way and eight-way
>> > variants?
>> > Add yet another optab?
>>
>> I guess that's fair, but having the new optab only be convert resulted in 
>> messy
>> code as everywhere you must check for both variants.
>>
>> Additionally that optab would then overlap with the existing optabs as, as 
>> you
>> Say, the documentation only says it's of a wider type and doesn't indicate
>> precision.
>>
>> So to avoid issues down the line then If the new optab isn't acceptable then
>> we'll have to do a wholesale conversion then..
>
> Yep.  It shouldn't be difficult though.

Still catching up, but FWIW, I agree this is the way to go.  (Convert all
existing dot_prods to convert optabs first, and then add the new AArch64
ones.)  Having two mechanisms feels like storing up trouble for later. :)

Richard


Re: [PATCH 00/12] aarch64: Extend aarch64_feature_flags to 128 bits

2024-05-20 Thread Richard Sandiford
Andrew Carlotti  writes:
> On Fri, May 17, 2024 at 04:45:05PM +0100, Richard Sandiford wrote:
>> Andrew Carlotti  writes:
>> > The end goal of the series is to change the definition of 
>> > aarch64_feature_flags
>> > from a uint64_t typedef to a class with 128 bits of storage.  This class 
>> > uses
>> > operator overloading to mimic the existing integer interface as much as
>> > possible, but with added restrictions to facilate type checking and
>> > extensibility.
>> >
>> > Patches 01-10 are preliminary enablement work, and have passed regression
>> > testing.  Are these ok for master?
>> >
>> > Patch 11 is an RFC, and the only patch that touches the middle end.  I am
>> > seeking clarity on which part(s) of the compiler should be expected to 
>> > handle
>> > or prevent non-bool types in instruction pattern conditions.  The actual 
>> > patch
>> > does not compile by itself (though it does in combination with 12/12), but 
>> > that
>> > is not important to the questions I'm asking.
>> >
>> > Patch 12 is then a small patch that actually replaces the uint64_t typedef 
>> > with
>> > a class.  I think this patch is fine in it's current form, but it depends 
>> > on a
>> > resolution to the issues in patch 11/12 first.
>> 
>> Thanks for doing this.
>> 
>> Rather than disallowing flags == 0, etc., I think we should allow
>> aarch64_feature_flags to be constructed from a single uint64_t.
>> It's a lossless conversion.  The important thing is that we don't
>> allow conversions the other way (and the patch doesn't allow them).
>
> I agree that allowing conversion from a single int should be safe (albeit it
> was probably helpful to disallow it during the development of this series).
> It does feel a little bit strange to have a separate mechanism for
> setting the first 64 bits (and zeroing the rest).

With a templated class, I think it makes sense.  The constructor would
take a variable number of arguments and any unspecified elements would
implicitly be zero.  In that sense, a single uint64_t isn't a special
case.  It's just an instance of a generic rule.

> Do you consider the existing code in some places to be clearer than the new
> versions in this patch series?  If so, it would be helpful to know which
> patches (or parts of patches) I should drop.

Probably patches 3, 4, and (for unrelated reasons) 9.  (9 feels like
a microoptimisation, given that the underlying issue has been fixed.)

>> Also, I think we should make the new class in 12/12 be a templated
>>  type that provides an N-bit bitmask.  It should arguably
>> also be target-independent code.  aarch64_feature_flags would then be
>> an alias with the appropriate number of bits.
>
> I think the difficult part is to do this for generic N while still satisfying
> C++11 constexpr function requirements (we can't use a loop, for example).
> However, while writing this response, I've realised that I can do this using
> recursion, with an N-bit bitmask being implemented as a class containing an
> M-bit integer and (recursively) and (N-M)-bit bitmask.

I think it'd be better to keep a flat object, not least for debugging.

Things like operator| could be handled using code like:


template
struct operators
{
  template
  static constexpr Result binary(Operator op, const Arg , const Arg ,
 Rest ...rest)
  {
return operators::template binary
  (op, x, y, op (x[N - 1], y[N - 1]), rest...);
  }
};

template<>
struct operators<0>
{
  template
  static constexpr Result binary(Operator op, const Arg , const Arg ,
 Rest ...rest)
  {
return Result { rest... };
  }
};

using T = std::array;

template
constexpr T f(T x, T y) { return x | y; }
constexpr T x = { 1, 2 };
constexpr T y = { 0x100, 0x400 };
constexpr T z = operators<2>::binary (f, x, y);


(Unfortunately, constexpr lambdas are also not supported in C++11.)

>> For the RFC in 11/12, how about, as another prepatch before 12/12,
>> removing all the mechanical:
>> 
>> #define AARCH64_ISA_LS64(aarch64_isa_flags & AARCH64_FL_LS64)
>> 
>> style macros and replacing uses with something like:
>> 
>>   AARCH64_HAVE_ISA (LS64)
>
> This sounds like a good approach, and is roughly what I was already planning 
> to
> do (although I hadn't worked out the details yet).  I think that can entirely
> replace 11/12 in the context of this series, but the questions about
> instruction pattern condition type checking still ought to be addressed
> separately.

Yeah, stronger typing would be good.  I think in practice the generators
should add the "bool (...)" wrapper.

Thanks,
Richard


Re: [Patch, aarch64] Further renaming of generic code

2024-05-20 Thread Richard Sandiford
Ajit Agarwal  writes:
> Hello Alex/Richard:
>
> Renaming of generic code is done to make target independent
> and target dependent code to support multiple targets.
>
> Target independent code is the Generic code with pure virtual function
> to interface betwwen target independent and dependent code.
>
> Target dependent code is the implementation of pure virtual function for
> aarch64 target and the call to target independent code.
>
> Bootstrapped and regtested on aarch64-linux-gnu.
>
> Thanks & Regards
> Ajit
>
> aarch64: Further renaming of generic code
>
> Renaming of generic code is done to make target independent
> and target dependent code to support multiple targets.
>
> Target independent code is the Generic code with pure virtual function
> to interface betwwen target independent and dependent code.
>
> Target dependent code is the implementation of pure virtual function for
> aarch64 target and the call to target independent code.
>
> 2024-05-20  Ajit Kumar Agarwal  
>
> gcc/ChangeLog:
>
> * config/aarch64/aarch64-ldp-fusion.cc: Renaming of generic code

* config/aarch64/aarch64-ldp-fusion.cc: Rename generic parts of code
to avoid "ldp" and "stp".

> ---
>  gcc/config/aarch64/aarch64-ldp-fusion.cc | 55 
>  1 file changed, 28 insertions(+), 27 deletions(-)
>
> [...]
> @@ -1126,7 +1127,7 @@ struct stp_change_builder
>  
>bool done () const { return m_state == state::DONE; }
>  
> -  stp_change_builder (insn_info *insns[2],
> +  store_change_builder (insn_info *insns[2],
> insn_info *repurpose,
> insn_info *dest)

Please reindent the parameters for the new longer name.

>  : m_state (state::FIRST), m_insns { insns[0], insns[1] },
> [...]
> @@ -1916,7 +1917,7 @@ fixup_debug_uses (obstack_watermark ,
>  // BASE gives the chosen base candidate for the pair and MOVE_RANGE is
>  // a singleton range which says where to place the pair.
>  bool
> -ldp_bb_info::fuse_pair (bool load_p,
> +pair_fusion_bb_info::fuse_pair (bool load_p,
>   unsigned access_size,
>   int writeback,
>   insn_info *i1, insn_info *i2,

Same here.

> @@ -2687,7 +2688,7 @@ pair_fusion::get_viable_bases (insn_info *insns[2],
>  // ACCESS_SIZE gives the (common) size of a single access, LOAD_P is true
>  // if the accesses are both loads, otherwise they are both stores.
>  bool
> -ldp_bb_info::try_fuse_pair (bool load_p, unsigned access_size,
> +pair_fusion_bb_info::try_fuse_pair (bool load_p, unsigned access_size,
>   insn_info *i1, insn_info *i2)
>  {
>if (dump_file)

And here.

OK with those changes, thanks.

Richard


Re: [PATCH] AArch64: Improve costing of ctz

2024-05-20 Thread Richard Sandiford
Wilco Dijkstra  writes:
> Improve costing of ctz - both TARGET_CSSC and vector cases were not handled 
> yet.
>
> Passes regress & bootstrap - OK for commit?
>
> gcc:
> * config/aarch64/aarch64.cc (aarch64_rtx_costs): Improve CTZ costing.

Ok, thanks.

Richard

> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 
> fe13c9a0d4863041eb9101882ea57c2094240d16..2a6f76f4008839bf0aa158504430af9b971c
>  100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -14309,10 +14309,24 @@ aarch64_rtx_costs (rtx x, machine_mode mode, int 
> outer ATTRIBUTE_UNUSED,
>return false;
>  
>  case CTZ:
> -  *cost = COSTS_N_INSNS (2);
> -
> -  if (speed)
> - *cost += extra_cost->alu.clz + extra_cost->alu.rev;
> +  if (VECTOR_MODE_P (mode))
> + {
> +   *cost = COSTS_N_INSNS (3);
> +   if (speed)
> + *cost += extra_cost->vect.alu * 3;
> + }
> +  else if (TARGET_CSSC)
> + {
> +   *cost = COSTS_N_INSNS (1);
> +   if (speed)
> + *cost += extra_cost->alu.clz;
> + }
> +  else
> + {
> +   *cost = COSTS_N_INSNS (2);
> +   if (speed)
> + *cost += extra_cost->alu.clz + extra_cost->alu.rev;
> + }
>return false;
>  
>  case COMPARE:


Re: [PATCH] AArch64: Fix printing of 2-instruction alternatives

2024-05-20 Thread Richard Sandiford
Wilco Dijkstra  writes:
> Add missing '\' in 2-instruction movsi/di alternatives so that they are
> printed on separate lines.
>
> Passes bootstrap and regress, OK for commit once stage 1 reopens?
>
> gcc:
> * config/aarch64/aarch64.md (movsi_aarch64): Use '\;' to force
> newline in 2-instruction pattern.
> (movdi_aarch64): Likewise.

Oops, good catch.  Ok for trunk, thanks.

Richard

>
> ---
>
> diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
> index 
> 1a2e01284249223565cd12cf1bfd5db5475e56fb..5416c2e3b2002d0e53baf23e7c0048ddf683
>  100644
> --- a/gcc/config/aarch64/aarch64.md
> +++ b/gcc/config/aarch64/aarch64.md
> @@ -1447,7 +1447,7 @@ (define_insn_and_split "*movsi_aarch64"
>   [w  , m  ; load_4   , fp  , 4] ldr\t%s0, %1
>   [m  , r Z; store_4  , *   , 4] str\t%w1, %0
>   [m  , w  ; store_4  , fp  , 4] str\t%s1, %0
> - [r  , Usw; load_4   , *   , 8] adrp\t%x0, %A1;ldr\t%w0, [%x0, %L1]
> + [r  , Usw; load_4   , *   , 8] adrp\t%x0, %A1\;ldr\t%w0, [%x0, %L1]
>   [r  , Usa; adr  , *   , 4] adr\t%x0, %c1
>   [r  , Ush; adr  , *   , 4] adrp\t%x0, %A1
>   [w  , r Z; f_mcr, fp  , 4] fmov\t%s0, %w1
> @@ -1484,7 +1484,7 @@ (define_insn_and_split "*movdi_aarch64"
>   [w, m  ; load_8   , fp  , 4] ldr\t%d0, %1
>   [m, r Z; store_8  , *   , 4] str\t%x1, %0
>   [m, w  ; store_8  , fp  , 4] str\t%d1, %0
> - [r, Usw; load_8   , *   , 8] << TARGET_ILP32 ? "adrp\t%0, %A1;ldr\t%w0, 
> [%0, %L1]" : "adrp\t%0, %A1;ldr\t%0, [%0, %L1]";
> + [r, Usw; load_8   , *   , 8] << TARGET_ILP32 ? "adrp\t%0, 
> %A1\;ldr\t%w0, [%0, %L1]" : "adrp\t%0, %A1\;ldr\t%0, [%0, %L1]";
>   [r, Usa; adr  , *   , 4] adr\t%x0, %c1
>   [r, Ush; adr  , *   , 4] adrp\t%x0, %A1
>   [w, r Z; f_mcr, fp  , 4] fmov\t%d0, %x1


Re: [PATCH] aarch64: Fold vget_low_* intrinsics to BIT_FIELD_REF [PR102171]

2024-05-20 Thread Richard Sandiford
Pengxuan Zheng  writes:
> This patch folds vget_low_* intrinsics to BIT_FILED_REF to open up more
> optimization opportunities for gimple optimizers.
>
> While we are here, we also remove the vget_low_* definitions from arm_neon.h 
> and
> use the new intrinsics framework.
>
> PR target/102171
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64-builtins.cc (AARCH64_SIMD_VGET_LOW_BUILTINS):
>   New macro to create definitions for all vget_low intrinsics.
>   (VGET_LOW_BUILTIN): Likewise.
>   (enum aarch64_builtins): Add vget_low function codes.
>   (aarch64_general_fold_builtin): Fold vget_low calls.
>   * config/aarch64/aarch64-simd-builtins.def: Delete vget_low builtins.
>   * config/aarch64/aarch64-simd.md (aarch64_get_low): Delete.
>   (aarch64_vget_lo_halfv8bf): Likewise.
>   * config/aarch64/arm_neon.h (__attribute__): Delete.
>   (vget_low_f16): Likewise.
>   (vget_low_f32): Likewise.
>   (vget_low_f64): Likewise.
>   (vget_low_p8): Likewise.
>   (vget_low_p16): Likewise.
>   (vget_low_p64): Likewise.
>   (vget_low_s8): Likewise.
>   (vget_low_s16): Likewise.
>   (vget_low_s32): Likewise.
>   (vget_low_s64): Likewise.
>   (vget_low_u8): Likewise.
>   (vget_low_u16): Likewise.
>   (vget_low_u32): Likewise.
>   (vget_low_u64): Likewise.
>   (vget_low_bf16): Likewise.
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/pr113573.c: Replace __builtin_aarch64_get_lowv8hi
>   with vget_low_s16.
>   * gcc.target/aarch64/vget_low_2.c: New test.
>   * gcc.target/aarch64/vget_low_2_be.c: New test.

Ok, thanks.  I suppose the patch has the side effect of allowing
vget_low_bf16 to be called without +bf16.  IMO that's the correct
behaviour though, and is consistent with how we handle reinterprets.

Richard

> Signed-off-by: Pengxuan Zheng 
> ---
>  gcc/config/aarch64/aarch64-builtins.cc|  60 ++
>  gcc/config/aarch64/aarch64-simd-builtins.def  |   5 +-
>  gcc/config/aarch64/aarch64-simd.md|  23 +---
>  gcc/config/aarch64/arm_neon.h | 105 --
>  gcc/testsuite/gcc.target/aarch64/pr113573.c   |   2 +-
>  gcc/testsuite/gcc.target/aarch64/vget_low_2.c |  30 +
>  .../gcc.target/aarch64/vget_low_2_be.c|  31 ++
>  7 files changed, 124 insertions(+), 132 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/vget_low_2.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/vget_low_2_be.c
>
> diff --git a/gcc/config/aarch64/aarch64-builtins.cc 
> b/gcc/config/aarch64/aarch64-builtins.cc
> index 75d21de1401..4afe7c86ae3 100644
> --- a/gcc/config/aarch64/aarch64-builtins.cc
> +++ b/gcc/config/aarch64/aarch64-builtins.cc
> @@ -658,6 +658,23 @@ static aarch64_simd_builtin_datum 
> aarch64_simd_builtin_data[] = {
>VREINTERPRET_BUILTINS \
>VREINTERPRETQ_BUILTINS
>  
> +#define AARCH64_SIMD_VGET_LOW_BUILTINS \
> +  VGET_LOW_BUILTIN(f16) \
> +  VGET_LOW_BUILTIN(f32) \
> +  VGET_LOW_BUILTIN(f64) \
> +  VGET_LOW_BUILTIN(p8) \
> +  VGET_LOW_BUILTIN(p16) \
> +  VGET_LOW_BUILTIN(p64) \
> +  VGET_LOW_BUILTIN(s8) \
> +  VGET_LOW_BUILTIN(s16) \
> +  VGET_LOW_BUILTIN(s32) \
> +  VGET_LOW_BUILTIN(s64) \
> +  VGET_LOW_BUILTIN(u8) \
> +  VGET_LOW_BUILTIN(u16) \
> +  VGET_LOW_BUILTIN(u32) \
> +  VGET_LOW_BUILTIN(u64) \
> +  VGET_LOW_BUILTIN(bf16)
> +
>  typedef struct
>  {
>const char *name;
> @@ -697,6 +714,9 @@ typedef struct
>  #define VREINTERPRET_BUILTIN(A, B, L) \
>AARCH64_SIMD_BUILTIN_VREINTERPRET##L##_##A##_##B,
>  
> +#define VGET_LOW_BUILTIN(A) \
> +  AARCH64_SIMD_BUILTIN_VGET_LOW_##A,
> +
>  #undef VAR1
>  #define VAR1(T, N, MAP, FLAG, A) \
>AARCH64_SIMD_BUILTIN_##T##_##N##A,
> @@ -732,6 +752,7 @@ enum aarch64_builtins
>AARCH64_CRC32_BUILTIN_MAX,
>/* SIMD intrinsic builtins.  */
>AARCH64_SIMD_VREINTERPRET_BUILTINS
> +  AARCH64_SIMD_VGET_LOW_BUILTINS
>/* ARMv8.3-A Pointer Authentication Builtins.  */
>AARCH64_PAUTH_BUILTIN_AUTIA1716,
>AARCH64_PAUTH_BUILTIN_PACIA1716,
> @@ -823,8 +844,37 @@ static aarch64_fcmla_laneq_builtin_datum 
> aarch64_fcmla_lane_builtin_data[] = {
>   && SIMD_INTR_QUAL(A) == SIMD_INTR_QUAL(B) \
>},
>  
> +#undef VGET_LOW_BUILTIN
> +#define VGET_LOW_BUILTIN(A) \
> +  {"vget_low_" #A, \
> +   AARCH64_SIMD_BUILTIN_VGET_LOW_##A, \
> +   2, \
> +   { SIMD_INTR_MODE(A, d), SIMD_INTR_MODE(A, q) }, \
> +   { SIMD_INTR_QUAL(A), SIMD_INTR_QUAL(A) }, \
> +   FLAG_AUTO_FP, \
> +   false \
> +  },
> +
> +#define AARCH64_SIMD_VGET_LOW_BUILTINS \
> +  VGET_LOW_BUILTIN(f16) \
> +  VGET_LOW_BUILTIN(f32) \
> +  VGET_LOW_BUILTIN(f64) \
> +  VGET_LOW_BUILTIN(p8) \
> +  VGET_LOW_BUILTIN(p16) \
> +  VGET_LOW_BUILTIN(p64) \
> +  VGET_LOW_BUILTIN(s8) \
> +  VGET_LOW_BUILTIN(s16) \
> +  VGET_LOW_BUILTIN(s32) \
> +  VGET_LOW_BUILTIN(s64) \
> +  VGET_LOW_BUILTIN(u8) \
> +  VGET_LOW_BUILTIN(u16) \
> +  VGET_LOW_BUILTIN(u32) \
> +  VGET_LOW_BUILTIN(u64) \
> +  

Re: [Patch, aarch64] v7: Preparatory patch to place target independent and dependent changed code in one file

2024-05-20 Thread Richard Sandiford
Ajit Agarwal  writes:
> Hello Alex/Richard:
>
> All comments are addressed.
>
> Common infrastructure of load store pair fusion is divided into target
> independent and target dependent changed code.
>
> Target independent code is the Generic code with pure virtual function
> to interface between target independent and dependent code.
>
> Target dependent code is the implementation of pure virtual function for
> aarch64 target and the call to target independent code.
>
> Bootstrapped and regtested on aarch64-linux-gnu.
>
> Thanks & Regards
> Ajit
>
>
> aarch64: Preparatory patch to place target independent and
> dependent changed code in one file
>
> Common infrastructure of load store pair fusion is divided into target
> independent and target dependent changed code.
>
> Target independent code is the Generic code with pure virtual function
> to interface betwwen target independent and dependent code.
>
> Target dependent code is the implementation of pure virtual function for
> aarch64 target and the call to target independent code.
>
> 2024-05-18  Ajit Kumar Agarwal  
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64-ldp-fusion.cc: Factor out a
>   target-independent interface and move it to the head of the file
> ---
>  gcc/config/aarch64/aarch64-ldp-fusion.cc | 555 +++
>  1 file changed, 373 insertions(+), 182 deletions(-)
>
> diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc 
> b/gcc/config/aarch64/aarch64-ldp-fusion.cc
> index 1d9caeab05d..e4e55b84f8b 100644
> --- a/gcc/config/aarch64/aarch64-ldp-fusion.cc
> +++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
> @@ -138,6 +138,235 @@ struct alt_base
>poly_int64 offset;
>  };
>  
> +// Virtual base class for load/store walkers used in alias analysis.
> +struct alias_walker
> +{
> +  virtual bool conflict_p (int ) const = 0;
> +  virtual insn_info *insn () const = 0;
> +  virtual bool valid () const = 0;
> +  virtual void advance () = 0;
> +};
> +
> +// When querying should_handle_writeback, this enum is used to
> +// qualify which opportunities we are asking about.
> +enum class writeback {
> +  // Only those writeback opportunities that arise from existing
> +  // auto-increment accesses.
> +  EXISTING,
> +
> +  // All writeback opportunities including those that involve folding
> +  // base register updates into a non-writeback pair.

This misses:

> There should be a comma after "opportunities"

from the previous review.  I.e.:

  // All writeback opportunities, including those that involve folding
  // base register updates into a non-writeback pair.

OK with that change, thanks.

Richard


Re: [Patch, aarch64] v6: Preparatory patch to place target independent and, dependent changed code in one file

2024-05-17 Thread Richard Sandiford
Ajit Agarwal  writes:
> Hello Alex/Richard:
>
> All review comments are addressed.
>
> Common infrastructure of load store pair fusion is divided into target
> independent and target dependent changed code.
>
> Target independent code is the Generic code with pure virtual function
> to interface between target independent and dependent code.
>
> Target dependent code is the implementation of pure virtual function for
> aarch64 target and the call to target independent code.
>
> Bootstrapped and regtested on aarch64-linux-gnu.
>
> Thanks & Regards
> Ajit

Thanks for the patch and thanks to Alex for the reviews.  The patch
looks good to me apart from the minor nits below and the comments that
Alex had.  Please post the updated patch for a final ok though.

> aarch64: Preparatory patch to place target independent and
> dependent changed code in one file
>
> Common infrastructure of load store pair fusion is divided into target
> independent and target dependent changed code.
>
> Target independent code is the Generic code with pure virtual function
> to interface betwwen target independent and dependent code.
>
> Target dependent code is the implementation of pure virtual function for
> aarch64 target and the call to target independent code.
>
> 2024-05-15  Ajit Kumar Agarwal  
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64-ldp-fusion.cc: Place target
>   independent and dependent changed code.

Not sure this is a complete sentence.  Maybe:

* config/aarch64/aarch64-ldp-fusion.cc: Factor out a
target-independent interface and move it to the head of the file.

That technically isn't detailed enough for a changelog entry,
but IMO we should use it anyway.  It's pointless to write the usual
amount of detail when the code is going to move soon.

> ---
>  gcc/config/aarch64/aarch64-ldp-fusion.cc | 533 +++
>  1 file changed, 357 insertions(+), 176 deletions(-)
>
> diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc 
> b/gcc/config/aarch64/aarch64-ldp-fusion.cc
> index 1d9caeab05d..429e532ea3b 100644
> --- a/gcc/config/aarch64/aarch64-ldp-fusion.cc
> +++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
> @@ -138,6 +138,225 @@ struct alt_base
>poly_int64 offset;
>  };
>  
> +// Virtual base class for load/store walkers used in alias analysis.
> +struct alias_walker
> +{
> +  virtual bool conflict_p (int ) const = 0;
> +  virtual insn_info *insn () const = 0;
> +  virtual bool valid () const = 0;
> +  virtual void advance () = 0;
> +};
> +
> +// When querying handle_writeback_opportunities, this enum is used to
> +// qualify which opportunities we are asking about.
> +enum class writeback {
> +  // Only those writeback opportunities that arise from existing
> +  // auto-increment accesses.
> +  EXISTING,
> +  // All writeback opportunities including those that involve folding

There should be a comma after "opportunities"

> +  // base register updates into a non-writeback pair.
> +  ALL
> +};
> +
> +struct pair_fusion {
> +  pair_fusion ()
> +  {
> +calculate_dominance_info (CDI_DOMINATORS);
> +df_analyze ();
> +crtl->ssa = new rtl_ssa::function_info (cfun);
> +  };

Unnecessary trailing ";".  I think it'd be better to define this and
the destructor out-of-line though.  For one thing, it'll reduce the number
of header file dependencies, once the code is moved to its own header file.

> +
> +  // Given:
> +  // - an rtx REG_OP, the non-memory operand in a load/store insn,
> +  // - a machine_mode MEM_MODE, the mode of the MEM in that insn, and
> +  // - a boolean LOAD_P (true iff the insn is a load), then:
> +  // return true if the access should be considered an FP/SIMD access.
> +  // Such accesses are segregated from GPR accesses, since we only want
> +  // to form pairs for accesses that use the same register file.
> +  virtual bool fpsimd_op_p (rtx, machine_mode, bool)
> +  {
> +return false;
> +  }
> +
> +  // Return true if we should consider forming pairs from memory
> +  // accesses with operand mode MODE at this stage in compilation.
> +  virtual bool pair_operand_mode_ok_p (machine_mode mode) = 0;
> +
> +  // Return true iff REG_OP is a suitable register operand for a paired
> +  // memory access, where LOAD_P is true if we're asking about loads and
> +  // false for stores.  MODE gives the mode of the operand.
> +  virtual bool pair_reg_operand_ok_p (bool load_p, rtx reg_op,
> +   machine_mode mode) = 0;
> +
> +  // Return alias check limit.
> +  // This is needed to avoid unbounded quadratic behaviour when
> +  // performing alias analysis.
> +  virtual int pair_mem_alias_check_limit () = 0;

I think the end result should be to make this a target-independent
--param, but this is ok/good as an intermediate step.

> +
> +  // Returns true if we should try to handle writeback opportunities.

s/Returns/Return/

> +  // WHICH determines the kinds of writeback opportunities the caller
> +  // is asking about.
> +  

Re: [PATCH 00/12] aarch64: Extend aarch64_feature_flags to 128 bits

2024-05-17 Thread Richard Sandiford
Andrew Carlotti  writes:
> The end goal of the series is to change the definition of 
> aarch64_feature_flags
> from a uint64_t typedef to a class with 128 bits of storage.  This class uses
> operator overloading to mimic the existing integer interface as much as
> possible, but with added restrictions to facilate type checking and
> extensibility.
>
> Patches 01-10 are preliminary enablement work, and have passed regression
> testing.  Are these ok for master?
>
> Patch 11 is an RFC, and the only patch that touches the middle end.  I am
> seeking clarity on which part(s) of the compiler should be expected to handle
> or prevent non-bool types in instruction pattern conditions.  The actual patch
> does not compile by itself (though it does in combination with 12/12), but 
> that
> is not important to the questions I'm asking.
>
> Patch 12 is then a small patch that actually replaces the uint64_t typedef 
> with
> a class.  I think this patch is fine in it's current form, but it depends on a
> resolution to the issues in patch 11/12 first.

Thanks for doing this.

Rather than disallowing flags == 0, etc., I think we should allow
aarch64_feature_flags to be constructed from a single uint64_t.
It's a lossless conversion.  The important thing is that we don't
allow conversions the other way (and the patch doesn't allow them).

Also, I think we should make the new class in 12/12 be a templated
 type that provides an N-bit bitmask.  It should arguably
also be target-independent code.  aarch64_feature_flags would then be
an alias with the appropriate number of bits.

For the RFC in 11/12, how about, as another prepatch before 12/12,
removing all the mechanical:

#define AARCH64_ISA_LS64   (aarch64_isa_flags & AARCH64_FL_LS64)

style macros and replacing uses with something like:

  AARCH64_HAVE_ISA (LS64)

Uses outside aarch64.h should arguably be changed to TARGET_* instead,
since the convention seems to be that TARGET_* checks the underlying
ISA flag and also any other relevant conditions (where applicable).

Thanks,
Richard


Re: [RFC] Merge strathegy for all-SLP vectorizer

2024-05-17 Thread Richard Sandiford via Gcc
Richard Biener via Gcc  writes:
> Hi,
>
> I'd like to discuss how to go forward with getting the vectorizer to
> all-SLP for this stage1.  While there is a personal branch with my
> ongoing work (users/rguenth/vect-force-slp) branches haven't proved
> themselves working well for collaboration.

Speaking for myself, the problem hasn't been so much the branch as
lack of time.  I've been pretty swamped the last eight months of so
(except for the time that I took off, which admittedly was quite a
bit!), and so I never even got around to properly reading and replying
to your message after the Cauldron.  It's been on the "this is important,
I should make time to read and understand it properly" list all this time.
Sorry about that. :(

I'm hoping to have time to work/help out on SLP stuff soon.

> The branch isn't ready to be merged in full but I have been picking
> improvements to trunk last stage1 and some remaining bits in the past
> weeks.  I have refrained from merging code paths that cannot be
> exercised on trunk.
>
> There are two important set of changes on the branch, both critical
> to get more testing on non-x86 targets.
>
>  1. enable single-lane SLP discovery
>  2. avoid splitting store groups (9315bfc661432c3 and 4336060fe2db8ec
> if you fetch the branch)
>
> The first point is also most annoying on the testsuite since doing
> SLP instead of interleaving changes what we dump and thus tests
> start to fail in random ways when you switch between both modes.
> On the branch single-lane SLP discovery is gated with
> --param vect-single-lane-slp.
>
> The branch has numerous changes to enable single-lane SLP for some
> code paths that have SLP not implemented and where I did not bother
> to try supporting multi-lane SLP at this point.  It also adds more
> SLP discovery entry points.
>
> I'm not sure how to try merging these pieces to allow others to
> more easily help out.  One possibility is to merge
> --param vect-single-lane-slp defaulted off and pick dependent
> changes even when they cause testsuite regressions with
> vect-single-lane-slp=1.  Alternatively adjust the testsuite by
> adding --param vect-single-lane-slp=0 and default to 1
> (or keep the default).

FWIW, this one sounds good to me (the default to 1 version).
I.e. mechanically add --param vect-single-lane-slp=0 to any tests
that fail with the new default.  That means that the test that need
fixing are easily greppable for anyone who wants to help.  Sometimes
it'll just be a test update.  Sometimes it will be new vectoriser code.

Thanks,
Richard

> Or require a clean testsuite with
> --param vect-single-lane-slp defaulted to 1 but keep the --param
> for debugging (and allow FAILs with 0).
>
> For fun I merged just single-lane discovery of non-grouped stores
> and have that enabled by default.  On x86_64 this results in the
> set of FAILs below.
>
> Any suggestions?
>
> Thanks,
> Richard.
>
> FAIL: gcc.dg/vect/O3-pr39675-2.c scan-tree-dump-times vect "vectorizing 
> stmts using SLP" 1
> XPASS: gcc.dg/vect/no-scevccp-outer-12.c scan-tree-dump-times vect "OUTER 
> LOOP VECTORIZED." 1
> FAIL: gcc.dg/vect/no-section-anchors-vect-31.c scan-tree-dump-times vect 
> "Alignment of access forced using peeling" 2
> FAIL: gcc.dg/vect/no-section-anchors-vect-31.c scan-tree-dump-times vect 
> "Vectorizing an unaligned access" 0
> FAIL: gcc.dg/vect/no-section-anchors-vect-64.c scan-tree-dump-times vect 
> "Alignment of access forced using peeling" 2
> FAIL: gcc.dg/vect/no-section-anchors-vect-64.c scan-tree-dump-times vect 
> "Vectorizing an unaligned access" 0
> FAIL: gcc.dg/vect/no-section-anchors-vect-66.c scan-tree-dump-times vect 
> "Alignment of access forced using peeling" 1
> FAIL: gcc.dg/vect/no-section-anchors-vect-66.c scan-tree-dump-times vect 
> "Vectorizing an unaligned access" 0
> FAIL: gcc.dg/vect/no-section-anchors-vect-68.c scan-tree-dump-times vect 
> "Alignment of access forced using peeling" 2
> FAIL: gcc.dg/vect/no-section-anchors-vect-68.c scan-tree-dump-times vect 
> "Vectorizing an unaligned access" 0
> FAIL: gcc.dg/vect/slp-12a.c -flto -ffat-lto-objects  scan-tree-dump-times 
> vect "vectorizing stmts using SLP" 1
> FAIL: gcc.dg/vect/slp-12a.c scan-tree-dump-times vect "vectorizing stmts 
> using SLP" 1
> FAIL: gcc.dg/vect/slp-19a.c -flto -ffat-lto-objects  scan-tree-dump-times 
> vect "vectorizing stmts using SLP" 1
> FAIL: gcc.dg/vect/slp-19a.c scan-tree-dump-times vect "vectorizing stmts 
> using SLP" 1
> FAIL: gcc.dg/vect/slp-19b.c -flto -ffat-lto-objects  scan-tree-dump-times 
> vect "vectorizing stmts using SLP" 1
> FAIL: gcc.dg/vect/slp-19b.c scan-tree-dump-times vect "vectorizing stmts 
> using SLP" 1
> FAIL: gcc.dg/vect/slp-19c.c -flto -ffat-lto-objects  scan-tree-dump-times 
> vect "vectorized 1 loops" 1
> FAIL: gcc.dg/vect/slp-19c.c -flto -ffat-lto-objects  scan-tree-dump-times 
> vect "vectorizing stmts using SLP" 1
> FAIL: gcc.dg/vect/slp-19c.c scan-tree-dump-times vect "vectorized 1 loops" 
> 1
> FAIL: 

Re: [PATCH] AArch64: Use LDP/STP for large struct types

2024-05-16 Thread Richard Sandiford
Richard Sandiford  writes:
> Wilco Dijkstra  writes:
>> Use LDP/STP for large struct types as they have useful immediate offsets and 
>> are typically faster.
>> This removes differences between little and big endian and allows use of 
>> LDP/STP without UNSPEC.
>>
>> Passes regress and bootstrap, OK for commit?
>>
>> gcc:
>> * config/aarch64/aarch64.cc (aarch64_classify_address): Treat SIMD 
>> structs identically
>> in little and bigendian.
>> * config/aarch64/aarch64.md (aarch64_mov): Remove VSTRUCT 
>> instructions.
>> (aarch64_be_mov): Allow little-endian, rename to 
>> aarch64_mov.
>> (aarch64_be_movoi): Allow little-endian, rename to aarch64_movoi.
>> (aarch64_be_movci): Allow little-endian, rename to aarch64_movci.
>> (aarch64_be_movxi): Allow little-endian, rename to aarch64_movxi.
>> Remove big-endian special case in define_split variants.
>>
>> gcc/testsuite:
>> * gcc.target/aarch64/torture/simd-abi-8.c: Update to check for 
>> LDP/STP.
>
> [...]
> So another alternative would be to go with the patch as-is,
> but add a new mechanism for gimple to query the valid addresses
> for IFN_(MASK_)LOAD_LANES and IFN_(MASK_)STORE_LANES, rather than
> relying purely on the legitimate address mechanism,.  Ideally, the new
> interface would be generic enough that we could use it for target (md)
> builtins as well, to better optimise ACLE code.

Gah, just realised after sending that there's another potential problem.
Currently inline asms can assume that "m" will only include the LD1/ST1
range for little-endian.  We might need to consider using
TARGET_MEM_CONSTRAINT, so that we continue to present the same
interface to asms, but can use the wider range internally.

Thanks,
Richard


Re: [PATCH] AArch64: Use LDP/STP for large struct types

2024-05-16 Thread Richard Sandiford
Wilco Dijkstra  writes:
> Use LDP/STP for large struct types as they have useful immediate offsets and 
> are typically faster.
> This removes differences between little and big endian and allows use of 
> LDP/STP without UNSPEC.
>
> Passes regress and bootstrap, OK for commit?
>
> gcc:
> * config/aarch64/aarch64.cc (aarch64_classify_address): Treat SIMD 
> structs identically
> in little and bigendian.
> * config/aarch64/aarch64.md (aarch64_mov): Remove VSTRUCT 
> instructions.
> (aarch64_be_mov): Allow little-endian, rename to 
> aarch64_mov.
> (aarch64_be_movoi): Allow little-endian, rename to aarch64_movoi.
> (aarch64_be_movci): Allow little-endian, rename to aarch64_movci.
> (aarch64_be_movxi): Allow little-endian, rename to aarch64_movxi.
> Remove big-endian special case in define_split variants.
>
> gcc/testsuite:
> * gcc.target/aarch64/torture/simd-abi-8.c: Update to check for 
> LDP/STP.

I'm nervous about approving the removal of something that was deliberately
added by the initial commits. :)  But, even ignoring the extra offset range,
using LDP/STP makes strong intuitive sense for 2-register modes.  And for
3- and 4-registers modes, it's not surprising if the split that the
patch performs is (at worst) equivalent to what the hardware would do
itself or (at best) something that the hardware handles slightly better.

It's also a significant clean-up.

My only concern is that the main uses of these modes are for LD[234] and
ST[234].  By imposing the LD1/ST1 restrictions, the current little-endian
definition of "m" also corresponds to what LD[234] and ST[234] expect.
This in turn means that ivopts will optimise induction variable selection
to account for the fact that LD[234] and ST[234] do not support offsets.

I think the effect of the patch will be to make ivopts optimise LD[234]
and ST[234] on the assumption that they have the same range as LDP/STP.
We could avoid that if we

(1) Keep:

> @@ -10482,14 +10481,6 @@ aarch64_classify_address (struct 
> aarch64_address_info *info,
>&& (code != REG && code != PLUS))
>  return false;
>  
> -  /* On LE, for AdvSIMD, don't support anything other than POST_INC or
> - REG addressing.  */
> -  if (advsimd_struct_p
> -  && TARGET_SIMD
> -  && !BYTES_BIG_ENDIAN
> -  && (code != POST_INC && code != REG))
> -return false;
> -
>gcc_checking_assert (GET_MODE (x) == VOIDmode
>  || SCALAR_INT_MODE_P (GET_MODE (x)));
>  

but drop the !BYTES_BIG_ENDIAN condition.

(2) Make Ump a defined_relaxed_memory_constraint (so that it accepts
more than "m" does).

(3) Use Ump instead of "o" in the move patterns.

Of course, this might make pure gimple-level data-shuffling worse.
I suppose it could also make RTL passes handle your memcpy use case
more pessimistically, although I'm not sure whether that would be for
legitimate reasons.

So another alternative would be to go with the patch as-is,
but add a new mechanism for gimple to query the valid addresses
for IFN_(MASK_)LOAD_LANES and IFN_(MASK_)STORE_LANES, rather than
relying purely on the legitimate address mechanism,.  Ideally, the new
interface would be generic enough that we could use it for target (md)
builtins as well, to better optimise ACLE code.

So the patch is OK as-is from my POV, but I think it's relatively
important that we try to fix the ivopts handling before GCC 15.

Thanks,
Richard

> ---
>
> diff --git a/gcc/config/aarch64/aarch64-simd.md 
> b/gcc/config/aarch64/aarch64-simd.md
> index 
> 16b7445d9f72f77a98ab262e21fd24e6cc97eba0..bb8b6963fd5117be82afe6ccd7154ae5302c3691
>  100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -7917,32 +7917,6 @@
>[(set_attr "type" "neon_store1_4reg")]
>  )
>  
> -(define_insn "*aarch64_mov"
> -  [(set (match_operand:VSTRUCT_QD 0 "aarch64_simd_nonimmediate_operand")
> - (match_operand:VSTRUCT_QD 1 "aarch64_simd_general_operand"))]
> -  "TARGET_SIMD && !BYTES_BIG_ENDIAN
> -   && (register_operand (operands[0], mode)
> -   || register_operand (operands[1], mode))"
> -  {@ [ cons: =0 , 1   ; attrs: type, length]
> - [ w, w   ; multiple   ,   ] #
> - [ Utv  , w   ; neon_store_reg_q , 4 ] 
> st1\t{%S1. - %1.}, %0
> - [ w, Utv ; neon_load_reg_q  , 4 ] 
> ld1\t{%S0. - %0.}, %1
> -  }
> -)
> -
> -(define_insn "*aarch64_mov"
> -  [(set (match_operand:VSTRUCT 0 "aarch64_simd_nonimmediate_operand")
> - (match_operand:VSTRUCT 1 "aarch64_simd_general_operand"))]
> -  "TARGET_SIMD && !BYTES_BIG_ENDIAN
> -   && (register_operand (operands[0], mode)
> -   || register_operand (operands[1], mode))"
> -  {@ [ cons: =0 , 1   ; attrs: type, length]
> - [ w, w   ; multiple   ,   ] #
> - [ Utv  , w   ; neon_store_reg_q , 4 ] 

Re: [PATCH 0/4]AArch64: support conditional early clobbers on certain operations.

2024-05-15 Thread Richard Sandiford
Tamar Christina  writes:
>> >> On Wed, May 15, 2024 at 12:29 PM Tamar Christina
>> >>  wrote:
>> >> >
>> >> > Hi All,
>> >> >
>> >> > Some Neoverse Software Optimization Guides (SWoG) have a clause that 
>> >> > state
>> >> > that for predicated operations that also produce a predicate it is 
>> >> > preferred
>> >> > that the codegen should use a different register for the destination 
>> >> > than that
>> >> > of the input predicate in order to avoid a performance overhead.
>> >> >
>> >> > This of course has the problem that it increases register pressure and 
>> >> > so
>> should
>> >> > be done with care.  Additionally not all micro-architectures have this
>> >> > consideration and so it shouldn't be done as a default thing.
>> >> >
>> >> > The patch series adds support for doing conditional early clobbers 
>> >> > through a
>> >> > combination of new alternatives and attributes to control their 
>> >> > availability.
>> >>
>> >> You could have two alternatives, one with early clobber and one with
>> >> a matching constraint where you'd disparage the matching constraint one?
>> >>
>> >
>> > Yeah, that's what I do, though there's no need to disparage the non-early 
>> > clobber
>> > alternative as the early clobber alternative will naturally get a penalty 
>> > if it needs a
>> > reload.
>> 
>> But I think Richard's suggestion was to disparage the one with a matching
>> constraint (not the earlyclobber), to reflect the increased cost of
>> reusing the register.
>> 
>> We did take that approach for gathers, e.g.:
>> 
>>  [, Z,   w, Ui1, Ui1, Upl] ld1\t%0.s, %5/z, [%2.s]
>>  [?w, Z,   0, Ui1, Ui1, Upl] ^
>> 
>> The (supposed) advantage is that, if register pressure is so tight
>> that using matching registers is the only alternative, we still
>> have the opportunity to do that, as a last resort.
>> 
>> Providing only an earlyclobber version means that using the same
>> register is prohibited outright.  If no other register is free, the RA
>> would need to spill something else to free up a temporary register.
>> And it might then do the equivalent of (pseudo-code):
>> 
>>   not p1.b, ..., p0.b
>>   mov p0.d, p1.d
>> 
>> after spilling what would otherwise have occupied p1.  In that
>> situation it would be better use:
>> 
>>   not p0.b, ..., p0.b
>> 
>> and not introduce the spill of p1.
>
> I think I understood what Richi meant, but I thought it was already working 
> that way.

The suggestion was to use matching constraints (like "0") though,
whereas the patch doesn't.  I think your argument is that you don't
need to use matching constraints.  But that's different from the
suggestion (and from how we handle gathers).

I was going to say in response to patch 3 (but got distracted, sorry):
I don't think we should have:

   , Upa, ...
   Upa, Upa, ...

(taken from the pure logic ops) enabled at the same time.  Even though
it works for the testcases, I don't think it has well-defined semantics.

The problem is that, taken on its own, the second alternative says that
matching operands are free.  And fundamentally, I don't think the costs
*must* take the earlyclobber alternative over the non-earlyclobber one
(when costing during IRA, for instance).  In principle, the cheapest
is best.

The aim of the gather approach is to make each alternative correct in
isolation.  In:

  [, Z,   w, Ui1, Ui1, Upl] ld1\t%0.s, %5/z, [%2.s]
  [?w, Z,   0, Ui1, Ui1, Upl] ^

the second alternative says that it is possible to have operands 0
and 2 be the same vector register, but using that version has the
cost of an extra reload.  In that sense the alternatives are
(essentially) consistent about the restriction.

> i.e. as one of the testcases I had:
>
>> aarch64-none-elf-gcc -O3 -g0 -S -o - pred-clobber.c -mcpu=neoverse-n2 
>> -ffixed-p[1-15]
>
> foo:
> mov z31.h, w0
> ptrue   p0.b, all
> cmplo   p0.h, p0/z, z0.h, z31.h
> b   use
>
> and reload did not force a spill.
>
> My understanding of how this works, and how it seems to be working is that 
> since reload costs
> Alternative from front to back the cheapest one wins and it stops evaluating 
> the rest.
>
> The early clobber case is first and preferred, however when it's not 
> possible, i.e. requires a non-pseudo
> reload, the reload cost is added to the alternative.
>
> However you're right that in the following testcase:
>
> -mcpu=neoverse-n2 -ffixed-p1 -ffixed-p2 -ffixed-p3 -ffixed-p4 -ffixed-p5 
> -ffixed-p6 -ffixed-p7 -ffixed-p8 -ffixed-p9 -ffixed-p10 -ffixed-p11 
> -ffixed-p12 -ffixed-p12 -ffixed-p13 -ffixed-p14 -ffixed-p14 -fdump-rtl-reload
>
> i.e. giving it an extra free register inexplicably causes a spill:
>
> foo:
> addvl   sp, sp, #-1
> mov z31.h, w0
> ptrue   p0.b, all
> str p15, [sp]
> cmplo   p15.h, p0/z, z0.h, z31.h
> mov p0.b, p15.b
> ldr p15, [sp]
> addvl   sp, sp, #1
> b   use
>
> so that's 

Re: [PATCH 0/4]AArch64: support conditional early clobbers on certain operations.

2024-05-15 Thread Richard Sandiford
Tamar Christina  writes:
>> -Original Message-
>> From: Richard Biener 
>> Sent: Wednesday, May 15, 2024 12:20 PM
>> To: Tamar Christina 
>> Cc: gcc-patches@gcc.gnu.org; nd ; Richard Earnshaw
>> ; Marcus Shawcroft
>> ; ktkac...@gcc.gnu.org; Richard Sandiford
>> 
>> Subject: Re: [PATCH 0/4]AArch64: support conditional early clobbers on 
>> certain
>> operations.
>> 
>> On Wed, May 15, 2024 at 12:29 PM Tamar Christina
>>  wrote:
>> >
>> > Hi All,
>> >
>> > Some Neoverse Software Optimization Guides (SWoG) have a clause that state
>> > that for predicated operations that also produce a predicate it is 
>> > preferred
>> > that the codegen should use a different register for the destination than 
>> > that
>> > of the input predicate in order to avoid a performance overhead.
>> >
>> > This of course has the problem that it increases register pressure and so 
>> > should
>> > be done with care.  Additionally not all micro-architectures have this
>> > consideration and so it shouldn't be done as a default thing.
>> >
>> > The patch series adds support for doing conditional early clobbers through 
>> > a
>> > combination of new alternatives and attributes to control their 
>> > availability.
>> 
>> You could have two alternatives, one with early clobber and one with
>> a matching constraint where you'd disparage the matching constraint one?
>> 
>
> Yeah, that's what I do, though there's no need to disparage the non-early 
> clobber
> alternative as the early clobber alternative will naturally get a penalty if 
> it needs a
> reload.

But I think Richard's suggestion was to disparage the one with a matching
constraint (not the earlyclobber), to reflect the increased cost of
reusing the register.

We did take that approach for gathers, e.g.:

 [, Z,   w, Ui1, Ui1, Upl] ld1\t%0.s, %5/z, [%2.s]
 [?w, Z,   0, Ui1, Ui1, Upl] ^

The (supposed) advantage is that, if register pressure is so tight
that using matching registers is the only alternative, we still
have the opportunity to do that, as a last resort.

Providing only an earlyclobber version means that using the same
register is prohibited outright.  If no other register is free, the RA
would need to spill something else to free up a temporary register.
And it might then do the equivalent of (pseudo-code):

  not p1.b, ..., p0.b
  mov p0.d, p1.d

after spilling what would otherwise have occupied p1.  In that
situation it would be better use:

  not p0.b, ..., p0.b

and not introduce the spill of p1.

Another case where using matching registers is natural is for
loop-carried dependencies.  Do we want to keep them in:

   loop:
  ...no other sets of p0
  not p0.b, ..., p0.b
  ...no other sets of p0
  bne loop

or should we split it to:

   loop:
  ...no other sets of p0
  not p1.b, ..., p0.b
  mov p0.d, p1.d
  ...no other sets of p0
  bne loop

?

Thanks,
Richard

>
> Cheers,
> Tamar
>
>> > On high register pressure we also use LRA's costing to prefer not to use 
>> > the
>> > alternative and instead just use the tie as this is preferable to a reload.
>> >
>> > Concretely this patch series does:
>> >
>> > > aarch64-none-elf-gcc -O3 -g0 -S -o - pred-clobber.c -mcpu=neoverse-n2
>> >
>> > foo:
>> > mov z31.h, w0
>> > ptrue   p3.b, all
>> > cmplo   p0.h, p3/z, z0.h, z31.h
>> > b   use
>> >
>> > > aarch64-none-elf-gcc -O3 -g0 -S -o - pred-clobber.c -mcpu=neoverse-n1+sve
>> >
>> > foo:
>> > mov z31.h, w0
>> > ptrue   p0.b, all
>> > cmplo   p0.h, p0/z, z0.h, z31.h
>> > b   use
>> >
>> > > aarch64-none-elf-gcc -O3 -g0 -S -o - pred-clobber.c -mcpu=neoverse-n2 -
>> ffixed-p[1-15]
>> >
>> > foo:
>> > mov z31.h, w0
>> > ptrue   p0.b, all
>> > cmplo   p0.h, p0/z, z0.h, z31.h
>> > b   use
>> >
>> > Testcases for the changes are in the last patch of the series.
>> >
>> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>> >
>> > Thanks,
>> > Tamar
>> >
>> > ---
>> >
>> > --


Re: [PATCH 1/4]AArch64: convert several predicate patterns to new compact syntax

2024-05-15 Thread Richard Sandiford
Thanks for doing this a pre-patch.  Minor request below:

Tamar Christina  writes:
>  ;; Perform a logical operation on operands 2 and 3, using operand 1 as
> @@ -6676,38 +6690,42 @@ (define_insn "@aarch64_pred__z"
>  (define_insn "*3_cc"
>[(set (reg:CC_NZC CC_REGNUM)
>   (unspec:CC_NZC
> -   [(match_operand:VNx16BI 1 "register_operand" "Upa")
> +   [(match_operand:VNx16BI 1 "register_operand")
>  (match_operand 4)
>  (match_operand:SI 5 "aarch64_sve_ptrue_flag")
>  (and:PRED_ALL
>(LOGICAL:PRED_ALL
> -(match_operand:PRED_ALL 2 "register_operand" "Upa")
> -(match_operand:PRED_ALL 3 "register_operand" "Upa"))
> +(match_operand:PRED_ALL 2 "register_operand")
> +(match_operand:PRED_ALL 3 "register_operand"))
>(match_dup 4))]
> UNSPEC_PTEST))
> -   (set (match_operand:PRED_ALL 0 "register_operand" "=Upa")
> +   (set (match_operand:PRED_ALL 0 "register_operand")
>   (and:PRED_ALL (LOGICAL:PRED_ALL (match_dup 2) (match_dup 3))
> (match_dup 4)))]
>"TARGET_SVE"
> -  "s\t%0.b, %1/z, %2.b, %3.b"
> +  {@ [ cons: =0, 1  , 2  , 3  , 4, 5 ]
> + [ Upa , Upa, Upa, Upa,  ,   ] s\t%0.b, %1/z, %2.b, %3.b
> +  }
>  )

Could we leave out these empty trailing constraints?  They're quite
common in SVE & SME patterns and are specifically not meant to influence
instruction selection.  E.g. we've done the same thing for *cnot
(to pick a random example).

Agree with Kyrill's ok otherwise.

Richard


Re: [PATCH 2/4]AArch64: add new tuning param and attribute for enabling conditional early clobber

2024-05-15 Thread Richard Sandiford
Tamar Christina  writes:
> Hi All,
>
> This adds a new tuning parameter EARLY_CLOBBER_SVE_PRED_DEST for AArch64 to
> allow us to conditionally enable the early clobber alternatives based on the
> tuning models.
>
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>
> Ok for master?
>
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64-tuning-flags.def
>   (EARLY_CLOBBER_SVE_PRED_DEST): New.
>   * config/aarch64/aarch64.h (TARGET_SVE_PRED_CLOBBER): New.
>   * config/aarch64/aarch64.md (pred_clobber): New.
>   (arch_enabled): Use it.
>
> ---
> diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def 
> b/gcc/config/aarch64/aarch64-tuning-flags.def
> index 
> d5bcaebce770f0b217aac783063d39135f754c77..49fbad3ff28bc82b25c61ac501ccf533ec4b4c3f
>  100644
> --- a/gcc/config/aarch64/aarch64-tuning-flags.def
> +++ b/gcc/config/aarch64/aarch64-tuning-flags.def
> @@ -48,4 +48,8 @@ AARCH64_EXTRA_TUNING_OPTION ("avoid_cross_loop_fma", 
> AVOID_CROSS_LOOP_FMA)
>  
>  AARCH64_EXTRA_TUNING_OPTION ("fully_pipelined_fma", FULLY_PIPELINED_FMA)
>  
> +/* Enable is the target prefers to use a fresh register for predicate outputs
> +   rather than re-use an input predicate register.  */
> +AARCH64_EXTRA_TUNING_OPTION ("early_clobber_sve_pred_dest", 
> EARLY_CLOBBER_SVE_PRED_DEST)

Sorry for the bike-shedding, but how about something like "avoid_pred_rmw"?
(I'm open to other suggestions.)  Just looking for something that describes
either the architecture or the end result that we want to achieve.
And preferable something fairly short :)

avoid_* would be consistent with the existing "avoid_cross_loop_fma".

> +
>  #undef AARCH64_EXTRA_TUNING_OPTION
> diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h
> index 
> bbf11faaf4b4340956094a983f8b0dc2649b2d27..76a18dd511f40ebb58ed12d56b46c74084ba7c3c
>  100644
> --- a/gcc/config/aarch64/aarch64.h
> +++ b/gcc/config/aarch64/aarch64.h
> @@ -495,6 +495,11 @@ constexpr auto AARCH64_FL_DEFAULT_ISA_MODE = 
> AARCH64_FL_SM_OFF;
>  enabled through +gcs.  */
>  #define TARGET_GCS (AARCH64_ISA_GCS)
>  
> +/*  Prefer different predicate registers for the output of a predicated 
> operation over
> +re-using an existing input predicate.  */
> +#define TARGET_SVE_PRED_CLOBBER (TARGET_SVE \
> +  && (aarch64_tune_params.extra_tuning_flags \
> +  & 
> AARCH64_EXTRA_TUNE_EARLY_CLOBBER_SVE_PRED_DEST))
>  
>  /* Standard register usage.  */
>  
> diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
> index 
> dbde066f7478bec51a8703b017ea553aa98be309..1ecd1a2812969504bd5114a53473b478c5ddba82
>  100644
> --- a/gcc/config/aarch64/aarch64.md
> +++ b/gcc/config/aarch64/aarch64.md
> @@ -445,6 +445,10 @@ (define_enum_attr "arch" "arches" (const_string "any"))
>  ;; target-independent code.
>  (define_attr "is_call" "no,yes" (const_string "no"))
>  
> +;; Indicates whether we want to enable the pattern with an optional early
> +;; clobber for SVE predicates.
> +(define_attr "pred_clobber" "no,yes" (const_string "no"))
> +
>  ;; [For compatibility with Arm in pipeline models]
>  ;; Attribute that specifies whether or not the instruction touches fp
>  ;; registers.
> @@ -461,7 +465,8 @@ (define_attr "fp" "no,yes"
>  (define_attr "arch_enabled" "no,yes"
>(if_then_else
>  (ior
> - (eq_attr "arch" "any")
> + (and (eq_attr "arch" "any")
> +  (eq_attr "pred_clobber" "no"))
>  
>   (and (eq_attr "arch" "rcpc8_4")
>(match_test "AARCH64_ISA_RCPC8_4"))
> @@ -488,7 +493,10 @@ (define_attr "arch_enabled" "no,yes"
>(match_test "TARGET_SVE"))
>  
>   (and (eq_attr "arch" "sme")
> -  (match_test "TARGET_SME")))
> +  (match_test "TARGET_SME"))
> +
> + (and (eq_attr "pred_clobber" "yes")
> +  (match_test "TARGET_SVE_PRED_CLOBBER")))

IMO it'd be bettero handle pred_clobber separately from arch, as a new
top-level AND:

  (and
(ior
  (eq_attr "pred_clobber" "no")
  (match_test "!TARGET_..."))
(ior
  ...existing arch tests...))

Thanks,
Richard


Re: [PATCH] AArch64: Use UZP1 instead of INS

2024-05-15 Thread Richard Sandiford
Wilco Dijkstra  writes:
> Use UZP1 instead of INS when combining low and high halves of vectors.
> UZP1 has 3 operands which improves register allocation, and is faster on
> some microarchitectures.
>
> Passes regress & bootstrap, OK for commit?

OK, thanks.  We can add core-specific tuning later if a supported core
strongly prefers INS for some reason, but I agree that the three-address
nature of UZP1 makes it the better default choice.

Richard

>
> gcc:
> * config/aarch64/aarch64-simd.md (aarch64_combine_internal):
> Use UZP1 instead of INS.
> (aarch64_combine_internal_be): Likewise.
>
> gcc/testsuite:
> * gcc.target/aarch64/ldp_stp_16.c: Update to check for UZP1.  
> * gcc.target/aarch64/pr109072_1.c: Likewise.
> * gcc.target/aarch64/vec-init-14.c: Likewise.
> * gcc.target/aarch64/vec-init-9.c: Likewise.
>
> ---
>
> diff --git a/gcc/config/aarch64/aarch64-simd.md 
> b/gcc/config/aarch64/aarch64-simd.md
> index 
> f8bb973a278c7964f3e3a4f7154a0ab62214b7cf..16b7445d9f72f77a98ab262e21fd24e6cc97eba0
>  100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -4388,7 +4388,7 @@
> && (register_operand (operands[0], mode)
> || register_operand (operands[2], mode))"
>{@ [ cons: =0 , 1  , 2   ; attrs: type   , arch  ]
> - [ w, 0  , w   ; neon_ins, simd  ] 
> ins\t%0.[1], %2.[0]
> + [ w, w  , w   ; neon_permute, simd  ] 
> uzp1\t%0.2, %1.2, %2.2
>   [ w, 0  , ?r  ; neon_from_gp, simd  ] 
> ins\t%0.[1], %2
>   [ w, 0  , ?r  ; f_mcr , * ] 
> fmov\t%0.d[1], %2
>   [ w, 0  , Utv ; neon_load1_one_lane , simd  ] 
> ld1\t{%0.}[1], %2
> @@ -4407,7 +4407,7 @@
> && (register_operand (operands[0], mode)
> || register_operand (operands[2], mode))"
>{@ [ cons: =0 , 1  , 2   ; attrs: type   , arch  ]
> - [ w, 0  , w   ; neon_ins, simd  ] 
> ins\t%0.[1], %2.[0]
> + [ w, w  , w   ; neon_permute, simd  ] 
> uzp1\t%0.2, %1.2, %2.2
>   [ w, 0  , ?r  ; neon_from_gp, simd  ] 
> ins\t%0.[1], %2
>   [ w, 0  , ?r  ; f_mcr , * ] 
> fmov\t%0.d[1], %2
>   [ w, 0  , Utv ; neon_load1_one_lane , simd  ] 
> ld1\t{%0.}[1], %2
> diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_stp_16.c 
> b/gcc/testsuite/gcc.target/aarch64/ldp_stp_16.c
> index 
> f1f46e051a86d160a7f7f14872108da87b444ca1..95835aa2eb41c289e7b74f19bb56cf6fa23a3045
>  100644
> --- a/gcc/testsuite/gcc.target/aarch64/ldp_stp_16.c
> +++ b/gcc/testsuite/gcc.target/aarch64/ldp_stp_16.c
> @@ -80,16 +80,16 @@ CONS2_FN (2, float);
>  
>  /*
>  ** cons2_4_float:{ target aarch64_little_endian }
> -**   ins v0.s\[1\], v1.s\[0\]
> -**   stp d0, d0, \[x0\]
> -**   stp d0, d0, \[x0, #?16\]
> +**   uzp1v([0-9])\.2s, v0\.2s, v1\.2s
> +**   stp d\1, d\1, \[x0\]
> +**   stp d\1, d\1, \[x0, #?16\]
>  **   ret
>  */
>  /*
>  ** cons2_4_float:{ target aarch64_big_endian }
> -**   ins v1.s\[1\], v0.s\[0\]
> -**   stp d1, d1, \[x0\]
> -**   stp d1, d1, \[x0, #?16\]
> +**   uzp1v([0-9])\.2s, v1\.2s, v0\.2s
> +**   stp d\1, d\1, \[x0\]
> +**   stp d\1, d\1, \[x0, #?16\]
>  **   ret
>  */
>  CONS2_FN (4, float);
> @@ -125,8 +125,8 @@ CONS4_FN (2, float);
>  
>  /*
>  ** cons4_4_float:
> -**   ins v[0-9]+\.s[^\n]+
> -**   ins v[0-9]+\.s[^\n]+
> +**   uzp1v[0-9]+\.2s[^\n]+
> +**   uzp1v[0-9]+\.2s[^\n]+
>  **   zip1v([0-9]+).4s, [^\n]+
>  **   stp q\1, q\1, \[x0\]
>  **   stp q\1, q\1, \[x0, #?32\]
> diff --git a/gcc/testsuite/gcc.target/aarch64/pr109072_1.c 
> b/gcc/testsuite/gcc.target/aarch64/pr109072_1.c
> index 
> 6c1d2b0bdccfb74b80d938a0d94413f0f9dda5ab..0fc195a598f3b82ff188b3151e77e1272254b78c
>  100644
> --- a/gcc/testsuite/gcc.target/aarch64/pr109072_1.c
> +++ b/gcc/testsuite/gcc.target/aarch64/pr109072_1.c
> @@ -54,7 +54,7 @@ f32x2_1 (float32_t x)
>  
>  /*
>  ** f32x2_2:
> -**   ins v0\.s\[1\], v1.s\[0\]
> +**   uzp1v0\.2s, v0\.2s, v1\.2s
>  **   ret
>  */
>  float32x2_t
> @@ -165,7 +165,7 @@ f64x2_1 (float64_t x)
>  
>  /*
>  ** f64x2_2:
> -**   ins v0\.d\[1\], v1.d\[0\]
> +**   uzp1v0\.2d, v0\.2d, v1\.2d
>  **   ret
>  */
>  float64x2_t
> diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-14.c 
> b/gcc/testsuite/gcc.target/aarch64/vec-init-14.c
> index 
> 02875088cd98833882cdf15b14dcb426951e428f..1a2cc9fbf473ad0de2d8ef97d7efdbe40d959866
>  100644
> --- a/gcc/testsuite/gcc.target/aarch64/vec-init-14.c
> +++ b/gcc/testsuite/gcc.target/aarch64/vec-init-14.c
> @@ -67,7 +67,7 @@ int32x2_t s32_6(int32_t a0, int32_t a1) {
>  
>  /*
>  ** f32_1:
> -**   ins v0\.s\[1\], v1\.s\[0\]
> +**   uzp1v0\.2s, v0\.2s, v1\.2s
>  **   ret
>  */
>  float32x2_t f32_1(float32_t a0, float32_t a1) {
> @@ -90,7 +90,7 @@ float32x2_t 

[pushed] aarch64: Avoid using mismatched ZERO ZA sizes

2024-04-12 Thread Richard Sandiford
The svzero_mask_za intrinsic tried to use the shortest combination
of .b, .h, .s and .d tiles, allowing mixtures of sizes where necessary.
However, Iain S pointed out that LLVM instead requires the tiles to
have the same suffix.  GAS supports both versions, so this patch
generates the LLVM-friendly form.

Tested on aarch64-linux-gnu & pushed.

Please revert the patch if it causes any problems.

Richard


gcc/
* config/aarch64/aarch64.cc (aarch64_output_sme_zero_za): Require
all tiles to have the same suffix.

gcc/testsuite/
* gcc.target/aarch64/sme/acle-asm/zero_mask_za.c (zero_mask_za_ab)
(zero_mask_za_d7, zero_mask_za_bf): Expect a list of .d tiles instead
of a mixture.
---
 gcc/config/aarch64/aarch64.cc | 20 +++
 .../aarch64/sme/acle-asm/zero_mask_za.c   |  6 +++---
 2 files changed, 15 insertions(+), 11 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index a2e3d208d76..1beec94629d 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -13210,29 +13210,33 @@ aarch64_output_sme_zero_za (rtx mask)
   /* The last entry in the list has the form "za7.d }", but that's the
  same length as "za7.d, ".  */
   static char buffer[sizeof("zero\t{ ") + sizeof ("za7.d, ") * 8 + 1];
-  unsigned int i = 0;
-  i += snprintf (buffer + i, sizeof (buffer) - i, "zero\t");
-  const char *prefix = "{ ";
   for (auto  : tiles)
 {
   unsigned int tile_mask = tile.mask;
   unsigned int tile_index = 0;
+  unsigned int i = snprintf (buffer, sizeof (buffer), "zero\t");
+  const char *prefix = "{ ";
+  auto remaining_mask = mask_val;
   while (tile_mask < 0x100)
{
- if ((mask_val & tile_mask) == tile_mask)
+ if ((remaining_mask & tile_mask) == tile_mask)
{
  i += snprintf (buffer + i, sizeof (buffer) - i, "%sza%d.%c",
 prefix, tile_index, tile.letter);
  prefix = ", ";
- mask_val &= ~tile_mask;
+ remaining_mask &= ~tile_mask;
}
  tile_mask <<= 1;
  tile_index += 1;
}
+  if (remaining_mask == 0)
+   {
+ gcc_assert (i + 3 <= sizeof (buffer));
+ snprintf (buffer + i, sizeof (buffer) - i, " }");
+ return buffer;
+   }
 }
-  gcc_assert (mask_val == 0 && i + 3 <= sizeof (buffer));
-  snprintf (buffer + i, sizeof (buffer) - i, " }");
-  return buffer;
+  gcc_unreachable ();
 }
 
 /* Return size in bits of an arithmetic operand which is shifted/scaled and
diff --git a/gcc/testsuite/gcc.target/aarch64/sme/acle-asm/zero_mask_za.c 
b/gcc/testsuite/gcc.target/aarch64/sme/acle-asm/zero_mask_za.c
index 9ce7331ebdd..2ba8f8cc332 100644
--- a/gcc/testsuite/gcc.target/aarch64/sme/acle-asm/zero_mask_za.c
+++ b/gcc/testsuite/gcc.target/aarch64/sme/acle-asm/zero_mask_za.c
@@ -103,21 +103,21 @@ PROTO (zero_mask_za_aa, void, ()) { svzero_mask_za 
(0xaa); }
 
 /*
 ** zero_mask_za_ab:
-** zero{ za1\.h, za0\.d }
+** zero{ za0\.d, za1\.d, za3\.d, za5\.d, za7\.d }
 ** ret
 */
 PROTO (zero_mask_za_ab, void, ()) { svzero_mask_za (0xab); }
 
 /*
 ** zero_mask_za_d7:
-** zero{ za0\.h, za1\.d, za7\.d }
+** zero{ za0\.d, za1\.d, za2\.d, za4\.d, za6\.d, za7\.d }
 ** ret
 */
 PROTO (zero_mask_za_d7, void, ()) { svzero_mask_za (0xd7); }
 
 /*
 ** zero_mask_za_bf:
-** zero{ za1\.h, za0\.s, za2\.d }
+** zero{ za0\.d, za1\.d, za2\.d, za3\.d, za4\.d, za5\.d, za7\.d }
 ** ret
 */
 PROTO (zero_mask_za_bf, void, ()) { svzero_mask_za (0xbf); }
-- 
2.25.1



Re: [PATCH] docs: Update function multiversioning documentation

2024-04-12 Thread Richard Sandiford
Hi Andrew,

Thanks for doing this.  I think it improves the organisation of the
FMV documentation and adds some details that were previously missing.

I've made some suggestions below, but documentation is subjective
and I realise that not everyone will agree with them.

I've also added Sandra to cc: in case she has time to help with this.
[original patch: 
https://gcc.gnu.org/pipermail/gcc-patches/2024-April/649071.html]

Andrew Carlotti  writes:
> Add target_version attribute to Common Function Attributes and update
> target and target_clones documentation.  Move shared detail and examples
> to the Function Multiversioning page.  Add target-specific details to
> target-specific pages.
>
> ---
>
> I've built and checked the info and dvi outputs.  Ok for master?
>
> gcc/ChangeLog:
>
>   * doc/extend.texi (Common Function Attributes): Update target
>   and target_clones documentation, and add target_version.
>   (AArch64 Function Attributes): Add ACLE reference and list
>   supported features.
>   (PowerPC Function Attributes): List supported features.
>   (x86 Function Attributes): Mention function multiversioning.
>   (Function Multiversioning): Update, and move shared detail here.
>
>
> diff --git a/gcc/doc/extend.texi b/gcc/doc/extend.texi
> index 
> 7b54a241a7bfde03ce86571be9486b30bcea6200..78cc7ad2903b61a06b618b82ba7ad52ed42d944a
>  100644
> --- a/gcc/doc/extend.texi
> +++ b/gcc/doc/extend.texi
> @@ -4178,18 +4178,27 @@ and @option{-Wanalyzer-tainted-size}.
>  Multiple target back ends implement the @code{target} attribute
>  to specify that a function is to
>  be compiled with different target options than specified on the
> -command line.  The original target command-line options are ignored.
> -One or more strings can be provided as arguments.
> -Each string consists of one or more comma-separated suffixes to
> -the @code{-m} prefix jointly forming the name of a machine-dependent
> -option.  @xref{Submodel Options,,Machine-Dependent Options}.
> -
> +command line.  One or more strings can be provided as arguments.
> +The attribute may override the original target command-line options, or it 
> may
> +be combined with them in a target-specific manner.

It's hard to tell from this what the conditions for "may" are,
e.g. whether it depends on the arguments, on the back end, or both.
Could you add a bit more text to clarify (even if it's just a forward
reference)?

With that extra text, and perhaps without, I think it's clearer to
say this after...

>  The @code{target} attribute can be used for instance to have a function
>  compiled with a different ISA (instruction set architecture) than the
> -default.  @samp{#pragma GCC target} can be used to specify target-specific
> +default.

...this.  I.e.:

  Multiple target back ends implement [...] command-line.  
  The @code{target} attribute can be used [...] the default.

  

> +
> +@samp{#pragma GCC target} can be used to specify target-specific
>  options for more than one function.  @xref{Function Specific Option Pragmas},
>  for details about the pragma.
>  
> +On x86, the @code{target} attribute can also be used to create multiple
> +versions of a function, compiled with different target-specific options.
> +@xref{Function Multiversioning} for more details.

It might be clearer to put this at the end, since the rest of the section
goes back to talking about the non-FMV usage.  Perhaps the same goes for
the pragma part.

Also, how about saying that, on AArch64, the equivalent functionality
is provided by the target_version attribute?

> +
> +The options supported by the @code{target} attribute are specific to each
> +target; refer to @ref{x86 Function Attributes}, @ref{PowerPC Function
> +Attributes}, @ref{ARM Function Attributes}, @ref{AArch64 Function 
> Attributes},
> +@ref{Nios II Function Attributes}, and @ref{S/390 Function Attributes}
> +for details.
> +
>  For instance, on an x86, you could declare one function with the
>  @code{target("sse4.1,arch=core2")} attribute and another with
>  @code{target("sse4a,arch=amdfam10")}.  This is equivalent to
> @@ -4211,39 +4220,18 @@ multiple options is equivalent to separating the 
> option suffixes with
>  a comma (@samp{,}) within a single string.  Spaces are not permitted
>  within the strings.
>  
> -The options supported are specific to each target; refer to @ref{x86
> -Function Attributes}, @ref{PowerPC Function Attributes},
> -@ref{ARM Function Attributes}, @ref{AArch64 Function Attributes},
> -@ref{Nios II Function Attributes}, and @ref{S/390 Function Attributes}
> -for details.
> -
>  @cindex @code{target_clones} function attribute
>  @item target_clones (@var{options})
>  The @code{target_clones} attribute is used to specify that a function
> -be cloned into multiple versions compiled with different target options
> -than specified on the command line.  The supported options and restrictions
> -are the same as for @code{target} attribute.
> -
> -For instance, on an 

Re: [PATCH] aarch64: Add rcpc3 dependency on rcpc2 and rcpc

2024-04-12 Thread Richard Sandiford
Andrew Carlotti  writes:
> We don't yet have a separate feature flag for FEAT_LRCPC2 (and adding
> one will require extending the feature bitmask).  Instead, make the
> FEAT_LRCPC patterns available when either armv8.4-a or +rcpc3 is
> specified.  On the other hand, we already have a +rcpc flag, so this
> dependency can be specified directly.
>
> The cpunative test needed updating because it used an invalid Features
> list, since lrcpc3 requires both ilrcpc and lrcpc to be present.
> Without this change, host_detect_local_cpu would return the architecture
> string 'armv8-a+dotprod+crc+crypto+rcpc3+norcpc'.
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64-option-extensions.def: Add RCPC to
>   RCPC3 dependencies.
>   * config/aarch64/aarch64.h (AARCH64_ISA_RCPC8_4): Add test for
>   RCPC3 bit
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/cpunative/info_24: Include lrcpc and ilrcpc.
>
> ---
>
> Bootstrapped and regression tested on aarch64.  I also verified that the
> atomic-store.c and ldapr-sext.c tests would pass when replacing 'armv8.4-a'
> with 'armv8-a+rcpc3'.
>
> Ok for master?
>
>
> diff --git a/gcc/config/aarch64/aarch64-option-extensions.def 
> b/gcc/config/aarch64/aarch64-option-extensions.def
> index 
> 3155eccd39c8e6825b7fc2bb0d0514c2e7e559bf..42ec0eec31e2ddb0cc6f83fdbaf0fd4eac5ca7f4
>  100644
> --- a/gcc/config/aarch64/aarch64-option-extensions.def
> +++ b/gcc/config/aarch64/aarch64-option-extensions.def
> @@ -153,7 +153,7 @@ AARCH64_FMV_FEATURE("fp16fml", FP16FML, (F16FML))
>  
>  AARCH64_OPT_FMV_EXTENSION("rcpc", RCPC, (), (), (), "lrcpc")
>  
> -AARCH64_OPT_FMV_EXTENSION("rcpc3", RCPC3, (), (), (), "lrcpc3")
> +AARCH64_OPT_FMV_EXTENSION("rcpc3", RCPC3, (RCPC), (), (), "lrcpc3")
>  
>  AARCH64_OPT_FMV_EXTENSION("i8mm", I8MM, (SIMD), (), (), "i8mm")
>  
> diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h
> index 
> 45e901cda644dbe4eaae709e685954f1a6f7dbcf..5870e3f812f6cb0674488b8e17ab7278003d2d54
>  100644
> --- a/gcc/config/aarch64/aarch64.h
> +++ b/gcc/config/aarch64/aarch64.h
> @@ -242,7 +242,8 @@ constexpr auto AARCH64_FL_DEFAULT_ISA_MODE = 
> AARCH64_FL_SM_OFF;
>  #define AARCH64_ISA_SHA3(aarch64_isa_flags & AARCH64_FL_SHA3)
>  #define AARCH64_ISA_F16FML  (aarch64_isa_flags & AARCH64_FL_F16FML)
>  #define AARCH64_ISA_RCPC(aarch64_isa_flags & AARCH64_FL_RCPC)
> -#define AARCH64_ISA_RCPC8_4 (aarch64_isa_flags & AARCH64_FL_V8_4A)
> +#define AARCH64_ISA_RCPC8_4 (aarch64_isa_flags \
> + & (AARCH64_FL_V8_4A | AARCH64_FL_RCPC3))

It looks like the effect of these two changes is that:

* armv9-a+rcpc3+norcpc leaves TARGET_RCPC2 true and TARGET_RCPC and
  TARGET_RCPC3 false.

* armv8-a+rcpc3+norcpc correctly leaves all three false.

If we add the RCPC3->RCPC dependency then I think we should also
require FL_RCPC alongside FL_V8_4A.  I.e.:

#define AARCH64_ISA_RCPC8_4 (AARCH64_ISA_RCPC \
 && (aarch64_isa_flags \
 & (AARCH64_FL_V8_4A | AARCH64_FL_RCPC3)))

OK with that change, thanks.

Richard


>  #define AARCH64_ISA_RNG (aarch64_isa_flags & AARCH64_FL_RNG)
>  #define AARCH64_ISA_V8_5A   (aarch64_isa_flags & AARCH64_FL_V8_5A)
>  #define AARCH64_ISA_TME (aarch64_isa_flags & AARCH64_FL_TME)
> diff --git a/gcc/testsuite/gcc.target/aarch64/cpunative/info_24 
> b/gcc/testsuite/gcc.target/aarch64/cpunative/info_24
> index 
> 8d3c16a10910af977c560782f9d659c0e51286fd..3c64e00ca3a416ef565bc0b4a5b3e5bd9cfc41bc
>  100644
> --- a/gcc/testsuite/gcc.target/aarch64/cpunative/info_24
> +++ b/gcc/testsuite/gcc.target/aarch64/cpunative/info_24
> @@ -1,8 +1,8 @@
>  processor: 0
>  BogoMIPS : 100.00
> -Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 asimddp lrcpc3
> +Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 asimddp lrcpc 
> ilrcpc lrcpc3
>  CPU implementer  : 0xfe
>  CPU architecture: 8
>  CPU variant  : 0x0
>  CPU part : 0xd08
> -CPU revision : 2
> \ No newline at end of file
> +CPU revision : 2


Re: [PATCH] aarch64: Enable +cssc for armv8.9-a

2024-04-12 Thread Richard Sandiford
Andrew Carlotti  writes:
> FEAT_CSSC is mandatory in the architecture from Armv8.9.
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64-arches.def: Add CSSC to V8_9A
>   dependencies.

OK, thanks.

Richard

>
> ---
>
> Bootstrapped and regression tested on aarch64.  Ok for master?
>
>
> diff --git a/gcc/config/aarch64/aarch64-arches.def 
> b/gcc/config/aarch64/aarch64-arches.def
> index 
> 9bec30e9203bac01155281ef3474846c402bb29e..4634b272e28006b5c6c2d6705a2f1010cbd9ab9b
>  100644
> --- a/gcc/config/aarch64/aarch64-arches.def
> +++ b/gcc/config/aarch64/aarch64-arches.def
> @@ -39,7 +39,7 @@ AARCH64_ARCH("armv8.5-a", generic_armv8_a,   V8_5A, 
> 8,  (V8_4A, SB, SSBS
>  AARCH64_ARCH("armv8.6-a", generic_armv8_a,   V8_6A, 8,  (V8_5A, 
> I8MM, BF16))
>  AARCH64_ARCH("armv8.7-a", generic_armv8_a,   V8_7A, 8,  (V8_6A))
>  AARCH64_ARCH("armv8.8-a", generic_armv8_a,   V8_8A, 8,  (V8_7A, 
> MOPS))
> -AARCH64_ARCH("armv8.9-a", generic_armv8_a,   V8_9A, 8,  (V8_8A))
> +AARCH64_ARCH("armv8.9-a", generic_armv8_a,   V8_9A, 8,  (V8_8A, 
> CSSC))
>  AARCH64_ARCH("armv8-r",   generic_armv8_a,   V8R  , 8,  (V8_4A))
>  AARCH64_ARCH("armv9-a",   generic_armv9_a,   V9A  , 9,  (V8_5A, 
> SVE2))
>  AARCH64_ARCH("armv9.1-a", generic_armv9_a,   V9_1A, 9,  (V8_6A, V9A))


Re: [PATCH]middle-end: adjust loop upper bounds when peeling for gaps and early break [PR114403].

2024-04-12 Thread Richard Sandiford
Richard Biener  writes:
> On Fri, 12 Apr 2024, Tamar Christina wrote:
>
>> Hi All,
>> 
>> This is a story all about how the peeling for gaps introduces a bug in the 
>> upper
>> bounds.
>> 
>> Before I go further, I'll first explain how I understand this to work for 
>> loops
>> with a single exit.
>> 
>> When peeling for gaps we peel N < VF iterations to scalar.
>> This happens by removing N iterations from the calculation of niters such 
>> that
>> vect_iters * VF == niters is always false.
>> 
>> In other words, when we exit the vector loop we always fall to the scalar 
>> loop.
>> The loop bounds adjustment guarantees this. Because of this we potentially
>> execute a vector loop iteration less.  That is, if you're at the boundary
>> condition where niters % VF by peeling one or more scalar iterations the 
>> vector
>> loop executes one less.
>> 
>> This is accounted for by the adjustments in vect_transform_loops.  This
>> adjustment happens differently based on whether the the vector loop can be
>> partial or not:
>> 
>> Peeling for gaps sets the bias to 0 and then:
>> 
>> when not partial:  we take the floor of (scalar_upper_bound / VF) - 1 to get 
>> the
>> vector latch iteration count.
>> 
>> when loop is partial:  For a single exit this means the loop is masked, we 
>> take
>>the ceil to account for the fact that the loop can 
>> handle
>> the final partial iteration using masking.
>> 
>> Note that there's no difference between ceil an floor on the boundary 
>> condition.
>> There is a difference however when you're slightly above it. i.e. if scalar
>> iterates 14 times and VF = 4 and we peel 1 iteration for gaps.
>> 
>> The partial loop does ((13 + 0) / 4) - 1 == 2 vector iterations. and in 
>> effect
>> the partial iteration is ignored and it's done as scalar.
>> 
>> This is fine because the niters modification has capped the vector iteration 
>> at
>> 2.  So that when we reduce the induction values you end up entering the 
>> scalar
>> code with ind_var.2 = ind_var.1 + 2 * VF.
>> 
>> Now lets look at early breaks.  To make it esier I'll focus on the specific
>> testcase:
>> 
>> char buffer[64];
>> 
>> __attribute__ ((noipa))
>> buff_t *copy (buff_t *first, buff_t *last)
>> {
>>   char *buffer_ptr = buffer;
>>   char *const buffer_end = [SZ-1];
>>   int store_size = sizeof(first->Val);
>>   while (first != last && (buffer_ptr + store_size) <= buffer_end)
>> {
>>   const char *value_data = (const char *)(>Val);
>>   __builtin_memcpy(buffer_ptr, value_data, store_size);
>>   buffer_ptr += store_size;
>>   ++first;
>> }
>> 
>>   if (first == last)
>> return 0;
>> 
>>   return first;
>> }
>> 
>> Here the first, early exit is on the condition:
>> 
>>   (buffer_ptr + store_size) <= buffer_end
>> 
>> and the main exit is on condition:
>> 
>>   first != last
>> 
>> This is important, as this bug only manifests itself when the first exit has 
>> a
>> known constant iteration count that's lower than the latch exit count.
>> 
>> because buffer holds 64 bytes, and VF = 4, unroll = 2, we end up processing 
>> 16
>> bytes per iteration.  So the exit has a known bounds of 8 + 1.
>> 
>> The vectorizer correctly analizes this:
>> 
>> Statement (exit)if (ivtmp_21 != 0)
>>  is executed at most 8 (bounded by 8) + 1 times in loop 1.
>> 
>> and as a consequence the IV is bound by 9:
>> 
>>   # vect_vec_iv_.14_117 = PHI <_118(9), { 9, 8, 7, 6 }(20)>
>>   ...
>>   vect_ivtmp_21.16_124 = vect_vec_iv_.14_117 + { 18446744073709551615, 
>> 18446744073709551615, 18446744073709551615, 18446744073709551615 };
>>   mask_patt_22.17_126 = vect_ivtmp_21.16_124 != { 0, 0, 0, 0 };
>>   if (mask_patt_22.17_126 == { -1, -1, -1, -1 })
>> goto ; [88.89%]
>>   else
>> goto ; [11.11%]
>> 
>> The imporant bits are this:
>> 
>> In this example the value of last - first = 416.
>> 
>> the calculated vector iteration count, is:
>> 
>> x = (((ptr2 - ptr1) - 16) / 16) + 1 = 27
>> 
>> the bounds generated, adjusting for gaps:
>> 
>>x == (((x - 1) >> 2) << 2)
>> 
>> which means we'll always fall through to the scalar code. as intended.
>> 
>> Here are two key things to note:
>> 
>> 1. In this loop, the early exit will always be the one taken.  When it's 
>> taken
>>we enter the scalar loop with the correct induction value to apply the gap
>>peeling.
>> 
>> 2. If the main exit is taken, the induction values assumes you've finished 
>> all
>>vector iterations.  i.e. it assumes you have completed 24 iterations, as 
>> we
>>treat the main exit the same for normal loop vect and early break when not
>>PEELED.
>>This means the induction value is adjusted to ind_var.2 = ind_var.1 + 24 
>> * VF;
>> 
>> So what's going wrong.  The vectorizer's codegen is correct and efficient,
>> however when we adjust the upper bounds, that code knows that the loops upper
>> bound is based on the early exit. i.e. 8 latch iterations. or in other 

Re: [PATCH v2] aarch64: Preserve mem info on change of base for ldp/stp [PR114674]

2024-04-12 Thread Richard Sandiford
Alex Coplan  writes:
> This is a v2 because I accidentally sent a WIP version of the patch last
> time round which used replace_equiv_address instead of
> replace_equiv_address_nv; that caused some ICEs (pointed out by the
> Linaro CI) since pair addressing modes aren't a subset of the addresses
> that are accepted by memory_operand for a given mode.
>
> This patch should otherwise be identical to v1.  Bootstrapped/regtested
> on aarch64-linux-gnu (indeed this is the patch I actually tested last
> time), is this version also OK for GCC 15?

OK, thanks.  Sorry for missing this in the first review.

Richard

> Thanks,
> Alex
>
> --- >8 ---
>
> The ldp/stp fusion pass can change the base of an access so that the two
> accesses end up using a common base register.  So far we have been using
> adjust_address_nv to do this, but this means that we don't preserve
> other properties of the mem we're replacing.  It seems better to use
> replace_equiv_address_nv, as this will preserve e.g. the MEM_ALIGN of the
> mem whose address we're changing.
>
> The PR shows that by adjusting the other mem we lose alignment
> information about the original access and therefore end up rejecting an
> otherwise viable pair when --param=aarch64-stp-policy=aligned is passed.
> This patch fixes that by using replace_equiv_address_nv instead.
>
> Notably this is the same approach as taken by
> aarch64_check_consecutive_mems when a change of base is required, so
> this at least makes things more consistent between the ldp fusion pass
> and the peepholes.
>
> gcc/ChangeLog:
>
>   PR target/114674
>   * config/aarch64/aarch64-ldp-fusion.cc (ldp_bb_info::fuse_pair):
>   Use replace_equiv_address_nv on a change of base instead of
>   adjust_address_nv on the other access.
>
> gcc/testsuite/ChangeLog:
>
>   PR target/114674
>   * gcc.target/aarch64/pr114674.c: New test.
>
> diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc 
> b/gcc/config/aarch64/aarch64-ldp-fusion.cc
> index 365dcf48b22..d07d79df06c 100644
> --- a/gcc/config/aarch64/aarch64-ldp-fusion.cc
> +++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
> @@ -1730,11 +1730,11 @@ ldp_bb_info::fuse_pair (bool load_p,
>   adjust_amt *= -1;
>  
>rtx change_reg = XEXP (change_pat, !load_p);
> -  machine_mode mode_for_mem = GET_MODE (change_mem);
>rtx effective_base = drop_writeback (base_mem);
> -  rtx new_mem = adjust_address_nv (effective_base,
> -mode_for_mem,
> -adjust_amt);
> +  rtx adjusted_addr = plus_constant (Pmode,
> +  XEXP (effective_base, 0),
> +  adjust_amt);
> +  rtx new_mem = replace_equiv_address_nv (change_mem, adjusted_addr);
>rtx new_set = load_p
>   ? gen_rtx_SET (change_reg, new_mem)
>   : gen_rtx_SET (new_mem, change_reg);
> diff --git a/gcc/testsuite/gcc.target/aarch64/pr114674.c 
> b/gcc/testsuite/gcc.target/aarch64/pr114674.c
> new file mode 100644
> index 000..944784fd008
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/pr114674.c
> @@ -0,0 +1,17 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3 --param=aarch64-stp-policy=aligned" } */
> +typedef struct {
> + unsigned int f1;
> + unsigned int f2;
> +} test_struct;
> +
> +static test_struct ts = {
> + 123, 456
> +};
> +
> +void foo(void)
> +{
> + ts.f2 = 36969 * (ts.f2 & 65535) + (ts.f1 >> 16);
> + ts.f1 = 18000 * (ts.f2 & 65535) + (ts.f2 >> 16);
> +}
> +/* { dg-final { scan-assembler-times "stp" 1 } } */


Re: [PATCH] aarch64: Fix _BitInt testcases

2024-04-11 Thread Richard Sandiford
"Andre Vieira (lists)"  writes:
> This patch fixes some testisms introduced by:
>
> commit 5aa3fec38cc6f52285168b161bab1a869d864b44
> Author: Andre Vieira 
> Date:   Wed Apr 10 16:29:46 2024 +0100
>
>  aarch64: Add support for _BitInt
>
> The testcases were relying on an unnecessary sign-extend that is no longer
> generated.
>
> The tested version was just slightly behind top of trunk when the patch 
> was committed, and the codegen had changed, for the better, by then.
>
> OK for trunk? (I am away tomorrow, so if you want this in before the 
> weekend feel free to commit it on my behalf, if approved ofc...)
>
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/bitfield-bitint-abi-align16.c (g1, g8, g16, g1p, 
> g8p,
>   g16p): Remove unnecessary sbfx.
>   * gcc.target/aarch64/bitfield-bitint-abi-align8.c (g1, g8, g16, g1p, 
> g8p,
>   g16p): Likewise.

LGTM, thanks.  Pushed to trunk.

Richard

>
>
> diff --git a/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align16.c 
> b/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align16.c
> index 
> 3f292a45f955d35b802a0bd789cd39d5fa7b5860..4a228b0a1ce696dc80e32305162d58f01d44051d
>  100644
> --- a/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align16.c
> +++ b/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align16.c
> @@ -55,9 +55,8 @@
>  ** g1:
>  **   mov (x[0-9]+), x0
>  **   mov w0, w1
> -**   sbfx(x[0-9]+), \1, 0, 63
> -**   and x4, \2, 9223372036854775807
> -**   and x2, \2, 1
> +**   and x4, \1, 9223372036854775807
> +**   and x2, \1, 1
>  **   mov x3, 0
>  **   b   f1
>  */
> @@ -66,9 +65,8 @@
>  ** g8:
>  **   mov (x[0-9]+), x0
>  **   mov w0, w1
> -**   sbfx(x[0-9]+), \1, 0, 63
> -**   and x4, \2, 9223372036854775807
> -**   and x2, \2, 1
> +**   and x4, \1, 9223372036854775807
> +**   and x2, \1, 1
>  **   mov x3, 0
>  **   b   f8
>  */
> @@ -76,9 +74,8 @@
>  ** g16:
>  **   mov (x[0-9]+), x0
>  **   mov w0, w1
> -**   sbfx(x[0-9]+), \1, 0, 63
> -**   and x4, \2, 9223372036854775807
> -**   and x2, \2, 1
> +**   and x4, \1, 9223372036854775807
> +**   and x2, \1, 1
>  **   mov x3, 0
>  **   b   f16
>  */
> @@ -107,9 +104,8 @@
>  /*
>  ** g1p:
>  **   mov (w[0-9]+), w1
> -**   sbfx(x[0-9]+), x0, 0, 63
> -**   and x3, \2, 9223372036854775807
> -**   and x1, \2, 1
> +**   and x3, x0, 9223372036854775807
> +**   and x1, x0, 1
>  **   mov x2, 0
>  **   mov w0, \1
>  **   b   f1p
> @@ -117,9 +113,8 @@
>  /*
>  ** g8p:
>  **   mov (w[0-9]+), w1
> -**   sbfx(x[0-9]+), x0, 0, 63
> -**   and x3, \2, 9223372036854775807
> -**   and x1, \2, 1
> +**   and x3, x0, 9223372036854775807
> +**   and x1, x0, 1
>  **   mov x2, 0
>  **   mov w0, \1
>  **   b   f8p
> @@ -128,9 +123,8 @@
>  ** g16p:
>  **   mov (x[0-9]+), x0
>  **   mov w0, w1
> -**   sbfx(x[0-9]+), \1, 0, 63
> -**   and x4, \2, 9223372036854775807
> -**   and x2, \2, 1
> +**   and x4, \1, 9223372036854775807
> +**   and x2, \1, 1
>  **   mov x3, 0
>  **   b   f16p
>  */
> diff --git a/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align8.c 
> b/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align8.c
> index 
> da3c23550bae6734f69e2baf0e8db741fb65cfda..e7f773640f04f56646e5e1a5fb91280ea7e4db98
>  100644
> --- a/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align8.c
> +++ b/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align8.c
> @@ -54,9 +54,8 @@
>  /*
>  ** g1:
>  **   mov (w[0-9]+), w1
> -**   sbfx(x[0-9]+), x0, 0, 63
> -**   and x3, \2, 9223372036854775807
> -**   and x1, \2, 1
> +**   and x3, x0, 9223372036854775807
> +**   and x1, x0, 1
>  **   mov x2, 0
>  **   mov w0, \1
>  **   b   f1
> @@ -65,9 +64,8 @@
>  /*
>  ** g8:
>  **   mov (w[0-9]+), w1
> -**   sbfx(x[0-9]+), x0, 0, 63
> -**   and x3, \2, 9223372036854775807
> -**   and x1, \2, 1
> +**   and x3, x0, 9223372036854775807
> +**   and x1, x0, 1
>  **   mov x2, 0
>  **   mov w0, \1
>  **   b   f8
> @@ -76,9 +74,8 @@
>  ** g16:
>  **   mov (x[0-9]+), x0
>  **   mov w0, w1
> -**   sbfx(x[0-9]+), \1, 0, 63
> -**   and x4, \2, 9223372036854775807
> -**   and x2, \2, 1
> +**   and x4, \1, 9223372036854775807
> +**   and x2, \1, 1
>  **   mov x3, 0
>  **   b   f16
>  */
> @@ -107,9 +104,8 @@
>  /*
>  ** g1p:
>  **   mov (w[0-9]+), w1
> -**   sbfx(x[0-9]+), x0, 0, 63
> -**   and x3, \2, 9223372036854775807
> -**   and x1, \2, 1
> +**   and x3, x0, 9223372036854775807
> +**   and x1, x0, 1
>  **   mov x2, 0
>  **   mov w0, \1
>  **   b   f1p
> @@ -117,9 +113,8 @@
>  /*
>  ** g8p:
>  **   mov (w[0-9]+), w1
> -**   sbfx(x[0-9]+), x0, 0, 63
> -**   and x3, \2, 9223372036854775807
> -**   and x1, \2, 1
> +**   and

[gcc r14-9925] aarch64: Fix _BitInt testcases

2024-04-11 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:b87ba79200f2a727aa5c523abcc5c03fa11fc007

commit r14-9925-gb87ba79200f2a727aa5c523abcc5c03fa11fc007
Author: Andre Vieira (lists) 
Date:   Thu Apr 11 17:54:37 2024 +0100

aarch64: Fix _BitInt testcases

This patch fixes some testisms introduced by:

commit 5aa3fec38cc6f52285168b161bab1a869d864b44
Author: Andre Vieira 
Date:   Wed Apr 10 16:29:46 2024 +0100

 aarch64: Add support for _BitInt

The testcases were relying on an unnecessary sign-extend that is no longer
generated.

The tested version was just slightly behind top of trunk when the patch
was committed, and the codegen had changed, for the better, by then.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/bitfield-bitint-abi-align16.c (g1, g8, g16, 
g1p, g8p,
g16p): Remove unnecessary sbfx.
* gcc.target/aarch64/bitfield-bitint-abi-align8.c (g1, g8, g16, 
g1p, g8p,
g16p): Likewise.

Diff:
---
 .../aarch64/bitfield-bitint-abi-align16.c  | 30 +-
 .../aarch64/bitfield-bitint-abi-align8.c   | 30 +-
 2 files changed, 24 insertions(+), 36 deletions(-)

diff --git a/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align16.c 
b/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align16.c
index 3f292a45f95..4a228b0a1ce 100644
--- a/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align16.c
+++ b/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align16.c
@@ -55,9 +55,8 @@
 ** g1:
 ** mov (x[0-9]+), x0
 ** mov w0, w1
-** sbfx(x[0-9]+), \1, 0, 63
-** and x4, \2, 9223372036854775807
-** and x2, \2, 1
+** and x4, \1, 9223372036854775807
+** and x2, \1, 1
 ** mov x3, 0
 ** b   f1
 */
@@ -66,9 +65,8 @@
 ** g8:
 ** mov (x[0-9]+), x0
 ** mov w0, w1
-** sbfx(x[0-9]+), \1, 0, 63
-** and x4, \2, 9223372036854775807
-** and x2, \2, 1
+** and x4, \1, 9223372036854775807
+** and x2, \1, 1
 ** mov x3, 0
 ** b   f8
 */
@@ -76,9 +74,8 @@
 ** g16:
 ** mov (x[0-9]+), x0
 ** mov w0, w1
-** sbfx(x[0-9]+), \1, 0, 63
-** and x4, \2, 9223372036854775807
-** and x2, \2, 1
+** and x4, \1, 9223372036854775807
+** and x2, \1, 1
 ** mov x3, 0
 ** b   f16
 */
@@ -107,9 +104,8 @@
 /*
 ** g1p:
 ** mov (w[0-9]+), w1
-** sbfx(x[0-9]+), x0, 0, 63
-** and x3, \2, 9223372036854775807
-** and x1, \2, 1
+** and x3, x0, 9223372036854775807
+** and x1, x0, 1
 ** mov x2, 0
 ** mov w0, \1
 ** b   f1p
@@ -117,9 +113,8 @@
 /*
 ** g8p:
 ** mov (w[0-9]+), w1
-** sbfx(x[0-9]+), x0, 0, 63
-** and x3, \2, 9223372036854775807
-** and x1, \2, 1
+** and x3, x0, 9223372036854775807
+** and x1, x0, 1
 ** mov x2, 0
 ** mov w0, \1
 ** b   f8p
@@ -128,9 +123,8 @@
 ** g16p:
 ** mov (x[0-9]+), x0
 ** mov w0, w1
-** sbfx(x[0-9]+), \1, 0, 63
-** and x4, \2, 9223372036854775807
-** and x2, \2, 1
+** and x4, \1, 9223372036854775807
+** and x2, \1, 1
 ** mov x3, 0
 ** b   f16p
 */
diff --git a/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align8.c 
b/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align8.c
index da3c23550ba..e7f773640f0 100644
--- a/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align8.c
+++ b/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align8.c
@@ -54,9 +54,8 @@
 /*
 ** g1:
 ** mov (w[0-9]+), w1
-** sbfx(x[0-9]+), x0, 0, 63
-** and x3, \2, 9223372036854775807
-** and x1, \2, 1
+** and x3, x0, 9223372036854775807
+** and x1, x0, 1
 ** mov x2, 0
 ** mov w0, \1
 ** b   f1
@@ -65,9 +64,8 @@
 /*
 ** g8:
 ** mov (w[0-9]+), w1
-** sbfx(x[0-9]+), x0, 0, 63
-** and x3, \2, 9223372036854775807
-** and x1, \2, 1
+** and x3, x0, 9223372036854775807
+** and x1, x0, 1
 ** mov x2, 0
 ** mov w0, \1
 ** b   f8
@@ -76,9 +74,8 @@
 ** g16:
 ** mov (x[0-9]+), x0
 ** mov w0, w1
-** sbfx(x[0-9]+), \1, 0, 63
-** and x4, \2, 9223372036854775807
-** and x2, \2, 1
+** and x4, \1, 9223372036854775807
+** and x2, \1, 1
 ** mov x3, 0
 ** b   f16
 */
@@ -107,9 +104,8 @@
 /*
 ** g1p:
 ** mov (w[0-9]+), w1
-** sbfx(x[0-9]+), x0, 0, 63
-** and x3, \2, 9223372036854775807
-** and x1, \2, 1
+** and x3, x0, 9223372036854775807
+** and x1, x0, 1
 ** mov x2, 0
 ** mov w0, \1
 ** b   f1p
@@ -117,9 +113,8 @@
 /*
 ** g8p:
 ** mov (w[0-9]+), w1
-** sbfx(x[0-9]+), x0, 0, 63
-** and 

Re: [PATCH v2 00/13] Add aarch64-w64-mingw32 target

2024-04-11 Thread Richard Sandiford
Evgeny Karpov  writes:
> Wednesday, April 10, 2024 8:40 PM
> Richard Sandiford  wrote:
>
>> Thanks for the updates and sorry again for the slow review.
>> I've replied to some of the patches in the series but otherwise it looks 
>> good to
>> me.
>> 
>> If you agree with the suggested changes then the series is OK for stage 1,
>> assuming no objections from those with an interest in the
>> x86 cygwin/mingw port.
>> 
>> Richard
>
> Thank you, Richard, for the review!
> v3 addresses all comments on v2.
> https://gcc.gnu.org/pipermail/gcc-patches/2024-April/thread.html#649261

Thanks, the new series looks good to me.

Richard


Re: [PATCH] aarch64: Preserve mem info on change of base for ldp/stp [PR114674]

2024-04-11 Thread Richard Sandiford
Alex Coplan  writes:
> Hi,
>
> The ldp/stp fusion pass can change the base of an access so that the two
> accesses end up using a common base register.  So far we have been using
> adjust_address_nv to do this, but this means that we don't preserve
> other properties of the mem we're replacing.  It seems better to use
> replace_equiv_address_nv, as this will preserve e.g. the MEM_ALIGN of the
> mem whose address we're changing.
>
> The PR shows that by adjusting the other mem we lose alignment
> information about the original access and therefore end up rejecting an
> otherwise viable pair when --param=aarch64-stp-policy=aligned is passed.
> This patch fixes that by using replace_equiv_address_nv instead.
>
> Notably this is the same approach as taken by
> aarch64_check_consecutive_mems when a change of base is required, so
> this at least makes things more consistent between the ldp fusion pass
> and the peepholes.
>
> Bootstrapped/regtested on aarch64-linux-gnu, OK for trunk when stage 1
> opens for GCC 15?

Yes, thanks.

Richard

>
> Thanks,
> Alex
>
>
> gcc/ChangeLog:
>
>   PR target/114674
>   * config/aarch64/aarch64-ldp-fusion.cc (ldp_bb_info::fuse_pair):
>   Use replace_equiv_address_nv on a change of base instead of
>   adjust_address_nv on the other access.
>
> gcc/testsuite/ChangeLog:
>
>   PR target/114674
>   * gcc.target/aarch64/pr114674.c: New test.
>
> diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc 
> b/gcc/config/aarch64/aarch64-ldp-fusion.cc
> index 365dcf48b22..4258a560c48 100644
> --- a/gcc/config/aarch64/aarch64-ldp-fusion.cc
> +++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
> @@ -1730,11 +1730,11 @@ ldp_bb_info::fuse_pair (bool load_p,
>   adjust_amt *= -1;
>  
>rtx change_reg = XEXP (change_pat, !load_p);
> -  machine_mode mode_for_mem = GET_MODE (change_mem);
>rtx effective_base = drop_writeback (base_mem);
> -  rtx new_mem = adjust_address_nv (effective_base,
> -mode_for_mem,
> -adjust_amt);
> +  rtx adjusted_addr = plus_constant (Pmode,
> +  XEXP (effective_base, 0),
> +  adjust_amt);
> +  rtx new_mem = replace_equiv_address (change_mem, adjusted_addr);
>rtx new_set = load_p
>   ? gen_rtx_SET (change_reg, new_mem)
>   : gen_rtx_SET (new_mem, change_reg);
> diff --git a/gcc/testsuite/gcc.target/aarch64/pr114674.c 
> b/gcc/testsuite/gcc.target/aarch64/pr114674.c
> new file mode 100644
> index 000..944784fd008
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/pr114674.c
> @@ -0,0 +1,17 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3 --param=aarch64-stp-policy=aligned" } */
> +typedef struct {
> + unsigned int f1;
> + unsigned int f2;
> +} test_struct;
> +
> +static test_struct ts = {
> + 123, 456
> +};
> +
> +void foo(void)
> +{
> + ts.f2 = 36969 * (ts.f2 & 65535) + (ts.f1 >> 16);
> + ts.f1 = 18000 * (ts.f2 & 65535) + (ts.f2 >> 16);
> +}
> +/* { dg-final { scan-assembler-times "stp" 1 } } */


Re: [PATCH 0/5] aarch64: FMV feature list fixes

2024-04-10 Thread Richard Sandiford
Andrew Carlotti  writes:
> On Wed, Apr 10, 2024 at 05:42:05PM +0100, Richard Sandiford wrote:
>> Andrew Carlotti  writes:
>> > On Tue, Apr 09, 2024 at 04:43:16PM +0100, Richard Sandiford wrote:
>> >> Andrew Carlotti  writes:
>> >> > The first three patches are trivial changes to the feature list to 
>> >> > reflect
>> >> > recent changes in the ACLE.  Patch 4 removes most of the FMV 
>> >> > multiversioning
>> >> > features that don't work at the moment, and should be entirely 
>> >> > uncontroversial.
>> >> >
>> >> > Patch 5 handles the remaining cases, where there's an inconsistency in 
>> >> > how
>> >> > features are named in the current FMV specification compared to the 
>> >> > existing
>> >> > command line options.  It might be better to instead preserve the 
>> >> > "memtag2",
>> >> > "ssbs2" and "ls64_accdata" names for now; I'd be happy to commit either
>> >> > version.
>> >> 
>> >> Yeah, I suppose patch 5 leaves things in a somewhat awkward state,
>> >> since e.g.:
>> >> 
>> >> -AARCH64_OPT_FMV_EXTENSION("memtag", MEMTAG, (), (), (), "")
>> >> +AARCH64_OPT_EXTENSION("memtag", MEMTAG, (), (), (), "")
>> >>  
>> >> -AARCH64_FMV_FEATURE("memtag2", MEMTAG2, (MEMTAG))
>> >> +AARCH64_FMV_FEATURE("memtag", MEMTAG2, (MEMTAG))
>> >> 
>> >> seems to drop "memtag2" and FEAT_MEMTAG, but keep "memtag" and
>> >> FEAT_MEMTAG2.  Is that right?
>> >
>> > That's deliberate. The FEAT_MEMTAG bit in __aarch64_cpu_features is 
>> > defined to
>> > match the definition of FEAT_MTE in the architecture, and likewise for
>> > FEAT_MEMTAG2/FEAT_MTE2.  However, in Binutils the "+memtag" extension 
>> > enables
>> > both FEAT_MTE and FEAT_MTE2 instructions (although none of the FEAT_MTE2
>> > instructions can be generated from GCC without inline assembly).  The FMV
>> > specification in the ACLE currently uses names "memtag" and "memtag2" that
>> > match the architecture names, but arguably don't match the command line
>> > extension names.  I'm advocating for that to change to match the extension
>> > names in command line options.
>> 
>> Hmm, ok.  I agree it makes sense for the user-visible FMV namnes to match
>> the command line.  But shouldn't __aarch64_cpu_features either (a) use 
>> exactly
>> the same names as the architecture or (b) use exactly the same names as the
>> command-line (mangled where necessary)?  It seems that we're instead
>> using a third convention that doesn't exactly match the other two.
>
> I agree that the name isn't one I would choose now, but I don't think it 
> matters much that it's inconsistent.

I kind-of think it does though.  Given...

>> That is, I can see the rationale for "memtag" => FEAT_MTE2 and
>> "memtag" => FEAT_MEMTAG.  It just seems odd to have "memtag" => FEAT_MEMTAG2
>> (where MEMTAG2 is an alias of MTE2).
>> 
>> How much leeway do we have to change the __aarch64_cpu_features names?
>> Is it supposed to be a public API (as opposed to ABI)?
>
> I think we're designing it to be capable of being a public API, but we haven't
> yet made it one.  That's partly why I've kept the enum value names the same as
> in LLVM so far.

...this, I don't want to sleep-walk into a situation where we have
one naming convention for the architecture, one for the attributes,
and a third one for the API.  If we're not in a position to commit
to a consistent naming scheme for the API by GCC 14 then it might be
better to remove the FMV features in 5/5 for GCC 14 and revisit in GCC 15.

A patch to do that is pre-approved if you agree (but please say
if you don't).

Thanks,
Richard


Re: [PATCH v2 00/13] Add aarch64-w64-mingw32 target

2024-04-10 Thread Richard Sandiford
Evgeny Karpov  writes:
> Hello,
>
> v2 is ready for the review!
> Based on the v1 review: 
> https://gcc.gnu.org/pipermail/gcc-patches/2024-February/thread.html#646203
>
> Testing for the x86_64-w64-mingw32 target is in progress to avoid
> regression due to refactoring.

Thanks for the updates and sorry again for the slow review.
I've replied to some of the patches in the series but otherwise
it looks good to me.

If you agree with the suggested changes then the series is OK for
stage 1, assuming no objections from those with an interest in the
x86 cygwin/mingw port.

Richard

> Regards,
> Evgeny
>
>
> Changes from v1 to v2:
> Adjust the target name to aarch64-*-mingw* to exclude the big-endian
> target from support.
> Exclude 64-bit ISA.
> Rename enum calling_abi to aarch64_calling_abi.
> Move AArch64 MS ABI definitions FIXED_REGISTERS,
> CALL_REALLY_USED_REGISTERS, and STATIC_CHAIN_REGNUM from aarch64.h 
> to aarch64-abi-ms.h.
> Rename TARGET_ARM64_MS_ABI to TARGET_AARCH64_MS_ABI.
> Exclude TARGET_64BIT from the aarch64 target.
> Exclude HAVE_GAS_WEAK.
> Set HAVE_GAS_ALIGNED_COMM to 1 by default.
> Use a reference from "x86 Windows Options" to 
> "Cygwin and MinGW Options".
> Update commit descriptions to follow standard style.
> Rebase from 4th March 2024.


Re: [PATCH v2 10/13] Rename "x86 Windows Options" to "Cygwin and MinGW Options"

2024-04-10 Thread Richard Sandiford
Evgeny Karpov  writes:
> From: Zac Walker 
> Date: Fri, 1 Mar 2024 02:17:39 +0100
> Subject: [PATCH v2 10/13] Rename "x86 Windows Options" to "Cygwin and MinGW
>  Options"
>
> Rename "x86 Windows Options" to "Cygwin and MinGW Options".
> It will be used also for AArch64.
>
> gcc/ChangeLog:
>
>   * config/i386/mingw-w64.opt.urls: Rename options' name and
>   regenerate option URLs.
>   * config/lynx.opt.urls: Likewise.
>   * config/mingw/cygming.opt.urls: Likewise.
>   * config/mingw/mingw.opt.urls: Likewise.
>   * doc/invoke.texi: Likewise.
> ---
>  gcc/config/i386/mingw-w64.opt.urls |  2 +-
>  gcc/config/lynx.opt.urls   |  2 +-
>  gcc/config/mingw/cygming.opt.urls  | 18 +-
>  gcc/config/mingw/mingw.opt.urls|  2 +-
>  gcc/doc/invoke.texi| 12 ++--
>  5 files changed, 22 insertions(+), 14 deletions(-)
>
> diff --git a/gcc/config/i386/mingw-w64.opt.urls 
> b/gcc/config/i386/mingw-w64.opt.urls
> index 6bb53ef29b2..5cceba1d1a1 100644
> --- a/gcc/config/i386/mingw-w64.opt.urls
> +++ b/gcc/config/i386/mingw-w64.opt.urls
> @@ -1,5 +1,5 @@
>  ; Autogenerated by regenerate-opt-urls.py from gcc/config/i386/mingw-w64.opt 
> and generated HTML
>  
>  municode
> -UrlSuffix(gcc/x86-Windows-Options.html#index-municode)
> +UrlSuffix(gcc/Cygwin-and-MinGW-Options.html#index-municode)
>  
> diff --git a/gcc/config/lynx.opt.urls b/gcc/config/lynx.opt.urls
> index 63e7b9c4b33..b547138f7ff 100644
> --- a/gcc/config/lynx.opt.urls
> +++ b/gcc/config/lynx.opt.urls
> @@ -1,5 +1,5 @@
>  ; Autogenerated by regenerate-opt-urls.py from gcc/config/lynx.opt and 
> generated HTML
>  
>  mthreads
> -UrlSuffix(gcc/x86-Windows-Options.html#index-mthreads-1)
> +UrlSuffix(gcc/Cygwin-and-MinGW-Options.html#index-mthreads-1)
>  
> diff --git a/gcc/config/mingw/cygming.opt.urls 
> b/gcc/config/mingw/cygming.opt.urls
> index 87799befe3c..c624e22e442 100644
> --- a/gcc/config/mingw/cygming.opt.urls
> +++ b/gcc/config/mingw/cygming.opt.urls
> @@ -1,30 +1,30 @@
>  ; Autogenerated by regenerate-opt-urls.py from gcc/config/i386/cygming.opt 
> and generated HTML
>  
>  mconsole
> -UrlSuffix(gcc/x86-Windows-Options.html#index-mconsole)
> +UrlSuffix(gcc/Cygwin-and-MinGW-Options.html#index-mconsole)
>  
>  mdll
> -UrlSuffix(gcc/x86-Windows-Options.html#index-mdll)
> +UrlSuffix(gcc/Cygwin-and-MinGW-Options.html#index-mdll)
>  
>  mnop-fun-dllimport
> -UrlSuffix(gcc/x86-Windows-Options.html#index-mnop-fun-dllimport)
> +UrlSuffix(gcc/Cygwin-and-MinGW-Options.html#index-mnop-fun-dllimport)
>  
>  ; skipping UrlSuffix for 'mthreads' due to multiple URLs:
> +;   duplicate: 'gcc/Cygwin-and-MinGW-Options.html#index-mthreads-1'
>  ;   duplicate: 'gcc/x86-Options.html#index-mthreads'
> -;   duplicate: 'gcc/x86-Windows-Options.html#index-mthreads-1'
>  
>  mwin32
> -UrlSuffix(gcc/x86-Windows-Options.html#index-mwin32)
> +UrlSuffix(gcc/Cygwin-and-MinGW-Options.html#index-mwin32)
>  
>  mwindows
> -UrlSuffix(gcc/x86-Windows-Options.html#index-mwindows)
> +UrlSuffix(gcc/Cygwin-and-MinGW-Options.html#index-mwindows)
>  
>  mpe-aligned-commons
> -UrlSuffix(gcc/x86-Windows-Options.html#index-mpe-aligned-commons)
> +UrlSuffix(gcc/Cygwin-and-MinGW-Options.html#index-mpe-aligned-commons)
>  
>  fset-stack-executable
> -UrlSuffix(gcc/x86-Windows-Options.html#index-fno-set-stack-executable)
> +UrlSuffix(gcc/Cygwin-and-MinGW-Options.html#index-fno-set-stack-executable)
>  
>  fwritable-relocated-rdata
> -UrlSuffix(gcc/x86-Windows-Options.html#index-fno-writable-relocated-rdata)
> +UrlSuffix(gcc/Cygwin-and-MinGW-Options.html#index-fno-writable-relocated-rdata)
>  
> diff --git a/gcc/config/mingw/mingw.opt.urls b/gcc/config/mingw/mingw.opt.urls
> index 2cbbaadf310..f8ee5be6a53 100644
> --- a/gcc/config/mingw/mingw.opt.urls
> +++ b/gcc/config/mingw/mingw.opt.urls
> @@ -1,7 +1,7 @@
>  ; Autogenerated by regenerate-opt-urls.py from gcc/config/i386/mingw.opt and 
> generated HTML
>  
>  mcrtdll=
> -UrlSuffix(gcc/x86-Windows-Options.html#index-mcrtdll)
> +UrlSuffix(gcc/Cygwin-and-MinGW-Options.html#index-mcrtdll)
>  
>  ; skipping UrlSuffix for 'pthread' due to multiple URLs:
>  ;   duplicate: 'gcc/Link-Options.html#index-pthread-1'
> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> index bdf05be387d..e2e473e095f 100644
> --- a/gcc/doc/invoke.texi
> +++ b/gcc/doc/invoke.texi
> @@ -1493,6 +1493,8 @@ See RS/6000 and PowerPC Options.
>  -munroll-only-small-loops -mlam=@var{choice}}
>  
>  @emph{x86 Windows Options}
> +
> +@emph{Cygwin and MinGW Options}
>  @gccoptlist{-mconsole  -mcrtdll=@var{library}  -mdll
>  -mnop-fun-dllimport  -mthread
>  -municode  -mwin32  -mwindows  -fno-set-stack-executable}
> @@ -20976,6 +20978,7 @@ platform.
>  * C6X Options::
>  * CRIS Options::
>  * C-SKY Options::
> +* Cygwin and MinGW Options::
>  * Darwin Options::
>  * DEC Alpha Options::
>  * eBPF Options::
> @@ -36112,8 +36115,13 @@ positions 62:57 can be used for metadata.
>  
>  @node x86 Windows Options
>  @subsection 

Re: [PATCH v2 08/13] aarch64: Add Cygwin and MinGW environments for AArch64

2024-04-10 Thread Richard Sandiford
Evgeny Karpov  writes:
> From: Zac Walker 
> Date: Fri, 1 Mar 2024 10:49:28 +0100
> Subject: [PATCH v2 08/13] aarch64: Add Cygwin and MinGW environments for
>  AArch64
>
> Define Cygwin and MinGW environment such as types, SEH definitions,
> shared libraries, etc.
>
> gcc/ChangeLog:
>
>   * config.gcc: Add Cygwin and MinGW difinitions.
>   * config/aarch64/aarch64-protos.h
>   (mingw_pe_maybe_record_exported_symbol): Declare functions
>   which are used in Cygwin and MinGW environment.
>   (mingw_pe_section_type_flags): Likewise.
>   (mingw_pe_unique_section): Likewise.
>   (mingw_pe_encode_section_info): Likewise.
>   * config/aarch64/cygming.h: New file.
> ---
>  gcc/config.gcc  |   4 +
>  gcc/config/aarch64/aarch64-protos.h |   5 +
>  gcc/config/aarch64/cygming.h| 175 
>  3 files changed, 184 insertions(+)
>  create mode 100644 gcc/config/aarch64/cygming.h
>
> diff --git a/gcc/config.gcc b/gcc/config.gcc
> index 3aca257c322..4471599454b 100644
> --- a/gcc/config.gcc
> +++ b/gcc/config.gcc
> @@ -1267,7 +1267,11 @@ aarch64*-*-linux*)
>  aarch64-*-mingw*)
>   tm_file="${tm_file} aarch64/aarch64-abi-ms.h"
>   tm_file="${tm_file} aarch64/aarch64-coff.h"
> + tm_file="${tm_file} aarch64/cygming.h"
> + tm_file="${tm_file} mingw/mingw32.h"
> + tm_file="${tm_file} mingw/mingw-stdint.h"
>   tmake_file="${tmake_file} aarch64/t-aarch64"
> + target_gtfiles="$target_gtfiles \$(srcdir)/config/mingw/winnt.cc"
>   case ${enable_threads} in
> "" | yes | win32)
>   thread_file='win32'
> diff --git a/gcc/config/aarch64/aarch64-protos.h 
> b/gcc/config/aarch64/aarch64-protos.h
> index bd719b992a5..759e1a0f9da 100644
> --- a/gcc/config/aarch64/aarch64-protos.h
> +++ b/gcc/config/aarch64/aarch64-protos.h
> @@ -1110,6 +1110,11 @@ extern void aarch64_output_patchable_area (unsigned 
> int, bool);
>  
>  extern void aarch64_adjust_reg_alloc_order ();
>  
> +extern void mingw_pe_maybe_record_exported_symbol (tree, const char *, int);
> +extern unsigned int mingw_pe_section_type_flags (tree, const char *, int);
> +extern void mingw_pe_unique_section (tree, int);
> +extern void mingw_pe_encode_section_info (tree, rtx, int);
> +
>  bool aarch64_optimize_mode_switching (aarch64_mode_entity);
>  void aarch64_restore_za (rtx);
>  
> diff --git a/gcc/config/aarch64/cygming.h b/gcc/config/aarch64/cygming.h
> new file mode 100644
> index 000..2f239c42a89
> --- /dev/null
> +++ b/gcc/config/aarch64/cygming.h
> @@ -0,0 +1,175 @@
> +/* Operating system specific defines to be used when targeting GCC for
> +   hosting on Windows32, using a Unix style C library and tools.
> +   Copyright (C) 1995-2024 Free Software Foundation, Inc.
> +
> +This file is part of GCC.
> +
> +GCC is free software; you can redistribute it and/or modify
> +it under the terms of the GNU General Public License as published by
> +the Free Software Foundation; either version 3, or (at your option)
> +any later version.
> +
> +GCC is distributed in the hope that it will be useful,
> +but WITHOUT ANY WARRANTY; without even the implied warranty of
> +MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +GNU General Public License for more details.
> +
> +You should have received a copy of the GNU General Public License
> +along with GCC; see the file COPYING3.  If not see
> +.  */
> +
> +#ifndef GCC_AARCH64_CYGMING_H
> +#define GCC_AARCH64_CYGMING_H
> +
> +#undef PREFERRED_DEBUGGING_TYPE
> +#define PREFERRED_DEBUGGING_TYPE DINFO_TYPE_NONE
> +
> +#define FASTCALL_PREFIX '@'
> +
> +#define print_reg(rtx, code, file)

How about:

#define print_reg(rtx, code, file) (gcc_unreachable ())

so that attempts to use this are a noisy runtime failure?

> +#define SYMBOL_FLAG_DLLIMPORT 0
> +#define SYMBOL_FLAG_DLLEXPORT 0
> +
> +#define SYMBOL_REF_DLLEXPORT_P(X) \
> + ((SYMBOL_REF_FLAGS (X) & SYMBOL_FLAG_DLLEXPORT) != 0)
> +
> +/* Disable SEH and declare the required SEH-related macros that are
> +still needed for compilation.  */
> +#undef TARGET_SEH
> +#define TARGET_SEH 0
> +
> +#define SSE_REGNO_P(N) 0
> +#define GENERAL_REGNO_P(N) 0
> +#define SEH_MAX_FRAME_SIZE 0

Similarly here, how about:

#define SSE_REGNO_P(N) (gcc_unreachable (), 0)
#define GENERAL_REGNO_P(N) (gcc_unreachable (), 0)
#define SEH_MAX_FRAME_SIZE (gcc_unreachable (), 0)

Thanks,
Richard


Re: [PATCH v2 04/13] aarch64: Add aarch64-w64-mingw32 COFF

2024-04-10 Thread Richard Sandiford
Evgeny Karpov  writes:
> From: Zac Walker 
> Date: Fri, 1 Mar 2024 01:55:47 +0100
> Subject: [PATCH v2 04/13] aarch64: Add aarch64-w64-mingw32 COFF
>
> Define ASM specific for COFF format on AArch64.
>
> gcc/ChangeLog:
>
>   * config.gcc: Add COFF format support definitions.
>   * config/aarch64/aarch64-coff.h: New file.
> ---
>  gcc/config.gcc|  1 +
>  gcc/config/aarch64/aarch64-coff.h | 91 +++
>  2 files changed, 92 insertions(+)
>  create mode 100644 gcc/config/aarch64/aarch64-coff.h
>
> diff --git a/gcc/config.gcc b/gcc/config.gcc
> index b762393b64c..cb6661f44ef 100644
> --- a/gcc/config.gcc
> +++ b/gcc/config.gcc
> @@ -1266,6 +1266,7 @@ aarch64*-*-linux*)
>   ;;
>  aarch64-*-mingw*)
>   tm_file="${tm_file} aarch64/aarch64-abi-ms.h"
> + tm_file="${tm_file} aarch64/aarch64-coff.h"
>   tmake_file="${tmake_file} aarch64/t-aarch64"
>   case ${enable_threads} in
> "" | yes | win32)
> diff --git a/gcc/config/aarch64/aarch64-coff.h 
> b/gcc/config/aarch64/aarch64-coff.h
> new file mode 100644
> index 000..79c5a43b970
> --- /dev/null
> +++ b/gcc/config/aarch64/aarch64-coff.h
> @@ -0,0 +1,91 @@
> +/* Machine description for AArch64 architecture.
> +   Copyright (C) 2024 Free Software Foundation, Inc.
> +
> +   This file is part of GCC.
> +
> +   GCC is free software; you can redistribute it and/or modify it
> +   under the terms of the GNU General Public License as published by
> +   the Free Software Foundation; either version 3, or (at your option)
> +   any later version.
> +
> +   GCC is distributed in the hope that it will be useful, but
> +   WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   General Public License for more details.
> +
> +   You should have received a copy of the GNU General Public License
> +   along with GCC; see the file COPYING3.  If not see
> +   .  */
> +
> +#ifndef GCC_AARCH64_COFF_H
> +#define GCC_AARCH64_COFF_H
> +
> +#include "aarch64.h"

Is this needed?  It looks like aarch64-coff.h comes after aarch64.h
in the include list, so I'd have expected the #include to be a no-op.

If you want to emphasise that this file must be included after aarch64.h
then perhaps:

#if !defined(GCC_AARCH64_H)
#error This file must be included after aarch64.h
#endif

would work.  But it should also be ok just to drop the include without
replacing it with anything.

> +
> +#ifndef LOCAL_LABEL_PREFIX
> +# define LOCAL_LABEL_PREFIX  ""
> +#endif
> +
> +/* Using long long breaks -ansi and -std=c90, so these will need to be
> +   made conditional for an LLP64 ABI.  */
> +#undef SIZE_TYPE
> +#define SIZE_TYPE"long long unsigned int"
> +
> +#undef PTRDIFF_TYPE
> +#define PTRDIFF_TYPE "long long int"
> +
> +#undef LONG_TYPE_SIZE
> +#define LONG_TYPE_SIZE 32
> +
> +#ifndef ASM_GENERATE_INTERNAL_LABEL
> +# define ASM_GENERATE_INTERNAL_LABEL(STRING, PREFIX, NUM)  \
> +  sprintf (STRING, "*%s%s%u", LOCAL_LABEL_PREFIX, PREFIX, (unsigned 
> int)(NUM))
> +#endif
> +
> +#define ASM_OUTPUT_ALIGN(STREAM, POWER)  \
> +  fprintf (STREAM, "\t.align\t%d\n", (int)POWER)
> +
> +/* Output a common block.  */
> +#ifndef ASM_OUTPUT_COMMON
> +# define ASM_OUTPUT_COMMON(STREAM, NAME, SIZE, ROUNDED)  \
> +{\
> +  fprintf (STREAM, "\t.comm\t"); \
> +  assemble_name (STREAM, NAME);  \
> +  asm_fprintf (STREAM, ", %d, %d\n", \
> +  (int)(ROUNDED), (int)(SIZE));  \
> +}
> +#endif
> +
> +/* Output a local common block.  /bin/as can't do this, so hack a
> +   `.space' into the bss segment.  Note that this is *bad* practice,
> +   which is guaranteed NOT to work since it doesn't define STATIC
> +   COMMON space but merely STATIC BSS space.  */
> +#ifndef ASM_OUTPUT_ALIGNED_LOCAL
> +# define ASM_OUTPUT_ALIGNED_LOCAL(STREAM, NAME, SIZE, ALIGN) \
> +{
> \
> +  switch_to_section (bss_section);   
> \
> +  ASM_OUTPUT_ALIGN (STREAM, floor_log2 (ALIGN / BITS_PER_UNIT)); \
> +  ASM_OUTPUT_LABEL (STREAM, NAME);   
> \
> +  fprintf (STREAM, "\t.space\t%d\n", (int)(SIZE));   
> \
> +}
> +#endif
> +
> +#define ASM_OUTPUT_SKIP(STREAM, NBYTES)  \
> +  fprintf (STREAM, "\t.space\t%d  // skip\n", (int) (NBYTES))
> +
> +#define ASM_OUTPUT_TYPE_DIRECTIVE(STREAM, NAME, TYPE)
> +#define ASM_DECLARE_FUNCTION_SIZE(FILE, FNAME, DECL)

Just curious: are these empty definitions the final intended
definitions, or are they just temporary?  Might be worth a comment
either way.

Thanks,
Richard

> +
> +#define TEXT_SECTION_ASM_OP  "\t.text"
> +#define DATA_SECTION_ASM_OP  "\t.data"
> +#define BSS_SECTION_ASM_OP   

Re: [PATCH v2 03/13] aarch64: Mark x18 register as a fixed register for MS ABI

2024-04-10 Thread Richard Sandiford
Evgeny Karpov  writes:
> From: Zac Walker 
> Date: Fri, 1 Mar 2024 09:56:59 +0100
> Subject: [PATCH v2 03/13] aarch64: Mark x18 register as a fixed register for
>  MS ABI
>
> Define the MS ABI for aarch64-w64-mingw32.
> Adjust FIXED_REGISTERS, CALL_REALLY_USED_REGISTERS and
> STATIC_CHAIN_REGNUM for AArch64 MS ABI.
> The X18 register is reserved on Windows for the TEB.
>
> gcc/ChangeLog:
>
>   * config.gcc: Define TARGET_AARCH64_MS_ABI when
>   AArch64 MS ABI is used.
>   * config/aarch64/aarch64-abi-ms.h: New file. Adjust
>   FIXED_REGISTERS, CALL_REALLY_USED_REGISTERS and
>   STATIC_CHAIN_REGNUM for AArch64 MS ABI.
> ---
>  gcc/config.gcc  |  1 +
>  gcc/config/aarch64/aarch64-abi-ms.h | 64 +
>  2 files changed, 65 insertions(+)
>  create mode 100644 gcc/config/aarch64/aarch64-abi-ms.h
>
> diff --git a/gcc/config.gcc b/gcc/config.gcc
> index 2756377e50b..b762393b64c 100644
> --- a/gcc/config.gcc
> +++ b/gcc/config.gcc
> @@ -1265,6 +1265,7 @@ aarch64*-*-linux*)
>   TM_MULTILIB_CONFIG=`echo $TM_MULTILIB_CONFIG | sed 's/^,//'`
>   ;;
>  aarch64-*-mingw*)
> + tm_file="${tm_file} aarch64/aarch64-abi-ms.h"
>   tmake_file="${tmake_file} aarch64/t-aarch64"
>   case ${enable_threads} in
> "" | yes | win32)
> diff --git a/gcc/config/aarch64/aarch64-abi-ms.h 
> b/gcc/config/aarch64/aarch64-abi-ms.h
> new file mode 100644
> index 000..90b0dcc5edf
> --- /dev/null
> +++ b/gcc/config/aarch64/aarch64-abi-ms.h
> @@ -0,0 +1,64 @@
> +/* Machine description for AArch64 MS ABI.
> +   Copyright (C) 2024 Free Software Foundation, Inc.
> +
> +This file is part of GCC.
> +
> +GCC is free software; you can redistribute it and/or modify
> +it under the terms of the GNU General Public License as published by
> +the Free Software Foundation; either version 3, or (at your option)
> +any later version.
> +
> +GCC is distributed in the hope that it will be useful,
> +but WITHOUT ANY WARRANTY; without even the implied warranty of
> +MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +GNU General Public License for more details.
> +
> +You should have received a copy of the GNU General Public License
> +along with GCC; see the file COPYING3.  If not see
> +.  */
> +
> +#ifndef GCC_AARCH64_ABI_MS_H
> +#define GCC_AARCH64_ABI_MS_H
> +
> +/* X18 reserved for the TEB on Windows.  */
> +
> +#undef FIXED_REGISTERS
> +#define FIXED_REGISTERS  \
> +  {  \
> +0, 0, 0, 0,   0, 0, 0, 0,/* R0 - R7.  */ \
> +0, 0, 0, 0,   0, 0, 0, 0,/* R8 - R15.  */\
> +0, 0, 1, 0,   0, 0, 0, 0,/* R16 - R23.  */   \
> +0, 0, 0, 0,   0, 1, 0, 1,/* R24 - R30, SP.  */   \
> +0, 0, 0, 0,   0, 0, 0, 0,/* V0 - V7.  */ \
> +0, 0, 0, 0,   0, 0, 0, 0,   /* V8 - V15.  */ \
> +0, 0, 0, 0,   0, 0, 0, 0,   /* V16 - V23.  */\
> +0, 0, 0, 0,   0, 0, 0, 0,   /* V24 - V31.  */\
> +1, 1, 1, 1,  /* SFP, AP, CC, VG.  */ \
> +0, 0, 0, 0,   0, 0, 0, 0,/* P0 - P7.  */ \
> +0, 0, 0, 0,   0, 0, 0, 0,   /* P8 - P15.  */ \
> +1, 1,/* FFR and FFRT.  */\
> +1, 1, 1, 1, 1, 1, 1, 1   /* Fake registers.  */  \
> +  }
> +
> +#undef CALL_REALLY_USED_REGISTERS
> +#define CALL_REALLY_USED_REGISTERS   \
> +  {  \
> +1, 1, 1, 1,   1, 1, 1, 1,/* R0 - R7.  */ \
> +1, 1, 1, 1,   1, 1, 1, 1,/* R8 - R15.  */\
> +1, 1, 0, 0,   0, 0, 0, 0,   /* R16 - R23.  */\
> +0, 0, 0, 0,   0, 1, 1, 1,/* R24 - R30, SP.  */   \
> +1, 1, 1, 1,   1, 1, 1, 1,/* V0 - V7.  */ \
> +0, 0, 0, 0,   0, 0, 0, 0,/* V8 - V15.  */\
> +1, 1, 1, 1,   1, 1, 1, 1,   /* V16 - V23.  */\
> +1, 1, 1, 1,   1, 1, 1, 1,   /* V24 - V31.  */\
> +1, 1, 1, 0,  /* SFP, AP, CC, VG.  */ \
> +1, 1, 1, 1,   1, 1, 1, 1,/* P0 - P7.  */ \
> +1, 1, 1, 1,   1, 1, 1, 1,/* P8 - P15.  */\
> +1, 1,/* FFR and FFRT.  */\
> +0, 0, 0, 0, 0, 0, 0, 0   /* Fake registers.  */  \
> +  }
> +
> +#undef  STATIC_CHAIN_REGNUM
> +#define STATIC_CHAIN_REGNUM R17_REGNUM
> +
> +#endif /* GCC_AARCH64_ABI_MS_H.  */

Gah, I think there was a miscommunication, sorry.  The way I'd interpreted
Richard's comment:

> +/* X18 reserved for the TEB on Windows.  */
> +#ifdef TARGET_ARM64_MS_ABI
> +# define FIXED_X18 1
> +# define CALL_USED_X18 0
> +#else
> +# define FIXED_X18 0
> +# define CALL_USED_X18 1
> +#endif
>
> I'm not overly keen on ifdefs like this (and the one below), it can
> get quite confusing if we have to support more than a couple of ABIs.
> Perhaps we could create a couple of new headers, 

Re: [PATCH v2 02/13] aarch64: The aarch64-w64-mingw32 target implements

2024-04-10 Thread Richard Sandiford
Sorry for the slow reply.

Evgeny Karpov  writes:
> From: Zac Walker 
> Date: Fri, 1 Mar 2024 01:45:13 +0100
> Subject: [PATCH v2 02/13] aarch64: The aarch64-w64-mingw32 target implements
>  the MS ABI
>
> Two ABIs for aarch64 have been defined for different platforms.
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64-opts.h (enum aarch64_calling_abi):
>   Define two ABIs.
> ---
>  gcc/config/aarch64/aarch64-opts.h | 7 +++
>  1 file changed, 7 insertions(+)
>
> diff --git a/gcc/config/aarch64/aarch64-opts.h 
> b/gcc/config/aarch64/aarch64-opts.h
> index a05c0d3ded1..52c9e4596d6 100644
> --- a/gcc/config/aarch64/aarch64-opts.h
> +++ b/gcc/config/aarch64/aarch64-opts.h
> @@ -131,4 +131,11 @@ enum aarch64_early_ra_scope {
>AARCH64_EARLY_RA_NONE
>  };
>  
> +/* Available call ABIs.  */
> +enum aarch64_calling_abi
> +{
> +  AARCH64_CALLING_ABI_EABI,
> +  AARCH64_CALLING_ABI_MS
> +};
> +
>  #endif

Thanks for removing the MS_ABI uses.  However, I'm still a bit uneasy
about this.  We already have a way of categorising ABIs (arm_pcs)
and it's not clear how this new enum would interact with it.  We also
have infrastructure for recording the arm_pcs for each call, including
in RTL insns.  Would we need to do the same for this classification?

It seems like the enum is currently unused.  At least, I was able
to build successfully with:

diff --git a/gcc/config/aarch64/aarch64-opts.h 
b/gcc/config/aarch64/aarch64-opts.h
index 52c9e4596d6..a05c0d3ded1 100644
--- a/gcc/config/aarch64/aarch64-opts.h
+++ b/gcc/config/aarch64/aarch64-opts.h
@@ -131,11 +131,4 @@ enum aarch64_early_ra_scope {
   AARCH64_EARLY_RA_NONE
 };
 
-/* Available call ABIs.  */
-enum aarch64_calling_abi
-{
-  AARCH64_CALLING_ABI_EABI,
-  AARCH64_CALLING_ABI_MS
-};
-
 #endif
diff --git a/gcc/config/aarch64/cygming.h b/gcc/config/aarch64/cygming.h
index 2f239c42a89..902539763bd 100644
--- a/gcc/config/aarch64/cygming.h
+++ b/gcc/config/aarch64/cygming.h
@@ -43,9 +43,6 @@ still needed for compilation.  */
 #define GENERAL_REGNO_P(N) 0
 #define SEH_MAX_FRAME_SIZE 0
 
-#undef DEFAULT_ABI
-#define DEFAULT_ABI AARCH64_CALLING_ABI_MS
-
 #undef TARGET_PECOFF
 #define TARGET_PECOFF 1
 
diff --git a/gcc/config/mingw/mingw32.h b/gcc/config/mingw/mingw32.h
index 040c3e1e521..08f1b5f0696 100644
--- a/gcc/config/mingw/mingw32.h
+++ b/gcc/config/mingw/mingw32.h
@@ -19,9 +19,7 @@ along with GCC; see the file COPYING3.  If not see
 .  */
 
 #undef DEFAULT_ABI
-#if defined (TARGET_AARCH64_MS_ABI)
-# define DEFAULT_ABI AARCH64_CALLING_ABI_MS
-#else
+#if !defined (TARGET_AARCH64_MS_ABI)
 # define DEFAULT_ABI MS_ABI
 #endif
 
Would you be happy with that for now?  We can then revisit this
later when the information is needed.

Thanks,
Richard


Re: [PATCH 0/5] aarch64: FMV feature list fixes

2024-04-10 Thread Richard Sandiford
Andrew Carlotti  writes:
> On Tue, Apr 09, 2024 at 04:43:16PM +0100, Richard Sandiford wrote:
>> Andrew Carlotti  writes:
>> > The first three patches are trivial changes to the feature list to reflect
>> > recent changes in the ACLE.  Patch 4 removes most of the FMV 
>> > multiversioning
>> > features that don't work at the moment, and should be entirely 
>> > uncontroversial.
>> >
>> > Patch 5 handles the remaining cases, where there's an inconsistency in how
>> > features are named in the current FMV specification compared to the 
>> > existing
>> > command line options.  It might be better to instead preserve the 
>> > "memtag2",
>> > "ssbs2" and "ls64_accdata" names for now; I'd be happy to commit either
>> > version.
>> 
>> Yeah, I suppose patch 5 leaves things in a somewhat awkward state,
>> since e.g.:
>> 
>> -AARCH64_OPT_FMV_EXTENSION("memtag", MEMTAG, (), (), (), "")
>> +AARCH64_OPT_EXTENSION("memtag", MEMTAG, (), (), (), "")
>>  
>> -AARCH64_FMV_FEATURE("memtag2", MEMTAG2, (MEMTAG))
>> +AARCH64_FMV_FEATURE("memtag", MEMTAG2, (MEMTAG))
>> 
>> seems to drop "memtag2" and FEAT_MEMTAG, but keep "memtag" and
>> FEAT_MEMTAG2.  Is that right?
>
> That's deliberate. The FEAT_MEMTAG bit in __aarch64_cpu_features is defined to
> match the definition of FEAT_MTE in the architecture, and likewise for
> FEAT_MEMTAG2/FEAT_MTE2.  However, in Binutils the "+memtag" extension enables
> both FEAT_MTE and FEAT_MTE2 instructions (although none of the FEAT_MTE2
> instructions can be generated from GCC without inline assembly).  The FMV
> specification in the ACLE currently uses names "memtag" and "memtag2" that
> match the architecture names, but arguably don't match the command line
> extension names.  I'm advocating for that to change to match the extension
> names in command line options.

Hmm, ok.  I agree it makes sense for the user-visible FMV namnes to match
the command line.  But shouldn't __aarch64_cpu_features either (a) use exactly
the same names as the architecture or (b) use exactly the same names as the
command-line (mangled where necessary)?  It seems that we're instead
using a third convention that doesn't exactly match the other two.

That is, I can see the rationale for "memtag" => FEAT_MTE2 and
"memtag" => FEAT_MEMTAG.  It just seems odd to have "memtag" => FEAT_MEMTAG2
(where MEMTAG2 is an alias of MTE2).

How much leeway do we have to change the __aarch64_cpu_features names?
Is it supposed to be a public API (as opposed to ABI)?

> The LS64 example is definitely an inconsistency, since GCC uses "+ls64" to
> enable intrinsics for all of the FEAT_LS64/FEAT_LS64_V/FEAT_LS64_ACCDATA
> intrinsics.

Ok, thanks.  If we go for option (a) above then I agree that the ls64
change is correct.  If we go for option (b) then I suppose it should
stay as LS64.

> There were similar issues with "sha1", "pmull" and "sve2-pmull128", but in
> these cases their presence architecturally is implied by the presence of the
> features checked for "sha2", "aes" and "sve2-aes" so it's fine to just delete
> the ones without command line flags.
>
>> Apart from that and the comment on patch 2, the series looks good to me.
>> 
>> While rechecking aarch64-option-extensions.def against the ACLE list:
>> it seems that the .def doesn't treat mops as an FMV feature.  Is that
>> deliberate?
>
> "mops" was added to the ACLE list later, and libgcc doesn't yet support
> detecting it.  I didn't think it was sensible to add new FMV feature support 
> at
> this stage.

Ah, ok, makes sense.

Richard


Re: [PATCHv3 2/2] aarch64: Add support for _BitInt

2024-04-10 Thread Richard Sandiford
"Andre Vieira (lists)"  writes:
> Added the target check, also had to change some of the assembly checking 
> due to changes upstream, the assembly is still valid, but we do extend 
> where not necessary, I do believe that's a general issue though.
>
> The _BitInt(N > 64) codegen for non-powers of 2 did get worse, we see 
> similar codegen with _int128 bitfields on aarch64.
> I suspect we need to improve the way we 'extend' TImode in the aarch64 
> backend to be able to operate only on the affected DImode parts of it 
> when relevant. Though I also think we may need to change how _BitInt is 
> currently expanded in such situations, right now it does the extension 
> as two shifts. Anyway I did not have too much time to look deeper into this.
>
> Bootstrapped on aarch64-unknown-linux-gnu.
>
> OK for trunk?

OK, thanks.  In truth I've not gone through the tests very thorougly
this time around, and just gone by the internal diff between this
version and the previous one.  But we can adjust them as necessary
based on any reports that come in.

Richard

>
> On 28/03/2024 15:21, Richard Sandiford wrote:
>> Jakub Jelinek  writes:
>>> On Thu, Mar 28, 2024 at 03:00:46PM +, Richard Sandiford wrote:
>>>>>   * gcc.target/aarch64/bitint-alignments.c: New test.
>>>>>   * gcc.target/aarch64/bitint-args.c: New test.
>>>>>   * gcc.target/aarch64/bitint-sizes.c: New test.
>>>>>   * gcc.target/aarch64/bitfield-bitint-abi.h: New header.
>>>>>   * gcc.target/aarch64/bitfield-bitint-abi-align16.c: New test.
>>>>>   * gcc.target/aarch64/bitfield-bitint-abi-align8.c: New test.
>>>>
>>>> Since we don't support big-endian yet, I assume the tests should be
>>>> conditional on aarch64_little_endian.
>>>
>>> Perhaps better on bitint effective target, then they'll become available
>>> automatically as soon as big endian aarch64 _BitInt support is turned on.
>> 
>> Ah, yeah, good point.
>> 
>> Richard
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 
> 81400cc666472ffeff40df14e98ae00ebc774d31..c0af4ef151a8c46f78c0c3a43c2ab1318a3f610a
>  100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -6583,6 +6583,7 @@ aarch64_return_in_memory_1 (const_tree type)
>int count;
>  
>if (!AGGREGATE_TYPE_P (type)
> +  && TREE_CODE (type) != BITINT_TYPE
>&& TREE_CODE (type) != COMPLEX_TYPE
>&& TREE_CODE (type) != VECTOR_TYPE)
>  /* Simple scalar types always returned in registers.  */
> @@ -21996,6 +21997,11 @@ aarch64_composite_type_p (const_tree type,
>if (type && (AGGREGATE_TYPE_P (type) || TREE_CODE (type) == COMPLEX_TYPE))
>  return true;
>  
> +  if (type
> +  && TREE_CODE (type) == BITINT_TYPE
> +  && int_size_in_bytes (type) > 16)
> +return true;
> +
>if (mode == BLKmode
>|| GET_MODE_CLASS (mode) == MODE_COMPLEX_FLOAT
>|| GET_MODE_CLASS (mode) == MODE_COMPLEX_INT)
> @@ -28477,6 +28483,42 @@ aarch64_excess_precision (enum excess_precision_type 
> type)
>return FLT_EVAL_METHOD_UNPREDICTABLE;
>  }
>  
> +/* Implement TARGET_C_BITINT_TYPE_INFO.
> +   Return true if _BitInt(N) is supported and fill its details into *INFO.  
> */
> +bool
> +aarch64_bitint_type_info (int n, struct bitint_info *info)
> +{
> +  if (TARGET_BIG_END)
> +return false;
> +
> +  if (n <= 8)
> +info->limb_mode = QImode;
> +  else if (n <= 16)
> +info->limb_mode = HImode;
> +  else if (n <= 32)
> +info->limb_mode = SImode;
> +  else if (n <= 64)
> +info->limb_mode = DImode;
> +  else if (n <= 128)
> +info->limb_mode = TImode;
> +  else
> +/* The AAPCS for AArch64 defines _BitInt(N > 128) as an array with
> +   type {signed,unsigned} __int128[M] where M*128 >= N.  However, to be
> +   able to use libgcc's implementation to support large _BitInt's we need
> +   to use a LIMB_MODE that is no larger than 'long long'.  This is why we
> +   use DImode for our internal LIMB_MODE and we define the ABI_LIMB_MODE 
> to
> +   be TImode to ensure we are ABI compliant.  */
> +info->limb_mode = DImode;
> +
> +  if (n > 128)
> +info->abi_limb_mode = TImode;
> +  else
> +info->abi_limb_mode = info->limb_mode;
> +  info->big_endian = TARGET_BIG_END;
> +  info->extended = false;
> +  return true;
> +}
> +
>  /* Implement TARGET_SCHED_CAN_SPECULATE_INSN.  Return true if INSN can be
> scheduled for speculative ex

Re: [PATCHv2 1/2] aarch64: Do not give ABI change diagnostics for _BitInt(N)

2024-04-10 Thread Richard Sandiford
"Andre Vieira (lists)"  writes:
> @@ -6907,6 +6938,11 @@ aarch64_layout_arg (cumulative_args_t pcum_v, const 
> function_arg_info )
> && (!alignment || abi_break_gcc_9 < alignment)
> && (!abi_break_gcc_13 || alignment < abi_break_gcc_13));
>  
> +  /* _BitInt(N) was only added in GCC 14.  */
> +  bool warn_pcs_change_le_gcc14
> += warn_pcs_change && !bitint_or_aggr_of_bitint_p (type);
> +
> +

Excess blank line.

OK with that removed, thanks (no need to retest).

Richard


Re: [PATCH]AArch64: Do not allow SIMD clones with simdlen 1 [PR113552][GCC 13/12/11 backport]

2024-04-09 Thread Richard Sandiford
Tamar Christina  writes:
> Hi All,
>
> This is a backport of g:306713c953d509720dc394c43c0890548bb0ae07.
>
> The AArch64 vector PCS does not allow simd calls with simdlen 1,
> however due to a bug we currently do allow it for num == 0.
>
> This causes us to emit a symbol that doesn't exist and we fail to link.
>
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>
> OK for GCC 13,12,11 branches?.
>
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
>   PR tree-optimization/113552
>   * config/aarch64/aarch64.cc
>   (aarch64_simd_clone_compute_vecsize_and_simdlen): Block simdlen 1.
>
> gcc/testsuite/ChangeLog:
>
>   PR tree-optimization/113552
>   * gcc.target/aarch64/pr113552.c: New test.
>   * gcc.target/aarch64/simd_pcs_attribute-3.c: Remove bogus check.

OK, thanks.

Richard

> ---
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 
> f546c48ae2d2bad2e34c6b72e5e3e30aba3c3bd6..cec614596c5532b6341ab814f700c4ffc737340f
>  100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -27027,7 +27027,7 @@ aarch64_simd_clone_compute_vecsize_and_simdlen 
> (struct cgraph_node *node,
>   bool explicit_p)
>  {
>tree t, ret_type;
> -  unsigned int elt_bits, count;
> +  unsigned int elt_bits, count = 0;
>unsigned HOST_WIDE_INT const_simdlen;
>poly_uint64 vec_bits;
>  
> @@ -27100,8 +27100,17 @@ aarch64_simd_clone_compute_vecsize_and_simdlen 
> (struct cgraph_node *node,
>elt_bits = GET_MODE_BITSIZE (SCALAR_TYPE_MODE (base_type));
>if (known_eq (clonei->simdlen, 0U))
>  {
> -  count = 2;
> -  vec_bits = (num == 0 ? 64 : 128);
> +  /* We don't support simdlen == 1.  */
> +  if (known_eq (elt_bits, 64))
> + {
> +   count = 1;
> +   vec_bits = 128;
> + }
> +  else
> + {
> +   count = 2;
> +   vec_bits = (num == 0 ? 64 : 128);
> + }
>clonei->simdlen = exact_div (vec_bits, elt_bits);
>  }
>else
> @@ -27121,6 +27130,7 @@ aarch64_simd_clone_compute_vecsize_and_simdlen 
> (struct cgraph_node *node,
> return 0;
>   }
>  }
> +
>clonei->vecsize_int = vec_bits;
>clonei->vecsize_float = vec_bits;
>return count;
> diff --git a/gcc/testsuite/gcc.target/aarch64/pr113552.c 
> b/gcc/testsuite/gcc.target/aarch64/pr113552.c
> new file mode 100644
> index 
> ..9c96b061ed2b4fcc57e58925277f74d14f79c51f
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/pr113552.c
> @@ -0,0 +1,17 @@
> +/* { dg-do compile } */
> +/* { dg-options "-Ofast -march=armv8-a" } */
> +
> +__attribute__ ((__simd__ ("notinbranch"), const))
> +double cos (double);
> +
> +void foo (float *a, double *b)
> +{
> +for (int i = 0; i < 12; i+=3)
> +  {
> +b[i] = cos (5.0 * a[i]);
> +b[i+1] = cos (5.0 * a[i+1]);
> +b[i+2] = cos (5.0 * a[i+2]);
> +  }
> +}
> +
> +/* { dg-final { scan-assembler-times {bl\t_ZGVnN2v_cos} 6 } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/simd_pcs_attribute-3.c 
> b/gcc/testsuite/gcc.target/aarch64/simd_pcs_attribute-3.c
> index 
> 95f6a6803e889c02177ef10972962ed62d2095eb..c6dac6b104c94c9de89ed88dc5a73e185d2be125
>  100644
> --- a/gcc/testsuite/gcc.target/aarch64/simd_pcs_attribute-3.c
> +++ b/gcc/testsuite/gcc.target/aarch64/simd_pcs_attribute-3.c
> @@ -18,7 +18,7 @@ double foo(double x)
>  }
>  
>  /* { dg-final { scan-assembler-not {\.variant_pcs\tfoo} } } */
> -/* { dg-final { scan-assembler-times {\.variant_pcs\t_ZGVnM1v_foo} 1 } } */
> +/* { dg-final { scan-assembler-not {\.variant_pcs\t_ZGVnM1v_foo} } } */
>  /* { dg-final { scan-assembler-times {\.variant_pcs\t_ZGVnM2v_foo} 1 } } */
> -/* { dg-final { scan-assembler-times {\.variant_pcs\t_ZGVnN1v_foo} 1 } } */
> +/* { dg-final { scan-assembler-not {\.variant_pcs\t_ZGVnN1v_foo} } } */
>  /* { dg-final { scan-assembler-times {\.variant_pcs\t_ZGVnN2v_foo} 1 } } */


Re: [PATCH 0/5] aarch64: FMV feature list fixes

2024-04-09 Thread Richard Sandiford
Andrew Carlotti  writes:
> The first three patches are trivial changes to the feature list to reflect
> recent changes in the ACLE.  Patch 4 removes most of the FMV multiversioning
> features that don't work at the moment, and should be entirely 
> uncontroversial.
>
> Patch 5 handles the remaining cases, where there's an inconsistency in how
> features are named in the current FMV specification compared to the existing
> command line options.  It might be better to instead preserve the "memtag2",
> "ssbs2" and "ls64_accdata" names for now; I'd be happy to commit either
> version.

Yeah, I suppose patch 5 leaves things in a somewhat awkward state,
since e.g.:

-AARCH64_OPT_FMV_EXTENSION("memtag", MEMTAG, (), (), (), "")
+AARCH64_OPT_EXTENSION("memtag", MEMTAG, (), (), (), "")
 
-AARCH64_FMV_FEATURE("memtag2", MEMTAG2, (MEMTAG))
+AARCH64_FMV_FEATURE("memtag", MEMTAG2, (MEMTAG))

seems to drop "memtag2" and FEAT_MEMTAG, but keep "memtag" and
FEAT_MEMTAG2.  Is that right?

Apart from that and the comment on patch 2, the series looks good to me.

While rechecking aarch64-option-extensions.def against the ACLE list:
it seems that the .def doesn't treat mops as an FMV feature.  Is that
deliberate?

Thanks,
Richard


Re: [PATCH 2/5] aarch64: Don't use FEAT_MAX as array length

2024-04-09 Thread Richard Sandiford
Andrew Carlotti  writes:
> There was an assumption in some places that the aarch64_fmv_feature_data
> array contained FEAT_MAX elements.  While this assumption held up till
> now, it is safer and more flexible to use the array size directly.
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64.cc (compare_feature_masks):
>   Use ARRAY_SIZE to determine iteration bounds.
>   (aarch64_mangle_decl_assembler_name): Ditto.
>
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 
> 1ea84c8bd7386e399f6ffa3a5e36408cf8831fc6..5de842fcc212c78beba1fa99639e79562d718579
>  100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -19899,7 +19899,8 @@ compare_feature_masks (aarch64_fmv_feature_mask mask1,
>auto diff_mask = mask1 ^ mask2;
>if (diff_mask == 0ULL)
>  return 0;
> -  for (int i = FEAT_MAX - 1; i > 0; i--)
> +  static const int num_features = ARRAY_SIZE (aarch64_fmv_feature_data);

There doesn't seem any need for this to be static (or const).  Same for
the second hunk.

> +  for (int i = num_features - 1; i > 0; i--)

Pre-existing, but is > 0 rather than >= 0 deliberate?  Shouldn't we look
at index 0 as well?

LGTM otherwise.

Thanks,
Richard

>  {
>auto bit_mask = aarch64_fmv_feature_data[i].feature_mask;
>if (diff_mask & bit_mask)
> @@ -19982,7 +19983,8 @@ aarch64_mangle_decl_assembler_name (tree decl, tree 
> id)
>  
>name += "._";
>  
> -  for (int i = 0; i < FEAT_MAX; i++)
> +  static const int num_features = ARRAY_SIZE (aarch64_fmv_feature_data);
> +  for (int i = 0; i < num_features; i++)
>   {
> if (feature_mask & aarch64_fmv_feature_data[i].feature_mask)
>   {


Re: [PATCH v2] aarch64: Fix ACLE SME streaming mode error in neon-sve-bridge

2024-04-09 Thread Richard Sandiford
Richard Ball  writes:
> When using LTO, handling the pragma for sme before the pragma
> for the neon-sve-bridge caused the following error on svset_neonq,
> in the neon-sve-bridge.c test.
>
> error: ACLE function '0' can only be called when SME streaming mode is 
> enabled.
>
> This has been resolved by changing the pragma handlers to accept two modes.
> One where they add functions normally and a second in which 
> registered_functions
> is filled with a placeholder value.
>
> By using this, the ordering of the functions can be maintained.
>
> No regressions on aarch64-none-elf.

OK, thanks.

Richard

> gcc/ChangeLog:
>
>   * config/aarch64/aarch64-c.cc (aarch64_pragma_aarch64):
>   Add functions_nulls parameter to pragma_handlers.
> * config/aarch64/aarch64-protos.h: Likewise.
> * config/aarch64/aarch64-sve-builtins.h
>   (enum handle_pragma_index): Add enum to count
>   number of pragmas to be handled.
> * config/aarch64/aarch64-sve-builtins.cc
>   (GTY): Add global variable for initial indexes
>   and change overload_names to an array.
>   (function_builder::function_builder):
>   Add pragma handler information.
> (function_builder::add_function):
>   Add code for overwriting previous
>   registered_functions entries.
> (add_unique_function):
>   Use an array to register overload_names
>   for both pragma handler modes.
> (add_overloaded_function): Likewise.
> (init_builtins):
>   Add functions_nulls parameter to pragma_handlers.
> (handle_arm_sve_h):
>   Initialize pragma handler information.
> (handle_arm_neon_sve_bridge_h): Likewise.
> (handle_arm_sme_h): Likewise.
>
> diff --git a/gcc/config/aarch64/aarch64-c.cc b/gcc/config/aarch64/aarch64-c.cc
> index 
> b5a6917d06db41a845681ed6f13f2800c70ede10..fe1a20e4e546a68e5f7eddff3bbb0d3e831fbd9b
>  100644
> --- a/gcc/config/aarch64/aarch64-c.cc
> +++ b/gcc/config/aarch64/aarch64-c.cc
> @@ -344,15 +344,15 @@ aarch64_pragma_aarch64 (cpp_reader *)
>  
>const char *name = TREE_STRING_POINTER (x);
>if (strcmp (name, "arm_sve.h") == 0)
> -aarch64_sve::handle_arm_sve_h ();
> +aarch64_sve::handle_arm_sve_h (false);
>else if (strcmp (name, "arm_sme.h") == 0)
> -aarch64_sve::handle_arm_sme_h ();
> +aarch64_sve::handle_arm_sme_h (false);
>else if (strcmp (name, "arm_neon.h") == 0)
>  handle_arm_neon_h ();
>else if (strcmp (name, "arm_acle.h") == 0)
>  handle_arm_acle_h ();
>else if (strcmp (name, "arm_neon_sve_bridge.h") == 0)
> -aarch64_sve::handle_arm_neon_sve_bridge_h ();
> +aarch64_sve::handle_arm_neon_sve_bridge_h (false);
>else
>  error ("unknown %<#pragma GCC aarch64%> option %qs", name);
>  }
> diff --git a/gcc/config/aarch64/aarch64-protos.h 
> b/gcc/config/aarch64/aarch64-protos.h
> index 
> bd719b992a545630f9ad6a73753fad55c6ac5d7e..42639e9efcf1e0f9362f759ae63a31b8eeb0d581
>  100644
> --- a/gcc/config/aarch64/aarch64-protos.h
> +++ b/gcc/config/aarch64/aarch64-protos.h
> @@ -1014,9 +1014,9 @@ bool aarch64_general_check_builtin_call (location_t, 
> vec,
>  
>  namespace aarch64_sve {
>void init_builtins ();
> -  void handle_arm_sve_h ();
> -  void handle_arm_sme_h ();
> -  void handle_arm_neon_sve_bridge_h ();
> +  void handle_arm_sve_h (bool);
> +  void handle_arm_sme_h (bool);
> +  void handle_arm_neon_sve_bridge_h (bool);
>tree builtin_decl (unsigned, bool);
>bool builtin_type_p (const_tree);
>bool builtin_type_p (const_tree, unsigned int *, unsigned int *);
> diff --git a/gcc/config/aarch64/aarch64-sve-builtins.h 
> b/gcc/config/aarch64/aarch64-sve-builtins.h
> index 
> e66729ed63532811b3b16ab57ae11cb10518caca..3e0d2b8560c538d201c97fba3591ddf9893d664a
>  100644
> --- a/gcc/config/aarch64/aarch64-sve-builtins.h
> +++ b/gcc/config/aarch64/aarch64-sve-builtins.h
> @@ -123,6 +123,15 @@ enum units_index
>UNITS_vectors
>  };
>  
> +/* Enumerates the pragma handlers.  */
> +enum handle_pragma_index
> +{
> +  arm_sve_handle,
> +  arm_sme_handle,
> +  arm_neon_sve_handle,
> +  NUM_PRAGMA_HANDLERS
> +};
> +
>  /* Describes the various uses of a governing predicate.  */
>  enum predication_index
>  {
> @@ -419,7 +428,7 @@ class registered_function;
>  class function_builder
>  {
>  public:
> -  function_builder ();
> +  function_builder (handle_pragma_index, bool);
>~function_builder ();
>  
>void add_unique_function (const function_instance &, tree,
> @@ -453,6 +462,12 @@ private:
>  
>/* Used for building up function names.  */
>obstack m_string_obstack;
> +
> +  /* Used to store the index for the current function.  */
> +  unsigned int m_function_index;
> +
> +  /* Stores the mode of the current pragma handler.  */
> +  bool m_function_nulls;
>  };
>  
>  /* A base class for handling calls to built-in functions.  */
> diff --git a/gcc/config/aarch64/aarch64-sve-builtins.cc 
> b/gcc/config/aarch64/aarch64-sve-builtins.cc
> index 

[pushed] aarch64: Fix expansion of svsudot [PR114607]

2024-04-08 Thread Richard Sandiford
Not sure how this happend, but: svsudot is supposed to be expanded
as USDOT with the operands swapped.  However, a thinko in the
expansion of svsudot meant that the arguments weren't in fact
swapped; the attempted swap was just a no-op.  And the testcases
blithely accepted that.

Tested on aarch64-linux-gnu and pushed to trunk.  I'll backport
in a few weeks if there's no fallout.

Richard


gcc/
PR target/114607
* config/aarch64/aarch64-sve-builtins-base.cc
(svusdot_impl::expand): Fix botched attempt to swap the operands
for svsudot.

gcc/testsuite/
PR target/114607
* gcc.target/aarch64/sve/acle/asm/sudot_s32.c: New test.
---
 gcc/config/aarch64/aarch64-sve-builtins-base.cc   | 2 +-
 gcc/testsuite/gcc.target/aarch64/sve/acle/asm/sudot_s32.c | 8 
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-sve-builtins-base.cc 
b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
index 5be2315a3c6..0d2edf3f19e 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins-base.cc
+++ b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
@@ -2809,7 +2809,7 @@ public:
version) is through the USDOT instruction but with the second and third
inputs swapped.  */
 if (m_su)
-  e.rotate_inputs_left (1, 2);
+  e.rotate_inputs_left (1, 3);
 /* The ACLE function has the same order requirements as for svdot.
While there's no requirement for the RTL pattern to have the same sort
of order as that for dot_prod, it's easier to read.
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/sudot_s32.c 
b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/sudot_s32.c
index 4b452619eee..e06b69affab 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/sudot_s32.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/sudot_s32.c
@@ -6,7 +6,7 @@
 
 /*
 ** sudot_s32_tied1:
-** usdot   z0\.s, z2\.b, z4\.b
+** usdot   z0\.s, z4\.b, z2\.b
 ** ret
 */
 TEST_TRIPLE_Z (sudot_s32_tied1, svint32_t, svint8_t, svuint8_t,
@@ -17,7 +17,7 @@ TEST_TRIPLE_Z (sudot_s32_tied1, svint32_t, svint8_t, 
svuint8_t,
 ** sudot_s32_tied2:
 ** mov (z[0-9]+)\.d, z0\.d
 ** movprfx z0, z4
-** usdot   z0\.s, z2\.b, \1\.b
+** usdot   z0\.s, \1\.b, z2\.b
 ** ret
 */
 TEST_TRIPLE_Z_REV (sudot_s32_tied2, svint32_t, svint8_t, svuint8_t,
@@ -27,7 +27,7 @@ TEST_TRIPLE_Z_REV (sudot_s32_tied2, svint32_t, svint8_t, 
svuint8_t,
 /*
 ** sudot_w0_s32_tied:
 ** mov (z[0-9]+\.b), w0
-** usdot   z0\.s, z2\.b, \1
+** usdot   z0\.s, \1, z2\.b
 ** ret
 */
 TEST_TRIPLE_ZX (sudot_w0_s32_tied, svint32_t, svint8_t, uint8_t,
@@ -37,7 +37,7 @@ TEST_TRIPLE_ZX (sudot_w0_s32_tied, svint32_t, svint8_t, 
uint8_t,
 /*
 ** sudot_9_s32_tied:
 ** mov (z[0-9]+\.b), #9
-** usdot   z0\.s, z2\.b, \1
+** usdot   z0\.s, \1, z2\.b
 ** ret
 */
 TEST_TRIPLE_Z (sudot_9_s32_tied, svint32_t, svint8_t, uint8_t,
-- 
2.25.1



[gcc r14-9836] aarch64: Fix expansion of svsudot [PR114607]

2024-04-08 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:2c1c2485a4b1aca746ac693041e51ea6da5c64ca

commit r14-9836-g2c1c2485a4b1aca746ac693041e51ea6da5c64ca
Author: Richard Sandiford 
Date:   Mon Apr 8 16:53:32 2024 +0100

aarch64: Fix expansion of svsudot [PR114607]

Not sure how this happend, but: svsudot is supposed to be expanded
as USDOT with the operands swapped.  However, a thinko in the
expansion of svsudot meant that the arguments weren't in fact
swapped; the attempted swap was just a no-op.  And the testcases
blithely accepted that.

gcc/
PR target/114607
* config/aarch64/aarch64-sve-builtins-base.cc
(svusdot_impl::expand): Fix botched attempt to swap the operands
for svsudot.

gcc/testsuite/
PR target/114607
* gcc.target/aarch64/sve/acle/asm/sudot_s32.c: New test.

Diff:
---
 gcc/config/aarch64/aarch64-sve-builtins-base.cc   | 2 +-
 gcc/testsuite/gcc.target/aarch64/sve/acle/asm/sudot_s32.c | 8 
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-sve-builtins-base.cc 
b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
index 5be2315a3c6..0d2edf3f19e 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins-base.cc
+++ b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
@@ -2809,7 +2809,7 @@ public:
version) is through the USDOT instruction but with the second and third
inputs swapped.  */
 if (m_su)
-  e.rotate_inputs_left (1, 2);
+  e.rotate_inputs_left (1, 3);
 /* The ACLE function has the same order requirements as for svdot.
While there's no requirement for the RTL pattern to have the same sort
of order as that for dot_prod, it's easier to read.
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/sudot_s32.c 
b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/sudot_s32.c
index 4b452619eee..e06b69affab 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/sudot_s32.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/sudot_s32.c
@@ -6,7 +6,7 @@
 
 /*
 ** sudot_s32_tied1:
-** usdot   z0\.s, z2\.b, z4\.b
+** usdot   z0\.s, z4\.b, z2\.b
 ** ret
 */
 TEST_TRIPLE_Z (sudot_s32_tied1, svint32_t, svint8_t, svuint8_t,
@@ -17,7 +17,7 @@ TEST_TRIPLE_Z (sudot_s32_tied1, svint32_t, svint8_t, 
svuint8_t,
 ** sudot_s32_tied2:
 ** mov (z[0-9]+)\.d, z0\.d
 ** movprfx z0, z4
-** usdot   z0\.s, z2\.b, \1\.b
+** usdot   z0\.s, \1\.b, z2\.b
 ** ret
 */
 TEST_TRIPLE_Z_REV (sudot_s32_tied2, svint32_t, svint8_t, svuint8_t,
@@ -27,7 +27,7 @@ TEST_TRIPLE_Z_REV (sudot_s32_tied2, svint32_t, svint8_t, 
svuint8_t,
 /*
 ** sudot_w0_s32_tied:
 ** mov (z[0-9]+\.b), w0
-** usdot   z0\.s, z2\.b, \1
+** usdot   z0\.s, \1, z2\.b
 ** ret
 */
 TEST_TRIPLE_ZX (sudot_w0_s32_tied, svint32_t, svint8_t, uint8_t,
@@ -37,7 +37,7 @@ TEST_TRIPLE_ZX (sudot_w0_s32_tied, svint32_t, svint8_t, 
uint8_t,
 /*
 ** sudot_9_s32_tied:
 ** mov (z[0-9]+\.b), #9
-** usdot   z0\.s, z2\.b, \1
+** usdot   z0\.s, \1, z2\.b
 ** ret
 */
 TEST_TRIPLE_Z (sudot_9_s32_tied, svint32_t, svint8_t, uint8_t,


Re: [PATCH][wwwdocs] Add NEON-SVE bridge intrinsics to changes.html

2024-04-08 Thread Richard Sandiford
Richard Ball  writes:
> Hi all,
>
> Adding the NEON-SVE bridge intrinsics that were missed
> in the last patch.
>
> Thanks,
> Richard

OK, thanks.

Richard

> diff --git a/htdocs/gcc-14/changes.html b/htdocs/gcc-14/changes.html
> index 
> 9fd224c1df3f05eadcedaaa41c0859e712b93b78..df63af48298564de9c35bab1dd35891c2581e3d6
>  100644
> --- a/htdocs/gcc-14/changes.html
> +++ b/htdocs/gcc-14/changes.html
> @@ -420,6 +420,12 @@ a work-in-progress.
>-march=armv8.2-a or higher to be specified.  Likewise, the
>intrinsics enabled by +memtag no longer require
>-march=armv8.5-a.
> +  Support for the
> +   href="https://github.com/ARM-software/acle/blob/main/main/acle.md#neon-sve-bridge;>
> +  NEON-SVE Bridge intrinsics.
> +  These are intrinsics that allow conversions between NEON and SVE 
> vectors,
> +  enabled through the inclusion of the 
> arm_neon_sve_bridge.h header.
> +  
>  
>The option -mtp= is now supported for changing the TPIDR
> register used for TLS accesses.  For more details please refer to the


[gcc r14-9833] aarch64: Fix vld1/st1_x4 intrinsic test

2024-04-08 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:278cad85077509b73b1faf32d36f3889c2a5524b

commit r14-9833-g278cad85077509b73b1faf32d36f3889c2a5524b
Author: Swinney, Jonathan 
Date:   Mon Apr 8 14:02:33 2024 +0100

aarch64: Fix vld1/st1_x4 intrinsic test

The test for this intrinsic was failing silently and so it failed to
report the bug reported in 114521. This patch modifes the test to
report the result.

Bug report: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114521

Signed-off-by: Jonathan Swinney 

gcc/testsuite/
* gcc.target/aarch64/advsimd-intrinsics/vld1x4.c: Exit with a 
nonzero
code if the test fails.

Diff:
---
 gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vld1x4.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vld1x4.c 
b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vld1x4.c
index 89b289bb21d..17db262a31a 100644
--- a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vld1x4.c
+++ b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vld1x4.c
@@ -3,6 +3,7 @@
 /* { dg-skip-if "unimplemented" { arm*-*-* } } */
 /* { dg-options "-O3" } */
 
+#include 
 #include 
 #include "arm-neon-ref.h"
 
@@ -71,13 +72,16 @@ VARIANT (float64, 2, q_f64)
 VARIANTS (TESTMETH)
 
 #define CHECKS(BASE, ELTS, SUFFIX) \
-  if (test_vld1##SUFFIX##_x4 () != 0)  \
-fprintf (stderr, "test_vld1##SUFFIX##_x4");
+  if (test_vld1##SUFFIX##_x4 () != 0) {\
+fprintf (stderr, "test_vld1" #SUFFIX "_x4 failed\n"); \
+failed = true; \
+  }
 
 int
 main (int argc, char **argv)
 {
+  bool failed = false;
   VARIANTS (CHECKS)
 
-  return 0;
+  return (failed) ? 1 : 0;
 }


Re: [PATCH] aarch64: Fix vld1/st1_x4 intrinsic test

2024-04-08 Thread Richard Sandiford
"Swinney, Jonathan"  writes:
> The test for this intrinsic was failing silently and so it failed to
> report the bug reported in 114521. This patch modifes the test to
> report the result.
>
> Bug report: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114521
>
> Signed-off-by: Jonathan Swinney 
> ---
>  .../gcc.target/aarch64/advsimd-intrinsics/vld1x4.c | 10 +++---
>  1 file changed, 7 insertions(+), 3 deletions(-)

Thanks, pushed to trunk.

Richard

>
> diff --git a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vld1x4.c 
> b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vld1x4.c
> index 89b289bb21d..17db262a31a 100644
> --- a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vld1x4.c
> +++ b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vld1x4.c
> @@ -3,6 +3,7 @@
>  /* { dg-skip-if "unimplemented" { arm*-*-* } } */
>  /* { dg-options "-O3" } */
>  
> +#include 
>  #include 
>  #include "arm-neon-ref.h"
>  
> @@ -71,13 +72,16 @@ VARIANT (float64, 2, q_f64)
>  VARIANTS (TESTMETH)
>  
>  #define CHECKS(BASE, ELTS, SUFFIX)   \
> -  if (test_vld1##SUFFIX##_x4 () != 0)\
> -fprintf (stderr, "test_vld1##SUFFIX##_x4");
> +  if (test_vld1##SUFFIX##_x4 () != 0) {  \
> +fprintf (stderr, "test_vld1" #SUFFIX "_x4 failed\n"); \
> +failed = true; \
> +  }
>  
>  int
>  main (int argc, char **argv)
>  {
> +  bool failed = false;
>VARIANTS (CHECKS)
>  
> -  return 0;
> +  return (failed) ? 1 : 0;
>  }


Re: [PATCH] rtl-optimization/101523 - avoid re-combine after noop 2->2 combination

2024-04-08 Thread Richard Sandiford
Segher Boessenkool  writes:
> Hi!
>
> On Wed, Apr 03, 2024 at 01:07:41PM +0200, Richard Biener wrote:
>> The following avoids re-walking and re-combining the instructions
>> between i2 and i3 when the pattern of i2 doesn't change.
>> 
>> Bootstrap and regtest running ontop of a reversal of 
>> r14-9692-g839bc42772ba7a.
>
> Please include that in the patch (or series, preferably).
>
>> It brings down memory use frmo 9GB to 400MB and compile-time from
>> 80s to 3.5s.  r14-9692-g839bc42772ba7a does better in both metrics
>> but has shown code generation regressions across acrchitectures.
>> 
>> OK to revert r14-9692-g839bc42772ba7a?
>
> No.
>
> The patch solved a very real problem.  How does your replacement handle
> that?  You don't say.  It looks like it only battles symptoms a bit,
> instead :-(
>
> We had this before: 3->2 combinations that leave an instruction
> identical to what was there before.  This was just a combination with
> context as well.  The only reason this wasn't a huge problem then
> already was because this is a 3->2 combination, even if it really is a
> 2->1 one it still is beneficial in all the same cases.  But in the new
> case it can iterate indefinitely -- well not quite, but some polynomial
> number of times, for a polynomial at least of degree three, possibly
> more :-(
>
> With this patch you need to show combine still is linear.  I don't think
> it is, but some deeper analysis might show it still is.
>
>   ~ - ~ - ~
>
> What should *really* be done is something that has been on the wish list
> for decades: an uncse pass.
>
> The things that combine no longer works on after my patch are actually
> 1->1 combinations (which we never do currently, although we probably
> should); or alternatively, an un-CSE followed by a 2->1 combination.
>
> We can do the latter of course, but we need to do an actual uncse first!
> Somewhere before combine, and then redo a CSE after it.  An actual CSE,
> not doing ten gazillion other things.

Can you give a specific example of a 2->2 combination that we still
want to apply after r14-9692-g839bc42772ba7a?

2->2 combinations as I understand them were added by
c4c5ad1d6d1e1e1fe7a1c2b3bb097cc269dc7306:

Author: Segher Boessenkool 
Date:   Mon Jul 30 15:18:17 2018 +0200

combine: Allow combining two insns to two insns

This patch allows combine to combine two insns into two.  This helps
in many cases, by reducing instruction path length, and also allowing
further combinations to happen.  PR85160 is a typical example of code
that it can improve.

This patch does not allow such combinations if either of the original
instructions was a simple move instruction.  In those cases combining
the two instructions increases register pressure without improving the
code.  With this move test register pressure does no longer increase
noticably as far as I can tell.

(At first I also didn't allow either of the resulting insns to be a
move instruction.  But that is actually a very good thing to have, as
should have been obvious).

PR rtl-optimization/85160
* combine.c (is_just_move): New function.
(try_combine): Allow combining two instructions into two if neither 
of
the original instructions was a move.

That patch didn't have a testcase, but one was added later in
81bdfc1e2940fc93bcd0bba4416daff47f04f3b3:

testcase for 2-2 combine

gcc/testsuite/
PR rtl-optimization/85160
* gcc.target/powerpc/combine-2-2.c: New testcase.

But this is the powerpc test that regresses with the recent patch (PR114518).

The patches reference aarch64 bug PR85160.  If I check out and build
c4c5ad1d6d1e above, I can see that it does indeed remove two mvns from
the PR85160 testcase.  The diff from c4c5ad1d6d1e~ is:

@@ -10,12 +10,10 @@
.cfi_startproc
ldr w3, [x2, w3, sxtw 2]
ldr w2, [x2, w4, sxtw 2]
-   mvn w3, w3
-   mvn w2, w2
-   and w4, w3, w1
-   and w1, w2, w1
-   and w3, w3, w0
-   and w2, w2, w0
+   bic w4, w1, w3
+   bic w3, w0, w3
+   bic w1, w1, w2
+   bic w2, w0, w2
asr w4, w4, 9
asr w1, w1, 7
orr w3, w4, w3, asr 7

(which is great).  But if I apply 839bc42772ba on top of c4c5ad1d6d1e
then the optimisation is undone.

Is that the intention?  I.e. are we effectively removing the kind of
2->2 combinations added in c4c5ad1d6d1e1e1fe?  If so, why not simply
revert c4c5ad1d6d1e1e1fe itself?

Or is there a specific testcase that is still optimised with the
combination of c4c5ad1d6d1e1e1fe and 839bc42772ba7a that would not
be optimised without c4c5ad1d6d1e1e1fe?  If so, can you say what it is?

Thanks,
Richard


Re: [pushed] aarch64: Fix bogus cnot optimisation [PR114603]

2024-04-08 Thread Richard Sandiford
Richard Biener  writes:
> On Fri, Apr 5, 2024 at 3:52 PM Richard Sandiford
>> This isn't a regression on a known testcase.  However, it's a nasty
>> wrong code bug that could conceivably trigger for autovec code (although
>> I've not been able to construct a reproducer so far).  That fix is also
>> quite localised to the buggy operation.  I'd therefore prefer to push
>> the fix now rather than wait for GCC 15.
>
> wrong-code bugs (and also rejects-valid or ice-on-valid) are always exempt
> from the regression-only fixing.  In practice every such bug will be a
> regression,
> in this case to when the combining pattern was introduced (unless that was
> with the version with the initial introduction of the port of course).

Ah, thanks, hadn't realised that.  Makes sense though.

It's good news of a sort since unfortunately I've another SVE wrong-code
fix in the works...

Richard


[pushed] aarch64: Fix bogus cnot optimisation [PR114603]

2024-04-05 Thread Richard Sandiford
aarch64-sve.md had a pattern that combined:

cmpeq   pb.T, pa/z, zc.T, #0
mov zd.T, pb/z, #1

into:

cnotzd.T, pa/m, zc.T

But this is only valid if pa.T is a ptrue.  In other cases, the
original would set inactive elements of zd.T to 0, whereas the
combined form would copy elements from zc.T.

This isn't a regression on a known testcase.  However, it's a nasty
wrong code bug that could conceivably trigger for autovec code (although
I've not been able to construct a reproducer so far).  That fix is also
quite localised to the buggy operation.  I'd therefore prefer to push
the fix now rather than wait for GCC 15.

Tested on aarch64-linux-gnu & pushed.  I'll backport to branches if
there is no fallout.

Richard

gcc/
PR target/114603
* config/aarch64/aarch64-sve.md (@aarch64_pred_cnot): Replace
with...
(@aarch64_ptrue_cnot): ...this, requiring operand 1 to be
a ptrue.
(*cnot): Require operand 1 to be a ptrue.
* config/aarch64/aarch64-sve-builtins-base.cc (svcnot_impl::expand):
Use aarch64_ptrue_cnot for _x operations that are predicated
with a ptrue.  Represent other _x operations as fully-defined _m
operations.

gcc/testsuite/
PR target/114603
* gcc.target/aarch64/sve/acle/general/cnot_1.c: New test.
---
 .../aarch64/aarch64-sve-builtins-base.cc  | 25 ---
 gcc/config/aarch64/aarch64-sve.md | 22 
 .../aarch64/sve/acle/general/cnot_1.c | 23 +
 3 files changed, 50 insertions(+), 20 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/acle/general/cnot_1.c

diff --git a/gcc/config/aarch64/aarch64-sve-builtins-base.cc 
b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
index 257ca5bf6ad..5be2315a3c6 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins-base.cc
+++ b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
@@ -517,15 +517,22 @@ public:
   expand (function_expander ) const override
   {
 machine_mode mode = e.vector_mode (0);
-if (e.pred == PRED_x)
-  {
-   /* The pattern for CNOT includes an UNSPEC_PRED_Z, so needs
-  a ptrue hint.  */
-   e.add_ptrue_hint (0, e.gp_mode (0));
-   return e.use_pred_x_insn (code_for_aarch64_pred_cnot (mode));
-  }
-
-return e.use_cond_insn (code_for_cond_cnot (mode), 0);
+machine_mode pred_mode = e.gp_mode (0);
+/* The underlying _x pattern is effectively:
+
+dst = src == 0 ? 1 : 0
+
+   rather than an UNSPEC_PRED_X.  Using this form allows autovec
+   constructs to be matched by combine, but it means that the
+   predicate on the src == 0 comparison must be all-true.
+
+   For simplicity, represent other _x operations as fully-defined _m
+   operations rather than using a separate bespoke pattern.  */
+if (e.pred == PRED_x
+   && gen_lowpart (pred_mode, e.args[0]) == CONSTM1_RTX (pred_mode))
+  return e.use_pred_x_insn (code_for_aarch64_ptrue_cnot (mode));
+return e.use_cond_insn (code_for_cond_cnot (mode),
+   e.pred == PRED_x ? 1 : 0);
   }
 };
 
diff --git a/gcc/config/aarch64/aarch64-sve.md 
b/gcc/config/aarch64/aarch64-sve.md
index eca8623e587..0434358122d 100644
--- a/gcc/config/aarch64/aarch64-sve.md
+++ b/gcc/config/aarch64/aarch64-sve.md
@@ -3363,24 +3363,24 @@ (define_insn_and_split 
"trunc2"
 ;; - CNOT
 ;; -
 
-;; Predicated logical inverse.
-(define_expand "@aarch64_pred_cnot"
+;; Logical inverse, predicated with a ptrue.
+(define_expand "@aarch64_ptrue_cnot"
   [(set (match_operand:SVE_FULL_I 0 "register_operand")
(unspec:SVE_FULL_I
  [(unspec:
 [(match_operand: 1 "register_operand")
- (match_operand:SI 2 "aarch64_sve_ptrue_flag")
+ (const_int SVE_KNOWN_PTRUE)
  (eq:
-   (match_operand:SVE_FULL_I 3 "register_operand")
-   (match_dup 4))]
+   (match_operand:SVE_FULL_I 2 "register_operand")
+   (match_dup 3))]
 UNSPEC_PRED_Z)
-  (match_dup 5)
-  (match_dup 4)]
+  (match_dup 4)
+  (match_dup 3)]
  UNSPEC_SEL))]
   "TARGET_SVE"
   {
-operands[4] = CONST0_RTX (mode);
-operands[5] = CONST1_RTX (mode);
+operands[3] = CONST0_RTX (mode);
+operands[4] = CONST1_RTX (mode);
   }
 )
 
@@ -3389,7 +3389,7 @@ (define_insn "*cnot"
(unspec:SVE_I
  [(unspec:
 [(match_operand: 1 "register_operand")
- (match_operand:SI 5 "aarch64_sve_ptrue_flag")
+ (const_int SVE_KNOWN_PTRUE)
  (eq:
(match_operand:SVE_I 2 "register_operand")
(match_operand:SVE_I 3 "aarch64_simd_imm_zero"))]
@@ -11001,4 +11001,4 @@ (define_insn "@aarch64_sve_set_neonq_"
   GET_MODE (operands[2]));
 return 

[gcc r14-9811] aarch64: Fix bogus cnot optimisation [PR114603]

2024-04-05 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:67cbb1c638d6ab3a9cb77e674541e2b291fb67df

commit r14-9811-g67cbb1c638d6ab3a9cb77e674541e2b291fb67df
Author: Richard Sandiford 
Date:   Fri Apr 5 14:47:15 2024 +0100

aarch64: Fix bogus cnot optimisation [PR114603]

aarch64-sve.md had a pattern that combined:

cmpeq   pb.T, pa/z, zc.T, #0
mov zd.T, pb/z, #1

into:

cnotzd.T, pa/m, zc.T

But this is only valid if pa.T is a ptrue.  In other cases, the
original would set inactive elements of zd.T to 0, whereas the
combined form would copy elements from zc.T.

gcc/
PR target/114603
* config/aarch64/aarch64-sve.md (@aarch64_pred_cnot): Replace
with...
(@aarch64_ptrue_cnot): ...this, requiring operand 1 to be
a ptrue.
(*cnot): Require operand 1 to be a ptrue.
* config/aarch64/aarch64-sve-builtins-base.cc (svcnot_impl::expand):
Use aarch64_ptrue_cnot for _x operations that are predicated
with a ptrue.  Represent other _x operations as fully-defined _m
operations.

gcc/testsuite/
PR target/114603
* gcc.target/aarch64/sve/acle/general/cnot_1.c: New test.

Diff:
---
 gcc/config/aarch64/aarch64-sve-builtins-base.cc| 25 ++
 gcc/config/aarch64/aarch64-sve.md  | 22 +--
 .../gcc.target/aarch64/sve/acle/general/cnot_1.c   | 23 
 3 files changed, 50 insertions(+), 20 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-sve-builtins-base.cc 
b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
index 257ca5bf6ad..5be2315a3c6 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins-base.cc
+++ b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
@@ -517,15 +517,22 @@ public:
   expand (function_expander ) const override
   {
 machine_mode mode = e.vector_mode (0);
-if (e.pred == PRED_x)
-  {
-   /* The pattern for CNOT includes an UNSPEC_PRED_Z, so needs
-  a ptrue hint.  */
-   e.add_ptrue_hint (0, e.gp_mode (0));
-   return e.use_pred_x_insn (code_for_aarch64_pred_cnot (mode));
-  }
-
-return e.use_cond_insn (code_for_cond_cnot (mode), 0);
+machine_mode pred_mode = e.gp_mode (0);
+/* The underlying _x pattern is effectively:
+
+dst = src == 0 ? 1 : 0
+
+   rather than an UNSPEC_PRED_X.  Using this form allows autovec
+   constructs to be matched by combine, but it means that the
+   predicate on the src == 0 comparison must be all-true.
+
+   For simplicity, represent other _x operations as fully-defined _m
+   operations rather than using a separate bespoke pattern.  */
+if (e.pred == PRED_x
+   && gen_lowpart (pred_mode, e.args[0]) == CONSTM1_RTX (pred_mode))
+  return e.use_pred_x_insn (code_for_aarch64_ptrue_cnot (mode));
+return e.use_cond_insn (code_for_cond_cnot (mode),
+   e.pred == PRED_x ? 1 : 0);
   }
 };
 
diff --git a/gcc/config/aarch64/aarch64-sve.md 
b/gcc/config/aarch64/aarch64-sve.md
index eca8623e587..0434358122d 100644
--- a/gcc/config/aarch64/aarch64-sve.md
+++ b/gcc/config/aarch64/aarch64-sve.md
@@ -3363,24 +3363,24 @@
 ;; - CNOT
 ;; -
 
-;; Predicated logical inverse.
-(define_expand "@aarch64_pred_cnot"
+;; Logical inverse, predicated with a ptrue.
+(define_expand "@aarch64_ptrue_cnot"
   [(set (match_operand:SVE_FULL_I 0 "register_operand")
(unspec:SVE_FULL_I
  [(unspec:
 [(match_operand: 1 "register_operand")
- (match_operand:SI 2 "aarch64_sve_ptrue_flag")
+ (const_int SVE_KNOWN_PTRUE)
  (eq:
-   (match_operand:SVE_FULL_I 3 "register_operand")
-   (match_dup 4))]
+   (match_operand:SVE_FULL_I 2 "register_operand")
+   (match_dup 3))]
 UNSPEC_PRED_Z)
-  (match_dup 5)
-  (match_dup 4)]
+  (match_dup 4)
+  (match_dup 3)]
  UNSPEC_SEL))]
   "TARGET_SVE"
   {
-operands[4] = CONST0_RTX (mode);
-operands[5] = CONST1_RTX (mode);
+operands[3] = CONST0_RTX (mode);
+operands[4] = CONST1_RTX (mode);
   }
 )
 
@@ -3389,7 +3389,7 @@
(unspec:SVE_I
  [(unspec:
 [(match_operand: 1 "register_operand")
- (match_operand:SI 5 "aarch64_sve_ptrue_flag")
+ (const_int SVE_KNOWN_PTRUE)
  (eq:
(match_operand:SVE_I 2 "register_operand")
(match_operand:SVE_I 3 "aarch64_simd_imm_zero"))]
@@ -11001,4 +11001,4 @@
   GET_MODE (operands[2]));
 return "sel\t%0., %3, %2., %1.";
   }
-)
\ No newline at end of file
+)
diff --gi

[pushed] aarch64: Recognise svundef idiom [PR114577]

2024-04-04 Thread Richard Sandiford
GCC 14 adds the header file arm_neon_sve_bridge.h to help interface
SVE and Advanced SIMD code.  One of the defined idioms is:

  svset_neonq (svundef_TYPE (), advsimd_vector)

which simply reinterprets advsimd_vector as an SVE vector without
regard for what's in the upper bits.

GCC was failing to recognise this idiom, which was likely to
significantly hamper adoption.

There is (AFAIK) no good way of representing an extension with
undefined bits in gimple.  We could add an internal-only builtin
to represent it, but the current framework makes that somewhat
awkward.  It also doesn't seem very forward-looking.

This patch instead goes for the simpler approach of recognising
undefined arguments at expansion time.

Tested on aarch64-linux-gnu & pushed.

Richard


gcc/
PR target/114577
* config/aarch64/aarch64-sve-builtins.h (aarch64_sve::lookup_fndecl):
Declare.
* config/aarch64/aarch64-sve-builtins.cc (aarch64_sve::lookup_fndecl):
New function.
* config/aarch64/aarch64-sve-builtins-base.cc (is_undef): Likewise.
(svset_neonq_impl::expand): Optimise expansions whose first argument
is undefined.

gcc/testsuite/
PR target/114577
* gcc.target/aarch64/sve/acle/general/pr114577_1.c: New test.
* gcc.target/aarch64/sve/acle/general/pr114577_2.c: Likewise.
---
 .../aarch64/aarch64-sve-builtins-base.cc  | 27 ++
 gcc/config/aarch64/aarch64-sve-builtins.cc| 16 
 gcc/config/aarch64/aarch64-sve-builtins.h |  1 +
 .../aarch64/sve/acle/general/pr114577_1.c | 94 +++
 .../aarch64/sve/acle/general/pr114577_2.c | 46 +
 5 files changed, 184 insertions(+)
 create mode 100644 
gcc/testsuite/gcc.target/aarch64/sve/acle/general/pr114577_1.c
 create mode 100644 
gcc/testsuite/gcc.target/aarch64/sve/acle/general/pr114577_2.c

diff --git a/gcc/config/aarch64/aarch64-sve-builtins-base.cc 
b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
index a8c3f84a70b..257ca5bf6ad 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins-base.cc
+++ b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
@@ -47,11 +47,31 @@
 #include "aarch64-builtins.h"
 #include "ssa.h"
 #include "gimple-fold.h"
+#include "tree-ssa.h"
 
 using namespace aarch64_sve;
 
 namespace {
 
+/* Return true if VAL is an undefined value.  */
+static bool
+is_undef (tree val)
+{
+  if (TREE_CODE (val) == SSA_NAME)
+{
+  if (ssa_undefined_value_p (val, false))
+   return true;
+
+  gimple *def = SSA_NAME_DEF_STMT (val);
+  if (gcall *call = dyn_cast (def))
+   if (tree fndecl = gimple_call_fndecl (call))
+ if (const function_instance *instance = lookup_fndecl (fndecl))
+   if (instance->base == functions::svundef)
+ return true;
+}
+  return false;
+}
+
 /* Return the UNSPEC_CMLA* unspec for rotation amount ROT.  */
 static int
 unspec_cmla (int rot)
@@ -1142,6 +1162,13 @@ public:
   expand (function_expander ) const override
   {
 machine_mode mode = e.vector_mode (0);
+
+/* If the SVE argument is undefined, we just need to reinterpret the
+   Advanced SIMD argument as an SVE vector.  */
+if (!BYTES_BIG_ENDIAN
+   && is_undef (CALL_EXPR_ARG (e.call_expr, 0)))
+  return simplify_gen_subreg (mode, e.args[1], GET_MODE (e.args[1]), 0);
+
 rtx_vector_builder builder (VNx16BImode, 16, 2);
 for (unsigned int i = 0; i < 16; i++)
   builder.quick_push (CONST1_RTX (BImode));
diff --git a/gcc/config/aarch64/aarch64-sve-builtins.cc 
b/gcc/config/aarch64/aarch64-sve-builtins.cc
index 11f5c5c500c..e124d1f90a5 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins.cc
+++ b/gcc/config/aarch64/aarch64-sve-builtins.cc
@@ -1055,6 +1055,22 @@ get_vector_type (sve_type type)
   return acle_vector_types[type.num_vectors - 1][vector_type];
 }
 
+/* If FNDECL is an SVE builtin, return its function instance, otherwise
+   return null.  */
+const function_instance *
+lookup_fndecl (tree fndecl)
+{
+  if (!fndecl_built_in_p (fndecl, BUILT_IN_MD))
+return nullptr;
+
+  unsigned int code = DECL_MD_FUNCTION_CODE (fndecl);
+  if ((code & AARCH64_BUILTIN_CLASS) != AARCH64_BUILTIN_SVE)
+return nullptr;
+
+  unsigned int subcode = code >> AARCH64_BUILTIN_SHIFT;
+  return &(*registered_functions)[subcode]->instance;
+}
+
 /* Report an error against LOCATION that the user has tried to use
function FNDECL when extension EXTENSION is disabled.  */
 static void
diff --git a/gcc/config/aarch64/aarch64-sve-builtins.h 
b/gcc/config/aarch64/aarch64-sve-builtins.h
index e66729ed635..053006776a9 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins.h
+++ b/gcc/config/aarch64/aarch64-sve-builtins.h
@@ -810,6 +810,7 @@ extern tree acle_svprfop;
 
 bool vector_cst_all_same (tree, unsigned int);
 bool is_ptrue (tree, unsigned int);
+const function_instance *lookup_fndecl (tree);
 
 /* Try to find a mode with the given mode_suffix_info fields.  Return the
mode on success or MODE_none on 

[gcc r14-9787] aarch64: Recognise svundef idiom [PR114577]

2024-04-04 Thread Richard Sandiford via Gcc-cvs
https://gcc.gnu.org/g:86dce005a1d440154dbf585dde5a2dd4cfac7a05

commit r14-9787-g86dce005a1d440154dbf585dde5a2dd4cfac7a05
Author: Richard Sandiford 
Date:   Thu Apr 4 14:15:49 2024 +0100

aarch64: Recognise svundef idiom [PR114577]

GCC 14 adds the header file arm_neon_sve_bridge.h to help interface
SVE and Advanced SIMD code.  One of the defined idioms is:

  svset_neonq (svundef_TYPE (), advsimd_vector)

which simply reinterprets advsimd_vector as an SVE vector without
regard for what's in the upper bits.

GCC was failing to recognise this idiom, which was likely to
significantly hamper adoption.

There is (AFAIK) no good way of representing an extension with
undefined bits in gimple.  We could add an internal-only builtin
to represent it, but the current framework makes that somewhat
awkward.  It also doesn't seem very forward-looking.

This patch instead goes for the simpler approach of recognising
undefined arguments at expansion time.

gcc/
PR target/114577
* config/aarch64/aarch64-sve-builtins.h 
(aarch64_sve::lookup_fndecl):
Declare.
* config/aarch64/aarch64-sve-builtins.cc 
(aarch64_sve::lookup_fndecl):
New function.
* config/aarch64/aarch64-sve-builtins-base.cc (is_undef): Likewise.
(svset_neonq_impl::expand): Optimise expansions whose first argument
is undefined.

gcc/testsuite/
PR target/114577
* gcc.target/aarch64/sve/acle/general/pr114577_1.c: New test.
* gcc.target/aarch64/sve/acle/general/pr114577_2.c: Likewise.

Diff:
---
 gcc/config/aarch64/aarch64-sve-builtins-base.cc| 27 +++
 gcc/config/aarch64/aarch64-sve-builtins.cc | 16 
 gcc/config/aarch64/aarch64-sve-builtins.h  |  1 +
 .../aarch64/sve/acle/general/pr114577_1.c  | 94 ++
 .../aarch64/sve/acle/general/pr114577_2.c  | 46 +++
 5 files changed, 184 insertions(+)

diff --git a/gcc/config/aarch64/aarch64-sve-builtins-base.cc 
b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
index a8c3f84a70b..257ca5bf6ad 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins-base.cc
+++ b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
@@ -47,11 +47,31 @@
 #include "aarch64-builtins.h"
 #include "ssa.h"
 #include "gimple-fold.h"
+#include "tree-ssa.h"
 
 using namespace aarch64_sve;
 
 namespace {
 
+/* Return true if VAL is an undefined value.  */
+static bool
+is_undef (tree val)
+{
+  if (TREE_CODE (val) == SSA_NAME)
+{
+  if (ssa_undefined_value_p (val, false))
+   return true;
+
+  gimple *def = SSA_NAME_DEF_STMT (val);
+  if (gcall *call = dyn_cast (def))
+   if (tree fndecl = gimple_call_fndecl (call))
+ if (const function_instance *instance = lookup_fndecl (fndecl))
+   if (instance->base == functions::svundef)
+ return true;
+}
+  return false;
+}
+
 /* Return the UNSPEC_CMLA* unspec for rotation amount ROT.  */
 static int
 unspec_cmla (int rot)
@@ -1142,6 +1162,13 @@ public:
   expand (function_expander ) const override
   {
 machine_mode mode = e.vector_mode (0);
+
+/* If the SVE argument is undefined, we just need to reinterpret the
+   Advanced SIMD argument as an SVE vector.  */
+if (!BYTES_BIG_ENDIAN
+   && is_undef (CALL_EXPR_ARG (e.call_expr, 0)))
+  return simplify_gen_subreg (mode, e.args[1], GET_MODE (e.args[1]), 0);
+
 rtx_vector_builder builder (VNx16BImode, 16, 2);
 for (unsigned int i = 0; i < 16; i++)
   builder.quick_push (CONST1_RTX (BImode));
diff --git a/gcc/config/aarch64/aarch64-sve-builtins.cc 
b/gcc/config/aarch64/aarch64-sve-builtins.cc
index 11f5c5c500c..e124d1f90a5 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins.cc
+++ b/gcc/config/aarch64/aarch64-sve-builtins.cc
@@ -1055,6 +1055,22 @@ get_vector_type (sve_type type)
   return acle_vector_types[type.num_vectors - 1][vector_type];
 }
 
+/* If FNDECL is an SVE builtin, return its function instance, otherwise
+   return null.  */
+const function_instance *
+lookup_fndecl (tree fndecl)
+{
+  if (!fndecl_built_in_p (fndecl, BUILT_IN_MD))
+return nullptr;
+
+  unsigned int code = DECL_MD_FUNCTION_CODE (fndecl);
+  if ((code & AARCH64_BUILTIN_CLASS) != AARCH64_BUILTIN_SVE)
+return nullptr;
+
+  unsigned int subcode = code >> AARCH64_BUILTIN_SHIFT;
+  return &(*registered_functions)[subcode]->instance;
+}
+
 /* Report an error against LOCATION that the user has tried to use
function FNDECL when extension EXTENSION is disabled.  */
 static void
diff --git a/gcc/config/aarch64/aarch64-sve-builtins.h 
b/gcc/config/aarch64/aarch64-sve-builtins.h
index e66729ed635..053006776a9 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins.h
+++ b/gcc/config/aarch64/aarch64-sve-builtins.h
@@ -810,6 +810,7 @@ extern tre

Re: [PATCH] libatomic: Cleanup macros in atomic_16.S

2024-04-04 Thread Richard Sandiford
Wilco Dijkstra  writes:
> As mentioned in 
> https://gcc.gnu.org/pipermail/gcc-patches/2024-March/648397.html ,
> do some additional cleanup of the macros and aliases:
>
> Cleanup the macros to add the libat_ prefixes in atomic_16.S.  Emit the
> alias to __atomic_ when ifuncs are not enabled in the ENTRY macro.
>
> Passes regress and bootstrap, OK for commit?
>
> libatomic:
> * config/linux/aarch64/atomic_16.S: Add __libat_ prefix in the
> LSE2/LSE128/CORE macros, remove elsewhere.  Add ATOMIC macro.

Thanks for doing this.  LGTM, but one minor suggestion below:

> diff --git a/libatomic/config/linux/aarch64/atomic_16.S 
> b/libatomic/config/linux/aarch64/atomic_16.S
> index 
> 4e3fa870b0338da4cfcdb0879ab8bed8d041a0a3..d0343507120c06a483ffdae1a793b6b5263cfe98
>  100644
> --- a/libatomic/config/linux/aarch64/atomic_16.S
> +++ b/libatomic/config/linux/aarch64/atomic_16.S
> @@ -45,7 +45,7 @@
>  # define HAVE_FEAT_LSE128 0
>  #endif
>
> -#define HAVE_FEAT_LSE2  HAVE_IFUNC
> +#define HAVE_FEAT_LSE2 HAVE_IFUNC
>
>  #if HAVE_FEAT_LSE128
> .arch   armv9-a+lse128
> @@ -53,31 +53,37 @@
> .arch   armv8-a+lse
>  #endif
>
> -#define LSE128(NAME)   NAME##_i1
> -#define LSE2(NAME) NAME##_i2
> -#define CORE(NAME) NAME
> +#define LSE128(NAME)   libat_##NAME##_i1
> +#define LSE2(NAME) libat_##NAME##_i2
> +#define CORE(NAME) libat_##NAME
> +#define ATOMIC(NAME)   __atomic_##NAME
>
> -#define ENTRY_FEAT(NAME, FEAT)  \
> -   ENTRY (FEAT (NAME))
> +#if HAVE_IFUNC
> +# define ENTRY(NAME)   ENTRY2 (CORE (NAME), )
> +#else
> +/* Emit __atomic_* entrypoints if no ifuncs.  */
> +# define ENTRY(NAME)   ENTRY2 (CORE (NAME), ALIAS (NAME, ATOMIC, CORE))
> +#endif
> +#define ENTRY_FEAT(NAME, FEAT) ENTRY2 (FEAT (NAME), )

Perhaps we should define this only in the HAVE_IFUNC arm, so that it's
a noisy failure if we forget to protect an ENTRY_FEAT with an appropriate
#ifdef.

OK for GCC 15 with or without that change.

Richard

> +
> +#define END(NAME)  END2 (CORE (NAME))
> +#define END_FEAT(NAME, FEAT)   END2 (FEAT (NAME))
>
> -#define ENTRY(NAME)\
> +#define ENTRY2(NAME, ALIASES)  \
> .global NAME;   \
> .hidden NAME;   \
> .type NAME,%function;   \
> .p2align 4; \
> +   ALIASES;\
>  NAME:  \
> -   .cfi_startproc; \
> -   hint34  // bti c
> -
> -#define END_FEAT(NAME, FEAT)   \
> -   END (FEAT (NAME))
> +   .cfi_startproc; \
> +   hint34; // bti c
>
> -#define END(NAME)  \
> +#define END2(NAME) \
> .cfi_endproc;   \
> .size NAME, .-NAME;
>
> -#define ALIAS(NAME, FROM, TO)  ALIAS1 (FROM (NAME),TO (NAME))
> -#define ALIAS2(NAME)   ALIAS1 (__atomic_##NAME, libat_##NAME)
> +#define ALIAS(NAME, FROM, TO)  ALIAS1 (FROM (NAME), TO (NAME))
>
>  #define ALIAS1(ALIAS, NAME)\
> .global ALIAS;  \
> @@ -116,7 +122,7 @@ NAME:   \
>  #define SEQ_CST 5
>
>
> -ENTRY (libat_load_16)
> +ENTRY (load_16)
> mov x5, x0
> cbnzw1, 2f
>
> @@ -131,11 +137,11 @@ ENTRY (libat_load_16)
> stxpw4, res0, res1, [x5]
> cbnzw4, 2b
> ret
> -END (libat_load_16)
> +END (load_16)
>
>
>  #if HAVE_FEAT_LSE2
> -ENTRY_FEAT (libat_load_16, LSE2)
> +ENTRY_FEAT (load_16, LSE2)
> cbnzw1, 1f
>
> /* RELAXED.  */
> @@ -155,11 +161,11 @@ ENTRY_FEAT (libat_load_16, LSE2)
> ldp res0, res1, [x0]
> dmb ishld
> ret
> -END_FEAT (libat_load_16, LSE2)
> +END_FEAT (load_16, LSE2)
>  #endif
>
>
> -ENTRY (libat_store_16)
> +ENTRY (store_16)
> cbnzw4, 2f
>
> /* RELAXED.  */
> @@ -173,11 +179,11 @@ ENTRY (libat_store_16)
> stlxp   w4, in0, in1, [x0]
> cbnzw4, 2b
> ret
> -END (libat_store_16)
> +END (store_16)
>
>
>  #if HAVE_FEAT_LSE2
> -ENTRY_FEAT (libat_store_16, LSE2)
> +ENTRY_FEAT (store_16, LSE2)
> cbnzw4, 1f
>
> /* RELAXED.  */
> @@ -189,11 +195,11 @@ ENTRY_FEAT (libat_store_16, LSE2)
> stlxp   w4, in0, in1, [x0]
> cbnzw4, 1b
> ret
> -END_FEAT (libat_store_16, LSE2)
> +END_FEAT (store_16, LSE2)
>  #endif
>
>
> -ENTRY (libat_exchange_16)
> +ENTRY (exchange_16)
> mov x5, x0
> cbnzw4, 2f
>
> @@ -217,11 +223,11 @@ ENTRY (libat_exchange_16)
> stlxp   w4, in0, in1, [x5]
> cbnzw4, 4b
> ret
> -END (libat_exchange_16)
> +END (exchange_16)
>
>
>  #if HAVE_FEAT_LSE128
> -ENTRY_FEAT (libat_exchange_16, LSE128)
> +ENTRY_FEAT (exchange_16, LSE128)
> mov tmp0, x0
> mov res0, in0
> mov res1, in1
> @@ -241,11 +247,11 @@ ENTRY_FEAT (libat_exchange_16, LSE128)
> /* RELEASE/ACQ_REL/SEQ_CST.  */
>  2: swppal  res0, res1, [tmp0]
> ret
> -END_FEAT (libat_exchange_16, LSE128)
> +END_FEAT 

Re: [PATCH] libatomic: Fix build for --disable-gnu-indirect-function [PR113986]

2024-04-04 Thread Richard Sandiford
Wilco Dijkstra  writes:
> v2: 
>
> Fix libatomic build to support --disable-gnu-indirect-function on AArch64.
> Always build atomic_16.S, add aliases to the __atomic_ functions if 
> !HAVE_IFUNC. 
> Include auto-config.h in atomic_16.S to avoid having to pass defines via 
> makefiles.
> Fix build if HWCAP_ATOMICS/CPUID are not defined.
>
> Passes regress and bootstrap, OK for commit?
>
> libatomic:
> PR target/113986
> * Makefile.in: Regenerated.
> * Makefile.am: Make atomic_16.S not depend on HAVE_IFUNC.
> Remove predefine of HAVE_FEAT_LSE128.
> * acinclude.m4: Remove ARCH_AARCH64_HAVE_LSE128.
> * configure: Regenerated.
> * config/linux/aarch64/atomic_16.S: Add __atomic_ aliases if 
> !HAVE_IFUNC. 
> * config/linux/aarch64/host-config.h: Correctly handle !HAVE_IFUNC.  
> Add
> defines for HWCAP_ATOMICS and HWCAP_CPUID.

OK, thanks, but...

> @@ -35,7 +48,6 @@ typedef struct __ifunc_arg_t {
>  # define _IFUNC_ARG_HWCAP (1ULL << 62)
>  #endif
>  
> -#ifdef HWCAP_USCAT
>  # if N == 16
>  #  define IFUNC_COND_1   (has_lse128 (hwcap, features))
>  #  define IFUNC_COND_2   (has_lse2 (hwcap, features))
> @@ -44,19 +56,6 @@ typedef struct __ifunc_arg_t {
>  #  define IFUNC_COND_1   (hwcap & HWCAP_ATOMICS)
>  #  define IFUNC_NCOND(N) 1
>  # endif

...please reindent this block so that there is no space after the
"#" in the outermost directives.

Richard


Re: [PATCH V3 0/2] aarch64: Place target independent and dependent changed code in one file.

2024-04-03 Thread Richard Sandiford
Alex Coplan  writes:
> On 23/02/2024 16:41, Ajit Agarwal wrote:
>> Hello Richard/Alex/Segher:
>
> Hi Ajit,
>
> Sorry for the delay and thanks for working on this.
>
> Generally this looks like the right sort of approach (IMO) but I've left
> some comments below.
>
> I'll start with a meta comment: in the subject line you have marked this
> as 0/2, but usually 0/n is reserved for the cover letter of a patch
> series and wouldn't contain an actual patch.  I think this might have
> confused the Linaro CI suitably such that it didn't run regression tests
> on the patch.

Alex, thanks for the thorough and in-depth review.  I agree with all the
comments FWIW.  Just to add a couple of things:

> > @@ -138,8 +138,18 @@ struct alt_base
> >poly_int64 offset;
> >  };
> >  
> > +// Virtual base class for load/store walkers used in alias analysis.
> > +struct alias_walker
> > +{
> > +  virtual bool conflict_p (int ) const = 0;
> > +  virtual insn_info *insn () const = 0;
> > +  virtual bool valid () const  = 0;
> > +  virtual void advance () = 0;
> > +};
> > +
> > +
> >  // State used by the pass for a given basic block.
> > -struct ldp_bb_info
> > +struct pair_fusion
>
> As a comment on the high-level design, I think we want a generic class
> for the overall pass, not just for the BB-specific structure.
>
> That is because naturally we want the ldp_fusion_bb function itself to
> be a member of such a class, so that it can access virtual functions to
> query the target e.g. about the load/store pair policy, and whether to
> try and promote writeback pairs.
>
> If we keep all of the virtual functions in such an outer class, then we
> can keep the ldp_fusion_bb class generic (not needing an override for
> each target) and that inner class can perhaps be given a pointer or
> reference to the outer class when it is instantiated in ldp_fusion_bb.

I agree that in general, the new virtual methods should belong to a pass
class rather than the per-bb class.

In principle, if we need to virtualise existing members of ldp_bb_info
(or code contained within existing members of ldp_bb_info), and if that
code accesses members of the bb info, then it might make sense to have
target-specific derivatives of the bb info structure too, with a virtual
function to create the bb info structure for a given bb.

However, it looks like all but one of the virtual functions in the patch
are self-contained (in the sense of depending only on their arguments
and on globals).  The one exception is transform_for_base, but Alex
asked whether that needs to be virtualised.  If it doesn't, then like
Alex says, it seems that all virtuals could belong to the pass class
rather than to the bb info.

>> [...]
>> +  }
>>  };
>>  
>> +bool
>> +store_modifies_mem_p (rtx mem, insn_info *store_insn, int );
>> +bool load_modified_by_store_p (insn_info *load,
>> +  insn_info *store,
>> +  int );
>> +extern insn_info *
>> +try_repurpose_store (insn_info *first,
>> + insn_info *second,
>> + const insn_range_info _range);
>> +
>> +void reset_debug_use (use_info *use);
>> +
>> +extern void
>> +fixup_debug_uses (obstack_watermark ,
>> +  insn_info *insns[2],
>> +  rtx orig_rtl[2],
>> +  insn_info *pair_dst,
>> +  insn_info *trailing_add,
>> +  bool load_p,
>> +  int writeback,
>> +  rtx writeback_effect,
>> +  unsigned base_regno);
>> +
>> +void
>> +fixup_debug_uses_trailing_add (obstack_watermark ,
>> +   insn_info *pair_dst,
>> +   insn_info *trailing_add,
>> +   rtx writeback_effect);
>> +
>> +
>> +extern void
>> +fixup_debug_use (obstack_watermark ,
>> + use_info *use,
>> + def_info *def,
>> + rtx base,
>> + poly_int64 wb_offset);
>> +
>> +extern insn_info *
>> +find_trailing_add (insn_info *insns[2],
>> +   const insn_range_info _range,
>> +   int initial_writeback,
>> +   rtx *writeback_effect,
>> +   def_info **add_def,
>> +   def_info *base_def,
>> +   poly_int64 initial_offset,
>> +   unsigned access_size);
>> +
>> +rtx drop_writeback (rtx mem);
>> +rtx pair_mem_strip_offset (rtx mem, poly_int64 *offset);
>> +bool any_pre_modify_p (rtx x);
>> +bool any_post_modify_p (rtx x);
>> +int encode_lfs (lfs_fields fields);
>> +extern insn_info * latest_hazard_before (insn_info *insn, rtx *ignore,
>> +  insn_info *ignore_insn = nullptr);
>> +insn_info * first_hazard_after (insn_info *insn, rtx *ignore);
>> +bool ranges_overlap_p (const insn_range_info , const insn_range_info 
>> );
>> +insn_range_info get_def_range (def_info *def);
>> +insn_range_info def_downwards_move_range (def_info *def);
>> +insn_range_info def_upwards_move_range (def_info *def);
>> +rtx gen_tombstone (void);
>> 

Re: [PATCH v2 2/3] aarch64: Add support for aarch64-gnu (GNU/Hurd on AArch64)

2024-04-02 Thread Richard Sandiford
Sergey Bugaev  writes:
> Coupled with a corresponding binutils patch, this produces a toolchain that 
> can
> sucessfully build working binaries targeting aarch64-gnu.
>
> gcc/Changelog:
>
>   * config.gcc: Recognize aarch64*-*-gnu* targets.
>   * config/aarch64/aarch64-gnu.h: New file.
>
> Signed-off-by: Sergey Bugaev 
> ---
>  gcc/config.gcc   |  6 +++
>  gcc/config/aarch64/aarch64-gnu.h | 68 
>  2 files changed, 74 insertions(+)
>  create mode 100644 gcc/config/aarch64/aarch64-gnu.h

I don't know if you're waiting on me, but just in case: this and patch 3
still LGTM if Thomas is OK with them.

Thanks,
Richard

> diff --git a/gcc/config.gcc b/gcc/config.gcc
> index 87a5c92b6..9d935164c 100644
> --- a/gcc/config.gcc
> +++ b/gcc/config.gcc
> @@ -1264,6 +1264,12 @@ aarch64*-*-linux*)
>   done
>   TM_MULTILIB_CONFIG=`echo $TM_MULTILIB_CONFIG | sed 's/^,//'`
>   ;;
> +aarch64*-*-gnu*)
> +tm_file="${tm_file} elfos.h gnu-user.h gnu.h glibc-stdint.h"
> +tm_file="${tm_file} aarch64/aarch64-elf.h aarch64/aarch64-errata.h 
> aarch64/aarch64-gnu.h"
> +tmake_file="${tmake_file} aarch64/t-aarch64"
> +tm_defines="${tm_defines}  TARGET_DEFAULT_ASYNC_UNWIND_TABLES=1"
> + ;;
>  aarch64*-wrs-vxworks*)
>  tm_file="${tm_file} elfos.h aarch64/aarch64-elf.h"
>  tm_file="${tm_file} vx-common.h vxworks.h aarch64/aarch64-vxworks.h"
> diff --git a/gcc/config/aarch64/aarch64-gnu.h 
> b/gcc/config/aarch64/aarch64-gnu.h
> new file mode 100644
> index 0..ee5494034
> --- /dev/null
> +++ b/gcc/config/aarch64/aarch64-gnu.h
> @@ -0,0 +1,68 @@
> +/* Definitions for AArch64 running GNU/Hurd.
> +   Copyright (C) 2009-2024 Free Software Foundation, Inc.
> +
> +   This file is part of GCC.
> +
> +   GCC is free software; you can redistribute it and/or modify it
> +   under the terms of the GNU General Public License as published by
> +   the Free Software Foundation; either version 3, or (at your option)
> +   any later version.
> +
> +   GCC is distributed in the hope that it will be useful, but
> +   WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   General Public License for more details.
> +
> +   You should have received a copy of the GNU General Public License
> +   along with GCC; see the file COPYING3.  If not see
> +   .  */
> +
> +#ifndef GCC_AARCH64_GNU_H
> +#define GCC_AARCH64_GNU_H
> +
> +#define GNU_USER_DYNAMIC_LINKER 
> "/lib/ld-aarch64%{mbig-endian:_be}%{mabi=ilp32:_ilp32}.so.1"
> +
> +#define CPP_SPEC "%{pthread:-D_REENTRANT}"
> +
> +#define GNU_TARGET_LINK_SPEC  "%{h*} \
> +   %{static:-Bstatic}\
> +   %{shared:-shared} \
> +   %{symbolic:-Bsymbolic}\
> +   %{!static:%{!static-pie:  \
> + %{rdynamic:-export-dynamic} \
> + %{!shared:-dynamic-linker " GNU_USER_DYNAMIC_LINKER "}}} \
> +   %{static-pie:-Bstatic -pie --no-dynamic-linker -z text} \
> +   -X\
> +   %{mbig-endian:-EB} %{mlittle-endian:-EL} \
> +   -maarch64gnu%{mabi=ilp32:32}%{mbig-endian:b}"
> +
> +
> +#define LINK_SPEC GNU_TARGET_LINK_SPEC AARCH64_ERRATA_LINK_SPEC
> +
> +#define GNU_USER_TARGET_MATHFILE_SPEC \
> +  "%{Ofast|ffast-math|funsafe-math-optimizations:%{!shared:crtfastmath.o%s}}"
> +
> +#undef ENDFILE_SPEC
> +#define ENDFILE_SPEC   \
> +  GNU_USER_TARGET_MATHFILE_SPEC " " \
> +  GNU_USER_TARGET_ENDFILE_SPEC
> +
> +#define TARGET_OS_CPP_BUILTINS() \
> +  do \
> +{\
> + GNU_USER_TARGET_OS_CPP_BUILTINS();  \
> +}\
> +  while (0)
> +
> +#define TARGET_ASM_FILE_END aarch64_file_end_indicate_exec_stack
> +
> +/* Uninitialized common symbols in non-PIE executables, even with
> +   strong definitions in dependent shared libraries, will resolve
> +   to COPY relocated symbol in the executable.  See PR65780.  */
> +#undef TARGET_BINDS_LOCAL_P
> +#define TARGET_BINDS_LOCAL_P default_binds_local_p_2
> +
> +/* Define this to be nonzero if static stack checking is supported.  */
> +#define STACK_CHECK_STATIC_BUILTIN 1
> +
> +#endif  /* GCC_AARCH64_GNU_H */


Re: [PATCH] aarch64: Fix typo in comment about FEATURE_STRING

2024-04-02 Thread Richard Sandiford
Christophe Lyon  writes:
> Fix the comment to document FEATURE_STRING instead of FEAT_STRING.
>
> 2024-03-29  Christophe Lyon  
>
>   gcc/
>   * config/aarch64/aarch64-option-extensions.def: Fix comment.

OK, thanks.

Richard

> ---
>  gcc/config/aarch64/aarch64-option-extensions.def | 16 
>  1 file changed, 8 insertions(+), 8 deletions(-)
>
> diff --git a/gcc/config/aarch64/aarch64-option-extensions.def 
> b/gcc/config/aarch64/aarch64-option-extensions.def
> index 061a145e9e7..aa3cd99f791 100644
> --- a/gcc/config/aarch64/aarch64-option-extensions.def
> +++ b/gcc/config/aarch64/aarch64-option-extensions.def
> @@ -54,14 +54,14 @@
>   If a feature A appears in this list then the list implicitly includes
>   any features that are transitively dependent on A (according to 
> REQUIRES).
>  
> -   - FEAT_STRING is a string containing the entries in the 'Features' field 
> of
> - /proc/cpuinfo on a GNU/Linux system that correspond to this architecture
> - extension being available.  Sometimes multiple entries are needed to 
> enable
> - the extension (for example, the 'crypto' extension depends on four
> - entries: aes, pmull, sha1, sha2 being present).  In that case this field
> - should contain a space (" ") separated list of the strings in 'Features'
> - that are required.  Their order is not important.  An empty string means
> - do not detect this feature during auto detection.
> +   - FEATURE_STRING is a string containing the entries in the 'Features' 
> field
> + of /proc/cpuinfo on a GNU/Linux system that correspond to this
> + architecture extension being available.  Sometimes multiple entries are
> + needed to enable the extension (for example, the 'crypto' extension
> + depends on four entries: aes, pmull, sha1, sha2 being present).  In that
> + case this field should contain a space (" ") separated list of the 
> strings
> + in 'Features' that are required.  Their order is not important.  An 
> empty
> + string means do not detect this feature during auto detection.
>  
> - OPT_FLAGS is a list of feature IDENTS that should be enabled (along with
>   their transitive dependencies) when the specified FMV feature is 
> present.


  1   2   3   4   5   6   7   8   9   10   >