Re: arm: Add .type and .size to __gnu_cmse_nonsecure_call [PR115360]

2024-06-12 Thread Andre Vieira (lists)




On 06/06/2024 12:53, Richard Earnshaw (lists) wrote:

On 05/06/2024 17:07, Andre Vieira (lists) wrote:

Hi,

This patch adds missing assembly directives to the CMSE library wrapper to call 
functions with attribute cmse_nonsecure_call.  Without the .type directive the 
linker will fail to produce the correct veneer if a call to this wrapper 
function is to far from the wrapper itself.  The .size was added for 
completeness, though we don't necessarily have a usecase for it.

I did not add a testcase as I couldn't get dejagnu to disassemble the linked 
binary to check we used an appropriate branch instruction, I did however test 
it locally and with this change the GNU linker now generates an appropriate 
veneer and call to that veneer when __gnu_cmse_nonsecure_call is too far.

OK for trunk and backport to any release branches still in support (after 
waiting a week or so)?

libgcc/ChangeLog:

 PR target/115360
 * config/arm/cmse_nonsecure_call.S: Add .type and .size directives.


OK.

R.


OK to backport? I was thinking backporting it as far as gcc-11 (we 
haven't done a 11.5 yet).


Kind Regards,
Andre


Re: [PATCH v3 1/2] arm: Zero/Sign extends for CMSE security on Armv8-M.baseline [PR115253]

2024-06-11 Thread Andre Vieira (lists)




On 11/06/2024 14:59, Richard Earnshaw (lists) wrote:

You effectively have an 'else if' split across a comment here, and the 
indentation looks weird.  Either write 'else if' on one line (and re-indent 
accordingly) or put this entire block inside braces.


Apologies here, Torbjorn had this as an else if before, but I found that 
structure slightly more difficult to read because it wasn't entirely 
clear the else if and else were both for the 'SIGNED' handling, so I 
have a small preference for the 'or put this entire block inside braces'.


Kind Regards,
Andre


Re: [PATCH v2 1/2] arm: Zero/Sign extends for CMSE security on Armv8-M.baseline [PR115253]

2024-06-10 Thread Andre Vieira (lists)

Hi,

So, you talk about gen_thumb1_extendhisi2, but there is also 
gen_thumb1_extendqisi2. Will it actually be cleaner if the block is 
indented one level?
The comment can be added in the "if (TARGET_THUMB1)" block regardless to 
indicate that gen_rtx_SIGN_EXTEND can't be used.




gen_rtx_SIGN_EXTEND (I see I used wrong caps above before sorry!) will 
work for the case of QImode -> SImode for Thumb1. The reason it doesn't 
work for HImode -> SImode in thumb1 is because thumb1_extendhisi2 uses a 
scratch register, see:

(define_insn "thumb1_extendhisi2"
  [(set (match_operand:SI 0 "register_operand" "=l,l")
(sign_extend:SI (match_operand:HI 1 "nonimmediate_operand" "l,m")))
   (clobber (match_scratch:SI 2 "=X,l"))]

 meaning that the pattern generated with gen_rtx_SIGN_EXTEND:
[(set (SImode ...) (sign_extend:SImode (HImode))]
Does not match this and there are no other thumb1 patterns that match 
this either, so the compiler ICEs. For thumb1_extendqisi2 the pattern 
doesn't have the scratch register so a:

 gen_rtx_SET (, gen_rtx_SIGN_EXTEND (SImode, ))
will generate a rtl pattern that will match thumb1_extendqisi2.

Hope that makes it clearer.



Whit the patch you have in mind, will it be possible to call 
gen_rtx_SIGN_EXTEND  for THUMB1 too? Or do we need to keep calling 
thumb1_extendhisi2 and thumb1_extendqisi2?


With patch I have in mind, and will post soon, you could use 
gen_rtx_SIGN_EXTEND for HImode in thumb1, like I explained before for 
QImode its already possible.


That series will have another patch that also removes all calls to 
gen_thumb1_extend* and replaces them with gen_rtx_SET + 
gen_rtx_SIGN_EXTEND and renames the "thumb1_extend{hisi2,qisi1} patterns 
to "*thumb1_extend{hisi2,qisi2}".


However, that patch we won't backport, but yours we should hence why 
it's worth having your patch with the thumb1_extendhisi2 workaround.


Thanks,

Andre


Re: [PATCH v2 1/2] arm: Zero/Sign extends for CMSE security on Armv8-M.baseline [PR115253]

2024-06-10 Thread Andre Vieira (lists)

Hi Torbjorn,

Thanks for this, I have some comments below.

On 07/06/2024 09:56, Torbjörn SVENSSON wrote:

Properly handle zero and sign extension for Armv8-M.baseline as
Cortex-M23 can have the security extension active.
Currently, there is a internal compiler error on Cortex-M23 for the
epilog processing of sign extension.

This patch addresses the following CVE-2024-0151 for Armv8-M.baseline.

gcc/ChangeLog:

PR target/115253
* config/arm/arm.cc (cmse_nonsecure_call_inline_register_clear):
Sign extend for Thumb1.
(thumb1_expand_prologue): Add zero/sign extend.

Signed-off-by: Torbjörn SVENSSON 
Co-authored-by: Yvan ROUX 
---
  gcc/config/arm/arm.cc | 68 ++-
  1 file changed, 60 insertions(+), 8 deletions(-)

diff --git a/gcc/config/arm/arm.cc b/gcc/config/arm/arm.cc
index ea0c963a4d6..d1bb173c135 100644
--- a/gcc/config/arm/arm.cc
+++ b/gcc/config/arm/arm.cc
@@ -19220,17 +19220,23 @@ cmse_nonsecure_call_inline_register_clear (void)
  || TREE_CODE (ret_type) == BOOLEAN_TYPE)
  && known_lt (GET_MODE_SIZE (TYPE_MODE (ret_type)), 4))
{
- machine_mode ret_mode = TYPE_MODE (ret_type);
+ rtx ret_mode = gen_rtx_REG (TYPE_MODE (ret_type), R0_REGNUM);
+ rtx si_mode = gen_rtx_REG (SImode, R0_REGNUM);


I'd rename ret_mode and si_mode to ret_reg and si_reg, so its clear they 
are registers and not actually mode types.



  rtx extend;
  if (TYPE_UNSIGNED (ret_type))
-   extend = gen_rtx_ZERO_EXTEND (SImode,
- gen_rtx_REG (ret_mode, 
R0_REGNUM));
+   extend = gen_rtx_SET (si_mode, gen_rtx_ZERO_EXTEND (SImode,
+   ret_mode));
+ else if (TARGET_THUMB1)
+   {
+ if (known_lt (GET_MODE_SIZE (TYPE_MODE (ret_type)), 2))
+   extend = gen_thumb1_extendqisi2 (si_mode, ret_mode);
+ else
+   extend = gen_thumb1_extendhisi2 (si_mode, ret_mode);
+   }
  else
-   extend = gen_rtx_SIGN_EXTEND (SImode,
- gen_rtx_REG (ret_mode, 
R0_REGNUM));
- emit_insn_after (gen_rtx_SET (gen_rtx_REG (SImode, R0_REGNUM),
-extend), insn);
-
+   extend = gen_rtx_SET (si_mode, gen_rtx_SIGN_EXTEND (SImode,
+   ret_mode));
+ emit_insn_after (extend, insn);
}


Using gen_rtx_SIGN_EXTEND should work for both, the reason it doesn't is 
because of some weird code in thumb1_extendhisi2, which I'm actually 
gonna look at removing, but I don't think we should block this fix as 
we'd want to backport it ASAP.


But for clearness we should re-order this code so it's clear we only 
need it for that specific case.

Can you maybe do:
if (TYPE_UNSIGNED ..)
{
}
else
{
   /*  Signed-extension is a special case because of 
thumb1_extendhisi2.  */

   if (TARGET_THUMB1
   && known_gt (GET_MODE_SIZE (TYPE_MODE (ret_type)), 2))
 {
//call the gen_thumb1_extendhisi2
 }
else
 {
// use gen_RTX_SIGN_EXTEND
 }
}
  
  
@@ -27250,6 +27256,52 @@ thumb1_expand_prologue (void)

live_regs_mask = offsets->saved_regs_mask;
lr_needs_saving = live_regs_mask & (1 << LR_REGNUM);
  
+  /* The AAPCS requires the callee to widen integral types narrower

+ than 32 bits to the full width of the register; but when handling
+ calls to non-secure space, we cannot trust the callee to have
+ correctly done so.  So forcibly re-widen the result here.  */
+  if (IS_CMSE_ENTRY (func_type))
+{
+  function_args_iterator args_iter;
+  CUMULATIVE_ARGS args_so_far_v;
+  cumulative_args_t args_so_far;
+  bool first_param = true;
+  tree arg_type;
+  tree fndecl = current_function_decl;
+  tree fntype = TREE_TYPE (fndecl);
+  arm_init_cumulative_args (_so_far_v, fntype, NULL_RTX, fndecl);
+  args_so_far = pack_cumulative_args (_so_far_v);
+  FOREACH_FUNCTION_ARGS (fntype, arg_type, args_iter)
+   {
+ rtx arg_rtx;
+
+ if (VOID_TYPE_P (arg_type))
+   break;
+
+ function_arg_info arg (arg_type, /*named=*/true);
+ if (!first_param)
+   /* We should advance after processing the argument and pass
+  the argument we're advancing past.  */
+   arm_function_arg_advance (args_so_far, arg);
+ first_param = false;
+ arg_rtx = arm_function_arg (args_so_far, arg);
+ gcc_assert (REG_P (arg_rtx));
+ if ((TREE_CODE (arg_type) == INTEGER_TYPE
+ || TREE_CODE (arg_type) == ENUMERAL_TYPE
+ || TREE_CODE (arg_type) == BOOLEAN_TYPE)
+ && known_lt (GET_MODE_SIZE (GET_MODE 

arm: Add .type and .size to __gnu_cmse_nonsecure_call [PR115360]

2024-06-05 Thread Andre Vieira (lists)

Hi,

This patch adds missing assembly directives to the CMSE library wrapper 
to call functions with attribute cmse_nonsecure_call.  Without the .type 
directive the linker will fail to produce the correct veneer if a call 
to this wrapper function is to far from the wrapper itself.  The .size 
was added for completeness, though we don't necessarily have a usecase 
for it.


I did not add a testcase as I couldn't get dejagnu to disassemble the 
linked binary to check we used an appropriate branch instruction, I did 
however test it locally and with this change the GNU linker now 
generates an appropriate veneer and call to that veneer when 
__gnu_cmse_nonsecure_call is too far.


OK for trunk and backport to any release branches still in support 
(after waiting a week or so)?


libgcc/ChangeLog:

PR target/115360
* config/arm/cmse_nonsecure_call.S: Add .type and .size directives.diff --git a/libgcc/config/arm/cmse_nonsecure_call.S 
b/libgcc/config/arm/cmse_nonsecure_call.S
index 
f93ce6bb4f9f679020c9684c56f5916f1c0bf73c..fef37b955af673fb975dd64229d5e0647afe0d00
 100644
--- a/libgcc/config/arm/cmse_nonsecure_call.S
+++ b/libgcc/config/arm/cmse_nonsecure_call.S
@@ -33,6 +33,7 @@
 #endif
 
 .thumb
+.type __gnu_cmse_nonsecure_call, %function
 .global __gnu_cmse_nonsecure_call
 __gnu_cmse_nonsecure_call:
 #if defined(__ARM_ARCH_8M_MAIN__)
@@ -142,3 +143,4 @@ pop {r5-r7, pc}
 #else
 #error "This should only be used for armv8-m base- and mainline."
 #endif
+.size __gnu_cmse_nonsecure_call, .-__gnu_cmse_nonsecure_call


Re: RFC: Support for pragma clang loop interleave_count(N)

2024-06-04 Thread Andre Vieira (lists)




On 04/06/2024 12:50, Richard Biener wrote:

On Tue, 4 Jun 2024, Andre Vieira (lists) wrote:


Hi,

We got a question as to whether GCC had something similar to llvm's pragma
clang loop interleave_count(N), see
https://clang.llvm.org/docs/LanguageExtensions.html#extensions-for-loop-hint-optimizations

I did a quick hack, using 'GCC interleaves N', just as a proof of concept, to
see whether we could connect this to the suggested_unroll_factor in the
vectorizer and to test the waters regarding having something like this
upstream.

For the real thing I'd suggest we use the same pragma syntax as clang's so its
easier to port code.  It is my understanding that the main use for this is for
doing performance tuning of HPC kernels and performance tuning of CPU's cost
models.

This seems to work (TM), though with the move to slp-only I guess this will
stop working? Though I suspect we will want to have similar capabilities in
SLP, or maybe we have already and I didn't look hard enough.


suggested-unroll-factor also works with SLP, at least I don't see a
reason why it should not.



I think I may have misread what this (see below) was trying to say and 
assumed we didn't support it.


  /* If the slp decision is false when suggested unroll factor is worked
 out, and we are applying suggested unroll factor, we can simply skip
 all slp related analyses this time.  */
  bool slp = !applying_suggested_uf || slp_done_for_suggested_uf;



RFC: Support for pragma clang loop interleave_count(N)

2024-06-04 Thread Andre Vieira (lists)

Hi,

We got a question as to whether GCC had something similar to llvm's 
pragma clang loop interleave_count(N), see

https://clang.llvm.org/docs/LanguageExtensions.html#extensions-for-loop-hint-optimizations

I did a quick hack, using 'GCC interleaves N', just as a proof of 
concept, to see whether we could connect this to the 
suggested_unroll_factor in the
vectorizer and to test the waters regarding having something like this 
upstream.


For the real thing I'd suggest we use the same pragma syntax as clang's 
so its easier to port code.  It is my understanding that the main use 
for this is for doing performance tuning of HPC kernels and performance 
tuning of CPU's cost models.


This seems to work (TM), though with the move to slp-only I guess this 
will stop working? Though I suspect we will want to have similar 
capabilities in SLP, or maybe we have already and I didn't look hard enough.


Also only implemented it for C and C++, have not looked at Fortran.

WDYT?

Kind regards,
Andrediff --git a/gcc/c-family/c-pragma.h b/gcc/c-family/c-pragma.h
index 
ce93a52fa578127f1eade05dbafdf52021fd61fe..945c314d31c715522e56141ef9b616e52b466261
 100644
--- a/gcc/c-family/c-pragma.h
+++ b/gcc/c-family/c-pragma.h
@@ -87,6 +87,7 @@ enum pragma_kind {
   PRAGMA_GCC_PCH_PREPROCESS,
   PRAGMA_IVDEP,
   PRAGMA_UNROLL,
+  PRAGMA_INTERLEAVES,
   PRAGMA_NOVECTOR,
 
   PRAGMA_FIRST_EXTERNAL
diff --git a/gcc/c-family/c-pragma.cc b/gcc/c-family/c-pragma.cc
index 
1237ee6e62b9d501a7f9ad1cc267061ed068b920..facfb75c1eb9f184f05f4476a3aa04b45c21fd9a
 100644
--- a/gcc/c-family/c-pragma.cc
+++ b/gcc/c-family/c-pragma.cc
@@ -1828,6 +1828,10 @@ init_pragma (void)
 cpp_register_deferred_pragma (parse_in, "GCC", "unroll", PRAGMA_UNROLL,
  false, false);
 
+  if (!flag_preprocess_only)
+cpp_register_deferred_pragma (parse_in, "GCC", "interleaves", 
PRAGMA_INTERLEAVES,
+ false, false);
+
   if (!flag_preprocess_only)
 cpp_register_deferred_pragma (parse_in, "GCC", "novector", PRAGMA_NOVECTOR,
  false, false);
diff --git a/gcc/c/c-parser.cc b/gcc/c/c-parser.cc
index 
00f8bf4376e537e04ea8e468a05dade3c7212d8b..69b36d196bd74b494ffb6186be5240e3d2431407
 100644
--- a/gcc/c/c-parser.cc
+++ b/gcc/c/c-parser.cc
@@ -1665,11 +1665,12 @@ static tree c_parser_c99_block_statement (c_parser *, 
bool *,
  location_t * = NULL);
 static void c_parser_if_statement (c_parser *, bool *, vec *);
 static void c_parser_switch_statement (c_parser *, bool *);
-static void c_parser_while_statement (c_parser *, bool, unsigned short, bool,
- bool *);
-static void c_parser_do_statement (c_parser *, bool, unsigned short, bool);
-static void c_parser_for_statement (c_parser *, bool, unsigned short, bool,
-   bool *);
+static void c_parser_while_statement (c_parser *, bool, unsigned short,
+ unsigned short, bool, bool *);
+static void c_parser_do_statement (c_parser *, bool, unsigned short,
+  unsigned short,bool);
+static void c_parser_for_statement (c_parser *, bool, unsigned short,
+   unsigned short, bool, bool *);
 static tree c_parser_asm_statement (c_parser *);
 static tree c_parser_asm_operands (c_parser *);
 static tree c_parser_asm_goto_operands (c_parser *);
@@ -7603,13 +7604,13 @@ c_parser_statement_after_labels (c_parser *parser, bool 
*if_p,
  c_parser_switch_statement (parser, if_p);
  break;
case RID_WHILE:
- c_parser_while_statement (parser, false, 0, false, if_p);
+ c_parser_while_statement (parser, false, 0, 0, false, if_p);
  break;
case RID_DO:
- c_parser_do_statement (parser, false, 0, false);
+ c_parser_do_statement (parser, false, 0, 0, false);
  break;
case RID_FOR:
- c_parser_for_statement (parser, false, 0, false, if_p);
+ c_parser_for_statement (parser, false, 0, 0, false, if_p);
  break;
case RID_GOTO:
  c_parser_consume_token (parser);
@@ -8105,7 +8106,7 @@ c_parser_switch_statement (c_parser *parser, bool *if_p)
 
 static void
 c_parser_while_statement (c_parser *parser, bool ivdep, unsigned short unroll,
- bool novector, bool *if_p)
+ unsigned short interleaves, bool novector, bool *if_p)
 {
   tree block, cond, body;
   unsigned char save_in_statement;
@@ -8135,6 +8136,11 @@ c_parser_while_statement (c_parser *parser, bool ivdep, 
unsigned short unroll,
   build_int_cst (integer_type_node,
  annot_expr_unroll_kind),
   build_int_cst (integer_type_node, unroll));
+  if (interleaves && cond != error_mark_node)
+cond = build3 (ANNOTATE_EXPR, TREE_TYPE (cond), cond,
+  

[PATCH 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops

2024-05-23 Thread Andre Vieira

This patch adds support for MVE Tail-Predicated Low Overhead Loops by using the
doloop funcitonality added to support predicated vectorized hardware loops.

gcc/ChangeLog:

* config/arm/arm-protos.h (arm_target_bb_ok_for_lob): Change
declaration to pass basic_block.
(arm_attempt_dlstp_transform): New declaration.
* config/arm/arm.cc (TARGET_LOOP_UNROLL_ADJUST): Define targethook.
(TARGET_PREDICT_DOLOOP_P): Likewise.
(arm_target_bb_ok_for_lob): Adapt condition.
(arm_mve_get_vctp_lanes): New function.
(arm_dl_usage_type): New internal enum.
(arm_get_required_vpr_reg): New function.
(arm_get_required_vpr_reg_param): New function.
(arm_get_required_vpr_reg_ret_val): New function.
(arm_mve_get_loop_vctp): New function.
(arm_mve_insn_predicated_by): New function.
(arm_mve_across_lane_insn_p): New function.
(arm_mve_load_store_insn_p): New function.
(arm_mve_impl_pred_on_outputs_p): New function.
(arm_mve_impl_pred_on_inputs_p): New function.
(arm_last_vect_def_insn): New function.
(arm_mve_impl_predicated_p): New function.
(arm_mve_check_reg_origin_is_num_elems): New function.
(arm_mve_dlstp_check_inc_counter): New function.
(arm_mve_dlstp_check_dec_counter): New function.
(arm_mve_loop_valid_for_dlstp): New function.
(arm_predict_doloop_p): New function.
(arm_loop_unroll_adjust): New function.
(arm_emit_mve_unpredicated_insn_to_seq): New function.
(arm_attempt_dlstp_transform): New function.
* config/arm/arm.opt (mdlstp): New option.
* config/arm/iteratords.md (dlstp_elemsize, letp_num_lanes,
letp_num_lanes_neg, letp_num_lanes_minus_1): New attributes.
(DLSTP, LETP): New iterators.
(predicated_doloop_end_internal): New pattern.
(dlstp_insn): New pattern.
* config/arm/thumb2.md (doloop_end): Adapt to support tail-predicated
loops.
(doloop_begin): Likewise.
* config/arm/types.md (mve_misc): New mve type to represent
predicated_loop_end insn sequences.
* config/arm/unspecs.md:
(DLSTP8, DLSTP16, DLSTP32, DSLTP64,
LETP8, LETP16, LETP32, LETP64): New unspecs for DLSTP and LETP.

gcc/testsuite/ChangeLog:

* gcc.target/arm/lob.h: Add new helpers.
* gcc.target/arm/lob1.c: Use new helpers.
* gcc.target/arm/lob6.c: Likewise.
* gcc.target/arm/dlstp-compile-asm-1.c: New test.
* gcc.target/arm/dlstp-compile-asm-2.c: New test.
* gcc.target/arm/dlstp-compile-asm-3.c: New test.
* gcc.target/arm/dlstp-int8x16.c: New test.
* gcc.target/arm/dlstp-int8x16-run.c: New test.
* gcc.target/arm/dlstp-int16x8.c: New test.
* gcc.target/arm/dlstp-int16x8-run.c: New test.
* gcc.target/arm/dlstp-int32x4.c: New test.
* gcc.target/arm/dlstp-int32x4-run.c: New test.
* gcc.target/arm/dlstp-int64x2.c: New test.
* gcc.target/arm/dlstp-int64x2-run.c: New test.
* gcc.target/arm/dlstp-invalid-asm.c: New test.

Co-authored-by: Stam Markianos-Wright 
---
 gcc/config/arm/arm-protos.h   |4 +-
 gcc/config/arm/arm.cc | 1249 -
 gcc/config/arm/arm.opt|3 +
 gcc/config/arm/iterators.md   |   15 +
 gcc/config/arm/mve.md |   50 +
 gcc/config/arm/thumb2.md  |  138 +-
 gcc/config/arm/types.md   |6 +-
 gcc/config/arm/unspecs.md |   14 +-
 gcc/testsuite/gcc.target/arm/lob.h|  128 +-
 gcc/testsuite/gcc.target/arm/lob1.c   |   23 +-
 gcc/testsuite/gcc.target/arm/lob6.c   |8 +-
 .../gcc.target/arm/mve/dlstp-compile-asm-1.c  |  146 ++
 .../gcc.target/arm/mve/dlstp-compile-asm-2.c  |  749 ++
 .../gcc.target/arm/mve/dlstp-compile-asm-3.c  |   46 +
 .../gcc.target/arm/mve/dlstp-int16x8-run.c|   44 +
 .../gcc.target/arm/mve/dlstp-int16x8.c|   31 +
 .../gcc.target/arm/mve/dlstp-int32x4-run.c|   45 +
 .../gcc.target/arm/mve/dlstp-int32x4.c|   31 +
 .../gcc.target/arm/mve/dlstp-int64x2-run.c|   48 +
 .../gcc.target/arm/mve/dlstp-int64x2.c|   28 +
 .../gcc.target/arm/mve/dlstp-int8x16-run.c|   44 +
 .../gcc.target/arm/mve/dlstp-int8x16.c|   32 +
 .../gcc.target/arm/mve/dlstp-invalid-asm.c|  521 +++
 23 files changed, 3321 insertions(+), 82 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm-1.c
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm-2.c
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm-3.c
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-int16x8-run.c
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-int16x8.c
 create mode 100644 

[PATCH 1/2] doloop: Add support for predicated vectorized loops

2024-05-23 Thread Andre Vieira

This patch adds support in the target agnostic doloop pass for the detection of
predicated vectorized hardware loops.  Arm is currently the only target that
will make use of this feature.

gcc/ChangeLog:

* df-core.cc (df_bb_regno_only_def_find): New helper function.
* df.h (df_bb_regno_only_def_find): Declare new function.
* loop-doloop.cc (doloop_condition_get): Add support for detecting
predicated vectorized hardware loops.
(doloop_modify): Add support for GTU condition checks.
(doloop_optimize): Update costing computation to support alterations to
desc->niter_expr by the backend.

Co-authored-by: Stam Markianos-Wright 
---
 gcc/df-core.cc |  15 +
 gcc/df.h   |   1 +
 gcc/loop-doloop.cc | 164 +++--
 3 files changed, 113 insertions(+), 67 deletions(-)

diff --git a/gcc/df-core.cc b/gcc/df-core.cc
index f0eb4c93957..b0e8a88d433 100644
--- a/gcc/df-core.cc
+++ b/gcc/df-core.cc
@@ -1964,6 +1964,21 @@ df_bb_regno_last_def_find (basic_block bb, unsigned int regno)
   return NULL;
 }
 
+/* Return the one and only def of REGNO within BB.  If there is no def or
+   there are multiple defs, return NULL.  */
+
+df_ref
+df_bb_regno_only_def_find (basic_block bb, unsigned int regno)
+{
+  df_ref temp = df_bb_regno_first_def_find (bb, regno);
+  if (!temp)
+return NULL;
+  else if (temp == df_bb_regno_last_def_find (bb, regno))
+return temp;
+  else
+return NULL;
+}
+
 /* Finds the reference corresponding to the definition of REG in INSN.
DF is the dataflow object.  */
 
diff --git a/gcc/df.h b/gcc/df.h
index 84e5aa8b524..c4e690b40cf 100644
--- a/gcc/df.h
+++ b/gcc/df.h
@@ -987,6 +987,7 @@ extern void df_check_cfg_clean (void);
 #endif
 extern df_ref df_bb_regno_first_def_find (basic_block, unsigned int);
 extern df_ref df_bb_regno_last_def_find (basic_block, unsigned int);
+extern df_ref df_bb_regno_only_def_find (basic_block, unsigned int);
 extern df_ref df_find_def (rtx_insn *, rtx);
 extern bool df_reg_defined (rtx_insn *, rtx);
 extern df_ref df_find_use (rtx_insn *, rtx);
diff --git a/gcc/loop-doloop.cc b/gcc/loop-doloop.cc
index 529e810e530..8953e1de960 100644
--- a/gcc/loop-doloop.cc
+++ b/gcc/loop-doloop.cc
@@ -85,10 +85,10 @@ doloop_condition_get (rtx_insn *doloop_pat)
  forms:
 
  1)  (parallel [(set (pc) (if_then_else (condition)
-	  			(label_ref (label))
-(pc)))
-	 (set (reg) (plus (reg) (const_int -1)))
-	 (additional clobbers and uses)])
+	(label_ref (label))
+	(pc)))
+		 (set (reg) (plus (reg) (const_int -1)))
+		 (additional clobbers and uses)])
 
  The branch must be the first entry of the parallel (also required
  by jump.cc), and the second entry of the parallel must be a set of
@@ -96,19 +96,33 @@ doloop_condition_get (rtx_insn *doloop_pat)
  the loop counter in an if_then_else too.
 
  2)  (set (reg) (plus (reg) (const_int -1))
- (set (pc) (if_then_else (reg != 0)
-	 (label_ref (label))
-			 (pc))).  
+	 (set (pc) (if_then_else (reg != 0)
+ (label_ref (label))
+ (pc))).
 
- Some targets (ARM) do the comparison before the branch, as in the
+ 3) Some targets (Arm) do the comparison before the branch, as in the
  following form:
 
- 3) (parallel [(set (cc) (compare ((plus (reg) (const_int -1), 0)))
-   (set (reg) (plus (reg) (const_int -1)))])
-(set (pc) (if_then_else (cc == NE)
-(label_ref (label))
-(pc))) */
-
+ (parallel [(set (cc) (compare (plus (reg) (const_int -1)) 0))
+		(set (reg) (plus (reg) (const_int -1)))])
+ (set (pc) (if_then_else (cc == NE)
+			 (label_ref (label))
+			 (pc)))
+
+  4) This form supports a construct that is used to represent a vectorized
+  do loop with predication, however we do not need to care about the
+  details of the predication here.
+  Arm uses this construct to support MVE tail predication.
+
+  (parallel
+   [(set (pc)
+	 (if_then_else (gtu (plus (reg) (const_int -n))
+(const_int n-1))
+			   (label_ref)
+			   (pc)))
+	(set (reg) (plus (reg) (const_int -n)))
+	(additional clobbers and uses)])
+ */
   pattern = PATTERN (doloop_pat);
 
   if (GET_CODE (pattern) != PARALLEL)
@@ -173,15 +187,17 @@ doloop_condition_get (rtx_insn *doloop_pat)
   if (! REG_P (reg))
 return 0;
 
-  /* Check if something = (plus (reg) (const_int -1)).
+  /* Check if something = (plus (reg) (const_int -n)).
  On IA-64, this decrement is wrapped in an if_then_else.  */
   inc_src = SET_SRC (inc);
   if (GET_CODE (inc_src) == IF_THEN_ELSE)
 inc_src = XEXP (inc_src, 1);
   if (GET_CODE (inc_src) != PLUS
-  || XEXP (inc_src, 0) != reg
-  || XEXP (inc_src, 1) != constm1_rtx)
+  || !rtx_equal_p (XEXP (inc_src, 0), reg)
+  || 

[PATCH 0/2] arm, doloop: Add support for MVE Tail-Predicated Low Overhead Loops

2024-05-23 Thread Andre Vieira

Hi,

  We held these two patches back in stage 4 because they touched 
target-agnostic code, though I am quite confident they will not affect other 
targets. Given stage one has reopened, I am reposting them, I rebased them but 
they seem to apply cleanly on trunk.
  No changes from previously reviewed patches.

  OK for trunk?

Andre Vieira (2):
  doloop: Add support for predicated vectorized loops
  arm: Add support for MVE Tail-Predicated Low Overhead Loops

 gcc/config/arm/arm-protos.h   |4 +-
 gcc/config/arm/arm.cc | 1249 -
 gcc/config/arm/arm.opt|3 +
 gcc/config/arm/iterators.md   |   15 +
 gcc/config/arm/mve.md |   50 +
 gcc/config/arm/thumb2.md  |  138 +-
 gcc/config/arm/types.md   |6 +-
 gcc/config/arm/unspecs.md |   14 +-
 gcc/df-core.cc|   15 +
 gcc/df.h  |1 +
 gcc/loop-doloop.cc|  164 ++-
 gcc/testsuite/gcc.target/arm/lob.h|  128 +-
 gcc/testsuite/gcc.target/arm/lob1.c   |   23 +-
 gcc/testsuite/gcc.target/arm/lob6.c   |8 +-
 .../gcc.target/arm/mve/dlstp-compile-asm-1.c  |  146 ++
 .../gcc.target/arm/mve/dlstp-compile-asm-2.c  |  749 ++
 .../gcc.target/arm/mve/dlstp-compile-asm-3.c  |   46 +
 .../gcc.target/arm/mve/dlstp-int16x8-run.c|   44 +
 .../gcc.target/arm/mve/dlstp-int16x8.c|   31 +
 .../gcc.target/arm/mve/dlstp-int32x4-run.c|   45 +
 .../gcc.target/arm/mve/dlstp-int32x4.c|   31 +
 .../gcc.target/arm/mve/dlstp-int64x2-run.c|   48 +
 .../gcc.target/arm/mve/dlstp-int64x2.c|   28 +
 .../gcc.target/arm/mve/dlstp-int8x16-run.c|   44 +
 .../gcc.target/arm/mve/dlstp-int8x16.c|   32 +
 .../gcc.target/arm/mve/dlstp-invalid-asm.c|  521 +++
 26 files changed, 3434 insertions(+), 149 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm-1.c
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm-2.c
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm-3.c
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-int16x8-run.c
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-int16x8.c
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-int32x4-run.c
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-int32x4.c
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-int64x2-run.c
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-int64x2.c
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-int8x16-run.c
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-int8x16.c
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-invalid-asm.c

-- 
2.17.1


[PATCH 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops

2024-05-23 Thread Andre Vieira

This patch adds support for MVE Tail-Predicated Low Overhead Loops by using the
doloop funcitonality added to support predicated vectorized hardware loops.

gcc/ChangeLog:

* config/arm/arm-protos.h (arm_target_bb_ok_for_lob): Change
declaration to pass basic_block.
(arm_attempt_dlstp_transform): New declaration.
* config/arm/arm.cc (TARGET_LOOP_UNROLL_ADJUST): Define targethook.
(TARGET_PREDICT_DOLOOP_P): Likewise.
(arm_target_bb_ok_for_lob): Adapt condition.
(arm_mve_get_vctp_lanes): New function.
(arm_dl_usage_type): New internal enum.
(arm_get_required_vpr_reg): New function.
(arm_get_required_vpr_reg_param): New function.
(arm_get_required_vpr_reg_ret_val): New function.
(arm_mve_get_loop_vctp): New function.
(arm_mve_insn_predicated_by): New function.
(arm_mve_across_lane_insn_p): New function.
(arm_mve_load_store_insn_p): New function.
(arm_mve_impl_pred_on_outputs_p): New function.
(arm_mve_impl_pred_on_inputs_p): New function.
(arm_last_vect_def_insn): New function.
(arm_mve_impl_predicated_p): New function.
(arm_mve_check_reg_origin_is_num_elems): New function.
(arm_mve_dlstp_check_inc_counter): New function.
(arm_mve_dlstp_check_dec_counter): New function.
(arm_mve_loop_valid_for_dlstp): New function.
(arm_predict_doloop_p): New function.
(arm_loop_unroll_adjust): New function.
(arm_emit_mve_unpredicated_insn_to_seq): New function.
(arm_attempt_dlstp_transform): New function.
* config/arm/arm.opt (mdlstp): New option.
* config/arm/iteratords.md (dlstp_elemsize, letp_num_lanes,
letp_num_lanes_neg, letp_num_lanes_minus_1): New attributes.
(DLSTP, LETP): New iterators.
(predicated_doloop_end_internal): New pattern.
(dlstp_insn): New pattern.
* config/arm/thumb2.md (doloop_end): Adapt to support tail-predicated
loops.
(doloop_begin): Likewise.
* config/arm/types.md (mve_misc): New mve type to represent
predicated_loop_end insn sequences.
* config/arm/unspecs.md:
(DLSTP8, DLSTP16, DLSTP32, DSLTP64,
LETP8, LETP16, LETP32, LETP64): New unspecs for DLSTP and LETP.

gcc/testsuite/ChangeLog:

* gcc.target/arm/lob.h: Add new helpers.
* gcc.target/arm/lob1.c: Use new helpers.
* gcc.target/arm/lob6.c: Likewise.
* gcc.target/arm/dlstp-compile-asm-1.c: New test.
* gcc.target/arm/dlstp-compile-asm-2.c: New test.
* gcc.target/arm/dlstp-compile-asm-3.c: New test.
* gcc.target/arm/dlstp-int8x16.c: New test.
* gcc.target/arm/dlstp-int8x16-run.c: New test.
* gcc.target/arm/dlstp-int16x8.c: New test.
* gcc.target/arm/dlstp-int16x8-run.c: New test.
* gcc.target/arm/dlstp-int32x4.c: New test.
* gcc.target/arm/dlstp-int32x4-run.c: New test.
* gcc.target/arm/dlstp-int64x2.c: New test.
* gcc.target/arm/dlstp-int64x2-run.c: New test.
* gcc.target/arm/dlstp-invalid-asm.c: New test.

Co-authored-by: Stam Markianos-Wright 
---
 gcc/config/arm/arm-protos.h   |4 +-
 gcc/config/arm/arm.cc | 1249 -
 gcc/config/arm/arm.opt|3 +
 gcc/config/arm/iterators.md   |   15 +
 gcc/config/arm/mve.md |   50 +
 gcc/config/arm/thumb2.md  |  138 +-
 gcc/config/arm/types.md   |6 +-
 gcc/config/arm/unspecs.md |   14 +-
 gcc/testsuite/gcc.target/arm/lob.h|  128 +-
 gcc/testsuite/gcc.target/arm/lob1.c   |   23 +-
 gcc/testsuite/gcc.target/arm/lob6.c   |8 +-
 .../gcc.target/arm/mve/dlstp-compile-asm-1.c  |  146 ++
 .../gcc.target/arm/mve/dlstp-compile-asm-2.c  |  749 ++
 .../gcc.target/arm/mve/dlstp-compile-asm-3.c  |   46 +
 .../gcc.target/arm/mve/dlstp-int16x8-run.c|   44 +
 .../gcc.target/arm/mve/dlstp-int16x8.c|   31 +
 .../gcc.target/arm/mve/dlstp-int32x4-run.c|   45 +
 .../gcc.target/arm/mve/dlstp-int32x4.c|   31 +
 .../gcc.target/arm/mve/dlstp-int64x2-run.c|   48 +
 .../gcc.target/arm/mve/dlstp-int64x2.c|   28 +
 .../gcc.target/arm/mve/dlstp-int8x16-run.c|   44 +
 .../gcc.target/arm/mve/dlstp-int8x16.c|   32 +
 .../gcc.target/arm/mve/dlstp-invalid-asm.c|  521 +++
 23 files changed, 3321 insertions(+), 82 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm-1.c
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm-2.c
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm-3.c
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-int16x8-run.c
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-int16x8.c
 create mode 100644 

[PATCH 1/2] doloop: Add support for predicated vectorized loops

2024-05-23 Thread Andre Vieira

This patch adds support in the target agnostic doloop pass for the detection of
predicated vectorized hardware loops.  Arm is currently the only target that
will make use of this feature.

gcc/ChangeLog:

* df-core.cc (df_bb_regno_only_def_find): New helper function.
* df.h (df_bb_regno_only_def_find): Declare new function.
* loop-doloop.cc (doloop_condition_get): Add support for detecting
predicated vectorized hardware loops.
(doloop_modify): Add support for GTU condition checks.
(doloop_optimize): Update costing computation to support alterations to
desc->niter_expr by the backend.

Co-authored-by: Stam Markianos-Wright 
---
 gcc/df-core.cc |  15 +
 gcc/df.h   |   1 +
 gcc/loop-doloop.cc | 164 +++--
 3 files changed, 113 insertions(+), 67 deletions(-)

diff --git a/gcc/df-core.cc b/gcc/df-core.cc
index f0eb4c93957..b0e8a88d433 100644
--- a/gcc/df-core.cc
+++ b/gcc/df-core.cc
@@ -1964,6 +1964,21 @@ df_bb_regno_last_def_find (basic_block bb, unsigned int regno)
   return NULL;
 }
 
+/* Return the one and only def of REGNO within BB.  If there is no def or
+   there are multiple defs, return NULL.  */
+
+df_ref
+df_bb_regno_only_def_find (basic_block bb, unsigned int regno)
+{
+  df_ref temp = df_bb_regno_first_def_find (bb, regno);
+  if (!temp)
+return NULL;
+  else if (temp == df_bb_regno_last_def_find (bb, regno))
+return temp;
+  else
+return NULL;
+}
+
 /* Finds the reference corresponding to the definition of REG in INSN.
DF is the dataflow object.  */
 
diff --git a/gcc/df.h b/gcc/df.h
index 84e5aa8b524..c4e690b40cf 100644
--- a/gcc/df.h
+++ b/gcc/df.h
@@ -987,6 +987,7 @@ extern void df_check_cfg_clean (void);
 #endif
 extern df_ref df_bb_regno_first_def_find (basic_block, unsigned int);
 extern df_ref df_bb_regno_last_def_find (basic_block, unsigned int);
+extern df_ref df_bb_regno_only_def_find (basic_block, unsigned int);
 extern df_ref df_find_def (rtx_insn *, rtx);
 extern bool df_reg_defined (rtx_insn *, rtx);
 extern df_ref df_find_use (rtx_insn *, rtx);
diff --git a/gcc/loop-doloop.cc b/gcc/loop-doloop.cc
index 529e810e530..8953e1de960 100644
--- a/gcc/loop-doloop.cc
+++ b/gcc/loop-doloop.cc
@@ -85,10 +85,10 @@ doloop_condition_get (rtx_insn *doloop_pat)
  forms:
 
  1)  (parallel [(set (pc) (if_then_else (condition)
-	  			(label_ref (label))
-(pc)))
-	 (set (reg) (plus (reg) (const_int -1)))
-	 (additional clobbers and uses)])
+	(label_ref (label))
+	(pc)))
+		 (set (reg) (plus (reg) (const_int -1)))
+		 (additional clobbers and uses)])
 
  The branch must be the first entry of the parallel (also required
  by jump.cc), and the second entry of the parallel must be a set of
@@ -96,19 +96,33 @@ doloop_condition_get (rtx_insn *doloop_pat)
  the loop counter in an if_then_else too.
 
  2)  (set (reg) (plus (reg) (const_int -1))
- (set (pc) (if_then_else (reg != 0)
-	 (label_ref (label))
-			 (pc))).  
+	 (set (pc) (if_then_else (reg != 0)
+ (label_ref (label))
+ (pc))).
 
- Some targets (ARM) do the comparison before the branch, as in the
+ 3) Some targets (Arm) do the comparison before the branch, as in the
  following form:
 
- 3) (parallel [(set (cc) (compare ((plus (reg) (const_int -1), 0)))
-   (set (reg) (plus (reg) (const_int -1)))])
-(set (pc) (if_then_else (cc == NE)
-(label_ref (label))
-(pc))) */
-
+ (parallel [(set (cc) (compare (plus (reg) (const_int -1)) 0))
+		(set (reg) (plus (reg) (const_int -1)))])
+ (set (pc) (if_then_else (cc == NE)
+			 (label_ref (label))
+			 (pc)))
+
+  4) This form supports a construct that is used to represent a vectorized
+  do loop with predication, however we do not need to care about the
+  details of the predication here.
+  Arm uses this construct to support MVE tail predication.
+
+  (parallel
+   [(set (pc)
+	 (if_then_else (gtu (plus (reg) (const_int -n))
+(const_int n-1))
+			   (label_ref)
+			   (pc)))
+	(set (reg) (plus (reg) (const_int -n)))
+	(additional clobbers and uses)])
+ */
   pattern = PATTERN (doloop_pat);
 
   if (GET_CODE (pattern) != PARALLEL)
@@ -173,15 +187,17 @@ doloop_condition_get (rtx_insn *doloop_pat)
   if (! REG_P (reg))
 return 0;
 
-  /* Check if something = (plus (reg) (const_int -1)).
+  /* Check if something = (plus (reg) (const_int -n)).
  On IA-64, this decrement is wrapped in an if_then_else.  */
   inc_src = SET_SRC (inc);
   if (GET_CODE (inc_src) == IF_THEN_ELSE)
 inc_src = XEXP (inc_src, 1);
   if (GET_CODE (inc_src) != PLUS
-  || XEXP (inc_src, 0) != reg
-  || XEXP (inc_src, 1) != constm1_rtx)
+  || !rtx_equal_p (XEXP (inc_src, 0), reg)
+  || 

[PATCH 0/2] arm, doloop: Add support for MVE Tail-Predicated Low Overhead Loops

2024-05-23 Thread Andre Vieira

Hi,

  We held these two patches back in stage 4 because they touched 
target-agnostic code, though I am quite confident they will not affect other 
targets. Given stage one has reopened, I am reposting them, I rebased them but 
they seem to apply cleanly on trunk.

  OK for trunk?

Andre Vieira (2):
  doloop: Add support for predicated vectorized loops
  arm: Add support for MVE Tail-Predicated Low Overhead Loops

 gcc/config/arm/arm-protos.h   |4 +-
 gcc/config/arm/arm.cc | 1249 -
 gcc/config/arm/arm.opt|3 +
 gcc/config/arm/iterators.md   |   15 +
 gcc/config/arm/mve.md |   50 +
 gcc/config/arm/thumb2.md  |  138 +-
 gcc/config/arm/types.md   |6 +-
 gcc/config/arm/unspecs.md |   14 +-
 gcc/df-core.cc|   15 +
 gcc/df.h  |1 +
 gcc/loop-doloop.cc|  164 ++-
 gcc/testsuite/gcc.target/arm/lob.h|  128 +-
 gcc/testsuite/gcc.target/arm/lob1.c   |   23 +-
 gcc/testsuite/gcc.target/arm/lob6.c   |8 +-
 .../gcc.target/arm/mve/dlstp-compile-asm-1.c  |  146 ++
 .../gcc.target/arm/mve/dlstp-compile-asm-2.c  |  749 ++
 .../gcc.target/arm/mve/dlstp-compile-asm-3.c  |   46 +
 .../gcc.target/arm/mve/dlstp-int16x8-run.c|   44 +
 .../gcc.target/arm/mve/dlstp-int16x8.c|   31 +
 .../gcc.target/arm/mve/dlstp-int32x4-run.c|   45 +
 .../gcc.target/arm/mve/dlstp-int32x4.c|   31 +
 .../gcc.target/arm/mve/dlstp-int64x2-run.c|   48 +
 .../gcc.target/arm/mve/dlstp-int64x2.c|   28 +
 .../gcc.target/arm/mve/dlstp-int8x16-run.c|   44 +
 .../gcc.target/arm/mve/dlstp-int8x16.c|   32 +
 .../gcc.target/arm/mve/dlstp-invalid-asm.c|  521 +++
 26 files changed, 3434 insertions(+), 149 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm-1.c
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm-2.c
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm-3.c
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-int16x8-run.c
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-int16x8.c
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-int32x4-run.c
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-int32x4.c
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-int64x2-run.c
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-int64x2.c
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-int8x16-run.c
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-int8x16.c
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-invalid-asm.c

-- 
2.17.1


[wwwdocs] Specify AArch64 BitInt support for little-endian only

2024-05-07 Thread Andre Vieira (lists)

Hey Jakub,

This what ya had in mind?

Kind regards,
Andre Vieiradiff --git a/htdocs/gcc-14/changes.html b/htdocs/gcc-14/changes.html
index 
ca5174de991bb088f653468f77485c15a61526e6..924e045a15a78b5702a0d6997953f35c6b47efd1
 100644
--- a/htdocs/gcc-14/changes.html
+++ b/htdocs/gcc-14/changes.html
@@ -325,7 +325,7 @@ You may also want to check out our
   Bit-precise integer types (_BitInt (N)
   and unsigned _BitInt (N)): integer types with
   a specified number of bits.  These are only supported on
-  IA-32, x86-64 and AArch64 at present.
+  IA-32, x86-64 and AArch64 (little-endian) at present.
   Structure, union and enumeration types may be defined more
   than once in the same scope with the same contents and the same
   tag; if such types are defined with the same contents and the


[PATCH] aarch64: Fix _BitInt testcases

2024-04-11 Thread Andre Vieira (lists)

This patch fixes some testisms introduced by:

commit 5aa3fec38cc6f52285168b161bab1a869d864b44
Author: Andre Vieira 
Date:   Wed Apr 10 16:29:46 2024 +0100

aarch64: Add support for _BitInt

The testcases were relying on an unnecessary sign-extend that is no longer
generated.

The tested version was just slightly behind top of trunk when the patch 
was committed, and the codegen had changed, for the better, by then.


OK for trunk? (I am away tomorrow, so if you want this in before the 
weekend feel free to commit it on my behalf, if approved ofc...)



gcc/testsuite/ChangeLog:

* gcc.target/aarch64/bitfield-bitint-abi-align16.c (g1, g8, g16, g1p, 
g8p,
g16p): Remove unnecessary sbfx.
* gcc.target/aarch64/bitfield-bitint-abi-align8.c (g1, g8, g16, g1p, 
g8p,
g16p): Likewise.


diff --git a/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align16.c 
b/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align16.c
index 
3f292a45f955d35b802a0bd789cd39d5fa7b5860..4a228b0a1ce696dc80e32305162d58f01d44051d
 100644
--- a/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align16.c
+++ b/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align16.c
@@ -55,9 +55,8 @@
 ** g1:
 ** mov (x[0-9]+), x0
 ** mov w0, w1
-** sbfx(x[0-9]+), \1, 0, 63
-** and x4, \2, 9223372036854775807
-** and x2, \2, 1
+** and x4, \1, 9223372036854775807
+** and x2, \1, 1
 ** mov x3, 0
 ** b   f1
 */
@@ -66,9 +65,8 @@
 ** g8:
 ** mov (x[0-9]+), x0
 ** mov w0, w1
-** sbfx(x[0-9]+), \1, 0, 63
-** and x4, \2, 9223372036854775807
-** and x2, \2, 1
+** and x4, \1, 9223372036854775807
+** and x2, \1, 1
 ** mov x3, 0
 ** b   f8
 */
@@ -76,9 +74,8 @@
 ** g16:
 ** mov (x[0-9]+), x0
 ** mov w0, w1
-** sbfx(x[0-9]+), \1, 0, 63
-** and x4, \2, 9223372036854775807
-** and x2, \2, 1
+** and x4, \1, 9223372036854775807
+** and x2, \1, 1
 ** mov x3, 0
 ** b   f16
 */
@@ -107,9 +104,8 @@
 /*
 ** g1p:
 ** mov (w[0-9]+), w1
-** sbfx(x[0-9]+), x0, 0, 63
-** and x3, \2, 9223372036854775807
-** and x1, \2, 1
+** and x3, x0, 9223372036854775807
+** and x1, x0, 1
 ** mov x2, 0
 ** mov w0, \1
 ** b   f1p
@@ -117,9 +113,8 @@
 /*
 ** g8p:
 ** mov (w[0-9]+), w1
-** sbfx(x[0-9]+), x0, 0, 63
-** and x3, \2, 9223372036854775807
-** and x1, \2, 1
+** and x3, x0, 9223372036854775807
+** and x1, x0, 1
 ** mov x2, 0
 ** mov w0, \1
 ** b   f8p
@@ -128,9 +123,8 @@
 ** g16p:
 ** mov (x[0-9]+), x0
 ** mov w0, w1
-** sbfx(x[0-9]+), \1, 0, 63
-** and x4, \2, 9223372036854775807
-** and x2, \2, 1
+** and x4, \1, 9223372036854775807
+** and x2, \1, 1
 ** mov x3, 0
 ** b   f16p
 */
diff --git a/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align8.c 
b/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align8.c
index 
da3c23550bae6734f69e2baf0e8db741fb65cfda..e7f773640f04f56646e5e1a5fb91280ea7e4db98
 100644
--- a/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align8.c
+++ b/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align8.c
@@ -54,9 +54,8 @@
 /*
 ** g1:
 ** mov (w[0-9]+), w1
-** sbfx(x[0-9]+), x0, 0, 63
-** and x3, \2, 9223372036854775807
-** and x1, \2, 1
+** and x3, x0, 9223372036854775807
+** and x1, x0, 1
 ** mov x2, 0
 ** mov w0, \1
 ** b   f1
@@ -65,9 +64,8 @@
 /*
 ** g8:
 ** mov (w[0-9]+), w1
-** sbfx(x[0-9]+), x0, 0, 63
-** and x3, \2, 9223372036854775807
-** and x1, \2, 1
+** and x3, x0, 9223372036854775807
+** and x1, x0, 1
 ** mov x2, 0
 ** mov w0, \1
 ** b   f8
@@ -76,9 +74,8 @@
 ** g16:
 ** mov (x[0-9]+), x0
 ** mov w0, w1
-** sbfx(x[0-9]+), \1, 0, 63
-** and x4, \2, 9223372036854775807
-** and x2, \2, 1
+** and x4, \1, 9223372036854775807
+** and x2, \1, 1
 ** mov x3, 0
 ** b   f16
 */
@@ -107,9 +104,8 @@
 /*
 ** g1p:
 ** mov (w[0-9]+), w1
-** sbfx(x[0-9]+), x0, 0, 63
-** and x3, \2, 9223372036854775807
-** and x1, \2, 1
+** and x3, x0, 9223372036854775807
+** and x1, x0, 1
 ** mov x2, 0
 ** mov w0, \1
 ** b   f1p
@@ -117,9 +113,8 @@
 /*
 ** g8p:
 ** mov (w[0-9]+), w1
-** sbfx(x[0-9]+), x0, 0, 63
-** and x3, \2, 9223372036854775807
-** and x1, \2, 1
+** and x3, x0, 9223372036854775807
+** and x1, x0, 1
 ** mov x2, 0
 ** mov w0, \1
 ** b   f8p
@@ -128,9 +123,8 @@
 ** g16p:
 ** mov (x[0-9]+), x0
 ** mov w0, w1

[PATCH][wwwdocs] gcc-14/changes.html: Update _BitInt to include AArch64 (little-endian)

2024-04-10 Thread Andre Vieira (lists)

Hi,

Patch to add AArch64 to the list of supported _BitInt(N) in 
gcc-14/changes.html.


OK?diff --git a/htdocs/gcc-14/changes.html b/htdocs/gcc-14/changes.html
index 
a7ba957110183f906938d935bfa17aaed2ba20c8..55ab8c14c6d0b54e05a5f266f25c8ef1a4f959bf
 100644
--- a/htdocs/gcc-14/changes.html
+++ b/htdocs/gcc-14/changes.html
@@ -216,7 +216,7 @@ a work-in-progress.
   Bit-precise integer types (_BitInt (N)
   and unsigned _BitInt (N)): integer types with
   a specified number of bits.  These are only supported on
-  IA-32/x86-64 at present.
+  IA-32/x86-64 and AArch64 (little-endian) at present.
   Structure, union and enumeration types may be defined more
   than once in the same scope with the same contents and the same
   tag; if such types are defined with the same contents and the


Re: [PATCHv3 2/2] aarch64: Add support for _BitInt

2024-04-10 Thread Andre Vieira (lists)
Added the target check, also had to change some of the assembly checking 
due to changes upstream, the assembly is still valid, but we do extend 
where not necessary, I do believe that's a general issue though.


The _BitInt(N > 64) codegen for non-powers of 2 did get worse, we see 
similar codegen with _int128 bitfields on aarch64.
I suspect we need to improve the way we 'extend' TImode in the aarch64 
backend to be able to operate only on the affected DImode parts of it 
when relevant. Though I also think we may need to change how _BitInt is 
currently expanded in such situations, right now it does the extension 
as two shifts. Anyway I did not have too much time to look deeper into this.


Bootstrapped on aarch64-unknown-linux-gnu.

OK for trunk?

On 28/03/2024 15:21, Richard Sandiford wrote:

Jakub Jelinek  writes:

On Thu, Mar 28, 2024 at 03:00:46PM +, Richard Sandiford wrote:

* gcc.target/aarch64/bitint-alignments.c: New test.
* gcc.target/aarch64/bitint-args.c: New test.
* gcc.target/aarch64/bitint-sizes.c: New test.
* gcc.target/aarch64/bitfield-bitint-abi.h: New header.
* gcc.target/aarch64/bitfield-bitint-abi-align16.c: New test.
* gcc.target/aarch64/bitfield-bitint-abi-align8.c: New test.


Since we don't support big-endian yet, I assume the tests should be
conditional on aarch64_little_endian.


Perhaps better on bitint effective target, then they'll become available
automatically as soon as big endian aarch64 _BitInt support is turned on.


Ah, yeah, good point.

Richarddiff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 
81400cc666472ffeff40df14e98ae00ebc774d31..c0af4ef151a8c46f78c0c3a43c2ab1318a3f610a
 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -6583,6 +6583,7 @@ aarch64_return_in_memory_1 (const_tree type)
   int count;
 
   if (!AGGREGATE_TYPE_P (type)
+  && TREE_CODE (type) != BITINT_TYPE
   && TREE_CODE (type) != COMPLEX_TYPE
   && TREE_CODE (type) != VECTOR_TYPE)
 /* Simple scalar types always returned in registers.  */
@@ -21996,6 +21997,11 @@ aarch64_composite_type_p (const_tree type,
   if (type && (AGGREGATE_TYPE_P (type) || TREE_CODE (type) == COMPLEX_TYPE))
 return true;
 
+  if (type
+  && TREE_CODE (type) == BITINT_TYPE
+  && int_size_in_bytes (type) > 16)
+return true;
+
   if (mode == BLKmode
   || GET_MODE_CLASS (mode) == MODE_COMPLEX_FLOAT
   || GET_MODE_CLASS (mode) == MODE_COMPLEX_INT)
@@ -28477,6 +28483,42 @@ aarch64_excess_precision (enum excess_precision_type 
type)
   return FLT_EVAL_METHOD_UNPREDICTABLE;
 }
 
+/* Implement TARGET_C_BITINT_TYPE_INFO.
+   Return true if _BitInt(N) is supported and fill its details into *INFO.  */
+bool
+aarch64_bitint_type_info (int n, struct bitint_info *info)
+{
+  if (TARGET_BIG_END)
+return false;
+
+  if (n <= 8)
+info->limb_mode = QImode;
+  else if (n <= 16)
+info->limb_mode = HImode;
+  else if (n <= 32)
+info->limb_mode = SImode;
+  else if (n <= 64)
+info->limb_mode = DImode;
+  else if (n <= 128)
+info->limb_mode = TImode;
+  else
+/* The AAPCS for AArch64 defines _BitInt(N > 128) as an array with
+   type {signed,unsigned} __int128[M] where M*128 >= N.  However, to be
+   able to use libgcc's implementation to support large _BitInt's we need
+   to use a LIMB_MODE that is no larger than 'long long'.  This is why we
+   use DImode for our internal LIMB_MODE and we define the ABI_LIMB_MODE to
+   be TImode to ensure we are ABI compliant.  */
+info->limb_mode = DImode;
+
+  if (n > 128)
+info->abi_limb_mode = TImode;
+  else
+info->abi_limb_mode = info->limb_mode;
+  info->big_endian = TARGET_BIG_END;
+  info->extended = false;
+  return true;
+}
+
 /* Implement TARGET_SCHED_CAN_SPECULATE_INSN.  Return true if INSN can be
scheduled for speculative execution.  Reject the long-running division
and square-root instructions.  */
@@ -30601,6 +30643,9 @@ aarch64_run_selftests (void)
 #undef TARGET_C_EXCESS_PRECISION
 #define TARGET_C_EXCESS_PRECISION aarch64_excess_precision
 
+#undef TARGET_C_BITINT_TYPE_INFO
+#define TARGET_C_BITINT_TYPE_INFO aarch64_bitint_type_info
+
 #undef  TARGET_EXPAND_BUILTIN
 #define TARGET_EXPAND_BUILTIN aarch64_expand_builtin
 
diff --git a/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align16.c 
b/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align16.c
new file mode 100644
index 
..3f292a45f955d35b802a0bd789cd39d5fa7b5860
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align16.c
@@ -0,0 +1,384 @@
+/* { dg-do compile { target bitint } } */
+/* { dg-additional-options "-std=c23 -O2 -fno-stack-protector -save-temps 
-fno-schedule-insns -fno-schedule-insns2" } */
+/* { dg-final { check-function-bodies "**" "" "" } } */
+
+#define ALIGN 16
+#include "bitfield-bitint-abi.h"
+
+// f1-f16 are all the same
+

Re: [PATCHv2 1/2] aarch64: Do not give ABI change diagnostics for _BitInt(N)

2024-04-10 Thread Andre Vieira (lists)

Hey,

Added the warn_pcs_change_le_gcc14 variable and changed the uses of 
warn_pcs_change to use this new variable.
Also fixed an issue with the loop through TREE_FIELDS to avoid an ICE 
during bootstrap.


OK for trunk?

Bootstrapped and regression tested on aarch64-unknown-linux-gnu.

Kind regards,
Andre

On 28/03/2024 12:54, Richard Sandiford wrote:

"Andre Vieira (lists)"  writes:

This patch makes sure we do not give ABI change diagnostics for the ABI
breaks of GCC 9, 13 and 14 for any type involving _BitInt(N), since that
type did not exist before this GCC version.

ChangeLog:

* config/aarch64/aarch64.cc (bitint_or_aggr_of_bitint_p): New function.
(aarch64_layout_arg): Don't emit diagnostics for types involving
_BitInt(N).

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 
1ea84c8bd7386e399f6ffa3a5e36408cf8831fc6..b68cf3e7cb9a6fa89b4e5826a39ffa11f64ca20a
 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -6744,6 +6744,33 @@ aarch64_function_arg_alignment (machine_mode mode, 
const_tree type,
return alignment;
  }
  
+/* Return true if TYPE describes a _BitInt(N) or an angreggate that uses the

+   _BitInt(N) type.  These include ARRAY_TYPE's with an element that is a
+   _BitInt(N) or an aggregate that uses it, and a RECORD_TYPE or a UNION_TYPE
+   with a field member that is a _BitInt(N) or an aggregate that uses it.
+   Return false otherwise.  */
+
+static bool
+bitint_or_aggr_of_bitint_p (tree type)
+{
+  if (!type)
+return false;
+
+  if (TREE_CODE (type) == BITINT_TYPE)
+return true;
+
+  /* If ARRAY_TYPE, check it's element type.  */
+  if (TREE_CODE (type) == ARRAY_TYPE)
+return bitint_or_aggr_of_bitint_p (TREE_TYPE (type));
+
+  /* If RECORD_TYPE or UNION_TYPE, check the fields' types.  */
+  if (RECORD_OR_UNION_TYPE_P (type))
+for (tree field = TYPE_FIELDS (type); field; field = TREE_CHAIN (field))
+  if (bitint_or_aggr_of_bitint_p (TREE_TYPE (field)))
+   return true;
+  return false;
+}
+
  /* Layout a function argument according to the AAPCS64 rules.  The rule
 numbers refer to the rule numbers in the AAPCS64.  ORIG_MODE is the
 mode that was originally given to us by the target hook, whereas the
@@ -6767,12 +6794,6 @@ aarch64_layout_arg (cumulative_args_t pcum_v, const 
function_arg_info )
if (pcum->aapcs_arg_processed)
  return;
  
-  bool warn_pcs_change

-= (warn_psabi
-   && !pcum->silent_p
-   && (currently_expanding_function_start
-  || currently_expanding_gimple_stmt));
-
/* HFAs and HVAs can have an alignment greater than 16 bytes.  For example:
  
 typedef struct foo {

@@ -6907,6 +6928,18 @@ aarch64_layout_arg (cumulative_args_t pcum_v, const 
function_arg_info )
  && (!alignment || abi_break_gcc_9 < alignment)
  && (!abi_break_gcc_13 || alignment < abi_break_gcc_13));
  
+

+  bool warn_pcs_change
+= (warn_psabi
+   && !pcum->silent_p
+   && (currently_expanding_function_start
+  || currently_expanding_gimple_stmt)
+  /* warn_pcs_change is currently used to gate diagnostics in case of
+abi_break_gcc_{9,13,14}.  These however, do not apply to _BitInt(N)
+types as they were only introduced in GCC 14.  */
+   && (!type || !bitint_or_aggr_of_bitint_p (type)));


How about making this a new variable such as:

   /* _BitInt(N) was only added in GCC 14.  */
   bool warn_pcs_change_le_gcc14
 = (warn_psabi && !bitint_or_aggr_of_bitint_p (type);

(and keeping warn_pcs_change where it is).  In principle, warn_pcs_change
is meaningful for any future ABI breaks, and we might forget that it
excludes bitints.  The name is just a suggestion.

OK with that change, thanks.

Richard


+
+
/* allocate_ncrn may be false-positive, but allocate_nvrn is quite reliable.
   The following code thus handles passing by SIMD/FP registers first.  */
  
@@ -21266,19 +21299,25 @@ aarch64_gimplify_va_arg_expr (tree valist, tree type, gimple_seq *pre_p,

rsize = ROUND_UP (size, UNITS_PER_WORD);
nregs = rsize / UNITS_PER_WORD;
  
-  if (align <= 8 && abi_break_gcc_13 && warn_psabi)

+  if (align <= 8
+ && abi_break_gcc_13
+ && warn_psabi
+ && !bitint_or_aggr_of_bitint_p (type))
inform (input_location, "parameter passing for argument of type "
"%qT changed in GCC 13.1", type);
  
if (warn_psabi

  && abi_break_gcc_14
- && (abi_break_gcc_14 > 8 * BITS_PER_UNIT) != (align > 8))
+ && (abi_break_gcc_14 > 8 * BITS_PER_UNIT) != (align > 8)
+ && !bitint_or_aggr_of_bitint_p (type))
inform (input_location, "parameter passing for argu

[PATCHv2 2/2] aarch64: Add support for _BitInt

2024-03-27 Thread Andre Vieira (lists)
This patch adds support for C23's _BitInt for the AArch64 port when 
compiling for little endianness.  Big Endianness requires further 
target-agnostic support and we therefor disable it for now.


The tests expose some suboptimal codegen for which I'll create PR's for 
optimizations after this goes in.


gcc/ChangeLog:

* config/aarch64/aarch64.cc (TARGET_C_BITINT_TYPE_INFO): Declare MACRO.
(aarch64_bitint_type_info): New function.
(aarch64_return_in_memory_1): Return large _BitInt's in memory.
(aarch64_function_arg_alignment): Adapt to correctly return the ABI
mandated alignment of _BitInt(N) where N > 128 as the alignment of
TImode.
(aarch64_composite_type_p): Return true for _BitInt(N), where N > 128.

libgcc/ChangeLog:

* config/aarch64/t-softfp (softfp_extras): Add floatbitinthf,
floatbitintbf, floatbitinttf and fixtfbitint.
* config/aarch64/libgcc-softfp.ver (GCC_14.0.0): Add __floatbitinthf,
__floatbitintbf, __floatbitinttf and __fixtfbitint.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/bitint-alignments.c: New test.
* gcc.target/aarch64/bitint-args.c: New test.
* gcc.target/aarch64/bitint-sizes.c: New test.
* gcc.target/aarch64/bitfield-bitint-abi.h: New header.
* gcc.target/aarch64/bitfield-bitint-abi-align16.c: New test.
* gcc.target/aarch64/bitfield-bitint-abi-align8.c: New test.diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 
b68cf3e7cb9a6fa89b4e5826a39ffa11f64ca20a..5fe55c6e980bc1ea66df0e4357932123cd049366
 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -6583,6 +6583,7 @@ aarch64_return_in_memory_1 (const_tree type)
   int count;
 
   if (!AGGREGATE_TYPE_P (type)
+  && TREE_CODE (type) != BITINT_TYPE
   && TREE_CODE (type) != COMPLEX_TYPE
   && TREE_CODE (type) != VECTOR_TYPE)
 /* Simple scalar types always returned in registers.  */
@@ -21991,6 +21992,11 @@ aarch64_composite_type_p (const_tree type,
   if (type && (AGGREGATE_TYPE_P (type) || TREE_CODE (type) == COMPLEX_TYPE))
 return true;
 
+  if (type
+  && TREE_CODE (type) == BITINT_TYPE
+  && int_size_in_bytes (type) > 16)
+return true;
+
   if (mode == BLKmode
   || GET_MODE_CLASS (mode) == MODE_COMPLEX_FLOAT
   || GET_MODE_CLASS (mode) == MODE_COMPLEX_INT)
@@ -28472,6 +28478,42 @@ aarch64_excess_precision (enum excess_precision_type 
type)
   return FLT_EVAL_METHOD_UNPREDICTABLE;
 }
 
+/* Implement TARGET_C_BITINT_TYPE_INFO.
+   Return true if _BitInt(N) is supported and fill its details into *INFO.  */
+bool
+aarch64_bitint_type_info (int n, struct bitint_info *info)
+{
+  if (TARGET_BIG_END)
+return false;
+
+  if (n <= 8)
+info->limb_mode = QImode;
+  else if (n <= 16)
+info->limb_mode = HImode;
+  else if (n <= 32)
+info->limb_mode = SImode;
+  else if (n <= 64)
+info->limb_mode = DImode;
+  else if (n <= 128)
+info->limb_mode = TImode;
+  else
+/* The AAPCS for AArch64 defines _BitInt(N > 128) as an array with
+   type {signed,unsigned} __int128[M] where M*128 >= N.  However, to be
+   able to use libgcc's implementation to support large _BitInt's we need
+   to use a LIMB_MODE that is no larger than 'long long'.  This is why we
+   use DImode for our internal LIMB_MODE and we define the ABI_LIMB_MODE to
+   be TImode to ensure we are ABI compliant.  */
+info->limb_mode = DImode;
+
+  if (n > 128)
+info->abi_limb_mode = TImode;
+  else
+info->abi_limb_mode = info->limb_mode;
+  info->big_endian = TARGET_BIG_END;
+  info->extended = false;
+  return true;
+}
+
 /* Implement TARGET_SCHED_CAN_SPECULATE_INSN.  Return true if INSN can be
scheduled for speculative execution.  Reject the long-running division
and square-root instructions.  */
@@ -30596,6 +30638,9 @@ aarch64_run_selftests (void)
 #undef TARGET_C_EXCESS_PRECISION
 #define TARGET_C_EXCESS_PRECISION aarch64_excess_precision
 
+#undef TARGET_C_BITINT_TYPE_INFO
+#define TARGET_C_BITINT_TYPE_INFO aarch64_bitint_type_info
+
 #undef  TARGET_EXPAND_BUILTIN
 #define TARGET_EXPAND_BUILTIN aarch64_expand_builtin
 
diff --git a/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align16.c 
b/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align16.c
new file mode 100644
index 
..048d04e4c1bf90215892aa0173f6246a097d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align16.c
@@ -0,0 +1,378 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fno-stack-protector -save-temps -fno-schedule-insns 
-fno-schedule-insns2" } */
+/* { dg-final { check-function-bodies "**" "" "" } } */
+
+#define ALIGN 16
+#include "bitfield-bitint-abi.h"
+
+// f1-f16 are all the same
+
+/*
+** f1:
+** and x0, x2, 1
+** ret
+*/
+/*
+** f8:
+** and x0, x2, 1
+** ret
+*/
+/*
+** f16:
+** and x0, x2, 1

[PATCHv2 1/2] aarch64: Do not give ABI change diagnostics for _BitInt(N)

2024-03-27 Thread Andre Vieira (lists)
This patch makes sure we do not give ABI change diagnostics for the ABI 
breaks of GCC 9, 13 and 14 for any type involving _BitInt(N), since that 
type did not exist before this GCC version.


ChangeLog:

* config/aarch64/aarch64.cc (bitint_or_aggr_of_bitint_p): New function.
(aarch64_layout_arg): Don't emit diagnostics for types involving
_BitInt(N).diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 
1ea84c8bd7386e399f6ffa3a5e36408cf8831fc6..b68cf3e7cb9a6fa89b4e5826a39ffa11f64ca20a
 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -6744,6 +6744,33 @@ aarch64_function_arg_alignment (machine_mode mode, 
const_tree type,
   return alignment;
 }
 
+/* Return true if TYPE describes a _BitInt(N) or an angreggate that uses the
+   _BitInt(N) type.  These include ARRAY_TYPE's with an element that is a
+   _BitInt(N) or an aggregate that uses it, and a RECORD_TYPE or a UNION_TYPE
+   with a field member that is a _BitInt(N) or an aggregate that uses it.
+   Return false otherwise.  */
+
+static bool
+bitint_or_aggr_of_bitint_p (tree type)
+{
+  if (!type)
+return false;
+
+  if (TREE_CODE (type) == BITINT_TYPE)
+return true;
+
+  /* If ARRAY_TYPE, check it's element type.  */
+  if (TREE_CODE (type) == ARRAY_TYPE)
+return bitint_or_aggr_of_bitint_p (TREE_TYPE (type));
+
+  /* If RECORD_TYPE or UNION_TYPE, check the fields' types.  */
+  if (RECORD_OR_UNION_TYPE_P (type))
+for (tree field = TYPE_FIELDS (type); field; field = TREE_CHAIN (field))
+  if (bitint_or_aggr_of_bitint_p (TREE_TYPE (field)))
+   return true;
+  return false;
+}
+
 /* Layout a function argument according to the AAPCS64 rules.  The rule
numbers refer to the rule numbers in the AAPCS64.  ORIG_MODE is the
mode that was originally given to us by the target hook, whereas the
@@ -6767,12 +6794,6 @@ aarch64_layout_arg (cumulative_args_t pcum_v, const 
function_arg_info )
   if (pcum->aapcs_arg_processed)
 return;
 
-  bool warn_pcs_change
-= (warn_psabi
-   && !pcum->silent_p
-   && (currently_expanding_function_start
-  || currently_expanding_gimple_stmt));
-
   /* HFAs and HVAs can have an alignment greater than 16 bytes.  For example:
 
typedef struct foo {
@@ -6907,6 +6928,18 @@ aarch64_layout_arg (cumulative_args_t pcum_v, const 
function_arg_info )
  && (!alignment || abi_break_gcc_9 < alignment)
  && (!abi_break_gcc_13 || alignment < abi_break_gcc_13));
 
+
+  bool warn_pcs_change
+= (warn_psabi
+   && !pcum->silent_p
+   && (currently_expanding_function_start
+  || currently_expanding_gimple_stmt)
+  /* warn_pcs_change is currently used to gate diagnostics in case of
+abi_break_gcc_{9,13,14}.  These however, do not apply to _BitInt(N)
+types as they were only introduced in GCC 14.  */
+   && (!type || !bitint_or_aggr_of_bitint_p (type)));
+
+
   /* allocate_ncrn may be false-positive, but allocate_nvrn is quite reliable.
  The following code thus handles passing by SIMD/FP registers first.  */
 
@@ -21266,19 +21299,25 @@ aarch64_gimplify_va_arg_expr (tree valist, tree type, 
gimple_seq *pre_p,
   rsize = ROUND_UP (size, UNITS_PER_WORD);
   nregs = rsize / UNITS_PER_WORD;
 
-  if (align <= 8 && abi_break_gcc_13 && warn_psabi)
+  if (align <= 8
+ && abi_break_gcc_13
+ && warn_psabi
+ && !bitint_or_aggr_of_bitint_p (type))
inform (input_location, "parameter passing for argument of type "
"%qT changed in GCC 13.1", type);
 
   if (warn_psabi
  && abi_break_gcc_14
- && (abi_break_gcc_14 > 8 * BITS_PER_UNIT) != (align > 8))
+ && (abi_break_gcc_14 > 8 * BITS_PER_UNIT) != (align > 8)
+ && !bitint_or_aggr_of_bitint_p (type))
inform (input_location, "parameter passing for argument of type "
"%qT changed in GCC 14.1", type);
 
   if (align > 8)
{
- if (abi_break_gcc_9 && warn_psabi)
+ if (abi_break_gcc_9
+ && warn_psabi
+ && !bitint_or_aggr_of_bitint_p (type))
inform (input_location, "parameter passing for argument of type "
"%qT changed in GCC 9.1", type);
  dw_align = true;


[PATCHv2 0/2] aarch64, bitint: Add support for _BitInt for AArch64 Little Endian

2024-03-27 Thread Andre Vieira (lists)

Hi,

Introduced a new patch to disable diagnostics for ABI breaks involving 
_BitInt(N) given the type didn't exist, let me know what you think of that.


Also added further testing to replicate the ABI diagnostic tests to use 
_BitInt(N).


Andre Vieira (2)
aarch64: Do not give ABI change diagnostics for _BitInt(N)
aarch64: Add support for _BitInt



Backport PR91838 and PR110838

2024-03-25 Thread Andre Vieira (lists)

Hi,

After the backport off PR target/112787 a failure was reported against 
x86_64, this would be fixed by backporting:
* tree-optimization/91838 - fix FAIL of g++.dg/opt/pr91838.C 
(d1c072a1c3411a6fe29900750b38210af8451eeb)
* tree-optimization/110838 - less aggressively fold out-of-bound shifts 
(04aa0edcace22a7815cfc57575f1f7b1f166ac10)


Patches apply cleanly, just one minor git context conflict with includes.

Bootstrapped and regression tested on aarch64-unknown-linux-gnu and 
x86_64-pc-linux-gnu for gcc-12 and gcc-13 branches.


OK to backport?

Kind regards,
Andre


Re: [PATCH] testsuite: Fix fallout of turning warnings into errors on 32-bit Arm

2024-03-01 Thread Andre Vieira (lists)

Hi Thiago,

Thanks for this, LGTM but I can't approve this, CC'ing Richard.

Do have a nitpick, in the gcc/testsuite/ChangeLog: remove 
'gcc/testsuite' from bullet points 2-4.


Kind regards,
Andre

On 13/01/2024 00:55, Thiago Jung Bauermann wrote:

Since commits 2c3db94d9fd ("c: Turn int-conversion warnings into
permerrors") and 55e94561e97e ("c: Turn -Wimplicit-function-declaration
into a permerror") these tests fail with errors such as:

   FAIL: gcc.target/arm/pr59858.c (test for excess errors)
   FAIL: gcc.target/arm/pr65647.c (test for excess errors)
   FAIL: gcc.target/arm/pr65710.c (test for excess errors)
   FAIL: gcc.target/arm/pr97969.c (test for excess errors)

Here's one example of the excess errors:

   FAIL: gcc.target/arm/pr65647.c (test for excess errors)
   Excess errors:
   /path/gcc.git/gcc/testsuite/gcc.target/arm/pr65647.c:6:17: error: 
initialization of 'int' from 'int *' makes integer from pointer without a cast 
[-Wint-conversion]
   /path/gcc.git/gcc/testsuite/gcc.target/arm/pr65647.c:6:51: error: 
initialization of 'int' from 'int *' makes integer from pointer without a cast 
[-Wint-conversion]
   /path/gcc.git/gcc/testsuite/gcc.target/arm/pr65647.c:6:62: error: 
initialization of 'int' from 'int *' makes integer from pointer without a cast 
[-Wint-conversion]
   /path/gcc.git/gcc/testsuite/gcc.target/arm/pr65647.c:7:48: error: 
initialization of 'int' from 'int *' makes integer from pointer without a cast 
[-Wint-conversion]
   /path/gcc.git/gcc/testsuite/gcc.target/arm/pr65647.c:8:9: error: 
initialization of 'int' from 'int *' makes integer from pointer without a cast 
[-Wint-conversion]
   /path/gcc.git/gcc/testsuite/gcc.target/arm/pr65647.c:24:5: error: 
initialization of 'int' from 'int *' makes integer from pointer without a cast 
[-Wint-conversion]
   /path/gcc.git/gcc/testsuite/gcc.target/arm/pr65647.c:25:5: error: 
initialization of 'int' from 'struct S1 *' makes integer from pointer without a 
cast [-Wint-conversion]
   /path/gcc.git/gcc/testsuite/gcc.target/arm/pr65647.c:41:3: error: implicit 
declaration of function 'fn3'; did you mean 'fn2'? 
[-Wimplicit-function-declaration]
   /path/gcc.git/gcc/testsuite/gcc.target/arm/pr65647.c:46:3: error: implicit 
declaration of function 'fn5'; did you mean 'fn4'? 
[-Wimplicit-function-declaration]
   /path/gcc.git/gcc/testsuite/gcc.target/arm/pr65647.c:57:16: error: implicit 
declaration of function 'fn6'; did you mean 'fn4'? 
[-Wimplicit-function-declaration]

PR rtl-optimization/59858 and PR target/65710 test the fix of an ICE.
PR target/65647 and PR target/97969 test for a compilation infinite loop.

Therefore, add -fpermissive so that the tests behave as they did previously.
Tested on armv8l-linux-gnueabihf.

gcc/testsuite/ChangeLog:
* gcc.target/arm/pr59858.c: Add -fpermissive.
* gcc/testsuite/gcc.target/arm/pr65647.c: Likewise.
* gcc/testsuite/gcc.target/arm/pr65710.c: Likewise.
* gcc/testsuite/gcc.target/arm/pr97969.c: Likewise.
---
  gcc/testsuite/gcc.target/arm/pr59858.c | 2 +-
  gcc/testsuite/gcc.target/arm/pr65647.c | 2 +-
  gcc/testsuite/gcc.target/arm/pr65710.c | 2 +-
  gcc/testsuite/gcc.target/arm/pr97969.c | 2 +-
  4 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/gcc/testsuite/gcc.target/arm/pr59858.c 
b/gcc/testsuite/gcc.target/arm/pr59858.c
index 3360b48e8586..9336edfce277 100644
--- a/gcc/testsuite/gcc.target/arm/pr59858.c
+++ b/gcc/testsuite/gcc.target/arm/pr59858.c
@@ -1,5 +1,5 @@
  /* { dg-do compile } */
-/* { dg-options "-march=armv5te -fno-builtin -mfloat-abi=soft -mthumb 
-fno-stack-protector -Os -fno-tree-loop-optimize -fno-tree-dominator-opts -fPIC -w" 
} */
+/* { dg-options "-march=armv5te -fno-builtin -mfloat-abi=soft -mthumb 
-fno-stack-protector -Os -fno-tree-loop-optimize -fno-tree-dominator-opts -fPIC -w 
-fpermissive" } */
  /* { dg-require-effective-target fpic } */
  /* { dg-skip-if "Incompatible command line options: -mfloat-abi=soft -mfloat-abi=hard" { *-*-* } 
{ "-mfloat-abi=hard" } { "" } } */
  /* { dg-require-effective-target arm_arch_v5te_thumb_ok } */
diff --git a/gcc/testsuite/gcc.target/arm/pr65647.c 
b/gcc/testsuite/gcc.target/arm/pr65647.c
index 26b4e399f6be..3cbf6b804ec0 100644
--- a/gcc/testsuite/gcc.target/arm/pr65647.c
+++ b/gcc/testsuite/gcc.target/arm/pr65647.c
@@ -1,7 +1,7 @@
  /* { dg-do compile } */
  /* { dg-require-effective-target arm_arch_v6m_ok } */
  /* { dg-skip-if "do not override -mfloat-abi" { *-*-* } { "-mfloat-abi=*" } 
{"-mfloat-abi=soft" } } */
-/* { dg-options "-march=armv6-m -mthumb -O3 -w -mfloat-abi=soft" } */
+/* { dg-options "-march=armv6-m -mthumb -O3 -w -mfloat-abi=soft -fpermissive" 
} */
  
  a, b, c, e, g = , h, i = 7, l = 1, m, n, o, q = , r, s = , u, w = 9, x,

y = 6, z, t6 = 7, t8, t9 = 1, t11 = 5, t12 = , t13 = 3, t15,
diff --git a/gcc/testsuite/gcc.target/arm/pr65710.c 
b/gcc/testsuite/gcc.target/arm/pr65710.c
index 103ce1d45f77..4cbf7817af7e 100644
--- 

Re: [PATCH] tree-optimization/110221 - SLP and loop mask/len

2024-03-01 Thread Andre Vieira (lists)

Hi,

Bootstrapped and tested the gcc-13 backport of this on gcc-12 for 
aarch64-unknown-linux-gnu and x86_64-pc-linux-gnu and no regressions.


OK to push to gcc-12 branch?

Kind regards,
Andre Vieira

On 10/11/2023 13:16, Richard Biener wrote:

The following fixes the issue that when SLP stmts are internal defs
but appear invariant because they end up only using invariant defs
then they get scheduled outside of the loop.  This nice optimization
breaks down when loop masks or lens are applied since those are not
explicitly tracked as dependences.  The following makes sure to never
schedule internal defs outside of the vectorized loop when the
loop uses masks/lens.

Bootstrapped and tested on x86_64-unknown-linux-gnu, pushed.

PR tree-optimization/110221
* tree-vect-slp.cc (vect_schedule_slp_node): When loop
masking / len is applied make sure to not schedule
intenal defs outside of the loop.

* gfortran.dg/pr110221.f: New testcase.
---
  gcc/testsuite/gfortran.dg/pr110221.f | 17 +
  gcc/tree-vect-slp.cc | 10 ++
  2 files changed, 27 insertions(+)
  create mode 100644 gcc/testsuite/gfortran.dg/pr110221.f

diff --git a/gcc/testsuite/gfortran.dg/pr110221.f 
b/gcc/testsuite/gfortran.dg/pr110221.f
new file mode 100644
index 000..8b57384313a
--- /dev/null
+++ b/gcc/testsuite/gfortran.dg/pr110221.f
@@ -0,0 +1,17 @@
+C PR middle-end/68146
+C { dg-do compile }
+C { dg-options "-O2 -w" }
+C { dg-additional-options "-mavx512f --param vect-partial-vector-usage=2" { 
target avx512f } }
+  SUBROUTINE CJYVB(V,Z,V0,CBJ,CDJ,CBY,CYY)
+  IMPLICIT DOUBLE PRECISION (A,B,G,O-Y)
+  IMPLICIT COMPLEX*16 (C,Z)
+  DIMENSION CBJ(0:*),CDJ(0:*),CBY(0:*)
+  N=INT(V)
+  CALL GAMMA2(VG,GA)
+  DO 65 K=1,N
+CBY(K)=CYY
+65CONTINUE
+  CDJ(0)=V0/Z*CBJ(0)-CBJ(1)
+  DO 70 K=1,N
+70  CDJ(K)=-(K+V0)/Z*CBJ(K)+CBJ(K-1)
+  END
diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
index 3e5814c3a31..80e279d8f50 100644
--- a/gcc/tree-vect-slp.cc
+++ b/gcc/tree-vect-slp.cc
@@ -9081,6 +9081,16 @@ vect_schedule_slp_node (vec_info *vinfo,
/* Emit other stmts after the children vectorized defs which is
 earliest possible.  */
gimple *last_stmt = NULL;
+  if (auto loop_vinfo = dyn_cast  (vinfo))
+   if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
+   || LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
+ {
+   /* But avoid scheduling internal defs outside of the loop when
+  we might have only implicitly tracked loop mask/len defs.  */
+   gimple_stmt_iterator si
+ = gsi_after_labels (LOOP_VINFO_LOOP (loop_vinfo)->header);
+   last_stmt = *si;
+ }
bool seen_vector_def = false;
FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), i, child)
if (SLP_TREE_DEF_TYPE (child) == vect_internal_def)


Re: [PATCH 1/3] vect: Pass stmt_vec_info to TARGET_SIMD_CLONE_USABLE

2024-02-28 Thread Andre Vieira (lists)



On 27/02/2024 08:47, Richard Biener wrote:

On Mon, 26 Feb 2024, Andre Vieira (lists) wrote:




On 05/02/2024 09:56, Richard Biener wrote:

On Thu, 1 Feb 2024, Andre Vieira (lists) wrote:




On 01/02/2024 07:19, Richard Biener wrote:

On Wed, 31 Jan 2024, Andre Vieira (lists) wrote:


The patch didn't come with a testcase so it's really hard to tell
what goes wrong now and how it is fixed ...


My bad! I had a testcase locally but never added it...

However... now I look at it and ran it past Richard S, the codegen isn't
'wrong', but it does have the potential to lead to some pretty slow
codegen,
especially for inbranch simdclones where it transforms the SVE predicate
into
an Advanced SIMD vector by inserting the elements one at a time...

An example of which can be seen if you do:

gcc -O3 -march=armv8-a+sve -msve-vector-bits=128  -fopenmp-simd t.c -S

with the following t.c:
#pragma omp declare simd simdlen(4) inbranch
int __attribute__ ((const)) fn5(int);

void fn4 (int *a, int *b, int n)
{
  for (int i = 0; i < n; ++i)
  b[i] = fn5(a[i]);
}

Now I do have to say, for our main usecase of libmvec we won't have any
'inbranch' Advanced SIMD clones, so we avoid that issue... But of course
that
doesn't mean user-code will.


It seems to use SVE masks with vector(4)  and the
ABI says the mask is vector(4) int.  You say that's because we choose
a Adv SIMD clone for the SVE VLS vector code (it calls _ZGVnM4v_fn5).

The vectorizer creates

_44 = VEC_COND_EXPR ;

and then vector lowering decomposes this.  That means the vectorizer
lacks a check that the target handles this VEC_COND_EXPR.

Of course I would expect that SVE with VLS vectors is able to
code generate this operation, so it's missing patterns in the end.

Richard.



What should we do for GCC-14? Going forward I think the right thing to do is
to add these patterns. But I am not even going to try to do that right now and
even though we can codegen for this, the result doesn't feel like it would
ever be profitable which means I'd rather not vectorize, or well pick a
different vector mode if possible.

This would be achieved with the change to the targethook. If I change the hook
to take modes, using STMT_VINFO_VECTYPE (stmt_vinfo), is that OK for now?


Passing in a mode is OK.  I'm still not fully understanding why the
clone isn't fully specifying 'mode' and if it does not why the
vectorizer itself can not disregard it.



We could check that the modes of the parameters & return type are the 
same as the vector operands & result in the vectorizer. But then we'd 
also want to make sure we don't reject cases where we have simdclones 
with compatible modes, aka same element type, but a multiple element 
count.  Which is where'd we get in trouble again I think, because we'd 
want to accept V8SI -> 2x V4SI, but not V8SI -> 2x VNx4SI (with VLS and 
aarch64_sve_vg = 2), not because it's invalid, but because right now the 
codegen is bad. And it's easier to do this in the targethook, which we 
can technically also use to 'rank' simdclones by setting a 
target_badness value, so in the future we could decide to assign some 
'badness' to influence the rank a SVE simdclone for Advanced SIMD loops 
vs an Advanced SIMD clone for Advanced SIMD loops.


This does touch another issue of simdclone costing, which is a larger 
issue in general and one we (arm) might want to approach in the future. 
It's a complex issue, because the vectorizer doesn't know the 
performance impact of a simdclone, we assume (as we should) that its 
faster than the original scalar, though we currently don't record costs 
for either, but we don't know by how much or how much impact it has, so 
the vectorizer can't reason whether it's beneficial to use a simdclone 
if it has to do a lot of operand preparation, we can merely tell it to 
use it, or not and all the other operations in the loop will determine 
costing.




 From the past discussion I understood the existing situation isn't
as bad as initially thought and no bad things happen right now?
Nope, I thought they compiler would fall apart, but it seems to be able 
to transform the operands from one mode into the other, so without the 
targethook it just generates slower loops in certain cases, which we'd 
rather avoid given the usecase for simdclones is to speed things up ;)



Attached reworked patch.


This patch adds a machine_mode argument to TARGET_SIMD_CLONE_USABLE to 
make sure the target can reject a simd_clone based on the vector mode it 
is using.  This is needed because for VLS SVE vectorization the 
vectorizer accepts Advanced SIMD simd clones when vectorizing using SVE 
types because the simdlens might match, this currently leads to 
suboptimal codegen.


Other targets do not currently need to use this argument.

gcc/ChangeLog:

* target.def (TARGET_SIMD_CLONE_USABLE): Add argument.
* tree-vect-stmts.cc (vectorizable_simd_clone_call): Pass vector_mod

[PATCH v6 5/5] arm: Add support for MVE Tail-Predicated Low Overhead Loops

2024-02-27 Thread Andre Vieira

This patch adds support for MVE Tail-Predicated Low Overhead Loops by using the
doloop funcitonality added to support predicated vectorized hardware loops.

gcc/ChangeLog:

* config/arm/arm-protos.h (arm_target_bb_ok_for_lob): Change
declaration to pass basic_block.
(arm_attempt_dlstp_transform): New declaration.
* config/arm/arm.cc (TARGET_LOOP_UNROLL_ADJUST): Define targethook.
(TARGET_PREDICT_DOLOOP_P): Likewise.
(arm_target_bb_ok_for_lob): Adapt condition.
(arm_mve_get_vctp_lanes): New function.
(arm_dl_usage_type): New internal enum.
(arm_get_required_vpr_reg): New function.
(arm_get_required_vpr_reg_param): New function.
(arm_get_required_vpr_reg_ret_val): New function.
(arm_mve_get_loop_vctp): New function.
(arm_mve_insn_predicated_by): New function.
(arm_mve_across_lane_insn_p): New function.
(arm_mve_load_store_insn_p): New function.
(arm_mve_impl_pred_on_outputs_p): New function.
(arm_mve_impl_pred_on_inputs_p): New function.
(arm_last_vect_def_insn): New function.
(arm_mve_impl_predicated_p): New function.
(arm_mve_check_reg_origin_is_num_elems): New function.
(arm_mve_dlstp_check_inc_counter): New function.
(arm_mve_dlstp_check_dec_counter): New function.
(arm_mve_loop_valid_for_dlstp): New function.
(arm_predict_doloop_p): New function.
(arm_loop_unroll_adjust): New function.
(arm_emit_mve_unpredicated_insn_to_seq): New function.
(arm_attempt_dlstp_transform): New function.
* config/arm/arm.opt (mdlstp): New option.
* config/arm/iteratords.md (dlstp_elemsize, letp_num_lanes,
letp_num_lanes_neg, letp_num_lanes_minus_1): New attributes.
(DLSTP, LETP): New iterators.
(predicated_doloop_end_internal): New pattern.
(dlstp_insn): New pattern.
* config/arm/thumb2.md (doloop_end): Adapt to support tail-predicated
loops.
(doloop_begin): Likewise.
* config/arm/types.md (mve_misc): New mve type to represent
predicated_loop_end insn sequences.
* config/arm/unspecs.md:
(DLSTP8, DLSTP16, DLSTP32, DSLTP64,
LETP8, LETP16, LETP32, LETP64): New unspecs for DLSTP and LETP.

gcc/testsuite/ChangeLog:

* gcc.target/arm/lob.h: Add new helpers.
* gcc.target/arm/lob1.c: Use new helpers.
* gcc.target/arm/lob6.c: Likewise.
* gcc.target/arm/dlstp-compile-asm-1.c: New test.
* gcc.target/arm/dlstp-compile-asm-2.c: New test.
* gcc.target/arm/dlstp-compile-asm-3.c: New test.
* gcc.target/arm/dlstp-int8x16.c: New test.
* gcc.target/arm/dlstp-int8x16-run.c: New test.
* gcc.target/arm/dlstp-int16x8.c: New test.
* gcc.target/arm/dlstp-int16x8-run.c: New test.
* gcc.target/arm/dlstp-int32x4.c: New test.
* gcc.target/arm/dlstp-int32x4-run.c: New test.
* gcc.target/arm/dlstp-int64x2.c: New test.
* gcc.target/arm/dlstp-int64x2-run.c: New test.
* gcc.target/arm/dlstp-invalid-asm.c: New test.

Co-authored-by: Stam Markianos-Wright 
---
 gcc/config/arm/arm-protos.h   |4 +-
 gcc/config/arm/arm.cc | 1249 -
 gcc/config/arm/arm.opt|3 +
 gcc/config/arm/iterators.md   |   15 +
 gcc/config/arm/mve.md |   50 +
 gcc/config/arm/thumb2.md  |  138 +-
 gcc/config/arm/types.md   |6 +-
 gcc/config/arm/unspecs.md |   14 +-
 gcc/testsuite/gcc.target/arm/lob.h|  128 +-
 gcc/testsuite/gcc.target/arm/lob1.c   |   23 +-
 gcc/testsuite/gcc.target/arm/lob6.c   |8 +-
 .../gcc.target/arm/mve/dlstp-compile-asm-1.c  |  146 ++
 .../gcc.target/arm/mve/dlstp-compile-asm-2.c  |  749 ++
 .../gcc.target/arm/mve/dlstp-compile-asm-3.c  |   46 +
 .../gcc.target/arm/mve/dlstp-int16x8-run.c|   44 +
 .../gcc.target/arm/mve/dlstp-int16x8.c|   31 +
 .../gcc.target/arm/mve/dlstp-int32x4-run.c|   45 +
 .../gcc.target/arm/mve/dlstp-int32x4.c|   31 +
 .../gcc.target/arm/mve/dlstp-int64x2-run.c|   48 +
 .../gcc.target/arm/mve/dlstp-int64x2.c|   28 +
 .../gcc.target/arm/mve/dlstp-int8x16-run.c|   44 +
 .../gcc.target/arm/mve/dlstp-int8x16.c|   32 +
 .../gcc.target/arm/mve/dlstp-invalid-asm.c|  521 +++
 23 files changed, 3321 insertions(+), 82 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm-1.c
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm-2.c
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm-3.c
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-int16x8-run.c
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-int16x8.c
 create mode 100644 

[PATCH v6 4/5] doloop: Add support for predicated vectorized loops

2024-02-27 Thread Andre Vieira

This patch adds support in the target agnostic doloop pass for the detection of
predicated vectorized hardware loops.  Arm is currently the only target that
will make use of this feature.

gcc/ChangeLog:

* df-core.cc (df_bb_regno_only_def_find): New helper function.
* df.h (df_bb_regno_only_def_find): Declare new function.
* loop-doloop.cc (doloop_condition_get): Add support for detecting
predicated vectorized hardware loops.
(doloop_modify): Add support for GTU condition checks.
(doloop_optimize): Update costing computation to support alterations to
desc->niter_expr by the backend.

Co-authored-by: Stam Markianos-Wright 
---
 gcc/df-core.cc |  15 +
 gcc/df.h   |   1 +
 gcc/loop-doloop.cc | 164 +++--
 3 files changed, 113 insertions(+), 67 deletions(-)

diff --git a/gcc/df-core.cc b/gcc/df-core.cc
index f0eb4c93957..b0e8a88d433 100644
--- a/gcc/df-core.cc
+++ b/gcc/df-core.cc
@@ -1964,6 +1964,21 @@ df_bb_regno_last_def_find (basic_block bb, unsigned int regno)
   return NULL;
 }
 
+/* Return the one and only def of REGNO within BB.  If there is no def or
+   there are multiple defs, return NULL.  */
+
+df_ref
+df_bb_regno_only_def_find (basic_block bb, unsigned int regno)
+{
+  df_ref temp = df_bb_regno_first_def_find (bb, regno);
+  if (!temp)
+return NULL;
+  else if (temp == df_bb_regno_last_def_find (bb, regno))
+return temp;
+  else
+return NULL;
+}
+
 /* Finds the reference corresponding to the definition of REG in INSN.
DF is the dataflow object.  */
 
diff --git a/gcc/df.h b/gcc/df.h
index 84e5aa8b524..c4e690b40cf 100644
--- a/gcc/df.h
+++ b/gcc/df.h
@@ -987,6 +987,7 @@ extern void df_check_cfg_clean (void);
 #endif
 extern df_ref df_bb_regno_first_def_find (basic_block, unsigned int);
 extern df_ref df_bb_regno_last_def_find (basic_block, unsigned int);
+extern df_ref df_bb_regno_only_def_find (basic_block, unsigned int);
 extern df_ref df_find_def (rtx_insn *, rtx);
 extern bool df_reg_defined (rtx_insn *, rtx);
 extern df_ref df_find_use (rtx_insn *, rtx);
diff --git a/gcc/loop-doloop.cc b/gcc/loop-doloop.cc
index 529e810e530..8953e1de960 100644
--- a/gcc/loop-doloop.cc
+++ b/gcc/loop-doloop.cc
@@ -85,10 +85,10 @@ doloop_condition_get (rtx_insn *doloop_pat)
  forms:
 
  1)  (parallel [(set (pc) (if_then_else (condition)
-	  			(label_ref (label))
-(pc)))
-	 (set (reg) (plus (reg) (const_int -1)))
-	 (additional clobbers and uses)])
+	(label_ref (label))
+	(pc)))
+		 (set (reg) (plus (reg) (const_int -1)))
+		 (additional clobbers and uses)])
 
  The branch must be the first entry of the parallel (also required
  by jump.cc), and the second entry of the parallel must be a set of
@@ -96,19 +96,33 @@ doloop_condition_get (rtx_insn *doloop_pat)
  the loop counter in an if_then_else too.
 
  2)  (set (reg) (plus (reg) (const_int -1))
- (set (pc) (if_then_else (reg != 0)
-	 (label_ref (label))
-			 (pc))).  
+	 (set (pc) (if_then_else (reg != 0)
+ (label_ref (label))
+ (pc))).
 
- Some targets (ARM) do the comparison before the branch, as in the
+ 3) Some targets (Arm) do the comparison before the branch, as in the
  following form:
 
- 3) (parallel [(set (cc) (compare ((plus (reg) (const_int -1), 0)))
-   (set (reg) (plus (reg) (const_int -1)))])
-(set (pc) (if_then_else (cc == NE)
-(label_ref (label))
-(pc))) */
-
+ (parallel [(set (cc) (compare (plus (reg) (const_int -1)) 0))
+		(set (reg) (plus (reg) (const_int -1)))])
+ (set (pc) (if_then_else (cc == NE)
+			 (label_ref (label))
+			 (pc)))
+
+  4) This form supports a construct that is used to represent a vectorized
+  do loop with predication, however we do not need to care about the
+  details of the predication here.
+  Arm uses this construct to support MVE tail predication.
+
+  (parallel
+   [(set (pc)
+	 (if_then_else (gtu (plus (reg) (const_int -n))
+(const_int n-1))
+			   (label_ref)
+			   (pc)))
+	(set (reg) (plus (reg) (const_int -n)))
+	(additional clobbers and uses)])
+ */
   pattern = PATTERN (doloop_pat);
 
   if (GET_CODE (pattern) != PARALLEL)
@@ -173,15 +187,17 @@ doloop_condition_get (rtx_insn *doloop_pat)
   if (! REG_P (reg))
 return 0;
 
-  /* Check if something = (plus (reg) (const_int -1)).
+  /* Check if something = (plus (reg) (const_int -n)).
  On IA-64, this decrement is wrapped in an if_then_else.  */
   inc_src = SET_SRC (inc);
   if (GET_CODE (inc_src) == IF_THEN_ELSE)
 inc_src = XEXP (inc_src, 1);
   if (GET_CODE (inc_src) != PLUS
-  || XEXP (inc_src, 0) != reg
-  || XEXP (inc_src, 1) != constm1_rtx)
+  || !rtx_equal_p (XEXP (inc_src, 0), reg)
+  || 

[PATCH v6 1/5] arm: Add define_attr to to create a mapping between MVE predicated and unpredicated insns

2024-02-27 Thread Andre Vieira

This patch adds an attribute to the mve md patterns to be able to identify
predicable MVE instructions and what their predicated and unpredicated variants
are.  This attribute is used to encode the icode of the unpredicated variant of
an instruction in its predicated variant.

This will make it possible for us to transform VPT-predicated insns in
the insn chain into their unpredicated equivalents when transforming the loop
into a MVE Tail-Predicated Low Overhead Loop. For example:
`mve_vldrbq_z_ -> mve_vldrbq_`.

gcc/ChangeLog:

* config/arm/arm.md (mve_unpredicated_insn): New attribute.
* config/arm/arm.h (MVE_VPT_PREDICATED_INSN_P): New define.
(MVE_VPT_UNPREDICATED_INSN_P): Likewise.
(MVE_VPT_PREDICABLE_INSN_P): Likewise.
* config/arm/vec-common.md (mve_vshlq_): Add attribute.
* config/arm/mve.md (arm_vcx1q_p_v16qi): Add attribute.
(arm_vcx1qv16qi): Likewise.
(arm_vcx1qav16qi): Likewise.
(arm_vcx1qv16qi): Likewise.
(arm_vcx2q_p_v16qi): Likewise.
(arm_vcx2qv16qi): Likewise.
(arm_vcx2qav16qi): Likewise.
(arm_vcx2qv16qi): Likewise.
(arm_vcx3q_p_v16qi): Likewise.
(arm_vcx3qv16qi): Likewise.
(arm_vcx3qav16qi): Likewise.
(arm_vcx3qv16qi): Likewise.
(@mve_q_): Likewise.
(@mve_q_int_): Likewise.
(@mve_q_v4si): Likewise.
(@mve_q_n_): Likewise.
(@mve_q_r_): Likewise.
(@mve_q_f): Likewise.
(@mve_q_m_): Likewise.
(@mve_q_m_n_): Likewise.
(@mve_q_m_r_): Likewise.
(@mve_q_m_f): Likewise.
(@mve_q_int_m_): Likewise.
(@mve_q_p_v4si): Likewise.
(@mve_q_p_): Likewise.
(@mve_q_): Likewise.
(@mve_q_f): Likewise.
(@mve_q_m_): Likewise.
(@mve_q_m_f): Likewise.
(mve_vq_f): Likewise.
(mve_q): Likewise.
(mve_q_f): Likewise.
(mve_vadciq_v4si): Likewise.
(mve_vadciq_m_v4si): Likewise.
(mve_vadcq_v4si): Likewise.
(mve_vadcq_m_v4si): Likewise.
(mve_vandq_): Likewise.
(mve_vandq_f): Likewise.
(mve_vandq_m_): Likewise.
(mve_vandq_m_f): Likewise.
(mve_vandq_s): Likewise.
(mve_vandq_u): Likewise.
(mve_vbicq_): Likewise.
(mve_vbicq_f): Likewise.
(mve_vbicq_m_): Likewise.
(mve_vbicq_m_f): Likewise.
(mve_vbicq_m_n_): Likewise.
(mve_vbicq_n_): Likewise.
(mve_vbicq_s): Likewise.
(mve_vbicq_u): Likewise.
(@mve_vclzq_s): Likewise.
(mve_vclzq_u): Likewise.
(@mve_vcmp_q_): Likewise.
(@mve_vcmp_q_n_): Likewise.
(@mve_vcmp_q_f): Likewise.
(@mve_vcmp_q_n_f): Likewise.
(@mve_vcmp_q_m_f): Likewise.
(@mve_vcmp_q_m_n_): Likewise.
(@mve_vcmp_q_m_): Likewise.
(@mve_vcmp_q_m_n_f): Likewise.
(mve_vctpq): Likewise.
(mve_vctpq_m): Likewise.
(mve_vcvtaq_): Likewise.
(mve_vcvtaq_m_): Likewise.
(mve_vcvtbq_f16_f32v8hf): Likewise.
(mve_vcvtbq_f32_f16v4sf): Likewise.
(mve_vcvtbq_m_f16_f32v8hf): Likewise.
(mve_vcvtbq_m_f32_f16v4sf): Likewise.
(mve_vcvtmq_): Likewise.
(mve_vcvtmq_m_): Likewise.
(mve_vcvtnq_): Likewise.
(mve_vcvtnq_m_): Likewise.
(mve_vcvtpq_): Likewise.
(mve_vcvtpq_m_): Likewise.
(mve_vcvtq_from_f_): Likewise.
(mve_vcvtq_m_from_f_): Likewise.
(mve_vcvtq_m_n_from_f_): Likewise.
(mve_vcvtq_m_n_to_f_): Likewise.
(mve_vcvtq_m_to_f_): Likewise.
(mve_vcvtq_n_from_f_): Likewise.
(mve_vcvtq_n_to_f_): Likewise.
(mve_vcvtq_to_f_): Likewise.
(mve_vcvttq_f16_f32v8hf): Likewise.
(mve_vcvttq_f32_f16v4sf): Likewise.
(mve_vcvttq_m_f16_f32v8hf): Likewise.
(mve_vcvttq_m_f32_f16v4sf): Likewise.
(mve_vdwdupq_m_wb_u_insn): Likewise.
(mve_vdwdupq_wb_u_insn): Likewise.
(mve_veorq_s>): Likewise.
(mve_veorq_u>): Likewise.
(mve_veorq_f): Likewise.
(mve_vidupq_m_wb_u_insn): Likewise.
(mve_vidupq_u_insn): Likewise.
(mve_viwdupq_m_wb_u_insn): Likewise.
(mve_viwdupq_wb_u_insn): Likewise.
(mve_vldrbq_): Likewise.
(mve_vldrbq_gather_offset_): Likewise.
(mve_vldrbq_gather_offset_z_): Likewise.
(mve_vldrbq_z_): Likewise.
(mve_vldrdq_gather_base_v2di): Likewise.
(mve_vldrdq_gather_base_wb_v2di_insn): Likewise.
(mve_vldrdq_gather_base_wb_z_v2di_insn): Likewise.
(mve_vldrdq_gather_base_z_v2di): Likewise.
(mve_vldrdq_gather_offset_v2di): Likewise.
(mve_vldrdq_gather_offset_z_v2di): Likewise.
(mve_vldrdq_gather_shifted_offset_v2di): Likewise.
(mve_vldrdq_gather_shifted_offset_z_v2di): Likewise.
(mve_vldrhq_): Likewise.
(mve_vldrhq_fv8hf): Likewise.
(mve_vldrhq_gather_offset_): 

[PATCH v6 2/5] arm: Annotate instructions with mve_safe_imp_xlane_pred

2024-02-27 Thread Andre Vieira

This patch annotates some MVE across lane instructions with a new attribute.
We use this attribute to let the compiler know that these instructions can be
safely implicitly predicated when tail predicating if their operands are
guaranteed to have zeroed tail predicated lanes.  These instructions were
selected because having the value 0 in those lanes or 'tail-predicating' those
lanes have the same effect.

gcc/ChangeLog:

* config/arm/arm.md (mve_safe_imp_xlane_pred): New attribute.
* config/arm/iterators.md (mve_vmaxmin_safe_imp): New iterator
attribute.
* config/arm/mve.md (vaddvq_s, vaddvq_u, vaddlvq_s, vaddlvq_u,
vaddvaq_s, vaddvaq_u, vmaxavq_s, vmaxvq_u, vmladavq_s, vmladavq_u,
vmladavxq_s, vmlsdavq_s, vmlsdavxq_s, vaddlvaq_s, vaddlvaq_u,
vmlaldavq_u, vmlaldavq_s, vmlaldavq_u, vmlaldavxq_s, vmlsldavq_s,
vmlsldavxq_s, vrmlaldavhq_u, vrmlaldavhq_s, vrmlaldavhxq_s,
vrmlsldavhq_s, vrmlsldavhxq_s, vrmlaldavhaq_s, vrmlaldavhaq_u,
vrmlaldavhaxq_s, vrmlsldavhaq_s, vrmlsldavhaxq_s, vabavq_s, vabavq_u,
vmladavaq_u, vmladavaq_s, vmladavaxq_s, vmlsdavaq_s, vmlsdavaxq_s,
vmlaldavaq_s, vmlaldavaq_u, vmlaldavaxq_s, vmlsldavaq_s,
vmlsldavaxq_s): Added mve_safe_imp_xlane_pred.
---
 gcc/config/arm/arm.md   |  6 ++
 gcc/config/arm/iterators.md |  8 
 gcc/config/arm/mve.md   | 12 
 3 files changed, 26 insertions(+)

diff --git a/gcc/config/arm/arm.md b/gcc/config/arm/arm.md
index 81290e83818..814e871acea 100644
--- a/gcc/config/arm/arm.md
+++ b/gcc/config/arm/arm.md
@@ -130,6 +130,12 @@ (define_attr "predicated" "yes,no" (const_string "no"))
 ; encode that it is a predicable instruction.
 (define_attr "mve_unpredicated_insn" "" (symbol_ref "CODE_FOR_nothing"))
 
+; An attribute used by the loop-doloop pass when determining whether it is
+; safe to predicate a MVE instruction, that operates across lanes, and was
+; previously not predicated.  The pass will still check whether all inputs
+; are predicated by the VCTP predication mask.
+(define_attr "mve_safe_imp_xlane_pred" "yes,no" (const_string "no"))
+
 ; LENGTH of an instruction (in bytes)
 (define_attr "length" ""
   (const_int 4))
diff --git a/gcc/config/arm/iterators.md b/gcc/config/arm/iterators.md
index 7600bf62531..22b3ddf5637 100644
--- a/gcc/config/arm/iterators.md
+++ b/gcc/config/arm/iterators.md
@@ -869,6 +869,14 @@ (define_code_attr mve_addsubmul [
 		 (plus "vadd")
 		 ])
 
+(define_int_attr mve_vmaxmin_safe_imp [
+		 (VMAXVQ_U "yes")
+		 (VMAXVQ_S "no")
+		 (VMAXAVQ_S "yes")
+		 (VMINVQ_U "no")
+		 (VMINVQ_S "no")
+		 (VMINAVQ_S "no")])
+
 (define_int_attr mve_cmp_op1 [
 		 (VCMPCSQ_M_U "cs")
 		 (VCMPEQQ_M_S "eq") (VCMPEQQ_M_U "eq")
diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
index 8aa0bded7f0..d7bdcd862f8 100644
--- a/gcc/config/arm/mve.md
+++ b/gcc/config/arm/mve.md
@@ -393,6 +393,7 @@ (define_insn "@mve_q_"
   "TARGET_HAVE_MVE"
   ".%#\t%0, %q1"
  [(set (attr "mve_unpredicated_insn") (symbol_ref "CODE_FOR_mve_q_"))
+  (set_attr "mve_safe_imp_xlane_pred" "yes")
   (set_attr "type" "mve_move")
 ])
 
@@ -529,6 +530,7 @@ (define_insn "@mve_q_v4si"
   "TARGET_HAVE_MVE"
   ".32\t%Q0, %R0, %q1"
  [(set (attr "mve_unpredicated_insn") (symbol_ref "CODE_FOR_mve_q_v4si"))
+  (set_attr "mve_safe_imp_xlane_pred" "yes")
   (set_attr "type" "mve_move")
 ])
 
@@ -802,6 +804,7 @@ (define_insn "@mve_q_"
   "TARGET_HAVE_MVE"
   ".%#\t%0, %q2"
  [(set (attr "mve_unpredicated_insn") (symbol_ref "CODE_FOR_mve_q_"))
+  (set_attr "mve_safe_imp_xlane_pred" "yes")
   (set_attr "type" "mve_move")
 ])
 
@@ -1014,6 +1017,7 @@ (define_insn "@mve_q_"
   "TARGET_HAVE_MVE"
   ".%#\t%0, %q2"
  [(set (attr "mve_unpredicated_insn") (symbol_ref "CODE_FOR_mve_q_"))
+  (set_attr "mve_safe_imp_xlane_pred" "")
   (set_attr "type" "mve_move")
 ])
 
@@ -1033,6 +1037,7 @@ (define_insn "@mve_q_"
   "TARGET_HAVE_MVE"
   ".%#\t%0, %q1, %q2"
  [(set (attr "mve_unpredicated_insn") (symbol_ref "CODE_FOR_mve_q_"))
+  (set_attr "mve_safe_imp_xlane_pred" "yes")
   (set_attr "type" "mve_move")
 ])
 
@@ -1219,6 +1224,7 @@ (define_insn "@mve_q_v4si"
   "TARGET_HAVE_MVE"
   ".32\t%Q0, %R0, %q2"
  [(set (attr "mve_unpredicated_insn") (symbol_ref "CODE_FOR_mve_q_v4si"))
+  (set_attr "mve_safe_imp_xlane_pred" "yes")
   (set_attr "type" "mve_move")
 ])
 
@@ -1450,6 +1456,7 @@ (define_insn "@mve_q_"
   "TARGET_HAVE_MVE"
   ".%#\t%Q0, %R0, %q1, %q2"
  [(set (attr "mve_unpredicated_insn") (symbol_ref "CODE_FOR_mve_q_"))
+  (set_attr "mve_safe_imp_xlane_pred" "yes")
   (set_attr "type" "mve_move")
 ])
 
@@ -1588,6 +1595,7 @@ (define_insn "@mve_q_v4si"
   "TARGET_HAVE_MVE"
   ".32\t%Q0, %R0, %q1, %q2"
  [(set (attr "mve_unpredicated_insn") (symbol_ref "CODE_FOR_mve_q_v4si"))
+  (set_attr "mve_safe_imp_xlane_pred" "yes")
   (set_attr "type" "mve_move")
 ])
 
@@ -1725,6 +1733,7 @@ (define_insn "@mve_q_v4si"
   "TARGET_HAVE_MVE"
   ".32\t%Q0, %R0, %q2, %q3"
  [(set 

[PATCH v6 3/5] arm: Fix a wrong attribute use and remove unused unspecs and iterators

2024-02-27 Thread Andre Vieira

This patch fixes the erroneous use of a mode attribute without a mode iterator
in the pattern and removes unused unspecs and iterators.

gcc/ChangeLog:

* config/arm/iterators.md (supf): Remove VMLALDAVXQ_U, VMLALDAVXQ_P_U,
VMLALDAVAXQ_U cases.
(VMLALDAVXQ): Remove iterator.
(VMLALDAVXQ_P): Likewise.
(VMLALDAVAXQ): Likewise.
* config/arm/mve.md (mve_vstrwq_p_fv4sf): Replace use of 
mode iterator attribute with V4BI mode.
* config/arm/unspecs.md (VMLALDAVXQ_U, VMLALDAVXQ_P_U,
VMLALDAVAXQ_U): Remove unused unspecs.
---
 gcc/config/arm/iterators.md | 9 +++--
 gcc/config/arm/mve.md   | 2 +-
 gcc/config/arm/unspecs.md   | 3 ---
 3 files changed, 4 insertions(+), 10 deletions(-)

diff --git a/gcc/config/arm/iterators.md b/gcc/config/arm/iterators.md
index 22b3ddf5637..3206bcab4cf 100644
--- a/gcc/config/arm/iterators.md
+++ b/gcc/config/arm/iterators.md
@@ -2370,7 +2370,7 @@ (define_int_attr supf [(VCVTQ_TO_F_S "s") (VCVTQ_TO_F_U "u") (VREV16Q_S "s")
 		   (VSUBQ_S "s") (VSUBQ_U "u") (VADDVAQ_S "s")
 		   (VADDVAQ_U "u") (VADDLVAQ_S "s") (VADDLVAQ_U "u")
 		   (VBICQ_N_S "s") (VBICQ_N_U "u") (VMLALDAVQ_U "u")
-		   (VMLALDAVQ_S "s") (VMLALDAVXQ_U "u") (VMLALDAVXQ_S "s")
+		   (VMLALDAVQ_S "s") (VMLALDAVXQ_S "s")
 		   (VMOVNBQ_U "u") (VMOVNBQ_S "s") (VMOVNTQ_U "u")
 		   (VMOVNTQ_S "s") (VORRQ_N_S "s") (VORRQ_N_U "u")
 		   (VQMOVNBQ_U "u") (VQMOVNBQ_S "s") (VQMOVNTQ_S "s")
@@ -2412,8 +2412,8 @@ (define_int_attr supf [(VCVTQ_TO_F_S "s") (VCVTQ_TO_F_U "u") (VREV16Q_S "s")
 		   (VREV16Q_M_S "s") (VREV16Q_M_U "u")
 		   (VQRSHRNTQ_N_U "u") (VMOVNTQ_M_U "u") (VMOVLBQ_M_U "u")
 		   (VMLALDAVAQ_U "u") (VQSHRNBQ_N_U "u") (VSHRNBQ_N_U "u")
-		   (VRSHRNBQ_N_U "u") (VMLALDAVXQ_P_U "u")
-		   (VMVNQ_M_N_U "u") (VQSHRNTQ_N_U "u") (VMLALDAVAXQ_U "u")
+		   (VRSHRNBQ_N_U "u")
+		   (VMVNQ_M_N_U "u") (VQSHRNTQ_N_U "u")
 		   (VQMOVNTQ_M_U "u") (VSHRNTQ_N_U "u") (VCVTMQ_M_S "s")
 		   (VCVTMQ_M_U "u") (VCVTNQ_M_S "s") (VCVTNQ_M_U "u")
 		   (VCVTPQ_M_S "s") (VCVTPQ_M_U "u") (VADDLVAQ_P_S "s")
@@ -2762,7 +2762,6 @@ (define_int_iterator VSUBQ_N [VSUBQ_N_S VSUBQ_N_U])
 (define_int_iterator VADDLVAQ [VADDLVAQ_S VADDLVAQ_U])
 (define_int_iterator VBICQ_N [VBICQ_N_S VBICQ_N_U])
 (define_int_iterator VMLALDAVQ [VMLALDAVQ_U VMLALDAVQ_S])
-(define_int_iterator VMLALDAVXQ [VMLALDAVXQ_U VMLALDAVXQ_S])
 (define_int_iterator VMOVNBQ [VMOVNBQ_U VMOVNBQ_S])
 (define_int_iterator VMOVNTQ [VMOVNTQ_S VMOVNTQ_U])
 (define_int_iterator VORRQ_N [VORRQ_N_U VORRQ_N_S])
@@ -2817,11 +2816,9 @@ (define_int_iterator VMLALDAVAQ [VMLALDAVAQ_S VMLALDAVAQ_U])
 (define_int_iterator VQSHRNBQ_N [VQSHRNBQ_N_U VQSHRNBQ_N_S])
 (define_int_iterator VSHRNBQ_N [VSHRNBQ_N_U VSHRNBQ_N_S])
 (define_int_iterator VRSHRNBQ_N [VRSHRNBQ_N_S VRSHRNBQ_N_U])
-(define_int_iterator VMLALDAVXQ_P [VMLALDAVXQ_P_U VMLALDAVXQ_P_S])
 (define_int_iterator VQMOVNTQ_M [VQMOVNTQ_M_U VQMOVNTQ_M_S])
 (define_int_iterator VMVNQ_M_N [VMVNQ_M_N_U VMVNQ_M_N_S])
 (define_int_iterator VQSHRNTQ_N [VQSHRNTQ_N_U VQSHRNTQ_N_S])
-(define_int_iterator VMLALDAVAXQ [VMLALDAVAXQ_S VMLALDAVAXQ_U])
 (define_int_iterator VSHRNTQ_N [VSHRNTQ_N_S VSHRNTQ_N_U])
 (define_int_iterator VCVTMQ_M [VCVTMQ_M_S VCVTMQ_M_U])
 (define_int_iterator VCVTNQ_M [VCVTNQ_M_S VCVTNQ_M_U])
diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
index d7bdcd862f8..9fe51298cdc 100644
--- a/gcc/config/arm/mve.md
+++ b/gcc/config/arm/mve.md
@@ -4605,7 +4605,7 @@ (define_insn "mve_vstrwq_p_fv4sf"
   [(set (match_operand:V4SI 0 "mve_memory_operand" "=Ux")
 	(unspec:V4SI
 	 [(match_operand:V4SF 1 "s_register_operand" "w")
-	  (match_operand: 2 "vpr_register_operand" "Up")
+	  (match_operand:V4BI 2 "vpr_register_operand" "Up")
 	  (match_dup 0)]
 	 VSTRWQ_F))]
   "TARGET_HAVE_MVE && TARGET_HAVE_MVE_FLOAT"
diff --git a/gcc/config/arm/unspecs.md b/gcc/config/arm/unspecs.md
index b9db306c067..46ac8b37157 100644
--- a/gcc/config/arm/unspecs.md
+++ b/gcc/config/arm/unspecs.md
@@ -717,7 +717,6 @@ (define_c_enum "unspec" [
   VCVTBQ_F16_F32
   VCVTTQ_F16_F32
   VMLALDAVQ_U
-  VMLALDAVXQ_U
   VMLALDAVXQ_S
   VMLALDAVQ_S
   VMLSLDAVQ_S
@@ -934,7 +933,6 @@ (define_c_enum "unspec" [
   VSHRNBQ_N_S
   VRSHRNBQ_N_S
   VRSHRNBQ_N_U
-  VMLALDAVXQ_P_U
   VMLALDAVXQ_P_S
   VQMOVNTQ_M_U
   VQMOVNTQ_M_S
@@ -943,7 +941,6 @@ (define_c_enum "unspec" [
   VQSHRNTQ_N_U
   VQSHRNTQ_N_S
   VMLALDAVAXQ_S
-  VMLALDAVAXQ_U
   VSHRNTQ_N_S
   VSHRNTQ_N_U
   VCVTBQ_M_F16_F32


[PATCH v5 0/5] arm: Add support for MVE Tail-Predicated Low Overhead Loops

2024-02-27 Thread Andre Vieira
Hi,

Re-ordered patches, our latest plan is to only commit patches 1-3, and leave
4-5 for GCC 15, as we believe it is too late in Stage 4 to be making changes to
target agnostic parts, especially since these affect so many ports that we can
not easily test.

[1/5] arm: Add define_attr to to create a mapping between MVE predicated and 
unpredicated insns
[2/5] arm: Annotate instructions with mve_safe_imp_xlane_pred
[3/5] arm: Fix a wrong attribute use and remove unused unspecs and iterators
[4/5] doloop: Add support for predicated vectorized loops
[5/5] arm: Add support for MVE Tail-Predicated Low Overhead Loops

Original cover letter:
This patch adds support for Arm's MVE Tail Predicated Low Overhead Loop
feature.

The M-class Arm-ARM:
https://developer.arm.com/documentation/ddi0553/bu/?lang=en
Section B5.5.1 "Loop tail predication" describes the feature
we are adding support for with this patch (although
we only add codegen for DLSTP/LETP instruction loops).

Previously with commit d2ed233cb94 we'd added support for
non-MVE DLS/LE loops through the loop-doloop pass, which, given
a standard MVE loop like:

```
void  __attribute__ ((noinline)) test (int16_t *a, int16_t *b, int16_t *c, int 
n)
{
  while (n > 0)
{
  mve_pred16_t p = vctp16q (n);
  int16x8_t va = vldrhq_z_s16 (a, p);
  int16x8_t vb = vldrhq_z_s16 (b, p);
  int16x8_t vc = vaddq_x_s16 (va, vb, p);
  vstrhq_p_s16 (c, vc, p);
  c+=8;
  a+=8;
  b+=8;
  n-=8;
}
}
```
.. would output:

```

dls lr, lr
.L3:
vctp.16 r3
vmrsip, P0  @ movhi
sxthip, ip
vmsr P0, ip @ movhi
mov r4, r0
vpst
vldrht.16   q2, [r4]
mov r4, r1
vmovq3, q0
vpst
vldrht.16   q1, [r4]
mov r4, r2
vpst
vaddt.i16   q3, q2, q1
subsr3, r3, #8
vpst
vstrht.16   q3, [r4]
addsr0, r0, #16
addsr1, r1, #16
addsr2, r2, #16
le  lr, .L3
```

where the LE instruction will decrement LR by 1, compare and
branch if needed.

(there are also other inefficiencies with the above code, like the
pointless vmrs/sxth/vmsr on the VPR and the adds not being merged
into the vldrht/vstrht as a #16 offsets and some random movs!
But that's different problems...)

The MVE version is similar, except that:
* Instead of DLS/LE the instructions are DLSTP/LETP.
* Instead of pre-calculating the number of iterations of the
  loop, we place the number of elements to be processed by the
  loop into LR.
* Instead of decrementing the LR by one, LETP will decrement it
  by FPSCR.LTPSIZE, which is the number of elements being
  processed in each iteration: 16 for 8-bit elements, 5 for 16-bit
  elements, etc.
* On the final iteration, automatic Loop Tail Predication is
  performed, as if the instructions within the loop had been VPT
  predicated with a VCTP generating the VPR predicate in every
  loop iteration.

The dlstp/letp loop now looks like:

```

dlstp.16lr, r3
.L14:
mov r3, r0
vldrh.16q3, [r3]
mov r3, r1
vldrh.16q2, [r3]
mov r3, r2
vadd.i16  q3, q3, q2
addsr0, r0, #16
vstrh.16q3, [r3]
addsr1, r1, #16
addsr2, r2, #16
letplr, .L14

```

Since the loop tail predication is automatic, we have eliminated
the VCTP that had been specified by the user in the intrinsic
and converted the VPT-predicated instructions into their
unpredicated equivalents (which also saves us from VPST insns).

The LE instruction here decrements LR by 8 in each iteration.

Stam Markianos-Wright (1):
  arm: Add define_attr to to create a mapping between MVE predicated and
unpredicated insns

Andre Vieira (4):
  arm: Annotate instructions with mve_safe_imp_xlane_pred
  arm: Fix a wrong attribute use and remove unused unspecs and iterators
  doloop: Add support for predicated vectorized loops
  arm: Add support for MVE Tail-Predicated Low Overhead Loops


-- 
2.17.1


Re: [PATCH 2/2] aarch64: Add support for _BitInt

2024-02-27 Thread Andre Vieira (lists)

Hey,

Dropped the first patch and dealt with the comments above, hopefully I 
didn't miss any this time.


--

This patch adds support for C23's _BitInt for the AArch64 port when 
compiling

for little endianness.  Big Endianness requires further target-agnostic
support and we therefor disable it for now.

gcc/ChangeLog:

* config/aarch64/aarch64.cc (TARGET_C_BITINT_TYPE_INFO): Declare MACRO.
(aarch64_bitint_type_info): New function.
(aarch64_return_in_memory_1): Return large _BitInt's in memory.
(aarch64_function_arg_alignment): Adapt to correctly return the ABI
mandated alignment of _BitInt(N) where N > 128 as the alignment of
TImode.
(aarch64_composite_type_p): Return true for _BitInt(N), where N > 128.

libgcc/ChangeLog:

* config/aarch64/t-softfp (softfp_extras): Add floatbitinthf,
floatbitintbf, floatbitinttf and fixtfbitint.
* config/aarch64/libgcc-softfp.ver (GCC_14.0.0): Add __floatbitinthf,
__floatbitintbf, __floatbitinttf and __fixtfbitint.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/bitint-alignments.c: New test.
* gcc.target/aarch64/bitint-args.c: New test.
* gcc.target/aarch64/bitint-sizes.c: New test.


On 02/02/2024 14:46, Jakub Jelinek wrote:

On Thu, Jan 25, 2024 at 05:45:01PM +, Andre Vieira wrote:

This patch adds support for C23's _BitInt for the AArch64 port when compiling
for little endianness.  Big Endianness requires further target-agnostic
support and we therefor disable it for now.

gcc/ChangeLog:

* config/aarch64/aarch64.cc (TARGET_C_BITINT_TYPE_INFO): Declare MACRO.
(aarch64_bitint_type_info): New function.
(aarch64_return_in_memory_1): Return large _BitInt's in memory.
(aarch64_function_arg_alignment): Adapt to correctly return the ABI
mandated alignment of _BitInt(N) where N > 128 as the alignment of
TImode.
(aarch64_composite_type_p): Return true for _BitInt(N), where N > 128.

libgcc/ChangeLog:

* config/aarch64/t-softfp: Add fixtfbitint, floatbitinttf and
floatbitinthf to the softfp_extras variable to ensure the
runtime support is available for _BitInt.


I think this lacks some config/aarch64/t-whatever.ver
additions.
See PR113700 for some more details.
We want the support routines for binary floating point <-> _BitInt
conversions in both libgcc.a and libgcc_s.so.1 and exported from the latter
too at GCC_14.0.0 symver, while decimal floating point <-> _BitInt solely in
libgcc.a (as with all the huge dfp/bid stuff).

Jakub
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 
16318bf925883ecedf9345e53fc0824a553b2747..9bd8d22f6edd9f6c77907ec383f9e8bf055cfb8b
 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -6583,6 +6583,7 @@ aarch64_return_in_memory_1 (const_tree type)
   int count;
 
   if (!AGGREGATE_TYPE_P (type)
+  && TREE_CODE (type) != BITINT_TYPE
   && TREE_CODE (type) != COMPLEX_TYPE
   && TREE_CODE (type) != VECTOR_TYPE)
 /* Simple scalar types always returned in registers.  */
@@ -21895,6 +21896,11 @@ aarch64_composite_type_p (const_tree type,
   if (type && (AGGREGATE_TYPE_P (type) || TREE_CODE (type) == COMPLEX_TYPE))
 return true;
 
+  if (type
+  && TREE_CODE (type) == BITINT_TYPE
+  && int_size_in_bytes (type) > 16)
+return true;
+
   if (mode == BLKmode
   || GET_MODE_CLASS (mode) == MODE_COMPLEX_FLOAT
   || GET_MODE_CLASS (mode) == MODE_COMPLEX_INT)
@@ -28400,6 +28406,42 @@ aarch64_excess_precision (enum excess_precision_type 
type)
   return FLT_EVAL_METHOD_UNPREDICTABLE;
 }
 
+/* Implement TARGET_C_BITINT_TYPE_INFO.
+   Return true if _BitInt(N) is supported and fill its details into *INFO.  */
+bool
+aarch64_bitint_type_info (int n, struct bitint_info *info)
+{
+  if (TARGET_BIG_END)
+return false;
+
+  if (n <= 8)
+info->limb_mode = QImode;
+  else if (n <= 16)
+info->limb_mode = HImode;
+  else if (n <= 32)
+info->limb_mode = SImode;
+  else if (n <= 64)
+info->limb_mode = DImode;
+  else if (n <= 128)
+info->limb_mode = TImode;
+  else
+/* The AAPCS for AArch64 defines _BitInt(N > 128) as an array with
+   type {signed,unsigned} __int128[M] where M*128 >= N.  However, to be
+   able to use libgcc's implementation to support large _BitInt's we need
+   to use a LIMB_MODE that is no larger than 'long long'.  This is why we
+   use DImode for our internal LIMB_MODE and we define the ABI_LIMB_MODE to
+   be TImode to ensure we are ABI compliant.  */
+info->limb_mode = DImode;
+
+  if (n > 128)
+info->abi_limb_mode = TImode;
+  else
+info->abi_limb_mode = info->limb_mode;
+  info->big_endian = TARGET_BIG_END;
+  info->extended = 

Re: [PATCH 1/3] vect: Pass stmt_vec_info to TARGET_SIMD_CLONE_USABLE

2024-02-26 Thread Andre Vieira (lists)




On 05/02/2024 09:56, Richard Biener wrote:

On Thu, 1 Feb 2024, Andre Vieira (lists) wrote:




On 01/02/2024 07:19, Richard Biener wrote:

On Wed, 31 Jan 2024, Andre Vieira (lists) wrote:


The patch didn't come with a testcase so it's really hard to tell
what goes wrong now and how it is fixed ...


My bad! I had a testcase locally but never added it...

However... now I look at it and ran it past Richard S, the codegen isn't
'wrong', but it does have the potential to lead to some pretty slow codegen,
especially for inbranch simdclones where it transforms the SVE predicate into
an Advanced SIMD vector by inserting the elements one at a time...

An example of which can be seen if you do:

gcc -O3 -march=armv8-a+sve -msve-vector-bits=128  -fopenmp-simd t.c -S

with the following t.c:
#pragma omp declare simd simdlen(4) inbranch
int __attribute__ ((const)) fn5(int);

void fn4 (int *a, int *b, int n)
{
 for (int i = 0; i < n; ++i)
 b[i] = fn5(a[i]);
}

Now I do have to say, for our main usecase of libmvec we won't have any
'inbranch' Advanced SIMD clones, so we avoid that issue... But of course that
doesn't mean user-code will.


It seems to use SVE masks with vector(4)  and the
ABI says the mask is vector(4) int.  You say that's because we choose
a Adv SIMD clone for the SVE VLS vector code (it calls _ZGVnM4v_fn5).

The vectorizer creates

   _44 = VEC_COND_EXPR ;

and then vector lowering decomposes this.  That means the vectorizer
lacks a check that the target handles this VEC_COND_EXPR.

Of course I would expect that SVE with VLS vectors is able to
code generate this operation, so it's missing patterns in the end.

Richard.



What should we do for GCC-14? Going forward I think the right thing to 
do is to add these patterns. But I am not even going to try to do that 
right now and even though we can codegen for this, the result doesn't 
feel like it would ever be profitable which means I'd rather not 
vectorize, or well pick a different vector mode if possible.


This would be achieved with the change to the targethook. If I change 
the hook to take modes, using STMT_VINFO_VECTYPE (stmt_vinfo), is that 
OK for now?


Kind regards,
Andre


[PATCH v5 5/5] arm: Add support for MVE Tail-Predicated Low Overhead Loops

2024-02-22 Thread Andre Vieira

This patch adds support for MVE Tail-Predicated Low Overhead Loops by using the
doloop funcitonality added to support predicated vectorized hardware loops.

gcc/ChangeLog:

* config/arm/arm-protos.h (arm_target_bb_ok_for_lob): Change
declaration to pass basic_block.
(arm_attempt_dlstp_transform): New declaration.
* config/arm/arm.cc (TARGET_LOOP_UNROLL_ADJUST): Define targethook.
(TARGET_PREDICT_DOLOOP_P): Likewise.
(arm_target_bb_ok_for_lob): Adapt condition.
(arm_mve_get_vctp_lanes): New function.
(arm_dl_usage_type): New internal enum.
(arm_get_required_vpr_reg): New function.
(arm_get_required_vpr_reg_param): New function.
(arm_get_required_vpr_reg_ret_val): New function.
(arm_mve_get_loop_vctp): New function.
(arm_mve_insn_predicated_by): New function.
(arm_mve_across_lane_insn_p): New function.
(arm_mve_load_store_insn_p): New function.
(arm_mve_impl_pred_on_outputs_p): New function.
(arm_mve_impl_pred_on_inputs_p): New function.
(arm_last_vect_def_insn): New function.
(arm_mve_impl_predicated_p): New function.
(arm_mve_check_reg_origin_is_num_elems): New function.
(arm_mve_dlstp_check_inc_counter): New function.
(arm_mve_dlstp_check_dec_counter): New function.
(arm_mve_loop_valid_for_dlstp): New function.
(arm_predict_doloop_p): New function.
(arm_loop_unroll_adjust): New function.
(arm_emit_mve_unpredicated_insn_to_seq): New function.
(arm_attempt_dlstp_transform): New function.
* config/arm/arm.opt (mdlstp): New option.
* config/arm/iteratords.md (dlstp_elemsize, letp_num_lanes,
letp_num_lanes_neg, letp_num_lanes_minus_1): New attributes.
(DLSTP, LETP): New iterators.
(predicated_doloop_end_internal): New pattern.
(dlstp_insn): New pattern.
* config/arm/thumb2.md (doloop_end): Adapt to support tail-predicated
loops.
(doloop_begin): Likewise.
* config/arm/types.md (mve_misc): New mve type to represent
predicated_loop_end insn sequences.
* config/arm/unspecs.md:
(DLSTP8, DLSTP16, DLSTP32, DSLTP64,
LETP8, LETP16, LETP32, LETP64): New unspecs for DLSTP and LETP.

gcc/testsuite/ChangeLog:

* gcc.target/arm/lob.h: Add new helpers.
* gcc.target/arm/lob1.c: Use new helpers.
* gcc.target/arm/lob6.c: Likewise.
* gcc.target/arm/dlstp-compile-asm-1.c: New test.
* gcc.target/arm/dlstp-compile-asm-2.c: New test.
* gcc.target/arm/dlstp-compile-asm-3.c: New test.
* gcc.target/arm/dlstp-int8x16.c: New test.
* gcc.target/arm/dlstp-int8x16-run.c: New test.
* gcc.target/arm/dlstp-int16x8.c: New test.
* gcc.target/arm/dlstp-int16x8-run.c: New test.
* gcc.target/arm/dlstp-int32x4.c: New test.
* gcc.target/arm/dlstp-int32x4-run.c: New test.
* gcc.target/arm/dlstp-int64x2.c: New test.
* gcc.target/arm/dlstp-int64x2-run.c: New test.
* gcc.target/arm/dlstp-invalid-asm.c: New test.

Co-authored-by: Stam Markianos-Wright 
---
 gcc/config/arm/arm-protos.h   |4 +-
 gcc/config/arm/arm.cc | 1249 -
 gcc/config/arm/arm.opt|3 +
 gcc/config/arm/iterators.md   |   15 +
 gcc/config/arm/mve.md |   50 +
 gcc/config/arm/thumb2.md  |  138 +-
 gcc/config/arm/types.md   |6 +-
 gcc/config/arm/unspecs.md |   14 +-
 gcc/testsuite/gcc.target/arm/lob.h|  128 +-
 gcc/testsuite/gcc.target/arm/lob1.c   |   23 +-
 gcc/testsuite/gcc.target/arm/lob6.c   |8 +-
 .../gcc.target/arm/mve/dlstp-compile-asm-1.c  |  146 ++
 .../gcc.target/arm/mve/dlstp-compile-asm-2.c  |  749 ++
 .../gcc.target/arm/mve/dlstp-compile-asm-3.c  |   46 +
 .../gcc.target/arm/mve/dlstp-int16x8-run.c|   44 +
 .../gcc.target/arm/mve/dlstp-int16x8.c|   31 +
 .../gcc.target/arm/mve/dlstp-int32x4-run.c|   45 +
 .../gcc.target/arm/mve/dlstp-int32x4.c|   31 +
 .../gcc.target/arm/mve/dlstp-int64x2-run.c|   48 +
 .../gcc.target/arm/mve/dlstp-int64x2.c|   28 +
 .../gcc.target/arm/mve/dlstp-int8x16-run.c|   44 +
 .../gcc.target/arm/mve/dlstp-int8x16.c|   32 +
 .../gcc.target/arm/mve/dlstp-invalid-asm.c|  521 +++
 23 files changed, 3321 insertions(+), 82 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm-1.c
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm-2.c
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm-3.c
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-int16x8-run.c
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-int16x8.c
 create mode 100644 

[PATCH v5 1/5] arm: Add define_attr to to create a mapping between MVE predicated and unpredicated insns

2024-02-22 Thread Andre Vieira

This patch adds an attribute to the mve md patterns to be able to identify
predicable MVE instructions and what their predicated and unpredicated variants
are.  This attribute is used to encode the icode of the unpredicated variant of
an instruction in its predicated variant.

This will make it possible for us to transform VPT-predicated insns in
the insn chain into their unpredicated equivalents when transforming the loop
into a MVE Tail-Predicated Low Overhead Loop. For example:
`mve_vldrbq_z_ -> mve_vldrbq_`.

gcc/ChangeLog:

* config/arm/arm.md (mve_unpredicated_insn): New attribute.
* config/arm/arm.h (MVE_VPT_PREDICATED_INSN_P): New define.
(MVE_VPT_UNPREDICATED_INSN_P): Likewise.
(MVE_VPT_PREDICABLE_INSN_P): Likewise.
* config/arm/vec-common.md (mve_vshlq_): Add attribute.
* config/arm/mve.md (arm_vcx1q_p_v16qi): Add attribute.
(arm_vcx1qv16qi): Likewise.
(arm_vcx1qav16qi): Likewise.
(arm_vcx1qv16qi): Likewise.
(arm_vcx2q_p_v16qi): Likewise.
(arm_vcx2qv16qi): Likewise.
(arm_vcx2qav16qi): Likewise.
(arm_vcx2qv16qi): Likewise.
(arm_vcx3q_p_v16qi): Likewise.
(arm_vcx3qv16qi): Likewise.
(arm_vcx3qav16qi): Likewise.
(arm_vcx3qv16qi): Likewise.
(@mve_q_): Likewise.
(@mve_q_int_): Likewise.
(@mve_q_v4si): Likewise.
(@mve_q_n_): Likewise.
(@mve_q_r_): Likewise.
(@mve_q_f): Likewise.
(@mve_q_m_): Likewise.
(@mve_q_m_n_): Likewise.
(@mve_q_m_r_): Likewise.
(@mve_q_m_f): Likewise.
(@mve_q_int_m_): Likewise.
(@mve_q_p_v4si): Likewise.
(@mve_q_p_): Likewise.
(@mve_q_): Likewise.
(@mve_q_f): Likewise.
(@mve_q_m_): Likewise.
(@mve_q_m_f): Likewise.
(mve_vq_f): Likewise.
(mve_q): Likewise.
(mve_q_f): Likewise.
(mve_vadciq_v4si): Likewise.
(mve_vadciq_m_v4si): Likewise.
(mve_vadcq_v4si): Likewise.
(mve_vadcq_m_v4si): Likewise.
(mve_vandq_): Likewise.
(mve_vandq_f): Likewise.
(mve_vandq_m_): Likewise.
(mve_vandq_m_f): Likewise.
(mve_vandq_s): Likewise.
(mve_vandq_u): Likewise.
(mve_vbicq_): Likewise.
(mve_vbicq_f): Likewise.
(mve_vbicq_m_): Likewise.
(mve_vbicq_m_f): Likewise.
(mve_vbicq_m_n_): Likewise.
(mve_vbicq_n_): Likewise.
(mve_vbicq_s): Likewise.
(mve_vbicq_u): Likewise.
(@mve_vclzq_s): Likewise.
(mve_vclzq_u): Likewise.
(@mve_vcmp_q_): Likewise.
(@mve_vcmp_q_n_): Likewise.
(@mve_vcmp_q_f): Likewise.
(@mve_vcmp_q_n_f): Likewise.
(@mve_vcmp_q_m_f): Likewise.
(@mve_vcmp_q_m_n_): Likewise.
(@mve_vcmp_q_m_): Likewise.
(@mve_vcmp_q_m_n_f): Likewise.
(mve_vctpq): Likewise.
(mve_vctpq_m): Likewise.
(mve_vcvtaq_): Likewise.
(mve_vcvtaq_m_): Likewise.
(mve_vcvtbq_f16_f32v8hf): Likewise.
(mve_vcvtbq_f32_f16v4sf): Likewise.
(mve_vcvtbq_m_f16_f32v8hf): Likewise.
(mve_vcvtbq_m_f32_f16v4sf): Likewise.
(mve_vcvtmq_): Likewise.
(mve_vcvtmq_m_): Likewise.
(mve_vcvtnq_): Likewise.
(mve_vcvtnq_m_): Likewise.
(mve_vcvtpq_): Likewise.
(mve_vcvtpq_m_): Likewise.
(mve_vcvtq_from_f_): Likewise.
(mve_vcvtq_m_from_f_): Likewise.
(mve_vcvtq_m_n_from_f_): Likewise.
(mve_vcvtq_m_n_to_f_): Likewise.
(mve_vcvtq_m_to_f_): Likewise.
(mve_vcvtq_n_from_f_): Likewise.
(mve_vcvtq_n_to_f_): Likewise.
(mve_vcvtq_to_f_): Likewise.
(mve_vcvttq_f16_f32v8hf): Likewise.
(mve_vcvttq_f32_f16v4sf): Likewise.
(mve_vcvttq_m_f16_f32v8hf): Likewise.
(mve_vcvttq_m_f32_f16v4sf): Likewise.
(mve_vdwdupq_m_wb_u_insn): Likewise.
(mve_vdwdupq_wb_u_insn): Likewise.
(mve_veorq_s>): Likewise.
(mve_veorq_u>): Likewise.
(mve_veorq_f): Likewise.
(mve_vidupq_m_wb_u_insn): Likewise.
(mve_vidupq_u_insn): Likewise.
(mve_viwdupq_m_wb_u_insn): Likewise.
(mve_viwdupq_wb_u_insn): Likewise.
(mve_vldrbq_): Likewise.
(mve_vldrbq_gather_offset_): Likewise.
(mve_vldrbq_gather_offset_z_): Likewise.
(mve_vldrbq_z_): Likewise.
(mve_vldrdq_gather_base_v2di): Likewise.
(mve_vldrdq_gather_base_wb_v2di_insn): Likewise.
(mve_vldrdq_gather_base_wb_z_v2di_insn): Likewise.
(mve_vldrdq_gather_base_z_v2di): Likewise.
(mve_vldrdq_gather_offset_v2di): Likewise.
(mve_vldrdq_gather_offset_z_v2di): Likewise.
(mve_vldrdq_gather_shifted_offset_v2di): Likewise.
(mve_vldrdq_gather_shifted_offset_z_v2di): Likewise.
(mve_vldrhq_): Likewise.
(mve_vldrhq_fv8hf): Likewise.
(mve_vldrhq_gather_offset_): 

[PATCH v5 4/5] arm: Fix a wrong attribute use and remove unused unspecs and iterators

2024-02-22 Thread Andre Vieira

This patch fixes the erroneous use of a mode attribute without a mode iterator
in the pattern and removes unused unspecs and iterators.

gcc/ChangeLog:

* config/arm/iterators.md (supf): Remove VMLALDAVXQ_U, VMLALDAVXQ_P_U,
VMLALDAVAXQ_U cases.
(VMLALDAVXQ): Remove iterator.
(VMLALDAVXQ_P): Likewise.
(VMLALDAVAXQ): Likewise.
* config/arm/mve.md (mve_vstrwq_p_fv4sf): Replace use of 
mode iterator attribute with V4BI mode.
* config/arm/unspecs.md (VMLALDAVXQ_U, VMLALDAVXQ_P_U,
VMLALDAVAXQ_U): Remove unused unspecs.
---
 gcc/config/arm/iterators.md | 9 +++--
 gcc/config/arm/mve.md   | 2 +-
 gcc/config/arm/unspecs.md   | 3 ---
 3 files changed, 4 insertions(+), 10 deletions(-)

diff --git a/gcc/config/arm/iterators.md b/gcc/config/arm/iterators.md
index 22b3ddf5637..3206bcab4cf 100644
--- a/gcc/config/arm/iterators.md
+++ b/gcc/config/arm/iterators.md
@@ -2370,7 +2370,7 @@ (define_int_attr supf [(VCVTQ_TO_F_S "s") (VCVTQ_TO_F_U "u") (VREV16Q_S "s")
 		   (VSUBQ_S "s") (VSUBQ_U "u") (VADDVAQ_S "s")
 		   (VADDVAQ_U "u") (VADDLVAQ_S "s") (VADDLVAQ_U "u")
 		   (VBICQ_N_S "s") (VBICQ_N_U "u") (VMLALDAVQ_U "u")
-		   (VMLALDAVQ_S "s") (VMLALDAVXQ_U "u") (VMLALDAVXQ_S "s")
+		   (VMLALDAVQ_S "s") (VMLALDAVXQ_S "s")
 		   (VMOVNBQ_U "u") (VMOVNBQ_S "s") (VMOVNTQ_U "u")
 		   (VMOVNTQ_S "s") (VORRQ_N_S "s") (VORRQ_N_U "u")
 		   (VQMOVNBQ_U "u") (VQMOVNBQ_S "s") (VQMOVNTQ_S "s")
@@ -2412,8 +2412,8 @@ (define_int_attr supf [(VCVTQ_TO_F_S "s") (VCVTQ_TO_F_U "u") (VREV16Q_S "s")
 		   (VREV16Q_M_S "s") (VREV16Q_M_U "u")
 		   (VQRSHRNTQ_N_U "u") (VMOVNTQ_M_U "u") (VMOVLBQ_M_U "u")
 		   (VMLALDAVAQ_U "u") (VQSHRNBQ_N_U "u") (VSHRNBQ_N_U "u")
-		   (VRSHRNBQ_N_U "u") (VMLALDAVXQ_P_U "u")
-		   (VMVNQ_M_N_U "u") (VQSHRNTQ_N_U "u") (VMLALDAVAXQ_U "u")
+		   (VRSHRNBQ_N_U "u")
+		   (VMVNQ_M_N_U "u") (VQSHRNTQ_N_U "u")
 		   (VQMOVNTQ_M_U "u") (VSHRNTQ_N_U "u") (VCVTMQ_M_S "s")
 		   (VCVTMQ_M_U "u") (VCVTNQ_M_S "s") (VCVTNQ_M_U "u")
 		   (VCVTPQ_M_S "s") (VCVTPQ_M_U "u") (VADDLVAQ_P_S "s")
@@ -2762,7 +2762,6 @@ (define_int_iterator VSUBQ_N [VSUBQ_N_S VSUBQ_N_U])
 (define_int_iterator VADDLVAQ [VADDLVAQ_S VADDLVAQ_U])
 (define_int_iterator VBICQ_N [VBICQ_N_S VBICQ_N_U])
 (define_int_iterator VMLALDAVQ [VMLALDAVQ_U VMLALDAVQ_S])
-(define_int_iterator VMLALDAVXQ [VMLALDAVXQ_U VMLALDAVXQ_S])
 (define_int_iterator VMOVNBQ [VMOVNBQ_U VMOVNBQ_S])
 (define_int_iterator VMOVNTQ [VMOVNTQ_S VMOVNTQ_U])
 (define_int_iterator VORRQ_N [VORRQ_N_U VORRQ_N_S])
@@ -2817,11 +2816,9 @@ (define_int_iterator VMLALDAVAQ [VMLALDAVAQ_S VMLALDAVAQ_U])
 (define_int_iterator VQSHRNBQ_N [VQSHRNBQ_N_U VQSHRNBQ_N_S])
 (define_int_iterator VSHRNBQ_N [VSHRNBQ_N_U VSHRNBQ_N_S])
 (define_int_iterator VRSHRNBQ_N [VRSHRNBQ_N_S VRSHRNBQ_N_U])
-(define_int_iterator VMLALDAVXQ_P [VMLALDAVXQ_P_U VMLALDAVXQ_P_S])
 (define_int_iterator VQMOVNTQ_M [VQMOVNTQ_M_U VQMOVNTQ_M_S])
 (define_int_iterator VMVNQ_M_N [VMVNQ_M_N_U VMVNQ_M_N_S])
 (define_int_iterator VQSHRNTQ_N [VQSHRNTQ_N_U VQSHRNTQ_N_S])
-(define_int_iterator VMLALDAVAXQ [VMLALDAVAXQ_S VMLALDAVAXQ_U])
 (define_int_iterator VSHRNTQ_N [VSHRNTQ_N_S VSHRNTQ_N_U])
 (define_int_iterator VCVTMQ_M [VCVTMQ_M_S VCVTMQ_M_U])
 (define_int_iterator VCVTNQ_M [VCVTNQ_M_S VCVTNQ_M_U])
diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
index d7bdcd862f8..9fe51298cdc 100644
--- a/gcc/config/arm/mve.md
+++ b/gcc/config/arm/mve.md
@@ -4605,7 +4605,7 @@ (define_insn "mve_vstrwq_p_fv4sf"
   [(set (match_operand:V4SI 0 "mve_memory_operand" "=Ux")
 	(unspec:V4SI
 	 [(match_operand:V4SF 1 "s_register_operand" "w")
-	  (match_operand: 2 "vpr_register_operand" "Up")
+	  (match_operand:V4BI 2 "vpr_register_operand" "Up")
 	  (match_dup 0)]
 	 VSTRWQ_F))]
   "TARGET_HAVE_MVE && TARGET_HAVE_MVE_FLOAT"
diff --git a/gcc/config/arm/unspecs.md b/gcc/config/arm/unspecs.md
index b9db306c067..46ac8b37157 100644
--- a/gcc/config/arm/unspecs.md
+++ b/gcc/config/arm/unspecs.md
@@ -717,7 +717,6 @@ (define_c_enum "unspec" [
   VCVTBQ_F16_F32
   VCVTTQ_F16_F32
   VMLALDAVQ_U
-  VMLALDAVXQ_U
   VMLALDAVXQ_S
   VMLALDAVQ_S
   VMLSLDAVQ_S
@@ -934,7 +933,6 @@ (define_c_enum "unspec" [
   VSHRNBQ_N_S
   VRSHRNBQ_N_S
   VRSHRNBQ_N_U
-  VMLALDAVXQ_P_U
   VMLALDAVXQ_P_S
   VQMOVNTQ_M_U
   VQMOVNTQ_M_S
@@ -943,7 +941,6 @@ (define_c_enum "unspec" [
   VQSHRNTQ_N_U
   VQSHRNTQ_N_S
   VMLALDAVAXQ_S
-  VMLALDAVAXQ_U
   VSHRNTQ_N_S
   VSHRNTQ_N_U
   VCVTBQ_M_F16_F32


[PATCH v5 3/5] arm: Annotate instructions with mve_safe_imp_xlane_pred

2024-02-22 Thread Andre Vieira

This patch annotates some MVE across lane instructions with a new attribute.
We use this attribute to let the compiler know that these instructions can be
safely implicitly predicated when tail predicating if their operands are
guaranteed to have zeroed tail predicated lanes.  These instructions were
selected because having the value 0 in those lanes or 'tail-predicating' those
lanes have the same effect.

gcc/ChangeLog:

* config/arm/arm.md (mve_safe_imp_xlane_pred): New attribute.
* config/arm/iterators.md (mve_vmaxmin_safe_imp): New iterator
attribute.
* config/arm/mve.md (vaddvq_s, vaddvq_u, vaddlvq_s, vaddlvq_u,
vaddvaq_s, vaddvaq_u, vmaxavq_s, vmaxvq_u, vmladavq_s, vmladavq_u,
vmladavxq_s, vmlsdavq_s, vmlsdavxq_s, vaddlvaq_s, vaddlvaq_u,
vmlaldavq_u, vmlaldavq_s, vmlaldavq_u, vmlaldavxq_s, vmlsldavq_s,
vmlsldavxq_s, vrmlaldavhq_u, vrmlaldavhq_s, vrmlaldavhxq_s,
vrmlsldavhq_s, vrmlsldavhxq_s, vrmlaldavhaq_s, vrmlaldavhaq_u,
vrmlaldavhaxq_s, vrmlsldavhaq_s, vrmlsldavhaxq_s, vabavq_s, vabavq_u,
vmladavaq_u, vmladavaq_s, vmladavaxq_s, vmlsdavaq_s, vmlsdavaxq_s,
vmlaldavaq_s, vmlaldavaq_u, vmlaldavaxq_s, vmlsldavaq_s,
vmlsldavaxq_s): Added mve_safe_imp_xlane_pred.
---
 gcc/config/arm/arm.md   |  6 ++
 gcc/config/arm/iterators.md |  8 
 gcc/config/arm/mve.md   | 12 
 3 files changed, 26 insertions(+)

diff --git a/gcc/config/arm/arm.md b/gcc/config/arm/arm.md
index 81290e83818..814e871acea 100644
--- a/gcc/config/arm/arm.md
+++ b/gcc/config/arm/arm.md
@@ -130,6 +130,12 @@ (define_attr "predicated" "yes,no" (const_string "no"))
 ; encode that it is a predicable instruction.
 (define_attr "mve_unpredicated_insn" "" (symbol_ref "CODE_FOR_nothing"))
 
+; An attribute used by the loop-doloop pass when determining whether it is
+; safe to predicate a MVE instruction, that operates across lanes, and was
+; previously not predicated.  The pass will still check whether all inputs
+; are predicated by the VCTP predication mask.
+(define_attr "mve_safe_imp_xlane_pred" "yes,no" (const_string "no"))
+
 ; LENGTH of an instruction (in bytes)
 (define_attr "length" ""
   (const_int 4))
diff --git a/gcc/config/arm/iterators.md b/gcc/config/arm/iterators.md
index 7600bf62531..22b3ddf5637 100644
--- a/gcc/config/arm/iterators.md
+++ b/gcc/config/arm/iterators.md
@@ -869,6 +869,14 @@ (define_code_attr mve_addsubmul [
 		 (plus "vadd")
 		 ])
 
+(define_int_attr mve_vmaxmin_safe_imp [
+		 (VMAXVQ_U "yes")
+		 (VMAXVQ_S "no")
+		 (VMAXAVQ_S "yes")
+		 (VMINVQ_U "no")
+		 (VMINVQ_S "no")
+		 (VMINAVQ_S "no")])
+
 (define_int_attr mve_cmp_op1 [
 		 (VCMPCSQ_M_U "cs")
 		 (VCMPEQQ_M_S "eq") (VCMPEQQ_M_U "eq")
diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
index 8aa0bded7f0..d7bdcd862f8 100644
--- a/gcc/config/arm/mve.md
+++ b/gcc/config/arm/mve.md
@@ -393,6 +393,7 @@ (define_insn "@mve_q_"
   "TARGET_HAVE_MVE"
   ".%#\t%0, %q1"
  [(set (attr "mve_unpredicated_insn") (symbol_ref "CODE_FOR_mve_q_"))
+  (set_attr "mve_safe_imp_xlane_pred" "yes")
   (set_attr "type" "mve_move")
 ])
 
@@ -529,6 +530,7 @@ (define_insn "@mve_q_v4si"
   "TARGET_HAVE_MVE"
   ".32\t%Q0, %R0, %q1"
  [(set (attr "mve_unpredicated_insn") (symbol_ref "CODE_FOR_mve_q_v4si"))
+  (set_attr "mve_safe_imp_xlane_pred" "yes")
   (set_attr "type" "mve_move")
 ])
 
@@ -802,6 +804,7 @@ (define_insn "@mve_q_"
   "TARGET_HAVE_MVE"
   ".%#\t%0, %q2"
  [(set (attr "mve_unpredicated_insn") (symbol_ref "CODE_FOR_mve_q_"))
+  (set_attr "mve_safe_imp_xlane_pred" "yes")
   (set_attr "type" "mve_move")
 ])
 
@@ -1014,6 +1017,7 @@ (define_insn "@mve_q_"
   "TARGET_HAVE_MVE"
   ".%#\t%0, %q2"
  [(set (attr "mve_unpredicated_insn") (symbol_ref "CODE_FOR_mve_q_"))
+  (set_attr "mve_safe_imp_xlane_pred" "")
   (set_attr "type" "mve_move")
 ])
 
@@ -1033,6 +1037,7 @@ (define_insn "@mve_q_"
   "TARGET_HAVE_MVE"
   ".%#\t%0, %q1, %q2"
  [(set (attr "mve_unpredicated_insn") (symbol_ref "CODE_FOR_mve_q_"))
+  (set_attr "mve_safe_imp_xlane_pred" "yes")
   (set_attr "type" "mve_move")
 ])
 
@@ -1219,6 +1224,7 @@ (define_insn "@mve_q_v4si"
   "TARGET_HAVE_MVE"
   ".32\t%Q0, %R0, %q2"
  [(set (attr "mve_unpredicated_insn") (symbol_ref "CODE_FOR_mve_q_v4si"))
+  (set_attr "mve_safe_imp_xlane_pred" "yes")
   (set_attr "type" "mve_move")
 ])
 
@@ -1450,6 +1456,7 @@ (define_insn "@mve_q_"
   "TARGET_HAVE_MVE"
   ".%#\t%Q0, %R0, %q1, %q2"
  [(set (attr "mve_unpredicated_insn") (symbol_ref "CODE_FOR_mve_q_"))
+  (set_attr "mve_safe_imp_xlane_pred" "yes")
   (set_attr "type" "mve_move")
 ])
 
@@ -1588,6 +1595,7 @@ (define_insn "@mve_q_v4si"
   "TARGET_HAVE_MVE"
   ".32\t%Q0, %R0, %q1, %q2"
  [(set (attr "mve_unpredicated_insn") (symbol_ref "CODE_FOR_mve_q_v4si"))
+  (set_attr "mve_safe_imp_xlane_pred" "yes")
   (set_attr "type" "mve_move")
 ])
 
@@ -1725,6 +1733,7 @@ (define_insn "@mve_q_v4si"
   "TARGET_HAVE_MVE"
   ".32\t%Q0, %R0, %q2, %q3"
  [(set 

[PATCH v5 2/5] doloop: Add support for predicated vectorized loops

2024-02-22 Thread Andre Vieira

This patch adds support in the target agnostic doloop pass for the detection of
predicated vectorized hardware loops.  Arm is currently the only target that
will make use of this feature.

gcc/ChangeLog:

* df-core.cc (df_bb_regno_only_def_find): New helper function.
* df.h (df_bb_regno_only_def_find): Declare new function.
* loop-doloop.cc (doloop_condition_get): Add support for detecting
predicated vectorized hardware loops.
(doloop_modify): Add support for GTU condition checks.
(doloop_optimize): Update costing computation to support alterations to
desc->niter_expr by the backend.

Co-authored-by: Stam Markianos-Wright 
---
 gcc/df-core.cc |  15 +
 gcc/df.h   |   1 +
 gcc/loop-doloop.cc | 164 +++--
 3 files changed, 113 insertions(+), 67 deletions(-)

diff --git a/gcc/df-core.cc b/gcc/df-core.cc
index f0eb4c93957..b0e8a88d433 100644
--- a/gcc/df-core.cc
+++ b/gcc/df-core.cc
@@ -1964,6 +1964,21 @@ df_bb_regno_last_def_find (basic_block bb, unsigned int regno)
   return NULL;
 }
 
+/* Return the one and only def of REGNO within BB.  If there is no def or
+   there are multiple defs, return NULL.  */
+
+df_ref
+df_bb_regno_only_def_find (basic_block bb, unsigned int regno)
+{
+  df_ref temp = df_bb_regno_first_def_find (bb, regno);
+  if (!temp)
+return NULL;
+  else if (temp == df_bb_regno_last_def_find (bb, regno))
+return temp;
+  else
+return NULL;
+}
+
 /* Finds the reference corresponding to the definition of REG in INSN.
DF is the dataflow object.  */
 
diff --git a/gcc/df.h b/gcc/df.h
index 84e5aa8b524..c4e690b40cf 100644
--- a/gcc/df.h
+++ b/gcc/df.h
@@ -987,6 +987,7 @@ extern void df_check_cfg_clean (void);
 #endif
 extern df_ref df_bb_regno_first_def_find (basic_block, unsigned int);
 extern df_ref df_bb_regno_last_def_find (basic_block, unsigned int);
+extern df_ref df_bb_regno_only_def_find (basic_block, unsigned int);
 extern df_ref df_find_def (rtx_insn *, rtx);
 extern bool df_reg_defined (rtx_insn *, rtx);
 extern df_ref df_find_use (rtx_insn *, rtx);
diff --git a/gcc/loop-doloop.cc b/gcc/loop-doloop.cc
index 529e810e530..8953e1de960 100644
--- a/gcc/loop-doloop.cc
+++ b/gcc/loop-doloop.cc
@@ -85,10 +85,10 @@ doloop_condition_get (rtx_insn *doloop_pat)
  forms:
 
  1)  (parallel [(set (pc) (if_then_else (condition)
-	  			(label_ref (label))
-(pc)))
-	 (set (reg) (plus (reg) (const_int -1)))
-	 (additional clobbers and uses)])
+	(label_ref (label))
+	(pc)))
+		 (set (reg) (plus (reg) (const_int -1)))
+		 (additional clobbers and uses)])
 
  The branch must be the first entry of the parallel (also required
  by jump.cc), and the second entry of the parallel must be a set of
@@ -96,19 +96,33 @@ doloop_condition_get (rtx_insn *doloop_pat)
  the loop counter in an if_then_else too.
 
  2)  (set (reg) (plus (reg) (const_int -1))
- (set (pc) (if_then_else (reg != 0)
-	 (label_ref (label))
-			 (pc))).  
+	 (set (pc) (if_then_else (reg != 0)
+ (label_ref (label))
+ (pc))).
 
- Some targets (ARM) do the comparison before the branch, as in the
+ 3) Some targets (Arm) do the comparison before the branch, as in the
  following form:
 
- 3) (parallel [(set (cc) (compare ((plus (reg) (const_int -1), 0)))
-   (set (reg) (plus (reg) (const_int -1)))])
-(set (pc) (if_then_else (cc == NE)
-(label_ref (label))
-(pc))) */
-
+ (parallel [(set (cc) (compare (plus (reg) (const_int -1)) 0))
+		(set (reg) (plus (reg) (const_int -1)))])
+ (set (pc) (if_then_else (cc == NE)
+			 (label_ref (label))
+			 (pc)))
+
+  4) This form supports a construct that is used to represent a vectorized
+  do loop with predication, however we do not need to care about the
+  details of the predication here.
+  Arm uses this construct to support MVE tail predication.
+
+  (parallel
+   [(set (pc)
+	 (if_then_else (gtu (plus (reg) (const_int -n))
+(const_int n-1))
+			   (label_ref)
+			   (pc)))
+	(set (reg) (plus (reg) (const_int -n)))
+	(additional clobbers and uses)])
+ */
   pattern = PATTERN (doloop_pat);
 
   if (GET_CODE (pattern) != PARALLEL)
@@ -173,15 +187,17 @@ doloop_condition_get (rtx_insn *doloop_pat)
   if (! REG_P (reg))
 return 0;
 
-  /* Check if something = (plus (reg) (const_int -1)).
+  /* Check if something = (plus (reg) (const_int -n)).
  On IA-64, this decrement is wrapped in an if_then_else.  */
   inc_src = SET_SRC (inc);
   if (GET_CODE (inc_src) == IF_THEN_ELSE)
 inc_src = XEXP (inc_src, 1);
   if (GET_CODE (inc_src) != PLUS
-  || XEXP (inc_src, 0) != reg
-  || XEXP (inc_src, 1) != constm1_rtx)
+  || !rtx_equal_p (XEXP (inc_src, 0), reg)
+  || 

[PATCH v5 0/5] arm: Add support for MVE Tail-Predicated Low Overhead Loops

2024-02-22 Thread Andre Vieira
Hi,

This is a reworked patch series from.  The main differences are a further split
of patches, where:
[1/5] is arm specific and has been approved before,
[2/5] is target agnostic, has had no substantial changes from v3.
[3/5] new arm specific patch that is split from the original last patch and
annotates across lane instructions that are safe for tail predication if their
tail predicated operands are zeroed.
[4/5] new arm specific patch that could be committed indepdent of series to fix
an obvious issue and remove unused unspecs & iterators.
[5/5] (v3-v4) reworked last patch refactoring the implicit predication and some 
other
validity checks, (v4-v5) removed the expectation that vctp instructions are
always zero extended after this was fixed on trunk.

Original cover letter:
This patch adds support for Arm's MVE Tail Predicated Low Overhead Loop
feature.

The M-class Arm-ARM:
https://developer.arm.com/documentation/ddi0553/bu/?lang=en
Section B5.5.1 "Loop tail predication" describes the feature
we are adding support for with this patch (although
we only add codegen for DLSTP/LETP instruction loops).

Previously with commit d2ed233cb94 we'd added support for
non-MVE DLS/LE loops through the loop-doloop pass, which, given
a standard MVE loop like:

```
void  __attribute__ ((noinline)) test (int16_t *a, int16_t *b, int16_t *c, int 
n)
{
  while (n > 0)
{
  mve_pred16_t p = vctp16q (n);
  int16x8_t va = vldrhq_z_s16 (a, p);
  int16x8_t vb = vldrhq_z_s16 (b, p);
  int16x8_t vc = vaddq_x_s16 (va, vb, p);
  vstrhq_p_s16 (c, vc, p);
  c+=8;
  a+=8;
  b+=8;
  n-=8;
}
}
```
.. would output:

```

dls lr, lr
.L3:
vctp.16 r3
vmrsip, P0  @ movhi
sxthip, ip
vmsr P0, ip @ movhi
mov r4, r0
vpst
vldrht.16   q2, [r4]
mov r4, r1
vmovq3, q0
vpst
vldrht.16   q1, [r4]
mov r4, r2
vpst
vaddt.i16   q3, q2, q1
subsr3, r3, #8
vpst
vstrht.16   q3, [r4]
addsr0, r0, #16
addsr1, r1, #16
addsr2, r2, #16
le  lr, .L3
```

where the LE instruction will decrement LR by 1, compare and
branch if needed.

(there are also other inefficiencies with the above code, like the
pointless vmrs/sxth/vmsr on the VPR and the adds not being merged
into the vldrht/vstrht as a #16 offsets and some random movs!
But that's different problems...)

The MVE version is similar, except that:
* Instead of DLS/LE the instructions are DLSTP/LETP.
* Instead of pre-calculating the number of iterations of the
  loop, we place the number of elements to be processed by the
  loop into LR.
* Instead of decrementing the LR by one, LETP will decrement it
  by FPSCR.LTPSIZE, which is the number of elements being
  processed in each iteration: 16 for 8-bit elements, 5 for 16-bit
  elements, etc.
* On the final iteration, automatic Loop Tail Predication is
  performed, as if the instructions within the loop had been VPT
  predicated with a VCTP generating the VPR predicate in every
  loop iteration.

The dlstp/letp loop now looks like:

```

dlstp.16lr, r3
.L14:
mov r3, r0
vldrh.16q3, [r3]
mov r3, r1
vldrh.16q2, [r3]
mov r3, r2
vadd.i16  q3, q3, q2
addsr0, r0, #16
vstrh.16q3, [r3]
addsr1, r1, #16
addsr2, r2, #16
letplr, .L14

```

Since the loop tail predication is automatic, we have eliminated
the VCTP that had been specified by the user in the intrinsic
and converted the VPT-predicated instructions into their
unpredicated equivalents (which also saves us from VPST insns).

The LE instruction here decrements LR by 8 in each iteration.

Stam Markianos-Wright (1):
  arm: Add define_attr to to create a mapping between MVE predicated and
unpredicated insns

Andre Vieira (4):
  doloop: Add support for predicated vectorized loops
  arm: Annotate instructions with mve_safe_imp_xlane_pred
  arm: Fix a wrong attribute use and remove unused unspecs and iterators
  arm: Add support for MVE Tail-Predicated Low Overhead Loops


-- 
2.17.1


[PATCH v4 5/5] arm: Add support for MVE Tail-Predicated Low Overhead Loops

2024-02-21 Thread Andre Vieira

This patch adds support for MVE Tail-Predicated Low Overhead Loops by using the
doloop funcitonality added to support predicated vectorized hardware loops.

gcc/ChangeLog:

* config/arm/arm-protos.h (arm_target_bb_ok_for_lob): Change
declaration to pass basic_block.
(arm_attempt_dlstp_transform): New declaration.
* config/arm/arm.cc (TARGET_LOOP_UNROLL_ADJUST): Define targethook.
(TARGET_PREDICT_DOLOOP_P): Likewise.
(arm_target_bb_ok_for_lob): Adapt condition.
(arm_mve_get_vctp_lanes): New function.
(arm_dl_usage_type): New internal enum.
(arm_get_required_vpr_reg): New function.
(arm_get_required_vpr_reg_param): New function.
(arm_get_required_vpr_reg_ret_val): New function.
(arm_mve_get_loop_vctp): New function.
(arm_mve_insn_predicated_by): New function.
(arm_mve_across_lane_insn_p): New function.
(arm_mve_load_store_insn_p): New function.
(arm_mve_impl_pred_on_outputs_p): New function.
(arm_mve_impl_pred_on_inputs_p): New function.
(arm_last_vect_def_insn): New function.
(arm_mve_impl_predicated_p): New function.
(arm_mve_check_reg_origin_is_num_elems): New function.
(arm_mve_dlstp_check_inc_counter): New function.
(arm_mve_dlstp_check_dec_counter): New function.
(arm_mve_loop_valid_for_dlstp): New function.
(arm_predict_doloop_p): New function.
(arm_loop_unroll_adjust): New function.
(arm_emit_mve_unpredicated_insn_to_seq): New function.
(arm_mve_get_vctp_vec_form): New function.
(arm_attempt_dlstp_transform): New function.
* config/arm/arm.opt (mdlstp): New option.
* config/arm/iteratords.md (dlstp_elemsize, letp_num_lanes,
letp_num_lanes_neg, letp_num_lanes_minus_1): New attributes.
(DLSTP, LETP): New iterators.
(predicated_doloop_end_internal): New pattern.
(dlstp_insn): New pattern.
* config/arm/thumb2.md (doloop_end): Adapt to support tail-predicated
loops.
(doloop_begin): Likewise.
* config/arm/types.md (mve_misc): New mve type to represent
predicated_loop_end insn sequences.
* config/arm/unspecs.md:
(DLSTP8, DLSTP16, DLSTP32, DSLTP64,
LETP8, LETP16, LETP32, LETP64): New unspecs for DLSTP and LETP.

gcc/testsuite/ChangeLog:

* gcc.target/arm/lob.h: Add new helpers.
* gcc.target/arm/lob1.c: Use new helpers.
* gcc.target/arm/lob6.c: Likewise.
* gcc.target/arm/dlstp-compile-asm-1.c: New test.
* gcc.target/arm/dlstp-compile-asm-2.c: New test.
* gcc.target/arm/dlstp-compile-asm-3.c: New test.
* gcc.target/arm/dlstp-int8x16.c: New test.
* gcc.target/arm/dlstp-int8x16-run.c: New test.
* gcc.target/arm/dlstp-int16x8.c: New test.
* gcc.target/arm/dlstp-int16x8-run.c: New test.
* gcc.target/arm/dlstp-int32x4.c: New test.
* gcc.target/arm/dlstp-int32x4-run.c: New test.
* gcc.target/arm/dlstp-int64x2.c: New test.
* gcc.target/arm/dlstp-int64x2-run.c: New test.
* gcc.target/arm/dlstp-invalid-asm.c: New test.

Co-authored-by: Stam Markianos-Wright 

diff --git a/gcc/config/arm/arm-protos.h b/gcc/config/arm/arm-protos.h
index 2cd560c9925..34d6be76e94 100644
--- a/gcc/config/arm/arm-protos.h
+++ b/gcc/config/arm/arm-protos.h
@@ -65,8 +65,8 @@ extern void arm_emit_speculation_barrier_function (void);
 extern void arm_decompose_di_binop (rtx, rtx, rtx *, rtx *, rtx *, rtx *);
 extern bool arm_q_bit_access (void);
 extern bool arm_ge_bits_access (void);
-extern bool arm_target_insn_ok_for_lob (rtx);
-
+extern bool arm_target_bb_ok_for_lob (basic_block);
+extern int arm_attempt_dlstp_transform (rtx);
 #ifdef RTX_CODE
 enum reg_class
 arm_mode_base_reg_class (machine_mode);
diff --git a/gcc/config/arm/arm.cc b/gcc/config/arm/arm.cc
index e5a944486d7..4fdcf5ed82a 100644
--- a/gcc/config/arm/arm.cc
+++ b/gcc/config/arm/arm.cc
@@ -668,6 +668,12 @@ static const scoped_attribute_specs *const arm_attribute_table[] =
 #undef TARGET_HAVE_CONDITIONAL_EXECUTION
 #define TARGET_HAVE_CONDITIONAL_EXECUTION arm_have_conditional_execution
 
+#undef TARGET_LOOP_UNROLL_ADJUST
+#define TARGET_LOOP_UNROLL_ADJUST arm_loop_unroll_adjust
+
+#undef TARGET_PREDICT_DOLOOP_P
+#define TARGET_PREDICT_DOLOOP_P arm_predict_doloop_p
+
 #undef TARGET_LEGITIMATE_CONSTANT_P
 #define TARGET_LEGITIMATE_CONSTANT_P arm_legitimate_constant_p
 
@@ -34483,19 +34489,1279 @@ arm_invalid_within_doloop (const rtx_insn *insn)
 }
 
 bool
-arm_target_insn_ok_for_lob (rtx insn)
-{
-  basic_block bb = BLOCK_FOR_INSN (insn);
-  /* Make sure the basic block of the target insn is a simple latch
- having as single predecessor and successor the body of the loop
- itself.  Only simple loops with a single basic block as body are
- supported for 'low over head loop' making sure that LE target is
- above LE 

[PATCH v4 4/5] arm: Fix a wrong attribute use and remove unused unspecs and iterators

2024-02-21 Thread Andre Vieira

This patch fixes the erroneous use of a mode attribute without a mode iterator
in the pattern and removes unused unspecs and iterators.

gcc/ChangeLog:

* config/arm/iterators.md (supf): Remove VMLALDAVXQ_U, VMLALDAVXQ_P_U,
VMLALDAVAXQ_U cases.
(VMLALDAVXQ): Remove iterator.
(VMLALDAVXQ_P): Likewise.
(VMLALDAVAXQ): Likewise.
* config/arm/mve.md (mve_vstrwq_p_fv4sf): Replace use of 
mode iterator attribute with V4BI mode.
* config/arm/unspecs.md (VMLALDAVXQ_U, VMLALDAVXQ_P_U,
VMLALDAVAXQ_U): Remove unused unspecs.

diff --git a/gcc/config/arm/iterators.md b/gcc/config/arm/iterators.md
index 22b3ddf5637..3206bcab4cf 100644
--- a/gcc/config/arm/iterators.md
+++ b/gcc/config/arm/iterators.md
@@ -2370,7 +2370,7 @@ (define_int_attr supf [(VCVTQ_TO_F_S "s") (VCVTQ_TO_F_U "u") (VREV16Q_S "s")
 		   (VSUBQ_S "s") (VSUBQ_U "u") (VADDVAQ_S "s")
 		   (VADDVAQ_U "u") (VADDLVAQ_S "s") (VADDLVAQ_U "u")
 		   (VBICQ_N_S "s") (VBICQ_N_U "u") (VMLALDAVQ_U "u")
-		   (VMLALDAVQ_S "s") (VMLALDAVXQ_U "u") (VMLALDAVXQ_S "s")
+		   (VMLALDAVQ_S "s") (VMLALDAVXQ_S "s")
 		   (VMOVNBQ_U "u") (VMOVNBQ_S "s") (VMOVNTQ_U "u")
 		   (VMOVNTQ_S "s") (VORRQ_N_S "s") (VORRQ_N_U "u")
 		   (VQMOVNBQ_U "u") (VQMOVNBQ_S "s") (VQMOVNTQ_S "s")
@@ -2412,8 +2412,8 @@ (define_int_attr supf [(VCVTQ_TO_F_S "s") (VCVTQ_TO_F_U "u") (VREV16Q_S "s")
 		   (VREV16Q_M_S "s") (VREV16Q_M_U "u")
 		   (VQRSHRNTQ_N_U "u") (VMOVNTQ_M_U "u") (VMOVLBQ_M_U "u")
 		   (VMLALDAVAQ_U "u") (VQSHRNBQ_N_U "u") (VSHRNBQ_N_U "u")
-		   (VRSHRNBQ_N_U "u") (VMLALDAVXQ_P_U "u")
-		   (VMVNQ_M_N_U "u") (VQSHRNTQ_N_U "u") (VMLALDAVAXQ_U "u")
+		   (VRSHRNBQ_N_U "u")
+		   (VMVNQ_M_N_U "u") (VQSHRNTQ_N_U "u")
 		   (VQMOVNTQ_M_U "u") (VSHRNTQ_N_U "u") (VCVTMQ_M_S "s")
 		   (VCVTMQ_M_U "u") (VCVTNQ_M_S "s") (VCVTNQ_M_U "u")
 		   (VCVTPQ_M_S "s") (VCVTPQ_M_U "u") (VADDLVAQ_P_S "s")
@@ -2762,7 +2762,6 @@ (define_int_iterator VSUBQ_N [VSUBQ_N_S VSUBQ_N_U])
 (define_int_iterator VADDLVAQ [VADDLVAQ_S VADDLVAQ_U])
 (define_int_iterator VBICQ_N [VBICQ_N_S VBICQ_N_U])
 (define_int_iterator VMLALDAVQ [VMLALDAVQ_U VMLALDAVQ_S])
-(define_int_iterator VMLALDAVXQ [VMLALDAVXQ_U VMLALDAVXQ_S])
 (define_int_iterator VMOVNBQ [VMOVNBQ_U VMOVNBQ_S])
 (define_int_iterator VMOVNTQ [VMOVNTQ_S VMOVNTQ_U])
 (define_int_iterator VORRQ_N [VORRQ_N_U VORRQ_N_S])
@@ -2817,11 +2816,9 @@ (define_int_iterator VMLALDAVAQ [VMLALDAVAQ_S VMLALDAVAQ_U])
 (define_int_iterator VQSHRNBQ_N [VQSHRNBQ_N_U VQSHRNBQ_N_S])
 (define_int_iterator VSHRNBQ_N [VSHRNBQ_N_U VSHRNBQ_N_S])
 (define_int_iterator VRSHRNBQ_N [VRSHRNBQ_N_S VRSHRNBQ_N_U])
-(define_int_iterator VMLALDAVXQ_P [VMLALDAVXQ_P_U VMLALDAVXQ_P_S])
 (define_int_iterator VQMOVNTQ_M [VQMOVNTQ_M_U VQMOVNTQ_M_S])
 (define_int_iterator VMVNQ_M_N [VMVNQ_M_N_U VMVNQ_M_N_S])
 (define_int_iterator VQSHRNTQ_N [VQSHRNTQ_N_U VQSHRNTQ_N_S])
-(define_int_iterator VMLALDAVAXQ [VMLALDAVAXQ_S VMLALDAVAXQ_U])
 (define_int_iterator VSHRNTQ_N [VSHRNTQ_N_S VSHRNTQ_N_U])
 (define_int_iterator VCVTMQ_M [VCVTMQ_M_S VCVTMQ_M_U])
 (define_int_iterator VCVTNQ_M [VCVTNQ_M_S VCVTNQ_M_U])
diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
index d7bdcd862f8..9fe51298cdc 100644
--- a/gcc/config/arm/mve.md
+++ b/gcc/config/arm/mve.md
@@ -4605,7 +4605,7 @@ (define_insn "mve_vstrwq_p_fv4sf"
   [(set (match_operand:V4SI 0 "mve_memory_operand" "=Ux")
 	(unspec:V4SI
 	 [(match_operand:V4SF 1 "s_register_operand" "w")
-	  (match_operand: 2 "vpr_register_operand" "Up")
+	  (match_operand:V4BI 2 "vpr_register_operand" "Up")
 	  (match_dup 0)]
 	 VSTRWQ_F))]
   "TARGET_HAVE_MVE && TARGET_HAVE_MVE_FLOAT"
diff --git a/gcc/config/arm/unspecs.md b/gcc/config/arm/unspecs.md
index b9db306c067..46ac8b37157 100644
--- a/gcc/config/arm/unspecs.md
+++ b/gcc/config/arm/unspecs.md
@@ -717,7 +717,6 @@ (define_c_enum "unspec" [
   VCVTBQ_F16_F32
   VCVTTQ_F16_F32
   VMLALDAVQ_U
-  VMLALDAVXQ_U
   VMLALDAVXQ_S
   VMLALDAVQ_S
   VMLSLDAVQ_S
@@ -934,7 +933,6 @@ (define_c_enum "unspec" [
   VSHRNBQ_N_S
   VRSHRNBQ_N_S
   VRSHRNBQ_N_U
-  VMLALDAVXQ_P_U
   VMLALDAVXQ_P_S
   VQMOVNTQ_M_U
   VQMOVNTQ_M_S
@@ -943,7 +941,6 @@ (define_c_enum "unspec" [
   VQSHRNTQ_N_U
   VQSHRNTQ_N_S
   VMLALDAVAXQ_S
-  VMLALDAVAXQ_U
   VSHRNTQ_N_S
   VSHRNTQ_N_U
   VCVTBQ_M_F16_F32


[PATCH v4 3/5] arm: Annotate instructions with mve_safe_imp_xlane_pred

2024-02-21 Thread Andre Vieira

This patch annotates some MVE across lane instructions with a new attribute.
We use this attribute to let the compiler know that these instructions can be
safely implicitly predicated when tail predicating if their operands are
guaranteed to have zeroed tail predicated lanes.  These instructions were
selected because having the value 0 in those lanes or 'tail-predicating' those
lanes have the same effect.

gcc/ChangeLog:

* config/arm/arm.md (mve_safe_imp_xlane_pred): New attribute.
* config/arm/iterators.md (mve_vmaxmin_safe_imp): New iterator
attribute.
* config/arm/mve.md (vaddvq_s, vaddvq_u, vaddlvq_s, vaddlvq_u,
vaddvaq_s, vaddvaq_u, vmaxavq_s, vmaxvq_u, vmladavq_s, vmladavq_u,
vmladavxq_s, vmlsdavq_s, vmlsdavxq_s, vaddlvaq_s, vaddlvaq_u,
vmlaldavq_u, vmlaldavq_s, vmlaldavq_u, vmlaldavxq_s, vmlsldavq_s,
vmlsldavxq_s, vrmlaldavhq_u, vrmlaldavhq_s, vrmlaldavhxq_s,
vrmlsldavhq_s, vrmlsldavhxq_s, vrmlaldavhaq_s, vrmlaldavhaq_u,
vrmlaldavhaxq_s, vrmlsldavhaq_s, vrmlsldavhaxq_s, vabavq_s, vabavq_u,
vmladavaq_u, vmladavaq_s, vmladavaxq_s, vmlsdavaq_s, vmlsdavaxq_s,
vmlaldavaq_s, vmlaldavaq_u, vmlaldavaxq_s, vmlsldavaq_s,
vmlsldavaxq_s): Added mve_safe_imp_xlane_pred.

diff --git a/gcc/config/arm/arm.md b/gcc/config/arm/arm.md
index 3f2863adf44..58619393858 100644
--- a/gcc/config/arm/arm.md
+++ b/gcc/config/arm/arm.md
@@ -130,6 +130,12 @@ (define_attr "predicated" "yes,no" (const_string "no"))
 ; encode that it is a predicable instruction.
 (define_attr "mve_unpredicated_insn" "" (symbol_ref "CODE_FOR_nothing"))
 
+; An attribute used by the loop-doloop pass when determining whether it is
+; safe to predicate a MVE instruction, that operates across lanes, and was
+; previously not predicated.  The pass will still check whether all inputs
+; are predicated by the VCTP predication mask.
+(define_attr "mve_safe_imp_xlane_pred" "yes,no" (const_string "no"))
+
 ; LENGTH of an instruction (in bytes)
 (define_attr "length" ""
   (const_int 4))
diff --git a/gcc/config/arm/iterators.md b/gcc/config/arm/iterators.md
index 7600bf62531..22b3ddf5637 100644
--- a/gcc/config/arm/iterators.md
+++ b/gcc/config/arm/iterators.md
@@ -869,6 +869,14 @@ (define_code_attr mve_addsubmul [
 		 (plus "vadd")
 		 ])
 
+(define_int_attr mve_vmaxmin_safe_imp [
+		 (VMAXVQ_U "yes")
+		 (VMAXVQ_S "no")
+		 (VMAXAVQ_S "yes")
+		 (VMINVQ_U "no")
+		 (VMINVQ_S "no")
+		 (VMINAVQ_S "no")])
+
 (define_int_attr mve_cmp_op1 [
 		 (VCMPCSQ_M_U "cs")
 		 (VCMPEQQ_M_S "eq") (VCMPEQQ_M_U "eq")
diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
index 8aa0bded7f0..d7bdcd862f8 100644
--- a/gcc/config/arm/mve.md
+++ b/gcc/config/arm/mve.md
@@ -393,6 +393,7 @@ (define_insn "@mve_q_"
   "TARGET_HAVE_MVE"
   ".%#\t%0, %q1"
  [(set (attr "mve_unpredicated_insn") (symbol_ref "CODE_FOR_mve_q_"))
+  (set_attr "mve_safe_imp_xlane_pred" "yes")
   (set_attr "type" "mve_move")
 ])
 
@@ -529,6 +530,7 @@ (define_insn "@mve_q_v4si"
   "TARGET_HAVE_MVE"
   ".32\t%Q0, %R0, %q1"
  [(set (attr "mve_unpredicated_insn") (symbol_ref "CODE_FOR_mve_q_v4si"))
+  (set_attr "mve_safe_imp_xlane_pred" "yes")
   (set_attr "type" "mve_move")
 ])
 
@@ -802,6 +804,7 @@ (define_insn "@mve_q_"
   "TARGET_HAVE_MVE"
   ".%#\t%0, %q2"
  [(set (attr "mve_unpredicated_insn") (symbol_ref "CODE_FOR_mve_q_"))
+  (set_attr "mve_safe_imp_xlane_pred" "yes")
   (set_attr "type" "mve_move")
 ])
 
@@ -1014,6 +1017,7 @@ (define_insn "@mve_q_"
   "TARGET_HAVE_MVE"
   ".%#\t%0, %q2"
  [(set (attr "mve_unpredicated_insn") (symbol_ref "CODE_FOR_mve_q_"))
+  (set_attr "mve_safe_imp_xlane_pred" "")
   (set_attr "type" "mve_move")
 ])
 
@@ -1033,6 +1037,7 @@ (define_insn "@mve_q_"
   "TARGET_HAVE_MVE"
   ".%#\t%0, %q1, %q2"
  [(set (attr "mve_unpredicated_insn") (symbol_ref "CODE_FOR_mve_q_"))
+  (set_attr "mve_safe_imp_xlane_pred" "yes")
   (set_attr "type" "mve_move")
 ])
 
@@ -1219,6 +1224,7 @@ (define_insn "@mve_q_v4si"
   "TARGET_HAVE_MVE"
   ".32\t%Q0, %R0, %q2"
  [(set (attr "mve_unpredicated_insn") (symbol_ref "CODE_FOR_mve_q_v4si"))
+  (set_attr "mve_safe_imp_xlane_pred" "yes")
   (set_attr "type" "mve_move")
 ])
 
@@ -1450,6 +1456,7 @@ (define_insn "@mve_q_"
   "TARGET_HAVE_MVE"
   ".%#\t%Q0, %R0, %q1, %q2"
  [(set (attr "mve_unpredicated_insn") (symbol_ref "CODE_FOR_mve_q_"))
+  (set_attr "mve_safe_imp_xlane_pred" "yes")
   (set_attr "type" "mve_move")
 ])
 
@@ -1588,6 +1595,7 @@ (define_insn "@mve_q_v4si"
   "TARGET_HAVE_MVE"
   ".32\t%Q0, %R0, %q1, %q2"
  [(set (attr "mve_unpredicated_insn") (symbol_ref "CODE_FOR_mve_q_v4si"))
+  (set_attr "mve_safe_imp_xlane_pred" "yes")
   (set_attr "type" "mve_move")
 ])
 
@@ -1725,6 +1733,7 @@ (define_insn "@mve_q_v4si"
   "TARGET_HAVE_MVE"
   ".32\t%Q0, %R0, %q2, %q3"
  [(set (attr "mve_unpredicated_insn") (symbol_ref "CODE_FOR_mve_q_v4si"))
+  (set_attr "mve_safe_imp_xlane_pred" "yes")
   (set_attr "type" "mve_move")
 ])
 
@@ -1742,6 +1751,7 @@ 

[PATCH v4 1/5] arm: Add define_attr to to create a mapping between MVE predicated and unpredicated insns

2024-02-21 Thread Andre Vieira

This patch adds an attribute to the mve md patterns to be able to identify
predicable MVE instructions and what their predicated and unpredicated variants
are.  This attribute is used to encode the icode of the unpredicated variant of
an instruction in its predicated variant.

This will make it possible for us to transform VPT-predicated insns in
the insn chain into their unpredicated equivalents when transforming the loop
into a MVE Tail-Predicated Low Overhead Loop. For example:
`mve_vldrbq_z_ -> mve_vldrbq_`.

gcc/ChangeLog:

* config/arm/arm.md (mve_unpredicated_insn): New attribute.
* config/arm/arm.h (MVE_VPT_PREDICATED_INSN_P): New define.
(MVE_VPT_UNPREDICATED_INSN_P): Likewise.
(MVE_VPT_PREDICABLE_INSN_P): Likewise.
* config/arm/vec-common.md (mve_vshlq_): Add attribute.
* config/arm/mve.md (arm_vcx1q_p_v16qi): Add attribute.
(arm_vcx1qv16qi): Likewise.
(arm_vcx1qav16qi): Likewise.
(arm_vcx1qv16qi): Likewise.
(arm_vcx2q_p_v16qi): Likewise.
(arm_vcx2qv16qi): Likewise.
(arm_vcx2qav16qi): Likewise.
(arm_vcx2qv16qi): Likewise.
(arm_vcx3q_p_v16qi): Likewise.
(arm_vcx3qv16qi): Likewise.
(arm_vcx3qav16qi): Likewise.
(arm_vcx3qv16qi): Likewise.
(@mve_q_): Likewise.
(@mve_q_int_): Likewise.
(@mve_q_v4si): Likewise.
(@mve_q_n_): Likewise.
(@mve_q_r_): Likewise.
(@mve_q_f): Likewise.
(@mve_q_m_): Likewise.
(@mve_q_m_n_): Likewise.
(@mve_q_m_r_): Likewise.
(@mve_q_m_f): Likewise.
(@mve_q_int_m_): Likewise.
(@mve_q_p_v4si): Likewise.
(@mve_q_p_): Likewise.
(@mve_q_): Likewise.
(@mve_q_f): Likewise.
(@mve_q_m_): Likewise.
(@mve_q_m_f): Likewise.
(mve_vq_f): Likewise.
(mve_q): Likewise.
(mve_q_f): Likewise.
(mve_vadciq_v4si): Likewise.
(mve_vadciq_m_v4si): Likewise.
(mve_vadcq_v4si): Likewise.
(mve_vadcq_m_v4si): Likewise.
(mve_vandq_): Likewise.
(mve_vandq_f): Likewise.
(mve_vandq_m_): Likewise.
(mve_vandq_m_f): Likewise.
(mve_vandq_s): Likewise.
(mve_vandq_u): Likewise.
(mve_vbicq_): Likewise.
(mve_vbicq_f): Likewise.
(mve_vbicq_m_): Likewise.
(mve_vbicq_m_f): Likewise.
(mve_vbicq_m_n_): Likewise.
(mve_vbicq_n_): Likewise.
(mve_vbicq_s): Likewise.
(mve_vbicq_u): Likewise.
(@mve_vclzq_s): Likewise.
(mve_vclzq_u): Likewise.
(@mve_vcmp_q_): Likewise.
(@mve_vcmp_q_n_): Likewise.
(@mve_vcmp_q_f): Likewise.
(@mve_vcmp_q_n_f): Likewise.
(@mve_vcmp_q_m_f): Likewise.
(@mve_vcmp_q_m_n_): Likewise.
(@mve_vcmp_q_m_): Likewise.
(@mve_vcmp_q_m_n_f): Likewise.
(mve_vctpq): Likewise.
(mve_vctpq_m): Likewise.
(mve_vcvtaq_): Likewise.
(mve_vcvtaq_m_): Likewise.
(mve_vcvtbq_f16_f32v8hf): Likewise.
(mve_vcvtbq_f32_f16v4sf): Likewise.
(mve_vcvtbq_m_f16_f32v8hf): Likewise.
(mve_vcvtbq_m_f32_f16v4sf): Likewise.
(mve_vcvtmq_): Likewise.
(mve_vcvtmq_m_): Likewise.
(mve_vcvtnq_): Likewise.
(mve_vcvtnq_m_): Likewise.
(mve_vcvtpq_): Likewise.
(mve_vcvtpq_m_): Likewise.
(mve_vcvtq_from_f_): Likewise.
(mve_vcvtq_m_from_f_): Likewise.
(mve_vcvtq_m_n_from_f_): Likewise.
(mve_vcvtq_m_n_to_f_): Likewise.
(mve_vcvtq_m_to_f_): Likewise.
(mve_vcvtq_n_from_f_): Likewise.
(mve_vcvtq_n_to_f_): Likewise.
(mve_vcvtq_to_f_): Likewise.
(mve_vcvttq_f16_f32v8hf): Likewise.
(mve_vcvttq_f32_f16v4sf): Likewise.
(mve_vcvttq_m_f16_f32v8hf): Likewise.
(mve_vcvttq_m_f32_f16v4sf): Likewise.
(mve_vdwdupq_m_wb_u_insn): Likewise.
(mve_vdwdupq_wb_u_insn): Likewise.
(mve_veorq_s>): Likewise.
(mve_veorq_u>): Likewise.
(mve_veorq_f): Likewise.
(mve_vidupq_m_wb_u_insn): Likewise.
(mve_vidupq_u_insn): Likewise.
(mve_viwdupq_m_wb_u_insn): Likewise.
(mve_viwdupq_wb_u_insn): Likewise.
(mve_vldrbq_): Likewise.
(mve_vldrbq_gather_offset_): Likewise.
(mve_vldrbq_gather_offset_z_): Likewise.
(mve_vldrbq_z_): Likewise.
(mve_vldrdq_gather_base_v2di): Likewise.
(mve_vldrdq_gather_base_wb_v2di_insn): Likewise.
(mve_vldrdq_gather_base_wb_z_v2di_insn): Likewise.
(mve_vldrdq_gather_base_z_v2di): Likewise.
(mve_vldrdq_gather_offset_v2di): Likewise.
(mve_vldrdq_gather_offset_z_v2di): Likewise.
(mve_vldrdq_gather_shifted_offset_v2di): Likewise.
(mve_vldrdq_gather_shifted_offset_z_v2di): Likewise.
(mve_vldrhq_): Likewise.
(mve_vldrhq_fv8hf): Likewise.
(mve_vldrhq_gather_offset_): 

[PATCH v4 2/5] doloop: Add support for predicated vectorized loops

2024-02-21 Thread Andre Vieira

This patch adds support in the target agnostic doloop pass for the detection of
predicated vectorized hardware loops.  Arm is currently the only target that
will make use of this feature.

The doloop_condition_get function is used to validate that the 'transformed'
jump instruction is one of the conditions that would be correct for the
canonincal loop format where the loop counter is increased by 1 each time.
The way Arm models predicated vectorized hardware loops transforms the loop
counter to 'step' using the element count for each iteration rather than a step
of one.  This means we had to change the condition test in the jump instruction,
meaning we had to add a different condition we accept in doloop_condition_get.

gcc/ChangeLog:

* df-core.cc (df_bb_regno_only_def_find): New helper function.
* df.h (df_bb_regno_only_def_find): Declare new function.
* loop-doloop.cc (doloop_condition_get): Accept conditions generated by
predicated vectorized hardware loops.
(doloop_modify): Add support for GTU condition checks.
(doloop_optimize): Update costing computation to support alterations to
desc->niter_expr by the backend.

Co-authored-by: Stam Markianos-Wright 

diff --git a/gcc/df-core.cc b/gcc/df-core.cc
index f0eb4c93957..b0e8a88d433 100644
--- a/gcc/df-core.cc
+++ b/gcc/df-core.cc
@@ -1964,6 +1964,21 @@ df_bb_regno_last_def_find (basic_block bb, unsigned int regno)
   return NULL;
 }
 
+/* Return the one and only def of REGNO within BB.  If there is no def or
+   there are multiple defs, return NULL.  */
+
+df_ref
+df_bb_regno_only_def_find (basic_block bb, unsigned int regno)
+{
+  df_ref temp = df_bb_regno_first_def_find (bb, regno);
+  if (!temp)
+return NULL;
+  else if (temp == df_bb_regno_last_def_find (bb, regno))
+return temp;
+  else
+return NULL;
+}
+
 /* Finds the reference corresponding to the definition of REG in INSN.
DF is the dataflow object.  */
 
diff --git a/gcc/df.h b/gcc/df.h
index 84e5aa8b524..c4e690b40cf 100644
--- a/gcc/df.h
+++ b/gcc/df.h
@@ -987,6 +987,7 @@ extern void df_check_cfg_clean (void);
 #endif
 extern df_ref df_bb_regno_first_def_find (basic_block, unsigned int);
 extern df_ref df_bb_regno_last_def_find (basic_block, unsigned int);
+extern df_ref df_bb_regno_only_def_find (basic_block, unsigned int);
 extern df_ref df_find_def (rtx_insn *, rtx);
 extern bool df_reg_defined (rtx_insn *, rtx);
 extern df_ref df_find_use (rtx_insn *, rtx);
diff --git a/gcc/loop-doloop.cc b/gcc/loop-doloop.cc
index 529e810e530..8953e1de960 100644
--- a/gcc/loop-doloop.cc
+++ b/gcc/loop-doloop.cc
@@ -85,10 +85,10 @@ doloop_condition_get (rtx_insn *doloop_pat)
  forms:
 
  1)  (parallel [(set (pc) (if_then_else (condition)
-	  			(label_ref (label))
-(pc)))
-	 (set (reg) (plus (reg) (const_int -1)))
-	 (additional clobbers and uses)])
+	(label_ref (label))
+	(pc)))
+		 (set (reg) (plus (reg) (const_int -1)))
+		 (additional clobbers and uses)])
 
  The branch must be the first entry of the parallel (also required
  by jump.cc), and the second entry of the parallel must be a set of
@@ -96,19 +96,33 @@ doloop_condition_get (rtx_insn *doloop_pat)
  the loop counter in an if_then_else too.
 
  2)  (set (reg) (plus (reg) (const_int -1))
- (set (pc) (if_then_else (reg != 0)
-	 (label_ref (label))
-			 (pc))).  
+	 (set (pc) (if_then_else (reg != 0)
+ (label_ref (label))
+ (pc))).
 
- Some targets (ARM) do the comparison before the branch, as in the
+ 3) Some targets (Arm) do the comparison before the branch, as in the
  following form:
 
- 3) (parallel [(set (cc) (compare ((plus (reg) (const_int -1), 0)))
-   (set (reg) (plus (reg) (const_int -1)))])
-(set (pc) (if_then_else (cc == NE)
-(label_ref (label))
-(pc))) */
-
+ (parallel [(set (cc) (compare (plus (reg) (const_int -1)) 0))
+		(set (reg) (plus (reg) (const_int -1)))])
+ (set (pc) (if_then_else (cc == NE)
+			 (label_ref (label))
+			 (pc)))
+
+  4) This form supports a construct that is used to represent a vectorized
+  do loop with predication, however we do not need to care about the
+  details of the predication here.
+  Arm uses this construct to support MVE tail predication.
+
+  (parallel
+   [(set (pc)
+	 (if_then_else (gtu (plus (reg) (const_int -n))
+(const_int n-1))
+			   (label_ref)
+			   (pc)))
+	(set (reg) (plus (reg) (const_int -n)))
+	(additional clobbers and uses)])
+ */
   pattern = PATTERN (doloop_pat);
 
   if (GET_CODE (pattern) != PARALLEL)
@@ -173,15 +187,17 @@ doloop_condition_get (rtx_insn *doloop_pat)
   if (! REG_P (reg))
 return 0;
 
-  /* Check if something = (plus (reg) (const_int -1)).
+  /* Check if something = (plus (reg) (const_int 

[PATCH v4 0/5] arm: Add support for MVE Tail-Predicated Low Overhead Loops

2024-02-21 Thread Andre Vieira
Hi,

This is a reworked patch series from.  The main differences are a further split
of patches, where:
[1/5] is arm specific and has been approved before,
[2/5] is target agnostic, has had no substantial changes from v3.
[3/5] new arm specific patch that is split from the original last patch and
annotates across lane instructions that are safe for tail predication if their
tail predicated operands are zeroed.
[4/5] new arm specific patch that could be committed indepdent of series to fix
an obvious issue and remove unused unspecs & iterators.
[5/5] reworked last patch refactoring the implicit predication and some other
validity checks.

Original cover letter:
This patch adds support for Arm's MVE Tail Predicated Low Overhead Loop
feature.

The M-class Arm-ARM:
https://developer.arm.com/documentation/ddi0553/bu/?lang=en
Section B5.5.1 "Loop tail predication" describes the feature
we are adding support for with this patch (although
we only add codegen for DLSTP/LETP instruction loops).

Previously with commit d2ed233cb94 we'd added support for
non-MVE DLS/LE loops through the loop-doloop pass, which, given
a standard MVE loop like:

```
void  __attribute__ ((noinline)) test (int16_t *a, int16_t *b, int16_t *c, int 
n)
{
  while (n > 0)
{
  mve_pred16_t p = vctp16q (n);
  int16x8_t va = vldrhq_z_s16 (a, p);
  int16x8_t vb = vldrhq_z_s16 (b, p);
  int16x8_t vc = vaddq_x_s16 (va, vb, p);
  vstrhq_p_s16 (c, vc, p);
  c+=8;
  a+=8;
  b+=8;
  n-=8;
}
}
```
.. would output:

```

dls lr, lr
.L3:
vctp.16 r3
vmrsip, P0  @ movhi
sxthip, ip
vmsr P0, ip @ movhi
mov r4, r0
vpst
vldrht.16   q2, [r4]
mov r4, r1
vmovq3, q0
vpst
vldrht.16   q1, [r4]
mov r4, r2
vpst
vaddt.i16   q3, q2, q1
subsr3, r3, #8
vpst
vstrht.16   q3, [r4]
addsr0, r0, #16
addsr1, r1, #16
addsr2, r2, #16
le  lr, .L3
```

where the LE instruction will decrement LR by 1, compare and
branch if needed.

(there are also other inefficiencies with the above code, like the
pointless vmrs/sxth/vmsr on the VPR and the adds not being merged
into the vldrht/vstrht as a #16 offsets and some random movs!
But that's different problems...)

The MVE version is similar, except that:
* Instead of DLS/LE the instructions are DLSTP/LETP.
* Instead of pre-calculating the number of iterations of the
  loop, we place the number of elements to be processed by the
  loop into LR.
* Instead of decrementing the LR by one, LETP will decrement it
  by FPSCR.LTPSIZE, which is the number of elements being
  processed in each iteration: 16 for 8-bit elements, 5 for 16-bit
  elements, etc.
* On the final iteration, automatic Loop Tail Predication is
  performed, as if the instructions within the loop had been VPT
  predicated with a VCTP generating the VPR predicate in every
  loop iteration.

The dlstp/letp loop now looks like:

```

dlstp.16lr, r3
.L14:
mov r3, r0
vldrh.16q3, [r3]
mov r3, r1
vldrh.16q2, [r3]
mov r3, r2
vadd.i16  q3, q3, q2
addsr0, r0, #16
vstrh.16q3, [r3]
addsr1, r1, #16
addsr2, r2, #16
letplr, .L14

```

Since the loop tail predication is automatic, we have eliminated
the VCTP that had been specified by the user in the intrinsic
and converted the VPT-predicated instructions into their
unpredicated equivalents (which also saves us from VPST insns).

The LE instruction here decrements LR by 8 in each iteration.

Stam Markianos-Wright (1):
  arm: Add define_attr to to create a mapping between MVE predicated and
unpredicated insns

Andre Vieira (4):
  doloop: Add support for predicated vectorized loops
  arm: Annotate instructions with mve_safe_imp_xlane_pred
  arm: Fix a wrong attribute use and remove unused unspecs and iterators
  arm: Add support for MVE Tail-Predicated Low Overhead Loops


-- 
2.17.1


Re: [comitted] bitint: Fix testism where __seg_gs was being used for all targets

2024-02-19 Thread Andre Vieira (lists)




On 19/02/2024 16:17, Jakub Jelinek wrote:

On Mon, Feb 19, 2024 at 04:13:29PM +, Andre Vieira (lists) wrote:

Replaced uses of __seg_gs with the MACRO SEG defined in the testcase to pick
(if any) the right __seg_{gs,fs} keyword based on target.

gcc/testsuite/ChangeLog:

* gcc.dg/bitint-86.c (__seg_gs): Replace with SEG MACRO.


ChangeLog should be
* gcc.dg/bitint-86.c (foo, bar, baz): Replace __seg_gs with SEG.
Otherwise, LGTM.
Sorry for forgetting to do that myself.


Jakub



That makes sense ... but I already pushed it upstream, thought it was 
obvious. Apologies for the ChangeLog mistake :(


[comitted] bitint: Fix testism where __seg_gs was being used for all targets

2024-02-19 Thread Andre Vieira (lists)
Replaced uses of __seg_gs with the MACRO SEG defined in the testcase to 
pick (if any) the right __seg_{gs,fs} keyword based on target.


gcc/testsuite/ChangeLog:

* gcc.dg/bitint-86.c (__seg_gs): Replace with SEG MACRO.diff --git a/gcc/testsuite/gcc.dg/bitint-86.c b/gcc/testsuite/gcc.dg/bitint-86.c
index 
4e5761a203bc39150540326df9c0d88544bb02ef..10a2392b6f530ae165252bdac750061e92d53131
 100644
--- a/gcc/testsuite/gcc.dg/bitint-86.c
+++ b/gcc/testsuite/gcc.dg/bitint-86.c
@@ -15,14 +15,14 @@ struct T { struct S b[4]; };
 #endif
 
 void
-foo (__seg_gs struct T *p)
+foo (SEG struct T *p)
 {
   struct S s;
   p->b[0] = s;
 }
 
 void
-bar (__seg_gs struct T *p, _BitInt(710) x, int y, double z)
+bar (SEG struct T *p, _BitInt(710) x, int y, double z)
 {
   p->b[0].a = x + 42;
   p->b[1].a = x << y;
@@ -31,7 +31,7 @@ bar (__seg_gs struct T *p, _BitInt(710) x, int y, double z)
 }
 
 int
-baz (__seg_gs struct T *p, _BitInt(710) x, _BitInt(710) y)
+baz (SEG struct T *p, _BitInt(710) x, _BitInt(710) y)
 {
   return __builtin_add_overflow (x, y, >b[1].a);
 }


Re: veclower: improve selection of vector mode when lowering [PR 112787]

2024-02-19 Thread Andre Vieira (lists)

Hi all,

OK to backport this to gcc-12 and gcc-13? Patch applies cleanly, 
bootstrapped and regression tested on aarch64-unknown-linux-gnu. Only 
change is in the testcase as I had to use -march=armv9-a because 
-march=armv8-a+sve conflicts with -mcpu=neoverse-n2 in previous gcc 
versions.


Kind Regards,
Andre

On 20/12/2023 14:30, Richard Biener wrote:

On Wed, 20 Dec 2023, Andre Vieira (lists) wrote:


Thanks, fully agree with all comments.

gcc/ChangeLog:

PR target/112787
* tree-vect-generic (type_for_widest_vector_mode): Change function
 to use original vector type and check widest vector mode has at most
 the same number of elements.
(get_compute_type): Pass original vector type rather than the element
 type to type_for_widest_vector_mode and remove now obsolete check
 for the number of elements.


OK.

Richard.


On 07/12/2023 07:45, Richard Biener wrote:

On Wed, 6 Dec 2023, Andre Vieira (lists) wrote:


Hi,

This patch addresses the issue reported in PR target/112787 by improving
the
compute type selection.  We do this by not considering types with more
elements
than the type we are lowering since we'd reject such types anyway.

gcc/ChangeLog:

  PR target/112787
  * tree-vect-generic (type_for_widest_vector_mode): Add a parameter to
  control maximum amount of elements in resulting vector mode.
  (get_compute_type): Restrict vector_compute_type to a mode no wider
  than the original compute type.

gcc/testsuite/ChangeLog:

  * gcc.target/aarch64/pr112787.c: New test.

Bootstrapped and regression tested on aarch64-unknown-linux-gnu and
x86_64-pc-linux-gnu.

Is this OK for trunk?


@@ -1347,7 +1347,7 @@ optimize_vector_constructor (gimple_stmt_iterator
*gsi)
  TYPE, or NULL_TREE if none is found.  */

Can you improve the function comment?  It also doesn't mention OP ...

   static tree
-type_for_widest_vector_mode (tree type, optab op)
+type_for_widest_vector_mode (tree type, optab op, poly_int64 max_nunits =
0)
   {
 machine_mode inner_mode = TYPE_MODE (type);
 machine_mode best_mode = VOIDmode, mode;
@@ -1371,7 +1371,9 @@ type_for_widest_vector_mode (tree type, optab op)
 FOR_EACH_MODE_FROM (mode, mode)
   if (GET_MODE_INNER (mode) == inner_mode
  && maybe_gt (GET_MODE_NUNITS (mode), best_nunits)
-   && optab_handler (op, mode) != CODE_FOR_nothing)
+   && optab_handler (op, mode) != CODE_FOR_nothing
+   && (known_eq (max_nunits, 0)
+   || known_lt (GET_MODE_NUNITS (mode), max_nunits)))

max_nunits suggests that known_le would be appropriate instead.

I see the only other caller with similar "problems":

  }
/* Can't use get_compute_type here, as supportable_convert_operation
   doesn't necessarily use an optab and needs two arguments.  */
tree vec_compute_type
  = type_for_widest_vector_mode (TREE_TYPE (arg_type), mov_optab);
if (vec_compute_type
&& VECTOR_MODE_P (TYPE_MODE (vec_compute_type))
&& subparts_gt (arg_type, vec_compute_type))

so please do not default to 0 but adjust this one as well.  It also
seems you then can remove the subparts_gt guards on both
vec_compute_type uses.

I think the API would be cleaner if we'd pass the original vector type
we can then extract TYPE_VECTOR_SUBPARTS from, avoiding the extra arg.

No?

Thanks,
Richard.






Re: [PATCH 1/3] vect: Pass stmt_vec_info to TARGET_SIMD_CLONE_USABLE

2024-02-01 Thread Andre Vieira (lists)




On 01/02/2024 07:19, Richard Biener wrote:

On Wed, 31 Jan 2024, Andre Vieira (lists) wrote:


The patch didn't come with a testcase so it's really hard to tell
what goes wrong now and how it is fixed ...


My bad! I had a testcase locally but never added it...

However... now I look at it and ran it past Richard S, the codegen isn't 
'wrong', but it does have the potential to lead to some pretty slow 
codegen, especially for inbranch simdclones where it transforms the SVE 
predicate into an Advanced SIMD vector by inserting the elements one at 
a time...


An example of which can be seen if you do:

gcc -O3 -march=armv8-a+sve -msve-vector-bits=128  -fopenmp-simd t.c -S

with the following t.c:
#pragma omp declare simd simdlen(4) inbranch
int __attribute__ ((const)) fn5(int);

void fn4 (int *a, int *b, int n)
{
for (int i = 0; i < n; ++i)
b[i] = fn5(a[i]);
}

Now I do have to say, for our main usecase of libmvec we won't have any 
'inbranch' Advanced SIMD clones, so we avoid that issue... But of course 
that doesn't mean user-code will.


I'm gonna remove this patch and run another test regression to see if it 
catches anything weird, but if not then I guess we do have the option to 
not use this patch and aim to solve the costing or codegen issue in 
GCC-15. We don't currently do any simdclone costing and I don't have a 
clear suggestion for how given openmp has no mechanism that I know off 
to expose the speedup of a simdclone over it's scalar variant, so how 
would we 'compare' a simdclone call with extra overhead of argument 
preparation vs scalar, though at least we could prefer a call to a 
different simdclone with less argument preparation. Anyways I digress.


Other tests, these require aarch64-autovec-preference=2 so that also has 
me worried less...


gcc -O3 -march=armv8-a+sve -msve-vector-bits=128 --param 
aarch64-autovec-preference=2 -fopenmp-simd t.c -S


t.c:
#pragma omp declare simd simdlen(2) notinbranch
float __attribute__ ((const)) fn1(double);

void fn0 (float *a, float *b, int n)
{
for (int i = 0; i < n; ++i)
b[i] = fn1((double) a[i]);
}

#pragma omp declare simd simdlen(2) notinbranch
float __attribute__ ((const)) fn3(float);

void fn2 (float *a, double *b, int n)
{
for (int i = 0; i < n; ++i)
b[i] = (double) fn3(a[i]);
}


Richard.



That said, I wonder how we end up mixing things up in the first place.

Richard.






Re: [PATCH 1/3] vect: Pass stmt_vec_info to TARGET_SIMD_CLONE_USABLE

2024-01-31 Thread Andre Vieira (lists)




On 31/01/2024 14:35, Richard Biener wrote:

On Wed, 31 Jan 2024, Andre Vieira (lists) wrote:




On 31/01/2024 13:58, Richard Biener wrote:

On Wed, 31 Jan 2024, Andre Vieira (lists) wrote:




On 31/01/2024 12:13, Richard Biener wrote:

On Wed, 31 Jan 2024, Richard Biener wrote:


On Tue, 30 Jan 2024, Andre Vieira wrote:



This patch adds stmt_vec_info to TARGET_SIMD_CLONE_USABLE to make sure
the
target can reject a simd_clone based on the vector mode it is using.
This is needed because for VLS SVE vectorization the vectorizer accepts
Advanced SIMD simd clones when vectorizing using SVE types because the
simdlens
might match.  This will cause type errors later on.

Other targets do not currently need to use this argument.


Can you instead pass down the mode?


Thinking about that again the cgraph_simd_clone info in the clone
should have sufficient information to disambiguate.  If it doesn't
then we should amend it.

Richard.


Hi Richard,

Thanks for the review, I don't think cgraph_simd_clone_info is the right
place
to pass down this information, since this is information about the caller
rather than the simdclone itself. What we are trying to achieve here is
making
the vectorizer being able to accept or reject simdclones based on the ISA
we
are vectorizing for. To distinguish between SVE and Advanced SIMD ISAs we
use
modes, I am also not sure that's ideal but it is what we currently use. So
to
answer your earlier question, yes I can also pass down mode if that's
preferable.


Note cgraph_simd_clone_info has simdlen and we seem to check elsewhere
whether that's POLY or constant.  I wonder how aarch64_sve_mode_p
comes into play here which in the end classifies VLS SVE modes as
non-SVE?



Using -msve-vector-bits=128
(gdb) p TYPE_MODE (STMT_VINFO_VECTYPE (stmt_vinfo))
$4 = E_VNx4SImode
(gdb) p  TYPE_SIZE (STMT_VINFO_VECTYPE (stmt_vinfo))
$5 = (tree) 0xf741c1b0
(gdb) p debug (TYPE_SIZE (STMT_VINFO_VECTYPE (stmt_vinfo)))
128
(gdb) p aarch64_sve_mode_p (TYPE_MODE (STMT_VINFO_VECTYPE (stmt_vinfo)))
$5 = true

and for reference without vls codegen:
(gdb) p TYPE_MODE (STMT_VINFO_VECTYPE (stmt_vinfo))
$1 = E_VNx4SImode
(gdb) p  debug (TYPE_SIZE (STMT_VINFO_VECTYPE (stmt_vinfo)))
POLY_INT_CST [128, 128]

Having said that I believe that the USABLE targethook implementation for
aarch64 should also block other uses, like an Advanced SIMD mode being used as
input for a SVE VLS SIMDCLONE. The reason being that for instance 'half'
registers like VNx2SI are packed differently from V2SI.

We could teach the vectorizer to support these of course, but that requires
more work and is not extremely useful just yet. I'll add the extra check that
to the patch once we agree on how to pass down the information we need. Happy
to use either mode, or stmt_vec_info and extract the mode from it like it does
now.


As said, please pass down 'mode'.  But I wonder how to document it,
which mode is that supposed to be?  Any of result or any argument
mode that happens to be a vector?  I think that we might be able
to mix Advanced SIMD modes and SVE modes with -msve-vector-bits=128
in the same loop?

Are the simd clones you don't want to use with -msve-vector-bits=128
having constant simdlen?  If so why do you generate them in the first
place?


So this is where things get a bit confusing and I will write up some 
text for these cases to put in our ABI document (currently in Beta and 
in need of some tlc).


Our intended behaviour is for a 'declare simd' without a simdlen to 
generate simdclones for:
* Advanced SIMD 128 and 64-bit vectors, where possible (we don't allow 
for simdlen 1, Tamar fixed that in gcc recently),

* SVE VLA vectors.

Let me illustrate this with an example:

__attribute__ ((simd (notinbranch), const)) float cosf(float);

Should tell the compiler the following simd clones are available:
__ZGVnN4v_cosf 128-bit 4x4 float Advanced SIMD clone
__ZGVnN2v_cosf 64-bit  4x2 float Advanced SIMD clone
__ZGVsMxv_cosf [128, 128]-bit 4x4xN SVE SIMD clone

[To save you looking into the abi let me break this down, _ZGV is 
prefix, then 'n' or 's' picks between Advanced SIMD and SVE, 'N' or 'M' 
picks between Not Masked and Masked (SVE is always masked even if we ask 
for notinbranch), then a digit or 'x' picks between Vector Length or 
VLA, and after that you get a letter per argument, where v = vector mapped]


Regardless of -msve-vector-bits, however, the vectorizer (and any other 
part of the compiler) may assume that the VL of the VLA SVE clone is 
that specified by -msve-vector-bits, which if the clone is written in a 
VLA way will still work.


If the attribute is used with a function definition rather than 
declaration, so:


__attribute__ ((simd (notinbranch), const)) float fn0(float a)
{
  return a + 1.0f;
}

the compiler should again generate the three simd clones:
__ZGVnN4v_fn0 128-bit 4x4 float Advanced SIMD clone
__ZGVnN2v_fn0 64-bit  4x2 float Advanced SIMD clone
__ZGVsMxv_fn0 [128, 128]-bit 4x4xN SVE SIMD

Re: [PATCH 1/3] vect: Pass stmt_vec_info to TARGET_SIMD_CLONE_USABLE

2024-01-31 Thread Andre Vieira (lists)




On 31/01/2024 14:03, Richard Biener wrote:

On Wed, 31 Jan 2024, Richard Biener wrote:


On Wed, 31 Jan 2024, Andre Vieira (lists) wrote:




On 31/01/2024 12:13, Richard Biener wrote:

On Wed, 31 Jan 2024, Richard Biener wrote:


On Tue, 30 Jan 2024, Andre Vieira wrote:



This patch adds stmt_vec_info to TARGET_SIMD_CLONE_USABLE to make sure the
target can reject a simd_clone based on the vector mode it is using.
This is needed because for VLS SVE vectorization the vectorizer accepts
Advanced SIMD simd clones when vectorizing using SVE types because the
simdlens
might match.  This will cause type errors later on.

Other targets do not currently need to use this argument.


Can you instead pass down the mode?


Thinking about that again the cgraph_simd_clone info in the clone
should have sufficient information to disambiguate.  If it doesn't
then we should amend it.

Richard.


Hi Richard,

Thanks for the review, I don't think cgraph_simd_clone_info is the right place
to pass down this information, since this is information about the caller
rather than the simdclone itself. What we are trying to achieve here is making
the vectorizer being able to accept or reject simdclones based on the ISA we
are vectorizing for. To distinguish between SVE and Advanced SIMD ISAs we use
modes, I am also not sure that's ideal but it is what we currently use. So to
answer your earlier question, yes I can also pass down mode if that's
preferable.


Note cgraph_simd_clone_info has simdlen and we seem to check elsewhere
whether that's POLY or constant.  I wonder how aarch64_sve_mode_p
comes into play here which in the end classifies VLS SVE modes as
non-SVE?


Maybe it's just a bit non-obvious as you key on mangling:

  static int
-aarch64_simd_clone_usable (struct cgraph_node *node)
+aarch64_simd_clone_usable (struct cgraph_node *node, stmt_vec_info
stmt_vinfo)
  {
switch (node->simdclone->vecsize_mangle)
  {
  case 'n':
if (!TARGET_SIMD)
 return -1;
+  if (STMT_VINFO_VECTYPE (stmt_vinfo)
+ && aarch64_sve_mode_p (TYPE_MODE (STMT_VINFO_VECTYPE
(stmt_vinfo
+   return -1;

?  What does 'n' mean?  It's documented as

   /* The mangling character for a given vector size.  This is used
  to determine the ISA mangling bit as specified in the Intel
  Vector ABI.  */
   unsigned char vecsize_mangle;


I'll update the comment, but yeh 'n' is for Advanced SIMD, 's' is for SVE.


which is slightly misleading.


Re: [PATCH 1/3] vect: Pass stmt_vec_info to TARGET_SIMD_CLONE_USABLE

2024-01-31 Thread Andre Vieira (lists)




On 31/01/2024 13:58, Richard Biener wrote:

On Wed, 31 Jan 2024, Andre Vieira (lists) wrote:




On 31/01/2024 12:13, Richard Biener wrote:

On Wed, 31 Jan 2024, Richard Biener wrote:


On Tue, 30 Jan 2024, Andre Vieira wrote:



This patch adds stmt_vec_info to TARGET_SIMD_CLONE_USABLE to make sure the
target can reject a simd_clone based on the vector mode it is using.
This is needed because for VLS SVE vectorization the vectorizer accepts
Advanced SIMD simd clones when vectorizing using SVE types because the
simdlens
might match.  This will cause type errors later on.

Other targets do not currently need to use this argument.


Can you instead pass down the mode?


Thinking about that again the cgraph_simd_clone info in the clone
should have sufficient information to disambiguate.  If it doesn't
then we should amend it.

Richard.


Hi Richard,

Thanks for the review, I don't think cgraph_simd_clone_info is the right place
to pass down this information, since this is information about the caller
rather than the simdclone itself. What we are trying to achieve here is making
the vectorizer being able to accept or reject simdclones based on the ISA we
are vectorizing for. To distinguish between SVE and Advanced SIMD ISAs we use
modes, I am also not sure that's ideal but it is what we currently use. So to
answer your earlier question, yes I can also pass down mode if that's
preferable.


Note cgraph_simd_clone_info has simdlen and we seem to check elsewhere
whether that's POLY or constant.  I wonder how aarch64_sve_mode_p
comes into play here which in the end classifies VLS SVE modes as
non-SVE?



Using -msve-vector-bits=128
(gdb) p TYPE_MODE (STMT_VINFO_VECTYPE (stmt_vinfo))
$4 = E_VNx4SImode
(gdb) p  TYPE_SIZE (STMT_VINFO_VECTYPE (stmt_vinfo))
$5 = (tree) 0xf741c1b0
(gdb) p debug (TYPE_SIZE (STMT_VINFO_VECTYPE (stmt_vinfo)))
128
(gdb) p aarch64_sve_mode_p (TYPE_MODE (STMT_VINFO_VECTYPE (stmt_vinfo)))
$5 = true

and for reference without vls codegen:
(gdb) p TYPE_MODE (STMT_VINFO_VECTYPE (stmt_vinfo))
$1 = E_VNx4SImode
(gdb) p  debug (TYPE_SIZE (STMT_VINFO_VECTYPE (stmt_vinfo)))
POLY_INT_CST [128, 128]

Having said that I believe that the USABLE targethook implementation for 
aarch64 should also block other uses, like an Advanced SIMD mode being 
used as input for a SVE VLS SIMDCLONE. The reason being that for 
instance 'half' registers like VNx2SI are packed differently from V2SI.


We could teach the vectorizer to support these of course, but that 
requires more work and is not extremely useful just yet. I'll add the 
extra check that to the patch once we agree on how to pass down the 
information we need. Happy to use either mode, or stmt_vec_info and 
extract the mode from it like it does now.



Regards,
Andre





Re: [PATCH 1/3] vect: Pass stmt_vec_info to TARGET_SIMD_CLONE_USABLE

2024-01-31 Thread Andre Vieira (lists)




On 31/01/2024 12:13, Richard Biener wrote:

On Wed, 31 Jan 2024, Richard Biener wrote:


On Tue, 30 Jan 2024, Andre Vieira wrote:



This patch adds stmt_vec_info to TARGET_SIMD_CLONE_USABLE to make sure the
target can reject a simd_clone based on the vector mode it is using.
This is needed because for VLS SVE vectorization the vectorizer accepts
Advanced SIMD simd clones when vectorizing using SVE types because the simdlens
might match.  This will cause type errors later on.

Other targets do not currently need to use this argument.


Can you instead pass down the mode?


Thinking about that again the cgraph_simd_clone info in the clone
should have sufficient information to disambiguate.  If it doesn't
then we should amend it.

Richard.


Hi Richard,

Thanks for the review, I don't think cgraph_simd_clone_info is the right 
place to pass down this information, since this is information about the 
caller rather than the simdclone itself. What we are trying to achieve 
here is making the vectorizer being able to accept or reject simdclones 
based on the ISA we are vectorizing for. To distinguish between SVE and 
Advanced SIMD ISAs we use modes, I am also not sure that's ideal but it 
is what we currently use. So to answer your earlier question, yes I can 
also pass down mode if that's preferable.


Regards,
Andre


[PATCH 3/3] aarch64: Add SVE support for simd clones [PR 96342]

2024-01-30 Thread Andre Vieira

This patch finalizes adding support for the generation of SVE simd clones when
no simdlen is provided, following the ABI rules where the widest data type
determines the minimum amount of elements in a length agnostic vector.

gcc/ChangeLog:

* config/aarch64/aarch64-protos.h (add_sve_type_attribute): Declare.
* config/aarch64/aarch64-sve-builtins.cc (add_sve_type_attribute): Make
visibility global and support use for non_acle types.
* config/aarch64/aarch64.cc
(aarch64_simd_clone_compute_vecsize_and_simdlen): Create VLA simd clone
when no simdlen is provided, according to ABI rules.
(simd_clone_adjust_sve_vector_type): New helper function.
(aarch64_simd_clone_adjust): Add '+sve' attribute to SVE simd clones
and modify types to use SVE types.
* omp-simd-clone.cc (simd_clone_mangle): Print 'x' for VLA simdlen.
(simd_clone_adjust): Adapt safelen check to be compatible with VLA
simdlen.

gcc/testsuite/ChangeLog:

* c-c++-common/gomp/declare-variant-14.c: Make i?86 and x86_64 target
only test.
* gfortran.dg/gomp/declare-variant-14.f90: Likewise.
* gcc.target/aarch64/declare-simd-2.c: Add SVE clone scan.
* gcc.target/aarch64/vect-simd-clone-1.c: New test.

diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
index a0b142e0b94..207396de0ff 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -1031,6 +1031,8 @@ namespace aarch64_sve {
 #ifdef GCC_TARGET_H
   bool verify_type_context (location_t, type_context_kind, const_tree, bool);
 #endif
+ void add_sve_type_attribute (tree, unsigned int, unsigned int,
+			  const char *, const char *);
 }
 
 extern void aarch64_split_combinev16qi (rtx operands[3]);
diff --git a/gcc/config/aarch64/aarch64-sve-builtins.cc b/gcc/config/aarch64/aarch64-sve-builtins.cc
index 11f5c5c500c..747131e684e 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins.cc
+++ b/gcc/config/aarch64/aarch64-sve-builtins.cc
@@ -953,14 +953,16 @@ static bool reported_missing_registers_p;
 /* Record that TYPE is an ABI-defined SVE type that contains NUM_ZR SVE vectors
and NUM_PR SVE predicates.  MANGLED_NAME, if nonnull, is the ABI-defined
mangling of the type.  ACLE_NAME is the  name of the type.  */
-static void
+void
 add_sve_type_attribute (tree type, unsigned int num_zr, unsigned int num_pr,
 			const char *mangled_name, const char *acle_name)
 {
   tree mangled_name_tree
 = (mangled_name ? get_identifier (mangled_name) : NULL_TREE);
+  tree acle_name_tree
+= (acle_name ? get_identifier (acle_name) : NULL_TREE);
 
-  tree value = tree_cons (NULL_TREE, get_identifier (acle_name), NULL_TREE);
+  tree value = tree_cons (NULL_TREE, acle_name_tree, NULL_TREE);
   value = tree_cons (NULL_TREE, mangled_name_tree, value);
   value = tree_cons (NULL_TREE, size_int (num_pr), value);
   value = tree_cons (NULL_TREE, size_int (num_zr), value);
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 31617510160..cba8879ab33 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -28527,7 +28527,7 @@ aarch64_simd_clone_compute_vecsize_and_simdlen (struct cgraph_node *node,
 	int num, bool explicit_p)
 {
   tree t, ret_type;
-  unsigned int nds_elt_bits;
+  unsigned int nds_elt_bits, wds_elt_bits;
   unsigned HOST_WIDE_INT const_simdlen;
 
   if (!TARGET_SIMD)
@@ -28572,10 +28572,14 @@ aarch64_simd_clone_compute_vecsize_and_simdlen (struct cgraph_node *node,
   if (TREE_CODE (ret_type) != VOID_TYPE)
 {
   nds_elt_bits = lane_size (SIMD_CLONE_ARG_TYPE_VECTOR, ret_type);
+  wds_elt_bits = nds_elt_bits;
   vec_elts.safe_push (std::make_pair (ret_type, nds_elt_bits));
 }
   else
-nds_elt_bits = POINTER_SIZE;
+{
+  nds_elt_bits = POINTER_SIZE;
+  wds_elt_bits = 0;
+}
 
   int i;
   tree type_arg_types = TYPE_ARG_TYPES (TREE_TYPE (node->decl));
@@ -28583,44 +28587,72 @@ aarch64_simd_clone_compute_vecsize_and_simdlen (struct cgraph_node *node,
   for (t = (decl_arg_p ? DECL_ARGUMENTS (node->decl) : type_arg_types), i = 0;
t && t != void_list_node; t = TREE_CHAIN (t), i++)
 {
-  tree arg_type = decl_arg_p ? TREE_TYPE (t) : TREE_VALUE (t);
+  tree type = decl_arg_p ? TREE_TYPE (t) : TREE_VALUE (t);
   if (clonei->args[i].arg_type != SIMD_CLONE_ARG_TYPE_UNIFORM
-	  && !supported_simd_type (arg_type))
+	  && !supported_simd_type (type))
 	{
 	  if (!explicit_p)
 	;
-	  else if (COMPLEX_FLOAT_TYPE_P (ret_type))
+	  else if (COMPLEX_FLOAT_TYPE_P (type))
 	warning_at (DECL_SOURCE_LOCATION (node->decl), 0,
 			"GCC does not currently support argument type %qT "
-			"for simd", arg_type);
+			"for simd", type);
 	  else
 	warning_at (DECL_SOURCE_LOCATION (node->decl), 0,
 			"unsupported argument type %qT for simd",
-			arg_type);
+			type);
 	  return 0;
 	}
-  unsigned 

[PATCH 2/3] vect: disable multiple calls of poly simdclones

2024-01-30 Thread Andre Vieira

The current codegen code to support VF's that are multiples of a simdclone
simdlen rely on BIT_FIELD_REF to create multiple input vectors.  This does not
work for non-constant simdclones, so we should disable using such clones when
the VF is a multiple of the non-constant simdlen until we change the codegen to
support those.

gcc/ChangeLog:

* tree-vect-stmts.cc (vectorizable_simd_clone_call): Reject simdclones
with non-constant simdlen when VF is not exactly the same.

diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index da02082c034..9bfb898683d 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -4068,7 +4068,10 @@ vectorizable_simd_clone_call (vec_info *vinfo, stmt_vec_info stmt_info,
 	if (!constant_multiple_p (vf * group_size, n->simdclone->simdlen,
   _calls)
 	|| (!n->simdclone->inbranch && (masked_call_offset > 0))
-	|| (nargs != simd_nargs))
+	|| (nargs != simd_nargs)
+	/* Currently we do not support multiple calls of non-constant
+	   simdlen as poly vectors can not be accessed by BIT_FIELD_REF.  */
+	|| (!n->simdclone->simdlen.is_constant () && num_calls != 1))
 	  continue;
 	if (num_calls != 1)
 	  this_badness += floor_log2 (num_calls) * 4096;


[PATCH 1/3] vect: Pass stmt_vec_info to TARGET_SIMD_CLONE_USABLE

2024-01-30 Thread Andre Vieira

This patch adds stmt_vec_info to TARGET_SIMD_CLONE_USABLE to make sure the
target can reject a simd_clone based on the vector mode it is using.
This is needed because for VLS SVE vectorization the vectorizer accepts
Advanced SIMD simd clones when vectorizing using SVE types because the simdlens
might match.  This will cause type errors later on.

Other targets do not currently need to use this argument.

gcc/ChangeLog:

* target.def (TARGET_SIMD_CLONE_USABLE): Add argument.
* tree-vect-stmts.cc (vectorizable_simd_clone_call): Pass stmt_info to
call TARGET_SIMD_CLONE_USABLE.
* config/aarch64/aarch64.cc (aarch64_simd_clone_usable): Add argument
and use it to reject the use of SVE simd clones with Advanced SIMD
modes.
* config/gcn/gcn.cc (gcn_simd_clone_usable): Add unused argument.
* config/i386/i386.cc (ix86_simd_clone_usable): Likewise.

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index a37d47b243e..31617510160 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -28694,13 +28694,16 @@ aarch64_simd_clone_adjust (struct cgraph_node *node)
 /* Implement TARGET_SIMD_CLONE_USABLE.  */
 
 static int
-aarch64_simd_clone_usable (struct cgraph_node *node)
+aarch64_simd_clone_usable (struct cgraph_node *node, stmt_vec_info stmt_vinfo)
 {
   switch (node->simdclone->vecsize_mangle)
 {
 case 'n':
   if (!TARGET_SIMD)
 	return -1;
+  if (STMT_VINFO_VECTYPE (stmt_vinfo)
+	  && aarch64_sve_mode_p (TYPE_MODE (STMT_VINFO_VECTYPE (stmt_vinfo
+	return -1;
   return 0;
 default:
   gcc_unreachable ();
diff --git a/gcc/config/gcn/gcn.cc b/gcc/config/gcn/gcn.cc
index e80de2ce056..c48b212d9e6 100644
--- a/gcc/config/gcn/gcn.cc
+++ b/gcc/config/gcn/gcn.cc
@@ -5658,7 +5658,8 @@ gcn_simd_clone_adjust (struct cgraph_node *ARG_UNUSED (node))
 /* Implement TARGET_SIMD_CLONE_USABLE.  */
 
 static int
-gcn_simd_clone_usable (struct cgraph_node *ARG_UNUSED (node))
+gcn_simd_clone_usable (struct cgraph_node *ARG_UNUSED (node),
+		   stmt_vec_info ARG_UNUSED (stmt_vinfo))
 {
   /* We don't need to do anything here because
  gcn_simd_clone_compute_vecsize_and_simdlen currently only returns one
diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index b3e7c74846e..63e6b9d2643 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -25193,7 +25193,8 @@ ix86_simd_clone_compute_vecsize_and_simdlen (struct cgraph_node *node,
slightly less desirable, etc.).  */
 
 static int
-ix86_simd_clone_usable (struct cgraph_node *node)
+ix86_simd_clone_usable (struct cgraph_node *node,
+			stmt_vec_info ARG_UNUSED (stmt_vinfo))
 {
   switch (node->simdclone->vecsize_mangle)
 {
diff --git a/gcc/target.def b/gcc/target.def
index fdad7bbc93e..4fade9c4eec 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -1648,7 +1648,7 @@ DEFHOOK
 in vectorized loops in current function, or non-negative number if it is\n\
 usable.  In that case, the smaller the number is, the more desirable it is\n\
 to use it.",
-int, (struct cgraph_node *), NULL)
+int, (struct cgraph_node *, _stmt_vec_info *), NULL)
 
 HOOK_VECTOR_END (simd_clone)
 
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index 1dbe1115da4..da02082c034 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -4074,7 +4074,7 @@ vectorizable_simd_clone_call (vec_info *vinfo, stmt_vec_info stmt_info,
 	  this_badness += floor_log2 (num_calls) * 4096;
 	if (n->simdclone->inbranch)
 	  this_badness += 8192;
-	int target_badness = targetm.simd_clone.usable (n);
+	int target_badness = targetm.simd_clone.usable (n, stmt_info);
 	if (target_badness < 0)
 	  continue;
 	this_badness += target_badness * 512;


[PATCH 0/3] vect, aarch64: Add SVE support for simdclones

2024-01-30 Thread Andre Vieira
Hi,

This patch series is a set of patches that I have sent up for review before and 
it enables initial support SVE simd clones with some caveats.
Caveat 1: we do not support SVE simd clones with function bodies.
To enable support for this we need to change the way we 'simdify' a function 
body. For each argument that maps to a vector an array is created with 
'simdlen'. This however does not work for VLA simdlen.  We will need to come up 
with a way to support this such that the generated code is performant, there's 
little reason to 'simdify' a function by generating really slow code. I have 
some ideas on how we might be able to do this, though I'm not convinced it's 
even worth trying, but I think that's a bigger discussion.  For now I've 
disabled generating SVE simdclones for functions with function bodies.  This 
still fits our libmvec usecase as the simd clones are handwritten using 
intrinsics in glibc.

Caveat 2: we can not generate ncopy calls to a SVE simd clone call.
When I first sent the second patch of this series upstream Richi asked me to 
look at enabling being able to support calling ncopies of VLA simdlen simd 
clones, I have vectorizer code to do this, however I found that we didn't yet 
have enough backend support to be able to index VLA vectors to support this.  I 
think that's something that will need to wait until gcc 15, so for now I'd 
simply reject vectorization where that is required.

Caveat 3: we don't yet support SVE simdclones for VLS codegen.
We've disabled the use of SVE simdclones when the -msve-vector-bits option is 
used to request VLS codegen. We need this because the mangling is determined by 
the 'simdlen' of a simd clone which will not be VLA when -msve-vector-bits is 
passed. We would like to support using VLA simd clones when generating VLS, but 
for that to work right now we'd need to set the simdlen of the simd clone to 
the VLS value and that messes up the mangling.  In the future we will need to 
add a target hook to specify the mangling.

Given that the target agnostic changes are minimal, have been suggested before 
and have no impact on other targets, the target specific parts have been 
reviewed before, would this still be acceptable for Stage 4? I would really 
like to make use of the work that was done to support this and the SVE 
simdclones added to glibc.

Kind regards,
Andre

Andre Vieira (3):
vect: Pass stmt_vec_info to TARGET_SIMD_CLONE_USABLE
vect: disable multiple calls of poly simdclones
aarch64: Add SVE support for simd clones [PR 96342]

-- 
2.17.1


[PATCH 2/2] aarch64: Add support for _BitInt

2024-01-25 Thread Andre Vieira

This patch adds support for C23's _BitInt for the AArch64 port when compiling
for little endianness.  Big Endianness requires further target-agnostic
support and we therefor disable it for now.

gcc/ChangeLog:

* config/aarch64/aarch64.cc (TARGET_C_BITINT_TYPE_INFO): Declare MACRO.
(aarch64_bitint_type_info): New function.
(aarch64_return_in_memory_1): Return large _BitInt's in memory.
(aarch64_function_arg_alignment): Adapt to correctly return the ABI
mandated alignment of _BitInt(N) where N > 128 as the alignment of
TImode.
(aarch64_composite_type_p): Return true for _BitInt(N), where N > 128.

libgcc/ChangeLog:

* config/aarch64/t-softfp: Add fixtfbitint, floatbitinttf and
floatbitinthf to the softfp_extras variable to ensure the
runtime support is available for _BitInt.
---
 gcc/config/aarch64/aarch64.cc  | 44 +-
 libgcc/config/aarch64/t-softfp |  3 ++-
 2 files changed, 45 insertions(+), 2 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index e6bd3fd0bb4..48bac51bc7c 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -6534,7 +6534,7 @@ aarch64_return_in_memory_1 (const_tree type)
   machine_mode ag_mode;
   int count;
 
-  if (!AGGREGATE_TYPE_P (type)
+  if (!(AGGREGATE_TYPE_P (type) || TREE_CODE (type) == BITINT_TYPE)
   && TREE_CODE (type) != COMPLEX_TYPE
   && TREE_CODE (type) != VECTOR_TYPE)
 /* Simple scalar types always returned in registers.  */
@@ -6618,6 +6618,10 @@ aarch64_function_arg_alignment (machine_mode mode, const_tree type,
 
   gcc_assert (TYPE_MODE (type) == mode);
 
+  if (TREE_CODE (type) == BITINT_TYPE
+  && int_size_in_bytes (type) > 16)
+return GET_MODE_ALIGNMENT (TImode);
+
   if (!AGGREGATE_TYPE_P (type))
 {
   /* The ABI alignment is the natural alignment of the type, without
@@ -21793,6 +21797,11 @@ aarch64_composite_type_p (const_tree type,
   if (type && (AGGREGATE_TYPE_P (type) || TREE_CODE (type) == COMPLEX_TYPE))
 return true;
 
+  if (type
+  && TREE_CODE (type) == BITINT_TYPE
+  && int_size_in_bytes (type) > 16)
+return true;
+
   if (mode == BLKmode
   || GET_MODE_CLASS (mode) == MODE_COMPLEX_FLOAT
   || GET_MODE_CLASS (mode) == MODE_COMPLEX_INT)
@@ -28330,6 +28339,36 @@ aarch64_excess_precision (enum excess_precision_type type)
   return FLT_EVAL_METHOD_UNPREDICTABLE;
 }
 
+/* Implement TARGET_C_BITINT_TYPE_INFO.
+   Return true if _BitInt(N) is supported and fill its details into *INFO.  */
+bool
+aarch64_bitint_type_info (int n, struct bitint_info *info)
+{
+  if (TARGET_BIG_END)
+return false;
+
+  if (n <= 8)
+info->limb_mode = QImode;
+  else if (n <= 16)
+info->limb_mode = HImode;
+  else if (n <= 32)
+info->limb_mode = SImode;
+  else if (n <= 64)
+info->limb_mode = DImode;
+  else if (n <= 128)
+info->limb_mode = TImode;
+  else
+info->limb_mode = DImode;
+
+  if (n > 128)
+info->abi_limb_mode = TImode;
+  else
+info->abi_limb_mode = info->limb_mode;
+  info->big_endian = TARGET_BIG_END;
+  info->extended = false;
+  return true;
+}
+
 /* Implement TARGET_SCHED_CAN_SPECULATE_INSN.  Return true if INSN can be
scheduled for speculative execution.  Reject the long-running division
and square-root instructions.  */
@@ -30439,6 +30478,9 @@ aarch64_run_selftests (void)
 #undef TARGET_C_EXCESS_PRECISION
 #define TARGET_C_EXCESS_PRECISION aarch64_excess_precision
 
+#undef TARGET_C_BITINT_TYPE_INFO
+#define TARGET_C_BITINT_TYPE_INFO aarch64_bitint_type_info
+
 #undef  TARGET_EXPAND_BUILTIN
 #define TARGET_EXPAND_BUILTIN aarch64_expand_builtin
 
diff --git a/libgcc/config/aarch64/t-softfp b/libgcc/config/aarch64/t-softfp
index 2e32366f891..a335a34c243 100644
--- a/libgcc/config/aarch64/t-softfp
+++ b/libgcc/config/aarch64/t-softfp
@@ -4,7 +4,8 @@ softfp_extensions := sftf dftf hftf bfsf
 softfp_truncations := tfsf tfdf tfhf tfbf dfbf sfbf hfbf
 softfp_exclude_libgcc2 := n
 softfp_extras += fixhfti fixunshfti floattihf floatuntihf \
-		 floatdibf floatundibf floattibf floatuntibf
+		 floatdibf floatundibf floattibf floatuntibf \
+		 fixtfbitint floatbitinttf floatbitinthf
 
 TARGET_LIBGCC2_CFLAGS += -Wno-missing-prototypes
 


[PATCH 1/2] bitint: Use TARGET_ARRAY_MODE for large bitints where target supports it

2024-01-25 Thread Andre Vieira

This patch ensures we use TARGET_ARRAY_MODE to determine the storage mode of
large bitints that are represented as arrays in memory.  This is required to
support such bitints for aarch64 and potential other targets with similar
bitint specifications.  Existing tests like gcc.dg/torture/bitint-25.c are
affected by this for aarch64 targets.

gcc/ChangeLog:
stor-layout.cc (layout_type): Use TARGET_ARRAY_MODE for large bitints
for targets that implement it.
---
 gcc/stor-layout.cc | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/gcc/stor-layout.cc b/gcc/stor-layout.cc
index 4cf249133e9..31da2c123ab 100644
--- a/gcc/stor-layout.cc
+++ b/gcc/stor-layout.cc
@@ -2427,8 +2427,16 @@ layout_type (tree type)
 	  }
 	else
 	  {
-	SET_TYPE_MODE (type, BLKmode);
 	cnt = CEIL (TYPE_PRECISION (type), GET_MODE_PRECISION (limb_mode));
+	machine_mode mode;
+	/* Some targets use TARGET_ARRAY_MODE to select the mode they use
+	   for arrays with a specific element mode and a specific element
+	   count and we should use this mode for large bitints that are
+	   stored as such arrays.  */
+	if (!targetm.array_mode (limb_mode, cnt).exists ()
+		|| !targetm.array_mode_supported_p (limb_mode, cnt))
+	  mode = BLKmode;
+	SET_TYPE_MODE (type, mode);
 	gcc_assert (info.abi_limb_mode == info.limb_mode
 			|| !info.big_endian == !WORDS_BIG_ENDIAN);
 	  }


[PATCH 0/2] aarch64, bitint: Add support for _BitInt for AArch64 Little Endian

2024-01-25 Thread Andre Vieira
Hi,

This patch series adds support for _BitInt for AArch64 when compiling for
Little Endian. The first patch in the series fixes an issue that arises with
support for AArch64, the second patch adds the backend support for it.

Andre Vieira (2):
bitint: Use TARGET_ARRAY_MODE for large bitints where target supports it
aarch64: Add support for _BitInt

Patch series boostrapped and regression tested on aarch64-unknown-linux-gnu and 
x86_64-pc-linux-gnu.

Ok for trunk?

-- 
2.17.1


[PATCH v3 1/2] arm: Add define_attr to to create a mapping between MVE predicated and unpredicated insns

2024-01-19 Thread Andre Vieira

Reposting for testing purposes, no changes from v2 (other than rebase).
diff --git a/gcc/config/arm/arm.h b/gcc/config/arm/arm.h
index 2a2207c0ba1..449e6935b32 100644
--- a/gcc/config/arm/arm.h
+++ b/gcc/config/arm/arm.h
@@ -2375,6 +2375,21 @@ extern int making_const_table;
   else if (TARGET_THUMB1)\
 thumb1_final_prescan_insn (INSN)
 
+/* These defines are useful to refer to the value of the mve_unpredicated_insn
+   insn attribute.  Note that, because these use the get_attr_* function, these
+   will change recog_data if (INSN) isn't current_insn.  */
+#define MVE_VPT_PREDICABLE_INSN_P(INSN)	\
+  (recog_memoized (INSN) >= 0		\
+   && get_attr_mve_unpredicated_insn (INSN) != CODE_FOR_nothing)
+
+#define MVE_VPT_PREDICATED_INSN_P(INSN)	\
+  (MVE_VPT_PREDICABLE_INSN_P (INSN)	\
+   && recog_memoized (INSN) != get_attr_mve_unpredicated_insn (INSN))
+
+#define MVE_VPT_UNPREDICATED_INSN_P(INSN)\
+  (MVE_VPT_PREDICABLE_INSN_P (INSN)	\
+   && recog_memoized (INSN) == get_attr_mve_unpredicated_insn (INSN))
+
 #define ARM_SIGN_EXTEND(x)  ((HOST_WIDE_INT)			\
   (HOST_BITS_PER_WIDE_INT <= 32 ? (unsigned HOST_WIDE_INT) (x)	\
: unsigned HOST_WIDE_INT)(x)) & (unsigned HOST_WIDE_INT) 0x) |\
diff --git a/gcc/config/arm/arm.md b/gcc/config/arm/arm.md
index 4a98f2d7b62..3f2863adf44 100644
--- a/gcc/config/arm/arm.md
+++ b/gcc/config/arm/arm.md
@@ -124,6 +124,12 @@ (define_attr "fpu" "none,vfp"
 ; and not all ARM insns do.
 (define_attr "predicated" "yes,no" (const_string "no"))
 
+; An attribute that encodes the CODE_FOR_ of the MVE VPT unpredicated
+; version of a VPT-predicated instruction.  For unpredicated instructions
+; that are predicable, encode the same pattern's CODE_FOR_ as a way to
+; encode that it is a predicable instruction.
+(define_attr "mve_unpredicated_insn" "" (symbol_ref "CODE_FOR_nothing"))
+
 ; LENGTH of an instruction (in bytes)
 (define_attr "length" ""
   (const_int 4))
diff --git a/gcc/config/arm/iterators.md b/gcc/config/arm/iterators.md
index 547d87f3bc8..7600bf62531 100644
--- a/gcc/config/arm/iterators.md
+++ b/gcc/config/arm/iterators.md
@@ -2311,6 +2311,7 @@ (define_int_attr simd32_op [(UNSPEC_QADD8 "qadd8") (UNSPEC_QSUB8 "qsub8")
 
 (define_int_attr mmla_sfx [(UNSPEC_MATMUL_S "s8") (UNSPEC_MATMUL_U "u8")
 			   (UNSPEC_MATMUL_US "s8")])
+
 ;;MVE int attribute.
 (define_int_attr supf [(VCVTQ_TO_F_S "s") (VCVTQ_TO_F_U "u") (VREV16Q_S "s")
 		   (VREV16Q_U "u") (VMVNQ_N_S "s") (VMVNQ_N_U "u")
diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
index 0fabbaa931b..8aa0bded7f0 100644
--- a/gcc/config/arm/mve.md
+++ b/gcc/config/arm/mve.md
@@ -17,7 +17,7 @@
 ;; along with GCC; see the file COPYING3.  If not see
 ;; .
 
-(define_insn "*mve_mov"
+(define_insn "mve_mov"
   [(set (match_operand:MVE_types 0 "nonimmediate_operand" "=w,w,r,w   , w,   r,Ux,w")
 	(match_operand:MVE_types 1 "general_operand"  " w,r,w,DnDm,UxUi,r,w, Ul"))]
   "TARGET_HAVE_MVE || TARGET_HAVE_MVE_FLOAT"
@@ -81,18 +81,27 @@ (define_insn "*mve_mov"
   return "";
 }
 }
-  [(set_attr "type" "mve_move,mve_move,mve_move,mve_move,mve_load,multiple,mve_store,mve_load")
+   [(set_attr_alternative "mve_unpredicated_insn" [(symbol_ref "CODE_FOR_mve_mov")
+		   (symbol_ref "CODE_FOR_nothing")
+		   (symbol_ref "CODE_FOR_nothing")
+		   (symbol_ref "CODE_FOR_mve_mov")
+		   (symbol_ref "CODE_FOR_mve_mov")
+		   (symbol_ref "CODE_FOR_nothing")
+		   (symbol_ref "CODE_FOR_mve_mov")
+		   (symbol_ref "CODE_FOR_nothing")])
+   (set_attr "type" "mve_move,mve_move,mve_move,mve_move,mve_load,multiple,mve_store,mve_load")
(set_attr "length" "4,8,8,4,4,8,4,8")
(set_attr "thumb2_pool_range" "*,*,*,*,1018,*,*,*")
(set_attr "neg_pool_range" "*,*,*,*,996,*,*,*")])
 
-(define_insn "*mve_vdup"
+(define_insn "mve_vdup"
   [(set (match_operand:MVE_vecs 0 "s_register_operand" "=w")
 	(vec_duplicate:MVE_vecs
 	  (match_operand: 1 "s_register_operand" "r")))]
   "TARGET_HAVE_MVE || TARGET_HAVE_MVE_FLOAT"
   "vdup.\t%q0, %1"
-  [(set_attr "length" "4")
+ [(set (attr "mve_unpredicated_insn") (symbol_ref "CODE_FOR_mve_vdup"))
+  (set_attr "length" "4")
(set_attr "type" "mve_move")])
 
 ;;
@@ -145,7 +154,8 @@ (define_insn "@mve_q_f"
   ]
   "TARGET_HAVE_MVE && TARGET_HAVE_MVE_FLOAT"
   ".f%#\t%q0, %q1"
-  [(set_attr "type" "mve_move")
+ [(set (attr "mve_unpredicated_insn") (symbol_ref "CODE_FOR_mve_q_f"))
+  (set_attr "type" "mve_move")
 ])
 
 ;;
@@ -159,7 +169,8 @@ (define_insn "@mve_q_f"
   ]
   "TARGET_HAVE_MVE && TARGET_HAVE_MVE_FLOAT"
   ".%#\t%q0, %q1"
-  [(set_attr "type" "mve_move")
+ [(set (attr "mve_unpredicated_insn") (symbol_ref "CODE_FOR_mve_q_f"))
+  (set_attr "type" "mve_move")
 ])
 
 ;;
@@ -173,7 +184,8 @@ (define_insn "mve_vq_f"
   ]
   "TARGET_HAVE_MVE && TARGET_HAVE_MVE_FLOAT"
   "v.f%#\t%q0, %q1"
-  [(set_attr "type" "mve_move")
+ [(set (attr "mve_unpredicated_insn") (symbol_ref 

[PATCH v3 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops

2024-01-19 Thread Andre Vieira

Respin after comments from Kyrill and rebase. I also removed an if-then-else
construct in arm_mve_check_reg_origin_is_num_elems similar to the other 
functions
Kyrill pointed out.

After an earlier comment from Richard Sandiford I also added comments to the
two tail predication patterns added to explain the need for the unspecs.
diff --git a/gcc/config/arm/arm-protos.h b/gcc/config/arm/arm-protos.h
index 2cd560c9925..76c6ee95c16 100644
--- a/gcc/config/arm/arm-protos.h
+++ b/gcc/config/arm/arm-protos.h
@@ -65,8 +65,8 @@ extern void arm_emit_speculation_barrier_function (void);
 extern void arm_decompose_di_binop (rtx, rtx, rtx *, rtx *, rtx *, rtx *);
 extern bool arm_q_bit_access (void);
 extern bool arm_ge_bits_access (void);
-extern bool arm_target_insn_ok_for_lob (rtx);
-
+extern bool arm_target_bb_ok_for_lob (basic_block);
+extern rtx arm_attempt_dlstp_transform (rtx);
 #ifdef RTX_CODE
 enum reg_class
 arm_mode_base_reg_class (machine_mode);
diff --git a/gcc/config/arm/arm.cc b/gcc/config/arm/arm.cc
index e5a944486d7..75432f3f73a 100644
--- a/gcc/config/arm/arm.cc
+++ b/gcc/config/arm/arm.cc
@@ -668,6 +668,12 @@ static const scoped_attribute_specs *const arm_attribute_table[] =
 #undef TARGET_HAVE_CONDITIONAL_EXECUTION
 #define TARGET_HAVE_CONDITIONAL_EXECUTION arm_have_conditional_execution
 
+#undef TARGET_LOOP_UNROLL_ADJUST
+#define TARGET_LOOP_UNROLL_ADJUST arm_loop_unroll_adjust
+
+#undef TARGET_PREDICT_DOLOOP_P
+#define TARGET_PREDICT_DOLOOP_P arm_predict_doloop_p
+
 #undef TARGET_LEGITIMATE_CONSTANT_P
 #define TARGET_LEGITIMATE_CONSTANT_P arm_legitimate_constant_p
 
@@ -34483,19 +34489,1135 @@ arm_invalid_within_doloop (const rtx_insn *insn)
 }
 
 bool
-arm_target_insn_ok_for_lob (rtx insn)
+arm_target_bb_ok_for_lob (basic_block bb)
 {
-  basic_block bb = BLOCK_FOR_INSN (insn);
   /* Make sure the basic block of the target insn is a simple latch
  having as single predecessor and successor the body of the loop
  itself.  Only simple loops with a single basic block as body are
  supported for 'low over head loop' making sure that LE target is
  above LE itself in the generated code.  */
-
   return single_succ_p (bb)
-&& single_pred_p (bb)
-&& single_succ_edge (bb)->dest == single_pred_edge (bb)->src
-&& contains_no_active_insn_p (bb);
+	 && single_pred_p (bb)
+	 && single_succ_edge (bb)->dest == single_pred_edge (bb)->src;
+}
+
+/* Utility fuction: Given a VCTP or a VCTP_M insn, return the number of MVE
+   lanes based on the machine mode being used.  */
+
+static int
+arm_mve_get_vctp_lanes (rtx_insn *insn)
+{
+  rtx insn_set = single_set (insn);
+  if (insn_set
+  && GET_CODE (SET_SRC (insn_set)) == UNSPEC
+  && (XINT (SET_SRC (insn_set), 1) == VCTP
+	  || XINT (SET_SRC (insn_set), 1) == VCTP_M))
+{
+  machine_mode mode = GET_MODE (SET_SRC (insn_set));
+  return (VECTOR_MODE_P (mode) && VALID_MVE_PRED_MODE (mode))
+	 ? GET_MODE_NUNITS (mode) : 0;
+}
+  return 0;
+}
+
+/* Check if INSN requires the use of the VPR reg, if it does, return the
+   sub-rtx of the VPR reg.  The TYPE argument controls whether
+   this function should:
+   * For TYPE == 0, check all operands, including the OUT operands,
+ and return the first occurrence of the VPR reg.
+   * For TYPE == 1, only check the input operands.
+   * For TYPE == 2, only check the output operands.
+   (INOUT operands are considered both as input and output operands)
+*/
+static rtx
+arm_get_required_vpr_reg (rtx_insn *insn, unsigned int type = 0)
+{
+  gcc_assert (type < 3);
+  if (!NONJUMP_INSN_P (insn))
+return NULL_RTX;
+
+  bool requires_vpr;
+  extract_constrain_insn (insn);
+  int n_operands = recog_data.n_operands;
+  if (recog_data.n_alternatives == 0)
+return NULL_RTX;
+
+  /* Fill in recog_op_alt with information about the constraints of
+ this insn.  */
+  preprocess_constraints (insn);
+
+  for (int op = 0; op < n_operands; op++)
+{
+  requires_vpr = true;
+  if (type == 1 && recog_data.operand_type[op] == OP_OUT)
+	continue;
+  else if (type == 2 && recog_data.operand_type[op] == OP_IN)
+	continue;
+
+  /* Iterate through alternatives of operand "op" in recog_op_alt and
+	 identify if the operand is required to be the VPR.  */
+  for (int alt = 0; alt < recog_data.n_alternatives; alt++)
+	{
+	  const operand_alternative *op_alt
+	  = _op_alt[alt * n_operands];
+	  /* Fetch the reg_class for each entry and check it against the
+	 VPR_REG reg_class.  */
+	  if (alternative_class (op_alt, op) != VPR_REG)
+	requires_vpr = false;
+	}
+  /* If all alternatives of the insn require the VPR reg for this operand,
+	 it means that either this is VPR-generating instruction, like a vctp,
+	 vcmp, etc., or it is a VPT-predicated insruction.  Return the subrtx
+	 of the VPR reg operand.  */
+  if (requires_vpr)
+	return recog_data.operand[op];
+}
+  return NULL_RTX;
+}
+
+/* Wrapper function of arm_get_required_vpr_reg with 

[PATCH v3 0/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops

2024-01-19 Thread Andre Vieira
Hi,

Reworked the patches according to Kyrill's comments, made some other
non-functional changes and rebased.

Reposting as v3 so patchworks picks them up and runs the necessary testing.

Andre Vieira (2):
arm: Add define_attr to to create a mapping between MVE predicated and
  unpredicated insns
arm: Add support for MVE Tail-Predicated Low Overhead Loops

-- 
2.17.1


[RFC] aarch64: Add support for __BitInt

2024-01-10 Thread Andre Vieira (lists)

Hi,

This patch is still work in progress, but posting to show failure with 
bitint-7 test where handle_stmt called from lower_mergeable_stmt ICE's 
because the idx (3) is out of range for the __BitInt(135) with a 
limb_prec of 64.


I hacked gcc locally to work around this issue and still have one 
outstanding failure, so will look to resolve that failure before posting 
a new version.


Kind Regards,
Andrediff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 
a5a6b52730d6c5013346d128e89915883f1707ae..15fb0ece5256f25c2ca8bb5cb82fc61488d0393e
 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -6534,7 +6534,7 @@ aarch64_return_in_memory_1 (const_tree type)
   machine_mode ag_mode;
   int count;
 
-  if (!AGGREGATE_TYPE_P (type)
+  if (!(AGGREGATE_TYPE_P (type) || TREE_CODE (type) == BITINT_TYPE)
   && TREE_CODE (type) != COMPLEX_TYPE
   && TREE_CODE (type) != VECTOR_TYPE)
 /* Simple scalar types always returned in registers.  */
@@ -6618,6 +6618,10 @@ aarch64_function_arg_alignment (machine_mode mode, 
const_tree type,
 
   gcc_assert (TYPE_MODE (type) == mode);
 
+  if (TREE_CODE (type) == BITINT_TYPE
+  && int_size_in_bytes (type) > 16)
+return GET_MODE_ALIGNMENT (TImode);
+
   if (!AGGREGATE_TYPE_P (type))
 {
   /* The ABI alignment is the natural alignment of the type, without
@@ -21773,6 +21777,11 @@ aarch64_composite_type_p (const_tree type,
   if (type && (AGGREGATE_TYPE_P (type) || TREE_CODE (type) == COMPLEX_TYPE))
 return true;
 
+  if (type
+  && TREE_CODE (type) == BITINT_TYPE
+  && int_size_in_bytes (type) > 16)
+return true;
+
   if (mode == BLKmode
   || GET_MODE_CLASS (mode) == MODE_COMPLEX_FLOAT
   || GET_MODE_CLASS (mode) == MODE_COMPLEX_INT)
@@ -28265,6 +28274,29 @@ aarch64_excess_precision (enum excess_precision_type 
type)
   return FLT_EVAL_METHOD_UNPREDICTABLE;
 }
 
+/* Implement TARGET_C_BITINT_TYPE_INFO.
+   Return true if _BitInt(N) is supported and fill its details into *INFO.  */
+bool
+aarch64_bitint_type_info (int n, struct bitint_info *info)
+{
+  if (n <= 8)
+info->limb_mode = QImode;
+  else if (n <= 16)
+info->limb_mode = HImode;
+  else if (n <= 32)
+info->limb_mode = SImode;
+  else
+info->limb_mode = DImode;
+
+  if (n > 128)
+info->abi_limb_mode = TImode;
+  else
+info->abi_limb_mode = info->limb_mode;
+  info->big_endian = TARGET_BIG_END;
+  info->extended = false;
+  return true;
+}
+
 /* Implement TARGET_SCHED_CAN_SPECULATE_INSN.  Return true if INSN can be
scheduled for speculative execution.  Reject the long-running division
and square-root instructions.  */
@@ -30374,6 +30406,9 @@ aarch64_run_selftests (void)
 #undef TARGET_C_EXCESS_PRECISION
 #define TARGET_C_EXCESS_PRECISION aarch64_excess_precision
 
+#undef TARGET_C_BITINT_TYPE_INFO
+#define TARGET_C_BITINT_TYPE_INFO aarch64_bitint_type_info
+
 #undef  TARGET_EXPAND_BUILTIN
 #define TARGET_EXPAND_BUILTIN aarch64_expand_builtin
 
diff --git a/libgcc/config/aarch64/t-softfp b/libgcc/config/aarch64/t-softfp
index 
2e32366f891361e2056c680b2e36edb1871c7670..4302ad52eb881825d0fb65b9ebd21031781781f5
 100644
--- a/libgcc/config/aarch64/t-softfp
+++ b/libgcc/config/aarch64/t-softfp
@@ -4,7 +4,8 @@ softfp_extensions := sftf dftf hftf bfsf
 softfp_truncations := tfsf tfdf tfhf tfbf dfbf sfbf hfbf
 softfp_exclude_libgcc2 := n
 softfp_extras += fixhfti fixunshfti floattihf floatuntihf \
-floatdibf floatundibf floattibf floatuntibf
+floatdibf floatundibf floattibf floatuntibf \
+fixtfbitint floatbitinttf
 
 TARGET_LIBGCC2_CFLAGS += -Wno-missing-prototypes
 


[PATCH v2 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops

2024-01-05 Thread Andre Vieira
Respin after comments on first version.
diff --git a/gcc/config/arm/arm-protos.h b/gcc/config/arm/arm-protos.h
index 2f5ca79ed8d..4f164c54740 100644
--- a/gcc/config/arm/arm-protos.h
+++ b/gcc/config/arm/arm-protos.h
@@ -65,8 +65,8 @@ extern void arm_emit_speculation_barrier_function (void);
 extern void arm_decompose_di_binop (rtx, rtx, rtx *, rtx *, rtx *, rtx *);
 extern bool arm_q_bit_access (void);
 extern bool arm_ge_bits_access (void);
-extern bool arm_target_insn_ok_for_lob (rtx);
-
+extern bool arm_target_bb_ok_for_lob (basic_block);
+extern rtx arm_attempt_dlstp_transform (rtx);
 #ifdef RTX_CODE
 enum reg_class
 arm_mode_base_reg_class (machine_mode);
diff --git a/gcc/config/arm/arm.cc b/gcc/config/arm/arm.cc
index 0c0cb14a8a4..1ee72bcb7ec 100644
--- a/gcc/config/arm/arm.cc
+++ b/gcc/config/arm/arm.cc
@@ -668,6 +668,12 @@ static const scoped_attribute_specs *const arm_attribute_table[] =
 #undef TARGET_HAVE_CONDITIONAL_EXECUTION
 #define TARGET_HAVE_CONDITIONAL_EXECUTION arm_have_conditional_execution
 
+#undef TARGET_LOOP_UNROLL_ADJUST
+#define TARGET_LOOP_UNROLL_ADJUST arm_loop_unroll_adjust
+
+#undef TARGET_PREDICT_DOLOOP_P
+#define TARGET_PREDICT_DOLOOP_P arm_predict_doloop_p
+
 #undef TARGET_LEGITIMATE_CONSTANT_P
 #define TARGET_LEGITIMATE_CONSTANT_P arm_legitimate_constant_p
 
@@ -34483,19 +34489,1147 @@ arm_invalid_within_doloop (const rtx_insn *insn)
 }
 
 bool
-arm_target_insn_ok_for_lob (rtx insn)
+arm_target_bb_ok_for_lob (basic_block bb)
 {
-  basic_block bb = BLOCK_FOR_INSN (insn);
   /* Make sure the basic block of the target insn is a simple latch
  having as single predecessor and successor the body of the loop
  itself.  Only simple loops with a single basic block as body are
  supported for 'low over head loop' making sure that LE target is
  above LE itself in the generated code.  */
-
   return single_succ_p (bb)
-&& single_pred_p (bb)
-&& single_succ_edge (bb)->dest == single_pred_edge (bb)->src
-&& contains_no_active_insn_p (bb);
+	 && single_pred_p (bb)
+	 && single_succ_edge (bb)->dest == single_pred_edge (bb)->src;
+}
+
+/* Utility fuction: Given a VCTP or a VCTP_M insn, return the number of MVE
+   lanes based on the machine mode being used.  */
+
+static int
+arm_mve_get_vctp_lanes (rtx_insn *insn)
+{
+  rtx insn_set = single_set (insn);
+  if (insn_set
+  && GET_CODE (SET_SRC (insn_set)) == UNSPEC
+  && (XINT (SET_SRC (insn_set), 1) == VCTP
+	  || XINT (SET_SRC (insn_set), 1) == VCTP_M))
+{
+  machine_mode mode = GET_MODE (SET_SRC (insn_set));
+  return (VECTOR_MODE_P (mode) && VALID_MVE_PRED_MODE (mode))
+	 ? GET_MODE_NUNITS (mode) : 0;
+}
+  return 0;
+}
+
+/* Check if INSN requires the use of the VPR reg, if it does, return the
+   sub-rtx of the VPR reg.  The TYPE argument controls whether
+   this function should:
+   * For TYPE == 0, check all operands, including the OUT operands,
+ and return the first occurrence of the VPR reg.
+   * For TYPE == 1, only check the input operands.
+   * For TYPE == 2, only check the output operands.
+   (INOUT operands are considered both as input and output operands)
+*/
+static rtx
+arm_get_required_vpr_reg (rtx_insn *insn, unsigned int type = 0)
+{
+  gcc_assert (type < 3);
+  if (!NONJUMP_INSN_P (insn))
+return NULL_RTX;
+
+  bool requires_vpr;
+  extract_constrain_insn (insn);
+  int n_operands = recog_data.n_operands;
+  if (recog_data.n_alternatives == 0)
+return NULL_RTX;
+
+  /* Fill in recog_op_alt with information about the constraints of
+ this insn.  */
+  preprocess_constraints (insn);
+
+  for (int op = 0; op < n_operands; op++)
+{
+  requires_vpr = true;
+  if (type == 1 && recog_data.operand_type[op] == OP_OUT)
+	continue;
+  else if (type == 2 && recog_data.operand_type[op] == OP_IN)
+	continue;
+
+  /* Iterate through alternatives of operand "op" in recog_op_alt and
+	 identify if the operand is required to be the VPR.  */
+  for (int alt = 0; alt < recog_data.n_alternatives; alt++)
+	{
+	  const operand_alternative *op_alt
+	  = _op_alt[alt * n_operands];
+	  /* Fetch the reg_class for each entry and check it against the
+	 VPR_REG reg_class.  */
+	  if (alternative_class (op_alt, op) != VPR_REG)
+	requires_vpr = false;
+	}
+  /* If all alternatives of the insn require the VPR reg for this operand,
+	 it means that either this is VPR-generating instruction, like a vctp,
+	 vcmp, etc., or it is a VPT-predicated insruction.  Return the subrtx
+	 of the VPR reg operand.  */
+  if (requires_vpr)
+	return recog_data.operand[op];
+}
+  return NULL_RTX;
+}
+
+/* Wrapper function of arm_get_required_vpr_reg with TYPE == 1, so return
+   something only if the VPR reg is an input operand to the insn.  */
+
+static rtx
+arm_get_required_vpr_reg_param (rtx_insn *insn)
+{
+  return arm_get_required_vpr_reg (insn, 1);
+}
+
+/* Wrapper function of arm_get_required_vpr_reg with TYPE == 2, so return
+   

[PATCH v2 1/2] arm: Add define_attr to to create a mapping between MVE predicated and unpredicated insns

2024-01-05 Thread Andre Vieira
Respin of first version to address comments and make it buildable on its own.

diff --git a/gcc/config/arm/arm.h b/gcc/config/arm/arm.h
index a9c2752c0ea..f0b01b7461f 100644
--- a/gcc/config/arm/arm.h
+++ b/gcc/config/arm/arm.h
@@ -2375,6 +2375,21 @@ extern int making_const_table;
   else if (TARGET_THUMB1)\
 thumb1_final_prescan_insn (INSN)
 
+/* These defines are useful to refer to the value of the mve_unpredicated_insn
+   insn attribute.  Note that, because these use the get_attr_* function, these
+   will change recog_data if (INSN) isn't current_insn.  */
+#define MVE_VPT_PREDICABLE_INSN_P(INSN)	\
+  (recog_memoized (INSN) >= 0		\
+   && get_attr_mve_unpredicated_insn (INSN) != CODE_FOR_nothing)
+
+#define MVE_VPT_PREDICATED_INSN_P(INSN)	\
+  (MVE_VPT_PREDICABLE_INSN_P (INSN)	\
+   && recog_memoized (INSN) != get_attr_mve_unpredicated_insn (INSN))
+
+#define MVE_VPT_UNPREDICATED_INSN_P(INSN)\
+  (MVE_VPT_PREDICABLE_INSN_P (INSN)	\
+   && recog_memoized (INSN) == get_attr_mve_unpredicated_insn (INSN))
+
 #define ARM_SIGN_EXTEND(x)  ((HOST_WIDE_INT)			\
   (HOST_BITS_PER_WIDE_INT <= 32 ? (unsigned HOST_WIDE_INT) (x)	\
: unsigned HOST_WIDE_INT)(x)) & (unsigned HOST_WIDE_INT) 0x) |\
diff --git a/gcc/config/arm/arm.md b/gcc/config/arm/arm.md
index 07eaf06cdea..296212be33f 100644
--- a/gcc/config/arm/arm.md
+++ b/gcc/config/arm/arm.md
@@ -124,6 +124,12 @@ (define_attr "fpu" "none,vfp"
 ; and not all ARM insns do.
 (define_attr "predicated" "yes,no" (const_string "no"))
 
+; An attribute that encodes the CODE_FOR_ of the MVE VPT unpredicated
+; version of a VPT-predicated instruction.  For unpredicated instructions
+; that are predicable, encode the same pattern's CODE_FOR_ as a way to
+; encode that it is a predicable instruction.
+(define_attr "mve_unpredicated_insn" "" (symbol_ref "CODE_FOR_nothing"))
+
 ; LENGTH of an instruction (in bytes)
 (define_attr "length" ""
   (const_int 4))
diff --git a/gcc/config/arm/iterators.md b/gcc/config/arm/iterators.md
index a9803538101..5ea2d9e8668 100644
--- a/gcc/config/arm/iterators.md
+++ b/gcc/config/arm/iterators.md
@@ -2305,6 +2305,7 @@ (define_int_attr simd32_op [(UNSPEC_QADD8 "qadd8") (UNSPEC_QSUB8 "qsub8")
 
 (define_int_attr mmla_sfx [(UNSPEC_MATMUL_S "s8") (UNSPEC_MATMUL_U "u8")
 			   (UNSPEC_MATMUL_US "s8")])
+
 ;;MVE int attribute.
 (define_int_attr supf [(VCVTQ_TO_F_S "s") (VCVTQ_TO_F_U "u") (VREV16Q_S "s")
 		   (VREV16Q_U "u") (VMVNQ_N_S "s") (VMVNQ_N_U "u")
diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
index b0d3443da9c..b1862d7977e 100644
--- a/gcc/config/arm/mve.md
+++ b/gcc/config/arm/mve.md
@@ -17,7 +17,7 @@
 ;; along with GCC; see the file COPYING3.  If not see
 ;; .
 
-(define_insn "*mve_mov"
+(define_insn "mve_mov"
   [(set (match_operand:MVE_types 0 "nonimmediate_operand" "=w,w,r,w   , w,   r,Ux,w")
 	(match_operand:MVE_types 1 "general_operand"  " w,r,w,DnDm,UxUi,r,w, Ul"))]
   "TARGET_HAVE_MVE || TARGET_HAVE_MVE_FLOAT"
@@ -81,18 +81,27 @@ (define_insn "*mve_mov"
   return "";
 }
 }
-  [(set_attr "type" "mve_move,mve_move,mve_move,mve_move,mve_load,multiple,mve_store,mve_load")
+   [(set_attr_alternative "mve_unpredicated_insn" [(symbol_ref "CODE_FOR_mve_mov")
+		   (symbol_ref "CODE_FOR_nothing")
+		   (symbol_ref "CODE_FOR_nothing")
+		   (symbol_ref "CODE_FOR_mve_mov")
+		   (symbol_ref "CODE_FOR_mve_mov")
+		   (symbol_ref "CODE_FOR_nothing")
+		   (symbol_ref "CODE_FOR_mve_mov")
+		   (symbol_ref "CODE_FOR_nothing")])
+   (set_attr "type" "mve_move,mve_move,mve_move,mve_move,mve_load,multiple,mve_store,mve_load")
(set_attr "length" "4,8,8,4,4,8,4,8")
(set_attr "thumb2_pool_range" "*,*,*,*,1018,*,*,*")
(set_attr "neg_pool_range" "*,*,*,*,996,*,*,*")])
 
-(define_insn "*mve_vdup"
+(define_insn "mve_vdup"
   [(set (match_operand:MVE_vecs 0 "s_register_operand" "=w")
 	(vec_duplicate:MVE_vecs
 	  (match_operand: 1 "s_register_operand" "r")))]
   "TARGET_HAVE_MVE || TARGET_HAVE_MVE_FLOAT"
   "vdup.\t%q0, %1"
-  [(set_attr "length" "4")
+ [(set (attr "mve_unpredicated_insn") (symbol_ref "CODE_FOR_mve_vdup"))
+  (set_attr "length" "4")
(set_attr "type" "mve_move")])
 
 ;;
@@ -145,7 +154,8 @@ (define_insn "@mve_q_f"
   ]
   "TARGET_HAVE_MVE && TARGET_HAVE_MVE_FLOAT"
   ".f%#\t%q0, %q1"
-  [(set_attr "type" "mve_move")
+ [(set (attr "mve_unpredicated_insn") (symbol_ref "CODE_FOR_mve_q_f"))
+  (set_attr "type" "mve_move")
 ])
 
 ;;
@@ -159,7 +169,8 @@ (define_insn "@mve_q_f"
   ]
   "TARGET_HAVE_MVE && TARGET_HAVE_MVE_FLOAT"
   ".%#\t%q0, %q1"
-  [(set_attr "type" "mve_move")
+ [(set (attr "mve_unpredicated_insn") (symbol_ref "CODE_FOR_mve_q_f"))
+  (set_attr "type" "mve_move")
 ])
 
 ;;
@@ -173,7 +184,8 @@ (define_insn "mve_vq_f"
   ]
   "TARGET_HAVE_MVE && TARGET_HAVE_MVE_FLOAT"
   "v.f%#\t%q0, %q1"
-  [(set_attr "type" "mve_move")
+ [(set (attr "mve_unpredicated_insn") (symbol_ref 

[PATCH v2 0/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops

2024-01-05 Thread Andre Vieira
Hi,

Resending series version 2 addression comments on first version, also moved
parts of the first patch to the second so it can be built without the second
patch.

Andre Vieira (2):
  arm: Add define_attr to to create a mapping between MVE predicated and
unpredicated insns
  arm: Add support for MVE Tail-Predicated Low Overhead Loops


-- 
2.17.1


Re: [PATCH 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops

2023-12-20 Thread Andre Vieira (lists)
Squashed the definition and changes to predicated_doloop_end_internal 
and dlstp*_insn into this patch to make sure the first patch builds 
independently


On 18/12/2023 11:53, Andre Vieira wrote:


Reworked Stam's patch after comments in:
https://gcc.gnu.org/pipermail/gcc-patches/2023-December/640362.html

The original gcc ChangeLog remains unchanged, but I did split up some tests so
here is the testsuite ChangeLog.


gcc/testsuite/ChangeLog:

* gcc.target/arm/lob.h: Update framework.
* gcc.target/arm/lob1.c: Likewise.
* gcc.target/arm/lob6.c: Likewise.
* gcc.target/arm/mve/dlstp-compile-asm.c: New test.
* gcc.target/arm/mve/dlstp-int16x8.c: New test.
* gcc.target/arm/mve/dlstp-int16x8-run.c: New test.
* gcc.target/arm/mve/dlstp-int32x4.c: New test.
* gcc.target/arm/mve/dlstp-int32x4-run.c: New test.
* gcc.target/arm/mve/dlstp-int64x2.c: New test.
* gcc.target/arm/mve/dlstp-int64x2-run.c: New test.
* gcc.target/arm/mve/dlstp-int8x16.c: New test.
* gcc.target/arm/mve/dlstp-int8x16-run.c: New test.
* gcc.target/arm/mve/dlstp-invalid-asm.c: New test.
diff --git a/gcc/config/arm/arm-protos.h b/gcc/config/arm/arm-protos.h
index 
2f5ca79ed8ddd647b212782a0454ee4fefc07257..4f164c547406c43219900c111401540c7ef9d7d1
 100644
--- a/gcc/config/arm/arm-protos.h
+++ b/gcc/config/arm/arm-protos.h
@@ -65,8 +65,8 @@ extern void arm_emit_speculation_barrier_function (void);
 extern void arm_decompose_di_binop (rtx, rtx, rtx *, rtx *, rtx *, rtx *);
 extern bool arm_q_bit_access (void);
 extern bool arm_ge_bits_access (void);
-extern bool arm_target_insn_ok_for_lob (rtx);
-
+extern bool arm_target_bb_ok_for_lob (basic_block);
+extern rtx arm_attempt_dlstp_transform (rtx);
 #ifdef RTX_CODE
 enum reg_class
 arm_mode_base_reg_class (machine_mode);
diff --git a/gcc/config/arm/arm.cc b/gcc/config/arm/arm.cc
index 
0c0cb14a8a4f043357b8acd7042a9f9386af1eb1..1ee72bcb7ec4bea5feea8453ceef7702b0088a73
 100644
--- a/gcc/config/arm/arm.cc
+++ b/gcc/config/arm/arm.cc
@@ -668,6 +668,12 @@ static const scoped_attribute_specs *const 
arm_attribute_table[] =
 #undef TARGET_HAVE_CONDITIONAL_EXECUTION
 #define TARGET_HAVE_CONDITIONAL_EXECUTION arm_have_conditional_execution
 
+#undef TARGET_LOOP_UNROLL_ADJUST
+#define TARGET_LOOP_UNROLL_ADJUST arm_loop_unroll_adjust
+
+#undef TARGET_PREDICT_DOLOOP_P
+#define TARGET_PREDICT_DOLOOP_P arm_predict_doloop_p
+
 #undef TARGET_LEGITIMATE_CONSTANT_P
 #define TARGET_LEGITIMATE_CONSTANT_P arm_legitimate_constant_p
 
@@ -34483,19 +34489,1147 @@ arm_invalid_within_doloop (const rtx_insn *insn)
 }
 
 bool
-arm_target_insn_ok_for_lob (rtx insn)
+arm_target_bb_ok_for_lob (basic_block bb)
 {
-  basic_block bb = BLOCK_FOR_INSN (insn);
   /* Make sure the basic block of the target insn is a simple latch
  having as single predecessor and successor the body of the loop
  itself.  Only simple loops with a single basic block as body are
  supported for 'low over head loop' making sure that LE target is
  above LE itself in the generated code.  */
-
   return single_succ_p (bb)
-&& single_pred_p (bb)
-&& single_succ_edge (bb)->dest == single_pred_edge (bb)->src
-&& contains_no_active_insn_p (bb);
+&& single_pred_p (bb)
+&& single_succ_edge (bb)->dest == single_pred_edge (bb)->src;
+}
+
+/* Utility fuction: Given a VCTP or a VCTP_M insn, return the number of MVE
+   lanes based on the machine mode being used.  */
+
+static int
+arm_mve_get_vctp_lanes (rtx_insn *insn)
+{
+  rtx insn_set = single_set (insn);
+  if (insn_set
+  && GET_CODE (SET_SRC (insn_set)) == UNSPEC
+  && (XINT (SET_SRC (insn_set), 1) == VCTP
+ || XINT (SET_SRC (insn_set), 1) == VCTP_M))
+{
+  machine_mode mode = GET_MODE (SET_SRC (insn_set));
+  return (VECTOR_MODE_P (mode) && VALID_MVE_PRED_MODE (mode))
+? GET_MODE_NUNITS (mode) : 0;
+}
+  return 0;
+}
+
+/* Check if INSN requires the use of the VPR reg, if it does, return the
+   sub-rtx of the VPR reg.  The TYPE argument controls whether
+   this function should:
+   * For TYPE == 0, check all operands, including the OUT operands,
+ and return the first occurrence of the VPR reg.
+   * For TYPE == 1, only check the input operands.
+   * For TYPE == 2, only check the output operands.
+   (INOUT operands are considered both as input and output operands)
+*/
+static rtx
+arm_get_required_vpr_reg (rtx_insn *insn, unsigned int type = 0)
+{
+  gcc_assert (type < 3);
+  if (!NONJUMP_INSN_P (insn))
+return NULL_RTX;
+
+  bool requires_vpr;
+  extract_constrain_insn (insn);
+  int n_operands = recog_data.n_operands;
+  if (recog_data.n_alternatives == 0)
+return NULL_RTX;
+
+  /* Fill in recog_op_alt with information about the constraints of
+ this insn.  */
+  preprocess_constraints (insn);
+
+  for (int op = 0;

Re: [PATCH 1/2] arm: Add define_attr to to create a mapping between MVE predicated and unpredicated insns

2023-12-20 Thread Andre Vieira (lists)
Reworked patch after Richard's comments and moved 
predicated_doloop_end_internal and dlstp*_insn to the next patch in the 
series to make sure this one builds on its own.


On 18/12/2023 11:53, Andre Vieira wrote:


Re-sending Stam's first patch, same as:
https://gcc.gnu.org/pipermail/gcc-patches/2023-November/635301.html

Hopefully patchworks can pick this up :)
diff --git a/gcc/config/arm/arm.h b/gcc/config/arm/arm.h
index 
a9c2752c0ea5ecd4597ded254e9426753ac0a098..f0b01b7461f883994a0be137cb6cbf079d54618b
 100644
--- a/gcc/config/arm/arm.h
+++ b/gcc/config/arm/arm.h
@@ -2375,6 +2375,21 @@ extern int making_const_table;
   else if (TARGET_THUMB1)  \
 thumb1_final_prescan_insn (INSN)
 
+/* These defines are useful to refer to the value of the mve_unpredicated_insn
+   insn attribute.  Note that, because these use the get_attr_* function, these
+   will change recog_data if (INSN) isn't current_insn.  */
+#define MVE_VPT_PREDICABLE_INSN_P(INSN)
\
+  (recog_memoized (INSN) >= 0  \
+   && get_attr_mve_unpredicated_insn (INSN) != CODE_FOR_nothing)
+
+#define MVE_VPT_PREDICATED_INSN_P(INSN)
\
+  (MVE_VPT_PREDICABLE_INSN_P (INSN)\
+   && recog_memoized (INSN) != get_attr_mve_unpredicated_insn (INSN))
+
+#define MVE_VPT_UNPREDICATED_INSN_P(INSN)  \
+  (MVE_VPT_PREDICABLE_INSN_P (INSN)\
+   && recog_memoized (INSN) == get_attr_mve_unpredicated_insn (INSN))
+
 #define ARM_SIGN_EXTEND(x)  ((HOST_WIDE_INT)   \
   (HOST_BITS_PER_WIDE_INT <= 32 ? (unsigned HOST_WIDE_INT) (x) \
: unsigned HOST_WIDE_INT)(x)) & (unsigned HOST_WIDE_INT) 0x) |\
diff --git a/gcc/config/arm/arm.md b/gcc/config/arm/arm.md
index 
07eaf06cdeace750fe1c7d399deb833ef5fc2b66..296212be33ffe6397b05491d8854d2a59f7c54df
 100644
--- a/gcc/config/arm/arm.md
+++ b/gcc/config/arm/arm.md
@@ -124,6 +124,12 @@ (define_attr "fpu" "none,vfp"
 ; and not all ARM insns do.
 (define_attr "predicated" "yes,no" (const_string "no"))
 
+; An attribute that encodes the CODE_FOR_ of the MVE VPT unpredicated
+; version of a VPT-predicated instruction.  For unpredicated instructions
+; that are predicable, encode the same pattern's CODE_FOR_ as a way to
+; encode that it is a predicable instruction.
+(define_attr "mve_unpredicated_insn" "" (symbol_ref "CODE_FOR_nothing"))
+
 ; LENGTH of an instruction (in bytes)
 (define_attr "length" ""
   (const_int 4))
diff --git a/gcc/config/arm/iterators.md b/gcc/config/arm/iterators.md
index 
a980353810166312d5bdfc8ad58b2825c910d0a0..5ea2d9e866891bdb3dc73fcf6cbd6cdd2f989951
 100644
--- a/gcc/config/arm/iterators.md
+++ b/gcc/config/arm/iterators.md
@@ -2305,6 +2305,7 @@ (define_int_attr simd32_op [(UNSPEC_QADD8 "qadd8") 
(UNSPEC_QSUB8 "qsub8")
 
 (define_int_attr mmla_sfx [(UNSPEC_MATMUL_S "s8") (UNSPEC_MATMUL_U "u8")
   (UNSPEC_MATMUL_US "s8")])
+
 ;;MVE int attribute.
 (define_int_attr supf [(VCVTQ_TO_F_S "s") (VCVTQ_TO_F_U "u") (VREV16Q_S "s")
   (VREV16Q_U "u") (VMVNQ_N_S "s") (VMVNQ_N_U "u")
diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
index 
b0d3443da9cee991193d390200738290806a1e69..b1862d7977e91605cd971e634105bed3fa6e75cb
 100644
--- a/gcc/config/arm/mve.md
+++ b/gcc/config/arm/mve.md
@@ -17,7 +17,7 @@
 ;; along with GCC; see the file COPYING3.  If not see
 ;; <http://www.gnu.org/licenses/>.
 
-(define_insn "*mve_mov"
+(define_insn "mve_mov"
   [(set (match_operand:MVE_types 0 "nonimmediate_operand" "=w,w,r,w   , w,   
r,Ux,w")
(match_operand:MVE_types 1 "general_operand"  " 
w,r,w,DnDm,UxUi,r,w, Ul"))]
   "TARGET_HAVE_MVE || TARGET_HAVE_MVE_FLOAT"
@@ -81,18 +81,27 @@ (define_insn "*mve_mov"
   return "";
 }
 }
-  [(set_attr "type" 
"mve_move,mve_move,mve_move,mve_move,mve_load,multiple,mve_store,mve_load")
+   [(set_attr_alternative "mve_unpredicated_insn" [(symbol_ref 
"CODE_FOR_mve_mov")
+  (symbol_ref 
"CODE_FOR_nothing")
+  (symbol_ref 
"CODE_FOR_nothing")
+  (symbol_ref 
"CODE_FOR_mve_mov")
+  (symbol_ref 
"CODE_FOR_mve_mov")
+  (symbol_ref 
"CODE_FOR_nothing")
+

omp: Fix simdclone arguments with veclen lower than simdlen [PR113040]

2023-12-20 Thread Andre Vieira (lists)

This patch fixes an issue introduced by:
commit ea4a3d08f11a59319df7b750a955ac613a3f438a
Author: Andre Vieira 
Date:   Wed Nov 1 17:02:41 2023 +

omp: Reorder call for TARGET_SIMD_CLONE_ADJUST

The problem was that after this patch we no longer added multiple 
arguments for vector arguments where the veclen was lower than the simdlen.


gcc/ChangeLog:

* omp-simd-clone.cc (simd_clone_adjust_argument_types): Add multiple
vector arguments where simdlen is larger than veclen.

Bootstrapped and regression tested on x86_64-pc-linux-gnu and 
aarch64-unknown-linux-gnu.


OK for trunk?

PS: struggling to add a testcase for this, the dumps don't show the 
simdclone prototype and I can't easily create a run-test for this as it 
requires glibc.  Only option is a very flaky assembly scan test to see 
if it's writing to ymm4 (i.e. it is passing enough parameters), but I 
haven't because I don't think that's a good idea.
PPS: maybe we ought to print the simdclone prototype when passing 
-fdump-ipa-simdclone ?diff --git a/gcc/omp-simd-clone.cc b/gcc/omp-simd-clone.cc
index 
3fbe428125243bc02bd58f6e50ac773e8df8..5151fef3bcdaa76802184df43ba13b8709645fd4
 100644
--- a/gcc/omp-simd-clone.cc
+++ b/gcc/omp-simd-clone.cc
@@ -781,6 +781,7 @@ simd_clone_adjust_argument_types (struct cgraph_node *node)
   struct cgraph_simd_clone *sc = node->simdclone;
   unsigned i, k;
   poly_uint64 veclen;
+  auto_vec new_params;
 
   for (i = 0; i < sc->nargs; ++i)
 {
@@ -798,9 +799,11 @@ simd_clone_adjust_argument_types (struct cgraph_node *node)
   switch (sc->args[i].arg_type)
{
default:
+ new_params.safe_push (parm_type);
  break;
case SIMD_CLONE_ARG_TYPE_LINEAR_UVAL_CONSTANT_STEP:
case SIMD_CLONE_ARG_TYPE_LINEAR_UVAL_VARIABLE_STEP:
+ new_params.safe_push (parm_type);
  if (node->definition)
sc->args[i].simd_array
  = create_tmp_simd_array (IDENTIFIER_POINTER (DECL_NAME (parm)),
@@ -828,6 +831,9 @@ simd_clone_adjust_argument_types (struct cgraph_node *node)
  else
vtype = build_vector_type (parm_type, veclen);
  sc->args[i].vector_type = vtype;
+ k = vector_unroll_factor (sc->simdlen, veclen);
+ for (unsigned j = 0; j < k; j++)
+   new_params.safe_push (vtype);
 
  if (node->definition)
sc->args[i].simd_array
@@ -893,22 +899,8 @@ simd_clone_adjust_argument_types (struct cgraph_node *node)
last_parm_void = true;
 
   gcc_assert (TYPE_ARG_TYPES (TREE_TYPE (node->decl)));
-  for (i = 0; i < sc->nargs; i++)
-   {
- tree ptype;
- switch (sc->args[i].arg_type)
-   {
-   default:
- ptype = sc->args[i].orig_type;
- break;
-   case SIMD_CLONE_ARG_TYPE_LINEAR_VAL_CONSTANT_STEP:
-   case SIMD_CLONE_ARG_TYPE_LINEAR_VAL_VARIABLE_STEP:
-   case SIMD_CLONE_ARG_TYPE_VECTOR:
- ptype = sc->args[i].vector_type;
- break;
-   }
- new_arg_types = tree_cons (NULL_TREE, ptype, new_arg_types);
-   }
+  for (i = 0; i < new_params.length (); i++)
+   new_arg_types = tree_cons (NULL_TREE, new_params[i], new_arg_types);
   new_reversed = nreverse (new_arg_types);
   if (last_parm_void)
{


Re: veclower: improve selection of vector mode when lowering [PR 112787]

2023-12-20 Thread Andre Vieira (lists)

Thanks, fully agree with all comments.

gcc/ChangeLog:

PR target/112787
* tree-vect-generic (type_for_widest_vector_mode): Change function
to use original vector type and check widest vector mode has at 
most

the same number of elements.
(get_compute_type): Pass original vector type rather than the element
type to type_for_widest_vector_mode and remove now obsolete check
for the number of elements.

On 07/12/2023 07:45, Richard Biener wrote:

On Wed, 6 Dec 2023, Andre Vieira (lists) wrote:


Hi,

This patch addresses the issue reported in PR target/112787 by improving the
compute type selection.  We do this by not considering types with more
elements
than the type we are lowering since we'd reject such types anyway.

gcc/ChangeLog:

PR target/112787
* tree-vect-generic (type_for_widest_vector_mode): Add a parameter to
control maximum amount of elements in resulting vector mode.
(get_compute_type): Restrict vector_compute_type to a mode no wider
than the original compute type.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/pr112787.c: New test.

Bootstrapped and regression tested on aarch64-unknown-linux-gnu and
x86_64-pc-linux-gnu.

Is this OK for trunk?


@@ -1347,7 +1347,7 @@ optimize_vector_constructor (gimple_stmt_iterator
*gsi)
 TYPE, or NULL_TREE if none is found.  */

Can you improve the function comment?  It also doesn't mention OP ...

  static tree
-type_for_widest_vector_mode (tree type, optab op)
+type_for_widest_vector_mode (tree type, optab op, poly_int64 max_nunits =
0)
  {
machine_mode inner_mode = TYPE_MODE (type);
machine_mode best_mode = VOIDmode, mode;
@@ -1371,7 +1371,9 @@ type_for_widest_vector_mode (tree type, optab op)
FOR_EACH_MODE_FROM (mode, mode)
  if (GET_MODE_INNER (mode) == inner_mode
 && maybe_gt (GET_MODE_NUNITS (mode), best_nunits)
-   && optab_handler (op, mode) != CODE_FOR_nothing)
+   && optab_handler (op, mode) != CODE_FOR_nothing
+   && (known_eq (max_nunits, 0)
+   || known_lt (GET_MODE_NUNITS (mode), max_nunits)))

max_nunits suggests that known_le would be appropriate instead.

I see the only other caller with similar "problems":

 }
   /* Can't use get_compute_type here, as supportable_convert_operation
  doesn't necessarily use an optab and needs two arguments.  */
   tree vec_compute_type
 = type_for_widest_vector_mode (TREE_TYPE (arg_type), mov_optab);
   if (vec_compute_type
   && VECTOR_MODE_P (TYPE_MODE (vec_compute_type))
   && subparts_gt (arg_type, vec_compute_type))

so please do not default to 0 but adjust this one as well.  It also
seems you then can remove the subparts_gt guards on both
vec_compute_type uses.

I think the API would be cleaner if we'd pass the original vector type
we can then extract TYPE_VECTOR_SUBPARTS from, avoiding the extra arg.

No?

Thanks,
Richard.diff --git a/gcc/testsuite/gcc.target/aarch64/pr112787.c 
b/gcc/testsuite/gcc.target/aarch64/pr112787.c
new file mode 100644
index 
..caca1bf7ef447e4489b2c134d7200a4afd16763f
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr112787.c
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O2 -march=armv8-a+sve -mcpu=neoverse-n2" } */
+
+typedef int __attribute__((__vector_size__ (64))) vec;
+
+vec fn (vec a, vec b)
+{
+  return a + b;
+}
+
+/* { dg-final { scan-assembler-times {add\tv[0-9]+} 4 } } */
diff --git a/gcc/tree-vect-generic.cc b/gcc/tree-vect-generic.cc
index 
a7e6cb87a5e31e3dd2a893ea5652eeebf8d5d214..c906eb3521ea01fd2bdfc89c3476d02c555cf8cc
 100644
--- a/gcc/tree-vect-generic.cc
+++ b/gcc/tree-vect-generic.cc
@@ -1343,12 +1343,16 @@ optimize_vector_constructor (gimple_stmt_iterator *gsi)
   gsi_replace (gsi, g, false);
 }
 
-/* Return a type for the widest vector mode whose components are of type
-   TYPE, or NULL_TREE if none is found.  */
+/* Return a type for the widest vector mode with the same element type as
+   type ORIGINAL_VECTOR_TYPE, with at most the same number of elements as type
+   ORIGINAL_VECTOR_TYPE and that is supported by the target for an operation
+   with optab OP, or return NULL_TREE if none is found.  */
 
 static tree
-type_for_widest_vector_mode (tree type, optab op)
+type_for_widest_vector_mode (tree original_vector_type, optab op)
 {
+  gcc_assert (VECTOR_TYPE_P (original_vector_type));
+  tree type = TREE_TYPE (original_vector_type);
   machine_mode inner_mode = TYPE_MODE (type);
   machine_mode best_mode = VOIDmode, mode;
   poly_int64 best_nunits = 0;
@@ -1371,7 +1375,9 @@ type_for_widest_vector_mode (tree type, optab op)
   FOR_EACH_MODE_FROM (mode, mode)
 if (GET_MODE_INNER (mode) == inner_mode
&& maybe_gt (GET_MODE_NUNITS (mode), best_nunits)
-   && optab_handler (op

[PATCH 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops

2023-12-18 Thread Andre Vieira

Reworked Stam's patch after comments in:
https://gcc.gnu.org/pipermail/gcc-patches/2023-December/640362.html

The original gcc ChangeLog remains unchanged, but I did split up some tests so
here is the testsuite ChangeLog.


gcc/testsuite/ChangeLog:

* gcc.target/arm/lob.h: Update framework.
* gcc.target/arm/lob1.c: Likewise.
* gcc.target/arm/lob6.c: Likewise.
* gcc.target/arm/mve/dlstp-compile-asm.c: New test.
* gcc.target/arm/mve/dlstp-int16x8.c: New test.
* gcc.target/arm/mve/dlstp-int16x8-run.c: New test.
* gcc.target/arm/mve/dlstp-int32x4.c: New test.
* gcc.target/arm/mve/dlstp-int32x4-run.c: New test.
* gcc.target/arm/mve/dlstp-int64x2.c: New test.
* gcc.target/arm/mve/dlstp-int64x2-run.c: New test.
* gcc.target/arm/mve/dlstp-int8x16.c: New test.
* gcc.target/arm/mve/dlstp-int8x16-run.c: New test.
* gcc.target/arm/mve/dlstp-invalid-asm.c: New test.

diff --git a/gcc/config/arm/arm-protos.h b/gcc/config/arm/arm-protos.h
index 2f5ca79ed8d..4f164c54740 100644
--- a/gcc/config/arm/arm-protos.h
+++ b/gcc/config/arm/arm-protos.h
@@ -65,8 +65,8 @@ extern void arm_emit_speculation_barrier_function (void);
 extern void arm_decompose_di_binop (rtx, rtx, rtx *, rtx *, rtx *, rtx *);
 extern bool arm_q_bit_access (void);
 extern bool arm_ge_bits_access (void);
-extern bool arm_target_insn_ok_for_lob (rtx);
-
+extern bool arm_target_bb_ok_for_lob (basic_block);
+extern rtx arm_attempt_dlstp_transform (rtx);
 #ifdef RTX_CODE
 enum reg_class
 arm_mode_base_reg_class (machine_mode);
diff --git a/gcc/config/arm/arm.cc b/gcc/config/arm/arm.cc
index 0c0cb14a8a4..1ee72bcb7ec 100644
--- a/gcc/config/arm/arm.cc
+++ b/gcc/config/arm/arm.cc
@@ -668,6 +668,12 @@ static const scoped_attribute_specs *const arm_attribute_table[] =
 #undef TARGET_HAVE_CONDITIONAL_EXECUTION
 #define TARGET_HAVE_CONDITIONAL_EXECUTION arm_have_conditional_execution
 
+#undef TARGET_LOOP_UNROLL_ADJUST
+#define TARGET_LOOP_UNROLL_ADJUST arm_loop_unroll_adjust
+
+#undef TARGET_PREDICT_DOLOOP_P
+#define TARGET_PREDICT_DOLOOP_P arm_predict_doloop_p
+
 #undef TARGET_LEGITIMATE_CONSTANT_P
 #define TARGET_LEGITIMATE_CONSTANT_P arm_legitimate_constant_p
 
@@ -34483,19 +34489,1147 @@ arm_invalid_within_doloop (const rtx_insn *insn)
 }
 
 bool
-arm_target_insn_ok_for_lob (rtx insn)
+arm_target_bb_ok_for_lob (basic_block bb)
 {
-  basic_block bb = BLOCK_FOR_INSN (insn);
   /* Make sure the basic block of the target insn is a simple latch
  having as single predecessor and successor the body of the loop
  itself.  Only simple loops with a single basic block as body are
  supported for 'low over head loop' making sure that LE target is
  above LE itself in the generated code.  */
-
   return single_succ_p (bb)
-&& single_pred_p (bb)
-&& single_succ_edge (bb)->dest == single_pred_edge (bb)->src
-&& contains_no_active_insn_p (bb);
+	 && single_pred_p (bb)
+	 && single_succ_edge (bb)->dest == single_pred_edge (bb)->src;
+}
+
+/* Utility fuction: Given a VCTP or a VCTP_M insn, return the number of MVE
+   lanes based on the machine mode being used.  */
+
+static int
+arm_mve_get_vctp_lanes (rtx_insn *insn)
+{
+  rtx insn_set = single_set (insn);
+  if (insn_set
+  && GET_CODE (SET_SRC (insn_set)) == UNSPEC
+  && (XINT (SET_SRC (insn_set), 1) == VCTP
+	  || XINT (SET_SRC (insn_set), 1) == VCTP_M))
+{
+  machine_mode mode = GET_MODE (SET_SRC (insn_set));
+  return (VECTOR_MODE_P (mode) && VALID_MVE_PRED_MODE (mode))
+	 ? GET_MODE_NUNITS (mode) : 0;
+}
+  return 0;
+}
+
+/* Check if INSN requires the use of the VPR reg, if it does, return the
+   sub-rtx of the VPR reg.  The TYPE argument controls whether
+   this function should:
+   * For TYPE == 0, check all operands, including the OUT operands,
+ and return the first occurrence of the VPR reg.
+   * For TYPE == 1, only check the input operands.
+   * For TYPE == 2, only check the output operands.
+   (INOUT operands are considered both as input and output operands)
+*/
+static rtx
+arm_get_required_vpr_reg (rtx_insn *insn, unsigned int type = 0)
+{
+  gcc_assert (type < 3);
+  if (!NONJUMP_INSN_P (insn))
+return NULL_RTX;
+
+  bool requires_vpr;
+  extract_constrain_insn (insn);
+  int n_operands = recog_data.n_operands;
+  if (recog_data.n_alternatives == 0)
+return NULL_RTX;
+
+  /* Fill in recog_op_alt with information about the constraints of
+ this insn.  */
+  preprocess_constraints (insn);
+
+  for (int op = 0; op < n_operands; op++)
+{
+  requires_vpr = true;
+  if (type == 1 && recog_data.operand_type[op] == OP_OUT)
+	continue;
+  else if (type == 2 && recog_data.operand_type[op] == OP_IN)
+	continue;
+
+  /* Iterate through alternatives of operand "op" in recog_op_alt and
+	 identify if the operand is required to be the VPR.  */
+  for (int alt = 0; alt < recog_data.n_alternatives; alt++)
+	{
+	  const 

[PATCH 1/2] arm: Add define_attr to to create a mapping between MVE predicated and unpredicated insns

2023-12-18 Thread Andre Vieira

Re-sending Stam's first patch, same as:
https://gcc.gnu.org/pipermail/gcc-patches/2023-November/635301.html

Hopefully patchworks can pick this up :)

diff --git a/gcc/config/arm/arm.h b/gcc/config/arm/arm.h
index a9c2752c0ea..0b0e8620717 100644
--- a/gcc/config/arm/arm.h
+++ b/gcc/config/arm/arm.h
@@ -2375,6 +2375,21 @@ extern int making_const_table;
   else if (TARGET_THUMB1)\
 thumb1_final_prescan_insn (INSN)
 
+/* These defines are useful to refer to the value of the mve_unpredicated_insn
+   insn attribute.  Note that, because these use the get_attr_* function, these
+   will change recog_data if (INSN) isn't current_insn.  */
+#define MVE_VPT_PREDICABLE_INSN_P(INSN)	\
+  (recog_memoized (INSN) >= 0		\
+  && get_attr_mve_unpredicated_insn (INSN) != 0)			\
+
+#define MVE_VPT_PREDICATED_INSN_P(INSN)	\
+  (MVE_VPT_PREDICABLE_INSN_P (INSN)	\
+   && recog_memoized (INSN) != get_attr_mve_unpredicated_insn (INSN))	\
+
+#define MVE_VPT_UNPREDICATED_INSN_P(INSN)\
+  (MVE_VPT_PREDICABLE_INSN_P (INSN)	\
+   && recog_memoized (INSN) == get_attr_mve_unpredicated_insn (INSN))	\
+
 #define ARM_SIGN_EXTEND(x)  ((HOST_WIDE_INT)			\
   (HOST_BITS_PER_WIDE_INT <= 32 ? (unsigned HOST_WIDE_INT) (x)	\
: unsigned HOST_WIDE_INT)(x)) & (unsigned HOST_WIDE_INT) 0x) |\
diff --git a/gcc/config/arm/arm.md b/gcc/config/arm/arm.md
index 07eaf06cdea..8efdebecc3c 100644
--- a/gcc/config/arm/arm.md
+++ b/gcc/config/arm/arm.md
@@ -124,6 +124,8 @@ (define_attr "fpu" "none,vfp"
 ; and not all ARM insns do.
 (define_attr "predicated" "yes,no" (const_string "no"))
 
+(define_attr "mve_unpredicated_insn" "" (const_int 0))
+
 ; LENGTH of an instruction (in bytes)
 (define_attr "length" ""
   (const_int 4))
diff --git a/gcc/config/arm/iterators.md b/gcc/config/arm/iterators.md
index a9803538101..5ea2d9e8668 100644
--- a/gcc/config/arm/iterators.md
+++ b/gcc/config/arm/iterators.md
@@ -2305,6 +2305,7 @@ (define_int_attr simd32_op [(UNSPEC_QADD8 "qadd8") (UNSPEC_QSUB8 "qsub8")
 
 (define_int_attr mmla_sfx [(UNSPEC_MATMUL_S "s8") (UNSPEC_MATMUL_U "u8")
 			   (UNSPEC_MATMUL_US "s8")])
+
 ;;MVE int attribute.
 (define_int_attr supf [(VCVTQ_TO_F_S "s") (VCVTQ_TO_F_U "u") (VREV16Q_S "s")
 		   (VREV16Q_U "u") (VMVNQ_N_S "s") (VMVNQ_N_U "u")
diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
index b0d3443da9c..62df022ef19 100644
--- a/gcc/config/arm/mve.md
+++ b/gcc/config/arm/mve.md
@@ -17,7 +17,7 @@
 ;; along with GCC; see the file COPYING3.  If not see
 ;; .
 
-(define_insn "*mve_mov"
+(define_insn "mve_mov"
   [(set (match_operand:MVE_types 0 "nonimmediate_operand" "=w,w,r,w   , w,   r,Ux,w")
 	(match_operand:MVE_types 1 "general_operand"  " w,r,w,DnDm,UxUi,r,w, Ul"))]
   "TARGET_HAVE_MVE || TARGET_HAVE_MVE_FLOAT"
@@ -81,18 +81,27 @@ (define_insn "*mve_mov"
   return "";
 }
 }
-  [(set_attr "type" "mve_move,mve_move,mve_move,mve_move,mve_load,multiple,mve_store,mve_load")
+   [(set_attr_alternative "mve_unpredicated_insn" [(symbol_ref "CODE_FOR_mve_mov")
+		   (symbol_ref "CODE_FOR_nothing")
+		   (symbol_ref "CODE_FOR_nothing")
+		   (symbol_ref "CODE_FOR_mve_mov")
+		   (symbol_ref "CODE_FOR_mve_mov")
+		   (symbol_ref "CODE_FOR_nothing")
+		   (symbol_ref "CODE_FOR_mve_mov")
+		   (symbol_ref "CODE_FOR_nothing")])
+   (set_attr "type" "mve_move,mve_move,mve_move,mve_move,mve_load,multiple,mve_store,mve_load")
(set_attr "length" "4,8,8,4,4,8,4,8")
(set_attr "thumb2_pool_range" "*,*,*,*,1018,*,*,*")
(set_attr "neg_pool_range" "*,*,*,*,996,*,*,*")])
 
-(define_insn "*mve_vdup"
+(define_insn "mve_vdup"
   [(set (match_operand:MVE_vecs 0 "s_register_operand" "=w")
 	(vec_duplicate:MVE_vecs
 	  (match_operand: 1 "s_register_operand" "r")))]
   "TARGET_HAVE_MVE || TARGET_HAVE_MVE_FLOAT"
   "vdup.\t%q0, %1"
-  [(set_attr "length" "4")
+ [(set (attr "mve_unpredicated_insn") (symbol_ref "CODE_FOR_mve_vdup"))
+  (set_attr "length" "4")
(set_attr "type" "mve_move")])
 
 ;;
@@ -145,7 +154,8 @@ (define_insn "@mve_q_f"
   ]
   "TARGET_HAVE_MVE && TARGET_HAVE_MVE_FLOAT"
   ".f%#\t%q0, %q1"
-  [(set_attr "type" "mve_move")
+ [(set (attr "mve_unpredicated_insn") (symbol_ref "CODE_FOR_mve_q_f"))
+  (set_attr "type" "mve_move")
 ])
 
 ;;
@@ -159,7 +169,8 @@ (define_insn "@mve_q_f"
   ]
   "TARGET_HAVE_MVE && TARGET_HAVE_MVE_FLOAT"
   ".%#\t%q0, %q1"
-  [(set_attr "type" "mve_move")
+ [(set (attr "mve_unpredicated_insn") (symbol_ref "CODE_FOR_mve_q_f"))
+  (set_attr "type" "mve_move")
 ])
 
 ;;
@@ -173,7 +184,8 @@ (define_insn "mve_vq_f"
   ]
   "TARGET_HAVE_MVE && TARGET_HAVE_MVE_FLOAT"
   "v.f%#\t%q0, %q1"
-  [(set_attr "type" "mve_move")
+ [(set (attr "mve_unpredicated_insn") (symbol_ref "CODE_FOR_mve_vq_f"))
+  (set_attr "type" "mve_move")
 ])
 
 ;;
@@ -187,7 +199,8 @@ (define_insn "@mve_q_n_f"
   ]
   "TARGET_HAVE_MVE && TARGET_HAVE_MVE_FLOAT"
   ".%#\t%q0, %1"
-  [(set_attr "type" "mve_move")
+ [(set (attr 

[PATCH 0/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops

2023-12-18 Thread Andre Vieira
Resending series to make use of the Linaro pre-commit CI in patchworks.

Andre Vieira (2):
  arm: Add define_attr to to create a mapping between MVE predicated and
unpredicated insns
  arm: Add support for MVE Tail-Predicated Low Overhead Loops

-- 
2.17.1


Re: [PATCH] Fix tests for gomp

2023-12-13 Thread Andre Vieira (lists)




On 13/12/2023 10:55, Jakub Jelinek wrote:

On Wed, Dec 13, 2023 at 10:43:16AM +, Andre Vieira (lists) wrote:

Hi,

Apologies for the delay and this mixup. I need to do something different

This is to fix testisms initially introduced by:
commit f5fc001a84a7dbb942a6252b3162dd38b4aae311
Author: Andre Vieira 
Date:   Mon Dec 11 14:24:41 2023 +

 aarch64: enable mixed-types for aarch64 simdclones

gcc/testsuite/ChangeLog:

* gcc.dg/gomp/pr87887-1.c: Fixed test.
* gcc.dg/gomp/pr89246-1.c: Likewise.
* gcc.dg/gomp/simd-clones-2.c: Likewise.

libgomp/ChangeLog:

* testsuite/libgomp.c/declare-variant-1.c: Fixed test.
* testsuite/libgomp.fortran/declare-simd-1.f90: Likewise.

OK for trunk? I was intending to commit as obvious, but jakub had made a
comment about declare-simd-1.f90 so I thought it might be worth just sending
it up to the mailing list first.



--- a/libgomp/testsuite/libgomp.c/declare-variant-1.c
+++ b/libgomp/testsuite/libgomp.c/declare-variant-1.c
@@ -40,16 +40,17 @@ f04 (int a)
  int
  test1 (int x)
  {
-  /* At gimplification time, we can't decide yet which function to call.  */
-  /* { dg-final { scan-tree-dump-times "f04 \\\(x" 2 "gimple" } } */
+  /* At gimplification time, we can't decide yet which function to call for
+ x86_64 targets, given the f01 variant.  */
+  /* { dg-final { scan-tree-dump-times "f04 \\\(x" 2 "gimple" { target 
x86_64-*-* } } } */
/* After simd clones are created, the original non-clone test1 shall
   call f03 (score 6), the sse2/avx/avx2 clones too, but avx512f clones
   shall call f01 with score 8.  */
/* { dg-final { scan-ltrans-tree-dump-not "f04 \\\(x" "optimized" } } */
-  /* { dg-final { scan-ltrans-tree-dump-times "f03 \\\(x" 14 "optimized" { 
target { !aarch64*-*-* } } } } } */
-  /* { dg-final { scan-ltrans-tree-dump-times "f01 \\\(x" 4 "optimized" { 
target { !aarch64*-*-* } } } } } */
-  /* { dg-final { scan-ltrans-tree-dump-times "f03 \\\(x" 10 "optimized" { 
target { aarch64*-*-* } } } } } */
-  /* { dg-final { scan-ltrans-tree-dump-not "f01 \\\(x" "optimized" { target { 
aarch64*-*-* } } } } } */
+  /* { dg-final { scan-ltrans-tree-dump-times "f03 \\\(x" 14 "optimized" { 
target { !aarch64*-*-* } } } } */
+  /* { dg-final { scan-ltrans-tree-dump-times "f01 \\\(x" 4 "optimized" { 
target { !aarch64*-*-* } } } } */
+  /* { dg-final { scan-ltrans-tree-dump-times "f03 \\\(x" 10 "optimized" { 
target { aarch64*-*-* } } } } */
+  /* { dg-final { scan-ltrans-tree-dump-not "f01 \\\(x" "optimized" { target { 
aarch64*-*-* } } } } */


The changes in this test look all wrong.  The differences are
i?86-*-* x86_64-*-* (which can support avx512f isa) vs. other targets (which
can't).
So, there is nothing aarch64 specific in there and { target x86_64-*-* }
is also incorrect.  It should be simply
{ target i?86-*-* x86_64-*-* }
vs.
{ target { ! { i?86-*-* x86_64-*-* } } }
(never sure about the ! syntaxes).



Hmm I think I understand what you are saying, but I'm not sure I 
agree. So before I enabled simdclone testing for aarch64, this test had 
no target selectors. So it checked the same for 'all simdclone test 
targets'. Which seem to be x86 and amdgcn:


@@ -4321,7 +4321,8 @@ proc check_effective_target_vect_simd_clones { } {
 return [check_cached_effective_target_indexed vect_simd_clones {
   expr { (([istarget i?86-*-*] || [istarget x86_64-*-*])
  && [check_effective_target_avx512f])
|| [istarget amdgcn-*-*]
|| [istarget aarch64*-*-*] }}]
 }

I haven't checked what amdgcn does with this test, but I'd have to 
assume they were passing before? Though I'm not sure how amdgcn would 
pass the original:
 -  /* At gimplification time, we can't decide yet which function to 
call.  */

 -  /* { dg-final { scan-tree-dump-times "f04 \\\(x" 2 "gimple" } } */

I've added Andrew to the mail to see if he can comment on that. Either 
way I'd suggest we either add scan's per target with the expected value 
or stick with my original change of aarch64 vs non-aarch64 as I think 
that would better reflect the changes of enabling this for aarch64 where 
it wasn't ran before.


[PATCH] Fix tests for gomp

2023-12-13 Thread Andre Vieira (lists)

Hi,

Apologies for the delay and this mixup. I need to do something different

This is to fix testisms initially introduced by:
commit f5fc001a84a7dbb942a6252b3162dd38b4aae311
Author: Andre Vieira 
Date:   Mon Dec 11 14:24:41 2023 +

aarch64: enable mixed-types for aarch64 simdclones

gcc/testsuite/ChangeLog:

* gcc.dg/gomp/pr87887-1.c: Fixed test.
* gcc.dg/gomp/pr89246-1.c: Likewise.
* gcc.dg/gomp/simd-clones-2.c: Likewise.

libgomp/ChangeLog:

* testsuite/libgomp.c/declare-variant-1.c: Fixed test.
* testsuite/libgomp.fortran/declare-simd-1.f90: Likewise.

OK for trunk? I was intending to commit as obvious, but jakub had made a 
comment about declare-simd-1.f90 so I thought it might be worth just 
sending it up to the mailing list first.


Kind regards,
Andrediff --git a/gcc/testsuite/gcc.dg/gomp/pr87887-1.c 
b/gcc/testsuite/gcc.dg/gomp/pr87887-1.c
index 
281898300c7794d862e62c70a83a33d5aaa8f89e..8b04ffd0809be4e6f5ab97c2e32e800edffbee4f
 100644
--- a/gcc/testsuite/gcc.dg/gomp/pr87887-1.c
+++ b/gcc/testsuite/gcc.dg/gomp/pr87887-1.c
@@ -10,7 +10,6 @@ foo (int x)
 {
   return (struct S) { x };
 }
-/* { dg-warning "unsupported return type ‘struct S’ for ‘simd’ functions" "" { 
target aarch64*-*-* } .-4 } */
 
 #pragma omp declare simd
 int
@@ -18,7 +17,6 @@ bar (struct S x)
 {
   return x.n;
 }
-/* { dg-warning "unsupported argument type ‘struct S’ for ‘simd’ functions" "" 
{ target aarch64*-*-* } .-4 } */
 
 #pragma omp declare simd uniform (x)
 int
diff --git a/gcc/testsuite/gcc.dg/gomp/pr89246-1.c 
b/gcc/testsuite/gcc.dg/gomp/pr89246-1.c
index 
4a0fd74f0639b2832dcb9101e006d127568fbcbd..dfe629c1c6a51624cd94878c638606220cfe94eb
 100644
--- a/gcc/testsuite/gcc.dg/gomp/pr89246-1.c
+++ b/gcc/testsuite/gcc.dg/gomp/pr89246-1.c
@@ -8,7 +8,6 @@ int foo (__int128 x)
 {
   return x;
 }
-/* { dg-warning "unsupported argument type ‘__int128’ for ‘simd’ functions" "" 
{ target aarch64*-*-* } .-4 } */
 
 #pragma omp declare simd
 extern int bar (int x);
diff --git a/gcc/testsuite/gcc.dg/gomp/simd-clones-2.c 
b/gcc/testsuite/gcc.dg/gomp/simd-clones-2.c
index 
f12244054bd46fa10e51cc3a688c4cf683689994..354078acd9f3073b8400621a0e7149aee571594b
 100644
--- a/gcc/testsuite/gcc.dg/gomp/simd-clones-2.c
+++ b/gcc/testsuite/gcc.dg/gomp/simd-clones-2.c
@@ -19,7 +19,6 @@ float setArray(float *a, float x, int k)
 /* { dg-final { scan-tree-dump "_ZGVnN2ua32vl_setArray" "optimized" { target 
aarch64*-*-* } } } */
 /* { dg-final { scan-tree-dump "_ZGVnN4ua32vl_setArray" "optimized" { target 
aarch64*-*-* } } } */
 /* { dg-final { scan-tree-dump "_ZGVnN2vvva32_addit" "optimized" { target 
aarch64*-*-* } } } */
-/* { dg-final { scan-tree-dump "_ZGVnN4vvva32_addit" "optimized" { target 
aarch64*-*-* } } } */
 /* { dg-final { scan-tree-dump "_ZGVnM2vl66u_addit" "optimized" { target 
aarch64*-*-* } } } */
 /* { dg-final { scan-tree-dump "_ZGVnM4vl66u_addit" "optimized" { target 
aarch64*-*-* } } } */
 
diff --git a/libgomp/testsuite/libgomp.c/declare-variant-1.c 
b/libgomp/testsuite/libgomp.c/declare-variant-1.c
index 
6129f23a0f80585246957022d63608dc3a68f1ff..790e9374054fe3e0ae609796640ff295b61e8389
 100644
--- a/libgomp/testsuite/libgomp.c/declare-variant-1.c
+++ b/libgomp/testsuite/libgomp.c/declare-variant-1.c
@@ -40,16 +40,17 @@ f04 (int a)
 int
 test1 (int x)
 {
-  /* At gimplification time, we can't decide yet which function to call.  */
-  /* { dg-final { scan-tree-dump-times "f04 \\\(x" 2 "gimple" } } */
+  /* At gimplification time, we can't decide yet which function to call for
+ x86_64 targets, given the f01 variant.  */
+  /* { dg-final { scan-tree-dump-times "f04 \\\(x" 2 "gimple" { target 
x86_64-*-* } } } */
   /* After simd clones are created, the original non-clone test1 shall
  call f03 (score 6), the sse2/avx/avx2 clones too, but avx512f clones
  shall call f01 with score 8.  */
   /* { dg-final { scan-ltrans-tree-dump-not "f04 \\\(x" "optimized" } } */
-  /* { dg-final { scan-ltrans-tree-dump-times "f03 \\\(x" 14 "optimized" { 
target { !aarch64*-*-* } } } } } */
-  /* { dg-final { scan-ltrans-tree-dump-times "f01 \\\(x" 4 "optimized" { 
target { !aarch64*-*-* } } } } } */
-  /* { dg-final { scan-ltrans-tree-dump-times "f03 \\\(x" 10 "optimized" { 
target { aarch64*-*-* } } } } } */
-  /* { dg-final { scan-ltrans-tree-dump-not "f01 \\\(x" "optimized" { target { 
aarch64*-*-* } } } } } */
+  /* { dg-final { scan-ltrans-tree-dump-times "f03 \\\(x" 14 "optimized" { 
target { !aarch64*-*-* } } } } */
+  /* { dg-final { scan-ltrans-tree-dump-times "f01 \\\(x" 4 "optimized" { 
target { !aarch64*-*-* } } } } *

Re: [PATCH] aarch64: enable mixed-types for aarch64 simdclones

2023-12-12 Thread Andre Vieira (lists)




On 11/12/2023 21:42, Thomas Schwinge wrote:

Hi Andre!

On 2023-10-16T16:03:26+0100, "Andre Vieira (lists)" 
 wrote:

Just a minor update to the patch, I had missed the libgomp testsuite, so
had to make some adjustments there too.


Unfortunately, there appear to be a number of DejaGnu directive errors in
your test case changes -- do you not see those in your testing?


I hadn't seen those... I wonder whether they don't show up if you do 
dg-cmp-results with just one -v, I have binned the build, but I'll rerun 
it and double check, may need to use '-v -v' instead.


Thanks for letting me know.
..., and the following change also doesn't look quite right:



--- a/libgomp/testsuite/libgomp.fortran/declare-simd-1.f90
+++ b/libgomp/testsuite/libgomp.fortran/declare-simd-1.f90
@@ -1,5 +1,5 @@
  ! { dg-do run { target vect_simd_clones } }
-! { dg-options "-fno-inline" }
+! { dg-options "-fno-inline -cpp -D__aarch64__" }




Yeah, that needs a target selector. Thanks!


Re: [PING][PATCH 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops

2023-12-07 Thread Andre Vieira (lists)
Thanks for addressing my comments. I have reviewed this and the other 
patch before and they LGTM. I however do not have approval rights so you 
will need the OK from a maintainer.


Thanks for doing this :)

Andre

On 30/11/2023 12:55, Stamatis Markianos-Wright wrote:

Hi Andre,

Thanks for the comments, see latest revision attached.

On 27/11/2023 12:47, Andre Vieira (lists) wrote:

Hi Stam,

Just some comments.

+/* Recursively scan through the DF chain backwards within the basic 
block and
+   determine if any of the USEs of the original insn (or the USEs of 
the insns
s/Recursively scan/Scan/ as you no longer recurse, thanks for that by 
the way :) +   where thy were DEF-ed, etc., recursively) were affected 
by implicit VPT

remove recursively for the same reasons.

+  if (!CONST_INT_P (cond_counter_iv.step) || !CONST_INT_P 
(cond_temp_iv.step))

+    return NULL;
+  /* Look at the steps and swap around the rtx's if needed. Error 
out if

+ one of them cannot be identified as constant.  */
+  if (INTVAL (cond_counter_iv.step) != 0 && INTVAL 
(cond_temp_iv.step) != 0)

+    return NULL;

Move the comment above the if before, as the erroring out it talks 
about is there.

Done


+  emit_note_after ((enum insn_note)NOTE_KIND (insn), BB_END (body));
 space after 'insn_note)'

@@ -173,14 +176,14 @@ doloop_condition_get (rtx_insn *doloop_pat)
   if (! REG_P (reg))
 return 0;
 -  /* Check if something = (plus (reg) (const_int -1)).
+  /* Check if something = (plus (reg) (const_int -n)).
  On IA-64, this decrement is wrapped in an if_then_else.  */
   inc_src = SET_SRC (inc);
   if (GET_CODE (inc_src) == IF_THEN_ELSE)
 inc_src = XEXP (inc_src, 1);
   if (GET_CODE (inc_src) != PLUS
   || XEXP (inc_src, 0) != reg
-  || XEXP (inc_src, 1) != constm1_rtx)
+  || !CONST_INT_P (XEXP (inc_src, 1)))

Do we ever check that inc_src is negative? We used to check if it was 
-1, now we only check it's a constnat, but not a negative one, so I 
suspect this needs a:

|| INTVAL (XEXP (inc_src, 1)) >= 0

Good point. Done


@@ -492,7 +519,8 @@ doloop_modify (class loop *loop, class niter_desc 
*desc,

 case GE:
   /* Currently only GE tests against zero are supported.  */
   gcc_assert (XEXP (condition, 1) == const0_rtx);
-
+  /* FALLTHRU */
+    case GTU:
   noloop = constm1_rtx;

I spent a very long time staring at this trying to understand why 
noloop = constm1_rtx for GTU, where I thought it should've been (count 
& (n-1)). For the current use of doloop it doesn't matter because ARM 
is the only target using it and you set desc->noloop_assumptions to 
null_rtx in 'arm_attempt_dlstp_transform' so noloop is never used. 
However, if a different target accepts this GTU pattern then this 
target agnostic code will do the wrong thing.  I suggest we either:
 - set noloop to what we think might be the correct value, which if 
you ask me should be 'count & (XEXP (condition, 1))',
 - or add a gcc_assert (GET_CODE (condition) != GTU); under the if 
(desc->noloop_assumption); part and document why.  I have a slight 
preference for the assert given otherwise we are adding code that we 
can't test.


Yea, that's true tbh. I've done the latter, but also separated out the 
"case GTU:" and added a comment, so that it's more clear that the noloop 
things aren't used in the only implemented GTU case (Arm)


Thank you :)



LGTM otherwise (but I don't have the power to approve this ;)).

Kind regards,
Andre

From: Stamatis Markianos-Wright 
Sent: Thursday, November 16, 2023 11:36 AM
To: Stamatis Markianos-Wright via Gcc-patches; Richard Earnshaw; 
Richard Sandiford; Kyrylo Tkachov
Subject: [PING][PATCH 2/2] arm: Add support for MVE Tail-Predicated 
Low Overhead Loops


Pinging back to the top of reviewers' inboxes due to worry about Stage 1
End in a few days :)


See the last email for the latest version of the 2/2 patch. The 1/2
patch is A-Ok from Kyrill's earlier target-backend review.


On 10/11/2023 12:41, Stamatis Markianos-Wright wrote:


On 06/11/2023 17:29, Stamatis Markianos-Wright wrote:


On 06/11/2023 11:24, Richard Sandiford wrote:

Stamatis Markianos-Wright  writes:
One of the main reasons for reading the arm bits was to try to 
answer

the question: if we switch to a downcounting loop with a GE
condition,
how do we make sure that the start value is not a large unsigned
number that is interpreted as negative by GE?  E.g. if the loop
originally counted up in steps of N and used an LTU condition,
it could stop at a value in the range [INT_MAX + 1, UINT_MAX].
But the loop might never iterate if we start counting down from
most values in that range.

Does the patch handle that?

So AFAICT this is actually handled in the generic code in
`doloop_valid_p`:

This kind of loops fail because of they are "desc->infinite", then no
loop-doloop conversion is attempted at all (even for standa

veclower: improve selection of vector mode when lowering [PR 112787]

2023-12-06 Thread Andre Vieira (lists)

Hi,

This patch addresses the issue reported in PR target/112787 by improving the
compute type selection.  We do this by not considering types with more 
elements

than the type we are lowering since we'd reject such types anyway.

gcc/ChangeLog:

PR target/112787
* tree-vect-generic (type_for_widest_vector_mode): Add a parameter to
control maximum amount of elements in resulting vector mode.
(get_compute_type): Restrict vector_compute_type to a mode no wider
than the original compute type.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/pr112787.c: New test.

Bootstrapped and regression tested on aarch64-unknown-linux-gnu and 
x86_64-pc-linux-gnu.


Is this OK for trunk?

Kind regards,
Andre Vieiradiff --git a/gcc/testsuite/gcc.target/aarch64/pr112787.c 
b/gcc/testsuite/gcc.target/aarch64/pr112787.c
new file mode 100644
index 
..caca1bf7ef447e4489b2c134d7200a4afd16763f
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr112787.c
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O2 -march=armv8-a+sve -mcpu=neoverse-n2" } */
+
+typedef int __attribute__((__vector_size__ (64))) vec;
+
+vec fn (vec a, vec b)
+{
+  return a + b;
+}
+
+/* { dg-final { scan-assembler-times {add\tv[0-9]+} 4 } } */
diff --git a/gcc/tree-vect-generic.cc b/gcc/tree-vect-generic.cc
index 
a7e6cb87a5e31e3dd2a893ea5652eeebf8d5d214..2dbf3c8f5f64f2623944110dbc371fe0944198f0
 100644
--- a/gcc/tree-vect-generic.cc
+++ b/gcc/tree-vect-generic.cc
@@ -1347,7 +1347,7 @@ optimize_vector_constructor (gimple_stmt_iterator *gsi)
TYPE, or NULL_TREE if none is found.  */
 
 static tree
-type_for_widest_vector_mode (tree type, optab op)
+type_for_widest_vector_mode (tree type, optab op, poly_int64 max_nunits = 0)
 {
   machine_mode inner_mode = TYPE_MODE (type);
   machine_mode best_mode = VOIDmode, mode;
@@ -1371,7 +1371,9 @@ type_for_widest_vector_mode (tree type, optab op)
   FOR_EACH_MODE_FROM (mode, mode)
 if (GET_MODE_INNER (mode) == inner_mode
&& maybe_gt (GET_MODE_NUNITS (mode), best_nunits)
-   && optab_handler (op, mode) != CODE_FOR_nothing)
+   && optab_handler (op, mode) != CODE_FOR_nothing
+   && (known_eq (max_nunits, 0)
+   || known_lt (GET_MODE_NUNITS (mode), max_nunits)))
   best_mode = mode, best_nunits = GET_MODE_NUNITS (mode);
 
   if (best_mode == VOIDmode)
@@ -1702,7 +1704,8 @@ get_compute_type (enum tree_code code, optab op, tree 
type)
  || optab_handler (op, TYPE_MODE (type)) == CODE_FOR_nothing))
 {
   tree vector_compute_type
-   = type_for_widest_vector_mode (TREE_TYPE (type), op);
+   = type_for_widest_vector_mode (TREE_TYPE (type), op,
+  TYPE_VECTOR_SUBPARTS (compute_type));
   if (vector_compute_type != NULL_TREE
  && subparts_gt (compute_type, vector_compute_type)
  && maybe_ne (TYPE_VECTOR_SUBPARTS (vector_compute_type), 1U)


Re: [PATCH 8/8] aarch64: Add SVE support for simd clones [PR 96342]

2023-12-01 Thread Andre Vieira (lists)




On 29/11/2023 17:01, Richard Sandiford wrote:

"Andre Vieira (lists)"  writes:

Rebased, no major changes, still needs review.

On 30/08/2023 10:19, Andre Vieira (lists) via Gcc-patches wrote:

This patch finalizes adding support for the generation of SVE simd
clones when no simdlen is provided, following the ABI rules where the
widest data type determines the minimum amount of elements in a length
agnostic vector.

gcc/ChangeLog:

      * config/aarch64/aarch64-protos.h (add_sve_type_attribute):
Declare.
  * config/aarch64/aarch64-sve-builtins.cc (add_sve_type_attribute):
Make
  visibility global.
  * config/aarch64/aarch64.cc (aarch64_fntype_abi): Ensure SVE ABI is
  chosen over SIMD ABI if a SVE type is used in return or arguments.
  (aarch64_simd_clone_compute_vecsize_and_simdlen): Create VLA simd
clone
  when no simdlen is provided, according to ABI rules.
  (aarch64_simd_clone_adjust): Add '+sve' attribute to SVE simd clones.
  (aarch64_simd_clone_adjust_ret_or_param): New.
  (TARGET_SIMD_CLONE_ADJUST_RET_OR_PARAM): Define.
  * omp-simd-clone.cc (simd_clone_mangle): Print 'x' for VLA simdlen.
  (simd_clone_adjust): Adapt safelen check to be compatible with VLA
  simdlen.

gcc/testsuite/ChangeLog:

  * c-c++-common/gomp/declare-variant-14.c: Adapt aarch64 scan.
  * gfortran.dg/gomp/declare-variant-14.f90: Likewise.
  * gcc.target/aarch64/declare-simd-1.c: Remove warning checks where no
  longer necessary.
  * gcc.target/aarch64/declare-simd-2.c: Add SVE clone scan.


diff --git a/gcc/config/aarch64/aarch64-protos.h 
b/gcc/config/aarch64/aarch64-protos.h
index 
60a55f4bc1956786ea687fc7cad7ec9e4a84e1f0..769d637f63724a7f0044f48f3dd683e0fb46049c
 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -1005,6 +1005,8 @@ namespace aarch64_sve {
  #ifdef GCC_TARGET_H
bool verify_type_context (location_t, type_context_kind, const_tree, bool);
  #endif
+ void add_sve_type_attribute (tree, unsigned int, unsigned int,
+ const char *, const char *);
  }
  
  extern void aarch64_split_combinev16qi (rtx operands[3]);

diff --git a/gcc/config/aarch64/aarch64-sve-builtins.cc 
b/gcc/config/aarch64/aarch64-sve-builtins.cc
index 
161a14edde7c9fb1b13b146cf50463e2d78db264..6f99c438d10daa91b7e3b623c995489f1a8a0f4c
 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins.cc
+++ b/gcc/config/aarch64/aarch64-sve-builtins.cc
@@ -569,14 +569,16 @@ static bool reported_missing_registers_p;
  /* Record that TYPE is an ABI-defined SVE type that contains NUM_ZR SVE 
vectors
 and NUM_PR SVE predicates.  MANGLED_NAME, if nonnull, is the ABI-defined
 mangling of the type.  ACLE_NAME is the  name of the type.  */
-static void
+void
  add_sve_type_attribute (tree type, unsigned int num_zr, unsigned int num_pr,
const char *mangled_name, const char *acle_name)
  {
tree mangled_name_tree
  = (mangled_name ? get_identifier (mangled_name) : NULL_TREE);
+  tree acle_name_tree
+= (acle_name ? get_identifier (acle_name) : NULL_TREE);
  
-  tree value = tree_cons (NULL_TREE, get_identifier (acle_name), NULL_TREE);

+  tree value = tree_cons (NULL_TREE, acle_name_tree, NULL_TREE);
value = tree_cons (NULL_TREE, mangled_name_tree, value);
value = tree_cons (NULL_TREE, size_int (num_pr), value);
value = tree_cons (NULL_TREE, size_int (num_zr), value);
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 
37507f091c2a6154fa944c3a9fad6a655ab5d5a1..cb0947b18c6a611d55579b5b08d93f6a4a9c3b2c
 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -4080,13 +4080,13 @@ aarch64_takes_arguments_in_sve_regs_p (const_tree 
fntype)
  static const predefined_function_abi &
  aarch64_fntype_abi (const_tree fntype)
  {
-  if (lookup_attribute ("aarch64_vector_pcs", TYPE_ATTRIBUTES (fntype)))
-return aarch64_simd_abi ();
-
if (aarch64_returns_value_in_sve_regs_p (fntype)
|| aarch64_takes_arguments_in_sve_regs_p (fntype))
  return aarch64_sve_abi ();
  
+  if (lookup_attribute ("aarch64_vector_pcs", TYPE_ATTRIBUTES (fntype)))

+return aarch64_simd_abi ();
+
return default_function_abi;
  }
  


I think we discussed this off-list later, but the change above shouldn't
be necessary.  aarch64_vector_pcs must not be attached to SVE PCS functions,
so the two cases should be mutually exclusive.


Yeah I had made the changes locally, but not updated the patch yet.



@@ -27467,7 +27467,7 @@ aarch64_simd_clone_compute_vecsize_and_simdlen (struct 
cgraph_node *node,
int num, bool explicit_p)
  {
tree t, ret_type;
-  unsigned int nds_elt_bits;
+  unsigned int nds_elt_bits, wds_elt_bits;
int count;
unsigned HOST_WIDE_INT const_simdlen;
  
@@ -27513,10 +27513,14 @@ aarch64_simd_clone_compute_vecsize_and

Re: [RFC] vect: disable multiple calls of poly simdclones

2023-11-27 Thread Andre Vieira (lists)




On 06/11/2023 07:52, Richard Biener wrote:

On Fri, 3 Nov 2023, Andre Vieira (lists) wrote:


Hi,

The current codegen code to support VF's that are multiples of a simdclone
simdlen rely on BIT_FIELD_REF to create multiple input vectors.  This does not
work for non-constant simdclones, so we should disable using such clones when
the VF is a multiple of the non-constant simdlen until we change the codegen
to support those.

Enabling SVE simdclone support will cause ICEs if the vectorizer decides to
use a SVE simdclone with a VF that is larger than the simdlen. I'll be away
for the next two weeks, so cant' really discuss this further.
I initially tried to solve the problem, but the way
vectorizable_simd_clone_call is structured doesn't make it easy to replace
BIT_FIELD_REF with the poly-suitable solution right now of using
unpack_{hi,lo}.


I think it should be straight-forward to use unpack_{even,odd} (it's
even/odd for VLA, right?  If lo/hi would be possible then doing
BIT_FIELD_REF would be, too?  Also you need to have multiple stages
of unpack/pack when the factor is more than 2).

There's plenty of time even during stage3 to address this.

At least your patch should have come with a testcase (or two).


Yeah I didn't add one as it didn't trigger on AArch64 without my two 
outstanding aarch64 simdclone patches.


Is there a bugreport tracking this issue?  It should affect GCN as well
I guess.


No, since I can't trigger them yet on trunk until the reviews on my 
target specific patches are done and they are committed.


I don't have a GCN backend lying around but I suspect GCN doesn't use 
poly simdlen simdclones yet either... I haven't checked. The issue 
triggers for aarch64 when trying to generate SVE simdclones for 
functions with mixed types.  I'll give the unpack thing a go locally.


Re: [PING][PATCH 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops

2023-11-27 Thread Andre Vieira (lists)

Hi Stam,

Just some comments.

+/* Recursively scan through the DF chain backwards within the basic 
block and
+   determine if any of the USEs of the original insn (or the USEs of 
the insns
s/Recursively scan/Scan/ as you no longer recurse, thanks for that by 
the way :) +   where thy were DEF-ed, etc., recursively) were affected 
by implicit VPT

remove recursively for the same reasons.

+  if (!CONST_INT_P (cond_counter_iv.step) || !CONST_INT_P 
(cond_temp_iv.step))

+   return NULL;
+  /* Look at the steps and swap around the rtx's if needed.  Error 
out if

+one of them cannot be identified as constant.  */
+  if (INTVAL (cond_counter_iv.step) != 0 && INTVAL 
(cond_temp_iv.step) != 0)

+   return NULL;

Move the comment above the if before, as the erroring out it talks about 
is there.


+  emit_note_after ((enum insn_note)NOTE_KIND (insn), BB_END (body));
 space after 'insn_note)'

@@ -173,14 +176,14 @@ doloop_condition_get (rtx_insn *doloop_pat)
   if (! REG_P (reg))
 return 0;
 -  /* Check if something = (plus (reg) (const_int -1)).
+  /* Check if something = (plus (reg) (const_int -n)).
  On IA-64, this decrement is wrapped in an if_then_else.  */
   inc_src = SET_SRC (inc);
   if (GET_CODE (inc_src) == IF_THEN_ELSE)
 inc_src = XEXP (inc_src, 1);
   if (GET_CODE (inc_src) != PLUS
   || XEXP (inc_src, 0) != reg
-  || XEXP (inc_src, 1) != constm1_rtx)
+  || !CONST_INT_P (XEXP (inc_src, 1)))

Do we ever check that inc_src is negative? We used to check if it was 
-1, now we only check it's a constnat, but not a negative one, so I 
suspect this needs a:

|| INTVAL (XEXP (inc_src, 1)) >= 0

@@ -492,7 +519,8 @@ doloop_modify (class loop *loop, class niter_desc *desc,
 case GE:
   /* Currently only GE tests against zero are supported.  */
   gcc_assert (XEXP (condition, 1) == const0_rtx);
-
+  /* FALLTHRU */
+case GTU:
   noloop = constm1_rtx;

I spent a very long time staring at this trying to understand why noloop 
= constm1_rtx for GTU, where I thought it should've been (count & 
(n-1)). For the current use of doloop it doesn't matter because ARM is 
the only target using it and you set desc->noloop_assumptions to 
null_rtx in 'arm_attempt_dlstp_transform' so noloop is never used. 
However, if a different target accepts this GTU pattern then this target 
agnostic code will do the wrong thing.  I suggest we either:
 - set noloop to what we think might be the correct value, which if you 
ask me should be 'count & (XEXP (condition, 1))',
 - or add a gcc_assert (GET_CODE (condition) != GTU); under the if 
(desc->noloop_assumption); part and document why.  I have a slight 
preference for the assert given otherwise we are adding code that we 
can't test.


LGTM otherwise (but I don't have the power to approve this ;)).

Kind regards,
Andre

From: Stamatis Markianos-Wright 
Sent: Thursday, November 16, 2023 11:36 AM
To: Stamatis Markianos-Wright via Gcc-patches; Richard Earnshaw; Richard 
Sandiford; Kyrylo Tkachov
Subject: [PING][PATCH 2/2] arm: Add support for MVE Tail-Predicated Low 
Overhead Loops


Pinging back to the top of reviewers' inboxes due to worry about Stage 1
End in a few days :)


See the last email for the latest version of the 2/2 patch. The 1/2
patch is A-Ok from Kyrill's earlier target-backend review.


On 10/11/2023 12:41, Stamatis Markianos-Wright wrote:


On 06/11/2023 17:29, Stamatis Markianos-Wright wrote:


On 06/11/2023 11:24, Richard Sandiford wrote:

Stamatis Markianos-Wright  writes:

One of the main reasons for reading the arm bits was to try to answer
the question: if we switch to a downcounting loop with a GE
condition,
how do we make sure that the start value is not a large unsigned
number that is interpreted as negative by GE?  E.g. if the loop
originally counted up in steps of N and used an LTU condition,
it could stop at a value in the range [INT_MAX + 1, UINT_MAX].
But the loop might never iterate if we start counting down from
most values in that range.

Does the patch handle that?

So AFAICT this is actually handled in the generic code in
`doloop_valid_p`:

This kind of loops fail because of they are "desc->infinite", then no
loop-doloop conversion is attempted at all (even for standard
dls/le loops)

Thanks to that check I haven't been able to trigger anything like the
behaviour you describe, do you think the doloop_valid_p checks are
robust enough?

The loops I was thinking of are provably not infinite though. E.g.:

   for (unsigned int i = 0; i < UINT_MAX - 100; ++i)
 ...

is known to terminate.  And doloop conversion is safe with the normal
count-down-by-1 approach, so I don't think current code would need
to reject it.  I.e. a conversion to:

   unsigned int i = UINT_MAX - 101;
   do
 ...
   while (--i != ~0U);

would be safe, but a conversion to:

   int i = UINT_MAX - 101;
   do
 ...
   while ((i -= step, i > 0));

wouldn't, 

[RFC] vect: disable multiple calls of poly simdclones

2023-11-03 Thread Andre Vieira (lists)

Hi,

The current codegen code to support VF's that are multiples of a 
simdclone simdlen rely on BIT_FIELD_REF to create multiple input 
vectors.  This does not work for non-constant simdclones, so we should 
disable using such clones when
the VF is a multiple of the non-constant simdlen until we change the 
codegen to support those.


Enabling SVE simdclone support will cause ICEs if the vectorizer decides 
to use a SVE simdclone with a VF that is larger than the simdlen. I'll 
be away for the next two weeks, so cant' really discuss this further.
I initially tried to solve the problem, but the way 
vectorizable_simd_clone_call is structured doesn't make it easy to 
replace BIT_FIELD_REF with the poly-suitable solution right now of using 
unpack_{hi,lo}. Unfortunately I only found this now as I was adding 
further tests for SVE :(


gcc/ChangeLog:

* tree-vect-stmts.cc (vectorizable_simd_clone_call): Reject simdclones
with non-constant simdlen when VF is not exactly the same.diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index 
5f262cae2aae784e3ef4fd07455b7aa742797b51..dc3e0716161838aef66cf37342499006673336d6
 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -4165,7 +4165,10 @@ vectorizable_simd_clone_call (vec_info *vinfo, 
stmt_vec_info stmt_info,
if (!constant_multiple_p (vf * group_size, n->simdclone->simdlen,
  _calls)
|| (!n->simdclone->inbranch && (masked_call_offset > 0))
-   || (nargs != simd_nargs))
+   || (nargs != simd_nargs)
+   /* Currently we do not support multiple calls of non-constant
+  simdlen as poly vectors can not be accessed by BIT_FIELD_REF.  */
+   || (!n->simdclone->simdlen.is_constant () && num_calls != 1))
  continue;
if (num_calls != 1)
  this_badness += exact_log2 (num_calls) * 4096;


Re: [PATCH] vect: allow using inbranch simdclones for masked loops

2023-11-03 Thread Andre Vieira (lists)




On 03/11/2023 07:31, Richard Biener wrote:



OK.

I do wonder about the gfortran testsuite adjustments though.

!GCC$ builtin (sin) attributes simd (inbranch)

   ! this should not be using simd clone
   y4 = sin(x8)

previously we wouldn't vectorize this as no notinbranch simd function
is available but now we do since we use the inbranch function for the
notinbranch call.  If that's desired then a better modification of
the test would be to expect vectorization, no?



I was in two minds about this. I interpreted the test to be about the 
fact that sin is overloaded in fortran, given the name of the program 
'program test_overloaded_intrinsic', and thus I thought it was testing 
that it calls sinf when a real(4) is passed and sin for a real(8) and 
that simdclones aren't used for the wrong overload. That doesn't quite 
explain why the pragma for sin(double) was added in the first place, 
that wouldn't have been necessary, but then again neither are the cos 
and cosf.


Happy to put it back in and test that the 'masked' simdclone is used 
using some regexp too.


[PATCH] vect: allow using inbranch simdclones for masked loops

2023-11-02 Thread Andre Vieira (lists)

Hi,

In a previous patch I did most of the work for this, but forgot to 
change the check for number of arguments matching between call and 
simdclone.  This check should accept calls without a mask to be matched 
against simdclones with mask arguments.  I also added tests to verify 
this feature actually works.



For the simd-builtins tests I decided to remove the sin (double) 
simdclone which would now be used, because it was inbranch and we enable 
their use for not inbranch.  Given the nature of the test, removing it 
made more sense, but thats not a strong opinion, happy to change.


Bootstrapped and regression tested on aarch64-unknown-linux-gnu and 
x86_64-pc-linux-gnu.


OK for trunk?

PS: I'll be away for two weeks from tomorrow, it would be really nice if 
this can go in for gcc-14, otherwise the previous work I did for this 
won't have any actual visible effect :(



gcc/ChangeLog:

* tree-vect-stmts.cc (vectorizable_simd_clone_call): Allow unmasked
calls to use masked simdclones.

gcc/testsuite/ChangeLog:

* gcc.dg/vect/vect-simd-clone-20.c: New file.
* gfortran.dg/simd-builtins-1.h: Adapt.
* gfortran.dg/simd-builtins-6.f90: Adapt.diff --git a/gcc/testsuite/gcc.dg/vect/vect-simd-clone-20.c 
b/gcc/testsuite/gcc.dg/vect/vect-simd-clone-20.c
new file mode 100644
index 
..9f51a68f3a0c8851af2cd26bd8235c771b851d7d
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-simd-clone-20.c
@@ -0,0 +1,87 @@
+/* { dg-require-effective-target vect_simd_clones } */
+/* { dg-additional-options "-fopenmp-simd --param vect-epilogues-nomask=0" } */
+/* { dg-additional-options "-mavx" { target avx_runtime } } */
+
+/* Test that simd inbranch clones work correctly.  */
+
+#ifndef TYPE
+#define TYPE int
+#endif
+
+/* A simple function that will be cloned.  */
+#pragma omp declare simd inbranch
+TYPE __attribute__((noinline))
+foo (TYPE a)
+{
+  return a + 1;
+}
+
+/* Check that "inbranch" clones are called correctly.  */
+
+void __attribute__((noipa))
+masked (TYPE * __restrict a, TYPE * __restrict b, int size)
+{
+  #pragma omp simd
+  for (int i = 0; i < size; i++)
+b[i] = foo(a[i]);
+}
+
+/* Check that "inbranch" works when there might be unrolling.  */
+
+void __attribute__((noipa))
+masked_fixed (TYPE * __restrict a, TYPE * __restrict b)
+{
+  #pragma omp simd
+  for (int i = 0; i < 128; i++)
+b[i] = foo(a[i]);
+}
+
+/* Validate the outputs.  */
+
+void
+check_masked (TYPE *b, int size)
+{
+  for (int i = 0; i < size; i++)
+if (b[i] != (TYPE)(i + 1))
+  {
+   __builtin_printf ("error at %d\n", i);
+   __builtin_exit (1);
+  }
+}
+
+int
+main ()
+{
+  TYPE a[1024];
+  TYPE b[1024];
+
+  for (int i = 0; i < 1024; i++)
+a[i] = i;
+
+  masked_fixed (a, b);
+  check_masked (b, 128);
+
+  /* Test various sizes to cover machines with different vectorization
+ factors.  */
+  for (int size = 8; size <= 1024; size *= 2)
+{
+  masked (a, b, size);
+  check_masked (b, size);
+}
+
+  /* Test sizes that might exercise the partial vector code-path.  */
+  for (int size = 8; size <= 1024; size *= 2)
+{
+  masked (a, b, size-4);
+  check_masked (b, size-4);
+}
+
+  return 0;
+}
+
+/* Ensure the the in-branch simd clones are used on targets that support them. 
 */
+/* { dg-final { scan-tree-dump-times {[\n\r] [^\n]* = foo\.simdclone} 2 "vect" 
{ target { aarch64*-*-* } } } } */
+/* { dg-final { scan-tree-dump-times {[\n\r] [^\n]* = foo\.simdclone} 4 "vect" 
{ target { x86_64*-*-* } } } } */
+
+/* The LTO test produces two dump files and we scan the wrong one.  */
+/* { dg-skip-if "" { *-*-* } { "-flto" } { "" } } */
diff --git a/gcc/testsuite/gfortran.dg/simd-builtins-1.h 
b/gcc/testsuite/gfortran.dg/simd-builtins-1.h
index 
88d555cf41ad065ea525a63d7c05d15d3e5b54ed..08b73514a67d5791d35203530d039741946e9dcc
 100644
--- a/gcc/testsuite/gfortran.dg/simd-builtins-1.h
+++ b/gcc/testsuite/gfortran.dg/simd-builtins-1.h
@@ -1,4 +1,3 @@
-!GCC$ builtin (sin) attributes simd (inbranch)
 !GCC$ builtin (sinf) attributes simd (notinbranch)
 !GCC$ builtin (cosf) attributes simd
 !GCC$ builtin (cosf) attributes simd (notinbranch)
diff --git a/gcc/testsuite/gfortran.dg/simd-builtins-6.f90 
b/gcc/testsuite/gfortran.dg/simd-builtins-6.f90
index 
60bcac78f3e0cc492930f3eb73cf97065312dc1c..2c68f9f1818a35674a0aef15793aa312a48199a8
 100644
--- a/gcc/testsuite/gfortran.dg/simd-builtins-6.f90
+++ b/gcc/testsuite/gfortran.dg/simd-builtins-6.f90
@@ -2,7 +2,6 @@
 ! { dg-additional-options "-nostdinc -Ofast -fdump-tree-optimized" }
 ! { dg-additional-options "-msse2 -mno-avx" { target i?86-*-linux* 
x86_64-*-linux* } }
 
-!GCC$ builtin (sin) attributes simd (inbranch)
 !GCC$ builtin (sinf) attributes simd (notinbranch)
 !GCC$ builtin (cosf) attributes simd
 !GCC$ builtin (cosf) attributes simd (notinbranch)
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index 

Re: [PATCH6/8] omp: Reorder call for TARGET_SIMD_CLONE_ADJUST (was Re: [PATCH7/8] vect: Add TARGET_SIMD_CLONE_ADJUST_RET_OR_PARAM)

2023-10-30 Thread Andre Vieira (lists)

Hi Richi,

Friendly ping on this. I'm going away for two weeks end of this week, so 
I won't be here for end of stage-1, but I'd still very much like to get 
this done for GCC 14.


I don't know if you had a chance to look at this yet when you reviewed 
the other patches or if you maybe just missed it? A quick td;lr this 
moves around the TARGET_SIMD_CLONE_ADJUST call after we've vectorized 
the types in simdclones to avoid having to add the extra target hooks to 
change the types.  This required some moving around of the code that 
constructed the adjustments and the code that constructed the array for 
the return value.


Kind regards,
Andre

On 18/10/2023 15:41, Andre Vieira (lists) wrote:
This patch moves the call to TARGET_SIMD_CLONE_ADJUST until after the 
arguments and return types have been transformed into vector types.  It 
also constructs the adjuments and retval modifications after this call, 
allowing targets to alter the types of the arguments and return of the 
clone prior to the modifications to the function definition.


Is this OK?

gcc/ChangeLog:

     * omp-simd-clone.cc (simd_clone_adjust_return_type): Hoist out
     code to create return array and don't return new type.
     (simd_clone_adjust_argument_types): Hoist out code that creates
     ipa_param_body_adjustments and don't return them.
     (simd_clone_adjust): Call TARGET_SIMD_CLONE_ADJUST after return
     and argument types have been vectorized, create adjustments and
     return array after the hook.
     (expand_simd_clones): Call TARGET_SIMD_CLONE_ADJUST after return
     and argument types have been vectorized.

On 04/10/2023 13:40, Andre Vieira (lists) wrote:



On 04/10/2023 11:41, Richard Biener wrote:

On Wed, 4 Oct 2023, Andre Vieira (lists) wrote:




On 30/08/2023 14:04, Richard Biener wrote:

On Wed, 30 Aug 2023, Andre Vieira (lists) wrote:

This patch adds a new target hook to enable us to adapt the types 
of return
and parameters of simd clones.  We use this in two ways, the first 
one is

to
make sure we can create valid SVE types, including the SVE type 
attribute,
when creating a SVE simd clone, even when the target options do 
not support
SVE.  We are following the same behaviour seen with x86 that 
creates simd
clones according to the ABI rules when no simdlen is provided, 
even if that
simdlen is not supported by the current target options.  Note that 
this

doesn't mean the simd clone will be used in auto-vectorization.


You are not documenting the bool parameter of the new hook.

What's wrong with doing the adjustment in TARGET_SIMD_CLONE_ADJUST?


simd_clone_adjust_argument_types is called after that hook, so by 
the time we
call TARGET_SIMD_CLONE_ADJUST the types are still in scalar, not 
vector.  The

same is true for the return type one.

Also the changes to the types need to be taken into consideration in
'adjustments' I think.


Nothing in the three existing implementations of 
TARGET_SIMD_CLONE_ADJUST

relies on this ordering I think, how about moving the hook invocation
after simd_clone_adjust_argument_types?



But that wouldn't change the 'ipa_param_body_adjustments' for when we 
have a function definition and we need to redo the body.

Richard.

PS: I hope the subject line survived, my email client is having a 
bit of a

wobble this morning... it's what you get for updating software :(


Re: [PING][PATCH 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops

2023-10-23 Thread Andre Vieira (lists)
Ping for Jeff or another global maintainer to review the target agnostic 
bits of this, that's:

loop-doloop.cc
df-core.{c,h}

I do have a nitpick myself that I missed last time around:
  /* We expect the condition to be of the form (reg != 0)  */
  cond = XEXP (SET_SRC (cmp), 0);
- if (GET_CODE (cond) != NE || XEXP (cond, 1) != const0_rtx)
+ if ((GET_CODE (cond) != NE && GET_CODE (cond) != GE)
+ || XEXP (cond, 1) != const0_rtx)
return 0;
}
Could do with updating the comment to reflect allowing >= now. But happy 
for you to change this once approved by a maintainer.


Kind regards,
Andre

On 11/10/2023 12:34, Stamatis Markianos-Wright wrote:

Hi all,

On 28/09/2023 13:51, Andre Vieira (lists) wrote:

Hi,

On 14/09/2023 13:10, Kyrylo Tkachov via Gcc-patches wrote:

Hi Stam,





The arm parts look sensible but we'd need review for the df-core.h 
and df-core.cc changes.

Maybe Jeff can help or can recommend someone to take a look?


Just thought I'd do a follow-up "ping" on this :)



Thanks,
Kyrill



FWIW the changes LGTM, if we don't want these in df-core we can always 
implement the extra utility locally. It's really just a helper 
function to check if df_bb_regno_first_def_find and 
df_bb_regno_last_def_find yield the same result, meaning we only have 
a single definition.


Kind regards,
Andre


Thanks,

Stam



Re: [PATCH] ifcvt: Don't lower bitfields with non-constant offsets [PR 111882]

2023-10-20 Thread Andre Vieira (lists)




On 20/10/2023 14:41, Richard Biener wrote:

On Fri, 20 Oct 2023, Andre Vieira (lists) wrote:


Hi,

This patch stops lowering of bitfields by ifcvt when they have non-constant
offsets as we are not likely to be able to do anything useful with those
during
vectorization.  That also fixes the issue reported in PR 111882, which was
being caused by an offset with a side-effect being lowered, but constants have
no side-effects so we will no longer run into that problem.

Bootstrapped and regression tested on aarch64-unknown-linux-gnu.

OK for trunk?


+  if (!TREE_CONSTANT (DECL_FIELD_OFFSET (rep_decl))
+  || !TREE_CONSTANT (DECL_FIELD_BIT_OFFSET (rep_decl))
+  || !TREE_CONSTANT (ref_offset)
+  || !TREE_CONSTANT (DECL_FIELD_BIT_OFFSET (field_decl)))
+return NULL_TREE;

DECL_FIELD_BIT_OFFSET is always constant.  Please test
TREE_CODE (..) == INTEGER_CST instead of TREE_CONSTANT.


OK with those changes.
After I sent it I realized it would've been nicer to add a diagnostic, 
you OK with:

+  if (dump_file && (dump_flags & TDF_DETAILS))
+   fprintf (dump_file, "\t Bitfield NOT OK to lower,"
+   " offset is non-constant.\n");


Richard.



gcc/ChangeLog:

PR tree-optimization/111882
* tree-if-conv.cc (get_bitfield_rep): Return NULL_TREE for bitfields
with
non-constant offsets.

gcc/testsuite/ChangeLog:

* gcc.dg/vect/pr111882.c: New test.





[PATCH] ifcvt: Don't lower bitfields with non-constant offsets [PR 111882]

2023-10-20 Thread Andre Vieira (lists)

Hi,

This patch stops lowering of bitfields by ifcvt when they have non-constant
offsets as we are not likely to be able to do anything useful with those 
during

vectorization.  That also fixes the issue reported in PR 111882, which was
being caused by an offset with a side-effect being lowered, but 
constants have

no side-effects so we will no longer run into that problem.

Bootstrapped and regression tested on aarch64-unknown-linux-gnu.

OK for trunk?

gcc/ChangeLog:

PR tree-optimization/111882
* tree-if-conv.cc (get_bitfield_rep): Return NULL_TREE for bitfields 
with
non-constant offsets.

gcc/testsuite/ChangeLog:

* gcc.dg/vect/pr111882.c: New test.diff --git a/gcc/testsuite/gcc.dg/vect/pr111882.c 
b/gcc/testsuite/gcc.dg/vect/pr111882.c
new file mode 100644
index 
..024ad57b6930dd1f516e9c8127e2b440016360c6
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr111882.c
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-additional-options { -fdump-tree-ifcvt-all } } */
+
+static void __attribute__((noipa)) f(int n) {
+  int i, j;
+  struct S { char d[n]; int a; int b : 17; int c : 12; };
+  struct S A[100][];
+  for (i = 0; i < 100; i++) {
+asm volatile("" : : "g"([0][0]) : "memory");
+for (j = 0; j < ; j++) A[i][j].b = 2;
+  }
+}
+void g(void) { f(1); }
+
+/* { dg-final { scan-tree-dump-not "Bitfield OK to lower" "ifcvt" } } */
diff --git a/gcc/tree-if-conv.cc b/gcc/tree-if-conv.cc
index 
dab7eeb7707ae8f1f342a571f8e5c99e0ef39309..66f3c882cb688049224b344b81637e7b1ae7e36c
 100644
--- a/gcc/tree-if-conv.cc
+++ b/gcc/tree-if-conv.cc
@@ -3495,6 +3495,7 @@ get_bitfield_rep (gassign *stmt, bool write, tree *bitpos,
: gimple_assign_rhs1 (stmt);
 
   tree field_decl = TREE_OPERAND (comp_ref, 1);
+  tree ref_offset = component_ref_field_offset (comp_ref);
   tree rep_decl = DECL_BIT_FIELD_REPRESENTATIVE (field_decl);
 
   /* Bail out if the representative is not a suitable type for a scalar
@@ -3509,6 +3510,12 @@ get_bitfield_rep (gassign *stmt, bool write, tree 
*bitpos,
   if (compare_tree_int (DECL_SIZE (field_decl), bf_prec) != 0)
 return NULL_TREE;
 
+  if (!TREE_CONSTANT (DECL_FIELD_OFFSET (rep_decl))
+  || !TREE_CONSTANT (DECL_FIELD_BIT_OFFSET (rep_decl))
+  || !TREE_CONSTANT (ref_offset)
+  || !TREE_CONSTANT (DECL_FIELD_BIT_OFFSET (field_decl)))
+return NULL_TREE;
+
   if (struct_expr)
 *struct_expr = TREE_OPERAND (comp_ref, 0);
 
@@ -3529,7 +3536,7 @@ get_bitfield_rep (gassign *stmt, bool write, tree *bitpos,
 the structure and the container from the number of bits from the start
 of the structure and the actual bitfield member. */
   tree bf_pos = fold_build2 (MULT_EXPR, bitsizetype,
-DECL_FIELD_OFFSET (field_decl),
+ref_offset,
 build_int_cst (bitsizetype, BITS_PER_UNIT));
   bf_pos = fold_build2 (PLUS_EXPR, bitsizetype, bf_pos,
DECL_FIELD_BIT_OFFSET (field_decl));


[PATCH6/8] omp: Reorder call for TARGET_SIMD_CLONE_ADJUST (was Re: [PATCH7/8] vect: Add TARGET_SIMD_CLONE_ADJUST_RET_OR_PARAM)

2023-10-18 Thread Andre Vieira (lists)
This patch moves the call to TARGET_SIMD_CLONE_ADJUST until after the 
arguments and return types have been transformed into vector types.  It 
also constructs the adjuments and retval modifications after this call, 
allowing targets to alter the types of the arguments and return of the 
clone prior to the modifications to the function definition.


Is this OK?

gcc/ChangeLog:

* omp-simd-clone.cc (simd_clone_adjust_return_type): Hoist out
code to create return array and don't return new type.
(simd_clone_adjust_argument_types): Hoist out code that creates
ipa_param_body_adjustments and don't return them.
(simd_clone_adjust): Call TARGET_SIMD_CLONE_ADJUST after return
and argument types have been vectorized, create adjustments and
return array after the hook.
(expand_simd_clones): Call TARGET_SIMD_CLONE_ADJUST after return
and argument types have been vectorized.

On 04/10/2023 13:40, Andre Vieira (lists) wrote:



On 04/10/2023 11:41, Richard Biener wrote:

On Wed, 4 Oct 2023, Andre Vieira (lists) wrote:




On 30/08/2023 14:04, Richard Biener wrote:

On Wed, 30 Aug 2023, Andre Vieira (lists) wrote:

This patch adds a new target hook to enable us to adapt the types 
of return
and parameters of simd clones.  We use this in two ways, the first 
one is

to
make sure we can create valid SVE types, including the SVE type 
attribute,
when creating a SVE simd clone, even when the target options do not 
support
SVE.  We are following the same behaviour seen with x86 that 
creates simd
clones according to the ABI rules when no simdlen is provided, even 
if that
simdlen is not supported by the current target options.  Note that 
this

doesn't mean the simd clone will be used in auto-vectorization.


You are not documenting the bool parameter of the new hook.

What's wrong with doing the adjustment in TARGET_SIMD_CLONE_ADJUST?


simd_clone_adjust_argument_types is called after that hook, so by the 
time we
call TARGET_SIMD_CLONE_ADJUST the types are still in scalar, not 
vector.  The

same is true for the return type one.

Also the changes to the types need to be taken into consideration in
'adjustments' I think.


Nothing in the three existing implementations of TARGET_SIMD_CLONE_ADJUST
relies on this ordering I think, how about moving the hook invocation
after simd_clone_adjust_argument_types?



But that wouldn't change the 'ipa_param_body_adjustments' for when we 
have a function definition and we need to redo the body.

Richard.

PS: I hope the subject line survived, my email client is having a bit 
of a

wobble this morning... it's what you get for updating software :(diff --git a/gcc/omp-simd-clone.cc b/gcc/omp-simd-clone.cc
index 
ef0b9b48c7212900023bc0eaebca5e1f9389db77..fb80888190c88e29895ecfbbe1b17d390c9a9dfe
 100644
--- a/gcc/omp-simd-clone.cc
+++ b/gcc/omp-simd-clone.cc
@@ -701,10 +701,9 @@ simd_clone_create (struct cgraph_node *old_node, bool 
force_local)
 }
 
 /* Adjust the return type of the given function to its appropriate
-   vector counterpart.  Returns a simd array to be used throughout the
-   function as a return value.  */
+   vector counterpart.  */
 
-static tree
+static void
 simd_clone_adjust_return_type (struct cgraph_node *node)
 {
   tree fndecl = node->decl;
@@ -714,7 +713,7 @@ simd_clone_adjust_return_type (struct cgraph_node *node)
 
   /* Adjust the function return type.  */
   if (orig_rettype == void_type_node)
-return NULL_TREE;
+return;
   t = TREE_TYPE (TREE_TYPE (fndecl));
   if (INTEGRAL_TYPE_P (t) || POINTER_TYPE_P (t))
 veclen = node->simdclone->vecsize_int;
@@ -737,24 +736,6 @@ simd_clone_adjust_return_type (struct cgraph_node *node)
veclen));
 }
   TREE_TYPE (TREE_TYPE (fndecl)) = t;
-  if (!node->definition)
-return NULL_TREE;
-
-  t = DECL_RESULT (fndecl);
-  /* Adjust the DECL_RESULT.  */
-  gcc_assert (TREE_TYPE (t) != void_type_node);
-  TREE_TYPE (t) = TREE_TYPE (TREE_TYPE (fndecl));
-  relayout_decl (t);
-
-  tree atype = build_array_type_nelts (orig_rettype,
-  node->simdclone->simdlen);
-  if (maybe_ne (veclen, node->simdclone->simdlen))
-return build1 (VIEW_CONVERT_EXPR, atype, t);
-
-  /* Set up a SIMD array to use as the return value.  */
-  tree retval = create_tmp_var_raw (atype, "retval");
-  gimple_add_tmp_var (retval);
-  return retval;
 }
 
 /* Each vector argument has a corresponding array to be used locally
@@ -788,7 +769,7 @@ create_tmp_simd_array (const char *prefix, tree type, 
poly_uint64 simdlen)
declarations will be remapped.  New arguments which are not to be remapped
are marked with USER_FLAG.  */
 
-static ipa_param_body_adjustments *
+static void
 simd_clone_adjust_argument_types (struct cgraph_node *node)
 {
   auto_vec args;
@@ -798,15 +779,11 @@ simd_clone_adjust_argument_types (

Re: [PATCH 8/8] aarch64: Add SVE support for simd clones [PR 96342]

2023-10-18 Thread Andre Vieira (lists)

Rebased, no major changes, still needs review.

On 30/08/2023 10:19, Andre Vieira (lists) via Gcc-patches wrote:
This patch finalizes adding support for the generation of SVE simd 
clones when no simdlen is provided, following the ABI rules where the 
widest data type determines the minimum amount of elements in a length 
agnostic vector.


gcc/ChangeLog:

     * config/aarch64/aarch64-protos.h (add_sve_type_attribute): 
Declare.
 * config/aarch64/aarch64-sve-builtins.cc (add_sve_type_attribute): 
Make

 visibility global.
 * config/aarch64/aarch64.cc (aarch64_fntype_abi): Ensure SVE ABI is
 chosen over SIMD ABI if a SVE type is used in return or arguments.
 (aarch64_simd_clone_compute_vecsize_and_simdlen): Create VLA simd 
clone

 when no simdlen is provided, according to ABI rules.
 (aarch64_simd_clone_adjust): Add '+sve' attribute to SVE simd clones.
 (aarch64_simd_clone_adjust_ret_or_param): New.
 (TARGET_SIMD_CLONE_ADJUST_RET_OR_PARAM): Define.
 * omp-simd-clone.cc (simd_clone_mangle): Print 'x' for VLA simdlen.
 (simd_clone_adjust): Adapt safelen check to be compatible with VLA
 simdlen.

gcc/testsuite/ChangeLog:

 * c-c++-common/gomp/declare-variant-14.c: Adapt aarch64 scan.
 * gfortran.dg/gomp/declare-variant-14.f90: Likewise.
 * gcc.target/aarch64/declare-simd-1.c: Remove warning checks where no
 longer necessary.
 * gcc.target/aarch64/declare-simd-2.c: Add SVE clone scan.diff --git a/gcc/config/aarch64/aarch64-protos.h 
b/gcc/config/aarch64/aarch64-protos.h
index 
60a55f4bc1956786ea687fc7cad7ec9e4a84e1f0..769d637f63724a7f0044f48f3dd683e0fb46049c
 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -1005,6 +1005,8 @@ namespace aarch64_sve {
 #ifdef GCC_TARGET_H
   bool verify_type_context (location_t, type_context_kind, const_tree, bool);
 #endif
+ void add_sve_type_attribute (tree, unsigned int, unsigned int,
+ const char *, const char *);
 }
 
 extern void aarch64_split_combinev16qi (rtx operands[3]);
diff --git a/gcc/config/aarch64/aarch64-sve-builtins.cc 
b/gcc/config/aarch64/aarch64-sve-builtins.cc
index 
161a14edde7c9fb1b13b146cf50463e2d78db264..6f99c438d10daa91b7e3b623c995489f1a8a0f4c
 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins.cc
+++ b/gcc/config/aarch64/aarch64-sve-builtins.cc
@@ -569,14 +569,16 @@ static bool reported_missing_registers_p;
 /* Record that TYPE is an ABI-defined SVE type that contains NUM_ZR SVE vectors
and NUM_PR SVE predicates.  MANGLED_NAME, if nonnull, is the ABI-defined
mangling of the type.  ACLE_NAME is the  name of the type.  */
-static void
+void
 add_sve_type_attribute (tree type, unsigned int num_zr, unsigned int num_pr,
const char *mangled_name, const char *acle_name)
 {
   tree mangled_name_tree
 = (mangled_name ? get_identifier (mangled_name) : NULL_TREE);
+  tree acle_name_tree
+= (acle_name ? get_identifier (acle_name) : NULL_TREE);
 
-  tree value = tree_cons (NULL_TREE, get_identifier (acle_name), NULL_TREE);
+  tree value = tree_cons (NULL_TREE, acle_name_tree, NULL_TREE);
   value = tree_cons (NULL_TREE, mangled_name_tree, value);
   value = tree_cons (NULL_TREE, size_int (num_pr), value);
   value = tree_cons (NULL_TREE, size_int (num_zr), value);
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 
37507f091c2a6154fa944c3a9fad6a655ab5d5a1..cb0947b18c6a611d55579b5b08d93f6a4a9c3b2c
 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -4080,13 +4080,13 @@ aarch64_takes_arguments_in_sve_regs_p (const_tree 
fntype)
 static const predefined_function_abi &
 aarch64_fntype_abi (const_tree fntype)
 {
-  if (lookup_attribute ("aarch64_vector_pcs", TYPE_ATTRIBUTES (fntype)))
-return aarch64_simd_abi ();
-
   if (aarch64_returns_value_in_sve_regs_p (fntype)
   || aarch64_takes_arguments_in_sve_regs_p (fntype))
 return aarch64_sve_abi ();
 
+  if (lookup_attribute ("aarch64_vector_pcs", TYPE_ATTRIBUTES (fntype)))
+return aarch64_simd_abi ();
+
   return default_function_abi;
 }
 
@@ -27467,7 +27467,7 @@ aarch64_simd_clone_compute_vecsize_and_simdlen (struct 
cgraph_node *node,
int num, bool explicit_p)
 {
   tree t, ret_type;
-  unsigned int nds_elt_bits;
+  unsigned int nds_elt_bits, wds_elt_bits;
   int count;
   unsigned HOST_WIDE_INT const_simdlen;
 
@@ -27513,10 +27513,14 @@ aarch64_simd_clone_compute_vecsize_and_simdlen 
(struct cgraph_node *node,
   if (TREE_CODE (ret_type) != VOID_TYPE)
 {
   nds_elt_bits = lane_size (SIMD_CLONE_ARG_TYPE_VECTOR, ret_type);
+  wds_elt_bits = nds_elt_bits;
   vec_elts.safe_push (std::make_pair (ret_type, nds_elt_bits));
 }
   else
-nds_elt_bits = POINTER_SIZE;
+{
+  nds_elt_bits = POINTER_SIZE;
+  wds_elt_bits = 0;
+}
 
   int i;
   tree type_arg_types = T

Re: [PATCH 4/8] vect: don't allow fully masked loops with non-masked simd clones [PR 110485]

2023-10-18 Thread Andre Vieira (lists)
Rebased on top of trunk, minor change to check if loop_vinfo since we 
now do some slp vectorization for simd_clones.


I assume the previous OK still holds.

On 30/08/2023 13:54, Richard Biener wrote:

On Wed, 30 Aug 2023, Andre Vieira (lists) wrote:


When analyzing a loop and choosing a simdclone to use it is possible to choose
a simdclone that cannot be used 'inbranch' for a loop that can use partial
vectors.  This may lead to the vectorizer deciding to use partial vectors
which are not supported for notinbranch simd clones. This patch fixes that by
disabling the use of partial vectors once a notinbranch simd clone has been
selected.


OK.


gcc/ChangeLog:

PR tree-optimization/110485
* tree-vect-stmts.cc (vectorizable_simd_clone_call): Disable partial
vectors usage if a notinbranch simdclone has been selected.

gcc/testsuite/ChangeLog:

* gcc.dg/gomp/pr110485.c: New test.

diff --git a/gcc/testsuite/gcc.dg/gomp/pr110485.c 
b/gcc/testsuite/gcc.dg/gomp/pr110485.c
new file mode 100644
index 
..ba6817a127f40246071e32ccebf692cc4d121d15
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/gomp/pr110485.c
@@ -0,0 +1,19 @@
+/* PR 110485 */
+/* { dg-do compile } */
+/* { dg-additional-options "-Ofast -fdump-tree-vect-details" } */
+/* { dg-additional-options "-march=znver4 --param=vect-partial-vector-usage=1" 
{ target x86_64-*-* } } */
+#pragma omp declare simd notinbranch uniform(p)
+extern double __attribute__ ((const)) bar (double a, double p);
+
+double a[1024];
+double b[1024];
+
+void foo (int n)
+{
+  #pragma omp simd
+  for (int i = 0; i < n; ++i)
+a[i] = bar (b[i], 71.2);
+}
+
+/* { dg-final { scan-tree-dump-not "MASK_LOAD" "vect" } } */
+/* { dg-final { scan-tree-dump "can't use a fully-masked loop because a 
non-masked simd clone was selected." "vect" { target x86_64-*-* } } } */
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index 
a9156975d64c7a335ffd27614e87f9d11b23d1ba..731acc76350cae39c899a866584068cff247183a
 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -4539,6 +4539,17 @@ vectorizable_simd_clone_call (vec_info *vinfo, 
stmt_vec_info stmt_info,
   ? boolean_true_node : boolean_false_node;
simd_clone_info.safe_push (sll);
  }
+
+  if (!bestn->simdclone->inbranch && loop_vinfo)
+   {
+ if (dump_enabled_p ()
+ && LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo))
+   dump_printf_loc (MSG_NOTE, vect_location,
+"can't use a fully-masked loop because a"
+" non-masked simd clone was selected.\n");
+ LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
+   }
+
   STMT_VINFO_TYPE (stmt_info) = call_simd_clone_vec_info_type;
   DUMP_VECT_SCOPE ("vectorizable_simd_clone_call");
 /*  vect_model_simple_cost (vinfo, stmt_info, ncopies,


[PATCH 0/8] omp: Replace simd_clone_subparts with TYPE_VECTOR_SUBPARTS

2023-10-18 Thread Andre Vieira (lists)


Refactor simd clone handling code ahead of support for poly simdlen.

gcc/ChangeLog:

* omp-simd-clone.cc (simd_clone_subparts): Remove.
(simd_clone_init_simd_arrays): Replace simd_clone_supbarts with
TYPE_VECTOR_SUBPARTS.
(ipa_simd_modify_function_body): Likewise.
* tree-vect-stmts.cc (vectorizable_simd_clone_call): Likewise.
(simd_clone_subparts): Remove.
diff --git a/gcc/omp-simd-clone.cc b/gcc/omp-simd-clone.cc
index 
c1cb7cc8a5c770940bc2032f824e084b37e96dbe..a42643400ddcf10961633448b49d4caafb999f12
 100644
--- a/gcc/omp-simd-clone.cc
+++ b/gcc/omp-simd-clone.cc
@@ -255,16 +255,6 @@ ok_for_auto_simd_clone (struct cgraph_node *node)
   return true;
 }
 
-
-/* Return the number of elements in vector type VECTYPE, which is associated
-   with a SIMD clone.  At present these always have a constant length.  */
-
-static unsigned HOST_WIDE_INT
-simd_clone_subparts (tree vectype)
-{
-  return TYPE_VECTOR_SUBPARTS (vectype).to_constant ();
-}
-
 /* Allocate a fresh `simd_clone' and return it.  NARGS is the number
of arguments to reserve space for.  */
 
@@ -1028,7 +1018,7 @@ simd_clone_init_simd_arrays (struct cgraph_node *node,
}
  continue;
}
-  if (known_eq (simd_clone_subparts (TREE_TYPE (arg)),
+  if (known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg)),
node->simdclone->simdlen))
{
  tree ptype = build_pointer_type (TREE_TYPE (TREE_TYPE (array)));
@@ -1040,7 +1030,7 @@ simd_clone_init_simd_arrays (struct cgraph_node *node,
}
   else
{
- unsigned int simdlen = simd_clone_subparts (TREE_TYPE (arg));
+ poly_uint64 simdlen = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg));
  unsigned int times = vector_unroll_factor (node->simdclone->simdlen,
 simdlen);
  tree ptype = build_pointer_type (TREE_TYPE (TREE_TYPE (array)));
@@ -1226,9 +1216,9 @@ ipa_simd_modify_function_body (struct cgraph_node *node,
  iter, NULL_TREE, NULL_TREE);
   adjustments->register_replacement (&(*adjustments->m_adj_params)[j], r);
 
-  if (multiple_p (node->simdclone->simdlen, simd_clone_subparts (vectype)))
+  if (multiple_p (node->simdclone->simdlen, TYPE_VECTOR_SUBPARTS 
(vectype)))
j += vector_unroll_factor (node->simdclone->simdlen,
-  simd_clone_subparts (vectype)) - 1;
+  TYPE_VECTOR_SUBPARTS (vectype)) - 1;
 }
   adjustments->sort_replacements ();
 
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index 
9bb43e98f56d18929c9c02227954fdf38eafefd8..a9156975d64c7a335ffd27614e87f9d11b23d1ba
 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -4126,16 +4126,6 @@ vect_simd_lane_linear (tree op, class loop *loop,
 }
 }
 
-/* Return the number of elements in vector type VECTYPE, which is associated
-   with a SIMD clone.  At present these vectors always have a constant
-   length.  */
-
-static unsigned HOST_WIDE_INT
-simd_clone_subparts (tree vectype)
-{
-  return TYPE_VECTOR_SUBPARTS (vectype).to_constant ();
-}
-
 /* Function vectorizable_simd_clone_call.
 
Check if STMT_INFO performs a function call that can be vectorized
@@ -4429,7 +4419,7 @@ vectorizable_simd_clone_call (vec_info *vinfo, 
stmt_vec_info stmt_info,
slp_node);
  if (arginfo[i].vectype == NULL
  || !constant_multiple_p (bestn->simdclone->simdlen,
-  simd_clone_subparts 
(arginfo[i].vectype)))
+  TYPE_VECTOR_SUBPARTS 
(arginfo[i].vectype)))
return false;
}
 
@@ -,10 +4434,11 @@ vectorizable_simd_clone_call (vec_info *vinfo, 
stmt_vec_info stmt_info,
 
   if (bestn->simdclone->args[i].arg_type == SIMD_CLONE_ARG_TYPE_MASK)
{
+ tree clone_arg_vectype = bestn->simdclone->args[i].vector_type;
  if (bestn->simdclone->mask_mode == VOIDmode)
{
- if (simd_clone_subparts (bestn->simdclone->args[i].vector_type)
- != simd_clone_subparts (arginfo[i].vectype))
+ if (maybe_ne (TYPE_VECTOR_SUBPARTS (clone_arg_vectype),
+   TYPE_VECTOR_SUBPARTS (arginfo[i].vectype)))
{
  /* FORNOW we only have partial support for vector-type masks
 that can't hold all of simdlen. */
@@ -4464,7 +4455,7 @@ vectorizable_simd_clone_call (vec_info *vinfo, 
stmt_vec_info stmt_info,
  if (!SCALAR_INT_MODE_P (TYPE_MODE (arginfo[i].vectype))
  || maybe_ne (exact_div (bestn->simdclone->simdlen,
  num_mask_args),
-  simd_clone_subparts (arginfo[i].vectype)))
+  TYPE_VECTOR_SUBPARTS (arginfo[i].vectype)))

Re: [PATCH 5/8] vect: Use inbranch simdclones in masked loops

2023-10-18 Thread Andre Vieira (lists)

Rebased, needs review.

On 30/08/2023 10:13, Andre Vieira (lists) via Gcc-patches wrote:
This patch enables the compiler to use inbranch simdclones when 
generating masked loops in autovectorization.


gcc/ChangeLog:

 * omp-simd-clone.cc (simd_clone_adjust_argument_types): Make function
 compatible with mask parameters in clone.
 * tree-vect-stmts.cc (vect_convert): New helper function.
 (vect_build_all_ones_mask): Allow vector boolean typed masks.
 (vectorizable_simd_clone_call): Enable the use of masked clones in
 fully masked loops.diff --git a/gcc/omp-simd-clone.cc b/gcc/omp-simd-clone.cc
index 
a42643400ddcf10961633448b49d4caafb999f12..ef0b9b48c7212900023bc0eaebca5e1f9389db77
 100644
--- a/gcc/omp-simd-clone.cc
+++ b/gcc/omp-simd-clone.cc
@@ -807,8 +807,14 @@ simd_clone_adjust_argument_types (struct cgraph_node *node)
 {
   ipa_adjusted_param adj;
   memset (, 0, sizeof (adj));
-  tree parm = args[i];
-  tree parm_type = node->definition ? TREE_TYPE (parm) : parm;
+  tree parm = NULL_TREE;
+  tree parm_type = NULL_TREE;
+  if(i < args.length())
+   {
+ parm = args[i];
+ parm_type = node->definition ? TREE_TYPE (parm) : parm;
+   }
+
   adj.base_index = i;
   adj.prev_clone_index = i;
 
@@ -1547,7 +1553,7 @@ simd_clone_adjust (struct cgraph_node *node)
  mask = gimple_assign_lhs (g);
  g = gimple_build_assign (make_ssa_name (TREE_TYPE (mask)),
   BIT_AND_EXPR, mask,
-  build_int_cst (TREE_TYPE (mask), 1));
+  build_one_cst (TREE_TYPE (mask)));
  gsi_insert_after (, g, GSI_CONTINUE_LINKING);
  mask = gimple_assign_lhs (g);
}
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index 
731acc76350cae39c899a866584068cff247183a..6e2c70c1d3970af652c1e50e41b144162884bf24
 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -1594,6 +1594,20 @@ check_load_store_for_partial_vectors (loop_vec_info 
loop_vinfo, tree vectype,
 }
 }
 
+/* Return SSA name of the result of the conversion of OPERAND into type TYPE.
+   The conversion statement is inserted at GSI.  */
+
+static tree
+vect_convert (vec_info *vinfo, stmt_vec_info stmt_info, tree type, tree 
operand,
+ gimple_stmt_iterator *gsi)
+{
+  operand = build1 (VIEW_CONVERT_EXPR, type, operand);
+  gassign *new_stmt = gimple_build_assign (make_ssa_name (type),
+  operand);
+  vect_finish_stmt_generation (vinfo, stmt_info, new_stmt, gsi);
+  return gimple_get_lhs (new_stmt);
+}
+
 /* Return the mask input to a masked load or store.  VEC_MASK is the vectorized
form of the scalar mask condition and LOOP_MASK, if nonnull, is the mask
that needs to be applied to all loads and stores in a vectorized loop.
@@ -2547,7 +2561,8 @@ vect_build_all_ones_mask (vec_info *vinfo,
 {
   if (TREE_CODE (masktype) == INTEGER_TYPE)
 return build_int_cst (masktype, -1);
-  else if (TREE_CODE (TREE_TYPE (masktype)) == INTEGER_TYPE)
+  else if (VECTOR_BOOLEAN_TYPE_P (masktype)
+  || TREE_CODE (TREE_TYPE (masktype)) == INTEGER_TYPE)
 {
   tree mask = build_int_cst (TREE_TYPE (masktype), -1);
   mask = build_vector_from_val (masktype, mask);
@@ -4156,7 +4171,7 @@ vectorizable_simd_clone_call (vec_info *vinfo, 
stmt_vec_info stmt_info,
   size_t i, nargs;
   tree lhs, rtype, ratype;
   vec *ret_ctor_elts = NULL;
-  int arg_offset = 0;
+  int masked_call_offset = 0;
 
   /* Is STMT a vectorizable call?   */
   gcall *stmt = dyn_cast  (stmt_info->stmt);
@@ -4171,7 +4186,7 @@ vectorizable_simd_clone_call (vec_info *vinfo, 
stmt_vec_info stmt_info,
   gcc_checking_assert (TREE_CODE (fndecl) == ADDR_EXPR);
   fndecl = TREE_OPERAND (fndecl, 0);
   gcc_checking_assert (TREE_CODE (fndecl) == FUNCTION_DECL);
-  arg_offset = 1;
+  masked_call_offset = 1;
 }
   if (fndecl == NULL_TREE)
 return false;
@@ -4199,7 +4214,7 @@ vectorizable_simd_clone_call (vec_info *vinfo, 
stmt_vec_info stmt_info,
 return false;
 
   /* Process function arguments.  */
-  nargs = gimple_call_num_args (stmt) - arg_offset;
+  nargs = gimple_call_num_args (stmt) - masked_call_offset;
 
   /* Bail out if the function has zero arguments.  */
   if (nargs == 0)
@@ -4221,7 +4236,7 @@ vectorizable_simd_clone_call (vec_info *vinfo, 
stmt_vec_info stmt_info,
   thisarginfo.op = NULL_TREE;
   thisarginfo.simd_lane_linear = false;
 
-  int op_no = i + arg_offset;
+  int op_no = i + masked_call_offset;
   if (slp_node)
op_no = vect_slp_child_index_for_operand (stmt, op_no);
   if (!vect_is_simple_use (vinfo, stmt_info, slp_node,
@@ -4303,16 +4318,6 @@ vectorizable_simd_clone_call (vec_info *vinfo, 
stmt_vec_info stmt_info,
   arginfo.quick_push (thisarginfo);
 }
 
-  if (loop_vinfo
-  && !LOOP_VINFO_VEC

Re: [Patch 3/8] vect: Fix vect_get_smallest_scalar_type for simd clones

2023-10-18 Thread Andre Vieira (lists)

Made it a local function and changed prototype according to comments.

Is this OK?

 gcc/ChangeLog:
* tree-vect-data-refs.cc (vect_get_smallest_scalar_type): Special
case
simd clone calls and only use types that are mapped to vectors.
(simd_clone_call_p): New helper function.

On 30/08/2023 13:54, Richard Biener wrote:

On Wed, 30 Aug 2023, Andre Vieira (lists) wrote:


The vect_get_smallest_scalar_type helper function was using any argument to a
simd clone call when trying to determine the smallest scalar type that would
be vectorized.  This included the function pointer type in a MASK_CALL for
instance, and would result in the wrong type being selected.  Instead this
patch special cases simd_clone_call's and uses only scalar types of the
original function that get transformed into vector types.


Looks sensible.

+bool
+simd_clone_call_p (gimple *stmt, cgraph_node **out_node)

you could return the cgraph_node * or NULL here.  Are you going to
use the function elsewhere?  Otherwise put it in the same TU as
the only use please and avoid exporting it.

Richard.


gcc/ChangeLog:

* tree-vect-data-refs.cci (vect_get_smallest_scalar_type): Special
case
simd clone calls and only use types that are mapped to vectors.
* tree-vect-stmts.cc (simd_clone_call_p): New helper function.
* tree-vectorizer.h (simd_clone_call_p): Declare new function.

gcc/testsuite/ChangeLog:

* gcc.dg/vect/vect-simd-clone-16f.c: Remove unnecessary differentation
between targets with different pointer sizes.
* gcc.dg/vect/vect-simd-clone-17f.c: Likewise.
* gcc.dg/vect/vect-simd-clone-18f.c: Likewise.

diff --git a/gcc/testsuite/gcc.dg/vect/vect-simd-clone-16f.c 
b/gcc/testsuite/gcc.dg/vect/vect-simd-clone-16f.c
index 
574698d3e133ecb8700e698fa42a6b05dd6b8a18..7cd29e894d0502a59fadfe67db2db383133022d3
 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-simd-clone-16f.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-simd-clone-16f.c
@@ -7,9 +7,8 @@
 #include "vect-simd-clone-16.c"
 
 /* Ensure the the in-branch simd clones are used on targets that support them.
-   Some targets use pairs of vectors and do twice the calls.  */
-/* { dg-final { scan-tree-dump-times {[\n\r] [^\n]* = foo\.simdclone} 2 "vect" 
{ target { ! { { i?86-*-* x86_64-*-* } && { ! lp64 } } } } } } */
-/* { dg-final { scan-tree-dump-times {[\n\r] [^\n]* = foo\.simdclone} 4 "vect" 
{ target { { i?86*-*-* x86_64-*-* } && { ! lp64 } } } } } */
+ */
+/* { dg-final { scan-tree-dump-times {[\n\r] [^\n]* = foo\.simdclone} 2 "vect" 
} } */
 
 /* The LTO test produces two dump files and we scan the wrong one.  */
 /* { dg-skip-if "" { *-*-* } { "-flto" } { "" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-simd-clone-17f.c 
b/gcc/testsuite/gcc.dg/vect/vect-simd-clone-17f.c
index 
8bb6d19301a67a3eebce522daaf7d54d88f708d7..177521dc44531479fca1f1a1a0f2010f30fa3fb5
 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-simd-clone-17f.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-simd-clone-17f.c
@@ -7,9 +7,8 @@
 #include "vect-simd-clone-17.c"
 
 /* Ensure the the in-branch simd clones are used on targets that support them.
-   Some targets use pairs of vectors and do twice the calls.  */
-/* { dg-final { scan-tree-dump-times {[\n\r] [^\n]* = foo\.simdclone} 2 "vect" 
{ target { ! { { i?86-*-* x86_64-*-* } && { ! lp64 } } } } } } */
-/* { dg-final { scan-tree-dump-times {[\n\r] [^\n]* = foo\.simdclone} 4 "vect" 
{ target { { i?86*-*-* x86_64-*-* } && { ! lp64 } } } } } */
+ */
+/* { dg-final { scan-tree-dump-times {[\n\r] [^\n]* = foo\.simdclone} 2 "vect" 
} } */
 
 /* The LTO test produces two dump files and we scan the wrong one.  */
 /* { dg-skip-if "" { *-*-* } { "-flto" } { "" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-simd-clone-18f.c 
b/gcc/testsuite/gcc.dg/vect/vect-simd-clone-18f.c
index 
d34f23f4db8e9c237558cc22fe66b7e02b9e6c20..4dd51381d73c0c7c8ec812f24e5054df038059c5
 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-simd-clone-18f.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-simd-clone-18f.c
@@ -7,9 +7,8 @@
 #include "vect-simd-clone-18.c"
 
 /* Ensure the the in-branch simd clones are used on targets that support them.
-   Some targets use pairs of vectors and do twice the calls.  */
-/* { dg-final { scan-tree-dump-times {[\n\r] [^\n]* = foo\.simdclone} 2 "vect" 
{ target { ! { { i?86-*-* x86_64-*-* } && { ! lp64 } } } } } } */
-/* { dg-final { scan-tree-dump-times {[\n\r] [^\n]* = foo\.simdclone} 4 "vect" 
{ target { { i?86*-*-* x86_64-*-* } && { ! lp64 } } } } } */
+ */
+/* { dg-final { scan-tree-dump-times {[\n\r] [^\n]* = foo\.simdclone} 2 "vect" 
} } */
 
 /* The LTO test produces two dump files and we scan the wrong one.  */
 /* { dg-skip-if "" { *-*-* } 

Re: [Patch 2/8] parloops: Allow poly nit and bound

2023-10-18 Thread Andre Vieira (lists)

Posting the changed patch for completion, already reviewed.

On 30/08/2023 13:32, Richard Biener wrote:

On Wed, 30 Aug 2023, Andre Vieira (lists) wrote:


Teach parloops how to handle a poly nit and bound e ahead of the changes to
enable non-constant simdlen.


Can you use poly_int_tree_p to combine INTEGER_CST || POLY_INT_CST please?

OK with that change.


gcc/ChangeLog:

* tree-parloops.cc (try_to_transform_to_exit_first_loop_alt): Accept
poly NIT and ALT_BOUND.

diff --git a/gcc/tree-parloops.cc b/gcc/tree-parloops.cc
index 
a35f3d5023b06e5ef96eb4222488fcb34dd7bd45..80f3dd6dce281e1eb1d76d38bd09e6638a875142
 100644
--- a/gcc/tree-parloops.cc
+++ b/gcc/tree-parloops.cc
@@ -2531,14 +2531,15 @@ try_transform_to_exit_first_loop_alt (class loop *loop,
   tree nit_type = TREE_TYPE (nit);
 
   /* Figure out whether nit + 1 overflows.  */
-  if (TREE_CODE (nit) == INTEGER_CST)
+  if (poly_int_tree_p (nit))
 {
   if (!tree_int_cst_equal (nit, TYPE_MAX_VALUE (nit_type)))
{
  alt_bound = fold_build2_loc (UNKNOWN_LOCATION, PLUS_EXPR, nit_type,
   nit, build_one_cst (nit_type));
 
- gcc_assert (TREE_CODE (alt_bound) == INTEGER_CST);
+ gcc_assert (TREE_CODE (alt_bound) == INTEGER_CST
+ || TREE_CODE (alt_bound) == POLY_INT_CST);
  transform_to_exit_first_loop_alt (loop, reduction_list, alt_bound);
  return true;
}


Re: [PATCH 1/8] parloops: Copy target and optimizations when creating a function clone

2023-10-18 Thread Andre Vieira (lists)

Just posting a rebase for completion.

On 30/08/2023 13:31, Richard Biener wrote:

On Wed, 30 Aug 2023, Andre Vieira (lists) wrote:



SVE simd clones require to be compiled with a SVE target enabled or the
argument types will not be created properly. To achieve this we need to copy
DECL_FUNCTION_SPECIFIC_TARGET from the original function declaration to the
clones.  I decided it was probably also a good idea to copy
DECL_FUNCTION_SPECIFIC_OPTIMIZATION in case the original function is meant to
be compiled with specific optimization options.


OK.


gcc/ChangeLog:

* tree-parloops.cc (create_loop_fn): Copy specific target and
optimization options to clone.

diff --git a/gcc/tree-parloops.cc b/gcc/tree-parloops.cc
index 
e495bbd65270bdf90bae2c4a2b52777522352a77..a35f3d5023b06e5ef96eb4222488fcb34dd7bd45
 100644
--- a/gcc/tree-parloops.cc
+++ b/gcc/tree-parloops.cc
@@ -2203,6 +2203,11 @@ create_loop_fn (location_t loc)
   DECL_CONTEXT (t) = decl;
   TREE_USED (t) = 1;
   DECL_ARGUMENTS (decl) = t;
+  DECL_FUNCTION_SPECIFIC_TARGET (decl)
+= DECL_FUNCTION_SPECIFIC_TARGET (act_cfun->decl);
+  DECL_FUNCTION_SPECIFIC_OPTIMIZATION (decl)
+= DECL_FUNCTION_SPECIFIC_OPTIMIZATION (act_cfun->decl);
+
 
   allocate_struct_function (decl, false);
 


Re: aarch64, vect, omp: Add SVE support for simd clones [PR 96342]

2023-10-18 Thread Andre Vieira (lists)

Hi,

I noticed I had missed one of the preparatory patches at the start of 
this series (first one) added now, also removed the 'vect: Add 
vector_mode paramater to simd_clone_usable' since after review we no 
longer deemed it necessary. And replaced the old vect: Add 
TARGET_SIMD_CLONE_ADJUST_RET_OR_PARAM with omp: Reorder call for 
TARGET_SIMD_CLONE_ADJUST after comments.


Bootstrapped and regression tested the series on 
aarch64-unknown-linux-gnu and x86_64-pc-linux-gnu.



Andre Vieira (8):

omp: Replace simd_clone_supbarts with TYPE_VECTOR_SUBPARTS [NEW]
parloops: Copy target and optimizations when creating a function clone 
[Reviewed]
parloops: Allow poly nit and bound [Cond Reviewed, made the requested 
changes]
vect: Fix vect_get_smallest_scalar_type for simd clones [First Reviewe, 
made the requested changes, OK?]
vect: don't allow fully masked loops with non-masked simd clones [PR 
110485] [Reviewed]

vect: Use inbranch simdclones in masked loops [Needs review]
vect: omp: Reorder call for TARGET_SIMD_CLONE_ADJUST [NEW]
aarch64: Add SVE support for simd clones [PR 96342] [Needs review]

PS: apologies for the inconsistent numbering of the emails, things got a 
bit confusing with removing and adding patches to the series.


On 30/08/2023 09:49, Andre Vieira (lists) via Gcc-patches wrote:

Hi,

This patch series aims to implement support for SVE simd clones when not 
specifying a 'simdlen' clause for AArch64. This patch depends on my 
earlier patch: '[PATCH] aarch64: enable mixed-types for aarch64 
simdclones'.


Bootstrapped and regression tested the series on 
aarch64-unknown-linux-gnu and x86_64-pc-linux-gnu. I also tried building 
the patches separately, but that was before some further clean-up 
restructuring, so will do that again prior to pushing.


Andre Vieira (8):

parloops: Copy target and optimizations when creating a function clone
parloops: Allow poly nit and bound
vect: Fix vect_get_smallest_scalar_type for simd clones
vect: don't allow fully masked loops with non-masked simd clones [PR 
110485]

vect: Use inbranch simdclones in masked loops
vect: Add vector_mode paramater to simd_clone_usable
vect: Add TARGET_SIMD_CLONE_ADJUST_RET_OR_PARAM
aarch64: Add SVE support for simd clones [PR 96342]


Re: Check that passes do not forget to define profile

2023-10-17 Thread Andre Vieira (lists)

So OK to commit this?

This patch makes sure the profile_count information is initialized for 
the new

bb created in move_sese_region_to_fn.

gcc/ChangeLog:

* tree-cfg.cc (move_sese_region_to_fn): Initialize profile_count for
new basic block.

Bootstrapped and regression tested on aarch64-unknown-linux-gnu and 
x86_64-pc-linux-gnu.


On 04/10/2023 12:02, Jan Hubicka wrote:

Hi Honza,

My current patch set for AArch64 VLA omp codegen started failing on
gcc.dg/gomp/pr87898.c after this. I traced it back to
'move_sese_region_to_fn' in tree/cfg.cc not setting count for the bb
created.

I was able to 'fix' it locally by setting the count of the new bb to the
accumulation of e->count () of all the entry_endges (if initialized). I'm
however not even close to certain that's the right approach, attached patch
for illustration.

Kind regards,
Andre
diff --git a/gcc/tree-cfg.cc b/gcc/tree-cfg.cc



index 
ffab7518b1568b58e610e26feb9e3cab18ddb3c2..32fc47ae683164bf8fac477fbe6e2c998382e754
 100644
--- a/gcc/tree-cfg.cc
+++ b/gcc/tree-cfg.cc
@@ -8160,11 +8160,15 @@ move_sese_region_to_fn (struct function *dest_cfun, 
basic_block entry_bb,
bb = create_empty_bb (entry_pred[0]);
if (current_loops)
  add_bb_to_loop (bb, loop);
+  profile_count count = profile_count::zero ();
for (i = 0; i < num_entry_edges; i++)
  {
e = make_edge (entry_pred[i], bb, entry_flag[i]);
e->probability = entry_prob[i];
+  if (e->count ().initialized_p ())
+   count += e->count ();
  }
+  bb->count = count;


This looks generally right - if you create a BB you need to set its
count and unless it has self-loop that is the sum of counts of
incommping edges.

However the initialized_p check should be unnecessary: if one of entry
edges to BB is uninitialized, the + operation will make bb count
uninitialized too, which is OK.

Honza
  
for (i = 0; i < num_exit_edges; i++)

  {
diff --git a/gcc/tree-cfg.cc b/gcc/tree-cfg.cc
index 
ffab7518b1568b58e610e26feb9e3cab18ddb3c2..ffeb20b717aead756844c5f48c2cc23f5e9f14a6
 100644
--- a/gcc/tree-cfg.cc
+++ b/gcc/tree-cfg.cc
@@ -8160,11 +8160,14 @@ move_sese_region_to_fn (struct function *dest_cfun, 
basic_block entry_bb,
   bb = create_empty_bb (entry_pred[0]);
   if (current_loops)
 add_bb_to_loop (bb, loop);
+  profile_count count = profile_count::zero ();
   for (i = 0; i < num_entry_edges; i++)
 {
   e = make_edge (entry_pred[i], bb, entry_flag[i]);
   e->probability = entry_prob[i];
+  count += e->count ();
 }
+  bb->count = count;
 
   for (i = 0; i < num_exit_edges; i++)
 {


Re: [PATCH] aarch64: enable mixed-types for aarch64 simdclones

2023-10-16 Thread Andre Vieira (lists)

Hey,

Just a minor update to the patch, I had missed the libgomp testsuite, so 
had to make some adjustments there too.


gcc/ChangeLog:

* config/aarch64/aarch64.cc (lane_size): New function.
(aarch64_simd_clone_compute_vecsize_and_simdlen): Determine 
simdlen according to NDS rule
and reject combination of simdlen and types that lead to 
vectors larger than 128bits.


gcc/testsuite/ChangeLog:

* lib/target-supports.exp: Add aarch64 targets to vect_simd_clones.
* c-c++-common/gomp/declare-variant-14.c: Adapt test for aarch64.
* c-c++-common/gomp/pr60823-1.c: Likewise.
* c-c++-common/gomp/pr60823-2.c: Likewise.
* c-c++-common/gomp/pr60823-3.c: Likewise.
* g++.dg/gomp/attrs-10.C: Likewise.
* g++.dg/gomp/declare-simd-1.C: Likewise.
* g++.dg/gomp/declare-simd-3.C: Likewise.
* g++.dg/gomp/declare-simd-4.C: Likewise.
* g++.dg/gomp/declare-simd-7.C: Likewise.
* g++.dg/gomp/declare-simd-8.C: Likewise.
* g++.dg/gomp/pr88182.C: Likewise.
* gcc.dg/declare-simd.c: Likewise.
* gcc.dg/gomp/declare-simd-1.c: Likewise.
* gcc.dg/gomp/declare-simd-3.c: Likewise.
* gcc.dg/gomp/pr87887-1.c: Likewise.
* gcc.dg/gomp/pr87895-1.c: Likewise.
* gcc.dg/gomp/pr89246-1.c: Likewise.
* gcc.dg/gomp/pr99542.c: Likewise.
* gcc.dg/gomp/simd-clones-2.c: Likewise.
* gcc.dg/gcc.dg/vect/vect-simd-clone-1.c: Likewise.
* gcc.dg/gcc.dg/vect/vect-simd-clone-2.c: Likewise.
* gcc.dg/gcc.dg/vect/vect-simd-clone-4.c: Likewise.
* gcc.dg/gcc.dg/vect/vect-simd-clone-5.c: Likewise.
* gcc.dg/gcc.dg/vect/vect-simd-clone-8.c: Likewise.
* gfortran.dg/gomp/declare-simd-2.f90: Likewise.
* gfortran.dg/gomp/declare-simd-coarray-lib.f90: Likewise.
* gfortran.dg/gomp/declare-variant-14.f90: Likewise.
* gfortran.dg/gomp/pr79154-1.f90: Likewise.
* gfortran.dg/gomp/pr83977.f90: Likewise.

libgomp/testsuite/ChangeLog:

* libgomp.c/declare-variant-1.c: Adapt test for aarch64.
* libgomp.fortran/declare-simd-1.f90: Likewise.diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 
9fbfc548a891f5d11940c6fd3c49a14bfbdec886..37507f091c2a6154fa944c3a9fad6a655ab5d5a1
 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -27414,33 +27414,62 @@ supported_simd_type (tree t)
   return false;
 }
 
-/* Return true for types that currently are supported as SIMD return
-   or argument types.  */
+/* Determine the lane size for the clone argument/return type.  This follows
+   the LS(P) rule in the VFABIA64.  */
 
-static bool
-currently_supported_simd_type (tree t, tree b)
+static unsigned
+lane_size (cgraph_simd_clone_arg_type clone_arg_type, tree type)
 {
-  if (COMPLEX_FLOAT_TYPE_P (t))
-return false;
+  gcc_assert (clone_arg_type != SIMD_CLONE_ARG_TYPE_MASK);
 
-  if (TYPE_SIZE (t) != TYPE_SIZE (b))
-return false;
+  /* For non map-to-vector types that are pointers we use the element type it
+ points to.  */
+  if (POINTER_TYPE_P (type))
+switch (clone_arg_type)
+  {
+  default:
+   break;
+  case SIMD_CLONE_ARG_TYPE_UNIFORM:
+  case SIMD_CLONE_ARG_TYPE_LINEAR_CONSTANT_STEP:
+  case SIMD_CLONE_ARG_TYPE_LINEAR_VARIABLE_STEP:
+   type = TREE_TYPE (type);
+   break;
+  }
 
-  return supported_simd_type (t);
+  /* For types (or types pointers of non map-to-vector types point to) that are
+ integers or floating point, we use their size if they are 1, 2, 4 or 8.
+   */
+  if (INTEGRAL_TYPE_P (type)
+  || SCALAR_FLOAT_TYPE_P (type))
+  switch (TYPE_PRECISION (type) / BITS_PER_UNIT)
+   {
+   default:
+ break;
+   case 1:
+   case 2:
+   case 4:
+   case 8:
+ return TYPE_PRECISION (type);
+   }
+  /* For any other we use the size of uintptr_t.  For map-to-vector types that
+ are pointers, using the size of uintptr_t is the same as using the size of
+ their type, seeing all pointers are the same size as uintptr_t.  */
+  return POINTER_SIZE;
 }
 
+
 /* Implement TARGET_SIMD_CLONE_COMPUTE_VECSIZE_AND_SIMDLEN.  */
 
 static int
 aarch64_simd_clone_compute_vecsize_and_simdlen (struct cgraph_node *node,
struct cgraph_simd_clone *clonei,
-   tree base_type, int num,
-   bool explicit_p)
+   tree base_type ATTRIBUTE_UNUSED,
+   int num, bool explicit_p)
 {
   tree t, ret_type;
-  unsigned int elt_bits, count;
+  unsigned int nds_elt_bits;
+  int count;
   unsigned HOST_WIDE_INT const_simdlen;
-  poly_uint64 vec_bits;
 
   if (!TARGET_SIMD)
 return 0;
@@ -27460,80 +27489,135 @@ aarch64_simd_clone_compute_vecsize_and_simdlen 
(struct cgraph_node *node,
 }
 
   

  1   2   3   4   5   6   7   8   >