Re: [PATCH] aarch64: Improve popcount for bytes [PR113042]

2024-06-10 Thread Kyrylo Tkachov
Hi Andrew

-Original Message-
From: Andrew Pinski mailto:quic_apin...@quicinc.com>>
Date: Monday, 10 June 2024 at 06:05
To: "gcc-patches@gcc.gnu.org " 
mailto:gcc-patches@gcc.gnu.org>>
Cc: Andrew Pinski mailto:quic_apin...@quicinc.com>>
Subject: [PATCH] aarch64: Improve popcount for bytes [PR113042]


For popcount for bytes, we don't need the reduction addition
after the vector cnt instruction as we are only counting one
byte's popcount.
This implements a new define_expand to handle that.


Bootstrapped and tested on aarch64-linux-gnu with no regressions.


PR target/113042


gcc/ChangeLog:


* config/aarch64/aarch64.md (popcountqi2): New pattern.


gcc/testsuite/ChangeLog:


* gcc.target/aarch64/popcnt5.c: New test.


Signed-off-by: Andrew Pinski mailto:quic_apin...@quicinc.com>>
---
gcc/config/aarch64/aarch64.md | 26 ++
gcc/testsuite/gcc.target/aarch64/popcnt5.c | 19 
2 files changed, 45 insertions(+)
create mode 100644 gcc/testsuite/gcc.target/aarch64/popcnt5.c


diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 389a1906e23..ebaf7ec9970 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -5358,6 +5358,32 @@ (define_expand "popcount2"
}
})


+/* Popcount for byte can remove the reduction part after the popcount.
+ For optimization reasons, enabling this for CSSC. */
+(define_expand "popcountqi2"
+ [(set (match_operand:QI 0 "register_operand" "=w")
+ (popcount:QI (match_operand:QI 1 "register_operand" "w")))]
+ "TARGET_CSSC || TARGET_SIMD"
+{
+ rtx in = operands[1];
+ rtx out = operands[0];
+ if (TARGET_CSSC)
+ {
+ rtx tmp = gen_reg_rtx (SImode);
+ rtx out1 = gen_reg_rtx (SImode);
+ emit_insn (gen_zero_extendqisi2 (tmp, in));
+ emit_insn (gen_popcountsi2 (out1, tmp));
+ emit_move_insn (out, gen_lowpart (QImode, out1));
+ DONE;
+ }
+ rtx v = gen_reg_rtx (V8QImode);
+ rtx v1 = gen_reg_rtx (V8QImode);
+ emit_move_insn (v, gen_lowpart (V8QImode, in));
+ emit_insn (gen_popcountv8qi2 (v1, v));
+ emit_move_insn (out, gen_lowpart (QImode, v1));
+ DONE;
+})

TBH I'd rather merge it with the GPI popcount pattern that looks almost 
identical. You could extend it with the ALLI iterator and handle HImode as well 
quite easily.
Thanks,
Kyrill


+
(define_insn "clrsb2"
[(set (match_operand:GPI 0 "register_operand" "=r")
(clrsb:GPI (match_operand:GPI 1 "register_operand" "r")))]
diff --git a/gcc/testsuite/gcc.target/aarch64/popcnt5.c 
b/gcc/testsuite/gcc.target/aarch64/popcnt5.c
new file mode 100644
index 000..406369d9b29
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/popcnt5.c
@@ -0,0 +1,19 @@
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+/* { dg-final { check-function-bodies "**" "" } } */
+/* PR target/113042 */
+
+#pragma GCC target "+nocssc"
+
+/*
+** h8:
+** ldr b[0-9]+, \[x0\]
+** cnt v[0-9]+.8b, v[0-9]+.8b
+** smov w0, v[0-9]+.b\[0\]
+** ret
+*/
+/* We should not need the addv here since we only need a byte popcount. */
+
+unsigned h8 (const unsigned char *a) {
+ return __builtin_popcountg (a[0]);
+}
--
2.42.0







[MAINTAINERS] Update my email address and step down as arm port maintainer

2024-04-04 Thread Kyrylo Tkachov
Hi all,

I'm stepping down as arm maintainer. Realistically I won't have good access to 
arm hardware to test patches
for the port in the foreseeable future, or at least the more active M-profile 
parts of it.
I'm still happy to keep helping with AArch64 though.
I'm also adding myself to the DCO section in the meantime.
A big thank you to the GCC community for giving me the opportunity and special 
thanks to Richard and Ramana
for guiding me in starting with patch reviews.

Pushing to trunk.

Thanks,
Kyrill

* MAINTAINERS: Update my email details, remove myself as arm maintainer.
Add myself to DCO section.


maintainers.patch
Description: maintainers.patch


[PATCH][wwwdocs] changes.html changes for AArch64 for GCC 14.1

2024-04-02 Thread Kyrylo Tkachov
Hi all,

Here's a writeup of the AArch64 changes to highlight in GCC 14.1.
If there's something you'd like to highlight feel free to comment or add
a patch yourself. I don't expect the list to be exhaustive.

It's been a busy release for AArch64!
Thanks,
Kyrill


gcc-14-aarch64-wwwdocs.patch
Description: gcc-14-aarch64-wwwdocs.patch


RE: [PATCH] aarch64: Align lrcpc3 FEAT_STRING with /proc/cpuinfo 'Features' entry

2024-03-25 Thread Kyrylo Tkachov
Hi Victor,

> -Original Message-
> From: Victor Do Nascimento 
> Sent: Monday, March 25, 2024 10:59 AM
> To: gcc-patches@gcc.gnu.org
> Cc: Kyrylo Tkachov ; Richard Sandiford
> ; Richard Earnshaw
> ; Victor Do Nascimento
> 
> Subject: [PATCH] aarch64: Align lrcpc3 FEAT_STRING with /proc/cpuinfo
> 'Features' entry
> 
> Due to the Linux kernel exposing the lrcpc3 architectural feature as
> "lrcpc3", this patch corrects the relevant FEATURE_STRING entry in the
> "rcpc3" AARCH64_OPT_FMV_EXTENSION macro, such that the feature can be
> correctly detected when doing native compilation on rcpc3-enabled
> targets.
> 
> Regtested on aarch64-linux-gnu.

Ok but...

> 
> gcc/ChangeLog:
> 
>   * config/aarch64/aarch64-option-extensions.def: Fix 'lrcpc3'
>   entry.

This would usually be written as:
* config/aarch64/aarch64-option-extensions.def (rcpc3):
Fix FEATURE_STRING field to "lrcpc3".

Thanks,
Kyrill

> 
> gcc/testsuite/ChangeLog:
> 
>   * testsuite/gcc.target/aarch64/cpunative/info_24: New.
>   * testsuite/gcc.target/aarch64/cpunative/native_cpu_24.c:
>   Likewise.
> ---
>  gcc/config/aarch64/aarch64-option-extensions.def  |  2 +-
>  gcc/testsuite/gcc.target/aarch64/cpunative/info_24|  8 
>  .../gcc.target/aarch64/cpunative/native_cpu_24.c  | 11 +++
>  3 files changed, 20 insertions(+), 1 deletion(-)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/cpunative/info_24
>  create mode 100644
> gcc/testsuite/gcc.target/aarch64/cpunative/native_cpu_24.c
> 
> diff --git a/gcc/config/aarch64/aarch64-option-extensions.def
> b/gcc/config/aarch64/aarch64-option-extensions.def
> index 1a3b91c68cf..975e7b84cec 100644
> --- a/gcc/config/aarch64/aarch64-option-extensions.def
> +++ b/gcc/config/aarch64/aarch64-option-extensions.def
> @@ -174,7 +174,7 @@ AARCH64_OPT_FMV_EXTENSION("rcpc", RCPC, (), (), (),
> "lrcpc")
> 
>  AARCH64_FMV_FEATURE("rcpc2", RCPC2, (RCPC))
> 
> -AARCH64_OPT_FMV_EXTENSION("rcpc3", RCPC3, (), (), (), "rcpc3")
> +AARCH64_OPT_FMV_EXTENSION("rcpc3", RCPC3, (), (), (), "lrcpc3")
> 
>  AARCH64_FMV_FEATURE("frintts", FRINTTS, ())
> 
> diff --git a/gcc/testsuite/gcc.target/aarch64/cpunative/info_24
> b/gcc/testsuite/gcc.target/aarch64/cpunative/info_24
> new file mode 100644
> index 000..8d3c16a1091
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/cpunative/info_24
> @@ -0,0 +1,8 @@
> +processor: 0
> +BogoMIPS : 100.00
> +Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 asimddp lrcpc3
> +CPU implementer  : 0xfe
> +CPU architecture: 8
> +CPU variant  : 0x0
> +CPU part : 0xd08
> +CPU revision : 2
> \ No newline at end of file
> diff --git a/gcc/testsuite/gcc.target/aarch64/cpunative/native_cpu_24.c
> b/gcc/testsuite/gcc.target/aarch64/cpunative/native_cpu_24.c
> new file mode 100644
> index 000..05dc870885f
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/cpunative/native_cpu_24.c
> @@ -0,0 +1,11 @@
> +/* { dg-do compile { target { { aarch64*-*-linux*} && native } } } */
> +/* { dg-set-compiler-env-var GCC_CPUINFO
> "$srcdir/gcc.target/aarch64/cpunative/info_23" } */
> +/* { dg-additional-options "-mcpu=native --save-temps " } */
> +
> +int main()
> +{
> +  return 0;
> +}
> +
> +/* { dg-final { scan-assembler {\.arch armv8-a\+dotprod\+crc\+crypto\+rcpc3} 
> }
> } */
> +/* Test one where rcpc3 is available and so should be emitted.  */
> --
> 2.34.1



RE: [libatomic PATCH] PR other/113336: Fix libatomic testsuite regressions on ARM.

2024-02-14 Thread Kyrylo Tkachov


> -Original Message-
> From: Victor Do Nascimento 
> Sent: Wednesday, February 14, 2024 5:06 PM
> To: Roger Sayle ; gcc-patches@gcc.gnu.org;
> Richard Earnshaw 
> Subject: Re: [libatomic PATCH] PR other/113336: Fix libatomic testsuite
> regressions on ARM.
> 
> Though I'm not in a position to approve the patch, I'm happy to confirm
> the proposed changes look good to me.
> 
> Thanks for the updated version,
> Victor
> 

This is ok from me too.
Thanks Victor for helping with the review.
Kyrill

> 
> On 1/28/24 16:24, Roger Sayle wrote:
> >
> > This patch is a revised version of the fix for PR other/113336.
> >
> > This patch has been tested on arm-linux-gnueabihf with --with-arch=armv6
> > with make bootstrap and make -k check where it fixes all of the FAILs in
> > libatomic.  Ok for mainline?
> >
> >
> > 2024-01-28  Roger Sayle  
> >  Victor Do Nascimento  
> >
> > libatomic/ChangeLog
> >  PR other/113336
> >  * Makefile.am: Build tas_1_2_.o on ARCH_ARM_LINUX
> >  * Makefile.in: Regenerate.
> >
> > Thanks in advance.
> > Roger
> > --
> >


RE: [PATCH] arm/aarch64: Add bti for all functions [PR106671]

2024-02-14 Thread Kyrylo Tkachov
Hi Feng,

> -Original Message-
> From: Gcc-patches  bounces+kyrylo.tkachov=arm@gcc.gnu.org> On Behalf Of Feng Xue OS
> via Gcc-patches
> Sent: Wednesday, August 2, 2023 4:49 PM
> To: gcc-patches@gcc.gnu.org
> Subject: [PATCH] arm/aarch64: Add bti for all functions [PR106671]
> 
> This patch extends option -mbranch-protection=bti with an optional
> argument
> as bti[+all] to force compiler to unconditionally insert bti for all
> functions. Because a direct function call at the stage of compiling might be
> rewritten to an indirect call with some kind of linker-generated thunk stub
> as invocation relay for some reasons. One instance is if a direct callee is
> placed far from its caller, direct BL {imm} instruction could not represent
> the distance, so indirect BLR {reg} should be used. For this case, a bti is
> required at the beginning of the callee.
> 
>caller() {
>bl callee
>}
> 
> =>
> 
>caller() {
>adrp   reg, 
>addreg, reg, #constant
>blrreg
>}
> 
> Although the issue could be fixed with a pretty new version of ld, here we
> provide another means for user who has to rely on the old ld or other non-ld
> linker. I also checked LLVM, by default, it implements bti just as the 
> proposed
> -mbranch-protection=bti+all.

Apologies for the delay, we had discussed this on and off internally over time.
I don't think adding extra complexity in the compiler going forward for the 
sake of older linkers is a good tradeoffs.
So I'd like to avoid this.
Thanks,
Kyrill

> 
> Feng
> 
> ---
>  gcc/config/aarch64/aarch64.cc| 12 +++-
>  gcc/config/aarch64/aarch64.opt   |  2 +-
>  gcc/config/arm/aarch-bti-insert.cc   |  3 ++-
>  gcc/config/arm/aarch-common.cc   | 22 ++
>  gcc/config/arm/aarch-common.h| 18 ++
>  gcc/config/arm/arm.cc|  4 ++--
>  gcc/config/arm/arm.opt   |  2 +-
>  gcc/doc/invoke.texi  | 16 ++--
>  gcc/testsuite/gcc.target/aarch64/bti-5.c | 17 +
>  9 files changed, 76 insertions(+), 20 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/bti-5.c
> 
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 71215ef9fee..a404447c8d0 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -8997,7 +8997,8 @@ void aarch_bti_arch_check (void)
>  bool
>  aarch_bti_enabled (void)
>  {
> -  return (aarch_enable_bti == 1);
> +  gcc_checking_assert (aarch_enable_bti != AARCH_BTI_FUNCTION_UNSET);
> +  return (aarch_enable_bti != AARCH_BTI_FUNCTION_NONE);
>  }
> 
>  /* Check if INSN is a BTI J insn.  */
> @@ -18454,12 +18455,12 @@ aarch64_override_options (void)
> 
>selected_tune = tune ? tune->ident : cpu->ident;
> 
> -  if (aarch_enable_bti == 2)
> +  if (aarch_enable_bti == AARCH_BTI_FUNCTION_UNSET)
>  {
>  #ifdef TARGET_ENABLE_BTI
> -  aarch_enable_bti = 1;
> +  aarch_enable_bti = AARCH_BTI_FUNCTION;
>  #else
> -  aarch_enable_bti = 0;
> +  aarch_enable_bti = AARCH_BTI_FUNCTION_NONE;
>  #endif
>  }
> 
> @@ -22881,7 +22882,8 @@ aarch64_print_patchable_function_entry (FILE
> *file,
>basic_block bb = ENTRY_BLOCK_PTR_FOR_FN (cfun)->next_bb;
> 
>if (!aarch_bti_enabled ()
> -  || cgraph_node::get (cfun->decl)->only_called_directly_p ())
> +  || (aarch_enable_bti != AARCH_BTI_FUNCTION_ALL
> +   && cgraph_node::get (cfun->decl)->only_called_directly_p ()))
>  {
>/* Emit the patchable_area at the beginning of the function.  */
>rtx_insn *insn = emit_insn_before (pa, BB_HEAD (bb));
> diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt
> index 025e52d40e5..5571f7e916d 100644
> --- a/gcc/config/aarch64/aarch64.opt
> +++ b/gcc/config/aarch64/aarch64.opt
> @@ -37,7 +37,7 @@ TargetVariable
>  aarch64_feature_flags aarch64_isa_flags = 0
> 
>  TargetVariable
> -unsigned aarch_enable_bti = 2
> +enum aarch_bti_function_type aarch_enable_bti =
> AARCH_BTI_FUNCTION_UNSET
> 
>  TargetVariable
>  enum aarch_key_type aarch_ra_sign_key = AARCH_KEY_A
> diff --git a/gcc/config/arm/aarch-bti-insert.cc b/gcc/config/arm/aarch-bti-
> insert.cc
> index 71a77e29406..babd2490c9f 100644
> --- a/gcc/config/arm/aarch-bti-insert.cc
> +++ b/gcc/config/arm/aarch-bti-insert.cc
> @@ -164,7 +164,8 @@ rest_of_insert_bti (void)
>   functions that are already protected by Return Address Signing (PACIASP/
>   PACIBSP).  For all other cases insert a BTI C at the beginning of the
>   function.  */
> -  if (!cgraph_node::get (cfun->decl)->only_called_directly_p ())
> +  if (aarch_enable_bti == AARCH_BTI_FUNCTION_ALL
> +  || !cgraph_node::get (cfun->decl)->only_called_directly_p ())
>  {
>bb = ENTRY_BLOCK_PTR_FOR_FN (cfun)->next_bb;
>insn = BB_HEAD (bb);
> diff --git a/gcc/config/arm/aarch-common.cc b/gcc/config/arm/aarch-
> 

RE: [PATCH] AArch64: Add -mcpu=cobalt-100

2024-01-25 Thread Kyrylo Tkachov



> -Original Message-
> From: Wilco Dijkstra 
> Sent: Thursday, January 25, 2024 5:00 PM
> To: Kyrylo Tkachov ; GCC Patches  patc...@gcc.gnu.org>
> Cc: Richard Earnshaw ; Richard Sandiford
> 
> Subject: Re: [PATCH] AArch64: Add -mcpu=cobalt-100
> 
> Hi,
> 
> >> Add support for -mcpu=cobalt-100 (Neoverse N2 with a different implementer
> >> ID).
> >>
> >> Passes regress, OK for commit?
> >
> > Ok.
> 
> Also OK to backport to GCC 13, 12 and 11?

On the 11 branch at least there is no support for the armv9-a flags, so the 
aarch64-cores.def entry would need to use what the branch-local neoverse-n2 
entry uses (armv8.5-a).
So the trunk patch won't apply as is.
So please ensure the appropriate flags are used in the aarch64-cores.def entry 
(with the usual testing).
But otherwise it's okay.
Thanks,
Kyrill

> 
> Cheers,
> Wilco


RE: [PATCH] aarch64: Re-enable ldp/stp fusion pass

2024-01-24 Thread Kyrylo Tkachov
Hi Alex,

> -Original Message-
> From: Alex Coplan 
> Sent: Wednesday, January 24, 2024 8:34 AM
> To: gcc-patches@gcc.gnu.org
> Cc: Richard Earnshaw ; Richard Sandiford
> ; Kyrylo Tkachov ;
> Jakub Jelinek 
> Subject: [PATCH] aarch64: Re-enable ldp/stp fusion pass
> 
> Hi,
> 
> Since, to the best of my knowledge, all reported regressions related to
> the ldp/stp fusion pass have now been fixed, and PGO+LTO bootstrap with
> --enable-languages=all is working again with the passes enabled, this
> patch turns the passes back on by default, as agreed with Jakub here:
> 
> https://gcc.gnu.org/pipermail/gcc-patches/2024-January/642478.html
> 
> Bootstrapped/regtested on aarch64-linux-gnu, OK for trunk?
> 

If we were super-pedantic about the GCC rules we could say that this is a 
revert of 8ed77a2356c3562f96c64f968e7529065c128c6a and therefore:
"Similarly, no outside approval is needed to revert a patch that you checked 
in." 
But that would go against the spirit of the rule.
Anyway, this is ok. Thanks for working through the regressions so diligently.
Kyrill

> Thanks,
> Alex
> 
> gcc/ChangeLog:
> 
>   * config/aarch64/aarch64.opt (-mearly-ldp-fusion): Set default
>   to 1.
>   (-mlate-ldp-fusion): Likewise.


RE: [PATCH v2 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops

2024-01-17 Thread Kyrylo Tkachov
Hi Andre,

> -Original Message-
> From: Andre Vieira 
> Sent: Friday, January 5, 2024 5:52 PM
> To: gcc-patches@gcc.gnu.org
> Cc: Richard Earnshaw ; Stam Markianos-Wright
> 
> Subject: [PATCH v2 2/2] arm: Add support for MVE Tail-Predicated Low Overhead
> Loops
> 
> Respin after comments on first version.

I think I'm nitpicking some code style and implementation points rather than 
diving deep into the algorithms, I think those were okay last time I looked at 
this some time ago.

+/* Return true if INSN is a MVE instruction that is VPT-predicable, but in
+   its unpredicated form, or if it is predicated, but on a predicate other
+   than VPR_REG.  */
+
+static bool
+arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate (rtx_insn *insn,
+ rtx vpr_reg)
+{
+  rtx insn_vpr_reg_operand;
+  if (MVE_VPT_UNPREDICATED_INSN_P (insn)
+  || (MVE_VPT_PREDICATED_INSN_P (insn)
+ && (insn_vpr_reg_operand = arm_get_required_vpr_reg_param (insn))
+ && !rtx_equal_p (vpr_reg, insn_vpr_reg_operand)))
+return true;
+  else
+return false;
+}
+
+/* Return true if INSN is a MVE instruction that is VPT-predicable and is
+   predicated on VPR_REG.  */
+
+static bool
+arm_mve_vec_insn_is_predicated_with_this_predicate (rtx_insn *insn,
+   rtx vpr_reg)
+{
+  rtx insn_vpr_reg_operand;
+  if (MVE_VPT_PREDICATED_INSN_P (insn)
+  && (insn_vpr_reg_operand = arm_get_required_vpr_reg_param (insn))
+  && rtx_equal_p (vpr_reg, insn_vpr_reg_operand))
+return true;
+  else
+return false;
+}

These two functions seem to have an "if (condition) return true; else return 
false;" structure that we try to avoid. How about:
rtx_insn vpr_reg_operand = MVE_VPT_PREDICATED_INSN_P (insn)  ? 
arm_get_required_vpr_reg_param (insn) : NULL_RTX;
return vpr_reg_operand && rtx_equal_p (vpr_reg, insn_vpr_reg_operand);


+static bool
+arm_is_mve_across_vector_insn (rtx_insn* insn)
+{
+  df_ref insn_defs = NULL;
+  if (!MVE_VPT_PREDICABLE_INSN_P (insn))
+return false;
+
+  bool is_across_vector = false;
+  FOR_EACH_INSN_DEF (insn_defs, insn)
+if (!VALID_MVE_MODE (GET_MODE (DF_REF_REG (insn_defs)))
+   && !arm_get_required_vpr_reg_ret_val (insn))
+  is_across_vector = true;
+

You can just return true here immediately, no need to set is_across_vector

+  return is_across_vector;

... and you can return false here, avoiding the need for is_across_vector 
entirely
+}

+static bool
+arm_mve_check_reg_origin_is_num_elems (basic_block body, rtx reg, rtx 
vctp_step)
+{
+  /* Ok, we now know the loop starts from zero and increments by one.
+ Now just show that the max value of the counter came from an
+ appropriate ASHIFRT expr of the correct amount.  */
+  basic_block pre_loop_bb = body->prev_bb;
+  while (pre_loop_bb && BB_END (pre_loop_bb)
+&& !df_bb_regno_only_def_find (pre_loop_bb, REGNO (reg)))
+pre_loop_bb = pre_loop_bb->prev_bb;
+
+  df_ref counter_max_last_def = df_bb_regno_only_def_find (pre_loop_bb, REGNO 
(reg));
+  if (!counter_max_last_def)
+return false;
+  rtx counter_max_last_set = single_set (DF_REF_INSN (counter_max_last_def));
+  if (!counter_max_last_set)
+return false;
+
+  /* If we encounter a simple SET from a REG, follow it through.  */
+  if (REG_P (SET_SRC (counter_max_last_set)))
+return arm_mve_check_reg_origin_is_num_elems
+(pre_loop_bb->next_bb, SET_SRC (counter_max_last_set), vctp_step);
+
+  /* If we encounter a SET from an IF_THEN_ELSE where one of the operands is a
+ constant and the other is a REG, follow through to that REG.  */
+  if (GET_CODE (SET_SRC (counter_max_last_set)) == IF_THEN_ELSE
+  && REG_P (XEXP (SET_SRC (counter_max_last_set), 1))
+  && CONST_INT_P (XEXP (SET_SRC (counter_max_last_set), 2)))
+return arm_mve_check_reg_origin_is_num_elems
+(pre_loop_bb->next_bb, XEXP (SET_SRC (counter_max_last_set), 1), 
vctp_step);
+
+  if (GET_CODE (SET_SRC (counter_max_last_set)) == ASHIFTRT
+  && CONST_INT_P (XEXP (SET_SRC (counter_max_last_set), 1))
+  && ((1 << INTVAL (XEXP (SET_SRC (counter_max_last_set), 1)))
+  == abs (INTVAL (vctp_step

I'm a bit concerned here with using abs() for HOST_WIDE_INT values that are 
compared to other HOST_WIDE_INT values.
abs () will implicitly cast the argument and return an int. We should use the 
abs_hwi function defined in hwint.h. It may not cause problems in practice 
given the ranges involved, but better safe than sorry at this stage.

Looks decent to me otherwise, and an impressive piece of work, thanks.
I'd give Richard an opportunity to comment next week when he's back before 
committing though.
Thanks,
Kyrill


RE: [PATCH] aarch64: Fix aarch64_ldp_reg_operand predicate not to allow all subreg [PR113221]

2024-01-17 Thread Kyrylo Tkachov



> -Original Message-
> From: Andrew Pinski 
> Sent: Wednesday, January 17, 2024 3:29 AM
> To: gcc-patches@gcc.gnu.org
> Cc: Alex Coplan ; Andrew Pinski
> 
> Subject: [PATCH] aarch64: Fix aarch64_ldp_reg_operand predicate not to allow
> all subreg [PR113221]
> 
> So the problem here is that aarch64_ldp_reg_operand will all subreg even 
> subreg
> of lo_sum.
> When LRA tries to fix that up, all things break. So the fix is to change the 
> check to
> only
> allow reg and subreg of regs.
> 
> Note the tendancy here is to use register_operand but that checks the mode of
> the register
> but we need to allow a mismatch modes for this predicate for now.
> 
> Built and tested for aarch64-linux-gnu with no regressions
> (Also tested with the LD/ST pair pass back on).

Ok with the comments from Alex addressed.
Thanks,
Kyrill

> 
>   PR target/113221
> 
> gcc/ChangeLog:
> 
>   * config/aarch64/predicates.md (aarch64_ldp_reg_operand): For subreg,
>   only allow REG operands isntead of allowing all.
> 
> gcc/testsuite/ChangeLog:
> 
>   * gcc.c-torture/compile/pr113221-1.c: New test.
> 
> Signed-off-by: Andrew Pinski 
> ---
>  gcc/config/aarch64/predicates.md |  8 +++-
>  gcc/testsuite/gcc.c-torture/compile/pr113221-1.c | 12 
>  2 files changed, 19 insertions(+), 1 deletion(-)
>  create mode 100644 gcc/testsuite/gcc.c-torture/compile/pr113221-1.c
> 
> diff --git a/gcc/config/aarch64/predicates.md
> b/gcc/config/aarch64/predicates.md
> index 8a204e48bb5..256268517d8 100644
> --- a/gcc/config/aarch64/predicates.md
> +++ b/gcc/config/aarch64/predicates.md
> @@ -313,7 +313,13 @@ (define_predicate "pmode_plus_operator"
> 
>  (define_special_predicate "aarch64_ldp_reg_operand"
>(and
> -(match_code "reg,subreg")
> +(ior
> +  (match_code "reg")
> +  (and
> +   (match_code "subreg")
> +   (match_test "GET_CODE (SUBREG_REG (op)) == REG")
> +  )
> +)
>  (match_test "aarch64_ldpstp_operand_mode_p (GET_MODE (op))")
>  (ior
>(match_test "mode == VOIDmode")
> diff --git a/gcc/testsuite/gcc.c-torture/compile/pr113221-1.c
> b/gcc/testsuite/gcc.c-torture/compile/pr113221-1.c
> new file mode 100644
> index 000..152a510786e
> --- /dev/null
> +++ b/gcc/testsuite/gcc.c-torture/compile/pr113221-1.c
> @@ -0,0 +1,12 @@
> +/* { dg-options "-fno-move-loop-invariants -funroll-all-loops" } */
> +/* PR target/113221 */
> +/* This used to ICE after the `load/store pair fusion pass` was added
> +   due to the predicate aarch64_ldp_reg_operand allowing too much. */
> +
> +
> +void bar();
> +void foo(int* b) {
> +  for (;;)
> +*b++ = (long)bar;
> +}
> +
> --
> 2.39.3



RE: [PATCH] AArch64: Add -mcpu=cobalt-100

2024-01-16 Thread Kyrylo Tkachov



> -Original Message-
> From: Wilco Dijkstra 
> Sent: Tuesday, January 16, 2024 5:23 PM
> To: GCC Patches 
> Cc: Kyrylo Tkachov ; Richard Earnshaw
> ; Richard Sandiford
> 
> Subject: [PATCH] AArch64: Add -mcpu=cobalt-100
> 
> 
> Add support for -mcpu=cobalt-100 (Neoverse N2 with a different implementer
> ID).
> 
> Passes regress, OK for commit?

Ok.
Thanks,
Kyrill

> 
> gcc/ChangeLog:
> * config/aarch64/aarch64-cores.def (AARCH64_CORE): Add 'cobalt-100'
> CPU.
> * config/aarch64/aarch64-tune.md: Regenerated.
> * doc/invoke.texi (-mcpu): Add cobalt-100 core.
> 
> ---
> 
> diff --git a/gcc/config/aarch64/aarch64-cores.def
> b/gcc/config/aarch64/aarch64-cores.def
> index
> 054862f37bc8738e7193348d01f485a46a9a36e3..7ebefcf543b6f84b3df22ab8367
> 28111b56fa76f 100644
> --- a/gcc/config/aarch64/aarch64-cores.def
> +++ b/gcc/config/aarch64/aarch64-cores.def
> @@ -186,6 +186,7 @@ AARCH64_CORE("cortex-x3",  cortexx3, cortexa57, V9A,
> (SVE2_BITPERM, MEMTAG, I8M
>  AARCH64_CORE("cortex-x4",  cortexx4, cortexa57, V9_2A,  (SVE2_BITPERM,
> MEMTAG, PROFILE), neoversen2, 0x41, 0xd81, -1)
> 
>  AARCH64_CORE("neoverse-n2", neoversen2, cortexa57, V9A, (I8MM, BF16,
> SVE2_BITPERM, RNG, MEMTAG, PROFILE), neoversen2, 0x41, 0xd49, -1)
> +AARCH64_CORE("cobalt-100",   cobalt100, cortexa57, V9A, (I8MM, BF16,
> SVE2_BITPERM, RNG, MEMTAG, PROFILE), neoversen2, 0x6d, 0xd49, -1)
> 
>  AARCH64_CORE("neoverse-v2", neoversev2, cortexa57, V9A, (I8MM, BF16,
> SVE2_BITPERM, RNG, MEMTAG, PROFILE), neoversev2, 0x41, 0xd4f, -1)
>  AARCH64_CORE("demeter", demeter, cortexa57, V9A, (I8MM, BF16,
> SVE2_BITPERM, RNG, MEMTAG, PROFILE), neoversev2, 0x41, 0xd4f, -1)
> diff --git a/gcc/config/aarch64/aarch64-tune.md b/gcc/config/aarch64/aarch64-
> tune.md
> index
> 98e6882d4324d81268e28810b305b87c63bba22d..abd3c9e0822eeb1652f4856cd
> e591ac175ac0a4a 100644
> --- a/gcc/config/aarch64/aarch64-tune.md
> +++ b/gcc/config/aarch64/aarch64-tune.md
> @@ -1,5 +1,5 @@
>  ;; -*- buffer-read-only: t -*-
>  ;; Generated automatically by gentune.sh from aarch64-cores.def
>  (define_attr "tune"
> -
>   "cortexa34,cortexa35,cortexa53,cortexa57,cortexa72,cortexa73,thunderx,
> thunderxt88p1,thunderxt88,octeontx,octeontxt81,octeontxt83,thunderxt81,thun
> derxt83,ampere1,ampere1a,ampere1b,emag,xgene1,falkor,qdf24xx,exynosm1,ph
> ecda,thunderx2t99p1,vulcan,thunderx2t99,cortexa55,cortexa75,cortexa76,cortex
> a76ae,cortexa77,cortexa78,cortexa78ae,cortexa78c,cortexa65,cortexa65ae,cortex
> x1,cortexx1c,neoversen1,ares,neoversee1,octeontx2,octeontx2t98,octeontx2t96,
> octeontx2t93,octeontx2f95,octeontx2f95n,octeontx2f95mm,a64fx,tsv110,thunde
> rx3t110,neoversev1,zeus,neoverse512tvb,saphira,cortexa57cortexa53,cortexa72c
> ortexa53,cortexa73cortexa35,cortexa73cortexa53,cortexa75cortexa55,cortexa76c
> ortexa55,cortexr82,cortexa510,cortexa520,cortexa710,cortexa715,cortexa720,cor
> texx2,cortexx3,cortexx4,neoversen2,neoversev2,demeter,generic,generic_armv8
> _a,generic_armv9_a"
> +
>   "cortexa34,cortexa35,cortexa53,cortexa57,cortexa72,cortexa73,thunderx,
> thunderxt88p1,thunderxt88,octeontx,octeontxt81,octeontxt83,thunderxt81,thun
> derxt83,ampere1,ampere1a,ampere1b,emag,xgene1,falkor,qdf24xx,exynosm1,ph
> ecda,thunderx2t99p1,vulcan,thunderx2t99,cortexa55,cortexa75,cortexa76,cortex
> a76ae,cortexa77,cortexa78,cortexa78ae,cortexa78c,cortexa65,cortexa65ae,cortex
> x1,cortexx1c,neoversen1,ares,neoversee1,octeontx2,octeontx2t98,octeontx2t96,
> octeontx2t93,octeontx2f95,octeontx2f95n,octeontx2f95mm,a64fx,tsv110,thunde
> rx3t110,neoversev1,zeus,neoverse512tvb,saphira,cortexa57cortexa53,cortexa72c
> ortexa53,cortexa73cortexa35,cortexa73cortexa53,cortexa75cortexa55,cortexa76c
> ortexa55,cortexr82,cortexa510,cortexa520,cortexa710,cortexa715,cortexa720,cor
> texx2,cortexx3,cortexx4,neoversen2,cobalt100,neoversev2,demeter,generic,gene
> ric_armv8_a,generic_armv9_a"
>   (const (symbol_ref "((enum attr_tune) aarch64_tune)")))
> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> index
> 216e2f594d1cbc139c7e0125d9579c6924d23443..a25362b8c157f67d68b19f94cc
> 2d64bd09505bdc 100644
> --- a/gcc/doc/invoke.texi
> +++ b/gcc/doc/invoke.texi
> @@ -21163,7 +21163,7 @@ performance of the code.  Permissible values for this
> option are:
>  @samp{cortex-r82}, @samp{cortex-x1}, @samp{cortex-x1c}, @samp{cortex-x2},
>  @samp{cortex-x3}, @samp{cortex-x4}, @samp{cortex-a510}, @samp{cortex-
> a520},
>  @samp{cortex-a710}, @samp{cortex-a715}, @samp{cortex-a720},
> @samp{ampere1},
> -@samp{ampere1a}, @samp{ampere1b}, and @samp{native}.
> +@samp{ampere1a}, @samp{ampere1b}, @samp{cobalt-100} and
> @samp{native}.
> 
>  The values @samp{cortex-a57.cortex-a53}, @samp{cortex-a72.cortex-a53},
>  @samp{cortex-a73.cortex-a35}, @samp{cortex-a73.cortex-a53},
> 



RE: [PATCH][wwwdoc] gcc-14: Add arm cortex-m52 cpu support

2024-01-10 Thread Kyrylo Tkachov


> -Original Message-
> From: Chung-Ju Wu 
> Sent: Wednesday, January 10, 2024 7:07 AM
> To: Gerald Pfeifer ; gcc-patches  patc...@gcc.gnu.org>
> Cc: Kyrylo Tkachov ; Richard Earnshaw
> ; Sudakshina Das ;
> jason...@anshingtek.com.tw
> Subject: [PATCH][wwwdoc] gcc-14: Add arm cortex-m52 cpu support
> 
> Hi Gerald,
> 
> The Arm Cortex-M52 CPU has been added to the upstream:
> https://gcc.gnu.org/pipermail/gcc-patches/2024-January/642230.html
> 
> I would like to document this on the gcc-14 changes.html page.
> Attached is the patch for gcc-wwwdocs repository.
> 
> Is it OK?

I can approve these as port maintainer. The entry is okay.
Thanks,
Kyrill

> 
> Regards,
> jasonwucj


RE: [PATCH]Arm: Update early-break tests to accept thumb output too.

2024-01-09 Thread Kyrylo Tkachov


> -Original Message-
> From: Tamar Christina 
> Sent: Tuesday, January 9, 2024 12:02 PM
> To: gcc-patches@gcc.gnu.org
> Cc: nd ; Richard Earnshaw ;
> ni...@redhat.com; Kyrylo Tkachov 
> Subject: [PATCH]Arm: Update early-break tests to accept thumb output too.
> 
> Hi All,
> 
> The tests I recently added for early break fail in thumb mode
> because in thumb mode `cbz/cbnz` exist and so the cmp+branch
> is fused.  This updates the testcases to accept either output.
> 
> Tested on arm-none-linux-gnueabihf with -mthumb/-marm.
> 
> Ok for master?
> 
> Thanks,
> Tamar
> 
> gcc/testsuite/ChangeLog:
> 
>   * gcc.target/arm/vect-early-break-cbranch.c: Accept thumb output.
> 
> --- inline copy of patch --
> diff --git a/gcc/testsuite/gcc.target/arm/vect-early-break-cbranch.c
> b/gcc/testsuite/gcc.target/arm/vect-early-break-cbranch.c
> index
> f57bbd8be428d75dcf35aa194b5892fe04124cf6..d5c6d56ec869b8fa868acb78d4c
> 3f40b2a241953 100644
> --- a/gcc/testsuite/gcc.target/arm/vect-early-break-cbranch.c
> +++ b/gcc/testsuite/gcc.target/arm/vect-early-break-cbranch.c
> @@ -16,8 +16,12 @@ int b[N] = {0};
>  **   vpmax.u32   d[0-9]+, d[0-9]+, d[0-9]+
>  **   vpmax.u32   d[0-9]+, d[0-9]+, d[0-9]+
>  **   vmovr[0-9]+, s[0-9]+@ int
> +** (
>  **   cmp r[0-9]+, #0
>  **   bne \.L[0-9]+
> +** |
> +**   cbnzr[0-9]+, \.L.+
> +** )

If we want to be a bit fancy, I think the scan syntax allows to add a target 
selector, you should be able to do
** | { target_thumb }
**   cbnz...

Ok for trunk with or without that change.
Thanks,
Kyrill

>  **   ...
>  */
>  void f1 ()
> @@ -37,8 +41,12 @@ void f1 ()
>  **   vpmax.u32   d[0-9]+, d[0-9]+, d[0-9]+
>  **   vpmax.u32   d[0-9]+, d[0-9]+, d[0-9]+
>  **   vmovr[0-9]+, s[0-9]+@ int
> +** (
>  **   cmp r[0-9]+, #0
>  **   bne \.L[0-9]+
> +** |
> +**   cbnzr[0-9]+, \.L.+
> +** )
>  **   ...
>  */
>  void f2 ()
> @@ -58,8 +66,12 @@ void f2 ()
>  **   vpmax.u32   d[0-9]+, d[0-9]+, d[0-9]+
>  **   vpmax.u32   d[0-9]+, d[0-9]+, d[0-9]+
>  **   vmovr[0-9]+, s[0-9]+@ int
> +** (
>  **   cmp r[0-9]+, #0
>  **   bne \.L[0-9]+
> +** |
> +**   cbnzr[0-9]+, \.L.+
> +** )
>  **   ...
>  */
>  void f3 ()
> @@ -80,8 +92,12 @@ void f3 ()
>  **   vpmax.u32   d[0-9]+, d[0-9]+, d[0-9]+
>  **   vpmax.u32   d[0-9]+, d[0-9]+, d[0-9]+
>  **   vmovr[0-9]+, s[0-9]+@ int
> +** (
>  **   cmp r[0-9]+, #0
>  **   bne \.L[0-9]+
> +** |
> +**   cbnzr[0-9]+, \.L.+
> +** )
>  **   ...
>  */
>  void f4 ()
> @@ -101,8 +117,12 @@ void f4 ()
>  **   vpmax.u32   d[0-9]+, d[0-9]+, d[0-9]+
>  **   vpmax.u32   d[0-9]+, d[0-9]+, d[0-9]+
>  **   vmovr[0-9]+, s[0-9]+@ int
> +** (
>  **   cmp r[0-9]+, #0
>  **   bne \.L[0-9]+
> +** |
> +**   cbnzr[0-9]+, \.L.+
> +** )
>  **   ...
>  */
>  void f5 ()
> @@ -122,8 +142,12 @@ void f5 ()
>  **   vpmax.u32   d[0-9]+, d[0-9]+, d[0-9]+
>  **   vpmax.u32   d[0-9]+, d[0-9]+, d[0-9]+
>  **   vmovr[0-9]+, s[0-9]+@ int
> +** (
>  **   cmp r[0-9]+, #0
>  **   bne \.L[0-9]+
> +** |
> +**   cbnzr[0-9]+, \.L.+
> +** )
>  **   ...
>  */
>  void f6 ()
> 
> 
> 
> 
> --


RE: [PATCH 2/2] arm: Add cortex-m52 doc

2024-01-08 Thread Kyrylo Tkachov


> -Original Message-
> From: Chung-Ju Wu 
> Sent: Monday, January 8, 2024 6:17 AM
> To: gcc-patches ; Kyrylo Tkachov
> ; Richard Earnshaw 
> Cc: jason...@anshingtek.com.tw
> Subject: [PATCH 2/2] arm: Add cortex-m52 doc
> 
> Hi,
> 
> This is the patch to add cortex-m52 in the Arm-related options
> sections of the gcc invoke.texi documentation.
> 
> Is it OK for trunk?

In the ChangeLog entry:
gcc/ChangeLog:

* doc/invoke.texi: Update docs.

Let's be more specific and specify something like
* doc/invoke.texi (Arm Options): Document Cortex-m52 options.

Ok with a better ChangeLog entry.
Thanks,
Kyrill


> 
> Regards,
> jasonwucj


RE: [PATCH 1/2] arm: Add cortex-m52 core

2024-01-08 Thread Kyrylo Tkachov
Hi jasonwucj,

> -Original Message-
> From: Chung-Ju Wu 
> Sent: Monday, January 8, 2024 6:16 AM
> To: gcc-patches ; Kyrylo Tkachov
> ; Richard Earnshaw 
> Cc: jason...@anshingtek.com.tw
> Subject: [PATCH 1/2] arm: Add cortex-m52 core
> 
> Hi,
> 
> Recently, Arm announced the Cortex-M52, delivering increased performance
> in DSP and ML along with a range of other features and benefits.
> For the completeness of Arm ecosystem, we hope that cortex-m52 support
> could be available in gcc-14.
> 
> Attached is the patch to support cortex-m52 cpu with MVE and PACBTI enabled in
> GCC.
> Bootstrapped and tested on arm-none-eabi.
> 
> Is it OK for trunk?

The patch looks good to me. It should be safe to include it in GCC 14 as it 
doesn’t add any new logic beyond a new entry in arm-cpus.in.
Do you have commit rights to push it?
Thanks,
Kyrill

> 
> Regards,
> jasonwucj


RE: [PATCH 20/21]Arm: Add Advanced SIMD cbranch implementation

2024-01-04 Thread Kyrylo Tkachov
Hi Tamar,

> -Original Message-
> From: Tamar Christina 
> Sent: Thursday, January 4, 2024 11:06 AM
> To: Tamar Christina ; gcc-patches@gcc.gnu.org
> Cc: nd ; Ramana Radhakrishnan
> ; Richard Earnshaw
> ; ni...@redhat.com; Kyrylo Tkachov
> 
> Subject: RE: [PATCH 20/21]Arm: Add Advanced SIMD cbranch implementation
> 
> Ping,
> 
> ---
> 
> Hi All,
> 
> This adds an implementation for conditional branch optab for AArch32.
> The previous version only allowed operand 0 but it looks like cbranch
> expansion does not check with the target and so we have to implement all.
> 
> I therefore did not commit it.  This is a larger version. I've also dropped 
> the MVE
> version because the mid-end can rewrite the comparison into comparing two
> predicates without checking with the backend.  Since MVE only has 1 predicate
> register this would need to go through memory and two MRS calls.  It's 
> unlikely
> to be beneficial and so that's for GCC 15 when I can fix the middle-end.
> 
> The cases where AArch32 is skipped in the testsuite are all 
> missed-optimizations
> due to AArch32 missing some optabs.

Does the testsuite have vect_* checks that can be used instead of target arm*?
If so let's use those.
Otherwise it's okay as is.
Thanks,
Kyrill

> 
> For e.g.
> 
> void f1 ()
> {
>   for (int i = 0; i < N; i++)
> {
>   b[i] += a[i];
>   if (a[i] > 0)
>   break;
> }
> }
> 
> For 128-bit vectors we generate:
> 
> vcgt.s32q8, q9, #0
> vpmax.u32   d7, d16, d17
> vpmax.u32   d7, d7, d7
> vmovr3, s14 @ int
> cmp r3, #0
> 
> and of 64-bit vector we can omit one vpmax as we still need to compress to
> 32-bits.
> 
> Bootstrapped Regtested on arm-none-linux-gnueabihf and no issues.
> 
> Ok for master?
> 
> Thanks,
> Tamar
> 
> gcc/ChangeLog:
> 
>   * config/arm/neon.md (cbranch4): New.
> 
> gcc/testsuite/ChangeLog:
> 
>   * gcc.dg/vect/vect-early-break_2.c: Skip Arm.
>   * gcc.dg/vect/vect-early-break_7.c: Likewise.
>   * gcc.dg/vect/vect-early-break_75.c: Likewise.
>   * gcc.dg/vect/vect-early-break_77.c: Likewise.
>   * gcc.dg/vect/vect-early-break_82.c: Likewise.
>   * gcc.dg/vect/vect-early-break_88.c: Likewise.
>   * lib/target-supports.exp (add_options_for_vect_early_break,
>   check_effective_target_vect_early_break_hw,
>   check_effective_target_vect_early_break): Support AArch32.
>   * gcc.target/arm/vect-early-break-cbranch.c: New test.
> 
> --- inline version of patch ---
> 
> diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md
> index
> d213369ffc38fb88ad0357d848cc7da5af73bab7..ed659ab736862da416d1ff6241d
> 0d3e6c6b96ff1 100644
> --- a/gcc/config/arm/neon.md
> +++ b/gcc/config/arm/neon.md
> @@ -408,6 +408,55 @@ (define_insn "vec_extract"
>[(set_attr "type" "neon_store1_one_lane,neon_to_gp")]
>  )
> 
> +;; Patterns comparing two vectors and conditionally jump.
> +;; Avdanced SIMD lacks a vector != comparison, but this is a quite common
> +;; operation.  To not pay the penalty for inverting == we can map our any
> +;; comparisons to all i.e. any(~x) => all(x).
> +;;
> +;; However unlike the AArch64 version, we can't optimize this further as the
> +;; chain is too long for combine due to these being unspecs so it doesn't 
> fold
> +;; the operation to something simpler.
> +(define_expand "cbranch4"
> +  [(set (pc) (if_then_else
> +   (match_operator 0 "expandable_comparison_operator"
> +[(match_operand:VDQI 1 "register_operand")
> + (match_operand:VDQI 2 "reg_or_zero_operand")])
> +   (label_ref (match_operand 3 "" ""))
> +   (pc)))]
> +  "TARGET_NEON"
> +{
> +  rtx mask = operands[1];
> +
> +  /* If comparing against a non-zero vector we have to do a comparison first
> + so we can have a != 0 comparison with the result.  */
> +  if (operands[2] != CONST0_RTX (mode))
> +{
> +  mask = gen_reg_rtx (mode);
> +  emit_insn (gen_xor3 (mask, operands[1], operands[2]));
> +}
> +
> +  /* For 128-bit vectors we need an additional reductions.  */
> +  if (known_eq (128, GET_MODE_BITSIZE (mode)))
> +{
> +  /* Always reduce using a V4SI.  */
> +  mask = gen_reg_rtx (V2SImode);
> +  rtx low = gen_reg_rtx (V2SImode);
> +  rtx high = gen_reg_rtx (V2SImode);
> +  rtx op1 = lowpart_subreg (V4SImode, operands[1], mode);
> +  emit_insn (gen_neon_vget_lowv4si (low, op1));
> +  emit_insn (gen_neon_v

RE: [PATCH] aarch64: Fix parens in aarch64_stp_reg_operand [PR113061]

2023-12-19 Thread Kyrylo Tkachov


> -Original Message-
> From: Alex Coplan 
> Sent: Monday, December 18, 2023 10:29 PM
> To: gcc-patches@gcc.gnu.org
> Cc: Richard Earnshaw ; Richard Sandiford
> ; Kyrylo Tkachov 
> Subject: [PATCH] aarch64: Fix parens in aarch64_stp_reg_operand [PR113061]
> 
> In r14-6603-gfcdd2757c76bf925115b8e1ba4318d6366dd6f09 I messed up the
> parentheses in aarch64_stp_reg_operand, the indentation shows the
> intended nesting of the conditions.
> 
> This patch fixes that.
> 
> This fixes PR113061 which shows IRA substituting (const_int 1) into a
> writeback stp pattern as a result (and LRA failing to reload the
> constant).
> 
> Bootstrapped/regtested on aarch64-linux-gnu, OK for trunk?

Ok.
Thanks,
Kyrill

> 
> Thanks,
> Alex
> 
> gcc/ChangeLog:
> 
>   PR target/113061
>   * config/aarch64/predicates.md (aarch64_stp_reg_operand): Fix
>   parentheses to match intent.
> 
> gcc/testsuite/ChangeLog:
> 
>   PR target/113061
>   * gfortran.dg/PR113061.f90: New test.


RE: [PATCH] aarch64: Add an early RA for strided registers

2023-12-05 Thread Kyrylo Tkachov
Hi Richard,

> -Original Message-
> From: Richard Sandiford 
> Sent: Monday, November 20, 2023 12:16 PM
> To: gcc-patches@gcc.gnu.org
> Subject: [PATCH] aarch64: Add an early RA for strided registers
> 
> [Yeah, I just missed the stage1 deadline, sorry.  But this is gated
>  behind several other things, so it seems a bit academic whether it
>  was posted yesterday or today.]
> 
> This pass adds a simple register allocator for FP & SIMD registers.
> Its main purpose is to make use of SME2's strided LD1, ST1 and LUTI2/4
> instructions, which require a very specific grouping structure,
> and so would be difficult to exploit with general allocation.
> 
> The allocator is very simple.  It gives up on anything that would
> require spilling, or that it might not handle well for other reasons.
> 
> The allocator needs to track liveness at the level of individual FPRs.
> Doing that fixes a lot of the PRs relating to redundant moves caused by
> structure loads and stores.  That particular problem is now fixed more
> generally by Lehua's RA patches.
> 
> However, the patch runs before scheduling, so it has a chance to bag
> a spill-free allocation of vector code before the scheduler moves
> things around.  It could therefore still be useful for non-SME code
> (e.g. for hand-scheduled ACLE code).
> 
> The pass is controlled by a tristate switch:
> 
> - -mearly-ra=all: run on all functions
> - -mearly-ra=strided: run on functions that have access to strided registers
> - -mearly-ra=none: don't run on any function
> 
> The patch makes -mearly-ra=strided the default at -O2 and above.

Can you please add a statement about that default to the invoke.texi 
documentation.

> However, I've tested it with -mearly-ra=all too.
> 
> As said previously, the pass is very naive.  There's much more that we
> could do, such as handling invariants better.  The main focus is on not
> committing to a bad allocation, rather than on handling as much as
> possible.
> 

At a first glance it seemed to me a bit unconventional to add an 
aarch64-specific pass for such an RA task.
But I tend to agree that the specific allocation requirements for these 
instructions warrant some target-specific logic.
And we're quite happy to have target-specific passes that adjust all manners of 
instruction selection and scheduling decisions, and I don't see the strided 
register allocation decisions as qualitatively different.

> Tested on aarch64-linux-gnu.

I'm okay with that patch, but it looks like it has a small hunk to 
ira-costs.cc, so I guess you'll want an authority on that to okay it.

One further small comment inline below.

Thanks,
Kyrill

> 
> Richard
> 
> 
> gcc/
>   * ira-costs.cc (scan_one_insn): Restrict handling of parameter
>   loads to pseudo registers.
>   * config.gcc: Add aarch64-early-ra.o for AArch64 targets.
>   * config/aarch64/t-aarch64 (aarch64-early-ra.o): New rule.
>   * config/aarch64/aarch64-opts.h (aarch64_early_ra_scope): New enum.
>   * config/aarch64/aarch64.opt (mearly_ra): New option.
>   * doc/invoke.texi: Document it.
>   * common/config/aarch64/aarch64-common.cc
>   (aarch_option_optimization_table): Use -mearly-ra=strided by
>   default for -O2 and above.
>   * config/aarch64/aarch64-passes.def (pass_aarch64_early_ra): New
> pass.
>   * config/aarch64/aarch64-protos.h (aarch64_strided_registers_p)
>   (make_pass_aarch64_early_ra): Declare.
>   * config/aarch64/aarch64-sme.md
> (@aarch64_sme_lut):
>   Add a stride_type attribute.
>   (@aarch64_sme_lut_strided2): New pattern.
>   (@aarch64_sme_lut_strided4): Likewise.
>   * config/aarch64/aarch64-sve-builtins-base.cc (svld1_impl::expand)
>   (svldnt1_impl::expand, svst1_impl::expand, svstn1_impl::expand):
> Handle
>   new way of defining multi-register loads and stores.
>   * config/aarch64/aarch64-sve.md
> (@aarch64_ld1)
>   (@aarch64_ldnt1,
> @aarch64_st1)
>   (@aarch64_stnt1): Delete.
>   * config/aarch64/aarch64-sve2.md
> (@aarch64_)
>   (@aarch64__strided2): New patterns.
>   (@aarch64__strided4): Likewise.
>   (@aarch64_): Likewise.
>   (@aarch64__strided2): Likewise.
>   (@aarch64__strided4): Likewise.
>   * config/aarch64/aarch64.cc (aarch64_strided_registers_p): New
>   function.
>   * config/aarch64/aarch64.md (UNSPEC_LD1_SVE_COUNT): Delete.
>   (UNSPEC_ST1_SVE_COUNT, UNSPEC_LDNT1_SVE_COUNT): Likewise.
>   (UNSPEC_STNT1_SVE_COUNT): Likewise.
>   (stride_type): New attribute.
>   * config/aarch64/constraints.md (Uwd, Uwt): New constraints.
>   * config/aarch64/iterators.md (UNSPEC_LD1_COUNT,
> UNSPEC_LDNT1_COUNT)
>   (UNSPEC_ST1_COUNT, UNSPEC_STNT1_COUNT): New unspecs.
>   (optab): Handle them.
>   (LD1_COUNT, ST1_COUNT): New iterators.
>   * config/aarch64/aarch64-early-ra.cc: New file.
> 
> gcc/testsuite/
>   * gcc.target/aarch64/ldp_stp_16.c (cons4_4_float): Tighten expected
>   

RE: [PATCH v2 3/5] aarch64: Sync `aarch64-sys-regs.def' with Binutils.

2023-11-28 Thread Kyrylo Tkachov
Hi Victor,

> -Original Message-
> From: Victor Do Nascimento 
> Sent: Tuesday, November 28, 2023 3:56 PM
> To: gcc-patches@gcc.gnu.org
> Cc: Kyrylo Tkachov ; Richard Sandiford
> ; Richard Earnshaw
> ; Victor Do Nascimento
> 
> Subject: [PATCH v2 3/5] aarch64: Sync `aarch64-sys-regs.def' with Binutils.
> 
> This patch updates `aarch64-sys-regs.def', bringing it into sync with
> the Binutils source.
> 
> gcc/ChangeLog:
> 
>   * config/aarch64/aarch64-sys-regs.def (par_el1): New.
>   (rcwmask_el1): Likewise.
>   (rcwsmask_el1): Likewise.
>   (ttbr0_el1): Likewise.
>   (ttbr0_el12): Likewise.
>   (ttbr0_el2): Likewise.
>   (ttbr1_el1): Likewise.
>   (ttbr1_el12): Likewise.
>   (ttbr1_el2): Likewise.
>   (vttbr_el2): Likewise.
>   (gcspr_el0): Likewise.
>   (gcspr_el1): Likewise.
>   (gcspr_el12): Likewise.
>   (gcspr_el2): Likewise.
>   (gcspr_el3): Likewise.
>   (gcscre0_el1): Likewise.
>   (gcscr_el1): Likewise.
>   (gcscr_el12): Likewise.
>   (gcscr_el2): Likewise.
>   (gcscr_el3): Likewise.

In the case where we copy a file from elsewhere or regenerate we can just use 
the short entry:
* config/aarch64/aarch64-sys-regs.def: Copy from Binutils.

Or something equivalent.
Ok with and adjusted ChangeLog.
Thanks,
Kyrill

> ---
>  gcc/config/aarch64/aarch64-sys-regs.def | 30 +
>  1 file changed, 21 insertions(+), 9 deletions(-)
> 
> diff --git a/gcc/config/aarch64/aarch64-sys-regs.def
> b/gcc/config/aarch64/aarch64-sys-regs.def
> index d24a2455503..96bdadb0b0f 100644
> --- a/gcc/config/aarch64/aarch64-sys-regs.def
> +++ b/gcc/config/aarch64/aarch64-sys-regs.def
> @@ -419,6 +419,16 @@
>SYSREG ("fpcr",CPENC (3,3,4,4,0),  0,
>   AARCH64_NO_FEATURES)
>SYSREG ("fpexc32_el2", CPENC (3,4,5,3,0),  0,
>   AARCH64_NO_FEATURES)
>SYSREG ("fpsr",CPENC (3,3,4,4,1),  0,
>   AARCH64_NO_FEATURES)
> +  SYSREG ("gcspr_el0",   CPENC (3,3,2,5,1),  F_ARCHEXT,
>   AARCH64_FEATURE (GCS))
> +  SYSREG ("gcspr_el1",   CPENC (3,0,2,5,1),  F_ARCHEXT,
>   AARCH64_FEATURE (GCS))
> +  SYSREG ("gcspr_el2",   CPENC (3,4,2,5,1),  F_ARCHEXT,
>   AARCH64_FEATURE (GCS))
> +  SYSREG ("gcspr_el12",  CPENC (3,5,2,5,1),  F_ARCHEXT,
>   AARCH64_FEATURE (GCS))
> +  SYSREG ("gcspr_el3",   CPENC (3,6,2,5,1),  F_ARCHEXT,
>   AARCH64_FEATURE (GCS))
> +  SYSREG ("gcscre0_el1", CPENC (3,0,2,5,2),  F_ARCHEXT,
>   AARCH64_FEATURE (GCS))
> +  SYSREG ("gcscr_el1",   CPENC (3,0,2,5,0),  F_ARCHEXT,
>   AARCH64_FEATURE (GCS))
> +  SYSREG ("gcscr_el2",   CPENC (3,4,2,5,0),  F_ARCHEXT,
>   AARCH64_FEATURE (GCS))
> +  SYSREG ("gcscr_el12",  CPENC (3,5,2,5,0),  F_ARCHEXT,
>   AARCH64_FEATURE (GCS))
> +  SYSREG ("gcscr_el3",   CPENC (3,6,2,5,0),  F_ARCHEXT,
>   AARCH64_FEATURE (GCS))
>SYSREG ("gcr_el1", CPENC (3,0,1,0,6),  F_ARCHEXT,
>   AARCH64_FEATURE (MEMTAG))
>SYSREG ("gmid_el1",CPENC (3,1,0,0,4),
>   F_REG_READ|F_ARCHEXT,   AARCH64_FEATURE (MEMTAG))
>SYSREG ("gpccr_el3",   CPENC (3,6,2,1,6),  0,
>   AARCH64_NO_FEATURES)
> @@ -584,7 +594,7 @@
>SYSREG ("oslar_el1",   CPENC (2,0,1,0,4),  F_REG_WRITE,
>   AARCH64_NO_FEATURES)
>SYSREG ("oslsr_el1",   CPENC (2,0,1,1,4),  F_REG_READ,
>   AARCH64_NO_FEATURES)
>SYSREG ("pan", CPENC (3,0,4,2,3),  F_ARCHEXT,
>   AARCH64_FEATURE (PAN))
> -  SYSREG ("par_el1", CPENC (3,0,7,4,0),  0,
>   AARCH64_NO_FEATURES)
> +  SYSREG ("par_el1", CPENC (3,0,7,4,0),  F_REG_128,
>   AARCH64_NO_FEATURES)
>SYSREG ("pmbidr_el1",  CPENC (3,0,9,10,7),
>   F_REG_READ|F_ARCHEXT,   AARCH64_FEATURE (PROFILE))
>SYSREG ("pmblimitr_el1",   CPENC (3,0,9,10,0), F_ARCHEXT,
>   AARCH64_FEATURE (PROFILE))
>SYSREG ("pmbptr_el1",  CPENC (3,0,9,10,1), F_ARCHEXT,
>   AARCH64_FEATURE (PROFILE))
> @@ -746,6 +756,8 @@
>SYSREG ("prlar_el2",   CPENC (3,4,6,8,1),  F_ARCHEXT,
>   AARCH64_FEATURE (V8R))
>SYSREG ("prselr_el1",  CPENC (3,0,6,2,1),  F_ARCHEXT,
>

RE: [PATCH 20/21]Arm: Add Advanced SIMD cbranch implementation

2023-11-27 Thread Kyrylo Tkachov
Hi Tamar,

> -Original Message-
> From: Tamar Christina 
> Sent: Monday, November 6, 2023 7:43 AM
> To: gcc-patches@gcc.gnu.org
> Cc: nd ; Ramana Radhakrishnan
> ; Richard Earnshaw
> ; ni...@redhat.com; Kyrylo Tkachov
> 
> Subject: [PATCH 20/21]Arm: Add Advanced SIMD cbranch implementation
> 
> Hi All,
> 
> This adds an implementation for conditional branch optab for AArch32.
> 
> For e.g.
> 
> void f1 ()
> {
>   for (int i = 0; i < N; i++)
> {
>   b[i] += a[i];
>   if (a[i] > 0)
>   break;
> }
> }
> 
> For 128-bit vectors we generate:
> 
> vcgt.s32q8, q9, #0
> vpmax.u32   d7, d16, d17
> vpmax.u32   d7, d7, d7
> vmovr3, s14 @ int
> cmp r3, #0
> 
> and of 64-bit vector we can omit one vpmax as we still need to compress to
> 32-bits.
> 
> Bootstrapped Regtested on arm-none-linux-gnueabihf and no issues.
> 
> Ok for master?
> 

This is okay once the prerequisites go in.
Thanks,
Kyrill

> Thanks,
> Tamar
> 
> gcc/ChangeLog:
> 
>   * config/arm/neon.md (cbranch4): New.
> 
> gcc/testsuite/ChangeLog:
> 
>   * lib/target-supports.exp (vect_early_break): Add AArch32.
>   * gcc.target/arm/vect-early-break-cbranch.c: New test.
> 
> --- inline copy of patch --
> diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md
> index
> d213369ffc38fb88ad0357d848cc7da5af73bab7..130efbc37cfe3128533599dfadc
> 344d2243dcb63 100644
> --- a/gcc/config/arm/neon.md
> +++ b/gcc/config/arm/neon.md
> @@ -408,6 +408,45 @@ (define_insn "vec_extract"
>[(set_attr "type" "neon_store1_one_lane,neon_to_gp")]
>  )
> 
> +;; Patterns comparing two vectors and conditionally jump.
> +;; Avdanced SIMD lacks a vector != comparison, but this is a quite common
> +;; operation.  To not pay the penalty for inverting == we can map our any
> +;; comparisons to all i.e. any(~x) => all(x).
> +;;
> +;; However unlike the AArch64 version, we can't optimize this further as the
> +;; chain is too long for combine due to these being unspecs so it doesn't 
> fold
> +;; the operation to something simpler.
> +(define_expand "cbranch4"
> +  [(set (pc) (if_then_else
> +   (match_operator 0 "expandable_comparison_operator"
> +[(match_operand:VDQI 1 "register_operand")
> + (match_operand:VDQI 2 "zero_operand")])
> +   (label_ref (match_operand 3 "" ""))
> +   (pc)))]
> +  "TARGET_NEON"
> +{
> +  rtx mask = operands[1];
> +
> +  /* For 128-bit vectors we need an additional reductions.  */
> +  if (known_eq (128, GET_MODE_BITSIZE (mode)))
> +{
> +  /* Always reduce using a V4SI.  */
> +  mask = gen_reg_rtx (V2SImode);
> +  rtx low = gen_reg_rtx (V2SImode);
> +  rtx high = gen_reg_rtx (V2SImode);
> +  emit_insn (gen_neon_vget_lowv4si (low, operands[1]));
> +  emit_insn (gen_neon_vget_highv4si (high, operands[1]));
> +  emit_insn (gen_neon_vpumaxv2si (mask, low, high));
> +}
> +
> +  emit_insn (gen_neon_vpumaxv2si (mask, mask, mask));
> +
> +  rtx val = gen_reg_rtx (SImode);
> +  emit_move_insn (val, gen_lowpart (SImode, mask));
> +  emit_jump_insn (gen_cbranch_cc (operands[0], val, const0_rtx, 
> operands[3]));
> +  DONE;
> +})
> +
>  ;; This pattern is renamed from "vec_extract" to
>  ;; "neon_vec_extract" and this pattern is called
>  ;; by define_expand in vec-common.md file.
> diff --git a/gcc/testsuite/gcc.target/arm/vect-early-break-cbranch.c
> b/gcc/testsuite/gcc.target/arm/vect-early-break-cbranch.c
> new file mode 100644
> index
> ..2c05aa10d26ed4ac9785672e
> 6e3b4355cef046dc
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/arm/vect-early-break-cbranch.c
> @@ -0,0 +1,136 @@
> +/* { dg-do compile } */
> +/* { dg-require-effective-target arm_neon_ok } */
> +/* { dg-require-effective-target arm32 } */
> +/* { dg-options "-O3 -march=armv8-a+simd -mfpu=auto -mfloat-abi=hard" } */
> +/* { dg-final { check-function-bodies "**" "" "" } } */
> +
> +#define N 640
> +int a[N] = {0};
> +int b[N] = {0};
> +
> +/* f1:
> +**   ...
> +**   vcgt.s32q[0-9]+, q[0-9]+, #0
> +**   vpmax.u32   d[0-9]+, d[0-9]+, d[0-9]+
> +**   vpmax.u32   d[0-9]+, d[0-9]+, d[0-9]+
> +**   vmovr[0-9]+, s[0-9]+@ int
> +**   cmp r[0-9]+, #0
> +**   bne \.L[0-9]+
> +**   ...
> +*/
> +void f1 ()
> +{
> +  for (int i = 0; i < N; i++)
> +{
> +  b[

RE: [PATCH 21/21]Arm: Add MVE cbranch implementation

2023-11-27 Thread Kyrylo Tkachov
Hi Tamar,

> -Original Message-
> From: Tamar Christina 
> Sent: Monday, November 6, 2023 7:43 AM
> To: gcc-patches@gcc.gnu.org
> Cc: nd ; Ramana Radhakrishnan
> ; Richard Earnshaw
> ; ni...@redhat.com; Kyrylo Tkachov
> 
> Subject: [PATCH 21/21]Arm: Add MVE cbranch implementation
> 
> Hi All,
> 
> This adds an implementation for conditional branch optab for MVE.
> 
> Unfortunately MVE has rather limited operations on VPT.P0, we are missing the
> ability to do P0 comparisons and logical OR on P0.
> 
> For that reason we can only support cbranch with 0, as for comparing to a 0
> predicate we don't need to actually do a comparison, we only have to check 
> that
> any bit is set within P0.
> 
> Because we can only do P0 comparisons with 0, the costing of the comparison 
> was
> reduced in order for the compiler not to try to push 0 to a register thinking
> it's too expensive.  For the cbranch implementation to be safe we must see the
> constant 0 vector.
> 
> For the lack of logical OR on P0 we can't really work around.  This means MVE
> can't support cases where the sizes of operands in the comparison don't match,
> i.e. when one operand has been unpacked.
> 
> For e.g.
> 
> void f1 ()
> {
>   for (int i = 0; i < N; i++)
> {
>   b[i] += a[i];
>   if (a[i] > 0)
>   break;
> }
> }
> 
> For 128-bit vectors we generate:
> 
> vcmp.s32gt, q3, q1
> vmrsr3, p0  @ movhi
> cbnzr3, .L2
> 
> MVE does not have 64-bit vector comparisons, as such that is also not 
> supported.
> 
> Bootstrapped arm-none-linux-gnueabihf and regtested with
> -march=armv8.1-m.main+mve -mfpu=auto and no issues.
> 
> Ok for master?
> 

This is okay once the rest goes in.
Thanks,
Kyrill

> Thanks,
> Tamar
> 
> gcc/ChangeLog:
> 
>   * config/arm/arm.cc (arm_rtx_costs_internal): Update costs for pred 0
>   compares.
>   * config/arm/mve.md (cbranch4): New.
> 
> gcc/testsuite/ChangeLog:
> 
>   * lib/target-supports.exp (vect_early_break): Add MVE.
>   * gcc.target/arm/mve/vect-early-break-cbranch.c: New test.
> 
> --- inline copy of patch --
> diff --git a/gcc/config/arm/arm.cc b/gcc/config/arm/arm.cc
> index
> 38f0839de1c75547c259ac3d655fcfc14e7208a2..15e65c15cb3cb6f70161787e84
> b255a24eb51e32 100644
> --- a/gcc/config/arm/arm.cc
> +++ b/gcc/config/arm/arm.cc
> @@ -11883,6 +11883,15 @@ arm_rtx_costs_internal (rtx x, enum rtx_code code,
> enum rtx_code outer_code,
>  || TARGET_HAVE_MVE)
> && simd_immediate_valid_for_move (x, mode, NULL, NULL))
>   *cost = COSTS_N_INSNS (1);
> +  else if (TARGET_HAVE_MVE
> +&& outer_code == COMPARE
> +&& VALID_MVE_PRED_MODE (mode))
> + /* MVE allows very limited instructions on VPT.P0,  however comparisons
> +to 0 do not require us to materialze this constant or require a
> +predicate comparison as we can go through SImode.  For that reason
> +allow P0 CMP 0 as a cheap operation such that the 0 isn't forced to
> +registers as we can't compare two predicates.  */
> + *cost = COSTS_N_INSNS (1);
>else
>   *cost = COSTS_N_INSNS (4);
>return true;
> diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
> index
> 74909ce47e132c22a94f7d9cd3a0921b38e33051..95d40770ecc25f9eb251eba38
> 306dd43cbebfb3f 100644
> --- a/gcc/config/arm/mve.md
> +++ b/gcc/config/arm/mve.md
> @@ -6880,6 +6880,21 @@ (define_expand
> "vcond_mask_"
>DONE;
>  })
> 
> +(define_expand "cbranch4"
> +  [(set (pc) (if_then_else
> +   (match_operator 0 "expandable_comparison_operator"
> +[(match_operand:MVE_7 1 "register_operand")
> + (match_operand:MVE_7 2 "zero_operand")])
> +   (label_ref (match_operand 3 "" ""))
> +   (pc)))]
> +  "TARGET_HAVE_MVE"
> +{
> +  rtx val = gen_reg_rtx (SImode);
> +  emit_move_insn (val, gen_lowpart (SImode, operands[1]));
> +  emit_jump_insn (gen_cbranchsi4 (operands[0], val, const0_rtx, 
> operands[3]));
> +  DONE;
> +})
> +
>  ;; Reinterpret operand 1 in operand 0's mode, without changing its contents.
>  (define_expand "@arm_mve_reinterpret"
>[(set (match_operand:MVE_vecs 0 "register_operand")
> diff --git a/gcc/testsuite/gcc.target/arm/mve/vect-early-break-cbranch.c
> b/gcc/testsuite/gcc.target/arm/mve/vect-early-break-cbranch.c
> new file mode 100644
> index
> ..c3b8506dca0b2b044e6869a6
> c8259d663c1ff930
&g

RE: [PATCH]AArch64: fix aarch64_usubw pattern

2023-11-22 Thread Kyrylo Tkachov
Hi Tamar,

> -Original Message-
> From: Tamar Christina 
> Sent: Wednesday, November 22, 2023 10:20 AM
> To: gcc-patches@gcc.gnu.org
> Cc: nd ; Richard Earnshaw ;
> Marcus Shawcroft ; Kyrylo Tkachov
> ; Richard Sandiford 
> Subject: [PATCH]AArch64: fix aarch64_usubw pattern
> 
> Hi All,
> 
> It looks like during my pre-commit testrun I forgot to apply this patch
> to the patch stack.  It had a typo in the element size.
> 
> It also looks like since the hi/lo operations take different element
> counts for the assembler syntax that I can't have a unified pattern.
> 
> This splits it into two each :(
> 
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> 
> Sorry for the breakage,
> Ok for master?

If I was pedantic I'd argue for the testsuite changes going in independently as 
obvious, but the patch is okay as is anyway.
So ok for trunk.
Thanks,
Kyrill

> 
> Thanks,
> Tamar
> 
> gcc/ChangeLog:
> 
>   * config/aarch64/aarch64-simd.md
>   (aarch64_uaddw__zip,
>aarch64_usubw__zip): Split into...
>   (aarch64_uaddw_lo_zip, aarch64_uaddw_hi_zip,
>   "aarch64_usubw_lo_zip, "aarch64_usubw_hi_zip): ...
> This.
> 
> 
> gcc/testsuite/ChangeLog:
> 
>   * gcc.target/aarch64/uxtl-combine-4.c: Fix typo.
>   * gcc.target/aarch64/uxtl-combine-5.c: Likewise.
>   * gcc.target/aarch64/uxtl-combine-6.c: Likewise.
> 
> --- inline copy of patch --
> diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-
> simd.md
> index
> 75ee659871080ed28b9887990b7431682c283502..80e338bb8952140dd8be178c
> c8aed0c47b81c775 100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -4810,7 +4810,7 @@ (define_insn
> "aarch64_subw2_internal"
>[(set_attr "type" "neon_sub_widen")]
>  )
> 
> -(define_insn "aarch64_usubw__zip"
> +(define_insn "aarch64_usubw_lo_zip"
>[(set (match_operand: 0 "register_operand" "=w")
>   (minus:
> (match_operand: 1 "register_operand" "w")
> @@ -4818,23 +4818,51 @@ (define_insn
> "aarch64_usubw__zip"
>   (unspec: [
>   (match_operand:VQW 2 "register_operand" "w")
>   (match_operand:VQW 3 "aarch64_simd_imm_zero")
> -] PERM_EXTEND) 0)))]
> +] UNSPEC_ZIP1) 0)))]
>"TARGET_SIMD"
> -  "usubw\\t%0., %1.,
> %2."
> +  "usubw\\t%0., %1., %2."
>[(set_attr "type" "neon_sub_widen")]
>  )
> 
> -(define_insn "aarch64_uaddw__zip"
> +(define_insn "aarch64_uaddw_lo_zip"
>[(set (match_operand: 0 "register_operand" "=w")
>   (plus:
> (subreg:
>   (unspec: [
>   (match_operand:VQW 2 "register_operand" "w")
>   (match_operand:VQW 3 "aarch64_simd_imm_zero")
> -] PERM_EXTEND) 0)
> +] UNSPEC_ZIP1) 0)
> (match_operand: 1 "register_operand" "w")))]
>"TARGET_SIMD"
> -  "uaddw\\t%0., %1.,
> %2."
> +  "uaddw\\t%0., %1., %2."
> +  [(set_attr "type" "neon_add_widen")]
> +)
> +
> +(define_insn "aarch64_usubw_hi_zip"
> +  [(set (match_operand: 0 "register_operand" "=w")
> + (minus:
> +   (match_operand: 1 "register_operand" "w")
> +   (subreg:
> + (unspec: [
> + (match_operand:VQW 2 "register_operand" "w")
> + (match_operand:VQW 3 "aarch64_simd_imm_zero")
> +] UNSPEC_ZIP2) 0)))]
> +  "TARGET_SIMD"
> +  "usubw2\\t%0., %1., %2."
> +  [(set_attr "type" "neon_sub_widen")]
> +)
> +
> +(define_insn "aarch64_uaddw_hi_zip"
> +  [(set (match_operand: 0 "register_operand" "=w")
> + (plus:
> +   (subreg:
> + (unspec: [
> + (match_operand:VQW 2 "register_operand" "w")
> + (match_operand:VQW 3 "aarch64_simd_imm_zero")
> +] UNSPEC_ZIP2) 0)
> +   (match_operand: 1 "register_operand" "w")))]
> +  "TARGET_SIMD"
> +  "uaddw2\\t%0., %1., %2."
>[(set_attr "type" "neon_add_widen")]
>  )
> 
> diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
> index
> 2354315d7d249ccee46625d13b32678f1da1f087..a920de99ffca378

RE: [PATCH 6/6] arm: [MVE intrinsics] rework vldq1 vst1q

2023-11-16 Thread Kyrylo Tkachov



> -Original Message-
> From: Christophe Lyon 
> Sent: Thursday, November 16, 2023 3:26 PM
> To: gcc-patches@gcc.gnu.org; Richard Sandiford
> ; Richard Earnshaw
> ; Kyrylo Tkachov 
> Cc: Christophe Lyon 
> Subject: [PATCH 6/6] arm: [MVE intrinsics] rework vldq1 vst1q
> 
> Implement vld1q, vst1q using the new MVE builtins framework.

Ok. Nice to see more MVE intrinsics getting the good treatment.
Thanks,
Kyrill

> 
> 2023-11-16  Christophe Lyon  
> 
>   gcc/
>   * config/arm/arm-mve-builtins-base.cc (vld1_impl, vld1q)
>   (vst1_impl, vst1q): New.
>   * config/arm/arm-mve-builtins-base.def (vld1q, vst1q): New.
>   * config/arm/arm-mve-builtins-base.h (vld1q, vst1q): New.
>   * config/arm/arm_mve.h
>   (vld1q): Delete.
>   (vst1q): Delete.
>   (vld1q_s8): Delete.
>   (vld1q_s32): Delete.
>   (vld1q_s16): Delete.
>   (vld1q_u8): Delete.
>   (vld1q_u32): Delete.
>   (vld1q_u16): Delete.
>   (vld1q_f32): Delete.
>   (vld1q_f16): Delete.
>   (vst1q_f32): Delete.
>   (vst1q_f16): Delete.
>   (vst1q_s8): Delete.
>   (vst1q_s32): Delete.
>   (vst1q_s16): Delete.
>   (vst1q_u8): Delete.
>   (vst1q_u32): Delete.
>   (vst1q_u16): Delete.
>   (__arm_vld1q_s8): Delete.
>   (__arm_vld1q_s32): Delete.
>   (__arm_vld1q_s16): Delete.
>   (__arm_vld1q_u8): Delete.
>   (__arm_vld1q_u32): Delete.
>   (__arm_vld1q_u16): Delete.
>   (__arm_vst1q_s8): Delete.
>   (__arm_vst1q_s32): Delete.
>   (__arm_vst1q_s16): Delete.
>   (__arm_vst1q_u8): Delete.
>   (__arm_vst1q_u32): Delete.
>   (__arm_vst1q_u16): Delete.
>   (__arm_vld1q_f32): Delete.
>   (__arm_vld1q_f16): Delete.
>   (__arm_vst1q_f32): Delete.
>   (__arm_vst1q_f16): Delete.
>   (__arm_vld1q): Delete.
>   (__arm_vst1q): Delete.
>   * config/arm/mve.md (mve_vld1q_f): Rename into ...
>   (@mve_vld1q_f): ... this.
>   (mve_vld1q_): Rename into ...
>   (@mve_vld1q_) ... this.
>   (mve_vst1q_f): Rename into ...
>   (@mve_vst1q_f): ... this.
>   (mve_vst1q_): Rename into ...
>   (@mve_vst1q_) ... this.
> ---
>  gcc/config/arm/arm-mve-builtins-base.cc  |  58 +
>  gcc/config/arm/arm-mve-builtins-base.def |   4 +
>  gcc/config/arm/arm-mve-builtins-base.h   |   4 +-
>  gcc/config/arm/arm_mve.h | 282 ---
>  gcc/config/arm/mve.md|   8 +-
>  5 files changed, 69 insertions(+), 287 deletions(-)
> 
> diff --git a/gcc/config/arm/arm-mve-builtins-base.cc b/gcc/config/arm/arm-
> mve-builtins-base.cc
> index 5478cac8aeb..cfe1b954a29 100644
> --- a/gcc/config/arm/arm-mve-builtins-base.cc
> +++ b/gcc/config/arm/arm-mve-builtins-base.cc
> @@ -83,6 +83,62 @@ class vuninitializedq_impl : public
> quiet
>}
>  };
> 
> +class vld1_impl : public full_width_access
> +{
> +public:
> +  unsigned int
> +  call_properties (const function_instance &) const override
> +  {
> +return CP_READ_MEMORY;
> +  }
> +
> +  rtx
> +  expand (function_expander ) const override
> +  {
> +insn_code icode;
> +if (e.type_suffix (0).float_p)
> +  icode = code_for_mve_vld1q_f(e.vector_mode (0));
> +else
> +  {
> + if (e.type_suffix (0).unsigned_p)
> +   icode = code_for_mve_vld1q(VLD1Q_U,
> +  e.vector_mode (0));
> + else
> +   icode = code_for_mve_vld1q(VLD1Q_S,
> +  e.vector_mode (0));
> +  }
> +return e.use_contiguous_load_insn (icode);
> +  }
> +};
> +
> +class vst1_impl : public full_width_access
> +{
> +public:
> +  unsigned int
> +  call_properties (const function_instance &) const override
> +  {
> +return CP_WRITE_MEMORY;
> +  }
> +
> +  rtx
> +  expand (function_expander ) const override
> +  {
> +insn_code icode;
> +if (e.type_suffix (0).float_p)
> +  icode = code_for_mve_vst1q_f(e.vector_mode (0));
> +else
> +  {
> + if (e.type_suffix (0).unsigned_p)
> +   icode = code_for_mve_vst1q(VST1Q_U,
> +  e.vector_mode (0));
> + else
> +   icode = code_for_mve_vst1q(VST1Q_S,
> +  e.vector_mode (0));
> +  }
> +return e.use_contiguous_store_insn (icode);
> +  }
> +};
> +
>  } /* end anonymous namespace */
> 
>  namespace arm_mve {
> @@ -290,6 +346,7 @@ FUNCTION (vfmasq,
> unspec_mve_function_exact_insn, (-1, -1, -1, -1, -1, VFMASQ_N_
>  FUNCTION (vfmsq, unspec_mve_function_exact_insn, (-1, -1, V

RE: [PATCH 4/6] arm: [MVE intrinsics] add load and store shapes

2023-11-16 Thread Kyrylo Tkachov



> -Original Message-
> From: Christophe Lyon 
> Sent: Thursday, November 16, 2023 3:26 PM
> To: gcc-patches@gcc.gnu.org; Richard Sandiford
> ; Richard Earnshaw
> ; Kyrylo Tkachov 
> Cc: Christophe Lyon 
> Subject: [PATCH 4/6] arm: [MVE intrinsics] add load and store shapes
> 
> This patch adds the load and store shapes descriptions.

Ok.
Thanks,
Kyrill

> 
> 2023-11-16  Christophe Lyon  
> 
>   gcc/
>   * config/arm/arm-mve-builtins-shapes.cc (load, store): New.
>   * config/arm/arm-mve-builtins-shapes.h (load, store): New.
> ---
>  gcc/config/arm/arm-mve-builtins-shapes.cc | 67 +++
>  gcc/config/arm/arm-mve-builtins-shapes.h  |  2 +
>  2 files changed, 69 insertions(+)
> 
> diff --git a/gcc/config/arm/arm-mve-builtins-shapes.cc b/gcc/config/arm/arm-
> mve-builtins-shapes.cc
> index ce87ebcef30..fe983e7c736 100644
> --- a/gcc/config/arm/arm-mve-builtins-shapes.cc
> +++ b/gcc/config/arm/arm-mve-builtins-shapes.cc
> @@ -1428,6 +1428,38 @@ struct inherent_def : public nonoverloaded_base
>  };
>  SHAPE (inherent)
> 
> +/* sv_t svfoo[_t0](const _t *)
> +
> +   Example: vld1q.
> +   int8x16_t [__arm_]vld1q[_s8](int8_t const *base)
> +   int8x16_t [__arm_]vld1q_z[_s8](int8_t const *base, mve_pred16_t p)  */
> +struct load_def : public overloaded_base<0>
> +{
> +  void
> +  build (function_builder , const function_group_info ,
> +  bool preserve_user_namespace) const override
> +  {
> +b.add_overloaded_functions (group, MODE_none,
> preserve_user_namespace);
> +build_all (b, "t0,al", group, MODE_none, preserve_user_namespace);
> +  }
> +
> +  /* Resolve a call based purely on a pointer argument.  */
> +  tree
> +  resolve (function_resolver ) const override
> +  {
> +gcc_assert (r.mode_suffix_id == MODE_none);
> +
> +unsigned int i, nargs;
> +type_suffix_index type;
> +if (!r.check_gp_argument (1, i, nargs)
> + || (type = r.infer_pointer_type (i)) == NUM_TYPE_SUFFIXES)
> +  return error_mark_node;
> +
> +return r.resolve_to (r.mode_suffix_id, type);
> +  }
> +};
> +SHAPE (load)
> +
>  /* _t vfoo[_t0](_t)
> _t vfoo_n_t0(_t)
> 
> @@ -1477,6 +1509,41 @@ struct mvn_def : public overloaded_base<0>
>  };
>  SHAPE (mvn)
> 
> +/* void vfoo[_t0](_t *, v[xN]_t)
> +
> +   where  might be tied to  (for non-truncating stores) or might
> +   depend on the function base name (for truncating stores).
> +
> +   Example: vst1q.
> +   void [__arm_]vst1q[_s8](int8_t *base, int8x16_t value)
> +   void [__arm_]vst1q_p[_s8](int8_t *base, int8x16_t value, mve_pred16_t p)
> */
> +struct store_def : public overloaded_base<0>
> +{
> +  void
> +  build (function_builder , const function_group_info ,
> +  bool preserve_user_namespace) const override
> +  {
> +b.add_overloaded_functions (group, MODE_none,
> preserve_user_namespace);
> +build_all (b, "_,as,v0", group, MODE_none, preserve_user_namespace);
> +  }
> +
> +  tree
> +  resolve (function_resolver ) const override
> +  {
> +gcc_assert (r.mode_suffix_id == MODE_none);
> +
> +unsigned int i, nargs;
> +type_suffix_index type;
> +if (!r.check_gp_argument (2, i, nargs)
> + || !r.require_pointer_type (0)
> + || (type = r.infer_vector_type (1)) == NUM_TYPE_SUFFIXES)
> +  return error_mark_node;
> +
> +return r.resolve_to (r.mode_suffix_id, type);
> +  }
> +};
> +SHAPE (store)
> +
>  /* _t vfoo[_t0](_t, _t, _t)
> 
> i.e. the standard shape for ternary operations that operate on
> diff --git a/gcc/config/arm/arm-mve-builtins-shapes.h b/gcc/config/arm/arm-
> mve-builtins-shapes.h
> index a93245321c9..aa9309dec7e 100644
> --- a/gcc/config/arm/arm-mve-builtins-shapes.h
> +++ b/gcc/config/arm/arm-mve-builtins-shapes.h
> @@ -61,7 +61,9 @@ namespace arm_mve
>  extern const function_shape *const cmp;
>  extern const function_shape *const create;
>  extern const function_shape *const inherent;
> +extern const function_shape *const load;
>  extern const function_shape *const mvn;
> +extern const function_shape *const store;
>  extern const function_shape *const ternary;
>  extern const function_shape *const ternary_lshift;
>  extern const function_shape *const ternary_n;
> --
> 2.34.1



RE: [PATCH 3/6] arm: [MVE intrinsics] Add support for contiguous loads and stores

2023-11-16 Thread Kyrylo Tkachov



> -Original Message-
> From: Christophe Lyon 
> Sent: Thursday, November 16, 2023 3:26 PM
> To: gcc-patches@gcc.gnu.org; Richard Sandiford
> ; Richard Earnshaw
> ; Kyrylo Tkachov 
> Cc: Christophe Lyon 
> Subject: [PATCH 3/6] arm: [MVE intrinsics] Add support for contiguous loads
> and stores
> 
> This patch adds base support for load/store intrinsics to the
> framework, starting with loads and stores for contiguous memory
> elements, without extension nor truncation.
> 
> Compared to the aarch64/SVE implementation, there's no support for
> gather/scatter loads/stores yet.  This will be added later as needed.
> 

Ok.
Thanks,
Kyrill

> 2023-11-16  Christophe Lyon  
> 
>   gcc/
>   * config/arm/arm-mve-builtins-functions.h (multi_vector_function)
>   (full_width_access): New classes.
>   * config/arm/arm-mve-builtins.cc
>   (find_type_suffix_for_scalar_type, infer_pointer_type)
>   (require_pointer_type, get_contiguous_base, add_mem_operand)
>   (add_fixed_operand, use_contiguous_load_insn)
>   (use_contiguous_store_insn): New.
>   * config/arm/arm-mve-builtins.h (memory_vector_mode)
>   (infer_pointer_type, require_pointer_type, get_contiguous_base)
>   (add_mem_operand)
>   (add_fixed_operand, use_contiguous_load_insn)
>   (use_contiguous_store_insn): New.
> ---
>  gcc/config/arm/arm-mve-builtins-functions.h |  56 ++
>  gcc/config/arm/arm-mve-builtins.cc  | 116 
>  gcc/config/arm/arm-mve-builtins.h   |  28 -
>  3 files changed, 199 insertions(+), 1 deletion(-)
> 
> diff --git a/gcc/config/arm/arm-mve-builtins-functions.h
> b/gcc/config/arm/arm-mve-builtins-functions.h
> index eba1f071af0..6d234a2dd7c 100644
> --- a/gcc/config/arm/arm-mve-builtins-functions.h
> +++ b/gcc/config/arm/arm-mve-builtins-functions.h
> @@ -966,6 +966,62 @@ public:
>}
>  };
> 
> +/* A function_base that sometimes or always operates on tuples of
> +   vectors.  */
> +class multi_vector_function : public function_base
> +{
> +public:
> +  CONSTEXPR multi_vector_function (unsigned int vectors_per_tuple)
> +: m_vectors_per_tuple (vectors_per_tuple) {}
> +
> +  unsigned int
> +  vectors_per_tuple () const override
> +  {
> +return m_vectors_per_tuple;
> +  }
> +
> +  /* The number of vectors in a tuple, or 1 if the function only operates
> + on single vectors.  */
> +  unsigned int m_vectors_per_tuple;
> +};
> +
> +/* A function_base that loads or stores contiguous memory elements
> +   without extending or truncating them.  */
> +class full_width_access : public multi_vector_function
> +{
> +public:
> +  CONSTEXPR full_width_access (unsigned int vectors_per_tuple = 1)
> +: multi_vector_function (vectors_per_tuple) {}
> +
> +  tree
> +  memory_scalar_type (const function_instance ) const override
> +  {
> +return fi.scalar_type (0);
> +  }
> +
> +  machine_mode
> +  memory_vector_mode (const function_instance ) const override
> +  {
> +machine_mode mode = fi.vector_mode (0);
> +/* Vectors of floating-point are managed in memory as vectors of
> +   integers.  */
> +switch (mode)
> +  {
> +  case E_V4SFmode:
> + mode = E_V4SImode;
> + break;
> +  case E_V8HFmode:
> + mode = E_V8HImode;
> + break;
> +  }
> +
> +if (m_vectors_per_tuple != 1)
> +  mode = targetm.array_mode (mode, m_vectors_per_tuple).require ();
> +
> +return mode;
> +  }
> +};
> +
>  } /* end namespace arm_mve */
> 
>  /* Declare the global function base NAME, creating it from an instance
> diff --git a/gcc/config/arm/arm-mve-builtins.cc b/gcc/config/arm/arm-mve-
> builtins.cc
> index 02dc8fa9b73..a265cb05553 100644
> --- a/gcc/config/arm/arm-mve-builtins.cc
> +++ b/gcc/config/arm/arm-mve-builtins.cc
> @@ -36,6 +36,7 @@
>  #include "fold-const.h"
>  #include "gimple.h"
>  #include "gimple-iterator.h"
> +#include "explow.h"
>  #include "emit-rtl.h"
>  #include "langhooks.h"
>  #include "stringpool.h"
> @@ -529,6 +530,22 @@ matches_type_p (const_tree model_type, const_tree
> candidate)
> && TYPE_MAIN_VARIANT (model_type) == TYPE_MAIN_VARIANT
> (candidate));
>  }
> 
> +/* If TYPE is a valid MVE element type, return the corresponding type
> +   suffix, otherwise return NUM_TYPE_SUFFIXES.  */
> +static type_suffix_index
> +find_type_suffix_for_scalar_type (const_tree type)
> +{
> +  /* A linear search should be OK here, since the code isn't hot and
> + the number of types is only small.  */
> +  for (unsign

RE: [PATCH 2/6] arm: [MVE intrinsics] Add support for void and load/store pointers as argument types.

2023-11-16 Thread Kyrylo Tkachov



> -Original Message-
> From: Christophe Lyon 
> Sent: Thursday, November 16, 2023 3:26 PM
> To: gcc-patches@gcc.gnu.org; Richard Sandiford
> ; Richard Earnshaw
> ; Kyrylo Tkachov 
> Cc: Christophe Lyon 
> Subject: [PATCH 2/6] arm: [MVE intrinsics] Add support for void and
> load/store pointers as argument types.
> 
> This patch adds support for '_', 'al' and 'as' for void, load pointer
> and store pointer argument/return value types in intrinsic signatures.
> 
> It also adds a mew memory_scalar_type() helper to function_instance,
> which is used by 'al' and 'as'.

Ok.
Thanks,
Kyrill

> 
> 2023-11-16  Christophe Lyon  
> 
>   gcc/
>   * config/arm/arm-mve-builtins-shapes.cc (build_const_pointer):
>   New.
>   (parse_type): Add support for '_', 'al' and 'as'.
>   * config/arm/arm-mve-builtins.h (function_instance): Add
>   memory_scalar_type.
>   (function_base): Likewise.
> ---
>  gcc/config/arm/arm-mve-builtins-shapes.cc | 25 +++
>  gcc/config/arm/arm-mve-builtins.h | 17 +++
>  2 files changed, 42 insertions(+)
> 
> diff --git a/gcc/config/arm/arm-mve-builtins-shapes.cc b/gcc/config/arm/arm-
> mve-builtins-shapes.cc
> index 23eb9d0e69b..ce87ebcef30 100644
> --- a/gcc/config/arm/arm-mve-builtins-shapes.cc
> +++ b/gcc/config/arm/arm-mve-builtins-shapes.cc
> @@ -39,6 +39,13 @@
> 
>  namespace arm_mve {
> 
> +/* Return a representation of "const T *".  */
> +static tree
> +build_const_pointer (tree t)
> +{
> +  return build_pointer_type (build_qualified_type (t, TYPE_QUAL_CONST));
> +}
> +
>  /* If INSTANCE has a predicate, add it to the list of argument types
> in ARGUMENT_TYPES.  RETURN_TYPE is the type returned by the
> function.  */
> @@ -140,6 +147,9 @@ parse_element_type (const function_instance
> , const char *)
>  /* Read and return a type from FORMAT for function INSTANCE.  Advance
> FORMAT beyond the type string.  The format is:
> 
> +   _   - void
> +   al  - array pointer for loads
> +   as  - array pointer for stores
> p   - predicates with type mve_pred16_t
> s  - a scalar type with the given element suffix
> t  - a vector or tuple type with given element suffix [*1]
> @@ -156,6 +166,21 @@ parse_type (const function_instance ,
> const char *)
>  {
>int ch = *format++;
> 
> +
> +  if (ch == '_')
> +return void_type_node;
> +
> +  if (ch == 'a')
> +{
> +  ch = *format++;
> +  if (ch == 'l')
> + return build_const_pointer (instance.memory_scalar_type ());
> +  if (ch == 's') {
> + return build_pointer_type (instance.memory_scalar_type ());
> +  }
> +  gcc_unreachable ();
> +}
> +
>if (ch == 'p')
>  return get_mve_pred16_t ();
> 
> diff --git a/gcc/config/arm/arm-mve-builtins.h b/gcc/config/arm/arm-mve-
> builtins.h
> index 37b8223dfb2..4fd230fe4c7 100644
> --- a/gcc/config/arm/arm-mve-builtins.h
> +++ b/gcc/config/arm/arm-mve-builtins.h
> @@ -277,6 +277,7 @@ public:
>bool could_trap_p () const;
> 
>unsigned int vectors_per_tuple () const;
> +  tree memory_scalar_type () const;
> 
>const mode_suffix_info _suffix () const;
> 
> @@ -519,6 +520,14 @@ public:
>   of vectors in the tuples, otherwise return 1.  */
>virtual unsigned int vectors_per_tuple () const { return 1; }
> 
> +  /* If the function addresses memory, return the type of a single
> + scalar memory element.  */
> +  virtual tree
> +  memory_scalar_type (const function_instance &) const
> +  {
> +gcc_unreachable ();
> +  }
> +
>/* Try to fold the given gimple call.  Return the new gimple statement
>   on success, otherwise return null.  */
>virtual gimple *fold (gimple_folder &) const { return NULL; }
> @@ -644,6 +653,14 @@ function_instance::vectors_per_tuple () const
>return base->vectors_per_tuple ();
>  }
> 
> +/* If the function addresses memory, return the type of a single
> +   scalar memory element.  */
> +inline tree
> +function_instance::memory_scalar_type () const
> +{
> +  return base->memory_scalar_type (*this);
> +}
> +
>  /* Return information about the function's mode suffix.  */
>  inline const mode_suffix_info &
>  function_instance::mode_suffix () const
> --
> 2.34.1



RE: [PATCH 1/6] arm: Fix arm_simd_types and MVE scalar_types

2023-11-16 Thread Kyrylo Tkachov



> -Original Message-
> From: Christophe Lyon 
> Sent: Thursday, November 16, 2023 3:26 PM
> To: gcc-patches@gcc.gnu.org; Richard Sandiford
> ; Richard Earnshaw
> ; Kyrylo Tkachov 
> Cc: Christophe Lyon 
> Subject: [PATCH 1/6] arm: Fix arm_simd_types and MVE scalar_types
> 
> So far we define arm_simd_types and scalar_types using type
> definitions like intSI_type_node, etc...
> 
> This is causing problems with later patches which re-implement
> load/store MVE intrinsics, leading to error messages such as:
>   error: passing argument 1 of 'vst1q_s32' from incompatible pointer type
>   note: expected 'int *' but argument is of type 'int32_t *' {aka 'long int 
> *'}
> 
> This patch uses get_typenode_from_name (INT32_TYPE) instead, which
> defines the types as appropriate for the target/C library.

Ok.
Thanks,
Kyrill

> 
> 2023-11-16  Christophe Lyon  
> 
>   gcc/
>   * config/arm/arm-builtins.cc (arm_init_simd_builtin_types): Fix
>   initialization of arm_simd_types[].eltype.
>   * config/arm/arm-mve-builtins.def (DEF_MVE_TYPE): Fix scalar
>   types.
> ---
>  gcc/config/arm/arm-builtins.cc  | 28 ++--
>  gcc/config/arm/arm-mve-builtins.def | 16 
>  2 files changed, 22 insertions(+), 22 deletions(-)
> 
> diff --git a/gcc/config/arm/arm-builtins.cc b/gcc/config/arm/arm-builtins.cc
> index fca7dcaf565..dd9c5815c45 100644
> --- a/gcc/config/arm/arm-builtins.cc
> +++ b/gcc/config/arm/arm-builtins.cc
> @@ -1580,20 +1580,20 @@ arm_init_simd_builtin_types (void)
>TYPE_STRING_FLAG (arm_simd_polyHI_type_node) = false;
>  }
>/* Init all the element types built by the front-end.  */
> -  arm_simd_types[Int8x8_t].eltype = intQI_type_node;
> -  arm_simd_types[Int8x16_t].eltype = intQI_type_node;
> -  arm_simd_types[Int16x4_t].eltype = intHI_type_node;
> -  arm_simd_types[Int16x8_t].eltype = intHI_type_node;
> -  arm_simd_types[Int32x2_t].eltype = intSI_type_node;
> -  arm_simd_types[Int32x4_t].eltype = intSI_type_node;
> -  arm_simd_types[Int64x2_t].eltype = intDI_type_node;
> -  arm_simd_types[Uint8x8_t].eltype = unsigned_intQI_type_node;
> -  arm_simd_types[Uint8x16_t].eltype = unsigned_intQI_type_node;
> -  arm_simd_types[Uint16x4_t].eltype = unsigned_intHI_type_node;
> -  arm_simd_types[Uint16x8_t].eltype = unsigned_intHI_type_node;
> -  arm_simd_types[Uint32x2_t].eltype = unsigned_intSI_type_node;
> -  arm_simd_types[Uint32x4_t].eltype = unsigned_intSI_type_node;
> -  arm_simd_types[Uint64x2_t].eltype = unsigned_intDI_type_node;
> +  arm_simd_types[Int8x8_t].eltype = get_typenode_from_name
> (INT8_TYPE);
> +  arm_simd_types[Int8x16_t].eltype = get_typenode_from_name
> (INT8_TYPE);
> +  arm_simd_types[Int16x4_t].eltype = get_typenode_from_name
> (INT16_TYPE);
> +  arm_simd_types[Int16x8_t].eltype = get_typenode_from_name
> (INT16_TYPE);
> +  arm_simd_types[Int32x2_t].eltype = get_typenode_from_name
> (INT32_TYPE);
> +  arm_simd_types[Int32x4_t].eltype = get_typenode_from_name
> (INT32_TYPE);
> +  arm_simd_types[Int64x2_t].eltype = get_typenode_from_name
> (INT64_TYPE);
> +  arm_simd_types[Uint8x8_t].eltype = get_typenode_from_name
> (UINT8_TYPE);
> +  arm_simd_types[Uint8x16_t].eltype = get_typenode_from_name
> (UINT8_TYPE);
> +  arm_simd_types[Uint16x4_t].eltype = get_typenode_from_name
> (UINT16_TYPE);
> +  arm_simd_types[Uint16x8_t].eltype = get_typenode_from_name
> (UINT16_TYPE);
> +  arm_simd_types[Uint32x2_t].eltype = get_typenode_from_name
> (UINT32_TYPE);
> +  arm_simd_types[Uint32x4_t].eltype = get_typenode_from_name
> (UINT32_TYPE);
> +  arm_simd_types[Uint64x2_t].eltype = get_typenode_from_name
> (UINT64_TYPE);
> 
>/* Note: poly64x2_t is defined in arm_neon.h, to ensure it gets default
>   mangling.  */
> diff --git a/gcc/config/arm/arm-mve-builtins.def b/gcc/config/arm/arm-mve-
> builtins.def
> index e2cf1baf370..a901d8231e9 100644
> --- a/gcc/config/arm/arm-mve-builtins.def
> +++ b/gcc/config/arm/arm-mve-builtins.def
> @@ -39,14 +39,14 @@ DEF_MVE_MODE (r, none, none, none)
> 
>  #define REQUIRES_FLOAT false
>  DEF_MVE_TYPE (mve_pred16_t, boolean_type_node)
> -DEF_MVE_TYPE (uint8x16_t, unsigned_intQI_type_node)
> -DEF_MVE_TYPE (uint16x8_t, unsigned_intHI_type_node)
> -DEF_MVE_TYPE (uint32x4_t, unsigned_intSI_type_node)
> -DEF_MVE_TYPE (uint64x2_t, unsigned_intDI_type_node)
> -DEF_MVE_TYPE (int8x16_t, intQI_type_node)
> -DEF_MVE_TYPE (int16x8_t, intHI_type_node)
> -DEF_MVE_TYPE (int32x4_t, intSI_type_node)
> -DEF_MVE_TYPE (int64x2_t, intDI_type_node)
> +DEF_MVE_TYPE (uint8x16_t, get_typenode_from_name (UINT8_TYPE))
> +DEF_MVE_TYPE (uint16x8_t, get_typenode_from_name (UINT16_TYPE))
> +DEF_

RE: [PATCH 5/6] arm: [MVE intrinsics] fix vst1 tests

2023-11-16 Thread Kyrylo Tkachov



> -Original Message-
> From: Christophe Lyon 
> Sent: Thursday, November 16, 2023 3:26 PM
> To: gcc-patches@gcc.gnu.org; Richard Sandiford
> ; Richard Earnshaw
> ; Kyrylo Tkachov 
> Cc: Christophe Lyon 
> Subject: [PATCH 5/6] arm: [MVE intrinsics] fix vst1 tests
> 
> vst1q intrinsics return void, so we should not do 'return vst1q_f16 (base,
> value);'
> 
> This was OK so far, but will trigger an error/warning with the new
> implementation of these intrinsics.
> 

Whoops!
Ok (could have gone in as obvious IMO).
Thanks,
Kyrill

> This patch just removes the 'return' keyword.
> 
> 2023-11-16  Christophe Lyon  
> 
>   gcc/testsuite/
>   * gcc.target/arm/mve/intrinsics/vst1q_f16.c: Remove 'return'.
>   * gcc.target/arm/mve/intrinsics/vst1q_f32.c: Likewise.
>   * gcc.target/arm/mve/intrinsics/vst1q_s16.c: Likewise.
>   * gcc.target/arm/mve/intrinsics/vst1q_s32.c: Likewise.
>   * gcc.target/arm/mve/intrinsics/vst1q_s8.c: Likewise.
>   * gcc.target/arm/mve/intrinsics/vst1q_u16.c: Likewise.
>   * gcc.target/arm/mve/intrinsics/vst1q_u32.c: Likewise.
>   * gcc.target/arm/mve/intrinsics/vst1q_u8.c: Likewise.
> ---
>  gcc/testsuite/gcc.target/arm/mve/intrinsics/vst1q_f16.c | 4 ++--
>  gcc/testsuite/gcc.target/arm/mve/intrinsics/vst1q_f32.c | 4 ++--
>  gcc/testsuite/gcc.target/arm/mve/intrinsics/vst1q_s16.c | 4 ++--
>  gcc/testsuite/gcc.target/arm/mve/intrinsics/vst1q_s32.c | 4 ++--
>  gcc/testsuite/gcc.target/arm/mve/intrinsics/vst1q_s8.c  | 4 ++--
>  gcc/testsuite/gcc.target/arm/mve/intrinsics/vst1q_u16.c | 4 ++--
>  gcc/testsuite/gcc.target/arm/mve/intrinsics/vst1q_u32.c | 4 ++--
>  gcc/testsuite/gcc.target/arm/mve/intrinsics/vst1q_u8.c  | 4 ++--
>  8 files changed, 16 insertions(+), 16 deletions(-)
> 
> diff --git a/gcc/testsuite/gcc.target/arm/mve/intrinsics/vst1q_f16.c
> b/gcc/testsuite/gcc.target/arm/mve/intrinsics/vst1q_f16.c
> index 1fa02f00f53..e4b40604d54 100644
> --- a/gcc/testsuite/gcc.target/arm/mve/intrinsics/vst1q_f16.c
> +++ b/gcc/testsuite/gcc.target/arm/mve/intrinsics/vst1q_f16.c
> @@ -18,7 +18,7 @@ extern "C" {
>  void
>  foo (float16_t *base, float16x8_t value)
>  {
> -  return vst1q_f16 (base, value);
> +  vst1q_f16 (base, value);
>  }
> 
> 
> @@ -31,7 +31,7 @@ foo (float16_t *base, float16x8_t value)
>  void
>  foo1 (float16_t *base, float16x8_t value)
>  {
> -  return vst1q (base, value);
> +  vst1q (base, value);
>  }
> 
>  #ifdef __cplusplus
> diff --git a/gcc/testsuite/gcc.target/arm/mve/intrinsics/vst1q_f32.c
> b/gcc/testsuite/gcc.target/arm/mve/intrinsics/vst1q_f32.c
> index 67cc3ae3b47..8f42323c603 100644
> --- a/gcc/testsuite/gcc.target/arm/mve/intrinsics/vst1q_f32.c
> +++ b/gcc/testsuite/gcc.target/arm/mve/intrinsics/vst1q_f32.c
> @@ -18,7 +18,7 @@ extern "C" {
>  void
>  foo (float32_t *base, float32x4_t value)
>  {
> -  return vst1q_f32 (base, value);
> +  vst1q_f32 (base, value);
>  }
> 
> 
> @@ -31,7 +31,7 @@ foo (float32_t *base, float32x4_t value)
>  void
>  foo1 (float32_t *base, float32x4_t value)
>  {
> -  return vst1q (base, value);
> +  vst1q (base, value);
>  }
> 
>  #ifdef __cplusplus
> diff --git a/gcc/testsuite/gcc.target/arm/mve/intrinsics/vst1q_s16.c
> b/gcc/testsuite/gcc.target/arm/mve/intrinsics/vst1q_s16.c
> index 052959b2083..891ac4155d9 100644
> --- a/gcc/testsuite/gcc.target/arm/mve/intrinsics/vst1q_s16.c
> +++ b/gcc/testsuite/gcc.target/arm/mve/intrinsics/vst1q_s16.c
> @@ -18,7 +18,7 @@ extern "C" {
>  void
>  foo (int16_t *base, int16x8_t value)
>  {
> -  return vst1q_s16 (base, value);
> +  vst1q_s16 (base, value);
>  }
> 
> 
> @@ -31,7 +31,7 @@ foo (int16_t *base, int16x8_t value)
>  void
>  foo1 (int16_t *base, int16x8_t value)
>  {
> -  return vst1q (base, value);
> +  vst1q (base, value);
>  }
> 
>  #ifdef __cplusplus
> diff --git a/gcc/testsuite/gcc.target/arm/mve/intrinsics/vst1q_s32.c
> b/gcc/testsuite/gcc.target/arm/mve/intrinsics/vst1q_s32.c
> index 444ad07f4ef..a28d1eb98db 100644
> --- a/gcc/testsuite/gcc.target/arm/mve/intrinsics/vst1q_s32.c
> +++ b/gcc/testsuite/gcc.target/arm/mve/intrinsics/vst1q_s32.c
> @@ -18,7 +18,7 @@ extern "C" {
>  void
>  foo (int32_t *base, int32x4_t value)
>  {
> -  return vst1q_s32 (base, value);
> +  vst1q_s32 (base, value);
>  }
> 
> 
> @@ -31,7 +31,7 @@ foo (int32_t *base, int32x4_t value)
>  void
>  foo1 (int32_t *base, int32x4_t value)
>  {
> -  return vst1q (base, value);
> +  vst1q (base, value);
>  }
> 
>  #ifdef __cplusplus
> diff --git a/gcc/testsuite/gcc.target/arm/mve/intrinsics/vst1q_s8.c
> b/gcc/testsuite/gcc.target/arm/mve/intrin

RE: [PATCH] aarch64: costs: update for TARGET_CSSC

2023-11-16 Thread Kyrylo Tkachov


> -Original Message-
> From: Richard Earnshaw 
> Sent: Thursday, November 16, 2023 8:53 AM
> To: Philipp Tomsich ; gcc-patches@gcc.gnu.org
> Cc: Kyrylo Tkachov 
> Subject: Re: [PATCH] aarch64: costs: update for TARGET_CSSC
> 
> 
> 
> On 16/11/2023 06:15, Philipp Tomsich wrote:
> > With the addition of CSSC (Common Short Sequence Compression)
> > instructions, a number of idioms match to single instructions (e.g.,
> > abs) that previously expanded to multi-instruction sequences.
> >
> > This recognizes (some of) those idioms that are now misclassified and
> > returns a cost of a single instruction.
> >
> > gcc/ChangeLog:
> >
> > * config/aarch64/aarch64.cc (aarch64_rtx_costs): Support
> > idioms matching to CSSC instructions, if target CSSC is
> > present
> >
> > Signed-off-by: Philipp Tomsich 
> > ---
> >
> >   gcc/config/aarch64/aarch64.cc | 34 --
> >   1 file changed, 24 insertions(+), 10 deletions(-)
> >
> > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> > index 800a8b0e110..d89c94519e9 100644
> > --- a/gcc/config/aarch64/aarch64.cc
> > +++ b/gcc/config/aarch64/aarch64.cc
> > @@ -14431,10 +14431,17 @@ aarch64_rtx_costs (rtx x, machine_mode
> mode, int outer ATTRIBUTE_UNUSED,
> > return false;
> >
> >   case CTZ:
> > -  *cost = COSTS_N_INSNS (2);
> > +  if (!TARGET_CSSC)
> > +   {
> > + /* Will be split to a bit-reversal + clz */
> > + *cost = COSTS_N_INSNS (2);
> > +
> > + if (speed)
> > +   *cost += extra_cost->alu.clz + extra_cost->alu.rev;
> > +   }
> > +  else
> > +   *cost = COSTS_N_INSNS (1);
> 
> There should be some speed-related extra_cost to add here as well, so
> that target-specific costing can be taken into account.

And I'd rather have the conditions be not inverted i.e.
If (TARGET_CSSC)
 ...
else
 ...

Thanks,
Kyrill
> 
> >
> > -  if (speed)
> > -   *cost += extra_cost->alu.clz + extra_cost->alu.rev;
> > return false;
> >
> >   case COMPARE:
> > @@ -15373,12 +15380,17 @@ cost_plus:
> > }
> > else
> > {
> > - /* Integer ABS will either be split to
> > -two arithmetic instructions, or will be an ABS
> > -(scalar), which we don't model.  */
> > - *cost = COSTS_N_INSNS (2);
> > - if (speed)
> > -   *cost += 2 * extra_cost->alu.arith;
> > + if (!TARGET_CSSC)
> > +   {
> > + /* Integer ABS will either be split to
> > +two arithmetic instructions, or will be an ABS
> > +(scalar), which we don't model.  */
> > + *cost = COSTS_N_INSNS (2);
> > + if (speed)
> > +   *cost += 2 * extra_cost->alu.arith;
> > +   }
> > + else
> > +   *cost = COSTS_N_INSNS (1);
> 
> same here.
> 
> > }
> > return false;
> >
> > @@ -15388,13 +15400,15 @@ cost_plus:
> > {
> >   if (VECTOR_MODE_P (mode))
> > *cost += extra_cost->vect.alu;
> > - else
> > + else if (GET_MODE_CLASS (mode) == MODE_FLOAT)
> > {
> >   /* FMAXNM/FMINNM/FMAX/FMIN.
> >  TODO: This may not be accurate for all implementations, but
> >  we do not model this in the cost tables.  */
> >   *cost += extra_cost->fp[mode == DFmode].addsub;
> > }
> > + else if (TARGET_CSSC)
> > +   *cost = COSTS_N_INSNS (1);
> 
> and here.
> 
> > }
> > return false;
> >
> 
> R.


RE: [PATCH] Add a REG_P check for inc and dec for Arm MVE

2023-11-14 Thread Kyrylo Tkachov
Hi Saurabh,

> -Original Message-
> From: Saurabh Jha 
> Sent: Thursday, November 9, 2023 10:12 AM
> To: gcc-patches@gcc.gnu.org; Richard Earnshaw
> ; Richard Sandiford
> 
> Subject: [PATCH] Add a REG_P check for inc and dec for Arm MVE
> 
> Hey,
> 
> This patch tightens mve_vector_mem_operand to reject non-register
> operands inside {PRE,POST}_{INC,DEC} addresses by introducing a REG_P
> check.
> 
> This patch fixes this ICE:https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112337
> 
> Okay for trunk? I don't have trunk access so could someone please commit
> on my behalf?

Ok.

> 
> Regards,
> Saurabh
> 
> gcc/ChangeLog:
> 
>   PR target/112337
>   * config/arm/arm.cc (mve_vector_mem_operand): Add a REG_P
> check for INC
>   and DEC operations
> 
> gcc/testsuite/ChangeLog:
> 
>   PR target/112337
>   * gcc.target/arm/mve/pr112337.c: Test for REG_P check for INC and
> DEC
>   operations

ChangeLog entries should end with a full stop (the git commit hooks enforce it).
I've adjusted the ChangeLog and pushed this patch for you.
Thank you for the patch!
Kyrill



RE: [PATCH] AArch64: Cleanup memset expansion

2023-11-10 Thread Kyrylo Tkachov


> -Original Message-
> From: Richard Earnshaw 
> Sent: Friday, November 10, 2023 11:31 AM
> To: Wilco Dijkstra ; Kyrylo Tkachov
> ; GCC Patches 
> Cc: Richard Sandiford ; Richard Earnshaw
> 
> Subject: Re: [PATCH] AArch64: Cleanup memset expansion
> 
> 
> 
> On 10/11/2023 10:17, Wilco Dijkstra wrote:
> > Hi Kyrill,
> >
> >> +  /* Reduce the maximum size with -Os.  */
> >> +  if (optimize_function_for_size_p (cfun))
> >> +    max_set_size = 96;
> >> +
> >
> >>  This is a new "magic" number in this code. It looks sensible, but how
> did you arrive at it?
> >
> > We need 1 instruction to create the value to store (DUP or MOVI) and 1 STP
> > for every 32 bytes, so the 96 means 4 instructions for typical sizes
> > (sizes not
> > a multiple of 16 can add one extra instruction).

It would be useful to have that reasoning in the comment.

> >
> > I checked codesize on SPECINT2017, and 96 had practically identical size.
> > Using 128 would also be a reasonable Os value with a very slight size
> > increase,
> > and 384 looks good for O2 - however I didn't want to tune these values
> > as this
> > is a cleanup patch.
> >
> > Cheers,
> > Wilco
> 
> Shouldn't this be a param then?  Also, manifest constants in the middle
> of code are a potential nightmare, please move it to a #define (even if
> that's then used as the default value for the param).

I agree on making this a #define but I wouldn't insist on a param.
Code size IMO has a much more consistent right or wrong answer as it's 
statically determinable.
It this was a speed-related param then I'd expect the flexibility for the power 
user to override such heuristics would be more widely useful.
But for code size the compiler should always be able to get it right.

If Richard would still like the param then I'm fine with having the param, but 
I'd be okay with the comment above and making this a #define.
Thanks,
Kyrill


RE: [PATCH] libatomic: Improve ifunc selection on AArch64

2023-11-10 Thread Kyrylo Tkachov



> -Original Message-
> From: Wilco Dijkstra 
> Sent: Friday, November 10, 2023 10:23 AM
> To: Kyrylo Tkachov ; GCC Patches  patc...@gcc.gnu.org>; Richard Sandiford 
> Subject: Re: [PATCH] libatomic: Improve ifunc selection on AArch64
> 
> Hi Kyrill,
> 
> > +  if (!(hwcap & HWCAP_CPUID))
> > +return false;
> > +
> > +  unsigned long midr;
> > +  asm volatile ("mrs %0, midr_el1" : "=r" (midr));
> 
> > From what I recall that midr_el1 register is emulated by the kernel and so
> userspace software
> > has to check that the kernel supports that emulation through hwcaps before
> reading it.
> > According to https://www.kernel.org/doc/html/v5.8/arm64/cpu-feature-
> registers.html you
> > need to check getauxval(AT_HWCAP) & HWCAP_CPUID) before doing that
> read.
> 
> That's why I do that immediately before reading midr_el1 - see above.

Errr, yes. Obviously I wasn't fully awake when I looked at it!
Sorry for the noise.
Ok for trunk then.
Kyrill

> 
> Cheers,
> Wilco


RE: [PATCH] libatomic: Improve ifunc selection on AArch64

2023-11-10 Thread Kyrylo Tkachov
Hi Wilco,

> -Original Message-
> From: Wilco Dijkstra 
> Sent: Monday, November 6, 2023 12:13 PM
> To: GCC Patches ; Richard Sandiford
> 
> Cc: Kyrylo Tkachov 
> Subject: Re: [PATCH] libatomic: Improve ifunc selection on AArch64
> 
> 
> 
> ping
> 
> 
> From: Wilco Dijkstra
> Sent: 04 August 2023 16:05
> To: GCC Patches ; Richard Sandiford
> 
> Cc: Kyrylo Tkachov 
> Subject: [PATCH] libatomic: Improve ifunc selection on AArch64
> 
> 
> Add support for ifunc selection based on CPUID register.  Neoverse N1
> supports
> atomic 128-bit load/store, so use the FEAT_USCAT ifunc like newer Neoverse
> cores.
> 
> Passes regress, OK for commit?
> 
> libatomic/
> config/linux/aarch64/host-config.h (ifunc1): Use CPUID in ifunc
> selection.
> 
> ---
> 
> diff --git a/libatomic/config/linux/aarch64/host-config.h
> b/libatomic/config/linux/aarch64/host-config.h
> index
> 851c78c01cd643318aaa52929ce4550266238b79..e5dc33c030a4bab927874fa6
> c69425db463fdc4b 100644
> --- a/libatomic/config/linux/aarch64/host-config.h
> +++ b/libatomic/config/linux/aarch64/host-config.h
> @@ -26,7 +26,7 @@
> 
>  #ifdef HWCAP_USCAT
>  # if N == 16
> -#  define IFUNC_COND_1 (hwcap & HWCAP_USCAT)
> +#  define IFUNC_COND_1 ifunc1 (hwcap)
>  # else
>  #  define IFUNC_COND_1  (hwcap & HWCAP_ATOMICS)
>  # endif
> @@ -50,4 +50,28 @@
>  #undef MAYBE_HAVE_ATOMIC_EXCHANGE_16
>  #define MAYBE_HAVE_ATOMIC_EXCHANGE_16   1
> 
> +#ifdef HWCAP_USCAT
> +
> +#define MIDR_IMPLEMENTOR(midr) (((midr) >> 24) & 255)
> +#define MIDR_PARTNUM(midr) (((midr) >> 4) & 0xfff)
> +
> +static inline bool
> +ifunc1 (unsigned long hwcap)
> +{
> +  if (hwcap & HWCAP_USCAT)
> +return true;
> +  if (!(hwcap & HWCAP_CPUID))
> +return false;
> +
> +  unsigned long midr;
> +  asm volatile ("mrs %0, midr_el1" : "=r" (midr));

>From what I recall that midr_el1 register is emulated by the kernel and so 
>userspace software has to check that the kernel supports that emulation 
>through hwcaps before reading it.
According to 
https://www.kernel.org/doc/html/v5.8/arm64/cpu-feature-registers.html you need 
to check getauxval(AT_HWCAP) & HWCAP_CPUID) before doing that read.

Thanks,
Kyrill

> +
> +  /* Neoverse N1 supports atomic 128-bit load/store.  */
> +  if (MIDR_IMPLEMENTOR (midr) == 'A' && MIDR_PARTNUM(midr) == 0xd0c)
> +return true;
> +
> +  return false;
> +}
> +#endif
> +
>  #include_next 


RE: [PATCH] AArch64: Cleanup memset expansion

2023-11-10 Thread Kyrylo Tkachov
Hi Wilco,

> -Original Message-
> From: Wilco Dijkstra 
> Sent: Monday, November 6, 2023 12:12 PM
> To: GCC Patches 
> Cc: Richard Sandiford ; Richard Earnshaw
> 
> Subject: Re: [PATCH] AArch64: Cleanup memset expansion
> 
> ping
> 
> Cleanup memset implementation.  Similar to memcpy/memmove, use an
> offset and
> bytes throughout.  Simplify the complex calculations when optimizing for size
> by using a fixed limit.
> 
> Passes regress/bootstrap, OK for commit?
> 

This looks like a good cleanup but I have a question...

> gcc/ChangeLog:
>     * config/aarch64/aarch64.cc (aarch64_progress_pointer): Remove
> function.
>     (aarch64_set_one_block_and_progress_pointer): Simplify and clean up.
>     (aarch64_expand_setmem): Clean up implementation, use byte offsets,
>     simplify size calculation.
> 
> ---
> 
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index
> e19e2d1de2e5b30eca672df05d9dcc1bc106ecc8..578a253d6e0e133e1959255
> 3fc873b3e73f9f218 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -25229,15 +25229,6 @@ aarch64_move_pointer (rtx pointer, poly_int64
> amount)
>  next, amount);
>  }
> 
> -/* Return a new RTX holding the result of moving POINTER forward by the
> -   size of the mode it points to.  */
> -
> -static rtx
> -aarch64_progress_pointer (rtx pointer)
> -{
> -  return aarch64_move_pointer (pointer, GET_MODE_SIZE (GET_MODE
> (pointer)));
> -}
> -
>  /* Copy one block of size MODE from SRC to DST at offset OFFSET.  */
> 
>  static void
> @@ -25393,46 +25384,22 @@ aarch64_expand_cpymem (rtx *operands,
> bool is_memmove)
>    return true;
>  }
> 
> -/* Like aarch64_copy_one_block_and_progress_pointers, except for memset
> where
> -   SRC is a register we have created with the duplicated value to be set.  */
> +/* Set one block of size MODE at DST at offset OFFSET to value in SRC.  */
>  static void
> -aarch64_set_one_block_and_progress_pointer (rtx src, rtx *dst,
> -   machine_mode mode)
> -{
> -  /* If we are copying 128bits or 256bits, we can do that straight from
> - the SIMD register we prepared.  */
> -  if (known_eq (GET_MODE_BITSIZE (mode), 256))
> -    {
> -  mode = GET_MODE (src);
> -  /* "Cast" the *dst to the correct mode.  */
> -  *dst = adjust_address (*dst, mode, 0);
> -  /* Emit the memset.  */
> -  emit_insn (aarch64_gen_store_pair (mode, *dst, src,
> -    aarch64_progress_pointer (*dst), 
> src));
> -
> -  /* Move the pointers forward.  */
> -  *dst = aarch64_move_pointer (*dst, 32);
> -  return;
> -    }
> -  if (known_eq (GET_MODE_BITSIZE (mode), 128))
> +aarch64_set_one_block (rtx src, rtx dst, int offset, machine_mode mode)
> +{
> +  /* Emit explict store pair instructions for 32-byte writes.  */
> +  if (known_eq (GET_MODE_SIZE (mode), 32))
>  {
> -  /* "Cast" the *dst to the correct mode.  */
> -  *dst = adjust_address (*dst, GET_MODE (src), 0);
> -  /* Emit the memset.  */
> -  emit_move_insn (*dst, src);
> -  /* Move the pointers forward.  */
> -  *dst = aarch64_move_pointer (*dst, 16);
> +  mode = V16QImode;
> +  rtx dst1 = adjust_address (dst, mode, offset);
> +  rtx dst2 = adjust_address (dst, mode, offset + 16);
> +  emit_insn (aarch64_gen_store_pair (mode, dst1, src, dst2, src));
>    return;
>  }
> -  /* For copying less, we have to extract the right amount from src.  */
> -  rtx reg = lowpart_subreg (mode, src, GET_MODE (src));
> -
> -  /* "Cast" the *dst to the correct mode.  */
> -  *dst = adjust_address (*dst, mode, 0);
> -  /* Emit the memset.  */
> -  emit_move_insn (*dst, reg);
> -  /* Move the pointer forward.  */
> -  *dst = aarch64_progress_pointer (*dst);
> +  if (known_lt (GET_MODE_SIZE (mode), 16))
> +    src = lowpart_subreg (mode, src, GET_MODE (src));
> +  emit_move_insn (adjust_address (dst, mode, offset), src);
>  }
> 
>  /* Expand a setmem using the MOPS instructions.  OPERANDS are the same
> @@ -25461,7 +25428,7 @@ aarch64_expand_setmem_mops (rtx *operands)
>  bool
>  aarch64_expand_setmem (rtx *operands)
>  {
> -  int n, mode_bits;
> +  int mode_bytes;
>    unsigned HOST_WIDE_INT len;
>    rtx dst = operands[0];
>    rtx val = operands[2], src;
> @@ -25474,104 +25441,70 @@ aarch64_expand_setmem (rtx *operands)
>    || (STRICT_ALIGNMENT && align < 16))
>  return aarch64_expand_setmem_mops (operands);
> 
> -  bool size_p = optimize_function_for_size_p (cfun);
> -
>    /* Default the maximum to 256-bytes when considering only libcall vs
>   SIMD broadcast sequence.  */
>    unsigned max_set_size = 256;
>    unsigned mops_threshold = aarch64_mops_memset_size_threshold;
> 
> +  /* Reduce the maximum size with -Os.  */
> +  if (optimize_function_for_size_p (cfun))
> +    max_set_size = 96;
> +

 This is a new "magic" number in this code. It 

RE: [PATCH v4] aarch64: Fine-grained policies to control ldp-stp formation.

2023-09-27 Thread Kyrylo Tkachov
Hi Manos,

> -Original Message-
> From: Manos Anagnostakis 
> Sent: Tuesday, September 26, 2023 2:52 PM
> To: gcc-patches@gcc.gnu.org
> Cc: Kyrylo Tkachov ; Tamar Christina
> ; Philipp Tomsich ;
> Manos Anagnostakis 
> Subject: [PATCH v4] aarch64: Fine-grained policies to control ldp-stp
> formation.
> 
> This patch implements the following TODO in gcc/config/aarch64/aarch64.cc
> to provide the requested behaviour for handling ldp and stp:
> 
>   /* Allow the tuning structure to disable LDP instruction formation
>  from combining instructions (e.g., in peephole2).
>  TODO: Implement fine-grained tuning control for LDP and STP:
>1. control policies for load and store separately;
>2. support the following policies:
>   - default (use what is in the tuning structure)
>   - always
>   - never
>   - aligned (only if the compiler can prove that the
> load will be aligned to 2 * element_size)  */
> 
> It provides two new and concrete target-specific command-line parameters
> -param=aarch64-ldp-policy= and -param=aarch64-stp-policy=
> to give the ability to control load and store policies seperately as
> stated in part 1 of the TODO.
> 
> The accepted values for both parameters are:
> - default: Use the policy of the tuning structure (default).
> - always: Emit ldp/stp regardless of alignment.
> - never: Do not emit ldp/stp.
> - aligned: In order to emit ldp/stp, first check if the load/store will
>   be aligned to 2 * element_size.
> 
> Bootstrapped and regtested aarch64-linux.
> 
> gcc/ChangeLog:
> * config/aarch64/aarch64-opts.h (enum aarch64_ldp_policy): New
>   enum type.
> (enum aarch64_stp_policy): New enum type.
> * config/aarch64/aarch64-protos.h (struct tune_params): Add
>   appropriate enums for the policies.
>   (aarch64_mem_ok_with_ldpstp_policy_model): New declaration.
> * config/aarch64/aarch64-tuning-flags.def
>   (AARCH64_EXTRA_TUNING_OPTION): Remove superseded tuning
>   options.
> * config/aarch64/aarch64.cc (aarch64_parse_ldp_policy): New
>   function to parse ldp-policy parameter.
> (aarch64_parse_stp_policy): New function to parse stp-policy 
> parameter.
> (aarch64_override_options_internal): Call parsing functions.
>   (aarch64_mem_ok_with_ldpstp_policy_model): New function.
> (aarch64_operands_ok_for_ldpstp): Add call to
>   aarch64_mem_ok_with_ldpstp_policy_model for parameter-value
>   check and alignment check and remove superseded ones.
> (aarch64_operands_adjust_ok_for_ldpstp): Add call to
> aarch64_mem_ok_with_ldpstp_policy_model for parameter-value
>   check and alignment check and remove superseded ones.
> * config/aarch64/aarch64.opt: Add parameters.
>   * doc/invoke.texi: Document the parameters accordingly.

The ChangeLog entry should name the new parameters. For example:
* config/aarch64/aarch64.opt (aarch64-ldp-policy): New param.

Ok with the fixed ChangeLog.
Thank you for the work!
Kyrill

> 
> gcc/testsuite/ChangeLog:
>   * gcc.target/aarch64/ampere1-no_ldp_combine.c: Removed.
> * gcc.target/aarch64/ldp_aligned.c: New test.
> * gcc.target/aarch64/ldp_always.c: New test.
> * gcc.target/aarch64/ldp_never.c: New test.
> * gcc.target/aarch64/stp_aligned.c: New test.
> * gcc.target/aarch64/stp_always.c: New test.
> * gcc.target/aarch64/stp_never.c: New test.
> 
> Signed-off-by: Manos Anagnostakis 
> ---
> Changes in v4:
> - Changed the parameters to accept enum instead of an
>   integer and updated documentation in doc/invoke.texi.
> - Packed all the new checks in aarch64_operands_ok_for_ldpstp/
>   aarch64_operands_adjust_ok_for_ldpstp in a new function
>   called aarch64_mem_ok_with_ldpstp_policy_model.
> 
>  gcc/config/aarch64/aarch64-opts.h |  16 ++
>  gcc/config/aarch64/aarch64-protos.h   |  25 +++
>  gcc/config/aarch64/aarch64-tuning-flags.def   |   8 -
>  gcc/config/aarch64/aarch64.cc | 212 +-
>  gcc/config/aarch64/aarch64.opt|  38 
>  gcc/doc/invoke.texi   |  20 ++
>  .../aarch64/ampere1-no_ldp_combine.c  |  11 -
>  .../gcc.target/aarch64/ldp_aligned.c  |  66 ++
>  gcc/testsuite/gcc.target/aarch64/ldp_always.c |  66 ++
>  gcc/testsuite/gcc.target/aarch64/ldp_never.c  |  66 ++
>  .../gcc.target/aarch64/stp_aligned.c  |  60 +
>  gcc/testsuite/gcc.target/aarch64/stp_always.c |  60 +
>  gcc/testsuite/gcc.target/aarch64/stp_n

RE: [PATCH v3] aarch64: Fine-grained policies to control ldp-stp formation.

2023-09-26 Thread Kyrylo Tkachov


> -Original Message-
> From: Kyrylo Tkachov 
> Sent: Tuesday, September 26, 2023 9:36 AM
> To: Manos Anagnostakis ; gcc-
> patc...@gcc.gnu.org
> Cc: Philipp Tomsich ; Andrew Pinski
> 
> Subject: RE: [PATCH v3] aarch64: Fine-grained policies to control ldp-stp
> formation.
> 
> Hi Manos,
> 
> Thank you for the quick turnaround, please post the patch that uses a --
> param with an enum. I think that's the direction we should be going with this
> patch.

Ah, and please address Tamar's feedback from 
https://gcc.gnu.org/pipermail/gcc-patches/2023-September/631343.html
Thanks,
Kyrill

> 
> From: Manos Anagnostakis 
> Sent: Tuesday, September 26, 2023 7:06 AM
> To: gcc-patches@gcc.gnu.org
> Cc: Philipp Tomsich ; Kyrylo Tkachov
> ; Andrew Pinski 
> Subject: Re: [PATCH v3] aarch64: Fine-grained policies to control ldp-stp
> formation.
> 
> Thank you Andrew for the input.
> 
> I've prepared a patch using --param with enum, which seems a more suitable
> approach to me as strings are more descriptive as well.
> 
> The current patch needed an adjustment on how to call the parsing functions
> to match the compiler coding style.
> 
> Both are bootstrapped and regstested.
> 
> I can send a V4 of whichever is preferred.
> 
> Thanks!
> 
> Manos.
> 
> On Mon, Sep 25, 2023 at 11:57 PM Andrew Pinski
> <mailto:pins...@gmail.com> wrote:
> On Mon, Sep 25, 2023 at 1:04 PM Andrew Pinski <mailto:pins...@gmail.com>
> wrote:
> >
> > On Mon, Sep 25, 2023 at 12:59 PM Philipp Tomsich
> > <mailto:philipp.toms...@vrull.eu> wrote:
> > >
> > > On Mon, 25 Sept 2023 at 21:54, Andrew Pinski
> <mailto:pins...@gmail.com> wrote:
> > > >
> > > > On Mon, Sep 25, 2023 at 12:50 PM Manos Anagnostakis
> > > > <mailto:manos.anagnosta...@vrull.eu> wrote:
> > > > >
> > > > > This patch implements the following TODO in
> gcc/config/aarch64/aarch64.cc
> > > > > to provide the requested behaviour for handling ldp and stp:
> > > > >
> > > > >   /* Allow the tuning structure to disable LDP instruction formation
> > > > >      from combining instructions (e.g., in peephole2).
> > > > >      TODO: Implement fine-grained tuning control for LDP and STP:
> > > > >            1. control policies for load and store separately;
> > > > >            2. support the following policies:
> > > > >               - default (use what is in the tuning structure)
> > > > >               - always
> > > > >               - never
> > > > >               - aligned (only if the compiler can prove that the
> > > > >                 load will be aligned to 2 * element_size)  */
> > > > >
> > > > > It provides two new and concrete target-specific command-line
> parameters
> > > > > -param=aarch64-ldp-policy= and -param=aarch64-stp-policy=
> > > > > to give the ability to control load and store policies seperately as
> > > > > stated in part 1 of the TODO.
> > > > >
> > > > > The accepted values for both parameters are:
> > > > > - 0: Use the policy of the tuning structure (default).
> > > > > - 1: Emit ldp/stp regardless of alignment.
> > > > > - 2: Do not emit ldp/stp.
> > > > > - 3: In order to emit ldp/stp, first check if the load/store will
> > > > >   be aligned to 2 * element_size.
> > > >
> > > > Instead of a number, does it make sense to instead use an string
> > > > (ENUM) for this param.
> > > > Also I think using --param is a bad idea if it is going to be
> > > > documented in the user manual.
> > > > Maybe a -m option should be used instead.
> > >
> > > See https://gcc.gnu.org/pipermail/gcc-patches/2023-
> September/631283.html
> > > for the discussion triggering the change from -m... to --param and the
> > > change to using a number instead of a string.
> >
> > That is the opposite of the current GCC practice across all targets.
> > Things like this should be consistent and if one target decides to do
> > it different, then maybe it should NOT.
> > Anyways we should document the correct coding style for options so we
> > don't have these back and forths again.
> 
> Kyrylo:
> >  It will have to take a number rather than a string but that should be 
> >okay, as
> long as the right values are documented in invoke.texi.
> 
> No it does

RE: [PATCH v3] aarch64: Fine-grained policies to control ldp-stp formation.

2023-09-26 Thread Kyrylo Tkachov
Hi Manos,

Thank you for the quick turnaround, please post the patch that uses a --param 
with an enum. I think that's the direction we should be going with this patch.
Thanks,
Kyrill

From: Manos Anagnostakis  
Sent: Tuesday, September 26, 2023 7:06 AM
To: gcc-patches@gcc.gnu.org
Cc: Philipp Tomsich ; Kyrylo Tkachov 
; Andrew Pinski 
Subject: Re: [PATCH v3] aarch64: Fine-grained policies to control ldp-stp 
formation.

Thank you Andrew for the input.

I've prepared a patch using --param with enum, which seems a more suitable 
approach to me as strings are more descriptive as well.

The current patch needed an adjustment on how to call the parsing functions to 
match the compiler coding style.

Both are bootstrapped and regstested.

I can send a V4 of whichever is preferred.

Thanks!

Manos.

On Mon, Sep 25, 2023 at 11:57 PM Andrew Pinski <mailto:pins...@gmail.com> wrote:
On Mon, Sep 25, 2023 at 1:04 PM Andrew Pinski <mailto:pins...@gmail.com> wrote:
>
> On Mon, Sep 25, 2023 at 12:59 PM Philipp Tomsich
> <mailto:philipp.toms...@vrull.eu> wrote:
> >
> > On Mon, 25 Sept 2023 at 21:54, Andrew Pinski <mailto:pins...@gmail.com> 
> > wrote:
> > >
> > > On Mon, Sep 25, 2023 at 12:50 PM Manos Anagnostakis
> > > <mailto:manos.anagnosta...@vrull.eu> wrote:
> > > >
> > > > This patch implements the following TODO in 
> > > > gcc/config/aarch64/aarch64.cc
> > > > to provide the requested behaviour for handling ldp and stp:
> > > >
> > > >   /* Allow the tuning structure to disable LDP instruction formation
> > > >      from combining instructions (e.g., in peephole2).
> > > >      TODO: Implement fine-grained tuning control for LDP and STP:
> > > >            1. control policies for load and store separately;
> > > >            2. support the following policies:
> > > >               - default (use what is in the tuning structure)
> > > >               - always
> > > >               - never
> > > >               - aligned (only if the compiler can prove that the
> > > >                 load will be aligned to 2 * element_size)  */
> > > >
> > > > It provides two new and concrete target-specific command-line parameters
> > > > -param=aarch64-ldp-policy= and -param=aarch64-stp-policy=
> > > > to give the ability to control load and store policies seperately as
> > > > stated in part 1 of the TODO.
> > > >
> > > > The accepted values for both parameters are:
> > > > - 0: Use the policy of the tuning structure (default).
> > > > - 1: Emit ldp/stp regardless of alignment.
> > > > - 2: Do not emit ldp/stp.
> > > > - 3: In order to emit ldp/stp, first check if the load/store will
> > > >   be aligned to 2 * element_size.
> > >
> > > Instead of a number, does it make sense to instead use an string
> > > (ENUM) for this param.
> > > Also I think using --param is a bad idea if it is going to be
> > > documented in the user manual.
> > > Maybe a -m option should be used instead.
> >
> > See https://gcc.gnu.org/pipermail/gcc-patches/2023-September/631283.html
> > for the discussion triggering the change from -m... to --param and the
> > change to using a number instead of a string.
>
> That is the opposite of the current GCC practice across all targets.
> Things like this should be consistent and if one target decides to do
> it different, then maybe it should NOT.
> Anyways we should document the correct coding style for options so we
> don't have these back and forths again.

Kyrylo:
>  It will have to take a number rather than a string but that should be okay, 
>as long as the right values are documented in invoke.texi.

No it does not need to be a number. --param=ranger-debug= does not
take a number, it takes an enum .
One of the benefits of moving --param support over to .opt to allow
more than just numbers even.

Thanks,
Andrew


>
>
> Thanks,
> Andrew
>
> >
> > Thanks,
> > Philipp.
> >
> > >
> > > Thanks,
> > > Andrew
> > >
> > > >
> > > > gcc/ChangeLog:
> > > >         * config/aarch64/aarch64-protos.h (struct tune_params): Add
> > > >         appropriate enums for the policies.
> > > >         * config/aarch64/aarch64-tuning-flags.def
> > > >         (AARCH64_EXTRA_TUNING_OPTION): Remove superseded tuning
> > > >         options.
> > > >         * config/aarch64/aarch64.cc (aarch64_parse_ldp_policy): New
> > > >         function to par

RE: [PATCH v3] aarch64: Fine-grained policies to control ldp-stp formation.

2023-09-26 Thread Kyrylo Tkachov


> -Original Message-
> From: Andrew Pinski 
> Sent: Monday, September 25, 2023 9:05 PM
> To: Philipp Tomsich 
> Cc: Manos Anagnostakis ; gcc-
> patc...@gcc.gnu.org; Kyrylo Tkachov 
> Subject: Re: [PATCH v3] aarch64: Fine-grained policies to control ldp-stp
> formation.
> 
> On Mon, Sep 25, 2023 at 12:59 PM Philipp Tomsich
>  wrote:
> >
> > On Mon, 25 Sept 2023 at 21:54, Andrew Pinski  wrote:
> > >
> > > On Mon, Sep 25, 2023 at 12:50 PM Manos Anagnostakis
> > >  wrote:
> > > >
> > > > This patch implements the following TODO in
> gcc/config/aarch64/aarch64.cc
> > > > to provide the requested behaviour for handling ldp and stp:
> > > >
> > > >   /* Allow the tuning structure to disable LDP instruction formation
> > > >  from combining instructions (e.g., in peephole2).
> > > >  TODO: Implement fine-grained tuning control for LDP and STP:
> > > >1. control policies for load and store separately;
> > > >2. support the following policies:
> > > >   - default (use what is in the tuning structure)
> > > >   - always
> > > >   - never
> > > >   - aligned (only if the compiler can prove that the
> > > > load will be aligned to 2 * element_size)  */
> > > >
> > > > It provides two new and concrete target-specific command-line
> parameters
> > > > -param=aarch64-ldp-policy= and -param=aarch64-stp-policy=
> > > > to give the ability to control load and store policies seperately as
> > > > stated in part 1 of the TODO.
> > > >
> > > > The accepted values for both parameters are:
> > > > - 0: Use the policy of the tuning structure (default).
> > > > - 1: Emit ldp/stp regardless of alignment.
> > > > - 2: Do not emit ldp/stp.
> > > > - 3: In order to emit ldp/stp, first check if the load/store will
> > > >   be aligned to 2 * element_size.
> > >
> > > Instead of a number, does it make sense to instead use an string
> > > (ENUM) for this param.
> > > Also I think using --param is a bad idea if it is going to be
> > > documented in the user manual.
> > > Maybe a -m option should be used instead.
> >
> > See https://gcc.gnu.org/pipermail/gcc-patches/2023-
> September/631283.html
> > for the discussion triggering the change from -m... to --param and the
> > change to using a number instead of a string.
> 
> That is the opposite of the current GCC practice across all targets.
> Things like this should be consistent and if one target decides to do
> it different, then maybe it should NOT.
> Anyways we should document the correct coding style for options so we
> don't have these back and forths again.

My rationale for having this as a param rather than an -m* option is that
this is just an override for a codegen heuristic that the compiler should be
getting correct on its own when used by a normal user.
Having a way to force an explicit LDP/STP policy can be useful for testing
the compiler and for some power user experimentation, but I wouldn't
want to see it make its way into any user makefiles.

Good point on having it accept an enum, it is definitely more readable to have 
a string argument.
Thanks,
Kyrill

> 
> 
> Thanks,
> Andrew
> 
> >
> > Thanks,
> > Philipp.
> >
> > >
> > > Thanks,
> > > Andrew
> > >
> > > >
> > > > gcc/ChangeLog:
> > > > * config/aarch64/aarch64-protos.h (struct tune_params): Add
> > > > appropriate enums for the policies.
> > > > * config/aarch64/aarch64-tuning-flags.def
> > > > (AARCH64_EXTRA_TUNING_OPTION): Remove superseded tuning
> > > > options.
> > > > * config/aarch64/aarch64.cc (aarch64_parse_ldp_policy): New
> > > > function to parse ldp-policy parameter.
> > > > (aarch64_parse_stp_policy): New function to parse stp-policy
> parameter.
> > > > (aarch64_override_options_internal): Call parsing functions.
> > > > (aarch64_operands_ok_for_ldpstp): Add parameter-value check
> and
> > > > alignment check and remove superseded ones.
> > > > (aarch64_operands_adjust_ok_for_ldpstp): Add parameter-value
> check and
> > > > alignment check and remove superseded ones.
> > > > * config/aarch64/aarch64.opt: Add options.
> > > &g

RE: [PATCH] aarch64: Fine-grained ldp and stp policies with test-cases.

2023-09-25 Thread Kyrylo Tkachov
Hi Manos,

Apologies for the long delay.

> -Original Message-
> From: Manos Anagnostakis 
> Sent: Friday, August 18, 2023 8:50 AM
> To: gcc-patches@gcc.gnu.org
> Cc: Kyrylo Tkachov ; Philipp Tomsich
> ; Manos Anagnostakis
> 
> Subject: [PATCH] aarch64: Fine-grained ldp and stp policies with test-cases.
> 
> This patch implements the following TODO in gcc/config/aarch64/aarch64.cc
> to provide the requested behaviour for handling ldp and stp:
> 
>   /* Allow the tuning structure to disable LDP instruction formation
>  from combining instructions (e.g., in peephole2).
>  TODO: Implement fine-grained tuning control for LDP and STP:
>1. control policies for load and store separately;
>2. support the following policies:
>   - default (use what is in the tuning structure)
>   - always
>   - never
>   - aligned (only if the compiler can prove that the
> load will be aligned to 2 * element_size)  */
> 
> It provides two new and concrete command-line options -mldp-policy and -
> mstp-policy
> to give the ability to control load and store policies seperately as
> stated in part 1 of the TODO.
> 
> The accepted values for both options are:
> - default: Use the ldp/stp policy defined in the corresponding tuning
>   structure.
> - always: Emit ldp/stp regardless of alignment.
> - never: Do not emit ldp/stp.
> - aligned: In order to emit ldp/stp, first check if the load/store will
>   be aligned to 2 * element_size.
> 
> gcc/ChangeLog:
> * config/aarch64/aarch64-protos.h (struct tune_params): Add
>   appropriate enums for the policies.
> * config/aarch64/aarch64-tuning-flags.def
>   (AARCH64_EXTRA_TUNING_OPTION): Remove superseded tuning
>   options.
> * config/aarch64/aarch64.cc (aarch64_parse_ldp_policy): New
>   function to parse ldp-policy option.
> (aarch64_parse_stp_policy): New function to parse stp-policy option.
> (aarch64_override_options_internal): Call parsing functions.
> (aarch64_operands_ok_for_ldpstp): Add option-value check and
>   alignment check and remove superseded ones
> (aarch64_operands_adjust_ok_for_ldpstp): Add option-value check and
>   alignment check and remove superseded ones.
> * config/aarch64/aarch64.opt: Add options.
> 
> gcc/testsuite/ChangeLog:
> * gcc.target/aarch64/ldp_aligned.c: New test.
> * gcc.target/aarch64/ldp_always.c: New test.
> * gcc.target/aarch64/ldp_never.c: New test.
> * gcc.target/aarch64/stp_aligned.c: New test.
> * gcc.target/aarch64/stp_always.c: New test.
> * gcc.target/aarch64/stp_never.c: New test.
> 
> Signed-off-by: Manos Anagnostakis 
> ---
> 
>  gcc/config/aarch64/aarch64-protos.h   |  24 ++
>  gcc/config/aarch64/aarch64-tuning-flags.def   |   8 -
>  gcc/config/aarch64/aarch64.cc | 229 ++
>  gcc/config/aarch64/aarch64.opt|   8 +
>  .../gcc.target/aarch64/ldp_aligned.c  |  64 +
>  gcc/testsuite/gcc.target/aarch64/ldp_always.c |  64 +
>  gcc/testsuite/gcc.target/aarch64/ldp_never.c  |  64 +
>  .../gcc.target/aarch64/stp_aligned.c  |  60 +
>  gcc/testsuite/gcc.target/aarch64/stp_always.c |  60 +
>  gcc/testsuite/gcc.target/aarch64/stp_never.c  |  60 +
>  10 files changed, 580 insertions(+), 61 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_aligned.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_always.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_never.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/stp_aligned.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/stp_always.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/stp_never.c
> 
> diff --git a/gcc/config/aarch64/aarch64-protos.h
> b/gcc/config/aarch64/aarch64-protos.h
> index 70303d6fd95..be1d73490ed 100644
> --- a/gcc/config/aarch64/aarch64-protos.h
> +++ b/gcc/config/aarch64/aarch64-protos.h
> @@ -568,6 +568,30 @@ struct tune_params
>/* Place prefetch struct pointer at the end to enable type checking
>   errors when tune_params misses elements (e.g., from erroneous merges).
> */
>const struct cpu_prefetch_tune *prefetch;
> +/* An enum specifying how to handle load pairs using a fine-grained policy:
> +   - LDP_POLICY_ALIGNED: Emit ldp if the source pointer is aligned
> +   to at least double the alignment of the type.
> +   - LDP_POLICY_ALWAYS: Emit ldp regardless of alignment.
> +   - LDP_POLICY_NEVER: Do not emit ldp.  */
> +
> +  enum aarch64_ldp_policy_model
> +  {
> 

RE: [PING][PATCH 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops

2023-09-14 Thread Kyrylo Tkachov via Gcc-patches
Hi Stam,

> -Original Message-
> From: Stam Markianos-Wright 
> Sent: Wednesday, September 6, 2023 6:19 PM
> To: gcc-patches@gcc.gnu.org
> Cc: Kyrylo Tkachov ; Richard Earnshaw
> 
> Subject: [PING][PATCH 2/2] arm: Add support for MVE Tail-Predicated Low
> Overhead Loops
> 
> Hi all,
> 
> This is the 2/2 patch that contains the functional changes needed
> for MVE Tail Predicated Low Overhead Loops.  See my previous email
> for a general introduction of MVE LOLs.
> 
> This support is added through the already existing loop-doloop
> mechanisms that are used for non-MVE dls/le looping.
> 
> Mid-end changes are:
> 
> 1) Relax the loop-doloop mechanism in the mid-end to allow for
>     decrement numbers other that -1 and for `count` to be an
>     rtx containing a simple REG (which in this case will contain
>     the number of elements to be processed), rather
>     than an expression for calculating the number of iterations.
> 2) Added a new df utility function: `df_bb_regno_only_def_find` that
>     will return the DEF of a REG if it is DEF-ed only once within the
>     basic block.
> 
> And many things in the backend to implement the above optimisation:
> 
> 3)  Implement the `arm_predict_doloop_p` target hook to instruct the
>      mid-end about Low Overhead Loops (MVE or not), as well as
>      `arm_loop_unroll_adjust` which will prevent unrolling of any loops
>      that are valid for becoming MVE Tail_Predicated Low Overhead Loops
>      (unrolling can transform a loop in ways that invalidate the dlstp/
>      letp tranformation logic and the benefit of the dlstp/letp loop
>      would be considerably higher than that of unrolling)
> 4)  Appropriate changes to the define_expand of doloop_end, new
>      patterns for dlstp and letp, new iterators,  unspecs, etc.
> 5) `arm_mve_loop_valid_for_dlstp` and a number of checking functions:
>     * `arm_mve_dlstp_check_dec_counter`
>     * `arm_mve_dlstp_check_inc_counter`
>     * `arm_mve_check_reg_origin_is_num_elems`
>     * `arm_mve_check_df_chain_back_for_implic_predic`
>     * `arm_mve_check_df_chain_fwd_for_implic_predic_impact`
>     This all, in smoe way or another, are running checks on the loop
>     structure in order to determine if the loop is valid for dlstp/letp
>     transformation.
> 6) `arm_attempt_dlstp_transform`: (called from the define_expand of
>      doloop_end) this function re-checks for the loop's suitability for
>      dlstp/letp transformation and then implements it, if possible.
> 7) Various utility functions:
>     *`arm_mve_get_vctp_lanes` to map
>     from vctp unspecs to number of lanes, and `arm_get_required_vpr_reg`
>     to check an insn to see if it requires the VPR or not.
>     * `arm_mve_get_loop_vctp`
>     * `arm_mve_get_vctp_lanes`
>     * `arm_emit_mve_unpredicated_insn_to_seq`
>     * `arm_get_required_vpr_reg`
>     * `arm_get_required_vpr_reg_param`
>     * `arm_get_required_vpr_reg_ret_val`
>     * `arm_mve_is_across_vector_insn`
>     * `arm_is_mve_load_store_insn`
>     * `arm_mve_vec_insn_is_predicated_with_this_predicate`
>     * `arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate`
> 
> No regressions on arm-none-eabi with various targets and on
> aarch64-none-elf. Thoughts on getting this into trunk?

The arm parts look sensible but we'd need review for the df-core.h and 
df-core.cc changes.
Maybe Jeff can help or can recommend someone to take a look?
Thanks,
Kyrill

> 
> Thank you,
> Stam Markianos-Wright
> 
> gcc/ChangeLog:
> 
>      * config/arm/arm-protos.h (arm_target_insn_ok_for_lob): Rename to...
>      (arm_target_bb_ok_for_lob): ...this
>      (arm_attempt_dlstp_transform): New.
>      * config/arm/arm.cc (TARGET_LOOP_UNROLL_ADJUST): New.
>      (TARGET_PREDICT_DOLOOP_P): New.
>      (arm_block_set_vect):
>      (arm_target_insn_ok_for_lob): Rename from arm_target_insn_ok_for_lob.
>      (arm_target_bb_ok_for_lob): New.
>      (arm_mve_get_vctp_lanes): New.
>      (arm_get_required_vpr_reg): New.
>      (arm_get_required_vpr_reg_param): New.
>      (arm_get_required_vpr_reg_ret_val): New.
>      (arm_mve_get_loop_vctp): New.
>      (arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate): New.
>      (arm_mve_vec_insn_is_predicated_with_this_predicate): New.
>      (arm_mve_check_df_chain_back_for_implic_predic): New.
>      (arm_mve_check_df_chain_fwd_for_implic_predic_impact): New.
>      (arm_mve_check_reg_origin_is_num_elems): New.
>      (arm_mve_dlstp_check_inc_counter): New.
>      (arm_mve_dlstp_check_dec_counter): New.
>      (arm_mve_loop_valid_for_dlstp): New.
>      (arm_mve_is_across_vector_insn): New.
>      (arm_is_mve_load_store_insn): New.
>      (arm_predict_doloop_p): Ne

RE: [PING][PATCH 1/2] arm: Add define_attr to to create a mapping between MVE predicated and unpredicated insns

2023-09-14 Thread Kyrylo Tkachov via Gcc-patches
Hi Stam,

> -Original Message-
> From: Stam Markianos-Wright 
> Sent: Wednesday, September 6, 2023 6:19 PM
> To: gcc-patches@gcc.gnu.org
> Cc: Kyrylo Tkachov ; Richard Earnshaw
> 
> Subject: [PING][PATCH 1/2] arm: Add define_attr to to create a mapping
> between MVE predicated and unpredicated insns
> 
> 
> Hi all,
> 
> I'd like to submit two patches that add support for Arm's MVE
> Tail Predicated Low Overhead Loop feature.
> 
> --- Introduction ---
> 
> The M-class Arm-ARM:
> https://developer.arm.com/documentation/ddi0553/bu/?lang=en
> Section B5.5.1 "Loop tail predication" describes the feature
> we are adding support for with this patch (although
> we only add codegen for DLSTP/LETP instruction loops).
> 
> Previously with commit d2ed233cb94 we'd added support for
> non-MVE DLS/LE loops through the loop-doloop pass, which, given
> a standard MVE loop like:
> 
> ```
> void  __attribute__ ((noinline)) test (int16_t *a, int16_t *b, int16_t
> *c, int n)
> {
>    while (n > 0)
>      {
>    mve_pred16_t p = vctp16q (n);
>    int16x8_t va = vldrhq_z_s16 (a, p);
>    int16x8_t vb = vldrhq_z_s16 (b, p);
>    int16x8_t vc = vaddq_x_s16 (va, vb, p);
>    vstrhq_p_s16 (c, vc, p);
>    c+=8;
>    a+=8;
>    b+=8;
>    n-=8;
>      }
> }
> ```
> .. would output:
> 
> ```
>      
>      dls lr, lr
> .L3:
>      vctp.16 r3
>      vmrs    ip, P0  @ movhi
>      sxth    ip, ip
>      vmsr P0, ip @ movhi
>      mov r4, r0
>      vpst
>      vldrht.16   q2, [r4]
>      mov r4, r1
>      vmov    q3, q0
>      vpst
>      vldrht.16   q1, [r4]
>      mov r4, r2
>      vpst
>      vaddt.i16   q3, q2, q1
>      subs    r3, r3, #8
>      vpst
>      vstrht.16   q3, [r4]
>      adds    r0, r0, #16
>      adds    r1, r1, #16
>      adds    r2, r2, #16
>      le  lr, .L3
> ```
> 
> where the LE instruction will decrement LR by 1, compare and
> branch if needed.
> 
> (there are also other inefficiencies with the above code, like the
> pointless vmrs/sxth/vmsr on the VPR and the adds not being merged
> into the vldrht/vstrht as a #16 offsets and some random movs!
> But that's different problems...)
> 
> The MVE version is similar, except that:
> * Instead of DLS/LE the instructions are DLSTP/LETP.
> * Instead of pre-calculating the number of iterations of the
>    loop, we place the number of elements to be processed by the
>    loop into LR.
> * Instead of decrementing the LR by one, LETP will decrement it
>    by FPSCR.LTPSIZE, which is the number of elements being
>    processed in each iteration: 16 for 8-bit elements, 5 for 16-bit
>    elements, etc.
> * On the final iteration, automatic Loop Tail Predication is
>    performed, as if the instructions within the loop had been VPT
>    predicated with a VCTP generating the VPR predicate in every
>    loop iteration.
> 
> The dlstp/letp loop now looks like:
> 
> ```
>      
>      dlstp.16    lr, r3
> .L14:
>      mov r3, r0
>      vldrh.16    q3, [r3]
>      mov r3, r1
>      vldrh.16    q2, [r3]
>      mov r3, r2
>      vadd.i16  q3, q3, q2
>      adds    r0, r0, #16
>      vstrh.16    q3, [r3]
>      adds    r1, r1, #16
>      adds    r2, r2, #16
>      letp    lr, .L14
> 
> ```
> 
> Since the loop tail predication is automatic, we have eliminated
> the VCTP that had been specified by the user in the intrinsic
> and converted the VPT-predicated instructions into their
> unpredicated equivalents (which also saves us from VPST insns).
> 
> The LE instruction here decrements LR by 8 in each iteration.
> 
> --- This 1/2 patch ---
> 
> This first patch lays some groundwork by adding an attribute to
> md patterns, and then the second patch contains the functional
> changes.
> 
> One major difficulty in implementing MVE Tail-Predicated Low
> Overhead Loops was the need to transform VPT-predicated insns
> in the insn chain into their unpredicated equivalents, like:
> `mve_vldrbq_z_ -> mve_vldrbq_`.
> 
> This requires us to have a deterministic link between two
> different patterns in mve.md -- this _could_ be done by
> re-ordering the entirety of mve.md such that the patterns are
> at some constant icode proximity (e.g. having the _z immediately
> after the unpredicated version would mean that to map from the
> former to the latter you could use icode-1), but that is a very
> messy solution t

RE: [PATCH 1/9] arm: [MVE intrinsics] factorize vmullbq vmulltq

2023-08-22 Thread Kyrylo Tkachov via Gcc-patches
Hi Christophe,

> -Original Message-
> From: Christophe Lyon 
> Sent: Monday, August 14, 2023 7:34 PM
> To: gcc-patches@gcc.gnu.org; Kyrylo Tkachov ;
> Richard Earnshaw ; Richard Sandiford
> 
> Cc: Christophe Lyon 
> Subject: [PATCH 1/9] arm: [MVE intrinsics] factorize vmullbq vmulltq
> 
> Factorize vmullbq, vmulltq so that they use the same parameterized
> names.
> 
> 2023-08-14  Christophe Lyon  
> 
>   gcc/
>   * config/arm/iterators.md (mve_insn): Add vmullb, vmullt.
>   (isu): Add VMULLBQ_INT_S, VMULLBQ_INT_U, VMULLTQ_INT_S,
>   VMULLTQ_INT_U.
>   (supf): Add VMULLBQ_POLY_P, VMULLTQ_POLY_P,
> VMULLBQ_POLY_M_P,
>   VMULLTQ_POLY_M_P.
>   (VMULLBQ_INT, VMULLTQ_INT, VMULLBQ_INT_M, VMULLTQ_INT_M):
> Delete.
>   (VMULLxQ_INT, VMULLxQ_POLY, VMULLxQ_INT_M,
> VMULLxQ_POLY_M): New.
>   * config/arm/mve.md (mve_vmullbq_int_)
>   (mve_vmulltq_int_): Merge into ...
>   (@mve_q_int_) ... this.
>   (mve_vmulltq_poly_p, mve_vmullbq_poly_p): Merge
> into ...
>   (@mve_q_poly_): ... this.
>   (mve_vmullbq_int_m_,
> mve_vmulltq_int_m_): Merge into ...
>   (@mve_q_int_m_): ... this.
>   (mve_vmullbq_poly_m_p, mve_vmulltq_poly_m_p):
> Merge into ...
>   (@mve_q_poly_m_): ... this.

The series is okay and similar in design to your previous series in this area.
Thanks again for doing this rework.
Kyrill

> ---
>  gcc/config/arm/iterators.md |  23 +++--
>  gcc/config/arm/mve.md   | 100 
>  2 files changed, 38 insertions(+), 85 deletions(-)
> 
> diff --git a/gcc/config/arm/iterators.md b/gcc/config/arm/iterators.md
> index b13ff53d36f..fb003bcd67b 100644
> --- a/gcc/config/arm/iterators.md
> +++ b/gcc/config/arm/iterators.md
> @@ -917,6 +917,7 @@
> 
>  (define_int_attr mve_insn [
>(UNSPEC_VCADD90 "vcadd") (UNSPEC_VCADD270 "vcadd")
> +  (UNSPEC_VCMLA "vcmla") (UNSPEC_VCMLA90 "vcmla")
> (UNSPEC_VCMLA180 "vcmla") (UNSPEC_VCMLA270 "vcmla")
>(UNSPEC_VCMUL "vcmul") (UNSPEC_VCMUL90 "vcmul")
> (UNSPEC_VCMUL180 "vcmul") (UNSPEC_VCMUL270 "vcmul")
>(VABAVQ_P_S "vabav") (VABAVQ_P_U "vabav")
>(VABAVQ_S "vabav") (VABAVQ_U "vabav")
> @@ -1044,6 +1045,13 @@
>(VMOVNTQ_S "vmovnt") (VMOVNTQ_U "vmovnt")
>(VMULHQ_M_S "vmulh") (VMULHQ_M_U "vmulh")
>(VMULHQ_S "vmulh") (VMULHQ_U "vmulh")
> +  (VMULLBQ_INT_M_S "vmullb") (VMULLBQ_INT_M_U
> "vmullb")
> +  (VMULLBQ_INT_S "vmullb") (VMULLBQ_INT_U "vmullb")
> +  (VMULLBQ_POLY_M_P "vmullb") (VMULLTQ_POLY_M_P
> "vmullt")
> +  (VMULLBQ_POLY_P "vmullb")
> +  (VMULLTQ_INT_M_S "vmullt") (VMULLTQ_INT_M_U
> "vmullt")
> +  (VMULLTQ_INT_S "vmullt") (VMULLTQ_INT_U "vmullt")
> +  (VMULLTQ_POLY_P "vmullt")
>(VMULQ_M_N_S "vmul") (VMULQ_M_N_U "vmul")
> (VMULQ_M_N_F "vmul")
>(VMULQ_M_S "vmul") (VMULQ_M_U "vmul") (VMULQ_M_F
> "vmul")
>(VMULQ_N_S "vmul") (VMULQ_N_U "vmul") (VMULQ_N_F
> "vmul")
> @@ -1209,7 +1217,6 @@
>(VSUBQ_M_N_S "vsub") (VSUBQ_M_N_U "vsub")
> (VSUBQ_M_N_F "vsub")
>(VSUBQ_M_S "vsub") (VSUBQ_M_U "vsub") (VSUBQ_M_F
> "vsub")
>(VSUBQ_N_S "vsub") (VSUBQ_N_U "vsub") (VSUBQ_N_F
> "vsub")
> -  (UNSPEC_VCMLA "vcmla") (UNSPEC_VCMLA90 "vcmla")
> (UNSPEC_VCMLA180 "vcmla") (UNSPEC_VCMLA270 "vcmla")
>])
> 
>  (define_int_attr isu[
> @@ -1246,6 +1253,8 @@
>(VMOVNBQ_S "i") (VMOVNBQ_U "i")
>(VMOVNTQ_M_S "i") (VMOVNTQ_M_U "i")
>(VMOVNTQ_S "i") (VMOVNTQ_U "i")
> +  (VMULLBQ_INT_S "s") (VMULLBQ_INT_U "u")
> +  (VMULLTQ_INT_S "s") (VMULLTQ_INT_U "u")
>(VNEGQ_M_S "s")
>(VQABSQ_M_S "s")
>(VQMOVNBQ_M_S "s") (VQMOVNBQ_M_U "u")
> @@ -2330,6 +2339,10 @@
>  (VMLADAVQ_U "u")

RE: [PATCH] arm: [MVE intrinsics] Remove dead check for float type in parse_element_type

2023-08-22 Thread Kyrylo Tkachov via Gcc-patches



> -Original Message-
> From: Christophe Lyon 
> Sent: Monday, August 14, 2023 7:10 PM
> To: gcc-patches@gcc.gnu.org; Kyrylo Tkachov ;
> Richard Earnshaw ; Richard Sandiford
> 
> Cc: Christophe Lyon 
> Subject: [PATCH] arm: [MVE intrinsics] Remove dead check for float type in
> parse_element_type
> 
> Fix a likely copy/paste error, where we check if ch == 'f' after we
> checked it's either 's' or 'u'.

Ok.
Thanks,
Kyrill

> 
> 2023-08-14  Christophe Lyon  
> 
>   gcc/
>   * config/arm/arm-mve-builtins-shapes.cc (parse_element_type):
>   Remove dead check.
> ---
>  gcc/config/arm/arm-mve-builtins-shapes.cc | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/gcc/config/arm/arm-mve-builtins-shapes.cc b/gcc/config/arm/arm-
> mve-builtins-shapes.cc
> index 1633084608e..23eb9d0e69b 100644
> --- a/gcc/config/arm/arm-mve-builtins-shapes.cc
> +++ b/gcc/config/arm/arm-mve-builtins-shapes.cc
> @@ -80,8 +80,7 @@ parse_element_type (const function_instance
> , const char *)
> 
>if (ch == 's' || ch == 'u')
>  {
> -  type_class_index tclass = (ch == 'f' ? TYPE_float
> -  : ch == 's' ? TYPE_signed
> +  type_class_index tclass = (ch == 's' ? TYPE_signed
>: TYPE_unsigned);
>char *end;
>unsigned int bits = strtol (format, , 10);
> --
> 2.34.1



RE: [PATCH] arm: [MVE intrinsics] fix binary_acca_int32 and binary_acca_int64 shapes

2023-08-22 Thread Kyrylo Tkachov via Gcc-patches
Hi Christophe,

> -Original Message-
> From: Christophe Lyon 
> Sent: Monday, August 14, 2023 7:01 PM
> To: gcc-patches@gcc.gnu.org; Kyrylo Tkachov ;
> Richard Earnshaw ; Richard Sandiford
> 
> Cc: Christophe Lyon 
> Subject: [PATCH] arm: [MVE intrinsics] fix binary_acca_int32 and
> binary_acca_int64 shapes
> 
> Fix these two shapes, where we were failing to check the last
> non-predicate parameter.

Ok.
Thanks,
Kyrill

> 
> 2023-08-14  Christophe Lyon  
> 
>   gcc/
>   * config/arm/arm-mve-builtins-shapes.cc (binary_acca_int32): Fix
> loop bound.
>   (binary_acca_int64): Likewise.
> ---
>  gcc/config/arm/arm-mve-builtins-shapes.cc | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/gcc/config/arm/arm-mve-builtins-shapes.cc b/gcc/config/arm/arm-
> mve-builtins-shapes.cc
> index 6d477a84330..1633084608e 100644
> --- a/gcc/config/arm/arm-mve-builtins-shapes.cc
> +++ b/gcc/config/arm/arm-mve-builtins-shapes.cc
> @@ -455,7 +455,7 @@ struct binary_acca_int32_def : public
> overloaded_base<0>
>   || (type = r.infer_vector_type (1)) == NUM_TYPE_SUFFIXES)
>return error_mark_node;
> 
> -unsigned int last_arg = i;
> +unsigned int last_arg = i + 1;
>  for (i = 1; i < last_arg; i++)
>if (!r.require_matching_vector_type (i, type))
>   return error_mark_node;
> @@ -492,7 +492,7 @@ struct binary_acca_int64_def : public
> overloaded_base<0>
>   || (type = r.infer_vector_type (1)) == NUM_TYPE_SUFFIXES)
>return error_mark_node;
> 
> -unsigned int last_arg = i;
> +unsigned int last_arg = i + 1;
>  for (i = 1; i < last_arg; i++)
>if (!r.require_matching_vector_type (i, type))
>   return error_mark_node;
> --
> 2.34.1



RE: [PING][PATCH] arm: Remove unsigned variant of vcaddq_m

2023-08-21 Thread Kyrylo Tkachov via Gcc-patches
Ok.
Thanks,
Kyrill

From: Stam Markianos-Wright  
Sent: Saturday, August 19, 2023 12:42 PM
To: gcc-patches@gcc.gnu.org
Cc: Kyrylo Tkachov ; Richard Earnshaw 

Subject: [PING][PATCH] arm: Remove unsigned variant of vcaddq_m



(Pinging since I realised that this is required for my later Low Overhead Loop 
patch series to work)

Ok for trunk with the updated changelog that Christophe mentioned?

Thanks,
Stamatis/Stam Markianos-Wright 


From: Stam Markianos-Wright
Sent: Tuesday, August 1, 2023 6:21 PM
To: mailto:gcc-patches@gcc.gnu.org <mailto:gcc-patches@gcc.gnu.org>
Cc: Richard Earnshaw <mailto:richard.earns...@arm.com>; Kyrylo Tkachov 
<mailto:kyrylo.tkac...@arm.com>
Subject: arm: Remove unsigned variant of vcaddq_m 
 
Hi all,

The unsigned variants of the vcaddq_m operation are not needed within the
compiler, as the assembly output of the signed and unsigned versions of the
ops is identical: with a `.i` suffix (as opposed to separate `.s` and `.u`
suffixes).

Tested with baremetal arm-none-eabi on Arm's fastmodels.

Ok for trunk?

Thanks,
Stamatis Markianos-Wright

gcc/ChangeLog:

     * config/arm/arm-mve-builtins-base.cc (vcaddq_rot90, vcaddq_rot270):
       Use common insn for signed and unsigned front-end definitions.
     * config/arm/arm_mve_builtins.def
       (vcaddq_rot90_m_u, vcaddq_rot270_m_u): Make common.
       (vcaddq_rot90_m_s, vcaddq_rot270_m_s): Remove.
     * config/arm/iterators.md (mve_insn): Merge signed and unsigned defs.
       (isu): Likewise.
       (rot): Likewise.
       (mve_rot): Likewise.
       (supf): Likewise.
       (VxCADDQ_M): Likewise.
     * config/arm/unspecs.md (unspec): Likewise.
---
  gcc/config/arm/arm-mve-builtins-base.cc |  4 ++--
  gcc/config/arm/arm_mve_builtins.def |  6 ++---
  gcc/config/arm/iterators.md | 30 +++--
  gcc/config/arm/mve.md   |  4 ++--
  gcc/config/arm/unspecs.md   |  6 ++---
  5 files changed, 21 insertions(+), 29 deletions(-)

diff --git a/gcc/config/arm/arm-mve-builtins-base.cc 
b/gcc/config/arm/arm-mve-builtins-base.cc
index e31095ae112..426a87e9852 100644
--- a/gcc/config/arm/arm-mve-builtins-base.cc
+++ b/gcc/config/arm/arm-mve-builtins-base.cc
@@ -260,8 +260,8 @@ FUNCTION_PRED_P_S_U (vaddvq, VADDVQ)
  FUNCTION_PRED_P_S_U (vaddvaq, VADDVAQ)
  FUNCTION_WITH_RTX_M (vandq, AND, VANDQ)
  FUNCTION_ONLY_N (vbrsrq, VBRSRQ)
-FUNCTION (vcaddq_rot90, unspec_mve_function_exact_insn_rot, 
(UNSPEC_VCADD90, UNSPEC_VCADD90, UNSPEC_VCADD90, VCADDQ_ROT90_M_S, 
VCADDQ_ROT90_M_U, VCADDQ_ROT90_M_F))
-FUNCTION (vcaddq_rot270, unspec_mve_function_exact_insn_rot, 
(UNSPEC_VCADD270, UNSPEC_VCADD270, UNSPEC_VCADD270, VCADDQ_ROT270_M_S, 
VCADDQ_ROT270_M_U, VCADDQ_ROT270_M_F))
+FUNCTION (vcaddq_rot90, unspec_mve_function_exact_insn_rot, 
(UNSPEC_VCADD90, UNSPEC_VCADD90, UNSPEC_VCADD90, VCADDQ_ROT90_M, 
VCADDQ_ROT90_M, VCADDQ_ROT90_M_F))
+FUNCTION (vcaddq_rot270, unspec_mve_function_exact_insn_rot, 
(UNSPEC_VCADD270, UNSPEC_VCADD270, UNSPEC_VCADD270, VCADDQ_ROT270_M, 
VCADDQ_ROT270_M, VCADDQ_ROT270_M_F))
  FUNCTION (vcmlaq, unspec_mve_function_exact_insn_rot, (-1, -1, 
UNSPEC_VCMLA, -1, -1, VCMLAQ_M_F))
  FUNCTION (vcmlaq_rot90, unspec_mve_function_exact_insn_rot, (-1, -1, 
UNSPEC_VCMLA90, -1, -1, VCMLAQ_ROT90_M_F))
  FUNCTION (vcmlaq_rot180, unspec_mve_function_exact_insn_rot, (-1, -1, 
UNSPEC_VCMLA180, -1, -1, VCMLAQ_ROT180_M_F))
diff --git a/gcc/config/arm/arm_mve_builtins.def 
b/gcc/config/arm/arm_mve_builtins.def
index 43dacc3dda1..6ac1812c697 100644
--- a/gcc/config/arm/arm_mve_builtins.def
+++ b/gcc/config/arm/arm_mve_builtins.def
@@ -523,8 +523,8 @@ VAR3 (QUADOP_UNONE_UNONE_UNONE_UNONE_PRED, 
vhsubq_m_n_u, v16qi, v8hi, v4si)
  VAR3 (QUADOP_UNONE_UNONE_UNONE_UNONE_PRED, vhaddq_m_u, v16qi, v8hi, v4si)
  VAR3 (QUADOP_UNONE_UNONE_UNONE_UNONE_PRED, vhaddq_m_n_u, v16qi, v8hi, 
v4si)
  VAR3 (QUADOP_UNONE_UNONE_UNONE_UNONE_PRED, veorq_m_u, v16qi, v8hi, v4si)
-VAR3 (QUADOP_UNONE_UNONE_UNONE_UNONE_PRED, vcaddq_rot90_m_u, v16qi, 
v8hi, v4si)
-VAR3 (QUADOP_UNONE_UNONE_UNONE_UNONE_PRED, vcaddq_rot270_m_u, v16qi, 
v8hi, v4si)
+VAR3 (QUADOP_UNONE_UNONE_UNONE_UNONE_PRED, vcaddq_rot90_m_, v16qi, 
v8hi, v4si)
+VAR3 (QUADOP_UNONE_UNONE_UNONE_UNONE_PRED, vcaddq_rot270_m_, v16qi, 
v8hi, v4si)
  VAR3 (QUADOP_UNONE_UNONE_UNONE_UNONE_PRED, vbicq_m_u, v16qi, v8hi, v4si)
  VAR3 (QUADOP_UNONE_UNONE_UNONE_UNONE_PRED, vandq_m_u, v16qi, v8hi, v4si)
  VAR3 (QUADOP_UNONE_UNONE_UNONE_UNONE_PRED, vaddq_m_u, v16qi, v8hi, v4si)
@@ -587,8 +587,6 @@ VAR3 (QUADOP_NONE_NONE_NONE_NONE_PRED, 
vhcaddq_rot270_m_s, v16qi, v8hi, v4si)
  VAR3 (QUADOP_NONE_NONE_NONE_NONE_PRED, vhaddq_m_s, v16qi, v8hi, v4si)
  VAR3 (QUADOP_NONE_NONE_NONE_NONE_PRED, vhaddq_m_n_s, v16qi, v8hi, v4si)
  VAR3 (QUADOP_NONE_NONE_NONE_NONE_PRED, veorq_m_s, v16qi, v8hi, v4si)
-VAR3 (QUADOP_NONE_NONE_NONE_NONE_PRED, vcaddq_rot90_m_s, v16qi, v8hi, v4si)
-VAR3 (QUADOP_NONE_NONE_NONE_NONE_PRED, vcaddq_rot27

RE: [PATCH 1/6] arm: [MVE intrinsics] Factorize vcaddq vhcaddq

2023-07-14 Thread Kyrylo Tkachov via Gcc-patches



> -Original Message-
> From: Christophe Lyon 
> Sent: Thursday, July 13, 2023 11:22 AM
> To: gcc-patches@gcc.gnu.org; Kyrylo Tkachov ;
> Richard Earnshaw ; Richard Sandiford
> 
> Cc: Christophe Lyon 
> Subject: [PATCH 1/6] arm: [MVE intrinsics] Factorize vcaddq vhcaddq
> 
> Factorize vcaddq, vhcaddq so that they use the same parameterized
> names.
> 
> To be able to use the same patterns, we add a suffix to vcaddq.
> 
> Note that vcadd uses UNSPEC_VCADDxx for builtins without predication,
> and VCADDQ_ROTxx_M_x (that is, not starting with "UNSPEC_").  The
> UNPEC_* names are also used by neon.md

Thanks for working on this.
The series is ok.
Kyrill

> 
> 2023-07-13  Christophe Lyon  
> 
>   gcc/
>   * config/arm/arm_mve_builtins.def (vcaddq_rot90_,
> vcaddq_rot270_)
>   (vcaddq_rot90_f, vcaddq_rot90_f): Add "_" or "_f" suffix.
>   * config/arm/iterators.md (mve_insn): Add vcadd, vhcadd.
>   (isu): Add UNSPEC_VCADD90, UNSPEC_VCADD270,
> VCADDQ_ROT270_M_U,
>   VCADDQ_ROT270_M_S, VCADDQ_ROT90_M_U,
> VCADDQ_ROT90_M_S,
>   VHCADDQ_ROT90_M_S, VHCADDQ_ROT270_M_S,
> VHCADDQ_ROT90_S,
>   VHCADDQ_ROT270_S.
>   (rot): Add VCADDQ_ROT90_M_F, VCADDQ_ROT90_M_S,
> VCADDQ_ROT90_M_U,
>   VCADDQ_ROT270_M_F, VCADDQ_ROT270_M_S,
> VCADDQ_ROT270_M_U,
>   VHCADDQ_ROT90_S, VHCADDQ_ROT270_S, VHCADDQ_ROT90_M_S,
>   VHCADDQ_ROT270_M_S.
>   (mve_rot): Add VCADDQ_ROT90_M_F, VCADDQ_ROT90_M_S,
>   VCADDQ_ROT90_M_U, VCADDQ_ROT270_M_F,
> VCADDQ_ROT270_M_S,
>   VCADDQ_ROT270_M_U, VHCADDQ_ROT90_S, VHCADDQ_ROT270_S,
>   VHCADDQ_ROT90_M_S, VHCADDQ_ROT270_M_S.
>   (supf): Add VHCADDQ_ROT90_M_S, VHCADDQ_ROT270_M_S,
>   VHCADDQ_ROT90_S, VHCADDQ_ROT270_S, UNSPEC_VCADD90,
>   UNSPEC_VCADD270.
>   (VCADDQ_ROT270_M): Delete.
>   (VCADDQ_M_F VxCADDQ VxCADDQ_M): New.
>   (VCADDQ_ROT90_M): Delete.
>   * config/arm/mve.md (mve_vcaddq)
>   (mve_vhcaddq_rot270_s, mve_vhcaddq_rot90_s):
> Merge
>   into ...
>   (@mve_q_): ... this.
>   (mve_vcaddq): Rename into ...
>   (@mve_q_f): ... this
>   (mve_vcaddq_rot270_m_)
>   (mve_vcaddq_rot90_m_,
> mve_vhcaddq_rot270_m_s)
>   (mve_vhcaddq_rot90_m_s): Merge into ...
>   (@mve_q_m_): ... this.
>   (mve_vcaddq_rot270_m_f, mve_vcaddq_rot90_m_f):
> Merge
>   into ...
>   (@mve_q_m_f): ... this.
> ---
>  gcc/config/arm/arm_mve_builtins.def |   6 +-
>  gcc/config/arm/iterators.md |  38 +++-
>  gcc/config/arm/mve.md   | 135 +---
>  3 files changed, 62 insertions(+), 117 deletions(-)
> 
> diff --git a/gcc/config/arm/arm_mve_builtins.def
> b/gcc/config/arm/arm_mve_builtins.def
> index 8de765de3b0..63ad1845593 100644
> --- a/gcc/config/arm/arm_mve_builtins.def
> +++ b/gcc/config/arm/arm_mve_builtins.def
> @@ -187,6 +187,10 @@ VAR3 (BINOP_NONE_NONE_NONE, vmaxvq_s, v16qi,
> v8hi, v4si)
>  VAR3 (BINOP_NONE_NONE_NONE, vmaxq_s, v16qi, v8hi, v4si)
>  VAR3 (BINOP_NONE_NONE_NONE, vhsubq_s, v16qi, v8hi, v4si)
>  VAR3 (BINOP_NONE_NONE_NONE, vhsubq_n_s, v16qi, v8hi, v4si)
> +VAR3 (BINOP_NONE_NONE_NONE, vcaddq_rot90_, v16qi, v8hi, v4si)
> +VAR3 (BINOP_NONE_NONE_NONE, vcaddq_rot270_, v16qi, v8hi, v4si)
> +VAR2 (BINOP_NONE_NONE_NONE, vcaddq_rot90_f, v8hf, v4sf)
> +VAR2 (BINOP_NONE_NONE_NONE, vcaddq_rot270_f, v8hf, v4sf)
>  VAR3 (BINOP_NONE_NONE_NONE, vhcaddq_rot90_s, v16qi, v8hi, v4si)
>  VAR3 (BINOP_NONE_NONE_NONE, vhcaddq_rot270_s, v16qi, v8hi, v4si)
>  VAR3 (BINOP_NONE_NONE_NONE, vhaddq_s, v16qi, v8hi, v4si)
> @@ -870,8 +874,6 @@ VAR3
> (QUADOP_UNONE_UNONE_UNONE_IMM_PRED, vshlcq_m_vec_u, v16qi,
> v8hi, v4si)
>  VAR3 (QUADOP_UNONE_UNONE_UNONE_IMM_PRED, vshlcq_m_carry_u,
> v16qi, v8hi, v4si)
> 
>  /* optabs without any suffixes.  */
> -VAR5 (BINOP_NONE_NONE_NONE, vcaddq_rot90, v16qi, v8hi, v4si, v8hf,
> v4sf)
> -VAR5 (BINOP_NONE_NONE_NONE, vcaddq_rot270, v16qi, v8hi, v4si, v8hf,
> v4sf)
>  VAR2 (BINOP_NONE_NONE_NONE, vcmulq_rot90, v8hf, v4sf)
>  VAR2 (BINOP_NONE_NONE_NONE, vcmulq_rot270, v8hf, v4sf)
>  VAR2 (BINOP_NONE_NONE_NONE, vcmulq_rot180, v8hf, v4sf)
> diff --git a/gcc/config/arm/iterators.md b/gcc/config/arm/iterators.md
> index 9e77af55d60..da1ead34e58 100644
> --- a/gcc/config/arm/iterators.md
> +++ b/gcc/config/arm/iterators.md
> @@ -902,6 +902,7 @@
>])
> 
>  (define_int_attr mve_insn [
> +  (UNSPEC_VCADD90 "vcadd") (UNSPEC_VCADD270 "vcadd")
>(VABAVQ_P_S "vabav") (VABAVQ_P_U "vabav")
>(VABAVQ_S "vabav") (VABAVQ_U "vabav&q

RE: [PATCH 2/2] [testsuite, arm]: Make mve_fp_fpu[12].c accept single or double precision FPU

2023-07-14 Thread Kyrylo Tkachov via Gcc-patches



> -Original Message-
> From: Christophe Lyon 
> Sent: Thursday, July 13, 2023 11:22 AM
> To: gcc-patches@gcc.gnu.org; Kyrylo Tkachov ;
> Richard Earnshaw 
> Cc: Christophe Lyon 
> Subject: [PATCH 2/2] [testsuite,arm]: Make mve_fp_fpu[12].c accept single or
> double precision FPU
> 
> This tests currently expect a directive containing .fpu fpv5-sp-d16
> and thus may fail if the test is executed for instance with
> -march=armv8.1-m.main+mve.fp+fp.dp
> 
> This patch accepts either fpv5-sp-d16 or fpv5-d16 to avoid the failure.
> 

Ok.
Thanks,
Kyrill

> 2023-06-28  Christophe Lyon  
> 
>   gcc/testsuite/
>   * gcc.target/arm/mve/intrinsics/mve_fp_fpu1.c: Fix .fpu
>   scan-assembler.
>   * gcc.target/arm/mve/intrinsics/mve_fp_fpu2.c: Likewise.
> ---
>  gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_fp_fpu1.c | 2 +-
>  gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_fp_fpu2.c | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_fp_fpu1.c
> b/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_fp_fpu1.c
> index e375327fb97..8358a616bb5 100644
> --- a/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_fp_fpu1.c
> +++ b/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_fp_fpu1.c
> @@ -12,4 +12,4 @@ foo1 (int8x16_t value)
>return b;
>  }
> 
> -/* { dg-final { scan-assembler "\.fpu fpv5-sp-d16" }  } */
> +/* { dg-final { scan-assembler "\.fpu fpv5(-sp|)-d16" }  } */
> diff --git a/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_fp_fpu2.c
> b/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_fp_fpu2.c
> index 1fca1100cf0..5dd2feefc35 100644
> --- a/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_fp_fpu2.c
> +++ b/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_fp_fpu2.c
> @@ -12,4 +12,4 @@ foo1 (int8x16_t value)
>return b;
>  }
> 
> -/* { dg-final { scan-assembler "\.fpu fpv5-sp-d16" }  } */
> +/* { dg-final { scan-assembler "\.fpu fpv5(-sp|)-d16" }  } */
> --
> 2.34.1



RE: [PATCH 1/2] [testsuite,arm]: Make nomve_fp_1.c require arm_fp

2023-07-14 Thread Kyrylo Tkachov via Gcc-patches



> -Original Message-
> From: Christophe Lyon 
> Sent: Thursday, July 13, 2023 11:22 AM
> To: gcc-patches@gcc.gnu.org; Kyrylo Tkachov ;
> Richard Earnshaw 
> Cc: Christophe Lyon 
> Subject: [PATCH 1/2] [testsuite,arm]: Make nomve_fp_1.c require arm_fp
> 
> If GCC is configured with the default (soft) -mfloat-abi, and we don't
> override the target_board test flags appropriately,
> gcc.target/arm/mve/general-c/nomve_fp_1.c fails for lack of
> -mfloat-abi=softfp or -mfloat-abi=hard, because it doesn't use
> dg-add-options arm_v8_1m_mve (on purpose, see comment in the test).
> 
> Require and use the options needed for arm_fp to fix this problem.

Ok.
Thanks,
Kyrill

> 
> 2023-06-28  Christophe Lyon  
> 
>   gcc/testsuite/
>   * gcc.target/arm/mve/general-c/nomve_fp_1.c: Require arm_fp.
> ---
>  gcc/testsuite/gcc.target/arm/mve/general-c/nomve_fp_1.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/gcc/testsuite/gcc.target/arm/mve/general-c/nomve_fp_1.c
> b/gcc/testsuite/gcc.target/arm/mve/general-c/nomve_fp_1.c
> index 21c2af16a61..c9d279ead68 100644
> --- a/gcc/testsuite/gcc.target/arm/mve/general-c/nomve_fp_1.c
> +++ b/gcc/testsuite/gcc.target/arm/mve/general-c/nomve_fp_1.c
> @@ -1,9 +1,11 @@
>  /* { dg-do compile } */
>  /* { dg-require-effective-target arm_v8_1m_mve_ok } */
> +/* { dg-require-effective-target arm_fp_ok } */
>  /* Do not use dg-add-options arm_v8_1m_mve, because this might expand
> to "",
> which could imply mve+fp depending on the user settings. We want to
> make
> sure the '+fp' extension is not enabled.  */
>  /* { dg-options "-mfpu=auto -march=armv8.1-m.main+mve" } */
> +/* { dg-add-options arm_fp } */
> 
>  #include 
> 
> --
> 2.34.1



RE: [PATCH] testsuite: Add _link flavor for several arm_arch* and arm* effective-targets

2023-07-10 Thread Kyrylo Tkachov via Gcc-patches


> -Original Message-
> From: Christophe Lyon 
> Sent: Monday, July 10, 2023 2:59 PM
> To: Kyrylo Tkachov 
> Cc: gcc-patches@gcc.gnu.org; Richard Earnshaw
> 
> Subject: Re: [PATCH] testsuite: Add _link flavor for several arm_arch* and
> arm* effective-targets
> 
> 
> 
> On Mon, 10 Jul 2023 at 15:46, Kyrylo Tkachov  <mailto:kyrylo.tkac...@arm.com> > wrote:
> 
> 
> 
> 
>   > -Original Message-
>   > From: Christophe Lyon  <mailto:christophe.l...@linaro.org> >
>   > Sent: Friday, July 7, 2023 8:52 AM
>   > To: gcc-patches@gcc.gnu.org <mailto:gcc-patches@gcc.gnu.org> ;
> Kyrylo Tkachov  <mailto:kyrylo.tkac...@arm.com> >;
>   > Richard Earnshaw  <mailto:richard.earns...@arm.com> >
>   > Cc: Christophe Lyon  <mailto:christophe.l...@linaro.org> >
>   > Subject: [PATCH] testsuite: Add _link flavor for several arm_arch*
> and arm*
>   > effective-targets
>   >
>   > For arm targets, we generate many effective-targets with
>   > check_effective_target_FUNC_multilib and
>   > check_effective_target_arm_arch_FUNC_multilib which check if we
> can
>   > link and execute a simple program with a given set of
> flags/multilibs.
>   >
>   > In some cases however, it's possible to link but not to execute a
>   > program, so this patch adds similar _link effective-targets which only
>   > check if link succeeds.
>   >
>   > The patch does not uupdate the documentation as it already lacks
> the
>   > numerous existing related effective-targets.
> 
>   I think this looks ok but...
> 
>   >
>   > 2023-07-07  Christophe Lyon   <mailto:christophe.l...@linaro.org> >
>   >
>   >   gcc/testsuite/
>   >   * lib/target-supports.exp (arm_*FUNC_link): New effective-
> targets.
>   > ---
>   >  gcc/testsuite/lib/target-supports.exp | 27
> +++
>   >  1 file changed, 27 insertions(+)
>   >
>   > diff --git a/gcc/testsuite/lib/target-supports.exp
> b/gcc/testsuite/lib/target-
>   > supports.exp
>   > index c04db2be7f9..d33bc077418 100644
>   > --- a/gcc/testsuite/lib/target-supports.exp
>   > +++ b/gcc/testsuite/lib/target-supports.exp
>   > @@ -5129,6 +5129,14 @@ foreach { armfunc armflag armdefs } {
>   >   return "$flags FLAG"
>   >   }
>   >
>   > +proc check_effective_target_arm_arch_FUNC_link { } {
>   > + return [check_no_compiler_messages arm_arch_FUNC_link
>   > executable {
>   > + #include 
>   > + int dummy;
>   > + int main (void) { return 0; }
>   > + } [add_options_for_arm_arch_FUNC ""]]
>   > + }
>   > +
>   >   proc check_effective_target_arm_arch_FUNC_multilib { } {
>   >   return [check_runtime arm_arch_FUNC_multilib {
>   >   int
>   > @@ -5906,6 +5914,7 @@ proc
> add_options_for_arm_v8_2a_bf16_neon {
>   > flags } {
>   >  #   arm_v8m_main_cde: Armv8-m CDE (Custom Datapath
> Extension).
>   >  #   arm_v8m_main_cde_fp: Armv8-m CDE with FP registers.
>   >  #   arm_v8_1m_main_cde_mve: Armv8.1-m CDE with MVE.
>   > +#   arm_v8_1m_main_cde_mve_fp: Armv8.1-m CDE with MVE with
> FP
>   > support.
>   >  # Usage:
>   >  #   /* { dg-require-effective-target arm_v8m_main_cde_ok } */
>   >  #   /* { dg-add-options arm_v8m_main_cde } */
>   > @@ -5965,6 +5974,24 @@ foreach { armfunc armflag armdef
> arminc } {
>   >   return "$flags $et_FUNC_flags"
>   >   }
>   >
>   > +proc check_effective_target_FUNC_link { } {
>   > + if { ! [check_effective_target_FUNC_ok] } {
>   > + return 0;
>   > + }
>   > + return [check_no_compiler_messages FUNC_link executable {
>   > + #if !(DEF)
>   > + #error "DEF failed"
>   > + #endif
>   > + #include 
> 
>   ... why is arm_cde.h included here?
> 
> 
> 
> It's the very same code as  check_effective_target_FUNC_multilib below.
> 
> I think it's needed in case the toolchain's default configuration is not
> able to support CDE. I believe these tests would fail if the toolchain 
> defaults
> 

RE: [PATCH v2] arm: Fix MVE intrinsics support with LTO (PR target/110268)

2023-07-10 Thread Kyrylo Tkachov via Gcc-patches



> -Original Message-
> From: Christophe Lyon 
> Sent: Monday, July 10, 2023 2:09 PM
> To: gcc-patches@gcc.gnu.org; Kyrylo Tkachov ;
> Richard Earnshaw 
> Cc: Christophe Lyon 
> Subject: [PATCH v2] arm: Fix MVE intrinsics support with LTO (PR
> target/110268)
> 
> After the recent MVE intrinsics re-implementation, LTO stopped working
> because the intrinsics would no longer be defined.
> 
> The main part of the patch is simple and similar to what we do for
> AArch64:
> - call handle_arm_mve_h() from arm_init_mve_builtins to declare the
>   intrinsics when the compiler is in LTO mode
> - actually implement arm_builtin_decl for MVE.
> 
> It was just a bit tricky to handle __ARM_MVE_PRESERVE_USER_NAMESPACE:
> its value in the user code cannot be guessed at LTO time, so we always
> have to assume that it was not defined.  The led to a few fixes in the
> way we register MVE builtins as placeholders or not.  Without this
> patch, we would just omit some versions of the inttrinsics when
> __ARM_MVE_PRESERVE_USER_NAMESPACE is true. In fact, like for the C/C++
> placeholders, we need to always keep entries for all of them to ensure
> that we have a consistent numbering scheme.

Ok.
Thanks,
Kyrill

> 
> 2023-06-26  Christophe Lyon   
> 
>   PR target/110268
>   gcc/
>   * config/arm/arm-builtins.cc (arm_init_mve_builtins): Handle LTO.
>   (arm_builtin_decl): Hahndle MVE builtins.
>   * config/arm/arm-mve-builtins.cc (builtin_decl): New function.
>   (add_unique_function): Fix handling of
>   __ARM_MVE_PRESERVE_USER_NAMESPACE.
>   (add_overloaded_function): Likewise.
>   * config/arm/arm-protos.h (builtin_decl): New declaration.
> 
>   gcc/testsuite/
>   * gcc.target/arm/pr110268-1.c: New test.
>   * gcc.target/arm/pr110268-2.c: New test.
> ---
>  gcc/config/arm/arm-builtins.cc| 11 +++-
>  gcc/config/arm/arm-mve-builtins.cc| 61 ---
>  gcc/config/arm/arm-protos.h   |  1 +
>  gcc/testsuite/gcc.target/arm/pr110268-1.c | 12 +
>  gcc/testsuite/gcc.target/arm/pr110268-2.c | 23 +
>  5 files changed, 78 insertions(+), 30 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/arm/pr110268-1.c
>  create mode 100644 gcc/testsuite/gcc.target/arm/pr110268-2.c
> 
> diff --git a/gcc/config/arm/arm-builtins.cc b/gcc/config/arm/arm-builtins.cc
> index 36365e40a5b..fca7dcaf565 100644
> --- a/gcc/config/arm/arm-builtins.cc
> +++ b/gcc/config/arm/arm-builtins.cc
> @@ -1918,6 +1918,15 @@ arm_init_mve_builtins (void)
>arm_builtin_datum *d = _builtin_data[i];
>arm_init_builtin (fcode, d, "__builtin_mve");
>  }
> +
> +  if (in_lto_p)
> +{
> +  arm_mve::handle_arm_mve_types_h ();
> +  /* Under LTO, we cannot know whether
> +  __ARM_MVE_PRESERVE_USER_NAMESPACE was defined, so assume
> it
> +  was not.  */
> +  arm_mve::handle_arm_mve_h (false);
> +}
>  }
> 
>  /* Set up all the NEON builtins, even builtins for instructions that are not
> @@ -2723,7 +2732,7 @@ arm_builtin_decl (unsigned code, bool initialize_p
> ATTRIBUTE_UNUSED)
>  case ARM_BUILTIN_GENERAL:
>return arm_general_builtin_decl (subcode);
>  case ARM_BUILTIN_MVE:
> -  return error_mark_node;
> +  return arm_mve::builtin_decl (subcode);
>  default:
>gcc_unreachable ();
>  }
> diff --git a/gcc/config/arm/arm-mve-builtins.cc b/gcc/config/arm/arm-mve-
> builtins.cc
> index 7033e41a571..413d8100607 100644
> --- a/gcc/config/arm/arm-mve-builtins.cc
> +++ b/gcc/config/arm/arm-mve-builtins.cc
> @@ -493,6 +493,16 @@ handle_arm_mve_h (bool
> preserve_user_namespace)
>preserve_user_namespace);
>  }
> 
> +/* Return the function decl with MVE function subcode CODE, or
> error_mark_node
> +   if no such function exists.  */
> +tree
> +builtin_decl (unsigned int code)
> +{
> +  if (code >= vec_safe_length (registered_functions))
> +return error_mark_node;
> +  return (*registered_functions)[code]->decl;
> +}
> +
>  /* Return true if CANDIDATE is equivalent to MODEL_TYPE for overloading
> purposes.  */
>  static bool
> @@ -849,7 +859,6 @@ function_builder::add_function (const
> function_instance ,
>  ? integer_zero_node
>  : simulate_builtin_function_decl (input_location, name, fntype,
> code, NULL, attrs);
> -
>registered_function  = *ggc_alloc  ();
>rfn.instance = instance;
>rfn.decl = decl;
> @@ -889,15 +898,12 @@ function_builder::add_unique_function (const
> function_instance ,
>gcc_assert (!*rfn_s

RE: [PATCH] doc: Document arm_v8_1m_main_cde_mve_fp

2023-07-10 Thread Kyrylo Tkachov via Gcc-patches



> -Original Message-
> From: Christophe Lyon 
> Sent: Friday, July 7, 2023 8:52 AM
> To: gcc-patches@gcc.gnu.org; Kyrylo Tkachov ;
> Richard Earnshaw 
> Cc: Christophe Lyon 
> Subject: [PATCH] doc: Document arm_v8_1m_main_cde_mve_fp
> 
> The arm_v8_1m_main_cde_mve_fp family of effective targets was not
> documented when it was introduced.
> 
> 2023-07-07  Christophe Lyon  
> 
>   gcc/
>   * doc/sourcebuild.texi (arm_v8_1m_main_cde_mve_fp): Document.
> ---
>  gcc/doc/sourcebuild.texi | 6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/gcc/doc/sourcebuild.texi b/gcc/doc/sourcebuild.texi
> index 526020c7511..03fb2394705 100644
> --- a/gcc/doc/sourcebuild.texi
> +++ b/gcc/doc/sourcebuild.texi
> @@ -2190,6 +2190,12 @@ ARM target supports options to generate
> instructions from ARMv8.1-M with
>  the Custom Datapath Extension (CDE) and M-Profile Vector Extension (MVE).
>  Some multilibs may be incompatible with these options.
> 
> +@item arm_v8_1m_main_cde_mve_fp
> +ARM target supports options to generate instructions from ARMv8.1-M
> +with the Custom Datapath Extension (CDE) and M-Profile Vector
> +Extension (MVE) with floating-point support.  Some multilibs may be
> +incompatible with these options.

I know the GCC source is inconsistent on this but the proper branding these 
days is "ARM" -> "Arm" and "ARMv8.1-M" -> "Armv8.1-M".
Ok with those changes.
Thanks,
Kyrill

> +
>  @item arm_pacbti_hw
>  Test system supports executing Pointer Authentication and Branch Target
>  Identification instructions.
> --
> 2.34.1



RE: [PATCH] testsuite: Add _link flavor for several arm_arch* and arm* effective-targets

2023-07-10 Thread Kyrylo Tkachov via Gcc-patches



> -Original Message-
> From: Christophe Lyon 
> Sent: Friday, July 7, 2023 8:52 AM
> To: gcc-patches@gcc.gnu.org; Kyrylo Tkachov ;
> Richard Earnshaw 
> Cc: Christophe Lyon 
> Subject: [PATCH] testsuite: Add _link flavor for several arm_arch* and arm*
> effective-targets
> 
> For arm targets, we generate many effective-targets with
> check_effective_target_FUNC_multilib and
> check_effective_target_arm_arch_FUNC_multilib which check if we can
> link and execute a simple program with a given set of flags/multilibs.
> 
> In some cases however, it's possible to link but not to execute a
> program, so this patch adds similar _link effective-targets which only
> check if link succeeds.
> 
> The patch does not uupdate the documentation as it already lacks the
> numerous existing related effective-targets.

I think this looks ok but...

> 
> 2023-07-07  Christophe Lyon  
> 
>   gcc/testsuite/
>   * lib/target-supports.exp (arm_*FUNC_link): New effective-targets.
> ---
>  gcc/testsuite/lib/target-supports.exp | 27 +++
>  1 file changed, 27 insertions(+)
> 
> diff --git a/gcc/testsuite/lib/target-supports.exp b/gcc/testsuite/lib/target-
> supports.exp
> index c04db2be7f9..d33bc077418 100644
> --- a/gcc/testsuite/lib/target-supports.exp
> +++ b/gcc/testsuite/lib/target-supports.exp
> @@ -5129,6 +5129,14 @@ foreach { armfunc armflag armdefs } {
>   return "$flags FLAG"
>   }
> 
> +proc check_effective_target_arm_arch_FUNC_link { } {
> + return [check_no_compiler_messages arm_arch_FUNC_link
> executable {
> + #include 
> + int dummy;
> + int main (void) { return 0; }
> + } [add_options_for_arm_arch_FUNC ""]]
> + }
> +
>   proc check_effective_target_arm_arch_FUNC_multilib { } {
>   return [check_runtime arm_arch_FUNC_multilib {
>   int
> @@ -5906,6 +5914,7 @@ proc add_options_for_arm_v8_2a_bf16_neon {
> flags } {
>  #   arm_v8m_main_cde: Armv8-m CDE (Custom Datapath Extension).
>  #   arm_v8m_main_cde_fp: Armv8-m CDE with FP registers.
>  #   arm_v8_1m_main_cde_mve: Armv8.1-m CDE with MVE.
> +#   arm_v8_1m_main_cde_mve_fp: Armv8.1-m CDE with MVE with FP
> support.
>  # Usage:
>  #   /* { dg-require-effective-target arm_v8m_main_cde_ok } */
>  #   /* { dg-add-options arm_v8m_main_cde } */
> @@ -5965,6 +5974,24 @@ foreach { armfunc armflag armdef arminc } {
>   return "$flags $et_FUNC_flags"
>   }
> 
> +proc check_effective_target_FUNC_link { } {
> + if { ! [check_effective_target_FUNC_ok] } {
> + return 0;
> + }
> + return [check_no_compiler_messages FUNC_link executable {
> + #if !(DEF)
> + #error "DEF failed"
> + #endif
> + #include 

... why is arm_cde.h included here?

> + INC
> + int
> + main (void)
> + {
> + return 0;
> + }
> + } [add_options_for_FUNC ""]]
> + }
> +
>   proc check_effective_target_FUNC_multilib { } {
>   if { ! [check_effective_target_FUNC_ok] } {
>   return 0;
> --
> 2.34.1



RE: [PATCH] arm: Fix MVE intrinsics support with LTO (PR target/110268)

2023-07-06 Thread Kyrylo Tkachov via Gcc-patches
Hi Christophe,

> -Original Message-
> From: Christophe Lyon 
> Sent: Thursday, July 6, 2023 4:21 PM
> To: Kyrylo Tkachov 
> Cc: gcc-patches@gcc.gnu.org; Richard Sandiford
> 
> Subject: Re: [PATCH] arm: Fix MVE intrinsics support with LTO (PR
> target/110268)
> 
> 
> 
> On Wed, 5 Jul 2023 at 19:07, Kyrylo Tkachov  <mailto:kyrylo.tkac...@arm.com> > wrote:
> 
> 
>   Hi Christophe,
> 
>   > -Original Message-
>   > From: Christophe Lyon  <mailto:christophe.l...@linaro.org> >
>   > Sent: Monday, June 26, 2023 4:03 PM
>   > To: gcc-patches@gcc.gnu.org <mailto:gcc-patches@gcc.gnu.org> ;
> Kyrylo Tkachov  <mailto:kyrylo.tkac...@arm.com> >;
>   > Richard Sandiford  <mailto:richard.sandif...@arm.com> >
>   > Cc: Christophe Lyon  <mailto:christophe.l...@linaro.org> >
>   > Subject: [PATCH] arm: Fix MVE intrinsics support with LTO (PR
> target/110268)
>   >
>   > After the recent MVE intrinsics re-implementation, LTO stopped
> working
>   > because the intrinsics would no longer be defined.
>   >
>   > The main part of the patch is simple and similar to what we do for
>   > AArch64:
>   > - call handle_arm_mve_h() from arm_init_mve_builtins to declare
> the
>   >   intrinsics when the compiler is in LTO mode
>   > - actually implement arm_builtin_decl for MVE.
>   >
>   > It was just a bit tricky to handle
> __ARM_MVE_PRESERVE_USER_NAMESPACE:
>   > its value in the user code cannot be guessed at LTO time, so we
> always
>   > have to assume that it was not defined.  The led to a few fixes in the
>   > way we register MVE builtins as placeholders or not.  Without this
>   > patch, we would just omit some versions of the inttrinsics when
>   > __ARM_MVE_PRESERVE_USER_NAMESPACE is true. In fact, like for
> the C/C++
>   > placeholders, we need to always keep entries for all of them to
> ensure
>   > that we have a consistent numbering scheme.
>   >
>   >   2023-06-26  Christophe Lyon<mailto:christophe.l...@linaro.org> >
>   >
>   >   PR target/110268
>   >   gcc/
>   >   * config/arm/arm-builtins.cc (arm_init_mve_builtins): Handle
> LTO.
>   >   (arm_builtin_decl): Hahndle MVE builtins.
>   >   * config/arm/arm-mve-builtins.cc (builtin_decl): New function.
>   >   (add_unique_function): Fix handling of
>   >   __ARM_MVE_PRESERVE_USER_NAMESPACE.
>   >   (add_overloaded_function): Likewise.
>   >   * config/arm/arm-protos.h (builtin_decl): New declaration.
>   >
>   >   gcc/testsuite/
>   >   * gcc.target/arm/pr110268-1.c: New test.
>   >   * gcc.target/arm/pr110268-2.c: New test.
>   > ---
>   >  gcc/config/arm/arm-builtins.cc| 11 +++-
>   >  gcc/config/arm/arm-mve-builtins.cc| 61 --
> -
>   >  gcc/config/arm/arm-protos.h   |  1 +
>   >  gcc/testsuite/gcc.target/arm/pr110268-1.c | 11 
>   >  gcc/testsuite/gcc.target/arm/pr110268-2.c | 22 
>   >  5 files changed, 76 insertions(+), 30 deletions(-)
>   >  create mode 100644 gcc/testsuite/gcc.target/arm/pr110268-1.c
>   >  create mode 100644 gcc/testsuite/gcc.target/arm/pr110268-2.c
>   >
>   > diff --git a/gcc/config/arm/arm-builtins.cc b/gcc/config/arm/arm-
> builtins.cc
>   > index 36365e40a5b..fca7dcaf565 100644
>   > --- a/gcc/config/arm/arm-builtins.cc
>   > +++ b/gcc/config/arm/arm-builtins.cc
>   > @@ -1918,6 +1918,15 @@ arm_init_mve_builtins (void)
>   >arm_builtin_datum *d = _builtin_data[i];
>   >arm_init_builtin (fcode, d, "__builtin_mve");
>   >  }
>   > +
>   > +  if (in_lto_p)
>   > +{
>   > +  arm_mve::handle_arm_mve_types_h ();
>   > +  /* Under LTO, we cannot know whether
>   > +  __ARM_MVE_PRESERVE_USER_NAMESPACE was defined, so
> assume
>   > it
>   > +  was not.  */
>   > +  arm_mve::handle_arm_mve_h (false);
>   > +}
>   >  }
>   >
>   >  /* Set up all the NEON builtins, even builtins for instructions that
> are not
>   > @@ -2723,7 +2732,7 @@ arm_builtin_decl (unsigned code, bool
> initialize_p
>   > ATTRIBUTE_UNUSED)
>   >  case ARM_BUILTIN_GENERAL:
>

RE: [PATCH] arm: Fix MVE intrinsics support with LTO (PR target/110268)

2023-07-05 Thread Kyrylo Tkachov via Gcc-patches
Hi Christophe,

> -Original Message-
> From: Christophe Lyon 
> Sent: Monday, June 26, 2023 4:03 PM
> To: gcc-patches@gcc.gnu.org; Kyrylo Tkachov ;
> Richard Sandiford 
> Cc: Christophe Lyon 
> Subject: [PATCH] arm: Fix MVE intrinsics support with LTO (PR target/110268)
> 
> After the recent MVE intrinsics re-implementation, LTO stopped working
> because the intrinsics would no longer be defined.
> 
> The main part of the patch is simple and similar to what we do for
> AArch64:
> - call handle_arm_mve_h() from arm_init_mve_builtins to declare the
>   intrinsics when the compiler is in LTO mode
> - actually implement arm_builtin_decl for MVE.
> 
> It was just a bit tricky to handle __ARM_MVE_PRESERVE_USER_NAMESPACE:
> its value in the user code cannot be guessed at LTO time, so we always
> have to assume that it was not defined.  The led to a few fixes in the
> way we register MVE builtins as placeholders or not.  Without this
> patch, we would just omit some versions of the inttrinsics when
> __ARM_MVE_PRESERVE_USER_NAMESPACE is true. In fact, like for the C/C++
> placeholders, we need to always keep entries for all of them to ensure
> that we have a consistent numbering scheme.
> 
>   2023-06-26  Christophe Lyon   
> 
>   PR target/110268
>   gcc/
>   * config/arm/arm-builtins.cc (arm_init_mve_builtins): Handle LTO.
>   (arm_builtin_decl): Hahndle MVE builtins.
>   * config/arm/arm-mve-builtins.cc (builtin_decl): New function.
>   (add_unique_function): Fix handling of
>   __ARM_MVE_PRESERVE_USER_NAMESPACE.
>   (add_overloaded_function): Likewise.
>   * config/arm/arm-protos.h (builtin_decl): New declaration.
> 
>   gcc/testsuite/
>   * gcc.target/arm/pr110268-1.c: New test.
>   * gcc.target/arm/pr110268-2.c: New test.
> ---
>  gcc/config/arm/arm-builtins.cc| 11 +++-
>  gcc/config/arm/arm-mve-builtins.cc| 61 ---
>  gcc/config/arm/arm-protos.h   |  1 +
>  gcc/testsuite/gcc.target/arm/pr110268-1.c | 11 
>  gcc/testsuite/gcc.target/arm/pr110268-2.c | 22 
>  5 files changed, 76 insertions(+), 30 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/arm/pr110268-1.c
>  create mode 100644 gcc/testsuite/gcc.target/arm/pr110268-2.c
> 
> diff --git a/gcc/config/arm/arm-builtins.cc b/gcc/config/arm/arm-builtins.cc
> index 36365e40a5b..fca7dcaf565 100644
> --- a/gcc/config/arm/arm-builtins.cc
> +++ b/gcc/config/arm/arm-builtins.cc
> @@ -1918,6 +1918,15 @@ arm_init_mve_builtins (void)
>arm_builtin_datum *d = _builtin_data[i];
>arm_init_builtin (fcode, d, "__builtin_mve");
>  }
> +
> +  if (in_lto_p)
> +{
> +  arm_mve::handle_arm_mve_types_h ();
> +  /* Under LTO, we cannot know whether
> +  __ARM_MVE_PRESERVE_USER_NAMESPACE was defined, so assume
> it
> +  was not.  */
> +  arm_mve::handle_arm_mve_h (false);
> +}
>  }
> 
>  /* Set up all the NEON builtins, even builtins for instructions that are not
> @@ -2723,7 +2732,7 @@ arm_builtin_decl (unsigned code, bool initialize_p
> ATTRIBUTE_UNUSED)
>  case ARM_BUILTIN_GENERAL:
>return arm_general_builtin_decl (subcode);
>  case ARM_BUILTIN_MVE:
> -  return error_mark_node;
> +  return arm_mve::builtin_decl (subcode);
>  default:
>gcc_unreachable ();
>  }
> diff --git a/gcc/config/arm/arm-mve-builtins.cc b/gcc/config/arm/arm-mve-
> builtins.cc
> index 7033e41a571..e9a12f27411 100644
> --- a/gcc/config/arm/arm-mve-builtins.cc
> +++ b/gcc/config/arm/arm-mve-builtins.cc
> @@ -493,6 +493,16 @@ handle_arm_mve_h (bool
> preserve_user_namespace)
>preserve_user_namespace);
>  }
> 
> +/* Return the function decl with SVE function subcode CODE, or
> error_mark_node
> +   if no such function exists.  */
> +tree
> +builtin_decl (unsigned int code)
> +{
> +  if (code >= vec_safe_length (registered_functions))
> +return error_mark_node;
> +  return (*registered_functions)[code]->decl;
> +}
> +
>  /* Return true if CANDIDATE is equivalent to MODEL_TYPE for overloading
> purposes.  */
>  static bool
> @@ -849,7 +859,6 @@ function_builder::add_function (const
> function_instance ,
>  ? integer_zero_node
>  : simulate_builtin_function_decl (input_location, name, fntype,
> code, NULL, attrs);
> -
>registered_function  = *ggc_alloc  ();
>rfn.instance = instance;
>rfn.decl = decl;
> @@ -889,15 +898,12 @@ function_builder::add_unique_function (const
> function_instance ,
>gcc_assert (!*rfn_slot);
>*rfn_

[PATCH][committed] aarch64: Use instead of in scalar SQRSHRUN pattern

2023-06-26 Thread Kyrylo Tkachov via Gcc-patches
Hi all,

In the scalar pattern for SQRSHRUN it's a bit clearer to use DWI instead of 
V2XWIDE
to make it more clear that no vector modes are involved.
No behavioural change intended.

Bootstrapped and tested on aarch64-none-linux-gnu.
Pushing to trunk.
Thanks,
Kyrill

gcc/ChangeLog:

* config/aarch64/aarch64-simd.md (aarch64_sqrshrun_n_insn):
Use  instead of .
(aarch64_sqrshrun_n): Likewise.


dwi.patch
Description: dwi.patch


[PATCH][committed] aarch64: Clean up some rounding immediate predicates

2023-06-26 Thread Kyrylo Tkachov via Gcc-patches
Hi all,

aarch64_simd_rsra_rnd_imm_vec is now used for more than just RSRA
and accepts more than just vectors so rename it to make it more
truthful.
The aarch64_simd_rshrn_imm_vec is now unused and can be deleted.
No behavioural change intended.

Bootstrapped and tested on aarch64-none-linux-gnu.
Pushing to trunk.
Thanks,
Kyrill

gcc/ChangeLog:

* config/aarch64/aarch64-protos.h (aarch64_const_vec_rsra_rnd_imm_p):
Rename to...
(aarch64_rnd_imm_p): ... This.
* config/aarch64/predicates.md (aarch64_simd_rsra_rnd_imm_vec):
Rename to...
(aarch64_int_rnd_operand): ... This.
(aarch64_simd_rshrn_imm_vec): Delete.
* config/aarch64/aarch64-simd.md (aarch64_rsra_n_insn):
Adjust for the above.
(aarch64_rshr_n_insn): Likewise.
(*aarch64_rshrn_n_insn): Likewise.
(*aarch64_sqrshrun_n_insn): Likewise.
(aarch64_sqrshrun_n_insn): Likewise.
(aarch64_rshrn2_n_insn_le): Likewise.
(aarch64_rshrn2_n_insn_be): Likewise.
(aarch64_sqrshrun2_n_insn_le): Likewise.
(aarch64_sqrshrun2_n_insn_be): Likewise.
* config/aarch64/aarch64.cc (aarch64_const_vec_rsra_rnd_imm_p):
Rename to...
(aarch64_rnd_imm_p): ... This.


rnd-imm.patch
Description: rnd-imm.patch


[PATCH][committed] aarch64: Avoid same input and output Z register for gather loads

2023-06-21 Thread Kyrylo Tkachov via Gcc-patches
Hi all,

The architecture recommends that load-gather instructions avoid using the same
Z register for the load address and the destination, and the Software 
Optimization
Guides for Arm cores recommend that as well.
This means that for code like:
#include 

svuint64_t
food (svbool_t p, uint64_t *in, svint64_t offsets, svuint64_t a)
{
  return svadd_u64_x (p, a, svld1_gather_offset(p, in, offsets));
}

we'll want to avoid generating the current:
food:
ld1dz0.d, p0/z, [x0, z0.d] // Z0 reused as input and output.
add z0.d, z1.d, z0.d
ret

However, we still want to avoid generating extra moves where there were
none before, so the tight aarch64-sve-acle.exp tests for load gathers
should still pass as they are.

This patch implements that recommendation for the load gather patterns by:
* duplicating the alternatives
* marking the output operand as early clobber
* Tying the input Z register operand in the original alternatives to 0
* Penalising the original alternatives with '?'

This results in a large-ish patch in terms of diff lines but the new
compact syntax (thanks Tamar) makes it quite a readable an regular change.

The benchmark numbers on a Neoverse V1 on fprate look okay:
diff
503.bwaves_r0.00%
507.cactuBSSN_r 0.00%
508.namd_r  0.00%
510.parest_r0.55%
511.povray_r0.22%
519.lbm_r   0.00%
521.wrf_r   0.00%
526.blender_r   0.00%
527.cam4_r  0.56%
538.imagick_r   0.00%
544.nab_r   0.00%
549.fotonik3d_r 0.00%
554.roms_r  0.00%
fprate  0.10%

Bootstrapped and tested on aarch64-none-linux-gnu.
Pushing to trunk.
Thanks,
Kyrill

P.S. I had messed up my previous commit of 
https://gcc.gnu.org/pipermail/gcc-patches/2023-June/622456.html
by squashing the config/aarch64 changes with this patch.
I have reverted that one commit and reapplied it properly (as it should have 
been a no-op) and am pushing this commit on top of that.
Sorry for the churn in the repo.

gcc/ChangeLog:

* config/aarch64/aarch64-sve.md 
(mask_gather_load):
Add alternatives to prefer to avoid same input and output Z register.
(mask_gather_load): Likewise.
(*mask_gather_load_xtw_unpacked): Likewise.
(*mask_gather_load_sxtw): Likewise.
(*mask_gather_load_uxtw): Likewise.
(@aarch64_gather_load_):
Likewise.

(@aarch64_gather_load_):
Likewise.
(*aarch64_gather_load_
_xtw_unpacked): Likewise.
(*aarch64_gather_load_
_sxtw): Likewise.
(*aarch64_gather_load_
_uxtw): Likewise.
(@aarch64_ldff1_gather): Likewise.
(@aarch64_ldff1_gather): Likewise.
(*aarch64_ldff1_gather_sxtw): Likewise.
(*aarch64_ldff1_gather_uxtw): Likewise.
(@aarch64_ldff1_gather_
): Likewise.
(@aarch64_ldff1_gather_
): Likewise.
(*aarch64_ldff1_gather_
_sxtw): Likewise.
(*aarch64_ldff1_gather_
_uxtw): Likewise.
* config/aarch64/aarch64-sve2.md (@aarch64_gather_ldnt): Likewise.
(@aarch64_gather_ldnt_
): Likewise.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/sve/gather_earlyclobber.c: New test.
* gcc.target/aarch64/sve2/gather_earlyclobber.c: New test.


gather-earlyclobber.patch
Description: gather-earlyclobber.patch


[PATCH][committed] aarch64: Convert SVE gather patterns to compact syntax

2023-06-21 Thread Kyrylo Tkachov via Gcc-patches
Hi all,

This patch converts the SVE load gather patterns to the new compact syntax
that Tamar introduced. This allows for a future patch I want to contribute
to add more alternatives that are better viewed in the more compact form.

The lines in some patterns are >80 long now, but I think that's unavoidable
and those patterns already had overly long constraint strings.

No functional change intended.
Bootstrapped and tested on aarch64-none-linux-gnu.
Pushing to trunk.
Thanks,
Kyrill

gcc/ChangeLog:

* config/aarch64/aarch64-sve.md 
(mask_gather_load):
Convert to compact alternatives syntax.
(mask_gather_load): Likewise.
(*mask_gather_load_xtw_unpacked): Likewise.
(*mask_gather_load_sxtw): Likewise.
(*mask_gather_load_uxtw): Likewise.
(@aarch64_gather_load_):
Likewise.

(@aarch64_gather_load_):
Likewise.
(*aarch64_gather_load_
_xtw_unpacked): Likewise.
(*aarch64_gather_load_
_sxtw): Likewise.
(*aarch64_gather_load_
_uxtw): Likewise.
(@aarch64_ldff1_gather): Likewise.
(@aarch64_ldff1_gather): Likewise.
(*aarch64_ldff1_gather_sxtw): Likewise.
(*aarch64_ldff1_gather_uxtw): Likewise.
(@aarch64_ldff1_gather_
): Likewise.
(@aarch64_ldff1_gather_
): Likewise.
(*aarch64_ldff1_gather_
_sxtw): Likewise.
(*aarch64_ldff1_gather_
_uxtw): Likewise.
* config/aarch64/aarch64-sve2.md (@aarch64_gather_ldnt): Likewise.
(@aarch64_gather_ldnt_
): Likewise.


gather-compact.patch
Description: gather-compact.patch


[PATCH][committed] aarch64: Optimise ADDP with same source operands

2023-06-20 Thread Kyrylo Tkachov via Gcc-patches
Hi all,

We've been asked to optimise the testcase in this patch of a 64-bit ADDP with
the low and high halves of the same 128-bit vector. This can be done by a
single .4s ADDP followed by just reading the bottom 64 bits. A splitter for
this is quite straightforward now that all the vec_concat stuff is collapsed
by simplify-rtx.

With this patch we generate a single:
addpv0.4s, v0.4s, v0.4s
instead of:
dup d31, v0.d[1]
addpv0.2s, v0.2s, v31.2s
ret

Bootstrapped and tested on aarch64-none-linux-gnu and aarch64_be-none-elf.
Pushing to trunk.
Thanks,
Kyrill

gcc/ChangeLog:

* config/aarch64/aarch64-simd.md (*aarch64_addp_same_reg):
New define_insn_and_split.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/simd/addp-same-low_1.c: New test.


addp-q.patch
Description: addp-q.patch


[PATCH][4/5] aarch64: [US]Q(R)SHR(U)N2 refactoring

2023-06-16 Thread Kyrylo Tkachov via Gcc-patches
This patch is large in lines of code, but it is a fairly regular
extension of the first patch as it converts the high-half patterns
to standard RTL codes in the same fashion as the first patch did for the
low-half ones.
This now allows us to remove the unspec codes for these instructions as
there are no more uses of them left.

Bootstrapped and tested on aarch64-none-linux-gnu and
aarch64_be-none-elf.

gcc/ChangeLog:

* config/aarch64/aarch64-simd-builtins.def (shrn2): Rename builtins 
to...
(shrn2_n): ... This.
(rshrn2): Rename builtins to...
(rshrn2_n): ... This.
* config/aarch64/arm_neon.h (vrshrn_high_n_s16): Adjust for the above.
(vrshrn_high_n_s32): Likewise.
(vrshrn_high_n_s64): Likewise.
(vrshrn_high_n_u16): Likewise.
(vrshrn_high_n_u32): Likewise.
(vrshrn_high_n_u64): Likewise.
(vshrn_high_n_s16): Likewise.
(vshrn_high_n_s32): Likewise.
(vshrn_high_n_s64): Likewise.
(vshrn_high_n_u16): Likewise.
(vshrn_high_n_u32): Likewise.
(vshrn_high_n_u64): Likewise.
* config/aarch64/aarch64-simd.md (*aarch64_shrn2_vect_le):
Delete.
(*aarch64_shrn2_vect_be): Likewise.
(aarch64_shrn2_insn_le): Likewise.
(aarch64_shrn2_insn_be): Likewise.
(aarch64_shrn2): Likewise.
(aarch64_rshrn2_insn_le): Likewise.
(aarch64_rshrn2_insn_be): Likewise.
(aarch64_rshrn2): Likewise.
(aarch64_qshrn2_n_insn_le): Likewise.
(aarch64_shrn2_n_insn_le): New define_insn.
(aarch64_qshrn2_n_insn_be): Delete.
(aarch64_shrn2_n_insn_be): New define_insn.
(aarch64_qshrn2_n): Delete.
(aarch64_shrn2_n): New define_expand.
(aarch64_rshrn2_n_insn_le): New define_insn.
(aarch64_rshrn2_n_insn_be): New define_insn.
(aarch64_rshrn2_n): New define_expand.
(aarch64_sqshrun2_n_insn_le): New define_insn.
(aarch64_sqshrun2_n_insn_be): New define_insn.
(aarch64_sqshrun2_n): New define_expand.
(aarch64_sqrshrun2_n_insn_le): New define_insn.
(aarch64_sqrshrun2_n_insn_be): New define_insn.
(aarch64_sqrshrun2_n): New define_expand.
* config/aarch64/iterators.md (UNSPEC_SQSHRUN, UNSPEC_SQRSHRUN,
UNSPEC_SQSHRN, UNSPEC_UQSHRN, UNSPEC_SQRSHRN, UNSPEC_UQRSHRN):
Delete unspec values.
(VQSHRN_N): Delete int iterator.


s4.patch
Description: s4.patch


[PATCH][0/5][committed] aarch64: Reimplement [US]Q(R)SHR(U)N(2) patterns with standard RTL codes

2023-06-16 Thread Kyrylo Tkachov via Gcc-patches
Hi all,

This patch series reimplements the MD patterns for the instructions that
perform narrowing right shifts with optional rounding and saturation
using standard RTL codes rather than unspecs.  This includes the scalar
forms and the *2 forms that write to the high half of the result vector.
This allows us to get rid of a number of unspecs and should significantly
improve the simplification capabilities around these instructions.
I attempted to compress as many forms as possible with iterators and the
end result looks reasonably orthogonal with a few small exceptions described
in the individual patches.

The semantics are pretty well exercised by tests in advsimd-intrinsics.exp and
in many of those tests the intrinsics involved are now entirely evaluated at
compile-time and disappear from the output at optimisation levels. The 
validation
against the reference numbers still passes (though I got many failures during
development as I was getting little things wrong, so the tests are working as
intended!).

Bootstrapped and tested on aarch64-none-linux-gnu and aarch64_be-none-elf.
Pushing to trunk.
Thanks,
Kyrill


[PATCH][2/5] aarch64: [US]Q(R)SHR(U)N scalar forms refactoring

2023-06-16 Thread Kyrylo Tkachov via Gcc-patches
Some instructions from the previous patch have scalar forms:
SQSHRN,SQRSHRN,UQSHRN,UQRSHRN,SQSHRUN,SQRSHRUN.
This patch converts the patterns for these to use standard RTL codes.
Their MD patterns deviate slightly from the vector forms mostly due to
things like operands being scalar rather than vectors.
One nuance is in the SQSHRUN,SQRSHRUN patterns. These end in a truncate
to the scalar narrow mode e.g. SI -> QI.  This gets simplified by the
RTL passes to a subreg rather than keeping it as a truncate.
So we end up representing these without the truncate and in the expander
read the narrow subreg in order to comply with the expected width of the
intrinsic.

Bootstrapped and tested on aarch64-none-linux-gnu and
aarch64_be-none-elf.

gcc/ChangeLog:

* config/aarch64/aarch64-simd.md (aarch64_qshrn_n):
Rename to...
(aarch64_shrn_n): ... This.  Reimplement with RTL codes.
(*aarch64_rshrn_n_insn): New define_insn.
(aarch64_sqrshrun_n_insn): Likewise.
(aarch64_sqshrun_n_insn): Likewise.
(aarch64_rshrn_n): New define_expand.
(aarch64_sqshrun_n): Likewise.
(aarch64_sqrshrun_n): Likewise.
* config/aarch64/iterators.md (V2XWIDE): Add HI and SI modes.


s2.patch
Description: s2.patch


[PATCH][5/5] aarch64: Handle ASHIFTRT in patterns for shrn2

2023-06-16 Thread Kyrylo Tkachov via Gcc-patches
Similar to the low-half patterns, we want to match both ashiftrt and
lshiftrt with the truncate for SHRN2.  We reuse the SHIFTRT iterator
and the AARCH64_VALID_SHRN_OP check to help, but because we expand the
high-half patterns by their gen_* names we need to disambiguate all the
different trunc+shift combinations in the pattern name, which leads to a
slight renaming of the builtins.  The AARCH64_VALID_SHRN_OP check on the
expander and the define_insns ensures that no invalid combination ends
up getting matched.

Bootstrapped and tested on aarch64-none-linux-gnu and
aarch64_be-none-elf.

gcc/ChangeLog:

* config/aarch64/aarch64-simd-builtins.def (shrn2_n): Rename builtins 
to...
(ushrn2_n): ... This.
(sqshrn2_n): Rename builtins to...
(ssqshrn2_n): ... This.
(uqshrn2_n): Rename builtins to...
(uqushrn2_n): ... This.
* config/aarch64/arm_neon.h (vqshrn_high_n_s16): Adjust for the above.
(vqshrn_high_n_s32): Likewise.
(vqshrn_high_n_s64): Likewise.
(vqshrn_high_n_u16): Likewise.
(vqshrn_high_n_u32): Likewise.
(vqshrn_high_n_u64): Likewise.
(vshrn_high_n_s16): Likewise.
(vshrn_high_n_s32): Likewise.
(vshrn_high_n_s64): Likewise.
(vshrn_high_n_u16): Likewise.
(vshrn_high_n_u32): Likewise.
(vshrn_high_n_u64): Likewise.
* config/aarch64/aarch64-simd.md 
(aarch64_shrn2_n_insn_le):
Rename to...
(aarch64_shrn2_n_insn_le): ... This.
Use SHIFTRT iterator and AARCH64_VALID_SHRN_OP check.
(aarch64_shrn2_n_insn_be): Rename to...
(aarch64_shrn2_n_insn_be): ... This.
Use SHIFTRT iterator and AARCH64_VALID_SHRN_OP check.
(aarch64_shrn2_n): Rename to...
(aarch64_shrn2_n): ... This.
Update expander for the above.


s5.patch
Description: s5.patch


[PATCH][3/5] aarch64: Add ASHIFTRT handling for shrn pattern

2023-06-16 Thread Kyrylo Tkachov via Gcc-patches
The first patch in the series has some fallout in the testsuite,
particularly gcc.target/aarch64/shrn-combine-2.c.
Our previous patterns for SHRN matched both
(truncate (ashiftrt (x) (N))) and (truncate (lshiftrt (x) (N))
as these are equivalent for the shift amounts involved.
In our refactoring, however, we mapped shrn to truncate+lshiftrt.

The fix here is to iterate over ashiftrt,lshiftrt in the pattern for it.
However, we don't want to allow ashiftrt for us_truncate or lshiftrt for
ss_truncate from the ALL_TRUNC iterator.

This patch addds a AARCH64_VALID_SHRN_OP helper to gate the valid
combinations of truncations and shifts.

Bootstrapped and tested on aarch64-none-linux-gnu and
aarch64_be-none-elf.

gcc/ChangeLog:

* config/aarch64/aarch64.h (AARCH64_VALID_SHRN_OP): Define.
* config/aarch64/aarch64-simd.md
(*aarch64_shrn_n_insn): Rename to...
(*aarch64_shrn_n_insn): ... This.
Use SHIFTRT iterator and add AARCH64_VALID_SHRN_OP to condition.
* config/aarch64/iterators.md (shrn_s): New code attribute.


s3.patch
Description: s3.patch


[PATCH][1/5] aarch64: Reimplement [US]Q(R)SHR(U)N patterns with RTL codes

2023-06-16 Thread Kyrylo Tkachov via Gcc-patches
This patch reimplements the MD patterns for the instructions that
perform narrowing right shifts with optional rounding and saturation
using standard RTL codes rather than unspecs.

There are four groups of patterns involved:

* Simple narrowing shifts with optional signed or unsigned truncation:
SHRN, SQSHRN, UQSHRN.  These are expressed as a truncation operation of
a right shift.  The matrix of valid combinations looks like this:

|   ashiftrt   |   lshiftrt  |
--
ss_truncate |   SQSHRN |  X  |
us_truncate | X|UQSHRN   |
truncate| X| SHRN|
--

* Narrowing shifts with rounding with optional signed or unsigned
truncation: RSHRN, SQRSHRN, UQRSHRN.  These follow the same
combinations of truncation and shift codes as above, but also perform
intermediate widening of the results in order to represent the addition
of the rounding constant.  This group also corrects an existing
inaccuracy for RSHRN where we don't currently model the intermediate
widening for rounding.

* The somewhat special "Signed saturating Shift Right Unsigned Narrow":
SQSHRUN.  Similar to the SQXTUN instructions, these perform a
saturating truncation that isn't represented by US_TRUNCATE or
SS_TRUNCATE but needs to use a clamping operation followed by a
TRUNCATE.

* The rounding version of the above: SQRSHRUN.  It needs the special
clamping truncate representation but with an intermediate widening and
rounding addition.

Besides using standard RTL codes for all of the above instructions, this
patch allows us to get rid of the explicit define_insns and
define_expands for SHRN and RSHRN.

Bootstrapped and tested on aarch64-none-linux-gnu and
aarch64_be-none-elf.  We've got pretty thorough execute tests in
advsimd-intrinsics.exp that exercise these and many instances of these
instructions get constant-folded away during optimisation and the
validation still passes (during development where I was figuring out the
details of the semantics they were discovering failures), so I'm fairly
confident in the representation.

gcc/ChangeLog:

* config/aarch64/aarch64-simd-builtins.def (shrn): Rename builtins to...
(shrn_n): ... This.
(rshrn): Rename builtins to...
(rshrn_n): ... This.
* config/aarch64/arm_neon.h (vshrn_n_s16): Adjust for the above.
(vshrn_n_s32): Likewise.
(vshrn_n_s64): Likewise.
(vshrn_n_u16): Likewise.
(vshrn_n_u32): Likewise.
(vshrn_n_u64): Likewise.
(vrshrn_n_s16): Likewise.
(vrshrn_n_s32): Likewise.
(vrshrn_n_s64): Likewise.
(vrshrn_n_u16): Likewise.
(vrshrn_n_u32): Likewise.
(vrshrn_n_u64): Likewise.
* config/aarch64/aarch64-simd.md
(*aarch64_shrn): Delete.
(aarch64_shrn): Likewise.
(aarch64_rshrn_insn): Likewise.
(aarch64_rshrn): Likewise.
(aarch64_qshrn_n_insn): Likewise.
(aarch64_qshrn_n): Likewise.
(*aarch64_shrn_n_insn): New define_insn.
(*aarch64_rshrn_n_insn): Likewise.
(*aarch64_sqshrun_n_insn): Likewise.
(*aarch64_sqrshrun_n_insn): Likewise.
(aarch64_shrn_n): New define_expand.
(aarch64_rshrn_n): Likewise.
(aarch64_sqshrun_n): Likewise.
(aarch64_sqrshrun_n): Likewise.
* config/aarch64/iterators.md (ALL_TRUNC): New code iterator.
(TRUNCEXTEND): New code attribute.
(TRUNC_SHIFT): Likewise.
(shrn_op): Likewise.
* config/aarch64/predicates.md (aarch64_simd_umax_quarter_mode):
New predicate.


s1.patch
Description: s1.patch


[PATCH] simplify-rtx: Simplify VEC_CONCAT of SUBREG and VEC_CONCAT from same vector

2023-06-16 Thread Kyrylo Tkachov via Gcc-patches
Hi all,

In the testcase for this patch we try to vec_concat the lowpart and highpart of 
a vector, but the lowpart is expressed as a subreg.
simplify-rtx.cc does not recognise this and combine ends up trying to match:
Trying 7 -> 8:
7: r93:V2SI=vec_select(r95:V4SI,parallel)
8: r97:V4SI=vec_concat(r95:V4SI#0,r93:V2SI)
  REG_DEAD r95:V4SI
  REG_DEAD r93:V2SI
Failed to match this instruction:
(set (reg:V4SI 97)
(vec_concat:V4SI (subreg:V2SI (reg/v:V4SI 95 [ a ]) 0)
(vec_select:V2SI (reg/v:V4SI 95 [ a ])
(parallel:V4SI [
(const_int 2 [0x2])
(const_int 3 [0x3])
]

This should be just (set (reg:V4SI 97) (reg:V4SI 95)). This patch adds such a 
simplification.
The testcase is a bit artificial, but I do have other aarch64-specific patterns 
that I want to optimise later
that rely on this simplification happening.

Without this patch for the testcase we generate:
foo:
dup d31, v0.d[1]
ins v0.d[1], v31.d[0]
ret

whereas we should just not generate anything as the operation is ultimately a 
no-op.

Bootstrapped and tested on aarch64-none-linux-gnu and aarch64_be-none-elf.
Ok for trunk?
Thanks,
Kyrill

gcc/ChangeLog:

* simplify-rtx.cc (simplify_context::simplify_binary_operation_1):
Simplify vec_concat of lowpart subreg and high part vec_select.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/simd/low-high-combine_1.c: New test.


concat-subreg.patch
Description: concat-subreg.patch


RE: [PATCH v2] [PR96339] Optimise svlast[ab]

2023-06-14 Thread Kyrylo Tkachov via Gcc-patches


> -Original Message-
> From: Gcc-patches  bounces+kyrylo.tkachov=arm@gcc.gnu.org> On Behalf Of Prathamesh
> Kulkarni via Gcc-patches
> Sent: Wednesday, June 14, 2023 8:13 AM
> To: Tejas Belagod 
> Cc: Richard Sandiford ; gcc-
> patc...@gcc.gnu.org
> Subject: Re: [PATCH v2] [PR96339] Optimise svlast[ab]
> 
> On Tue, 13 Jun 2023 at 12:38, Tejas Belagod via Gcc-patches
>  wrote:
> >
> >
> >
> > From: Richard Sandiford 
> > Date: Monday, June 12, 2023 at 2:15 PM
> > To: Tejas Belagod 
> > Cc: gcc-patches@gcc.gnu.org , Tejas Belagod
> 
> > Subject: Re: [PATCH v2] [PR96339] Optimise svlast[ab]
> > Tejas Belagod  writes:
> > > From: Tejas Belagod 
> > >
> > >   This PR optimizes an SVE intrinsics sequence where
> > > svlasta (svptrue_pat_b8 (SV_VL1), x)
> > >   a scalar is selected based on a constant predicate and a variable 
> > > vector.
> > >   This sequence is optimized to return the correspoding element of a
> NEON
> > >   vector. For eg.
> > > svlasta (svptrue_pat_b8 (SV_VL1), x)
> > >   returns
> > > umovw0, v0.b[1]
> > >   Likewise,
> > > svlastb (svptrue_pat_b8 (SV_VL1), x)
> > >   returns
> > >  umovw0, v0.b[0]
> > >   This optimization only works provided the constant predicate maps to a
> range
> > >   that is within the bounds of a 128-bit NEON register.
> > >
> > > gcc/ChangeLog:
> > >
> > >PR target/96339
> > >* config/aarch64/aarch64-sve-builtins-base.cc (svlast_impl::fold): 
> > > Fold
> sve
> > >calls that have a constant input predicate vector.
> > >(svlast_impl::is_lasta): Query to check if intrinsic is svlasta.
> > >(svlast_impl::is_lastb): Query to check if intrinsic is svlastb.
> > >(svlast_impl::vect_all_same): Check if all vector elements are 
> > > equal.
> > >
> > > gcc/testsuite/ChangeLog:
> > >
> > >PR target/96339
> > >* gcc.target/aarch64/sve/acle/general-c/svlast.c: New.
> > >* gcc.target/aarch64/sve/acle/general-c/svlast128_run.c: New.
> > >* gcc.target/aarch64/sve/acle/general-c/svlast256_run.c: New.
> > >* gcc.target/aarch64/sve/pcs/return_4.c (caller_bf16): Fix asm
> > >to expect optimized code for function body.
> > >* gcc.target/aarch64/sve/pcs/return_4_128.c (caller_bf16): 
> > > Likewise.
> > >* gcc.target/aarch64/sve/pcs/return_4_256.c (caller_bf16): 
> > > Likewise.
> > >* gcc.target/aarch64/sve/pcs/return_4_512.c (caller_bf16): 
> > > Likewise.
> > >* gcc.target/aarch64/sve/pcs/return_4_1024.c (caller_bf16): 
> > > Likewise.
> > >* gcc.target/aarch64/sve/pcs/return_4_2048.c (caller_bf16): 
> > > Likewise.
> > >* gcc.target/aarch64/sve/pcs/return_5.c (caller_bf16): Likewise.
> > >* gcc.target/aarch64/sve/pcs/return_5_128.c (caller_bf16): 
> > > Likewise.
> > >* gcc.target/aarch64/sve/pcs/return_5_256.c (caller_bf16): 
> > > Likewise.
> > >* gcc.target/aarch64/sve/pcs/return_5_512.c (caller_bf16): 
> > > Likewise.
> > >* gcc.target/aarch64/sve/pcs/return_5_1024.c (caller_bf16): 
> > > Likewise.
> > >* gcc.target/aarch64/sve/pcs/return_5_2048.c (caller_bf16): 
> > > Likewise.
> >
> > OK, thanks.
> >
> > Applied on master, thanks.
> Hi Tejas,
> This seems to break aarch64 bootstrap build with following error due
> to -Wsign-compare diagnostic:
> 00:18:19 /home/tcwg-
> buildslave/workspace/tcwg_gnu_6/abe/snapshots/gcc.git~master/gcc/config/
> aarch64/aarch64-sve-builtins-base.cc:1133:35:
> error: comparison of integer expressions of different signedness:
> ‘int’ and ‘long unsigned int’ [-Werror=sign-compare]
> 00:18:19  1133 | for (i = npats; i < enelts; i += step_1)
> 00:18:19  | ~~^~~~
> 00:30:46 abe-debug-build: cc1plus: all warnings being treated as errors
> 00:30:46 abe-debug-build: make[3]: ***
> [/home/tcwg-
> buildslave/workspace/tcwg_gnu_6/abe/snapshots/gcc.git~master/gcc/config/
> aarch64/t-aarch64:96:
> aarch64-sve-builtins-base.o] Error 1

Fixed thusly in trunk.
Thanks,
Kyrill

gcc/ChangeLog:

* config/aarch64/aarch64-sve-builtins-base.cc (svlast_impl::fold):
Fix signed comparison warning in loop from npats to enelts.

> 
> Thanks,
> Prathamesh
> >
> > Tejas.
> >
> >
> > Richard


boot.patch
Description: boot.patch


[PATCH][committed] arm: Extend -mtp= arguments

2023-06-13 Thread Kyrylo Tkachov via Gcc-patches
Hi all,

After discussing the -mtp= option with Arm's LLVM developers we'd like to extend
the functionality of the option somewhat.
There are actually 3 system registers that can be accessed for the thread 
pointer
in aarch32: tpidrurw, tpidruro, tpidrprw.  They are all read through the CP15 
co-processor
mechanism. The current -mtp=cp15 option reads the tpidruro register.
This patch extends -mtp to allow for the above three explicit tpidr names and
keeps -mtp=cp15 as an alias of -mtp=tpidruro for backwards compatibility.

There is more relevant discussion of the options at 
https://reviews.llvm.org/D152433 if you're interested.

Bootstrapped and tested on arm-none-linux-gnueabihf.
Pushing to trunk.
Thanks,
Kyrill

gcc/ChangeLog:

* config/arm/arm-opts.h (enum arm_tp_type): Remove TP_CP15.
Add TP_TPIDRURW, TP_TPIDRURO, TP_TPIDRPRW values.
* config/arm/arm-protos.h (arm_output_load_tpidr): Declare prototype.
* config/arm/arm.cc (arm_option_reconfigure_globals): Replace TP_CP15
with TP_TPIDRURO.
(arm_output_load_tpidr): Define.
* config/arm/arm.h (TARGET_HARD_TP): Define in terms of TARGET_SOFT_TP.
* config/arm/arm.md (load_tp_hard): Call arm_output_load_tpidr to output
assembly.
(reload_tp_hard): Likewise.
* config/arm/arm.opt (tpidrurw, tpidruro, tpidrprw): New values for
arm_tp_type.
* doc/invoke.texi (Arm Options, mtp): Document new values.

gcc/testsuite/ChangeLog:

* gcc.target/arm/mtp.c: New test.
* gcc.target/arm/mtp_1.c: New test.
* gcc.target/arm/mtp_2.c: New test.
* gcc.target/arm/mtp_3.c: New test.
* gcc.target/arm/mtp_4.c: New test.


mtp-arm.patch
Description: mtp-arm.patch


[PATCH][committed] aarch64: Extend -mtp= arguments

2023-06-13 Thread Kyrylo Tkachov via Gcc-patches
Hi all,

After discussing the -mtp= option with Arm's LLVM developers we'd like to extend
the functionality of the option somewhat.
First of all, there is another TPIDR register that can be used to read the 
thread pointer:
TPIDRRO_EL0 (which can also be accessed by AArch32 under another name) so it 
makes sense
to add -mtp=tpidrr0_el0. This makes the existing arguments el0, el1, el2, el3 
somewhat
inconsistent in their naming so this patch introduces the more "full" names
tpidr_el0, tpidr_el1, tpidr_el2, tpidr_el3 and makes the above short names 
alias of these new ones.
Long story short, we preserve backwards compatibility and add a new TPIDR 
register to access through
-mtp that wasn't available previously.
There is more relevant discussion of the options at 
https://reviews.llvm.org/D152433 if you're interested.

Bootstrapped and tested on aarch64-none-linux-gnu.
Pushing to trunk.
Thanks,
Kyrill

gcc/ChangeLog:

PR target/108779
* config/aarch64/aarch64-opts.h (enum aarch64_tp_reg): Add
AARCH64_TPIDRRO_EL0 value.
* config/aarch64/aarch64.cc (aarch64_output_load_tp):
* config/aarch64/aarch64.opt (tpidr_el0, tpidr_el1, tpidr_el2,
tpidr_el3, tpidrro_el3): New accepted values to -mtp=.
* doc/invoke.texi (AArch64 Options): Document new -mtp= options.

gcc/testsuite/ChangeLog:

PR target/108779
* gcc.target/aarch64/mtp_5.c: New test.
* gcc.target/aarch64/mtp_6.c: New test.
* gcc.target/aarch64/mtp_7.c: New test.
* gcc.target/aarch64/mtp_8.c: New test.
* gcc.target/aarch64/mtp_9.c: New test.


mtp-a64.patch
Description: mtp-a64.patch


RE: [PATCH] simplify-rtx: Implement constant folding of SS_TRUNCATE, US_TRUNCATE

2023-06-12 Thread Kyrylo Tkachov via Gcc-patches
Hi Richard,

> -Original Message-
> From: Richard Sandiford 
> Sent: Friday, June 9, 2023 7:08 PM
> To: Kyrylo Tkachov via Gcc-patches 
> Cc: Kyrylo Tkachov 
> Subject: Re: [PATCH] simplify-rtx: Implement constant folding of
> SS_TRUNCATE, US_TRUNCATE
> 
> Kyrylo Tkachov via Gcc-patches  writes:
> > Hi all,
> >
> > This patch implements RTL constant-folding for the SS_TRUNCATE and
> US_TRUNCATE codes.
> > The semantics are a clamping operation on the argument with the min and
> max of the narrow mode,
> > followed by a truncation. The signedness of the clamp and the min/max
> extrema is derived from
> > the signedness of the saturating operation.
> >
> > We have a number of instructions in aarch64 that use SS_TRUNCATE and
> US_TRUNCATE to represent
> > their operations and we have pretty thorough runtime tests in
> gcc.target/aarch64/advsimd-intrinsics/vqmovn*.c.
> > With this patch the instructions are folded away at optimisation levels and
> the correctness checks still
> > pass.
> >
> > Bootstrapped and tested on aarch64-none-linux-gnu and aarch64_be-none-
> elf.
> > Ok for trunk?
> >
> > Thanks,
> > Kyrill
> >
> > gcc/ChangeLog:
> >
> > * simplify-rtx.cc (simplify_const_unary_operation):
> > Handle US_TRUNCATE, SS_TRUNCATE.
> >
> > diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc
> > index
> 276be67aa67247dd46361ab9badc46ab089d6df0..5983a06e5a8ca89c717e864
> 8be410024147b16e6 100644
> > --- a/gcc/simplify-rtx.cc
> > +++ b/gcc/simplify-rtx.cc
> > @@ -2131,6 +2131,22 @@ simplify_const_unary_operation (enum
> rtx_code code, machine_mode mode,
> >   result = wide_int::from (op0, width, UNSIGNED);
> >   break;
> >
> > +   case US_TRUNCATE:
> > +   case SS_TRUNCATE:
> > + {
> > +   signop sgn = code == US_TRUNCATE ? UNSIGNED : SIGNED;
> > +   wide_int nmax
> > + = wide_int::from (wi::max_value (width, sgn),
> > +   GET_MODE_PRECISION (imode), sgn);
> > +   wide_int nmin
> > + = wide_int::from (wi::min_value (width, sgn),
> > +   GET_MODE_PRECISION (imode), sgn);
> > +   result
> > + = wide_int::from (op0, GET_MODE_PRECISION (imode), sgn);
> > +   result = wi::min (wi::max (result, nmin, sgn), nmax, sgn);
> 
> FWIW, it looks like this could be:
> 
>   result = wi::min (wi::max (op0, nmin, sgn), nmax, sgn);
> 
> without the first assignment to result.  That feels more natural IMO,
> since no conversion is being done on op0.

Thanks, that works indeed.
I'll push the attached patch to trunk once bootstrap and testing completes.
Kyrill

> 
> Thanks,
> Richard
> 
> > +   result = wide_int::from (result, width, sgn);
> > +   break;
> > + }
> > case SIGN_EXTEND:
> >   result = wide_int::from (op0, width, SIGNED);
> >   break;


sstrunc.patch
Description: sstrunc.patch


[PATCH] simplify-rtx: Implement constant folding of SS_TRUNCATE, US_TRUNCATE

2023-06-08 Thread Kyrylo Tkachov via Gcc-patches
Hi all,

This patch implements RTL constant-folding for the SS_TRUNCATE and US_TRUNCATE 
codes.
The semantics are a clamping operation on the argument with the min and max of 
the narrow mode,
followed by a truncation. The signedness of the clamp and the min/max extrema 
is derived from
the signedness of the saturating operation.

We have a number of instructions in aarch64 that use SS_TRUNCATE and 
US_TRUNCATE to represent
their operations and we have pretty thorough runtime tests in 
gcc.target/aarch64/advsimd-intrinsics/vqmovn*.c.
With this patch the instructions are folded away at optimisation levels and the 
correctness checks still
pass.

Bootstrapped and tested on aarch64-none-linux-gnu and aarch64_be-none-elf.
Ok for trunk?

Thanks,
Kyrill

gcc/ChangeLog:

* simplify-rtx.cc (simplify_const_unary_operation):
Handle US_TRUNCATE, SS_TRUNCATE.


s_truncate.patch
Description: s_truncate.patch


[PATCH][committed] aarch64: Represent SQXTUN with RTL operations

2023-06-07 Thread Kyrylo Tkachov via Gcc-patches
Hi all,

This patch removes UNSPEC_SQXTUN and uses organic RTL codes to represent the 
operation.
SQXTUN is an odd one. It's described in the architecture as "Signed saturating 
extract Unsigned Narrow".
It's not a straightforward ss_truncate nor a us_truncate.
It is a sort of truncating signed clamp operation with limits derived from the 
unsigned extrema of the narrow mode:
(truncate:N
  (smin:M
(smax:M (reg:M) (const_int 0))
(const_int )))

This patch implements these semantics. I've checked that the vqmovun tests in 
advsimd-intrinsics.exp
now get constant-folded and still pass validation, so I'm pretty confident in 
the semantics.

Bootstrapped and tested on aarch64-none-linux-gnu and aarch64_be-none-elf.
Pushing to trunk.
Thanks,
Kyrill

gcc/ChangeLog:

* config/aarch64/aarch64-simd.md (aarch64_sqmovun):
Rename to...
(*aarch64_sqmovun_insn): ... This.  Reimplement
with RTL codes.
(aarch64_sqmovun [SD_HSDI]): Reimplement with RTL codes.
(aarch64_sqxtun2_le): Likewise.
(aarch64_sqxtun2_be): Likewise.
(aarch64_sqxtun2): Adjust for the above.
(aarch64_sqmovun): New define_expand.
* config/aarch64/iterators.md (UNSPEC_SQXTUN): Delete.
(half_mask): New mode attribute.
* config/aarch64/predicates.md (aarch64_simd_umax_half_mode):
New predicate.


sqxtun.patch
Description: sqxtun.patch


[PATCH][committed] aarch64: Improve RTL representation of ADDP instructions

2023-06-07 Thread Kyrylo Tkachov via Gcc-patches
Hi all,

Similar to the ADDLP instructions the non-widening ADDP ones can be
represented by adding the odd lanes with the even lanes of a vector.
These instructions take two vector inputs and the architecture spec
describes the operation as concatenating them together before going
through it with pairwise additions.
This patch chooses to represent ADDP on 64-bit and 128-bit input
vectors slightly differently, reasons explained in the comments
in aarhc64-simd.md.

Bootstrapped and tested on aarch64-none-linux-gnu and aarch64_be-none-elf.
Pushing to trunk.
Thanks,
Kyrill

gcc/ChangeLog:

* config/aarch64/aarch64-simd.md (aarch64_addp):
Reimplement as...
(aarch64_addp_insn): ... This...
(aarch64_addp_insn): ... And this.
(aarch64_addp): New define_expand.


addp-r.patch
Description: addp-r.patch


[PATCH][committed] aarch64: Improve representation of vpaddd intrinsics

2023-06-06 Thread Kyrylo Tkachov via Gcc-patches
Hi all,

The aarch64_addpdi pattern is redundant as the reduc_plus_scal_ pattern 
can already generate
the required form of the ADDP instruction, and is mostly folded to GIMPLE early 
on so can benefit from more optimisations.
Though it turns out that we were missing the folding for the unsigned variants.
This patch adds that and wires up the vpaddd_u64 and vpaddd_s64 intrinsics 
through the above pattern instead
so that we can remove a redundant pattern and get more optimisation earlier.

Bootstrapped and tested on aarch64-none-linux-gnu and aarch64_be-none-elf.
Pushing to trunk.
Thanks,
Kyrill

gcc/ChangeLog:

* config/aarch64/aarch64-builtins.cc 
(aarch64_general_gimple_fold_builtin):
Handle unsigned reduc_plus_scal_ builtins.
* config/aarch64/aarch64-simd-builtins.def (addp): Delete DImode 
instances.
* config/aarch64/aarch64-simd.md (aarch64_addpdi): Delete.
* config/aarch64/arm_neon.h (vpaddd_s64): Reimplement with
__builtin_aarch64_reduc_plus_scal_v2di.
(vpaddd_u64): Reimplement with 
__builtin_aarch64_reduc_plus_scal_v2di_uu.


vpaddd.patch
Description: vpaddd.patch


[PATCH][committed] aarch64: Reimplement URSHR,SRSHR patterns with standard RTL codes

2023-06-06 Thread Kyrylo Tkachov via Gcc-patches
Hi all,

Having converted the patterns for the URSRA,SRSRA instructions to standard RTL 
codes we can also
easily convert the non-accumulating forms URSHR,SRSHR.
This patch does that, reusing the various helpers and predicates from that 
patch in a straightforward way.
This allows GCC to perform the optimisations in the testcase, matching what 
Clang does.

Bootstrapped and tested on aarch64-none-linux-gnu and aarch64_be-none-elf.
Pushing to trunk.
Thanks,
Kyrill

gcc/ChangeLog:

* config/aarch64/aarch64-simd.md (aarch64_shr_n): Delete.
(aarch64_rshr_n_insn): New define_insn.
(aarch64_rshr_n): New define_expand.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/simd/vrshr_1.c: New test.


rshr.patch
Description: rshr.patch


[PATCH][committed] aarch64: Simplify SHRN, RSHRN expanders and patterns

2023-06-06 Thread Kyrylo Tkachov via Gcc-patches
Hi all,

Now that we've got the  annotations we can get rid of explicit
!BYTES_BIG_ENDIAN and BYTES_BIG_ENDIAN patterns for the narrowing shift 
instructions.
This allows us to clean up the expanders as well.

Bootstrapped and tested on aarch64-none-linux-gnu and aarch64_be-none-elf.
Pushing to trunk.
Thanks,
Kyrill

gcc/ChangeLog:

* config/aarch64/aarch64-simd.md (aarch64_shrn_insn_le): Delete.
(aarch64_shrn_insn_be): Delete.
(*aarch64_shrn_vect):  Rename to...
(*aarch64_shrn): ... This.
(aarch64_shrn): Remove reference to the above deleted patterns.
(aarch64_rshrn_insn_le): Delete.
(aarch64_rshrn_insn_be): Delete.
(aarch64_rshrn_insn): New define_insn.
(aarch64_rshrn): Remove references to the above deleted patterns.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/simd/pr99195_5.c: Add testing for shrn_n, rshrn_n
intrinsics.


shrn-clean.patch
Description: shrn-clean.patch


[PATCH][committed] aarch64: Improve representation of ADDLV instructions

2023-06-06 Thread Kyrylo Tkachov via Gcc-patches
Hi all,

We've received requests to optimise the attached intrinsics testcase.
We currently generate:
foo_1:
uaddlp  v0.4s, v0.8h
uaddlv  d31, v0.4s
fmovx0, d31
ret
foo_2:
uaddlp  v0.4s, v0.8h
addvs31, v0.4s
fmovw0, s31
ret
foo_3:
saddlp  v0.4s, v0.8h
addvs31, v0.4s
fmovw0, s31
ret

The widening pair-wise addition addlp instructions can be omitted if we're just 
doing an ADDV afterwards.
Making this optimisation would be quite simple if we had a standard RTL PLUS 
vector reduction code.
As we don't, we can use UNSPEC_ADDV as a stand in.
This patch expresses the SADDLV and UADDLV instructions as an UNSPEC_ADDV over 
a widened input, thus removing
the need for separate UNSPEC_SADDLV and UNSPEC_UADDLV codes.
To optimise the testcases involved we add two splitters that match a vector 
addition where all participating elements
are taken and widened from the same vector and then fed into an UNSPEC_ADDV. In 
that case we can just remove the
vector PLUS and just emit the simple RTL for SADDLV/UADDLV.

Bootstrapped and tested on aarch64-none-linux-gnu.
Pushing to trunk.

Thanks,
Kyrill

gcc/ChangeLog:

* config/aarch64/aarch64-protos.h (aarch64_parallel_select_half_p):
Define prototype.
(aarch64_pars_overlap_p): Likewise.
* config/aarch64/aarch64-simd.md (aarch64_addlv):
Express in terms of UNSPEC_ADDV.
(*aarch64_addlv_ze): Likewise.
(*aarch64_addlv_reduction): Define.
(*aarch64_uaddlv_reduction_2): Likewise.
* config/aarch64/aarch64.cc (aarch64_parallel_select_half_p): 
Define.
(aarch64_pars_overlap_p): Likewise.
* config/aarch64/iterators.md (UNSPEC_SADDLV, UNSPEC_UADDLV): Delete.
(VQUADW): New mode attribute.
(VWIDE2X_S): Likewise.
(USADDLV): Delete.
(su): Delete handling of UNSPEC_SADDLV, UNSPEC_UADDLV.
* config/aarch64/predicates.md (vect_par_cnst_select_half): Define.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/simd/addlv_1.c: New test.


addlv2.patch
Description: addlv2.patch


[PATCH][committed] aarch64: Add =r, m and =m, r alternatives to 64-bit vector move patterns

2023-06-01 Thread Kyrylo Tkachov via Gcc-patches
Hi all,

We can use the X registers to load and store 64-bit vector modes, we just need 
to add the alternatives
to the mov patterns. This straightforward patch does that and for the pair 
variants too.
For the testcase in the code we now generate the optimal assembly without any 
superfluous
GP<->SIMD moves.

Bootstrapped and tested on aarch64-none-linux-gnu and aarch64_be-none-elf.
Pushing to trunk.
Thanks,
Kyrill

gcc/ChangeLog:

* config/aarch64/aarch64-simd.md (*aarch64_simd_mov):
Add =r,m and =r,m alternatives.
(load_pair): Likewise.
(vec_store_pair): Likewise.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/xreg-vec-modes_1.c: New test.


rm64.patch
Description: rm64.patch


[PATCH][committed] aarch64: PR target/99195 Annotate dot-product patterns for vec-concat-zero

2023-05-31 Thread Kyrylo Tkachov via Gcc-patches
Hi all,

This straightforward patch annotates the dotproduct instructions, including the 
i8mm ones.
Tests included.
Nothing unexpected here.

Bootstrapped and tested on aarch64-none-linux-gnu and aarch64_be-none-elf.
Pushing to trunk.
Thanks,
Kyrill

gcc/ChangeLog:

PR target/99195
* config/aarch64/aarch64-simd.md (dot_prod): Rename to...
(dot_prod): ... This.
(usdot_prod): Rename to...
(usdot_prod): ... This.
(aarch64_dot_lane): Rename to...
(aarch64_dot_lane): ... This.
(aarch64_dot_laneq): Rename to...
(aarch64_dot_laneq): ... This.
(aarch64_dot_lane): Rename 
to...

(aarch64_dot_lane):
... This.

gcc/testsuite/ChangeLog:

PR target/99195
* gcc.target/aarch64/simd/pr99195_11.c: New test.


dotprod.patch
Description: dotprod.patch


[PATCH][committed] aarch64: PR target/99195 Annotate saturating mult patterns for vec-concat-zero

2023-05-31 Thread Kyrylo Tkachov via Gcc-patches
Hi all,

This patch goes through the various alphabet soup saturating multiplication 
patterns, including those in TARGET_RDMA
and annotates them with . Many other patterns are widening and 
always write the full 128-bit vectors
so this annotation doesn't apply to them. Nothing out of the ordinary in this 
patch.

Bootstrapped and tested on aarch64-none-linux and aarch64_be-none-elf.
Pushing to trunk.
Thanks,
Kyrill

gcc/ChangeLog:

PR target/99195
* config/aarch64/aarch64-simd.md (aarch64_sqdmulh): Rename 
to...
(aarch64_sqdmulh): ... This.
(aarch64_sqdmulh_n): Rename to...
(aarch64_sqdmulh_n): ... This.
(aarch64_sqdmulh_lane): Rename to...
(aarch64_sqdmulh_lane): ... This.
(aarch64_sqdmulh_laneq): Rename to...
(aarch64_sqdmulh_laneq): ... This.
(aarch64_sqrdmlh): Rename to...
(aarch64_sqrdmlh): ... This.
(aarch64_sqrdmlh_lane): Rename to...
(aarch64_sqrdmlh_lane): ... 
This.
(aarch64_sqrdmlh_laneq): Rename to...
(aarch64_sqrdmlh_laneq): ... 
This.

gcc/testsuite/ChangeLog:

PR target/99195
* gcc.target/aarch64/simd/pr99195_1.c: Add tests for qdmulh, qrdmulh.
* gcc.target/aarch64/simd/pr99195_10.c: New test.


satmul.patch
Description: satmul.patch


[PATCH][NFC][committed] aarch64: Simplify output template emission code for a few patterns

2023-05-31 Thread Kyrylo Tkachov via Gcc-patches
Hi all,

If the output code for a define_insn just does a switch (which_alternative) 
with no other computation we can almost always
replace it with more compact MD syntax for each alternative in a 
mult-alternative '@' block.
This patch cleans up some such patterns in the aarch64 backend, making them 
shorter and more concise.
No behavioural change intended.

Bootstrapped and tested on aarch64-none-linux-gnu.
Pushing to trunk.
Thanks,
Kyrill

gcc/ChangeLog:

* config/aarch64/aarch64-simd.md (*aarch64_simd_mov): 
Rewrite
output template to avoid explicit switch on which_alternative.
(*aarch64_simd_mov): Likewise.
(and3): Likewise.
(ior3): Likewise.
* config/aarch64/aarch64.md (*mov_aarch64): Likewise.


outp.patch
Description: outp.patch


RE: [PATCH] [arm] testsuite: make mve_intrinsic_type_overloads-int.c libc-agnostic

2023-05-30 Thread Kyrylo Tkachov via Gcc-patches
Ok.
Thanks,
Kyrill

From: Christophe Lyon 
Sent: Tuesday, May 30, 2023 4:44 PM
To: Kyrylo Tkachov 
Cc: gcc-patches@gcc.gnu.org; Stam Markianos-Wright 

Subject: Re: [PATCH] [arm] testsuite: make mve_intrinsic_type_overloads-int.c 
libc-agnostic

Ping?


On Tue, 23 May 2023 at 16:59, Stamatis Markianos-Wright 
mailto:stam.markianos-wri...@arm.com>> wrote:

On 23/05/2023 15:41, Christophe Lyon wrote:
> Glibc defines int32_t as 'int' while newlib defines it as 'long int'.
>
> Although these correspond to the same size, g++ complains when using the  
>   
>   
>   
>   
>  'wrong' version:
>invalid conversion from 'long int*' to 'int32_t*' {aka 'int*'} 
> [-fpermissive]
> or
>invalid conversion from 'int*' to 'int32_t*' {aka 'long int*'} 
> [-fpermissive]
>
> when calling vst1q(int32*, int32x4_t) with a first parameter of type
> 'long int *' (resp. 'int *')
>
> To make this test pass with any type of toolchain, this patch defines
> 'word_type' according to which libc is in use.

Thank you for spotting this! I think this fix is needed on all of
GCC12,13,trunk btw (it should apply cleanly)


>
> 2023-05-23  Christophe Lyon  
> mailto:christophe.l...@linaro.org>>
>
>   gcc/testsuite/
>   * gcc.target/arm/mve/intrinsics/mve_intrinsic_type_overloads-int.c:
>   Support both definitions of int32_t.
> ---
>   .../mve_intrinsic_type_overloads-int.c| 28 ++-
>   1 file changed, 15 insertions(+), 13 deletions(-)
>
> diff --git 
> a/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_intrinsic_type_overloads-int.c
>  
> b/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_intrinsic_type_overloads-int.c
> index 7947dc024bc..ab51cc8b323 100644
> --- 
> a/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_intrinsic_type_overloads-int.c
> +++ 
> b/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_intrinsic_type_overloads-int.c
> @@ -47,14 +47,22 @@ foo2 (short * addr, int16x8_t value)
> vst1q (addr, value);
>   }
>
> -void
> -foo3 (int * addr, int32x4_t value)
> -{
> -  vst1q (addr, value); /* { dg-warning "invalid conversion" "" { target c++ 
> } } */
> -}
> +/* Glibc defines int32_t as 'int' while newlib defines it as 'long int'.
> +
> +   Although these correspond to the same size, g++ complains when using the
> +   'wrong' version:
> +  invalid conversion from 'long int*' to 'int32_t*' {aka 'int*'} 
> [-fpermissive]
> +
> +  The trick below is to make this test pass whether using glibc-based or
> +  newlib-based toolchains.  */
>
> +#if defined(__GLIBC__)
> +#define word_type int
> +#else
> +#define word_type long int
> +#endif
>   void
> -foo4 (long * addr, int32x4_t value)
> +foo3 (word_type * addr, int32x4_t value)
>   {
> vst1q (addr, value);
>   }
> @@ -78,13 +86,7 @@ foo7 (unsigned short * addr, uint16x8_t value)
>   }
>
>   void
> -foo8 (unsigned int * addr, uint32x4_t value)
> -{
> -  vst1q (addr, value); /* { dg-warning "invalid conversion" "" { target c++ 
> } } */
> -}
> -
> -void
> -foo9 (unsigned long * addr, uint32x4_t value)
> +foo8 (unsigned word_type * addr, uint32x4_t value)
>   {
> vst1q (addr, value);
>   }


RE: [PATCH] [arm][testsuite]: Fix ACLE data-intrinsics testcases

2023-05-30 Thread Kyrylo Tkachov via Gcc-patches


> -Original Message-
> From: Christophe Lyon 
> Sent: Tuesday, May 30, 2023 3:00 PM
> To: gcc-patches@gcc.gnu.org; Kyrylo Tkachov ;
> Chris Sidebottom 
> Cc: Christophe Lyon 
> Subject: [PATCH] [arm][testsuite]: Fix ACLE data-intrinsics testcases
> 
> data-intrinsics-assembly.c forces -march=armv6 using dg-add-options
> arm_arch_v6, which implicitly adds -mfloat-abi=softfp.
> 
> However, for a toolchain configured for arm-linux-gnueabihf and
> --with-arch=armv7-a, the testcase will fail when including arm_acle.h
> (which includes stdint.h, which will fail to include the non-existing
> gnu/stubs-soft.h).
> 
> Other effective-targets related to arm_acle.h would also pass because
> they first try without -mfloat-abi=softfp, so it seems the
> simplest/safest is to add { dg-require-effective-target arm_softfp_ok }
> to make sure arm_arch_v6_ok's assumption is valid.
> 
> The patch also fixes what seems to be an oversight in
> data-intrinsics-armv6.c: it requires arm_arch_v6_ok, but uses
> arm_arch_v6t2: the patch makes it require arm_arch_v6t2_ok.

Ok, thanks for sorting this out. The arm effective target checks always catch 
me off guard if I don't deal with them for a few months ☹
Kyrill

> 
> 2023-05-30  Christophe Lyon  
> 
>   gcc/testsuite/
>   * gcc.target/arm/acle/data-intrinsics-armv6.c: Fix typo.
>   * gcc.target/arm/acle/data-intrinsics-assembly.c Require
>   arm_softfp_ok.
> ---
>  gcc/testsuite/gcc.target/arm/acle/data-intrinsics-armv6.c| 2 +-
>  gcc/testsuite/gcc.target/arm/acle/data-intrinsics-assembly.c | 1 +
>  2 files changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/gcc/testsuite/gcc.target/arm/acle/data-intrinsics-armv6.c
> b/gcc/testsuite/gcc.target/arm/acle/data-intrinsics-armv6.c
> index aafdff35cee..988ecac3787 100644
> --- a/gcc/testsuite/gcc.target/arm/acle/data-intrinsics-armv6.c
> +++ b/gcc/testsuite/gcc.target/arm/acle/data-intrinsics-armv6.c
> @@ -1,5 +1,5 @@
>  /* { dg-do run } */
> -/* { dg-require-effective-target arm_arch_v6_ok } */
> +/* { dg-require-effective-target arm_arch_v6t2_ok } */
>  /* { dg-add-options arm_arch_v6t2 } */
> 
>  #include "arm_acle.h"
> diff --git a/gcc/testsuite/gcc.target/arm/acle/data-intrinsics-assembly.c
> b/gcc/testsuite/gcc.target/arm/acle/data-intrinsics-assembly.c
> index 3e066877a70..478cbde1600 100644
> --- a/gcc/testsuite/gcc.target/arm/acle/data-intrinsics-assembly.c
> +++ b/gcc/testsuite/gcc.target/arm/acle/data-intrinsics-assembly.c
> @@ -1,5 +1,6 @@
>  /* Test the ACLE data intrinsics get expanded to the correct instructions on 
> a
> specific architecture  */
>  /* { dg-do assemble } */
> +/* { dg-require-effective-target arm_softfp_ok } */
>  /* { dg-require-effective-target arm_arch_v6_ok } */
>  /* { dg-additional-options "--save-temps -O1" } */
>  /* { dg-add-options arm_arch_v6 } */
> --
> 2.34.1



[PATCH][committed] aarch64: Convert ADDLP and ADALP patterns to standard RTL codes

2023-05-30 Thread Kyrylo Tkachov via Gcc-patches
Hi all,

This patch converts the patterns for the integer widen and pairwise-add 
instructions
to standard RTL operations. The pairwise addition withing a vector can be 
represented
as an addition of two vec_selects, one selecting the even elements, and one 
selecting odd.
Thus for the intrinsic vpaddlq_s8 we can generate:
(set (reg:V8HI 92)
(plus:V8HI (vec_select:V8HI (sign_extend:V16HI (reg/v:V16QI 93 [ a ]))
(parallel [
(const_int 0 [0])
(const_int 2 [0x2])
(const_int 4 [0x4])
(const_int 6 [0x6])
(const_int 8 [0x8])
(const_int 10 [0xa])
(const_int 12 [0xc])
(const_int 14 [0xe])
]))
(vec_select:V8HI (sign_extend:V16HI (reg/v:V16QI 93 [ a ]))
(parallel [
(const_int 1 [0x1])
(const_int 3 [0x3])
(const_int 5 [0x5])
(const_int 7 [0x7])
(const_int 9 [0x9])
(const_int 11 [0xb])
(const_int 13 [0xd])
(const_int 15 [0xf])
] 

Similarly for the accumulating forms where there's an extra outer PLUS for the 
accumulation.
We already have the handy helper functions aarch64_stepped_int_parallel_p and
aarch64_gen_stepped_int_parallel defined in aarch64.cc that we can make use of 
to define
the right predicate for the VEC_SELECT PARALLEL.

This patch allows us to remove some code iterators and the UNSPEC definitions 
for SADDLP and UADDLP.
UNSPEC_UADALP and UNSPEC_SADALP are retained because they are used by SVE2 
patterns still.

Bootstrapped and tested on aarch64-none-linux-gnu and aarch64_be-none-elf.
Pushing to trunk.
Thanks,
Kyrill

gcc/ChangeLog:

* config/aarch64/aarch64-simd.md (aarch64_adalp): Delete.
(aarch64_adalp): New define_expand.
(*aarch64_adalp_insn): New define_insn.
(aarch64_addlp): Convert to define_expand.
(*aarch64_addlp_insn): New define_insn.
* config/aarch64/iterators.md (UNSPEC_SADDLP, UNSPEC_UADDLP): Delete.
(ADALP): Likewise.
(USADDLP): Likewise.
* config/aarch64/predicates.md (vect_par_cnst_even_or_odd_half): Define.


adalp.patch
Description: adalp.patch


[PATCH][committed] aarch64: Reimplement v(r)hadd and vhsub intrinsics with RTL codes

2023-05-30 Thread Kyrylo Tkachov via Gcc-patches
Hi all,

This patch reimplements the MD patterns for the 
UHADD,SHADD,UHSUB,SHSUB,URHADD,SRHADD instructions using
standard RTL operations rather than unspecs. The correct RTL representations 
involves widening
the inputs before adding them and halving, followed by a truncation back to the 
original mode.
An unfortunate wart in the patch is that we end up having very similar 
expanders for the intrinsics
through the aarch64_h and aarch64_rhadd names 
and the standard names
for the vector averaging optabs avg3_floor and avg3_ceil.
I'd like to reuse avg3_ceil for the intrinsics builtin as well but 
our scheme
in aarch64-simd-builtins.def and aarch64-builtins.cc makes it awkward by only 
allowing mappings
of entries in aarch64-simd-builtins.def to:
   0 - CODE_FOR_aarch64_
   1-9 - CODE_FOR_<1-9>
   10 - CODE_FOR_

whereas here we want a string after the  i.e. CODE_FOR_uavg3_ceil.
This patch adds a bit of remapping logic in aarch64-builtins.cc before the 
construction of the
builtin info that remaps the CODE_FOR_* definitions in 
aarch64-simd-builtins.def to the
optab-derived ones. CODE_FOR_aarch64_srhaddv4si gets remapped to 
CODE_FOR_avgv4si3_ceil, for example.
It's a bit specific to this case, but this solution requires the least invasive 
changes while avoiding
having duplicate expanders just for the sake of a different pattern name.

Bootstrapped and tested on aarch64-none-linux-gnu and aarch64_be-none-elf.
Pushing to trunk.
Thanks,
Kyrill

gcc/ChangeLog:

* config/aarch64/aarch64-builtins.cc (VAR1): Move to after inclusion of
aarch64-builtin-iterators.h.  Add definition to remap shadd, uhadd,
srhadd, urhadd builtin codes for standard optab ones.
* config/aarch64/aarch64-simd.md (avg3_floor): Rename to...
(avg3_floor): ... This.  Expand to RTL codes rather than
unspec.
(avg3_ceil): Rename to...
(avg3_ceil): ... This.  Expand to RTL codes rather than
unspec.
(aarch64_hsub): New define_expand.
(aarch64_h): Split into...
(*aarch64_h_insn): ... This...
(*aarch64_rhadd_insn): ... And this.


vrhadd.patch
Description: vrhadd.patch


RE: [PATCH 1/1] arm: merge MVE_5 and MVE_6 iterators

2023-05-25 Thread Kyrylo Tkachov via Gcc-patches



> -Original Message-
> From: Christophe Lyon 
> Sent: Thursday, May 25, 2023 1:25 PM
> To: gcc-patches@gcc.gnu.org; Kyrylo Tkachov 
> Cc: Christophe Lyon 
> Subject: [PATCH 1/1] arm: merge MVE_5 and MVE_6 iterators
> 
> MVE_5 and MVE_6 iterators are the same: this patch replaces MVE_6 with
> MVE_5 everywhere in mve.md and removes MVE_6 from iterators.md.
> 

Ok from me. I'd consider these kinds of cleanups obvious changes.
Thanks,
Kyrill

> 2023-05-25  Christophe Lyon 
> 
>   gcc/
>   * config/arm/iterators.md (MVE_6): Remove.
>   * config/arm/mve.md: Replace MVE_6 with MVE_5.
> ---
>  gcc/config/arm/iterators.md |  1 -
>  gcc/config/arm/mve.md   | 68 ++---
>  2 files changed, 34 insertions(+), 35 deletions(-)
> 
> diff --git a/gcc/config/arm/iterators.md b/gcc/config/arm/iterators.md
> index 597c1dae640..9e77af55d60 100644
> --- a/gcc/config/arm/iterators.md
> +++ b/gcc/config/arm/iterators.md
> @@ -272,7 +272,6 @@
>  (define_mode_iterator MVE_3 [V16QI V8HI])
>  (define_mode_iterator MVE_2 [V16QI V8HI V4SI])
>  (define_mode_iterator MVE_5 [V8HI V4SI])
> -(define_mode_iterator MVE_6 [V8HI V4SI])
>  (define_mode_iterator MVE_7 [V16BI V8BI V4BI V2QI])
>  (define_mode_iterator MVE_7_HI [HI V16BI V8BI V4BI V2QI])
>  (define_mode_iterator MVE_V8HF [V8HF])
> diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
> index 9e3570c5264..74909ce47e1 100644
> --- a/gcc/config/arm/mve.md
> +++ b/gcc/config/arm/mve.md
> @@ -3732,9 +3732,9 @@
>  ;; [vldrhq_gather_offset_s vldrhq_gather_offset_u]
>  ;;
>  (define_insn "mve_vldrhq_gather_offset_"
> -  [(set (match_operand:MVE_6 0 "s_register_operand" "=")
> - (unspec:MVE_6 [(match_operand: 1
> "memory_operand" "Us")
> -(match_operand:MVE_6 2 "s_register_operand" "w")]
> +  [(set (match_operand:MVE_5 0 "s_register_operand" "=")
> + (unspec:MVE_5 [(match_operand: 1
> "memory_operand" "Us")
> +(match_operand:MVE_5 2 "s_register_operand" "w")]
>   VLDRHGOQ))
>]
>"TARGET_HAVE_MVE"
> @@ -3755,9 +3755,9 @@
>  ;; [vldrhq_gather_offset_z_s vldrhq_gather_offset_z_u]
>  ;;
>  (define_insn "mve_vldrhq_gather_offset_z_"
> -  [(set (match_operand:MVE_6 0 "s_register_operand" "=")
> - (unspec:MVE_6 [(match_operand: 1
> "memory_operand" "Us")
> -(match_operand:MVE_6 2 "s_register_operand" "w")
> +  [(set (match_operand:MVE_5 0 "s_register_operand" "=")
> + (unspec:MVE_5 [(match_operand: 1
> "memory_operand" "Us")
> +(match_operand:MVE_5 2 "s_register_operand" "w")
>  (match_operand: 3
> "vpr_register_operand" "Up")
>   ]VLDRHGOQ))
>]
> @@ -3780,9 +3780,9 @@
>  ;; [vldrhq_gather_shifted_offset_s vldrhq_gather_shifted_offset_u]
>  ;;
>  (define_insn "mve_vldrhq_gather_shifted_offset_"
> -  [(set (match_operand:MVE_6 0 "s_register_operand" "=")
> - (unspec:MVE_6 [(match_operand: 1
> "memory_operand" "Us")
> -(match_operand:MVE_6 2 "s_register_operand" "w")]
> +  [(set (match_operand:MVE_5 0 "s_register_operand" "=")
> + (unspec:MVE_5 [(match_operand: 1
> "memory_operand" "Us")
> +(match_operand:MVE_5 2 "s_register_operand" "w")]
>   VLDRHGSOQ))
>]
>"TARGET_HAVE_MVE"
> @@ -3803,9 +3803,9 @@
>  ;; [vldrhq_gather_shifted_offset_z_s vldrhq_gather_shited_offset_z_u]
>  ;;
>  (define_insn "mve_vldrhq_gather_shifted_offset_z_"
> -  [(set (match_operand:MVE_6 0 "s_register_operand" "=")
> - (unspec:MVE_6 [(match_operand: 1
> "memory_operand" "Us")
> -(match_operand:MVE_6 2 "s_register_operand" "w")
> +  [(set (match_operand:MVE_5 0 "s_register_operand" "=")
> + (unspec:MVE_5 [(match_operand: 1
> "memory_operand" "Us")
> +(match_operand:MVE_5 2 "s_register_operand" "w")
>  (match_operand: 3
> "vpr_register_operand" "Up")
>   ]VLDRHGSOQ))
>]
> @@ -3828,8 +3828,8 @@
>  ;; [vldrhq_s, vldrhq_u]
>  ;;
>  (define_insn "mve_vldrhq_"
> -  [(set (match_operand:MVE_6 0 

[PATCH][committed] aarch64: PR target/99195 Annotate complex FP patterns for vec-concat-zero

2023-05-25 Thread Kyrylo Tkachov via Gcc-patches
Hi all,

This patch annotates the complex add and mla patterns for vec-concat-zero.
Testing showed an interesting bug in our MD patterns where they were defined to 
match:
(plus:VHSDF (match_operand:VHSDF 1 "register_operand" "0")
(unspec:VHSDF [(match_operand:VHSDF 2 "register_operand" 
"w")
   (match_operand:VHSDF 3 "register_operand" 
"w")
   (match_operand:SI 4 "const_int_operand" "n")]
   FCMLA))

but the canonicalisation rules for PLUS require the more "complex" operand to 
be first so
during combine when the new substituted patterns were attempted to be formed 
combine/recog would
try to match:
(plus:V2SF (unspec:V2SF [
(reg:V2SF 100)
(reg:V2SF 101)
(const_int 0 [0])
] UNSPEC_FCMLA270)
(reg:V2SF 99))
instead. This patch fixes the operands of the PLUS RTX in these patterns.
Similar patterns for the dot-product instructions already used the right order.

Bootstrapped and tested on aarch64-none-linux-gnu and aarch64_be-none-elf.
Pushing to trunk.
Thanks,
Kyrill

gcc/ChangeLog:

PR target/99195
* config/aarch64/aarch64-simd.md (aarch64_fcadd): Rename 
to...
(aarch64_fcadd): ... This.
Fix canonicalization of PLUS operands.
(aarch64_fcmla): Rename to...
(aarch64_fcmla): ... This.
Fix canonicalization of PLUS operands.
(aarch64_fcmla_lane): Rename to...
(aarch64_fcmla_lane): ... This.
Fix canonicalization of PLUS operands.
(aarch64_fcmla_laneqv4hf): Rename to...
(aarch64_fcmla_laneqv4hf): ... This.
Fix canonicalization of PLUS operands.
(aarch64_fcmlaq_lane): Fix canonicalization of PLUS operands.

gcc/testsuite/ChangeLog:

PR target/99195
* gcc.target/aarch64/simd/pr99195_9.c: New test.


cmplx-vcz.patch
Description: cmplx-vcz.patch


[PATCH][committed] arm: Implement ACLE Data Intrinsics

2023-05-25 Thread Kyrylo Tkachov via Gcc-patches
Hi all,

This patch implements a number of scalar data processing intrinsics from ACLE
that were requested by some users. Some of these have fast single-instruction
sequences for Armv6 and later, but even for earlier versions they can still emit
an inline sequence or a call to libgcc (and ACLE recommends them being 
unconditionally
available).

Chris Sidebottom wrote most of the patch, I just cleaned it up, wired up some 
builtins
and adjusted the tests.

Bootstrapped and tested on arm-none-linux-gnueabihf.
Pushing to trunk.
Thanks,
Kyrill

Co-authored-by: Chris Sidebottom 

gcc/ChangeLog:

2023-05-24  Chris Sidebottom  
Kyrylo Tkachov  

* config/arm/arm.md (rbitsi2): Rename to...
(arm_rbit): ... This.
(ctzsi2): Adjust for the above.
(arm_rev16si2): Convert to define_expand.
(arm_rev16si2_alt1): New pattern.
(arm_rev16si2_alt): Rename to...
(*arm_rev16si2_alt2): ... This.
* config/arm/arm_acle.h (__ror, __rorl, __rorll, __clz, __clzl, __clzll,
__cls, __clsl, __clsll, __revsh, __rev, __revl, __revll, __rev16,
__rev16l, __rev16ll, __rbit, __rbitl, __rbitll): Define intrinsics.
* config/arm/arm_acle_builtins.def (rbit, rev16si2): Define builtins.

gcc/testsuite/ChangeLog:

* gcc.target/arm/acle/data-intrinsics-armv6.c: New test.
* gcc.target/arm/acle/data-intrinsics-assembly.c: New test.
* gcc.target/arm/acle/data-intrinsics-rbit.c: New test.
* gcc.target/arm/acle/data-intrinsics.c: New test.


arm-acle.patch
Description: arm-acle.patch


RE: [PATCH] arm: Fix ICE due to infinite splitting [PR109800]

2023-05-25 Thread Kyrylo Tkachov via Gcc-patches


> -Original Message-
> From: Gcc-patches  bounces+kyrylo.tkachov=arm@gcc.gnu.org> On Behalf Of Kyrylo
> Tkachov via Gcc-patches
> Sent: Thursday, May 25, 2023 11:48 AM
> To: Alex Coplan 
> Cc: gcc-patches@gcc.gnu.org; ni...@redhat.com; Richard Earnshaw
> ; Ramana Radhakrishnan
> 
> Subject: RE: [PATCH] arm: Fix ICE due to infinite splitting [PR109800]
> 
> 
> 
> > -Original Message-
> > From: Alex Coplan 
> > Sent: Thursday, May 25, 2023 11:26 AM
> > To: Kyrylo Tkachov 
> > Cc: gcc-patches@gcc.gnu.org; ni...@redhat.com; Richard Earnshaw
> > ; Ramana Radhakrishnan
> > 
> > Subject: Re: [PATCH] arm: Fix ICE due to infinite splitting [PR109800]
> >
> > Hi Kyrill,
> >
> > On 23/05/2023 11:14, Kyrylo Tkachov wrote:
> > > Hi Alex,
> > > diff --git a/gcc/testsuite/gcc.target/arm/pr109800.c
> > b/gcc/testsuite/gcc.target/arm/pr109800.c
> > > new file mode 100644
> > > index 000..71d1ede13dd
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/arm/pr109800.c
> > > @@ -0,0 +1,3 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-options "-O2 -march=armv7-m -mfloat-abi=hard -mfpu=fpv4-sp-
> > d16 -mbig-endian -mpure-code" } */
> > > +double f() { return 5.0; }
> > >
> > > ... The arm testsuite options are kinda hard to get right with all the
> effective
> > targets and multilibs and such hardcoded abi and march options tend to
> > break in some target.
> > > I suggest you put this testcase in gcc.target/arm/pure-code and add a dg-
> > skip-if to skip the test if the multilib options specify a different 
> > float-abi.
> >
> > How about this instead:
> >
> > diff --git a/gcc/testsuite/gcc.target/arm/pure-code/pr109800.c
> > b/gcc/testsuite/gcc.target/arm/pure-code/pr109800.c
> > new file mode 100644
> > index 000..d797b790232
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/arm/pure-code/pr109800.c
> > @@ -0,0 +1,4 @@
> > +/* { dg-do compile } */
> > +/* { dg-require-effective-target arm_hard_ok } */
> > +/* { dg-options "-O2 -march=armv7-m -mfloat-abi=hard -mfpu=fpv4-sp-
> d16 -
> > mbig-endian -mpure-code" } */
> > +double f() { return 5.0; }
> >
> > Full v2 patch attached.
> 
> Thanks, looks better but I think you'll still want to have a dg-skip-if to 
> avoid
> explicit -mfloat-abi=soft and -mfloat-abi=softfp in the multilib options. You
> can grep in that test directory for examples

Actually, as discussed offline this patch is okay as it has the arm_hard_ok 
check.
Thanks,
Kyrill

> Kyrill
> 
> >
> > Thanks,
> > Alex


RE: [PATCH] arm: Fix ICE due to infinite splitting [PR109800]

2023-05-25 Thread Kyrylo Tkachov via Gcc-patches


> -Original Message-
> From: Alex Coplan 
> Sent: Thursday, May 25, 2023 11:26 AM
> To: Kyrylo Tkachov 
> Cc: gcc-patches@gcc.gnu.org; ni...@redhat.com; Richard Earnshaw
> ; Ramana Radhakrishnan
> 
> Subject: Re: [PATCH] arm: Fix ICE due to infinite splitting [PR109800]
> 
> Hi Kyrill,
> 
> On 23/05/2023 11:14, Kyrylo Tkachov wrote:
> > Hi Alex,
> > diff --git a/gcc/testsuite/gcc.target/arm/pr109800.c
> b/gcc/testsuite/gcc.target/arm/pr109800.c
> > new file mode 100644
> > index 000..71d1ede13dd
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/arm/pr109800.c
> > @@ -0,0 +1,3 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -march=armv7-m -mfloat-abi=hard -mfpu=fpv4-sp-
> d16 -mbig-endian -mpure-code" } */
> > +double f() { return 5.0; }
> >
> > ... The arm testsuite options are kinda hard to get right with all the 
> > effective
> targets and multilibs and such hardcoded abi and march options tend to
> break in some target.
> > I suggest you put this testcase in gcc.target/arm/pure-code and add a dg-
> skip-if to skip the test if the multilib options specify a different 
> float-abi.
> 
> How about this instead:
> 
> diff --git a/gcc/testsuite/gcc.target/arm/pure-code/pr109800.c
> b/gcc/testsuite/gcc.target/arm/pure-code/pr109800.c
> new file mode 100644
> index 000..d797b790232
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/arm/pure-code/pr109800.c
> @@ -0,0 +1,4 @@
> +/* { dg-do compile } */
> +/* { dg-require-effective-target arm_hard_ok } */
> +/* { dg-options "-O2 -march=armv7-m -mfloat-abi=hard -mfpu=fpv4-sp-d16 -
> mbig-endian -mpure-code" } */
> +double f() { return 5.0; }
> 
> Full v2 patch attached.

Thanks, looks better but I think you'll still want to have a dg-skip-if to 
avoid explicit -mfloat-abi=soft and -mfloat-abi=softfp in the multilib options. 
You can grep in that test directory for examples
Kyrill

> 
> Thanks,
> Alex


[ping] RE: [PATCH] stor-layout, aarch64: Express SRA intrinsics with RTL codes

2023-05-25 Thread Kyrylo Tkachov via Gcc-patches
Ping.
Thanks,
Kyrill

> -Original Message-
> From: Gcc-patches  bounces+kyrylo.tkachov=arm@gcc.gnu.org> On Behalf Of Kyrylo
> Tkachov via Gcc-patches
> Sent: Thursday, May 18, 2023 4:19 PM
> To: gcc-patches@gcc.gnu.org
> Subject: [PATCH] stor-layout, aarch64: Express SRA intrinsics with RTL codes
> 
> Hi all,
> 
> This patch expresses the intrinsics for the SRA and RSRA instructions with
> standard RTL codes rather than relying on UNSPECs.
> These instructions perform a vector shift right plus accumulate with an
> optional rounding constant addition for the RSRA variant.
> There are a number of interesting points:
> 
> * The scalar-in-SIMD-registers variant for DImode SRA e.g. ssra d0, d1, #N
> is left using the UNSPECs. Expressing it as a DImode plus+shift led to all
> kinds of trouble as it started matching the existing define_insns for
> "add x0, x0, asr #N" instructions and adding the SRA form as an extra
> alternative required a significant amount of deduplication of iterators and
> things still didn't work out well. I decided not to tackle that case in
> this patch. It can be attempted later.
> 
> * For the RSRA variants that add a rounding constant (1 << (shift-1)) the
> addition is notionally performed in a wider mode than the input types so that
> overflow is handled properly. In RTL this can be represented with an
> appropriate
> extend operation followed by a truncate back to the original modes.
> However for 128-bit input modes such as V4SI we don't have appropriate
> modes
> defined for this widening i.e. we'd need a V4DI mode to represent the
> intermediate widened result.  This patch defines such modes for
> V16HI,V8SI,V4DI,V2TI. These will come handy in the future too as we have
> more Advanced SIMD instruction that have similar intermediate widening
> semantics.
> 
> * The above new modes led to a problem with stor-layout.cc. The new modes
> only
> exist for the sake of the RTL optimisers understanding the semantics of the
> instruction but are not indended to be moved to and from register or
> memory,
> assigned to types, used as TYPE_MODE or participate in auto-vectorisation.
> This is expressed in aarch64 by aarch64_classify_vector_mode returning zero
> for these new modes. However, the code in stor-
> layout.cc:
> explicitly doesn't check this when picking a TYPE_MODE due to modes being
> made
> potentially available later through target switching (PR38240).
> This led to these modes being picked as TYPE_MODE for declarations such as:
> typedef int16_t vnx8hi __attribute__((vector_size (32))) when 256-bit
> fixed-length SVE modes are available and vector_type_mode later struggling
> to rectify this.
> This issue is addressed with the new target hook
> TARGET_VECTOR_MODE_SUPPORTED_ANY_TARGET_P that is intended to
> check if a
> vector mode can be used in any legal target attribute configuration of the
> port, as opposed to the existing TARGET_VECTOR_MODE_SUPPORTED_P that
> checks
> only the initial target configuration. This allows a simple adjustment in
> stor-layout.cc that still disqualifies these limited modes early on while
> allowing consideration of modes that can be turned on in the future with
> target attributes.
> 
> Bootstrapped and tested on aarch64-none-linux-gnu.
> Ok for the non-aarch64 parts?
> 
> Thanks,
> Kyrill
> 
> gcc/ChangeLog:
> 
>   * config/aarch64/aarch64-modes.def (V16HI, V8SI, V4DI, V2TI): New
> modes.
>   * config/aarch64/aarch64-protos.h (aarch64_const_vec_rnd_cst_p):
>   Declare prototype.
>   (aarch64_const_vec_rsra_rnd_imm_p): Likewise.
>   * config/aarch64/aarch64-simd.md (*aarch64_simd_sra):
> Rename to...
>   (aarch64_sra_n_insn): ... This.
>   (aarch64_rsra_n_insn): New define_insn.
>   (aarch64_sra_n): New define_expand.
>   (aarch64_rsra_n): Likewise.
>   (aarch64_sra_n): Rename to...
>   (aarch64_sra_ndi): ... This.
>   * config/aarch64/aarch64.cc (aarch64_classify_vector_mode): Add
>   any_target_p argument.
>   (aarch64_extract_vec_duplicate_wide_int): Define.
>   (aarch64_const_vec_rsra_rnd_imm_p): Likewise.
>   (aarch64_const_vec_rnd_cst_p): Likewise.
>   (aarch64_vector_mode_supported_any_target_p): Likewise.
>   (TARGET_VECTOR_MODE_SUPPORTED_ANY_TARGET_P): Likewise.
>   * config/aarch64/iterators.md (UNSPEC_SRSRA, UNSPEC_URSRA):
> Delete.
>   (VSRA): Adjust for the above.
>   (sur): Likewise.
>   (V2XWIDE): New mode_attr.
>   (vec_or_offset): Likewise.
>   (SHIFTEXTEND): Likewise.
>   * config/aarch64/predicates.md (aarch64_simd_rsra_rnd_imm_vec):
> New
>   predicate.
>   * doc/tm.texi (TARGET_VECTOR_

RE: [PATCH] aarch64: Implement vector FP absolute compare intrinsics with builtins

2023-05-25 Thread Kyrylo Tkachov via Gcc-patches



> -Original Message-
> From: Kyrylo Tkachov
> Sent: Thursday, May 18, 2023 12:14 PM
> To: gcc-patches@gcc.gnu.org
> Cc: Richard Sandiford 
> Subject: [PATCH] aarch64: Implement vector FP absolute compare intrinsics
> with builtins
> 
> Hi all,
> 
> While optimising some vector math library code with intrinsics we stumbled
> upon the issue in the testcase.
> The compiler should be generating a FACGT instruction but instead we
> generate:
> foo(__Float32x4_t, __Float32x4_t, __Float32x4_t):
> fabsv0.4s, v0.4s
> adrpx0, .LC0
> ldr q31, [x0, #:lo12:.LC0]
> fcmgt   v0.4s, v0.4s, v31.4s
> ret
> 
> This is because the vcagtq_f32 intrinsic is open-coded in arm_neon.h as
> return vabsq_f32 (__a) > vabsq_f32 (__b)
> thus relying on the optimisers to merge it back together. But since one of the
> arms of the comparison
> is a vector constant the combine pass optimises the abs into it and tries
> matching:
> (set (reg:V4SI 101)
> (neg:V4SI (gt:V4SI (reg:V4SF 100)
> (const_vector:V4SF [
> (const_double:SF 1.0e+2 [0x0.c8p+7]) repeated x4
> ]
> and
> (set (reg:V4SI 101)
> (neg:V4SI (gt:V4SI (abs:V4SF (reg:V4SF 104))
> (reg:V4SF 103
> 
> instead of what we want:
> (insn 13 9 14 2 (set (reg/i:V4SI 32 v0)
> (neg:V4SI (gt:V4SI (abs:V4SF (reg:V4SF 98))
> (abs:V4SF (reg:V4SF 96)
> 
> I don't really see a good way around that with our current implementation of
> these intrinsics.
> Therefore this patch reimplements these intrinsics with aarch64 builtins that
> generate the RTL for these
> instructions directly. Apparently we already had them defined in aarch64-
> simd-builtins.def and have been
> using them for the fp16 case already.
> I realise that this approach is against the general principle of expressing
> intrinsics in the higher-level constructs,
> so I'm willing to listen to counter-arguments.
> That said, the FACGT/FACGE instructions are as fast as the non-ABS
> comparison instructions on all microarchitectures that I know of
> so it should always be a win to have them in the merged form rather than
> split the fabs step separately or try to hoist it.
> And the testcase does come from real library code that we're trying to
> optimise.
> With this patch for the testcase we generate:
> foo:
> adrpx0, .LC0
> ldr q31, [x0, #:lo12:.LC0]
> facgt   v0.4s, v0.4s, v31.4s
> ret
> 
> Bootstrapped and tested on aarch64-none-linux-gnu.
> I'll hold off on committing this to give folks a few days to comment, but will
> push by the end of next week if there are no objections.

Pushed to trunk.
Thanks,
Kyrill

> 
> Thanks,
> Kyrill
> 
> gcc/ChangeLog:
> 
>   * config/aarch64/arm_neon.h (vcage_f64): Reimplement with
> builtins.
>   (vcage_f32): Likewise.
>   (vcages_f32): Likewise.
>   (vcageq_f32): Likewise.
>   (vcaged_f64): Likewise.
>   (vcageq_f64): Likewise.
>   (vcagts_f32): Likewise.
>   (vcagt_f32): Likewise.
>   (vcagt_f64): Likewise.
>   (vcagtq_f32): Likewise.
>   (vcagtd_f64): Likewise.
>   (vcagtq_f64): Likewise.
>   (vcale_f32): Likewise.
>   (vcale_f64): Likewise.
>   (vcaled_f64): Likewise.
>   (vcales_f32): Likewise.
>   (vcaleq_f32): Likewise.
>   (vcaleq_f64): Likewise.
>   (vcalt_f32): Likewise.
>   (vcalt_f64): Likewise.
>   (vcaltd_f64): Likewise.
>   (vcaltq_f32): Likewise.
>   (vcaltq_f64): Likewise.
>   (vcalts_f32): Likewise.
> 
> gcc/testsuite/ChangeLog:
> 
>   * gcc.target/aarch64/simd/facgt_constpool_1.c: New test.


[PATCH][committed] aarch64: PR target/99195 Annotate vector shift patterns for vec-concat-zero

2023-05-24 Thread Kyrylo Tkachov via Gcc-patches
Hi all,

Continuing the series of straightforward annotations, this one handles the 
normal (not widening or narrowing) vector shifts.
Tests included.

Bootstrapped and tested on aarch64-none-linux-gnu and aarch64_be-none-elf.
Pushing to trunk.
Thanks,
Kyrill

gcc/ChangeLog:

PR target/99195
* config/aarch64/aarch64-simd.md (aarch64_simd_lshr): Rename to...
(aarch64_simd_lshr): ... This.
(aarch64_simd_ashr): Rename to...
(aarch64_simd_ashr): ... This.
(aarch64_simd_imm_shl): Rename to...
(aarch64_simd_imm_shl): ... This.
(aarch64_simd_reg_sshl): Rename to...
(aarch64_simd_reg_sshl): ... This.
(aarch64_simd_reg_shl_unsigned): Rename to...
(aarch64_simd_reg_shl_unsigned): ... This.
(aarch64_simd_reg_shl_signed): Rename to...
(aarch64_simd_reg_shl_signed): ... This.
(vec_shr_): Rename to...
(vec_shr_): ... This.
(aarch64_shl): Rename to...
(aarch64_shl): ... This.
(aarch64_qshl): Rename to...
(aarch64_qshl): ... This.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/simd/pr99195_1.c: Add testing for shifts.
* gcc.target/aarch64/simd/pr99195_6.c: Likewise.
* gcc.target/aarch64/simd/pr99195_8.c: New test.


shift.patch
Description: shift.patch


[PATCH][committed] arm: PR target/109939 Correct signedness of return type of __ssat intrinsics

2023-05-24 Thread Kyrylo Tkachov via Gcc-patches
Hi all,

As the PR says we shouldn't be using qualifier_unsigned for the return type of 
the __ssat intrinsics.
UNSIGNED_SAT_BINOP_UNSIGNED_IMM_QUALIFIERS already exists for that.
This was just a thinko.
This patch fixes this and the warning with -Wconversion goes away.

Bootstrapped and tested on arm-none-linux-gnueabihf.
Pushing to trunk.
Thanks,
Kyrill

gcc/ChangeLog:

PR target/109939
* config/arm/arm-builtins.cc (SAT_BINOP_UNSIGNED_IMM_QUALIFIERS): Use
qualifier_none for the return operand.

gcc/testsuite/ChangeLog:

PR target/109939
* gcc.target/arm/pr109939.c: New test.


satsign.patch
Description: satsign.patch


RE: [PATCH] arm: Fix ICE due to infinite splitting [PR109800]

2023-05-23 Thread Kyrylo Tkachov via Gcc-patches
Hi Alex,

> -Original Message-
> From: Alex Coplan 
> Sent: Thursday, May 11, 2023 12:15 PM
> To: gcc-patches@gcc.gnu.org
> Cc: ni...@redhat.com; Richard Earnshaw ;
> Ramana Radhakrishnan ; Kyrylo Tkachov
> 
> Subject: [PATCH] arm: Fix ICE due to infinite splitting [PR109800]
> 
> Hi,
> 
> In r11-966-g9a182ef9ee011935d827ab5c6c9a7cd8e22257d8 we introduce a
> simplification to emit_move_insn that attempts to simplify moves of the
> form:
> 
> (set (subreg:M1 (reg:M2 ...)) (constant C))
> 
> where M1 and M2 are of equal mode size. That is problematic for the splitter
> vfp.md:no_literal_pool_df_immediate in the arm backend, which tries to pun
> an
> lvalue DFmode pseudo into DImode and assign a constant to it with
> emit_move_insn, as the new transformation simply undoes this, and we end
> up
> splitting indefinitely.
> 
> This patch changes things around in the arm backend so that we use a
> DImode temporary (instead of DFmode) and first load the DImode constant
> into the pseudo, and then pun the pseudo into DFmode as an rvalue in a
> reg -> reg move. I believe this should be semantically equivalent but
> avoids the pathalogical behaviour seen in the PR.
> 
> Bootstrapped/regtested on arm-linux-gnueabihf, regtested on
> arm-none-eabi and armeb-none-eabi.
> 
> OK for trunk and backports?

Ok but the testcase...

> 
> Thanks,
> Alex
> 
> gcc/ChangeLog:
> 
>   PR target/109800
>   * config/arm/arm.md (movdf): Generate temporary pseudo in
> DImode
>   instead of DFmode.
>   * config/arm/vfp.md (no_literal_pool_df_immediate): Rather than
> punning an
>   lvalue DFmode pseudo into DImode, use a DImode pseudo and pun it
> into
>   DFmode as an rvalue.
> 
> gcc/testsuite/ChangeLog:
> 
>   PR target/109800
>   * gcc.target/arm/pr109800.c: New test.

diff --git a/gcc/testsuite/gcc.target/arm/pr109800.c 
b/gcc/testsuite/gcc.target/arm/pr109800.c
new file mode 100644
index 000..71d1ede13dd
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/pr109800.c
@@ -0,0 +1,3 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=armv7-m -mfloat-abi=hard -mfpu=fpv4-sp-d16 
-mbig-endian -mpure-code" } */
+double f() { return 5.0; }

... The arm testsuite options are kinda hard to get right with all the 
effective targets and multilibs and such hardcoded abi and march options tend 
to break in some target.
I suggest you put this testcase in gcc.target/arm/pure-code and add a 
dg-skip-if to skip the test if the multilib options specify a different 
float-abi.

Thanks,
Kyrill


[PATCH][committed] aarch64: PR target/109855 Add predicate and constraints to define_subst in aarch64-simd.md

2023-05-23 Thread Kyrylo Tkachov via Gcc-patches
Hi all,

In this PR we ICE because the substituted pattern for mla "lost" its predicate 
and constraint for operand 0
because the define_subst template:
  [(set (match_operand: 0)
(vec_concat:
 (match_dup 1)
 (match_operand:VDZ 2 "aarch64_simd_or_scalar_imm_zero")))])

Uses match_operand instead of match_dup for operand 0. We can't use match_dup 0 
for it because we need to specify the widened mode.
The problem is fixed by adding a "register_operand" predicate and "=w" 
constraint to the match_operand.
This makes sense conceptually too as the transformation we're targeting only 
applies to instructions that write a "w" register.
With this change the mddump pattern that ICEs goes from:
(define_insn ("aarch64_mlav4hi_vec_concatz_le")
 [
(set (match_operand:V8HI 0 ("") ("")) <<-- Missing constraint!
(vec_concat:V8HI (plus:V4HI (mult:V4HI (match_operand:V4HI 2 
("register_operand") ("w"))
(match_operand:V4HI 3 ("register_operand") ("w")))
(match_operand:V4HI 1 ("register_operand") ("0")))
(match_operand:V4HI 4 ("aarch64_simd_or_scalar_imm_zero") 
(""
] ("(!BYTES_BIG_ENDIAN) && (TARGET_SIMD)") ("mla\t%0.4h, %2.4h, %3.4h")

to the proper:
(define_insn ("aarch64_mlav4hi_vec_concatz_le")
 [
(set (match_operand:V8HI 0 ("register_operand") ("=w")) << 
Constraint in the right place
(vec_concat:V8HI (plus:V4HI (mult:V4HI (match_operand:V4HI 2 
("register_operand") ("w"))
(match_operand:V4HI 3 ("register_operand") ("w")))
(match_operand:V4HI 1 ("register_operand") ("0")))
(match_operand:V4HI 4 ("aarch64_simd_or_scalar_imm_zero") 
(""
] ("(!BYTES_BIG_ENDIAN) && (TARGET_SIMD)") ("mla\t%0.4h, %2.4h, %3.4h")

This seems to do the right thing for multi-alternative patterns as well, the 
annotated pattern for aarch64_cmltv8qi is:
(define_insn ("aarch64_cmltv8qi")
 [
(set (match_operand:V8QI 0 ("register_operand") ("=w,w"))
(neg:V8QI (lt:V8QI (match_operand:V8QI 1 ("register_operand") 
("w,w"))
(match_operand:V8QI 2 ("aarch64_simd_reg_or_zero") 
("w,ZDz")
]

whereas the substituted version now looks like:
(define_insn ("aarch64_cmltv8qi_vec_concatz_le")
 [
(set (match_operand:V16QI 0 ("register_operand") ("=w,w"))
(vec_concat:V16QI (neg:V8QI (lt:V8QI (match_operand:V8QI 1 
("register_operand") ("w,w"))
(match_operand:V8QI 2 ("aarch64_simd_reg_or_zero") 
("w,ZDz"
(match_operand:V8QI 3 ("aarch64_simd_or_scalar_imm_zero") 
(""
]

Bootstrapped and tested on aarch64-none-linux-gnu.
Pushing to trunk.
Thanks,
Kyrill

gcc/ChangeLog:

PR target/109855
* config/aarch64/aarch64-simd.md (add_vec_concat_subst_le): Add 
predicate
and constraint for operand 0.
(add_vec_concat_subst_be): Likewise.

gcc/testsuite/ChangeLog:

PR target/109855
* gcc.target/aarch64/pr109855.c: New test.


subst-pred.patch
Description: subst-pred.patch


[PATCH] stor-layout, aarch64: Express SRA intrinsics with RTL codes

2023-05-18 Thread Kyrylo Tkachov via Gcc-patches
Hi all,

This patch expresses the intrinsics for the SRA and RSRA instructions with
standard RTL codes rather than relying on UNSPECs.
These instructions perform a vector shift right plus accumulate with an
optional rounding constant addition for the RSRA variant.
There are a number of interesting points:

* The scalar-in-SIMD-registers variant for DImode SRA e.g. ssra d0, d1, #N
is left using the UNSPECs. Expressing it as a DImode plus+shift led to all
kinds of trouble as it started matching the existing define_insns for
"add x0, x0, asr #N" instructions and adding the SRA form as an extra
alternative required a significant amount of deduplication of iterators and
things still didn't work out well. I decided not to tackle that case in
this patch. It can be attempted later.

* For the RSRA variants that add a rounding constant (1 << (shift-1)) the
addition is notionally performed in a wider mode than the input types so that
overflow is handled properly. In RTL this can be represented with an appropriate
extend operation followed by a truncate back to the original modes.
However for 128-bit input modes such as V4SI we don't have appropriate modes
defined for this widening i.e. we'd need a V4DI mode to represent the
intermediate widened result.  This patch defines such modes for
V16HI,V8SI,V4DI,V2TI. These will come handy in the future too as we have
more Advanced SIMD instruction that have similar intermediate widening
semantics.

* The above new modes led to a problem with stor-layout.cc. The new modes only
exist for the sake of the RTL optimisers understanding the semantics of the
instruction but are not indended to be moved to and from register or memory,
assigned to types, used as TYPE_MODE or participate in auto-vectorisation.
This is expressed in aarch64 by aarch64_classify_vector_mode returning zero
for these new modes. However, the code in stor-layout.cc:
explicitly doesn't check this when picking a TYPE_MODE due to modes being made
potentially available later through target switching (PR38240).
This led to these modes being picked as TYPE_MODE for declarations such as:
typedef int16_t vnx8hi __attribute__((vector_size (32))) when 256-bit
fixed-length SVE modes are available and vector_type_mode later struggling
to rectify this.
This issue is addressed with the new target hook
TARGET_VECTOR_MODE_SUPPORTED_ANY_TARGET_P that is intended to check if a
vector mode can be used in any legal target attribute configuration of the
port, as opposed to the existing TARGET_VECTOR_MODE_SUPPORTED_P that checks
only the initial target configuration. This allows a simple adjustment in
stor-layout.cc that still disqualifies these limited modes early on while
allowing consideration of modes that can be turned on in the future with
target attributes.

Bootstrapped and tested on aarch64-none-linux-gnu.
Ok for the non-aarch64 parts?

Thanks,
Kyrill

gcc/ChangeLog:

* config/aarch64/aarch64-modes.def (V16HI, V8SI, V4DI, V2TI): New modes.
* config/aarch64/aarch64-protos.h (aarch64_const_vec_rnd_cst_p):
Declare prototype.
(aarch64_const_vec_rsra_rnd_imm_p): Likewise.
* config/aarch64/aarch64-simd.md (*aarch64_simd_sra): Rename to...
(aarch64_sra_n_insn): ... This.
(aarch64_rsra_n_insn): New define_insn.
(aarch64_sra_n): New define_expand.
(aarch64_rsra_n): Likewise.
(aarch64_sra_n): Rename to...
(aarch64_sra_ndi): ... This.
* config/aarch64/aarch64.cc (aarch64_classify_vector_mode): Add
any_target_p argument.
(aarch64_extract_vec_duplicate_wide_int): Define.
(aarch64_const_vec_rsra_rnd_imm_p): Likewise.
(aarch64_const_vec_rnd_cst_p): Likewise.
(aarch64_vector_mode_supported_any_target_p): Likewise.
(TARGET_VECTOR_MODE_SUPPORTED_ANY_TARGET_P): Likewise.
* config/aarch64/iterators.md (UNSPEC_SRSRA, UNSPEC_URSRA): Delete.
(VSRA): Adjust for the above.
(sur): Likewise.
(V2XWIDE): New mode_attr.
(vec_or_offset): Likewise.
(SHIFTEXTEND): Likewise.
* config/aarch64/predicates.md (aarch64_simd_rsra_rnd_imm_vec): New
predicate.
* doc/tm.texi (TARGET_VECTOR_MODE_SUPPORTED_P): Adjust description to
clarify that it applies to current target options.
(TARGET_VECTOR_MODE_SUPPORTED_ANY_TARGET_P): Document.
* doc/tm.texi.in: Regenerate.
* stor-layout.cc (mode_for_vector): Check
vector_mode_supported_any_target_p when iterating through vector modes.
* target.def (TARGET_VECTOR_MODE_SUPPORTED_P): Adjust description to
clarify that it applies to current target options.
(TARGET_VECTOR_MODE_SUPPORTED_ANY_TARGET_P): Define.


sra.patch
Description: sra.patch


[PATCH] aarch64: Implement vector FP absolute compare intrinsics with builtins

2023-05-18 Thread Kyrylo Tkachov via Gcc-patches
Hi all,

While optimising some vector math library code with intrinsics we stumbled upon 
the issue in the testcase.
The compiler should be generating a FACGT instruction but instead we generate:
foo(__Float32x4_t, __Float32x4_t, __Float32x4_t):
fabsv0.4s, v0.4s
adrpx0, .LC0
ldr q31, [x0, #:lo12:.LC0]
fcmgt   v0.4s, v0.4s, v31.4s
ret

This is because the vcagtq_f32 intrinsic is open-coded in arm_neon.h as
return vabsq_f32 (__a) > vabsq_f32 (__b)
thus relying on the optimisers to merge it back together. But since one of the 
arms of the comparison
is a vector constant the combine pass optimises the abs into it and tries 
matching:
(set (reg:V4SI 101)
(neg:V4SI (gt:V4SI (reg:V4SF 100)
(const_vector:V4SF [
(const_double:SF 1.0e+2 [0x0.c8p+7]) repeated x4
]
and
(set (reg:V4SI 101)
(neg:V4SI (gt:V4SI (abs:V4SF (reg:V4SF 104))
(reg:V4SF 103

instead of what we want:
(insn 13 9 14 2 (set (reg/i:V4SI 32 v0)
(neg:V4SI (gt:V4SI (abs:V4SF (reg:V4SF 98))
(abs:V4SF (reg:V4SF 96)

I don't really see a good way around that with our current implementation of 
these intrinsics.
Therefore this patch reimplements these intrinsics with aarch64 builtins that 
generate the RTL for these
instructions directly. Apparently we already had them defined in 
aarch64-simd-builtins.def and have been
using them for the fp16 case already.
I realise that this approach is against the general principle of expressing 
intrinsics in the higher-level constructs,
so I'm willing to listen to counter-arguments.
That said, the FACGT/FACGE instructions are as fast as the non-ABS comparison 
instructions on all microarchitectures that I know of
so it should always be a win to have them in the merged form rather than split 
the fabs step separately or try to hoist it.
And the testcase does come from real library code that we're trying to optimise.
With this patch for the testcase we generate:
foo:
adrpx0, .LC0
ldr q31, [x0, #:lo12:.LC0]
facgt   v0.4s, v0.4s, v31.4s
ret

Bootstrapped and tested on aarch64-none-linux-gnu.
I'll hold off on committing this to give folks a few days to comment, but will 
push by the end of next week if there are no objections.

Thanks,
Kyrill

gcc/ChangeLog:

* config/aarch64/arm_neon.h (vcage_f64): Reimplement with builtins.
(vcage_f32): Likewise.
(vcages_f32): Likewise.
(vcageq_f32): Likewise.
(vcaged_f64): Likewise.
(vcageq_f64): Likewise.
(vcagts_f32): Likewise.
(vcagt_f32): Likewise.
(vcagt_f64): Likewise.
(vcagtq_f32): Likewise.
(vcagtd_f64): Likewise.
(vcagtq_f64): Likewise.
(vcale_f32): Likewise.
(vcale_f64): Likewise.
(vcaled_f64): Likewise.
(vcales_f32): Likewise.
(vcaleq_f32): Likewise.
(vcaleq_f64): Likewise.
(vcalt_f32): Likewise.
(vcalt_f64): Likewise.
(vcaltd_f64): Likewise.
(vcaltq_f32): Likewise.
(vcaltq_f64): Likewise.
(vcalts_f32): Likewise.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/simd/facgt_constpool_1.c: New test.


facgt.patch
Description: facgt.patch


RE: [GCC12 backport] arm: MVE testsuite and backend bugfixes

2023-05-17 Thread Kyrylo Tkachov via Gcc-patches


> -Original Message-
> From: Stam Markianos-Wright 
> Sent: Wednesday, May 17, 2023 2:41 PM
> To: Kyrylo Tkachov ; gcc-patches@gcc.gnu.org
> Cc: Richard Earnshaw ; Andrea Corallo
> 
> Subject: [GCC12 backport] arm: MVE testsuite and backend bugfixes
> 
> 
> On 17/05/2023 10:26, Kyrylo Tkachov wrote:
> > Hi Stam,
> >
> >> -Original Message-
> >> From: Stam Markianos-Wright 
> >> Sent: Tuesday, May 16, 2023 2:32 PM
> >> To: gcc-patches@gcc.gnu.org
> >> Cc: Kyrylo Tkachov ; Richard Earnshaw
> >> ; Andrea Corallo
> 
> >> Subject: [GCC12 backport] arm: MVE testsuite and backend bugfixes
> >>
> >> Hi all,
> >>
> >> We've recently sent up a lot of patches overhauling the testsuite of the
> >> Arm MVE backend.
> >> With these changes, we've also identified and fixed a number of bugs
> >> (some backend bugs and many to do with the polymorphism of intrinsics
> in
> >> MVE the header file).
> >> These would all be relevant to backport to GCC12.
> >> The list is as follows (in the order they all apply on top of eachother):
> >>
> >> * This patch series:
> >> https://gcc.gnu.org/pipermail/gcc-patches/2022-November/606552.html
> >> (commits 9a79b522e0663a202a288db56ebcbdcdb48bdaca to
> >> f2b54e5b796b00f0072b61f9cd6a964c66ead29b)
> >> * ecc363971aeac52481d92de8b37521f6cc2d38e6 arm: Fix MVE testsuite
> >> fallouts
> >> * 06aa66af7d0dacc1b247d9e38175e789ef159191 arm: Add missing early
> >> clobber to MVE vrev64q_m patterns
> >> * c09663eabfb84ac56ddd8d44abcab3f4902c83bd testsuite: [arm] Relax
> >> expected register names in MVE tests
> >> * 330d665ce6dcc63ed0bd78d807e69bbfc55255b6 arm: [MVE] Add missing
> >> length=8 attribute
> >> * 8d4f007398bc3f8fea812fb8cff4d7d0556d12f1 arm: fix mve intrinsics scan
> >> body tests for C++
> >> * This patch series
> >> https://gcc.gnu.org/pipermail/gcc-patches/2023-January/610312.html
> >> (commits dd4424ef898608321b60610c4f3c98737ace3680 to
> >> 267f01a493ab8a0bec9325ce3386b946c46f2e98)
> >> * 8a1360e72d6c6056606aa5edd8c906c50f26de59 arm: Split up MVE
> _Generic
> >> associations to prevent type clashes [PR107515]
> >> * 3f0ca7a3e4431534bff3b8eb73709cc822e489b0 arm: Fix vcreate
> definition
> >> * c1093923733a1072a237f112e3239b5ebd88eadd arm: Make MVE
> masked
> >> stores
> >> read memory operand [PR 108177]
> >> * f54e31ddefe3ea7146624eabcb75b1c90dc59f1a arm: fix __arm_vld1q_z*
> >> and
> >> __arm_vst1q_p* intrinsics [PR108442]
> >> * 1d509f190393627cdf0afffc427b25dd21c2 arm: remove unused
> variables
> >> from test
> >>
> > Ok to backport.
> >
> >> -- up to this point everything applied cleanly. The final two need minor
> >> rebasing changes --
> >>
> >> * This patch series:
> >> https://gcc.gnu.org/pipermail/gcc-patches/2023-April/617008.html (Not
> >> pushed to trunk yet, but has been approved. For trunk we do now need to
> >> resolve some merge conflicts, since Christophe has started merging the
> >> MVE Intrinsic Restructuring, but these are trivial. I will also backport
> >> to GCC13 where this patch series applies cleanly)
> >> * cfa118fc089e38a94ec60ccf5b667aea015e5f60 [arm] complete vmsr/vmrs
> >> blank and case adjustments.
> >>
> >> The final one is a commit from Alexandre Oliva that is needed to ensure
> >> that we don't accidentally regress the test due to the tabs vs spaces
> >> and capitalisation on the vmrs/vmsr instructions :)
> >>
> >> After all that, no regressions on baremetal arm-none-eabi in a bunch
> >> configurations (-marm, thumb1, thumb2, MVE, MVE.FP, softfp and hardfp):
> >>
> > Will you be sending these to the list after adjusting?
> 
> Yep, I believe we have to!
> 
> I'm thinking we should do one batch of [committed] emails for GCC12 and
> one for trunk.

Sounds good.

> 
> For GCC13 the previously sent version of the series at
> https://gcc.gnu.org/pipermail/gcc-patches/2023-May/617373.html applies
> cleanly. Let me know if there's anything further we need to do!
> 

WFM, please go ahead.
Thanks,
Kyrill

> Thanks,
> Stamatis
> 
> 
> > Thanks,
> > Kyrill
> >
> >> Thanks,
> >> Stam


RE: [GCC12 backport] arm: MVE testsuite and backend bugfixes

2023-05-17 Thread Kyrylo Tkachov via Gcc-patches
Hi Stam,

> -Original Message-
> From: Stam Markianos-Wright 
> Sent: Tuesday, May 16, 2023 2:32 PM
> To: gcc-patches@gcc.gnu.org
> Cc: Kyrylo Tkachov ; Richard Earnshaw
> ; Andrea Corallo 
> Subject: [GCC12 backport] arm: MVE testsuite and backend bugfixes
> 
> Hi all,
> 
> We've recently sent up a lot of patches overhauling the testsuite of the
> Arm MVE backend.
> With these changes, we've also identified and fixed a number of bugs
> (some backend bugs and many to do with the polymorphism of intrinsics in
> MVE the header file).
> These would all be relevant to backport to GCC12.
> The list is as follows (in the order they all apply on top of eachother):
> 
> * This patch series:
> https://gcc.gnu.org/pipermail/gcc-patches/2022-November/606552.html
> (commits 9a79b522e0663a202a288db56ebcbdcdb48bdaca to
> f2b54e5b796b00f0072b61f9cd6a964c66ead29b)
> * ecc363971aeac52481d92de8b37521f6cc2d38e6 arm: Fix MVE testsuite
> fallouts
> * 06aa66af7d0dacc1b247d9e38175e789ef159191 arm: Add missing early
> clobber to MVE vrev64q_m patterns
> * c09663eabfb84ac56ddd8d44abcab3f4902c83bd testsuite: [arm] Relax
> expected register names in MVE tests
> * 330d665ce6dcc63ed0bd78d807e69bbfc55255b6 arm: [MVE] Add missing
> length=8 attribute
> * 8d4f007398bc3f8fea812fb8cff4d7d0556d12f1 arm: fix mve intrinsics scan
> body tests for C++
> * This patch series
> https://gcc.gnu.org/pipermail/gcc-patches/2023-January/610312.html
> (commits dd4424ef898608321b60610c4f3c98737ace3680 to
> 267f01a493ab8a0bec9325ce3386b946c46f2e98)
> * 8a1360e72d6c6056606aa5edd8c906c50f26de59 arm: Split up MVE _Generic
> associations to prevent type clashes [PR107515]
> * 3f0ca7a3e4431534bff3b8eb73709cc822e489b0 arm: Fix vcreate definition
> * c1093923733a1072a237f112e3239b5ebd88eadd arm: Make MVE masked
> stores
> read memory operand [PR 108177]
> * f54e31ddefe3ea7146624eabcb75b1c90dc59f1a arm: fix __arm_vld1q_z*
> and
> __arm_vst1q_p* intrinsics [PR108442]
> * 1d509f190393627cdf0afffc427b25dd21c2 arm: remove unused variables
> from test
> 

Ok to backport.

> -- up to this point everything applied cleanly. The final two need minor
> rebasing changes --
> 
> * This patch series:
> https://gcc.gnu.org/pipermail/gcc-patches/2023-April/617008.html (Not
> pushed to trunk yet, but has been approved. For trunk we do now need to
> resolve some merge conflicts, since Christophe has started merging the
> MVE Intrinsic Restructuring, but these are trivial. I will also backport
> to GCC13 where this patch series applies cleanly)
> * cfa118fc089e38a94ec60ccf5b667aea015e5f60 [arm] complete vmsr/vmrs
> blank and case adjustments.
> 
> The final one is a commit from Alexandre Oliva that is needed to ensure
> that we don't accidentally regress the test due to the tabs vs spaces
> and capitalisation on the vmrs/vmsr instructions :)
> 
> After all that, no regressions on baremetal arm-none-eabi in a bunch
> configurations (-marm, thumb1, thumb2, MVE, MVE.FP, softfp and hardfp):
> 

Will you be sending these to the list after adjusting?
Thanks,
Kyrill

> Thanks,
> Stam



  1   2   3   4   5   6   7   8   9   10   >