[PATCH] vect: Fix access size alignment assumption [PR115192]

2024-05-24 Thread Richard Sandiford
create_intersect_range_checks checks whether two access ranges
a and b are alias-free using something equivalent to:

  end_a <= start_b || end_b <= start_a

It has two ways of doing this: a "vanilla" way that calculates
the exact exclusive end pointers, and another way that uses the
last inclusive aligned pointers (and changes the comparisons
accordingly).  The comment for the latter is:

  /* Calculate the minimum alignment shared by all four pointers,
 then arrange for this alignment to be subtracted from the
 exclusive maximum values to get inclusive maximum values.
 This "- min_align" is cumulative with a "+ access_size"
 in the calculation of the maximum values.  In the best
 (and common) case, the two cancel each other out, leaving
 us with an inclusive bound based only on seg_len.  In the
 worst case we're simply adding a smaller number than before.

The problem is that the associated code implicitly assumed that the
access size was a multiple of the pointer alignment, and so the
alignment could be carried over to the exclusive end pointer.

The testcase started failing after g:9fa5b473b5b8e289b6542
because that commit improved the alignment information for
the accesses.

Tested on aarch64-linux-gnu & x86_64-linux-gnu.  OK for trunk
and backports?

Richard


gcc/
PR tree-optimization/115192
* tree-data-ref.cc (create_intersect_range_checks): Take the
alignment of the access sizes into account.

gcc/testsuite/
PR tree-optimization/115192
* gcc.dg/vect/pr115192.c: New test.
---
 gcc/testsuite/gcc.dg/vect/pr115192.c | 28 
 gcc/tree-data-ref.cc |  5 -
 2 files changed, 32 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/pr115192.c

diff --git a/gcc/testsuite/gcc.dg/vect/pr115192.c 
b/gcc/testsuite/gcc.dg/vect/pr115192.c
new file mode 100644
index 000..923d377c1bb
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr115192.c
@@ -0,0 +1,28 @@
+#include "tree-vect.h"
+
+int data[4 * 16 * 16] __attribute__((aligned(16)));
+
+__attribute__((noipa)) void
+foo (__SIZE_TYPE__ n)
+{
+  for (__SIZE_TYPE__ i = 1; i < n; ++i)
+{
+  data[i * n * 4] = data[(i - 1) * n * 4] + 1;
+  data[i * n * 4 + 1] = data[(i - 1) * n * 4 + 1] + 2;
+}
+}
+
+int
+main ()
+{
+  check_vect ();
+
+  data[0] = 10;
+  data[1] = 20;
+
+  foo (3);
+
+  if (data[24] != 12 || data[25] != 24)
+__builtin_abort ();
+  return 0;
+}
diff --git a/gcc/tree-data-ref.cc b/gcc/tree-data-ref.cc
index db15ddb43de..7c4049faf34 100644
--- a/gcc/tree-data-ref.cc
+++ b/gcc/tree-data-ref.cc
@@ -73,6 +73,7 @@ along with GCC; see the file COPYING3.  If not see
 
 */
 
+#define INCLUDE_ALGORITHM
 #include "config.h"
 #include "system.h"
 #include "coretypes.h"
@@ -2640,7 +2641,9 @@ create_intersect_range_checks (class loop *loop, tree 
*cond_expr,
 Because the maximum values are inclusive, there is an alias
 if the maximum value of one segment is equal to the minimum
 value of the other.  */
-  min_align = MIN (dr_a.align, dr_b.align);
+  min_align = std::min (dr_a.align, dr_b.align);
+  min_align = std::min (min_align, known_alignment (dr_a.access_size));
+  min_align = std::min (min_align, known_alignment (dr_b.access_size));
   cmp_code = LT_EXPR;
 }
 
-- 
2.25.1



Re: [PATCH] aarch64: Fold vget_high_* intrinsics to BIT_FIELD_REF [PR102171]

2024-05-22 Thread Richard Sandiford
Pengxuan Zheng  writes:
> This patch is a follow-up of r15-697-ga2e4fe5a53cf75 to also fold vget_high_*
> intrinsics to BIT_FILED_REF and remove the vget_high_* definitions from
> arm_neon.h to use the new intrinsics framework.
>
>   PR target/102171
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64-builtins.cc (AARCH64_SIMD_VGET_HIGH_BUILTINS):
>   New macro to create definitions for all vget_high intrinsics.
>   (VGET_HIGH_BUILTIN): Likewise.
>   (enum aarch64_builtins): Add vget_high function codes.
>   (AARCH64_SIMD_VGET_LOW_BUILTINS): Delete duplicate macro.
>   (aarch64_general_fold_builtin): Fold vget_high calls.
>   * config/aarch64/aarch64-simd-builtins.def: Delete vget_high builtins.
>   * config/aarch64/aarch64-simd.md (aarch64_get_high): Delete.
>   (aarch64_vget_hi_halfv8bf): Likewise.
>   * config/aarch64/arm_neon.h (__attribute__): Delete.
>   (vget_high_f16): Likewise.
>   (vget_high_f32): Likewise.
>   (vget_high_f64): Likewise.
>   (vget_high_p8): Likewise.
>   (vget_high_p16): Likewise.
>   (vget_high_p64): Likewise.
>   (vget_high_s8): Likewise.
>   (vget_high_s16): Likewise.
>   (vget_high_s32): Likewise.
>   (vget_high_s64): Likewise.
>   (vget_high_u8): Likewise.
>   (vget_high_u16): Likewise.
>   (vget_high_u32): Likewise.
>   (vget_high_u64): Likewise.
>   (vget_high_bf16): Likewise.
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/vget_high_2.c: New test.
>   * gcc.target/aarch64/vget_high_2_be.c: New test.

OK, thanks.

Richard

> Signed-off-by: Pengxuan Zheng 
> ---
>  gcc/config/aarch64/aarch64-builtins.cc|  59 +++---
>  gcc/config/aarch64/aarch64-simd-builtins.def  |   6 -
>  gcc/config/aarch64/aarch64-simd.md|  22 
>  gcc/config/aarch64/arm_neon.h | 105 --
>  .../gcc.target/aarch64/vget_high_2.c  |  30 +
>  .../gcc.target/aarch64/vget_high_2_be.c   |  31 ++
>  6 files changed, 104 insertions(+), 149 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/vget_high_2.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/vget_high_2_be.c
>
> diff --git a/gcc/config/aarch64/aarch64-builtins.cc 
> b/gcc/config/aarch64/aarch64-builtins.cc
> index 11b888016ed..f8eeccb554d 100644
> --- a/gcc/config/aarch64/aarch64-builtins.cc
> +++ b/gcc/config/aarch64/aarch64-builtins.cc
> @@ -675,6 +675,23 @@ static aarch64_simd_builtin_datum 
> aarch64_simd_builtin_data[] = {
>VGET_LOW_BUILTIN(u64) \
>VGET_LOW_BUILTIN(bf16)
>  
> +#define AARCH64_SIMD_VGET_HIGH_BUILTINS \
> +  VGET_HIGH_BUILTIN(f16) \
> +  VGET_HIGH_BUILTIN(f32) \
> +  VGET_HIGH_BUILTIN(f64) \
> +  VGET_HIGH_BUILTIN(p8) \
> +  VGET_HIGH_BUILTIN(p16) \
> +  VGET_HIGH_BUILTIN(p64) \
> +  VGET_HIGH_BUILTIN(s8) \
> +  VGET_HIGH_BUILTIN(s16) \
> +  VGET_HIGH_BUILTIN(s32) \
> +  VGET_HIGH_BUILTIN(s64) \
> +  VGET_HIGH_BUILTIN(u8) \
> +  VGET_HIGH_BUILTIN(u16) \
> +  VGET_HIGH_BUILTIN(u32) \
> +  VGET_HIGH_BUILTIN(u64) \
> +  VGET_HIGH_BUILTIN(bf16)
> +
>  typedef struct
>  {
>const char *name;
> @@ -717,6 +734,9 @@ typedef struct
>  #define VGET_LOW_BUILTIN(A) \
>AARCH64_SIMD_BUILTIN_VGET_LOW_##A,
>  
> +#define VGET_HIGH_BUILTIN(A) \
> +  AARCH64_SIMD_BUILTIN_VGET_HIGH_##A,
> +
>  #undef VAR1
>  #define VAR1(T, N, MAP, FLAG, A) \
>AARCH64_SIMD_BUILTIN_##T##_##N##A,
> @@ -753,6 +773,7 @@ enum aarch64_builtins
>/* SIMD intrinsic builtins.  */
>AARCH64_SIMD_VREINTERPRET_BUILTINS
>AARCH64_SIMD_VGET_LOW_BUILTINS
> +  AARCH64_SIMD_VGET_HIGH_BUILTINS
>/* ARMv8.3-A Pointer Authentication Builtins.  */
>AARCH64_PAUTH_BUILTIN_AUTIA1716,
>AARCH64_PAUTH_BUILTIN_PACIA1716,
> @@ -855,26 +876,21 @@ static aarch64_fcmla_laneq_builtin_datum 
> aarch64_fcmla_lane_builtin_data[] = {
> false \
>},
>  
> -#define AARCH64_SIMD_VGET_LOW_BUILTINS \
> -  VGET_LOW_BUILTIN(f16) \
> -  VGET_LOW_BUILTIN(f32) \
> -  VGET_LOW_BUILTIN(f64) \
> -  VGET_LOW_BUILTIN(p8) \
> -  VGET_LOW_BUILTIN(p16) \
> -  VGET_LOW_BUILTIN(p64) \
> -  VGET_LOW_BUILTIN(s8) \
> -  VGET_LOW_BUILTIN(s16) \
> -  VGET_LOW_BUILTIN(s32) \
> -  VGET_LOW_BUILTIN(s64) \
> -  VGET_LOW_BUILTIN(u8) \
> -  VGET_LOW_BUILTIN(u16) \
> -  VGET_LOW_BUILTIN(u32) \
> -  VGET_LOW_BUILTIN(u64) \
> -  VGET_LOW_BUILTIN(bf16)
> +#undef VGET_HIGH_BUILTIN
> +#define VGET_HIGH_BUILTIN(A) \
> +  {"vget_high_" #A, \
> +   AARCH64_SIMD_BUILTIN_VGET_HIGH_##A, \
> +   2, \
> +   { SIMD_INTR_MODE(A, d), SIMD_INTR_MODE(A, q) }, \
> +   { SIMD_INTR_QUAL(A), SIMD_INTR_QUAL(A) }, \
> +   FLAG_AUTO_FP, \
> +   false \
> +  },
>  
>  static const aarch64_simd_intrinsic_datum aarch64_simd_intrinsic_data[] = {
>AARCH64_SIMD_VREINTERPRET_BUILTINS
>AARCH64_SIMD_VGET_LOW_BUILTINS
> +  AARCH64_SIMD_VGET_HIGH_BUILTINS
>  };
>  
>  
> @@ -3270,6 +3286,10 @@ aarch64_fold_builtin_lane_check (tree arg0, tree arg1, 
> tree arg2)
>  #define 

Re: [PATCH v1 5/6] Adjust DLL import/export implementation for AArch64

2024-05-22 Thread Richard Sandiford
Evgeny Karpov  writes:
> The DLL import/export mingw implementation, originally from ix86, requires
> minor adjustments to be compatible with AArch64.
>
> gcc/ChangeLog:
>
>   * config/mingw/mingw32.h (defined): Use the correct DllMainCRTStartup
>   entry function.
>   * config/mingw/winnt-dll.cc (defined): Exclude ix86-related code.
> ---
>  gcc/config/mingw/mingw32.h| 2 +-
>  gcc/config/mingw/winnt-dll.cc | 4 
>  2 files changed, 5 insertions(+), 1 deletion(-)

Could we provide some abstractions here, rather than testing
CPU-specific macros directly?  E.g.:

>
> diff --git a/gcc/config/mingw/mingw32.h b/gcc/config/mingw/mingw32.h
> index 08f1b5f0696..efe777051b4 100644
> --- a/gcc/config/mingw/mingw32.h
> +++ b/gcc/config/mingw/mingw32.h
> @@ -79,7 +79,7 @@ along with GCC; see the file COPYING3.  If not see
>  #endif
>  
>  #undef SUB_LINK_ENTRY
> -#if TARGET_64BIT_DEFAULT
> +#if TARGET_64BIT_DEFAULT || defined (TARGET_AARCH64_MS_ABI)

it looks like this is equivalent to something like "HAVE_64BIT_POINTERS"
or something, which aarch64 could define to 1 and x86 could define
to TARGET_64BIT_DEFAULT.

The name is just a suggestion, based on not really knowing what the
macro selects.  Please use whatever makes most sense :)

>  #define SUB_LINK_ENTRY SUB_LINK_ENTRY64
>  #else
>  #define SUB_LINK_ENTRY SUB_LINK_ENTRY32
> diff --git a/gcc/config/mingw/winnt-dll.cc b/gcc/config/mingw/winnt-dll.cc
> index 349ade6f5c0..294361fab4c 100644
> --- a/gcc/config/mingw/winnt-dll.cc
> +++ b/gcc/config/mingw/winnt-dll.cc
> @@ -206,9 +206,13 @@ legitimize_pe_coff_symbol (rtx addr, bool inreg)
>   }
>  }
>  
> +#if !defined (TARGET_AARCH64_MS_ABI)
> +
>if (ix86_cmodel != CM_LARGE_PIC && ix86_cmodel != CM_MEDIUM_PIC)
>  return NULL_RTX;
>  
> +#endif
> +

Similarly here, it feels like there is a concept underlying this check.
Could we just use:

  if (!NEW_MACRO)
return NULL_RTX;

with NEW_MACRO describing the underlying property that is common to
medium x86 PIC, large x86 PIC, and aarch64.

Thanks,
Richard

>if (GET_CODE (addr) == SYMBOL_REF
>&& !is_imported_p (addr)
>&& SYMBOL_REF_EXTERNAL_P (addr)


Re: [PATCH v1 4/6] aarch64: Add selectany attribute handling

2024-05-22 Thread Richard Sandiford
Evgeny Karpov  writes:
> This patch extends the aarch64 attributes list with the selectany
> attribute for the aarch64-w64-mingw32 target and reuses the mingw
> implementation to handle it.
>
>   * config/aarch64/aarch64.cc:
>   Extend the aarch64 attributes list.
>   * config/aarch64/cygming.h (SUBTARGET_ATTRIBUTE_TABLE):
>   Define the selectany attribute.

Now that TARGET_ATTRIBUTE_TABLE is an array, it should in principle
be possible to define the attribute in winnt.cc and so avoid
duplicating the table entry.  That'd be a separate clean-up though.
I agree that for this series we should stick with the current approach.

So the patch is ok, thanks.

Richard

> ---
>  gcc/config/aarch64/aarch64.cc | 5 -
>  gcc/config/aarch64/cygming.h  | 3 +++
>  2 files changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index c763a8a6298..19205927430 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -855,7 +855,10 @@ static const attribute_spec aarch64_gnu_attributes[] =
> NULL },
>{ "Advanced SIMD type", 1, 1, false, true,  false, true,  NULL, NULL },
>{ "SVE type",3, 3, false, true,  false, true,  NULL, NULL 
> },
> -  { "SVE sizeless type",  0, 0, false, true,  false, true,  NULL, NULL }
> +  { "SVE sizeless type",  0, 0, false, true,  false, true,  NULL, NULL },
> +#ifdef SUBTARGET_ATTRIBUTE_TABLE
> +  SUBTARGET_ATTRIBUTE_TABLE
> +#endif
>  };
>  
>  static const scoped_attribute_specs aarch64_gnu_attribute_table =
> diff --git a/gcc/config/aarch64/cygming.h b/gcc/config/aarch64/cygming.h
> index 0d048879311..76623153080 100644
> --- a/gcc/config/aarch64/cygming.h
> +++ b/gcc/config/aarch64/cygming.h
> @@ -154,6 +154,9 @@ still needed for compilation.  */
>  flag_stack_check = STATIC_BUILTIN_STACK_CHECK;   \
>} while (0)
>  
> +#define SUBTARGET_ATTRIBUTE_TABLE \
> +  { "selectany", 0, 0, true, false, false, false, \
> +mingw_handle_selectany_attribute, NULL }
>  
>  #define SUPPORTS_ONE_ONLY 1


Re: [PATCH v1 3/6] Rename functions for reuse in AArch64

2024-05-22 Thread Richard Sandiford
Evgeny Karpov  writes:
> This patch renames functions related to dllimport/dllexport
> and selectany functionality. These functions will be reused
> in the aarch64-w64-mingw32 target.
>
> gcc/ChangeLog:
>
>   * config/i386/cygming.h (mingw_pe_record_stub):
>   Rename functions in mingw folder which will be reused for
>   aarch64.
>   (TARGET_ASM_FILE_END): Update to new target-independent name.
>   (SUBTARGET_ATTRIBUTE_TABLE): Likewise.
>   (TARGET_VALID_DLLIMPORT_ATTRIBUTE_P): Likewise.
>   (SUB_TARGET_RECORD_STUB): Likewise.
>   * config/i386/i386-protos.h (ix86_handle_selectany_attribute): Likewise.
>   (mingw_handle_selectany_attribute): Likewise.
>   (i386_pe_valid_dllimport_attribute_p): Likewise.
>   (mingw_pe_valid_dllimport_attribute_p): Likewise.
>   (i386_pe_file_end): Likewise.
>   (mingw_pe_file_end): Likewise.
>   (i386_pe_record_stub): Likewise.
>   (mingw_pe_record_stub): Likewise.
>   * config/mingw/winnt.cc (ix86_handle_selectany_attribute): Likewise.
>   (mingw_handle_selectany_attribute): Likewise.
>   (i386_pe_valid_dllimport_attribute_p): Likewise.
>   (mingw_pe_valid_dllimport_attribute_p): Likewise.
>   (i386_pe_record_stub): Likewise.
>   (mingw_pe_record_stub): Likewise.
>   (i386_pe_file_end): Likewise.
>   (mingw_pe_file_end): Likewise.
>   * config/mingw/winnt.h (mingw_handle_selectany_attribute):
>   Declate functionality that will be reused by multiple targets.
>   (mingw_pe_file_end): Likewise.
>   (mingw_pe_record_stub): Likewise.
>   (mingw_pe_valid_dllimport_attribute_p): Likewise.

Ok, but...

> [...]
> diff --git a/gcc/config/mingw/winnt.cc b/gcc/config/mingw/winnt.cc
> index 9901576ade0..a0b5950be2e 100644
> --- a/gcc/config/mingw/winnt.cc
> +++ b/gcc/config/mingw/winnt.cc
> @@ -71,7 +71,7 @@ ix86_handle_shared_attribute (tree *node, tree name, tree, 
> int,
>  /* Handle a "selectany" attribute;
> arguments as in struct attribute_spec.handler.  */
>  tree
> -ix86_handle_selectany_attribute (tree *node, tree name, tree, int,
> +mingw_handle_selectany_attribute (tree *node, tree name, tree, int,
>bool *no_add_attrs)

please reindent the parameters for the new name length.

Thanks,
Richard


Re: [PATCH 4/4] Testsuite updates

2024-05-22 Thread Richard Sandiford
Richard Biener  writes:
> On Tue, 21 May 2024, Richard Biener wrote:
>
>> The gcc.dg/vect/slp-12a.c case is interesting as we currently split
>> the 8 store group into lanes 0-5 which we SLP with an unroll factor
>> of two (on x86-64 with SSE) and the remaining two lanes are using
>> interleaving vectorization with a final unroll factor of four.  Thus
>> we're using hybrid SLP within a single store group.  After the change
>> we discover the same 0-5 lane SLP part as well as two single-lane
>> parts feeding the full store group.  But that results in a load
>> permutation that isn't supported (I have WIP patchs to rectify that).
>> So we end up cancelling SLP and vectorizing the whole loop with
>> interleaving which is IMO good and results in better code.
>> 
>> This is similar for gcc.target/i386/pr52252-atom.c where interleaving
>> generates much better code than hybrid SLP.  I'm unsure how to update
>> the testcase though.
>> 
>> gcc.dg/vect/slp-21.c runs into similar situations.  Note that when
>> when analyzing SLP operations we discard an instance we currently
>> force the full loop to have no SLP because hybrid detection is
>> broken.  It's probably not worth fixing this at this moment.
>> 
>> For gcc.dg/vect/pr97428.c we are not splitting the 16 store group
>> into two but merge the two 8 lane loads into one before doing the
>> store and thus have only a single SLP instance.  A similar situation
>> happens in gcc.dg/vect/slp-11c.c but the branches feeding the
>> single SLP store only have a single lane.  Likewise for
>> gcc.dg/vect/vect-complex-5.c and gcc.dg/vect/vect-gather-2.c.
>> 
>> gcc.dg/vect/slp-cond-1.c has an additional SLP vectorization
>> with a SLP store group of size two but two single-lane branches.
>> 
>> gcc.target/i386/pr98928.c ICEs in SLP permute optimization
>> because we don't expect a constant and internal branch to be
>> merged with a permute node in
>> vect_optimize_slp_pass::change_vec_perm_layout:4859 (the only
>> permutes merging two SLP nodes are two-operator nodes right now).
>> This still requires fixing.
>> 
>> The whole series has been bootstrapped and tested on 
>> x86_64-unknown-linux-gnu with the gcc.target/i386/pr98928.c FAIL
>> unfixed.
>> 
>> Comments welcome (and hello ARM CI), RISC-V and other arch
>> testing appreciated.  Unless there are comments to the contrary
>> I plan to push patch 1 and 2 tomorrow.
>
> RISC-V CI didn't trigger (not sure what magic is required).  Both
> ARM and AARCH64 show that the "Vectorizing stmts using SLP" are a bit
> fragile because we sometimes cancel SLP becuase we want to use
> load/store-lanes.
>
> I have locally scrapped the SLP scanning for gcc.dg/vect/slp-21.c where
> it doesn't really matter (and if we are finished with all-SLP it will
> matter nowhere).  I've conditionalized the outcome based on
> vect_load_lanes for gcc.dg/vect/slp-11c.c and
> gcc.dg/vect/slp-cond-1.c
>
> On AARCH64 additionally gcc.target/aarch64/sve/mask_struct_store_4.c
> ICEs, I have a fix for that.
>
> gcc.target/aarch64/pr99873_2.c FAILs because with a single
> SLP store group merged from two two-lane load groups we cancel
> the SLP and want to use load/store-lanes.  I'll leave this
> FAILing or shall I XFAIL it?

Yeah, agree it's probably worth leaving it FAILing for now, since it
is something we should try to fix for GCC 15.

Thanks,
Richard

>
> Thanks,
> Richard.
>
>> Thanks,
>> Richard.
>> 
>>  * gcc.dg/vect/pr97428.c: Expect a single store SLP group.
>>  * gcc.dg/vect/slp-11c.c: Likewise.
>>  * gcc.dg/vect/vect-complex-5.c: Likewise.
>>  * gcc.dg/vect/slp-12a.c: Do not expect SLP.
>>  * gcc.dg/vect/slp-21.c: Likewise.
>>  * gcc.dg/vect/slp-cond-1.c: Expect one more SLP.
>>  * gcc.dg/vect/vect-gather-2.c: Expect SLP to be used.
>>  * gcc.target/i386/pr52252-atom.c: XFAIL test for palignr.
>> ---
>>  gcc/testsuite/gcc.dg/vect/pr97428.c  |  2 +-
>>  gcc/testsuite/gcc.dg/vect/slp-11c.c  |  5 +++--
>>  gcc/testsuite/gcc.dg/vect/slp-12a.c  |  6 +-
>>  gcc/testsuite/gcc.dg/vect/slp-21.c   | 19 +--
>>  gcc/testsuite/gcc.dg/vect/slp-cond-1.c   |  2 +-
>>  gcc/testsuite/gcc.dg/vect/vect-complex-5.c   |  2 +-
>>  gcc/testsuite/gcc.dg/vect/vect-gather-2.c|  1 -
>>  gcc/testsuite/gcc.target/i386/pr52252-atom.c |  3 ++-
>>  8 files changed, 18 insertions(+), 22 deletions(-)
>> 
>> diff --git a/gcc/testsuite/gcc.dg/vect/pr97428.c 
>> b/gcc/testsuite/gcc.dg/vect/pr97428.c
>> index 60dd984cfd3..3cc9976c00c 100644
>> --- a/gcc/testsuite/gcc.dg/vect/pr97428.c
>> +++ b/gcc/testsuite/gcc.dg/vect/pr97428.c
>> @@ -44,5 +44,5 @@ void foo_i2(dcmlx4_t dst[], const dcmlx_t src[], int n)
>>  /* { dg-final { scan-tree-dump "Detected interleaving store of size 16" 
>> "vect" } } */
>>  /* We're not able to peel & apply re-aligning to make accesses well-aligned 
>> for !vect_hw_misalign,
>> but we could by peeling the stores for alignment and applying 
>> re-aligning loads.  

Re: [PATCH 3/4]AArch64: add new alternative with early clobber to patterns

2024-05-22 Thread Richard Sandiford
Tamar Christina  writes:
>> -Original Message-
>> From: Richard Sandiford 
>> Sent: Wednesday, May 22, 2024 10:48 AM
>> To: Tamar Christina 
>> Cc: gcc-patches@gcc.gnu.org; nd ; Richard Earnshaw
>> ; Marcus Shawcroft
>> ; ktkac...@gcc.gnu.org
>> Subject: Re: [PATCH 3/4]AArch64: add new alternative with early clobber to
>> patterns
>> 
>> Tamar Christina  writes:
>> > Hi All,
>> >
>> > This patch adds new alternatives to the patterns which are affected.  The 
>> > new
>> > alternatives with the conditional early clobbers are added before the 
>> > normal
>> > ones in order for LRA to prefer them in the event that we have enough free
>> > registers to accommodate them.
>> >
>> > In case register pressure is too high the normal alternatives will be 
>> > preferred
>> > before a reload is considered as we rather have the tie than a spill.
>> >
>> > Tests are in the next patch.
>> >
>> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>> >
>> > Ok for master?
>> >
>> > Thanks,
>> > Tamar
>> >
>> > gcc/ChangeLog:
>> >
>> >* config/aarch64/aarch64-sve.md (and3,
>> >@aarch64_pred__z, *3_cc,
>> >*3_ptest, aarch64_pred__z,
>> >*3_cc, *3_ptest,
>> >aarch64_pred__z, *3_cc,
>> >*3_ptest, @aarch64_pred_cmp,
>> >*cmp_cc, *cmp_ptest,
>> >@aarch64_pred_cmp_wide,
>> >*aarch64_pred_cmp_wide_cc,
>> >*aarch64_pred_cmp_wide_ptest,
>> @aarch64_brk,
>> >*aarch64_brk_cc, *aarch64_brk_ptest,
>> >@aarch64_brk, *aarch64_brkn_cc, *aarch64_brkn_ptest,
>> >*aarch64_brk_cc, *aarch64_brk_ptest,
>> >aarch64_rdffr_z, *aarch64_rdffr_z_ptest, *aarch64_rdffr_ptest,
>> >*aarch64_rdffr_z_cc, *aarch64_rdffr_cc): Add new early clobber
>> >alternative.
>> >* config/aarch64/aarch64-sve2.md
>> >(@aarch64_pred_): Likewise.
>> >
>> > ---
>> > diff --git a/gcc/config/aarch64/aarch64-sve.md 
>> > b/gcc/config/aarch64/aarch64-
>> sve.md
>> > index
>> e3085c0c636f1317409bbf3b5fbaf5342a2df1f6..8fdc1bc3cd43acfcd675a18350c
>> 297428c85fe46 100644
>> > --- a/gcc/config/aarch64/aarch64-sve.md
>> > +++ b/gcc/config/aarch64/aarch64-sve.md
>> > @@ -1161,8 +1161,10 @@ (define_insn "aarch64_rdffr_z"
>> >  (reg:VNx16BI FFRT_REGNUM)
>> >  (match_operand:VNx16BI 1 "register_operand")))]
>> >"TARGET_SVE && TARGET_NON_STREAMING"
>> > -  {@ [ cons: =0, 1   ]
>> > - [ Upa , Upa ] rdffr\t%0.b, %1/z
>> > +  {@ [ cons: =0, 1  ; attrs: pred_clobber ]
>> > + [ , Upa; yes ] rdffr\t%0.b, %1/z
>> > + [ ?Upa, Upa; yes ] ^
>> > + [ Upa , Upa; *   ] ^
>> >}
>> >  )
>> 
>> Sorry for not explaining it very well, but in the previous review I 
>> suggested:
>> 
>> > The gather-like approach would be something like:
>> >
>> >  [  , Upl , w , ; yes ]
>> cmp\t%0., %1/z, %3., #%4
>> >  [ ?Upl , 0   , w , ; yes ] ^
>> >  [ Upa  , Upl , w , ; no  ] ^
>> >  [  , Upl , w , w; yes ] 
>> > cmp\t%0., %1/z,
>> %3., %4.
>> >  [ ?Upl , 0   , w , w; yes ] ^
>> >  [ Upa  , Upl , w , w; no  ] ^
>> >
>> > with:
>> >
>> >   (define_attr "pred_clobber" "any,no,yes" (const_string "any"))
>> 
>> (with emphasis on the last line).  What I didn't say explicitly is
>> that "no" should require !TARGET_SVE_PRED_CLOBBER.
>> 
>> The premise of that review was that we shouldn't enable things like:
>> 
>>  [ Upa  , Upl , w , w; no  ] ^
>> 
>> for TARGET_SVE_PRED_CLOBBER since it contradicts the earlyclobber
>> alternative.  So we should enable either the pred_clobber=yes
>> alternatives or the pred_clobber=no alternatives, but not both.
>> 
>> The default "any" is then for other non-predicate instructions that
>> don't care about TARGET_SVE_PRED_CLOBBER either way.
>> 
>> In contrast, this patch makes pred_clobber=yes enable the alternatives
>

Re: [PATCH v1 2/6] Extract ix86 dllimport implementation to mingw

2024-05-22 Thread Richard Sandiford
Evgeny Karpov  writes:
> This patch extracts the ix86 implementation for expanding a SYMBOL
> into its corresponding dllimport, far-address, or refptr symbol.
> It will be reused in the aarch64-w64-mingw32 target.
> The implementation is copied as is from i386/i386.cc with
> minor changes to follow to the code style.
>
> Also this patch replaces the original DLL import/export
> implementation in ix86 with mingw.
>
> gcc/ChangeLog:
>
>   * config.gcc: Add winnt-dll.o, which contains the DLL
>   import/export implementation.
>   * config/i386/cygming.h (SUB_TARGET_RECORD_STUB): Remove the
>   old implementation. Rename the required function to MinGW.
>   Rename it to a conditional function that will reuse the
>   MinGW implementation for COFF and nothing otherwise.
>   * config/i386/i386-expand.cc (ix86_expand_move): Likewise.
>   * config/i386/i386-expand.h (is_imported_p): Likewise.
>   (mingw_GOT_alias_set): Likewise.
>   (ix86_legitimize_pe_coff_symbol): Likewise.
>   * config/i386/i386-protos.h: Likewise.
>   * config/i386/i386.cc (is_imported_p): Likewise.
>   (ix86_legitimize_pe_coff_symbol): Likewise.
>   (ix86_GOT_alias_set): Likewise.
>   (legitimize_pic_address): Likewise.
>   (struct dllimport_hasher):
>   (GTY): Likewise.
>   (get_dllimport_decl): Likewise.
>   (legitimize_pe_coff_extern_decl): Likewise.
>   (legitimize_dllimport_symbol): Likewise.
>   (legitimize_pe_coff_symbol): Likewise.
>   (ix86_legitimize_address): Likewise.
>   * config/mingw/winnt.h (mingw_pe_record_stub): Likewise.
>   * config/mingw/winnt.cc (i386_pe_record_stub): Likewise.
>   (mingw_pe_record_stub): Likewise.
>   * config/mingw/t-cygming: Add the winnt-dll.o compilation.
>   * config/mingw/winnt-dll.cc: New file.

This looks good to me apart from a couple of very minor comments below,
but please get approval from the x86 maintainers as well.  In particular,
they might prefer to handle ix86_legitimize_pe_coff_symbol in some other way.

> [...]
> diff --git a/gcc/config/mingw/winnt-dll.cc b/gcc/config/mingw/winnt-dll.cc
> new file mode 100644
> index 000..349ade6f5c0
> --- /dev/null
> +++ b/gcc/config/mingw/winnt-dll.cc
> @@ -0,0 +1,229 @@
> +/* Expand a SYMBOL into its corresponding dllimport, far-address,
> +or refptr symbol.
> +Copyright (C) 2024 Free Software Foundation, Inc.

I suppose this should retain the range from the i386 file that the
code is moving from:

   Copyright (C) 1988-2024 Free Software Foundation, Inc.

> [...]
> diff --git a/gcc/config/mingw/winnt-dll.h b/gcc/config/mingw/winnt-dll.h
> new file mode 100644
> index 000..19c16e747a2
> --- /dev/null
> +++ b/gcc/config/mingw/winnt-dll.h
> @@ -0,0 +1,26 @@
> +/* Expand a SYMBOL into its corresponding dllimport, far-address,
> +or refptr symbol.
> +Copyright (C) 2024 Free Software Foundation, Inc.
> +
> +GCC is free software; you can redistribute it and/or modify it under
> +the terms of the GNU General Public License as published by the Free
> +Software Foundation; either version 3, or (at your option) any later
> +version.
> +
> +GCC is distributed in the hope that it will be useful, but WITHOUT ANY
> +WARRANTY; without even the implied warranty of MERCHANTABILITY or
> +FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
> +for more details.
> +
> +You should have received a copy of the GNU General Public License
> +along with GCC; see the file COPYING3.  If not see
> +http://www.gnu.org/licenses/.  */
> +
> +#ifndef GCC_MINGW_WINNT_DLL_H
> +#define GCC_MINGW_WINNT_DLL_H
> +
> +extern bool is_imported_p (rtx x);
> +extern alias_set_type mingw_GOT_alias_set (void);
> +extern rtx legitimize_pe_coff_symbol (rtx addr, bool inreg);
> +
> +#endif
> \ No newline at end of file

Would be good to add the newlihe.

Thanks,
Richard


Re: [PATCH 3/4]AArch64: add new alternative with early clobber to patterns

2024-05-22 Thread Richard Sandiford
Tamar Christina  writes:
> Hi All,
>
> This patch adds new alternatives to the patterns which are affected.  The new
> alternatives with the conditional early clobbers are added before the normal
> ones in order for LRA to prefer them in the event that we have enough free
> registers to accommodate them.
>
> In case register pressure is too high the normal alternatives will be 
> preferred
> before a reload is considered as we rather have the tie than a spill.
>
> Tests are in the next patch.
>
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>
> Ok for master?
>
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64-sve.md (and3,
>   @aarch64_pred__z, *3_cc,
>   *3_ptest, aarch64_pred__z,
>   *3_cc, *3_ptest,
>   aarch64_pred__z, *3_cc,
>   *3_ptest, @aarch64_pred_cmp,
>   *cmp_cc, *cmp_ptest,
>   @aarch64_pred_cmp_wide,
>   *aarch64_pred_cmp_wide_cc,
>   *aarch64_pred_cmp_wide_ptest, @aarch64_brk,
>   *aarch64_brk_cc, *aarch64_brk_ptest,
>   @aarch64_brk, *aarch64_brkn_cc, *aarch64_brkn_ptest,
>   *aarch64_brk_cc, *aarch64_brk_ptest,
>   aarch64_rdffr_z, *aarch64_rdffr_z_ptest, *aarch64_rdffr_ptest,
>   *aarch64_rdffr_z_cc, *aarch64_rdffr_cc): Add new early clobber
>   alternative.
>   * config/aarch64/aarch64-sve2.md
>   (@aarch64_pred_): Likewise.
>
> ---
> diff --git a/gcc/config/aarch64/aarch64-sve.md 
> b/gcc/config/aarch64/aarch64-sve.md
> index 
> e3085c0c636f1317409bbf3b5fbaf5342a2df1f6..8fdc1bc3cd43acfcd675a18350c297428c85fe46
>  100644
> --- a/gcc/config/aarch64/aarch64-sve.md
> +++ b/gcc/config/aarch64/aarch64-sve.md
> @@ -1161,8 +1161,10 @@ (define_insn "aarch64_rdffr_z"
> (reg:VNx16BI FFRT_REGNUM)
> (match_operand:VNx16BI 1 "register_operand")))]
>"TARGET_SVE && TARGET_NON_STREAMING"
> -  {@ [ cons: =0, 1   ]
> - [ Upa , Upa ] rdffr\t%0.b, %1/z
> +  {@ [ cons: =0, 1  ; attrs: pred_clobber ]
> + [ , Upa; yes ] rdffr\t%0.b, %1/z
> + [ ?Upa, Upa; yes ] ^
> + [ Upa , Upa; *   ] ^
>}
>  )

Sorry for not explaining it very well, but in the previous review I suggested:

> The gather-like approach would be something like:
>
>  [  , Upl , w , ; yes ] 
> cmp\t%0., %1/z, %3., #%4
>  [ ?Upl , 0   , w , ; yes ] ^
>  [ Upa  , Upl , w , ; no  ] ^
>  [  , Upl , w , w; yes ] 
> cmp\t%0., %1/z, %3., %4.
>  [ ?Upl , 0   , w , w; yes ] ^
>  [ Upa  , Upl , w , w; no  ] ^
>
> with:
>
>   (define_attr "pred_clobber" "any,no,yes" (const_string "any"))

(with emphasis on the last line).  What I didn't say explicitly is
that "no" should require !TARGET_SVE_PRED_CLOBBER.

The premise of that review was that we shouldn't enable things like:

 [ Upa  , Upl , w , w; no  ] ^

for TARGET_SVE_PRED_CLOBBER since it contradicts the earlyclobber
alternative.  So we should enable either the pred_clobber=yes
alternatives or the pred_clobber=no alternatives, but not both.

The default "any" is then for other non-predicate instructions that
don't care about TARGET_SVE_PRED_CLOBBER either way.

In contrast, this patch makes pred_clobber=yes enable the alternatives
that correctly describe the restriction (good!) but then also enables
the normal alternatives too, which IMO makes the semantics unclear.

Thanks,
Richard

>  
> @@ -1179,8 +1181,10 @@ (define_insn "*aarch64_rdffr_z_ptest"
> UNSPEC_PTEST))
> (clobber (match_scratch:VNx16BI 0))]
>"TARGET_SVE && TARGET_NON_STREAMING"
> -  {@ [ cons: =0, 1   ]
> - [ Upa , Upa ] rdffrs\t%0.b, %1/z
> +  {@ [ cons: =0, 1  ; attrs: pred_clobber ]
> + [ , Upa; yes ] rdffrs\t%0.b, %1/z
> + [ ?Upa, Upa; yes ] ^
> + [ Upa , Upa; *   ] ^
>}
>  )
>  
> @@ -1195,8 +1199,10 @@ (define_insn "*aarch64_rdffr_ptest"
> UNSPEC_PTEST))
> (clobber (match_scratch:VNx16BI 0))]
>"TARGET_SVE && TARGET_NON_STREAMING"
> -  {@ [ cons: =0, 1   ]
> - [ Upa , Upa ] rdffrs\t%0.b, %1/z
> +  {@ [ cons: =0, 1  ; attrs: pred_clobber ]
> + [ , Upa; yes ] rdffrs\t%0.b, %1/z
> + [ ?Upa, Upa; yes ] ^
> + [ Upa , Upa; *   ] ^
>}
>  )
>  
> @@ -1216,8 +1222,10 @@ (define_insn "*aarch64_rdffr_z_cc"
> (reg:VNx16BI FFRT_REGNUM)
> (match_dup 1)))]
>"TARGET_SVE && TARGET_NON_STREAMING"
> -  {@ [ cons: =0, 1   ]
> - [ Upa , Upa ] rdffrs\t%0.b, %1/z
> +  {@ [ cons: =0, 1  ; attrs: pred_clobber ]
> + [ , Upa; yes ] rdffrs\t%0.b, %1/z
> + [ ?Upa, Upa; yes ] ^
> + [ Upa , Upa; *   ] ^
>}
>  )
>  
> @@ -1233,8 +1241,10 @@ 

Re: [PATCH] Fix mixed input kind permute optimization

2024-05-22 Thread Richard Sandiford
Richard Sandiford  writes:
> Richard Biener  writes:
>> When change_vec_perm_layout runs into a permute combining two
>> nodes where one is invariant and one internal the partition of
>> one input can be -1 but the other might not be.  The following
>> supports this case by simply ignoring inputs with input partiton -1.
>>
>> I'm not sure this is correct but it avoids ICEing when accessing
>> that partitions layout for gcc.target/i386/pr98928.c with the
>> change to avoid splitting store dataref groups during SLP discovery.
>>
>> Bootstrap and regtest running on x86_64-unknown-linux-gnu (ontop of
>> the SLP series).  The change can't break anything that's already
>> broken but I'm not sure this does the right thing - the testcase
>> has an uniform constant.  I'll try to come up with a better runtime
>> testcase tomorrow.  Hints as to where to correctly fix such case
>> appreciated.
>
> Famous last words, but yeah, it looks correct to me.  I think the
> routine in principle should have a free choice of which layout to
> choose for invariants (as long as it's consistent for all queries
> about the same node).  So it should just be a question of whether
> keeping the original layout is more likely to give a valid
> permutation, or whether going with out_layout_i would be better.
> I don't have a strong intuition either way.

BTW, I should have said that using a different layout from 0
would require compensating code in the materialize function.
So this is definitely the simplest and most direct fix.

Thanks,
Richard


Re: [PATCH 3/4] Avoid splitting store dataref groups during SLP discovery

2024-05-21 Thread Richard Sandiford
Richard Biener  writes:
> The following avoids splitting store dataref groups during SLP
> discovery but instead forces (eventually single-lane) consecutive
> lane SLP discovery for all lanes of the group, creating VEC_PERM
> SLP nodes merging them so the store will always cover the whole group.
>
> With this for example
>
> int x[1024], y[1024], z[1024], w[1024];
> void foo (void)
> {
>   for (int i = 0; i < 256; i++)
> {
>   x[4*i+0] = y[2*i+0];
>   x[4*i+1] = y[2*i+1];
>   x[4*i+2] = z[i];
>   x[4*i+3] = w[i];
> }
> }
>
> which was previously using hybrid SLP can now be fully SLPed and

Nice!

> SSE code generated looks better (but of course you never know,
> I didn't actually benchmark).  We of course need a VF of four here.
>
> .L2:
> movdqa  z(%rax), %xmm0
> movdqa  w(%rax), %xmm4
> movdqa  y(%rax,%rax), %xmm2
> movdqa  y+16(%rax,%rax), %xmm1
> movdqa  %xmm0, %xmm3
> punpckhdq   %xmm4, %xmm0
> punpckldq   %xmm4, %xmm3
> movdqa  %xmm2, %xmm4
> shufps  $238, %xmm3, %xmm2
> movaps  %xmm2, x+16(,%rax,4)
> movdqa  %xmm1, %xmm2
> shufps  $68, %xmm3, %xmm4
> shufps  $68, %xmm0, %xmm2
> movaps  %xmm4, x(,%rax,4)
> shufps  $238, %xmm0, %xmm1
> movaps  %xmm2, x+32(,%rax,4)
> movaps  %xmm1, x+48(,%rax,4)
> addq$16, %rax
> cmpq$1024, %rax
> jne .L2
>
> The extra permute nodes merging distinct branches of the SLP
> tree might be unexpected for some code, esp. since
> SLP_TREE_REPRESENTATIVE cannot be meaningfully set and we
> cannot populate SLP_TREE_SCALAR_STMTS or SLP_TREE_SCALAR_OPS
> consistently as we can have a mix of both.
>
> The patch keeps the sub-trees form consecutive lanes but that's
> in principle not necessary if we for example have an even/odd
> split which now would result in N single-lane sub-trees.  That's
> left for future improvements.
>
> The interesting part is how VLA vector ISAs handle merging of
> two vectors that's not trivial even/odd merging.  The strathegy
> of how to build the permute tree might need adjustments for that
> (in the end splitting each branch to single lanes and then doing
> even/odd merging would be the brute-force fallback).  Not sure
> how much we can or should rely on the SLP optimize pass to handle
> this.

Yeah, I think we'll have to play it by ear.  It might involve tweaking
the order in which we "reduce" the VEC_PERM_EXPRs.  E.g. in the above
example, my guess is that it would be better to reduce the z/w part
first and then permute that with y, whereas it looks like the patch
always goes left-to-right.

The patch LGTM FWIW.

I suppose this does further hard-code the assumption that the vector
type is uniquely determined by the element type (and so we can safely
assume that everything has the same vector type as the first split node).
But that's pretty much pervasive, and not easy to solve until we're
serious about putting some infrastructre in place for it.  It just
caught me out when reading vector code for the first time in a while :)

(E.g. in the above example, the y vector could eventually be double the
z & w vectors.)

Thanks,
Richard

>   * tree-vect-slp.cc (vect_build_slp_instance): Do not split
>   store dataref groups on loop SLP discovery failure but create
>   a single SLP instance for the stores but branch to SLP sub-trees
>   and merge with a series of VEC_PERM nodes.
> ---
>  gcc/tree-vect-slp.cc | 240 ++-
>  1 file changed, 214 insertions(+), 26 deletions(-)
>
> diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
> index 43f2c153bf0..873748b0a72 100644
> --- a/gcc/tree-vect-slp.cc
> +++ b/gcc/tree-vect-slp.cc
> @@ -3468,12 +3468,7 @@ vect_build_slp_instance (vec_info *vinfo,
> return true;
>   }
>  }
> -  else
> -{
> -  /* Failed to SLP.  */
> -  /* Free the allocated memory.  */
> -  scalar_stmts.release ();
> -}
> +  /* Failed to SLP.  */
>  
>stmt_vec_info stmt_info = stmt_info_;
>/* Try to break the group up into pieces.  */
> @@ -3491,6 +3486,9 @@ vect_build_slp_instance (vec_info *vinfo,
>if (is_a  (vinfo)
> && (i > 1 && i < group_size))
>   {
> +   /* Free the allocated memory.  */
> +   scalar_stmts.release ();
> +
> tree scalar_type
>   = TREE_TYPE (DR_REF (STMT_VINFO_DATA_REF (stmt_info)));
> tree vectype = get_vectype_for_scalar_type (vinfo, scalar_type,
> @@ -3535,38 +3533,228 @@ vect_build_slp_instance (vec_info *vinfo,
>   }
>   }
>  
> -  /* For loop vectorization split into arbitrary pieces of size > 1.  */
> -  if (is_a  (vinfo)
> -   && (i > 1 && i < group_size)
> -   && !vect_slp_prefer_store_lanes_p (vinfo, stmt_info, group_size, i))
> +  /* For loop vectorization split the RHS into arbitrary pieces of
> +  size >= 1.  */
> +  else 

Re: [PATCH] Fix mixed input kind permute optimization

2024-05-21 Thread Richard Sandiford
Richard Biener  writes:
> When change_vec_perm_layout runs into a permute combining two
> nodes where one is invariant and one internal the partition of
> one input can be -1 but the other might not be.  The following
> supports this case by simply ignoring inputs with input partiton -1.
>
> I'm not sure this is correct but it avoids ICEing when accessing
> that partitions layout for gcc.target/i386/pr98928.c with the
> change to avoid splitting store dataref groups during SLP discovery.
>
> Bootstrap and regtest running on x86_64-unknown-linux-gnu (ontop of
> the SLP series).  The change can't break anything that's already
> broken but I'm not sure this does the right thing - the testcase
> has an uniform constant.  I'll try to come up with a better runtime
> testcase tomorrow.  Hints as to where to correctly fix such case
> appreciated.

Famous last words, but yeah, it looks correct to me.  I think the
routine in principle should have a free choice of which layout to
choose for invariants (as long as it's consistent for all queries
about the same node).  So it should just be a question of whether
keeping the original layout is more likely to give a valid
permutation, or whether going with out_layout_i would be better.
I don't have a strong intuition either way.

Thanks,
Richard

>
>   * tree-vect-slp.cc (change_vec_perm_layout): Ignore an
>   input partition of -1.
> ---
>  gcc/tree-vect-slp.cc | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
> index 873748b0a72..f6ec1a81c96 100644
> --- a/gcc/tree-vect-slp.cc
> +++ b/gcc/tree-vect-slp.cc
> @@ -4828,6 +4828,8 @@ change_vec_perm_layout (slp_tree node, 
> lane_permutation_t ,
>   {
> slp_tree in_node = SLP_TREE_CHILDREN (node)[entry.first];
> unsigned int in_partition_i = m_vertices[in_node->vertex].partition;
> +   if (in_partition_i == -1u)
> + continue;
> this_in_layout_i = m_partitions[in_partition_i].layout;
>   }
>if (this_in_layout_i > 0)


Re: [PATCH v3] aarch64: Fix normal returns inside functions which use eh_returns [PR114843]

2024-05-21 Thread Richard Sandiford
Wilco Dijkstra  writes:
> Hi Andrew,
>
> A few comments on the implementation, I think it can be simplified a lot:

FWIW, I agree with Wilco's comments, except:

>> +++ b/gcc/config/aarch64/aarch64.h
>> @@ -700,8 +700,9 @@ constexpr auto AARCH64_FL_DEFAULT_ISA_MODE = 
>> AARCH64_FL_SM_OFF;
>> #define DWARF2_UNWIND_INFO 1
>>  
>>  /* Use R0 through R3 to pass exception handling information.  */
>> +#define EH_RETURN_DATA_REGISTERS_N 4
>>  #define EH_RETURN_DATA_REGNO(N) \
>> -  ((N) < 4 ? ((unsigned int) R0_REGNUM + (N)) : INVALID_REGNUM)
>> +  ((N) < EH_RETURN_DATA_REGISTERS_N ? ((unsigned int) R0_REGNUM + (N)) : 
>> INVALID_REGNUM)
>  
> It would be useful to add a macro IS_EH_RETURN_REGNUM(regnum) that just checks
> the range R0_REGNUM to R0_REGNUM + EH_RETURN_DATA_REGISTERS_N.

I've just pushed a patch that adds a global eh_return_data_regs set,
so I think we can test that instead.

>> @@ -929,6 +928,7 @@ struct GTY (()) aarch64_frame
>>  outgoing arguments) of each register save slot, or -2 if no save is
>>  needed.  */
>>   poly_int64 reg_offset[LAST_SAVED_REGNUM + 1];
>> +  bool eh_return_allocated[EH_RETURN_DATA_REGISTERS_N];
>
> This doesn't make much sense - besides X0-X3, we also need X5 and X6 for 
> eh_return.
> If these or any of the other temporaries used by epilog are callee-saved 
> somehow,
> things are going horribly wrong already... So what do we gain by doing this?
>
>
>> +++ b/gcc/config/aarch64/aarch64.cc
>> @@ -7792,6 +7792,7 @@ aarch64_layout_frame (void)
>> 
>>  #define SLOT_NOT_REQUIRED (-2)
>>  #define SLOT_REQUIRED (-1)
>> +#define SLOT_EH_RETURN_REQUIRED (-3)
>  
> I don't see a need for this.
>
>
>> @@ -7949,6 +7950,18 @@ aarch64_layout_frame (void)
>> stopping it from being individually shrink-wrapped.  */
>>  allocate_gpr_slot (R30_REGNUM);
>>  
>> +  /* Allocate the eh_return first. */
>> +  if (crtl->calls_eh_return)
>> +for (regno = 0; EH_RETURN_DATA_REGNO (regno) != INVALID_REGNUM; regno++)
>> +  {
>> +int realregno = EH_RETURN_DATA_REGNO (regno);
>> +if (known_eq (frame.reg_offset[realregno], SLOT_EH_RETURN_REQUIRED))
>> +  {
>> +frame.eh_return_allocated[regno] = true;
>> +allocate_gpr_slot (realregno);
>> +  }
>> +  }
>
> This change is unnecessary if we just mark the slots with SLOT_REQUIRED.

Also, is it necessary to allocate EH data registers first?

>> @@ -8035,6 +8048,23 @@ aarch64_layout_frame (void)
>>   frame.wb_pop_candidate1 = frame.wb_push_candidate1;
>>   frame.wb_pop_candidate2 = frame.wb_push_candidate2;
>>  
>> +  /* EH data registers are not pop canidates. */
>> +  if (crtl->calls_eh_return)
>> +for (regno = 0; EH_RETURN_DATA_REGNO (regno) != INVALID_REGNUM; 
>> regno++)> 
>> +  {
>> +if (frame.eh_return_allocated[regno]
>> +&& frame.wb_pop_candidate1 == EH_RETURN_DATA_REGNO (regno))
>> +{
>> +  frame.wb_pop_candidate1 = frame.wb_pop_candidate2;
>> +  frame.wb_pop_candidate2 = INVALID_REGNUM;
>> +}
>> +if (frame.eh_return_allocated[regno]
>> +&& frame.wb_pop_candidate2 == EH_RETURN_DATA_REGNO (regno))
>> +{
>> +  frame.wb_pop_candidate2 = INVALID_REGNUM;
>> +}
>> +  }
>
> This is unnecessary since we can just avoid making them push candidates
> if there is no frame chain, eg:
>
> if ((!crtl->calls_eh_return || frame.emit_frame_chain) && !push_regs.empty ()
>   && known_eq (frame.reg_offset[push_regs[0]], frame.bytes_below_hard_fp))

I agree we should do the check here (and similarly for the second register),
rather than fixing it up later.  But IMO we should test the register directly:

  if (!push_regs.empty ()
  && known_eq (frame.reg_offset[push_regs[0]], frame.bytes_below_hard_fp)
  && (!crtl->calls_eh_return
  || !TEST_HARD_REG_BIT (eh_return_data_regs, push_regs[0])))

In some ways it seems unfortunate that we're generating two different
copies of the epilogue in order to skip two LDPs that (with a bit of
work) could easily be done before entering a combined epilogue.
But we already have a branch on EH_RETURN_TAKEN_RTX as well,
so maybe this is the tipping point at which duplication is worthwhile.

Thanks,
Richard

> @@ -8681,6 +8712,20 @@ aarch64_restore_callee_saves (poly_int64 
> bytes_below_sp,
>if (frame.is_scs_enabled && regno == LR_REGNUM)
>   return true;
>  
> +  /* Skip the eh return data registers if we are
> +  returning normally rather than via eh_return. */
> +  if (!was_eh_return && crtl->calls_eh_return)
> + {
> +   for (unsigned ehregno = 0;
> +EH_RETURN_DATA_REGNO (ehregno) != INVALID_REGNUM;
> +ehregno++)
> + {
> +   if (EH_RETURN_DATA_REGNO (ehregno) == regno
> +   && frame.eh_return_allocated[ehregno])
> + return true;
> + }
> + }
> +
>
> So this could be something like:
>
>   if (!was_eh_return && crtl->calls_eh_return && IS_EH_RETURN_REGNUM 
> (regno))
>   

[PATCH] Cache the set of EH_RETURN_DATA_REGNOs

2024-05-21 Thread Richard Sandiford
While reviewing Andrew's fix for PR114843, it seemed like it would
be convenient to have a HARD_REG_SET of EH_RETURN_DATA_REGNOs.
This patch adds one and uses it to simplify a couple of use sites.

Tested on aarch64-linux-gnu & x86_64-linux-gnu.  OK to install?

Richard


gcc/
* hard-reg-set.h (target_hard_regs::x_eh_return_data_regs): New field.
(eh_return_data_regs): New macro.
* reginfo.cc (init_reg_sets_1): Initialize x_eh_return_data_regs.
* df-scan.cc (df_get_exit_block_use_set): Use it.
* ira-lives.cc (process_out_of_region_eh_regs): Likewise.
---
 gcc/df-scan.cc |  8 +---
 gcc/hard-reg-set.h |  5 +
 gcc/ira-lives.cc   | 10 ++
 gcc/reginfo.cc | 10 ++
 4 files changed, 18 insertions(+), 15 deletions(-)

diff --git a/gcc/df-scan.cc b/gcc/df-scan.cc
index 1bade2cd71e..c8ab3c09cee 100644
--- a/gcc/df-scan.cc
+++ b/gcc/df-scan.cc
@@ -3702,13 +3702,7 @@ df_get_exit_block_use_set (bitmap exit_block_uses)
 
   /* Mark the registers that will contain data for the handler.  */
   if (reload_completed && crtl->calls_eh_return)
-for (i = 0; ; ++i)
-  {
-   unsigned regno = EH_RETURN_DATA_REGNO (i);
-   if (regno == INVALID_REGNUM)
- break;
-   bitmap_set_bit (exit_block_uses, regno);
-  }
+IOR_REG_SET_HRS (exit_block_uses, eh_return_data_regs);
 
 #ifdef EH_RETURN_STACKADJ_RTX
   if ((!targetm.have_epilogue () || ! epilogue_completed)
diff --git a/gcc/hard-reg-set.h b/gcc/hard-reg-set.h
index 8c1d1512ca2..340eb425c10 100644
--- a/gcc/hard-reg-set.h
+++ b/gcc/hard-reg-set.h
@@ -421,6 +421,9 @@ struct target_hard_regs {
  with the local stack frame are safe, but scant others.  */
   HARD_REG_SET x_regs_invalidated_by_call;
 
+  /* The set of registers that are used by EH_RETURN_DATA_REGNO.  */
+  HARD_REG_SET x_eh_return_data_regs;
+
   /* Table of register numbers in the order in which to try to use them.  */
   int x_reg_alloc_order[FIRST_PSEUDO_REGISTER];
 
@@ -485,6 +488,8 @@ extern struct target_hard_regs *this_target_hard_regs;
 #define call_used_or_fixed_regs \
   (regs_invalidated_by_call | fixed_reg_set)
 #endif
+#define eh_return_data_regs \
+  (this_target_hard_regs->x_eh_return_data_regs)
 #define reg_alloc_order \
   (this_target_hard_regs->x_reg_alloc_order)
 #define inv_reg_alloc_order \
diff --git a/gcc/ira-lives.cc b/gcc/ira-lives.cc
index e07d3dc3e89..958eabb9708 100644
--- a/gcc/ira-lives.cc
+++ b/gcc/ira-lives.cc
@@ -1260,14 +1260,8 @@ process_out_of_region_eh_regs (basic_block bb)
   for (int n = ALLOCNO_NUM_OBJECTS (a) - 1; n >= 0; n--)
{
  ira_object_t obj = ALLOCNO_OBJECT (a, n);
- for (int k = 0; ; k++)
-   {
- unsigned int regno = EH_RETURN_DATA_REGNO (k);
- if (regno == INVALID_REGNUM)
-   break;
- SET_HARD_REG_BIT (OBJECT_CONFLICT_HARD_REGS (obj), regno);
- SET_HARD_REG_BIT (OBJECT_TOTAL_CONFLICT_HARD_REGS (obj), regno);
-   }
+ OBJECT_CONFLICT_HARD_REGS (obj) |= eh_return_data_regs;
+ OBJECT_TOTAL_CONFLICT_HARD_REGS (obj) |= eh_return_data_regs;
}
 }
 }
diff --git a/gcc/reginfo.cc b/gcc/reginfo.cc
index a0baeb90e12..73121365c47 100644
--- a/gcc/reginfo.cc
+++ b/gcc/reginfo.cc
@@ -420,6 +420,16 @@ init_reg_sets_1 (void)
}
 }
 
+  /* Recalculate eh_return_data_regs.  */
+  CLEAR_HARD_REG_SET (eh_return_data_regs);
+  for (i = 0; ; ++i)
+{
+  unsigned int regno = EH_RETURN_DATA_REGNO (i);
+  if (regno == INVALID_REGNUM)
+   break;
+  SET_HARD_REG_BIT (eh_return_data_regs, regno);
+}
+
   memset (have_regs_of_mode, 0, sizeof (have_regs_of_mode));
   memset (contains_reg_of_mode, 0, sizeof (contains_reg_of_mode));
   for (m = 0; m < (unsigned int) MAX_MACHINE_MODE; m++)
-- 
2.25.1



Re: [PATCH] middle-end: Expand {u|s}dot product support in autovectorizer

2024-05-20 Thread Richard Sandiford
Richard Biener  writes:
> On Fri, May 17, 2024 at 11:56 AM Tamar Christina
>  wrote:
>>
>> > -Original Message-
>> > From: Richard Biener 
>> > Sent: Friday, May 17, 2024 10:46 AM
>> > To: Tamar Christina 
>> > Cc: Victor Do Nascimento ; gcc-
>> > patc...@gcc.gnu.org; Richard Sandiford ; Richard
>> > Earnshaw ; Victor Do Nascimento
>> > 
>> > Subject: Re: [PATCH] middle-end: Expand {u|s}dot product support in
>> > autovectorizer
>> >
>> > On Fri, May 17, 2024 at 11:05 AM Tamar Christina
>> >  wrote:
>> > >
>> > > > -Original Message-
>> > > > From: Richard Biener 
>> > > > Sent: Friday, May 17, 2024 6:51 AM
>> > > > To: Victor Do Nascimento 
>> > > > Cc: gcc-patches@gcc.gnu.org; Richard Sandiford
>> > ;
>> > > > Richard Earnshaw ; Victor Do Nascimento
>> > > > 
>> > > > Subject: Re: [PATCH] middle-end: Expand {u|s}dot product support in
>> > > > autovectorizer
>> > > >
>> > > > On Thu, May 16, 2024 at 4:40 PM Victor Do Nascimento
>> > > >  wrote:
>> > > > >
>> > > > > From: Victor Do Nascimento 
>> > > > >
>> > > > > At present, the compiler offers the `{u|s|us}dot_prod_optab' direct
>> > > > > optabs for dealing with vectorizable dot product code sequences.  The
>> > > > > consequence of using a direct optab for this is that backend-pattern
>> > > > > selection is only ever able to match against one datatype - Either
>> > > > > that of the operands or of the accumulated value, never both.
>> > > > >
>> > > > > With the introduction of the 2-way (un)signed dot-product insn [1][2]
>> > > > > in AArch64 SVE2, the existing direct opcode approach is no longer
>> > > > > sufficient for full specification of all the possible dot product
>> > > > > machine instructions to be matched to the code sequence; a dot 
>> > > > > product
>> > > > > resulting in VNx4SI may result from either dot products on VNx16QI or
>> > > > > VNx8HI values for the 4- and 2-way dot product operations, 
>> > > > > respectively.
>> > > > >
>> > > > > This means that the following example fails autovectorization:
>> > > > >
>> > > > > uint32_t foo(int n, uint16_t* data) {
>> > > > >   uint32_t sum = 0;
>> > > > >   for (int i=0; i> > > > > sum += data[i] * data[i];
>> > > > >   }
>> > > > >   return sum;
>> > > > > }
>> > > > >
>> > > > > To remedy the issue a new optab is added, tentatively named
>> > > > > `udot_prod_twoway_optab', whose selection is dependent upon checking
>> > > > > of both input and output types involved in the operation.
>> > > >
>> > > > I don't like this too much.  I'll note we document dot_prod as
>> > > >
>> > > > @cindex @code{sdot_prod@var{m}} instruction pattern
>> > > > @item @samp{sdot_prod@var{m}}
>> > > >
>> > > > Compute the sum of the products of two signed elements.
>> > > > Operand 1 and operand 2 are of the same mode. Their
>> > > > product, which is of a wider mode, is computed and added to operand 3.
>> > > > Operand 3 is of a mode equal or wider than the mode of the product. The
>> > > > result is placed in operand 0, which is of the same mode as operand 3.
>> > > > @var{m} is the mode of operand 1 and operand 2.
>> > > >
>> > > > with no restriction on the wider mode but we don't specify it which is
>> > > > bad design.  This should have been a convert optab with two modes
>> > > > from the start - adding a _twoway variant is just a hack.
>> > >
>> > > We did discuss this at the time we started implementing it.  There was 
>> > > two
>> > > options, one was indeed to change it to a convert dot_prod optab, but 
>> > > doing
>> > > this means we have to update every target that uses it.
>> > >
>> > > Now that means 3 ISAs for AArch64, Arm, Arc, c6x, 2 for x86, loongson and
>> > altivec.
>> > >
>> > > Which sure could be possible, but there's also every use in the backends 
>> > > that
>> > need
>> > > to be updated, and tested, which for some targets we don't even know how 
>> > > to
>> > begin.
>> > >
>> > > So it seems very hard to correct dotprod to a convert optab now.
>> >
>> > It's still the correct way to go.  At _least_ your new pattern should
>> > have been this,
>> > otherwise what do you do when you have two-way, four-way and eight-way
>> > variants?
>> > Add yet another optab?
>>
>> I guess that's fair, but having the new optab only be convert resulted in 
>> messy
>> code as everywhere you must check for both variants.
>>
>> Additionally that optab would then overlap with the existing optabs as, as 
>> you
>> Say, the documentation only says it's of a wider type and doesn't indicate
>> precision.
>>
>> So to avoid issues down the line then If the new optab isn't acceptable then
>> we'll have to do a wholesale conversion then..
>
> Yep.  It shouldn't be difficult though.

Still catching up, but FWIW, I agree this is the way to go.  (Convert all
existing dot_prods to convert optabs first, and then add the new AArch64
ones.)  Having two mechanisms feels like storing up trouble for later. :)

Richard


Re: [PATCH 00/12] aarch64: Extend aarch64_feature_flags to 128 bits

2024-05-20 Thread Richard Sandiford
Andrew Carlotti  writes:
> On Fri, May 17, 2024 at 04:45:05PM +0100, Richard Sandiford wrote:
>> Andrew Carlotti  writes:
>> > The end goal of the series is to change the definition of 
>> > aarch64_feature_flags
>> > from a uint64_t typedef to a class with 128 bits of storage.  This class 
>> > uses
>> > operator overloading to mimic the existing integer interface as much as
>> > possible, but with added restrictions to facilate type checking and
>> > extensibility.
>> >
>> > Patches 01-10 are preliminary enablement work, and have passed regression
>> > testing.  Are these ok for master?
>> >
>> > Patch 11 is an RFC, and the only patch that touches the middle end.  I am
>> > seeking clarity on which part(s) of the compiler should be expected to 
>> > handle
>> > or prevent non-bool types in instruction pattern conditions.  The actual 
>> > patch
>> > does not compile by itself (though it does in combination with 12/12), but 
>> > that
>> > is not important to the questions I'm asking.
>> >
>> > Patch 12 is then a small patch that actually replaces the uint64_t typedef 
>> > with
>> > a class.  I think this patch is fine in it's current form, but it depends 
>> > on a
>> > resolution to the issues in patch 11/12 first.
>> 
>> Thanks for doing this.
>> 
>> Rather than disallowing flags == 0, etc., I think we should allow
>> aarch64_feature_flags to be constructed from a single uint64_t.
>> It's a lossless conversion.  The important thing is that we don't
>> allow conversions the other way (and the patch doesn't allow them).
>
> I agree that allowing conversion from a single int should be safe (albeit it
> was probably helpful to disallow it during the development of this series).
> It does feel a little bit strange to have a separate mechanism for
> setting the first 64 bits (and zeroing the rest).

With a templated class, I think it makes sense.  The constructor would
take a variable number of arguments and any unspecified elements would
implicitly be zero.  In that sense, a single uint64_t isn't a special
case.  It's just an instance of a generic rule.

> Do you consider the existing code in some places to be clearer than the new
> versions in this patch series?  If so, it would be helpful to know which
> patches (or parts of patches) I should drop.

Probably patches 3, 4, and (for unrelated reasons) 9.  (9 feels like
a microoptimisation, given that the underlying issue has been fixed.)

>> Also, I think we should make the new class in 12/12 be a templated
>>  type that provides an N-bit bitmask.  It should arguably
>> also be target-independent code.  aarch64_feature_flags would then be
>> an alias with the appropriate number of bits.
>
> I think the difficult part is to do this for generic N while still satisfying
> C++11 constexpr function requirements (we can't use a loop, for example).
> However, while writing this response, I've realised that I can do this using
> recursion, with an N-bit bitmask being implemented as a class containing an
> M-bit integer and (recursively) and (N-M)-bit bitmask.

I think it'd be better to keep a flat object, not least for debugging.

Things like operator| could be handled using code like:


template
struct operators
{
  template
  static constexpr Result binary(Operator op, const Arg , const Arg ,
 Rest ...rest)
  {
return operators::template binary
  (op, x, y, op (x[N - 1], y[N - 1]), rest...);
  }
};

template<>
struct operators<0>
{
  template
  static constexpr Result binary(Operator op, const Arg , const Arg ,
 Rest ...rest)
  {
return Result { rest... };
  }
};

using T = std::array;

template
constexpr T f(T x, T y) { return x | y; }
constexpr T x = { 1, 2 };
constexpr T y = { 0x100, 0x400 };
constexpr T z = operators<2>::binary (f, x, y);


(Unfortunately, constexpr lambdas are also not supported in C++11.)

>> For the RFC in 11/12, how about, as another prepatch before 12/12,
>> removing all the mechanical:
>> 
>> #define AARCH64_ISA_LS64(aarch64_isa_flags & AARCH64_FL_LS64)
>> 
>> style macros and replacing uses with something like:
>> 
>>   AARCH64_HAVE_ISA (LS64)
>
> This sounds like a good approach, and is roughly what I was already planning 
> to
> do (although I hadn't worked out the details yet).  I think that can entirely
> replace 11/12 in the context of this series, but the questions about
> instruction pattern condition type checking still ought to be addressed
> separately.

Yeah, stronger typing would be good.  I think in practice the generators
should add the "bool (...)" wrapper.

Thanks,
Richard


Re: [Patch, aarch64] Further renaming of generic code

2024-05-20 Thread Richard Sandiford
Ajit Agarwal  writes:
> Hello Alex/Richard:
>
> Renaming of generic code is done to make target independent
> and target dependent code to support multiple targets.
>
> Target independent code is the Generic code with pure virtual function
> to interface betwwen target independent and dependent code.
>
> Target dependent code is the implementation of pure virtual function for
> aarch64 target and the call to target independent code.
>
> Bootstrapped and regtested on aarch64-linux-gnu.
>
> Thanks & Regards
> Ajit
>
> aarch64: Further renaming of generic code
>
> Renaming of generic code is done to make target independent
> and target dependent code to support multiple targets.
>
> Target independent code is the Generic code with pure virtual function
> to interface betwwen target independent and dependent code.
>
> Target dependent code is the implementation of pure virtual function for
> aarch64 target and the call to target independent code.
>
> 2024-05-20  Ajit Kumar Agarwal  
>
> gcc/ChangeLog:
>
> * config/aarch64/aarch64-ldp-fusion.cc: Renaming of generic code

* config/aarch64/aarch64-ldp-fusion.cc: Rename generic parts of code
to avoid "ldp" and "stp".

> ---
>  gcc/config/aarch64/aarch64-ldp-fusion.cc | 55 
>  1 file changed, 28 insertions(+), 27 deletions(-)
>
> [...]
> @@ -1126,7 +1127,7 @@ struct stp_change_builder
>  
>bool done () const { return m_state == state::DONE; }
>  
> -  stp_change_builder (insn_info *insns[2],
> +  store_change_builder (insn_info *insns[2],
> insn_info *repurpose,
> insn_info *dest)

Please reindent the parameters for the new longer name.

>  : m_state (state::FIRST), m_insns { insns[0], insns[1] },
> [...]
> @@ -1916,7 +1917,7 @@ fixup_debug_uses (obstack_watermark ,
>  // BASE gives the chosen base candidate for the pair and MOVE_RANGE is
>  // a singleton range which says where to place the pair.
>  bool
> -ldp_bb_info::fuse_pair (bool load_p,
> +pair_fusion_bb_info::fuse_pair (bool load_p,
>   unsigned access_size,
>   int writeback,
>   insn_info *i1, insn_info *i2,

Same here.

> @@ -2687,7 +2688,7 @@ pair_fusion::get_viable_bases (insn_info *insns[2],
>  // ACCESS_SIZE gives the (common) size of a single access, LOAD_P is true
>  // if the accesses are both loads, otherwise they are both stores.
>  bool
> -ldp_bb_info::try_fuse_pair (bool load_p, unsigned access_size,
> +pair_fusion_bb_info::try_fuse_pair (bool load_p, unsigned access_size,
>   insn_info *i1, insn_info *i2)
>  {
>if (dump_file)

And here.

OK with those changes, thanks.

Richard


Re: [PATCH] AArch64: Improve costing of ctz

2024-05-20 Thread Richard Sandiford
Wilco Dijkstra  writes:
> Improve costing of ctz - both TARGET_CSSC and vector cases were not handled 
> yet.
>
> Passes regress & bootstrap - OK for commit?
>
> gcc:
> * config/aarch64/aarch64.cc (aarch64_rtx_costs): Improve CTZ costing.

Ok, thanks.

Richard

> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 
> fe13c9a0d4863041eb9101882ea57c2094240d16..2a6f76f4008839bf0aa158504430af9b971c
>  100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -14309,10 +14309,24 @@ aarch64_rtx_costs (rtx x, machine_mode mode, int 
> outer ATTRIBUTE_UNUSED,
>return false;
>  
>  case CTZ:
> -  *cost = COSTS_N_INSNS (2);
> -
> -  if (speed)
> - *cost += extra_cost->alu.clz + extra_cost->alu.rev;
> +  if (VECTOR_MODE_P (mode))
> + {
> +   *cost = COSTS_N_INSNS (3);
> +   if (speed)
> + *cost += extra_cost->vect.alu * 3;
> + }
> +  else if (TARGET_CSSC)
> + {
> +   *cost = COSTS_N_INSNS (1);
> +   if (speed)
> + *cost += extra_cost->alu.clz;
> + }
> +  else
> + {
> +   *cost = COSTS_N_INSNS (2);
> +   if (speed)
> + *cost += extra_cost->alu.clz + extra_cost->alu.rev;
> + }
>return false;
>  
>  case COMPARE:


Re: [PATCH] AArch64: Fix printing of 2-instruction alternatives

2024-05-20 Thread Richard Sandiford
Wilco Dijkstra  writes:
> Add missing '\' in 2-instruction movsi/di alternatives so that they are
> printed on separate lines.
>
> Passes bootstrap and regress, OK for commit once stage 1 reopens?
>
> gcc:
> * config/aarch64/aarch64.md (movsi_aarch64): Use '\;' to force
> newline in 2-instruction pattern.
> (movdi_aarch64): Likewise.

Oops, good catch.  Ok for trunk, thanks.

Richard

>
> ---
>
> diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
> index 
> 1a2e01284249223565cd12cf1bfd5db5475e56fb..5416c2e3b2002d0e53baf23e7c0048ddf683
>  100644
> --- a/gcc/config/aarch64/aarch64.md
> +++ b/gcc/config/aarch64/aarch64.md
> @@ -1447,7 +1447,7 @@ (define_insn_and_split "*movsi_aarch64"
>   [w  , m  ; load_4   , fp  , 4] ldr\t%s0, %1
>   [m  , r Z; store_4  , *   , 4] str\t%w1, %0
>   [m  , w  ; store_4  , fp  , 4] str\t%s1, %0
> - [r  , Usw; load_4   , *   , 8] adrp\t%x0, %A1;ldr\t%w0, [%x0, %L1]
> + [r  , Usw; load_4   , *   , 8] adrp\t%x0, %A1\;ldr\t%w0, [%x0, %L1]
>   [r  , Usa; adr  , *   , 4] adr\t%x0, %c1
>   [r  , Ush; adr  , *   , 4] adrp\t%x0, %A1
>   [w  , r Z; f_mcr, fp  , 4] fmov\t%s0, %w1
> @@ -1484,7 +1484,7 @@ (define_insn_and_split "*movdi_aarch64"
>   [w, m  ; load_8   , fp  , 4] ldr\t%d0, %1
>   [m, r Z; store_8  , *   , 4] str\t%x1, %0
>   [m, w  ; store_8  , fp  , 4] str\t%d1, %0
> - [r, Usw; load_8   , *   , 8] << TARGET_ILP32 ? "adrp\t%0, %A1;ldr\t%w0, 
> [%0, %L1]" : "adrp\t%0, %A1;ldr\t%0, [%0, %L1]";
> + [r, Usw; load_8   , *   , 8] << TARGET_ILP32 ? "adrp\t%0, 
> %A1\;ldr\t%w0, [%0, %L1]" : "adrp\t%0, %A1\;ldr\t%0, [%0, %L1]";
>   [r, Usa; adr  , *   , 4] adr\t%x0, %c1
>   [r, Ush; adr  , *   , 4] adrp\t%x0, %A1
>   [w, r Z; f_mcr, fp  , 4] fmov\t%d0, %x1


Re: [PATCH] aarch64: Fold vget_low_* intrinsics to BIT_FIELD_REF [PR102171]

2024-05-20 Thread Richard Sandiford
Pengxuan Zheng  writes:
> This patch folds vget_low_* intrinsics to BIT_FILED_REF to open up more
> optimization opportunities for gimple optimizers.
>
> While we are here, we also remove the vget_low_* definitions from arm_neon.h 
> and
> use the new intrinsics framework.
>
> PR target/102171
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64-builtins.cc (AARCH64_SIMD_VGET_LOW_BUILTINS):
>   New macro to create definitions for all vget_low intrinsics.
>   (VGET_LOW_BUILTIN): Likewise.
>   (enum aarch64_builtins): Add vget_low function codes.
>   (aarch64_general_fold_builtin): Fold vget_low calls.
>   * config/aarch64/aarch64-simd-builtins.def: Delete vget_low builtins.
>   * config/aarch64/aarch64-simd.md (aarch64_get_low): Delete.
>   (aarch64_vget_lo_halfv8bf): Likewise.
>   * config/aarch64/arm_neon.h (__attribute__): Delete.
>   (vget_low_f16): Likewise.
>   (vget_low_f32): Likewise.
>   (vget_low_f64): Likewise.
>   (vget_low_p8): Likewise.
>   (vget_low_p16): Likewise.
>   (vget_low_p64): Likewise.
>   (vget_low_s8): Likewise.
>   (vget_low_s16): Likewise.
>   (vget_low_s32): Likewise.
>   (vget_low_s64): Likewise.
>   (vget_low_u8): Likewise.
>   (vget_low_u16): Likewise.
>   (vget_low_u32): Likewise.
>   (vget_low_u64): Likewise.
>   (vget_low_bf16): Likewise.
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/pr113573.c: Replace __builtin_aarch64_get_lowv8hi
>   with vget_low_s16.
>   * gcc.target/aarch64/vget_low_2.c: New test.
>   * gcc.target/aarch64/vget_low_2_be.c: New test.

Ok, thanks.  I suppose the patch has the side effect of allowing
vget_low_bf16 to be called without +bf16.  IMO that's the correct
behaviour though, and is consistent with how we handle reinterprets.

Richard

> Signed-off-by: Pengxuan Zheng 
> ---
>  gcc/config/aarch64/aarch64-builtins.cc|  60 ++
>  gcc/config/aarch64/aarch64-simd-builtins.def  |   5 +-
>  gcc/config/aarch64/aarch64-simd.md|  23 +---
>  gcc/config/aarch64/arm_neon.h | 105 --
>  gcc/testsuite/gcc.target/aarch64/pr113573.c   |   2 +-
>  gcc/testsuite/gcc.target/aarch64/vget_low_2.c |  30 +
>  .../gcc.target/aarch64/vget_low_2_be.c|  31 ++
>  7 files changed, 124 insertions(+), 132 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/vget_low_2.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/vget_low_2_be.c
>
> diff --git a/gcc/config/aarch64/aarch64-builtins.cc 
> b/gcc/config/aarch64/aarch64-builtins.cc
> index 75d21de1401..4afe7c86ae3 100644
> --- a/gcc/config/aarch64/aarch64-builtins.cc
> +++ b/gcc/config/aarch64/aarch64-builtins.cc
> @@ -658,6 +658,23 @@ static aarch64_simd_builtin_datum 
> aarch64_simd_builtin_data[] = {
>VREINTERPRET_BUILTINS \
>VREINTERPRETQ_BUILTINS
>  
> +#define AARCH64_SIMD_VGET_LOW_BUILTINS \
> +  VGET_LOW_BUILTIN(f16) \
> +  VGET_LOW_BUILTIN(f32) \
> +  VGET_LOW_BUILTIN(f64) \
> +  VGET_LOW_BUILTIN(p8) \
> +  VGET_LOW_BUILTIN(p16) \
> +  VGET_LOW_BUILTIN(p64) \
> +  VGET_LOW_BUILTIN(s8) \
> +  VGET_LOW_BUILTIN(s16) \
> +  VGET_LOW_BUILTIN(s32) \
> +  VGET_LOW_BUILTIN(s64) \
> +  VGET_LOW_BUILTIN(u8) \
> +  VGET_LOW_BUILTIN(u16) \
> +  VGET_LOW_BUILTIN(u32) \
> +  VGET_LOW_BUILTIN(u64) \
> +  VGET_LOW_BUILTIN(bf16)
> +
>  typedef struct
>  {
>const char *name;
> @@ -697,6 +714,9 @@ typedef struct
>  #define VREINTERPRET_BUILTIN(A, B, L) \
>AARCH64_SIMD_BUILTIN_VREINTERPRET##L##_##A##_##B,
>  
> +#define VGET_LOW_BUILTIN(A) \
> +  AARCH64_SIMD_BUILTIN_VGET_LOW_##A,
> +
>  #undef VAR1
>  #define VAR1(T, N, MAP, FLAG, A) \
>AARCH64_SIMD_BUILTIN_##T##_##N##A,
> @@ -732,6 +752,7 @@ enum aarch64_builtins
>AARCH64_CRC32_BUILTIN_MAX,
>/* SIMD intrinsic builtins.  */
>AARCH64_SIMD_VREINTERPRET_BUILTINS
> +  AARCH64_SIMD_VGET_LOW_BUILTINS
>/* ARMv8.3-A Pointer Authentication Builtins.  */
>AARCH64_PAUTH_BUILTIN_AUTIA1716,
>AARCH64_PAUTH_BUILTIN_PACIA1716,
> @@ -823,8 +844,37 @@ static aarch64_fcmla_laneq_builtin_datum 
> aarch64_fcmla_lane_builtin_data[] = {
>   && SIMD_INTR_QUAL(A) == SIMD_INTR_QUAL(B) \
>},
>  
> +#undef VGET_LOW_BUILTIN
> +#define VGET_LOW_BUILTIN(A) \
> +  {"vget_low_" #A, \
> +   AARCH64_SIMD_BUILTIN_VGET_LOW_##A, \
> +   2, \
> +   { SIMD_INTR_MODE(A, d), SIMD_INTR_MODE(A, q) }, \
> +   { SIMD_INTR_QUAL(A), SIMD_INTR_QUAL(A) }, \
> +   FLAG_AUTO_FP, \
> +   false \
> +  },
> +
> +#define AARCH64_SIMD_VGET_LOW_BUILTINS \
> +  VGET_LOW_BUILTIN(f16) \
> +  VGET_LOW_BUILTIN(f32) \
> +  VGET_LOW_BUILTIN(f64) \
> +  VGET_LOW_BUILTIN(p8) \
> +  VGET_LOW_BUILTIN(p16) \
> +  VGET_LOW_BUILTIN(p64) \
> +  VGET_LOW_BUILTIN(s8) \
> +  VGET_LOW_BUILTIN(s16) \
> +  VGET_LOW_BUILTIN(s32) \
> +  VGET_LOW_BUILTIN(s64) \
> +  VGET_LOW_BUILTIN(u8) \
> +  VGET_LOW_BUILTIN(u16) \
> +  VGET_LOW_BUILTIN(u32) \
> +  VGET_LOW_BUILTIN(u64) \
> +  

Re: [Patch, aarch64] v7: Preparatory patch to place target independent and dependent changed code in one file

2024-05-20 Thread Richard Sandiford
Ajit Agarwal  writes:
> Hello Alex/Richard:
>
> All comments are addressed.
>
> Common infrastructure of load store pair fusion is divided into target
> independent and target dependent changed code.
>
> Target independent code is the Generic code with pure virtual function
> to interface between target independent and dependent code.
>
> Target dependent code is the implementation of pure virtual function for
> aarch64 target and the call to target independent code.
>
> Bootstrapped and regtested on aarch64-linux-gnu.
>
> Thanks & Regards
> Ajit
>
>
> aarch64: Preparatory patch to place target independent and
> dependent changed code in one file
>
> Common infrastructure of load store pair fusion is divided into target
> independent and target dependent changed code.
>
> Target independent code is the Generic code with pure virtual function
> to interface betwwen target independent and dependent code.
>
> Target dependent code is the implementation of pure virtual function for
> aarch64 target and the call to target independent code.
>
> 2024-05-18  Ajit Kumar Agarwal  
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64-ldp-fusion.cc: Factor out a
>   target-independent interface and move it to the head of the file
> ---
>  gcc/config/aarch64/aarch64-ldp-fusion.cc | 555 +++
>  1 file changed, 373 insertions(+), 182 deletions(-)
>
> diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc 
> b/gcc/config/aarch64/aarch64-ldp-fusion.cc
> index 1d9caeab05d..e4e55b84f8b 100644
> --- a/gcc/config/aarch64/aarch64-ldp-fusion.cc
> +++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
> @@ -138,6 +138,235 @@ struct alt_base
>poly_int64 offset;
>  };
>  
> +// Virtual base class for load/store walkers used in alias analysis.
> +struct alias_walker
> +{
> +  virtual bool conflict_p (int ) const = 0;
> +  virtual insn_info *insn () const = 0;
> +  virtual bool valid () const = 0;
> +  virtual void advance () = 0;
> +};
> +
> +// When querying should_handle_writeback, this enum is used to
> +// qualify which opportunities we are asking about.
> +enum class writeback {
> +  // Only those writeback opportunities that arise from existing
> +  // auto-increment accesses.
> +  EXISTING,
> +
> +  // All writeback opportunities including those that involve folding
> +  // base register updates into a non-writeback pair.

This misses:

> There should be a comma after "opportunities"

from the previous review.  I.e.:

  // All writeback opportunities, including those that involve folding
  // base register updates into a non-writeback pair.

OK with that change, thanks.

Richard


Re: [Patch, aarch64] v6: Preparatory patch to place target independent and, dependent changed code in one file

2024-05-17 Thread Richard Sandiford
Ajit Agarwal  writes:
> Hello Alex/Richard:
>
> All review comments are addressed.
>
> Common infrastructure of load store pair fusion is divided into target
> independent and target dependent changed code.
>
> Target independent code is the Generic code with pure virtual function
> to interface between target independent and dependent code.
>
> Target dependent code is the implementation of pure virtual function for
> aarch64 target and the call to target independent code.
>
> Bootstrapped and regtested on aarch64-linux-gnu.
>
> Thanks & Regards
> Ajit

Thanks for the patch and thanks to Alex for the reviews.  The patch
looks good to me apart from the minor nits below and the comments that
Alex had.  Please post the updated patch for a final ok though.

> aarch64: Preparatory patch to place target independent and
> dependent changed code in one file
>
> Common infrastructure of load store pair fusion is divided into target
> independent and target dependent changed code.
>
> Target independent code is the Generic code with pure virtual function
> to interface betwwen target independent and dependent code.
>
> Target dependent code is the implementation of pure virtual function for
> aarch64 target and the call to target independent code.
>
> 2024-05-15  Ajit Kumar Agarwal  
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64-ldp-fusion.cc: Place target
>   independent and dependent changed code.

Not sure this is a complete sentence.  Maybe:

* config/aarch64/aarch64-ldp-fusion.cc: Factor out a
target-independent interface and move it to the head of the file.

That technically isn't detailed enough for a changelog entry,
but IMO we should use it anyway.  It's pointless to write the usual
amount of detail when the code is going to move soon.

> ---
>  gcc/config/aarch64/aarch64-ldp-fusion.cc | 533 +++
>  1 file changed, 357 insertions(+), 176 deletions(-)
>
> diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc 
> b/gcc/config/aarch64/aarch64-ldp-fusion.cc
> index 1d9caeab05d..429e532ea3b 100644
> --- a/gcc/config/aarch64/aarch64-ldp-fusion.cc
> +++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
> @@ -138,6 +138,225 @@ struct alt_base
>poly_int64 offset;
>  };
>  
> +// Virtual base class for load/store walkers used in alias analysis.
> +struct alias_walker
> +{
> +  virtual bool conflict_p (int ) const = 0;
> +  virtual insn_info *insn () const = 0;
> +  virtual bool valid () const = 0;
> +  virtual void advance () = 0;
> +};
> +
> +// When querying handle_writeback_opportunities, this enum is used to
> +// qualify which opportunities we are asking about.
> +enum class writeback {
> +  // Only those writeback opportunities that arise from existing
> +  // auto-increment accesses.
> +  EXISTING,
> +  // All writeback opportunities including those that involve folding

There should be a comma after "opportunities"

> +  // base register updates into a non-writeback pair.
> +  ALL
> +};
> +
> +struct pair_fusion {
> +  pair_fusion ()
> +  {
> +calculate_dominance_info (CDI_DOMINATORS);
> +df_analyze ();
> +crtl->ssa = new rtl_ssa::function_info (cfun);
> +  };

Unnecessary trailing ";".  I think it'd be better to define this and
the destructor out-of-line though.  For one thing, it'll reduce the number
of header file dependencies, once the code is moved to its own header file.

> +
> +  // Given:
> +  // - an rtx REG_OP, the non-memory operand in a load/store insn,
> +  // - a machine_mode MEM_MODE, the mode of the MEM in that insn, and
> +  // - a boolean LOAD_P (true iff the insn is a load), then:
> +  // return true if the access should be considered an FP/SIMD access.
> +  // Such accesses are segregated from GPR accesses, since we only want
> +  // to form pairs for accesses that use the same register file.
> +  virtual bool fpsimd_op_p (rtx, machine_mode, bool)
> +  {
> +return false;
> +  }
> +
> +  // Return true if we should consider forming pairs from memory
> +  // accesses with operand mode MODE at this stage in compilation.
> +  virtual bool pair_operand_mode_ok_p (machine_mode mode) = 0;
> +
> +  // Return true iff REG_OP is a suitable register operand for a paired
> +  // memory access, where LOAD_P is true if we're asking about loads and
> +  // false for stores.  MODE gives the mode of the operand.
> +  virtual bool pair_reg_operand_ok_p (bool load_p, rtx reg_op,
> +   machine_mode mode) = 0;
> +
> +  // Return alias check limit.
> +  // This is needed to avoid unbounded quadratic behaviour when
> +  // performing alias analysis.
> +  virtual int pair_mem_alias_check_limit () = 0;

I think the end result should be to make this a target-independent
--param, but this is ok/good as an intermediate step.

> +
> +  // Returns true if we should try to handle writeback opportunities.

s/Returns/Return/

> +  // WHICH determines the kinds of writeback opportunities the caller
> +  // is asking about.
> +  

Re: [PATCH 00/12] aarch64: Extend aarch64_feature_flags to 128 bits

2024-05-17 Thread Richard Sandiford
Andrew Carlotti  writes:
> The end goal of the series is to change the definition of 
> aarch64_feature_flags
> from a uint64_t typedef to a class with 128 bits of storage.  This class uses
> operator overloading to mimic the existing integer interface as much as
> possible, but with added restrictions to facilate type checking and
> extensibility.
>
> Patches 01-10 are preliminary enablement work, and have passed regression
> testing.  Are these ok for master?
>
> Patch 11 is an RFC, and the only patch that touches the middle end.  I am
> seeking clarity on which part(s) of the compiler should be expected to handle
> or prevent non-bool types in instruction pattern conditions.  The actual patch
> does not compile by itself (though it does in combination with 12/12), but 
> that
> is not important to the questions I'm asking.
>
> Patch 12 is then a small patch that actually replaces the uint64_t typedef 
> with
> a class.  I think this patch is fine in it's current form, but it depends on a
> resolution to the issues in patch 11/12 first.

Thanks for doing this.

Rather than disallowing flags == 0, etc., I think we should allow
aarch64_feature_flags to be constructed from a single uint64_t.
It's a lossless conversion.  The important thing is that we don't
allow conversions the other way (and the patch doesn't allow them).

Also, I think we should make the new class in 12/12 be a templated
 type that provides an N-bit bitmask.  It should arguably
also be target-independent code.  aarch64_feature_flags would then be
an alias with the appropriate number of bits.

For the RFC in 11/12, how about, as another prepatch before 12/12,
removing all the mechanical:

#define AARCH64_ISA_LS64   (aarch64_isa_flags & AARCH64_FL_LS64)

style macros and replacing uses with something like:

  AARCH64_HAVE_ISA (LS64)

Uses outside aarch64.h should arguably be changed to TARGET_* instead,
since the convention seems to be that TARGET_* checks the underlying
ISA flag and also any other relevant conditions (where applicable).

Thanks,
Richard


Re: [PATCH] AArch64: Use LDP/STP for large struct types

2024-05-16 Thread Richard Sandiford
Richard Sandiford  writes:
> Wilco Dijkstra  writes:
>> Use LDP/STP for large struct types as they have useful immediate offsets and 
>> are typically faster.
>> This removes differences between little and big endian and allows use of 
>> LDP/STP without UNSPEC.
>>
>> Passes regress and bootstrap, OK for commit?
>>
>> gcc:
>> * config/aarch64/aarch64.cc (aarch64_classify_address): Treat SIMD 
>> structs identically
>> in little and bigendian.
>> * config/aarch64/aarch64.md (aarch64_mov): Remove VSTRUCT 
>> instructions.
>> (aarch64_be_mov): Allow little-endian, rename to 
>> aarch64_mov.
>> (aarch64_be_movoi): Allow little-endian, rename to aarch64_movoi.
>> (aarch64_be_movci): Allow little-endian, rename to aarch64_movci.
>> (aarch64_be_movxi): Allow little-endian, rename to aarch64_movxi.
>> Remove big-endian special case in define_split variants.
>>
>> gcc/testsuite:
>> * gcc.target/aarch64/torture/simd-abi-8.c: Update to check for 
>> LDP/STP.
>
> [...]
> So another alternative would be to go with the patch as-is,
> but add a new mechanism for gimple to query the valid addresses
> for IFN_(MASK_)LOAD_LANES and IFN_(MASK_)STORE_LANES, rather than
> relying purely on the legitimate address mechanism,.  Ideally, the new
> interface would be generic enough that we could use it for target (md)
> builtins as well, to better optimise ACLE code.

Gah, just realised after sending that there's another potential problem.
Currently inline asms can assume that "m" will only include the LD1/ST1
range for little-endian.  We might need to consider using
TARGET_MEM_CONSTRAINT, so that we continue to present the same
interface to asms, but can use the wider range internally.

Thanks,
Richard


Re: [PATCH] AArch64: Use LDP/STP for large struct types

2024-05-16 Thread Richard Sandiford
Wilco Dijkstra  writes:
> Use LDP/STP for large struct types as they have useful immediate offsets and 
> are typically faster.
> This removes differences between little and big endian and allows use of 
> LDP/STP without UNSPEC.
>
> Passes regress and bootstrap, OK for commit?
>
> gcc:
> * config/aarch64/aarch64.cc (aarch64_classify_address): Treat SIMD 
> structs identically
> in little and bigendian.
> * config/aarch64/aarch64.md (aarch64_mov): Remove VSTRUCT 
> instructions.
> (aarch64_be_mov): Allow little-endian, rename to 
> aarch64_mov.
> (aarch64_be_movoi): Allow little-endian, rename to aarch64_movoi.
> (aarch64_be_movci): Allow little-endian, rename to aarch64_movci.
> (aarch64_be_movxi): Allow little-endian, rename to aarch64_movxi.
> Remove big-endian special case in define_split variants.
>
> gcc/testsuite:
> * gcc.target/aarch64/torture/simd-abi-8.c: Update to check for 
> LDP/STP.

I'm nervous about approving the removal of something that was deliberately
added by the initial commits. :)  But, even ignoring the extra offset range,
using LDP/STP makes strong intuitive sense for 2-register modes.  And for
3- and 4-registers modes, it's not surprising if the split that the
patch performs is (at worst) equivalent to what the hardware would do
itself or (at best) something that the hardware handles slightly better.

It's also a significant clean-up.

My only concern is that the main uses of these modes are for LD[234] and
ST[234].  By imposing the LD1/ST1 restrictions, the current little-endian
definition of "m" also corresponds to what LD[234] and ST[234] expect.
This in turn means that ivopts will optimise induction variable selection
to account for the fact that LD[234] and ST[234] do not support offsets.

I think the effect of the patch will be to make ivopts optimise LD[234]
and ST[234] on the assumption that they have the same range as LDP/STP.
We could avoid that if we

(1) Keep:

> @@ -10482,14 +10481,6 @@ aarch64_classify_address (struct 
> aarch64_address_info *info,
>&& (code != REG && code != PLUS))
>  return false;
>  
> -  /* On LE, for AdvSIMD, don't support anything other than POST_INC or
> - REG addressing.  */
> -  if (advsimd_struct_p
> -  && TARGET_SIMD
> -  && !BYTES_BIG_ENDIAN
> -  && (code != POST_INC && code != REG))
> -return false;
> -
>gcc_checking_assert (GET_MODE (x) == VOIDmode
>  || SCALAR_INT_MODE_P (GET_MODE (x)));
>  

but drop the !BYTES_BIG_ENDIAN condition.

(2) Make Ump a defined_relaxed_memory_constraint (so that it accepts
more than "m" does).

(3) Use Ump instead of "o" in the move patterns.

Of course, this might make pure gimple-level data-shuffling worse.
I suppose it could also make RTL passes handle your memcpy use case
more pessimistically, although I'm not sure whether that would be for
legitimate reasons.

So another alternative would be to go with the patch as-is,
but add a new mechanism for gimple to query the valid addresses
for IFN_(MASK_)LOAD_LANES and IFN_(MASK_)STORE_LANES, rather than
relying purely on the legitimate address mechanism,.  Ideally, the new
interface would be generic enough that we could use it for target (md)
builtins as well, to better optimise ACLE code.

So the patch is OK as-is from my POV, but I think it's relatively
important that we try to fix the ivopts handling before GCC 15.

Thanks,
Richard

> ---
>
> diff --git a/gcc/config/aarch64/aarch64-simd.md 
> b/gcc/config/aarch64/aarch64-simd.md
> index 
> 16b7445d9f72f77a98ab262e21fd24e6cc97eba0..bb8b6963fd5117be82afe6ccd7154ae5302c3691
>  100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -7917,32 +7917,6 @@
>[(set_attr "type" "neon_store1_4reg")]
>  )
>  
> -(define_insn "*aarch64_mov"
> -  [(set (match_operand:VSTRUCT_QD 0 "aarch64_simd_nonimmediate_operand")
> - (match_operand:VSTRUCT_QD 1 "aarch64_simd_general_operand"))]
> -  "TARGET_SIMD && !BYTES_BIG_ENDIAN
> -   && (register_operand (operands[0], mode)
> -   || register_operand (operands[1], mode))"
> -  {@ [ cons: =0 , 1   ; attrs: type, length]
> - [ w, w   ; multiple   ,   ] #
> - [ Utv  , w   ; neon_store_reg_q , 4 ] 
> st1\t{%S1. - %1.}, %0
> - [ w, Utv ; neon_load_reg_q  , 4 ] 
> ld1\t{%S0. - %0.}, %1
> -  }
> -)
> -
> -(define_insn "*aarch64_mov"
> -  [(set (match_operand:VSTRUCT 0 "aarch64_simd_nonimmediate_operand")
> - (match_operand:VSTRUCT 1 "aarch64_simd_general_operand"))]
> -  "TARGET_SIMD && !BYTES_BIG_ENDIAN
> -   && (register_operand (operands[0], mode)
> -   || register_operand (operands[1], mode))"
> -  {@ [ cons: =0 , 1   ; attrs: type, length]
> - [ w, w   ; multiple   ,   ] #
> - [ Utv  , w   ; neon_store_reg_q , 4 ] 

Re: [PATCH 0/4]AArch64: support conditional early clobbers on certain operations.

2024-05-15 Thread Richard Sandiford
Tamar Christina  writes:
>> >> On Wed, May 15, 2024 at 12:29 PM Tamar Christina
>> >>  wrote:
>> >> >
>> >> > Hi All,
>> >> >
>> >> > Some Neoverse Software Optimization Guides (SWoG) have a clause that 
>> >> > state
>> >> > that for predicated operations that also produce a predicate it is 
>> >> > preferred
>> >> > that the codegen should use a different register for the destination 
>> >> > than that
>> >> > of the input predicate in order to avoid a performance overhead.
>> >> >
>> >> > This of course has the problem that it increases register pressure and 
>> >> > so
>> should
>> >> > be done with care.  Additionally not all micro-architectures have this
>> >> > consideration and so it shouldn't be done as a default thing.
>> >> >
>> >> > The patch series adds support for doing conditional early clobbers 
>> >> > through a
>> >> > combination of new alternatives and attributes to control their 
>> >> > availability.
>> >>
>> >> You could have two alternatives, one with early clobber and one with
>> >> a matching constraint where you'd disparage the matching constraint one?
>> >>
>> >
>> > Yeah, that's what I do, though there's no need to disparage the non-early 
>> > clobber
>> > alternative as the early clobber alternative will naturally get a penalty 
>> > if it needs a
>> > reload.
>> 
>> But I think Richard's suggestion was to disparage the one with a matching
>> constraint (not the earlyclobber), to reflect the increased cost of
>> reusing the register.
>> 
>> We did take that approach for gathers, e.g.:
>> 
>>  [, Z,   w, Ui1, Ui1, Upl] ld1\t%0.s, %5/z, [%2.s]
>>  [?w, Z,   0, Ui1, Ui1, Upl] ^
>> 
>> The (supposed) advantage is that, if register pressure is so tight
>> that using matching registers is the only alternative, we still
>> have the opportunity to do that, as a last resort.
>> 
>> Providing only an earlyclobber version means that using the same
>> register is prohibited outright.  If no other register is free, the RA
>> would need to spill something else to free up a temporary register.
>> And it might then do the equivalent of (pseudo-code):
>> 
>>   not p1.b, ..., p0.b
>>   mov p0.d, p1.d
>> 
>> after spilling what would otherwise have occupied p1.  In that
>> situation it would be better use:
>> 
>>   not p0.b, ..., p0.b
>> 
>> and not introduce the spill of p1.
>
> I think I understood what Richi meant, but I thought it was already working 
> that way.

The suggestion was to use matching constraints (like "0") though,
whereas the patch doesn't.  I think your argument is that you don't
need to use matching constraints.  But that's different from the
suggestion (and from how we handle gathers).

I was going to say in response to patch 3 (but got distracted, sorry):
I don't think we should have:

   , Upa, ...
   Upa, Upa, ...

(taken from the pure logic ops) enabled at the same time.  Even though
it works for the testcases, I don't think it has well-defined semantics.

The problem is that, taken on its own, the second alternative says that
matching operands are free.  And fundamentally, I don't think the costs
*must* take the earlyclobber alternative over the non-earlyclobber one
(when costing during IRA, for instance).  In principle, the cheapest
is best.

The aim of the gather approach is to make each alternative correct in
isolation.  In:

  [, Z,   w, Ui1, Ui1, Upl] ld1\t%0.s, %5/z, [%2.s]
  [?w, Z,   0, Ui1, Ui1, Upl] ^

the second alternative says that it is possible to have operands 0
and 2 be the same vector register, but using that version has the
cost of an extra reload.  In that sense the alternatives are
(essentially) consistent about the restriction.

> i.e. as one of the testcases I had:
>
>> aarch64-none-elf-gcc -O3 -g0 -S -o - pred-clobber.c -mcpu=neoverse-n2 
>> -ffixed-p[1-15]
>
> foo:
> mov z31.h, w0
> ptrue   p0.b, all
> cmplo   p0.h, p0/z, z0.h, z31.h
> b   use
>
> and reload did not force a spill.
>
> My understanding of how this works, and how it seems to be working is that 
> since reload costs
> Alternative from front to back the cheapest one wins and it stops evaluating 
> the rest.
>
> The early clobber case is first and preferred, however when it's not 
> possible, i.e. requires a non-pseudo
> reload, the reload cost is added to the alternative.
>
> However you're right that in the following testcase:
>
> -mcpu=neoverse-n2 -ffixed-p1 -ffixed-p2 -ffixed-p3 -ffixed-p4 -ffixed-p5 
> -ffixed-p6 -ffixed-p7 -ffixed-p8 -ffixed-p9 -ffixed-p10 -ffixed-p11 
> -ffixed-p12 -ffixed-p12 -ffixed-p13 -ffixed-p14 -ffixed-p14 -fdump-rtl-reload
>
> i.e. giving it an extra free register inexplicably causes a spill:
>
> foo:
> addvl   sp, sp, #-1
> mov z31.h, w0
> ptrue   p0.b, all
> str p15, [sp]
> cmplo   p15.h, p0/z, z0.h, z31.h
> mov p0.b, p15.b
> ldr p15, [sp]
> addvl   sp, sp, #1
> b   use
>
> so that's 

Re: [PATCH 0/4]AArch64: support conditional early clobbers on certain operations.

2024-05-15 Thread Richard Sandiford
Tamar Christina  writes:
>> -Original Message-
>> From: Richard Biener 
>> Sent: Wednesday, May 15, 2024 12:20 PM
>> To: Tamar Christina 
>> Cc: gcc-patches@gcc.gnu.org; nd ; Richard Earnshaw
>> ; Marcus Shawcroft
>> ; ktkac...@gcc.gnu.org; Richard Sandiford
>> 
>> Subject: Re: [PATCH 0/4]AArch64: support conditional early clobbers on 
>> certain
>> operations.
>> 
>> On Wed, May 15, 2024 at 12:29 PM Tamar Christina
>>  wrote:
>> >
>> > Hi All,
>> >
>> > Some Neoverse Software Optimization Guides (SWoG) have a clause that state
>> > that for predicated operations that also produce a predicate it is 
>> > preferred
>> > that the codegen should use a different register for the destination than 
>> > that
>> > of the input predicate in order to avoid a performance overhead.
>> >
>> > This of course has the problem that it increases register pressure and so 
>> > should
>> > be done with care.  Additionally not all micro-architectures have this
>> > consideration and so it shouldn't be done as a default thing.
>> >
>> > The patch series adds support for doing conditional early clobbers through 
>> > a
>> > combination of new alternatives and attributes to control their 
>> > availability.
>> 
>> You could have two alternatives, one with early clobber and one with
>> a matching constraint where you'd disparage the matching constraint one?
>> 
>
> Yeah, that's what I do, though there's no need to disparage the non-early 
> clobber
> alternative as the early clobber alternative will naturally get a penalty if 
> it needs a
> reload.

But I think Richard's suggestion was to disparage the one with a matching
constraint (not the earlyclobber), to reflect the increased cost of
reusing the register.

We did take that approach for gathers, e.g.:

 [, Z,   w, Ui1, Ui1, Upl] ld1\t%0.s, %5/z, [%2.s]
 [?w, Z,   0, Ui1, Ui1, Upl] ^

The (supposed) advantage is that, if register pressure is so tight
that using matching registers is the only alternative, we still
have the opportunity to do that, as a last resort.

Providing only an earlyclobber version means that using the same
register is prohibited outright.  If no other register is free, the RA
would need to spill something else to free up a temporary register.
And it might then do the equivalent of (pseudo-code):

  not p1.b, ..., p0.b
  mov p0.d, p1.d

after spilling what would otherwise have occupied p1.  In that
situation it would be better use:

  not p0.b, ..., p0.b

and not introduce the spill of p1.

Another case where using matching registers is natural is for
loop-carried dependencies.  Do we want to keep them in:

   loop:
  ...no other sets of p0
  not p0.b, ..., p0.b
  ...no other sets of p0
  bne loop

or should we split it to:

   loop:
  ...no other sets of p0
  not p1.b, ..., p0.b
  mov p0.d, p1.d
  ...no other sets of p0
  bne loop

?

Thanks,
Richard

>
> Cheers,
> Tamar
>
>> > On high register pressure we also use LRA's costing to prefer not to use 
>> > the
>> > alternative and instead just use the tie as this is preferable to a reload.
>> >
>> > Concretely this patch series does:
>> >
>> > > aarch64-none-elf-gcc -O3 -g0 -S -o - pred-clobber.c -mcpu=neoverse-n2
>> >
>> > foo:
>> > mov z31.h, w0
>> > ptrue   p3.b, all
>> > cmplo   p0.h, p3/z, z0.h, z31.h
>> > b   use
>> >
>> > > aarch64-none-elf-gcc -O3 -g0 -S -o - pred-clobber.c -mcpu=neoverse-n1+sve
>> >
>> > foo:
>> > mov z31.h, w0
>> > ptrue   p0.b, all
>> > cmplo   p0.h, p0/z, z0.h, z31.h
>> > b   use
>> >
>> > > aarch64-none-elf-gcc -O3 -g0 -S -o - pred-clobber.c -mcpu=neoverse-n2 -
>> ffixed-p[1-15]
>> >
>> > foo:
>> > mov z31.h, w0
>> > ptrue   p0.b, all
>> > cmplo   p0.h, p0/z, z0.h, z31.h
>> > b   use
>> >
>> > Testcases for the changes are in the last patch of the series.
>> >
>> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>> >
>> > Thanks,
>> > Tamar
>> >
>> > ---
>> >
>> > --


Re: [PATCH 1/4]AArch64: convert several predicate patterns to new compact syntax

2024-05-15 Thread Richard Sandiford
Thanks for doing this a pre-patch.  Minor request below:

Tamar Christina  writes:
>  ;; Perform a logical operation on operands 2 and 3, using operand 1 as
> @@ -6676,38 +6690,42 @@ (define_insn "@aarch64_pred__z"
>  (define_insn "*3_cc"
>[(set (reg:CC_NZC CC_REGNUM)
>   (unspec:CC_NZC
> -   [(match_operand:VNx16BI 1 "register_operand" "Upa")
> +   [(match_operand:VNx16BI 1 "register_operand")
>  (match_operand 4)
>  (match_operand:SI 5 "aarch64_sve_ptrue_flag")
>  (and:PRED_ALL
>(LOGICAL:PRED_ALL
> -(match_operand:PRED_ALL 2 "register_operand" "Upa")
> -(match_operand:PRED_ALL 3 "register_operand" "Upa"))
> +(match_operand:PRED_ALL 2 "register_operand")
> +(match_operand:PRED_ALL 3 "register_operand"))
>(match_dup 4))]
> UNSPEC_PTEST))
> -   (set (match_operand:PRED_ALL 0 "register_operand" "=Upa")
> +   (set (match_operand:PRED_ALL 0 "register_operand")
>   (and:PRED_ALL (LOGICAL:PRED_ALL (match_dup 2) (match_dup 3))
> (match_dup 4)))]
>"TARGET_SVE"
> -  "s\t%0.b, %1/z, %2.b, %3.b"
> +  {@ [ cons: =0, 1  , 2  , 3  , 4, 5 ]
> + [ Upa , Upa, Upa, Upa,  ,   ] s\t%0.b, %1/z, %2.b, %3.b
> +  }
>  )

Could we leave out these empty trailing constraints?  They're quite
common in SVE & SME patterns and are specifically not meant to influence
instruction selection.  E.g. we've done the same thing for *cnot
(to pick a random example).

Agree with Kyrill's ok otherwise.

Richard


Re: [PATCH 2/4]AArch64: add new tuning param and attribute for enabling conditional early clobber

2024-05-15 Thread Richard Sandiford
Tamar Christina  writes:
> Hi All,
>
> This adds a new tuning parameter EARLY_CLOBBER_SVE_PRED_DEST for AArch64 to
> allow us to conditionally enable the early clobber alternatives based on the
> tuning models.
>
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>
> Ok for master?
>
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64-tuning-flags.def
>   (EARLY_CLOBBER_SVE_PRED_DEST): New.
>   * config/aarch64/aarch64.h (TARGET_SVE_PRED_CLOBBER): New.
>   * config/aarch64/aarch64.md (pred_clobber): New.
>   (arch_enabled): Use it.
>
> ---
> diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def 
> b/gcc/config/aarch64/aarch64-tuning-flags.def
> index 
> d5bcaebce770f0b217aac783063d39135f754c77..49fbad3ff28bc82b25c61ac501ccf533ec4b4c3f
>  100644
> --- a/gcc/config/aarch64/aarch64-tuning-flags.def
> +++ b/gcc/config/aarch64/aarch64-tuning-flags.def
> @@ -48,4 +48,8 @@ AARCH64_EXTRA_TUNING_OPTION ("avoid_cross_loop_fma", 
> AVOID_CROSS_LOOP_FMA)
>  
>  AARCH64_EXTRA_TUNING_OPTION ("fully_pipelined_fma", FULLY_PIPELINED_FMA)
>  
> +/* Enable is the target prefers to use a fresh register for predicate outputs
> +   rather than re-use an input predicate register.  */
> +AARCH64_EXTRA_TUNING_OPTION ("early_clobber_sve_pred_dest", 
> EARLY_CLOBBER_SVE_PRED_DEST)

Sorry for the bike-shedding, but how about something like "avoid_pred_rmw"?
(I'm open to other suggestions.)  Just looking for something that describes
either the architecture or the end result that we want to achieve.
And preferable something fairly short :)

avoid_* would be consistent with the existing "avoid_cross_loop_fma".

> +
>  #undef AARCH64_EXTRA_TUNING_OPTION
> diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h
> index 
> bbf11faaf4b4340956094a983f8b0dc2649b2d27..76a18dd511f40ebb58ed12d56b46c74084ba7c3c
>  100644
> --- a/gcc/config/aarch64/aarch64.h
> +++ b/gcc/config/aarch64/aarch64.h
> @@ -495,6 +495,11 @@ constexpr auto AARCH64_FL_DEFAULT_ISA_MODE = 
> AARCH64_FL_SM_OFF;
>  enabled through +gcs.  */
>  #define TARGET_GCS (AARCH64_ISA_GCS)
>  
> +/*  Prefer different predicate registers for the output of a predicated 
> operation over
> +re-using an existing input predicate.  */
> +#define TARGET_SVE_PRED_CLOBBER (TARGET_SVE \
> +  && (aarch64_tune_params.extra_tuning_flags \
> +  & 
> AARCH64_EXTRA_TUNE_EARLY_CLOBBER_SVE_PRED_DEST))
>  
>  /* Standard register usage.  */
>  
> diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
> index 
> dbde066f7478bec51a8703b017ea553aa98be309..1ecd1a2812969504bd5114a53473b478c5ddba82
>  100644
> --- a/gcc/config/aarch64/aarch64.md
> +++ b/gcc/config/aarch64/aarch64.md
> @@ -445,6 +445,10 @@ (define_enum_attr "arch" "arches" (const_string "any"))
>  ;; target-independent code.
>  (define_attr "is_call" "no,yes" (const_string "no"))
>  
> +;; Indicates whether we want to enable the pattern with an optional early
> +;; clobber for SVE predicates.
> +(define_attr "pred_clobber" "no,yes" (const_string "no"))
> +
>  ;; [For compatibility with Arm in pipeline models]
>  ;; Attribute that specifies whether or not the instruction touches fp
>  ;; registers.
> @@ -461,7 +465,8 @@ (define_attr "fp" "no,yes"
>  (define_attr "arch_enabled" "no,yes"
>(if_then_else
>  (ior
> - (eq_attr "arch" "any")
> + (and (eq_attr "arch" "any")
> +  (eq_attr "pred_clobber" "no"))
>  
>   (and (eq_attr "arch" "rcpc8_4")
>(match_test "AARCH64_ISA_RCPC8_4"))
> @@ -488,7 +493,10 @@ (define_attr "arch_enabled" "no,yes"
>(match_test "TARGET_SVE"))
>  
>   (and (eq_attr "arch" "sme")
> -  (match_test "TARGET_SME")))
> +  (match_test "TARGET_SME"))
> +
> + (and (eq_attr "pred_clobber" "yes")
> +  (match_test "TARGET_SVE_PRED_CLOBBER")))

IMO it'd be bettero handle pred_clobber separately from arch, as a new
top-level AND:

  (and
(ior
  (eq_attr "pred_clobber" "no")
  (match_test "!TARGET_..."))
(ior
  ...existing arch tests...))

Thanks,
Richard


Re: [PATCH] AArch64: Use UZP1 instead of INS

2024-05-15 Thread Richard Sandiford
Wilco Dijkstra  writes:
> Use UZP1 instead of INS when combining low and high halves of vectors.
> UZP1 has 3 operands which improves register allocation, and is faster on
> some microarchitectures.
>
> Passes regress & bootstrap, OK for commit?

OK, thanks.  We can add core-specific tuning later if a supported core
strongly prefers INS for some reason, but I agree that the three-address
nature of UZP1 makes it the better default choice.

Richard

>
> gcc:
> * config/aarch64/aarch64-simd.md (aarch64_combine_internal):
> Use UZP1 instead of INS.
> (aarch64_combine_internal_be): Likewise.
>
> gcc/testsuite:
> * gcc.target/aarch64/ldp_stp_16.c: Update to check for UZP1.  
> * gcc.target/aarch64/pr109072_1.c: Likewise.
> * gcc.target/aarch64/vec-init-14.c: Likewise.
> * gcc.target/aarch64/vec-init-9.c: Likewise.
>
> ---
>
> diff --git a/gcc/config/aarch64/aarch64-simd.md 
> b/gcc/config/aarch64/aarch64-simd.md
> index 
> f8bb973a278c7964f3e3a4f7154a0ab62214b7cf..16b7445d9f72f77a98ab262e21fd24e6cc97eba0
>  100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -4388,7 +4388,7 @@
> && (register_operand (operands[0], mode)
> || register_operand (operands[2], mode))"
>{@ [ cons: =0 , 1  , 2   ; attrs: type   , arch  ]
> - [ w, 0  , w   ; neon_ins, simd  ] 
> ins\t%0.[1], %2.[0]
> + [ w, w  , w   ; neon_permute, simd  ] 
> uzp1\t%0.2, %1.2, %2.2
>   [ w, 0  , ?r  ; neon_from_gp, simd  ] 
> ins\t%0.[1], %2
>   [ w, 0  , ?r  ; f_mcr , * ] 
> fmov\t%0.d[1], %2
>   [ w, 0  , Utv ; neon_load1_one_lane , simd  ] 
> ld1\t{%0.}[1], %2
> @@ -4407,7 +4407,7 @@
> && (register_operand (operands[0], mode)
> || register_operand (operands[2], mode))"
>{@ [ cons: =0 , 1  , 2   ; attrs: type   , arch  ]
> - [ w, 0  , w   ; neon_ins, simd  ] 
> ins\t%0.[1], %2.[0]
> + [ w, w  , w   ; neon_permute, simd  ] 
> uzp1\t%0.2, %1.2, %2.2
>   [ w, 0  , ?r  ; neon_from_gp, simd  ] 
> ins\t%0.[1], %2
>   [ w, 0  , ?r  ; f_mcr , * ] 
> fmov\t%0.d[1], %2
>   [ w, 0  , Utv ; neon_load1_one_lane , simd  ] 
> ld1\t{%0.}[1], %2
> diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_stp_16.c 
> b/gcc/testsuite/gcc.target/aarch64/ldp_stp_16.c
> index 
> f1f46e051a86d160a7f7f14872108da87b444ca1..95835aa2eb41c289e7b74f19bb56cf6fa23a3045
>  100644
> --- a/gcc/testsuite/gcc.target/aarch64/ldp_stp_16.c
> +++ b/gcc/testsuite/gcc.target/aarch64/ldp_stp_16.c
> @@ -80,16 +80,16 @@ CONS2_FN (2, float);
>  
>  /*
>  ** cons2_4_float:{ target aarch64_little_endian }
> -**   ins v0.s\[1\], v1.s\[0\]
> -**   stp d0, d0, \[x0\]
> -**   stp d0, d0, \[x0, #?16\]
> +**   uzp1v([0-9])\.2s, v0\.2s, v1\.2s
> +**   stp d\1, d\1, \[x0\]
> +**   stp d\1, d\1, \[x0, #?16\]
>  **   ret
>  */
>  /*
>  ** cons2_4_float:{ target aarch64_big_endian }
> -**   ins v1.s\[1\], v0.s\[0\]
> -**   stp d1, d1, \[x0\]
> -**   stp d1, d1, \[x0, #?16\]
> +**   uzp1v([0-9])\.2s, v1\.2s, v0\.2s
> +**   stp d\1, d\1, \[x0\]
> +**   stp d\1, d\1, \[x0, #?16\]
>  **   ret
>  */
>  CONS2_FN (4, float);
> @@ -125,8 +125,8 @@ CONS4_FN (2, float);
>  
>  /*
>  ** cons4_4_float:
> -**   ins v[0-9]+\.s[^\n]+
> -**   ins v[0-9]+\.s[^\n]+
> +**   uzp1v[0-9]+\.2s[^\n]+
> +**   uzp1v[0-9]+\.2s[^\n]+
>  **   zip1v([0-9]+).4s, [^\n]+
>  **   stp q\1, q\1, \[x0\]
>  **   stp q\1, q\1, \[x0, #?32\]
> diff --git a/gcc/testsuite/gcc.target/aarch64/pr109072_1.c 
> b/gcc/testsuite/gcc.target/aarch64/pr109072_1.c
> index 
> 6c1d2b0bdccfb74b80d938a0d94413f0f9dda5ab..0fc195a598f3b82ff188b3151e77e1272254b78c
>  100644
> --- a/gcc/testsuite/gcc.target/aarch64/pr109072_1.c
> +++ b/gcc/testsuite/gcc.target/aarch64/pr109072_1.c
> @@ -54,7 +54,7 @@ f32x2_1 (float32_t x)
>  
>  /*
>  ** f32x2_2:
> -**   ins v0\.s\[1\], v1.s\[0\]
> +**   uzp1v0\.2s, v0\.2s, v1\.2s
>  **   ret
>  */
>  float32x2_t
> @@ -165,7 +165,7 @@ f64x2_1 (float64_t x)
>  
>  /*
>  ** f64x2_2:
> -**   ins v0\.d\[1\], v1.d\[0\]
> +**   uzp1v0\.2d, v0\.2d, v1\.2d
>  **   ret
>  */
>  float64x2_t
> diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-14.c 
> b/gcc/testsuite/gcc.target/aarch64/vec-init-14.c
> index 
> 02875088cd98833882cdf15b14dcb426951e428f..1a2cc9fbf473ad0de2d8ef97d7efdbe40d959866
>  100644
> --- a/gcc/testsuite/gcc.target/aarch64/vec-init-14.c
> +++ b/gcc/testsuite/gcc.target/aarch64/vec-init-14.c
> @@ -67,7 +67,7 @@ int32x2_t s32_6(int32_t a0, int32_t a1) {
>  
>  /*
>  ** f32_1:
> -**   ins v0\.s\[1\], v1\.s\[0\]
> +**   uzp1v0\.2s, v0\.2s, v1\.2s
>  **   ret
>  */
>  float32x2_t f32_1(float32_t a0, float32_t a1) {
> @@ -90,7 +90,7 @@ float32x2_t 

[pushed] aarch64: Avoid using mismatched ZERO ZA sizes

2024-04-12 Thread Richard Sandiford
The svzero_mask_za intrinsic tried to use the shortest combination
of .b, .h, .s and .d tiles, allowing mixtures of sizes where necessary.
However, Iain S pointed out that LLVM instead requires the tiles to
have the same suffix.  GAS supports both versions, so this patch
generates the LLVM-friendly form.

Tested on aarch64-linux-gnu & pushed.

Please revert the patch if it causes any problems.

Richard


gcc/
* config/aarch64/aarch64.cc (aarch64_output_sme_zero_za): Require
all tiles to have the same suffix.

gcc/testsuite/
* gcc.target/aarch64/sme/acle-asm/zero_mask_za.c (zero_mask_za_ab)
(zero_mask_za_d7, zero_mask_za_bf): Expect a list of .d tiles instead
of a mixture.
---
 gcc/config/aarch64/aarch64.cc | 20 +++
 .../aarch64/sme/acle-asm/zero_mask_za.c   |  6 +++---
 2 files changed, 15 insertions(+), 11 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index a2e3d208d76..1beec94629d 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -13210,29 +13210,33 @@ aarch64_output_sme_zero_za (rtx mask)
   /* The last entry in the list has the form "za7.d }", but that's the
  same length as "za7.d, ".  */
   static char buffer[sizeof("zero\t{ ") + sizeof ("za7.d, ") * 8 + 1];
-  unsigned int i = 0;
-  i += snprintf (buffer + i, sizeof (buffer) - i, "zero\t");
-  const char *prefix = "{ ";
   for (auto  : tiles)
 {
   unsigned int tile_mask = tile.mask;
   unsigned int tile_index = 0;
+  unsigned int i = snprintf (buffer, sizeof (buffer), "zero\t");
+  const char *prefix = "{ ";
+  auto remaining_mask = mask_val;
   while (tile_mask < 0x100)
{
- if ((mask_val & tile_mask) == tile_mask)
+ if ((remaining_mask & tile_mask) == tile_mask)
{
  i += snprintf (buffer + i, sizeof (buffer) - i, "%sza%d.%c",
 prefix, tile_index, tile.letter);
  prefix = ", ";
- mask_val &= ~tile_mask;
+ remaining_mask &= ~tile_mask;
}
  tile_mask <<= 1;
  tile_index += 1;
}
+  if (remaining_mask == 0)
+   {
+ gcc_assert (i + 3 <= sizeof (buffer));
+ snprintf (buffer + i, sizeof (buffer) - i, " }");
+ return buffer;
+   }
 }
-  gcc_assert (mask_val == 0 && i + 3 <= sizeof (buffer));
-  snprintf (buffer + i, sizeof (buffer) - i, " }");
-  return buffer;
+  gcc_unreachable ();
 }
 
 /* Return size in bits of an arithmetic operand which is shifted/scaled and
diff --git a/gcc/testsuite/gcc.target/aarch64/sme/acle-asm/zero_mask_za.c 
b/gcc/testsuite/gcc.target/aarch64/sme/acle-asm/zero_mask_za.c
index 9ce7331ebdd..2ba8f8cc332 100644
--- a/gcc/testsuite/gcc.target/aarch64/sme/acle-asm/zero_mask_za.c
+++ b/gcc/testsuite/gcc.target/aarch64/sme/acle-asm/zero_mask_za.c
@@ -103,21 +103,21 @@ PROTO (zero_mask_za_aa, void, ()) { svzero_mask_za 
(0xaa); }
 
 /*
 ** zero_mask_za_ab:
-** zero{ za1\.h, za0\.d }
+** zero{ za0\.d, za1\.d, za3\.d, za5\.d, za7\.d }
 ** ret
 */
 PROTO (zero_mask_za_ab, void, ()) { svzero_mask_za (0xab); }
 
 /*
 ** zero_mask_za_d7:
-** zero{ za0\.h, za1\.d, za7\.d }
+** zero{ za0\.d, za1\.d, za2\.d, za4\.d, za6\.d, za7\.d }
 ** ret
 */
 PROTO (zero_mask_za_d7, void, ()) { svzero_mask_za (0xd7); }
 
 /*
 ** zero_mask_za_bf:
-** zero{ za1\.h, za0\.s, za2\.d }
+** zero{ za0\.d, za1\.d, za2\.d, za3\.d, za4\.d, za5\.d, za7\.d }
 ** ret
 */
 PROTO (zero_mask_za_bf, void, ()) { svzero_mask_za (0xbf); }
-- 
2.25.1



Re: [PATCH] docs: Update function multiversioning documentation

2024-04-12 Thread Richard Sandiford
Hi Andrew,

Thanks for doing this.  I think it improves the organisation of the
FMV documentation and adds some details that were previously missing.

I've made some suggestions below, but documentation is subjective
and I realise that not everyone will agree with them.

I've also added Sandra to cc: in case she has time to help with this.
[original patch: 
https://gcc.gnu.org/pipermail/gcc-patches/2024-April/649071.html]

Andrew Carlotti  writes:
> Add target_version attribute to Common Function Attributes and update
> target and target_clones documentation.  Move shared detail and examples
> to the Function Multiversioning page.  Add target-specific details to
> target-specific pages.
>
> ---
>
> I've built and checked the info and dvi outputs.  Ok for master?
>
> gcc/ChangeLog:
>
>   * doc/extend.texi (Common Function Attributes): Update target
>   and target_clones documentation, and add target_version.
>   (AArch64 Function Attributes): Add ACLE reference and list
>   supported features.
>   (PowerPC Function Attributes): List supported features.
>   (x86 Function Attributes): Mention function multiversioning.
>   (Function Multiversioning): Update, and move shared detail here.
>
>
> diff --git a/gcc/doc/extend.texi b/gcc/doc/extend.texi
> index 
> 7b54a241a7bfde03ce86571be9486b30bcea6200..78cc7ad2903b61a06b618b82ba7ad52ed42d944a
>  100644
> --- a/gcc/doc/extend.texi
> +++ b/gcc/doc/extend.texi
> @@ -4178,18 +4178,27 @@ and @option{-Wanalyzer-tainted-size}.
>  Multiple target back ends implement the @code{target} attribute
>  to specify that a function is to
>  be compiled with different target options than specified on the
> -command line.  The original target command-line options are ignored.
> -One or more strings can be provided as arguments.
> -Each string consists of one or more comma-separated suffixes to
> -the @code{-m} prefix jointly forming the name of a machine-dependent
> -option.  @xref{Submodel Options,,Machine-Dependent Options}.
> -
> +command line.  One or more strings can be provided as arguments.
> +The attribute may override the original target command-line options, or it 
> may
> +be combined with them in a target-specific manner.

It's hard to tell from this what the conditions for "may" are,
e.g. whether it depends on the arguments, on the back end, or both.
Could you add a bit more text to clarify (even if it's just a forward
reference)?

With that extra text, and perhaps without, I think it's clearer to
say this after...

>  The @code{target} attribute can be used for instance to have a function
>  compiled with a different ISA (instruction set architecture) than the
> -default.  @samp{#pragma GCC target} can be used to specify target-specific
> +default.

...this.  I.e.:

  Multiple target back ends implement [...] command-line.  
  The @code{target} attribute can be used [...] the default.

  

> +
> +@samp{#pragma GCC target} can be used to specify target-specific
>  options for more than one function.  @xref{Function Specific Option Pragmas},
>  for details about the pragma.
>  
> +On x86, the @code{target} attribute can also be used to create multiple
> +versions of a function, compiled with different target-specific options.
> +@xref{Function Multiversioning} for more details.

It might be clearer to put this at the end, since the rest of the section
goes back to talking about the non-FMV usage.  Perhaps the same goes for
the pragma part.

Also, how about saying that, on AArch64, the equivalent functionality
is provided by the target_version attribute?

> +
> +The options supported by the @code{target} attribute are specific to each
> +target; refer to @ref{x86 Function Attributes}, @ref{PowerPC Function
> +Attributes}, @ref{ARM Function Attributes}, @ref{AArch64 Function 
> Attributes},
> +@ref{Nios II Function Attributes}, and @ref{S/390 Function Attributes}
> +for details.
> +
>  For instance, on an x86, you could declare one function with the
>  @code{target("sse4.1,arch=core2")} attribute and another with
>  @code{target("sse4a,arch=amdfam10")}.  This is equivalent to
> @@ -4211,39 +4220,18 @@ multiple options is equivalent to separating the 
> option suffixes with
>  a comma (@samp{,}) within a single string.  Spaces are not permitted
>  within the strings.
>  
> -The options supported are specific to each target; refer to @ref{x86
> -Function Attributes}, @ref{PowerPC Function Attributes},
> -@ref{ARM Function Attributes}, @ref{AArch64 Function Attributes},
> -@ref{Nios II Function Attributes}, and @ref{S/390 Function Attributes}
> -for details.
> -
>  @cindex @code{target_clones} function attribute
>  @item target_clones (@var{options})
>  The @code{target_clones} attribute is used to specify that a function
> -be cloned into multiple versions compiled with different target options
> -than specified on the command line.  The supported options and restrictions
> -are the same as for @code{target} attribute.
> -
> -For instance, on an 

Re: [PATCH] aarch64: Add rcpc3 dependency on rcpc2 and rcpc

2024-04-12 Thread Richard Sandiford
Andrew Carlotti  writes:
> We don't yet have a separate feature flag for FEAT_LRCPC2 (and adding
> one will require extending the feature bitmask).  Instead, make the
> FEAT_LRCPC patterns available when either armv8.4-a or +rcpc3 is
> specified.  On the other hand, we already have a +rcpc flag, so this
> dependency can be specified directly.
>
> The cpunative test needed updating because it used an invalid Features
> list, since lrcpc3 requires both ilrcpc and lrcpc to be present.
> Without this change, host_detect_local_cpu would return the architecture
> string 'armv8-a+dotprod+crc+crypto+rcpc3+norcpc'.
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64-option-extensions.def: Add RCPC to
>   RCPC3 dependencies.
>   * config/aarch64/aarch64.h (AARCH64_ISA_RCPC8_4): Add test for
>   RCPC3 bit
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/cpunative/info_24: Include lrcpc and ilrcpc.
>
> ---
>
> Bootstrapped and regression tested on aarch64.  I also verified that the
> atomic-store.c and ldapr-sext.c tests would pass when replacing 'armv8.4-a'
> with 'armv8-a+rcpc3'.
>
> Ok for master?
>
>
> diff --git a/gcc/config/aarch64/aarch64-option-extensions.def 
> b/gcc/config/aarch64/aarch64-option-extensions.def
> index 
> 3155eccd39c8e6825b7fc2bb0d0514c2e7e559bf..42ec0eec31e2ddb0cc6f83fdbaf0fd4eac5ca7f4
>  100644
> --- a/gcc/config/aarch64/aarch64-option-extensions.def
> +++ b/gcc/config/aarch64/aarch64-option-extensions.def
> @@ -153,7 +153,7 @@ AARCH64_FMV_FEATURE("fp16fml", FP16FML, (F16FML))
>  
>  AARCH64_OPT_FMV_EXTENSION("rcpc", RCPC, (), (), (), "lrcpc")
>  
> -AARCH64_OPT_FMV_EXTENSION("rcpc3", RCPC3, (), (), (), "lrcpc3")
> +AARCH64_OPT_FMV_EXTENSION("rcpc3", RCPC3, (RCPC), (), (), "lrcpc3")
>  
>  AARCH64_OPT_FMV_EXTENSION("i8mm", I8MM, (SIMD), (), (), "i8mm")
>  
> diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h
> index 
> 45e901cda644dbe4eaae709e685954f1a6f7dbcf..5870e3f812f6cb0674488b8e17ab7278003d2d54
>  100644
> --- a/gcc/config/aarch64/aarch64.h
> +++ b/gcc/config/aarch64/aarch64.h
> @@ -242,7 +242,8 @@ constexpr auto AARCH64_FL_DEFAULT_ISA_MODE = 
> AARCH64_FL_SM_OFF;
>  #define AARCH64_ISA_SHA3(aarch64_isa_flags & AARCH64_FL_SHA3)
>  #define AARCH64_ISA_F16FML  (aarch64_isa_flags & AARCH64_FL_F16FML)
>  #define AARCH64_ISA_RCPC(aarch64_isa_flags & AARCH64_FL_RCPC)
> -#define AARCH64_ISA_RCPC8_4 (aarch64_isa_flags & AARCH64_FL_V8_4A)
> +#define AARCH64_ISA_RCPC8_4 (aarch64_isa_flags \
> + & (AARCH64_FL_V8_4A | AARCH64_FL_RCPC3))

It looks like the effect of these two changes is that:

* armv9-a+rcpc3+norcpc leaves TARGET_RCPC2 true and TARGET_RCPC and
  TARGET_RCPC3 false.

* armv8-a+rcpc3+norcpc correctly leaves all three false.

If we add the RCPC3->RCPC dependency then I think we should also
require FL_RCPC alongside FL_V8_4A.  I.e.:

#define AARCH64_ISA_RCPC8_4 (AARCH64_ISA_RCPC \
 && (aarch64_isa_flags \
 & (AARCH64_FL_V8_4A | AARCH64_FL_RCPC3)))

OK with that change, thanks.

Richard


>  #define AARCH64_ISA_RNG (aarch64_isa_flags & AARCH64_FL_RNG)
>  #define AARCH64_ISA_V8_5A   (aarch64_isa_flags & AARCH64_FL_V8_5A)
>  #define AARCH64_ISA_TME (aarch64_isa_flags & AARCH64_FL_TME)
> diff --git a/gcc/testsuite/gcc.target/aarch64/cpunative/info_24 
> b/gcc/testsuite/gcc.target/aarch64/cpunative/info_24
> index 
> 8d3c16a10910af977c560782f9d659c0e51286fd..3c64e00ca3a416ef565bc0b4a5b3e5bd9cfc41bc
>  100644
> --- a/gcc/testsuite/gcc.target/aarch64/cpunative/info_24
> +++ b/gcc/testsuite/gcc.target/aarch64/cpunative/info_24
> @@ -1,8 +1,8 @@
>  processor: 0
>  BogoMIPS : 100.00
> -Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 asimddp lrcpc3
> +Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 asimddp lrcpc 
> ilrcpc lrcpc3
>  CPU implementer  : 0xfe
>  CPU architecture: 8
>  CPU variant  : 0x0
>  CPU part : 0xd08
> -CPU revision : 2
> \ No newline at end of file
> +CPU revision : 2


Re: [PATCH] aarch64: Enable +cssc for armv8.9-a

2024-04-12 Thread Richard Sandiford
Andrew Carlotti  writes:
> FEAT_CSSC is mandatory in the architecture from Armv8.9.
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64-arches.def: Add CSSC to V8_9A
>   dependencies.

OK, thanks.

Richard

>
> ---
>
> Bootstrapped and regression tested on aarch64.  Ok for master?
>
>
> diff --git a/gcc/config/aarch64/aarch64-arches.def 
> b/gcc/config/aarch64/aarch64-arches.def
> index 
> 9bec30e9203bac01155281ef3474846c402bb29e..4634b272e28006b5c6c2d6705a2f1010cbd9ab9b
>  100644
> --- a/gcc/config/aarch64/aarch64-arches.def
> +++ b/gcc/config/aarch64/aarch64-arches.def
> @@ -39,7 +39,7 @@ AARCH64_ARCH("armv8.5-a", generic_armv8_a,   V8_5A, 
> 8,  (V8_4A, SB, SSBS
>  AARCH64_ARCH("armv8.6-a", generic_armv8_a,   V8_6A, 8,  (V8_5A, 
> I8MM, BF16))
>  AARCH64_ARCH("armv8.7-a", generic_armv8_a,   V8_7A, 8,  (V8_6A))
>  AARCH64_ARCH("armv8.8-a", generic_armv8_a,   V8_8A, 8,  (V8_7A, 
> MOPS))
> -AARCH64_ARCH("armv8.9-a", generic_armv8_a,   V8_9A, 8,  (V8_8A))
> +AARCH64_ARCH("armv8.9-a", generic_armv8_a,   V8_9A, 8,  (V8_8A, 
> CSSC))
>  AARCH64_ARCH("armv8-r",   generic_armv8_a,   V8R  , 8,  (V8_4A))
>  AARCH64_ARCH("armv9-a",   generic_armv9_a,   V9A  , 9,  (V8_5A, 
> SVE2))
>  AARCH64_ARCH("armv9.1-a", generic_armv9_a,   V9_1A, 9,  (V8_6A, V9A))


Re: [PATCH]middle-end: adjust loop upper bounds when peeling for gaps and early break [PR114403].

2024-04-12 Thread Richard Sandiford
Richard Biener  writes:
> On Fri, 12 Apr 2024, Tamar Christina wrote:
>
>> Hi All,
>> 
>> This is a story all about how the peeling for gaps introduces a bug in the 
>> upper
>> bounds.
>> 
>> Before I go further, I'll first explain how I understand this to work for 
>> loops
>> with a single exit.
>> 
>> When peeling for gaps we peel N < VF iterations to scalar.
>> This happens by removing N iterations from the calculation of niters such 
>> that
>> vect_iters * VF == niters is always false.
>> 
>> In other words, when we exit the vector loop we always fall to the scalar 
>> loop.
>> The loop bounds adjustment guarantees this. Because of this we potentially
>> execute a vector loop iteration less.  That is, if you're at the boundary
>> condition where niters % VF by peeling one or more scalar iterations the 
>> vector
>> loop executes one less.
>> 
>> This is accounted for by the adjustments in vect_transform_loops.  This
>> adjustment happens differently based on whether the the vector loop can be
>> partial or not:
>> 
>> Peeling for gaps sets the bias to 0 and then:
>> 
>> when not partial:  we take the floor of (scalar_upper_bound / VF) - 1 to get 
>> the
>> vector latch iteration count.
>> 
>> when loop is partial:  For a single exit this means the loop is masked, we 
>> take
>>the ceil to account for the fact that the loop can 
>> handle
>> the final partial iteration using masking.
>> 
>> Note that there's no difference between ceil an floor on the boundary 
>> condition.
>> There is a difference however when you're slightly above it. i.e. if scalar
>> iterates 14 times and VF = 4 and we peel 1 iteration for gaps.
>> 
>> The partial loop does ((13 + 0) / 4) - 1 == 2 vector iterations. and in 
>> effect
>> the partial iteration is ignored and it's done as scalar.
>> 
>> This is fine because the niters modification has capped the vector iteration 
>> at
>> 2.  So that when we reduce the induction values you end up entering the 
>> scalar
>> code with ind_var.2 = ind_var.1 + 2 * VF.
>> 
>> Now lets look at early breaks.  To make it esier I'll focus on the specific
>> testcase:
>> 
>> char buffer[64];
>> 
>> __attribute__ ((noipa))
>> buff_t *copy (buff_t *first, buff_t *last)
>> {
>>   char *buffer_ptr = buffer;
>>   char *const buffer_end = [SZ-1];
>>   int store_size = sizeof(first->Val);
>>   while (first != last && (buffer_ptr + store_size) <= buffer_end)
>> {
>>   const char *value_data = (const char *)(>Val);
>>   __builtin_memcpy(buffer_ptr, value_data, store_size);
>>   buffer_ptr += store_size;
>>   ++first;
>> }
>> 
>>   if (first == last)
>> return 0;
>> 
>>   return first;
>> }
>> 
>> Here the first, early exit is on the condition:
>> 
>>   (buffer_ptr + store_size) <= buffer_end
>> 
>> and the main exit is on condition:
>> 
>>   first != last
>> 
>> This is important, as this bug only manifests itself when the first exit has 
>> a
>> known constant iteration count that's lower than the latch exit count.
>> 
>> because buffer holds 64 bytes, and VF = 4, unroll = 2, we end up processing 
>> 16
>> bytes per iteration.  So the exit has a known bounds of 8 + 1.
>> 
>> The vectorizer correctly analizes this:
>> 
>> Statement (exit)if (ivtmp_21 != 0)
>>  is executed at most 8 (bounded by 8) + 1 times in loop 1.
>> 
>> and as a consequence the IV is bound by 9:
>> 
>>   # vect_vec_iv_.14_117 = PHI <_118(9), { 9, 8, 7, 6 }(20)>
>>   ...
>>   vect_ivtmp_21.16_124 = vect_vec_iv_.14_117 + { 18446744073709551615, 
>> 18446744073709551615, 18446744073709551615, 18446744073709551615 };
>>   mask_patt_22.17_126 = vect_ivtmp_21.16_124 != { 0, 0, 0, 0 };
>>   if (mask_patt_22.17_126 == { -1, -1, -1, -1 })
>> goto ; [88.89%]
>>   else
>> goto ; [11.11%]
>> 
>> The imporant bits are this:
>> 
>> In this example the value of last - first = 416.
>> 
>> the calculated vector iteration count, is:
>> 
>> x = (((ptr2 - ptr1) - 16) / 16) + 1 = 27
>> 
>> the bounds generated, adjusting for gaps:
>> 
>>x == (((x - 1) >> 2) << 2)
>> 
>> which means we'll always fall through to the scalar code. as intended.
>> 
>> Here are two key things to note:
>> 
>> 1. In this loop, the early exit will always be the one taken.  When it's 
>> taken
>>we enter the scalar loop with the correct induction value to apply the gap
>>peeling.
>> 
>> 2. If the main exit is taken, the induction values assumes you've finished 
>> all
>>vector iterations.  i.e. it assumes you have completed 24 iterations, as 
>> we
>>treat the main exit the same for normal loop vect and early break when not
>>PEELED.
>>This means the induction value is adjusted to ind_var.2 = ind_var.1 + 24 
>> * VF;
>> 
>> So what's going wrong.  The vectorizer's codegen is correct and efficient,
>> however when we adjust the upper bounds, that code knows that the loops upper
>> bound is based on the early exit. i.e. 8 latch iterations. or in other 

Re: [PATCH v2] aarch64: Preserve mem info on change of base for ldp/stp [PR114674]

2024-04-12 Thread Richard Sandiford
Alex Coplan  writes:
> This is a v2 because I accidentally sent a WIP version of the patch last
> time round which used replace_equiv_address instead of
> replace_equiv_address_nv; that caused some ICEs (pointed out by the
> Linaro CI) since pair addressing modes aren't a subset of the addresses
> that are accepted by memory_operand for a given mode.
>
> This patch should otherwise be identical to v1.  Bootstrapped/regtested
> on aarch64-linux-gnu (indeed this is the patch I actually tested last
> time), is this version also OK for GCC 15?

OK, thanks.  Sorry for missing this in the first review.

Richard

> Thanks,
> Alex
>
> --- >8 ---
>
> The ldp/stp fusion pass can change the base of an access so that the two
> accesses end up using a common base register.  So far we have been using
> adjust_address_nv to do this, but this means that we don't preserve
> other properties of the mem we're replacing.  It seems better to use
> replace_equiv_address_nv, as this will preserve e.g. the MEM_ALIGN of the
> mem whose address we're changing.
>
> The PR shows that by adjusting the other mem we lose alignment
> information about the original access and therefore end up rejecting an
> otherwise viable pair when --param=aarch64-stp-policy=aligned is passed.
> This patch fixes that by using replace_equiv_address_nv instead.
>
> Notably this is the same approach as taken by
> aarch64_check_consecutive_mems when a change of base is required, so
> this at least makes things more consistent between the ldp fusion pass
> and the peepholes.
>
> gcc/ChangeLog:
>
>   PR target/114674
>   * config/aarch64/aarch64-ldp-fusion.cc (ldp_bb_info::fuse_pair):
>   Use replace_equiv_address_nv on a change of base instead of
>   adjust_address_nv on the other access.
>
> gcc/testsuite/ChangeLog:
>
>   PR target/114674
>   * gcc.target/aarch64/pr114674.c: New test.
>
> diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc 
> b/gcc/config/aarch64/aarch64-ldp-fusion.cc
> index 365dcf48b22..d07d79df06c 100644
> --- a/gcc/config/aarch64/aarch64-ldp-fusion.cc
> +++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
> @@ -1730,11 +1730,11 @@ ldp_bb_info::fuse_pair (bool load_p,
>   adjust_amt *= -1;
>  
>rtx change_reg = XEXP (change_pat, !load_p);
> -  machine_mode mode_for_mem = GET_MODE (change_mem);
>rtx effective_base = drop_writeback (base_mem);
> -  rtx new_mem = adjust_address_nv (effective_base,
> -mode_for_mem,
> -adjust_amt);
> +  rtx adjusted_addr = plus_constant (Pmode,
> +  XEXP (effective_base, 0),
> +  adjust_amt);
> +  rtx new_mem = replace_equiv_address_nv (change_mem, adjusted_addr);
>rtx new_set = load_p
>   ? gen_rtx_SET (change_reg, new_mem)
>   : gen_rtx_SET (new_mem, change_reg);
> diff --git a/gcc/testsuite/gcc.target/aarch64/pr114674.c 
> b/gcc/testsuite/gcc.target/aarch64/pr114674.c
> new file mode 100644
> index 000..944784fd008
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/pr114674.c
> @@ -0,0 +1,17 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3 --param=aarch64-stp-policy=aligned" } */
> +typedef struct {
> + unsigned int f1;
> + unsigned int f2;
> +} test_struct;
> +
> +static test_struct ts = {
> + 123, 456
> +};
> +
> +void foo(void)
> +{
> + ts.f2 = 36969 * (ts.f2 & 65535) + (ts.f1 >> 16);
> + ts.f1 = 18000 * (ts.f2 & 65535) + (ts.f2 >> 16);
> +}
> +/* { dg-final { scan-assembler-times "stp" 1 } } */


Re: [PATCH] aarch64: Fix _BitInt testcases

2024-04-11 Thread Richard Sandiford
"Andre Vieira (lists)"  writes:
> This patch fixes some testisms introduced by:
>
> commit 5aa3fec38cc6f52285168b161bab1a869d864b44
> Author: Andre Vieira 
> Date:   Wed Apr 10 16:29:46 2024 +0100
>
>  aarch64: Add support for _BitInt
>
> The testcases were relying on an unnecessary sign-extend that is no longer
> generated.
>
> The tested version was just slightly behind top of trunk when the patch 
> was committed, and the codegen had changed, for the better, by then.
>
> OK for trunk? (I am away tomorrow, so if you want this in before the 
> weekend feel free to commit it on my behalf, if approved ofc...)
>
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/bitfield-bitint-abi-align16.c (g1, g8, g16, g1p, 
> g8p,
>   g16p): Remove unnecessary sbfx.
>   * gcc.target/aarch64/bitfield-bitint-abi-align8.c (g1, g8, g16, g1p, 
> g8p,
>   g16p): Likewise.

LGTM, thanks.  Pushed to trunk.

Richard

>
>
> diff --git a/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align16.c 
> b/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align16.c
> index 
> 3f292a45f955d35b802a0bd789cd39d5fa7b5860..4a228b0a1ce696dc80e32305162d58f01d44051d
>  100644
> --- a/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align16.c
> +++ b/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align16.c
> @@ -55,9 +55,8 @@
>  ** g1:
>  **   mov (x[0-9]+), x0
>  **   mov w0, w1
> -**   sbfx(x[0-9]+), \1, 0, 63
> -**   and x4, \2, 9223372036854775807
> -**   and x2, \2, 1
> +**   and x4, \1, 9223372036854775807
> +**   and x2, \1, 1
>  **   mov x3, 0
>  **   b   f1
>  */
> @@ -66,9 +65,8 @@
>  ** g8:
>  **   mov (x[0-9]+), x0
>  **   mov w0, w1
> -**   sbfx(x[0-9]+), \1, 0, 63
> -**   and x4, \2, 9223372036854775807
> -**   and x2, \2, 1
> +**   and x4, \1, 9223372036854775807
> +**   and x2, \1, 1
>  **   mov x3, 0
>  **   b   f8
>  */
> @@ -76,9 +74,8 @@
>  ** g16:
>  **   mov (x[0-9]+), x0
>  **   mov w0, w1
> -**   sbfx(x[0-9]+), \1, 0, 63
> -**   and x4, \2, 9223372036854775807
> -**   and x2, \2, 1
> +**   and x4, \1, 9223372036854775807
> +**   and x2, \1, 1
>  **   mov x3, 0
>  **   b   f16
>  */
> @@ -107,9 +104,8 @@
>  /*
>  ** g1p:
>  **   mov (w[0-9]+), w1
> -**   sbfx(x[0-9]+), x0, 0, 63
> -**   and x3, \2, 9223372036854775807
> -**   and x1, \2, 1
> +**   and x3, x0, 9223372036854775807
> +**   and x1, x0, 1
>  **   mov x2, 0
>  **   mov w0, \1
>  **   b   f1p
> @@ -117,9 +113,8 @@
>  /*
>  ** g8p:
>  **   mov (w[0-9]+), w1
> -**   sbfx(x[0-9]+), x0, 0, 63
> -**   and x3, \2, 9223372036854775807
> -**   and x1, \2, 1
> +**   and x3, x0, 9223372036854775807
> +**   and x1, x0, 1
>  **   mov x2, 0
>  **   mov w0, \1
>  **   b   f8p
> @@ -128,9 +123,8 @@
>  ** g16p:
>  **   mov (x[0-9]+), x0
>  **   mov w0, w1
> -**   sbfx(x[0-9]+), \1, 0, 63
> -**   and x4, \2, 9223372036854775807
> -**   and x2, \2, 1
> +**   and x4, \1, 9223372036854775807
> +**   and x2, \1, 1
>  **   mov x3, 0
>  **   b   f16p
>  */
> diff --git a/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align8.c 
> b/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align8.c
> index 
> da3c23550bae6734f69e2baf0e8db741fb65cfda..e7f773640f04f56646e5e1a5fb91280ea7e4db98
>  100644
> --- a/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align8.c
> +++ b/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align8.c
> @@ -54,9 +54,8 @@
>  /*
>  ** g1:
>  **   mov (w[0-9]+), w1
> -**   sbfx(x[0-9]+), x0, 0, 63
> -**   and x3, \2, 9223372036854775807
> -**   and x1, \2, 1
> +**   and x3, x0, 9223372036854775807
> +**   and x1, x0, 1
>  **   mov x2, 0
>  **   mov w0, \1
>  **   b   f1
> @@ -65,9 +64,8 @@
>  /*
>  ** g8:
>  **   mov (w[0-9]+), w1
> -**   sbfx(x[0-9]+), x0, 0, 63
> -**   and x3, \2, 9223372036854775807
> -**   and x1, \2, 1
> +**   and x3, x0, 9223372036854775807
> +**   and x1, x0, 1
>  **   mov x2, 0
>  **   mov w0, \1
>  **   b   f8
> @@ -76,9 +74,8 @@
>  ** g16:
>  **   mov (x[0-9]+), x0
>  **   mov w0, w1
> -**   sbfx(x[0-9]+), \1, 0, 63
> -**   and x4, \2, 9223372036854775807
> -**   and x2, \2, 1
> +**   and x4, \1, 9223372036854775807
> +**   and x2, \1, 1
>  **   mov x3, 0
>  **   b   f16
>  */
> @@ -107,9 +104,8 @@
>  /*
>  ** g1p:
>  **   mov (w[0-9]+), w1
> -**   sbfx(x[0-9]+), x0, 0, 63
> -**   and x3, \2, 9223372036854775807
> -**   and x1, \2, 1
> +**   and x3, x0, 9223372036854775807
> +**   and x1, x0, 1
>  **   mov x2, 0
>  **   mov w0, \1
>  **   b   f1p
> @@ -117,9 +113,8 @@
>  /*
>  ** g8p:
>  **   mov (w[0-9]+), w1
> -**   sbfx(x[0-9]+), x0, 0, 63
> -**   and x3, \2, 9223372036854775807
> -**   and x1, \2, 1
> +**   and

Re: [PATCH v2 00/13] Add aarch64-w64-mingw32 target

2024-04-11 Thread Richard Sandiford
Evgeny Karpov  writes:
> Wednesday, April 10, 2024 8:40 PM
> Richard Sandiford  wrote:
>
>> Thanks for the updates and sorry again for the slow review.
>> I've replied to some of the patches in the series but otherwise it looks 
>> good to
>> me.
>> 
>> If you agree with the suggested changes then the series is OK for stage 1,
>> assuming no objections from those with an interest in the
>> x86 cygwin/mingw port.
>> 
>> Richard
>
> Thank you, Richard, for the review!
> v3 addresses all comments on v2.
> https://gcc.gnu.org/pipermail/gcc-patches/2024-April/thread.html#649261

Thanks, the new series looks good to me.

Richard


Re: [PATCH] aarch64: Preserve mem info on change of base for ldp/stp [PR114674]

2024-04-11 Thread Richard Sandiford
Alex Coplan  writes:
> Hi,
>
> The ldp/stp fusion pass can change the base of an access so that the two
> accesses end up using a common base register.  So far we have been using
> adjust_address_nv to do this, but this means that we don't preserve
> other properties of the mem we're replacing.  It seems better to use
> replace_equiv_address_nv, as this will preserve e.g. the MEM_ALIGN of the
> mem whose address we're changing.
>
> The PR shows that by adjusting the other mem we lose alignment
> information about the original access and therefore end up rejecting an
> otherwise viable pair when --param=aarch64-stp-policy=aligned is passed.
> This patch fixes that by using replace_equiv_address_nv instead.
>
> Notably this is the same approach as taken by
> aarch64_check_consecutive_mems when a change of base is required, so
> this at least makes things more consistent between the ldp fusion pass
> and the peepholes.
>
> Bootstrapped/regtested on aarch64-linux-gnu, OK for trunk when stage 1
> opens for GCC 15?

Yes, thanks.

Richard

>
> Thanks,
> Alex
>
>
> gcc/ChangeLog:
>
>   PR target/114674
>   * config/aarch64/aarch64-ldp-fusion.cc (ldp_bb_info::fuse_pair):
>   Use replace_equiv_address_nv on a change of base instead of
>   adjust_address_nv on the other access.
>
> gcc/testsuite/ChangeLog:
>
>   PR target/114674
>   * gcc.target/aarch64/pr114674.c: New test.
>
> diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc 
> b/gcc/config/aarch64/aarch64-ldp-fusion.cc
> index 365dcf48b22..4258a560c48 100644
> --- a/gcc/config/aarch64/aarch64-ldp-fusion.cc
> +++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
> @@ -1730,11 +1730,11 @@ ldp_bb_info::fuse_pair (bool load_p,
>   adjust_amt *= -1;
>  
>rtx change_reg = XEXP (change_pat, !load_p);
> -  machine_mode mode_for_mem = GET_MODE (change_mem);
>rtx effective_base = drop_writeback (base_mem);
> -  rtx new_mem = adjust_address_nv (effective_base,
> -mode_for_mem,
> -adjust_amt);
> +  rtx adjusted_addr = plus_constant (Pmode,
> +  XEXP (effective_base, 0),
> +  adjust_amt);
> +  rtx new_mem = replace_equiv_address (change_mem, adjusted_addr);
>rtx new_set = load_p
>   ? gen_rtx_SET (change_reg, new_mem)
>   : gen_rtx_SET (new_mem, change_reg);
> diff --git a/gcc/testsuite/gcc.target/aarch64/pr114674.c 
> b/gcc/testsuite/gcc.target/aarch64/pr114674.c
> new file mode 100644
> index 000..944784fd008
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/pr114674.c
> @@ -0,0 +1,17 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3 --param=aarch64-stp-policy=aligned" } */
> +typedef struct {
> + unsigned int f1;
> + unsigned int f2;
> +} test_struct;
> +
> +static test_struct ts = {
> + 123, 456
> +};
> +
> +void foo(void)
> +{
> + ts.f2 = 36969 * (ts.f2 & 65535) + (ts.f1 >> 16);
> + ts.f1 = 18000 * (ts.f2 & 65535) + (ts.f2 >> 16);
> +}
> +/* { dg-final { scan-assembler-times "stp" 1 } } */


Re: [PATCH 0/5] aarch64: FMV feature list fixes

2024-04-10 Thread Richard Sandiford
Andrew Carlotti  writes:
> On Wed, Apr 10, 2024 at 05:42:05PM +0100, Richard Sandiford wrote:
>> Andrew Carlotti  writes:
>> > On Tue, Apr 09, 2024 at 04:43:16PM +0100, Richard Sandiford wrote:
>> >> Andrew Carlotti  writes:
>> >> > The first three patches are trivial changes to the feature list to 
>> >> > reflect
>> >> > recent changes in the ACLE.  Patch 4 removes most of the FMV 
>> >> > multiversioning
>> >> > features that don't work at the moment, and should be entirely 
>> >> > uncontroversial.
>> >> >
>> >> > Patch 5 handles the remaining cases, where there's an inconsistency in 
>> >> > how
>> >> > features are named in the current FMV specification compared to the 
>> >> > existing
>> >> > command line options.  It might be better to instead preserve the 
>> >> > "memtag2",
>> >> > "ssbs2" and "ls64_accdata" names for now; I'd be happy to commit either
>> >> > version.
>> >> 
>> >> Yeah, I suppose patch 5 leaves things in a somewhat awkward state,
>> >> since e.g.:
>> >> 
>> >> -AARCH64_OPT_FMV_EXTENSION("memtag", MEMTAG, (), (), (), "")
>> >> +AARCH64_OPT_EXTENSION("memtag", MEMTAG, (), (), (), "")
>> >>  
>> >> -AARCH64_FMV_FEATURE("memtag2", MEMTAG2, (MEMTAG))
>> >> +AARCH64_FMV_FEATURE("memtag", MEMTAG2, (MEMTAG))
>> >> 
>> >> seems to drop "memtag2" and FEAT_MEMTAG, but keep "memtag" and
>> >> FEAT_MEMTAG2.  Is that right?
>> >
>> > That's deliberate. The FEAT_MEMTAG bit in __aarch64_cpu_features is 
>> > defined to
>> > match the definition of FEAT_MTE in the architecture, and likewise for
>> > FEAT_MEMTAG2/FEAT_MTE2.  However, in Binutils the "+memtag" extension 
>> > enables
>> > both FEAT_MTE and FEAT_MTE2 instructions (although none of the FEAT_MTE2
>> > instructions can be generated from GCC without inline assembly).  The FMV
>> > specification in the ACLE currently uses names "memtag" and "memtag2" that
>> > match the architecture names, but arguably don't match the command line
>> > extension names.  I'm advocating for that to change to match the extension
>> > names in command line options.
>> 
>> Hmm, ok.  I agree it makes sense for the user-visible FMV namnes to match
>> the command line.  But shouldn't __aarch64_cpu_features either (a) use 
>> exactly
>> the same names as the architecture or (b) use exactly the same names as the
>> command-line (mangled where necessary)?  It seems that we're instead
>> using a third convention that doesn't exactly match the other two.
>
> I agree that the name isn't one I would choose now, but I don't think it 
> matters much that it's inconsistent.

I kind-of think it does though.  Given...

>> That is, I can see the rationale for "memtag" => FEAT_MTE2 and
>> "memtag" => FEAT_MEMTAG.  It just seems odd to have "memtag" => FEAT_MEMTAG2
>> (where MEMTAG2 is an alias of MTE2).
>> 
>> How much leeway do we have to change the __aarch64_cpu_features names?
>> Is it supposed to be a public API (as opposed to ABI)?
>
> I think we're designing it to be capable of being a public API, but we haven't
> yet made it one.  That's partly why I've kept the enum value names the same as
> in LLVM so far.

...this, I don't want to sleep-walk into a situation where we have
one naming convention for the architecture, one for the attributes,
and a third one for the API.  If we're not in a position to commit
to a consistent naming scheme for the API by GCC 14 then it might be
better to remove the FMV features in 5/5 for GCC 14 and revisit in GCC 15.

A patch to do that is pre-approved if you agree (but please say
if you don't).

Thanks,
Richard


Re: [PATCH v2 00/13] Add aarch64-w64-mingw32 target

2024-04-10 Thread Richard Sandiford
Evgeny Karpov  writes:
> Hello,
>
> v2 is ready for the review!
> Based on the v1 review: 
> https://gcc.gnu.org/pipermail/gcc-patches/2024-February/thread.html#646203
>
> Testing for the x86_64-w64-mingw32 target is in progress to avoid
> regression due to refactoring.

Thanks for the updates and sorry again for the slow review.
I've replied to some of the patches in the series but otherwise
it looks good to me.

If you agree with the suggested changes then the series is OK for
stage 1, assuming no objections from those with an interest in the
x86 cygwin/mingw port.

Richard

> Regards,
> Evgeny
>
>
> Changes from v1 to v2:
> Adjust the target name to aarch64-*-mingw* to exclude the big-endian
> target from support.
> Exclude 64-bit ISA.
> Rename enum calling_abi to aarch64_calling_abi.
> Move AArch64 MS ABI definitions FIXED_REGISTERS,
> CALL_REALLY_USED_REGISTERS, and STATIC_CHAIN_REGNUM from aarch64.h 
> to aarch64-abi-ms.h.
> Rename TARGET_ARM64_MS_ABI to TARGET_AARCH64_MS_ABI.
> Exclude TARGET_64BIT from the aarch64 target.
> Exclude HAVE_GAS_WEAK.
> Set HAVE_GAS_ALIGNED_COMM to 1 by default.
> Use a reference from "x86 Windows Options" to 
> "Cygwin and MinGW Options".
> Update commit descriptions to follow standard style.
> Rebase from 4th March 2024.


Re: [PATCH v2 10/13] Rename "x86 Windows Options" to "Cygwin and MinGW Options"

2024-04-10 Thread Richard Sandiford
Evgeny Karpov  writes:
> From: Zac Walker 
> Date: Fri, 1 Mar 2024 02:17:39 +0100
> Subject: [PATCH v2 10/13] Rename "x86 Windows Options" to "Cygwin and MinGW
>  Options"
>
> Rename "x86 Windows Options" to "Cygwin and MinGW Options".
> It will be used also for AArch64.
>
> gcc/ChangeLog:
>
>   * config/i386/mingw-w64.opt.urls: Rename options' name and
>   regenerate option URLs.
>   * config/lynx.opt.urls: Likewise.
>   * config/mingw/cygming.opt.urls: Likewise.
>   * config/mingw/mingw.opt.urls: Likewise.
>   * doc/invoke.texi: Likewise.
> ---
>  gcc/config/i386/mingw-w64.opt.urls |  2 +-
>  gcc/config/lynx.opt.urls   |  2 +-
>  gcc/config/mingw/cygming.opt.urls  | 18 +-
>  gcc/config/mingw/mingw.opt.urls|  2 +-
>  gcc/doc/invoke.texi| 12 ++--
>  5 files changed, 22 insertions(+), 14 deletions(-)
>
> diff --git a/gcc/config/i386/mingw-w64.opt.urls 
> b/gcc/config/i386/mingw-w64.opt.urls
> index 6bb53ef29b2..5cceba1d1a1 100644
> --- a/gcc/config/i386/mingw-w64.opt.urls
> +++ b/gcc/config/i386/mingw-w64.opt.urls
> @@ -1,5 +1,5 @@
>  ; Autogenerated by regenerate-opt-urls.py from gcc/config/i386/mingw-w64.opt 
> and generated HTML
>  
>  municode
> -UrlSuffix(gcc/x86-Windows-Options.html#index-municode)
> +UrlSuffix(gcc/Cygwin-and-MinGW-Options.html#index-municode)
>  
> diff --git a/gcc/config/lynx.opt.urls b/gcc/config/lynx.opt.urls
> index 63e7b9c4b33..b547138f7ff 100644
> --- a/gcc/config/lynx.opt.urls
> +++ b/gcc/config/lynx.opt.urls
> @@ -1,5 +1,5 @@
>  ; Autogenerated by regenerate-opt-urls.py from gcc/config/lynx.opt and 
> generated HTML
>  
>  mthreads
> -UrlSuffix(gcc/x86-Windows-Options.html#index-mthreads-1)
> +UrlSuffix(gcc/Cygwin-and-MinGW-Options.html#index-mthreads-1)
>  
> diff --git a/gcc/config/mingw/cygming.opt.urls 
> b/gcc/config/mingw/cygming.opt.urls
> index 87799befe3c..c624e22e442 100644
> --- a/gcc/config/mingw/cygming.opt.urls
> +++ b/gcc/config/mingw/cygming.opt.urls
> @@ -1,30 +1,30 @@
>  ; Autogenerated by regenerate-opt-urls.py from gcc/config/i386/cygming.opt 
> and generated HTML
>  
>  mconsole
> -UrlSuffix(gcc/x86-Windows-Options.html#index-mconsole)
> +UrlSuffix(gcc/Cygwin-and-MinGW-Options.html#index-mconsole)
>  
>  mdll
> -UrlSuffix(gcc/x86-Windows-Options.html#index-mdll)
> +UrlSuffix(gcc/Cygwin-and-MinGW-Options.html#index-mdll)
>  
>  mnop-fun-dllimport
> -UrlSuffix(gcc/x86-Windows-Options.html#index-mnop-fun-dllimport)
> +UrlSuffix(gcc/Cygwin-and-MinGW-Options.html#index-mnop-fun-dllimport)
>  
>  ; skipping UrlSuffix for 'mthreads' due to multiple URLs:
> +;   duplicate: 'gcc/Cygwin-and-MinGW-Options.html#index-mthreads-1'
>  ;   duplicate: 'gcc/x86-Options.html#index-mthreads'
> -;   duplicate: 'gcc/x86-Windows-Options.html#index-mthreads-1'
>  
>  mwin32
> -UrlSuffix(gcc/x86-Windows-Options.html#index-mwin32)
> +UrlSuffix(gcc/Cygwin-and-MinGW-Options.html#index-mwin32)
>  
>  mwindows
> -UrlSuffix(gcc/x86-Windows-Options.html#index-mwindows)
> +UrlSuffix(gcc/Cygwin-and-MinGW-Options.html#index-mwindows)
>  
>  mpe-aligned-commons
> -UrlSuffix(gcc/x86-Windows-Options.html#index-mpe-aligned-commons)
> +UrlSuffix(gcc/Cygwin-and-MinGW-Options.html#index-mpe-aligned-commons)
>  
>  fset-stack-executable
> -UrlSuffix(gcc/x86-Windows-Options.html#index-fno-set-stack-executable)
> +UrlSuffix(gcc/Cygwin-and-MinGW-Options.html#index-fno-set-stack-executable)
>  
>  fwritable-relocated-rdata
> -UrlSuffix(gcc/x86-Windows-Options.html#index-fno-writable-relocated-rdata)
> +UrlSuffix(gcc/Cygwin-and-MinGW-Options.html#index-fno-writable-relocated-rdata)
>  
> diff --git a/gcc/config/mingw/mingw.opt.urls b/gcc/config/mingw/mingw.opt.urls
> index 2cbbaadf310..f8ee5be6a53 100644
> --- a/gcc/config/mingw/mingw.opt.urls
> +++ b/gcc/config/mingw/mingw.opt.urls
> @@ -1,7 +1,7 @@
>  ; Autogenerated by regenerate-opt-urls.py from gcc/config/i386/mingw.opt and 
> generated HTML
>  
>  mcrtdll=
> -UrlSuffix(gcc/x86-Windows-Options.html#index-mcrtdll)
> +UrlSuffix(gcc/Cygwin-and-MinGW-Options.html#index-mcrtdll)
>  
>  ; skipping UrlSuffix for 'pthread' due to multiple URLs:
>  ;   duplicate: 'gcc/Link-Options.html#index-pthread-1'
> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> index bdf05be387d..e2e473e095f 100644
> --- a/gcc/doc/invoke.texi
> +++ b/gcc/doc/invoke.texi
> @@ -1493,6 +1493,8 @@ See RS/6000 and PowerPC Options.
>  -munroll-only-small-loops -mlam=@var{choice}}
>  
>  @emph{x86 Windows Options}
> +
> +@emph{Cygwin and MinGW Options}
>  @gccoptlist{-mconsole  -mcrtdll=@var{library}  -mdll
>  -mnop-fun-dllimport  -mthread
>  -municode  -mwin32  -mwindows  -fno-set-stack-executable}
> @@ -20976,6 +20978,7 @@ platform.
>  * C6X Options::
>  * CRIS Options::
>  * C-SKY Options::
> +* Cygwin and MinGW Options::
>  * Darwin Options::
>  * DEC Alpha Options::
>  * eBPF Options::
> @@ -36112,8 +36115,13 @@ positions 62:57 can be used for metadata.
>  
>  @node x86 Windows Options
>  @subsection 

Re: [PATCH v2 08/13] aarch64: Add Cygwin and MinGW environments for AArch64

2024-04-10 Thread Richard Sandiford
Evgeny Karpov  writes:
> From: Zac Walker 
> Date: Fri, 1 Mar 2024 10:49:28 +0100
> Subject: [PATCH v2 08/13] aarch64: Add Cygwin and MinGW environments for
>  AArch64
>
> Define Cygwin and MinGW environment such as types, SEH definitions,
> shared libraries, etc.
>
> gcc/ChangeLog:
>
>   * config.gcc: Add Cygwin and MinGW difinitions.
>   * config/aarch64/aarch64-protos.h
>   (mingw_pe_maybe_record_exported_symbol): Declare functions
>   which are used in Cygwin and MinGW environment.
>   (mingw_pe_section_type_flags): Likewise.
>   (mingw_pe_unique_section): Likewise.
>   (mingw_pe_encode_section_info): Likewise.
>   * config/aarch64/cygming.h: New file.
> ---
>  gcc/config.gcc  |   4 +
>  gcc/config/aarch64/aarch64-protos.h |   5 +
>  gcc/config/aarch64/cygming.h| 175 
>  3 files changed, 184 insertions(+)
>  create mode 100644 gcc/config/aarch64/cygming.h
>
> diff --git a/gcc/config.gcc b/gcc/config.gcc
> index 3aca257c322..4471599454b 100644
> --- a/gcc/config.gcc
> +++ b/gcc/config.gcc
> @@ -1267,7 +1267,11 @@ aarch64*-*-linux*)
>  aarch64-*-mingw*)
>   tm_file="${tm_file} aarch64/aarch64-abi-ms.h"
>   tm_file="${tm_file} aarch64/aarch64-coff.h"
> + tm_file="${tm_file} aarch64/cygming.h"
> + tm_file="${tm_file} mingw/mingw32.h"
> + tm_file="${tm_file} mingw/mingw-stdint.h"
>   tmake_file="${tmake_file} aarch64/t-aarch64"
> + target_gtfiles="$target_gtfiles \$(srcdir)/config/mingw/winnt.cc"
>   case ${enable_threads} in
> "" | yes | win32)
>   thread_file='win32'
> diff --git a/gcc/config/aarch64/aarch64-protos.h 
> b/gcc/config/aarch64/aarch64-protos.h
> index bd719b992a5..759e1a0f9da 100644
> --- a/gcc/config/aarch64/aarch64-protos.h
> +++ b/gcc/config/aarch64/aarch64-protos.h
> @@ -1110,6 +1110,11 @@ extern void aarch64_output_patchable_area (unsigned 
> int, bool);
>  
>  extern void aarch64_adjust_reg_alloc_order ();
>  
> +extern void mingw_pe_maybe_record_exported_symbol (tree, const char *, int);
> +extern unsigned int mingw_pe_section_type_flags (tree, const char *, int);
> +extern void mingw_pe_unique_section (tree, int);
> +extern void mingw_pe_encode_section_info (tree, rtx, int);
> +
>  bool aarch64_optimize_mode_switching (aarch64_mode_entity);
>  void aarch64_restore_za (rtx);
>  
> diff --git a/gcc/config/aarch64/cygming.h b/gcc/config/aarch64/cygming.h
> new file mode 100644
> index 000..2f239c42a89
> --- /dev/null
> +++ b/gcc/config/aarch64/cygming.h
> @@ -0,0 +1,175 @@
> +/* Operating system specific defines to be used when targeting GCC for
> +   hosting on Windows32, using a Unix style C library and tools.
> +   Copyright (C) 1995-2024 Free Software Foundation, Inc.
> +
> +This file is part of GCC.
> +
> +GCC is free software; you can redistribute it and/or modify
> +it under the terms of the GNU General Public License as published by
> +the Free Software Foundation; either version 3, or (at your option)
> +any later version.
> +
> +GCC is distributed in the hope that it will be useful,
> +but WITHOUT ANY WARRANTY; without even the implied warranty of
> +MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +GNU General Public License for more details.
> +
> +You should have received a copy of the GNU General Public License
> +along with GCC; see the file COPYING3.  If not see
> +.  */
> +
> +#ifndef GCC_AARCH64_CYGMING_H
> +#define GCC_AARCH64_CYGMING_H
> +
> +#undef PREFERRED_DEBUGGING_TYPE
> +#define PREFERRED_DEBUGGING_TYPE DINFO_TYPE_NONE
> +
> +#define FASTCALL_PREFIX '@'
> +
> +#define print_reg(rtx, code, file)

How about:

#define print_reg(rtx, code, file) (gcc_unreachable ())

so that attempts to use this are a noisy runtime failure?

> +#define SYMBOL_FLAG_DLLIMPORT 0
> +#define SYMBOL_FLAG_DLLEXPORT 0
> +
> +#define SYMBOL_REF_DLLEXPORT_P(X) \
> + ((SYMBOL_REF_FLAGS (X) & SYMBOL_FLAG_DLLEXPORT) != 0)
> +
> +/* Disable SEH and declare the required SEH-related macros that are
> +still needed for compilation.  */
> +#undef TARGET_SEH
> +#define TARGET_SEH 0
> +
> +#define SSE_REGNO_P(N) 0
> +#define GENERAL_REGNO_P(N) 0
> +#define SEH_MAX_FRAME_SIZE 0

Similarly here, how about:

#define SSE_REGNO_P(N) (gcc_unreachable (), 0)
#define GENERAL_REGNO_P(N) (gcc_unreachable (), 0)
#define SEH_MAX_FRAME_SIZE (gcc_unreachable (), 0)

Thanks,
Richard


Re: [PATCH v2 04/13] aarch64: Add aarch64-w64-mingw32 COFF

2024-04-10 Thread Richard Sandiford
Evgeny Karpov  writes:
> From: Zac Walker 
> Date: Fri, 1 Mar 2024 01:55:47 +0100
> Subject: [PATCH v2 04/13] aarch64: Add aarch64-w64-mingw32 COFF
>
> Define ASM specific for COFF format on AArch64.
>
> gcc/ChangeLog:
>
>   * config.gcc: Add COFF format support definitions.
>   * config/aarch64/aarch64-coff.h: New file.
> ---
>  gcc/config.gcc|  1 +
>  gcc/config/aarch64/aarch64-coff.h | 91 +++
>  2 files changed, 92 insertions(+)
>  create mode 100644 gcc/config/aarch64/aarch64-coff.h
>
> diff --git a/gcc/config.gcc b/gcc/config.gcc
> index b762393b64c..cb6661f44ef 100644
> --- a/gcc/config.gcc
> +++ b/gcc/config.gcc
> @@ -1266,6 +1266,7 @@ aarch64*-*-linux*)
>   ;;
>  aarch64-*-mingw*)
>   tm_file="${tm_file} aarch64/aarch64-abi-ms.h"
> + tm_file="${tm_file} aarch64/aarch64-coff.h"
>   tmake_file="${tmake_file} aarch64/t-aarch64"
>   case ${enable_threads} in
> "" | yes | win32)
> diff --git a/gcc/config/aarch64/aarch64-coff.h 
> b/gcc/config/aarch64/aarch64-coff.h
> new file mode 100644
> index 000..79c5a43b970
> --- /dev/null
> +++ b/gcc/config/aarch64/aarch64-coff.h
> @@ -0,0 +1,91 @@
> +/* Machine description for AArch64 architecture.
> +   Copyright (C) 2024 Free Software Foundation, Inc.
> +
> +   This file is part of GCC.
> +
> +   GCC is free software; you can redistribute it and/or modify it
> +   under the terms of the GNU General Public License as published by
> +   the Free Software Foundation; either version 3, or (at your option)
> +   any later version.
> +
> +   GCC is distributed in the hope that it will be useful, but
> +   WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   General Public License for more details.
> +
> +   You should have received a copy of the GNU General Public License
> +   along with GCC; see the file COPYING3.  If not see
> +   .  */
> +
> +#ifndef GCC_AARCH64_COFF_H
> +#define GCC_AARCH64_COFF_H
> +
> +#include "aarch64.h"

Is this needed?  It looks like aarch64-coff.h comes after aarch64.h
in the include list, so I'd have expected the #include to be a no-op.

If you want to emphasise that this file must be included after aarch64.h
then perhaps:

#if !defined(GCC_AARCH64_H)
#error This file must be included after aarch64.h
#endif

would work.  But it should also be ok just to drop the include without
replacing it with anything.

> +
> +#ifndef LOCAL_LABEL_PREFIX
> +# define LOCAL_LABEL_PREFIX  ""
> +#endif
> +
> +/* Using long long breaks -ansi and -std=c90, so these will need to be
> +   made conditional for an LLP64 ABI.  */
> +#undef SIZE_TYPE
> +#define SIZE_TYPE"long long unsigned int"
> +
> +#undef PTRDIFF_TYPE
> +#define PTRDIFF_TYPE "long long int"
> +
> +#undef LONG_TYPE_SIZE
> +#define LONG_TYPE_SIZE 32
> +
> +#ifndef ASM_GENERATE_INTERNAL_LABEL
> +# define ASM_GENERATE_INTERNAL_LABEL(STRING, PREFIX, NUM)  \
> +  sprintf (STRING, "*%s%s%u", LOCAL_LABEL_PREFIX, PREFIX, (unsigned 
> int)(NUM))
> +#endif
> +
> +#define ASM_OUTPUT_ALIGN(STREAM, POWER)  \
> +  fprintf (STREAM, "\t.align\t%d\n", (int)POWER)
> +
> +/* Output a common block.  */
> +#ifndef ASM_OUTPUT_COMMON
> +# define ASM_OUTPUT_COMMON(STREAM, NAME, SIZE, ROUNDED)  \
> +{\
> +  fprintf (STREAM, "\t.comm\t"); \
> +  assemble_name (STREAM, NAME);  \
> +  asm_fprintf (STREAM, ", %d, %d\n", \
> +  (int)(ROUNDED), (int)(SIZE));  \
> +}
> +#endif
> +
> +/* Output a local common block.  /bin/as can't do this, so hack a
> +   `.space' into the bss segment.  Note that this is *bad* practice,
> +   which is guaranteed NOT to work since it doesn't define STATIC
> +   COMMON space but merely STATIC BSS space.  */
> +#ifndef ASM_OUTPUT_ALIGNED_LOCAL
> +# define ASM_OUTPUT_ALIGNED_LOCAL(STREAM, NAME, SIZE, ALIGN) \
> +{
> \
> +  switch_to_section (bss_section);   
> \
> +  ASM_OUTPUT_ALIGN (STREAM, floor_log2 (ALIGN / BITS_PER_UNIT)); \
> +  ASM_OUTPUT_LABEL (STREAM, NAME);   
> \
> +  fprintf (STREAM, "\t.space\t%d\n", (int)(SIZE));   
> \
> +}
> +#endif
> +
> +#define ASM_OUTPUT_SKIP(STREAM, NBYTES)  \
> +  fprintf (STREAM, "\t.space\t%d  // skip\n", (int) (NBYTES))
> +
> +#define ASM_OUTPUT_TYPE_DIRECTIVE(STREAM, NAME, TYPE)
> +#define ASM_DECLARE_FUNCTION_SIZE(FILE, FNAME, DECL)

Just curious: are these empty definitions the final intended
definitions, or are they just temporary?  Might be worth a comment
either way.

Thanks,
Richard

> +
> +#define TEXT_SECTION_ASM_OP  "\t.text"
> +#define DATA_SECTION_ASM_OP  "\t.data"
> +#define BSS_SECTION_ASM_OP   

Re: [PATCH v2 03/13] aarch64: Mark x18 register as a fixed register for MS ABI

2024-04-10 Thread Richard Sandiford
Evgeny Karpov  writes:
> From: Zac Walker 
> Date: Fri, 1 Mar 2024 09:56:59 +0100
> Subject: [PATCH v2 03/13] aarch64: Mark x18 register as a fixed register for
>  MS ABI
>
> Define the MS ABI for aarch64-w64-mingw32.
> Adjust FIXED_REGISTERS, CALL_REALLY_USED_REGISTERS and
> STATIC_CHAIN_REGNUM for AArch64 MS ABI.
> The X18 register is reserved on Windows for the TEB.
>
> gcc/ChangeLog:
>
>   * config.gcc: Define TARGET_AARCH64_MS_ABI when
>   AArch64 MS ABI is used.
>   * config/aarch64/aarch64-abi-ms.h: New file. Adjust
>   FIXED_REGISTERS, CALL_REALLY_USED_REGISTERS and
>   STATIC_CHAIN_REGNUM for AArch64 MS ABI.
> ---
>  gcc/config.gcc  |  1 +
>  gcc/config/aarch64/aarch64-abi-ms.h | 64 +
>  2 files changed, 65 insertions(+)
>  create mode 100644 gcc/config/aarch64/aarch64-abi-ms.h
>
> diff --git a/gcc/config.gcc b/gcc/config.gcc
> index 2756377e50b..b762393b64c 100644
> --- a/gcc/config.gcc
> +++ b/gcc/config.gcc
> @@ -1265,6 +1265,7 @@ aarch64*-*-linux*)
>   TM_MULTILIB_CONFIG=`echo $TM_MULTILIB_CONFIG | sed 's/^,//'`
>   ;;
>  aarch64-*-mingw*)
> + tm_file="${tm_file} aarch64/aarch64-abi-ms.h"
>   tmake_file="${tmake_file} aarch64/t-aarch64"
>   case ${enable_threads} in
> "" | yes | win32)
> diff --git a/gcc/config/aarch64/aarch64-abi-ms.h 
> b/gcc/config/aarch64/aarch64-abi-ms.h
> new file mode 100644
> index 000..90b0dcc5edf
> --- /dev/null
> +++ b/gcc/config/aarch64/aarch64-abi-ms.h
> @@ -0,0 +1,64 @@
> +/* Machine description for AArch64 MS ABI.
> +   Copyright (C) 2024 Free Software Foundation, Inc.
> +
> +This file is part of GCC.
> +
> +GCC is free software; you can redistribute it and/or modify
> +it under the terms of the GNU General Public License as published by
> +the Free Software Foundation; either version 3, or (at your option)
> +any later version.
> +
> +GCC is distributed in the hope that it will be useful,
> +but WITHOUT ANY WARRANTY; without even the implied warranty of
> +MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +GNU General Public License for more details.
> +
> +You should have received a copy of the GNU General Public License
> +along with GCC; see the file COPYING3.  If not see
> +.  */
> +
> +#ifndef GCC_AARCH64_ABI_MS_H
> +#define GCC_AARCH64_ABI_MS_H
> +
> +/* X18 reserved for the TEB on Windows.  */
> +
> +#undef FIXED_REGISTERS
> +#define FIXED_REGISTERS  \
> +  {  \
> +0, 0, 0, 0,   0, 0, 0, 0,/* R0 - R7.  */ \
> +0, 0, 0, 0,   0, 0, 0, 0,/* R8 - R15.  */\
> +0, 0, 1, 0,   0, 0, 0, 0,/* R16 - R23.  */   \
> +0, 0, 0, 0,   0, 1, 0, 1,/* R24 - R30, SP.  */   \
> +0, 0, 0, 0,   0, 0, 0, 0,/* V0 - V7.  */ \
> +0, 0, 0, 0,   0, 0, 0, 0,   /* V8 - V15.  */ \
> +0, 0, 0, 0,   0, 0, 0, 0,   /* V16 - V23.  */\
> +0, 0, 0, 0,   0, 0, 0, 0,   /* V24 - V31.  */\
> +1, 1, 1, 1,  /* SFP, AP, CC, VG.  */ \
> +0, 0, 0, 0,   0, 0, 0, 0,/* P0 - P7.  */ \
> +0, 0, 0, 0,   0, 0, 0, 0,   /* P8 - P15.  */ \
> +1, 1,/* FFR and FFRT.  */\
> +1, 1, 1, 1, 1, 1, 1, 1   /* Fake registers.  */  \
> +  }
> +
> +#undef CALL_REALLY_USED_REGISTERS
> +#define CALL_REALLY_USED_REGISTERS   \
> +  {  \
> +1, 1, 1, 1,   1, 1, 1, 1,/* R0 - R7.  */ \
> +1, 1, 1, 1,   1, 1, 1, 1,/* R8 - R15.  */\
> +1, 1, 0, 0,   0, 0, 0, 0,   /* R16 - R23.  */\
> +0, 0, 0, 0,   0, 1, 1, 1,/* R24 - R30, SP.  */   \
> +1, 1, 1, 1,   1, 1, 1, 1,/* V0 - V7.  */ \
> +0, 0, 0, 0,   0, 0, 0, 0,/* V8 - V15.  */\
> +1, 1, 1, 1,   1, 1, 1, 1,   /* V16 - V23.  */\
> +1, 1, 1, 1,   1, 1, 1, 1,   /* V24 - V31.  */\
> +1, 1, 1, 0,  /* SFP, AP, CC, VG.  */ \
> +1, 1, 1, 1,   1, 1, 1, 1,/* P0 - P7.  */ \
> +1, 1, 1, 1,   1, 1, 1, 1,/* P8 - P15.  */\
> +1, 1,/* FFR and FFRT.  */\
> +0, 0, 0, 0, 0, 0, 0, 0   /* Fake registers.  */  \
> +  }
> +
> +#undef  STATIC_CHAIN_REGNUM
> +#define STATIC_CHAIN_REGNUM R17_REGNUM
> +
> +#endif /* GCC_AARCH64_ABI_MS_H.  */

Gah, I think there was a miscommunication, sorry.  The way I'd interpreted
Richard's comment:

> +/* X18 reserved for the TEB on Windows.  */
> +#ifdef TARGET_ARM64_MS_ABI
> +# define FIXED_X18 1
> +# define CALL_USED_X18 0
> +#else
> +# define FIXED_X18 0
> +# define CALL_USED_X18 1
> +#endif
>
> I'm not overly keen on ifdefs like this (and the one below), it can
> get quite confusing if we have to support more than a couple of ABIs.
> Perhaps we could create a couple of new headers, 

Re: [PATCH v2 02/13] aarch64: The aarch64-w64-mingw32 target implements

2024-04-10 Thread Richard Sandiford
Sorry for the slow reply.

Evgeny Karpov  writes:
> From: Zac Walker 
> Date: Fri, 1 Mar 2024 01:45:13 +0100
> Subject: [PATCH v2 02/13] aarch64: The aarch64-w64-mingw32 target implements
>  the MS ABI
>
> Two ABIs for aarch64 have been defined for different platforms.
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64-opts.h (enum aarch64_calling_abi):
>   Define two ABIs.
> ---
>  gcc/config/aarch64/aarch64-opts.h | 7 +++
>  1 file changed, 7 insertions(+)
>
> diff --git a/gcc/config/aarch64/aarch64-opts.h 
> b/gcc/config/aarch64/aarch64-opts.h
> index a05c0d3ded1..52c9e4596d6 100644
> --- a/gcc/config/aarch64/aarch64-opts.h
> +++ b/gcc/config/aarch64/aarch64-opts.h
> @@ -131,4 +131,11 @@ enum aarch64_early_ra_scope {
>AARCH64_EARLY_RA_NONE
>  };
>  
> +/* Available call ABIs.  */
> +enum aarch64_calling_abi
> +{
> +  AARCH64_CALLING_ABI_EABI,
> +  AARCH64_CALLING_ABI_MS
> +};
> +
>  #endif

Thanks for removing the MS_ABI uses.  However, I'm still a bit uneasy
about this.  We already have a way of categorising ABIs (arm_pcs)
and it's not clear how this new enum would interact with it.  We also
have infrastructure for recording the arm_pcs for each call, including
in RTL insns.  Would we need to do the same for this classification?

It seems like the enum is currently unused.  At least, I was able
to build successfully with:

diff --git a/gcc/config/aarch64/aarch64-opts.h 
b/gcc/config/aarch64/aarch64-opts.h
index 52c9e4596d6..a05c0d3ded1 100644
--- a/gcc/config/aarch64/aarch64-opts.h
+++ b/gcc/config/aarch64/aarch64-opts.h
@@ -131,11 +131,4 @@ enum aarch64_early_ra_scope {
   AARCH64_EARLY_RA_NONE
 };
 
-/* Available call ABIs.  */
-enum aarch64_calling_abi
-{
-  AARCH64_CALLING_ABI_EABI,
-  AARCH64_CALLING_ABI_MS
-};
-
 #endif
diff --git a/gcc/config/aarch64/cygming.h b/gcc/config/aarch64/cygming.h
index 2f239c42a89..902539763bd 100644
--- a/gcc/config/aarch64/cygming.h
+++ b/gcc/config/aarch64/cygming.h
@@ -43,9 +43,6 @@ still needed for compilation.  */
 #define GENERAL_REGNO_P(N) 0
 #define SEH_MAX_FRAME_SIZE 0
 
-#undef DEFAULT_ABI
-#define DEFAULT_ABI AARCH64_CALLING_ABI_MS
-
 #undef TARGET_PECOFF
 #define TARGET_PECOFF 1
 
diff --git a/gcc/config/mingw/mingw32.h b/gcc/config/mingw/mingw32.h
index 040c3e1e521..08f1b5f0696 100644
--- a/gcc/config/mingw/mingw32.h
+++ b/gcc/config/mingw/mingw32.h
@@ -19,9 +19,7 @@ along with GCC; see the file COPYING3.  If not see
 .  */
 
 #undef DEFAULT_ABI
-#if defined (TARGET_AARCH64_MS_ABI)
-# define DEFAULT_ABI AARCH64_CALLING_ABI_MS
-#else
+#if !defined (TARGET_AARCH64_MS_ABI)
 # define DEFAULT_ABI MS_ABI
 #endif
 
Would you be happy with that for now?  We can then revisit this
later when the information is needed.

Thanks,
Richard


Re: [PATCH 0/5] aarch64: FMV feature list fixes

2024-04-10 Thread Richard Sandiford
Andrew Carlotti  writes:
> On Tue, Apr 09, 2024 at 04:43:16PM +0100, Richard Sandiford wrote:
>> Andrew Carlotti  writes:
>> > The first three patches are trivial changes to the feature list to reflect
>> > recent changes in the ACLE.  Patch 4 removes most of the FMV 
>> > multiversioning
>> > features that don't work at the moment, and should be entirely 
>> > uncontroversial.
>> >
>> > Patch 5 handles the remaining cases, where there's an inconsistency in how
>> > features are named in the current FMV specification compared to the 
>> > existing
>> > command line options.  It might be better to instead preserve the 
>> > "memtag2",
>> > "ssbs2" and "ls64_accdata" names for now; I'd be happy to commit either
>> > version.
>> 
>> Yeah, I suppose patch 5 leaves things in a somewhat awkward state,
>> since e.g.:
>> 
>> -AARCH64_OPT_FMV_EXTENSION("memtag", MEMTAG, (), (), (), "")
>> +AARCH64_OPT_EXTENSION("memtag", MEMTAG, (), (), (), "")
>>  
>> -AARCH64_FMV_FEATURE("memtag2", MEMTAG2, (MEMTAG))
>> +AARCH64_FMV_FEATURE("memtag", MEMTAG2, (MEMTAG))
>> 
>> seems to drop "memtag2" and FEAT_MEMTAG, but keep "memtag" and
>> FEAT_MEMTAG2.  Is that right?
>
> That's deliberate. The FEAT_MEMTAG bit in __aarch64_cpu_features is defined to
> match the definition of FEAT_MTE in the architecture, and likewise for
> FEAT_MEMTAG2/FEAT_MTE2.  However, in Binutils the "+memtag" extension enables
> both FEAT_MTE and FEAT_MTE2 instructions (although none of the FEAT_MTE2
> instructions can be generated from GCC without inline assembly).  The FMV
> specification in the ACLE currently uses names "memtag" and "memtag2" that
> match the architecture names, but arguably don't match the command line
> extension names.  I'm advocating for that to change to match the extension
> names in command line options.

Hmm, ok.  I agree it makes sense for the user-visible FMV namnes to match
the command line.  But shouldn't __aarch64_cpu_features either (a) use exactly
the same names as the architecture or (b) use exactly the same names as the
command-line (mangled where necessary)?  It seems that we're instead
using a third convention that doesn't exactly match the other two.

That is, I can see the rationale for "memtag" => FEAT_MTE2 and
"memtag" => FEAT_MEMTAG.  It just seems odd to have "memtag" => FEAT_MEMTAG2
(where MEMTAG2 is an alias of MTE2).

How much leeway do we have to change the __aarch64_cpu_features names?
Is it supposed to be a public API (as opposed to ABI)?

> The LS64 example is definitely an inconsistency, since GCC uses "+ls64" to
> enable intrinsics for all of the FEAT_LS64/FEAT_LS64_V/FEAT_LS64_ACCDATA
> intrinsics.

Ok, thanks.  If we go for option (a) above then I agree that the ls64
change is correct.  If we go for option (b) then I suppose it should
stay as LS64.

> There were similar issues with "sha1", "pmull" and "sve2-pmull128", but in
> these cases their presence architecturally is implied by the presence of the
> features checked for "sha2", "aes" and "sve2-aes" so it's fine to just delete
> the ones without command line flags.
>
>> Apart from that and the comment on patch 2, the series looks good to me.
>> 
>> While rechecking aarch64-option-extensions.def against the ACLE list:
>> it seems that the .def doesn't treat mops as an FMV feature.  Is that
>> deliberate?
>
> "mops" was added to the ACLE list later, and libgcc doesn't yet support
> detecting it.  I didn't think it was sensible to add new FMV feature support 
> at
> this stage.

Ah, ok, makes sense.

Richard


Re: [PATCHv3 2/2] aarch64: Add support for _BitInt

2024-04-10 Thread Richard Sandiford
"Andre Vieira (lists)"  writes:
> Added the target check, also had to change some of the assembly checking 
> due to changes upstream, the assembly is still valid, but we do extend 
> where not necessary, I do believe that's a general issue though.
>
> The _BitInt(N > 64) codegen for non-powers of 2 did get worse, we see 
> similar codegen with _int128 bitfields on aarch64.
> I suspect we need to improve the way we 'extend' TImode in the aarch64 
> backend to be able to operate only on the affected DImode parts of it 
> when relevant. Though I also think we may need to change how _BitInt is 
> currently expanded in such situations, right now it does the extension 
> as two shifts. Anyway I did not have too much time to look deeper into this.
>
> Bootstrapped on aarch64-unknown-linux-gnu.
>
> OK for trunk?

OK, thanks.  In truth I've not gone through the tests very thorougly
this time around, and just gone by the internal diff between this
version and the previous one.  But we can adjust them as necessary
based on any reports that come in.

Richard

>
> On 28/03/2024 15:21, Richard Sandiford wrote:
>> Jakub Jelinek  writes:
>>> On Thu, Mar 28, 2024 at 03:00:46PM +, Richard Sandiford wrote:
>>>>>   * gcc.target/aarch64/bitint-alignments.c: New test.
>>>>>   * gcc.target/aarch64/bitint-args.c: New test.
>>>>>   * gcc.target/aarch64/bitint-sizes.c: New test.
>>>>>   * gcc.target/aarch64/bitfield-bitint-abi.h: New header.
>>>>>   * gcc.target/aarch64/bitfield-bitint-abi-align16.c: New test.
>>>>>   * gcc.target/aarch64/bitfield-bitint-abi-align8.c: New test.
>>>>
>>>> Since we don't support big-endian yet, I assume the tests should be
>>>> conditional on aarch64_little_endian.
>>>
>>> Perhaps better on bitint effective target, then they'll become available
>>> automatically as soon as big endian aarch64 _BitInt support is turned on.
>> 
>> Ah, yeah, good point.
>> 
>> Richard
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 
> 81400cc666472ffeff40df14e98ae00ebc774d31..c0af4ef151a8c46f78c0c3a43c2ab1318a3f610a
>  100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -6583,6 +6583,7 @@ aarch64_return_in_memory_1 (const_tree type)
>int count;
>  
>if (!AGGREGATE_TYPE_P (type)
> +  && TREE_CODE (type) != BITINT_TYPE
>&& TREE_CODE (type) != COMPLEX_TYPE
>&& TREE_CODE (type) != VECTOR_TYPE)
>  /* Simple scalar types always returned in registers.  */
> @@ -21996,6 +21997,11 @@ aarch64_composite_type_p (const_tree type,
>if (type && (AGGREGATE_TYPE_P (type) || TREE_CODE (type) == COMPLEX_TYPE))
>  return true;
>  
> +  if (type
> +  && TREE_CODE (type) == BITINT_TYPE
> +  && int_size_in_bytes (type) > 16)
> +return true;
> +
>if (mode == BLKmode
>|| GET_MODE_CLASS (mode) == MODE_COMPLEX_FLOAT
>|| GET_MODE_CLASS (mode) == MODE_COMPLEX_INT)
> @@ -28477,6 +28483,42 @@ aarch64_excess_precision (enum excess_precision_type 
> type)
>return FLT_EVAL_METHOD_UNPREDICTABLE;
>  }
>  
> +/* Implement TARGET_C_BITINT_TYPE_INFO.
> +   Return true if _BitInt(N) is supported and fill its details into *INFO.  
> */
> +bool
> +aarch64_bitint_type_info (int n, struct bitint_info *info)
> +{
> +  if (TARGET_BIG_END)
> +return false;
> +
> +  if (n <= 8)
> +info->limb_mode = QImode;
> +  else if (n <= 16)
> +info->limb_mode = HImode;
> +  else if (n <= 32)
> +info->limb_mode = SImode;
> +  else if (n <= 64)
> +info->limb_mode = DImode;
> +  else if (n <= 128)
> +info->limb_mode = TImode;
> +  else
> +/* The AAPCS for AArch64 defines _BitInt(N > 128) as an array with
> +   type {signed,unsigned} __int128[M] where M*128 >= N.  However, to be
> +   able to use libgcc's implementation to support large _BitInt's we need
> +   to use a LIMB_MODE that is no larger than 'long long'.  This is why we
> +   use DImode for our internal LIMB_MODE and we define the ABI_LIMB_MODE 
> to
> +   be TImode to ensure we are ABI compliant.  */
> +info->limb_mode = DImode;
> +
> +  if (n > 128)
> +info->abi_limb_mode = TImode;
> +  else
> +info->abi_limb_mode = info->limb_mode;
> +  info->big_endian = TARGET_BIG_END;
> +  info->extended = false;
> +  return true;
> +}
> +
>  /* Implement TARGET_SCHED_CAN_SPECULATE_INSN.  Return true if INSN can be
> scheduled for speculative ex

Re: [PATCHv2 1/2] aarch64: Do not give ABI change diagnostics for _BitInt(N)

2024-04-10 Thread Richard Sandiford
"Andre Vieira (lists)"  writes:
> @@ -6907,6 +6938,11 @@ aarch64_layout_arg (cumulative_args_t pcum_v, const 
> function_arg_info )
> && (!alignment || abi_break_gcc_9 < alignment)
> && (!abi_break_gcc_13 || alignment < abi_break_gcc_13));
>  
> +  /* _BitInt(N) was only added in GCC 14.  */
> +  bool warn_pcs_change_le_gcc14
> += warn_pcs_change && !bitint_or_aggr_of_bitint_p (type);
> +
> +

Excess blank line.

OK with that removed, thanks (no need to retest).

Richard


Re: [PATCH]AArch64: Do not allow SIMD clones with simdlen 1 [PR113552][GCC 13/12/11 backport]

2024-04-09 Thread Richard Sandiford
Tamar Christina  writes:
> Hi All,
>
> This is a backport of g:306713c953d509720dc394c43c0890548bb0ae07.
>
> The AArch64 vector PCS does not allow simd calls with simdlen 1,
> however due to a bug we currently do allow it for num == 0.
>
> This causes us to emit a symbol that doesn't exist and we fail to link.
>
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>
> OK for GCC 13,12,11 branches?.
>
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
>   PR tree-optimization/113552
>   * config/aarch64/aarch64.cc
>   (aarch64_simd_clone_compute_vecsize_and_simdlen): Block simdlen 1.
>
> gcc/testsuite/ChangeLog:
>
>   PR tree-optimization/113552
>   * gcc.target/aarch64/pr113552.c: New test.
>   * gcc.target/aarch64/simd_pcs_attribute-3.c: Remove bogus check.

OK, thanks.

Richard

> ---
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 
> f546c48ae2d2bad2e34c6b72e5e3e30aba3c3bd6..cec614596c5532b6341ab814f700c4ffc737340f
>  100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -27027,7 +27027,7 @@ aarch64_simd_clone_compute_vecsize_and_simdlen 
> (struct cgraph_node *node,
>   bool explicit_p)
>  {
>tree t, ret_type;
> -  unsigned int elt_bits, count;
> +  unsigned int elt_bits, count = 0;
>unsigned HOST_WIDE_INT const_simdlen;
>poly_uint64 vec_bits;
>  
> @@ -27100,8 +27100,17 @@ aarch64_simd_clone_compute_vecsize_and_simdlen 
> (struct cgraph_node *node,
>elt_bits = GET_MODE_BITSIZE (SCALAR_TYPE_MODE (base_type));
>if (known_eq (clonei->simdlen, 0U))
>  {
> -  count = 2;
> -  vec_bits = (num == 0 ? 64 : 128);
> +  /* We don't support simdlen == 1.  */
> +  if (known_eq (elt_bits, 64))
> + {
> +   count = 1;
> +   vec_bits = 128;
> + }
> +  else
> + {
> +   count = 2;
> +   vec_bits = (num == 0 ? 64 : 128);
> + }
>clonei->simdlen = exact_div (vec_bits, elt_bits);
>  }
>else
> @@ -27121,6 +27130,7 @@ aarch64_simd_clone_compute_vecsize_and_simdlen 
> (struct cgraph_node *node,
> return 0;
>   }
>  }
> +
>clonei->vecsize_int = vec_bits;
>clonei->vecsize_float = vec_bits;
>return count;
> diff --git a/gcc/testsuite/gcc.target/aarch64/pr113552.c 
> b/gcc/testsuite/gcc.target/aarch64/pr113552.c
> new file mode 100644
> index 
> ..9c96b061ed2b4fcc57e58925277f74d14f79c51f
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/pr113552.c
> @@ -0,0 +1,17 @@
> +/* { dg-do compile } */
> +/* { dg-options "-Ofast -march=armv8-a" } */
> +
> +__attribute__ ((__simd__ ("notinbranch"), const))
> +double cos (double);
> +
> +void foo (float *a, double *b)
> +{
> +for (int i = 0; i < 12; i+=3)
> +  {
> +b[i] = cos (5.0 * a[i]);
> +b[i+1] = cos (5.0 * a[i+1]);
> +b[i+2] = cos (5.0 * a[i+2]);
> +  }
> +}
> +
> +/* { dg-final { scan-assembler-times {bl\t_ZGVnN2v_cos} 6 } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/simd_pcs_attribute-3.c 
> b/gcc/testsuite/gcc.target/aarch64/simd_pcs_attribute-3.c
> index 
> 95f6a6803e889c02177ef10972962ed62d2095eb..c6dac6b104c94c9de89ed88dc5a73e185d2be125
>  100644
> --- a/gcc/testsuite/gcc.target/aarch64/simd_pcs_attribute-3.c
> +++ b/gcc/testsuite/gcc.target/aarch64/simd_pcs_attribute-3.c
> @@ -18,7 +18,7 @@ double foo(double x)
>  }
>  
>  /* { dg-final { scan-assembler-not {\.variant_pcs\tfoo} } } */
> -/* { dg-final { scan-assembler-times {\.variant_pcs\t_ZGVnM1v_foo} 1 } } */
> +/* { dg-final { scan-assembler-not {\.variant_pcs\t_ZGVnM1v_foo} } } */
>  /* { dg-final { scan-assembler-times {\.variant_pcs\t_ZGVnM2v_foo} 1 } } */
> -/* { dg-final { scan-assembler-times {\.variant_pcs\t_ZGVnN1v_foo} 1 } } */
> +/* { dg-final { scan-assembler-not {\.variant_pcs\t_ZGVnN1v_foo} } } */
>  /* { dg-final { scan-assembler-times {\.variant_pcs\t_ZGVnN2v_foo} 1 } } */


Re: [PATCH 0/5] aarch64: FMV feature list fixes

2024-04-09 Thread Richard Sandiford
Andrew Carlotti  writes:
> The first three patches are trivial changes to the feature list to reflect
> recent changes in the ACLE.  Patch 4 removes most of the FMV multiversioning
> features that don't work at the moment, and should be entirely 
> uncontroversial.
>
> Patch 5 handles the remaining cases, where there's an inconsistency in how
> features are named in the current FMV specification compared to the existing
> command line options.  It might be better to instead preserve the "memtag2",
> "ssbs2" and "ls64_accdata" names for now; I'd be happy to commit either
> version.

Yeah, I suppose patch 5 leaves things in a somewhat awkward state,
since e.g.:

-AARCH64_OPT_FMV_EXTENSION("memtag", MEMTAG, (), (), (), "")
+AARCH64_OPT_EXTENSION("memtag", MEMTAG, (), (), (), "")
 
-AARCH64_FMV_FEATURE("memtag2", MEMTAG2, (MEMTAG))
+AARCH64_FMV_FEATURE("memtag", MEMTAG2, (MEMTAG))

seems to drop "memtag2" and FEAT_MEMTAG, but keep "memtag" and
FEAT_MEMTAG2.  Is that right?

Apart from that and the comment on patch 2, the series looks good to me.

While rechecking aarch64-option-extensions.def against the ACLE list:
it seems that the .def doesn't treat mops as an FMV feature.  Is that
deliberate?

Thanks,
Richard


Re: [PATCH 2/5] aarch64: Don't use FEAT_MAX as array length

2024-04-09 Thread Richard Sandiford
Andrew Carlotti  writes:
> There was an assumption in some places that the aarch64_fmv_feature_data
> array contained FEAT_MAX elements.  While this assumption held up till
> now, it is safer and more flexible to use the array size directly.
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64.cc (compare_feature_masks):
>   Use ARRAY_SIZE to determine iteration bounds.
>   (aarch64_mangle_decl_assembler_name): Ditto.
>
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 
> 1ea84c8bd7386e399f6ffa3a5e36408cf8831fc6..5de842fcc212c78beba1fa99639e79562d718579
>  100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -19899,7 +19899,8 @@ compare_feature_masks (aarch64_fmv_feature_mask mask1,
>auto diff_mask = mask1 ^ mask2;
>if (diff_mask == 0ULL)
>  return 0;
> -  for (int i = FEAT_MAX - 1; i > 0; i--)
> +  static const int num_features = ARRAY_SIZE (aarch64_fmv_feature_data);

There doesn't seem any need for this to be static (or const).  Same for
the second hunk.

> +  for (int i = num_features - 1; i > 0; i--)

Pre-existing, but is > 0 rather than >= 0 deliberate?  Shouldn't we look
at index 0 as well?

LGTM otherwise.

Thanks,
Richard

>  {
>auto bit_mask = aarch64_fmv_feature_data[i].feature_mask;
>if (diff_mask & bit_mask)
> @@ -19982,7 +19983,8 @@ aarch64_mangle_decl_assembler_name (tree decl, tree 
> id)
>  
>name += "._";
>  
> -  for (int i = 0; i < FEAT_MAX; i++)
> +  static const int num_features = ARRAY_SIZE (aarch64_fmv_feature_data);
> +  for (int i = 0; i < num_features; i++)
>   {
> if (feature_mask & aarch64_fmv_feature_data[i].feature_mask)
>   {


Re: [PATCH v2] aarch64: Fix ACLE SME streaming mode error in neon-sve-bridge

2024-04-09 Thread Richard Sandiford
Richard Ball  writes:
> When using LTO, handling the pragma for sme before the pragma
> for the neon-sve-bridge caused the following error on svset_neonq,
> in the neon-sve-bridge.c test.
>
> error: ACLE function '0' can only be called when SME streaming mode is 
> enabled.
>
> This has been resolved by changing the pragma handlers to accept two modes.
> One where they add functions normally and a second in which 
> registered_functions
> is filled with a placeholder value.
>
> By using this, the ordering of the functions can be maintained.
>
> No regressions on aarch64-none-elf.

OK, thanks.

Richard

> gcc/ChangeLog:
>
>   * config/aarch64/aarch64-c.cc (aarch64_pragma_aarch64):
>   Add functions_nulls parameter to pragma_handlers.
> * config/aarch64/aarch64-protos.h: Likewise.
> * config/aarch64/aarch64-sve-builtins.h
>   (enum handle_pragma_index): Add enum to count
>   number of pragmas to be handled.
> * config/aarch64/aarch64-sve-builtins.cc
>   (GTY): Add global variable for initial indexes
>   and change overload_names to an array.
>   (function_builder::function_builder):
>   Add pragma handler information.
> (function_builder::add_function):
>   Add code for overwriting previous
>   registered_functions entries.
> (add_unique_function):
>   Use an array to register overload_names
>   for both pragma handler modes.
> (add_overloaded_function): Likewise.
> (init_builtins):
>   Add functions_nulls parameter to pragma_handlers.
> (handle_arm_sve_h):
>   Initialize pragma handler information.
> (handle_arm_neon_sve_bridge_h): Likewise.
> (handle_arm_sme_h): Likewise.
>
> diff --git a/gcc/config/aarch64/aarch64-c.cc b/gcc/config/aarch64/aarch64-c.cc
> index 
> b5a6917d06db41a845681ed6f13f2800c70ede10..fe1a20e4e546a68e5f7eddff3bbb0d3e831fbd9b
>  100644
> --- a/gcc/config/aarch64/aarch64-c.cc
> +++ b/gcc/config/aarch64/aarch64-c.cc
> @@ -344,15 +344,15 @@ aarch64_pragma_aarch64 (cpp_reader *)
>  
>const char *name = TREE_STRING_POINTER (x);
>if (strcmp (name, "arm_sve.h") == 0)
> -aarch64_sve::handle_arm_sve_h ();
> +aarch64_sve::handle_arm_sve_h (false);
>else if (strcmp (name, "arm_sme.h") == 0)
> -aarch64_sve::handle_arm_sme_h ();
> +aarch64_sve::handle_arm_sme_h (false);
>else if (strcmp (name, "arm_neon.h") == 0)
>  handle_arm_neon_h ();
>else if (strcmp (name, "arm_acle.h") == 0)
>  handle_arm_acle_h ();
>else if (strcmp (name, "arm_neon_sve_bridge.h") == 0)
> -aarch64_sve::handle_arm_neon_sve_bridge_h ();
> +aarch64_sve::handle_arm_neon_sve_bridge_h (false);
>else
>  error ("unknown %<#pragma GCC aarch64%> option %qs", name);
>  }
> diff --git a/gcc/config/aarch64/aarch64-protos.h 
> b/gcc/config/aarch64/aarch64-protos.h
> index 
> bd719b992a545630f9ad6a73753fad55c6ac5d7e..42639e9efcf1e0f9362f759ae63a31b8eeb0d581
>  100644
> --- a/gcc/config/aarch64/aarch64-protos.h
> +++ b/gcc/config/aarch64/aarch64-protos.h
> @@ -1014,9 +1014,9 @@ bool aarch64_general_check_builtin_call (location_t, 
> vec,
>  
>  namespace aarch64_sve {
>void init_builtins ();
> -  void handle_arm_sve_h ();
> -  void handle_arm_sme_h ();
> -  void handle_arm_neon_sve_bridge_h ();
> +  void handle_arm_sve_h (bool);
> +  void handle_arm_sme_h (bool);
> +  void handle_arm_neon_sve_bridge_h (bool);
>tree builtin_decl (unsigned, bool);
>bool builtin_type_p (const_tree);
>bool builtin_type_p (const_tree, unsigned int *, unsigned int *);
> diff --git a/gcc/config/aarch64/aarch64-sve-builtins.h 
> b/gcc/config/aarch64/aarch64-sve-builtins.h
> index 
> e66729ed63532811b3b16ab57ae11cb10518caca..3e0d2b8560c538d201c97fba3591ddf9893d664a
>  100644
> --- a/gcc/config/aarch64/aarch64-sve-builtins.h
> +++ b/gcc/config/aarch64/aarch64-sve-builtins.h
> @@ -123,6 +123,15 @@ enum units_index
>UNITS_vectors
>  };
>  
> +/* Enumerates the pragma handlers.  */
> +enum handle_pragma_index
> +{
> +  arm_sve_handle,
> +  arm_sme_handle,
> +  arm_neon_sve_handle,
> +  NUM_PRAGMA_HANDLERS
> +};
> +
>  /* Describes the various uses of a governing predicate.  */
>  enum predication_index
>  {
> @@ -419,7 +428,7 @@ class registered_function;
>  class function_builder
>  {
>  public:
> -  function_builder ();
> +  function_builder (handle_pragma_index, bool);
>~function_builder ();
>  
>void add_unique_function (const function_instance &, tree,
> @@ -453,6 +462,12 @@ private:
>  
>/* Used for building up function names.  */
>obstack m_string_obstack;
> +
> +  /* Used to store the index for the current function.  */
> +  unsigned int m_function_index;
> +
> +  /* Stores the mode of the current pragma handler.  */
> +  bool m_function_nulls;
>  };
>  
>  /* A base class for handling calls to built-in functions.  */
> diff --git a/gcc/config/aarch64/aarch64-sve-builtins.cc 
> b/gcc/config/aarch64/aarch64-sve-builtins.cc
> index 

[pushed] aarch64: Fix expansion of svsudot [PR114607]

2024-04-08 Thread Richard Sandiford
Not sure how this happend, but: svsudot is supposed to be expanded
as USDOT with the operands swapped.  However, a thinko in the
expansion of svsudot meant that the arguments weren't in fact
swapped; the attempted swap was just a no-op.  And the testcases
blithely accepted that.

Tested on aarch64-linux-gnu and pushed to trunk.  I'll backport
in a few weeks if there's no fallout.

Richard


gcc/
PR target/114607
* config/aarch64/aarch64-sve-builtins-base.cc
(svusdot_impl::expand): Fix botched attempt to swap the operands
for svsudot.

gcc/testsuite/
PR target/114607
* gcc.target/aarch64/sve/acle/asm/sudot_s32.c: New test.
---
 gcc/config/aarch64/aarch64-sve-builtins-base.cc   | 2 +-
 gcc/testsuite/gcc.target/aarch64/sve/acle/asm/sudot_s32.c | 8 
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-sve-builtins-base.cc 
b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
index 5be2315a3c6..0d2edf3f19e 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins-base.cc
+++ b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
@@ -2809,7 +2809,7 @@ public:
version) is through the USDOT instruction but with the second and third
inputs swapped.  */
 if (m_su)
-  e.rotate_inputs_left (1, 2);
+  e.rotate_inputs_left (1, 3);
 /* The ACLE function has the same order requirements as for svdot.
While there's no requirement for the RTL pattern to have the same sort
of order as that for dot_prod, it's easier to read.
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/sudot_s32.c 
b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/sudot_s32.c
index 4b452619eee..e06b69affab 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/sudot_s32.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/sudot_s32.c
@@ -6,7 +6,7 @@
 
 /*
 ** sudot_s32_tied1:
-** usdot   z0\.s, z2\.b, z4\.b
+** usdot   z0\.s, z4\.b, z2\.b
 ** ret
 */
 TEST_TRIPLE_Z (sudot_s32_tied1, svint32_t, svint8_t, svuint8_t,
@@ -17,7 +17,7 @@ TEST_TRIPLE_Z (sudot_s32_tied1, svint32_t, svint8_t, 
svuint8_t,
 ** sudot_s32_tied2:
 ** mov (z[0-9]+)\.d, z0\.d
 ** movprfx z0, z4
-** usdot   z0\.s, z2\.b, \1\.b
+** usdot   z0\.s, \1\.b, z2\.b
 ** ret
 */
 TEST_TRIPLE_Z_REV (sudot_s32_tied2, svint32_t, svint8_t, svuint8_t,
@@ -27,7 +27,7 @@ TEST_TRIPLE_Z_REV (sudot_s32_tied2, svint32_t, svint8_t, 
svuint8_t,
 /*
 ** sudot_w0_s32_tied:
 ** mov (z[0-9]+\.b), w0
-** usdot   z0\.s, z2\.b, \1
+** usdot   z0\.s, \1, z2\.b
 ** ret
 */
 TEST_TRIPLE_ZX (sudot_w0_s32_tied, svint32_t, svint8_t, uint8_t,
@@ -37,7 +37,7 @@ TEST_TRIPLE_ZX (sudot_w0_s32_tied, svint32_t, svint8_t, 
uint8_t,
 /*
 ** sudot_9_s32_tied:
 ** mov (z[0-9]+\.b), #9
-** usdot   z0\.s, z2\.b, \1
+** usdot   z0\.s, \1, z2\.b
 ** ret
 */
 TEST_TRIPLE_Z (sudot_9_s32_tied, svint32_t, svint8_t, uint8_t,
-- 
2.25.1



Re: [PATCH][wwwdocs] Add NEON-SVE bridge intrinsics to changes.html

2024-04-08 Thread Richard Sandiford
Richard Ball  writes:
> Hi all,
>
> Adding the NEON-SVE bridge intrinsics that were missed
> in the last patch.
>
> Thanks,
> Richard

OK, thanks.

Richard

> diff --git a/htdocs/gcc-14/changes.html b/htdocs/gcc-14/changes.html
> index 
> 9fd224c1df3f05eadcedaaa41c0859e712b93b78..df63af48298564de9c35bab1dd35891c2581e3d6
>  100644
> --- a/htdocs/gcc-14/changes.html
> +++ b/htdocs/gcc-14/changes.html
> @@ -420,6 +420,12 @@ a work-in-progress.
>-march=armv8.2-a or higher to be specified.  Likewise, the
>intrinsics enabled by +memtag no longer require
>-march=armv8.5-a.
> +  Support for the
> +   href="https://github.com/ARM-software/acle/blob/main/main/acle.md#neon-sve-bridge;>
> +  NEON-SVE Bridge intrinsics.
> +  These are intrinsics that allow conversions between NEON and SVE 
> vectors,
> +  enabled through the inclusion of the 
> arm_neon_sve_bridge.h header.
> +  
>  
>The option -mtp= is now supported for changing the TPIDR
> register used for TLS accesses.  For more details please refer to the


Re: [PATCH] aarch64: Fix vld1/st1_x4 intrinsic test

2024-04-08 Thread Richard Sandiford
"Swinney, Jonathan"  writes:
> The test for this intrinsic was failing silently and so it failed to
> report the bug reported in 114521. This patch modifes the test to
> report the result.
>
> Bug report: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114521
>
> Signed-off-by: Jonathan Swinney 
> ---
>  .../gcc.target/aarch64/advsimd-intrinsics/vld1x4.c | 10 +++---
>  1 file changed, 7 insertions(+), 3 deletions(-)

Thanks, pushed to trunk.

Richard

>
> diff --git a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vld1x4.c 
> b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vld1x4.c
> index 89b289bb21d..17db262a31a 100644
> --- a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vld1x4.c
> +++ b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vld1x4.c
> @@ -3,6 +3,7 @@
>  /* { dg-skip-if "unimplemented" { arm*-*-* } } */
>  /* { dg-options "-O3" } */
>  
> +#include 
>  #include 
>  #include "arm-neon-ref.h"
>  
> @@ -71,13 +72,16 @@ VARIANT (float64, 2, q_f64)
>  VARIANTS (TESTMETH)
>  
>  #define CHECKS(BASE, ELTS, SUFFIX)   \
> -  if (test_vld1##SUFFIX##_x4 () != 0)\
> -fprintf (stderr, "test_vld1##SUFFIX##_x4");
> +  if (test_vld1##SUFFIX##_x4 () != 0) {  \
> +fprintf (stderr, "test_vld1" #SUFFIX "_x4 failed\n"); \
> +failed = true; \
> +  }
>  
>  int
>  main (int argc, char **argv)
>  {
> +  bool failed = false;
>VARIANTS (CHECKS)
>  
> -  return 0;
> +  return (failed) ? 1 : 0;
>  }


Re: [PATCH] rtl-optimization/101523 - avoid re-combine after noop 2->2 combination

2024-04-08 Thread Richard Sandiford
Segher Boessenkool  writes:
> Hi!
>
> On Wed, Apr 03, 2024 at 01:07:41PM +0200, Richard Biener wrote:
>> The following avoids re-walking and re-combining the instructions
>> between i2 and i3 when the pattern of i2 doesn't change.
>> 
>> Bootstrap and regtest running ontop of a reversal of 
>> r14-9692-g839bc42772ba7a.
>
> Please include that in the patch (or series, preferably).
>
>> It brings down memory use frmo 9GB to 400MB and compile-time from
>> 80s to 3.5s.  r14-9692-g839bc42772ba7a does better in both metrics
>> but has shown code generation regressions across acrchitectures.
>> 
>> OK to revert r14-9692-g839bc42772ba7a?
>
> No.
>
> The patch solved a very real problem.  How does your replacement handle
> that?  You don't say.  It looks like it only battles symptoms a bit,
> instead :-(
>
> We had this before: 3->2 combinations that leave an instruction
> identical to what was there before.  This was just a combination with
> context as well.  The only reason this wasn't a huge problem then
> already was because this is a 3->2 combination, even if it really is a
> 2->1 one it still is beneficial in all the same cases.  But in the new
> case it can iterate indefinitely -- well not quite, but some polynomial
> number of times, for a polynomial at least of degree three, possibly
> more :-(
>
> With this patch you need to show combine still is linear.  I don't think
> it is, but some deeper analysis might show it still is.
>
>   ~ - ~ - ~
>
> What should *really* be done is something that has been on the wish list
> for decades: an uncse pass.
>
> The things that combine no longer works on after my patch are actually
> 1->1 combinations (which we never do currently, although we probably
> should); or alternatively, an un-CSE followed by a 2->1 combination.
>
> We can do the latter of course, but we need to do an actual uncse first!
> Somewhere before combine, and then redo a CSE after it.  An actual CSE,
> not doing ten gazillion other things.

Can you give a specific example of a 2->2 combination that we still
want to apply after r14-9692-g839bc42772ba7a?

2->2 combinations as I understand them were added by
c4c5ad1d6d1e1e1fe7a1c2b3bb097cc269dc7306:

Author: Segher Boessenkool 
Date:   Mon Jul 30 15:18:17 2018 +0200

combine: Allow combining two insns to two insns

This patch allows combine to combine two insns into two.  This helps
in many cases, by reducing instruction path length, and also allowing
further combinations to happen.  PR85160 is a typical example of code
that it can improve.

This patch does not allow such combinations if either of the original
instructions was a simple move instruction.  In those cases combining
the two instructions increases register pressure without improving the
code.  With this move test register pressure does no longer increase
noticably as far as I can tell.

(At first I also didn't allow either of the resulting insns to be a
move instruction.  But that is actually a very good thing to have, as
should have been obvious).

PR rtl-optimization/85160
* combine.c (is_just_move): New function.
(try_combine): Allow combining two instructions into two if neither 
of
the original instructions was a move.

That patch didn't have a testcase, but one was added later in
81bdfc1e2940fc93bcd0bba4416daff47f04f3b3:

testcase for 2-2 combine

gcc/testsuite/
PR rtl-optimization/85160
* gcc.target/powerpc/combine-2-2.c: New testcase.

But this is the powerpc test that regresses with the recent patch (PR114518).

The patches reference aarch64 bug PR85160.  If I check out and build
c4c5ad1d6d1e above, I can see that it does indeed remove two mvns from
the PR85160 testcase.  The diff from c4c5ad1d6d1e~ is:

@@ -10,12 +10,10 @@
.cfi_startproc
ldr w3, [x2, w3, sxtw 2]
ldr w2, [x2, w4, sxtw 2]
-   mvn w3, w3
-   mvn w2, w2
-   and w4, w3, w1
-   and w1, w2, w1
-   and w3, w3, w0
-   and w2, w2, w0
+   bic w4, w1, w3
+   bic w3, w0, w3
+   bic w1, w1, w2
+   bic w2, w0, w2
asr w4, w4, 9
asr w1, w1, 7
orr w3, w4, w3, asr 7

(which is great).  But if I apply 839bc42772ba on top of c4c5ad1d6d1e
then the optimisation is undone.

Is that the intention?  I.e. are we effectively removing the kind of
2->2 combinations added in c4c5ad1d6d1e1e1fe?  If so, why not simply
revert c4c5ad1d6d1e1e1fe itself?

Or is there a specific testcase that is still optimised with the
combination of c4c5ad1d6d1e1e1fe and 839bc42772ba7a that would not
be optimised without c4c5ad1d6d1e1e1fe?  If so, can you say what it is?

Thanks,
Richard


Re: [pushed] aarch64: Fix bogus cnot optimisation [PR114603]

2024-04-08 Thread Richard Sandiford
Richard Biener  writes:
> On Fri, Apr 5, 2024 at 3:52 PM Richard Sandiford
>> This isn't a regression on a known testcase.  However, it's a nasty
>> wrong code bug that could conceivably trigger for autovec code (although
>> I've not been able to construct a reproducer so far).  That fix is also
>> quite localised to the buggy operation.  I'd therefore prefer to push
>> the fix now rather than wait for GCC 15.
>
> wrong-code bugs (and also rejects-valid or ice-on-valid) are always exempt
> from the regression-only fixing.  In practice every such bug will be a
> regression,
> in this case to when the combining pattern was introduced (unless that was
> with the version with the initial introduction of the port of course).

Ah, thanks, hadn't realised that.  Makes sense though.

It's good news of a sort since unfortunately I've another SVE wrong-code
fix in the works...

Richard


[pushed] aarch64: Fix bogus cnot optimisation [PR114603]

2024-04-05 Thread Richard Sandiford
aarch64-sve.md had a pattern that combined:

cmpeq   pb.T, pa/z, zc.T, #0
mov zd.T, pb/z, #1

into:

cnotzd.T, pa/m, zc.T

But this is only valid if pa.T is a ptrue.  In other cases, the
original would set inactive elements of zd.T to 0, whereas the
combined form would copy elements from zc.T.

This isn't a regression on a known testcase.  However, it's a nasty
wrong code bug that could conceivably trigger for autovec code (although
I've not been able to construct a reproducer so far).  That fix is also
quite localised to the buggy operation.  I'd therefore prefer to push
the fix now rather than wait for GCC 15.

Tested on aarch64-linux-gnu & pushed.  I'll backport to branches if
there is no fallout.

Richard

gcc/
PR target/114603
* config/aarch64/aarch64-sve.md (@aarch64_pred_cnot): Replace
with...
(@aarch64_ptrue_cnot): ...this, requiring operand 1 to be
a ptrue.
(*cnot): Require operand 1 to be a ptrue.
* config/aarch64/aarch64-sve-builtins-base.cc (svcnot_impl::expand):
Use aarch64_ptrue_cnot for _x operations that are predicated
with a ptrue.  Represent other _x operations as fully-defined _m
operations.

gcc/testsuite/
PR target/114603
* gcc.target/aarch64/sve/acle/general/cnot_1.c: New test.
---
 .../aarch64/aarch64-sve-builtins-base.cc  | 25 ---
 gcc/config/aarch64/aarch64-sve.md | 22 
 .../aarch64/sve/acle/general/cnot_1.c | 23 +
 3 files changed, 50 insertions(+), 20 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/acle/general/cnot_1.c

diff --git a/gcc/config/aarch64/aarch64-sve-builtins-base.cc 
b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
index 257ca5bf6ad..5be2315a3c6 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins-base.cc
+++ b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
@@ -517,15 +517,22 @@ public:
   expand (function_expander ) const override
   {
 machine_mode mode = e.vector_mode (0);
-if (e.pred == PRED_x)
-  {
-   /* The pattern for CNOT includes an UNSPEC_PRED_Z, so needs
-  a ptrue hint.  */
-   e.add_ptrue_hint (0, e.gp_mode (0));
-   return e.use_pred_x_insn (code_for_aarch64_pred_cnot (mode));
-  }
-
-return e.use_cond_insn (code_for_cond_cnot (mode), 0);
+machine_mode pred_mode = e.gp_mode (0);
+/* The underlying _x pattern is effectively:
+
+dst = src == 0 ? 1 : 0
+
+   rather than an UNSPEC_PRED_X.  Using this form allows autovec
+   constructs to be matched by combine, but it means that the
+   predicate on the src == 0 comparison must be all-true.
+
+   For simplicity, represent other _x operations as fully-defined _m
+   operations rather than using a separate bespoke pattern.  */
+if (e.pred == PRED_x
+   && gen_lowpart (pred_mode, e.args[0]) == CONSTM1_RTX (pred_mode))
+  return e.use_pred_x_insn (code_for_aarch64_ptrue_cnot (mode));
+return e.use_cond_insn (code_for_cond_cnot (mode),
+   e.pred == PRED_x ? 1 : 0);
   }
 };
 
diff --git a/gcc/config/aarch64/aarch64-sve.md 
b/gcc/config/aarch64/aarch64-sve.md
index eca8623e587..0434358122d 100644
--- a/gcc/config/aarch64/aarch64-sve.md
+++ b/gcc/config/aarch64/aarch64-sve.md
@@ -3363,24 +3363,24 @@ (define_insn_and_split 
"trunc2"
 ;; - CNOT
 ;; -
 
-;; Predicated logical inverse.
-(define_expand "@aarch64_pred_cnot"
+;; Logical inverse, predicated with a ptrue.
+(define_expand "@aarch64_ptrue_cnot"
   [(set (match_operand:SVE_FULL_I 0 "register_operand")
(unspec:SVE_FULL_I
  [(unspec:
 [(match_operand: 1 "register_operand")
- (match_operand:SI 2 "aarch64_sve_ptrue_flag")
+ (const_int SVE_KNOWN_PTRUE)
  (eq:
-   (match_operand:SVE_FULL_I 3 "register_operand")
-   (match_dup 4))]
+   (match_operand:SVE_FULL_I 2 "register_operand")
+   (match_dup 3))]
 UNSPEC_PRED_Z)
-  (match_dup 5)
-  (match_dup 4)]
+  (match_dup 4)
+  (match_dup 3)]
  UNSPEC_SEL))]
   "TARGET_SVE"
   {
-operands[4] = CONST0_RTX (mode);
-operands[5] = CONST1_RTX (mode);
+operands[3] = CONST0_RTX (mode);
+operands[4] = CONST1_RTX (mode);
   }
 )
 
@@ -3389,7 +3389,7 @@ (define_insn "*cnot"
(unspec:SVE_I
  [(unspec:
 [(match_operand: 1 "register_operand")
- (match_operand:SI 5 "aarch64_sve_ptrue_flag")
+ (const_int SVE_KNOWN_PTRUE)
  (eq:
(match_operand:SVE_I 2 "register_operand")
(match_operand:SVE_I 3 "aarch64_simd_imm_zero"))]
@@ -11001,4 +11001,4 @@ (define_insn "@aarch64_sve_set_neonq_"
   GET_MODE (operands[2]));
 return 

[pushed] aarch64: Recognise svundef idiom [PR114577]

2024-04-04 Thread Richard Sandiford
GCC 14 adds the header file arm_neon_sve_bridge.h to help interface
SVE and Advanced SIMD code.  One of the defined idioms is:

  svset_neonq (svundef_TYPE (), advsimd_vector)

which simply reinterprets advsimd_vector as an SVE vector without
regard for what's in the upper bits.

GCC was failing to recognise this idiom, which was likely to
significantly hamper adoption.

There is (AFAIK) no good way of representing an extension with
undefined bits in gimple.  We could add an internal-only builtin
to represent it, but the current framework makes that somewhat
awkward.  It also doesn't seem very forward-looking.

This patch instead goes for the simpler approach of recognising
undefined arguments at expansion time.

Tested on aarch64-linux-gnu & pushed.

Richard


gcc/
PR target/114577
* config/aarch64/aarch64-sve-builtins.h (aarch64_sve::lookup_fndecl):
Declare.
* config/aarch64/aarch64-sve-builtins.cc (aarch64_sve::lookup_fndecl):
New function.
* config/aarch64/aarch64-sve-builtins-base.cc (is_undef): Likewise.
(svset_neonq_impl::expand): Optimise expansions whose first argument
is undefined.

gcc/testsuite/
PR target/114577
* gcc.target/aarch64/sve/acle/general/pr114577_1.c: New test.
* gcc.target/aarch64/sve/acle/general/pr114577_2.c: Likewise.
---
 .../aarch64/aarch64-sve-builtins-base.cc  | 27 ++
 gcc/config/aarch64/aarch64-sve-builtins.cc| 16 
 gcc/config/aarch64/aarch64-sve-builtins.h |  1 +
 .../aarch64/sve/acle/general/pr114577_1.c | 94 +++
 .../aarch64/sve/acle/general/pr114577_2.c | 46 +
 5 files changed, 184 insertions(+)
 create mode 100644 
gcc/testsuite/gcc.target/aarch64/sve/acle/general/pr114577_1.c
 create mode 100644 
gcc/testsuite/gcc.target/aarch64/sve/acle/general/pr114577_2.c

diff --git a/gcc/config/aarch64/aarch64-sve-builtins-base.cc 
b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
index a8c3f84a70b..257ca5bf6ad 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins-base.cc
+++ b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
@@ -47,11 +47,31 @@
 #include "aarch64-builtins.h"
 #include "ssa.h"
 #include "gimple-fold.h"
+#include "tree-ssa.h"
 
 using namespace aarch64_sve;
 
 namespace {
 
+/* Return true if VAL is an undefined value.  */
+static bool
+is_undef (tree val)
+{
+  if (TREE_CODE (val) == SSA_NAME)
+{
+  if (ssa_undefined_value_p (val, false))
+   return true;
+
+  gimple *def = SSA_NAME_DEF_STMT (val);
+  if (gcall *call = dyn_cast (def))
+   if (tree fndecl = gimple_call_fndecl (call))
+ if (const function_instance *instance = lookup_fndecl (fndecl))
+   if (instance->base == functions::svundef)
+ return true;
+}
+  return false;
+}
+
 /* Return the UNSPEC_CMLA* unspec for rotation amount ROT.  */
 static int
 unspec_cmla (int rot)
@@ -1142,6 +1162,13 @@ public:
   expand (function_expander ) const override
   {
 machine_mode mode = e.vector_mode (0);
+
+/* If the SVE argument is undefined, we just need to reinterpret the
+   Advanced SIMD argument as an SVE vector.  */
+if (!BYTES_BIG_ENDIAN
+   && is_undef (CALL_EXPR_ARG (e.call_expr, 0)))
+  return simplify_gen_subreg (mode, e.args[1], GET_MODE (e.args[1]), 0);
+
 rtx_vector_builder builder (VNx16BImode, 16, 2);
 for (unsigned int i = 0; i < 16; i++)
   builder.quick_push (CONST1_RTX (BImode));
diff --git a/gcc/config/aarch64/aarch64-sve-builtins.cc 
b/gcc/config/aarch64/aarch64-sve-builtins.cc
index 11f5c5c500c..e124d1f90a5 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins.cc
+++ b/gcc/config/aarch64/aarch64-sve-builtins.cc
@@ -1055,6 +1055,22 @@ get_vector_type (sve_type type)
   return acle_vector_types[type.num_vectors - 1][vector_type];
 }
 
+/* If FNDECL is an SVE builtin, return its function instance, otherwise
+   return null.  */
+const function_instance *
+lookup_fndecl (tree fndecl)
+{
+  if (!fndecl_built_in_p (fndecl, BUILT_IN_MD))
+return nullptr;
+
+  unsigned int code = DECL_MD_FUNCTION_CODE (fndecl);
+  if ((code & AARCH64_BUILTIN_CLASS) != AARCH64_BUILTIN_SVE)
+return nullptr;
+
+  unsigned int subcode = code >> AARCH64_BUILTIN_SHIFT;
+  return &(*registered_functions)[subcode]->instance;
+}
+
 /* Report an error against LOCATION that the user has tried to use
function FNDECL when extension EXTENSION is disabled.  */
 static void
diff --git a/gcc/config/aarch64/aarch64-sve-builtins.h 
b/gcc/config/aarch64/aarch64-sve-builtins.h
index e66729ed635..053006776a9 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins.h
+++ b/gcc/config/aarch64/aarch64-sve-builtins.h
@@ -810,6 +810,7 @@ extern tree acle_svprfop;
 
 bool vector_cst_all_same (tree, unsigned int);
 bool is_ptrue (tree, unsigned int);
+const function_instance *lookup_fndecl (tree);
 
 /* Try to find a mode with the given mode_suffix_info fields.  Return the
mode on success or MODE_none on 

Re: [PATCH] libatomic: Cleanup macros in atomic_16.S

2024-04-04 Thread Richard Sandiford
Wilco Dijkstra  writes:
> As mentioned in 
> https://gcc.gnu.org/pipermail/gcc-patches/2024-March/648397.html ,
> do some additional cleanup of the macros and aliases:
>
> Cleanup the macros to add the libat_ prefixes in atomic_16.S.  Emit the
> alias to __atomic_ when ifuncs are not enabled in the ENTRY macro.
>
> Passes regress and bootstrap, OK for commit?
>
> libatomic:
> * config/linux/aarch64/atomic_16.S: Add __libat_ prefix in the
> LSE2/LSE128/CORE macros, remove elsewhere.  Add ATOMIC macro.

Thanks for doing this.  LGTM, but one minor suggestion below:

> diff --git a/libatomic/config/linux/aarch64/atomic_16.S 
> b/libatomic/config/linux/aarch64/atomic_16.S
> index 
> 4e3fa870b0338da4cfcdb0879ab8bed8d041a0a3..d0343507120c06a483ffdae1a793b6b5263cfe98
>  100644
> --- a/libatomic/config/linux/aarch64/atomic_16.S
> +++ b/libatomic/config/linux/aarch64/atomic_16.S
> @@ -45,7 +45,7 @@
>  # define HAVE_FEAT_LSE128 0
>  #endif
>
> -#define HAVE_FEAT_LSE2  HAVE_IFUNC
> +#define HAVE_FEAT_LSE2 HAVE_IFUNC
>
>  #if HAVE_FEAT_LSE128
> .arch   armv9-a+lse128
> @@ -53,31 +53,37 @@
> .arch   armv8-a+lse
>  #endif
>
> -#define LSE128(NAME)   NAME##_i1
> -#define LSE2(NAME) NAME##_i2
> -#define CORE(NAME) NAME
> +#define LSE128(NAME)   libat_##NAME##_i1
> +#define LSE2(NAME) libat_##NAME##_i2
> +#define CORE(NAME) libat_##NAME
> +#define ATOMIC(NAME)   __atomic_##NAME
>
> -#define ENTRY_FEAT(NAME, FEAT)  \
> -   ENTRY (FEAT (NAME))
> +#if HAVE_IFUNC
> +# define ENTRY(NAME)   ENTRY2 (CORE (NAME), )
> +#else
> +/* Emit __atomic_* entrypoints if no ifuncs.  */
> +# define ENTRY(NAME)   ENTRY2 (CORE (NAME), ALIAS (NAME, ATOMIC, CORE))
> +#endif
> +#define ENTRY_FEAT(NAME, FEAT) ENTRY2 (FEAT (NAME), )

Perhaps we should define this only in the HAVE_IFUNC arm, so that it's
a noisy failure if we forget to protect an ENTRY_FEAT with an appropriate
#ifdef.

OK for GCC 15 with or without that change.

Richard

> +
> +#define END(NAME)  END2 (CORE (NAME))
> +#define END_FEAT(NAME, FEAT)   END2 (FEAT (NAME))
>
> -#define ENTRY(NAME)\
> +#define ENTRY2(NAME, ALIASES)  \
> .global NAME;   \
> .hidden NAME;   \
> .type NAME,%function;   \
> .p2align 4; \
> +   ALIASES;\
>  NAME:  \
> -   .cfi_startproc; \
> -   hint34  // bti c
> -
> -#define END_FEAT(NAME, FEAT)   \
> -   END (FEAT (NAME))
> +   .cfi_startproc; \
> +   hint34; // bti c
>
> -#define END(NAME)  \
> +#define END2(NAME) \
> .cfi_endproc;   \
> .size NAME, .-NAME;
>
> -#define ALIAS(NAME, FROM, TO)  ALIAS1 (FROM (NAME),TO (NAME))
> -#define ALIAS2(NAME)   ALIAS1 (__atomic_##NAME, libat_##NAME)
> +#define ALIAS(NAME, FROM, TO)  ALIAS1 (FROM (NAME), TO (NAME))
>
>  #define ALIAS1(ALIAS, NAME)\
> .global ALIAS;  \
> @@ -116,7 +122,7 @@ NAME:   \
>  #define SEQ_CST 5
>
>
> -ENTRY (libat_load_16)
> +ENTRY (load_16)
> mov x5, x0
> cbnzw1, 2f
>
> @@ -131,11 +137,11 @@ ENTRY (libat_load_16)
> stxpw4, res0, res1, [x5]
> cbnzw4, 2b
> ret
> -END (libat_load_16)
> +END (load_16)
>
>
>  #if HAVE_FEAT_LSE2
> -ENTRY_FEAT (libat_load_16, LSE2)
> +ENTRY_FEAT (load_16, LSE2)
> cbnzw1, 1f
>
> /* RELAXED.  */
> @@ -155,11 +161,11 @@ ENTRY_FEAT (libat_load_16, LSE2)
> ldp res0, res1, [x0]
> dmb ishld
> ret
> -END_FEAT (libat_load_16, LSE2)
> +END_FEAT (load_16, LSE2)
>  #endif
>
>
> -ENTRY (libat_store_16)
> +ENTRY (store_16)
> cbnzw4, 2f
>
> /* RELAXED.  */
> @@ -173,11 +179,11 @@ ENTRY (libat_store_16)
> stlxp   w4, in0, in1, [x0]
> cbnzw4, 2b
> ret
> -END (libat_store_16)
> +END (store_16)
>
>
>  #if HAVE_FEAT_LSE2
> -ENTRY_FEAT (libat_store_16, LSE2)
> +ENTRY_FEAT (store_16, LSE2)
> cbnzw4, 1f
>
> /* RELAXED.  */
> @@ -189,11 +195,11 @@ ENTRY_FEAT (libat_store_16, LSE2)
> stlxp   w4, in0, in1, [x0]
> cbnzw4, 1b
> ret
> -END_FEAT (libat_store_16, LSE2)
> +END_FEAT (store_16, LSE2)
>  #endif
>
>
> -ENTRY (libat_exchange_16)
> +ENTRY (exchange_16)
> mov x5, x0
> cbnzw4, 2f
>
> @@ -217,11 +223,11 @@ ENTRY (libat_exchange_16)
> stlxp   w4, in0, in1, [x5]
> cbnzw4, 4b
> ret
> -END (libat_exchange_16)
> +END (exchange_16)
>
>
>  #if HAVE_FEAT_LSE128
> -ENTRY_FEAT (libat_exchange_16, LSE128)
> +ENTRY_FEAT (exchange_16, LSE128)
> mov tmp0, x0
> mov res0, in0
> mov res1, in1
> @@ -241,11 +247,11 @@ ENTRY_FEAT (libat_exchange_16, LSE128)
> /* RELEASE/ACQ_REL/SEQ_CST.  */
>  2: swppal  res0, res1, [tmp0]
> ret
> -END_FEAT (libat_exchange_16, LSE128)
> +END_FEAT 

Re: [PATCH] libatomic: Fix build for --disable-gnu-indirect-function [PR113986]

2024-04-04 Thread Richard Sandiford
Wilco Dijkstra  writes:
> v2: 
>
> Fix libatomic build to support --disable-gnu-indirect-function on AArch64.
> Always build atomic_16.S, add aliases to the __atomic_ functions if 
> !HAVE_IFUNC. 
> Include auto-config.h in atomic_16.S to avoid having to pass defines via 
> makefiles.
> Fix build if HWCAP_ATOMICS/CPUID are not defined.
>
> Passes regress and bootstrap, OK for commit?
>
> libatomic:
> PR target/113986
> * Makefile.in: Regenerated.
> * Makefile.am: Make atomic_16.S not depend on HAVE_IFUNC.
> Remove predefine of HAVE_FEAT_LSE128.
> * acinclude.m4: Remove ARCH_AARCH64_HAVE_LSE128.
> * configure: Regenerated.
> * config/linux/aarch64/atomic_16.S: Add __atomic_ aliases if 
> !HAVE_IFUNC. 
> * config/linux/aarch64/host-config.h: Correctly handle !HAVE_IFUNC.  
> Add
> defines for HWCAP_ATOMICS and HWCAP_CPUID.

OK, thanks, but...

> @@ -35,7 +48,6 @@ typedef struct __ifunc_arg_t {
>  # define _IFUNC_ARG_HWCAP (1ULL << 62)
>  #endif
>  
> -#ifdef HWCAP_USCAT
>  # if N == 16
>  #  define IFUNC_COND_1   (has_lse128 (hwcap, features))
>  #  define IFUNC_COND_2   (has_lse2 (hwcap, features))
> @@ -44,19 +56,6 @@ typedef struct __ifunc_arg_t {
>  #  define IFUNC_COND_1   (hwcap & HWCAP_ATOMICS)
>  #  define IFUNC_NCOND(N) 1
>  # endif

...please reindent this block so that there is no space after the
"#" in the outermost directives.

Richard


Re: [PATCH V3 0/2] aarch64: Place target independent and dependent changed code in one file.

2024-04-03 Thread Richard Sandiford
Alex Coplan  writes:
> On 23/02/2024 16:41, Ajit Agarwal wrote:
>> Hello Richard/Alex/Segher:
>
> Hi Ajit,
>
> Sorry for the delay and thanks for working on this.
>
> Generally this looks like the right sort of approach (IMO) but I've left
> some comments below.
>
> I'll start with a meta comment: in the subject line you have marked this
> as 0/2, but usually 0/n is reserved for the cover letter of a patch
> series and wouldn't contain an actual patch.  I think this might have
> confused the Linaro CI suitably such that it didn't run regression tests
> on the patch.

Alex, thanks for the thorough and in-depth review.  I agree with all the
comments FWIW.  Just to add a couple of things:

> > @@ -138,8 +138,18 @@ struct alt_base
> >poly_int64 offset;
> >  };
> >  
> > +// Virtual base class for load/store walkers used in alias analysis.
> > +struct alias_walker
> > +{
> > +  virtual bool conflict_p (int ) const = 0;
> > +  virtual insn_info *insn () const = 0;
> > +  virtual bool valid () const  = 0;
> > +  virtual void advance () = 0;
> > +};
> > +
> > +
> >  // State used by the pass for a given basic block.
> > -struct ldp_bb_info
> > +struct pair_fusion
>
> As a comment on the high-level design, I think we want a generic class
> for the overall pass, not just for the BB-specific structure.
>
> That is because naturally we want the ldp_fusion_bb function itself to
> be a member of such a class, so that it can access virtual functions to
> query the target e.g. about the load/store pair policy, and whether to
> try and promote writeback pairs.
>
> If we keep all of the virtual functions in such an outer class, then we
> can keep the ldp_fusion_bb class generic (not needing an override for
> each target) and that inner class can perhaps be given a pointer or
> reference to the outer class when it is instantiated in ldp_fusion_bb.

I agree that in general, the new virtual methods should belong to a pass
class rather than the per-bb class.

In principle, if we need to virtualise existing members of ldp_bb_info
(or code contained within existing members of ldp_bb_info), and if that
code accesses members of the bb info, then it might make sense to have
target-specific derivatives of the bb info structure too, with a virtual
function to create the bb info structure for a given bb.

However, it looks like all but one of the virtual functions in the patch
are self-contained (in the sense of depending only on their arguments
and on globals).  The one exception is transform_for_base, but Alex
asked whether that needs to be virtualised.  If it doesn't, then like
Alex says, it seems that all virtuals could belong to the pass class
rather than to the bb info.

>> [...]
>> +  }
>>  };
>>  
>> +bool
>> +store_modifies_mem_p (rtx mem, insn_info *store_insn, int );
>> +bool load_modified_by_store_p (insn_info *load,
>> +  insn_info *store,
>> +  int );
>> +extern insn_info *
>> +try_repurpose_store (insn_info *first,
>> + insn_info *second,
>> + const insn_range_info _range);
>> +
>> +void reset_debug_use (use_info *use);
>> +
>> +extern void
>> +fixup_debug_uses (obstack_watermark ,
>> +  insn_info *insns[2],
>> +  rtx orig_rtl[2],
>> +  insn_info *pair_dst,
>> +  insn_info *trailing_add,
>> +  bool load_p,
>> +  int writeback,
>> +  rtx writeback_effect,
>> +  unsigned base_regno);
>> +
>> +void
>> +fixup_debug_uses_trailing_add (obstack_watermark ,
>> +   insn_info *pair_dst,
>> +   insn_info *trailing_add,
>> +   rtx writeback_effect);
>> +
>> +
>> +extern void
>> +fixup_debug_use (obstack_watermark ,
>> + use_info *use,
>> + def_info *def,
>> + rtx base,
>> + poly_int64 wb_offset);
>> +
>> +extern insn_info *
>> +find_trailing_add (insn_info *insns[2],
>> +   const insn_range_info _range,
>> +   int initial_writeback,
>> +   rtx *writeback_effect,
>> +   def_info **add_def,
>> +   def_info *base_def,
>> +   poly_int64 initial_offset,
>> +   unsigned access_size);
>> +
>> +rtx drop_writeback (rtx mem);
>> +rtx pair_mem_strip_offset (rtx mem, poly_int64 *offset);
>> +bool any_pre_modify_p (rtx x);
>> +bool any_post_modify_p (rtx x);
>> +int encode_lfs (lfs_fields fields);
>> +extern insn_info * latest_hazard_before (insn_info *insn, rtx *ignore,
>> +  insn_info *ignore_insn = nullptr);
>> +insn_info * first_hazard_after (insn_info *insn, rtx *ignore);
>> +bool ranges_overlap_p (const insn_range_info , const insn_range_info 
>> );
>> +insn_range_info get_def_range (def_info *def);
>> +insn_range_info def_downwards_move_range (def_info *def);
>> +insn_range_info def_upwards_move_range (def_info *def);
>> +rtx gen_tombstone (void);
>> 

Re: [PATCH v2 2/3] aarch64: Add support for aarch64-gnu (GNU/Hurd on AArch64)

2024-04-02 Thread Richard Sandiford
Sergey Bugaev  writes:
> Coupled with a corresponding binutils patch, this produces a toolchain that 
> can
> sucessfully build working binaries targeting aarch64-gnu.
>
> gcc/Changelog:
>
>   * config.gcc: Recognize aarch64*-*-gnu* targets.
>   * config/aarch64/aarch64-gnu.h: New file.
>
> Signed-off-by: Sergey Bugaev 
> ---
>  gcc/config.gcc   |  6 +++
>  gcc/config/aarch64/aarch64-gnu.h | 68 
>  2 files changed, 74 insertions(+)
>  create mode 100644 gcc/config/aarch64/aarch64-gnu.h

I don't know if you're waiting on me, but just in case: this and patch 3
still LGTM if Thomas is OK with them.

Thanks,
Richard

> diff --git a/gcc/config.gcc b/gcc/config.gcc
> index 87a5c92b6..9d935164c 100644
> --- a/gcc/config.gcc
> +++ b/gcc/config.gcc
> @@ -1264,6 +1264,12 @@ aarch64*-*-linux*)
>   done
>   TM_MULTILIB_CONFIG=`echo $TM_MULTILIB_CONFIG | sed 's/^,//'`
>   ;;
> +aarch64*-*-gnu*)
> +tm_file="${tm_file} elfos.h gnu-user.h gnu.h glibc-stdint.h"
> +tm_file="${tm_file} aarch64/aarch64-elf.h aarch64/aarch64-errata.h 
> aarch64/aarch64-gnu.h"
> +tmake_file="${tmake_file} aarch64/t-aarch64"
> +tm_defines="${tm_defines}  TARGET_DEFAULT_ASYNC_UNWIND_TABLES=1"
> + ;;
>  aarch64*-wrs-vxworks*)
>  tm_file="${tm_file} elfos.h aarch64/aarch64-elf.h"
>  tm_file="${tm_file} vx-common.h vxworks.h aarch64/aarch64-vxworks.h"
> diff --git a/gcc/config/aarch64/aarch64-gnu.h 
> b/gcc/config/aarch64/aarch64-gnu.h
> new file mode 100644
> index 0..ee5494034
> --- /dev/null
> +++ b/gcc/config/aarch64/aarch64-gnu.h
> @@ -0,0 +1,68 @@
> +/* Definitions for AArch64 running GNU/Hurd.
> +   Copyright (C) 2009-2024 Free Software Foundation, Inc.
> +
> +   This file is part of GCC.
> +
> +   GCC is free software; you can redistribute it and/or modify it
> +   under the terms of the GNU General Public License as published by
> +   the Free Software Foundation; either version 3, or (at your option)
> +   any later version.
> +
> +   GCC is distributed in the hope that it will be useful, but
> +   WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   General Public License for more details.
> +
> +   You should have received a copy of the GNU General Public License
> +   along with GCC; see the file COPYING3.  If not see
> +   .  */
> +
> +#ifndef GCC_AARCH64_GNU_H
> +#define GCC_AARCH64_GNU_H
> +
> +#define GNU_USER_DYNAMIC_LINKER 
> "/lib/ld-aarch64%{mbig-endian:_be}%{mabi=ilp32:_ilp32}.so.1"
> +
> +#define CPP_SPEC "%{pthread:-D_REENTRANT}"
> +
> +#define GNU_TARGET_LINK_SPEC  "%{h*} \
> +   %{static:-Bstatic}\
> +   %{shared:-shared} \
> +   %{symbolic:-Bsymbolic}\
> +   %{!static:%{!static-pie:  \
> + %{rdynamic:-export-dynamic} \
> + %{!shared:-dynamic-linker " GNU_USER_DYNAMIC_LINKER "}}} \
> +   %{static-pie:-Bstatic -pie --no-dynamic-linker -z text} \
> +   -X\
> +   %{mbig-endian:-EB} %{mlittle-endian:-EL} \
> +   -maarch64gnu%{mabi=ilp32:32}%{mbig-endian:b}"
> +
> +
> +#define LINK_SPEC GNU_TARGET_LINK_SPEC AARCH64_ERRATA_LINK_SPEC
> +
> +#define GNU_USER_TARGET_MATHFILE_SPEC \
> +  "%{Ofast|ffast-math|funsafe-math-optimizations:%{!shared:crtfastmath.o%s}}"
> +
> +#undef ENDFILE_SPEC
> +#define ENDFILE_SPEC   \
> +  GNU_USER_TARGET_MATHFILE_SPEC " " \
> +  GNU_USER_TARGET_ENDFILE_SPEC
> +
> +#define TARGET_OS_CPP_BUILTINS() \
> +  do \
> +{\
> + GNU_USER_TARGET_OS_CPP_BUILTINS();  \
> +}\
> +  while (0)
> +
> +#define TARGET_ASM_FILE_END aarch64_file_end_indicate_exec_stack
> +
> +/* Uninitialized common symbols in non-PIE executables, even with
> +   strong definitions in dependent shared libraries, will resolve
> +   to COPY relocated symbol in the executable.  See PR65780.  */
> +#undef TARGET_BINDS_LOCAL_P
> +#define TARGET_BINDS_LOCAL_P default_binds_local_p_2
> +
> +/* Define this to be nonzero if static stack checking is supported.  */
> +#define STACK_CHECK_STATIC_BUILTIN 1
> +
> +#endif  /* GCC_AARCH64_GNU_H */


Re: [PATCH] aarch64: Fix typo in comment about FEATURE_STRING

2024-04-02 Thread Richard Sandiford
Christophe Lyon  writes:
> Fix the comment to document FEATURE_STRING instead of FEAT_STRING.
>
> 2024-03-29  Christophe Lyon  
>
>   gcc/
>   * config/aarch64/aarch64-option-extensions.def: Fix comment.

OK, thanks.

Richard

> ---
>  gcc/config/aarch64/aarch64-option-extensions.def | 16 
>  1 file changed, 8 insertions(+), 8 deletions(-)
>
> diff --git a/gcc/config/aarch64/aarch64-option-extensions.def 
> b/gcc/config/aarch64/aarch64-option-extensions.def
> index 061a145e9e7..aa3cd99f791 100644
> --- a/gcc/config/aarch64/aarch64-option-extensions.def
> +++ b/gcc/config/aarch64/aarch64-option-extensions.def
> @@ -54,14 +54,14 @@
>   If a feature A appears in this list then the list implicitly includes
>   any features that are transitively dependent on A (according to 
> REQUIRES).
>  
> -   - FEAT_STRING is a string containing the entries in the 'Features' field 
> of
> - /proc/cpuinfo on a GNU/Linux system that correspond to this architecture
> - extension being available.  Sometimes multiple entries are needed to 
> enable
> - the extension (for example, the 'crypto' extension depends on four
> - entries: aes, pmull, sha1, sha2 being present).  In that case this field
> - should contain a space (" ") separated list of the strings in 'Features'
> - that are required.  Their order is not important.  An empty string means
> - do not detect this feature during auto detection.
> +   - FEATURE_STRING is a string containing the entries in the 'Features' 
> field
> + of /proc/cpuinfo on a GNU/Linux system that correspond to this
> + architecture extension being available.  Sometimes multiple entries are
> + needed to enable the extension (for example, the 'crypto' extension
> + depends on four entries: aes, pmull, sha1, sha2 being present).  In that
> + case this field should contain a space (" ") separated list of the 
> strings
> + in 'Features' that are required.  Their order is not important.  An 
> empty
> + string means do not detect this feature during auto detection.
>  
> - OPT_FLAGS is a list of feature IDENTS that should be enabled (along with
>   their transitive dependencies) when the specified FMV feature is 
> present.


Re: [PATCH] libgcc: Add missing HWCAP entries to aarch64/cpuinfo.c

2024-04-02 Thread Richard Sandiford
Wilco Dijkstra  writes:
> A few HWCAP entries are missing from aarch64/cpuinfo.c.  This results in 
> build errors
> on older machines.
>
> This counts a trivial build fix, but since it's late in stage 4 I'll let 
> maintainers chip in.
> OK for commit?
>
> libgcc/
> * config/aarch64/cpuinfo.c: Add HWCAP_EVTSTRM, HWCAP_CRC32, 
> HWCAP_CPUID,
> HWCAP_PACA and HWCAP_PACG.

OK, thanks.

Richard

> ---
>
> diff --git a/libgcc/config/aarch64/cpuinfo.c b/libgcc/config/aarch64/cpuinfo.c
> index 
> 3c6fb8a575b423c2aff71a1a9f40812b154ee284..4b94fca869507145ec690c825f637abbc82a3493
>  100644
> --- a/libgcc/config/aarch64/cpuinfo.c
> +++ b/libgcc/config/aarch64/cpuinfo.c
> @@ -52,15 +52,15 @@ struct {
>  #ifndef AT_HWCAP
>  #define AT_HWCAP 16
>  #endif
> -#ifndef HWCAP_CPUID
> -#define HWCAP_CPUID (1 << 11)
> -#endif
>  #ifndef HWCAP_FP
>  #define HWCAP_FP (1 << 0)
>  #endif
>  #ifndef HWCAP_ASIMD
>  #define HWCAP_ASIMD (1 << 1)
>  #endif
> +#ifndef HWCAP_EVTSTRM
> +#define HWCAP_EVTSTRM (1 << 2)
> +#endif
>  #ifndef HWCAP_AES
>  #define HWCAP_AES (1 << 3)
>  #endif
> @@ -73,6 +73,9 @@ struct {
>  #ifndef HWCAP_SHA2
>  #define HWCAP_SHA2 (1 << 6)
>  #endif
> +#ifndef HWCAP_CRC32
> +#define HWCAP_CRC32 (1 << 7)
> +#endif
>  #ifndef HWCAP_ATOMICS
>  #define HWCAP_ATOMICS (1 << 8)
>  #endif
> @@ -82,6 +85,9 @@ struct {
>  #ifndef HWCAP_ASIMDHP
>  #define HWCAP_ASIMDHP (1 << 10)
>  #endif
> +#ifndef HWCAP_CPUID
> +#define HWCAP_CPUID (1 << 11)
> +#endif
>  #ifndef HWCAP_ASIMDRDM
>  #define HWCAP_ASIMDRDM (1 << 12)
>  #endif
> @@ -133,6 +139,12 @@ struct {
>  #ifndef HWCAP_SB
>  #define HWCAP_SB (1 << 29)
>  #endif
> +#ifndef HWCAP_PACA
> +#define HWCAP_PACA (1 << 30)
> +#endif
> +#ifndef HWCAP_PACG
> +#define HWCAP_PACG (1UL << 31)
> +#endif
>
>  #ifndef HWCAP2_DCPODP
>  #define HWCAP2_DCPODP (1 << 0)


[oops pushed] aarch64: Fix vld1/st1_x4 intrinsic definitions

2024-03-28 Thread Richard Sandiford
Gah.  As mentioned on irc, I'd written this patch to fix PR114521.
The bug was fixed properly by Jonathan's struct rework in GCC 12,
but that's much too invasive to backport.  The attached patch therefore
deals with the bug directly.

Since it's new work, and since there's only one GCC 11 release to go,
I was originally planning to attach the patch to the PR for any distros
that wanted to take it.  But due to bad use of git, I accidentally
committed the patch while backporting the fix for PR97696.

Andrew suggested that we leave the patch in, so I'll do that unless
anyone objects.  (Please let me know if you do object though!)

Bootstrapped & regression-tested on aarch64-linux-gnu.  The PR contains
a patch to the tests that shows up the problem.

Sorry for the mistake.

Richard

---

The vld1_x4 and vst1_x4 patterns use XI registers for both 64-bit and
128-bit vectors.  This has the nice property that each individual
vector is within a separate 16-byte subreg of the XI, which should
reduce the number of memory spills needed.  However, it means that the
64-bit vector forms must convert between the native 4x64-bit structure
layout and the padded 4x128-bit XI layout.

The vld4 and vst4 functions did this correctly.  But the vld1x4 and
vst1x4 functions used a union between the native and padded layouts,
even though the layouts are different sizes.

This patch makes vld1x4 and vst1x4 use the same approach as vld4
and vst4.  It also fixes some uses of variables in the user namespace.

gcc/
* config/aarch64/arm_neon.h (vld1_s8_x4, vld1_s16_x4, vld1_s32_x4):
(vld1_u8_x4, vld1_u16_x4, vld1_u32_x4, vld1_f16_x4, vld1_f32_x4):
(vld1_p8_x4, vld1_p16_x4, vld1_s64_x4, vld1_u64_x4, vld1_p64_x4):
(vld1_f64_x4): Avoid using a union of a 256-bit structure and 512-bit
XImode integer.  Instead use the same approach as the vld4 intrinsics.
(vst1_s8_x4, vst1_s16_x4, vst1_s32_x4, vst1_u8_x4, vst1_u16_x4):
(vst1_u32_x4, vst1_f16_x4, vst1_f32_x4, vst1_p8_x4, vst1_p16_x4):
(vst1_s64_x4, vst1_u64_x4, vst1_p64_x4, vst1_f64_x4, vld1_bf16_x4):
(vst1_bf16_x4): Likewise for stores.
(vst1q_s8_x4, vst1q_s16_x4, vst1q_s32_x4, vst1q_u8_x4, vst1q_u16_x4):
(vst1q_u32_x4, vst1q_f16_x4, vst1q_f32_x4, vst1q_p8_x4, vst1q_p16_x4):
(vst1q_s64_x4, vst1q_u64_x4, vst1q_p64_x4, vst1q_f64_x4)
(vst1q_bf16_x4): Rename val parameter to __val.
---
 gcc/config/aarch64/arm_neon.h | 469 --
 1 file changed, 334 insertions(+), 135 deletions(-)

diff --git a/gcc/config/aarch64/arm_neon.h b/gcc/config/aarch64/arm_neon.h
index baa30bd5a9d..8f53f4e1559 100644
--- a/gcc/config/aarch64/arm_neon.h
+++ b/gcc/config/aarch64/arm_neon.h
@@ -16498,10 +16498,14 @@ __extension__ extern __inline int8x8x4_t
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 vld1_s8_x4 (const int8_t *__a)
 {
-  union { int8x8x4_t __i; __builtin_aarch64_simd_xi __o; } __au;
-  __au.__o
-= __builtin_aarch64_ld1x4v8qi ((const __builtin_aarch64_simd_qi *) __a);
-  return __au.__i;
+  int8x8x4_t ret;
+  __builtin_aarch64_simd_xi __o;
+  __o = __builtin_aarch64_ld1x4v8qi ((const __builtin_aarch64_simd_qi *) __a);
+  ret.val[0] = (int8x8_t) __builtin_aarch64_get_dregxiv8qi (__o, 0);
+  ret.val[1] = (int8x8_t) __builtin_aarch64_get_dregxiv8qi (__o, 1);
+  ret.val[2] = (int8x8_t) __builtin_aarch64_get_dregxiv8qi (__o, 2);
+  ret.val[3] = (int8x8_t) __builtin_aarch64_get_dregxiv8qi (__o, 3);
+  return ret;
 }
 
 __extension__ extern __inline int8x16x4_t
@@ -16518,10 +16522,14 @@ __extension__ extern __inline int16x4x4_t
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 vld1_s16_x4 (const int16_t *__a)
 {
-  union { int16x4x4_t __i; __builtin_aarch64_simd_xi __o; } __au;
-  __au.__o
-= __builtin_aarch64_ld1x4v4hi ((const __builtin_aarch64_simd_hi *) __a);
-  return __au.__i;
+  int16x4x4_t ret;
+  __builtin_aarch64_simd_xi __o;
+  __o = __builtin_aarch64_ld1x4v4hi ((const __builtin_aarch64_simd_hi *) __a);
+  ret.val[0] = (int16x4_t) __builtin_aarch64_get_dregxiv4hi (__o, 0);
+  ret.val[1] = (int16x4_t) __builtin_aarch64_get_dregxiv4hi (__o, 1);
+  ret.val[2] = (int16x4_t) __builtin_aarch64_get_dregxiv4hi (__o, 2);
+  ret.val[3] = (int16x4_t) __builtin_aarch64_get_dregxiv4hi (__o, 3);
+  return ret;
 }
 
 __extension__ extern __inline int16x8x4_t
@@ -16538,10 +16546,14 @@ __extension__ extern __inline int32x2x4_t
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 vld1_s32_x4 (const int32_t *__a)
 {
-  union { int32x2x4_t __i; __builtin_aarch64_simd_xi __o; } __au;
-  __au.__o
-  = __builtin_aarch64_ld1x4v2si ((const __builtin_aarch64_simd_si *) __a);
-  return __au.__i;
+  int32x2x4_t ret;
+  __builtin_aarch64_simd_xi __o;
+  __o = __builtin_aarch64_ld1x4v2si ((const __builtin_aarch64_simd_si *) __a);
+  ret.val[0] = (int32x2_t) __builtin_aarch64_get_dregxiv2si (__o, 0);
+  ret.val[1] = (int32x2_t) 

Re: [PATCHv2 2/2] aarch64: Add support for _BitInt

2024-03-28 Thread Richard Sandiford
Jakub Jelinek  writes:
> On Thu, Mar 28, 2024 at 03:00:46PM +0000, Richard Sandiford wrote:
>> >* gcc.target/aarch64/bitint-alignments.c: New test.
>> >* gcc.target/aarch64/bitint-args.c: New test.
>> >* gcc.target/aarch64/bitint-sizes.c: New test.
>> >* gcc.target/aarch64/bitfield-bitint-abi.h: New header.
>> >* gcc.target/aarch64/bitfield-bitint-abi-align16.c: New test.
>> >* gcc.target/aarch64/bitfield-bitint-abi-align8.c: New test.
>> 
>> Since we don't support big-endian yet, I assume the tests should be
>> conditional on aarch64_little_endian.
>
> Perhaps better on bitint effective target, then they'll become available
> automatically as soon as big endian aarch64 _BitInt support is turned on.

Ah, yeah, good point.

Richard


Re: [PATCHv2 2/2] aarch64: Add support for _BitInt

2024-03-28 Thread Richard Sandiford
"Andre Vieira (lists)"  writes:
> This patch adds support for C23's _BitInt for the AArch64 port when 
> compiling for little endianness.  Big Endianness requires further 
> target-agnostic support and we therefor disable it for now.
>
> The tests expose some suboptimal codegen for which I'll create PR's for 
> optimizations after this goes in.
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64.cc (TARGET_C_BITINT_TYPE_INFO): Declare MACRO.
>   (aarch64_bitint_type_info): New function.
>   (aarch64_return_in_memory_1): Return large _BitInt's in memory.
>   (aarch64_function_arg_alignment): Adapt to correctly return the ABI
>   mandated alignment of _BitInt(N) where N > 128 as the alignment of
>   TImode.
>   (aarch64_composite_type_p): Return true for _BitInt(N), where N > 128.
>
> libgcc/ChangeLog:
>
>   * config/aarch64/t-softfp (softfp_extras): Add floatbitinthf,
>   floatbitintbf, floatbitinttf and fixtfbitint.
>   * config/aarch64/libgcc-softfp.ver (GCC_14.0.0): Add __floatbitinthf,
>   __floatbitintbf, __floatbitinttf and __fixtfbitint.
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/bitint-alignments.c: New test.
>   * gcc.target/aarch64/bitint-args.c: New test.
>   * gcc.target/aarch64/bitint-sizes.c: New test.
>   * gcc.target/aarch64/bitfield-bitint-abi.h: New header.
>   * gcc.target/aarch64/bitfield-bitint-abi-align16.c: New test.
>   * gcc.target/aarch64/bitfield-bitint-abi-align8.c: New test.

Since we don't support big-endian yet, I assume the tests should be
conditional on aarch64_little_endian.

> [...]
> diff --git a/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align16.c 
> b/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align16.c
> new file mode 100644
> index 
> ..048d04e4c1bf90215892aa0173f6246a097d
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align16.c
> @@ -0,0 +1,378 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fno-stack-protector -save-temps -fno-schedule-insns 
> -fno-schedule-insns2" } */
> +/* { dg-final { check-function-bodies "**" "" "" } } */
> +
> +#define ALIGN 16
> +#include "bitfield-bitint-abi.h"
> +
> +// f1-f16 are all the same
> +
> +/*
> +** f1:
> +**   and x0, x2, 1
> +**   ret
> +*/
> +/*
> +** f8:
> +**   and x0, x2, 1
> +**   ret
> +*/
> +/*
> +** f16:
> +**   and x0, x2, 1
> +**   ret
> +*/
> +
> +/* fp seems to be unable to optimize away stack-usage, TODO: to fix.  */
> +
> +/*
> +** fp:
> +**...
> +**   and x0, x1, 1
> +**...
> +**   ret
> +*/
> +
> +// all other f1p-f8p generate the same code, for f16p the value comes from x2
> +/*
> +** f1p:
> +**   and x0, x1, 1
> +**   ret
> +*/
> +/*
> +** f8p:
> +**   and x0, x1, 1
> +**   ret
> +*/
> +/*
> +** f16p:
> +**   and x0, x2, 1
> +**   ret
> +*/
> +
> +// g1-g16 are all the same
> +/*
> +** g1:
> +**   mov (x[0-9]+), x0
> +**   mov w0, w1
> +**   and x4, \1, 9223372036854775807
> +**   and x2, \1, 1
> +**   mov x3, 0
> +**   b   f1
> +*/
> +
> +/*
> +** g8:
> +**   mov (x[0-9]+), x0
> +**   mov w0, w1
> +**   and x4, \1, 9223372036854775807
> +**   and x2, \1, 1
> +**   mov x3, 0
> +**   b   f8
> +*/
> +/*
> +** g16:
> +**   mov (x[0-9]+), x0
> +**   mov w0, w1
> +**   and x4, \1, 9223372036854775807
> +**   and x2, \1, 1
> +**   mov x3, 0
> +**   b   f16
> +*/
> +
> +// again gp different from the rest
> +
> +/*
> +** gp:
> +**   sub sp, sp, #16
> +**   mov (x[0-9]+), x0
> +**   mov w0, w1
> +**   sbfxx([0-9]+), \1, 0, 63
> +**   mov (w[0-9]+), 0
> +**   bfi \3, w\2, 0, 1
> +**   and x3, x\2, 9223372036854775807
> +**   mov x2, 0
> +**   str xzr, \[sp\]
> +**   strb\3, \[sp\]
> +**   ldr x1, \[sp\]
> +**   add sp, sp, 16
> +**   b   fp
> +*/
> +
> +// g1p-g8p are all the same, g16p uses x2 to pass parameter to f16p
> +
> +/*
> +** g1p:
> +**   mov (w[0-9]+), w1
> +**   and x3, x0, 9223372036854775807
> +**   and x1, x0, 1
> +**   mov x2, 0
> +**   mov w0, \1
> +**   b   f1p
> +*/
> +/*
> +** g8p:
> +**   mov (w[0-9]+), w1
> +**   and x3, x0, 9223372036854775807
> +**   and x1, x0, 1
> +**   mov x2, 0
> +**   mov w0, \1
> +**   b   f8p
> +*/
> +/*
> +** g16p:
> +**   mov (x[0-9]+), x0
> +**   mov w0, w1
> +**   and x4, \1, 9223372036854775807
> +**   and x2, \1, 1
> +**   mov x3, 0
> +**   b   f16p
> +*/
> +
> +// f*_stack are all the same
> +/*
> +** f1_stack:
> +**   ldr (x[0-9]+), \[sp, 16\]
> +**   and x0, \1, 1
> +**   ret
> +*/
> +/*
> +** f8_stack:
> +**   ldr (x[0-9]+), \[sp, 16\]
> +**   and x0, \1, 1
> +**   ret
> +*/
> +/*
> +** f16_stack:
> +**   ldr (x[0-9]+), \[sp, 16\]
> +**   and x0, \1, 1
> +**   ret
> +*/
> +
> +// fp{,1,8}_stack are all the same but fp16_stack loads from 

Re: [PATCHv2 1/2] aarch64: Do not give ABI change diagnostics for _BitInt(N)

2024-03-28 Thread Richard Sandiford
"Andre Vieira (lists)"  writes:
> This patch makes sure we do not give ABI change diagnostics for the ABI 
> breaks of GCC 9, 13 and 14 for any type involving _BitInt(N), since that 
> type did not exist before this GCC version.
>
> ChangeLog:
>
>   * config/aarch64/aarch64.cc (bitint_or_aggr_of_bitint_p): New function.
>   (aarch64_layout_arg): Don't emit diagnostics for types involving
>   _BitInt(N).
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 
> 1ea84c8bd7386e399f6ffa3a5e36408cf8831fc6..b68cf3e7cb9a6fa89b4e5826a39ffa11f64ca20a
>  100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -6744,6 +6744,33 @@ aarch64_function_arg_alignment (machine_mode mode, 
> const_tree type,
>return alignment;
>  }
>  
> +/* Return true if TYPE describes a _BitInt(N) or an angreggate that uses the
> +   _BitInt(N) type.  These include ARRAY_TYPE's with an element that is a
> +   _BitInt(N) or an aggregate that uses it, and a RECORD_TYPE or a UNION_TYPE
> +   with a field member that is a _BitInt(N) or an aggregate that uses it.
> +   Return false otherwise.  */
> +
> +static bool
> +bitint_or_aggr_of_bitint_p (tree type)
> +{
> +  if (!type)
> +return false;
> +
> +  if (TREE_CODE (type) == BITINT_TYPE)
> +return true;
> +
> +  /* If ARRAY_TYPE, check it's element type.  */
> +  if (TREE_CODE (type) == ARRAY_TYPE)
> +return bitint_or_aggr_of_bitint_p (TREE_TYPE (type));
> +
> +  /* If RECORD_TYPE or UNION_TYPE, check the fields' types.  */
> +  if (RECORD_OR_UNION_TYPE_P (type))
> +for (tree field = TYPE_FIELDS (type); field; field = TREE_CHAIN (field))
> +  if (bitint_or_aggr_of_bitint_p (TREE_TYPE (field)))
> + return true;
> +  return false;
> +}
> +
>  /* Layout a function argument according to the AAPCS64 rules.  The rule
> numbers refer to the rule numbers in the AAPCS64.  ORIG_MODE is the
> mode that was originally given to us by the target hook, whereas the
> @@ -6767,12 +6794,6 @@ aarch64_layout_arg (cumulative_args_t pcum_v, const 
> function_arg_info )
>if (pcum->aapcs_arg_processed)
>  return;
>  
> -  bool warn_pcs_change
> -= (warn_psabi
> -   && !pcum->silent_p
> -   && (currently_expanding_function_start
> -|| currently_expanding_gimple_stmt));
> -
>/* HFAs and HVAs can have an alignment greater than 16 bytes.  For example:
>  
> typedef struct foo {
> @@ -6907,6 +6928,18 @@ aarch64_layout_arg (cumulative_args_t pcum_v, const 
> function_arg_info )
> && (!alignment || abi_break_gcc_9 < alignment)
> && (!abi_break_gcc_13 || alignment < abi_break_gcc_13));
>  
> +
> +  bool warn_pcs_change
> += (warn_psabi
> +   && !pcum->silent_p
> +   && (currently_expanding_function_start
> +|| currently_expanding_gimple_stmt)
> +  /* warn_pcs_change is currently used to gate diagnostics in case of
> +  abi_break_gcc_{9,13,14}.  These however, do not apply to _BitInt(N)
> +  types as they were only introduced in GCC 14.  */
> +   && (!type || !bitint_or_aggr_of_bitint_p (type)));

How about making this a new variable such as:

  /* _BitInt(N) was only added in GCC 14.  */
  bool warn_pcs_change_le_gcc14
= (warn_psabi && !bitint_or_aggr_of_bitint_p (type);

(and keeping warn_pcs_change where it is).  In principle, warn_pcs_change
is meaningful for any future ABI breaks, and we might forget that it
excludes bitints.  The name is just a suggestion.

OK with that change, thanks.

Richard

> +
> +
>/* allocate_ncrn may be false-positive, but allocate_nvrn is quite 
> reliable.
>   The following code thus handles passing by SIMD/FP registers first.  */
>  
> @@ -21266,19 +21299,25 @@ aarch64_gimplify_va_arg_expr (tree valist, tree 
> type, gimple_seq *pre_p,
>rsize = ROUND_UP (size, UNITS_PER_WORD);
>nregs = rsize / UNITS_PER_WORD;
>  
> -  if (align <= 8 && abi_break_gcc_13 && warn_psabi)
> +  if (align <= 8
> +   && abi_break_gcc_13
> +   && warn_psabi
> +   && !bitint_or_aggr_of_bitint_p (type))
>   inform (input_location, "parameter passing for argument of type "
>   "%qT changed in GCC 13.1", type);
>  
>if (warn_psabi
> && abi_break_gcc_14
> -   && (abi_break_gcc_14 > 8 * BITS_PER_UNIT) != (align > 8))
> +   && (abi_break_gcc_14 > 8 * BITS_PER_UNIT) != (align > 8)
> +   && !bitint_or_aggr_of_bitint_p (type))
>   inform (input_location, "parameter passing for argument of type "
>   "%qT changed in GCC 14.1", type);
>  
>if (align > 8)
>   {
> -   if (abi_break_gcc_9 && warn_psabi)
> +   if (abi_break_gcc_9
> +   && warn_psabi
> +   && !bitint_or_aggr_of_bitint_p (type))
>   inform (input_location, "parameter passing for argument of type "
>   "%qT changed in GCC 9.1", type);
> dw_align = true;


Re: [PATCH] libstdc++: add ARM SVE support to std::experimental::simd

2024-03-27 Thread Richard Sandiford
Matthias Kretz  writes:
> Hi Richard,
>
> sorry for not answering sooner. I took action on your mail but failed to also 
> give feedback. Now in light of your veto of Srinivas patch I wanted to use 
> the 
> opportunity to pick this up again.
>
> On Dienstag, 23. Januar 2024 21:57:23 CET Richard Sandiford wrote:
>> However, we also support different vector lengths for streaming SVE
>> (running in "streaming" mode on SME) and non-streaming SVE (running
>> in "non-streaming" mode on the core).  Having two different lengths is
>> expected to be the common case, rather than a theoretical curiosity.
>
> I read up on this after you mentioned this for the first time. As a WG21 
> member I find the approach troublesome - but that's a bit off-topic for this 
> thread.
>
> The big issue here is that, IIUC, a user (and the simd library) cannot do the 
> right thing at the moment. There simply isn't enough context information 
> available when parsing the  header. I.e. on definition of 
> the class template there's no facility to take target_clones or SME 
> "streaming" mode into account. Consequently, if we want the library to be fit 
> for SME, then we need more language extension(s) to make it work.

Yeah.  I think the same applies to plain SVE.  It seems reasonable to
have functions whose implementation is specialised for a specific SVE
length, with that function being selected at runtime where appropriate.
Those functions needn't (in principle) be in separate TUs.  The “best”
definition of native then becomes a per-function property rather
than a per-TU property.

As you note later, I think the same thing would apply to x86_64.

> I guess I'm looking for a way to declare types that are different depending 
> on 
> whether they are used in streaming mode or non-streaming mode (making them 
> ill-formed to use in functions marked arm_streaming_compatible).
>
> From reading through https://arm-software.github.io/acle/main/
> acle.html#controlling-the-use-of-streaming-mode I don't see any discussion of 
> member functions or ctor/dtor, static and non-static data members, etc.
>
> The big issue I see here is that currently all of std::* is declared without 
> a 
> arm_streaming or arm_streaming_compatible. Thus, IIUC, you can't use anything 
> from the standard library in streaming mode. Since that also applies to 
> std::experimental::simd, we're not creating a new footgun, only missing out 
> on 
> potential users?

Kind-of.  However, we can inline a non-streaming function into a streaming
function if that doesn't change defined behaviour.  And that's important
in practice for C++, since most trivial inline functions will not be
marked streaming-compatible despite being so in practice.

It's UB to pass and return SVE vectors across streaming/non-streaming
boundaries unless the two VLs are equal.  It's therefore valid to inline
such functions into streaming functions *unless* the callee uses
non-streaming-only instructions such as gather loads.

Because of that, someone trying to use std::experimenal::simd in SME
functions is likely to succeed, at least in simple cases.

> Some more thoughts on target_clones/streaming SVE language extension 
> evolution:
>
>   void nonstreaming_fn(void) {
> constexpr int width = __arm_sve_bits(); // e.g. 512
> constexpr int width2 = __builtin_vector_size(); // e.g. 64 (the
>   // vector_size attribute works with bytes, not bits)
>   }
>
>   __attribute__((arm_locally_streaming))
>   void streaming_fn(void) {
> constexpr int width = __arm_sve_bits(); // e.g. 128
> constexpr int width2 = __builtin_vector_size(); // e.g. 16
>   }
>
>   __attribute__((target_clones("sse4.2,avx2")))
>   void streaming_fn(void) {
> constexpr int width = __builtin_vector_size(); // 16 in the sse4.2 clone
>   // and 32 in the avx2 clone
>   }
>
> ... as a starting point for exploration. Given this, I'd still have to resort 
> to a macro to define a "native" simd type:
>
> #define NATIVE_SIMD(T) std::experimental::simd CHAR_BITS, __arm_sve_bits() / CHAR_BITS>>
>
> Getting rid of the macro seems to be even harder.

Yeah.  The constexprs in the AArch64 functions would only be compile-time
constants in to-be-defined circumstances, using some mechanism to specify
the streaming and non-streaming vector lengths at compile time.  But that's
a premise of the whole discussion, just noting it for the record in case
anyone reading this later jumps in at this point.

> A declaration of an alias like
>
> template 
> using SveSimd = std::experimental::simd CHAR_BITS, __arm_sve_bits() / CHAR_BITS>>;
>
> would have to delay "invoking" __arm_sve_bits() until it knows its context:
>
>   void nonst

Re: [PATCH v2] libstdc++: add ARM SVE support to std::experimental::simd

2024-03-27 Thread Richard Sandiford
Matthias Kretz  writes:
> On Wednesday, 27 March 2024 11:07:14 CET Richard Sandiford wrote:
>> I'm still worried about:
>> 
>>   #if _GLIBCXX_SIMD_HAVE_SVE
>>   constexpr inline int __sve_vectorized_size_bytes = __ARM_FEATURE_SVE_BITS
>> / 8; #else
>>   constexpr inline int __sve_vectorized_size_bytes = 0;
>>   #endif
>> 
>> and the direct use __ARM_FEATURE_SVE_BITS elsewhere, for the reasons
>> discussed here (including possible ODR problems):
>> 
>>   https://gcc.gnu.org/pipermail/gcc-patches/2023-December/640037.html
>>   https://gcc.gnu.org/pipermail/gcc-patches/2024-January/643734.html
>> 
>> Logically the vector length should be a template parameter rather than
>> an invariant.  Has this been resolved?  If not, it feels like a blocker
>> to me (sorry).
>
> The vector length is always a template parameter to all user-facing API. Some 
> examples
>
> 1. on aarch64 the following is independent of SVE flags (and status quo):
>
>   simd is an alias for
>   simd
>
>   fixed_size_simd is supposed to be ABI-stable anyway (passed via
>   the stack, alignof == sizeof).
>
> 2. with -msve-vector-bits=512:
>
>   native_simd is an alias for
>   simd>
>
>   simd> is an alias for
>   simd>
>
> 3. with -msve-vector-bits=256: 
>
>   native_simd is an alias for
>   simd>
>
>   simd> is an alias for
>   simd>
>
> Implementation functions are either [[gnu::always_inline]] or tagged with the 
> ABI tag type and the __odr_helper template argument (to ensure not-inlined 
> inline functions have unique names).

Ah, thanks for the explanation.  I think the global native_float alias
is problematic for reasons that you touched on in your later message.
I'll reply more about that there.  But in other respects this looks good.

> Does that make __ARM_FEATURE_SVE_BITS usage indirect enough?

In principle, the only use of __ARM_FEATURE_SVE_BITS should be to determine
the definition of native_simd (with the caveats above).  But current
GCC restrictions might make that impractical.

> Also for context, please consider that this is std::*experimental*::simd. The 
> underlying ISO document will likely get retracted at some point and the whole 
> API and implementation (hopefully) superseded by C++26. The main purpose of 
> the spec and implementation is to gather experience.

Ah, ok.  If this is a deliberate experiment for evidence-gathering
purposes, rather than a long-term commitment, then I agree the barrier
should be lower.

So yeah, I'll withdraw my objection.  I've no problem with this going
into GCC 14 on the basis above.  Thanks again to you and Srinivas for
working on this.

Richard


Re: [PATCH v2] libstdc++: add ARM SVE support to std::experimental::simd

2024-03-27 Thread Richard Sandiford
Jonathan Wakely  writes:
> On Fri, 8 Mar 2024 at 09:58, Matthias Kretz wrote:
>>
>> Hi,
>>
>> I applied and did extended testing on x86_64 (no regressions) and aarch64
>> using qemu testing SVE 256, 512, and 1024. Looks good!
>>
>> While going through the applied patch I noticed a few style issues that I
>> simply turned into a patch (attached).
>>
> [...]
>>
>> From my side, with the noted changes the patch is ready for merging.
>> @Jonathan, any chance for a green light before GCC 14.1?
>
> As discussed on IRC, please push the revised patch with your
> suggestions incorporated (and post to the lists for posterity).
>
> Thanks, everybody, for the patches and the thorough review.

I'm still worried about:

  #if _GLIBCXX_SIMD_HAVE_SVE
  constexpr inline int __sve_vectorized_size_bytes = __ARM_FEATURE_SVE_BITS / 8;
  #else
  constexpr inline int __sve_vectorized_size_bytes = 0;
  #endif

and the direct use __ARM_FEATURE_SVE_BITS elsewhere, for the reasons
discussed here (including possible ODR problems):

  https://gcc.gnu.org/pipermail/gcc-patches/2023-December/640037.html
  https://gcc.gnu.org/pipermail/gcc-patches/2024-January/643734.html

Logically the vector length should be a template parameter rather than
an invariant.  Has this been resolved?  If not, it feels like a blocker
to me (sorry).

Thanks,
Richard


Re: [pushed] aarch64: Define out-of-class static constants

2024-03-26 Thread Richard Sandiford
Vaseeharan Vinayagamoorthy  writes:
> Hi Richard,
>
> I think this patch is breaking the build of aarch64-none-elf and 
> aarch64-none-linux-gnu targets, when building with GCC 4.8.
> This is not an issue when building with GCC 7.5.
>
> Kind regards,
> Vasee

Thanks.  I pushed the attached patch to fix it.

Richard

---

GCC 4.8 complained about the use of const rather than constexpr
for out-of-line static constexprs.

gcc/
* config/aarch64/aarch64-feature-deps.h: Use constexpr for
out-of-line statics.
---
 gcc/config/aarch64/aarch64-feature-deps.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-feature-deps.h 
b/gcc/config/aarch64/aarch64-feature-deps.h
index 3641badb82f..79126db8825 100644
--- a/gcc/config/aarch64/aarch64-feature-deps.h
+++ b/gcc/config/aarch64/aarch64-feature-deps.h
@@ -71,9 +71,9 @@ template struct info;
 static constexpr auto enable = flag | get_enable REQUIRES; \
 static constexpr auto explicit_on = enable | get_enable EXPLICIT_ON; \
   };   \
-  const aarch64_feature_flags info::flag;  \
-  const aarch64_feature_flags info::enable;\
-  const aarch64_feature_flags info::explicit_on; \
+  constexpr aarch64_feature_flags info::flag;  \
+  constexpr aarch64_feature_flags info::enable;
\
+  constexpr aarch64_feature_flags info::explicit_on; \
   constexpr info IDENT ()  \
   {\
 return info ();\
-- 
2.25.1



Re: [PATCH] aarch64: Add +lse128 architectural extension command-line flag

2024-03-26 Thread Richard Sandiford
Victor Do Nascimento  writes:
> Given how, at present, the choice of using LSE128 atomic instructions
> by the toolchain is delegated to run-time selection in the form of
> Libatomic ifuncs, responsible for querying target support, the
> `+lse128' target architecture compile-time flag is absent from GCC.
>
> This, however, contrasts with the Binutils implementation, which gates
> LSE128 instructions behind the `+lse128' flag.  This can lead to
> problems in GCC for certain use-cases.  One such example is in the use
> of inline assembly, whereby the inability of enabling the feature in
> the command-line prevents the compiler from automatically issuing the
> necessary LSE128 `.arch' directive.
>
> This patch therefore brings GCC into alignment with LLVM and Binutils
> in adding support for the `+lse128' architectural extension flag.
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64-option-extensions.def: Add LSE128
>   AARCH64_OPT_EXTENSION, adding it as a dependency for the D128
>   feature.
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/lse128-flag.c: New.
>   * gcc.target/aarch64/cpunative/info_23: Likewise.
>   * gcc.target/aarch64/cpunative/native_cpu_23.c: Likewise.

The new extension should be documented in doc/invoke.texi.

> ---
>  gcc/config/aarch64/aarch64-option-extensions.def  |  4 +++-
>  gcc/testsuite/gcc.target/aarch64/cpunative/info_23|  8 
>  .../gcc.target/aarch64/cpunative/native_cpu_23.c  | 11 +++
>  gcc/testsuite/gcc.target/aarch64/lse128-flag.c| 10 ++
>  4 files changed, 32 insertions(+), 1 deletion(-)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/cpunative/info_23
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/cpunative/native_cpu_23.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/lse128-flag.c
>
> diff --git a/gcc/config/aarch64/aarch64-option-extensions.def 
> b/gcc/config/aarch64/aarch64-option-extensions.def
> index 1a3b91c68cf..ac54b899a06 100644
> --- a/gcc/config/aarch64/aarch64-option-extensions.def
> +++ b/gcc/config/aarch64/aarch64-option-extensions.def
> @@ -275,7 +275,9 @@ AARCH64_OPT_EXTENSION("mops", MOPS, (), (), (), "")
>  
>  AARCH64_OPT_EXTENSION("cssc", CSSC, (), (), (), "cssc")
>  
> -AARCH64_OPT_EXTENSION("d128", D128, (), (), (), "d128")
> +AARCH64_OPT_EXTENSION("lse128", LSE128, (LSE), (), (), "lse128")
> +
> +AARCH64_OPT_EXTENSION("d128", D128, (LSE128), (), (), "d128")
>  
>  AARCH64_OPT_EXTENSION("the", THE, (), (), (), "the")
>  
> diff --git a/gcc/testsuite/gcc.target/aarch64/cpunative/info_23 
> b/gcc/testsuite/gcc.target/aarch64/cpunative/info_23
> new file mode 100644
> index 000..d77c25d2f61
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/cpunative/info_23
> @@ -0,0 +1,8 @@
> +processor: 0
> +BogoMIPS : 100.00
> +Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 asimddp atomics 
> lse128
> +CPU implementer  : 0xfe
> +CPU architecture: 8
> +CPU variant  : 0x0
> +CPU part : 0xd08
> +CPU revision : 2
> \ No newline at end of file
> diff --git a/gcc/testsuite/gcc.target/aarch64/cpunative/native_cpu_23.c 
> b/gcc/testsuite/gcc.target/aarch64/cpunative/native_cpu_23.c
> new file mode 100644
> index 000..8a1e235d8ab
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/cpunative/native_cpu_23.c
> @@ -0,0 +1,11 @@
> +/* { dg-do compile { target { { aarch64*-*-linux*} && native } } } */
> +/* { dg-set-compiler-env-var GCC_CPUINFO 
> "$srcdir/gcc.target/aarch64/cpunative/info_23" } */
> +/* { dg-additional-options "-mcpu=native" } */
> +
> +int main()
> +{
> +  return 0;
> +}
> +
> +/* { dg-final { scan-assembler {\.arch 
> armv8-a\+dotprod\+crc\+crypto\+lse128} } } */
> +/* Test one where lse128 is available and so should be emitted.  */
> diff --git a/gcc/testsuite/gcc.target/aarch64/lse128-flag.c 
> b/gcc/testsuite/gcc.target/aarch64/lse128-flag.c
> new file mode 100644
> index 000..71339c3af6d
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/lse128-flag.c
> @@ -0,0 +1,10 @@
> +/* { dg-do compile { target { aarch64*-*-*} } } */
> +/* { dg-additional-options "-march=armv9.4-a+lse128" } */
> +
> +int main()
> +{
> +  return 0;
> +}
> +
> +/* { dg-final { scan-assembler {\.arch armv9\.4-a\+crc\+lse128} } } */
> +/* Test a normal looking procinfo.  */

Not sure I understand the comment.  Is procinfo part of this test?

Looks good otherwise.

Thanks,
Richard


Re: [PATCH] gomp: testsuite: improve compatibility of bad-array-section-3.c [PR113428]

2024-03-08 Thread Richard Sandiford
Richard Earnshaw  writes:
> This test generates different warnings on ilp32 targets because the size
> of an integer matches the size of a pointer.  Avoid this by using
> signed char.
>
> gcc/testsuite:
>
>   PR testsuite/113428
>   * gcc.dg/gomp/bad-array-section-c-3.c: Use signed char instead
>   of int.
> ---
>
> I think this fixes the issues seen on ilp32 machines, without substantially
> changing what the test does, but a second set of eyes wouldn't hurt.
>
>  gcc/testsuite/gcc.dg/gomp/bad-array-section-c-3.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)

It looks good to me FWIW, but I'd feel more comfortable with an OK
from Jakub.

OK midday Monday UK time if no-one objects by then.

Richard

>
> diff --git a/gcc/testsuite/gcc.dg/gomp/bad-array-section-c-3.c 
> b/gcc/testsuite/gcc.dg/gomp/bad-array-section-c-3.c
> index 8be15ced8c0..431af71c422 100644
> --- a/gcc/testsuite/gcc.dg/gomp/bad-array-section-c-3.c
> +++ b/gcc/testsuite/gcc.dg/gomp/bad-array-section-c-3.c
> @@ -1,15 +1,15 @@
>  /* { dg-do compile } */
>  
>  struct S {
> -  int *ptr;
> +  signed char *ptr;
>  };
>  
>  int main()
>  {
> -  int arr[20];
> +  signed char arr[20];
>  
>/* Reject array section in compound initialiser.  */
> -#pragma omp target map( (struct S) { .ptr = (int *) arr[5:5] } )
> +#pragma omp target map( (struct S) { .ptr = (signed char *) arr[5:5] } )
>  /* { dg-error {expected '\]' before ':' token} "" { target *-*-* } .-1 } */
>  /* { dg-warning {cast to pointer from integer of different size} "" { target 
> *-*-* } .-2 } */
>  /* { dg-message {sorry, unimplemented: unsupported map expression} "" { 
> target *-*-* } .-3 } */


Re: [r14-9173 Regression] FAIL: gcc.dg/tree-ssa/andnot-2.c scan-tree-dump-not forwprop3 "_expr" on Linux/x86_64

2024-03-07 Thread Richard Sandiford
Sorry, still catching up on email, but: 

Richard Biener  writes:
> We have optimize_vectors_before_lowering_p but we shouldn't even there
> turn supported into not supported ops and as said, what's supported or
> not cannot be finally decided (if it's only vcond and not vcond_mask
> that is supported).  Also optimize_vectors_before_lowering_p is set
> for a short time between vectorization and vector lowering and we
> definitely do not want to turn supported vectorizer emitted stmts
> into ones that we need to lower.  For GCC 15 we should see to move
> vector lowering before vectorization (before loop optimization I'd
> say) to close this particula hole (and also reliably ICE when the
> vectorizer creates unsupported IL).  We also definitely want to
> retire vcond expanders (no target I know of supports single-instruction
> compare-and-select).

...definitely agree with this FWIW.  Sounds like a much cleaner approach.

One of the main tricks that vcond*s tend to do is invert "difficult"
comparisons and swap the data operands to match.  But I think we should
move to a situation where targets don't provide comparison patterns
that require an inversion, and instead move inversions to generic code.

Richard


Re: [PATCH] libatomic: Fix build for --disable-gnu-indirect-function [PR113986]

2024-03-07 Thread Richard Sandiford
Wilco Dijkstra  writes:
> Fix libatomic build to support --disable-gnu-indirect-function on AArch64.
> Always build atomic_16.S and add aliases to the __atomic_* functions if
> !HAVE_IFUNC.

This description is too brief for me.  Could you say in detail how the
new scheme works?  E.g. the description doesn't explain:

> -if ARCH_AARCH64_HAVE_LSE128
> -AM_CPPFLAGS   = -DHAVE_FEAT_LSE128
> -endif

And what's the purpose of ARCH_AARCH64_HAVE_LSE128 after this change?

Is the indirection via ALIAS2 necessary?  Couldn't ENTRY just define
the __atomic_* symbols directly, as non-hidden, if we remove the
libat_ prefix?  That would make it easier to ensure that the lists
are kept up-to-date.

Shouldn't we skip the ENTRY_FEAT functions and existing aliases
if !HAVE_IFUNC?

I think it'd be worth (as a prepatch) splitting the file into two
#included subfiles, one that contains the base AArch64 routines and one
that contains the optimised versions.  The former would then be #included
for all builds while the latter would be specific to HAVE_IFUNC.

Thanks,
Richard

> Passes regress and bootstrap, OK for commit?
>
> libatomic:
> PR target/113986
> * Makefile.in: Regenerated.
> * Makefile.am: Make atomic_16.S not depend on HAVE_IFUNC.
> Remove predefine of HAVE_FEAT_LSE128.
> * config/linux/aarch64/atomic_16.S: Add __atomic_ aliases if 
> !HAVE_IFUNC. 
> * config/linux/aarch64/host-config.h: Correctly handle !HAVE_IFUNC.
>
> ---
>
> diff --git a/libatomic/Makefile.am b/libatomic/Makefile.am
> index 
> d49c44c7d5fbe83061fddd1f8ef4813a39eb1b8b..980677f353345c050f6cef2d57090360216c56cf
>  100644
> --- a/libatomic/Makefile.am
> +++ b/libatomic/Makefile.am
> @@ -130,12 +130,8 @@ libatomic_la_LIBADD = $(foreach s,$(SIZES),$(addsuffix 
> _$(s)_.lo,$(SIZEOBJS)))
>  ## On a target-specific basis, include alternates to be selected by IFUNC.
>  if HAVE_IFUNC
>  if ARCH_AARCH64_LINUX
> -if ARCH_AARCH64_HAVE_LSE128
> -AM_CPPFLAGS   = -DHAVE_FEAT_LSE128
> -endif
>  IFUNC_OPTIONS = -march=armv8-a+lse
>  libatomic_la_LIBADD += $(foreach s,$(SIZES),$(addsuffix 
> _$(s)_1_.lo,$(SIZEOBJS)))
> -libatomic_la_SOURCES += atomic_16.S
>  
>  endif
>  if ARCH_ARM_LINUX
> @@ -155,6 +151,10 @@ libatomic_la_LIBADD += $(addsuffix 
> _16_1_.lo,$(SIZEOBJS)) \
>  endif
>  endif
>  
> +if ARCH_AARCH64_LINUX
> +libatomic_la_SOURCES += atomic_16.S
> +endif
> +
>  libatomic_convenience_la_SOURCES = $(libatomic_la_SOURCES)
>  libatomic_convenience_la_LIBADD = $(libatomic_la_LIBADD)
>  
> diff --git a/libatomic/Makefile.in b/libatomic/Makefile.in
> index 
> 11c8ec7ba15ba7da5ef55e90bd836317bc270061..d9d529bc502d4ce7b9997640d5f40f5d5cc1232c
>  100644
> --- a/libatomic/Makefile.in
> +++ b/libatomic/Makefile.in
> @@ -90,17 +90,17 @@ build_triplet = @build@
>  host_triplet = @host@
>  target_triplet = @target@
>  @ARCH_AARCH64_LINUX_TRUE@@HAVE_IFUNC_TRUE@am__append_1 = $(foreach 
> s,$(SIZES),$(addsuffix _$(s)_1_.lo,$(SIZEOBJS)))
> -@ARCH_AARCH64_LINUX_TRUE@@HAVE_IFUNC_TRUE@am__append_2 = atomic_16.S
> -@ARCH_ARM_LINUX_TRUE@@HAVE_IFUNC_TRUE@am__append_3 = $(foreach \
> +@ARCH_ARM_LINUX_TRUE@@HAVE_IFUNC_TRUE@am__append_2 = $(foreach \
>  @ARCH_ARM_LINUX_TRUE@@HAVE_IFUNC_TRUE@   s,$(SIZES),$(addsuffix \
>  @ARCH_ARM_LINUX_TRUE@@HAVE_IFUNC_TRUE@   _$(s)_1_.lo,$(SIZEOBJS))) \
>  @ARCH_ARM_LINUX_TRUE@@HAVE_IFUNC_TRUE@   $(addsuffix \
>  @ARCH_ARM_LINUX_TRUE@@HAVE_IFUNC_TRUE@   _8_2_.lo,$(SIZEOBJS)) \
>  @ARCH_ARM_LINUX_TRUE@@HAVE_IFUNC_TRUE@   tas_1_2_.lo
> -@ARCH_I386_TRUE@@HAVE_IFUNC_TRUE@am__append_4 = $(addsuffix 
> _8_1_.lo,$(SIZEOBJS))
> -@ARCH_X86_64_TRUE@@HAVE_IFUNC_TRUE@am__append_5 = $(addsuffix 
> _16_1_.lo,$(SIZEOBJS)) \
> +@ARCH_I386_TRUE@@HAVE_IFUNC_TRUE@am__append_3 = $(addsuffix 
> _8_1_.lo,$(SIZEOBJS))
> +@ARCH_X86_64_TRUE@@HAVE_IFUNC_TRUE@am__append_4 = $(addsuffix 
> _16_1_.lo,$(SIZEOBJS)) \
>  @ARCH_X86_64_TRUE@@HAVE_IFUNC_TRUE@ $(addsuffix 
> _16_2_.lo,$(SIZEOBJS))
>  
> +@ARCH_AARCH64_LINUX_TRUE@am__append_5 = atomic_16.S
>  subdir = .
>  ACLOCAL_M4 = $(top_srcdir)/aclocal.m4
>  am__aclocal_m4_deps = $(top_srcdir)/../config/acx.m4 \
> @@ -156,8 +156,7 @@ am__uninstall_files_from_dir = { \
>}
>  am__installdirs = "$(DESTDIR)$(toolexeclibdir)"
>  LTLIBRARIES = $(noinst_LTLIBRARIES) $(toolexeclib_LTLIBRARIES)
> -@ARCH_AARCH64_LINUX_TRUE@@HAVE_IFUNC_TRUE@am__objects_1 =  \
> -@ARCH_AARCH64_LINUX_TRUE@@HAVE_IFUNC_TRUE@   atomic_16.lo
> +@ARCH_AARCH64_LINUX_TRUE@am__objects_1 = atomic_16.lo
>  am_libatomic_la_OBJECTS = gload.lo gstore.lo gcas.lo gexch.lo \
>   glfree.lo lock.lo init.lo fenv.lo fence.lo flag.lo \
>   $(am__objects_1)
> @@ -425,7 +424,7 @@ libatomic_la_LDFLAGS = $(libatomic_version_info) 
> $(libatomic_version_script) \
>   $(lt_host_flags) $(libatomic_darwin_rpath)
>  
>  libatomic_la_SOURCES = gload.c gstore.c gcas.c gexch.c glfree.c lock.c \
> - init.c fenv.c fence.c flag.c $(am__append_2)
> + init.c 

Re: [PATCH 2/2] aarch64: Add support for _BitInt

2024-03-07 Thread Richard Sandiford
"Andre Vieira (lists)"  writes:
> Hey,
>
> Dropped the first patch and dealt with the comments above, hopefully I 
> didn't miss any this time.
>
> --
>
> This patch adds support for C23's _BitInt for the AArch64 port when 
> compiling
> for little endianness.  Big Endianness requires further target-agnostic
> support and we therefor disable it for now.
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64.cc (TARGET_C_BITINT_TYPE_INFO): Declare MACRO.
>   (aarch64_bitint_type_info): New function.
>   (aarch64_return_in_memory_1): Return large _BitInt's in memory.
>   (aarch64_function_arg_alignment): Adapt to correctly return the ABI
>   mandated alignment of _BitInt(N) where N > 128 as the alignment of
>   TImode.
>   (aarch64_composite_type_p): Return true for _BitInt(N), where N > 128.
>
> libgcc/ChangeLog:
>
>   * config/aarch64/t-softfp (softfp_extras): Add floatbitinthf,
>   floatbitintbf, floatbitinttf and fixtfbitint.
>   * config/aarch64/libgcc-softfp.ver (GCC_14.0.0): Add __floatbitinthf,
>   __floatbitintbf, __floatbitinttf and __fixtfbitint.
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/bitint-alignments.c: New test.
>   * gcc.target/aarch64/bitint-args.c: New test.
>   * gcc.target/aarch64/bitint-sizes.c: New test.
>
>
> On 02/02/2024 14:46, Jakub Jelinek wrote:
>> On Thu, Jan 25, 2024 at 05:45:01PM +, Andre Vieira wrote:
>>> This patch adds support for C23's _BitInt for the AArch64 port when 
>>> compiling
>>> for little endianness.  Big Endianness requires further target-agnostic
>>> support and we therefor disable it for now.
>>>
>>> gcc/ChangeLog:
>>>
>>> * config/aarch64/aarch64.cc (TARGET_C_BITINT_TYPE_INFO): Declare MACRO.
>>> (aarch64_bitint_type_info): New function.
>>> (aarch64_return_in_memory_1): Return large _BitInt's in memory.
>>> (aarch64_function_arg_alignment): Adapt to correctly return the ABI
>>> mandated alignment of _BitInt(N) where N > 128 as the alignment of
>>> TImode.
>>> (aarch64_composite_type_p): Return true for _BitInt(N), where N > 128.
>>>
>>> libgcc/ChangeLog:
>>>
>>> * config/aarch64/t-softfp: Add fixtfbitint, floatbitinttf and
>>> floatbitinthf to the softfp_extras variable to ensure the
>>> runtime support is available for _BitInt.
>> 
>> I think this lacks some config/aarch64/t-whatever.ver
>> additions.
>> See PR113700 for some more details.
>> We want the support routines for binary floating point <-> _BitInt
>> conversions in both libgcc.a and libgcc_s.so.1 and exported from the latter
>> too at GCC_14.0.0 symver, while decimal floating point <-> _BitInt solely in
>> libgcc.a (as with all the huge dfp/bid stuff).
>> 
>>  Jakub
>> 
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 
> 16318bf925883ecedf9345e53fc0824a553b2747..9bd8d22f6edd9f6c77907ec383f9e8bf055cfb8b
>  100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -6583,6 +6583,7 @@ aarch64_return_in_memory_1 (const_tree type)
>int count;
>  
>if (!AGGREGATE_TYPE_P (type)
> +  && TREE_CODE (type) != BITINT_TYPE
>&& TREE_CODE (type) != COMPLEX_TYPE
>&& TREE_CODE (type) != VECTOR_TYPE)
>  /* Simple scalar types always returned in registers.  */
> @@ -21895,6 +21896,11 @@ aarch64_composite_type_p (const_tree type,
>if (type && (AGGREGATE_TYPE_P (type) || TREE_CODE (type) == COMPLEX_TYPE))
>  return true;
>  
> +  if (type
> +  && TREE_CODE (type) == BITINT_TYPE
> +  && int_size_in_bytes (type) > 16)
> +return true;
> +

Think I probably said this before, but for the record: I don't think
the above code has any practical effect, but I agree it's probably better
to include it for completeness.

>if (mode == BLKmode
>|| GET_MODE_CLASS (mode) == MODE_COMPLEX_FLOAT
>|| GET_MODE_CLASS (mode) == MODE_COMPLEX_INT)
> @@ -28400,6 +28406,42 @@ aarch64_excess_precision (enum excess_precision_type 
> type)
>return FLT_EVAL_METHOD_UNPREDICTABLE;
>  }
>  
> +/* Implement TARGET_C_BITINT_TYPE_INFO.
> +   Return true if _BitInt(N) is supported and fill its details into *INFO.  
> */
> +bool
> +aarch64_bitint_type_info (int n, struct bitint_info *info)
> +{
> +  if (TARGET_BIG_END)
> +return false;
> +
> +  if (n <= 8)
> +info->limb_mode = QImode;
> +  else if (n <= 16)
> +info->limb_mode = HImode;
> +  else if (n <= 32)
> +info->limb_mode = SImode;
> +  else if (n <= 64)
> +info->limb_mode = DImode;
> +  else if (n <= 128)
> +info->limb_mode = TImode;
> +  else
> +/* The AAPCS for AArch64 defines _BitInt(N > 128) as an array with
> +   type {signed,unsigned} __int128[M] where M*128 >= N.  However, to be
> +   able to use libgcc's implementation to support large _BitInt's we need
> +   to use a LIMB_MODE that is no larger than 'long long'.  This is why we
> +   use DImode for our internal 

Re: [PATCH] AArch64: memcpy/memset expansions should not emit LDP/STP [PR113618]

2024-03-07 Thread Richard Sandiford
Wilco Dijkstra  writes:
> Hi Richard,
>
>> It looks like this is really doing two things at once: disabling the
>> direct emission of LDP/STP Qs, and switching the GPR handling from using
>> pairs of DImode moves to single TImode moves.  At least, that seems to be
>> the effect of...
>
> No it still uses TImode for the !TARGET_SIMD case.
>
>> +   if (GET_MODE_SIZE (mode_iter.require ()) <= MIN (size, 16))
>> + mode = mode_iter.require ();
>
>> ...hard-coding 16 here and...
>
> This only affects the Q register case.
>
>> -  if (size > 0 && size < copy_max / 2 && !STRICT_ALIGNMENT)
>> +  if (size > 0 && size < 16 && !STRICT_ALIGNMENT)
>
>> ...changing this limit from 8 to 16 for non-SIMD copies.
>>
>> Is that deliberate?  If so, please mention that kind of thing in the
>> covering note.  It sounded like this was intended to change the handling
>> of vector moves only.
>
> Yes it's deliberate. It now basically treats everything as blocks of 16 bytes
> which has a nice simplifying effect. I've added a note.
>
>> This means that, for GPRs, we are now effectively using the double-word
>> move patterns to get an LDP/STP indirectly, rather than directly as before.
>
> No, there is no difference here.
>
>> That seems OK, and I suppose might be slightly preferable to the current
>> code for things like:
>>
>>  char a[31], b[31];
>>  void f() { __builtin_memcpy(a, b, 31); }
>
> Yes an unaligned tail improves slightly by using blocks of 16 bytes.
> It's a very rare case, both -mgeneral-regs-only is rarely used, and most
> fixed-size copies are a nice multiple of 8.
>
>> But that raises the question: should we do the same thing for Q registers
>> and V2x16QImode?
>
> I don't believe it makes sense to use those complex types. And it likely
> blocks optimizations in a similar way as UNSPEC does.

A V2x16QImode move isn't particularly special as far as target-
independent code is concerned.  It's just an ordinary move of an
ordinary vector mode.  And the vector modes that we're picking here
generally have nothing to do with the source data.

But I'd forgotten about:

  /* On LE, for AdvSIMD, don't support anything other than POST_INC or
 REG addressing.  */
  if (advsimd_struct_p
  && TARGET_SIMD
  && !BYTES_BIG_ENDIAN
  && (code != POST_INC && code != REG))
return false;

> v2: Rebase to trunk
>
> The new RTL introduced for LDP/STP results in regressions due to use of 
> UNSPEC.
> Given the new LDP fusion pass is good at finding LDP opportunities, change the
> memcpy, memmove and memset expansions to emit single vector loads/stores.
> This fixes the regression and enables more RTL optimization on the standard
> memory accesses.  Handling of unaligned tail of memcpy/memmove is improved
> with -mgeneral-regs-only.  SPEC2017 performance improves slightly.  Codesize
> is a bit worse due to missed LDP opportunities as discussed in the PR.
>
> Passes regress, OK for commit?
>
> gcc/ChangeLog:
> PR target/113618
> * config/aarch64/aarch64.cc (aarch64_copy_one_block): Remove. 
> (aarch64_expand_cpymem): Emit single load/store only.
> (aarch64_set_one_block): Emit single stores only.
>
> gcc/testsuite/ChangeLog:
> PR target/113618
> * gcc.target/aarch64/pr113618.c: New test.

OK, thanks.

Richard

> ---
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 
> 16318bf925883ecedf9345e53fc0824a553b2747..0a28e033088a00818c6ed9fa8c15ecdee5a86c35
>  100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -26465,33 +26465,6 @@ aarch64_progress_pointer (rtx pointer)
>return aarch64_move_pointer (pointer, GET_MODE_SIZE (GET_MODE (pointer)));
>  }
>  
> -typedef auto_vec, 12> copy_ops;
> -
> -/* Copy one block of size MODE from SRC to DST at offset OFFSET.  */
> -static void
> -aarch64_copy_one_block (copy_ops , rtx src, rtx dst,
> - int offset, machine_mode mode)
> -{
> -  /* Emit explict load/store pair instructions for 32-byte copies.  */
> -  if (known_eq (GET_MODE_SIZE (mode), 32))
> -{
> -  mode = V4SImode;
> -  rtx src1 = adjust_address (src, mode, offset);
> -  rtx dst1 = adjust_address (dst, mode, offset);
> -  rtx reg1 = gen_reg_rtx (mode);
> -  rtx reg2 = gen_reg_rtx (mode);
> -  rtx load = aarch64_gen_load_pair (reg1, reg2, src1);
> -  rtx store = aarch64_gen_store_pair (dst1, reg1, reg2);
> -  ops.safe_push ({ load, store });
> -  return;
> -}
> -
> -  rtx reg = gen_reg_rtx (mode);
> -  rtx load = gen_move_insn (reg, adjust_address (src, mode, offset));
> -  rtx store = gen_move_insn (adjust_address (dst, mode, offset), reg);
> -  ops.safe_push ({ load, store });
> -}
> -
>  /* Expand a cpymem/movmem using the MOPS extension.  OPERANDS are taken
> from the cpymem/movmem pattern.  IS_MEMMOVE is true if this is a memmove
> rather than memcpy.  Return true iff we succeeded.  */
> @@ -26527,7 +26500,7 @@ 

Re: [PATCH] aarch64: Fix costing of manual bfi instructions

2024-03-07 Thread Richard Sandiford
Andrew Pinski  writes:
> This fixes the cost model for BFI instructions which don't
> use directly zero_extract on the LHS.
> aarch64_bfi_rtx_p does the heavy lifting by matching of
> the patterns.
>
> Note this alone does not fix PR 107270, it is a step in the right
> direction. There we get z zero_extend for the non-shifted part
> which we don't currently match.
>
> Built and tested on aarch64-linux-gnu with no regressions.
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64.cc (aarch64_bfi_rtx_p): New function.
>   (aarch64_rtx_costs): For IOR, try calling aarch64_bfi_rtx_p.
>
> Signed-off-by: Andrew Pinski 
> ---
>  gcc/config/aarch64/aarch64.cc | 94 +++
>  1 file changed, 94 insertions(+)
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 3d8341c17fe..dc5c5c23cb3 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -13776,6 +13776,90 @@ aarch64_extr_rtx_p (rtx x, rtx *res_op0, rtx 
> *res_op1)
>return false;
>  }
>  
> +/* Return true iff X is an rtx that will match an bfi instruction
> +   i.e. as described in the *aarch64_bfi5 family of patterns.
> +   OP0 and OP1 will be set to the operands of the insert involved
> +   on success and will be NULL_RTX otherwise.  */
> +
> +static bool
> +aarch64_bfi_rtx_p (rtx x, rtx *res_op0, rtx *res_op1)

I think it'd be slightly neater to pass an XEXP index in as well,
and use it...

> +{
> +  rtx op0, op1;
> +  scalar_int_mode mode;
> +
> +  *res_op0 = NULL_RTX;
> +  *res_op1 = NULL_RTX;
> +  if (!is_a  (GET_MODE (x), ))
> +return false;
> +
> +  if (GET_CODE (x) != IOR)
> +return false;
> +
> +  unsigned HOST_WIDE_INT mask1;
> +  unsigned HOST_WIDE_INT shft_amnt;
> +  unsigned HOST_WIDE_INT mask2;
> +  rtx shiftop;
> +
> +  rtx iop0 = XEXP (x, 0);
> +  rtx iop1 = XEXP (x, 1);

...here as opno and 1 - opno.  That way we don't need to...

> +
> +  if (GET_CODE (iop0) == AND
> +  && CONST_INT_P (XEXP (iop0, 1))
> +  && GET_CODE (XEXP (iop0, 0)) != ASHIFT)
> +{
> +  op0 = XEXP (iop0, 0);
> +  mask1 = UINTVAL (XEXP (iop0, 1));
> +  shiftop = iop1;
> +}
> +  else if (GET_CODE (iop1) == AND
> +  && CONST_INT_P (XEXP (iop1, 1))
> +  && GET_CODE (XEXP (iop1, 0)) != ASHIFT)
> +{
> +  op0 = XEXP (iop1, 0);
> +  mask1 = UINTVAL (XEXP (iop1, 1));
> +  shiftop = iop0;
> +}
> +  else
> +return false;

...handle this both ways, and don't need to exclude ASHIFT.

Maybe some variation on "insert_op" would be better than "shiftop",
since the operand might not include a shift.

Looks generally good to me otherwise FWIW, but obviously GCC 15 material.

Thanks,
Richard

> +
> +  /* Shifted with no mask. */
> +  if (GET_CODE (shiftop) == ASHIFT
> +  && CONST_INT_P (XEXP (shiftop, 1)))
> +{
> +  shft_amnt = UINTVAL (XEXP (shiftop, 1));
> +  mask2 = HOST_WIDE_INT_M1U << shft_amnt;
> +  op1 = XEXP (shiftop, 0);
> +}
> +   else if (GET_CODE (shiftop) == AND
> + && CONST_INT_P (XEXP (shiftop, 1)))
> +{
> +  mask2 = UINTVAL (XEXP (shiftop, 1));
> +  if (GET_CODE (XEXP (shiftop, 0)) == ASHIFT
> +   && CONST_INT_P (XEXP (XEXP (shiftop, 0), 1)))
> + {
> +   op1 = XEXP (XEXP (shiftop, 0), 0);
> +   shft_amnt = UINTVAL (XEXP (XEXP (shiftop, 0), 1));
> + }
> +  else
> + {
> +   op1 = XEXP (shiftop, 0);
> +   shft_amnt = 0;
> + }
> +}
> +  else
> +return false;
> +
> +  if (shft_amnt >= GET_MODE_BITSIZE (mode))
> +return false;
> +
> +  if (!aarch64_masks_and_shift_for_bfi_p (mode, mask1, shft_amnt, mask2))
> +return false;
> +
> +  *res_op0 = op0;
> +  *res_op1 = op1;
> +  return true;
> +}
> +
>  /* Calculate the cost of calculating (if_then_else (OP0) (OP1) (OP2)),
> storing it in *COST.  Result is true if the total cost of the operation
> has now been calculated.  */
> @@ -14662,6 +14746,16 @@ cost_plus:
> return true;
>  }
>  
> +  if (aarch64_bfi_rtx_p (x, , ))
> + {
> +   *cost += rtx_cost (op0, mode, IOR, 0, speed);
> +   *cost += rtx_cost (op0, mode, IOR, 1, speed);
> +   if (speed)
> + *cost += extra_cost->alu.bfi;
> +
> +   return true;
> + }
> +
>if (aarch64_extr_rtx_p (x, , ))
>  {
> *cost += rtx_cost (op0, mode, IOR, 0, speed);


Re: [PATCH 2/2] aarch64: Support `{1.0f, 1.0f, 0.0, 0.0}` CST forming with fmov with a smaller vector type.

2024-03-07 Thread Richard Sandiford
Andrew Pinski  writes:
> This enables construction of V4SF CST like `{1.0f, 1.0f, 0.0f, 0.0f}`
> (and other fp enabled CSTs) by using `fmov v0.2s, 1.0` as the instruction
> is designed to zero out the other bits.
> This is a small extension on top of the code that creates fmov for the case
> where the all but the first element is non-zero.

Similarly to the second reply to 1/2, I think we should handle this
by detecting when only the low 64 bits are nonzero, and then try to
construct a simd_immediate_info for the low 64 bits.  The technique
is more general than just floats.

The same thing would work for SVE too (if TARGET_SIMD).

Thanks,
Richard

> Built and tested for aarch64-linux-gnu with no regressions.
>
>   PR target/113856
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64.cc (simd_immediate_info): Add bool to the
>   float mode constructor. Document modifier field for FMOV_SDH.
>   (aarch64_simd_valid_immediate): Recognize where the first half
>   of the const float vect is the same.
>   (aarch64_output_simd_mov_immediate): Handle the case where insn is
>   FMOV_SDH and modifier is MSL.
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/fmov-zero-cst-3.c: New test.
>
> Signed-off-by: Andrew Pinski 
> ---
>  gcc/config/aarch64/aarch64.cc | 34 ---
>  .../gcc.target/aarch64/fmov-zero-cst-3.c  | 28 +++
>  2 files changed, 57 insertions(+), 5 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/fmov-zero-cst-3.c
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index c4386591a9b..89bd0c5e5a6 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -130,7 +130,7 @@ struct simd_immediate_info
>enum modifier_type { LSL, MSL };
>  
>simd_immediate_info () {}
> -  simd_immediate_info (scalar_float_mode, rtx, insn_type = MOV);
> +  simd_immediate_info (scalar_float_mode, rtx, insn_type = MOV, bool = 
> false);
>simd_immediate_info (scalar_int_mode, unsigned HOST_WIDE_INT,
>  insn_type = MOV, modifier_type = LSL,
>  unsigned int = 0);
> @@ -153,6 +153,8 @@ struct simd_immediate_info
>  
>/* The kind of shift modifier to use, and the number of bits to shift.
>This is (LSL, 0) if no shift is needed.  */
> +  /* For FMOV_SDH, LSL says it is a single while MSL
> +  says if it is either .4h/.2s fmov. */
>modifier_type modifier;
>unsigned int shift;
>  } mov;
> @@ -173,12 +175,12 @@ struct simd_immediate_info
>  /* Construct a floating-point immediate in which each element has mode
> ELT_MODE_IN and value VALUE_IN.  */
>  inline simd_immediate_info
> -::simd_immediate_info (scalar_float_mode elt_mode_in, rtx value_in, 
> insn_type insn_in)
> +::simd_immediate_info (scalar_float_mode elt_mode_in, rtx value_in, 
> insn_type insn_in, bool firsthalfsame)
>: elt_mode (elt_mode_in), insn (insn_in)
>  {
>gcc_assert (insn_in == MOV || insn_in == FMOV_SDH);
>u.mov.value = value_in;
> -  u.mov.modifier = LSL;
> +  u.mov.modifier = firsthalfsame ? MSL : LSL;
>u.mov.shift = 0;
>  }
>  
> @@ -22944,10 +22946,23 @@ aarch64_simd_valid_immediate (rtx op, 
> simd_immediate_info *info,
> || aarch64_float_const_representable_p (elt))
>   {
> bool valid = true;
> +   bool firsthalfsame = false;
> for (unsigned int i = 1; i < n_elts; i++)
>   {
> rtx elt1 = CONST_VECTOR_ENCODED_ELT (op, i);
> if (!aarch64_float_const_zero_rtx_p (elt1))
> + {
> +   if (i == 1)
> + firsthalfsame = true;
> +   if (!firsthalfsame
> +   || i >= n_elts/2
> +   || !rtx_equal_p (elt, elt1))
> + {
> +   valid = false;
> +   break;
> + }
> + }
> +   else if (firsthalfsame && i < n_elts/2)
>   {
> valid = false;
> break;
> @@ -22957,7 +22972,8 @@ aarch64_simd_valid_immediate (rtx op, 
> simd_immediate_info *info,
>   {
> if (info)
>   *info = simd_immediate_info (elt_float_mode, elt,
> -  simd_immediate_info::FMOV_SDH);
> +  simd_immediate_info::FMOV_SDH,
> +  firsthalfsame);
> return true;
>   }
>   }
> @@ -25165,8 +25181,16 @@ aarch64_output_simd_mov_immediate (rtx const_vector, 
> unsigned width,
> real_to_decimal_for_mode (float_buf,
>   CONST_DOUBLE_REAL_VALUE (info.u.mov.value),
>   buf_size, buf_size, 1, info.elt_mode);
> -   if (info.insn == simd_immediate_info::FMOV_SDH)
> +   if (info.insn == simd_immediate_info::FMOV_SDH
> +   && info.u.mov.modifier 

Re: [PATCH 1/2] aarch64: Use fmov s/d/hN, FP_CST for some vector CST [PR113856]

2024-03-07 Thread Richard Sandiford
Richard Sandiford  writes:
> Andrew Pinski  writes:
>> Aarch64 has a way to form some floating point CSTs via the fmov instructions,
>> these instructions also zero out the upper parts of the registers so they can
>> be used for vector CSTs that have have one non-zero constant that would be 
>> able
>> to formed via the fmov in the first element.
>>
>> This implements this "small" optimization so these vector cst don't need to 
>> do
>> loads from memory.
>>
>> Built and tested on aarch64-linux-gnu with no regressions.
>>
>>  PR target/113856
>>
>> gcc/ChangeLog:
>>
>>  * config/aarch64/aarch64.cc (struct simd_immediate_info):
>>  Add FMOV_SDH to insn_type. For scalar_float_mode constructor
>>  add insn_in.
>>  (aarch64_simd_valid_immediate): Catch `{fp, 0...}` vector_cst
>>  and return a simd_immediate_info which uses FMOV_SDH.
>>  (aarch64_output_simd_mov_immediate): Support outputting
>>  fmov for FMOV_SDH.
>>
>> gcc/testsuite/ChangeLog:
>>
>>  * gcc.target/aarch64/fmov-zero-cst-1.c: New test.
>>  * gcc.target/aarch64/fmov-zero-cst-2.c: New test.
>>
>> Signed-off-by: Andrew Pinski 
>> ---
>>  gcc/config/aarch64/aarch64.cc | 48 ++---
>>  .../gcc.target/aarch64/fmov-zero-cst-1.c  | 52 +++
>>  .../gcc.target/aarch64/fmov-zero-cst-2.c  | 19 +++
>>  3 files changed, 111 insertions(+), 8 deletions(-)
>>  create mode 100644 gcc/testsuite/gcc.target/aarch64/fmov-zero-cst-1.c
>>  create mode 100644 gcc/testsuite/gcc.target/aarch64/fmov-zero-cst-2.c
>>
>> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
>> index 5dd0814f198..c4386591a9b 100644
>> --- a/gcc/config/aarch64/aarch64.cc
>> +++ b/gcc/config/aarch64/aarch64.cc
>> @@ -126,11 +126,11 @@ constexpr auto AARCH64_STATE_OUT = 1U << 2;
>>  /* Information about a legitimate vector immediate operand.  */
>>  struct simd_immediate_info
>>  {
>> -  enum insn_type { MOV, MVN, INDEX, PTRUE };
>> +  enum insn_type { MOV, FMOV_SDH, MVN, INDEX, PTRUE };
>>enum modifier_type { LSL, MSL };
>>  
>>simd_immediate_info () {}
>> -  simd_immediate_info (scalar_float_mode, rtx);
>> +  simd_immediate_info (scalar_float_mode, rtx, insn_type = MOV);
>>simd_immediate_info (scalar_int_mode, unsigned HOST_WIDE_INT,
>> insn_type = MOV, modifier_type = LSL,
>> unsigned int = 0);
>> @@ -145,7 +145,7 @@ struct simd_immediate_info
>>  
>>union
>>{
>> -/* For MOV and MVN.  */
>> +/* For MOV, FMOV_SDH and MVN.  */
>>  struct
>>  {
>>/* The value of each element.  */
>> @@ -173,9 +173,10 @@ struct simd_immediate_info
>>  /* Construct a floating-point immediate in which each element has mode
>> ELT_MODE_IN and value VALUE_IN.  */
>>  inline simd_immediate_info
>> -::simd_immediate_info (scalar_float_mode elt_mode_in, rtx value_in)
>> -  : elt_mode (elt_mode_in), insn (MOV)
>> +::simd_immediate_info (scalar_float_mode elt_mode_in, rtx value_in, 
>> insn_type insn_in)
>
> Nit: long line.
>
>> +  : elt_mode (elt_mode_in), insn (insn_in)
>>  {
>> +  gcc_assert (insn_in == MOV || insn_in == FMOV_SDH);
>>u.mov.value = value_in;
>>u.mov.modifier = LSL;
>>u.mov.shift = 0;
>> @@ -22932,6 +22933,35 @@ aarch64_simd_valid_immediate (rtx op, 
>> simd_immediate_info *info,
>>return true;
>>  }
>>  }
>> +  /* See if we can use fmov d0/s0/h0 ... for the constant. */
>> +  if (n_elts >= 1
>
> This condition seems unnecessary.  n_elts can't be zero.
>
>> +  && (vec_flags & VEC_ADVSIMD)
>> +  && is_a  (elt_mode, _float_mode)
>> +  && !CONST_VECTOR_DUPLICATE_P (op))
>
> I think we should also drop this.  I guess it's to undo:
>
>   if (CONST_VECTOR_P (op)
>   && CONST_VECTOR_DUPLICATE_P (op))
> n_elts = CONST_VECTOR_NPATTERNS (op);
>
> but we can use GET_MODE_NUNITS (mode) directly instead.
>
>> +{
>> +  rtx elt = CONST_VECTOR_ENCODED_ELT (op, 0);
>> +  if (aarch64_float_const_zero_rtx_p (elt)
>> +  || aarch64_float_const_representable_p (elt))
>
> What's the effect of including aarch64_float_const_zero_rtx_p for the
> first element?  Does it change the code we generate for any cases
> involving +0.0?  Or is it more for -0.0?
>
>> +{
>> +  bool valid = true

Re: [PATCH 1/2] aarch64: Use fmov s/d/hN, FP_CST for some vector CST [PR113856]

2024-03-07 Thread Richard Sandiford
Andrew Pinski  writes:
> Aarch64 has a way to form some floating point CSTs via the fmov instructions,
> these instructions also zero out the upper parts of the registers so they can
> be used for vector CSTs that have have one non-zero constant that would be 
> able
> to formed via the fmov in the first element.
>
> This implements this "small" optimization so these vector cst don't need to do
> loads from memory.
>
> Built and tested on aarch64-linux-gnu with no regressions.
>
>   PR target/113856
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64.cc (struct simd_immediate_info):
>   Add FMOV_SDH to insn_type. For scalar_float_mode constructor
>   add insn_in.
>   (aarch64_simd_valid_immediate): Catch `{fp, 0...}` vector_cst
>   and return a simd_immediate_info which uses FMOV_SDH.
>   (aarch64_output_simd_mov_immediate): Support outputting
>   fmov for FMOV_SDH.
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/fmov-zero-cst-1.c: New test.
>   * gcc.target/aarch64/fmov-zero-cst-2.c: New test.
>
> Signed-off-by: Andrew Pinski 
> ---
>  gcc/config/aarch64/aarch64.cc | 48 ++---
>  .../gcc.target/aarch64/fmov-zero-cst-1.c  | 52 +++
>  .../gcc.target/aarch64/fmov-zero-cst-2.c  | 19 +++
>  3 files changed, 111 insertions(+), 8 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/fmov-zero-cst-1.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/fmov-zero-cst-2.c
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 5dd0814f198..c4386591a9b 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -126,11 +126,11 @@ constexpr auto AARCH64_STATE_OUT = 1U << 2;
>  /* Information about a legitimate vector immediate operand.  */
>  struct simd_immediate_info
>  {
> -  enum insn_type { MOV, MVN, INDEX, PTRUE };
> +  enum insn_type { MOV, FMOV_SDH, MVN, INDEX, PTRUE };
>enum modifier_type { LSL, MSL };
>  
>simd_immediate_info () {}
> -  simd_immediate_info (scalar_float_mode, rtx);
> +  simd_immediate_info (scalar_float_mode, rtx, insn_type = MOV);
>simd_immediate_info (scalar_int_mode, unsigned HOST_WIDE_INT,
>  insn_type = MOV, modifier_type = LSL,
>  unsigned int = 0);
> @@ -145,7 +145,7 @@ struct simd_immediate_info
>  
>union
>{
> -/* For MOV and MVN.  */
> +/* For MOV, FMOV_SDH and MVN.  */
>  struct
>  {
>/* The value of each element.  */
> @@ -173,9 +173,10 @@ struct simd_immediate_info
>  /* Construct a floating-point immediate in which each element has mode
> ELT_MODE_IN and value VALUE_IN.  */
>  inline simd_immediate_info
> -::simd_immediate_info (scalar_float_mode elt_mode_in, rtx value_in)
> -  : elt_mode (elt_mode_in), insn (MOV)
> +::simd_immediate_info (scalar_float_mode elt_mode_in, rtx value_in, 
> insn_type insn_in)

Nit: long line.

> +  : elt_mode (elt_mode_in), insn (insn_in)
>  {
> +  gcc_assert (insn_in == MOV || insn_in == FMOV_SDH);
>u.mov.value = value_in;
>u.mov.modifier = LSL;
>u.mov.shift = 0;
> @@ -22932,6 +22933,35 @@ aarch64_simd_valid_immediate (rtx op, 
> simd_immediate_info *info,
> return true;
>   }
>  }
> +  /* See if we can use fmov d0/s0/h0 ... for the constant. */
> +  if (n_elts >= 1

This condition seems unnecessary.  n_elts can't be zero.

> +  && (vec_flags & VEC_ADVSIMD)
> +  && is_a  (elt_mode, _float_mode)
> +  && !CONST_VECTOR_DUPLICATE_P (op))

I think we should also drop this.  I guess it's to undo:

  if (CONST_VECTOR_P (op)
  && CONST_VECTOR_DUPLICATE_P (op))
n_elts = CONST_VECTOR_NPATTERNS (op);

but we can use GET_MODE_NUNITS (mode) directly instead.

> +{
> +  rtx elt = CONST_VECTOR_ENCODED_ELT (op, 0);
> +  if (aarch64_float_const_zero_rtx_p (elt)
> +   || aarch64_float_const_representable_p (elt))

What's the effect of including aarch64_float_const_zero_rtx_p for the
first element?  Does it change the code we generate for any cases
involving +0.0?  Or is it more for -0.0?

> + {
> +   bool valid = true;
> +   for (unsigned int i = 1; i < n_elts; i++)
> + {
> +   rtx elt1 = CONST_VECTOR_ENCODED_ELT (op, i);
> +   if (!aarch64_float_const_zero_rtx_p (elt1))
> + {
> +   valid = false;
> +   break;
> + }
> + }
> +   if (valid)
> + {
> +   if (info)
> + *info = simd_immediate_info (elt_float_mode, elt,
> +  simd_immediate_info::FMOV_SDH);
> +   return true;
> + }
> + }
> +}
>  
>/* If all elements in an SVE vector have the same value, we have a free
>   choice between using the element mode and using the container mode.
> @@ -25121,7 +25151,8 @@ aarch64_output_simd_mov_immediate (rtx const_vector, 
> unsigned width,
>  
>if 

[pushed] aarch64: Define out-of-class static constants

2024-03-06 Thread Richard Sandiford
While reworking the aarch64 feature descriptions, I forgot
to add out-of-class definitions of some static constants.
This could lead to a build failure with some compilers.

This was seen with some WIP to increase the number of extensions
beyond 64.  It's latent on trunk though, and a regression from
before the rework.

Tested on aarch64-linux-gnu & pushed.

Richard


gcc/
* config/aarch64/aarch64-feature-deps.h (feature_deps::info): Add
out-of-class definitions of static constants.
---
 gcc/config/aarch64/aarch64-feature-deps.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/gcc/config/aarch64/aarch64-feature-deps.h 
b/gcc/config/aarch64/aarch64-feature-deps.h
index a1b81f9070b..3641badb82f 100644
--- a/gcc/config/aarch64/aarch64-feature-deps.h
+++ b/gcc/config/aarch64/aarch64-feature-deps.h
@@ -71,6 +71,9 @@ template struct info;
 static constexpr auto enable = flag | get_enable REQUIRES; \
 static constexpr auto explicit_on = enable | get_enable EXPLICIT_ON; \
   };   \
+  const aarch64_feature_flags info::flag;  \
+  const aarch64_feature_flags info::enable;\
+  const aarch64_feature_flags info::explicit_on; \
   constexpr info IDENT ()  \
   {\
 return info ();\
-- 
2.25.1



Re: [PATCHv2] fwprop: Avoid volatile defines to be propagated

2024-03-05 Thread Richard Sandiford
HAO CHEN GUI  writes:
> Hi,
>   This patch tries to fix a potential problem which is raised by the patch
> for PR111267. The volatile asm operand tries to be propagated to a single
> set insn with the patch for PR111267. The volatile asm operand might be
> executed for multiple times if the define insn isn't eliminated after
> propagation. Now set_src_cost comparison might reject such propagation.
> But it has the chance to be taken after replacing set_src_cost with insn
> cost. Actually I found the problem in testing my patch which replacing
> set_src_cost with insn_cost in fwprop pass.
>
>   Compared to the last version, the check volatile_insn_p is replaced with
> volatile_refs_p in order to check volatile memory reference also.
> https://gcc.gnu.org/pipermail/gcc-patches/2024-February/646482.html
>
>   Bootstrapped and tested on x86 and powerpc64-linux BE and LE with no
> regressions. Is it OK for the trunk?

OK, thanks.  I'm not sure this fixes a known regression, but IMO the
barrier should be lower for things that fix loss of volatility, since
it's usually so hard to observe the effect in a determinstic way.

Richard

>
> Thanks
> Gui Haochen
>
> ChangeLog
> fwprop: Avoid volatile defines to be propagated
>
> The patch for PR111267 (commit id 86de9b66480b710202a2898cf513db105d8c432f)
> which introduces an exception for propagation on single set insn.  The
> propagation which might not be profitable (checked by profitable_p) is still
> allowed to be propagated to single set insn.  It has a potential problem
> that a volatile operand might be propagated to a single set insn.  If the
> define insn is not eliminated after propagation, the volatile operand will
> be executed for multiple times.  This patch fixes the problem by skipping
> volatile set source rtx in propagation.
>
> gcc/
>   * fwprop.cc (forward_propagate_into): Return false for volatile set
>   source rtx.
>
> gcc/testsuite/
>   * gcc.target/powerpc/fwprop-1.c: New.
>
> patch.diff
> diff --git a/gcc/fwprop.cc b/gcc/fwprop.cc
> index 7872609b336..cb6fd6700ca 100644
> --- a/gcc/fwprop.cc
> +++ b/gcc/fwprop.cc
> @@ -854,6 +854,8 @@ forward_propagate_into (use_info *use, bool reg_prop_only 
> = false)
>
>rtx dest = SET_DEST (def_set);
>rtx src = SET_SRC (def_set);
> +  if (volatile_refs_p (src))
> +return false;
>
>/* Allow propagations into a loop only for reg-to-reg copies, since
>   replacing one register by another shouldn't increase the cost.
> diff --git a/gcc/testsuite/gcc.target/powerpc/fwprop-1.c 
> b/gcc/testsuite/gcc.target/powerpc/fwprop-1.c
> new file mode 100644
> index 000..07b207f980c
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/fwprop-1.c
> @@ -0,0 +1,15 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O1 -fdump-rtl-fwprop1-details" } */
> +/* { dg-final { scan-rtl-dump-not "propagating insn" "fwprop1" } } */
> +
> +/* Verify that volatile asm operands doesn't be propagated.  */
> +long long foo ()
> +{
> +  long long res;
> +  __asm__ __volatile__(
> +""
> +  : "=r" (res)
> +  :
> +  : "memory");
> +  return res;
> +}


Re: [PATCH] asan: Handle poly-int sizes in ASAN_MARK [PR97696]

2024-03-05 Thread Richard Sandiford
Jakub Jelinek  writes:
> On Tue, Mar 05, 2024 at 06:30:40PM +0000, Richard Sandiford wrote:
>> (1) Keep the test where it is, taking advantage of the current SVE
>> handling in aarch64-sve.exp, and add:
>> 
>>   /* { dg-skip-if "" { no_fsanitize_address } } */
>
> I'd go with this.  asan/ directory for test would be needed for dg-do run
> tests obviously, because then we need the test driver to add appropriate
> options to find the library etc.

Thanks, now pushed with that change.

What do you think about backports, after a baking-in period?

Richard


Re: [PATCH] asan: Handle poly-int sizes in ASAN_MARK [PR97696]

2024-03-05 Thread Richard Sandiford
Jakub Jelinek  writes:
> On Tue, Mar 05, 2024 at 06:03:41PM +0000, Richard Sandiford wrote:
>> This patch makes the expansion of IFN_ASAN_MARK let through
>> poly-int-sized objects.  The expansion itself was already generic
>> enough, but the tests for the fast path were too strict.
>> 
>> Bootstrapped & regression tested on aarch64-linux-gnu.  Is this OK
>> for trunk now, or should it wait for GCC 15?  I'm not sure that it's
>> technically a regression, in the sense that we previously accepted the
>> testcase, but rejecting with an ICE is arguably worse than "sorry, can't
>> do that".  And as noted in the PR, this bug is breaking numpy builds.
>> 
>> Richard
>> 
>> 
>> gcc/
>>  PR sanitizer/97696
>>  * asan.cc (asan_expand_mark_ifn): Allow the length to be a poly_int.
>> 
>> gcc/testsuite/
>>  PR sanitizer/97696
>>  * gcc.target/aarch64/sve/pr97696.c: New test.
>
> Ok for trunk now.

Thanks.  (And thanks for the quick review.)

>
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/pr97696.c
>> @@ -0,0 +1,28 @@
>> +/* { dg-options "-fsanitize=address -fsanitize-address-use-after-scope" } */
>
> Though I'd say this test should require sanitize_address affective target.
> E.g. libsanitizer (sure, not actually used by the test) is only supported
> on aarch64*-*-linux*, not e.g. on darwin nor freebsd nor fuchsia etc.

Yeah, I'd wondered about that.  But fsanitize_address is only available
in the asan.exp framework (or something else that includes asan-dg.exp).
And like you say, the test doesn't specifically need the library to
be available.

I guess the options are:

(1) Keep the test where it is, taking advantage of the current SVE
handling in aarch64-sve.exp, and add:

  /* { dg-skip-if "" { no_fsanitize_address } } */

(2) Move the test to gcc.dg/asan/, make it conditional on aarch64*-*-*,
and add:

  #pragma GCC target "+sve"

Any preference?

Actually running the test would require both libsanitizer support and
aarch64_sve_hw.  Assembling it would need an assembler that understands SVE.

Richard



[PATCH] asan: Handle poly-int sizes in ASAN_MARK [PR97696]

2024-03-05 Thread Richard Sandiford
This patch makes the expansion of IFN_ASAN_MARK let through
poly-int-sized objects.  The expansion itself was already generic
enough, but the tests for the fast path were too strict.

Bootstrapped & regression tested on aarch64-linux-gnu.  Is this OK
for trunk now, or should it wait for GCC 15?  I'm not sure that it's
technically a regression, in the sense that we previously accepted the
testcase, but rejecting with an ICE is arguably worse than "sorry, can't
do that".  And as noted in the PR, this bug is breaking numpy builds.

Richard


gcc/
PR sanitizer/97696
* asan.cc (asan_expand_mark_ifn): Allow the length to be a poly_int.

gcc/testsuite/
PR sanitizer/97696
* gcc.target/aarch64/sve/pr97696.c: New test.
---
 gcc/asan.cc   |  9 +++---
 .../gcc.target/aarch64/sve/pr97696.c  | 28 +++
 2 files changed, 32 insertions(+), 5 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/pr97696.c

diff --git a/gcc/asan.cc b/gcc/asan.cc
index 0fd7dd1f3ed..d621ec9c323 100644
--- a/gcc/asan.cc
+++ b/gcc/asan.cc
@@ -3795,9 +3795,7 @@ asan_expand_mark_ifn (gimple_stmt_iterator *iter)
 }
   tree len = gimple_call_arg (g, 2);
 
-  gcc_assert (tree_fits_shwi_p (len));
-  unsigned HOST_WIDE_INT size_in_bytes = tree_to_shwi (len);
-  gcc_assert (size_in_bytes);
+  gcc_assert (poly_int_tree_p (len));
 
   g = gimple_build_assign (make_ssa_name (pointer_sized_int_node),
   NOP_EXPR, base);
@@ -3806,9 +3804,10 @@ asan_expand_mark_ifn (gimple_stmt_iterator *iter)
   tree base_addr = gimple_assign_lhs (g);
 
   /* Generate direct emission if size_in_bytes is small.  */
-  if (size_in_bytes
-  <= (unsigned)param_use_after_scope_direct_emission_threshold)
+  unsigned threshold = param_use_after_scope_direct_emission_threshold;
+  if (tree_fits_uhwi_p (len) && tree_to_uhwi (len) <= threshold)
 {
+  unsigned HOST_WIDE_INT size_in_bytes = tree_to_uhwi (len);
   const unsigned HOST_WIDE_INT shadow_size
= shadow_mem_size (size_in_bytes);
   const unsigned int shadow_align
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pr97696.c 
b/gcc/testsuite/gcc.target/aarch64/sve/pr97696.c
new file mode 100644
index 000..f533d9efc02
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/pr97696.c
@@ -0,0 +1,28 @@
+/* { dg-options "-fsanitize=address -fsanitize-address-use-after-scope" } */
+
+#include 
+
+__attribute__((noinline, noclone)) int
+foo (char *a)
+{
+  int i, j = 0;
+  asm volatile ("" : "+r" (a) : : "memory");
+  for (i = 0; i < 12; i++)
+j += a[i];
+  return j;
+}
+
+int
+main ()
+{
+  int i, j = 0;
+  for (i = 0; i < 4; i++)
+{
+  char a[12];
+  __SVInt8_t freq;
+  __builtin_bcmp (, a, 10);
+  __builtin_memset (a, 0, sizeof (a));
+  j += foo (a);
+}
+  return j;
+}
-- 
2.25.1



[pushed] aarch64: Remove SME2.1 forms of LUTI2/4

2024-03-05 Thread Richard Sandiford
I was over-eager when adding support for strided SME2 instructions
and accidentally included forms of LUTI2 and LUTI4 that are only
available with SME2.1, not SME2.  This patch removes them for now.
We're planning to add proper support for SME2.1 in the GCC 15
timeframe.

Sorry for the blunder :(

Tested on aarch64-linux-gnu & pushed.

Richard


gcc/
* config/aarch64/aarch64.md (stride_type): Remove luti_consecutive
and luti_strided.
* config/aarch64/aarch64-sme.md
(@aarch64_sme_lut): Remove stride_type attribute.
(@aarch64_sme_lut_strided2): Delete.
(@aarch64_sme_lut_strided4): Likewise.
* config/aarch64/aarch64-early-ra.cc (is_stride_candidate)
(early_ra::maybe_convert_to_strided_access): Remove support for
strided LUTI2 and LUTI4.

gcc/testsuite/
* gcc.target/aarch64/sme/strided_1.c (test5): Remove.
---
 gcc/config/aarch64/aarch64-early-ra.cc| 20 +-
 gcc/config/aarch64/aarch64-sme.md | 70 ---
 gcc/config/aarch64/aarch64.md |  3 +-
 .../gcc.target/aarch64/sme/strided_1.c| 55 ---
 4 files changed, 3 insertions(+), 145 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-early-ra.cc 
b/gcc/config/aarch64/aarch64-early-ra.cc
index 8530b0ae41e..1e2c823cb2e 100644
--- a/gcc/config/aarch64/aarch64-early-ra.cc
+++ b/gcc/config/aarch64/aarch64-early-ra.cc
@@ -1060,8 +1060,7 @@ is_stride_candidate (rtx_insn *insn)
 return false;
 
   auto stride_type = get_attr_stride_type (insn);
-  return (stride_type == STRIDE_TYPE_LUTI_CONSECUTIVE
- || stride_type == STRIDE_TYPE_LD1_CONSECUTIVE
+  return (stride_type == STRIDE_TYPE_LD1_CONSECUTIVE
  || stride_type == STRIDE_TYPE_ST1_CONSECUTIVE);
 }
 
@@ -3212,8 +3211,7 @@ early_ra::maybe_convert_to_strided_access (rtx_insn *insn)
   auto stride_type = get_attr_stride_type (insn);
   rtx pat = PATTERN (insn);
   rtx op;
-  if (stride_type == STRIDE_TYPE_LUTI_CONSECUTIVE
-  || stride_type == STRIDE_TYPE_LD1_CONSECUTIVE)
+  if (stride_type == STRIDE_TYPE_LD1_CONSECUTIVE)
 op = SET_DEST (pat);
   else if (stride_type == STRIDE_TYPE_ST1_CONSECUTIVE)
 op = XVECEXP (SET_SRC (pat), 0, 1);
@@ -3263,20 +3261,6 @@ early_ra::maybe_convert_to_strided_access (rtx_insn 
*insn)
   XVECEXP (SET_SRC (pat), 0, XVECLEN (SET_SRC (pat), 0) - 1)
= *recog_data.dup_loc[0];
 }
-  else if (stride_type == STRIDE_TYPE_LUTI_CONSECUTIVE)
-{
-  auto bits = INTVAL (XVECEXP (SET_SRC (pat), 0, 4));
-  if (range.count == 2)
-   pat = gen_aarch64_sme_lut_strided2 (bits, single_mode,
-   regs[0], regs[1],
-   recog_data.operand[1],
-   recog_data.operand[2]);
-  else
-   pat = gen_aarch64_sme_lut_strided4 (bits, single_mode,
-   regs[0], regs[1], regs[2], regs[3],
-   recog_data.operand[1],
-   recog_data.operand[2]);
-}
   else
 gcc_unreachable ();
   PATTERN (insn) = pat;
diff --git a/gcc/config/aarch64/aarch64-sme.md 
b/gcc/config/aarch64/aarch64-sme.md
index c95d4aa696c..78ad2fc699f 100644
--- a/gcc/config/aarch64/aarch64-sme.md
+++ b/gcc/config/aarch64/aarch64-sme.md
@@ -1939,74 +1939,4 @@ (define_insn "@aarch64_sme_lut"
   "TARGET_STREAMING_SME2
&& !( == 4 &&  == 4 &&  == 8)"
   "luti\t%0, zt0, %1[%2]"
-  [(set_attr "stride_type" "luti_consecutive")]
-)
-
-(define_insn "@aarch64_sme_lut_strided2"
-  [(set (match_operand:SVE_FULL_BHS 0 "aarch64_simd_register" "=Uwd")
-   (unspec:SVE_FULL_BHS
- [(reg:V8DI ZT0_REGNUM)
-  (reg:DI SME_STATE_REGNUM)
-  (match_operand:VNx16QI 2 "register_operand" "w")
-  (match_operand:DI 3 "const_int_operand")
-  (const_int LUTI_BITS)
-  (const_int 0)]
- UNSPEC_SME_LUTI))
-   (set (match_operand:SVE_FULL_BHS 1 "aarch64_simd_register" "=w")
-   (unspec:SVE_FULL_BHS
- [(reg:V8DI ZT0_REGNUM)
-  (reg:DI SME_STATE_REGNUM)
-  (match_dup 2)
-  (match_dup 3)
-  (const_int LUTI_BITS)
-  (const_int 1)]
- UNSPEC_SME_LUTI))]
-  "TARGET_STREAMING_SME2
-   && aarch64_strided_registers_p (operands, 2, 8)"
-  "luti\t{%0., %1.}, zt0, %2[%3]"
-  [(set_attr "stride_type" "luti_strided")]
-)
-
-(define_insn "@aarch64_sme_lut_strided4"
-  [(set (match_operand:SVE_FULL_BHS 0 "aarch64_simd_register" "=Uwt")
-   (unspec:SVE_FULL_BHS
- [(reg:V8DI ZT0_REGNUM)
-  (reg:DI SME_STATE_REGNUM)
-  (match_operand:VNx16QI 4 "register_operand" "w")
-  (match_operand:DI 5 "const_int_operand")
-  (const_int LUTI_BITS)
-  (const_int 0)]
- UNSPEC_SME_LUTI))
-   (set (match_operand:SVE_FULL_BHS 1 "aarch64_simd_register" "=w")
-   (unspec:SVE_FULL_BHS
- 

[pushed] Restrict gcc.dg/rtl/aarch64/pr113295-1.c to aarch64

2024-02-24 Thread Richard Sandiford
I keep forgetting that gcc.dg/rtl is the one testsuite where
tests in target-specific subdirectories aren't automatically
restricted to that target.

Pushed as obvious after testing on aarch64-linux-gnu & x86_64-linux-gnu.

Richard


gcc/testsuite/
* gcc.dg/rtl/aarch64/pr113295-1.c: Restrict to aarc64*-*-*.
---
 gcc/testsuite/gcc.dg/rtl/aarch64/pr113295-1.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/testsuite/gcc.dg/rtl/aarch64/pr113295-1.c 
b/gcc/testsuite/gcc.dg/rtl/aarch64/pr113295-1.c
index 481fb813f61..bf6c5d1f256 100644
--- a/gcc/testsuite/gcc.dg/rtl/aarch64/pr113295-1.c
+++ b/gcc/testsuite/gcc.dg/rtl/aarch64/pr113295-1.c
@@ -1,5 +1,5 @@
+// { dg-do run { target aarch64*-*-* } }
 // { dg-options "-O2" }
-// { dg-do run }
 
 struct data {
   double x;
-- 
2.25.1



[PATCH] vect: Tighten check for impossible SLP layouts [PR113205]

2024-02-24 Thread Richard Sandiford
During its forward pass, the SLP layout code tries to calculate
the cost of a layout change on an incoming edge.  This is taken
as the minimum of two costs: one in which the source partition
keeps its current layout (chosen earlier during the pass) and
one in which the source partition switches to the new layout.
The latter can sometimes be arranged by the backward pass.

If only one of the costs is valid, the other cost was ignored.
But the PR shows that this is not safe.  If the source partition
has layout 0 (the normal layout), we have to be prepared to handle
the case in which that ends up being the only valid layout.

Other code already accounts for this restriction, e.g. see
the code starting with:

/* Reject the layout if it would make layout 0 impossible
   for later partitions.  This amounts to testing that the
   target supports reversing the layout change on edges
   to later partitions.

Tested on aarch64-linux-gnu and x86_64-linux-gnu.  OK to install?

Richard


gcc/
PR tree-optimization/113205
* tree-vect-slp.cc (vect_optimize_slp_pass::forward_cost): Reject
the proposed layout if it does not allow a source partition with
layout 2 to keep that layout.

gcc/testsuite/
PR tree-optimization/113205
* gcc.dg/torture/pr113205.c: New test.
---
 gcc/testsuite/gcc.dg/torture/pr113205.c | 19 +++
 gcc/tree-vect-slp.cc|  4 
 2 files changed, 23 insertions(+)
 create mode 100644 gcc/testsuite/gcc.dg/torture/pr113205.c

diff --git a/gcc/testsuite/gcc.dg/torture/pr113205.c 
b/gcc/testsuite/gcc.dg/torture/pr113205.c
new file mode 100644
index 000..edfba7fcd0e
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/torture/pr113205.c
@@ -0,0 +1,19 @@
+char a;
+char *b, *c;
+int d, e, f, g, h;
+int *i;
+
+void
+foo (void)
+{
+  unsigned p;
+  d = i[0];
+  e = i[1];
+  f = i[2];
+  g = i[3];
+  p = d * b[0];
+  p += f * c[h];
+  p += e * b[h];
+  p += g * c[h];
+  a = (p + 8000) >> (__SIZEOF_INT__ * __CHAR_BIT__ / 2);
+}
diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
index 7cf9504398c..895f4f7fb6b 100644
--- a/gcc/tree-vect-slp.cc
+++ b/gcc/tree-vect-slp.cc
@@ -5034,6 +5034,10 @@ vect_optimize_slp_pass::forward_cost (graph_edge *ud, 
unsigned int from_node_i,
   cost.split (from_partition.out_degree);
   cost.add_serial_cost (edge_cost);
 }
+  else if (from_partition.layout == 0)
+/* We must allow the source partition to have layout 0 as a fallback,
+   in case all other options turn out to be impossible.  */
+return cost;
 
   /* Take the minimum of that cost and the cost that applies if
  FROM_PARTITION instead switches to TO_LAYOUT_I.  */
-- 
2.25.1



Re: [PATCH v1 00/13] Add aarch64-w64-mingw32 target

2024-02-23 Thread Richard Sandiford
"Richard Earnshaw (lists)"  writes:
> On 21/02/2024 17:47, Evgeny Karpov wrote:
>> Hello,
>> 
>> We would like to take your attention to the review of changes for the
>> new GCC target, aarch64-w64-mingw32. The new target will be
>> supported, tested, added to CI, and maintained by Linaro. This marks
>> the first of three planned patch series contributing to the GCC C
>> compiler's support for Windows Arm64.
>> 
>> 1. Minimal aarch64-w64-mingw32 C implementation to cross-compile
>> hello-world with libgcc for Windows Arm64 using MinGW.
>> 2. Extension of the aarch64-w64-mingw32 C implementation to
>> cross-compile OpenSSL, OpenBLAS, FFmpeg, and libjpeg-turbo. All
>> packages successfully pass tests.
>> 3. Addition of call stack support for debugging, resolution of
>> optimization issues in the C compiler, and DLL export/import for the
>> aarch64-w64-mingw32 target.
>> 
>> This patch series introduces the 1st point, which involves building
>> hello-world for the aarch64-w64-mingw32 target. The patch depends on
>> the binutils changes for the aarch64-w64-mingw32 target that have
>> already been merged.
>> 
>> The binutils should include recent relocation fixes.
>> f87eaf8ff3995a5888c6dc4996a20c770e6bcd36
>> aarch64: Add new relocations and limit COFF AArch64 relocation offsets
>> 
>> The series is structured in a way to trivially show that it should not
>> affect any other targets.
>> 
>> In this patch, several changes have been made to support the
>> aarch64-w64-mingw32 target for GCC. The modifications include the
>> definition of the MS ABI for aarch64, adjustments to FIXED_REGISTERS
>> and STATIC_CHAIN_REGNUM for different ABIs, and specific definitions
>> for COFF format on AArch64. Additionally, the patch reuses MinGW
>>  types and definitions from i386, relocating them to a new
>> mingw folder for shared usage between both targets.
>> 
>> MinGW-specific options have been introduced for AArch64, along with
>> override options for aarch64-w64-mingw32. Builtin stack probing for
>> override options for aarch64-w64-mingw32. Builtin stack probing for
>> AArch64 has been enabled as an alternative for chkstk. Symbol name
>> encoding and section information handling for aarch64-w64-mingw32 have
>> been incorporated, and the MinGW environment has been added, which
>> will also be utilized for defining the Cygwin environment in the
>> future.
>> 
>> The patch includes renaming "x86 Windows Options" to "Cygwin and MinGW
>> Options," which now encompasses AArch64 as well. AArch64-specific
>> Cygwin and MinGW Options have been introduced for the unique
>> requirements of the AArch64 architecture.
>> 
>> Function type declaration and named sections support have been added.
>> The necessary objects for Cygwin and MinGW have been built for the
>> aarch64-w64-mingw32 target, and relevant files such as msformat-c.cc
>> and winnt-d.cc have been moved to the mingw folder for reuse in
>> AArch64.
>> 
>> Furthermore, the aarch64-w64-mingw32 target has been included in both
>> libatomic and libgcc, ensuring support for the AArch64 architecture
>> within these libraries. These changes collectively enhance the
>> capabilities of GCC for the specified target.
>> 
>> Coauthors: Zac Walker ,
>> Mark Harmstone   and
>> Ron Riddle 
>> 
>> Refactored, prepared, and validated by 
>> Radek Barton  and 
>> Evgeny Karpov 
>> 
>> Special thanks to the Linaro GNU toolchain team for internal review
>> and assistance in preparing the patch series!
>> 
>> Regards,
>> Evgeny
>
> Thanks for posting this.

Seconded. :)  Thanks also for the very clear organisation of the series,
and for commonising code rather than cut-&-pasting it.

FWIW, I agree with all the comments posted so far, and just sent
some other comments too.  I think my main high-level comments are:

- Could you double-check that all the code in the common files are
  used on both aarch64 and x86?  I think it's OK to move code outside
  of x86 even if aarch64 doesn't want to use it, provided that it makes
  conceptual target-independent sense.  But it's not clear whether
  unused code is deliberate or not (e.g. the EXTRA_OS_CPP_BUILTINS
  thing I mentioned in the part 2 review).

- Could you test with all languages enabled, and say what languages
  are supported?  Some languages require an existing compiler for
  the same language and so are more difficult to bootstrap for
  a new port.  I suppose you'd need a cross-host build first,
  then use the cross-compiled compilers to bootstrap.

Thanks,
Richard


Re: [PATCH v1 02/13] aarch64: The aarch64-w64-mingw32 target implements

2024-02-23 Thread Richard Sandiford
Evgeny Karpov  writes:
> The calling ABI enum definition has been done following a similar convention 
> in 
> https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/config/i386/i386-opts.h;h=ef2825803b32001b9632769bdff196d1e43d27ba;hb=refs/heads/master#l41
>
> MS_ABI is used in gcc/config/i386/mingw32.h and gcc/config/i386/winnt-d.cc
> https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/config/i386/mingw32.h;h=58304fc55f629648e47490fd3c0f3db3858e4fd8;hb=refs/heads/master#l22
>
> These files are moved to the mingw folder in the patch series.
> https://gcc.gnu.org/pipermail/gcc-patches/attachments/20240221/5e75c464/attachment.txt
>
> What do you think about this change for v2?
>
> +/* Available call ABIs.  */
> +enum aarch64_calling_abi
> +{
> +  AARCH64_CALLING_ABI_EABI,
> +  AARCH64_CALLING_ABI_MS,
> +  MS_ABI = AARCH64_CALLING_ABI_MS
> +};
> +

How is MS_ABI used in practice?  When I apply locally, it looks like
the two non-x86 uses are in:

gcc/config/mingw/mingw32.h:  if (TARGET_64BIT && ix86_abi == MS_ABI)
\
gcc/config/mingw/winnt-d.cc:  if (TARGET_64BIT && ix86_abi == MS_ABI)

But these should fail to build if used, because AFAICT there's no
definition of ix86_abi on aarch64.

The first match is in EXTRA_OS_CPP_BUILTINS, but I couldn't see any uses
of that in aarch64 code, which would explain why everything builds OK.
The winnt-d.cc occurence looks like it would break the build with the
D frontend enabled though.

Are there two distinct ABIs for aarch64-*-mingw*?  Or are these
distinctions ignored on aarch64 and just retained for compatibility?

If there are two distinct ABIs then we should probably add them to
aarch64_arm_pcs.  But if there is only a single ABI, we should probably
avoid adding calling_abi altogether and instead provide a macro like
TARGET_IS_MS_ABI that aarch64 and x86 can define differently.

(To be clear, I don't think the different handling of x18 matters
for the PCS classification.  That's an orthogonal platform property
that applies to all PCS variants equally.  No-one had suggested
otherwise, just wanted to say in case. :-) )

Thanks,
Richard

>
> Regards,
> Evgeny
>
>
> Thursday, February 22, 2024 12:40 PM
> Richard Earnshaw (lists) wrote:
>
>>
> +/* Available call ABIs.  */
> +enum calling_abi
> +{
> +  AARCH64_EABI = 0,
> +  MS_ABI = 1
> +};
> +
>
> The convention in this file seems to be that all enum types to start with 
> aarch64.  Also, the enumeration values should start with the name of the 
> enumeration type in upper case, so:
>
> enum aarch64_calling_abi
> {
>   AARCH64_CALLING_ABI_EABI,
>   AARCH64_CALLING_ABI_MS
> };
>
> or something very much like that.
>
> R.


Re: [PATCH v1 08/13] aarch64: Add Cygwin and MinGW environments for AArch64

2024-02-23 Thread Richard Sandiford
Evgeny Karpov  writes:
> From 1ea6efa6f88d131884ecef21c4b5d2ecbab14ea7 Mon Sep 17 00:00:00 2001
> From: Zac Walker 
> Date: Tue, 20 Feb 2024 18:06:36 +0100
> Subject: [PATCH v1 08/13] aarch64: Add Cygwin and MinGW environments for
>  AArch64
>
> Define Cygwin and MinGW environment such as types, SEH definitions,
> shared libraries, etc.
>
> gcc/ChangeLog:
>
>   * config.gcc: Add Cygwin and MinGW difinitions.
>   * config/aarch64/aarch64-protos.h
>   (mingw_pe_maybe_record_exported_symbol): Declare functions
>   which are used in Cygwin and MinGW environment.
>   (mingw_pe_section_type_flags): Likewise.
>   (mingw_pe_unique_section): Likewise.
>   (mingw_pe_encode_section_info): Likewise.
>   * config/aarch64/cygming.h: New file.
> ---
>  gcc/config.gcc  |   1 +
>  gcc/config/aarch64/aarch64-protos.h |   5 +
>  gcc/config/aarch64/cygming.h| 178 
>  3 files changed, 184 insertions(+)
>  create mode 100644 gcc/config/aarch64/cygming.h
>
> diff --git a/gcc/config.gcc b/gcc/config.gcc
> index 498ee702607..96e3508eb30 100644
> --- a/gcc/config.gcc
> +++ b/gcc/config.gcc
> @@ -1265,6 +1265,7 @@ aarch64*-*-linux*)
>   ;;
>  aarch64*-*-mingw*)
>   tm_file="${tm_file} aarch64/aarch64-coff.h"
> + tm_file="${tm_file} aarch64/cygming.h"
>   tm_file="${tm_file} mingw/mingw32.h"
>   tm_file="${tm_file} mingw/mingw-stdint.h"
>   tmake_file="${tmake_file} aarch64/t-aarch64"
> diff --git a/gcc/config/aarch64/aarch64-protos.h 
> b/gcc/config/aarch64/aarch64-protos.h
> index a0b142e0b94..6d85452b0f6 100644
> --- a/gcc/config/aarch64/aarch64-protos.h
> +++ b/gcc/config/aarch64/aarch64-protos.h
> @@ -1109,6 +1109,11 @@ extern void aarch64_output_patchable_area (unsigned 
> int, bool);
>  
>  extern void aarch64_adjust_reg_alloc_order ();
>  
> +extern void mingw_pe_maybe_record_exported_symbol (tree, const char *, int);
> +extern unsigned int mingw_pe_section_type_flags (tree, const char *, int);
> +extern void mingw_pe_unique_section (tree, int);
> +extern void mingw_pe_encode_section_info (tree, rtx, int);
> +
>  bool aarch64_optimize_mode_switching (aarch64_mode_entity);
>  void aarch64_restore_za (rtx);
>  
> diff --git a/gcc/config/aarch64/cygming.h b/gcc/config/aarch64/cygming.h
> new file mode 100644
> index 000..cf47184eb66
> --- /dev/null
> +++ b/gcc/config/aarch64/cygming.h
> @@ -0,0 +1,178 @@
> +/* Operating system specific defines to be used when targeting GCC for
> +   hosting on Windows32, using a Unix style C library and tools.
> +   Copyright (C) 1995-2024 Free Software Foundation, Inc.
> +
> +This file is part of GCC.
> +
> +GCC is free software; you can redistribute it and/or modify
> +it under the terms of the GNU General Public License as published by
> +the Free Software Foundation; either version 3, or (at your option)
> +any later version.
> +
> +GCC is distributed in the hope that it will be useful,
> +but WITHOUT ANY WARRANTY; without even the implied warranty of
> +MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +GNU General Public License for more details.
> +
> +You should have received a copy of the GNU General Public License
> +along with GCC; see the file COPYING3.  If not see
> +.  */
> +
> +#ifndef GCC_AARCH64_CYGMING_H
> +#define GCC_AARCH64_CYGMING_H
> +
> +#undef PREFERRED_DEBUGGING_TYPE
> +#define PREFERRED_DEBUGGING_TYPE DINFO_TYPE_NONE
> +
> +#define FASTCALL_PREFIX '@'
> +
> +#define print_reg(rtx, code, file)
> +
> +#define SYMBOL_FLAG_DLLIMPORT 0
> +#define SYMBOL_FLAG_DLLEXPORT 0
> +
> +#define SYMBOL_REF_DLLEXPORT_P(X) \
> + ((SYMBOL_REF_FLAGS (X) & SYMBOL_FLAG_DLLEXPORT) != 0)
> +
> +#undef TARGET_SEH
> +#define TARGET_SEH 0
> +
> +#define SSE_REGNO_P(N) 0
> +#define GENERAL_REGNO_P(N) 0

Could you add a comment to explain how these two macros are consumed?
What is the effect of saying that everything is neither a general
register nor an SSE register?

> +#define SEH_MAX_FRAME_SIZE 0
> +
> +#undef DEFAULT_ABI
> +#define DEFAULT_ABI MS_ABI
> +
> +#undef TARGET_PECOFF
> +#define TARGET_PECOFF 1
> +
> +#include 
> +#ifdef __MINGW32__
> +#include 
> +#endif
> +
> +extern void mingw_pe_asm_named_section (const char *, unsigned int, tree);
> +extern void mingw_pe_declare_function_type (FILE *file, const char *name,
> + int pub);
> +
> +#define TARGET_ASM_NAMED_SECTION  mingw_pe_asm_named_section
> +
> +/* Select attributes for named sections.  */
> +#define TARGET_SECTION_TYPE_FLAGS  mingw_pe_section_type_flags
> +
> +#define TARGET_ASM_UNIQUE_SECTION mingw_pe_unique_section
> +#define TARGET_ENCODE_SECTION_INFO  mingw_pe_encode_section_info
> +
> +/* Declare the type properly for any external libcall.  */
> +#define ASM_OUTPUT_EXTERNAL_LIBCALL(FILE, FUN) \
> +  mingw_pe_declare_function_type (FILE, XSTR (FUN, 0), 1)
> +
> +#define TARGET_OS_CPP_BUILTINS() \
> +  do  

Re: [PATCH v1 04/13] aarch64: Add aarch64-w64-mingw32 COFF

2024-02-23 Thread Richard Sandiford
Evgeny Karpov  writes:
> From 55fd2a63afa9abb3543d714b6f5925efd2682e08 Mon Sep 17 00:00:00 2001
> From: Zac Walker 
> Date: Wed, 21 Feb 2024 12:20:46 +0100
> Subject: [PATCH v1 04/13] aarch64: Add aarch64-w64-mingw32 COFF
>
> Define ASM specific for COFF format on AArch64.
>
> gcc/ChangeLog:
>
>   * config.gcc: Add COFF format support definitions.
>   * config/aarch64/aarch64-coff.h: New file.

The only surprising thing here to me was:

> ---
>  gcc/config.gcc|  1 +
>  gcc/config/aarch64/aarch64-coff.h | 92 +++
>  2 files changed, 93 insertions(+)
>  create mode 100644 gcc/config/aarch64/aarch64-coff.h
>
> diff --git a/gcc/config.gcc b/gcc/config.gcc
> index 2a9e4c44f50..34c7be72fb6 100644
> --- a/gcc/config.gcc
> +++ b/gcc/config.gcc
> @@ -1264,6 +1264,7 @@ aarch64*-*-linux*)
>   TM_MULTILIB_CONFIG=`echo $TM_MULTILIB_CONFIG | sed 's/^,//'`
>   ;;
>  aarch64*-*-mingw*)
> + tm_file="${tm_file} aarch64/aarch64-coff.h"
>   tmake_file="${tmake_file} aarch64/t-aarch64"
>   case ${enable_threads} in
> "" | yes | win32)
> diff --git a/gcc/config/aarch64/aarch64-coff.h 
> b/gcc/config/aarch64/aarch64-coff.h
> new file mode 100644
> index 000..d91bc36b67b
> --- /dev/null
> +++ b/gcc/config/aarch64/aarch64-coff.h
> @@ -0,0 +1,92 @@
> +/* Machine description for AArch64 architecture.
> +   Copyright (C) 2024 Free Software Foundation, Inc.
> +
> +   This file is part of GCC.
> +
> +   GCC is free software; you can redistribute it and/or modify it
> +   under the terms of the GNU General Public License as published by
> +   the Free Software Foundation; either version 3, or (at your option)
> +   any later version.
> +
> +   GCC is distributed in the hope that it will be useful, but
> +   WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   General Public License for more details.
> +
> +   You should have received a copy of the GNU General Public License
> +   along with GCC; see the file COPYING3.  If not see
> +   .  */
> +
> +#ifndef GCC_AARCH64_COFF_H
> +#define GCC_AARCH64_COFF_H
> +
> +#include "aarch64.h"
> +
> +#ifndef LOCAL_LABEL_PREFIX
> +# define LOCAL_LABEL_PREFIX  ""
> +#endif
> +
> +/* Using long long breaks -ansi and -std=c90, so these will need to be
> +   made conditional for an LLP64 ABI.  */
> +#undef SIZE_TYPE
> +#define SIZE_TYPE"long long unsigned int"
> +
> +#undef PTRDIFF_TYPE
> +#define PTRDIFF_TYPE "long long int"
> +
> +#define TARGET_64BIT 1

...this.  Does some code that is shared between x86 and aarch64 rely
on this definition?  It might be worth identifying the code in a comment
if so.

Thanks,
Richard

> +#undef LONG_TYPE_SIZE
> +#define LONG_TYPE_SIZE 32
> +
> +#ifndef ASM_GENERATE_INTERNAL_LABEL
> +# define ASM_GENERATE_INTERNAL_LABEL(STRING, PREFIX, NUM)  \
> +  sprintf (STRING, "*%s%s%u", LOCAL_LABEL_PREFIX, PREFIX, (unsigned 
> int)(NUM))
> +#endif
> +
> +#define ASM_OUTPUT_ALIGN(STREAM, POWER)  \
> +  fprintf (STREAM, "\t.align\t%d\n", (int)POWER)
> +
> +/* Output a common block.  */
> +#ifndef ASM_OUTPUT_COMMON
> +# define ASM_OUTPUT_COMMON(STREAM, NAME, SIZE, ROUNDED)  \
> +{\
> +  fprintf (STREAM, "\t.comm\t"); \
> +  assemble_name (STREAM, NAME);  \
> +  asm_fprintf (STREAM, ", %d, %d\n", \
> +  (int)(ROUNDED), (int)(SIZE));  \
> +}
> +#endif
> +
> +/* Output a local common block.  /bin/as can't do this, so hack a
> +   `.space' into the bss segment.  Note that this is *bad* practice,
> +   which is guaranteed NOT to work since it doesn't define STATIC
> +   COMMON space but merely STATIC BSS space.  */
> +#ifndef ASM_OUTPUT_ALIGNED_LOCAL
> +# define ASM_OUTPUT_ALIGNED_LOCAL(STREAM, NAME, SIZE, ALIGN) \
> +{
> \
> +  switch_to_section (bss_section);   
> \
> +  ASM_OUTPUT_ALIGN (STREAM, floor_log2 (ALIGN / BITS_PER_UNIT)); \
> +  ASM_OUTPUT_LABEL (STREAM, NAME);   
> \
> +  fprintf (STREAM, "\t.space\t%d\n", (int)(SIZE));   
> \
> +}
> +#endif
> +
> +#define ASM_OUTPUT_SKIP(STREAM, NBYTES)  \
> +  fprintf (STREAM, "\t.space\t%d  // skip\n", (int) (NBYTES))
> +
> +#define ASM_OUTPUT_TYPE_DIRECTIVE(STREAM, NAME, TYPE)
> +#define ASM_DECLARE_FUNCTION_SIZE(FILE, FNAME, DECL)
> +
> +#define TEXT_SECTION_ASM_OP  "\t.text"
> +#define DATA_SECTION_ASM_OP  "\t.data"
> +#define BSS_SECTION_ASM_OP   "\t.bss"
> +
> +#define CTORS_SECTION_ASM_OP "\t.section\t.ctors, \"aw\""
> +#define DTORS_SECTION_ASM_OP "\t.section\t.dtors, \"aw\""
> +
> +#define GLOBAL_ASM_OP "\t.global\t"
> +
> +#undef SUPPORTS_INIT_PRIORITY
> +#define SUPPORTS_INIT_PRIORITY 0
> 

Re: [PATCH v1 03/13] aarch64: Mark x18 register as a fixed register for MS ABI

2024-02-23 Thread Richard Sandiford
"Richard Earnshaw (lists)"  writes:
> On 21/02/2024 18:30, Evgeny Karpov wrote:
>> 
> +/* X18 reserved for the TEB on Windows.  */
> +#ifdef TARGET_ARM64_MS_ABI
> +# define FIXED_X18 1
> +# define CALL_USED_X18 0
> +#else
> +# define FIXED_X18 0
> +# define CALL_USED_X18 1
> +#endif
>
> I'm not overly keen on ifdefs like this (and the one below), it can get quite 
> confusing if we have to support more than a couple of ABIs.  Perhaps we could 
> create a couple of new headers, one for the EABI (which all existing targets 
> would then need to include) and one for the MS ABI.  Then the mingw port 
> would use that instead of the EABI header.
>
> An alternative is to make all this dynamic, based on the setting of the 
> aarch64_calling_abi enum and to make the adjustments in 
> aarch64_conditional_register_usage.

Agreed FWIW.

> +# define CALL_USED_X18 0
>
> Is that really correct?  If the register is really reserved, but some code 
> modifies it anyway, this will cause the compiler to restore the old value at 
> the end of a function; generally, for a reserved register, code that knows 
> what it's doing would want to make permanent changes to this value.

I don't think it would do that for fixed registers.  For those this
is more whether calls are allowed to change the value of x18 or whether
x18 is supposed to remain fixed (e.g. set at the start of the thread and
not changed thereafter).

How does the MS ABI use this register?  Same question for Darwin I suppose.

Thanks,
Richard

>
> +#ifdef TARGET_ARM64_MS_ABI
> +# define STATIC_CHAIN_REGNUM R17_REGNUM
> +#else
> +# define STATIC_CHAIN_REGNUM R18_REGNUM
> +#endif
>
> If we went the enum way, we'd want something like
>
> #define STATIC_CHAIN_REGNUM (calling_abi == AARCH64_CALLING_ABI_MS ? 
> R17_REGNUM : R18_REGNUM)
>
> R.


[pushed] aarch64: Tighten early-ra chain test for wide registers [PR113295]

2024-02-23 Thread Richard Sandiford
Most code in early-ra used is_chain_candidate to check whether we
should chain two allocnos.  This included both tests that matter
for correctness and tests for certain heuristics.

Once that test passes for one pair of allocnos, we test whether
it's safe to chain the containing groups (which might contain
multiple allocnos for x2, x3 and x4 modes).  This test used an
inline test for correctness only, deliberately skipping the
heuristics.  However, this instance of the test was missing
some handling of equivalent allocnos.

This patch fixes things by making is_chain_candidate take a
strictness parameter: correctness only, or correctness + heuristics.
It then makes the group-chaining test use the correctness version
rather than trying to replicate it inline.

Tested on aarch64-linux-gnu & pushed.

Richard


gcc/
PR target/113295
* config/aarch64/aarch64-early-ra.cc
(early_ra::test_strictness): New enum.
(early_ra::is_chain_candidate): Add a strictness parameter to
control whether only correctness matters, or whether both correctness
and heuristics should be used.  Handle multiple levels of equivalence.
(early_ra::find_related_start): Update call accordingly.
(early_ra::strided_polarity_pref): Likewise.
(early_ra::form_chains): Likewise.
(early_ra::try_to_chain_allocnos): Use is_chain_candidate in
correctness mode rather than trying to inline the test.

gcc/testsuite/
PR target/113295
* gcc.target/aarch64/pr113295-2.c: New test.
---
 gcc/config/aarch64/aarch64-early-ra.cc| 48 
 gcc/testsuite/gcc.target/aarch64/pr113295-2.c | 57 +++
 2 files changed, 82 insertions(+), 23 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/pr113295-2.c

diff --git a/gcc/config/aarch64/aarch64-early-ra.cc 
b/gcc/config/aarch64/aarch64-early-ra.cc
index 58ae5a49913..9ac9ec1bb0d 100644
--- a/gcc/config/aarch64/aarch64-early-ra.cc
+++ b/gcc/config/aarch64/aarch64-early-ra.cc
@@ -95,6 +95,10 @@ public:
   void execute ();
 
 private:
+  // Whether to test only things that are required for correctness,
+  // or whether to take optimization heuristics into account as well.
+  enum test_strictness { CORRECTNESS_ONLY, ALL_REASONS };
+
   static_assert (MAX_RECOG_OPERANDS <= 32, "Operand mask is 32 bits");
   using operand_mask = uint32_t;
 
@@ -452,7 +456,7 @@ private:
 
   template
   static int cmp_increasing (const void *, const void *);
-  bool is_chain_candidate (allocno_info *, allocno_info *);
+  bool is_chain_candidate (allocno_info *, allocno_info *, test_strictness);
   int rate_chain (allocno_info *, allocno_info *);
   static int cmp_chain_candidates (const void *, const void *);
   void chain_allocnos (unsigned int &, unsigned int &);
@@ -1588,7 +1592,7 @@ early_ra::find_related_start (allocno_info *dest_allocno,
return res;
 
   auto *next_allocno = m_allocnos[dest_allocno->copy_dest];
-  if (!is_chain_candidate (dest_allocno, next_allocno))
+  if (!is_chain_candidate (dest_allocno, next_allocno, ALL_REASONS))
return res;
 
   dest_allocno = next_allocno;
@@ -2011,7 +2015,7 @@ early_ra::strided_polarity_pref (allocno_info *allocno1,
   if (allocno1->offset + 1 < allocno1->group_size
   && allocno2->offset + 1 < allocno2->group_size)
 {
-  if (is_chain_candidate (allocno1 + 1, allocno2 + 1))
+  if (is_chain_candidate (allocno1 + 1, allocno2 + 1, ALL_REASONS))
return 1;
   else
return -1;
@@ -2019,7 +2023,7 @@ early_ra::strided_polarity_pref (allocno_info *allocno1,
 
   if (allocno1->offset > 0 && allocno2->offset > 0)
 {
-  if (is_chain_candidate (allocno1 - 1, allocno2 - 1))
+  if (is_chain_candidate (allocno1 - 1, allocno2 - 1, ALL_REASONS))
return 1;
   else
return -1;
@@ -2215,38 +2219,37 @@ early_ra::cmp_increasing (const void *allocno1_ptr, 
const void *allocno2_ptr)
 }
 
 // Return true if we should consider chaining ALLOCNO1 onto the head
-// of ALLOCNO2.  This is just a local test of the two allocnos; it doesn't
-// guarantee that chaining them would give a self-consistent system.
+// of ALLOCNO2.  STRICTNESS says whether we should take copy-elision
+// heuristics into account, or whether we should just consider things
+// that matter for correctness.
+//
+// This is just a local test of the two allocnos; it doesn't guarantee
+// that chaining them would give a self-consistent system.
 bool
-early_ra::is_chain_candidate (allocno_info *allocno1, allocno_info *allocno2)
+early_ra::is_chain_candidate (allocno_info *allocno1, allocno_info *allocno2,
+ test_strictness strictness)
 {
   if (allocno2->is_shared ())
 return false;
 
-  if (allocno1->is_equiv)
+  while (allocno1->is_equiv)
 allocno1 = m_allocnos[allocno1->related_allocno];
 
   if (allocno2->start_point >= allocno1->end_point
   && !allocno2->is_equiv_to 

[pushed] aarch64: Spread out FPR usage between RA regions [PR113613]

2024-02-23 Thread Richard Sandiford
early-ra already had code to do regrename-style "broadening"
of the allocation, to promote scheduling freedom.  However,
the pass divides the function into allocation regions
and this broadening only worked within a single region.
This meant that if a basic block contained one subblock
of FPR use, followed by a point at which no FPRs were live,
followed by another subblock of FPR use, the two subblocks
would tend to reuse the same registers.  This in turn meant
that it wasn't possible to form LDP/STP pairs between them.

The failure to form LDPs and STPs in the testcase was a
regression from GCC 13.

The patch adds a simple heuristic to prefer less recently
used registers in the event of a tie.

Tested on aarch64-linux-gnu & pushed.

Richard


gcc/
PR target/113613
* config/aarch64/aarch64-early-ra.cc
(early_ra::m_current_region): New member variable.
(early_ra::m_fpr_recency): Likewise.
(early_ra::start_new_region): Bump m_current_region.
(early_ra::allocate_colors): Prefer less recently used registers
in the event of a tie.  Add a comment to explain why we prefer(ed)
higher-numbered registers.
(early_ra::find_oldest_color): Prefer less recently used registers
here too.
(early_ra::finalize_allocation): Update recency information for
allocated registers.
(early_ra::process_blocks): Initialize m_current_region and
m_fpr_recency.

gcc/testsuite/
PR target/113613
* gcc.target/aarch64/pr113613.c: New test.
---
 gcc/config/aarch64/aarch64-early-ra.cc  | 55 +
 gcc/testsuite/gcc.target/aarch64/pr113613.c | 13 +
 2 files changed, 59 insertions(+), 9 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/pr113613.c

diff --git a/gcc/config/aarch64/aarch64-early-ra.cc 
b/gcc/config/aarch64/aarch64-early-ra.cc
index 9ac9ec1bb0d..8530b0ae41e 100644
--- a/gcc/config/aarch64/aarch64-early-ra.cc
+++ b/gcc/config/aarch64/aarch64-early-ra.cc
@@ -532,6 +532,12 @@ private:
   // The set of FPRs that are currently live.
   unsigned int m_live_fprs;
 
+  // A unique one-based identifier for the current region.
+  unsigned int m_current_region;
+
+  // The region in which each FPR was last used, or 0 if none.
+  unsigned int m_fpr_recency[32];
+
   // --
 
   // A mask of the FPRs that have already been allocated.
@@ -1305,6 +1311,7 @@ early_ra::start_new_region ()
   m_allocated_fprs = 0;
   m_call_preserved_fprs = 0;
   m_allocation_successful = true;
+  m_current_region += 1;
 }
 
 // Create and return an allocno group of size SIZE for register REGNO.
@@ -2819,19 +2826,30 @@ early_ra::allocate_colors ()
candidates &= ~(m_allocated_fprs >> i);
   unsigned int best = INVALID_REGNUM;
   int best_weight = 0;
+  unsigned int best_recency = 0;
   for (unsigned int fpr = 0; fpr <= 32U - color->group->size; ++fpr)
{
  if ((candidates & (1U << fpr)) == 0)
continue;
  int weight = color->fpr_preferences[fpr];
+ unsigned int recency = 0;
  // Account for registers that the current function must preserve.
  for (unsigned int i = 0; i < color->group->size; ++i)
-   if (m_call_preserved_fprs & (1U << (fpr + i)))
- weight -= 1;
- if (best == INVALID_REGNUM || best_weight <= weight)
+   {
+ if (m_call_preserved_fprs & (1U << (fpr + i)))
+   weight -= 1;
+ recency = MAX (recency, m_fpr_recency[fpr + i]);
+   }
+ // Prefer higher-numbered registers in the event of a tie.
+ // This should tend to keep lower-numbered registers free
+ // for allocnos that require V0-V7 or V0-V15.
+ if (best == INVALID_REGNUM
+ || best_weight < weight
+ || (best_weight == weight && recency <= best_recency))
{
  best = fpr;
  best_weight = weight;
+ best_recency = recency;
}
}
 
@@ -2888,19 +2906,27 @@ early_ra::find_oldest_color (unsigned int first_color,
 {
   color_info *best = nullptr;
   unsigned int best_start_point = ~0U;
+  unsigned int best_recency = 0;
   for (unsigned int ci = first_color; ci < m_colors.length (); ++ci)
 {
   auto *color = m_colors[ci];
-  if (fpr_conflicts & (1U << (color->hard_regno - V0_REGNUM)))
+  unsigned int fpr = color->hard_regno - V0_REGNUM;
+  if (fpr_conflicts & (1U << fpr))
continue;
-  if (!color->group)
-   return color;
-  auto chain_head = color->group->chain_heads ()[0];
-  auto start_point = m_allocnos[chain_head]->start_point;
-  if (!best || best_start_point > start_point)
+  unsigned int start_point = 0;
+  if (color->group)
+   {
+ auto chain_head = color->group->chain_heads ()[0];
+ start_point = 

Re: [PATCH 0/2 V2] aarch64: Place target independent and dependent code in one file.

2024-02-22 Thread Richard Sandiford
Ajit Agarwal  writes:
> Hello Alex/Richard:
>
> I have placed target indpendent and target dependent code in
> aarch64-ldp-fusion for load store fusion.
>
> Common infrastructure of load store pair fusion is divided into
> target independent and target dependent code.
>
> Target independent code is the Generic code with pure virtual
> function to interface betwwen target independent and dependent
> code.
>
> Target dependent code is the implementation of pure virtual
> function for aarch64 target and the call to target independent
> code.

Thanks for the update.  This is still quite hard to review though.
Sorry to ask for another round, but could you split it up further?
The ideal thing would be if patches that move code do nothing other
than move code, and if patches that change code do those changes
in-place.

Richard

>
> Bootstrapped in aarch64-linux-gnu.
>
> Thanks & Regards
> Ajit
>
>
> aarch64: Place target independent and dependent code in one file.
>
> Common infrastructure of load store pair fusion is divided into
> target independent and target dependent code.
>
> Target independent code is the Generic code with pure virtual
> function to interface betwwen target independent and dependent
> code.
>
> Target dependent code is the implementation of pure virtual
> function for aarch64 target and the call to target independent
> code.
>
> 2024-02-15  Ajit Kumar Agarwal  
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64-ldp-fusion.cc: Place target
>   independent and dependent code.
> ---
>  gcc/config/aarch64/aarch64-ldp-fusion.cc | 3513 --
>  1 file changed, 1842 insertions(+), 1671 deletions(-)
>
> diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc 
> b/gcc/config/aarch64/aarch64-ldp-fusion.cc
> index 22ed95eb743..0ab842e2bbb 100644
> --- a/gcc/config/aarch64/aarch64-ldp-fusion.cc
> +++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
> @@ -17,6 +17,7 @@
>  // along with GCC; see the file COPYING3.  If not see
>  // .
>  
> +
>  #define INCLUDE_ALGORITHM
>  #define INCLUDE_FUNCTIONAL
>  #define INCLUDE_LIST
> @@ -37,13 +38,12 @@
>  #include "tree-hash-traits.h"
>  #include "print-tree.h"
>  #include "insn-attr.h"
> -
>  using namespace rtl_ssa;
>  
> -static constexpr HOST_WIDE_INT LDP_IMM_BITS = 7;
> -static constexpr HOST_WIDE_INT LDP_IMM_SIGN_BIT = (1 << (LDP_IMM_BITS - 1));
> -static constexpr HOST_WIDE_INT LDP_MAX_IMM = LDP_IMM_SIGN_BIT - 1;
> -static constexpr HOST_WIDE_INT LDP_MIN_IMM = -LDP_MAX_IMM - 1;
> +static constexpr HOST_WIDE_INT PAIR_MEM_IMM_BITS = 7;
> +static constexpr HOST_WIDE_INT PAIR_MEM_IMM_SIGN_BIT = (1 << 
> (PAIR_MEM_IMM_BITS - 1));
> +static constexpr HOST_WIDE_INT PAIR_MEM_MAX_IMM = PAIR_MEM_IMM_SIGN_BIT - 1;
> +static constexpr HOST_WIDE_INT PAIR_MEM_MIN_IMM = -PAIR_MEM_MAX_IMM - 1;
>  
>  // We pack these fields (load_p, fpsimd_p, and size) into an integer
>  // (LFS) which we use as part of the key into the main hash tables.
> @@ -138,8 +138,144 @@ struct alt_base
>poly_int64 offset;
>  };
>  
> +// Class that implements a state machine for building the changes needed to 
> form
> +// a store pair instruction.  This allows us to easily build the changes in
> +// program order, as required by rtl-ssa.
> +struct stp_change_builder
> +{
> +  enum class state
> +  {
> +FIRST,
> +INSERT,
> +FIXUP_USE,
> +LAST,
> +DONE
> +  };
> +
> +  enum class action
> +  {
> +TOMBSTONE,
> +CHANGE,
> +INSERT,
> +FIXUP_USE
> +  };
> +
> +  struct change
> +  {
> +action type;
> +insn_info *insn;
> +  };
> +
> +  bool done () const { return m_state == state::DONE; }
> +
> +  stp_change_builder (insn_info *insns[2],
> +   insn_info *repurpose,
> +   insn_info *dest)
> +: m_state (state::FIRST), m_insns { insns[0], insns[1] },
> +  m_repurpose (repurpose), m_dest (dest), m_use (nullptr) {}
> +
> +  change get_change () const
> +  {
> +switch (m_state)
> +  {
> +  case state::FIRST:
> + return {
> +   m_insns[0] == m_repurpose ? action::CHANGE : action::TOMBSTONE,
> +   m_insns[0]
> + };
> +  case state::LAST:
> + return {
> +   m_insns[1] == m_repurpose ? action::CHANGE : action::TOMBSTONE,
> +   m_insns[1]
> + };
> +  case state::INSERT:
> + return { action::INSERT, m_dest };
> +  case state::FIXUP_USE:
> + return { action::FIXUP_USE, m_use->insn () };
> +  case state::DONE:
> + break;
> +  }
> +
> +gcc_unreachable ();
> +  }
> +
> +  // Transition to the next state.
> +  void advance ()
> +  {
> +switch (m_state)
> +  {
> +  case state::FIRST:
> + if (m_repurpose)
> +   m_state = state::LAST;
> + else
> +   m_state = state::INSERT;
> + break;
> +  case state::INSERT:
> +  {
> + def_info *def = memory_access (m_insns[0]->defs ());
> + while (*def->next_def ()->insn () <= *m_dest)
> +   def = def->next_def ();
> +
> +  

  1   2   3   4   5   6   7   8   9   10   >