Re: [PATCH v2] ARM: Block predication on atomics [PR111235]

2023-10-02 Thread Wilco Dijkstra
Hi Ramana,

>> I used --target=arm-none-linux-gnueabihf --host=arm-none-linux-gnueabihf
>> --build=arm-none-linux-gnueabihf --with-float=hard. However it seems that the
>> default armhf settings are incorrect. I shouldn't need the --with-float=hard 
>> since
>> that is obviously implied by armhf, and they should also imply armv7-a with 
>> vfpv3
>> according to documentation. It seems to get confused and skip some tests. I 
>> tried
>> using --with-fpu=auto, but that doesn't work at all, so in the end I forced 
>> it like:
>> --with-arch=armv8-a --with-fpu=neon-fp-armv8. With this it runs a few more 
>> tests.
> 
> Yeah that's a wart that I don't like.
> 
> armhf just implies the hard float ABI and came into being to help
> distinguish from the Base PCS for some of the distros at the time
> (2010s). However we didn't want to set a baseline arch at that time
> given the imminent arrival of v8-a and thus the specification of
> --with-arch , --with-fpu and --with-float became second nature to many
> of us working on it at that time.

Looking at it, the default is indeed incorrect, you get:
'-mcpu=arm10e' '-mfloat-abi=hard' '-marm' '-march=armv5te+fp'

That's like 25 years out of date!

However all the armhf distros have Armv7-a as the baseline and use Thumb-2:
'-mfloat-abi=hard' '-mthumb' '-march=armv7-a+fp'

So the issue is that dg-require-effective-target arm_arch_v7a_ok doesn't work on
armhf. It seems that if you specify an architecture even with hard-float 
configured,
it turns off FP and then complains because hard-float implies you must have 
FP...

So in most configurations Iincluding the one used by distro compilers) we 
basically
skip lots of tests for no apparent reason...

> Ok, thanks for promising to do so - I trust you to get it done. Please
> try out various combinations of -march v7ve, v7-a , v8-a with the tool
> as each of them have slightly different rules. For instance v7ve
> allows LDREXD and STREXD to be single copy atomic for 64 bit loads
> whereas v7-a did not .

You mean LDRD may be generated on CPUs with LPAE. We use LDREXD by
default since that is always atomic on v7-a.

> Ok if no regressions but as you might get nagged by the post commit CI ...

Thanks, I've committed it. Those links don't show anything concrete, however I 
do note
the CI didn't pick up v2.

Btw you're happy with backports if there are no issues reported for a few days?

Cheers,
Wilco

[PATCH v2] AArch64: Add inline memmove expansion

2023-10-16 Thread Wilco Dijkstra
v2: further cleanups, improved comments

Add support for inline memmove expansions.  The generated code is identical
as for memcpy, except that all loads are emitted before stores rather than
being interleaved.  The maximum size is 256 bytes which requires at most 16
registers.

Passes regress/bootstrap, OK for commit?

gcc/ChangeLog/
* config/aarch64/aarch64.opt (aarch64_mops_memmove_size_threshold):
Change default.
* config/aarch64/aarch64.md (cpymemdi): Add a parameter.
(movmemdi): Call aarch64_expand_cpymem.
* config/aarch64/aarch64.cc (aarch64_copy_one_block): Rename function,
simplify, support storing generated loads/stores. 
(aarch64_expand_cpymem): Support expansion of memmove.
* config/aarch64/aarch64-protos.h (aarch64_expand_cpymem): Add bool arg.

gcc/testsuite/ChangeLog/
* gcc.target/aarch64/memmove.c: Add new test.

---

diff --git a/gcc/config/aarch64/aarch64-protos.h 
b/gcc/config/aarch64/aarch64-protos.h
index 
60a55f4bc1956786ea687fc7cad7ec9e4a84e1f0..0d39622bd2826a3fde54d67b5c5da9ee9286cbbd
 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -769,7 +769,7 @@ bool aarch64_emit_approx_sqrt (rtx, rtx, bool);
 tree aarch64_vector_load_decl (tree);
 void aarch64_expand_call (rtx, rtx, rtx, bool);
 bool aarch64_expand_cpymem_mops (rtx *, bool);
-bool aarch64_expand_cpymem (rtx *);
+bool aarch64_expand_cpymem (rtx *, bool);
 bool aarch64_expand_setmem (rtx *);
 bool aarch64_float_const_zero_rtx_p (rtx);
 bool aarch64_float_const_rtx_p (rtx);
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 
2fa5d09de85d385c1165e399bcc97681ef170916..e19e2d1de2e5b30eca672df05d9dcc1bc106ecc8
 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -25238,52 +25238,37 @@ aarch64_progress_pointer (rtx pointer)
   return aarch64_move_pointer (pointer, GET_MODE_SIZE (GET_MODE (pointer)));
 }
 
-/* Copy one MODE sized block from SRC to DST, then progress SRC and DST by
-   MODE bytes.  */
+/* Copy one block of size MODE from SRC to DST at offset OFFSET.  */
 
 static void
-aarch64_copy_one_block_and_progress_pointers (rtx *src, rtx *dst,
- machine_mode mode)
+aarch64_copy_one_block (rtx *load, rtx *store, rtx src, rtx dst,
+   int offset, machine_mode mode)
 {
-  /* Handle 256-bit memcpy separately.  We do this by making 2 adjacent memory
- address copies using V4SImode so that we can use Q registers.  */
-  if (known_eq (GET_MODE_BITSIZE (mode), 256))
+  /* Emit explict load/store pair instructions for 32-byte copies.  */
+  if (known_eq (GET_MODE_SIZE (mode), 32))
 {
   mode = V4SImode;
+  rtx src1 = adjust_address (src, mode, offset);
+  rtx src2 = adjust_address (src, mode, offset + 16);
+  rtx dst1 = adjust_address (dst, mode, offset);
+  rtx dst2 = adjust_address (dst, mode, offset + 16);
   rtx reg1 = gen_reg_rtx (mode);
   rtx reg2 = gen_reg_rtx (mode);
-  /* "Cast" the pointers to the correct mode.  */
-  *src = adjust_address (*src, mode, 0);
-  *dst = adjust_address (*dst, mode, 0);
-  /* Emit the memcpy.  */
-  emit_insn (aarch64_gen_load_pair (mode, reg1, *src, reg2,
-   aarch64_progress_pointer (*src)));
-  emit_insn (aarch64_gen_store_pair (mode, *dst, reg1,
-aarch64_progress_pointer (*dst), 
reg2));
-  /* Move the pointers forward.  */
-  *src = aarch64_move_pointer (*src, 32);
-  *dst = aarch64_move_pointer (*dst, 32);
+  *load = aarch64_gen_load_pair (mode, reg1, src1, reg2, src2);
+  *store = aarch64_gen_store_pair (mode, dst1, reg1, dst2, reg2);
   return;
 }
 
   rtx reg = gen_reg_rtx (mode);
-
-  /* "Cast" the pointers to the correct mode.  */
-  *src = adjust_address (*src, mode, 0);
-  *dst = adjust_address (*dst, mode, 0);
-  /* Emit the memcpy.  */
-  emit_move_insn (reg, *src);
-  emit_move_insn (*dst, reg);
-  /* Move the pointers forward.  */
-  *src = aarch64_progress_pointer (*src);
-  *dst = aarch64_progress_pointer (*dst);
+  *load = gen_move_insn (reg, adjust_address (src, mode, offset));
+  *store = gen_move_insn (adjust_address (dst, mode, offset), reg);
 }
 
 /* Expand a cpymem/movmem using the MOPS extension.  OPERANDS are taken
from the cpymem/movmem pattern.  IS_MEMMOVE is true if this is a memmove
rather than memcpy.  Return true iff we succeeded.  */
 bool
-aarch64_expand_cpymem_mops (rtx *operands, bool is_memmove = false)
+aarch64_expand_cpymem_mops (rtx *operands, bool is_memmove)
 {
   if (!TARGET_MOPS)
 return false;
@@ -25302,51 +25287,48 @@ aarch64_expand_cpymem_mops (rtx *operands, bool 
is_memmove = false)
   return true;
 }
 
-/* Expand cpymem, as if from a __builtin_memcpy.  Return true if
-   we succeed, otherwise return false, indicating that a libcall to
-

Re: [PATCH v2] AArch64: Fix strict-align cpymem/setmem [PR103100]

2023-10-16 Thread Wilco Dijkstra
ping
 
v2: Use UINTVAL, rename max_mops_size.

The cpymemdi/setmemdi implementation doesn't fully support strict alignment.
Block the expansion if the alignment is less than 16 with STRICT_ALIGNMENT.
Clean up the condition when to use MOPS.
    
Passes regress/bootstrap, OK for commit?
    
gcc/ChangeLog/
    PR target/103100
    * config/aarch64/aarch64.md (cpymemdi): Remove pattern condition.
    (setmemdi): Likewise.
    * config/aarch64/aarch64.cc (aarch64_expand_cpymem): Support
    strict-align.  Cleanup condition for using MOPS.
    (aarch64_expand_setmem): Likewise.

---

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 
dd6874d13a75f20d10a244578afc355b25c73da2..8a12894d6b80de1031d6e7d02dca680c57bce136
 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -25261,27 +25261,23 @@ aarch64_expand_cpymem (rtx *operands)
   int mode_bits;
   rtx dst = operands[0];
   rtx src = operands[1];
+  unsigned align = UINTVAL (operands[3]);
   rtx base;
   machine_mode cur_mode = BLKmode;
+  bool size_p = optimize_function_for_size_p (cfun);
 
-  /* Variable-sized memcpy can go through the MOPS expansion if available.  */
-  if (!CONST_INT_P (operands[2]))
+  /* Variable-sized or strict-align copies may use the MOPS expansion.  */
+  if (!CONST_INT_P (operands[2]) || (STRICT_ALIGNMENT && align < 16))
 return aarch64_expand_cpymem_mops (operands);
 
-  unsigned HOST_WIDE_INT size = INTVAL (operands[2]);
-
-  /* Try to inline up to 256 bytes or use the MOPS threshold if available.  */
-  unsigned HOST_WIDE_INT max_copy_size
-    = TARGET_MOPS ? aarch64_mops_memcpy_size_threshold : 256;
+  unsigned HOST_WIDE_INT size = UINTVAL (operands[2]);
 
-  bool size_p = optimize_function_for_size_p (cfun);
+  /* Try to inline up to 256 bytes.  */
+  unsigned max_copy_size = 256;
+  unsigned mops_threshold = aarch64_mops_memcpy_size_threshold;
 
-  /* Large constant-sized cpymem should go through MOPS when possible.
- It should be a win even for size optimization in the general case.
- For speed optimization the choice between MOPS and the SIMD sequence
- depends on the size of the copy, rather than number of instructions,
- alignment etc.  */
-  if (size > max_copy_size)
+  /* Large copies use MOPS when available or a library call.  */
+  if (size > max_copy_size || (TARGET_MOPS && size > mops_threshold))
 return aarch64_expand_cpymem_mops (operands);
 
   int copy_bits = 256;
@@ -25445,12 +25441,13 @@ aarch64_expand_setmem (rtx *operands)
   unsigned HOST_WIDE_INT len;
   rtx dst = operands[0];
   rtx val = operands[2], src;
+  unsigned align = UINTVAL (operands[3]);
   rtx base;
   machine_mode cur_mode = BLKmode, next_mode;
 
-  /* If we don't have SIMD registers or the size is variable use the MOPS
- inlined sequence if possible.  */
-  if (!CONST_INT_P (operands[1]) || !TARGET_SIMD)
+  /* Variable-sized or strict-align memset may use the MOPS expansion.  */
+  if (!CONST_INT_P (operands[1]) || !TARGET_SIMD
+  || (STRICT_ALIGNMENT && align < 16))
 return aarch64_expand_setmem_mops (operands);
 
   bool size_p = optimize_function_for_size_p (cfun);
@@ -25458,10 +25455,13 @@ aarch64_expand_setmem (rtx *operands)
   /* Default the maximum to 256-bytes when considering only libcall vs
  SIMD broadcast sequence.  */
   unsigned max_set_size = 256;
+  unsigned mops_threshold = aarch64_mops_memset_size_threshold;
 
-  len = INTVAL (operands[1]);
-  if (len > max_set_size && !TARGET_MOPS)
-    return false;
+  len = UINTVAL (operands[1]);
+
+  /* Large memset uses MOPS when available or a library call.  */
+  if (len > max_set_size || (TARGET_MOPS && len > mops_threshold))
+    return aarch64_expand_setmem_mops (operands);
 
   int cst_val = !!(CONST_INT_P (val) && (INTVAL (val) != 0));
   /* The MOPS sequence takes:
@@ -25474,12 +25474,6 @@ aarch64_expand_setmem (rtx *operands)
  the arguments + 1 for the call.  */
   unsigned libcall_cost = 4;
 
-  /* Upper bound check.  For large constant-sized setmem use the MOPS sequence
- when available.  */
-  if (TARGET_MOPS
-  && len >= (unsigned HOST_WIDE_INT) aarch64_mops_memset_size_threshold)
-    return aarch64_expand_setmem_mops (operands);
-
   /* Attempt a sequence with a vector broadcast followed by stores.
  Count the number of operations involved to see if it's worth it
  against the alternatives.  A simple counter simd_ops on the
@@ -25521,10 +25515,8 @@ aarch64_expand_setmem (rtx *operands)
   simd_ops++;
   n -= mode_bits;
 
-  /* Do certain trailing copies as overlapping if it's going to be
-    cheaper.  i.e. less instructions to do so.  For instance doing a 15
-    byte copy it's more efficient to do two overlapping 8 byte copies than
-    8 + 4 + 2 + 1.  Only do this when -mstrict-align is not supplied.  */
+  /* Emit trailing writes using overlapping unaligned accesses
+   (when !STRICT_ALIGNMEN

Re: [PATCH] libatomic: Enable lock-free 128-bit atomics on AArch64 [PR110061]

2023-10-16 Thread Wilco Dijkstra
 

ping

From: Wilco Dijkstra
Sent: 02 June 2023 18:28
To: GCC Patches 
Cc: Richard Sandiford ; Kyrylo Tkachov 

Subject: [PATCH] libatomic: Enable lock-free 128-bit atomics on AArch64 
[PR110061] 
 

Enable lock-free 128-bit atomics on AArch64.  This is backwards compatible with
existing binaries, gives better performance than locking atomics and is what
most users expect.

Note 128-bit atomic loads use a load/store exclusive loop if LSE2 is not 
supported.
This results in an implicit store which is invisible to software as long as the 
given
address is writeable (which will be true when using atomics in actual code).

A simple test on an old Cortex-A72 showed 2.7x speedup of 128-bit atomics.

Passes regress, OK for commit?

libatomic/
    PR target/110061
    config/linux/aarch64/atomic_16.S: Implement lock-free ARMv8.0 atomics.
    config/linux/aarch64/host-config.h: Use atomic_16.S for baseline v8.0.
    State we have lock-free atomics.

---

diff --git a/libatomic/config/linux/aarch64/atomic_16.S 
b/libatomic/config/linux/aarch64/atomic_16.S
index 
05439ce394b9653c9bcb582761ff7aaa7c8f9643..0485c284117edf54f41959d2fab9341a9567b1cf
 100644
--- a/libatomic/config/linux/aarch64/atomic_16.S
+++ b/libatomic/config/linux/aarch64/atomic_16.S
@@ -22,6 +22,21 @@
    <http://www.gnu.org/licenses/>.  */
 
 
+/* AArch64 128-bit lock-free atomic implementation.
+
+   128-bit atomics are now lock-free for all AArch64 architecture versions.
+   This is backwards compatible with existing binaries and gives better
+   performance than locking atomics.
+
+   128-bit atomic loads use a exclusive loop if LSE2 is not supported.
+   This results in an implicit store which is invisible to software as long
+   as the given address is writeable.  Since all other atomics have explicit
+   writes, this will be true when using atomics in actual code.
+
+   The libat__16 entry points are ARMv8.0.
+   The libat__16_i1 entry points are used when LSE2 is available.  */
+
+
 .arch   armv8-a+lse
 
 #define ENTRY(name) \
@@ -37,6 +52,10 @@ name:    \
 .cfi_endproc;   \
 .size name, .-name;
 
+#define ALIAS(alias,name)  \
+   .global alias;  \
+   .set alias, name;
+
 #define res0 x0
 #define res1 x1
 #define in0  x2
@@ -70,6 +89,24 @@ name:    \
 #define SEQ_CST 5
 
 
+ENTRY (libat_load_16)
+   mov x5, x0
+   cbnz    w1, 2f
+
+   /* RELAXED.  */
+1: ldxp    res0, res1, [x5]
+   stxp    w4, res0, res1, [x5]
+   cbnz    w4, 1b
+   ret
+
+   /* ACQUIRE/CONSUME/SEQ_CST.  */
+2: ldaxp   res0, res1, [x5]
+   stxp    w4, res0, res1, [x5]
+   cbnz    w4, 2b
+   ret
+END (libat_load_16)
+
+
 ENTRY (libat_load_16_i1)
 cbnz    w1, 1f
 
@@ -93,6 +130,23 @@ ENTRY (libat_load_16_i1)
 END (libat_load_16_i1)
 
 
+ENTRY (libat_store_16)
+   cbnz    w4, 2f
+
+   /* RELAXED.  */
+1: ldxp    xzr, tmp0, [x0]
+   stxp    w4, in0, in1, [x0]
+   cbnz    w4, 1b
+   ret
+
+   /* RELEASE/SEQ_CST.  */
+2: ldxp    xzr, tmp0, [x0]
+   stlxp   w4, in0, in1, [x0]
+   cbnz    w4, 2b
+   ret
+END (libat_store_16)
+
+
 ENTRY (libat_store_16_i1)
 cbnz    w4, 1f
 
@@ -101,14 +155,14 @@ ENTRY (libat_store_16_i1)
 ret
 
 /* RELEASE/SEQ_CST.  */
-1: ldaxp   xzr, tmp0, [x0]
+1: ldxp    xzr, tmp0, [x0]
 stlxp   w4, in0, in1, [x0]
 cbnz    w4, 1b
 ret
 END (libat_store_16_i1)
 
 
-ENTRY (libat_exchange_16_i1)
+ENTRY (libat_exchange_16)
 mov x5, x0
 cbnz    w4, 2f
 
@@ -126,22 +180,55 @@ ENTRY (libat_exchange_16_i1)
 stxp    w4, in0, in1, [x5]
 cbnz    w4, 3b
 ret
-4:
-   cmp w4, RELEASE
-   b.ne    6f
 
-   /* RELEASE.  */
-5: ldxp    res0, res1, [x5]
+   /* RELEASE/ACQ_REL/SEQ_CST.  */
+4: ldaxp   res0, res1, [x5]
 stlxp   w4, in0, in1, [x5]
-   cbnz    w4, 5b
+   cbnz    w4, 4b
 ret
+END (libat_exchange_16)
 
-   /* ACQ_REL/SEQ_CST.  */
-6: ldaxp   res0, res1, [x5]
-   stlxp   w4, in0, in1, [x5]
-   cbnz    w4, 6b
+
+ENTRY (libat_compare_exchange_16)
+   ldp exp0, exp1, [x1]
+   cbz w4, 3f
+   cmp w4, RELEASE
+   b.hs    4f
+
+   /* ACQUIRE/CONSUME.  */
+1: ldaxp   tmp0, tmp1, [x0]
+   cmp tmp0, exp0
+   ccmp    tmp1, exp1, 0, eq
+   bne 2f
+   stxp    w4, in0, in1, [x0]
+   cbnz    w4, 1b
+   mov x0, 1
 ret
-END (libat_exchange_16_i1)
+
+2: stp tmp0, tmp1, [x1]
+   mov x0, 0
+   ret
+
+   /* RELAXED.  */
+3: ldxp    tmp0, tmp1, [x0]
+   cmp tmp0, exp0
+   ccmp    tmp1, exp1, 0, eq
+   bne 2b
+   stxp    w4, in0, in1, [x0]
+   cbnz    w4, 3b
+   mov x0, 1
+   ret
+
+   /* RELEASE/ACQ_REL/SEQ_CST.  */
+4: ldaxp   tmp0

Re: [PATCH] libatomic: Improve ifunc selection on AArch64

2023-10-16 Thread Wilco Dijkstra
 

ping


From: Wilco Dijkstra
Sent: 04 August 2023 16:05
To: GCC Patches ; Richard Sandiford 

Cc: Kyrylo Tkachov 
Subject: [PATCH] libatomic: Improve ifunc selection on AArch64 
 

Add support for ifunc selection based on CPUID register.  Neoverse N1 supports
atomic 128-bit load/store, so use the FEAT_USCAT ifunc like newer Neoverse
cores.

Passes regress, OK for commit?

libatomic/
    config/linux/aarch64/host-config.h (ifunc1): Use CPUID in ifunc
    selection.

---

diff --git a/libatomic/config/linux/aarch64/host-config.h 
b/libatomic/config/linux/aarch64/host-config.h
index 
851c78c01cd643318aaa52929ce4550266238b79..e5dc33c030a4bab927874fa6c69425db463fdc4b
 100644
--- a/libatomic/config/linux/aarch64/host-config.h
+++ b/libatomic/config/linux/aarch64/host-config.h
@@ -26,7 +26,7 @@
 
 #ifdef HWCAP_USCAT
 # if N == 16
-#  define IFUNC_COND_1 (hwcap & HWCAP_USCAT)
+#  define IFUNC_COND_1 ifunc1 (hwcap)
 # else
 #  define IFUNC_COND_1  (hwcap & HWCAP_ATOMICS)
 # endif
@@ -50,4 +50,28 @@
 #undef MAYBE_HAVE_ATOMIC_EXCHANGE_16
 #define MAYBE_HAVE_ATOMIC_EXCHANGE_16   1
 
+#ifdef HWCAP_USCAT
+
+#define MIDR_IMPLEMENTOR(midr) (((midr) >> 24) & 255)
+#define MIDR_PARTNUM(midr) (((midr) >> 4) & 0xfff)
+
+static inline bool
+ifunc1 (unsigned long hwcap)
+{
+  if (hwcap & HWCAP_USCAT)
+    return true;
+  if (!(hwcap & HWCAP_CPUID))
+    return false;
+
+  unsigned long midr;
+  asm volatile ("mrs %0, midr_el1" : "=r" (midr));
+
+  /* Neoverse N1 supports atomic 128-bit load/store.  */
+  if (MIDR_IMPLEMENTOR (midr) == 'A' && MIDR_PARTNUM(midr) == 0xd0c)
+    return true;
+
+  return false;
+}
+#endif
+
 #include_next 

Re: [PATCH] AArch64: Fix __sync_val_compare_and_swap [PR111404]

2023-10-16 Thread Wilco Dijkstra
ping
 

__sync_val_compare_and_swap may be used on 128-bit types and either calls the
outline atomic code or uses an inline loop.  On AArch64 LDXP is only atomic if
the value is stored successfully using STXP, but the current implementations
do not perform the store if the comparison fails.  In this case the value 
returned
is not read atomically.

Passes regress/bootstrap, OK for commit?

gcc/ChangeLog/
    PR target/111404
    * config/aarch64/aarch64.cc (aarch64_split_compare_and_swap):
    For 128-bit store the loaded value and loop if needed.

libgcc/ChangeLog/
    PR target/111404
    * config/aarch64/lse.S (__aarch64_cas16_acq_rel): Execute STLXP using
    either new value or loaded value.

---

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 
5e8d0a0c91bc7719de2a8c5627b354cf905a4db0..c44c0b979d0cc3755c61dcf566cfddedccebf1ea
 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -23413,11 +23413,11 @@ aarch64_split_compare_and_swap (rtx operands[])
   mem = operands[1];
   oldval = operands[2];
   newval = operands[3];
-  is_weak = (operands[4] != const0_rtx);
   model_rtx = operands[5];
   scratch = operands[7];
   mode = GET_MODE (mem);
   model = memmodel_from_int (INTVAL (model_rtx));
+  is_weak = operands[4] != const0_rtx && mode != TImode;
 
   /* When OLDVAL is zero and we want the strong version we can emit a tighter
 loop:
@@ -23478,6 +23478,33 @@ aarch64_split_compare_and_swap (rtx operands[])
   else
 aarch64_gen_compare_reg (NE, scratch, const0_rtx);
 
+  /* 128-bit LDAXP is not atomic unless STLXP succeeds.  So for a mismatch,
+ store the returned value and loop if the STLXP fails.  */
+  if (mode == TImode)
+    {
+  rtx_code_label *label3 = gen_label_rtx ();
+  emit_jump_insn (gen_rtx_SET (pc_rtx, gen_rtx_LABEL_REF (Pmode, label3)));
+  emit_barrier ();
+
+  emit_label (label2);
+  aarch64_emit_store_exclusive (mode, scratch, mem, rval, model_rtx);
+
+  if (aarch64_track_speculation)
+   {
+ /* Emit an explicit compare instruction, so that we can correctly
+    track the condition codes.  */
+ rtx cc_reg = aarch64_gen_compare_reg (NE, scratch, const0_rtx);
+ x = gen_rtx_NE (GET_MODE (cc_reg), cc_reg, const0_rtx);
+   }
+  else
+   x = gen_rtx_NE (VOIDmode, scratch, const0_rtx);
+  x = gen_rtx_IF_THEN_ELSE (VOIDmode, x,
+   gen_rtx_LABEL_REF (Pmode, label1), pc_rtx);
+  aarch64_emit_unlikely_jump (gen_rtx_SET (pc_rtx, x));
+
+  label2 = label3;
+    }
+
   emit_label (label2);
 
   /* If we used a CBNZ in the exchange loop emit an explicit compare with RVAL
diff --git a/libgcc/config/aarch64/lse.S b/libgcc/config/aarch64/lse.S
index 
dde3a28e07b13669533dfc5e8fac0a9a6ac33dbd..ba05047ff02b6fc5752235bffa924fc4a2f48c04
 100644
--- a/libgcc/config/aarch64/lse.S
+++ b/libgcc/config/aarch64/lse.S
@@ -160,6 +160,8 @@ see the files COPYING3 and COPYING.RUNTIME respectively.  
If not, see
 #define tmp0    16
 #define tmp1    17
 #define tmp2    15
+#define tmp3   14
+#define tmp4   13
 
 #define BTI_C   hint    34
 
@@ -233,10 +235,11 @@ STARTFN   NAME(cas)
 0:  LDXP    x0, x1, [x4]
 cmp x0, x(tmp0)
 ccmp    x1, x(tmp1), #0, eq
-   bne 1f
-   STXP    w(tmp2), x2, x3, [x4]
-   cbnz    w(tmp2), 0b
-1: BARRIER
+   csel    x(tmp2), x2, x0, eq
+   csel    x(tmp3), x3, x1, eq
+   STXP    w(tmp4), x(tmp2), x(tmp3), [x4]
+   cbnz    w(tmp4), 0b
+   BARRIER
 ret
 
 #endif


Re: [PATCH] AArch64: Fix __sync_val_compare_and_swap [PR111404]

2023-10-16 Thread Wilco Dijkstra
Hi Ramana,

> I remember this to be the previous discussions and common understanding.
>
> https://gcc.gnu.org/legacy-ml/gcc/2016-06/msg00017.html
>
> and here
> 
> https://gcc.gnu.org/legacy-ml/gcc-patches/2017-02/msg00168.html
>
> Can you point any discussion recently that shows this has changed and
> point me at that discussion if any anywhere ? I can't find it in my
> searches . Perhaps you've had the discussion some place to show it has
> changed.

Here are some more recent discussions about atomics, eg. this has good
arguments from developers wanting lock-free atomics:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80878

We also had some discussion how we could handle the read-only corner
case by either giving a warning/error on const pointers to atomics or
ensuring _Atomic variables are writeable:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108659
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109553

My conclusion from that is that nobody cared enough to fix this for x86
in all these years, so it's not seen as an important issue.

We've had several internal design discussions to figure out how to fix the ABI
issues. The conclusion was that this is the only possible solution that makes
GCC and LLVM compatible without breaking backwards compatibility. It also
allows use of newer atomic instructions (which people want inlined).

Cheers,
Wilco

[PATCH] AArch64: Improve immediate generation

2023-10-19 Thread Wilco Dijkstra
Further improve immediate generation by adding support for 2-instruction
MOV/EOR bitmask immediates.  This reduces the number of 3/4-instruction
immediates in SPECCPU2017 by ~2%.

Passes regress, OK for commit?

gcc/ChangeLog:
* config/aarch64/aarch64.cc (aarch64_internal_mov_immediate)
Add support for immediates using MOV/EOR bitmask.

gcc/testsuite:
* gcc.target/aarch64/imm_choice_comparison.c: Fix test.
* gcc.target/aarch64/moveor_imm.c: Add new test.
* gcc.target/aarch64/pr106583.c: Fix test.

---

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 
578a253d6e0e133e19592553fc873b3e73f9f218..ed5be2b64c9a767d74e9d78415da964c669001aa
 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -5748,6 +5748,26 @@ aarch64_internal_mov_immediate (rtx dest, rtx imm, bool 
generate,
}
  return 2;
}
+
+  /* Try 2 bitmask immediates which are xor'd together. */
+  for (i = 0; i < 64; i += 16)
+   {
+ val2 = (val >> i) & mask;
+ val2 |= val2 << 16;
+ val2 |= val2 << 32;
+ if (aarch64_bitmask_imm (val2) && aarch64_bitmask_imm (val ^ val2))
+   break;
+   }
+
+  if (i != 64)
+   {
+ if (generate)
+   {
+ emit_insn (gen_rtx_SET (dest, GEN_INT (val2)));
+ emit_insn (gen_xordi3 (dest, dest, GEN_INT (val ^ val2)));
+   }
+ return 2;
+   }
 }
 
   /* Try a bitmask plus 2 movk to generate the immediate in 3 instructions.  */
diff --git a/gcc/testsuite/gcc.target/aarch64/imm_choice_comparison.c 
b/gcc/testsuite/gcc.target/aarch64/imm_choice_comparison.c
index 
ebc44d6dbc7287d907603d77d7b54496de177c4b..2434ca380ca2cad3e1e4181deeaad680f518b866
 100644
--- a/gcc/testsuite/gcc.target/aarch64/imm_choice_comparison.c
+++ b/gcc/testsuite/gcc.target/aarch64/imm_choice_comparison.c
@@ -6,7 +6,7 @@
 int
 foo (long long x)
 {
-  return x <= 0x1998;
+  return x <= 0x9998;
 }
 
 int
diff --git a/gcc/testsuite/gcc.target/aarch64/moveor_imm.c 
b/gcc/testsuite/gcc.target/aarch64/moveor_imm.c
new file mode 100644
index 
..5f4997b50398fdda5924610959e0c54967ad0735
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/moveor_imm.c
@@ -0,0 +1,31 @@
+/* { dg-do assemble } */
+/* { dg-options "-O2 --save-temps" } */
+
+long f1 (void)
+{
+  return 0x2aab;
+}
+
+long f2 (void)
+{
+  return 0x10f0f0f0f0f0f0f1;
+}
+
+long f3 (void)
+{
+  return 0xccd;
+}
+
+long f4 (void)
+{
+  return 0x1998;
+}
+
+long f5 (void)
+{
+  return 0x3f333f33;
+}
+
+/* { dg-final { scan-assembler-not {\tmovk\t} } } */
+/* { dg-final { scan-assembler-times {\tmov\t} 5 } } */
+/* { dg-final { scan-assembler-times {\teor\t} 5 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/pr106583.c 
b/gcc/testsuite/gcc.target/aarch64/pr106583.c
index 
0f931580817d78dc1cc58f03b251bd21bec71f59..79ada5160ce059d66eeaee407ca02488b2a1f114
 100644
--- a/gcc/testsuite/gcc.target/aarch64/pr106583.c
+++ b/gcc/testsuite/gcc.target/aarch64/pr106583.c
@@ -3,7 +3,7 @@
 
 long f1 (void)
 {
-  return 0x7efefefefefefeff;
+  return 0x75fefefefefefeff;
 }
 
 long f2 (void)



[PATCH] AArch64: Cleanup memset expansion

2023-10-19 Thread Wilco Dijkstra
Cleanup memset implementation.  Similar to memcpy/memmove, use an offset and
bytes throughout.  Simplify the complex calculations when optimizing for size
by using a fixed limit.

Passes regress/bootstrap, OK for commit?

gcc/ChangeLog:
* config/aarch64/aarch64.cc (aarch64_progress_pointer): Remove function.
(aarch64_set_one_block_and_progress_pointer): Simplify and clean up.
(aarch64_expand_setmem): Clean up implementation, use byte offsets,
simplify size calculation.

---

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 
e19e2d1de2e5b30eca672df05d9dcc1bc106ecc8..578a253d6e0e133e19592553fc873b3e73f9f218
 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -25229,15 +25229,6 @@ aarch64_move_pointer (rtx pointer, poly_int64 amount)
next, amount);
 }
 
-/* Return a new RTX holding the result of moving POINTER forward by the
-   size of the mode it points to.  */
-
-static rtx
-aarch64_progress_pointer (rtx pointer)
-{
-  return aarch64_move_pointer (pointer, GET_MODE_SIZE (GET_MODE (pointer)));
-}
-
 /* Copy one block of size MODE from SRC to DST at offset OFFSET.  */
 
 static void
@@ -25393,46 +25384,22 @@ aarch64_expand_cpymem (rtx *operands, bool is_memmove)
   return true;
 }
 
-/* Like aarch64_copy_one_block_and_progress_pointers, except for memset where
-   SRC is a register we have created with the duplicated value to be set.  */
+/* Set one block of size MODE at DST at offset OFFSET to value in SRC.  */
 static void
-aarch64_set_one_block_and_progress_pointer (rtx src, rtx *dst,
-   machine_mode mode)
-{
-  /* If we are copying 128bits or 256bits, we can do that straight from
- the SIMD register we prepared.  */
-  if (known_eq (GET_MODE_BITSIZE (mode), 256))
-{
-  mode = GET_MODE (src);
-  /* "Cast" the *dst to the correct mode.  */
-  *dst = adjust_address (*dst, mode, 0);
-  /* Emit the memset.  */
-  emit_insn (aarch64_gen_store_pair (mode, *dst, src,
-aarch64_progress_pointer (*dst), src));
-
-  /* Move the pointers forward.  */
-  *dst = aarch64_move_pointer (*dst, 32);
-  return;
-}
-  if (known_eq (GET_MODE_BITSIZE (mode), 128))
+aarch64_set_one_block (rtx src, rtx dst, int offset, machine_mode mode)
+{
+  /* Emit explict store pair instructions for 32-byte writes.  */
+  if (known_eq (GET_MODE_SIZE (mode), 32))
 {
-  /* "Cast" the *dst to the correct mode.  */
-  *dst = adjust_address (*dst, GET_MODE (src), 0);
-  /* Emit the memset.  */
-  emit_move_insn (*dst, src);
-  /* Move the pointers forward.  */
-  *dst = aarch64_move_pointer (*dst, 16);
+  mode = V16QImode;
+  rtx dst1 = adjust_address (dst, mode, offset);
+  rtx dst2 = adjust_address (dst, mode, offset + 16);
+  emit_insn (aarch64_gen_store_pair (mode, dst1, src, dst2, src));
   return;
 }
-  /* For copying less, we have to extract the right amount from src.  */
-  rtx reg = lowpart_subreg (mode, src, GET_MODE (src));
-
-  /* "Cast" the *dst to the correct mode.  */
-  *dst = adjust_address (*dst, mode, 0);
-  /* Emit the memset.  */
-  emit_move_insn (*dst, reg);
-  /* Move the pointer forward.  */
-  *dst = aarch64_progress_pointer (*dst);
+  if (known_lt (GET_MODE_SIZE (mode), 16))
+src = lowpart_subreg (mode, src, GET_MODE (src));
+  emit_move_insn (adjust_address (dst, mode, offset), src);
 }
 
 /* Expand a setmem using the MOPS instructions.  OPERANDS are the same
@@ -25461,7 +25428,7 @@ aarch64_expand_setmem_mops (rtx *operands)
 bool
 aarch64_expand_setmem (rtx *operands)
 {
-  int n, mode_bits;
+  int mode_bytes;
   unsigned HOST_WIDE_INT len;
   rtx dst = operands[0];
   rtx val = operands[2], src;
@@ -25474,104 +25441,70 @@ aarch64_expand_setmem (rtx *operands)
   || (STRICT_ALIGNMENT && align < 16))
 return aarch64_expand_setmem_mops (operands);
 
-  bool size_p = optimize_function_for_size_p (cfun);
-
   /* Default the maximum to 256-bytes when considering only libcall vs
  SIMD broadcast sequence.  */
   unsigned max_set_size = 256;
   unsigned mops_threshold = aarch64_mops_memset_size_threshold;
 
+  /* Reduce the maximum size with -Os.  */
+  if (optimize_function_for_size_p (cfun))
+max_set_size = 96;
+
   len = UINTVAL (operands[1]);
 
   /* Large memset uses MOPS when available or a library call.  */
   if (len > max_set_size || (TARGET_MOPS && len > mops_threshold))
 return aarch64_expand_setmem_mops (operands);
 
-  int cst_val = !!(CONST_INT_P (val) && (INTVAL (val) != 0));
-  /* The MOPS sequence takes:
- 3 instructions for the memory storing
- + 1 to move the constant size into a reg
- + 1 if VAL is a non-zero constant to move into a reg
-(zero constants can use XZR directly).  */
-  unsigned mops_cost = 3 + 1 + cst_val;
-  /* A libcall to memset in the worst ca

Re: RFC: Patch to implement Aarch64 SIMD ABI

2018-07-19 Thread Wilco Dijkstra
Hi Steve,

> This patch checks for SIMD functions and saves the extra registers when
> needed.  It does not change the caller behavour, so with just this patch
> there may be values saved by both the caller and callee.  This is not
> efficient, but it is correct code.

I tried a few simple test cases. It seems calls to non-vector functions don't 
mark
the callee-saves as needing to be saved/restored:

void g(void);

void __attribute__ ((aarch64_vector_pcs))
f1 (void)
{ 
  g();
  g();
}

f1:
str x30, [sp, -16]!
bl  g
ldr x30, [sp], 16
b   g

Here I would expect q8-q23 to be preserved and no tailcall to g() since it is 
not a vector
function. This is important for correctness since f1 must preserve q8-q23.


// compile with -O2 -ffixed-d1 -ffixed-d2 -ffixed-d3 -ffixed-d4 -ffixed-d5 
-ffixed-d6 -ffixed-d7
float __attribute__ ((aarch64_vector_pcs))
f2 (float *p)
{
  float t0 = p[1];
  float t1 = p[3];
  float t2 = p[5]; 
  return t0 - t1 * (t1 + t0) + (t2 * t0);
}

f2:
stp d16, d17, [sp, -48]!
ldr s17, [x0, 4]
ldr s18, [x0, 12]
ldr s0, [x0, 20]
fadds16, s17, s18
fmsub   s16, s16, s18, s17
fmadd   s0, s17, s0, s16
ldp d16, d17, [sp], 48
ret

This uses s16-s18 when it should prefer to use s24-s31 first. Also it needs to 
save q16-q18,
not only d16 and d17.

Btw the -ffixed-d* is useful to block the register allocator from using certain 
registers.

Wilco


Re: RFC: Patch to implement Aarch64 SIMD ABI

2018-07-20 Thread Wilco Dijkstra
Steve Ellcey wrote:

> Yes, I see where I missed this in aarch64_push_regs
> and aarch64_pop_regs.  I think that is why the second of
> Wilco's two examples (f2) is wrong.  I am unclear about
> exactly what is meant by writeback and why we have it and
> how that and callee_adjust are used.  Any chance someone
> could help me understand this part of the prologue/epilogue
> code better?  The comments in aarch64.c/aarch64.h aren't
> really helping me understand what the code is doing or
> why it is doing it.

Writeback is the same as a base update in a load or store. When
creating the frame there are 3 stack adjustments to be made:
creating stack for locals, pushing callee-saves and reserving space
for outgoing arguments. We merge these stack adjustments as much as
possible and use load/store with writeback for codesize and performance.
See the last part in layout_frame for the different cases.

In many common cases the frame is small and there are no outgoing
arguments, so we emit an STP with writeback to store the first 2 callee saves
and create th full frame in a single instruction. In this case callee_adjust 
will
be the frame size and initial_adjust will be zero.

push_regs and pop_regs need to be passed a mode since layout_frame
will use STP with writeback of floating point callee-saves if there are no 
integer
callee-saves. Note if there is only 1 or odd number of callee-save it may use
LDR/STR with writeback, so we need to support TFmode for these too.

Wilco


Re: [Patch-86512]: Subnormal float support in armv7(with -msoft-float) for intrinsics

2018-07-20 Thread Wilco Dijkstra
Hi Umesh,

Looking at your patch, this would break all results which need to be normalized.


Index: libgcc/config/arm/ieee754-df.S
===
--- libgcc/config/arm/ieee754-df.S  (revision 262850)
+++ libgcc/config/arm/ieee754-df.S  (working copy)
@@ -203,8 +203,11 @@
 #endif
 
@ Determine how to normalize the result.
+   @ if result is denormal i.e (exp)=0,then don't normalise the result,
 LSYM(Lad_p):
cmp xh, #0x0010
+   blt LSYM(Lad_e)
+   cmp xh, #0x0010
bcc LSYM(Lad_a)
cmp xh, #0x0020
bcc LSYM(Lad_e)

It seems Lad_a doesn't correctly handle the case where the result is a 
denormal. For this case
the result is correct so nothing else needs to be done. This requires an 
explicit test that the
exponent is zero - other cases still need to be renormalized as usual. This 
code looks overly
complex so any change will require extensive testing of all the corner cases.

Wilco


Re: [Patch-86512]: Subnormal float support in armv7(with -msoft-float) for intrinsics

2018-07-20 Thread Wilco Dijkstra
Umesh Kalappa wrote:

> We tried some of the normalisation numbers and the fix works and please
> could you help us with the input ,where  if you see that fix breaks down.

Well try any set of inputs which require normalisation. You'll find these no
longer get normalised and so will get incorrect results. Try basic cases like
1.0 - 0.75 which I think will return 0.625...

A basic test would be to run old vs new on a large set of inputs to verify
there aren't any obvious differences.

Wilco
   

Re: [Patch-86512]: Subnormal float support in armv7(with -msoft-float) for intrinsics

2018-07-23 Thread Wilco Dijkstra
Umesh Kalappa wrote:

> We tested on the SP and yes the problem persist on the SP too and
> attached patch will fix the both SP and DP issues for the  denormal
> resultant.

The patch now looks correct to me (but I can't approve).

> We bootstrapped the compiler ,look ok to us with minimal testing ,
>
> Any floating point test-suite to test for the attached patch ? any
> recommendations or inputs  ?

Running the GCC regression tests would be required since a bootstrap isn't 
useful for this kind of change. Assuming you use Linux, building and running
GLIBC with the changed GCC would give additional test coverage as it tests
all the math library functions.

I don't know of any IEEE conformance testsuites in the GNU world, which is
why I'm suggesting running some targeted and randomized tests. You could
use the generic soft-float code in libgcc/soft-fp/adddf3.c to compare the 
outputs.


>>> Index: libgcc/config/arm/ieee754-df.S
>>> ===
>>> --- libgcc/config/arm/ieee754-df.S   (revision 262850)
>>> +++ libgcc/config/arm/ieee754-df.S   (working copy)
>>> @@ -203,6 +203,7 @@
>>>  #endif
>>>
>>>  @ Determine how to normalize the result.
>>> +    @ if result is denormal i.e (exp)=0,then don't normalise the result,

Use a standard sentence here, eg. like:

If exp is zero and the mantissa unnormalized, return a denormal.

Wilco


Re: RFC: Patch to implement Aarch64 SIMD ABI

2018-07-23 Thread Wilco Dijkstra
Steve Ellcey wrote:

> OK, I think I understand this a bit better now.  I think my main
> problem is with the  term 'writeback' which I am not used to seeing.
> But if I understand things correctly we are saving one or two registers
> and (possibly) updating the stack pointer using auto-increment/auto-
> decrement in one instruction and that the updating of SP is what you
> mean by 'writeback'.  Correct?

Correct. The term has been in use since the very first Arm CPUs, where
load/stores have a writeback bit to control whether the base register is 
updated.
Note that we don't limit the instructions to simple push/pops: SP is updated
by the frame size rather than by the transfer size.

Wilco


Re: [Patch-86512]: Subnormal float support in armv7(with -msoft-float) for intrinsics

2018-07-27 Thread Wilco Dijkstra
Hi Nicolas,

I think your patch doesn't quite work as expected:

@@ -238,9 +238,10 @@ LSYM(Lad_a):
movsip, ip, lsl #1
adcsxl, xl, xl
adc xh, xh, xh
-   tst xh, #0x0010
-   sub r4, r4, #1
-   bne LSYM(Lad_e)
+   subsr4, r4, #1
+   do_it   hs
+   tsths   xh, #0x0010
+   bhi LSYM(Lad_e)

If the exponent in r4 is zero, the carry bit will be clear, so we don't execute 
the tsths
and fallthrough (the denormal will be normalized and then denormalized again, 
but
that's so rare it doesn't matter really).

However if r4 is non-zero, the carry will be set, and the tsths will be 
executed. This
clears the carry and sets the Z flag based on bit 20. We will now also always
fallthrough rather than take the branch if bit 20 is non-zero. This may still 
give the
correct answer, however it would add considerable extra overhead... I think 
using
a cmp rather than tst would work.

Wilco

Re: [Patch-86512]: Subnormal float support in armv7(with -msoft-float) for intrinsics

2018-07-27 Thread Wilco Dijkstra
Nicolas Pitre wrote:

>> However if r4 is non-zero, the carry will be set, and the tsths will be 
>> executed. This
>> clears the carry and sets the Z flag based on bit 20.
>
> No, not at all. The carry is not affected. And that's the point of the 
> tst instruction here rather than a cmp: it sets the N and Z flags but 
> leaves C alone as there is no shifter involved.

No, the carry is always set by logical operations with a shifted immediate. It 
is only
unchanged if the immediate uses a zero rotate. So any shifted immediate > 255
will set the carry. This is detailed in the Arm Architecture Reference Manual, 
eg.
see the pseudo code for A32ExpandImm_C in LSL (immediate).

Wilco

[PATCH v2] AArch64: Improve immediate generation

2023-10-24 Thread Wilco Dijkstra
v2: Use check-function-bodies in tests

Further improve immediate generation by adding support for 2-instruction
MOV/EOR bitmask immediates.  This reduces the number of 3/4-instruction
immediates in SPECCPU2017 by ~2%.

Passes regress, OK for commit?

gcc/ChangeLog:
* config/aarch64/aarch64.cc (aarch64_internal_mov_immediate)
Add support for immediates using MOV/EOR bitmask.

gcc/testsuite:
* gcc.target/aarch64/imm_choice_comparison.c: Change tests.
* gcc.target/aarch64/moveor_imm.c: Add new test.
* gcc.target/aarch64/pr106583.c: Change tests.

---

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 
578a253d6e0e133e19592553fc873b3e73f9f218..ed5be2b64c9a767d74e9d78415da964c669001aa
 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -5748,6 +5748,26 @@ aarch64_internal_mov_immediate (rtx dest, rtx imm, bool 
generate,
}
  return 2;
}
+
+  /* Try 2 bitmask immediates which are xor'd together. */
+  for (i = 0; i < 64; i += 16)
+   {
+ val2 = (val >> i) & mask;
+ val2 |= val2 << 16;
+ val2 |= val2 << 32;
+ if (aarch64_bitmask_imm (val2) && aarch64_bitmask_imm (val ^ val2))
+   break;
+   }
+
+  if (i != 64)
+   {
+ if (generate)
+   {
+ emit_insn (gen_rtx_SET (dest, GEN_INT (val2)));
+ emit_insn (gen_xordi3 (dest, dest, GEN_INT (val ^ val2)));
+   }
+ return 2;
+   }
 }
 
   /* Try a bitmask plus 2 movk to generate the immediate in 3 instructions.  */
diff --git a/gcc/testsuite/gcc.target/aarch64/imm_choice_comparison.c 
b/gcc/testsuite/gcc.target/aarch64/imm_choice_comparison.c
index 
ebc44d6dbc7287d907603d77d7b54496de177c4b..a1fc90ad73411ae8ed848fa321586afcb8d710aa
 100644
--- a/gcc/testsuite/gcc.target/aarch64/imm_choice_comparison.c
+++ b/gcc/testsuite/gcc.target/aarch64/imm_choice_comparison.c
@@ -1,32 +1,64 @@
 /* { dg-do compile } */
 /* { dg-options "-O2" } */
+/* { dg-final { check-function-bodies "**" "" } } */
 
 /* Go from four moves to two.  */
 
+/*
+** foo:
+** mov w[0-9]+, 2576980377
+** movkx[0-9]+, 0x, lsl 32
+** ...
+*/
+
 int
 foo (long long x)
 {
-  return x <= 0x1998;
+  return x <= 0x9998;
 }
 
+/*
+** GT:
+** mov w[0-9]+, -16777217
+** ...
+*/
+
 int
 GT (unsigned int x)
 {
   return x > 0xfefe;
 }
 
+/*
+** LE:
+** mov w[0-9]+, -16777217
+** ...
+*/
+
 int
 LE (unsigned int x)
 {
   return x <= 0xfefe;
 }
 
+/*
+** GE:
+** mov w[0-9]+, 4278190079
+** ...
+*/
+
 int
 GE (long long x)
 {
   return x >= 0xff00;
 }
 
+/*
+** LT:
+** mov w[0-9]+, -16777217
+** ...
+*/
+
 int
 LT (int x)
 {
@@ -35,6 +67,13 @@ LT (int x)
 
 /* Optimize the immediate in conditionals.  */
 
+/*
+** check:
+** ...
+** mov w[0-9]+, -16777217
+** ...
+*/
+
 int
 check (int x, int y)
 {
@@ -44,11 +83,15 @@ check (int x, int y)
   return x;
 }
 
+/*
+** tern:
+** ...
+** mov w[0-9]+, -16777217
+** ...
+*/
+
 int
 tern (int x)
 {
   return x >= 0xff00 ? 5 : -3;
 }
-
-/* baz produces one movk instruction.  */
-/* { dg-final { scan-assembler-times "movk" 1 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/moveor_imm.c 
b/gcc/testsuite/gcc.target/aarch64/moveor_imm.c
new file mode 100644
index 
..1c0c3f3bf8c588f9661112a8b3f9a72c5ddff95c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/moveor_imm.c
@@ -0,0 +1,63 @@
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+/* { dg-final { check-function-bodies "**" "" } } */
+
+/*
+** f1:
+**  movx0, -6148914691236517206
+** eor x0, x0, -9223372036854775807
+** ret
+*/
+
+long f1 (void)
+{
+  return 0x2aab;
+}
+
+/*
+** f2:
+** mov x0, -1085102592571150096
+** eor x0, x0, -2305843009213693951
+** ret
+*/
+
+long f2 (void)
+{
+  return 0x10f0f0f0f0f0f0f1;
+}
+
+/*
+** f3:
+** mov x0, -3689348814741910324
+** eor x0, x0, -4611686018427387903
+** ret
+*/
+
+long f3 (void)
+{
+  return 0xccd;
+}
+
+/*
+** f4:
+** mov x0, -7378697629483820647
+** eor x0, x0, -9223372036854775807
+** ret
+*/
+
+long f4 (void)
+{
+  return 0x1998;
+}
+
+/*
+** f5:
+** mov x0, 3689348814741910323
+** eor x0, x0, 864691128656461824
+** ret
+*/
+
+long f5 (void)
+{
+  return 0x3f333f33;
+}
diff --git a/gcc/testsuite/gcc.target/aarch64/pr106583.c 
b/gcc/testsuite/gcc.target/aarch64/pr106583.c
index 
0f931580817d78dc1cc58f03b251bd21bec71f59..63df7395edf9491720e3601848e15aa773c51e6d
 100644
--- a/gcc/testsuite/gcc.target/aarch64/pr106583.c
+++ b/gcc/testsuite/gcc.target/aarch64/pr106583.c
@@ -1,41 +1,94 @@
-/* { dg-do assemble } */
-/* { dg-options "-O2 --save-temps" } */
+/* { dg-do compile } */
+/* { dg-options "-O2" } *

Re: [PATCH v2] AArch64: Fix strict-align cpymem/setmem [PR103100]

2023-11-06 Thread Wilco Dijkstra

ping
 
v2: Use UINTVAL, rename max_mops_size.

The cpymemdi/setmemdi implementation doesn't fully support strict alignment.
Block the expansion if the alignment is less than 16 with STRICT_ALIGNMENT.
Clean up the condition when to use MOPS.
    
Passes regress/bootstrap, OK for commit?
    
gcc/ChangeLog/
    PR target/103100
    * config/aarch64/aarch64.md (cpymemdi): Remove pattern condition.
    (setmemdi): Likewise.
    * config/aarch64/aarch64.cc (aarch64_expand_cpymem): Support
    strict-align.  Cleanup condition for using MOPS.
    (aarch64_expand_setmem): Likewise.

---

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 
dd6874d13a75f20d10a244578afc355b25c73da2..8a12894d6b80de1031d6e7d02dca680c57bce136
 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -25261,27 +25261,23 @@ aarch64_expand_cpymem (rtx *operands)
   int mode_bits;
   rtx dst = operands[0];
   rtx src = operands[1];
+  unsigned align = UINTVAL (operands[3]);
   rtx base;
   machine_mode cur_mode = BLKmode;
+  bool size_p = optimize_function_for_size_p (cfun);
 
-  /* Variable-sized memcpy can go through the MOPS expansion if available.  */
-  if (!CONST_INT_P (operands[2]))
+  /* Variable-sized or strict-align copies may use the MOPS expansion.  */
+  if (!CONST_INT_P (operands[2]) || (STRICT_ALIGNMENT && align < 16))
 return aarch64_expand_cpymem_mops (operands);
 
-  unsigned HOST_WIDE_INT size = INTVAL (operands[2]);
-
-  /* Try to inline up to 256 bytes or use the MOPS threshold if available.  */
-  unsigned HOST_WIDE_INT max_copy_size
-    = TARGET_MOPS ? aarch64_mops_memcpy_size_threshold : 256;
+  unsigned HOST_WIDE_INT size = UINTVAL (operands[2]);
 
-  bool size_p = optimize_function_for_size_p (cfun);
+  /* Try to inline up to 256 bytes.  */
+  unsigned max_copy_size = 256;
+  unsigned mops_threshold = aarch64_mops_memcpy_size_threshold;
 
-  /* Large constant-sized cpymem should go through MOPS when possible.
- It should be a win even for size optimization in the general case.
- For speed optimization the choice between MOPS and the SIMD sequence
- depends on the size of the copy, rather than number of instructions,
- alignment etc.  */
-  if (size > max_copy_size)
+  /* Large copies use MOPS when available or a library call.  */
+  if (size > max_copy_size || (TARGET_MOPS && size > mops_threshold))
 return aarch64_expand_cpymem_mops (operands);
 
   int copy_bits = 256;
@@ -25445,12 +25441,13 @@ aarch64_expand_setmem (rtx *operands)
   unsigned HOST_WIDE_INT len;
   rtx dst = operands[0];
   rtx val = operands[2], src;
+  unsigned align = UINTVAL (operands[3]);
   rtx base;
   machine_mode cur_mode = BLKmode, next_mode;
 
-  /* If we don't have SIMD registers or the size is variable use the MOPS
- inlined sequence if possible.  */
-  if (!CONST_INT_P (operands[1]) || !TARGET_SIMD)
+  /* Variable-sized or strict-align memset may use the MOPS expansion.  */
+  if (!CONST_INT_P (operands[1]) || !TARGET_SIMD
+  || (STRICT_ALIGNMENT && align < 16))
 return aarch64_expand_setmem_mops (operands);
 
   bool size_p = optimize_function_for_size_p (cfun);
@@ -25458,10 +25455,13 @@ aarch64_expand_setmem (rtx *operands)
   /* Default the maximum to 256-bytes when considering only libcall vs
  SIMD broadcast sequence.  */
   unsigned max_set_size = 256;
+  unsigned mops_threshold = aarch64_mops_memset_size_threshold;
 
-  len = INTVAL (operands[1]);
-  if (len > max_set_size && !TARGET_MOPS)
-    return false;
+  len = UINTVAL (operands[1]);
+
+  /* Large memset uses MOPS when available or a library call.  */
+  if (len > max_set_size || (TARGET_MOPS && len > mops_threshold))
+    return aarch64_expand_setmem_mops (operands);
 
   int cst_val = !!(CONST_INT_P (val) && (INTVAL (val) != 0));
   /* The MOPS sequence takes:
@@ -25474,12 +25474,6 @@ aarch64_expand_setmem (rtx *operands)
  the arguments + 1 for the call.  */
   unsigned libcall_cost = 4;
 
-  /* Upper bound check.  For large constant-sized setmem use the MOPS sequence
- when available.  */
-  if (TARGET_MOPS
-  && len >= (unsigned HOST_WIDE_INT) aarch64_mops_memset_size_threshold)
-    return aarch64_expand_setmem_mops (operands);
-
   /* Attempt a sequence with a vector broadcast followed by stores.
  Count the number of operations involved to see if it's worth it
  against the alternatives.  A simple counter simd_ops on the
@@ -25521,10 +25515,8 @@ aarch64_expand_setmem (rtx *operands)
   simd_ops++;
   n -= mode_bits;
 
-  /* Do certain trailing copies as overlapping if it's going to be
-    cheaper.  i.e. less instructions to do so.  For instance doing a 15
-    byte copy it's more efficient to do two overlapping 8 byte copies than
-    8 + 4 + 2 + 1.  Only do this when -mstrict-align is not supplied.  */
+  /* Emit trailing writes using overlapping unaligned accesses
+   (when !STRICT_ALIGNME

Re: [PATCH v2] AArch64: Add inline memmove expansion

2023-11-06 Thread Wilco Dijkstra
ping
 
v2: further cleanups, improved comments

Add support for inline memmove expansions.  The generated code is identical
as for memcpy, except that all loads are emitted before stores rather than
being interleaved.  The maximum size is 256 bytes which requires at most 16
registers.

Passes regress/bootstrap, OK for commit?
    
gcc/ChangeLog/
    * config/aarch64/aarch64.opt (aarch64_mops_memmove_size_threshold):
    Change default.
    * config/aarch64/aarch64.md (cpymemdi): Add a parameter.
    (movmemdi): Call aarch64_expand_cpymem.
    * config/aarch64/aarch64.cc (aarch64_copy_one_block): Rename function,
    simplify, support storing generated loads/stores. 
    (aarch64_expand_cpymem): Support expansion of memmove.
    * config/aarch64/aarch64-protos.h (aarch64_expand_cpymem): Add bool arg.

gcc/testsuite/ChangeLog/
    * gcc.target/aarch64/memmove.c: Add new test.

---

diff --git a/gcc/config/aarch64/aarch64-protos.h 
b/gcc/config/aarch64/aarch64-protos.h
index 
60a55f4bc1956786ea687fc7cad7ec9e4a84e1f0..0d39622bd2826a3fde54d67b5c5da9ee9286cbbd
 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -769,7 +769,7 @@ bool aarch64_emit_approx_sqrt (rtx, rtx, bool);
 tree aarch64_vector_load_decl (tree);
 void aarch64_expand_call (rtx, rtx, rtx, bool);
 bool aarch64_expand_cpymem_mops (rtx *, bool);
-bool aarch64_expand_cpymem (rtx *);
+bool aarch64_expand_cpymem (rtx *, bool);
 bool aarch64_expand_setmem (rtx *);
 bool aarch64_float_const_zero_rtx_p (rtx);
 bool aarch64_float_const_rtx_p (rtx);
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 
2fa5d09de85d385c1165e399bcc97681ef170916..e19e2d1de2e5b30eca672df05d9dcc1bc106ecc8
 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -25238,52 +25238,37 @@ aarch64_progress_pointer (rtx pointer)
   return aarch64_move_pointer (pointer, GET_MODE_SIZE (GET_MODE (pointer)));
 }
 
-/* Copy one MODE sized block from SRC to DST, then progress SRC and DST by
-   MODE bytes.  */
+/* Copy one block of size MODE from SRC to DST at offset OFFSET.  */
 
 static void
-aarch64_copy_one_block_and_progress_pointers (rtx *src, rtx *dst,
- machine_mode mode)
+aarch64_copy_one_block (rtx *load, rtx *store, rtx src, rtx dst,
+   int offset, machine_mode mode)
 {
-  /* Handle 256-bit memcpy separately.  We do this by making 2 adjacent memory
- address copies using V4SImode so that we can use Q registers.  */
-  if (known_eq (GET_MODE_BITSIZE (mode), 256))
+  /* Emit explict load/store pair instructions for 32-byte copies.  */
+  if (known_eq (GET_MODE_SIZE (mode), 32))
 {
   mode = V4SImode;
+  rtx src1 = adjust_address (src, mode, offset);
+  rtx src2 = adjust_address (src, mode, offset + 16);
+  rtx dst1 = adjust_address (dst, mode, offset);
+  rtx dst2 = adjust_address (dst, mode, offset + 16);
   rtx reg1 = gen_reg_rtx (mode);
   rtx reg2 = gen_reg_rtx (mode);
-  /* "Cast" the pointers to the correct mode.  */
-  *src = adjust_address (*src, mode, 0);
-  *dst = adjust_address (*dst, mode, 0);
-  /* Emit the memcpy.  */
-  emit_insn (aarch64_gen_load_pair (mode, reg1, *src, reg2,
-   aarch64_progress_pointer (*src)));
-  emit_insn (aarch64_gen_store_pair (mode, *dst, reg1,
-    aarch64_progress_pointer (*dst), 
reg2));
-  /* Move the pointers forward.  */
-  *src = aarch64_move_pointer (*src, 32);
-  *dst = aarch64_move_pointer (*dst, 32);
+  *load = aarch64_gen_load_pair (mode, reg1, src1, reg2, src2);
+  *store = aarch64_gen_store_pair (mode, dst1, reg1, dst2, reg2);
   return;
 }
 
   rtx reg = gen_reg_rtx (mode);
-
-  /* "Cast" the pointers to the correct mode.  */
-  *src = adjust_address (*src, mode, 0);
-  *dst = adjust_address (*dst, mode, 0);
-  /* Emit the memcpy.  */
-  emit_move_insn (reg, *src);
-  emit_move_insn (*dst, reg);
-  /* Move the pointers forward.  */
-  *src = aarch64_progress_pointer (*src);
-  *dst = aarch64_progress_pointer (*dst);
+  *load = gen_move_insn (reg, adjust_address (src, mode, offset));
+  *store = gen_move_insn (adjust_address (dst, mode, offset), reg);
 }
 
 /* Expand a cpymem/movmem using the MOPS extension.  OPERANDS are taken
    from the cpymem/movmem pattern.  IS_MEMMOVE is true if this is a memmove
    rather than memcpy.  Return true iff we succeeded.  */
 bool
-aarch64_expand_cpymem_mops (rtx *operands, bool is_memmove = false)
+aarch64_expand_cpymem_mops (rtx *operands, bool is_memmove)
 {
   if (!TARGET_MOPS)
 return false;
@@ -25302,51 +25287,48 @@ aarch64_expand_cpymem_mops (rtx *operands, bool 
is_memmove = false)
   return true;
 }
 
-/* Expand cpymem, as if from a __builtin_memcpy.  Return true if
-   we succeed, otherwise return false, indicating that a libca

Re: [PATCH] AArch64: Cleanup memset expansion

2023-11-06 Thread Wilco Dijkstra
ping
 
Cleanup memset implementation.  Similar to memcpy/memmove, use an offset and
bytes throughout.  Simplify the complex calculations when optimizing for size
by using a fixed limit.

Passes regress/bootstrap, OK for commit?
    
gcc/ChangeLog:
    * config/aarch64/aarch64.cc (aarch64_progress_pointer): Remove function.
    (aarch64_set_one_block_and_progress_pointer): Simplify and clean up.
    (aarch64_expand_setmem): Clean up implementation, use byte offsets,
    simplify size calculation.

---

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 
e19e2d1de2e5b30eca672df05d9dcc1bc106ecc8..578a253d6e0e133e19592553fc873b3e73f9f218
 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -25229,15 +25229,6 @@ aarch64_move_pointer (rtx pointer, poly_int64 amount)
 next, amount);
 }
 
-/* Return a new RTX holding the result of moving POINTER forward by the
-   size of the mode it points to.  */
-
-static rtx
-aarch64_progress_pointer (rtx pointer)
-{
-  return aarch64_move_pointer (pointer, GET_MODE_SIZE (GET_MODE (pointer)));
-}
-
 /* Copy one block of size MODE from SRC to DST at offset OFFSET.  */
 
 static void
@@ -25393,46 +25384,22 @@ aarch64_expand_cpymem (rtx *operands, bool is_memmove)
   return true;
 }
 
-/* Like aarch64_copy_one_block_and_progress_pointers, except for memset where
-   SRC is a register we have created with the duplicated value to be set.  */
+/* Set one block of size MODE at DST at offset OFFSET to value in SRC.  */
 static void
-aarch64_set_one_block_and_progress_pointer (rtx src, rtx *dst,
-   machine_mode mode)
-{
-  /* If we are copying 128bits or 256bits, we can do that straight from
- the SIMD register we prepared.  */
-  if (known_eq (GET_MODE_BITSIZE (mode), 256))
-    {
-  mode = GET_MODE (src);
-  /* "Cast" the *dst to the correct mode.  */
-  *dst = adjust_address (*dst, mode, 0);
-  /* Emit the memset.  */
-  emit_insn (aarch64_gen_store_pair (mode, *dst, src,
-    aarch64_progress_pointer (*dst), src));
-
-  /* Move the pointers forward.  */
-  *dst = aarch64_move_pointer (*dst, 32);
-  return;
-    }
-  if (known_eq (GET_MODE_BITSIZE (mode), 128))
+aarch64_set_one_block (rtx src, rtx dst, int offset, machine_mode mode)
+{
+  /* Emit explict store pair instructions for 32-byte writes.  */
+  if (known_eq (GET_MODE_SIZE (mode), 32))
 {
-  /* "Cast" the *dst to the correct mode.  */
-  *dst = adjust_address (*dst, GET_MODE (src), 0);
-  /* Emit the memset.  */
-  emit_move_insn (*dst, src);
-  /* Move the pointers forward.  */
-  *dst = aarch64_move_pointer (*dst, 16);
+  mode = V16QImode;
+  rtx dst1 = adjust_address (dst, mode, offset);
+  rtx dst2 = adjust_address (dst, mode, offset + 16);
+  emit_insn (aarch64_gen_store_pair (mode, dst1, src, dst2, src));
   return;
 }
-  /* For copying less, we have to extract the right amount from src.  */
-  rtx reg = lowpart_subreg (mode, src, GET_MODE (src));
-
-  /* "Cast" the *dst to the correct mode.  */
-  *dst = adjust_address (*dst, mode, 0);
-  /* Emit the memset.  */
-  emit_move_insn (*dst, reg);
-  /* Move the pointer forward.  */
-  *dst = aarch64_progress_pointer (*dst);
+  if (known_lt (GET_MODE_SIZE (mode), 16))
+    src = lowpart_subreg (mode, src, GET_MODE (src));
+  emit_move_insn (adjust_address (dst, mode, offset), src);
 }
 
 /* Expand a setmem using the MOPS instructions.  OPERANDS are the same
@@ -25461,7 +25428,7 @@ aarch64_expand_setmem_mops (rtx *operands)
 bool
 aarch64_expand_setmem (rtx *operands)
 {
-  int n, mode_bits;
+  int mode_bytes;
   unsigned HOST_WIDE_INT len;
   rtx dst = operands[0];
   rtx val = operands[2], src;
@@ -25474,104 +25441,70 @@ aarch64_expand_setmem (rtx *operands)
   || (STRICT_ALIGNMENT && align < 16))
 return aarch64_expand_setmem_mops (operands);
 
-  bool size_p = optimize_function_for_size_p (cfun);
-
   /* Default the maximum to 256-bytes when considering only libcall vs
  SIMD broadcast sequence.  */
   unsigned max_set_size = 256;
   unsigned mops_threshold = aarch64_mops_memset_size_threshold;
 
+  /* Reduce the maximum size with -Os.  */
+  if (optimize_function_for_size_p (cfun))
+    max_set_size = 96;
+
   len = UINTVAL (operands[1]);
 
   /* Large memset uses MOPS when available or a library call.  */
   if (len > max_set_size || (TARGET_MOPS && len > mops_threshold))
 return aarch64_expand_setmem_mops (operands);
 
-  int cst_val = !!(CONST_INT_P (val) && (INTVAL (val) != 0));
-  /* The MOPS sequence takes:
- 3 instructions for the memory storing
- + 1 to move the constant size into a reg
- + 1 if VAL is a non-zero constant to move into a reg
-    (zero constants can use XZR directly).  */
-  unsigned mops_cost = 3 + 1 + cst_val;
-  /* A libcall to memset in the 

Re: [PATCH] AArch64: Fix __sync_val_compare_and_swap [PR111404]

2023-11-06 Thread Wilco Dijkstra

 
ping
 

__sync_val_compare_and_swap may be used on 128-bit types and either calls the
outline atomic code or uses an inline loop.  On AArch64 LDXP is only atomic if
the value is stored successfully using STXP, but the current implementations
do not perform the store if the comparison fails.  In this case the value 
returned
is not read atomically.

Passes regress/bootstrap, OK for commit?

gcc/ChangeLog/
    PR target/111404
    * config/aarch64/aarch64.cc (aarch64_split_compare_and_swap):
    For 128-bit store the loaded value and loop if needed.

libgcc/ChangeLog/
    PR target/111404
    * config/aarch64/lse.S (__aarch64_cas16_acq_rel): Execute STLXP using
    either new value or loaded value.

---

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 
5e8d0a0c91bc7719de2a8c5627b354cf905a4db0..c44c0b979d0cc3755c61dcf566cfddedccebf1ea
 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -23413,11 +23413,11 @@ aarch64_split_compare_and_swap (rtx operands[])
   mem = operands[1];
   oldval = operands[2];
   newval = operands[3];
-  is_weak = (operands[4] != const0_rtx);
   model_rtx = operands[5];
   scratch = operands[7];
   mode = GET_MODE (mem);
   model = memmodel_from_int (INTVAL (model_rtx));
+  is_weak = operands[4] != const0_rtx && mode != TImode;
 
   /* When OLDVAL is zero and we want the strong version we can emit a tighter
 loop:
@@ -23478,6 +23478,33 @@ aarch64_split_compare_and_swap (rtx operands[])
   else
 aarch64_gen_compare_reg (NE, scratch, const0_rtx);
 
+  /* 128-bit LDAXP is not atomic unless STLXP succeeds.  So for a mismatch,
+ store the returned value and loop if the STLXP fails.  */
+  if (mode == TImode)
+    {
+  rtx_code_label *label3 = gen_label_rtx ();
+  emit_jump_insn (gen_rtx_SET (pc_rtx, gen_rtx_LABEL_REF (Pmode, label3)));
+  emit_barrier ();
+
+  emit_label (label2);
+  aarch64_emit_store_exclusive (mode, scratch, mem, rval, model_rtx);
+
+  if (aarch64_track_speculation)
+   {
+ /* Emit an explicit compare instruction, so that we can correctly
+    track the condition codes.  */
+ rtx cc_reg = aarch64_gen_compare_reg (NE, scratch, const0_rtx);
+ x = gen_rtx_NE (GET_MODE (cc_reg), cc_reg, const0_rtx);
+   }
+  else
+   x = gen_rtx_NE (VOIDmode, scratch, const0_rtx);
+  x = gen_rtx_IF_THEN_ELSE (VOIDmode, x,
+   gen_rtx_LABEL_REF (Pmode, label1), pc_rtx);
+  aarch64_emit_unlikely_jump (gen_rtx_SET (pc_rtx, x));
+
+  label2 = label3;
+    }
+
   emit_label (label2);
 
   /* If we used a CBNZ in the exchange loop emit an explicit compare with RVAL
diff --git a/libgcc/config/aarch64/lse.S b/libgcc/config/aarch64/lse.S
index 
dde3a28e07b13669533dfc5e8fac0a9a6ac33dbd..ba05047ff02b6fc5752235bffa924fc4a2f48c04
 100644
--- a/libgcc/config/aarch64/lse.S
+++ b/libgcc/config/aarch64/lse.S
@@ -160,6 +160,8 @@ see the files COPYING3 and COPYING.RUNTIME respectively.  
If not, see
 #define tmp0    16
 #define tmp1    17
 #define tmp2    15
+#define tmp3   14
+#define tmp4   13
 
 #define BTI_C   hint    34
 
@@ -233,10 +235,11 @@ STARTFN   NAME(cas)
 0:  LDXP    x0, x1, [x4]
 cmp x0, x(tmp0)
 ccmp    x1, x(tmp1), #0, eq
-   bne 1f
-   STXP    w(tmp2), x2, x3, [x4]
-   cbnz    w(tmp2), 0b
-1: BARRIER
+   csel    x(tmp2), x2, x0, eq
+   csel    x(tmp3), x3, x1, eq
+   STXP    w(tmp4), x(tmp2), x(tmp3), [x4]
+   cbnz    w(tmp4), 0b
+   BARRIER
 ret
 
 #endif

Re: [PATCH] libatomic: Improve ifunc selection on AArch64

2023-11-06 Thread Wilco Dijkstra
 

ping


From: Wilco Dijkstra
Sent: 04 August 2023 16:05
To: GCC Patches ; Richard Sandiford 

Cc: Kyrylo Tkachov 
Subject: [PATCH] libatomic: Improve ifunc selection on AArch64 
 

Add support for ifunc selection based on CPUID register.  Neoverse N1 supports
atomic 128-bit load/store, so use the FEAT_USCAT ifunc like newer Neoverse
cores.

Passes regress, OK for commit?

libatomic/
    config/linux/aarch64/host-config.h (ifunc1): Use CPUID in ifunc
    selection.

---

diff --git a/libatomic/config/linux/aarch64/host-config.h 
b/libatomic/config/linux/aarch64/host-config.h
index 
851c78c01cd643318aaa52929ce4550266238b79..e5dc33c030a4bab927874fa6c69425db463fdc4b
 100644
--- a/libatomic/config/linux/aarch64/host-config.h
+++ b/libatomic/config/linux/aarch64/host-config.h
@@ -26,7 +26,7 @@
 
 #ifdef HWCAP_USCAT
 # if N == 16
-#  define IFUNC_COND_1 (hwcap & HWCAP_USCAT)
+#  define IFUNC_COND_1 ifunc1 (hwcap)
 # else
 #  define IFUNC_COND_1  (hwcap & HWCAP_ATOMICS)
 # endif
@@ -50,4 +50,28 @@
 #undef MAYBE_HAVE_ATOMIC_EXCHANGE_16
 #define MAYBE_HAVE_ATOMIC_EXCHANGE_16   1
 
+#ifdef HWCAP_USCAT
+
+#define MIDR_IMPLEMENTOR(midr) (((midr) >> 24) & 255)
+#define MIDR_PARTNUM(midr) (((midr) >> 4) & 0xfff)
+
+static inline bool
+ifunc1 (unsigned long hwcap)
+{
+  if (hwcap & HWCAP_USCAT)
+    return true;
+  if (!(hwcap & HWCAP_CPUID))
+    return false;
+
+  unsigned long midr;
+  asm volatile ("mrs %0, midr_el1" : "=r" (midr));
+
+  /* Neoverse N1 supports atomic 128-bit load/store.  */
+  if (MIDR_IMPLEMENTOR (midr) == 'A' && MIDR_PARTNUM(midr) == 0xd0c)
+    return true;
+
+  return false;
+}
+#endif
+
 #include_next 

Re: [PATCH] libatomic: Enable lock-free 128-bit atomics on AArch64 [PR110061]

2023-11-06 Thread Wilco Dijkstra


ping

From: Wilco Dijkstra
Sent: 02 June 2023 18:28
To: GCC Patches 
Cc: Richard Sandiford ; Kyrylo Tkachov 

Subject: [PATCH] libatomic: Enable lock-free 128-bit atomics on AArch64 
[PR110061] 
 

Enable lock-free 128-bit atomics on AArch64.  This is backwards compatible with
existing binaries, gives better performance than locking atomics and is what
most users expect.

Note 128-bit atomic loads use a load/store exclusive loop if LSE2 is not 
supported.
This results in an implicit store which is invisible to software as long as the 
given
address is writeable (which will be true when using atomics in actual code).

A simple test on an old Cortex-A72 showed 2.7x speedup of 128-bit atomics.

Passes regress, OK for commit?

libatomic/
    PR target/110061
    config/linux/aarch64/atomic_16.S: Implement lock-free ARMv8.0 atomics.
    config/linux/aarch64/host-config.h: Use atomic_16.S for baseline v8.0.
    State we have lock-free atomics.

---

diff --git a/libatomic/config/linux/aarch64/atomic_16.S 
b/libatomic/config/linux/aarch64/atomic_16.S
index 
05439ce394b9653c9bcb582761ff7aaa7c8f9643..0485c284117edf54f41959d2fab9341a9567b1cf
 100644
--- a/libatomic/config/linux/aarch64/atomic_16.S
+++ b/libatomic/config/linux/aarch64/atomic_16.S
@@ -22,6 +22,21 @@
    <http://www.gnu.org/licenses/>.  */
 
 
+/* AArch64 128-bit lock-free atomic implementation.
+
+   128-bit atomics are now lock-free for all AArch64 architecture versions.
+   This is backwards compatible with existing binaries and gives better
+   performance than locking atomics.
+
+   128-bit atomic loads use a exclusive loop if LSE2 is not supported.
+   This results in an implicit store which is invisible to software as long
+   as the given address is writeable.  Since all other atomics have explicit
+   writes, this will be true when using atomics in actual code.
+
+   The libat__16 entry points are ARMv8.0.
+   The libat__16_i1 entry points are used when LSE2 is available.  */
+
+
 .arch   armv8-a+lse
 
 #define ENTRY(name) \
@@ -37,6 +52,10 @@ name:    \
 .cfi_endproc;   \
 .size name, .-name;
 
+#define ALIAS(alias,name)  \
+   .global alias;  \
+   .set alias, name;
+
 #define res0 x0
 #define res1 x1
 #define in0  x2
@@ -70,6 +89,24 @@ name:    \
 #define SEQ_CST 5
 
 
+ENTRY (libat_load_16)
+   mov x5, x0
+   cbnz    w1, 2f
+
+   /* RELAXED.  */
+1: ldxp    res0, res1, [x5]
+   stxp    w4, res0, res1, [x5]
+   cbnz    w4, 1b
+   ret
+
+   /* ACQUIRE/CONSUME/SEQ_CST.  */
+2: ldaxp   res0, res1, [x5]
+   stxp    w4, res0, res1, [x5]
+   cbnz    w4, 2b
+   ret
+END (libat_load_16)
+
+
 ENTRY (libat_load_16_i1)
 cbnz    w1, 1f
 
@@ -93,6 +130,23 @@ ENTRY (libat_load_16_i1)
 END (libat_load_16_i1)
 
 
+ENTRY (libat_store_16)
+   cbnz    w4, 2f
+
+   /* RELAXED.  */
+1: ldxp    xzr, tmp0, [x0]
+   stxp    w4, in0, in1, [x0]
+   cbnz    w4, 1b
+   ret
+
+   /* RELEASE/SEQ_CST.  */
+2: ldxp    xzr, tmp0, [x0]
+   stlxp   w4, in0, in1, [x0]
+   cbnz    w4, 2b
+   ret
+END (libat_store_16)
+
+
 ENTRY (libat_store_16_i1)
 cbnz    w4, 1f
 
@@ -101,14 +155,14 @@ ENTRY (libat_store_16_i1)
 ret
 
 /* RELEASE/SEQ_CST.  */
-1: ldaxp   xzr, tmp0, [x0]
+1: ldxp    xzr, tmp0, [x0]
 stlxp   w4, in0, in1, [x0]
 cbnz    w4, 1b
 ret
 END (libat_store_16_i1)
 
 
-ENTRY (libat_exchange_16_i1)
+ENTRY (libat_exchange_16)
 mov x5, x0
 cbnz    w4, 2f
 
@@ -126,22 +180,55 @@ ENTRY (libat_exchange_16_i1)
 stxp    w4, in0, in1, [x5]
 cbnz    w4, 3b
 ret
-4:
-   cmp w4, RELEASE
-   b.ne    6f
 
-   /* RELEASE.  */
-5: ldxp    res0, res1, [x5]
+   /* RELEASE/ACQ_REL/SEQ_CST.  */
+4: ldaxp   res0, res1, [x5]
 stlxp   w4, in0, in1, [x5]
-   cbnz    w4, 5b
+   cbnz    w4, 4b
 ret
+END (libat_exchange_16)
 
-   /* ACQ_REL/SEQ_CST.  */
-6: ldaxp   res0, res1, [x5]
-   stlxp   w4, in0, in1, [x5]
-   cbnz    w4, 6b
+
+ENTRY (libat_compare_exchange_16)
+   ldp exp0, exp1, [x1]
+   cbz w4, 3f
+   cmp w4, RELEASE
+   b.hs    4f
+
+   /* ACQUIRE/CONSUME.  */
+1: ldaxp   tmp0, tmp1, [x0]
+   cmp tmp0, exp0
+   ccmp    tmp1, exp1, 0, eq
+   bne 2f
+   stxp    w4, in0, in1, [x0]
+   cbnz    w4, 1b
+   mov x0, 1
 ret
-END (libat_exchange_16_i1)
+
+2: stp tmp0, tmp1, [x1]
+   mov x0, 0
+   ret
+
+   /* RELAXED.  */
+3: ldxp    tmp0, tmp1, [x0]
+   cmp tmp0, exp0
+   ccmp    tmp1, exp1, 0, eq
+   bne 2b
+   stxp    w4, in0, in1, [x0]
+   cbnz    w4, 3b
+   mov x0, 1
+   ret
+
+   /* RELEASE/ACQ_REL/SEQ_CST.  */
+4: ldaxp   tmp0

[PATCH] AArch64: Enable fast shifts on Neoverse N1

2020-09-14 Thread Wilco Dijkstra
Enable the fast shift feature in Neoverse N1 tuning - this means additions with
a shift left by 1-4 are as fast as addition. This improves multiply by constant
expansions, eg. x * 25 is now emitted using shifts rather than a multiply:

add w0, w0, w0, lsl 2
add w0, w0, w0, lsl 2

Bootstrap OK, regress pass, OK for commit?

ChangeLog:
2020-09-11  Wilco Dijkstra  

* config/aarch64/aarch64.c (neoversen1_tunings):
Enable AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND.

---

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 
803562df25751f2eb6dbe18b67cae46ea7c478dd..cbbdefa436bf11e9631c90631fb621e90e60754a
 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -1332,7 +1332,7 @@ static const struct tune_params neoversen1_tunings =
   2,   /* min_div_recip_mul_df.  */
   0,   /* max_case_values.  */
   tune_params::AUTOPREFETCHER_WEAK,/* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
+  (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND), /* tune_flags.  */
   &generic_prefetch_tune
 };
 


[PATCH 1/2] AArch64: Cleanup CPU option processing code

2020-09-14 Thread Wilco Dijkstra
The --with-cpu/--with-arch configure option processing not only checks valid 
arguments
but also sets TARGET_CPU_DEFAULT with a CPU and extension bitmask.  This isn't 
used
however since a --with-cpu is translated into a -mcpu option which is processed 
as if
written on the command-line (so TARGET_CPU_DEFAULT is never accessed).

So remove all the complex processing and bitmask, and just validate the option.
Fix a bug that always reports valid architecture extensions as invalid.  As a 
result
the CPU processing in aarch64.c can be simplified.

Bootstrap OK, regress pass, OK for commit?

ChangeLog:
2020-09-03  Wilco Dijkstra  

* config.gcc (aarch64*-*-*): Simplify --with-cpu and --with-arch
processing.  Add support for architectural extensions.
* config/aarch64/aarch64.h (TARGET_CPU_DEFAULT): Remove
AARCH64_CPU_DEFAULT_FLAGS.
* config/aarch64/aarch64.c (AARCH64_CPU_DEFAULT_FLAGS): Remove define.
(get_tune_cpu): Assert CPU is always valid.
(get_arch): Assert architecture is always valid.
(aarch64_override_options): Cleanup CPU selection code and simplify 
logic.

---

diff --git a/gcc/config.gcc b/gcc/config.gcc
index 
88e428fd2ad0fb605a53f5dfa58d9f14603b4302..918320573ade712ddc252045e0b70fb8b65e0c66
 100644
--- a/gcc/config.gcc
+++ b/gcc/config.gcc
@@ -4067,8 +4067,6 @@ case "${target}" in
  pattern=AARCH64_CORE
fi
 
-   ext_mask=AARCH64_CPU_DEFAULT_FLAGS
-
# Find the base CPU or ARCH id in aarch64-cores.def or
# aarch64-arches.def
if [ x"$base_val" = x ] \
@@ -4076,23 +4074,6 @@ case "${target}" in
${srcdir}/config/aarch64/$def \
> /dev/null; then
 
- if [ $which = arch ]; then
-   base_id=`grep "^$pattern(\"$base_val\"," \
- ${srcdir}/config/aarch64/$def | \
- sed -e 's/^[^,]*,[]*//' | \
- sed -e 's/,.*$//'`
-   # Extract the architecture flags from 
aarch64-arches.def
-   ext_mask=`grep "^$pattern(\"$base_val\"," \
-  ${srcdir}/config/aarch64/$def | \
-  sed -e 's/)$//' | \
-  sed -e 's/^.*,//'`
- else
-   base_id=`grep "^$pattern(\"$base_val\"," \
- ${srcdir}/config/aarch64/$def | \
- sed -e 's/^[^,]*,[]*//' | \
- sed -e 's/,.*$//'`
- fi
-
  # Use the pre-processor to strip flatten the options.
  # This makes the format less rigid than if we use
  # grep and sed directly here.
@@ -4117,25 +4098,7 @@ case "${target}" in
grep "^\"$base_ext\""`
 
if [ x"$base_ext" = x ] \
-   || [[ -n $opt_line ]]; then
-
- # These regexp extract the elements based on
- # their group match index in the regexp.
- ext_canon=`echo -e "$opt_line" | \
-   sed -e "s/$sed_patt/\2/"`
- ext_on=`echo -e "$opt_line" | \
-   sed -e "s/$sed_patt/\3/"`
- ext_off=`echo -e "$opt_line" | \
-   sed -e "s/$sed_patt/\4/"`
-
- if [ $ext = $base_ext ]; then
-   # Adding extension
-   ext_mask="("$ext_mask") | ("$ext_on" | 
"$ext_canon")"
- else
-   # Removing extension
-   ext_mask="("$ext_mask") & ~("$ext_off" 
| "$ext_canon")"
- fi
-
+   || [ x"$opt_line" != x ]; then
  true
else
  echo "Unknown extension used in 
--with-$which=$val" 1>&2
@@ -4144,10 +4107,6 @@ case "${target}" in
   

[PATCH 2/2] AArch64: Add support for --with-tune

2020-09-14 Thread Wilco Dijkstra
Add support for --with-tune. Like --with-cpu and --with-arch, the argument is
validated and transformed into a -mtune option to be processed like any other
command-line option.  --with-tune has no effect if a -mcpu or -mtune option
is used. The validating code didn't allow --with-cpu=native, so explicitly
allow that.

Co-authored-by:  Delia Burduv  

Bootstrap OK, regress pass, OK to commit?

ChangeLog
2020-09-03  Wilco Dijkstra  

* config.gcc
(aarch64*-*-*): Add --with-tune. Support --with-cpu=native.
* config/aarch64/aarch64.h (OPTION_DEFAULT_SPECS): Add --with-tune.

2020-09-03  Wilco Dijkstra  

* gcc/testsuite/lib/target-supports.exp:
(check_effective_target_tune_cortex_a76): New effective target test.
* gcc.target/aarch64/with-tune-config.c: New test.
* gcc.target/aarch64/with-tune-march.c: Likewise.
* gcc.target/aarch64/with-tune-mcpu.c: Likewise.
* gcc.target/aarch64/with-tune-mtune.c: Likewise.

---

diff --git a/gcc/config.gcc b/gcc/config.gcc
index 
918320573ade712ddc252045e0b70fb8b65e0c66..947cebb0960d403a9ce40b65bef948fbbe8916a9
 100644
--- a/gcc/config.gcc
+++ b/gcc/config.gcc
@@ -4052,8 +4052,8 @@ fi
 supported_defaults=
 case "${target}" in
aarch64*-*-*)
-   supported_defaults="abi cpu arch"
-   for which in cpu arch; do
+   supported_defaults="abi cpu arch tune"
+   for which in cpu arch tune; do
 
eval "val=\$with_$which"
base_val=`echo $val | sed -e 's/\+.*//'`
@@ -4074,6 +4074,12 @@ case "${target}" in
${srcdir}/config/aarch64/$def \
> /dev/null; then
 
+ # Disallow extensions in --with-tune=cortex-a53+crc.
+ if [ $which = tune ] && [ x"$ext_val" != x ]; then
+   echo "Architecture extensions not supported in 
--with-$which=$val" 1>&2
+   exit 1
+ fi
+
  # Use the pre-processor to strip flatten the options.
  # This makes the format less rigid than if we use
  # grep and sed directly here.
@@ -4109,8 +4115,13 @@ case "${target}" in
 
  true
else
- echo "Unknown $which used in --with-$which=$val" 1>&2
- exit 1
+ # Allow --with-$which=native.
+ if [ "$val" = native ]; then
+   true
+ else
+   echo "Unknown $which used in --with-$which=$val" 
1>&2
+   exit 1
+ fi
fi
done
;;
diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h
index 
30b3a28a6d2893cc29bec4aa5b7cfbe1fd51e0b7..09d1f7f8a84e726129f8ec7f59dacae7fe132eaa
 100644
--- a/gcc/config/aarch64/aarch64.h
+++ b/gcc/config/aarch64/aarch64.h
@@ -1200,12 +1200,14 @@ extern enum aarch64_code_model aarch64_cmodel;
 #define ENDIAN_LANE_N(NUNITS, N) \
   (BYTES_BIG_ENDIAN ? NUNITS - 1 - N : N)
 
-/* Support for a configure-time default CPU, etc.  We currently support
-   --with-arch and --with-cpu.  Both are ignored if either is specified
-   explicitly on the command line at run time.  */
+/* Support for configure-time --with-arch, --with-cpu and --with-tune.
+   --with-arch and --with-cpu are ignored if either -mcpu or -march is used.
+   --with-tune is ignored if either -mtune or -mcpu is used (but is not
+   affected by -march).  */
 #define OPTION_DEFAULT_SPECS   \
   {"arch", "%{!march=*:%{!mcpu=*:-march=%(VALUE)}}" }, \
-  {"cpu",  "%{!march=*:%{!mcpu=*:-mcpu=%(VALUE)}}" },
+  {"cpu",  "%{!march=*:%{!mcpu=*:-mcpu=%(VALUE)}}" },   \
+  {"tune", "%{!mcpu=*:%{!mtune=*:-mtune=%(VALUE)}}"},
 
 #define MCPU_TO_MARCH_SPEC \
" %{mcpu=*:-march=%:rewrite_mcpu(%{mcpu=*:%*})}"
diff --git a/gcc/testsuite/gcc.target/aarch64/with-tune-config.c 
b/gcc/testsuite/gcc.target/aarch64/with-tune-config.c
new file mode 100644
index 
..0940e9eea892770fada0b9bc0e05e22bebef1167
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/with-tune-config.c
@@ -0,0 +1,7 @@
+/* { dg-do compile { target { tune_cortex_a76 } } } */
+/* { dg-additional-options " -dA " } */
+
+void foo ()
+{}
+
+/* { dg-final { scan-assembler "//.tune cortex-a76" } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/with-tune-march.c 
b/gcc/testsuite/gcc.target/aarch64/with-tune-march.c
new file mode 

Re: [PATCH 1/2] AArch64: Cleanup CPU option processing code

2020-09-14 Thread Wilco Dijkstra
Hi Richard,

>On 14/09/2020 15:19, Wilco Dijkstra wrote:
>> The --with-cpu/--with-arch configure option processing not only checks valid 
>> arguments
>> but also sets TARGET_CPU_DEFAULT with a CPU and extension bitmask.  This 
>> isn't used
>> however since a --with-cpu is translated into a -mcpu option which is 
>> processed as if
>> written on the command-line (so TARGET_CPU_DEFAULT is never accessed).
>> 
>> So remove all the complex processing and bitmask, and just validate the 
>> option.
>> Fix a bug that always reports valid architecture extensions as invalid.  As 
>> a result
>> the CPU processing in aarch64.c can be simplified.
>
> Doesn't this change the default behaviour if cc1 is run directly?  I'm
> not saying this is the wrong thing to do (I think we rely on this in the
> arm port), but I just want to understand by what you mean when you say
> 'never used'.

Yes it does change default behaviour of cc1, but I don't think it does matter.
I bootstrapped and passed regress with an assert to verify TARGET_CPU_DEFAULT
is never accessed if there is a --with-cpu configure option. So using cc1 
directly
is not standard practice (and I believe most other configuration options are not
baked into cc1 either).

How do we rely on it in the Arm port? That doesn't sound right...

Cheers,
Wilco


[PATCH] PR85678: Change default to -fno-common

2019-10-25 Thread Wilco Dijkstra
GCC currently defaults to -fcommon.  As discussed in the PR, this is an ancient
C feature which is not conforming with the latest C standards.  On many targets
this means global variable accesses have a codesize and performance penalty.
This applies to C code only, C++ code is not affected by -fcommon.  It is about
time to change the default.

OK for commit?

ChangeLog
2019-10-25  Wilco Dijkstra  

PR85678
* common.opt (fcommon): Change init to 1.

doc/
* invoke.texi (-fcommon): Update documentation.
---

diff --git a/gcc/common.opt b/gcc/common.opt
index 
0195b0cb85a06dd043fd0412b42dfffddfa2495b..b0840f41a5e480f4428bd62724b0dc3d54c68c0b
 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -1131,7 +1131,7 @@ Common Report Var(flag_combine_stack_adjustments) 
Optimization
 Looks for opportunities to reduce stack adjustments and stack references.
 
 fcommon
-Common Report Var(flag_no_common,0)
+Common Report Var(flag_no_common,0) Init(1)
 Put uninitialized globals in the common section.
 
 fcompare-debug
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 
857d9692729e503657d0d0f44f1f6252ec90d49a..5b4ff66015f5f94a5bd89e4dc3d2d53553cc091e
 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -568,7 +568,7 @@ Objective-C and Objective-C++ Dialects}.
 -fnon-call-exceptions  -fdelete-dead-exceptions  -funwind-tables @gol
 -fasynchronous-unwind-tables @gol
 -fno-gnu-unique @gol
--finhibit-size-directive  -fno-common  -fno-ident @gol
+-finhibit-size-directive  -fcommon  -fno-ident @gol
 -fpcc-struct-return  -fpic  -fPIC  -fpie  -fPIE  -fno-plt @gol
 -fno-jump-tables @gol
 -frecord-gcc-switches @gol
@@ -14050,35 +14050,27 @@ useful for building programs to run under WINE@.
 code that is not binary compatible with code generated without that switch.
 Use it to conform to a non-default application binary interface.
 
-@item -fno-common
-@opindex fno-common
+@item -fcommon
 @opindex fcommon
+@opindex fno-common
 @cindex tentative definitions
-In C code, this option controls the placement of global variables 
-defined without an initializer, known as @dfn{tentative definitions} 
-in the C standard.  Tentative definitions are distinct from declarations 
+In C code, this option controls the placement of global variables
+defined without an initializer, known as @dfn{tentative definitions}
+in the C standard.  Tentative definitions are distinct from declarations
 of a variable with the @code{extern} keyword, which do not allocate storage.
 
-Unix C compilers have traditionally allocated storage for
-uninitialized global variables in a common block.  This allows the
-linker to resolve all tentative definitions of the same variable
+The default is @option{-fno-common}, which specifies that the compiler places
+uninitialized global variables in the BSS section of the object file.
+This inhibits the merging of tentative definitions by the linker so you get a
+multiple-definition error if the same variable is accidentally defined in more
+than one compilation unit.
+
+The @option{-fcommon} places uninitialized global variables in a common block.
+This allows the linker to resolve all tentative definitions of the same 
variable
 in different compilation units to the same object, or to a non-tentative
-definition.  
-This is the behavior specified by @option{-fcommon}, and is the default for 
-GCC on most targets.  
-On the other hand, this behavior is not required by ISO
-C, and on some targets may carry a speed or code size penalty on
-variable references.
-
-The @option{-fno-common} option specifies that the compiler should instead
-place uninitialized global variables in the BSS section of the object file.
-This inhibits the merging of tentative definitions by the linker so
-you get a multiple-definition error if the same 
-variable is defined in more than one compilation unit.
-Compiling with @option{-fno-common} is useful on targets for which
-it provides better performance, or if you wish to verify that the
-program will work on other systems that always treat uninitialized
-variable definitions this way.
+definition.  This behavior does not conform to ISO C, is inconsistent with C++,
+and on many targets implies a speed and code size penalty on global variable
+references.  It is mainly useful to enable legacy code to link without errors.
 
 @item -fno-ident
 @opindex fno-ident


Re: [PATCH] PR85678: Change default to -fno-common

2019-10-28 Thread Wilco Dijkstra
Hi Jeff,

> Has this been bootstrapped and regression tested?

Yes, it bootstraps OK of course. I ran regression over the weekend, there
are a few minor regressions in lto due to relying on tentative definitions
and a few latent bugs. I'd expect there will be a few similar failures on
other targets but nothing major since few testcases rely on -fcommon.
The big question is how it affects the distros.

Wilco





Re: [PATCH] PR85678: Change default to -fno-common

2019-10-28 Thread Wilco Dijkstra
Hi,

>> I suppose targets can override this decision. 
> I think they probably could via the override_options mechanism.

Yes, it's trivial to add this to target_option_override():

  if (!global_options_set.x_flag_no_common)
flag_no_common = 0;

Cheers,
Wilco








Re: [PATCH] PR85678: Change default to -fno-common

2019-10-29 Thread Wilco Dijkstra
Hi Iain,

> for the record,  Darwin bootstraps OK with the change (which is to be 
> expected,
> since the preferred setting for it is -fno-common).

That's good to hear.

> Testsuite fails are order “a few hundred” mostly seem to be related to 
> tree-prof
> and vector tests (plus the anticipated scan-asm stuff, where code-gen will 
> have
> changed).  I don’t have cycles to analyse the causes right now - but that 
> gives
> an idea.

Are those tests specific to Power? I got 14 failures in total across the full
C and C++ test suites. Note it's easy to update the default options for a
specific test directory if needed.

Wilco









[PATCH v2] PR85678: Change default to -fno-common

2019-10-29 Thread Wilco Dijkstra
v2: Tweak testsuite options to avoid failures

GCC currently defaults to -fcommon.  As discussed in the PR, this is an ancient
C feature which is not conforming with the latest C standards.  On many targets
this means global variable accesses have a codesize and performance penalty.
This applies to C code only, C++ code is not affected by -fcommon.  It is about
time to change the default.

Bootstrap OK, passes testsuite on AArch64. OK for commit?

ChangeLog
2019-10-29  Wilco Dijkstra  

PR85678
* common.opt (fcommon): Change init to 1.

doc/
* invoke.texi (-fcommon): Update documentation.

testsuite/

* gcc.dg/alias-15.c: Add -fcommon.
* gcc.dg/fdata-sections-1.c: Likewise.  
* gcc.dg/ipa/pr77653.c: Likewise.
* gcc.dg/lto/20090729_0.c: Likewise.
* gcc.dg/lto/20111207-1_0.c: Likewise.
* gcc.dg/lto/c-compatible-types-1_0.c: Likewise.
* gcc.dg/lto/pr55525_0.c: Likewise.
* gcc.target/aarch64/sve/peel_ind_1.c: Allow ANCHOR0.
* gcc.target/aarch64/sve/peel_ind_2.c: Likewise
* gcc.target/aarch64/sve/peel_ind_3.c: Likewise
* lib/lto.exp (lto_init): Add -fcommon.
---

diff --git a/gcc/common.opt b/gcc/common.opt
index 
f74b10aafc223e4961915b009c092f4876eddba4..798b6aeff3536e21c95752b5dd085f8ffef04643
 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -1131,7 +1131,7 @@ Common Report Var(flag_combine_stack_adjustments) 
Optimization
 Looks for opportunities to reduce stack adjustments and stack references.
 
 fcommon
-Common Report Var(flag_no_common,0)
+Common Report Var(flag_no_common,0) Init(1)
 Put uninitialized globals in the common section.
 
 fcompare-debug
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 
92fb316a368a4a36218fac6de2744c7ab6446ef5..18cfd07d4bbb4b866808db0701faf88bddbd9a94
 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -568,7 +568,7 @@ Objective-C and Objective-C++ Dialects}.
 -fnon-call-exceptions  -fdelete-dead-exceptions  -funwind-tables @gol
 -fasynchronous-unwind-tables @gol
 -fno-gnu-unique @gol
--finhibit-size-directive  -fno-common  -fno-ident @gol
+-finhibit-size-directive  -fcommon  -fno-ident @gol
 -fpcc-struct-return  -fpic  -fPIC  -fpie  -fPIE  -fno-plt @gol
 -fno-jump-tables @gol
 -frecord-gcc-switches @gol
@@ -14049,35 +14049,27 @@ useful for building programs to run under WINE@.
 code that is not binary compatible with code generated without that switch.
 Use it to conform to a non-default application binary interface.
 
-@item -fno-common
-@opindex fno-common
+@item -fcommon
 @opindex fcommon
+@opindex fno-common
 @cindex tentative definitions
-In C code, this option controls the placement of global variables 
-defined without an initializer, known as @dfn{tentative definitions} 
-in the C standard.  Tentative definitions are distinct from declarations 
+In C code, this option controls the placement of global variables
+defined without an initializer, known as @dfn{tentative definitions}
+in the C standard.  Tentative definitions are distinct from declarations
 of a variable with the @code{extern} keyword, which do not allocate storage.
 
-Unix C compilers have traditionally allocated storage for
-uninitialized global variables in a common block.  This allows the
-linker to resolve all tentative definitions of the same variable
+The default is @option{-fno-common}, which specifies that the compiler places
+uninitialized global variables in the BSS section of the object file.
+This inhibits the merging of tentative definitions by the linker so you get a
+multiple-definition error if the same variable is accidentally defined in more
+than one compilation unit.
+
+The @option{-fcommon} places uninitialized global variables in a common block.
+This allows the linker to resolve all tentative definitions of the same 
variable
 in different compilation units to the same object, or to a non-tentative
-definition.  
-This is the behavior specified by @option{-fcommon}, and is the default for 
-GCC on most targets.  
-On the other hand, this behavior is not required by ISO
-C, and on some targets may carry a speed or code size penalty on
-variable references.
-
-The @option{-fno-common} option specifies that the compiler should instead
-place uninitialized global variables in the BSS section of the object file.
-This inhibits the merging of tentative definitions by the linker so
-you get a multiple-definition error if the same 
-variable is defined in more than one compilation unit.
-Compiling with @option{-fno-common} is useful on targets for which
-it provides better performance, or if you wish to verify that the
-program will work on other systems that always treat uninitialized
-variable definitions this way.
+definition.  This behavior does not conform to ISO C, is inconsistent with C++,
+and on many targets implies a speed and code size penalty on global variable
+references.  It is mainly useful to enable legacy code to link without errors

Re: [PATCH v2] PR85678: Change default to -fno-common

2019-10-30 Thread Wilco Dijkstra
Hi Richard,

> Please don't add -fcommon in lto.exp.

So what is the best way to add an extra option to lto.exp?
Note dg-lto-options completely overrides the options from lto.exp, so I can't
use that except in tests which already use it.

Cheers,
Wilco

Re: [PATCH v2] PR85678: Change default to -fno-common

2019-11-04 Thread Wilco Dijkstra
Hi Richard,

>> > Please don't add -fcommon in lto.exp.
>>
>> So what is the best way to add an extra option to lto.exp?
>> Note dg-lto-options completely overrides the options from lto.exp, so I can't
>> use that except in tests which already use it.
>
> On what testcases do you need it at all?

These need it in order to run over the original set of LTO options. A 
possibility
would be to select one of the set of options and just run that using 
dg-lto-options
(assuming it's safe to use -flto-partition and/or -flinker-plugin on all 
targets).

PASS->FAIL: g++.dg/lto/odr-6 2 (test for LTO warnings, odr-6_0.C line 3)
PASS->FAIL: g++.dg/lto/odr-6 2 (test for LTO warnings, odr-6_0.C line 3)
PASS->FAIL: g++.dg/lto/odr-6 2 (test for LTO warnings, odr-6_1.c line 1)
PASS->FAIL: g++.dg/lto/odr-6 2 (test for LTO warnings, odr-6_1.c line 1)
PASS->FAIL: g++.dg/lto/odr-6 cp_lto_odr-6_0.o-cp_lto_odr-6_1.o link, -O0 -flto 
-flto-partition=1to1 -fno-use-linker-plugin 
PASS->FAIL: g++.dg/lto/odr-6 cp_lto_odr-6_0.o-cp_lto_odr-6_1.o link, -O0 -flto 
-flto-partition=none -fuse-linker-plugin
PASS->FAIL: g++.dg/lto/odr-6 cp_lto_odr-6_0.o-cp_lto_odr-6_1.o link, -O0 -flto 
-fuse-linker-plugin -fno-fat-lto-objects 
PASS->FAIL: g++.dg/lto/odr-6 cp_lto_odr-6_0.o-cp_lto_odr-6_1.o link, -O2 -flto 
-flto-partition=1to1 -fno-use-linker-plugin 
PASS->FAIL: g++.dg/lto/odr-6 cp_lto_odr-6_0.o-cp_lto_odr-6_1.o link, -O2 -flto 
-flto-partition=none -fuse-linker-plugin -fno-fat-lto-objects 
PASS->FAIL: g++.dg/lto/odr-6 cp_lto_odr-6_0.o-cp_lto_odr-6_1.o link, -O2 -flto 
-fuse-linker-plugin


PASS->FAIL: gcc.dg/lto/pr88077 c_lto_pr88077_0.o-c_lto_pr88077_1.o link, -O0 
-flto -flto-partition=1to1 -fno-use-linker-plugin 
PASS->FAIL: gcc.dg/lto/pr88077 c_lto_pr88077_0.o-c_lto_pr88077_1.o link, -O0 
-flto -flto-partition=none -fuse-linker-plugin
PASS->FAIL: gcc.dg/lto/pr88077 c_lto_pr88077_0.o-c_lto_pr88077_1.o link, -O0 
-flto -fuse-linker-plugin -fno-fat-lto-objects 
PASS->FAIL: gcc.dg/lto/pr88077 c_lto_pr88077_0.o-c_lto_pr88077_1.o link, -O2 
-flto -flto-partition=1to1 -fno-use-linker-plugin 
PASS->FAIL: gcc.dg/lto/pr88077 c_lto_pr88077_0.o-c_lto_pr88077_1.o link, -O2 
-flto -flto-partition=none -fuse-linker-plugin -fno-fat-lto-objects 
PASS->FAIL: gcc.dg/lto/pr88077 c_lto_pr88077_0.o-c_lto_pr88077_1.o link, -O2 
-flto -fuse-linker-plugin


Wilco


Re: [PATCH v3] PR85678: Change default to -fno-common

2019-11-05 Thread Wilco Dijkstra
Hi Richard,

> Please investigate those - C++ has -fno-common already so it might be a mix
> of C/C++ required here.  Note that secondary files can use dg-options
> with the same behavior as dg-additional-options (they append to 
> dg-lto-options),
> so here in _1.c add { dg-options "-fcommon" }

The odr-6 test mixes C and C++ using .C/.c extensions. But like you suggest, 
dg-options
works on the 2nd file, and with that odr-6 and pr88077 tests pass without 
needing changes
in lto.exp. I needed to change one of the types since default object alignment 
is different
between -fcommon and -fno-common and that causes linker failures when linking 
objects
built with different -fcommon settings. I also checked regress on x64, there 
was one minor
failure because of the alignment change, which is easily fixed.

So here is v3:

[PATCH v3] PR85678: Change default to -fno-common

GCC currently defaults to -fcommon.  As discussed in the PR, this is an ancient
C feature which is not conforming with the latest C standards.  On many targets
this means global variable accesses have a codesize and performance penalty.
This applies to C code only, C++ code is not affected by -fcommon.  It is about
time to change the default.

Passes bootstrap and regress on AArch64 and x64. OK for commit?

ChangeLog
2019-11-05  Wilco Dijkstra  

PR85678
* common.opt (fcommon): Change init to 1.

doc/
* invoke.texi (-fcommon): Update documentation.

testsuite/
* g++.dg/lto/odr-6_1.c: Add -fcommon.
* gcc.dg/alias-15.c: Likewise.
* gcc.dg/fdata-sections-1.c: Likewise.  
* gcc.dg/ipa/pr77653.c: Likewise.
* gcc.dg/lto/20090729_0.c: Likewise.
* gcc.dg/lto/20111207-1_0.c: Likewise.
* gcc.dg/lto/c-compatible-types-1_0.c: Likewise.
* gcc.dg/lto/pr55525_0.c: Likewise.
* gcc.dg/lto/pr88077_0.c: Use long to avoid alignment warning.
* gcc.dg/lto/pr88077_1.c: Add -fcommon.
* gcc.target/aarch64/sve/peel_ind_1.c: Allow ANCHOR0.
* gcc.target/aarch64/sve/peel_ind_2.c: Likewise.
* gcc.target/aarch64/sve/peel_ind_3.c: Likewise.
* gcc.target/i386/volatile-bitfields-2.c: Allow movl or movq.

diff --git a/gcc/common.opt b/gcc/common.opt
index 
f74b10aafc223e4961915b009c092f4876eddba4..798b6aeff3536e21c95752b5dd085f8ffef04643
 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -1131,7 +1131,7 @@ Common Report Var(flag_combine_stack_adjustments) 
Optimization
 Looks for opportunities to reduce stack adjustments and stack references.
 
 fcommon
-Common Report Var(flag_no_common,0)
+Common Report Var(flag_no_common,0) Init(1)
 Put uninitialized globals in the common section.
 
 fcompare-debug
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 
92fb316a368a4a36218fac6de2744c7ab6446ef5..18cfd07d4bbb4b866808db0701faf88bddbd9a94
 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -568,7 +568,7 @@ Objective-C and Objective-C++ Dialects}.
 -fnon-call-exceptions  -fdelete-dead-exceptions  -funwind-tables @gol
 -fasynchronous-unwind-tables @gol
 -fno-gnu-unique @gol
--finhibit-size-directive  -fno-common  -fno-ident @gol
+-finhibit-size-directive  -fcommon  -fno-ident @gol
 -fpcc-struct-return  -fpic  -fPIC  -fpie  -fPIE  -fno-plt @gol
 -fno-jump-tables @gol
 -frecord-gcc-switches @gol
@@ -14049,35 +14049,27 @@ useful for building programs to run under WINE@.
 code that is not binary compatible with code generated without that switch.
 Use it to conform to a non-default application binary interface.
 
-@item -fno-common
-@opindex fno-common
+@item -fcommon
 @opindex fcommon
+@opindex fno-common
 @cindex tentative definitions
-In C code, this option controls the placement of global variables 
-defined without an initializer, known as @dfn{tentative definitions} 
-in the C standard.  Tentative definitions are distinct from declarations 
+In C code, this option controls the placement of global variables
+defined without an initializer, known as @dfn{tentative definitions}
+in the C standard.  Tentative definitions are distinct from declarations
 of a variable with the @code{extern} keyword, which do not allocate storage.
 
-Unix C compilers have traditionally allocated storage for
-uninitialized global variables in a common block.  This allows the
-linker to resolve all tentative definitions of the same variable
+The default is @option{-fno-common}, which specifies that the compiler places
+uninitialized global variables in the BSS section of the object file.
+This inhibits the merging of tentative definitions by the linker so you get a
+multiple-definition error if the same variable is accidentally defined in more
+than one compilation unit.
+
+The @option{-fcommon} places uninitialized global variables in a common block.
+This allows the linker to resolve all tentative definitions of the same 
variable
 in different compilation units to the same object, or to a non-tentative
-definition

[PATCH][Arm] Only enable fsched-pressure with Ofast

2019-11-06 Thread Wilco Dijkstra
The current pressure scheduler doesn't appear to correctly track register
pressure and avoid creating unnecessary spills when register pressure is high.
As a result disabling the early scheduler improves integer performance
considerably and reduces codesize as a bonus. Since scheduling floating point
code is generally beneficial (more registers and higher latencies), only enable
the pressure scheduler with -Ofast.

On Cortex-A57 this gives a 0.7% performance gain on SPECINT2006 as well
as a 0.2% codesize reduction.

Bootstrapped on armhf. OK for commit?

ChangeLog:

2019-11-06  Wilco Dijkstra  

* gcc/common/config/arm-common.c (arm_option_optimization_table):
Enable fsched_pressure with Ofast only.

--
diff --git a/gcc/common/config/arm/arm-common.c 
b/gcc/common/config/arm/arm-common.c
index 
41a920f6dc96833e778faa8dbcc19beac483734c..b761d3abd670a144a593c4b410b1e7fbdcb52f56
 100644
--- a/gcc/common/config/arm/arm-common.c
+++ b/gcc/common/config/arm/arm-common.c
@@ -38,7 +38,7 @@ static const struct default_options 
arm_option_optimization_table[] =
   {
 /* Enable section anchors by default at -O1 or higher.  */
 { OPT_LEVELS_1_PLUS, OPT_fsection_anchors, NULL, 1 },
-{ OPT_LEVELS_1_PLUS, OPT_fsched_pressure, NULL, 1 },
+{ OPT_LEVELS_FAST, OPT_fsched_pressure, NULL, 1 },
 { OPT_LEVELS_NONE, 0, NULL, 0 }
   };
 


[PATCH][ARM] Improve max_cond_insns setting for Cortex cores

2019-11-06 Thread Wilco Dijkstra
Various CPUs have max_cond_insns set to 5 due to historical reasons.
Benchmarking shows that max_cond_insns=2 is fastest on modern Cortex-A
cores, so change it to 2 for all Cortex-A cores.  Set max_cond_insns
to 4 on Thumb-2 architectures given it's already limited to that by
MAX_INSN_PER_IT_BLOCK.  Also use the CPU tuning setting when a CPU/tune
is selected if -mrestrict-it is not explicitly set.

On Cortex-A57 this gives 1.1% performance gain on SPECINT2006 as well
as a 0.4% codesize reduction.

Bootstrapped on armhf. OK for commit?

ChangeLog:

2019-08-19  Wilco Dijkstra  

* gcc/config/arm/arm.c (arm_option_override_internal):
Use max_cond_insns from CPU tuning unless -mrestrict-it is used.
(arm_v6t2_tune): set max_cond_insns to 4.
(arm_cortex_tune): set max_cond_insns to 2.
(arm_cortex_a8_tune): Likewise.
(arm_cortex_a7_tune): Likewise.
(arm_cortex_a35_tune): Likewise.
(arm_cortex_a53_tune): Likewise.
(arm_cortex_a5_tune): Likewise.
(arm_cortex_a9_tune): Likewise.
(arm_v6m_tune): set max_cond_insns to 4.
---

diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
index 
628cf02f23fb29392a63d87f561c3ee2fb73a515..38ac16ad1def91ca78ccfa98fd1679b2b5114851
 100644
--- a/gcc/config/arm/arm.c
+++ b/gcc/config/arm/arm.c
@@ -1943,7 +1943,7 @@ const struct tune_params arm_v6t2_tune =
   arm_default_branch_cost,
   &arm_default_vec_cost,
   1,   /* Constant limit.  */
-  5,   /* Max cond insns.  */
+  4,   /* Max cond insns.  */
   8,   /* Memset max inline.  */
   1,   /* Issue rate.  */
   ARM_PREFETCH_NOT_BENEFICIAL,
@@ -1968,7 +1968,7 @@ const struct tune_params arm_cortex_tune =
   arm_default_branch_cost,
   &arm_default_vec_cost,
   1,   /* Constant limit.  */
-  5,   /* Max cond insns.  */
+  2,   /* Max cond insns.  */
   8,   /* Memset max inline.  */
   2,   /* Issue rate.  */
   ARM_PREFETCH_NOT_BENEFICIAL,
@@ -1991,7 +1991,7 @@ const struct tune_params arm_cortex_a8_tune =
   arm_default_branch_cost,
   &arm_default_vec_cost,
   1,   /* Constant limit.  */
-  5,   /* Max cond insns.  */
+  2,   /* Max cond insns.  */
   8,   /* Memset max inline.  */
   2,   /* Issue rate.  */
   ARM_PREFETCH_NOT_BENEFICIAL,
@@ -2014,7 +2014,7 @@ const struct tune_params arm_cortex_a7_tune =
   arm_default_branch_cost,
   &arm_default_vec_cost,
   1,   /* Constant limit.  */
-  5,   /* Max cond insns.  */
+  2,   /* Max cond insns.  */
   8,   /* Memset max inline.  */
   2,   /* Issue rate.  */
   ARM_PREFETCH_NOT_BENEFICIAL,
@@ -2060,7 +2060,7 @@ const struct tune_params arm_cortex_a35_tune =
   arm_default_branch_cost,
   &arm_default_vec_cost,
   1,   /* Constant limit.  */
-  5,   /* Max cond insns.  */
+  2,   /* Max cond insns.  */
   8,   /* Memset max inline.  */
   1,   /* Issue rate.  */
   ARM_PREFETCH_NOT_BENEFICIAL,
@@ -2083,7 +2083,7 @@ const struct tune_params arm_cortex_a53_tune =
   arm_default_branch_cost,
   &arm_default_vec_cost,
   1,   /* Constant limit.  */
-  5,   /* Max cond insns.  */
+  2,   /* Max cond insns.  */
   8,   /* Memset max inline.  */
   2,   /* Issue rate.  */
   ARM_PREFETCH_NOT_BENEFICIAL,
@@ -2167,9 +2167,6 @@ const struct tune_params arm_xgene1_tune =
   tune_params::SCHED_AUTOPREF_OFF
 };
 
-/* Branches can be dual-issued on Cortex-A5, so conditional execution is
-   less appealing.  Set max_insns_skipped to a low value.  */
-
 const struct tune_params arm_cortex_a5_tune =
 {
   &cortexa5_extra_costs,
@@ -2178,7 +2175,7 @@ const struct tune_params arm_cortex_a5_tune =
   arm_cortex_a5_branch_cost,
   &arm_default_vec_cost,
   1,   /* Constant limit.  */
-  1, 

[PATCH] PR90838: Support ctz idioms

2019-11-12 Thread Wilco Dijkstra
Hi,

Support common idioms for count trailing zeroes using an array lookup.
The canonical form is array[((x & -x) * C) >> SHIFT] where C is a magic
constant which when multiplied by a power of 2 contains a unique value
in the top 5 or 6 bits.  This is then indexed into a table which maps it
to the number of trailing zeroes.  When the table is valid, we emit a
sequence using the target defined value for ctz (0):

int ctz1 (unsigned x)
{
  static const char table[32] =
{
  0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
  31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
};

  return table[((unsigned)((x & -x) * 0x077CB531U)) >> 27];
}

Is optimized to:

rbitw0, w0
clz w0, w0
and w0, w0, 31
ret

Bootstrapped on AArch64. OK for commit?

ChangeLog:

2019-11-12  Wilco Dijkstra  

PR tree-optimization/90838
* generic-match-head.c (optimize_count_trailing_zeroes):
Add stub function.
* gimple-match-head.c (gimple_simplify): Add support for ARRAY_REF.
(optimize_count_trailing_zeroes): Add new function.
* match.pd: Add matching for ctz idioms.
* testsuite/gcc.target/aarch64/pr90838.c: New test.

--

diff --git a/gcc/generic-match-head.c b/gcc/generic-match-head.c
index 
fdc603977fc5b03a843944f75ce262f5d2256308..5a38bd233585225d60f0159c9042a16d9fdc9d80
 100644
--- a/gcc/generic-match-head.c
+++ b/gcc/generic-match-head.c
@@ -88,3 +88,10 @@ optimize_successive_divisions_p (tree, tree)
 {
   return false;
 }
+
+static bool
+optimize_count_trailing_zeroes (tree type, tree array_ref, tree input,
+   tree mulc, tree shift, tree &zero_val)
+{
+  return false;
+}
diff --git a/gcc/gimple-match-head.c b/gcc/gimple-match-head.c
index 
53278168a59f5ac10ce6760f04fd42589a0792e7..2d3b305f8ea54e4ca31c64994af30b34bb7eff09
 100644
--- a/gcc/gimple-match-head.c
+++ b/gcc/gimple-match-head.c
@@ -909,6 +909,24 @@ gimple_simplify (gimple *stmt, gimple_match_op *res_op, 
gimple_seq *seq,
res_op->set_op (TREE_CODE (op0), type, valueized);
return true;
  }
+   else if (code == ARRAY_REF)
+ {
+   tree rhs1 = gimple_assign_rhs1 (stmt);
+   tree op1 = TREE_OPERAND (rhs1, 1);
+   tree op2 = TREE_OPERAND (rhs1, 2);
+   tree op3 = TREE_OPERAND (rhs1, 3);
+   tree op0 = TREE_OPERAND (rhs1, 0);
+   bool valueized = false;
+
+   op0 = do_valueize (op0, top_valueize, valueized);
+   op1 = do_valueize (op1, top_valueize, valueized);
+
+   if (op2 && op3)
+ res_op->set_op (code, type, op0, op1, op2, op3);
+   else
+ res_op->set_op (code, type, op0, op1);
+   return gimple_resimplify4 (seq, res_op, valueize) || valueized;
+ }
break;
  case GIMPLE_UNARY_RHS:
{
@@ -1222,3 +1240,57 @@ optimize_successive_divisions_p (tree divisor, tree 
inner_div)
 }
   return true;
 }
+
+/* Recognize count trailing zeroes idiom.
+   The canonical form is array[((x & -x) * C) >> SHIFT] where C is a magic
+   constant which when multiplied by a power of 2 contains a unique value
+   in the top 5 or 6 bits.  This is then indexed into a table which maps it
+   to the number of trailing zeroes.  Array[0] is returned so the caller can
+   emit an appropriate sequence depending on whether ctz (0) is defined on
+   the target.  */
+static bool
+optimize_count_trailing_zeroes (tree type, tree array, tree x, tree mulc,
+   tree tshift, tree &zero_val)
+{
+  gcc_assert (TREE_CODE (mulc) == INTEGER_CST);
+  gcc_assert (TREE_CODE (tshift) == INTEGER_CST);
+
+  tree input_type = TREE_TYPE (x);
+
+  if (!direct_internal_fn_supported_p (IFN_CTZ, input_type, OPTIMIZE_FOR_BOTH))
+return false;
+
+  unsigned HOST_WIDE_INT val = tree_to_uhwi (mulc);
+  unsigned shiftval = tree_to_uhwi (tshift);
+  unsigned input_bits = tree_to_shwi (TYPE_SIZE (input_type));
+
+  /* Check the array is not wider than integer type and the input is a 32-bit
+ or 64-bit type.  The shift should extract the top 5..7 bits.  */
+  if (TYPE_PRECISION (type) > 32)
+return false;
+  if (input_bits != 32 && input_bits != 64)
+return false;
+  if (shiftval < input_bits - 7 || shiftval > input_bits - 5)
+return false;
+
+  tree t = build4 (ARRAY_REF, type, array, size_int (0), NULL_TREE, NULL_TREE);
+  t = fold_const_aggregate_ref (t);
+  if (t == NULL)
+return false;
+
+  zero_val = build_int_cst (integer_type_node, tree_to_shwi (t));
+
+  for (unsigned i = 0; i < input_bits; i++, val <<= 1)
+{
+  if (input_bits == 32)
+   val &= 0x;
+  t = build4 (ARRAY_REF, type, array, size_int ((int)(val >> shiftval)),
+  

Re: [PATCH] PR90838: Support ctz idioms

2019-11-13 Thread Wilco Dijkstra
Hi Segher,

> Out of interest, what uses this?  I have never seen it before.

It's used in sjeng in SPEC and gives a 2% speedup on Cortex-A57.
Tricks like this used to be very common 20 years ago since a loop or binary 
search
is way too slow and few CPUs supported fast clz/ctz instructions. It's one of 
those
instructions you rarely need, but when you do, performance is absolutely 
critical...

As Jakub mentioned in the PR, 
https://doc.lagout.org/security/Hackers%20Delight.pdf
is a good resource for these bit tricks.

Cheers,
Wilco




Re: [PATCH] Further bootstrap unbreak (was Re: [PATCH] PR90838: Support ctz idioms)

2020-01-13 Thread Wilco Dijkstra
Hi Jakub,

On Sat, Jan 11, 2020 at 05:30:52PM +0100, Jakub Jelinek wrote:
> On Sat, Jan 11, 2020 at 05:24:19PM +0100, Andreas Schwab wrote:
> > ../../gcc/tree-ssa-forwprop.c: In function 'bool 
> > simplify_count_trailing_zeroes(gimple_stmt_iterator*)':
> > ../../gcc/tree-ssa-forwprop.c:1925:23: error: variable 'mode' set but not 
> > used [-Werror=unused-but-set-variable]
> >  1925 |   scalar_int_mode mode = SCALAR_INT_TYPE_MODE (type);
> >   |   ^~~~
> 
> Oops, then I think we need following, but can't commit it until Monday.

Thanks for sorting this out so quickly, Jakub! It looks like we need to convert 
these macros
to targetm hooks given it's too difficult to get them to compile without 
warnings/errors...
That would also allow us to fix the odd interface of CTZ_DEFINED_VALUE_AT_ZERO
and remove the 1 vs 2 distinction.

Cheers,
Wilco




[PATCH] Fix ctz issues (PR93231)

2020-01-13 Thread Wilco Dijkstra
Further improve the ctz recognition: Avoid ICEing on negative shift
counts or multiply constants.  Check the type is 8 bits for the string
constant case to avoid accidentally matching a wide STRING_CST.
Add a tree_expr_nonzero_p check to allow the optimization even if
CTZ_DEFINED_VALUE_AT_ZERO returns 0 or 1.  Add extra test cases.

(note the diff uses the old tree and includes Jakub's bootstrap fixes)

Bootstrap OK on AArch64 and x64.

ChangeLog:
2020-01-13  Wilco Dijkstra  

PR tree-optimization/93231
* tree-ssa-forwprop.c
(optimize_count_trailing_zeroes): Use tree_to_shwi for shift
and TREE_INT_CST_LOW for multiply constants.  Check CST_STRING
element size is 8 bits.
(simplify_count_trailing_zeroes): Add test to handle known non-zero
inputs more efficiently.

testsuite/
* gcc.dg/pr90838.c: New test.
* gcc.dg/pr93231.c: New test.
---

diff --git a/gcc/testsuite/gcc.dg/pr90838.c b/gcc/testsuite/gcc.dg/pr90838.c
new file mode 100644
index 
..8070d439f6404dc6884a11e58f1db41c435e61fb
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr90838.c
@@ -0,0 +1,59 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-forwprop2-details" } */
+
+int ctz1 (unsigned x)
+{
+  static const char table[32] = "\x00\x01\x1c\x02\x1d\x0e\x18\x03\x1e\x16\x14"
+"\x0f\x19\x11\x04\b\x1f\x1b\r\x17\x15\x13\x10\x07\x1a\f\x12\x06\v\x05\n\t";
+
+  return table[((unsigned)((x & -x) * 0x077CB531U)) >> 27];
+}
+
+int ctz2 (unsigned x)
+{
+  const int u = 0;
+  static short table[64] =
+{
+  32, 0, 1,12, 2, 6, u,13, 3, u, 7, u, u, u, u,14,
+  10, 4, u, u, 8, u, u,25, u, u, u, u, u,21,27,15,
+  31,11, 5, u, u, u, u, u, 9, u, u,24, u, u,20,26,
+  30, u, u, u, u,23, u,19,29, u,22,18,28,17,16, u
+};
+
+  x = (x & -x) * 0x0450FBAF;
+  return table[x >> 26];
+}
+
+int ctz3 (unsigned x)
+{
+  static int table[32] =
+{
+  0, 1, 2,24, 3,19, 6,25, 22, 4,20,10,16, 7,12,26,
+  31,23,18, 5,21, 9,15,11,30,17, 8,14,29,13,28,27
+};
+
+  if (x == 0) return 32;
+  x = (x & -x) * 0x04D7651F;
+  return table[x >> 27];
+}
+
+static const unsigned long long magic = 0x03f08c5392f756cdULL;
+
+static const char table[64] = {
+ 0,  1, 12,  2, 13, 22, 17,  3,
+14, 33, 23, 36, 18, 58, 28,  4,
+62, 15, 34, 26, 24, 48, 50, 37,
+19, 55, 59, 52, 29, 44, 39,  5,
+63, 11, 21, 16, 32, 35, 57, 27,
+61, 25, 47, 49, 54, 51, 43, 38,
+10, 20, 31, 56, 60, 46, 53, 42,
+ 9, 30, 45, 41,  8, 40,  7,  6,
+};
+
+int ctz4 (unsigned long x)
+{
+  unsigned long lsb = x & -x;
+  return table[(lsb * magic) >> 58];
+}
+
+/* { dg-final { scan-tree-dump-times {= \.CTZ} 4 "forwprop2" { target 
aarch64*-*-* } } } */
diff --git a/gcc/testsuite/gcc.dg/pr93231.c b/gcc/testsuite/gcc.dg/pr93231.c
new file mode 100644
index 
..80853bad23b28abbe51bb6e2b9f8beeb06618e2f
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr93231.c
@@ -0,0 +1,35 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-forwprop2-details -Wno-shift-count-negative" 
} */
+
+int ctz_ice1 (int x)
+{
+  static const char table[32] =
+{
+  0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
+  31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
+};
+
+  return table[((int)((x & -x) * -0x077CB531)) >> 27];
+}
+
+int ctz_ice2 (unsigned x)
+{
+  static const char table[32] =
+{
+  0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
+  31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
+};
+
+  return table[((unsigned)((x & -x) * 0x077CB531U)) >> -27];
+}
+
+// This should never match
+int ctz_fail (int x)
+{
+  static const unsigned short int table[32] =
+
u"\x0100\x021c\x0e1d\x0318\x161e\x0f14\x1119\x0804\x1b1f\x170d\x1315\x0710\x0c1a\x0612\x050b\x090a";
+
+  return table[((int)((x & -x) * 0x077CB531)) >> 27];
+}
+
+/* { dg-final { scan-tree-dump-not {= \.CTZ} "forwprop2" } } */
diff --git a/gcc/tree-ssa-forwprop.c b/gcc/tree-ssa-forwprop.c
index 
aced6eb2d9139cae277593022415bc5efdf45175..afc52e5dfbaea29cb07707fee6cbbd3e0c969eda
 100644
--- a/gcc/tree-ssa-forwprop.c
+++ b/gcc/tree-ssa-forwprop.c
@@ -1864,8 +1864,8 @@ optimize_count_trailing_zeroes (tree array_ref, tree x, 
tree mulc,
   tree input_type = TREE_TYPE (x);
   unsigned input_bits = tree_to_shwi (TYPE_SIZE (input_type));
 
-  /* Check the array is not wider than integer type and the input is a 32-bit
- or 64-bit type.  */
+  /* Check the array element type is not wider than 32 bits and the input is
+ a 32-bit or 64-bit type.  */
   if (TYPE_PRECISION (type) > 32)
 return false;
   if (input_bits != 32 && input_bits != 64)
@@ -1879,7 +1879,7 @@ optimize_count_trailing_zeroes (tree array_ref, tree x, 
tree mulc,
   if (!low

Re: [PATCH] Fix ctz issues (PR93231)

2020-01-15 Thread Wilco Dijkstra
Hi Jakub,

>> (note the diff uses the old tree and includes Jakub's bootstrap fixes)
> You should rebase it because you'll be committing it against trunk
> which already has those changes.

Sure, it was just the small matter of updating the many GCC checkouts I have...

>> -  unsigned shiftval = tree_to_uhwi (tshift);
>> +  unsigned shiftval = tree_to_shwi (tshift);
>
> This relies on the FEs to narrow the type of say:
> x >> (((__uint128_t) 0x12 << 64) | 0x1234567812345678ULL)

Gimple canonicalizes all shifts to integer type, eg. 
x >> 0xabcdef0001 is converted to x >> 1 with a warning.

>> -  unsigned HOST_WIDE_INT val = tree_to_uhwi (mulc);
>> +  /* Extract the binary representation of the multiply constant.  */
>> +  unsigned HOST_WIDE_INT val = TREE_INT_CST_LOW (mulc);
>
> Will it work properly with the signed types though?
> The difference is whether the C code we are matching will use logical or
> arithmetic right shift.  And if the power of two times mulc ever can have
> sign bit set, it will then use negative indexes into the array.

The multiply can be signed or unsigned, and the immediate can be positive or
negative, but the shift must be unsigned indeed. I thought the match.pd
pattern only allows unsigned shifts, but the shift operator allows signed shifts
too, so I've added an unsigned check on the input_type.

>> -  if (TREE_CODE (ctor) == STRING_CST)
>> +  if (TREE_CODE (ctor) == STRING_CST && TYPE_PRECISION (type) == 8)
>
> Isn't another precondition that BITS_PER_UNIT is 8 (because STRING_CSTs are
> really bytes)?

I've used CHAR_TYPE_SIZE instead of 8, but I don't think GCC supports anything 
other
than BITS_PER_UNIT == 8 and CHAR_TYPE_SIZE == 8. GCC uses memcmp/strlen
on STRING_CST (as well as direct accesses as 'char') which won't work if the 
host and
target chars are not the same size.

Here is the updated version:

Further improve the ctz recognition: Avoid ICEing on negative shift
counts or multiply constants.  Check the type is a char type for the
string constant case to avoid accidentally matching a wide STRING_CST.
Add a tree_expr_nonzero_p check to allow the optimization even if
CTZ_DEFINED_VALUE_AT_ZERO returns 0 or 1.  Add extra test cases.

Bootstrap OK on AArch64 and x64.

ChangeLog:

2020-01-15  Wilco Dijkstra  

PR tree-optimization/93231
* tree-ssa-forwprop.c
(optimize_count_trailing_zeroes): Check input_type is unsigned.
Use tree_to_shwi for shift constant.  Check CST_STRING element
size is CHAR_TYPE_SIZE bits.
(simplify_count_trailing_zeroes): Add test to handle known non-zero
inputs more efficiently.

testsuite/
* gcc.dg/pr90838.c: New test.
* gcc.dg/pr93231.c: New test.
--
diff --git a/gcc/testsuite/gcc.dg/pr90838.c b/gcc/testsuite/gcc.dg/pr90838.c
new file mode 100644
index 
..8070d439f6404dc6884a11e58f1db41c435e61fb
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr90838.c
@@ -0,0 +1,59 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-forwprop2-details" } */
+
+int ctz1 (unsigned x)
+{
+  static const char table[32] = "\x00\x01\x1c\x02\x1d\x0e\x18\x03\x1e\x16\x14"
+"\x0f\x19\x11\x04\b\x1f\x1b\r\x17\x15\x13\x10\x07\x1a\f\x12\x06\v\x05\n\t";
+
+  return table[((unsigned)((x & -x) * 0x077CB531U)) >> 27];
+}
+
+int ctz2 (unsigned x)
+{
+  const int u = 0;
+  static short table[64] =
+{
+  32, 0, 1,12, 2, 6, u,13, 3, u, 7, u, u, u, u,14,
+  10, 4, u, u, 8, u, u,25, u, u, u, u, u,21,27,15,
+  31,11, 5, u, u, u, u, u, 9, u, u,24, u, u,20,26,
+  30, u, u, u, u,23, u,19,29, u,22,18,28,17,16, u
+};
+
+  x = (x & -x) * 0x0450FBAF;
+  return table[x >> 26];
+}
+
+int ctz3 (unsigned x)
+{
+  static int table[32] =
+{
+  0, 1, 2,24, 3,19, 6,25, 22, 4,20,10,16, 7,12,26,
+  31,23,18, 5,21, 9,15,11,30,17, 8,14,29,13,28,27
+};
+
+  if (x == 0) return 32;
+  x = (x & -x) * 0x04D7651F;
+  return table[x >> 27];
+}
+
+static const unsigned long long magic = 0x03f08c5392f756cdULL;
+
+static const char table[64] = {
+ 0,  1, 12,  2, 13, 22, 17,  3,
+14, 33, 23, 36, 18, 58, 28,  4,
+62, 15, 34, 26, 24, 48, 50, 37,
+19, 55, 59, 52, 29, 44, 39,  5,
+63, 11, 21, 16, 32, 35, 57, 27,
+61, 25, 47, 49, 54, 51, 43, 38,
+10, 20, 31, 56, 60, 46, 53, 42,
+ 9, 30, 45, 41,  8, 40,  7,  6,
+};
+
+int ctz4 (unsigned long x)
+{
+  unsigned long lsb = x & -x;
+  return table[(lsb * magic) >> 58];
+}
+
+/* { dg-final { scan-tree-dump-times {= \.CTZ} 4 "forwprop2" { target 
aarch64*-*-* } } } */
diff --git a/gcc/testsuite/gcc.dg/pr93231.c b/gcc/testsuite/gcc.dg/pr93231.c
new file mode 100644
index 
..cd0b3f320f78ffdd3d82cf487a63

[PATCH][AArch64] Fix shrinkwrapping interactions with atomics (PR92692)

2020-01-16 Thread Wilco Dijkstra
The separate shrinkwrapping pass may insert stores in the middle
of atomics loops which can cause issues on some implementations.
Avoid this by delaying splitting of atomic patterns until after
prolog/epilog generation.

Bootstrap completed, no test regressions on AArch64. 

Andrew, can you verify this fixes the failure you were getting?

ChangeLog:
2020-01-16  Wilco Dijkstra  

PR target/92692
* config/aarch64/aarch64.c (aarch64_split_compare_and_swap)
Add assert to ensure prolog has been emitted.
(aarch64_split_atomic_op): Likewise.
* config/aarch64/atomics.md (aarch64_compare_and_swap)
Use epilogue_completed rather than reload_completed.
(aarch64_atomic_exchange): Likewise.
(aarch64_atomic_): Likewise.
(atomic_nand): Likewise.
(aarch64_atomic_fetch_): Likewise.
(atomic_fetch_nand): Likewise.
(aarch64_atomic__fetch): Likewise.
(atomic_nand_fetch): Likewise.
---

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 
ac89cc1f9c938455d33d8850d9ebfc0473cb73dc..cd9d813f2ac4990971f6435fdb28b0f94ae10309
 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -18375,6 +18375,9 @@ aarch64_emit_post_barrier (enum memmodel model)
 void
 aarch64_split_compare_and_swap (rtx operands[])
 {
+  /* Split after prolog/epilog to avoid interactions with shrinkwrapping.  */
+  gcc_assert (epilogue_completed);
+
   rtx rval, mem, oldval, newval, scratch, x, model_rtx;
   machine_mode mode;
   bool is_weak;
@@ -18469,6 +18472,9 @@ void
 aarch64_split_atomic_op (enum rtx_code code, rtx old_out, rtx new_out, rtx mem,
 rtx value, rtx model_rtx, rtx cond)
 {
+  /* Split after prolog/epilog to avoid interactions with shrinkwrapping.  */
+  gcc_assert (epilogue_completed);
+
   machine_mode mode = GET_MODE (mem);
   machine_mode wmode = (mode == DImode ? DImode : SImode);
   const enum memmodel model = memmodel_from_int (INTVAL (model_rtx));
diff --git a/gcc/config/aarch64/atomics.md b/gcc/config/aarch64/atomics.md
index 
c2bcabd0c3c2627b7222dcbc1af9c2e6b7ce6a76..996947799b5ef8445e9786b94e1ce62fd16e5b5c
 100644
--- a/gcc/config/aarch64/atomics.md
+++ b/gcc/config/aarch64/atomics.md
@@ -56,7 +56,7 @@ (define_insn_and_split "@aarch64_compare_and_swap"
(clobber (match_scratch:SI 7 "=&r"))]
   ""
   "#"
-  "&& reload_completed"
+  "&& epilogue_completed"
   [(const_int 0)]
   {
 aarch64_split_compare_and_swap (operands);
@@ -80,7 +80,7 @@ (define_insn_and_split "@aarch64_compare_and_swap"
(clobber (match_scratch:SI 7 "=&r"))]
   ""
   "#"
-  "&& reload_completed"
+  "&& epilogue_completed"
   [(const_int 0)]
   {
 aarch64_split_compare_and_swap (operands);
@@ -104,7 +104,7 @@ (define_insn_and_split "@aarch64_compare_and_swap"
(clobber (match_scratch:SI 7 "=&r"))]
   ""
   "#"
-  "&& reload_completed"
+  "&& epilogue_completed"
   [(const_int 0)]
   {
 aarch64_split_compare_and_swap (operands);
@@ -223,7 +223,7 @@ (define_insn_and_split "aarch64_atomic_exchange"
(clobber (match_scratch:SI 4 "=&r"))]
   ""
   "#"
-  "&& reload_completed"
+  "&& epilogue_completed"
   [(const_int 0)]
   {
 aarch64_split_atomic_op (SET, operands[0], NULL, operands[1],
@@ -344,7 +344,7 @@ (define_insn_and_split "aarch64_atomic_"
   (clobber (match_scratch:SI 4 "=&r"))]
   ""
   "#"
-  "&& reload_completed"
+  "&& epilogue_completed"
   [(const_int 0)]
   {
 aarch64_split_atomic_op (, NULL, operands[3], operands[0],
@@ -400,7 +400,7 @@ (define_insn_and_split "atomic_nand"
(clobber (match_scratch:SI 4 "=&r"))]
   ""
   "#"
-  "&& reload_completed"
+  "&& epilogue_completed"
   [(const_int 0)]
   {
  aarch64_split_atomic_op (NOT, NULL, operands[3], operands[0],
@@ -504,7 +504,7 @@ (define_insn_and_split 
"aarch64_atomic_fetch_"
(clobber (match_scratch:SI 5 "=&r"))]
   ""
   "#"
-  "&& reload_completed"
+  "&& epilogue_completed"
   [(const_int 0)]
   {
 aarch64_split_atomic_op (, operands[0], operands[4], operands[1],
@@ -551,7 +551,7 @@ (define_insn_and_split "atomic_fetch_nand"
(clobber (match_scratch:SI 5 "=&r"))]
   ""
   "#"
-  "&& reload_completed"
+  "&& epilogue_completed"
   [(const_int 0)]
   {
 aarch64_split_atomic_op (NOT, operands[0], operands[4], operands[1],
@@ -604,7 +604,7 @@ (define_insn_and_split 
"aarch64_atomic__fetch"
(clobber (match_scratch:SI 4 "=&r"))]
   ""
   "#"
-  "&& reload_completed"
+  "&& epilogue_completed"
   [(const_int 0)]
   {
 aarch64_split_atomic_op (, NULL, operands[0], operands[1],
@@ -628,7 +628,7 @@ (define_insn_and_split "atomic_nand_fetch"
(clobber (match_scratch:SI 4 "=&r"))]
   ""
   "#"
-  "&& reload_completed"
+  "&& epilogue_completed"
   [(const_int 0)]
   {
 aarch64_split_atomic_op (NOT, NULL, operands[0], operands[1],


Re: [PATCH][AARCH64] Set jump-align=4 for neoversen1

2020-01-16 Thread Wilco Dijkstra
ping


Testing shows the setting of 32:16 for jump alignment has a significant codesize
cost, however it doesn't make a difference in performance. So set jump-align
to 4 to get 1.6% codesize improvement.

OK for commit?

ChangeLog
2019-12-24  Wilco Dijkstra  

* config/aarch64/aarch64.c (neoversen1_tunings): Set jump_align to 4.

--
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 
1646ed1d9a3de8ee2f0abff385a1ea145e234475..209ed8ebbe81104d9d8cff0df31946ab7704fb33
 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -1132,7 +1132,7 @@ static const struct tune_params neoversen1_tunings =
   3, /* issue_rate  */
   (AARCH64_FUSE_AES_AESMC | AARCH64_FUSE_CMP_BRANCH), /* fusible_ops  */
   "32:16", /* function_align.  */
-  "32:16", /* jump_align.  */
+  "4", /* jump_align.  */
   "32:16", /* loop_align.  */
   2,   /* int_reassoc_width.  */
   4,   /* fp_reassoc_width.  */


Re: [PATCH][AARCH64] Enable compare branch fusion

2020-01-16 Thread Wilco Dijkstra
ping


Enable the most basic form of compare-branch fusion since various CPUs
support it. This has no measurable effect on cores which don't support
branch fusion, but increases fusion opportunities on cores which do.

Bootstrapped on AArch64, OK for commit?

ChangeLog:
2019-12-24  Wilco Dijkstra  

* config/aarch64/aarch64.c (generic_tunings): Add branch fusion.
(neoversen1_tunings): Likewise.

--
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 
a3b18b381e1748f8fe5e522bdec4f7c850821fe8..1c32a3543bec4031cc9b641973101829c77296b5
 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -726,7 +726,7 @@ static const struct tune_params generic_tunings =
   SVE_NOT_IMPLEMENTED, /* sve_width  */
   4, /* memmov_cost  */
   2, /* issue_rate  */
-  (AARCH64_FUSE_AES_AESMC), /* fusible_ops  */
+  (AARCH64_FUSE_AES_AESMC | AARCH64_FUSE_CMP_BRANCH), /* fusible_ops  */
   "16:12", /* function_align.  */
   "4", /* jump_align.  */
   "8", /* loop_align.  */
@@ -1130,7 +1130,7 @@ static const struct tune_params neoversen1_tunings =
   SVE_NOT_IMPLEMENTED, /* sve_width  */
   4, /* memmov_cost  */
   3, /* issue_rate  */
-  AARCH64_FUSE_AES_AESMC, /* fusible_ops  */
+  (AARCH64_FUSE_AES_AESMC | AARCH64_FUSE_CMP_BRANCH), /* fusible_ops  */
   "32:16", /* function_align.  */
   "32:16", /* jump_align.  */
   "32:16", /* loop_align.  */




Re: [PATCH][Arm] Only enable fsched-pressure with Ofast

2020-01-16 Thread Wilco Dijkstra


ping

The current pressure scheduler doesn't appear to correctly track register
pressure and avoid creating unnecessary spills when register pressure is high.
As a result disabling the early scheduler improves integer performance
considerably and reduces codesize as a bonus. Since scheduling floating point
code is generally beneficial (more registers and higher latencies), only enable
the pressure scheduler with -Ofast.

On Cortex-A57 this gives a 0.7% performance gain on SPECINT2006 as well
as a 0.2% codesize reduction.

Bootstrapped on armhf. OK for commit?

ChangeLog:

2019-11-06  Wilco Dijkstra  

* gcc/common/config/arm-common.c (arm_option_optimization_table):
Enable fsched_pressure with Ofast only.

--
diff --git a/gcc/common/config/arm/arm-common.c 
b/gcc/common/config/arm/arm-common.c
index 
41a920f6dc96833e778faa8dbcc19beac483734c..b761d3abd670a144a593c4b410b1e7fbdcb52f56
 100644
--- a/gcc/common/config/arm/arm-common.c
+++ b/gcc/common/config/arm/arm-common.c
@@ -38,7 +38,7 @@ static const struct default_options 
arm_option_optimization_table[] =
   {
 /* Enable section anchors by default at -O1 or higher.  */
 { OPT_LEVELS_1_PLUS, OPT_fsection_anchors, NULL, 1 },
-{ OPT_LEVELS_1_PLUS, OPT_fsched_pressure, NULL, 1 },
+{ OPT_LEVELS_FAST, OPT_fsched_pressure, NULL, 1 },
 { OPT_LEVELS_NONE, 0, NULL, 0 }
   };


Re: [PATCH][AARCH64] Enable compare branch fusion

2020-01-17 Thread Wilco Dijkstra
Hi Richard,

> If you're able to say for the record which cores you tested, then that'd
> be good.

I've mostly checked it on Cortex-A57 - if there is any affect, it would be on
older cores.

> OK, thanks.  I agree there doesn't seem to be an obvious reason why this
> would pessimise any cores significantly.  And it looked from a quick
> check like all AArch64 cores give these compares the lowest in-use
> latency (as expected).

Indeed.

> We can revisit this if anyone finds any counterexamples.

Yes - it's unlikely there are any though!

Cheers,
Wilco





>

> --

> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c

> index 
> a3b18b381e1748f8fe5e522bdec4f7c850821fe8..1c32a3543bec4031cc9b641973101829c77296b5
>  100644

> --- a/gcc/config/aarch64/aarch64.c

> +++ b/gcc/config/aarch64/aarch64.c

> @@ -726,7 +726,7 @@ static const struct tune_params generic_tunings =

>    SVE_NOT_IMPLEMENTED, /* sve_width  */

>    4, /* memmov_cost  */

>    2, /* issue_rate  */

> -  (AARCH64_FUSE_AES_AESMC), /* fusible_ops  */

> +  (AARCH64_FUSE_AES_AESMC | AARCH64_FUSE_CMP_BRANCH), /* fusible_ops  */

>    "16:12",/* function_align.  */

>    "4",/* jump_align.  */

>    "8",/* loop_align.  */

> @@ -1130,7 +1130,7 @@ static const struct tune_params neoversen1_tunings =

>    SVE_NOT_IMPLEMENTED, /* sve_width  */

>    4, /* memmov_cost  */

>    3, /* issue_rate  */

> -  AARCH64_FUSE_AES_AESMC, /* fusible_ops  */

> +  (AARCH64_FUSE_AES_AESMC | AARCH64_FUSE_CMP_BRANCH), /* fusible_ops  */

>    "32:16",/* function_align.  */

>    "32:16",/* jump_align.  */

>    "32:16",/* loop_align.  */



Re: [PATCH][AARCH64] Set jump-align=4 for neoversen1

2020-01-17 Thread Wilco Dijkstra
Hi Kyrill & Richard,

> I was leaving this to others in case it was obvious to them.  On the
> basis that silence suggests it wasn't, :-) could you go into more details?
> Is it expected on first principles that jump alignment doesn't matter
> for Neoverse N1, or is this purely based on experimentation?  If it's

Jump alignment is set to 4 on almost all cores because higher values have
a major codesize cost and yet give no performance gains.

I suspect any core that set it higher has done so by accident rather than
having benchmarked the cost/benefit.

> expected, are we sure that the other "32:16" entries are still worthwhile?
> When you say it doesn't make a difference in performance, does that mean
> that no individual test's performance changed significantly, or just that
> the aggregate score didn't?  Did you experiment with anything inbetween
> the current 32:16 and 4, such as 32:8 or even 32:4?

I mean there is no difference above the noise floor for any test you throw at 
it.
I tried other alignments including 32:16, 32:12, 32:8 but all have a significant
cost and zero benefit.

Cheers,
Wilco

Re: [PATCH 3/4 GCC11] IVOPTs Consider cost_step on different forms during unrolling

2020-01-20 Thread Wilco Dijkstra
Hi Kewen,

Would it not make more sense to use the TARGET_ADDRESS_COST hook
to return different costs for immediate offset and register offset addressing,
and ensure IVOpts correctly takes this into account?

On AArch64 we've defined different costs for immediate offset, register offset,
register offset with extend, pre-increment and post-increment.

I don't see why this has been defined to always return 0 on rs6000...

Wilco


Re: [PATCH][ARM] Correctly set SLOW_BYTE_ACCESS

2020-01-21 Thread Wilco Dijkstra
ping (updated comment to use the same wording as the AArch64 version on trunk)

Contrary to all documentation, SLOW_BYTE_ACCESS simply means accessing
bitfields by their declared type, which results in better codegeneration
on practically any target.  So set it correctly to 1 on Arm.

As a result we generate much better code for bitfields:

typedef struct
{
  int x : 2, y : 8, z : 2;
} X;

int bitfield (X *p)
{
  return p->x + p->y + p->z;
}


Before:
ldrbr3, [r0]@ zero_extendqisi2
ldrhr2, [r0]
ldrbr0, [r0, #1]@ zero_extendqisi2
sbfxr3, r3, #0, #2
sbfxr2, r2, #2, #8
sbfxr0, r0, #2, #2
sxtab   r3, r2, r3
sxtab   r0, r3, r0
bx  lr

After:
ldr r0, [r0]
sbfxr3, r0, #0, #2
sbfxr2, r0, #2, #8
sbfxr0, r0, #10, #2
sxtab   r3, r2, r3
add r0, r0, r3
bx  lr

Bootstrap OK, OK for commit?

ChangeLog:
2019-09-11  Wilco Dijkstra  

* config/arm/arm.h (SLOW_BYTE_ACCESS): Set to 1.

--
diff --git a/gcc/config/arm/arm.h b/gcc/config/arm/arm.h
index 
e07cf03538c5bb23e3285859b9e44a627b6e9ced..998139ce759d5829b7f868367d4263df9d0e12d9
 100644
--- a/gcc/config/arm/arm.h
+++ b/gcc/config/arm/arm.h
@@ -1956,8 +1956,8 @@ enum arm_auto_incmodes
((arm_arch4 || (MODE) == QImode) ? ZERO_EXTEND  \
 : ((BYTES_BIG_ENDIAN && (MODE) == HImode) ? SIGN_EXTEND : UNKNOWN)))
 
-/* Nonzero if access to memory by bytes is slow and undesirable.  */
-#define SLOW_BYTE_ACCESS 0
+/* Enable wide bitfield accesses for more efficient bitfield code.  */
+#define SLOW_BYTE_ACCESS 1
 
 /* Immediate shift counts are truncated by the output routines (or was it
the assembler?).  Shift counts in a register are truncated by ARM.  Note



Re: [PATCH v2][ARM] Disable code hoisting with -O3 (PR80155)

2020-01-21 Thread Wilco Dijkstra
ping

Hi,

While code hoisting generally improves codesize, it can affect performance
negatively. Benchmarking shows it doesn't help SPEC and negatively affects
embedded benchmarks. Since the impact is relatively small with -O2 and mainly
affects -O3, the simplest option is to disable code hoisting for -O3 and higher.

OK for commit?

ChangeLog:
2019-11-26  Wilco Dijkstra  

PR tree-optimization/80155
* common/config/arm/arm-common.c (arm_option_optimization_table):
Disable -fcode-hoisting with -O3.
--

diff --git a/gcc/common/config/arm/arm-common.c 
b/gcc/common/config/arm/arm-common.c
index 
b761d3abd670a144a593c4b410b1e7fbdcb52f56..3e11f21b7dd76cc071b645c32a6fdb4a92511279
 100644
--- a/gcc/common/config/arm/arm-common.c
+++ b/gcc/common/config/arm/arm-common.c
@@ -39,6 +39,8 @@ static const struct default_options 
arm_option_optimization_table[] =
 /* Enable section anchors by default at -O1 or higher.  */
 { OPT_LEVELS_1_PLUS, OPT_fsection_anchors, NULL, 1 },
 { OPT_LEVELS_FAST, OPT_fsched_pressure, NULL, 1 },
+/* Disable code hoisting with -O3 or higher.  */
+{ OPT_LEVELS_3_PLUS, OPT_fcode_hoisting, NULL, 0 },
 { OPT_LEVELS_NONE, 0, NULL, 0 }
   };


Re: [PATCH][AArch64] Fix shrinkwrapping interactions with atomics (PR92692)

2020-01-27 Thread Wilco Dijkstra
Hi Segher,

> On Thu, Jan 16, 2020 at 12:50:14PM +0000, Wilco Dijkstra wrote:
>> The separate shrinkwrapping pass may insert stores in the middle
>> of atomics loops which can cause issues on some implementations.
>> Avoid this by delaying splitting of atomic patterns until after
>> prolog/epilog generation.
>
> Note that this isn't specific to sws at all: there isn't anything
> stopping later passes from doing this either.  Is there anything that
> protects us from sched2 doing similar here, for example?

The expansions create extra basic blocks and insert various barriers
that would stop any reasonable scheduler from doing it. And the
current scheduler is basic block based.

Wilco

[PATCH][AArch64] Improve popcount expansion

2020-02-03 Thread Wilco Dijkstra
The popcount expansion uses umov to extend the result and move it back
to the integer register file.  If we model ADDV as a zero-extending
operation, fmov can be used to move back to the integer side. This
results in a ~0.5% speedup on deepsjeng on Cortex-A57.

A typical __builtin_popcount expansion is now:

fmovs0, w0
cnt v0.8b, v0.8b
addvb0, v0.8b
fmovw0, s0

Bootstrap OK, passes regress.

ChangeLog
2020-02-02  Wilco Dijkstra  

gcc/
* config/aarch64/aarch64.md (popcount2): Improve expansion.
* config/aarch64/aarch64-simd.md
(aarch64_zero_extend_reduc_plus_): New pattern.
* config/aarch64/iterators.md (VDQV_E): New iterator.
testsuite/
* gcc.target/aarch64/popcnt2.c: New test.

--
diff --git a/gcc/config/aarch64/aarch64-simd.md 
b/gcc/config/aarch64/aarch64-simd.md
index 
97f46f96968a6bc2f93bbc812931537b819b3b19..34765ff43c1a090a31e2aed64ce95510317ab8c3
 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -2460,6 +2460,17 @@ (define_insn "aarch64_reduc_plus_internal"
   [(set_attr "type" "neon_reduc_add")]
 )
 
+;; ADDV with result zero-extended to SI/DImode (for popcount).
+(define_insn "aarch64_zero_extend_reduc_plus_"
+ [(set (match_operand:GPI 0 "register_operand" "=w")
+   (zero_extend:GPI
+   (unspec: [(match_operand:VDQV_E 1 "register_operand" "w")]
+UNSPEC_ADDV)))]
+ "TARGET_SIMD"
+ "add\\t%0, %1."
+  [(set_attr "type" "neon_reduc_add")]
+)
+
 (define_insn "aarch64_reduc_plus_internalv2si"
  [(set (match_operand:V2SI 0 "register_operand" "=w")
(unspec:V2SI [(match_operand:V2SI 1 "register_operand" "w")]
diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 
86c2cdfc7973f4b964ba233cfbbe369b24e0ac10..5edc76ee14b55b2b4323530e10bd22b3ffca483e
 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -4829,7 +4829,6 @@ (define_expand "popcount2"
 {
   rtx v = gen_reg_rtx (V8QImode);
   rtx v1 = gen_reg_rtx (V8QImode);
-  rtx r = gen_reg_rtx (QImode);
   rtx in = operands[1];
   rtx out = operands[0];
   if(mode == SImode)
@@ -4843,8 +4842,7 @@ (define_expand "popcount2"
 }
   emit_move_insn (v, gen_lowpart (V8QImode, in));
   emit_insn (gen_popcountv8qi2 (v1, v));
-  emit_insn (gen_reduc_plus_scal_v8qi (r, v1));
-  emit_insn (gen_zero_extendqi2 (out, r));
+  emit_insn (gen_aarch64_zero_extend_reduc_plus_v8qi (out, v1));
   DONE;
 })
 
diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
index 
fc973086cb91ae0dc54eeeb0b832d522539d7982..926779bf2442fa60d184ef17308f91996d6e8d1b
 100644
--- a/gcc/config/aarch64/iterators.md
+++ b/gcc/config/aarch64/iterators.md
@@ -208,6 +208,9 @@ (define_mode_iterator VDQV [V8QI V16QI V4HI V8HI V4SI V2DI])
 ;; Advanced SIMD modes (except V2DI) for Integer reduction across lanes.
 (define_mode_iterator VDQV_S [V8QI V16QI V4HI V8HI V4SI])
 
+;; Advanced SIMD modes for Integer reduction across lanes (zero/sign extended).
+(define_mode_iterator VDQV_E [V8QI V16QI V4HI V8HI])
+
 ;; All double integer narrow-able modes.
 (define_mode_iterator VDN [V4HI V2SI DI])
 
diff --git a/gcc/testsuite/gcc.target/aarch64/popcnt2.c 
b/gcc/testsuite/gcc.target/aarch64/popcnt2.c
new file mode 100644
index 
..e321858afa4d6ecb6fc7348f39f6e5c6c0c46147
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/popcnt2.c
@@ -0,0 +1,21 @@
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+
+unsigned
+foo (int x)
+{
+  return __builtin_popcount (x);
+}
+
+unsigned long
+foo1 (int x)
+{
+  return __builtin_popcount (x);
+}
+
+/* { dg-final { scan-assembler-not {popcount} } } */
+/* { dg-final { scan-assembler-times {cnt\t} 2 } } */
+/* { dg-final { scan-assembler-times {fmov} 4 } } */
+/* { dg-final { scan-assembler-not {umov} } } */
+/* { dg-final { scan-assembler-not {uxtw} } } */
+/* { dg-final { scan-assembler-not {sxtw} } } */



[PATCH][AArch64] Improve clz patterns

2020-02-04 Thread Wilco Dijkstra
Although GCC should understand the limited range of clz/ctz/cls results,
Combine sometimes behaves oddly and duplicates ctz to remove a
sign extension.  Avoid this by adding an explicit AND with 127 in the
patterns. Deepsjeng performance improves by ~0.6%.

Bootstrap OK.

ChangeLog:
2020-02-03  Wilco Dijkstra  

* config/aarch64/aarch64.md (clz2): Mask the clz result.
(clrsb2): Likewise.
(ctz2): Likewise.
--

diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 
5edc76ee14b55b2b4323530e10bd22b3ffca483e..7ff0536aac42957dbb7a15be766d35cc6725ac40
 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -4794,7 +4794,8 @@ (define_insn 
"*and_one_cmpl_3_compare0_no_reuse"
 
 (define_insn "clz2"
   [(set (match_operand:GPI 0 "register_operand" "=r")
-   (clz:GPI (match_operand:GPI 1 "register_operand" "r")))]
+   (and:GPI (clz:GPI (match_operand:GPI 1 "register_operand" "r"))
+(const_int 127)))]
   ""
   "clz\\t%0, %1"
   [(set_attr "type" "clz")]
@@ -4848,7 +4849,8 @@ (define_expand "popcount2"
 
 (define_insn "clrsb2"
   [(set (match_operand:GPI 0 "register_operand" "=r")
-(clrsb:GPI (match_operand:GPI 1 "register_operand" "r")))]
+   (and:GPI (clrsb:GPI (match_operand:GPI 1 "register_operand" "r"))
+(const_int 127)))]
   ""
   "cls\\t%0, %1"
   [(set_attr "type" "clz")]
@@ -4869,7 +4871,8 @@ (define_insn "rbit2"
 
 (define_insn_and_split "ctz2"
  [(set (match_operand:GPI   0 "register_operand" "=r")
-   (ctz:GPI (match_operand:GPI  1 "register_operand" "r")))]
+   (and:GPI (ctz:GPI (match_operand:GPI  1 "register_operand" "r"))
+   (const_int 127)))]
   ""
   "#"
   "reload_completed"




Re: [PATCH][AArch64] Improve popcount expansion

2020-02-04 Thread Wilco Dijkstra
Hi Andrew,

> You might want to add a testcase that the autovectorizers too.
>
> Currently we get also:
>
>    ldr q0, [x0]
>    addv    b0, v0.16b
>    umov    w0, v0.b[0]
>    ret

My patch doesn't change this case on purpose - there are also many intrinsics 
which generate redundant umovs. That's for a separate patch.

Wilco


Re: [PATCH][ARM] Correctly set SLOW_BYTE_ACCESS

2020-02-04 Thread Wilco Dijkstra
ping (updated comment to use the same wording as the AArch64 version on trunk)

Contrary to all documentation, SLOW_BYTE_ACCESS simply means accessing
bitfields by their declared type, which results in better codegeneration
on practically any target.  So set it correctly to 1 on Arm.

As a result we generate much better code for bitfields:

typedef struct
{
  int x : 2, y : 8, z : 2;
} X;

int bitfield (X *p)
{
  return p->x + p->y + p->z;
}


Before:
ldrbr3, [r0]@ zero_extendqisi2
ldrhr2, [r0]
ldrbr0, [r0, #1]@ zero_extendqisi2
sbfxr3, r3, #0, #2
sbfxr2, r2, #2, #8
sbfxr0, r0, #2, #2
sxtab   r3, r2, r3
sxtab   r0, r3, r0
bx  lr

After:
ldr r0, [r0]
sbfxr3, r0, #0, #2
sbfxr2, r0, #2, #8
sbfxr0, r0, #10, #2
sxtab   r3, r2, r3
add r0, r0, r3
bx  lr

Bootstrap OK, OK for commit?

ChangeLog:
2019-09-11  Wilco Dijkstra  

* config/arm/arm.h (SLOW_BYTE_ACCESS): Set to 1.

--
diff --git a/gcc/config/arm/arm.h b/gcc/config/arm/arm.h
index 
e07cf03538c5bb23e3285859b9e44a627b6e9ced..998139ce759d5829b7f868367d4263df9d0e12d9
 100644
--- a/gcc/config/arm/arm.h
+++ b/gcc/config/arm/arm.h
@@ -1956,8 +1956,8 @@ enum arm_auto_incmodes
((arm_arch4 || (MODE) == QImode) ? ZERO_EXTEND  \
 : ((BYTES_BIG_ENDIAN && (MODE) == HImode) ? SIGN_EXTEND : UNKNOWN)))
 
-/* Nonzero if access to memory by bytes is slow and undesirable.  */
-#define SLOW_BYTE_ACCESS 0
+/* Enable wide bitfield accesses for more efficient bitfield code.  */
+#define SLOW_BYTE_ACCESS 1
 
 /* Immediate shift counts are truncated by the output routines (or was it
the assembler?).  Shift counts in a register are truncated by ARM.  Note


Re: [PATCH][Arm] Only enable fsched-pressure with Ofast

2020-02-04 Thread Wilco Dijkstra
ping

The current pressure scheduler doesn't appear to correctly track register
pressure and avoid creating unnecessary spills when register pressure is high.
As a result disabling the early scheduler improves integer performance
considerably and reduces codesize as a bonus. Since scheduling floating point
code is generally beneficial (more registers and higher latencies), only enable
the pressure scheduler with -Ofast.

On Cortex-A57 this gives a 0.7% performance gain on SPECINT2006 as well
as a 0.2% codesize reduction.

Bootstrapped on armhf. OK for commit?

ChangeLog:

2019-11-06  Wilco Dijkstra  

* gcc/common/config/arm-common.c (arm_option_optimization_table):
Enable fsched_pressure with Ofast only.

--
diff --git a/gcc/common/config/arm/arm-common.c 
b/gcc/common/config/arm/arm-common.c
index 
41a920f6dc96833e778faa8dbcc19beac483734c..b761d3abd670a144a593c4b410b1e7fbdcb52f56
 100644
--- a/gcc/common/config/arm/arm-common.c
+++ b/gcc/common/config/arm/arm-common.c
@@ -38,7 +38,7 @@ static const struct default_options 
arm_option_optimization_table[] =
   {
 /* Enable section anchors by default at -O1 or higher.  */
 { OPT_LEVELS_1_PLUS, OPT_fsection_anchors, NULL, 1 },
-{ OPT_LEVELS_1_PLUS, OPT_fsched_pressure, NULL, 1 },
+{ OPT_LEVELS_FAST, OPT_fsched_pressure, NULL, 1 },
 { OPT_LEVELS_NONE, 0, NULL, 0 }
   };


Re: [PATCH][ARM] Improve max_cond_insns setting for Cortex cores

2020-02-04 Thread Wilco Dijkstra
Hi Kyrill,

> Hmm, I'm not too confident on that. I'd support such a change for the 
> generic arm_cortex_tune, definitely, and the Armv8-a based ones, but I 
> don't think the argument is as strong for Cortex-A7, Cortex-A8, Cortex-A9.
>
> So let's make the change for the Armv8-A-based cores now. If you get 
> benchmarking data for the older ones (such systems may or may not be 
> easy to get a hold of) we can update those separately.

I ran some experiments on Cortex-A53 and this shows the difference between
2, 3 and 4 is less than for out-of-order cores (which clearly prefer 2).
So it seems alright to set it to 4 for the older in-order cores - see updated 
patch
below.

>>   Set max_cond_insns
>> to 4 on Thumb-2 architectures given it's already limited to that by
>> MAX_INSN_PER_IT_BLOCK.  Also use the CPU tuning setting when a CPU/tune
>> is selected if -mrestrict-it is not explicitly set.
>
> This can go in as a separate patch from the rest, thanks.

Sure, I'll split that part off into a separate patch.

Cheers,
Wilco

[PATCH v2][ARM] Improve max_cond_insns setting for Cortex cores

Various CPUs have max_cond_insns set to 5 due to historical reasons.
Benchmarking shows that max_cond_insns=2 is fastest on modern Cortex-A
cores, so change it to 2. Set it to 4 on older in-order cores as that is
the MAX_INSN_PER_IT_BLOCK limit for Thumb-2.

Bootstrapped on armhf. OK for commit?

ChangeLog:

2019-12-03  Wilco Dijkstra  

* config/arm/arm.c (arm_v6t2_tune): Set max_cond_insns to 4.
(arm_cortex_tune): Set max_cond_insns to 2.
(arm_cortex_a8_tune): Set max_cond_insns to 4.
(arm_cortex_a7_tune): Likewise.
(arm_cortex_a35_tune): Set max_cond_insns to 2.
(arm_cortex_a53_tune): Likewise.
(arm_cortex_a5_tune): Set max_cond_insns to 4.
(arm_cortex_a9_tune): Likewise.
(arm_v6m_tune): Likewise.
--

diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
index 
a6b401b7f2e3738ff68316bd83d6e5a2bcf0e7d7..daebe76352d62ad94556762b4e3bc3d0532ad411
 100644
--- a/gcc/config/arm/arm.c
+++ b/gcc/config/arm/arm.c
@@ -1947,7 +1947,7 @@ const struct tune_params arm_v6t2_tune =
   arm_default_branch_cost,
   &arm_default_vec_cost,
   1,   /* Constant limit.  */
-  5,   /* Max cond insns.  */
+  4,   /* Max cond insns.  */
   8,   /* Memset max inline.  */
   1,   /* Issue rate.  */
   ARM_PREFETCH_NOT_BENEFICIAL,
@@ -1971,7 +1971,7 @@ const struct tune_params arm_cortex_tune =
   arm_default_branch_cost,
   &arm_default_vec_cost,
   1,   /* Constant limit.  */
-  5,   /* Max cond insns.  */
+  2,   /* Max cond insns.  */
   8,   /* Memset max inline.  */
   2,   /* Issue rate.  */
   ARM_PREFETCH_NOT_BENEFICIAL,
@@ -1993,7 +1993,7 @@ const struct tune_params arm_cortex_a8_tune =
   arm_default_branch_cost,
   &arm_default_vec_cost,
   1,   /* Constant limit.  */
-  5,   /* Max cond insns.  */
+  4,   /* Max cond insns.  */
   8,   /* Memset max inline.  */
   2,   /* Issue rate.  */
   ARM_PREFETCH_NOT_BENEFICIAL,
@@ -2015,7 +2015,7 @@ const struct tune_params arm_cortex_a7_tune =
   arm_default_branch_cost,
   &arm_default_vec_cost,
   1,   /* Constant limit.  */
-  5,   /* Max cond insns.  */
+  4,   /* Max cond insns.  */
   8,   /* Memset max inline.  */
   2,   /* Issue rate.  */
   ARM_PREFETCH_NOT_BENEFICIAL,
@@ -2059,7 +2059,7 @@ const struct tune_params arm_cortex_a35_tune =
   arm_default_branch_cost,
   &arm_default_vec_cost,
   1,   /* Constant limit.  */
-  5,   /* Max cond insns.  */
+  2,   /* Max cond insns.  */
   8,   /* Memset max inline.  */
   1,   /* Issue rate.  */
   ARM_PREFETCH_NOT_BENEFICIAL,
@@ -2081,7 +2081,7 @@ const struct tune_params arm_cortex_a53_tune =
   arm_default_branch_cost,
   &arm_default_vec_cost,
   1,   /* Constant limit.  */
-  5,

Re: [PATCH][ARM] Remove support for MULS

2020-02-04 Thread Wilco Dijkstra
Any further comments? Note GCC doesn't support S/UMULLS either since it is 
equally
useless. It's no surprise that Thumb-2 removed support for flag-setting 64-bit 
multiplies,
while AArch64 didn't add flag-setting multiplies. So there is no argument that 
these
instructions are in any way useful to compilers.

Wilco

Hi Richard, Kyrill,

>> I disagree. If they still trigger and generate better code than without 
>> we should keep them.
> 
>> What kind of code is *common* varies greatly from user to user.

Not really - doing a multiply and checking whether the result is zero is
exceedingly rare. I found only 3 cases out of 7300 mul/mla in all of
SPEC2006... Overall codesize effect with -Os: 28 bytes or 0.00045%.

So we really should not even consider wasting any more time on
maintaining such useless patterns.

> Also, the main reason for restricting their use was that in the 'olden 
> days', when we had multi-cycle implementations of the multiply 
> instructions with short-circuit fast termination when the result was 
> completed, the flag setting variants would never short-circuit.

That only applied to conditional multiplies IIRC, some implementations
would not early-terminate if the condition failed. Today there are serious
penalties for conditional multiplies - but that's something to address in a
different patch.

> These days we have fixed cycle counts for multiply instructions, so this 
> is no-longer a penalty.  

No, there is a large overhead on modern cores when you set the flags,
and there are other penalties due to the extra micro-ops.

> In the thumb2 case in particular we can often 
> reduce mul-cmp (6 bytes) to muls (2 bytes), that's a 66% saving on this 
> sequence and definitely worth exploiting when we can, even if it's not 
> all that common.

Using muls+cbz is equally small. With my patch we generate this with -Os:

void g(void);
int f(int x)
{
  if (x * x != 0)
g();
}

f:
mulsr0, r0, r0
push{r3, lr}
cbz r0, .L9
bl  g
.L9:
pop {r3, pc}

Wilco

Re: [PATCH][AArch64] Improve clz patterns

2020-02-04 Thread Wilco Dijkstra
Hi Richard,

> Could you go into more detail about what the before and after code
> looks like, and what combine is doing?  Like you say, this sounds
> like a target-independent thing on face value.

It is indeed, but it seems specific to instructions where we have range
information which allows it to remove a redundant sign-extend.

See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93565 for the full details.

> Either way, something like this needs a testcase.

Sure I've added the testcase from pr93565, see below.

Cheers,
Wilco


Although GCC should understand the limited range of clz/ctz/cls results,
Combine sometimes behaves oddly and duplicates ctz to remove an unnecessary
sign extension.  Avoid this by adding an explicit AND with 127 in the
patterns. Deepsjeng performance improves by ~0.6%.

Bootstrap OK.

ChangeLog:
2020-02-04  Wilco Dijkstra  

PR rtl-optimization/93565
* config/aarch64/aarch64.md (clz2): Mask the clz result.
(clrsb2): Likewise.
(ctz2): Likewise.

* gcc.target/aarch64/pr93565.c: New test.
--
diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 
5edc76ee14b55b2b4323530e10bd22b3ffca483e..7ff0536aac42957dbb7a15be766d35cc6725ac40
 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -4794,7 +4794,8 @@ (define_insn 
"*and_one_cmpl_3_compare0_no_reuse"
 
 (define_insn "clz2"
   [(set (match_operand:GPI 0 "register_operand" "=r")
-   (clz:GPI (match_operand:GPI 1 "register_operand" "r")))]
+   (and:GPI (clz:GPI (match_operand:GPI 1 "register_operand" "r"))
+(const_int 127)))]
   ""
   "clz\\t%0, %1"
   [(set_attr "type" "clz")]
@@ -4848,7 +4849,8 @@ (define_expand "popcount2"
 
 (define_insn "clrsb2"
   [(set (match_operand:GPI 0 "register_operand" "=r")
-(clrsb:GPI (match_operand:GPI 1 "register_operand" "r")))]
+   (and:GPI (clrsb:GPI (match_operand:GPI 1 "register_operand" "r"))
+(const_int 127)))]
   ""
   "cls\\t%0, %1"
   [(set_attr "type" "clz")]
@@ -4869,7 +4871,8 @@ (define_insn "rbit2"
 
 (define_insn_and_split "ctz2"
  [(set (match_operand:GPI   0 "register_operand" "=r")
-   (ctz:GPI (match_operand:GPI  1 "register_operand" "r")))]
+   (and:GPI (ctz:GPI (match_operand:GPI  1 "register_operand" "r"))
+   (const_int 127)))]
   ""
   "#"
   "reload_completed"
diff --git a/gcc/testsuite/gcc.target/aarch64/pr93565.c 
b/gcc/testsuite/gcc.target/aarch64/pr93565.c
new file mode 100644
index 
..7200f80d1bb161f6a058cc6591f61b6b75cf1749
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr93565.c
@@ -0,0 +1,34 @@
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+
+static const unsigned long long magic = 0x03f08c5392f756cdULL;
+
+static const char table[64] = {
+ 0,  1, 12,  2, 13, 22, 17,  3,
+14, 33, 23, 36, 18, 58, 28,  4,
+62, 15, 34, 26, 24, 48, 50, 37,
+19, 55, 59, 52, 29, 44, 39,  5,
+63, 11, 21, 16, 32, 35, 57, 27,
+61, 25, 47, 49, 54, 51, 43, 38,
+10, 20, 31, 56, 60, 46, 53, 42,
+ 9, 30, 45, 41,  8, 40,  7,  6,
+};
+
+static inline int ctz1 (unsigned long  b)
+{
+  unsigned long lsb = b & -b;
+  return table[(lsb * magic) >> 58];
+}
+
+void f (unsigned long x, int *p)
+{
+  if (x != 0)
+{
+  int a = ctz1 (x);
+  *p = a | p[a];
+}
+}
+
+/* { dg-final { scan-assembler-times "rbit\t" 1 } } */
+/* { dg-final { scan-assembler-times "clz\t" 1 } } */
+



Re: [PATCH][AARCH64] Fix for PR86901

2020-02-05 Thread Wilco Dijkstra
Hi Modi,

Thanks for your patch! 

> Adding support for extv and extzv on aarch64 as described in 
> PR86901. I also changed
> extract_bit_field_using_extv to use gen_lowpart_if_possible instead of 
> gen_lowpart directly. Using
> gen_lowpart directly will fail with an ICE in building libgcc when the 
> compiler fails to successfully do so 
> whereas gen_lowpart_if_possible will bail out of matching this pattern 
> gracefully.

I did a quick bootstrap, this shows several failures like:

gcc/builtins.c:9427:1: error: unrecognizable insn:
 9427 | }
  | ^
(insn 212 211 213 24 (set (reg:SI 207)
(zero_extract:SI (reg:SI 206)
(const_int 26 [0x1a])
(const_int 6 [0x6]))) "/gcc/builtins.c":9413:44 -1
 (nil))

The issue here is that 26+6 = 32 and that's not a valid ubfx encoding. Currently
cases like this are split into a right shift in aarch64.md around line 5569:

;; When the bit position and width add up to 32 we can use a W-reg LSR
;; instruction taking advantage of the implicit zero-extension of the X-reg.
(define_split
  [(set (match_operand:DI 0 "register_operand")
(zero_extract:DI (match_operand:DI 1 "register_operand")
 (match_operand 2
   "aarch64_simd_shift_imm_offset_di")
 (match_operand 3
   "aarch64_simd_shift_imm_di")))]
  "IN_RANGE (INTVAL (operands[2]) + INTVAL (operands[3]), 1,
 GET_MODE_BITSIZE (DImode) - 1)
   && (INTVAL (operands[2]) + INTVAL (operands[3]))
   == GET_MODE_BITSIZE (SImode)"
  [(set (match_dup 0)
(zero_extend:DI (lshiftrt:SI (match_dup 4) (match_dup 3]
  {
operands[4] = gen_lowpart (SImode, operands[1]);
  }

However that only supports DImode, not SImode, so it needs to be changed to
be more general using GPI.

Your new extv patterns should replace the magic patterns above it:

;; ---
;; Bitfields
;; ---

(define_expand ""

These are the current extv/extzv patterns, but without a mode. They are no 
longer
used when we start using the new ones.

Note you can write  to combine the extzv and extz patterns.
But please add a comment mentioning the pattern names so they are easy to find!

Besides a bootstrap it is always useful to compile a large body of code with 
your change
(say SPEC2006/2017) and check for differences in at least codesize. If there is 
an increase
in instruction count then there may be more issues that need to be resolved.

> I'm looking through past mails and https://gcc.gnu.org/contribute.html which 
> details testing bootstrap.
> I'm building a cross-compiler (x64_aarch64) and the instructions don't 
> address that scenario. The GCC 
> cross-build is green and there's no regressions on the C/C++ tests (The 
> go/fortran etc. look like they 
> need additional infrastructure built on my side to work). Is there a workflow 
> for cross-builds or should I 
> aim to get an aarch64 native machine for full validation?

I find it easiest to develop on a many-core AArch64 server so you get much 
faster builds,
bootstraps and regression tests. Cross compilers are mostly useful if you want 
to test big-endian
or new architecture features which are not yet supported in hardware. You don't 
normally need
to test Go/Fortran/ADA etc unless your patch does something that would directly 
affect them.

Finally do you have a copyright assignment with the FSF?

Cheers,
Wilco

Re: [PATCH][AARCH64] Fix for PR86901

2020-02-07 Thread Wilco Dijkstra
Hi,

Richard wrote:
> However, inside the compiler we really want to represent this as a 
>shift.
...
> Ideally this would be handled inside the mid-end expansion of an 
> extract, but in the absence of that I think this is best done inside the 
> extv expansion so that we never end up with a real extract in that case.

Yes the mid-end could be improved - it turns out it is due to expansion of
bitfields, all variations of (x & mask) >> N are optimized into shifts early on.

However it turns out Combine can already transform these zero/sign_extends
into shifts, so we do end up with good code. With the latest patch I get:

typedef struct { int x : 6, y : 6, z : 20; } X;

int f (int x, X *p) { return x + p->z; }

ldr w1, [x1]
add w0, w0, w1, asr 12
ret

So this case looks alright.

> Sounds good. I'll get those setup and running and will report back on 
> findings. What's
> the preferred way to measure codesize? I'm assuming by default the code pages 
> are 
> aligned so smaller differences would need to trip over the boundary to 
> actually show up.

You can use the size command on the binaries:

>size /bin/ls
   textdata bss dec hex filename
 10727120243472  112767   1b87f /bin/ls

As you can see it shows the text size in bytes. It is not rounded up to a page, 
so it is an 
accurate measure of the codesize. Generally -O2 size is most useful to check 
(since that
is what most applications build with), but -Ofast -flto can be useful as well 
(the global 
inlining means you get instruction combinations which appear less often with 
-O2).

Cheers,
Wilco

Re: [PATCH][AArch64] Improve clz patterns

2020-02-12 Thread Wilco Dijkstra
Hi Richard,

Right, so this is an alternative approach using costs - Combine won't try to
duplicate instructions if it increases costs, so increasing the ctz cost to 2
instructions (which is the correct cost for ctz anyway) ensures we still get
efficient code for this example:

[AArch64] Set ctz rtx_cost (PR93565)

Although GCC should understand the limited range of clz/ctz/cls results,
Combine sometimes behaves oddly and duplicates ctz to remove an unnecessary
sign extension.  Avoid this by setting the cost for ctz to be higher than
that of a simple ALU instruction.  Deepsjeng performance improves by ~0.6%.

Bootstrap OK.

ChangeLog:
2020-02-12  Wilco Dijkstra  

PR rtl-optimization/93565
* config/aarch64/aarch64.c (aarch64_rtx_costs): Add CTZ costs.

* gcc.target/aarch64/pr93565.c: New test.

--

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 
e40750380cce202473da3cf572ebdbc28a4ecc06..7426629d6c973c06640f75d3de53a2815ff40f1b
 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -11459,6 +11459,13 @@ aarch64_rtx_costs (rtx x, machine_mode mode, int outer 
ATTRIBUTE_UNUSED,
 
   return false;
 
+case CTZ:
+  *cost = COSTS_N_INSNS (2);
+
+  if (speed)
+   *cost += extra_cost->alu.clz + extra_cost->alu.rev;
+  return false;
+
 case COMPARE:
   op0 = XEXP (x, 0);
   op1 = XEXP (x, 1);
diff --git a/gcc/testsuite/gcc.target/aarch64/pr93565.c 
b/gcc/testsuite/gcc.target/aarch64/pr93565.c
new file mode 100644
index 
..7200f80d1bb161f6a058cc6591f61b6b75cf1749
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr93565.c
@@ -0,0 +1,34 @@
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+
+static const unsigned long long magic = 0x03f08c5392f756cdULL;
+
+static const char table[64] = {
+ 0,  1, 12,  2, 13, 22, 17,  3,
+14, 33, 23, 36, 18, 58, 28,  4,
+62, 15, 34, 26, 24, 48, 50, 37,
+19, 55, 59, 52, 29, 44, 39,  5,
+63, 11, 21, 16, 32, 35, 57, 27,
+61, 25, 47, 49, 54, 51, 43, 38,
+10, 20, 31, 56, 60, 46, 53, 42,
+ 9, 30, 45, 41,  8, 40,  7,  6,
+};
+
+static inline int ctz1 (unsigned long  b)
+{
+  unsigned long lsb = b & -b;
+  return table[(lsb * magic) >> 58];
+}
+
+void f (unsigned long x, int *p)
+{
+  if (x != 0)
+{
+  int a = ctz1 (x);
+  *p = a | p[a];
+}
+}
+
+/* { dg-final { scan-assembler-times "rbit\t" 1 } } */
+/* { dg-final { scan-assembler-times "clz\t" 1 } } */
+




Re: [PATCH][AArch64] Improve clz patterns

2020-02-12 Thread Wilco Dijkstra
Hi Richard,

See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93565#c8 - the problem is
more generic like I suspected and it's easy to create similar examples. So while
this turned out to be an easy worksaround for ctz, there general case is harder
to avoid since you still want to allow beneficial multi-use cases (such as 
merging
a shift into 2 ALU instructions).

Cheers,
Wilco



Re: [PATCH][AArch64] Improve clz patterns

2020-02-12 Thread Wilco Dijkstra
Hi Andrew,

> Yes I agree a better cost model for CTZ/CLZ is the right solution but
> I disagree with 2 ALU instruction as the cost.  It should either be
> the same cost as a multiply or have its own cost entry.
> For an example on OcteonTX (and ThunderX1), the cost of CLS/CLZ is 4
> cycles, the same as the cost as a multiple; on OcteonTX2 it is 5
> cycles (again the same cost as a multiple).

+  if (speed)
+  *cost += extra_cost->alu.clz + extra_cost->alu.rev;
+  return false;

So if the cost of clz and ctz are similar enough, this will use the defined
per-cpu costs.

Cheers,
Wilco

Re: [PATCH] PR85678: Change default to -fno-common

2019-12-04 Thread Wilco Dijkstra
Hi Jeff,

>> I've noticed quite significant package failures caused by the revision.
>> Would you please consider documenting this change in porting_to.html
>> (and in changes.html) for GCC 10 release?
>
> I'm not in the office right now, but figured I'd chime in.  I'd estimate
> 400-500 packages are failing in Fedora because of this change.  I'll
> have a hard number Monday.
>
> It's significant enough that I'm not sure how we're going to get them
> all fixed.

So what normally happens with the numerous new warnings/errors in GCC
releases? I suppose that could cause package failures too. Would it be feasible
to override the options for any failing packages?

Cheers,
Wilco



[wwwdocs] Document -fcommon default change

2019-12-05 Thread Wilco Dijkstra
Hi,

Add entries for the default change in changes.html and porting_to.html.
Passes the W3 validator.

Cheers,
Wilco

---

diff --git a/htdocs/gcc-10/changes.html b/htdocs/gcc-10/changes.html
index 
e02966460450b7aad884b2d45190b9ecd8c7a5d8..304e1e8ccd38795104156e86b92062696fa5aa8b
 100644
--- a/htdocs/gcc-10/changes.html
+++ b/htdocs/gcc-10/changes.html
@@ -102,6 +102,11 @@ a work-in-progress.
 In C2X mode, -fno-fp-int-builtin-inexact is
 enabled by default.
   
+
+  GCC now defaults to -fno-common.  In C, global variables 
with
+  multiple tentative definitions will result in linker errors.
+  Global variable accesses are also more efficient on various targets.
+  
 
 
 C++
diff --git a/htdocs/gcc-10/porting_to.html b/htdocs/gcc-10/porting_to.html
index 
3256e8a35d00ce1352c169a1c6df6d8f120889ee..e2c7e226a83b7720fe6ed40061cdddbc27659664
 100644
--- a/htdocs/gcc-10/porting_to.html
+++ b/htdocs/gcc-10/porting_to.html
@@ -29,9 +29,25 @@ and provide solutions. Let us know if you have suggestions 
for improvements!
 Preprocessor issues
 -->
 
-
+
+Default to -fno-common
+
+
+  A common mistake in C is omitting extern when declaring a global
+  variable in a header file.  If the header is included by several files it
+  results in multiple definitions of the same variable.  In previous GCC
+  versions this error is ignored.  GCC 10 defaults to -fno-common,
+  which means a linker error will now be reported.
+  To fix this, use extern in header files when declaring global
+  variables, and ensure each global is defined in exactly one C file.
+  As a workaround, legacy C code can be compiled with -fcommon.
+
+  
+  int x;  // tentative definition - avoid in header files
+
+  extern int y;  // correct declaration in a header file
+  
 
 Fortran language issues


Re: [PATCH] PR85678: Change default to -fno-common

2019-12-05 Thread Wilco Dijkstra
Hi,

I have updated the documentation patch here and added relevant maintainers
so hopefully this can go in soon: 
https://gcc.gnu.org/ml/gcc-patches/2019-12/msg00311.html

I moved the paragraph in changes.html to the C section like you suggested. Would
it make sense to link to the porting_to entry?

Cheers,
Wilco





Re: [PATCH v2 2/2][ARM] Improve max_cond_insns setting for Cortex cores

2019-12-06 Thread Wilco Dijkstra
Hi Christophe,

> This patch (r278968) is causing regressions when building GCC
> --target arm-none-linux-gnueabihf
> --with-mode thumb
> --with-cpu cortex-a57
> --with-fpu crypto-neon-fp-armv8
> because the assembler (gas version 2.33.1) complains:
> /ccc7z5eW.s:4267: IT blocks containing more than one conditional
> instruction are performance deprecated in ARMv8-A and ARMv8-R
>
> I guess that's related to what you say about -mrestrict-it ?

Yes it looks like that unnecessary warning hasn't been silenced in latest 
binutils,
but it should be easy to turn off.

Cheers,
Wilco


Re: [PATCH v2 2/2][ARM] Improve max_cond_insns setting for Cortex cores

2019-12-06 Thread Wilco Dijkstra
Hi Christophe,

I've added an option to allow the warning to be enabled/disabled:
https://sourceware.org/ml/binutils/2019-12/msg00093.html

Cheers,
Wilco

Re: [PATCH v2 2/2][ARM] Improve max_cond_insns setting for Cortex cores

2019-12-06 Thread Wilco Dijkstra
Hi Christophe,

> In practice, how do you activate it when running the GCC testsuite? Do
> you plan to send a GCC patch to enable this assembler flag, or do you
> locally enable that option by default in your binutils?

The warning is off by default so there is no need to do anything in the 
testsuite,
you just need a fixed binutils.

> FWIW, I've also noticed that the whole libstdc++ testsuite is somehow
> "deactivated" (I have 0 pass, 0 fail etc...)  after your GCC patch
> when configuring GCC
> --target arm-none-linux-gnueabihf
> --with-mode thumb
> --with-cpu cortex-a57
> --with-fpu crypto-neon-fp-armv8

Well it's possible a configure check failed somehow.

Cheers,
Wilco


Re: [PATCH v2 2/2][ARM] Improve max_cond_insns setting for Cortex cores

2019-12-09 Thread Wilco Dijkstra
Hi Christophe,

>> The warning is off by default so there is no need to do anything in the 
>> testsuite,
>> you just need a fixed binutils.
>>
>
> Don't we want to fix GCC to stop generating the offending sequence?

Why? All ARMv8 implementations have to support it, and despite the warning 
code actually runs significantly faster.

>> Well it's possible a configure check failed somehow.
>>
> Yes, it fails when compiling testsuite_abi.cc, resulting in tcl errors.

It's odd it's that sensitive to extra warnings, but anyway...

Cheers,
Wilco

Re: [PATCH] PR90838: Support ctz idioms

2019-12-11 Thread Wilco Dijkstra
Hi Richard,

>> +(match (ctz_table_index @1 @2 @3)
>> +  (rshift (mult (bit_and (negate @1) @1) INTEGER_CST@2) INTEGER_CST@3))
>
> You need a :c on the bit_and

Fixed.

> +  unsigned HOST_WIDE_INT val = tree_to_uhwi (mulc);
> +  unsigned shiftval = tree_to_uhwi (tshift);
> +  unsigned input_bits = tree_to_shwi (TYPE_SIZE (input_type));

> In the even that a __int128_t IFN_CTZ is supported the above might ICE with
> too large constants so please wither use wide-int ops or above verify
> tree_fits_{u,s}hwi () before doing the conversions (the conversion from
> TYPE_SIZE should always succeed though).

I've moved the initialization of val much later so we have done all the checks 
and
know for sure the mulc will fit in a HWint.

> Hmm.  So this verifies that for a subset of all possible inputs the table
> computes the correct value.
>
> a) how do we know this verification is exhaustive?
> b) we do this for every array access matching the pattern

It checks all the values that matter, which is the number of bits plus the 
special
handling of ctz(0). An array may contain entries which can never be referenced
(see ctz2() in the testcase), so we don't care what the value is in those cases.
Very few accesses can match the pattern given it is very specific and there are
many checks before it tries to check the contents of the array.

> I suggest you do
>  tree ctor = ctor_for_folding (array);
>  if (!ctor || TREE_CODE (ctor) != CONSTRUCTOR)
>    return false;
>
> and then perform the verification on the constructor elements directly.
> That's a lot cheaper.  Watch out for array_ref_low_bound which you
> don't get passed in here - thus pass in the ARRAY_REF, not the array.
>
> I believe it's also wrong in your code above (probably miscompiles
> a fortran equivalent of your testcase or fails verification/transform).
>
> When you do the verification on the ctor_for_folding then you
> can in theory lookup the varpool node for 'array' and cache
> the verification result there.

I've updated it to use the ctor, but it meant adding another code path to
handle string literals. It's not clear how the array_ref_low_bound affects the
initializer, but I now reject it if it is non-zero.

>> +  tree lhs = gimple_assign_lhs (stmt);
>> +  bool zero_ok = CTZ_DEFINED_VALUE_AT_ZERO (TYPE_MODE (type), val);
>
> since we're using the optab entry shouldn't you check for == 2 here?

Yes, that looks more correct (it's not clear what 1 means exactly).

> Please check this before building the call.

I've reordered the checks so it returns before it builds any gimple if it 
cannot do
the transformation.

> For all of the above using gimple_build () style stmt building and
> a final gsi_replace_with_seq would be more straight-forward.

I've changed that, but it meant always inserting the nop convert, otherwise
it does not make the code easier to follow.

Cheers,
Wilco


[PATCH v3] PR90838: Support ctz idioms

v3: Directly walk over the array initializer and other tweaks based on review.
v2: Use fwprop pass rather than match.pd

Support common idioms for count trailing zeroes using an array lookup.
The canonical form is array[((x & -x) * C) >> SHIFT] where C is a magic
constant which when multiplied by a power of 2 contains a unique value
in the top 5 or 6 bits.  This is then indexed into a table which maps it
to the number of trailing zeroes.  When the table is valid, we emit a
sequence using the target defined value for ctz (0):

int ctz1 (unsigned x)
{
  static const char table[32] =
{
  0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
  31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
};

  return table[((unsigned)((x & -x) * 0x077CB531U)) >> 27];
}

Is optimized to:

rbitw0, w0
clz w0, w0
and w0, w0, 31
ret

Bootstrapped on AArch64. OK for commit?

ChangeLog:

2019-12-11  Wilco Dijkstra  

PR tree-optimization/90838
* tree-ssa-forwprop.c (check_ctz_array): Add new function.
(check_ctz_string): Likewise.
(optimize_count_trailing_zeroes): Likewise.
(simplify_count_trailing_zeroes): Likewise.
(pass_forwprop::execute): Try ctz simplification.
* match.pd: Add matching for ctz idioms.
* testsuite/gcc.target/aarch64/pr90838.c: New test.

--
diff --git a/gcc/match.pd b/gcc/match.pd
index 
3b7a5ce4e9a4de4f983ccdc696ad406a7932c08c..410cd6eaae0cdc9de7e01d5496de0595b7ea15ba
 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -6116,3 +6116,11 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
 (simplify
  (vec_perm vec_same_elem_p@0 @0 @1)
  @0)
+
+/* Match count trailing zeroes for simplify_count_trailing_zeroes in fwprop.
+   The canonical form is array[((x & -x) * C) >> SHIFT] where C is a m

[PATCH][AArch64] Fixup core tunings

2019-12-13 Thread Wilco Dijkstra
Several tuning settings in cores.def are not consistent.
Set the tuning for Cortex-A76AE and Cortex-A77 to neoversen1 so
it is the same as for Cortex-A76 and Neoverse N1.
Set the tuning for Neoverse E1 to cortexa73 so it's the same as for
Cortex-A65. Set the scheduler for Cortex-A65 and Cortex-A65AE to
cortexa53.

Bootstrap OK, OK for commit?

ChangeLog:
2019-12-11  Wilco Dijkstra  

* config/aarch64/aarch64-cores.def: Update settings for
cortex-a76ae, cortex-a77, cortex-a65, cortex-a65ae, neoverse-e1,
cortex-a76.cortex-a55.
--

diff --git a/gcc/config/aarch64/aarch64-cores.def 
b/gcc/config/aarch64/aarch64-cores.def
index 
053c6390e747cb9c818fe29a9b22990143b260ad..d170253c6eddca87f8b9f4f7fcc4692695ef83fb
 100644
--- a/gcc/config/aarch64/aarch64-cores.def
+++ b/gcc/config/aarch64/aarch64-cores.def
@@ -101,13 +101,13 @@ AARCH64_CORE("thunderx2t99",  thunderx2t99,  
thunderx2t99, 8_1A,  AARCH64_FL_FOR
 AARCH64_CORE("cortex-a55",  cortexa55, cortexa53, 8_2A,  
AARCH64_FL_FOR_ARCH8_2 | AARCH64_FL_F16 | AARCH64_FL_RCPC | AARCH64_FL_DOTPROD, 
cortexa53, 0x41, 0xd05, -1)
 AARCH64_CORE("cortex-a75",  cortexa75, cortexa57, 8_2A,  
AARCH64_FL_FOR_ARCH8_2 | AARCH64_FL_F16 | AARCH64_FL_RCPC | AARCH64_FL_DOTPROD, 
cortexa73, 0x41, 0xd0a, -1)
 AARCH64_CORE("cortex-a76",  cortexa76, cortexa57, 8_2A,  
AARCH64_FL_FOR_ARCH8_2 | AARCH64_FL_F16 | AARCH64_FL_RCPC | AARCH64_FL_DOTPROD, 
neoversen1, 0x41, 0xd0b, -1)
-AARCH64_CORE("cortex-a76ae",  cortexa76ae, cortexa57, 8_2A,  
AARCH64_FL_FOR_ARCH8_2 | AARCH64_FL_F16 | AARCH64_FL_RCPC | AARCH64_FL_DOTPROD 
| AARCH64_FL_SSBS, cortexa72, 0x41, 0xd0e, -1)
-AARCH64_CORE("cortex-a77",  cortexa77, cortexa57, 8_2A,  
AARCH64_FL_FOR_ARCH8_2 | AARCH64_FL_F16 | AARCH64_FL_RCPC | AARCH64_FL_DOTPROD 
| AARCH64_FL_SSBS, cortexa72, 0x41, 0xd0d, -1)
-AARCH64_CORE("cortex-a65",  cortexa65, cortexa57, 8_2A,  
AARCH64_FL_FOR_ARCH8_2 | AARCH64_FL_F16 | AARCH64_FL_RCPC | AARCH64_FL_DOTPROD 
| AARCH64_FL_SSBS, cortexa73, 0x41, 0xd06, -1)
-AARCH64_CORE("cortex-a65ae",  cortexa65ae, cortexa57, 8_2A,  
AARCH64_FL_FOR_ARCH8_2 | AARCH64_FL_F16 | AARCH64_FL_RCPC | AARCH64_FL_DOTPROD 
| AARCH64_FL_SSBS, cortexa73, 0x41, 0xd43, -1)
+AARCH64_CORE("cortex-a76ae",  cortexa76ae, cortexa57, 8_2A,  
AARCH64_FL_FOR_ARCH8_2 | AARCH64_FL_F16 | AARCH64_FL_RCPC | AARCH64_FL_DOTPROD 
| AARCH64_FL_SSBS, neoversen1, 0x41, 0xd0e, -1)
+AARCH64_CORE("cortex-a77",  cortexa77, cortexa57, 8_2A,  
AARCH64_FL_FOR_ARCH8_2 | AARCH64_FL_F16 | AARCH64_FL_RCPC | AARCH64_FL_DOTPROD 
| AARCH64_FL_SSBS, neoversen1, 0x41, 0xd0d, -1)
+AARCH64_CORE("cortex-a65",  cortexa65, cortexa53, 8_2A,  
AARCH64_FL_FOR_ARCH8_2 | AARCH64_FL_F16 | AARCH64_FL_RCPC | AARCH64_FL_DOTPROD 
| AARCH64_FL_SSBS, cortexa73, 0x41, 0xd06, -1)
+AARCH64_CORE("cortex-a65ae",  cortexa65ae, cortexa53, 8_2A,  
AARCH64_FL_FOR_ARCH8_2 | AARCH64_FL_F16 | AARCH64_FL_RCPC | AARCH64_FL_DOTPROD 
| AARCH64_FL_SSBS, cortexa73, 0x41, 0xd43, -1)
 AARCH64_CORE("ares",  ares, cortexa57, 8_2A,  AARCH64_FL_FOR_ARCH8_2 | 
AARCH64_FL_F16 | AARCH64_FL_RCPC | AARCH64_FL_DOTPROD | AARCH64_FL_PROFILE, 
neoversen1, 0x41, 0xd0c, -1)
 AARCH64_CORE("neoverse-n1",  neoversen1, cortexa57, 8_2A,  
AARCH64_FL_FOR_ARCH8_2 | AARCH64_FL_F16 | AARCH64_FL_RCPC | AARCH64_FL_DOTPROD 
| AARCH64_FL_PROFILE, neoversen1, 0x41, 0xd0c, -1)
-AARCH64_CORE("neoverse-e1",  neoversee1, cortexa53, 8_2A,  
AARCH64_FL_FOR_ARCH8_2 | AARCH64_FL_F16 | AARCH64_FL_RCPC | AARCH64_FL_DOTPROD 
| AARCH64_FL_SSBS, cortexa53, 0x41, 0xd4a, -1)
+AARCH64_CORE("neoverse-e1",  neoversee1, cortexa53, 8_2A,  
AARCH64_FL_FOR_ARCH8_2 | AARCH64_FL_F16 | AARCH64_FL_RCPC | AARCH64_FL_DOTPROD 
| AARCH64_FL_SSBS, cortexa73, 0x41, 0xd4a, -1)
 
 /* HiSilicon ('H') cores. */
 AARCH64_CORE("tsv110",  tsv110, tsv110, 8_2A,  AARCH64_FL_FOR_ARCH8_2 | 
AARCH64_FL_CRYPTO | AARCH64_FL_F16 | AARCH64_FL_AES | AARCH64_FL_SHA2, tsv110,  
 0x48, 0xd01, -1)
@@ -127,6 +127,6 @@ AARCH64_CORE("cortex-a73.cortex-a53",  cortexa73cortexa53, 
cortexa53, 8A,  AARCH
 /* ARM DynamIQ big.LITTLE configurations.  */
 
 AARCH64_CORE("cortex-a75.cortex-a55",  cortexa75cortexa55, cortexa53, 8_2A,  
AARCH64_FL_FOR_ARCH8_2 | AARCH64_FL_F16 | AARCH64_FL_RCPC | AARCH64_FL_DOTPROD, 
cortexa73, 0x41, AARCH64_BIG_LITTLE (0xd0a, 0xd05), -1)
-AARCH64_CORE("cortex-a76.cortex-a55",  cortexa76cortexa55, cortexa53, 8_2A,  
AARCH64_FL_FOR_ARCH8_2 | AARCH64_FL_F16 | AARCH64_FL_RCPC | AARCH64_FL_DOTPROD, 
cortexa72, 0x41, AARCH64_BIG_LITTLE (0xd0b, 0xd05), -1)
+AARCH64_CORE("cortex-a76.cortex-a55",  cortexa76cortexa55, cortexa53, 8_2A,  
AARCH64_FL_FOR_ARCH8_2 | AARCH64_FL_F16 | AARCH64_FL_RCPC | AARCH64_FL_DOTPROD, 
neoversen1, 0x41, AARCH64_BIG_LITTLE (0xd0b, 0xd05), -1)
 
 #undef AARCH64_CORE


Re: [PATCH][AArch64] Fixup core tunings

2019-12-17 Thread Wilco Dijkstra
Hi Richard,

> This changelog entry is inadequate.  It's also not in the correct style.
>
> It should say what has changed, not just that it has changed.

Sure, but there is often no useful space for that. We should auto generate
changelogs if they are deemed useful. I find the commit message a lot more
useful in general. Here is the updated version:


Several tuning settings in cores.def are not consistent.
Set the tuning for Cortex-A76AE and Cortex-A77 to neoversen1 so
it is the same as for Cortex-A76 and Neoverse N1.
Set the tuning for Neoverse E1 to cortexa73 so it's the same as for
Cortex-A65. Set the scheduler for Cortex-A65 and Cortex-A65AE to
cortexa53.

Bootstrap OK, OK for commit?

ChangeLog:
2019-12-17  Wilco Dijkstra  

* config/aarch64/aarch64-cores.def: 
("cortex-a76ae"): Use neoversen1 tuning.
("cortex-a77"): Likewise.
("cortex-a65"): Use cortexa53 scheduler.
("cortex-a65ae"): Likewise.
("neoverse-e1"): Use cortexa73 tuning.
--

diff --git a/gcc/config/aarch64/aarch64-cores.def 
b/gcc/config/aarch64/aarch64-cores.def
index 
053c6390e747cb9c818fe29a9b22990143b260ad..d170253c6eddca87f8b9f4f7fcc4692695ef83fb
 100644
--- a/gcc/config/aarch64/aarch64-cores.def
+++ b/gcc/config/aarch64/aarch64-cores.def
@@ -101,13 +101,13 @@ AARCH64_CORE("thunderx2t99",  thunderx2t99,  
thunderx2t99, 8_1A,  AARCH64_FL_FOR
 AARCH64_CORE("cortex-a55",  cortexa55, cortexa53, 8_2A,  
AARCH64_FL_FOR_ARCH8_2 | AARCH64_FL_F16 | AARCH64_FL_RCPC | AARCH64_FL_DOTPROD, 
cortexa53, 0x41, 0xd05, -1)
 AARCH64_CORE("cortex-a75",  cortexa75, cortexa57, 8_2A,  
AARCH64_FL_FOR_ARCH8_2 | AARCH64_FL_F16 | AARCH64_FL_RCPC | AARCH64_FL_DOTPROD, 
cortexa73, 0x41, 0xd0a, -1)
 AARCH64_CORE("cortex-a76",  cortexa76, cortexa57, 8_2A,  
AARCH64_FL_FOR_ARCH8_2 | AARCH64_FL_F16 | AARCH64_FL_RCPC | AARCH64_FL_DOTPROD, 
neoversen1, 0x41, 0xd0b, -1)
-AARCH64_CORE("cortex-a76ae",  cortexa76ae, cortexa57, 8_2A,  
AARCH64_FL_FOR_ARCH8_2 | AARCH64_FL_F16 | AARCH64_FL_RCPC | AARCH64_FL_DOTPROD 
| AARCH64_FL_SSBS, cortexa72, 0x41, 0xd0e, -1)
-AARCH64_CORE("cortex-a77",  cortexa77, cortexa57, 8_2A,  
AARCH64_FL_FOR_ARCH8_2 | AARCH64_FL_F16 | AARCH64_FL_RCPC | AARCH64_FL_DOTPROD 
| AARCH64_FL_SSBS, cortexa72, 0x41, 0xd0d, -1)
-AARCH64_CORE("cortex-a65",  cortexa65, cortexa57, 8_2A,  
AARCH64_FL_FOR_ARCH8_2 | AARCH64_FL_F16 | AARCH64_FL_RCPC | AARCH64_FL_DOTPROD 
| AARCH64_FL_SSBS, cortexa73, 0x41, 0xd06, -1)
-AARCH64_CORE("cortex-a65ae",  cortexa65ae, cortexa57, 8_2A,  
AARCH64_FL_FOR_ARCH8_2 | AARCH64_FL_F16 | AARCH64_FL_RCPC | AARCH64_FL_DOTPROD 
| AARCH64_FL_SSBS, cortexa73, 0x41, 0xd43, -1)
+AARCH64_CORE("cortex-a76ae",  cortexa76ae, cortexa57, 8_2A,  
AARCH64_FL_FOR_ARCH8_2 | AARCH64_FL_F16 | AARCH64_FL_RCPC | AARCH64_FL_DOTPROD 
| AARCH64_FL_SSBS, neoversen1, 0x41, 0xd0e, -1)
+AARCH64_CORE("cortex-a77",  cortexa77, cortexa57, 8_2A,  
AARCH64_FL_FOR_ARCH8_2 | AARCH64_FL_F16 | AARCH64_FL_RCPC | AARCH64_FL_DOTPROD 
| AARCH64_FL_SSBS, neoversen1, 0x41, 0xd0d, -1)
+AARCH64_CORE("cortex-a65",  cortexa65, cortexa53, 8_2A,  
AARCH64_FL_FOR_ARCH8_2 | AARCH64_FL_F16 | AARCH64_FL_RCPC | AARCH64_FL_DOTPROD 
| AARCH64_FL_SSBS, cortexa73, 0x41, 0xd06, -1)
+AARCH64_CORE("cortex-a65ae",  cortexa65ae, cortexa53, 8_2A,  
AARCH64_FL_FOR_ARCH8_2 | AARCH64_FL_F16 | AARCH64_FL_RCPC | AARCH64_FL_DOTPROD 
| AARCH64_FL_SSBS, cortexa73, 0x41, 0xd43, -1)
 AARCH64_CORE("ares",  ares, cortexa57, 8_2A,  AARCH64_FL_FOR_ARCH8_2 | 
AARCH64_FL_F16 | AARCH64_FL_RCPC | AARCH64_FL_DOTPROD | AARCH64_FL_PROFILE, 
neoversen1, 0x41, 0xd0c, -1)
 AARCH64_CORE("neoverse-n1",  neoversen1, cortexa57, 8_2A,  
AARCH64_FL_FOR_ARCH8_2 | AARCH64_FL_F16 | AARCH64_FL_RCPC | AARCH64_FL_DOTPROD 
| AARCH64_FL_PROFILE, neoversen1, 0x41, 0xd0c, -1)
-AARCH64_CORE("neoverse-e1",  neoversee1, cortexa53, 8_2A,  
AARCH64_FL_FOR_ARCH8_2 | AARCH64_FL_F16 | AARCH64_FL_RCPC | AARCH64_FL_DOTPROD 
| AARCH64_FL_SSBS, cortexa53, 0x41, 0xd4a, -1)
+AARCH64_CORE("neoverse-e1",  neoversee1, cortexa53, 8_2A,  
AARCH64_FL_FOR_ARCH8_2 | AARCH64_FL_F16 | AARCH64_FL_RCPC | AARCH64_FL_DOTPROD 
| AARCH64_FL_SSBS, cortexa73, 0x41, 0xd4a, -1)
 
 /* HiSilicon ('H') cores. */
 AARCH64_CORE("tsv110",  tsv110, tsv110, 8_2A,  AARCH64_FL_FOR_ARCH8_2 | 
AARCH64_FL_CRYPTO | AARCH64_FL_F16 | AARCH64_FL_AES | AARCH64_FL_SHA2, tsv110,  
 0x48, 0xd01, -1)
@@ -127,6 +127,6 @@ AARCH64_CORE("cortex-a73.cortex-a53",  cortexa73cortexa53, 
cortexa53, 8A,  AARCH
 /* ARM DynamIQ big.LITTLE configurations.  */
 
 AARCH64_CORE("cortex-a75.cortex-a55",  cortexa75cortexa55, cortexa53, 8_2A,  
AARCH64_FL_FOR_ARCH8_2 | AARCH64_FL_F16 | AARCH64_FL_RCPC | AARCH64_FL_DOTPROD, 
cortexa73, 0x41, AARCH64_BIG_LITTLE (0xd0a, 0xd05), -1)
-AARC

Re: [PATCH][ARM] Switch to default sched pressure algorithm

2019-12-19 Thread Wilco Dijkstra
Hi,

>> I've noticed that your patch caused a regression:
>> FAIL: gcc.dg/tree-prof/pr77698.c scan-rtl-dump-times alignments
>> "internal loop alignment added" 1

I've created https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93007

Cheers,
Wilco



[PATCH][AARCH64] Enable compare branch fusion

2019-12-24 Thread Wilco Dijkstra
Enable the most basic form of compare-branch fusion since various CPUs
support it. This has no measurable effect on cores which don't support
branch fusion, but increases fusion opportunities on cores which do.

Bootstrapped on AArch64, OK for commit?

ChangeLog:
2019-12-24  Wilco Dijkstra  

* config/aarch64/aarch64.c (generic_tunings): Add branch fusion.
(neoversen1_tunings): Likewise.

--
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 
a3b18b381e1748f8fe5e522bdec4f7c850821fe8..1c32a3543bec4031cc9b641973101829c77296b5
 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -726,7 +726,7 @@ static const struct tune_params generic_tunings =
   SVE_NOT_IMPLEMENTED, /* sve_width  */
   4, /* memmov_cost  */
   2, /* issue_rate  */
-  (AARCH64_FUSE_AES_AESMC), /* fusible_ops  */
+  (AARCH64_FUSE_AES_AESMC | AARCH64_FUSE_CMP_BRANCH), /* fusible_ops  */
   "16:12", /* function_align.  */
   "4", /* jump_align.  */
   "8", /* loop_align.  */
@@ -1130,7 +1130,7 @@ static const struct tune_params neoversen1_tunings =
   SVE_NOT_IMPLEMENTED, /* sve_width  */
   4, /* memmov_cost  */
   3, /* issue_rate  */
-  AARCH64_FUSE_AES_AESMC, /* fusible_ops  */
+  (AARCH64_FUSE_AES_AESMC | AARCH64_FUSE_CMP_BRANCH), /* fusible_ops  */
   "32:16", /* function_align.  */
   "32:16", /* jump_align.  */
   "32:16", /* loop_align.  */


[PATCH][AARCH64] Set jump-align=4 for neoversen1

2019-12-24 Thread Wilco Dijkstra
Testing shows the setting of 32:16 for jump alignment has a significant codesize
cost, however it doesn't make a difference in performance. So set jump-align 
to 4 to get 1.6% codesize improvement.

OK for commit?

ChangeLog
2019-12-24  Wilco Dijkstra  

* config/aarch64/aarch64.c (neoversen1_tunings): Set jump_align to 4.

--
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 
1646ed1d9a3de8ee2f0abff385a1ea145e234475..209ed8ebbe81104d9d8cff0df31946ab7704fb33
 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -1132,7 +1132,7 @@ static const struct tune_params neoversen1_tunings =
   3, /* issue_rate  */
   (AARCH64_FUSE_AES_AESMC | AARCH64_FUSE_CMP_BRANCH), /* fusible_ops  */
   "32:16", /* function_align.  */
-  "32:16", /* jump_align.  */
+  "4", /* jump_align.  */
   "32:16", /* loop_align.  */
   2,   /* int_reassoc_width.  */
   4,   /* fp_reassoc_width.  */


Re: [wwwdocs] Document -fcommon default change

2020-01-07 Thread Wilco Dijkstra
Hi,

>On 1/6/20 7:10 AM, Jonathan Wakely wrote:
>> GCC now defaults to -fno-common.  As a result, global
>> variable accesses are more efficient on various targets.  In C, global
>> variables with multiple tentative definitions will result in linker
>> errors.
>
> This is better.  I'd also s/will/now/, since we're talking about the 
> present behavior of GCC 10, not some future behavior.

Thanks for the suggestions, I've reworded it as:

GCC now defaults to -fno-common.  As a result, global
variable accesses are more efficient on various targets.  In C, global
variables with multiple tentative definitions now result in linker errors.
With -fcommon such definitions are silently merged during
linking.

Also changed the anchor name (I think it was a copy/paste from another entry).
Here the updated version:


diff --git a/htdocs/gcc-10/changes.html b/htdocs/gcc-10/changes.html
index 
d6108269e977df2af29bd5c9149cc2136654ce05..45af7fa333cfff2155ff0346fe36855aa6ff940a
 100644
--- a/htdocs/gcc-10/changes.html
+++ b/htdocs/gcc-10/changes.html
@@ -140,6 +140,13 @@ a work-in-progress.
 In C2X mode, -fno-fp-int-builtin-inexact is
 enabled by default.
   
+
+  GCC now defaults to -fno-common.  As a result, global
+  variable accesses are more efficient on various targets.  In C, global
+  variables with multiple tentative definitions now result in linker 
errors.
+  With -fcommon such definitions are silently merged during
+  linking.
+  
 
 
 C++
diff --git a/htdocs/gcc-10/porting_to.html b/htdocs/gcc-10/porting_to.html
index 
3256e8a35d00ce1352c169a1c6df6d8f120889ee..7d45a962d014fecece9bd52a13ca1799153672fe
 100644
--- a/htdocs/gcc-10/porting_to.html
+++ b/htdocs/gcc-10/porting_to.html
@@ -29,9 +29,25 @@ and provide solutions. Let us know if you have suggestions 
for improvements!
 Preprocessor issues
 -->
 
-
+
+Default to -fno-common
+
+
+  A common mistake in C is omitting extern when declaring a global
+  variable in a header file.  If the header is included by several files it
+  results in multiple definitions of the same variable.  In previous GCC
+  versions this error is ignored.  GCC 10 defaults to -fno-common,
+  which means a linker error will now be reported.
+  To fix this, use extern in header files when declaring global
+  variables, and ensure each global is defined in exactly one C file.
+  As a workaround, legacy C code can be compiled with -fcommon.
+
+  
+  int x;  // tentative definition - avoid in header files
+
+  extern int y;  // correct declaration in a header file
+  
 
 Fortran language issues
 


[COMMITTED] ARM: Fix builtin-bswap-1.c test [PR113915]

2024-03-08 Thread Wilco Dijkstra
On Thumb-2 the use of CBZ blocks conditional execution, so change the
test to compare with a non-zero value.

gcc/testsuite/ChangeLog:
PR target/113915
* gcc.target/arm/builtin-bswap.x: Fix test to avoid emitting CBZ.

---

diff --git a/gcc/testsuite/gcc.target/arm/builtin-bswap.x 
b/gcc/testsuite/gcc.target/arm/builtin-bswap.x
index 
c96dbe6329c4dc648fd0bcc972ad494c7d6dc6e5..dc8f910e0007a67ae5cb5100c98101c7b199b5ca
 100644
--- a/gcc/testsuite/gcc.target/arm/builtin-bswap.x
+++ b/gcc/testsuite/gcc.target/arm/builtin-bswap.x
@@ -10,7 +10,7 @@ extern short foos16 (short);
 short swaps16_cond (short x, int y)
 {
   short z = x;
-  if (y)
+  if (y != 2)
 z = __builtin_bswap16 (x);
   return foos16 (z);
 }
@@ -27,7 +27,7 @@ extern unsigned short foou16 (unsigned short);
 unsigned short swapu16_cond (unsigned short x, int y)
 {
   unsigned short z = x;
-  if (y)
+  if (y != 2)
 z = __builtin_bswap16 (x);
   return foou16 (z);
 }
@@ -43,7 +43,7 @@ extern int foos32 (int);
 int swaps32_cond (int x, int y)
 {
   int z = x;
-  if (y)
+  if (y != 2)
 z = __builtin_bswap32 (x);
   return foos32 (z);
 }
@@ -60,7 +60,7 @@ extern unsigned int foou32 (unsigned int);
 unsigned int swapsu2 (unsigned int x, int y)
 {
   int z = x;
-  if (y)
+  if (y != 2)
 z = __builtin_bswap32 (x);
   return foou32 (z);
 }



Re: [PATCH] libatomic: Fix build for --disable-gnu-indirect-function [PR113986]

2024-03-26 Thread Wilco Dijkstra
Hi Richard,

> This description is too brief for me.  Could you say in detail how the
> new scheme works?  E.g. the description doesn't explain:
>
> -if ARCH_AARCH64_HAVE_LSE128
> -AM_CPPFLAGS   = -DHAVE_FEAT_LSE128
> -endif

That is not needed because we can include auto-config.h in atomic_16.S. I needed
this for HAVE_IFUNC, but then we redefine HAVE_FEAT_LSE128...

> And what's the purpose of ARCH_AARCH64_HAVE_LSE128 after this change?

None. I've removed the makefile leftovers in v2.

> Is the indirection via ALIAS2 necessary?  Couldn't ENTRY just define
> the __atomic_* symbols directly, as non-hidden, if we remove the
> libat_ prefix?  That would make it easier to ensure that the lists
> are kept up-to-date.

Yes, we need both the libat_ symbol as well as the __atomic_ variant in this
case. One is for internal calls, the other for external. I have a separate 
cleanup
patch which hides the extra alias in ENTRY and removes all the libat prefixes.
However while trivial, that feels more like a stage 1 patch.

> Shouldn't we skip the ENTRY_FEAT functions and existing aliases
> if !HAVE_IFUNC?

Yes, that's relatively easy, I've added HAVE_FEAT_LSE2 for that. Also we skip 
the
aliases at the end.

> I think it'd be worth (as a prepatch) splitting the file into two
> #included subfiles, one that contains the base AArch64 routines and one
> that contains the optimised versions.  The former would then be #included
> for all builds while the latter would be specific to HAVE_IFUNC.

That sounds like a complete rewrite. We might as well emit our own ifuncs at 
that
point and avoid all of the workarounds needed to fit in the framework of 
libatomic.

So for v2 I have kept things simple and just focus on fixing the bug.

Cheers,
Wilco


v2: 

Fix libatomic build to support --disable-gnu-indirect-function on AArch64.
Always build atomic_16.S, add aliases to the __atomic_ functions if 
!HAVE_IFUNC. 
Include auto-config.h in atomic_16.S to avoid having to pass defines via 
makefiles.
Fix build if HWCAP_ATOMICS/CPUID are not defined.

Passes regress and bootstrap, OK for commit?

libatomic:
PR target/113986
* Makefile.in: Regenerated.
* Makefile.am: Make atomic_16.S not depend on HAVE_IFUNC.
Remove predefine of HAVE_FEAT_LSE128.
* acinclude.m4: Remove ARCH_AARCH64_HAVE_LSE128.
* configure: Regenerated.
* config/linux/aarch64/atomic_16.S: Add __atomic_ aliases if 
!HAVE_IFUNC.   
* config/linux/aarch64/host-config.h: Correctly handle !HAVE_IFUNC.  Add
defines for HWCAP_ATOMICS and HWCAP_CPUID.

---

diff --git a/libatomic/Makefile.am b/libatomic/Makefile.am
index 
d49c44c7d5fbe83061fddd1f8ef4813a39eb1b8b..980677f353345c050f6cef2d57090360216c56cf
 100644
--- a/libatomic/Makefile.am
+++ b/libatomic/Makefile.am
@@ -130,12 +130,8 @@ libatomic_la_LIBADD = $(foreach s,$(SIZES),$(addsuffix 
_$(s)_.lo,$(SIZEOBJS)))
 ## On a target-specific basis, include alternates to be selected by IFUNC.
 if HAVE_IFUNC
 if ARCH_AARCH64_LINUX
-if ARCH_AARCH64_HAVE_LSE128
-AM_CPPFLAGS = -DHAVE_FEAT_LSE128
-endif
 IFUNC_OPTIONS   = -march=armv8-a+lse
 libatomic_la_LIBADD += $(foreach s,$(SIZES),$(addsuffix 
_$(s)_1_.lo,$(SIZEOBJS)))
-libatomic_la_SOURCES += atomic_16.S
 
 endif
 if ARCH_ARM_LINUX
@@ -155,6 +151,10 @@ libatomic_la_LIBADD += $(addsuffix _16_1_.lo,$(SIZEOBJS)) \
 endif
 endif
 
+if ARCH_AARCH64_LINUX
+libatomic_la_SOURCES += atomic_16.S
+endif
+
 libatomic_convenience_la_SOURCES = $(libatomic_la_SOURCES)
 libatomic_convenience_la_LIBADD = $(libatomic_la_LIBADD)
 
diff --git a/libatomic/Makefile.in b/libatomic/Makefile.in
index 
11c8ec7ba15ba7da5ef55e90bd836317bc270061..d9d529bc502d4ce7b9997640d5f40f5d5cc1232c
 100644
--- a/libatomic/Makefile.in
+++ b/libatomic/Makefile.in
@@ -90,17 +90,17 @@ build_triplet = @build@
 host_triplet = @host@
 target_triplet = @target@
 @ARCH_AARCH64_LINUX_TRUE@@HAVE_IFUNC_TRUE@am__append_1 = $(foreach 
s,$(SIZES),$(addsuffix _$(s)_1_.lo,$(SIZEOBJS)))
-@ARCH_AARCH64_LINUX_TRUE@@HAVE_IFUNC_TRUE@am__append_2 = atomic_16.S
-@ARCH_ARM_LINUX_TRUE@@HAVE_IFUNC_TRUE@am__append_3 = $(foreach \
+@ARCH_ARM_LINUX_TRUE@@HAVE_IFUNC_TRUE@am__append_2 = $(foreach \
 @ARCH_ARM_LINUX_TRUE@@HAVE_IFUNC_TRUE@ s,$(SIZES),$(addsuffix \
 @ARCH_ARM_LINUX_TRUE@@HAVE_IFUNC_TRUE@ _$(s)_1_.lo,$(SIZEOBJS))) \
 @ARCH_ARM_LINUX_TRUE@@HAVE_IFUNC_TRUE@ $(addsuffix \
 @ARCH_ARM_LINUX_TRUE@@HAVE_IFUNC_TRUE@ _8_2_.lo,$(SIZEOBJS)) \
 @ARCH_ARM_LINUX_TRUE@@HAVE_IFUNC_TRUE@ tas_1_2_.lo
-@ARCH_I386_TRUE@@HAVE_IFUNC_TRUE@am__append_4 = $(addsuffix 
_8_1_.lo,$(SIZEOBJS))
-@ARCH_X86_64_TRUE@@HAVE_IFUNC_TRUE@am__append_5 = $(addsuffix 
_16_1_.lo,$(SIZEOBJS)) \
+@ARCH_I386_TRUE@@HAVE_IFUNC_TRUE@am__append_3 = $(addsuffix 
_8_1_.lo,$(SIZEOBJS))
+@ARCH_X86_64_TRUE@@HAVE_IFUNC_TRUE@am__append_4 = $(addsuffix 
_16_1_.lo,$(SIZEOBJS)) \
 @ARCH_X86_64_TRUE@@HAVE_IFUNC_TRUE@   $(addsuffix 
_16_2_.lo,$(SIZEOBJS))
 
+@ARCH_AARCH64_LINUX_TRUE@am__append_

[PATCH] libatomic: Cleanup macros in atomic_16.S

2024-03-26 Thread Wilco Dijkstra

As mentioned in 
https://gcc.gnu.org/pipermail/gcc-patches/2024-March/648397.html ,
do some additional cleanup of the macros and aliases:

Cleanup the macros to add the libat_ prefixes in atomic_16.S.  Emit the
alias to __atomic_ when ifuncs are not enabled in the ENTRY macro.

Passes regress and bootstrap, OK for commit?

libatomic:
* config/linux/aarch64/atomic_16.S: Add __libat_ prefix in the
LSE2/LSE128/CORE macros, remove elsewhere.  Add ATOMIC macro.

---

diff --git a/libatomic/config/linux/aarch64/atomic_16.S 
b/libatomic/config/linux/aarch64/atomic_16.S
index 
4e3fa870b0338da4cfcdb0879ab8bed8d041a0a3..d0343507120c06a483ffdae1a793b6b5263cfe98
 100644
--- a/libatomic/config/linux/aarch64/atomic_16.S
+++ b/libatomic/config/linux/aarch64/atomic_16.S
@@ -45,7 +45,7 @@
 # define HAVE_FEAT_LSE128 0
 #endif
 
-#define HAVE_FEAT_LSE2  HAVE_IFUNC
+#define HAVE_FEAT_LSE2 HAVE_IFUNC
 
 #if HAVE_FEAT_LSE128
.arch   armv9-a+lse128
@@ -53,31 +53,37 @@
.arch   armv8-a+lse
 #endif
 
-#define LSE128(NAME)   NAME##_i1
-#define LSE2(NAME) NAME##_i2
-#define CORE(NAME) NAME
+#define LSE128(NAME)   libat_##NAME##_i1
+#define LSE2(NAME) libat_##NAME##_i2
+#define CORE(NAME) libat_##NAME
+#define ATOMIC(NAME)   __atomic_##NAME
 
-#define ENTRY_FEAT(NAME, FEAT)  \
-   ENTRY (FEAT (NAME))
+#if HAVE_IFUNC
+# define ENTRY(NAME)   ENTRY2 (CORE (NAME), )
+#else
+/* Emit __atomic_* entrypoints if no ifuncs.  */
+# define ENTRY(NAME)   ENTRY2 (CORE (NAME), ALIAS (NAME, ATOMIC, CORE))
+#endif
+#define ENTRY_FEAT(NAME, FEAT) ENTRY2 (FEAT (NAME), )
+
+#define END(NAME)  END2 (CORE (NAME))
+#define END_FEAT(NAME, FEAT)   END2 (FEAT (NAME))
 
-#define ENTRY(NAME)\
+#define ENTRY2(NAME, ALIASES)  \
.global NAME;   \
.hidden NAME;   \
.type NAME,%function;   \
.p2align 4; \
+   ALIASES;\
 NAME:  \
-   .cfi_startproc; \
-   hint34  // bti c
-
-#define END_FEAT(NAME, FEAT)   \
-   END (FEAT (NAME))
+   .cfi_startproc; \
+   hint34; // bti c
 
-#define END(NAME)  \
+#define END2(NAME) \
.cfi_endproc;   \
.size NAME, .-NAME;
 
-#define ALIAS(NAME, FROM, TO)  ALIAS1 (FROM (NAME),TO (NAME))
-#define ALIAS2(NAME)   ALIAS1 (__atomic_##NAME, libat_##NAME)
+#define ALIAS(NAME, FROM, TO)  ALIAS1 (FROM (NAME), TO (NAME))
 
 #define ALIAS1(ALIAS, NAME)\
.global ALIAS;  \
@@ -116,7 +122,7 @@ NAME:   \
 #define SEQ_CST 5
 
 
-ENTRY (libat_load_16)
+ENTRY (load_16)
mov x5, x0
cbnzw1, 2f
 
@@ -131,11 +137,11 @@ ENTRY (libat_load_16)
stxpw4, res0, res1, [x5]
cbnzw4, 2b
ret
-END (libat_load_16)
+END (load_16)
 
 
 #if HAVE_FEAT_LSE2
-ENTRY_FEAT (libat_load_16, LSE2)
+ENTRY_FEAT (load_16, LSE2)
cbnzw1, 1f
 
/* RELAXED.  */
@@ -155,11 +161,11 @@ ENTRY_FEAT (libat_load_16, LSE2)
ldp res0, res1, [x0]
dmb ishld
ret
-END_FEAT (libat_load_16, LSE2)
+END_FEAT (load_16, LSE2)
 #endif
 
 
-ENTRY (libat_store_16)
+ENTRY (store_16)
cbnzw4, 2f
 
/* RELAXED.  */
@@ -173,11 +179,11 @@ ENTRY (libat_store_16)
stlxp   w4, in0, in1, [x0]
cbnzw4, 2b
ret
-END (libat_store_16)
+END (store_16)
 
 
 #if HAVE_FEAT_LSE2
-ENTRY_FEAT (libat_store_16, LSE2)
+ENTRY_FEAT (store_16, LSE2)
cbnzw4, 1f
 
/* RELAXED.  */
@@ -189,11 +195,11 @@ ENTRY_FEAT (libat_store_16, LSE2)
stlxp   w4, in0, in1, [x0]
cbnzw4, 1b
ret
-END_FEAT (libat_store_16, LSE2)
+END_FEAT (store_16, LSE2)
 #endif
 
 
-ENTRY (libat_exchange_16)
+ENTRY (exchange_16)
mov x5, x0
cbnzw4, 2f
 
@@ -217,11 +223,11 @@ ENTRY (libat_exchange_16)
stlxp   w4, in0, in1, [x5]
cbnzw4, 4b
ret
-END (libat_exchange_16)
+END (exchange_16)
 
 
 #if HAVE_FEAT_LSE128
-ENTRY_FEAT (libat_exchange_16, LSE128)
+ENTRY_FEAT (exchange_16, LSE128)
mov tmp0, x0
mov res0, in0
mov res1, in1
@@ -241,11 +247,11 @@ ENTRY_FEAT (libat_exchange_16, LSE128)
/* RELEASE/ACQ_REL/SEQ_CST.  */
 2: swppal  res0, res1, [tmp0]
ret
-END_FEAT (libat_exchange_16, LSE128)
+END_FEAT (exchange_16, LSE128)
 #endif
 
 
-ENTRY (libat_compare_exchange_16)
+ENTRY (compare_exchange_16)
ldp exp0, exp1, [x1]
cbz w4, 3f
cmp w4, RELEASE
@@ -289,11 +295,11 @@ ENTRY (libat_compare_exchange_16)
stp tmp0, tmp1, [x1]
 6: csetx0, eq
ret
-END (libat_compare_exchange_16)
+END (compare_exchange_16)
 
 
 #if HAVE_FEAT_LSE2
-ENTRY_FEAT (libat_compare_exchange_16, LSE2)
+ENTRY_FEAT (compare_exchange_16, LSE2)
ldp exp0, exp1, [x1]
mov tmp0, exp0
mov tmp1, exp1
@@ -326,

[PATCH] libgcc: Add missing HWCAP entries to aarch64/cpuinfo.c

2024-04-02 Thread Wilco Dijkstra

A few HWCAP entries are missing from aarch64/cpuinfo.c.  This results in build 
errors
on older machines.

This counts a trivial build fix, but since it's late in stage 4 I'll let 
maintainers chip in.
OK for commit?

libgcc/
* config/aarch64/cpuinfo.c: Add HWCAP_EVTSTRM, HWCAP_CRC32, 
HWCAP_CPUID, 
HWCAP_PACA and HWCAP_PACG.

---

diff --git a/libgcc/config/aarch64/cpuinfo.c b/libgcc/config/aarch64/cpuinfo.c
index 
3c6fb8a575b423c2aff71a1a9f40812b154ee284..4b94fca869507145ec690c825f637abbc82a3493
 100644
--- a/libgcc/config/aarch64/cpuinfo.c
+++ b/libgcc/config/aarch64/cpuinfo.c
@@ -52,15 +52,15 @@ struct {
 #ifndef AT_HWCAP
 #define AT_HWCAP 16
 #endif
-#ifndef HWCAP_CPUID
-#define HWCAP_CPUID (1 << 11)
-#endif
 #ifndef HWCAP_FP
 #define HWCAP_FP (1 << 0)
 #endif
 #ifndef HWCAP_ASIMD
 #define HWCAP_ASIMD (1 << 1)
 #endif
+#ifndef HWCAP_EVTSTRM
+#define HWCAP_EVTSTRM (1 << 2)
+#endif
 #ifndef HWCAP_AES
 #define HWCAP_AES (1 << 3)
 #endif
@@ -73,6 +73,9 @@ struct {
 #ifndef HWCAP_SHA2
 #define HWCAP_SHA2 (1 << 6)
 #endif
+#ifndef HWCAP_CRC32
+#define HWCAP_CRC32 (1 << 7)
+#endif
 #ifndef HWCAP_ATOMICS
 #define HWCAP_ATOMICS (1 << 8)
 #endif
@@ -82,6 +85,9 @@ struct {
 #ifndef HWCAP_ASIMDHP
 #define HWCAP_ASIMDHP (1 << 10)
 #endif
+#ifndef HWCAP_CPUID
+#define HWCAP_CPUID (1 << 11)
+#endif
 #ifndef HWCAP_ASIMDRDM
 #define HWCAP_ASIMDRDM (1 << 12)
 #endif
@@ -133,6 +139,12 @@ struct {
 #ifndef HWCAP_SB
 #define HWCAP_SB (1 << 29)
 #endif
+#ifndef HWCAP_PACA
+#define HWCAP_PACA (1 << 30)
+#endif
+#ifndef HWCAP_PACG
+#define HWCAP_PACG (1UL << 31)
+#endif
 
 #ifndef HWCAP2_DCPODP
 #define HWCAP2_DCPODP (1 << 0)



[PATCH] AArch64: memcpy/memset expansions should not emit LDP/STP [PR113618]

2024-02-01 Thread Wilco Dijkstra

The new RTL introduced for LDP/STP results in regressions due to use of UNSPEC.
Given the new LDP fusion pass is good at finding LDP opportunities, change the
memcpy, memmove and memset expansions to emit single vector loads/stores.
This fixes the regression and enables more RTL optimization on the standard
memory accesses.  SPEC2017 performance improves slightly.  Codesize is a bit
worse due to missed LDP opportunities as discussed in the PR.

Passes regress, OK for commit?

gcc/ChangeLog:
PR target/113618
* config/aarch64/aarch64.cc (aarch64_copy_one_block): Remove. 
(aarch64_expand_cpymem): Emit single load/store only.
(aarch64_set_one_block): Remove.
(aarch64_expand_setmem): Emit single stores only.

gcc/testsuite/ChangeLog:
PR target/113618
* gcc.target/aarch64/pr113618.c: New test.

---

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 
d17198b4a5f73f8be8aeca3258b81809ffb48eac..2194441b949a53f181fe373e07bc18341c014918
 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -26376,33 +26376,6 @@ aarch64_move_pointer (rtx pointer, poly_int64 amount)
next, amount);
 }
 
-typedef auto_vec, 12> copy_ops;
-
-/* Copy one block of size MODE from SRC to DST at offset OFFSET.  */
-static void
-aarch64_copy_one_block (copy_ops &ops, rtx src, rtx dst,
-   int offset, machine_mode mode)
-{
-  /* Emit explict load/store pair instructions for 32-byte copies.  */
-  if (known_eq (GET_MODE_SIZE (mode), 32))
-{
-  mode = V4SImode;
-  rtx src1 = adjust_address (src, mode, offset);
-  rtx dst1 = adjust_address (dst, mode, offset);
-  rtx reg1 = gen_reg_rtx (mode);
-  rtx reg2 = gen_reg_rtx (mode);
-  rtx load = aarch64_gen_load_pair (reg1, reg2, src1);
-  rtx store = aarch64_gen_store_pair (dst1, reg1, reg2);
-  ops.safe_push ({ load, store });
-  return;
-}
-
-  rtx reg = gen_reg_rtx (mode);
-  rtx load = gen_move_insn (reg, adjust_address (src, mode, offset));
-  rtx store = gen_move_insn (adjust_address (dst, mode, offset), reg);
-  ops.safe_push ({ load, store });
-}
-
 /* Expand a cpymem/movmem using the MOPS extension.  OPERANDS are taken
from the cpymem/movmem pattern.  IS_MEMMOVE is true if this is a memmove
rather than memcpy.  Return true iff we succeeded.  */
@@ -26438,7 +26411,7 @@ aarch64_expand_cpymem (rtx *operands, bool is_memmove)
   rtx src = operands[1];
   unsigned align = UINTVAL (operands[3]);
   rtx base;
-  machine_mode cur_mode = BLKmode, next_mode;
+  machine_mode mode = BLKmode, next_mode;
 
   /* Variable-sized or strict-align copies may use the MOPS expansion.  */
   if (!CONST_INT_P (operands[2]) || (STRICT_ALIGNMENT && align < 16))
@@ -26465,7 +26438,7 @@ aarch64_expand_cpymem (rtx *operands, bool is_memmove)
  ??? Although it would be possible to use LDP/STP Qn in streaming mode
  (so using TARGET_BASE_SIMD instead of TARGET_SIMD), it isn't clear
  whether that would improve performance.  */
-  unsigned copy_max = (size <= 24 || !TARGET_SIMD) ? 16 : 32;
+  bool use_qregs = size > 24 && TARGET_SIMD;
 
   base = copy_to_mode_reg (Pmode, XEXP (dst, 0));
   dst = adjust_automodify_address (dst, VOIDmode, base, 0);
@@ -26473,7 +26446,7 @@ aarch64_expand_cpymem (rtx *operands, bool is_memmove)
   base = copy_to_mode_reg (Pmode, XEXP (src, 0));
   src = adjust_automodify_address (src, VOIDmode, base, 0);
 
-  copy_ops ops;
+  auto_vec, 16> ops;
   int offset = 0;
 
   while (size > 0)
@@ -26482,23 +26455,27 @@ aarch64_expand_cpymem (rtx *operands, bool is_memmove)
 or writing.  */
   opt_scalar_int_mode mode_iter;
   FOR_EACH_MODE_IN_CLASS (mode_iter, MODE_INT)
-   if (GET_MODE_SIZE (mode_iter.require ()) <= MIN (size, copy_max))
- cur_mode = mode_iter.require ();
+   if (GET_MODE_SIZE (mode_iter.require ()) <= MIN (size, 16))
+ mode = mode_iter.require ();
+
+  gcc_assert (mode != BLKmode);
 
-  gcc_assert (cur_mode != BLKmode);
+  mode_bytes = GET_MODE_SIZE (mode).to_constant ();
 
-  mode_bytes = GET_MODE_SIZE (cur_mode).to_constant ();
+  /* Prefer Q-register accesses.  */
+  if (mode_bytes == 16 && use_qregs)
+   mode = V4SImode;
 
-  /* Prefer Q-register accesses for the last bytes.  */
-  if (mode_bytes == 16 && copy_max == 32)
-   cur_mode = V4SImode;
-  aarch64_copy_one_block (ops, src, dst, offset, cur_mode);
+  rtx reg = gen_reg_rtx (mode);
+  rtx load = gen_move_insn (reg, adjust_address (src, mode, offset));
+  rtx store = gen_move_insn (adjust_address (dst, mode, offset), reg);
+  ops.safe_push ({ load, store });
   size -= mode_bytes;
   offset += mode_bytes;
 
   /* Emit trailing copies using overlapping unaligned accesses
 (when !STRICT_ALIGNMENT) - this is smaller and faster.  */
-  if (size > 0 && size < copy_max / 2 && !ST

[PATCH] ARM: Fix conditional execution [PR113915]

2024-02-21 Thread Wilco Dijkstra

By default most patterns can be conditionalized on Arm targets.  However
Thumb-2 predication requires the "predicable" attribute be explicitly
set to "yes".  Most patterns are shared between Arm and Thumb(-2) and are
marked with "predicable".  Given this sharing, it does not make sense to
use a different default for Arm.  So only consider conditional execution
of instructions that have the predicable attribute set to yes.  This ensures
that patterns not explicitly marked as such are never accidentally 
conditionally executed like in the PR.

GLIBC codesize was ~0.014% worse due to atomic operations now being
unconditional and a small number of patterns not setting "predicable".

Passes regress and bootstrap, OK for commit?

gcc/ChangeLog:
PR target/113915
* config/arm/arm.md (NOCOND): Improve comment.
* config/arm/arm.cc (arm_final_prescan_insn): Add check for
PREDICABLE_YES.

gcc/testsuite/ChangeLog:
PR target/113915
* gcc.target/arm/builtin-bswap-1.c: Fix test.

---

diff --git a/gcc/config/arm/arm.cc b/gcc/config/arm/arm.cc
index 
c44047c377a802d0c1dc1406df1b88a6b079607b..29771d284831a995adcf9adbb525396fbabb1ea2
 100644
--- a/gcc/config/arm/arm.cc
+++ b/gcc/config/arm/arm.cc
@@ -25610,11 +25610,12 @@ arm_final_prescan_insn (rtx_insn *insn)
  break;
 
case INSN:
- /* Instructions using or affecting the condition codes make it
-fail.  */
+ /* Check the instruction is explicitly marked as predicable.
+Instructions using or affecting the condition codes are not.  
*/
  scanbody = PATTERN (this_insn);
  if (!(GET_CODE (scanbody) == SET
|| GET_CODE (scanbody) == PARALLEL)
+ || get_attr_predicable (this_insn) != PREDICABLE_YES
  || get_attr_conds (this_insn) != CONDS_NOCOND)
fail = TRUE;
  break;
diff --git a/gcc/config/arm/arm.md b/gcc/config/arm/arm.md
index 
5816409f86f1106b410c5e21d77e599b485f85f2..671f093862259c2c0df93a986fc22fa56a8ea6c7
 100644
--- a/gcc/config/arm/arm.md
+++ b/gcc/config/arm/arm.md
@@ -307,6 +307,8 @@
 ;
 ; NOCOND means that the instruction does not use or alter the condition
 ;   codes but can be converted into a conditionally exectuted instruction.
+;   Given that NOCOND is the default for most instructions if omitted,
+;   the attribute predicable must be set to yes as well.
 
 (define_attr "conds" "use,set,clob,unconditional,nocond"
(if_then_else
diff --git a/gcc/testsuite/gcc.target/arm/builtin-bswap-1.c 
b/gcc/testsuite/gcc.target/arm/builtin-bswap-1.c
index 
c1e7740d14d3ca4e93a71e38b12f82c19791a204..3de7cea81c1128c2fe5a9e1216e6b027d26bcab9
 100644
--- a/gcc/testsuite/gcc.target/arm/builtin-bswap-1.c
+++ b/gcc/testsuite/gcc.target/arm/builtin-bswap-1.c
@@ -5,14 +5,8 @@
of the instructions.  Add an -mtune option known to facilitate that.  */
 /* { dg-additional-options "-O2 -mtune=cortex-a53" } */
 /* { dg-final { scan-assembler-not "orr\[ \t\]" } } */
-/* { dg-final { scan-assembler-times "revsh\\t" 1 { target { arm_nothumb } } } 
}  */
-/* { dg-final { scan-assembler-times "revshne\\t" 1 { target { arm_nothumb } } 
} }  */
-/* { dg-final { scan-assembler-times "revsh\\t" 2 { target { ! arm_nothumb } } 
} }  */
-/* { dg-final { scan-assembler-times "rev16\\t" 1 { target { arm_nothumb } } } 
}  */
-/* { dg-final { scan-assembler-times "rev16ne\\t" 1 { target { arm_nothumb } } 
} }  */
-/* { dg-final { scan-assembler-times "rev16\\t" 2 { target { ! arm_nothumb } } 
} }  */
-/* { dg-final { scan-assembler-times "rev\\t" 2 { target { arm_nothumb } } } } 
 */
-/* { dg-final { scan-assembler-times "revne\\t" 2 { target { arm_nothumb } } } 
}  */
-/* { dg-final { scan-assembler-times "rev\\t" 4 { target { ! arm_nothumb } } } 
}  */
+/* { dg-final { scan-assembler-times "revsh\\t" 2 } }  */
+/* { dg-final { scan-assembler-times "rev16\\t" 2 } }  */
+/* { dg-final { scan-assembler-times "rev\\t" 4 } }  */
 
 #include "builtin-bswap.x"



Re: [PATCH] AArch64: memcpy/memset expansions should not emit LDP/STP [PR113618]

2024-02-22 Thread Wilco Dijkstra
Hi Richard,

> It looks like this is really doing two things at once: disabling the
> direct emission of LDP/STP Qs, and switching the GPR handling from using
> pairs of DImode moves to single TImode moves.  At least, that seems to be
> the effect of...

No it still uses TImode for the !TARGET_SIMD case.

> +   if (GET_MODE_SIZE (mode_iter.require ()) <= MIN (size, 16))
> + mode = mode_iter.require ();

> ...hard-coding 16 here and...

This only affects the Q register case.

> -  if (size > 0 && size < copy_max / 2 && !STRICT_ALIGNMENT)
> +  if (size > 0 && size < 16 && !STRICT_ALIGNMENT)

> ...changing this limit from 8 to 16 for non-SIMD copies.
>
> Is that deliberate?  If so, please mention that kind of thing in the
> covering note.  It sounded like this was intended to change the handling
> of vector moves only.

Yes it's deliberate. It now basically treats everything as blocks of 16 bytes
which has a nice simplifying effect. I've added a note.

> This means that, for GPRs, we are now effectively using the double-word
> move patterns to get an LDP/STP indirectly, rather than directly as before.

No, there is no difference here.

> That seems OK, and I suppose might be slightly preferable to the current
> code for things like:
>
>  char a[31], b[31];
>  void f() { __builtin_memcpy(a, b, 31); }

Yes an unaligned tail improves slightly by using blocks of 16 bytes.
It's a very rare case, both -mgeneral-regs-only is rarely used, and most
fixed-size copies are a nice multiple of 8.

> But that raises the question: should we do the same thing for Q registers
> and V2x16QImode?

I don't believe it makes sense to use those complex types. And it likely
blocks optimizations in a similar way as UNSPEC does.

> If emitting individual vector loads and stores is better than using
> V2x16QI (and I can see that it might be), then why isn't the same
> true for GPRs and DImode vs TImode?

It might be feasible to do the same for scalar copies. But given that
using TImode works fine, there is no regression here, and use of
-mgeneral-regs-only is rare, what would the benefit be of doing that?

> I think the final version of this patch should go in ahead of the
> clean-up patch.  As I mentioned in the other review, I think the
> clean-up should wait for GCC 15.

I've rebased it to the trunk.

Cheers,
Wilco


v2: Rebase to trunk

The new RTL introduced for LDP/STP results in regressions due to use of UNSPEC.
Given the new LDP fusion pass is good at finding LDP opportunities, change the
memcpy, memmove and memset expansions to emit single vector loads/stores.
This fixes the regression and enables more RTL optimization on the standard
memory accesses.  Handling of unaligned tail of memcpy/memmove is improved
with -mgeneral-regs-only.  SPEC2017 performance improves slightly.  Codesize
is a bit worse due to missed LDP opportunities as discussed in the PR.

Passes regress, OK for commit?

gcc/ChangeLog:
PR target/113618
* config/aarch64/aarch64.cc (aarch64_copy_one_block): Remove. 
(aarch64_expand_cpymem): Emit single load/store only.
(aarch64_set_one_block): Emit single stores only.

gcc/testsuite/ChangeLog:
PR target/113618
* gcc.target/aarch64/pr113618.c: New test.

---

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 
16318bf925883ecedf9345e53fc0824a553b2747..0a28e033088a00818c6ed9fa8c15ecdee5a86c35
 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -26465,33 +26465,6 @@ aarch64_progress_pointer (rtx pointer)
   return aarch64_move_pointer (pointer, GET_MODE_SIZE (GET_MODE (pointer)));
 }
 
-typedef auto_vec, 12> copy_ops;
-
-/* Copy one block of size MODE from SRC to DST at offset OFFSET.  */
-static void
-aarch64_copy_one_block (copy_ops &ops, rtx src, rtx dst,
-   int offset, machine_mode mode)
-{
-  /* Emit explict load/store pair instructions for 32-byte copies.  */
-  if (known_eq (GET_MODE_SIZE (mode), 32))
-{
-  mode = V4SImode;
-  rtx src1 = adjust_address (src, mode, offset);
-  rtx dst1 = adjust_address (dst, mode, offset);
-  rtx reg1 = gen_reg_rtx (mode);
-  rtx reg2 = gen_reg_rtx (mode);
-  rtx load = aarch64_gen_load_pair (reg1, reg2, src1);
-  rtx store = aarch64_gen_store_pair (dst1, reg1, reg2);
-  ops.safe_push ({ load, store });
-  return;
-}
-
-  rtx reg = gen_reg_rtx (mode);
-  rtx load = gen_move_insn (reg, adjust_address (src, mode, offset));
-  rtx store = gen_move_insn (adjust_address (dst, mode, offset), reg);
-  ops.safe_push ({ load, store });
-}
-
 /* Expand a cpymem/movmem using the MOPS extension.  OPERANDS are taken
from the cpymem/movmem pattern.  IS_MEMMOVE is true if this is a memmove
rather than memcpy.  Return true iff we succeeded.  */
@@ -26527,7 +26500,7 @@ aarch64_expand_cpymem (rtx *operands, bool is_memmove)
   rtx src = operands[1];
   unsigned align = UINTVAL (operands[3]);
   rtx

Re: [PATCH] ARM: Fix conditional execution [PR113915]

2024-02-23 Thread Wilco Dijkstra
Hi Richard,

> This bit isn't.  The correct fix here is to fix the pattern(s) concerned to 
> add the missing predicate.
>
> Note that builtin-bswap.x explicitly mentions predicated mnemonics in the 
> comments.

I fixed the patterns in v2. There are likely some more, plus we could likely 
merge many t1 and t2
patterns where the only difference is predication. But those cleanups are for 
another time...

Cheers,
Wilco

v2: Add predicable to the rev patterns.

By default most patterns can be conditionalized on Arm targets.  However
Thumb-2 predication requires the "predicable" attribute be explicitly
set to "yes".  Most patterns are shared between Arm and Thumb(-2) and are
marked with "predicable".  Given this sharing, it does not make sense to
use a different default for Arm.  So only consider conditional execution
of instructions that have the predicable attribute set to yes.  This ensures
that patterns not explicitly marked as such are never conditionally executed.

Passes regress and bootstrap, OK for commit?

gcc/ChangeLog:
PR target/113915
* config/arm/arm.md (NOCOND): Improve comment.
(arm_rev*) Add predicable.
* config/arm/arm.cc (arm_final_prescan_insn): Add check for
PREDICABLE_YES.

gcc/testsuite/ChangeLog:
PR target/113915
* gcc.target/arm/builtin-bswap-1.c: Fix test.

---

diff --git a/gcc/config/arm/arm.cc b/gcc/config/arm/arm.cc
index 
1cd69268ee986a0953cc85ab259355d2191250ac..6a35fe44138135998877a9fb74c2a82a7f99dcd5
 100644
--- a/gcc/config/arm/arm.cc
+++ b/gcc/config/arm/arm.cc
@@ -25613,11 +25613,12 @@ arm_final_prescan_insn (rtx_insn *insn)
  break;
 
case INSN:
- /* Instructions using or affecting the condition codes make it
-fail.  */
+ /* Check the instruction is explicitly marked as predicable.
+Instructions using or affecting the condition codes are not.  
*/
  scanbody = PATTERN (this_insn);
  if (!(GET_CODE (scanbody) == SET
|| GET_CODE (scanbody) == PARALLEL)
+ || get_attr_predicable (this_insn) != PREDICABLE_YES
  || get_attr_conds (this_insn) != CONDS_NOCOND)
fail = TRUE;
  break;
diff --git a/gcc/config/arm/arm.md b/gcc/config/arm/arm.md
index 
5816409f86f1106b410c5e21d77e599b485f85f2..81237a61d4a2ebcfb77e47c2bd29137aba28a521
 100644
--- a/gcc/config/arm/arm.md
+++ b/gcc/config/arm/arm.md
@@ -307,6 +307,8 @@
 ;
 ; NOCOND means that the instruction does not use or alter the condition
 ;   codes but can be converted into a conditionally exectuted instruction.
+;   Given that NOCOND is the default for most instructions if omitted,
+;   the attribute predicable must be set to yes as well.
 
 (define_attr "conds" "use,set,clob,unconditional,nocond"
(if_then_else
@@ -12547,6 +12549,7 @@
   revsh%?\t%0, %1"
   [(set_attr "arch" "t1,t2,32")
(set_attr "length" "2,2,4")
+   (set_attr "predicable" "no,yes,yes")
(set_attr "type" "rev")]
 )
 
@@ -12560,6 +12563,7 @@
rev16%?\t%0, %1"
   [(set_attr "arch" "t1,t2,32")
(set_attr "length" "2,2,4")
+   (set_attr "predicable" "no,yes,yes")
(set_attr "type" "rev")]
 )
 
@@ -12584,6 +12588,7 @@
rev16%?\t%0, %1"
   [(set_attr "arch" "t1,t2,32")
(set_attr "length" "2,2,4")
+   (set_attr "predicable" "no,yes,yes")
(set_attr "type" "rev")]
 )
 
@@ -12619,6 +12624,7 @@
rev16%?\t%0, %1"
   [(set_attr "arch" "t1,t2,32")
(set_attr "length" "2,2,4")
+   (set_attr "predicable" "no,yes,yes")
(set_attr "type" "rev")]
 )
 
diff --git a/gcc/testsuite/gcc.target/arm/builtin-bswap-1.c 
b/gcc/testsuite/gcc.target/arm/builtin-bswap-1.c
index 
c1e7740d14d3ca4e93a71e38b12f82c19791a204..1a311a6a5af647d40abd553e5d0ba1273c76d288
 100644
--- a/gcc/testsuite/gcc.target/arm/builtin-bswap-1.c
+++ b/gcc/testsuite/gcc.target/arm/builtin-bswap-1.c
@@ -5,14 +5,11 @@
of the instructions.  Add an -mtune option known to facilitate that.  */
 /* { dg-additional-options "-O2 -mtune=cortex-a53" } */
 /* { dg-final { scan-assembler-not "orr\[ \t\]" } } */
-/* { dg-final { scan-assembler-times "revsh\\t" 1 { target { arm_nothumb } } } 
}  */
-/* { dg-final { scan-assembler-times "revshne\\t" 1 { target { arm_nothumb } } 
} }  */
-/* { dg-final { scan-assembler-times "revsh\\t" 2 { target { ! arm_nothumb } } 
} }  */
-/* { dg-final { scan-assembler-times "rev16\\t" 1 { target { arm_nothumb } } } 
}  */
-/* { dg-final { scan-assembler-times "rev16ne\\t" 1 { target { arm_nothumb } } 
} }  */
-/* { dg-final { scan-assembler-times "rev16\\t" 2 { target { ! arm_nothumb } } 
} }  */
-/* { dg-final { scan-assembler-times "rev\\t" 2 { target { arm_nothumb } } } } 
 */
-/* { dg-final { scan-assembler-times "revne\\t" 2 { target { arm_nothumb } } } 
}  */
-/* { dg-final { scan-assembler-times "rev\\t" 4 { target { ! arm_nothumb } } } 
}  */
+/* { dg-final { scan-assembler-times "revsh\\t" 1 } }  */
+/* { dg-

[PATCH] libatomic: Fix build for --disable-gnu-indirect-function [PR113986]

2024-02-23 Thread Wilco Dijkstra

Fix libatomic build to support --disable-gnu-indirect-function on AArch64.
Always build atomic_16.S and add aliases to the __atomic_* functions if
!HAVE_IFUNC.

Passes regress and bootstrap, OK for commit?

libatomic:
PR target/113986
* Makefile.in: Regenerated.
* Makefile.am: Make atomic_16.S not depend on HAVE_IFUNC.
Remove predefine of HAVE_FEAT_LSE128.
* config/linux/aarch64/atomic_16.S: Add __atomic_ aliases if 
!HAVE_IFUNC.   
* config/linux/aarch64/host-config.h: Correctly handle !HAVE_IFUNC.

---

diff --git a/libatomic/Makefile.am b/libatomic/Makefile.am
index 
d49c44c7d5fbe83061fddd1f8ef4813a39eb1b8b..980677f353345c050f6cef2d57090360216c56cf
 100644
--- a/libatomic/Makefile.am
+++ b/libatomic/Makefile.am
@@ -130,12 +130,8 @@ libatomic_la_LIBADD = $(foreach s,$(SIZES),$(addsuffix 
_$(s)_.lo,$(SIZEOBJS)))
 ## On a target-specific basis, include alternates to be selected by IFUNC.
 if HAVE_IFUNC
 if ARCH_AARCH64_LINUX
-if ARCH_AARCH64_HAVE_LSE128
-AM_CPPFLAGS = -DHAVE_FEAT_LSE128
-endif
 IFUNC_OPTIONS   = -march=armv8-a+lse
 libatomic_la_LIBADD += $(foreach s,$(SIZES),$(addsuffix 
_$(s)_1_.lo,$(SIZEOBJS)))
-libatomic_la_SOURCES += atomic_16.S
 
 endif
 if ARCH_ARM_LINUX
@@ -155,6 +151,10 @@ libatomic_la_LIBADD += $(addsuffix _16_1_.lo,$(SIZEOBJS)) \
 endif
 endif
 
+if ARCH_AARCH64_LINUX
+libatomic_la_SOURCES += atomic_16.S
+endif
+
 libatomic_convenience_la_SOURCES = $(libatomic_la_SOURCES)
 libatomic_convenience_la_LIBADD = $(libatomic_la_LIBADD)
 
diff --git a/libatomic/Makefile.in b/libatomic/Makefile.in
index 
11c8ec7ba15ba7da5ef55e90bd836317bc270061..d9d529bc502d4ce7b9997640d5f40f5d5cc1232c
 100644
--- a/libatomic/Makefile.in
+++ b/libatomic/Makefile.in
@@ -90,17 +90,17 @@ build_triplet = @build@
 host_triplet = @host@
 target_triplet = @target@
 @ARCH_AARCH64_LINUX_TRUE@@HAVE_IFUNC_TRUE@am__append_1 = $(foreach 
s,$(SIZES),$(addsuffix _$(s)_1_.lo,$(SIZEOBJS)))
-@ARCH_AARCH64_LINUX_TRUE@@HAVE_IFUNC_TRUE@am__append_2 = atomic_16.S
-@ARCH_ARM_LINUX_TRUE@@HAVE_IFUNC_TRUE@am__append_3 = $(foreach \
+@ARCH_ARM_LINUX_TRUE@@HAVE_IFUNC_TRUE@am__append_2 = $(foreach \
 @ARCH_ARM_LINUX_TRUE@@HAVE_IFUNC_TRUE@ s,$(SIZES),$(addsuffix \
 @ARCH_ARM_LINUX_TRUE@@HAVE_IFUNC_TRUE@ _$(s)_1_.lo,$(SIZEOBJS))) \
 @ARCH_ARM_LINUX_TRUE@@HAVE_IFUNC_TRUE@ $(addsuffix \
 @ARCH_ARM_LINUX_TRUE@@HAVE_IFUNC_TRUE@ _8_2_.lo,$(SIZEOBJS)) \
 @ARCH_ARM_LINUX_TRUE@@HAVE_IFUNC_TRUE@ tas_1_2_.lo
-@ARCH_I386_TRUE@@HAVE_IFUNC_TRUE@am__append_4 = $(addsuffix 
_8_1_.lo,$(SIZEOBJS))
-@ARCH_X86_64_TRUE@@HAVE_IFUNC_TRUE@am__append_5 = $(addsuffix 
_16_1_.lo,$(SIZEOBJS)) \
+@ARCH_I386_TRUE@@HAVE_IFUNC_TRUE@am__append_3 = $(addsuffix 
_8_1_.lo,$(SIZEOBJS))
+@ARCH_X86_64_TRUE@@HAVE_IFUNC_TRUE@am__append_4 = $(addsuffix 
_16_1_.lo,$(SIZEOBJS)) \
 @ARCH_X86_64_TRUE@@HAVE_IFUNC_TRUE@   $(addsuffix 
_16_2_.lo,$(SIZEOBJS))
 
+@ARCH_AARCH64_LINUX_TRUE@am__append_5 = atomic_16.S
 subdir = .
 ACLOCAL_M4 = $(top_srcdir)/aclocal.m4
 am__aclocal_m4_deps = $(top_srcdir)/../config/acx.m4 \
@@ -156,8 +156,7 @@ am__uninstall_files_from_dir = { \
   }
 am__installdirs = "$(DESTDIR)$(toolexeclibdir)"
 LTLIBRARIES = $(noinst_LTLIBRARIES) $(toolexeclib_LTLIBRARIES)
-@ARCH_AARCH64_LINUX_TRUE@@HAVE_IFUNC_TRUE@am__objects_1 =  \
-@ARCH_AARCH64_LINUX_TRUE@@HAVE_IFUNC_TRUE@ atomic_16.lo
+@ARCH_AARCH64_LINUX_TRUE@am__objects_1 = atomic_16.lo
 am_libatomic_la_OBJECTS = gload.lo gstore.lo gcas.lo gexch.lo \
glfree.lo lock.lo init.lo fenv.lo fence.lo flag.lo \
$(am__objects_1)
@@ -425,7 +424,7 @@ libatomic_la_LDFLAGS = $(libatomic_version_info) 
$(libatomic_version_script) \
$(lt_host_flags) $(libatomic_darwin_rpath)
 
 libatomic_la_SOURCES = gload.c gstore.c gcas.c gexch.c glfree.c lock.c \
-   init.c fenv.c fence.c flag.c $(am__append_2)
+   init.c fenv.c fence.c flag.c $(am__append_5)
 SIZEOBJS = load store cas exch fadd fsub fand fior fxor fnand tas
 EXTRA_libatomic_la_SOURCES = $(addsuffix _n.c,$(SIZEOBJS))
 libatomic_la_DEPENDENCIES = $(libatomic_la_LIBADD) $(libatomic_version_dep)
@@ -451,9 +450,8 @@ all_c_files := $(foreach dir,$(search_path),$(wildcard 
$(dir)/*.c))
 # Then sort through them to find the one we want, and select the first.
 M_SRC = $(firstword $(filter %/$(M_FILE), $(all_c_files)))
 libatomic_la_LIBADD = $(foreach s,$(SIZES),$(addsuffix \
-   _$(s)_.lo,$(SIZEOBJS))) $(am__append_1) $(am__append_3) \
-   $(am__append_4) $(am__append_5)
-@ARCH_AARCH64_HAVE_LSE128_TRUE@@ARCH_AARCH64_LINUX_TRUE@@HAVE_IFUNC_TRUE@AM_CPPFLAGS
 = -DHAVE_FEAT_LSE128
+   _$(s)_.lo,$(SIZEOBJS))) $(am__append_1) $(am__append_2) \
+   $(am__append_3) $(am__append_4)
 @ARCH_AARCH64_LINUX_TRUE@@HAVE_IFUNC_TRUE@IFUNC_OPTIONS = -march=armv8-a+lse
 @ARCH_ARM_LINUX_TRUE@@HAVE_IFUNC_TRUE@IFUNC_OPTIONS = -march=armv7-a+fp 
-DHAVE_KERNEL64
 @ARCH_I386_TRUE@@HAVE_IFUNC_TRUE@IFUNC_OPTIONS = -march=i586
diff --git a/libatomic/config/linux/aarch64/atomic_16.

Re: [PATCH] ARM: Fix conditional execution [PR113915]

2024-02-26 Thread Wilco Dijkstra
Hi Richard,

> Did you test this on a thumb1 target?  It seems to me that the target parts 
> that you've
> removed were likely related to that.  In fact, I don't see why this test 
> would need to be changed at all.

The testcase explicitly forces a Thumb-2 target (arm_arch_v6t2). The patterns
were wrong for Thumb-2 indeed, and the testcase was explicitly testing for this.
There is a separate builtin-bswap-2.c for Thumb-1 target (arm_arch_v6m).

Cheers,
Wilco


[PATCH] AArch64: Reassociate CONST in address expressions [PR112573]

2024-01-10 Thread Wilco Dijkstra
GCC tends to optimistically create CONST of globals with an immediate offset. 
However it is almost always better to CSE addresses of globals and add immediate
offsets separately (the offset could be merged later in single-use cases).
Splitting CONST expressions with an index in aarch64_legitimize_address fixes 
part
of PR112573.

Passes regress & bootstrap, OK for commit?

gcc/ChangeLog:
PR target/112573
* config/aarch64/aarch64.cc (aarch64_legitimize_address): Reassociate 
badly
formed CONST expressions.

gcc/testsuite/ChangeLog:
PR target/112573
* gcc.target/aarch64/pr112573.c: Add new test.

---

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 
0909b319d16b9a1587314bcfda0a8112b42a663f..9fbc8b62455f48baec533d3dd5e2d9ea995d5a8f
 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -12608,6 +12608,20 @@ aarch64_legitimize_address (rtx x, rtx /* orig_x  */, 
machine_mode mode)
  not to split a CONST for some forms of address expression, otherwise
  it will generate sub-optimal code.  */
 
+  /* First split X + CONST (base, offset) into (base + X) + offset.  */
+  if (GET_CODE (x) == PLUS && GET_CODE (XEXP (x, 1)) == CONST)
+{
+  poly_int64 offset;
+  rtx base = strip_offset_and_salt (XEXP (x, 1), &offset);
+
+  if (offset.is_constant ())
+  {
+ base = expand_binop (Pmode, add_optab, base, XEXP (x, 0),
+  NULL_RTX, true, OPTAB_DIRECT);
+ x = plus_constant (Pmode, base, offset);
+  }
+}
+
   if (GET_CODE (x) == PLUS && CONST_INT_P (XEXP (x, 1)))
 {
   rtx base = XEXP (x, 0);
diff --git a/gcc/testsuite/gcc.target/aarch64/pr112573.c 
b/gcc/testsuite/gcc.target/aarch64/pr112573.c
new file mode 100644
index 
..be04c0ca86ad9f33975a85f497549955d6d1236d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr112573.c
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fno-section-anchors" } */
+
+char a[100];
+
+void f1 (int x, int y)
+{
+  *((a + y) + 3) = x;
+  *((a + y) + 2) = x;
+  *((a + y) + 1) = x;
+  *((a + y) + 0) = x;
+}
+
+/* { dg-final { scan-assembler-times "strb" 4 } } */
+/* { dg-final { scan-assembler-times "adrp" 1 } } */



Re: [PATCH] AArch64: Reassociate CONST in address expressions [PR112573]

2024-01-16 Thread Wilco Dijkstra
Hi Richard,

>> +  rtx base = strip_offset_and_salt (XEXP (x, 1), &offset);
>
> This should be just strip_offset, so that we don't lose the salt
> during optimisation.

Fixed.

> +
> +  if (offset.is_constant ())

> I'm not sure this is really required.  Logically the same thing
> would apply to SVE, although admittedly:

It's not needed indeed, I've committed it with the if removed.

However I believe CONST only allows immediate offsets here so that it can
be used in const data. Building SPEC with gcc_assert (!offset.is_constant ()) 
doesn't ever trigger it.

Cheers,
Wilco

[PATCH] AArch64: Add -mcpu=cobalt-100

2024-01-16 Thread Wilco Dijkstra

Add support for -mcpu=cobalt-100 (Neoverse N2 with a different implementer ID).

Passes regress, OK for commit?

gcc/ChangeLog:
* config/aarch64/aarch64-cores.def (AARCH64_CORE): Add 'cobalt-100' CPU.
* config/aarch64/aarch64-tune.md: Regenerated.
* doc/invoke.texi (-mcpu): Add cobalt-100 core.

---

diff --git a/gcc/config/aarch64/aarch64-cores.def 
b/gcc/config/aarch64/aarch64-cores.def
index 
054862f37bc8738e7193348d01f485a46a9a36e3..7ebefcf543b6f84b3df22ab836728111b56fa76f
 100644
--- a/gcc/config/aarch64/aarch64-cores.def
+++ b/gcc/config/aarch64/aarch64-cores.def
@@ -186,6 +186,7 @@ AARCH64_CORE("cortex-x3",  cortexx3, cortexa57, V9A,  
(SVE2_BITPERM, MEMTAG, I8M
 AARCH64_CORE("cortex-x4",  cortexx4, cortexa57, V9_2A,  (SVE2_BITPERM, MEMTAG, 
PROFILE), neoversen2, 0x41, 0xd81, -1)
 
 AARCH64_CORE("neoverse-n2", neoversen2, cortexa57, V9A, (I8MM, BF16, 
SVE2_BITPERM, RNG, MEMTAG, PROFILE), neoversen2, 0x41, 0xd49, -1)
+AARCH64_CORE("cobalt-100",   cobalt100, cortexa57, V9A, (I8MM, BF16, 
SVE2_BITPERM, RNG, MEMTAG, PROFILE), neoversen2, 0x6d, 0xd49, -1)
 
 AARCH64_CORE("neoverse-v2", neoversev2, cortexa57, V9A, (I8MM, BF16, 
SVE2_BITPERM, RNG, MEMTAG, PROFILE), neoversev2, 0x41, 0xd4f, -1)
 AARCH64_CORE("demeter", demeter, cortexa57, V9A, (I8MM, BF16, SVE2_BITPERM, 
RNG, MEMTAG, PROFILE), neoversev2, 0x41, 0xd4f, -1)
diff --git a/gcc/config/aarch64/aarch64-tune.md 
b/gcc/config/aarch64/aarch64-tune.md
index 
98e6882d4324d81268e28810b305b87c63bba22d..abd3c9e0822eeb1652f4856cde591ac175ac0a4a
 100644
--- a/gcc/config/aarch64/aarch64-tune.md
+++ b/gcc/config/aarch64/aarch64-tune.md
@@ -1,5 +1,5 @@
 ;; -*- buffer-read-only: t -*-
 ;; Generated automatically by gentune.sh from aarch64-cores.def
 (define_attr "tune"
-   
"cortexa34,cortexa35,cortexa53,cortexa57,cortexa72,cortexa73,thunderx,thunderxt88p1,thunderxt88,octeontx,octeontxt81,octeontxt83,thunderxt81,thunderxt83,ampere1,ampere1a,ampere1b,emag,xgene1,falkor,qdf24xx,exynosm1,phecda,thunderx2t99p1,vulcan,thunderx2t99,cortexa55,cortexa75,cortexa76,cortexa76ae,cortexa77,cortexa78,cortexa78ae,cortexa78c,cortexa65,cortexa65ae,cortexx1,cortexx1c,neoversen1,ares,neoversee1,octeontx2,octeontx2t98,octeontx2t96,octeontx2t93,octeontx2f95,octeontx2f95n,octeontx2f95mm,a64fx,tsv110,thunderx3t110,neoversev1,zeus,neoverse512tvb,saphira,cortexa57cortexa53,cortexa72cortexa53,cortexa73cortexa35,cortexa73cortexa53,cortexa75cortexa55,cortexa76cortexa55,cortexr82,cortexa510,cortexa520,cortexa710,cortexa715,cortexa720,cortexx2,cortexx3,cortexx4,neoversen2,neoversev2,demeter,generic,generic_armv8_a,generic_armv9_a"
+   
"cortexa34,cortexa35,cortexa53,cortexa57,cortexa72,cortexa73,thunderx,thunderxt88p1,thunderxt88,octeontx,octeontxt81,octeontxt83,thunderxt81,thunderxt83,ampere1,ampere1a,ampere1b,emag,xgene1,falkor,qdf24xx,exynosm1,phecda,thunderx2t99p1,vulcan,thunderx2t99,cortexa55,cortexa75,cortexa76,cortexa76ae,cortexa77,cortexa78,cortexa78ae,cortexa78c,cortexa65,cortexa65ae,cortexx1,cortexx1c,neoversen1,ares,neoversee1,octeontx2,octeontx2t98,octeontx2t96,octeontx2t93,octeontx2f95,octeontx2f95n,octeontx2f95mm,a64fx,tsv110,thunderx3t110,neoversev1,zeus,neoverse512tvb,saphira,cortexa57cortexa53,cortexa72cortexa53,cortexa73cortexa35,cortexa73cortexa53,cortexa75cortexa55,cortexa76cortexa55,cortexr82,cortexa510,cortexa520,cortexa710,cortexa715,cortexa720,cortexx2,cortexx3,cortexx4,neoversen2,cobalt100,neoversev2,demeter,generic,generic_armv8_a,generic_armv9_a"
(const (symbol_ref "((enum attr_tune) aarch64_tune)")))
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 
216e2f594d1cbc139c7e0125d9579c6924d23443..a25362b8c157f67d68b19f94cc2d64bd09505bdc
 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -21163,7 +21163,7 @@ performance of the code.  Permissible values for this 
option are:
 @samp{cortex-r82}, @samp{cortex-x1}, @samp{cortex-x1c}, @samp{cortex-x2},
 @samp{cortex-x3}, @samp{cortex-x4}, @samp{cortex-a510}, @samp{cortex-a520},
 @samp{cortex-a710}, @samp{cortex-a715}, @samp{cortex-a720}, @samp{ampere1},
-@samp{ampere1a}, @samp{ampere1b}, and @samp{native}.
+@samp{ampere1a}, @samp{ampere1b}, @samp{cobalt-100} and @samp{native}.
 
 The values @samp{cortex-a57.cortex-a53}, @samp{cortex-a72.cortex-a53},
 @samp{cortex-a73.cortex-a35}, @samp{cortex-a73.cortex-a53},



Re: [PATCH] AArch64: Add -mcpu=cobalt-100

2024-01-25 Thread Wilco Dijkstra
Hi,

>> Add support for -mcpu=cobalt-100 (Neoverse N2 with a different implementer
>> ID).
>> 
>> Passes regress, OK for commit?
>
> Ok.

Also OK to backport to GCC 13, 12 and 11?

Cheers,
Wilco

[PATCH] AArch64: Remove AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS

2024-01-30 Thread Wilco Dijkstra

(follow-on based on review comments on
https://gcc.gnu.org/pipermail/gcc-patches/2024-January/641913.html)


Remove the tune AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS since it is only
used by an old core and doesn't properly support -Os.  SPECINT_2017
shows that removing it has no performance difference, while codesize
is reduced by 0.07%.

Passes regress, OK for commit?

gcc/ChangeLog:
* config/aarch64/aarch64.cc (aarch64_mode_valid_for_sched_fusion_p):
Remove check for AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS.
(aarch64_advsimd_ldp_stp_p): Likewise.
(aarch64_stp_sequence_cost): Likewise.
(aarch64_expand_cpymem): Likewise.
(aarch64_expand_setmem): Likewise.
* config/aarch64/aarch64-ldp-fusion.cc (ldp_operand_mode_ok_p): 
Likewise.   
* config/aarch64/aarch64-ldpstp.md: Likewise.
* config/aarch64/aarch64-tuning-flags.def: Remove NO_LDP_STP_QREGS.
* config/aarch64/tuning_models/emag.h: Likewise.
* config/aarch64/tuning_models/xgene1.h: Likewise.

gcc/testsuite/ChangeLog:
* gcc.target/aarch64/ldp_stp_q_disable.c: Remove test.

---

diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc 
b/gcc/config/aarch64/aarch64-ldp-fusion.cc
index 
22ed95eb743c9ee44e745560b207d389c8fca03b..de6685f75a2650d9a7d39fe6781ec57214092eb1
 100644
--- a/gcc/config/aarch64/aarch64-ldp-fusion.cc
+++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
@@ -315,17 +315,9 @@ any_post_modify_p (rtx x)
 static bool
 ldp_operand_mode_ok_p (machine_mode mode)
 {
-  const bool allow_qregs
-= !(aarch64_tune_params.extra_tuning_flags
-   & AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS);
-
   if (!aarch64_ldpstp_operand_mode_p (mode))
 return false;
 
-  const auto size = GET_MODE_SIZE (mode).to_constant ();
-  if (size == 16 && !allow_qregs)
-return false;
-
   // We don't pair up TImode accesses before RA because TImode is
   // special in that it can be allocated to a pair of GPRs or a single
   // FPR, and the RA is best placed to make that decision.
diff --git a/gcc/config/aarch64/aarch64-ldpstp.md 
b/gcc/config/aarch64/aarch64-ldpstp.md
index 
b7c0bf05cd18c971955d667bae91d7c3dc3f512e..7890a8cc32b24f8e1bc29cb722b10e511e7881ab
 100644
--- a/gcc/config/aarch64/aarch64-ldpstp.md
+++ b/gcc/config/aarch64/aarch64-ldpstp.md
@@ -96,9 +96,7 @@ (define_peephole2
(set (match_operand:VQ2 2 "register_operand" "")
(match_operand:VQ2 3 "memory_operand" ""))]
   "TARGET_FLOAT
-   && aarch64_operands_ok_for_ldpstp (operands, true)
-   && (aarch64_tune_params.extra_tuning_flags
-   & AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS) == 0"
+   && aarch64_operands_ok_for_ldpstp (operands, true)"
   [(const_int 0)]
 {
   aarch64_finish_ldpstp_peephole (operands, true);
@@ -111,9 +109,7 @@ (define_peephole2
(set (match_operand:VQ2 2 "memory_operand" "")
(match_operand:VQ2 3 "register_operand" ""))]
   "TARGET_FLOAT
-   && aarch64_operands_ok_for_ldpstp (operands, false)
-   && (aarch64_tune_params.extra_tuning_flags
-   & AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS) == 0"
+   && aarch64_operands_ok_for_ldpstp (operands, false)"
   [(const_int 0)]
 {
   aarch64_finish_ldpstp_peephole (operands, false);
diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def 
b/gcc/config/aarch64/aarch64-tuning-flags.def
index 
d917da720b22ed6aaf360dc4ebbe8efc4a3185f2..d5bcaebce770f0b217aac783063d39135f754c77
 100644
--- a/gcc/config/aarch64/aarch64-tuning-flags.def
+++ b/gcc/config/aarch64/aarch64-tuning-flags.def
@@ -36,9 +36,6 @@ AARCH64_EXTRA_TUNING_OPTION ("rename_fma_regs", 
RENAME_FMA_REGS)
are not considered cheap.  */
 AARCH64_EXTRA_TUNING_OPTION ("cheap_shift_extend", CHEAP_SHIFT_EXTEND)
 
-/* Disallow load/store pair instructions on Q-registers.  */
-AARCH64_EXTRA_TUNING_OPTION ("no_ldp_stp_qregs", NO_LDP_STP_QREGS)
-
 AARCH64_EXTRA_TUNING_OPTION ("rename_load_regs", RENAME_LOAD_REGS)
 
 AARCH64_EXTRA_TUNING_OPTION ("cse_sve_vl_constants", CSE_SVE_VL_CONSTANTS)
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 
433c160cba22374f6b7a3445c0202789927abd25..d7e8379b2eb90eccb8608a15cc8d11cc2187a9e7
 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -10335,9 +10335,7 @@ aarch64_mode_valid_for_sched_fusion_p (machine_mode 
mode)
 || mode == SDmode || mode == DDmode
 || (aarch64_vector_mode_supported_p (mode)
 && (known_eq (GET_MODE_SIZE (mode), 8)
-|| (known_eq (GET_MODE_SIZE (mode), 16)
-   && (aarch64_tune_params.extra_tuning_flags
-   & AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS) == 0)));
+|| known_eq (GET_MODE_SIZE (mode), 16)));
 }
 
 /* Return true if REGNO is a virtual pointer register, or an eliminable
@@ -16448,10 +16446,6 @@ aarch64_advsimd_ldp_stp_p (enum vect_cost_for_stmt 
kind,
   return false;
 }
 
-  if (aarch64_tune_params.extra_tuning_flags
-  & AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS)
-return fals

Re: [PATCH v4] AArch64: Cleanup memset expansion

2024-01-30 Thread Wilco Dijkstra
Hi Richard,

>> That tune is only used by an obsolete core. I ran the memcpy and memset
>> benchmarks from Optimized Routines on xgene-1 with and without LDP/STP.
>> There is no measurable penalty for using LDP/STP. I'm not sure why it was
>> ever added given it does not do anything useful. I'll post a separate patch
>> to remove it to reduce the maintenance overhead.

Patch: https://gcc.gnu.org/pipermail/gcc-patches/2024-January/62.html

> Is that enough to justify removing it though?  It sounds from:
>
>  https://gcc.gnu.org/pipermail/gcc-patches/2018-June/500017.html
>
> like the problem was in more balanced code, rather than memory-limited
> things like memset/memcpy.
>
> But yeah, I'm not sure if the intuition was supported by numbers
> in the end.  If SPEC also shows no change then we can probably drop it
> (unless someone objects).

SPECINT didn't show any difference either, so LDP doesn't have a measurable
penalty. It doesn't look like the original commit was ever backed up by 
benchmarks...

> Let's leave this patch until that's resolved though, since I think as it
> stands the patch does leave -Os -mtune=xgene1 worse off (bigger code).
> Handling the tune in the meantime would also be OK.

Note it was incorrectly handling -Os, it should still form LDP in that case
and take advantage of longer and faster inlined memcpy/memset instead of
calling a library function.

>    /* Default the maximum to 256-bytes when considering only libcall vs
>   SIMD broadcast sequence.  */

> ...this comment should be deleted along with the code it's describing.
> Don't respin just for that though :)

I've fixed that locally.

Cheers,
Wilco

Re: [PATCH v3] AArch64: Cleanup memset expansion

2023-12-22 Thread Wilco Dijkstra
v3: rebased to latest trunk

Cleanup memset implementation.  Similar to memcpy/memmove, use an offset and
bytes throughout.  Simplify the complex calculations when optimizing for size
by using a fixed limit.

Passes regress & bootstrap.

gcc/ChangeLog:
* config/aarch64/aarch64.h (MAX_SET_SIZE): New define.
* config/aarch64/aarch64.cc (aarch64_progress_pointer): Remove function.
(aarch64_set_one_block_and_progress_pointer): Simplify and clean up.
(aarch64_expand_setmem): Clean up implementation, use byte offsets,
simplify size calculation.

---

diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h
index 
3ae42be770400da96ea3d9d25d6e1b2d393d034d..dd3b7988d585277181c478cd022fd7b6285929d0
 100644
--- a/gcc/config/aarch64/aarch64.h
+++ b/gcc/config/aarch64/aarch64.h
@@ -1178,6 +1178,10 @@ typedef struct
mode that should actually be used.  We allow pairs of registers.  */
 #define MAX_FIXED_MODE_SIZE GET_MODE_BITSIZE (TImode)
 
+/* Maximum bytes set for an inline memset expansion.  With -Os use 3 STP
+   and 1 MOVI/DUP (same size as a call).  */
+#define MAX_SET_SIZE(speed) (speed ? 256 : 96)
+
 /* Maximum bytes moved by a single instruction (load/store pair).  */
 #define MOVE_MAX (UNITS_PER_WORD * 2)
 
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 
f9850320f61c5ddccf47e6583d304e5f405a484f..0909b319d16b9a1587314bcfda0a8112b42a663f
 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -26294,15 +26294,6 @@ aarch64_move_pointer (rtx pointer, poly_int64 amount)
next, amount);
 }
 
-/* Return a new RTX holding the result of moving POINTER forward by the
-   size of the mode it points to.  */
-
-static rtx
-aarch64_progress_pointer (rtx pointer)
-{
-  return aarch64_move_pointer (pointer, GET_MODE_SIZE (GET_MODE (pointer)));
-}
-
 typedef auto_vec, 12> copy_ops;
 
 /* Copy one block of size MODE from SRC to DST at offset OFFSET.  */
@@ -26457,45 +26448,21 @@ aarch64_expand_cpymem (rtx *operands, bool is_memmove)
   return true;
 }
 
-/* Like aarch64_copy_one_block_and_progress_pointers, except for memset where
-   SRC is a register we have created with the duplicated value to be set.  */
+/* Set one block of size MODE at DST at offset OFFSET to value in SRC.  */
 static void
-aarch64_set_one_block_and_progress_pointer (rtx src, rtx *dst,
-   machine_mode mode)
+aarch64_set_one_block (rtx src, rtx dst, int offset, machine_mode mode)
 {
-  /* If we are copying 128bits or 256bits, we can do that straight from
- the SIMD register we prepared.  */
-  if (known_eq (GET_MODE_BITSIZE (mode), 256))
-{
-  mode = GET_MODE (src);
-  /* "Cast" the *dst to the correct mode.  */
-  *dst = adjust_address (*dst, mode, 0);
-  /* Emit the memset.  */
-  emit_insn (aarch64_gen_store_pair (*dst, src, src));
-
-  /* Move the pointers forward.  */
-  *dst = aarch64_move_pointer (*dst, 32);
-  return;
-}
-  if (known_eq (GET_MODE_BITSIZE (mode), 128))
+  /* Emit explict store pair instructions for 32-byte writes.  */
+  if (known_eq (GET_MODE_SIZE (mode), 32))
 {
-  /* "Cast" the *dst to the correct mode.  */
-  *dst = adjust_address (*dst, GET_MODE (src), 0);
-  /* Emit the memset.  */
-  emit_move_insn (*dst, src);
-  /* Move the pointers forward.  */
-  *dst = aarch64_move_pointer (*dst, 16);
+  mode = V16QImode;
+  rtx dst1 = adjust_address (dst, mode, offset);
+  emit_insn (aarch64_gen_store_pair (dst1, src, src));
   return;
 }
-  /* For copying less, we have to extract the right amount from src.  */
-  rtx reg = lowpart_subreg (mode, src, GET_MODE (src));
-
-  /* "Cast" the *dst to the correct mode.  */
-  *dst = adjust_address (*dst, mode, 0);
-  /* Emit the memset.  */
-  emit_move_insn (*dst, reg);
-  /* Move the pointer forward.  */
-  *dst = aarch64_progress_pointer (*dst);
+  if (known_lt (GET_MODE_SIZE (mode), 16))
+src = lowpart_subreg (mode, src, GET_MODE (src));
+  emit_move_insn (adjust_address (dst, mode, offset), src);
 }
 
 /* Expand a setmem using the MOPS instructions.  OPERANDS are the same
@@ -26524,7 +26491,7 @@ aarch64_expand_setmem_mops (rtx *operands)
 bool
 aarch64_expand_setmem (rtx *operands)
 {
-  int n, mode_bits;
+  int mode_bytes;
   unsigned HOST_WIDE_INT len;
   rtx dst = operands[0];
   rtx val = operands[2], src;
@@ -26537,11 +26504,9 @@ aarch64_expand_setmem (rtx *operands)
   || (STRICT_ALIGNMENT && align < 16))
 return aarch64_expand_setmem_mops (operands);
 
-  bool size_p = optimize_function_for_size_p (cfun);
-
   /* Default the maximum to 256-bytes when considering only libcall vs
  SIMD broadcast sequence.  */
-  unsigned max_set_size = 256;
+  unsigned max_set_size = MAX_SET_SIZE (optimize_function_for_speed_p (cfun));
   unsigned mops_threshold = aarch64_mops_memset_size_threshold;
 

Re: [PATCH v3 2/3] libatomic: Enable LSE128 128-bit atomics for armv9.4-a

2024-01-08 Thread Wilco Dijkstra
Hi,

>> Is there no benefit to using SWPPL for RELEASE here?  Similarly for the
>> others.
>
> We started off implementing all possible memory orderings available. 
> Wilco saw value in merging less restricted orderings into more 
> restricted ones - mainly to reduce codesize in less frequently used atomics.
> 
> This saw us combine RELEASE and ACQ_REL/SEQ_CST cases to make functions 
> a little smaller.

Benchmarking showed that LSE and LSE2 RMW atomics have similar performance once
the atomic is acquire, release or both. Given there is already a significant 
overhead due
to the function call, PLT indirection and argument setup, it doesn't make sense 
to add
extra taken branches that may mispredict or cause extra fetch cycles...

The goal for next GCC is to inline these instructions directly to avoid these 
overheads.

Cheers,
Wilco

Re: [PATCH v3 2/3] libatomic: Enable LSE128 128-bit atomics for armv9.4-a

2024-01-08 Thread Wilco Dijkstra
Hi Richard,

>> Benchmarking showed that LSE and LSE2 RMW atomics have similar performance 
>> once
>> the atomic is acquire, release or both. Given there is already a significant 
>> overhead due
>> to the function call, PLT indirection and argument setup, it doesn't make 
>> sense to add
>> extra taken branches that may mispredict or cause extra fetch cycles...
>
> Thanks for the extra context, especially wrt the LSE/LSE2 benchmarking.
> If there isn't any difference for acquire vs. the rest, is there a
> justification we can use for keeping the acquire branch, rather than
> using SWPAL for everything except relaxed?

The results showed that acquire is typically slightly faster than release 
(5-10%), so for the
most frequently used atomics (CAS and SWP) it makes sense to add support for 
acquire.
In most cases once you have release semantics, adding acquire didn't make things
slower, so combining release/acq_rel/seq_cst avoids unnecessary extra branches 
and
keeps the code small.

> If so, then Victor, could you include that in the explanation above and
> add it as a source comment?  Although maybe tone down "doesn't make
> sense to add" to something like "doesn't seem worth adding". :)

Yes it's worth adding a comment to this effect.

Cheers,
Wilco

Re: [PATCH v4] AArch64: Cleanup memset expansion

2024-01-09 Thread Wilco Dijkstra
Hi Richard,

>> +#define MAX_SET_SIZE(speed) (speed ? 256 : 96)
>
> Since this isn't (AFAIK) a standard macro, there doesn't seem to be
> any need to put it in the header file.  It could just go at the head
> of aarch64.cc instead.

Sure, I've moved it in v4.

>> +  if (len <= 24 || (aarch64_tune_params.extra_tuning_flags
>> +   & AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS))
>> +    set_max = 16;
>
> I think we should take the tuning parameter into account when applying
> the MAX_SET_SIZE limit for -Os.  Shouldn't it be 48 rather than 96 in
> that case?  (Alternatively, I suppose it would make sense to ignore
> the param for -Os, although we don't seem to do that elsewhere.)

That tune is only used by an obsolete core. I ran the memcpy and memset
benchmarks from Optimized Routines on xgene-1 with and without LDP/STP.
There is no measurable penalty for using LDP/STP. I'm not sure why it was
ever added given it does not do anything useful. I'll post a separate patch
to remove it to reduce the maintenance overhead.

Cheers,
Wilco


Here is v4 (move MAX_SET_SIZE definition to aarch64.cc):

Cleanup memset implementation.  Similar to memcpy/memmove, use an offset and
bytes throughout.  Simplify the complex calculations when optimizing for size
by using a fixed limit.

Passes regress/bootstrap, OK for commit?

gcc/ChangeLog:
* config/aarch64/aarch64.cc (MAX_SET_SIZE): New define.
(aarch64_progress_pointer): Remove function.
(aarch64_set_one_block_and_progress_pointer): Simplify and clean up.
(aarch64_expand_setmem): Clean up implementation, use byte offsets,
simplify size calculation.

---

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 
a5a6b52730d6c5013346d128e89915883f1707ae..62f4eee429c1c5195d54604f1d341a8a5a499d89
 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -101,6 +101,10 @@
 /* Defined for convenience.  */
 #define POINTER_BYTES (POINTER_SIZE / BITS_PER_UNIT)
 
+/* Maximum bytes set for an inline memset expansion.  With -Os use 3 STP
+   and 1 MOVI/DUP (same size as a call).  */
+#define MAX_SET_SIZE(speed) (speed ? 256 : 96)
+
 /* Flags that describe how a function shares certain architectural state
with its callers.
 
@@ -26321,15 +26325,6 @@ aarch64_move_pointer (rtx pointer, poly_int64 amount)
next, amount);
 }
 
-/* Return a new RTX holding the result of moving POINTER forward by the
-   size of the mode it points to.  */
-
-static rtx
-aarch64_progress_pointer (rtx pointer)
-{
-  return aarch64_move_pointer (pointer, GET_MODE_SIZE (GET_MODE (pointer)));
-}
-
 typedef auto_vec, 12> copy_ops;
 
 /* Copy one block of size MODE from SRC to DST at offset OFFSET.  */
@@ -26484,45 +26479,21 @@ aarch64_expand_cpymem (rtx *operands, bool is_memmove)
   return true;
 }
 
-/* Like aarch64_copy_one_block_and_progress_pointers, except for memset where
-   SRC is a register we have created with the duplicated value to be set.  */
+/* Set one block of size MODE at DST at offset OFFSET to value in SRC.  */
 static void
-aarch64_set_one_block_and_progress_pointer (rtx src, rtx *dst,
-   machine_mode mode)
+aarch64_set_one_block (rtx src, rtx dst, int offset, machine_mode mode)
 {
-  /* If we are copying 128bits or 256bits, we can do that straight from
- the SIMD register we prepared.  */
-  if (known_eq (GET_MODE_BITSIZE (mode), 256))
-{
-  mode = GET_MODE (src);
-  /* "Cast" the *dst to the correct mode.  */
-  *dst = adjust_address (*dst, mode, 0);
-  /* Emit the memset.  */
-  emit_insn (aarch64_gen_store_pair (*dst, src, src));
-
-  /* Move the pointers forward.  */
-  *dst = aarch64_move_pointer (*dst, 32);
-  return;
-}
-  if (known_eq (GET_MODE_BITSIZE (mode), 128))
+  /* Emit explict store pair instructions for 32-byte writes.  */
+  if (known_eq (GET_MODE_SIZE (mode), 32))
 {
-  /* "Cast" the *dst to the correct mode.  */
-  *dst = adjust_address (*dst, GET_MODE (src), 0);
-  /* Emit the memset.  */
-  emit_move_insn (*dst, src);
-  /* Move the pointers forward.  */
-  *dst = aarch64_move_pointer (*dst, 16);
+  mode = V16QImode;
+  rtx dst1 = adjust_address (dst, mode, offset);
+  emit_insn (aarch64_gen_store_pair (dst1, src, src));
   return;
 }
-  /* For copying less, we have to extract the right amount from src.  */
-  rtx reg = lowpart_subreg (mode, src, GET_MODE (src));
-
-  /* "Cast" the *dst to the correct mode.  */
-  *dst = adjust_address (*dst, mode, 0);
-  /* Emit the memset.  */
-  emit_move_insn (*dst, reg);
-  /* Move the pointer forward.  */
-  *dst = aarch64_progress_pointer (*dst);
+  if (known_lt (GET_MODE_SIZE (mode), 16))
+src = lowpart_subreg (mode, src, GET_MODE (src));
+  emit_move_insn (adjust_address (dst, mode, offset), src);
 }
 
 /* Expand a setmem using the MOPS instructions.  OPERAND

  1   2   3   4   5   6   7   8   9   10   >