Re: [PATCH] AArch64: Improve immediate expansion [PR105928]

2023-09-18 Thread Wilco Dijkstra via Gcc-patches
Hi Richard,

> I was worried that reusing "dest" for intermediate results would
> prevent CSE for cases like:
>
> void g (long long, long long);
> void
> f (long long *ptr)
> {
>   g (0xee11ee22ee11ee22LL, 0xdc23dc44ee11ee22LL);
> }

Note that aarch64_internal_mov_immediate may be called after reload,
so it would end up even more complex. This should be done as a
dedicated mid-end optimization similar to TARGET_CONST_ANCHOR.
However the number of 3/4-instruction immediates is so small that
sharable cases would be very rare, so I don't believe it is worth it.

Cheers,
Wilco


[PATCH] AArch64: Improve immediate expansion [PR105928]

2023-09-14 Thread Wilco Dijkstra via Gcc-patches

Support immediate expansion of immediates which can be created from 2 MOVKs
and a shifted ORR or BIC instruction.  Change aarch64_split_dimode_const_store
to apply if we save one instruction.

This reduces the number of 4-instruction immediates in SPECINT/FP by 5%.

Passes regress, OK for commit?

gcc/ChangeLog:
PR target/105928
* config/aarch64/aarch64.cc (aarch64_internal_mov_immediate)
Add support for immediates using shifted ORR/BIC.
(aarch64_split_dimode_const_store): Apply if we save one instruction.
* config/aarch64/aarch64.md (_3): 
Make pattern global.

gcc/testsuite:
PR target/105928
* gcc.target/aarch64/pr105928.c: Add new test.
* gcc.target/aarch64/vect-cse-codegen.c: Fix test.

---

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 
c44c0b979d0cc3755c61dcf566cfddedccebf1ea..832f8197ac8d1a04986791e6f3e51861e41944b2
 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -5639,7 +5639,7 @@ aarch64_internal_mov_immediate (rtx dest, rtx imm, bool 
generate,
machine_mode mode)
 {
   int i;
-  unsigned HOST_WIDE_INT val, val2, mask;
+  unsigned HOST_WIDE_INT val, val2, val3, mask;
   int one_match, zero_match;
   int num_insns;
 
@@ -5721,6 +5721,35 @@ aarch64_internal_mov_immediate (rtx dest, rtx imm, bool 
generate,
}
  return 3;
}
+
+  /* Try shifting and inserting the bottom 32-bits into the top bits.  */
+  val2 = val & 0x;
+  val3 = 0x;
+  val3 = val2 | (val3 << 32);
+  for (i = 17; i < 48; i++)
+   if ((val2 | (val2 << i)) == val)
+ {
+   if (generate)
+ {
+   emit_insn (gen_rtx_SET (dest, GEN_INT (val2 & 0x)));
+   emit_insn (gen_insv_immdi (dest, GEN_INT (16),
+  GEN_INT (val2 >> 16)));
+   emit_insn (gen_ior_ashldi3 (dest, dest, GEN_INT (i), dest));
+ }
+   return 3;
+ }
+   else if ((val3 & ~(val3 << i)) == val)
+ {
+   if (generate)
+ {
+   emit_insn (gen_rtx_SET (dest, GEN_INT (val3 | 0x)));
+   emit_insn (gen_insv_immdi (dest, GEN_INT (16),
+  GEN_INT (val2 >> 16)));
+   emit_insn (gen_and_one_cmpl_ashldi3 (dest, dest, GEN_INT (i),
+ dest));
+ }
+   return 3;
+ }
 }
 
   /* Generate 2-4 instructions, skipping 16 bits of all zeroes or ones which
@@ -25506,8 +25535,6 @@ aarch64_split_dimode_const_store (rtx dst, rtx src)
   rtx lo = gen_lowpart (SImode, src);
   rtx hi = gen_highpart_mode (SImode, DImode, src);
 
-  bool size_p = optimize_function_for_size_p (cfun);
-
   if (!rtx_equal_p (lo, hi))
 return false;
 
@@ -25526,14 +25553,8 @@ aarch64_split_dimode_const_store (rtx dst, rtx src)
  MOV   w1, 49370
  MOVK  w1, 0x140, lsl 16
  STP   w1, w1, [x0]
-   So we want to perform this only when we save two instructions
-   or more.  When optimizing for size, however, accept any code size
-   savings we can.  */
-  if (size_p && orig_cost <= lo_cost)
-return false;
-
-  if (!size_p
-  && (orig_cost <= lo_cost + 1))
+   So we want to perform this when we save at least one instruction.  */
+  if (orig_cost <= lo_cost)
 return false;
 
   rtx mem_lo = adjust_address (dst, SImode, 0);
diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 
97f70d39cc0ddeb330e044bae0544d85a695567d..932d4d47a5db1a74e0d0565b565afbd769090853
 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -4618,7 +4618,7 @@ (define_insn "*and_si3_compare0_uxtw"
   [(set_attr "type" "logics_shift_imm")]
 )
 
-(define_insn "*_3"
+(define_insn "_3"
   [(set (match_operand:GPI 0 "register_operand" "=r")
(LOGICAL:GPI (SHIFT:GPI
  (match_operand:GPI 1 "register_operand" "r")
diff --git a/gcc/testsuite/gcc.target/aarch64/pr105928.c 
b/gcc/testsuite/gcc.target/aarch64/pr105928.c
new file mode 100644
index 
..ab52247df66020d0b8fe70bc81f572e8b64c2bb5
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr105928.c
@@ -0,0 +1,43 @@
+/* { dg-do assemble } */
+/* { dg-options "-O2 --save-temps" } */
+
+long f1 (void)
+{
+  return 0x80402010080400;
+}
+
+long f2 (void)
+{
+  return 0x1234567812345678;
+}
+
+long f3 (void)
+{
+  return 0x4567800012345678;
+}
+
+long f4 (void)
+{
+  return 0x3ecd3ecd;
+}
+
+long f5 (void)
+{
+  return 0x38e38e38e38e38e;
+}
+
+long f6 (void)
+{
+  return 0x1745d1745d1745d;
+}
+
+void f7 (long *p)
+{
+  *p = 0x1234567812345678;
+}
+
+/* { dg-final { scan-assembler-times {\tmovk\t} 7 } } */
+/* { dg-final { scan-assembler-times {\tmov\t} 7 } } */
+/* { dg-final { 

Re: [PATCH] libatomic: Enable lock-free 128-bit atomics on AArch64 [PR110061]

2023-09-13 Thread Wilco Dijkstra via Gcc-patches

ping

From: Wilco Dijkstra
Sent: 02 June 2023 18:28
To: GCC Patches 
Cc: Richard Sandiford ; Kyrylo Tkachov 

Subject: [PATCH] libatomic: Enable lock-free 128-bit atomics on AArch64 
[PR110061] 
 

Enable lock-free 128-bit atomics on AArch64.  This is backwards compatible with
existing binaries, gives better performance than locking atomics and is what
most users expect.

Note 128-bit atomic loads use a load/store exclusive loop if LSE2 is not 
supported.
This results in an implicit store which is invisible to software as long as the 
given
address is writeable (which will be true when using atomics in actual code).

A simple test on an old Cortex-A72 showed 2.7x speedup of 128-bit atomics.

Passes regress, OK for commit?

libatomic/
    PR target/110061
    config/linux/aarch64/atomic_16.S: Implement lock-free ARMv8.0 atomics.
    config/linux/aarch64/host-config.h: Use atomic_16.S for baseline v8.0.
    State we have lock-free atomics.

---

diff --git a/libatomic/config/linux/aarch64/atomic_16.S 
b/libatomic/config/linux/aarch64/atomic_16.S
index 
05439ce394b9653c9bcb582761ff7aaa7c8f9643..0485c284117edf54f41959d2fab9341a9567b1cf
 100644
--- a/libatomic/config/linux/aarch64/atomic_16.S
+++ b/libatomic/config/linux/aarch64/atomic_16.S
@@ -22,6 +22,21 @@
    .  */
 
 
+/* AArch64 128-bit lock-free atomic implementation.
+
+   128-bit atomics are now lock-free for all AArch64 architecture versions.
+   This is backwards compatible with existing binaries and gives better
+   performance than locking atomics.
+
+   128-bit atomic loads use a exclusive loop if LSE2 is not supported.
+   This results in an implicit store which is invisible to software as long
+   as the given address is writeable.  Since all other atomics have explicit
+   writes, this will be true when using atomics in actual code.
+
+   The libat__16 entry points are ARMv8.0.
+   The libat__16_i1 entry points are used when LSE2 is available.  */
+
+
 .arch   armv8-a+lse
 
 #define ENTRY(name) \
@@ -37,6 +52,10 @@ name:    \
 .cfi_endproc;   \
 .size name, .-name;
 
+#define ALIAS(alias,name)  \
+   .global alias;  \
+   .set alias, name;
+
 #define res0 x0
 #define res1 x1
 #define in0  x2
@@ -70,6 +89,24 @@ name:    \
 #define SEQ_CST 5
 
 
+ENTRY (libat_load_16)
+   mov x5, x0
+   cbnz    w1, 2f
+
+   /* RELAXED.  */
+1: ldxp    res0, res1, [x5]
+   stxp    w4, res0, res1, [x5]
+   cbnz    w4, 1b
+   ret
+
+   /* ACQUIRE/CONSUME/SEQ_CST.  */
+2: ldaxp   res0, res1, [x5]
+   stxp    w4, res0, res1, [x5]
+   cbnz    w4, 2b
+   ret
+END (libat_load_16)
+
+
 ENTRY (libat_load_16_i1)
 cbnz    w1, 1f
 
@@ -93,6 +130,23 @@ ENTRY (libat_load_16_i1)
 END (libat_load_16_i1)
 
 
+ENTRY (libat_store_16)
+   cbnz    w4, 2f
+
+   /* RELAXED.  */
+1: ldxp    xzr, tmp0, [x0]
+   stxp    w4, in0, in1, [x0]
+   cbnz    w4, 1b
+   ret
+
+   /* RELEASE/SEQ_CST.  */
+2: ldxp    xzr, tmp0, [x0]
+   stlxp   w4, in0, in1, [x0]
+   cbnz    w4, 2b
+   ret
+END (libat_store_16)
+
+
 ENTRY (libat_store_16_i1)
 cbnz    w4, 1f
 
@@ -101,14 +155,14 @@ ENTRY (libat_store_16_i1)
 ret
 
 /* RELEASE/SEQ_CST.  */
-1: ldaxp   xzr, tmp0, [x0]
+1: ldxp    xzr, tmp0, [x0]
 stlxp   w4, in0, in1, [x0]
 cbnz    w4, 1b
 ret
 END (libat_store_16_i1)
 
 
-ENTRY (libat_exchange_16_i1)
+ENTRY (libat_exchange_16)
 mov x5, x0
 cbnz    w4, 2f
 
@@ -126,22 +180,55 @@ ENTRY (libat_exchange_16_i1)
 stxp    w4, in0, in1, [x5]
 cbnz    w4, 3b
 ret
-4:
-   cmp w4, RELEASE
-   b.ne    6f
 
-   /* RELEASE.  */
-5: ldxp    res0, res1, [x5]
+   /* RELEASE/ACQ_REL/SEQ_CST.  */
+4: ldaxp   res0, res1, [x5]
 stlxp   w4, in0, in1, [x5]
-   cbnz    w4, 5b
+   cbnz    w4, 4b
 ret
+END (libat_exchange_16)
 
-   /* ACQ_REL/SEQ_CST.  */
-6: ldaxp   res0, res1, [x5]
-   stlxp   w4, in0, in1, [x5]
-   cbnz    w4, 6b
+
+ENTRY (libat_compare_exchange_16)
+   ldp exp0, exp1, [x1]
+   cbz w4, 3f
+   cmp w4, RELEASE
+   b.hs    4f
+
+   /* ACQUIRE/CONSUME.  */
+1: ldaxp   tmp0, tmp1, [x0]
+   cmp tmp0, exp0
+   ccmp    tmp1, exp1, 0, eq
+   bne 2f
+   stxp    w4, in0, in1, [x0]
+   cbnz    w4, 1b
+   mov x0, 1
 ret
-END (libat_exchange_16_i1)
+
+2: stp tmp0, tmp1, [x1]
+   mov x0, 0
+   ret
+
+   /* RELAXED.  */
+3: ldxp    tmp0, tmp1, [x0]
+   cmp tmp0, exp0
+   ccmp    tmp1, exp1, 0, eq
+   bne 2b
+   stxp    w4, in0, in1, [x0]
+   cbnz    w4, 3b
+   mov x0, 1
+   ret
+
+   /* RELEASE/ACQ_REL/SEQ_CST.  */
+4: ldaxp   tmp0, tmp1, 

Re: [PATCH] libatomic: Improve ifunc selection on AArch64

2023-09-13 Thread Wilco Dijkstra via Gcc-patches

ping


From: Wilco Dijkstra
Sent: 04 August 2023 16:05
To: GCC Patches ; Richard Sandiford 

Cc: Kyrylo Tkachov 
Subject: [PATCH] libatomic: Improve ifunc selection on AArch64 
 

Add support for ifunc selection based on CPUID register.  Neoverse N1 supports
atomic 128-bit load/store, so use the FEAT_USCAT ifunc like newer Neoverse
cores.

Passes regress, OK for commit?

libatomic/
    config/linux/aarch64/host-config.h (ifunc1): Use CPUID in ifunc
    selection.

---

diff --git a/libatomic/config/linux/aarch64/host-config.h 
b/libatomic/config/linux/aarch64/host-config.h
index 
851c78c01cd643318aaa52929ce4550266238b79..e5dc33c030a4bab927874fa6c69425db463fdc4b
 100644
--- a/libatomic/config/linux/aarch64/host-config.h
+++ b/libatomic/config/linux/aarch64/host-config.h
@@ -26,7 +26,7 @@
 
 #ifdef HWCAP_USCAT
 # if N == 16
-#  define IFUNC_COND_1 (hwcap & HWCAP_USCAT)
+#  define IFUNC_COND_1 ifunc1 (hwcap)
 # else
 #  define IFUNC_COND_1  (hwcap & HWCAP_ATOMICS)
 # endif
@@ -50,4 +50,28 @@
 #undef MAYBE_HAVE_ATOMIC_EXCHANGE_16
 #define MAYBE_HAVE_ATOMIC_EXCHANGE_16   1
 
+#ifdef HWCAP_USCAT
+
+#define MIDR_IMPLEMENTOR(midr) (((midr) >> 24) & 255)
+#define MIDR_PARTNUM(midr) (((midr) >> 4) & 0xfff)
+
+static inline bool
+ifunc1 (unsigned long hwcap)
+{
+  if (hwcap & HWCAP_USCAT)
+    return true;
+  if (!(hwcap & HWCAP_CPUID))
+    return false;
+
+  unsigned long midr;
+  asm volatile ("mrs %0, midr_el1" : "=r" (midr));
+
+  /* Neoverse N1 supports atomic 128-bit load/store.  */
+  if (MIDR_IMPLEMENTOR (midr) == 'A' && MIDR_PARTNUM(midr) == 0xd0c)
+    return true;
+
+  return false;
+}
+#endif
+
 #include_next 


[PATCH] AArch64: Fix __sync_val_compare_and_swap [PR111404]

2023-09-13 Thread Wilco Dijkstra via Gcc-patches

__sync_val_compare_and_swap may be used on 128-bit types and either calls the
outline atomic code or uses an inline loop.  On AArch64 LDXP is only atomic if
the value is stored successfully using STXP, but the current implementations
do not perform the store if the comparison fails.  In this case the value 
returned
is not read atomically.

Passes regress/bootstrap, OK for commit?

gcc/ChangeLog/
PR target/111404
* config/aarch64/aarch64.cc (aarch64_split_compare_and_swap):
For 128-bit store the loaded value and loop if needed.

libgcc/ChangeLog/
PR target/111404
* config/aarch64/lse.S (__aarch64_cas16_acq_rel): Execute STLXP using
either new value or loaded value.

---

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 
5e8d0a0c91bc7719de2a8c5627b354cf905a4db0..c44c0b979d0cc3755c61dcf566cfddedccebf1ea
 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -23413,11 +23413,11 @@ aarch64_split_compare_and_swap (rtx operands[])
   mem = operands[1];
   oldval = operands[2];
   newval = operands[3];
-  is_weak = (operands[4] != const0_rtx);
   model_rtx = operands[5];
   scratch = operands[7];
   mode = GET_MODE (mem);
   model = memmodel_from_int (INTVAL (model_rtx));
+  is_weak = operands[4] != const0_rtx && mode != TImode;
 
   /* When OLDVAL is zero and we want the strong version we can emit a tighter
 loop:
@@ -23478,6 +23478,33 @@ aarch64_split_compare_and_swap (rtx operands[])
   else
 aarch64_gen_compare_reg (NE, scratch, const0_rtx);
 
+  /* 128-bit LDAXP is not atomic unless STLXP succeeds.  So for a mismatch,
+ store the returned value and loop if the STLXP fails.  */
+  if (mode == TImode)
+{
+  rtx_code_label *label3 = gen_label_rtx ();
+  emit_jump_insn (gen_rtx_SET (pc_rtx, gen_rtx_LABEL_REF (Pmode, label3)));
+  emit_barrier ();
+
+  emit_label (label2);
+  aarch64_emit_store_exclusive (mode, scratch, mem, rval, model_rtx);
+
+  if (aarch64_track_speculation)
+   {
+ /* Emit an explicit compare instruction, so that we can correctly
+track the condition codes.  */
+ rtx cc_reg = aarch64_gen_compare_reg (NE, scratch, const0_rtx);
+ x = gen_rtx_NE (GET_MODE (cc_reg), cc_reg, const0_rtx);
+   }
+  else
+   x = gen_rtx_NE (VOIDmode, scratch, const0_rtx);
+  x = gen_rtx_IF_THEN_ELSE (VOIDmode, x,
+   gen_rtx_LABEL_REF (Pmode, label1), pc_rtx);
+  aarch64_emit_unlikely_jump (gen_rtx_SET (pc_rtx, x));
+
+  label2 = label3;
+}
+
   emit_label (label2);
 
   /* If we used a CBNZ in the exchange loop emit an explicit compare with RVAL
diff --git a/libgcc/config/aarch64/lse.S b/libgcc/config/aarch64/lse.S
index 
dde3a28e07b13669533dfc5e8fac0a9a6ac33dbd..ba05047ff02b6fc5752235bffa924fc4a2f48c04
 100644
--- a/libgcc/config/aarch64/lse.S
+++ b/libgcc/config/aarch64/lse.S
@@ -160,6 +160,8 @@ see the files COPYING3 and COPYING.RUNTIME respectively.  
If not, see
 #define tmp0   16
 #define tmp1   17
 #define tmp2   15
+#define tmp3   14
+#define tmp4   13
 
 #define BTI_C  hint34
 
@@ -233,10 +235,11 @@ STARTFN   NAME(cas)
 0: LDXPx0, x1, [x4]
cmp x0, x(tmp0)
ccmpx1, x(tmp1), #0, eq
-   bne 1f
-   STXPw(tmp2), x2, x3, [x4]
-   cbnzw(tmp2), 0b
-1: BARRIER
+   cselx(tmp2), x2, x0, eq
+   cselx(tmp3), x3, x1, eq
+   STXPw(tmp4), x(tmp2), x(tmp3), [x4]
+   cbnzw(tmp4), 0b
+   BARRIER
ret
 
 #endif



[PATCH] AArch64: List official cores before codenames

2023-09-13 Thread Wilco Dijkstra via Gcc-patches
List official cores first so that -cpu=native does not show a codename with -v
or in errors/warnings.

Passes regress, OK for commit?

gcc/ChangeLog:
* config/aarch64/aarch64-cores.def (neoverse-n1): Place before ares.
(neoverse-v1): Place before zeus.
(neoverse-v2): Place before demeter.
* config/aarch64/aarch64-tune.md: Regenerate.

---

diff --git a/gcc/config/aarch64/aarch64-cores.def 
b/gcc/config/aarch64/aarch64-cores.def
index 
dbac497ef3aab410eb81db185b2e9532186888bb..3894f2afc27e71523e5a413fa45c144222082934
 100644
--- a/gcc/config/aarch64/aarch64-cores.def
+++ b/gcc/config/aarch64/aarch64-cores.def
@@ -115,8 +115,8 @@ AARCH64_CORE("cortex-a65",  cortexa65, cortexa53, V8_2A,  
(F16, RCPC, DOTPROD, S
 AARCH64_CORE("cortex-a65ae",  cortexa65ae, cortexa53, V8_2A,  (F16, RCPC, 
DOTPROD, SSBS), cortexa73, 0x41, 0xd43, -1)
 AARCH64_CORE("cortex-x1",  cortexx1, cortexa57, V8_2A,  (F16, RCPC, DOTPROD, 
SSBS, PROFILE), neoversen1, 0x41, 0xd44, -1)
 AARCH64_CORE("cortex-x1c",  cortexx1c, cortexa57, V8_2A,  (F16, RCPC, DOTPROD, 
SSBS, PROFILE, PAUTH), neoversen1, 0x41, 0xd4c, -1)
-AARCH64_CORE("ares",  ares, cortexa57, V8_2A,  (F16, RCPC, DOTPROD, PROFILE), 
neoversen1, 0x41, 0xd0c, -1)
 AARCH64_CORE("neoverse-n1",  neoversen1, cortexa57, V8_2A,  (F16, RCPC, 
DOTPROD, PROFILE), neoversen1, 0x41, 0xd0c, -1)
+AARCH64_CORE("ares",  ares, cortexa57, V8_2A,  (F16, RCPC, DOTPROD, PROFILE), 
neoversen1, 0x41, 0xd0c, -1)
 AARCH64_CORE("neoverse-e1",  neoversee1, cortexa53, V8_2A,  (F16, RCPC, 
DOTPROD, SSBS), cortexa73, 0x41, 0xd4a, -1)
 
 /* Cavium ('C') cores. */
@@ -143,8 +143,8 @@ AARCH64_CORE("thunderx3t110",  thunderx3t110,  
thunderx3t110, V8_3A,  (CRYPTO, S
 /* ARMv8.4-A Architecture Processors.  */
 
 /* Arm ('A') cores.  */
-AARCH64_CORE("zeus", zeus, cortexa57, V8_4A,  (SVE, I8MM, BF16, PROFILE, SSBS, 
RNG), neoversev1, 0x41, 0xd40, -1)
 AARCH64_CORE("neoverse-v1", neoversev1, cortexa57, V8_4A,  (SVE, I8MM, BF16, 
PROFILE, SSBS, RNG), neoversev1, 0x41, 0xd40, -1)
+AARCH64_CORE("zeus", zeus, cortexa57, V8_4A,  (SVE, I8MM, BF16, PROFILE, SSBS, 
RNG), neoversev1, 0x41, 0xd40, -1)
 AARCH64_CORE("neoverse-512tvb", neoverse512tvb, cortexa57, V8_4A,  (SVE, I8MM, 
BF16, PROFILE, SSBS, RNG), neoverse512tvb, INVALID_IMP, INVALID_CORE, -1)
 
 /* Qualcomm ('Q') cores. */
@@ -182,7 +182,7 @@ AARCH64_CORE("cortex-x3",  cortexx3, cortexa57, V9A,  
(SVE2_BITPERM, MEMTAG, I8M
 
 AARCH64_CORE("neoverse-n2", neoversen2, cortexa57, V9A, (I8MM, BF16, 
SVE2_BITPERM, RNG, MEMTAG, PROFILE), neoversen2, 0x41, 0xd49, -1)
 
-AARCH64_CORE("demeter", demeter, cortexa57, V9A, (I8MM, BF16, SVE2_BITPERM, 
RNG, MEMTAG, PROFILE), neoversev2, 0x41, 0xd4f, -1)
 AARCH64_CORE("neoverse-v2", neoversev2, cortexa57, V9A, (I8MM, BF16, 
SVE2_BITPERM, RNG, MEMTAG, PROFILE), neoversev2, 0x41, 0xd4f, -1)
+AARCH64_CORE("demeter", demeter, cortexa57, V9A, (I8MM, BF16, SVE2_BITPERM, 
RNG, MEMTAG, PROFILE), neoversev2, 0x41, 0xd4f, -1)
 
 #undef AARCH64_CORE
diff --git a/gcc/config/aarch64/aarch64-tune.md 
b/gcc/config/aarch64/aarch64-tune.md
index 
2170980dddb0d5d410a49631ad26ff2e346b39dd..69e5357fa814e4733b05f7164bfa11e4aa04
 100644
--- a/gcc/config/aarch64/aarch64-tune.md
+++ b/gcc/config/aarch64/aarch64-tune.md
@@ -1,5 +1,5 @@
 ;; -*- buffer-read-only: t -*-
 ;; Generated automatically by gentune.sh from aarch64-cores.def
 (define_attr "tune"
-   
"cortexa34,cortexa35,cortexa53,cortexa57,cortexa72,cortexa73,thunderx,thunderxt88p1,thunderxt88,octeontx,octeontxt81,octeontxt83,thunderxt81,thunderxt83,ampere1,ampere1a,emag,xgene1,falkor,qdf24xx,exynosm1,phecda,thunderx2t99p1,vulcan,thunderx2t99,cortexa55,cortexa75,cortexa76,cortexa76ae,cortexa77,cortexa78,cortexa78ae,cortexa78c,cortexa65,cortexa65ae,cortexx1,cortexx1c,ares,neoversen1,neoversee1,octeontx2,octeontx2t98,octeontx2t96,octeontx2t93,octeontx2f95,octeontx2f95n,octeontx2f95mm,a64fx,tsv110,thunderx3t110,zeus,neoversev1,neoverse512tvb,saphira,cortexa57cortexa53,cortexa72cortexa53,cortexa73cortexa35,cortexa73cortexa53,cortexa75cortexa55,cortexa76cortexa55,cortexr82,cortexa510,cortexa520,cortexa710,cortexa715,cortexx2,cortexx3,neoversen2,demeter,neoversev2"
+   
"cortexa34,cortexa35,cortexa53,cortexa57,cortexa72,cortexa73,thunderx,thunderxt88p1,thunderxt88,octeontx,octeontxt81,octeontxt83,thunderxt81,thunderxt83,ampere1,ampere1a,emag,xgene1,falkor,qdf24xx,exynosm1,phecda,thunderx2t99p1,vulcan,thunderx2t99,cortexa55,cortexa75,cortexa76,cortexa76ae,cortexa77,cortexa78,cortexa78ae,cortexa78c,cortexa65,cortexa65ae,cortexx1,cortexx1c,neoversen1,ares,neoversee1,octeontx2,octeontx2t98,octeontx2t96,octeontx2t93,octeontx2f95,octeontx2f95n,octeontx2f95mm,a64fx,tsv110,thunderx3t110,neoversev1,zeus,neoverse512tvb,saphira,cortexa57cortexa53,cortexa72cortexa53,cortexa73cortexa35,cortexa73cortexa53,cortexa75cortexa55,cortexa76cortexa55,cortexr82,cortexa510,cortexa520,cortexa710,cortexa715,cortexx2,cortexx3,neoversen2,neoversev2,demeter"
(const (symbol_ref "((enum 

[PATCH] ARM: Block predication on atomics [PR111235]

2023-09-07 Thread Wilco Dijkstra via Gcc-patches
The v7 memory ordering model allows reordering of conditional atomic 
instructions.
To avoid this, make all atomic patterns unconditional.  Expand atomic loads and
stores for all architectures so the memory access can be wrapped into an UNSPEC.

Passes regress/bootstrap, OK for commit?

gcc/ChangeLog/
PR target/111235
* config/arm/constraints.md: Remove Pf constraint.
* onfig/arm/sync.md (arm_atomic_load): Add new pattern.
(arm_atomic_load_acquire): Likewise.
(arm_atomic_store): Likewise.
(arm_atomic_store_release): Likewise.
(atomic_load): Always expand atomic loads explicitly.
(atomic_store): Always expand atomic stores explicitly.
(arm_atomic_loaddi2_ldrd): Remove predication.
(arm_load_exclusive): Likewise.
(arm_load_acquire_exclusive): Likewise.
(arm_load_exclusivesi): Likewise.
(arm_load_acquire_exclusivesi: Likewise.
(arm_load_exclusivedi): Likewise.
(arm_load_acquire_exclusivedi): Likewise.
(arm_store_exclusive): Likewise.
(arm_store_release_exclusivedi): Likewise.
(arm_store_release_exclusive): Likewise.
* gcc/config/arm/unspecs.md: Add VUNSPEC_LDR and VUNSPEC_STR.

gcc/testsuite/ChangeLog/
PR target/111235
* gcc.target/arm/pr111235.c: Add new test.

---

diff --git a/gcc/config/arm/constraints.md b/gcc/config/arm/constraints.md
index 
05a4ebbdd67601d7b92aa44a619d17634cc69f17..d7c4a1b0cd785f276862048005e6cfa57cdcb20d
 100644
--- a/gcc/config/arm/constraints.md
+++ b/gcc/config/arm/constraints.md
@@ -36,7 +36,7 @@
 ;; in Thumb-1 state: Pa, Pb, Pc, Pd, Pe
 ;; in Thumb-2 state: Ha, Pj, PJ, Ps, Pt, Pu, Pv, Pw, Px, Py, Pz, Rd, Rf, Rb, 
Ra,
 ;;  Rg, Ri
-;; in all states: Pf, Pg
+;; in all states: Pg
 
 ;; The following memory constraints have been used:
 ;; in ARM/Thumb-2 state: Uh, Ut, Uv, Uy, Un, Um, Us, Up, Uf, Ux, Ul
@@ -239,13 +239,6 @@ (define_constraint "Pe"
   (and (match_code "const_int")
(match_test "TARGET_THUMB1 && ival >= 256 && ival <= 510")))
 
-(define_constraint "Pf"
-  "Memory models except relaxed, consume or release ones."
-  (and (match_code "const_int")
-   (match_test "!is_mm_relaxed (memmodel_from_int (ival))
-   && !is_mm_consume (memmodel_from_int (ival))
-   && !is_mm_release (memmodel_from_int (ival))")))
-
 (define_constraint "Pg"
   "@internal In Thumb-2 state a constant in range 1 to 32"
   (and (match_code "const_int")
diff --git a/gcc/config/arm/sync.md b/gcc/config/arm/sync.md
index 
7626bf3c443285dc63b4c4367b11a879a99c93c6..2210810f67f37ce043b8fdc73b4f21b54c5b1912
 100644
--- a/gcc/config/arm/sync.md
+++ b/gcc/config/arm/sync.md
@@ -62,68 +62,110 @@ (define_insn "*memory_barrier"
(set_attr "conds" "unconditional")
(set_attr "predicable" "no")])
 
-(define_insn "atomic_load"
-  [(set (match_operand:QHSI 0 "register_operand" "=r,r,l")
+(define_insn "arm_atomic_load"
+  [(set (match_operand:QHSI 0 "register_operand" "=r,l")
 (unspec_volatile:QHSI
-  [(match_operand:QHSI 1 "arm_sync_memory_operand" "Q,Q,Q")
-   (match_operand:SI 2 "const_int_operand" "n,Pf,n")]  ;; model
+  [(match_operand:QHSI 1 "memory_operand" "m,m")]
+  VUNSPEC_LDR))]
+  ""
+  "ldr\t%0, %1"
+  [(set_attr "arch" "32,any")])
+
+(define_insn "arm_atomic_load_acquire"
+  [(set (match_operand:QHSI 0 "register_operand" "=r")
+(unspec_volatile:QHSI
+  [(match_operand:QHSI 1 "arm_sync_memory_operand" "Q")]
   VUNSPEC_LDA))]
   "TARGET_HAVE_LDACQ"
-  {
-if (aarch_mm_needs_acquire (operands[2]))
-  {
-   if (TARGET_THUMB1)
- return "lda\t%0, %1";
-   else
- return "lda%?\t%0, %1";
-  }
-else
-  {
-   if (TARGET_THUMB1)
- return "ldr\t%0, %1";
-   else
- return "ldr%?\t%0, %1";
-  }
-  }
-  [(set_attr "arch" "32,v8mb,any")
-   (set_attr "predicable" "yes")])
+  "lda\t%0, %C1"
+)
 
-(define_insn "atomic_store"
-  [(set (match_operand:QHSI 0 "memory_operand" "=Q,Q,Q")
+(define_insn "arm_atomic_store"
+  [(set (match_operand:QHSI 0 "memory_operand" "=m,m")
+(unspec_volatile:QHSI
+  [(match_operand:QHSI 1 "register_operand" "r,l")]
+  VUNSPEC_STR))]
+  ""
+  "str\t%1, %0";
+  [(set_attr "arch" "32,any")])
+
+(define_insn "arm_atomic_store_release"
+  [(set (match_operand:QHSI 0 "arm_sync_memory_operand" "=Q")
 (unspec_volatile:QHSI
-  [(match_operand:QHSI 1 "general_operand" "r,r,l")
-   (match_operand:SI 2 "const_int_operand" "n,Pf,n")]  ;; model
+  [(match_operand:QHSI 1 "register_operand" "r")]
   VUNSPEC_STL))]
   "TARGET_HAVE_LDACQ"
-  {
-if (aarch_mm_needs_release (operands[2]))
-  {
-   if (TARGET_THUMB1)
- return "stl\t%1, %0";
-   else
- return "stl%?\t%1, %0";
-  }
-else
-  {
-   if (TARGET_THUMB1)
- return "str\t%1, %0";
-   else
- return "str%?\t%1, %0";
-  }
- 

Re: [PATCH] AArch64: Fix MOPS memmove operand corruption [PR111121]

2023-08-23 Thread Wilco Dijkstra via Gcc-patches
Hi Richard,

(that's quick!)

> +  if (size > max_copy_size || size > max_mops_size)
> +return aarch64_expand_cpymem_mops (operands, is_memmove);
>
> Could you explain this a bit more?  If I've followed the logic correctly,
> max_copy_size will always be 0 for movmem, so this "if" condition will
> always be true for movmem (given that the caller can be relied on to
> optimise away zero-length copies).  So doesn't this function reduce to:

In this patch it is zero yes, but there is no real reason for that. The goal is 
to
share as much code as possible. I have a patch that inlines memmove like
memcpy.

> when is_memmove is true?  If so, I think it would be clearer to do that
> directly, rather than go through aarch64_expand_cpymem.  max_copy_size
> is really an optimisation threshold, whereas the above seems to be
> leaning on it for correctness.

In principle we could for the time being add a assert (!is_memmove) if that
makes it clearer memmove isn't yet handled.

> ...I think we might as well keep this pattern conditional on TARGET_MOPS.

But then we have inconsistencies in the conditions of the expanders, which
is what led to all these bugs in the first place (I lost count, there are 4 or 5
different bugs I fixed). Ensuring everything is 100% identical between
memcpy and memmove makes the code much easier to follow.

> I think we can then also split:
>
>   /* All three registers are changed by the instruction, so each one
>  must be a fresh pseudo.  */
>   rtx dst_addr = copy_to_mode_reg (Pmode, XEXP (operands[0], 0));
>   rtx src_addr = copy_to_mode_reg (Pmode, XEXP (operands[1], 0));
>   rtx dst_mem = replace_equiv_address (operands[0], dst_addr);
>   rtx src_mem = replace_equiv_address (operands[1], src_addr);
>   rtx sz_reg = copy_to_mode_reg (DImode, operands[2]);
>
> out of aarch64_expand_cpymem_mops into a new function (say
> aarch64_prepare_mops_operands) and call it from the movmemdi
> expander.  There should then be no need for the extra staging
> expander (aarch64_movmemdi).

So you're saying we could remove aarch64_cpymemdi/movmemdi if
aarch64_expand_cpymem_mops did massage the operands in the
right way so that we can immediately match the underlying instruction?

Hmm, does that actually work, as in we don't lose the extra alias info that
gets lost in the current memmove expander? (another bug/inconsistency)

And the MOPS code would be separated from aarch64_expand_cpymem
so we'd do all the MOPS size tests inside aarch64_expand_cpymem_mops
and the expander tries using MOPS first and if it fails try inline expansion?

So something like:

(define_expand "movmemdi"

  if (aarch64_try_mops_expansion (operands, is_memmove))
DONE;
  if (aarch64_try_inline_copy_expansion (operands, is_memmove))
DONE;
  FAIL;
)

> IMO the STRICT_ALIGNMENT stuff should be a separate patch,
> with its own testcases.

We will need backports to fix all these bugs, so the question is whether it
is worth doing a lot of cleanups now?

Cheers,
Wilco


[PATCH] AArch64: Fix MOPS memmove operand corruption [PR111121]

2023-08-23 Thread Wilco Dijkstra via Gcc-patches

A MOPS memmove may corrupt registers since there is no copy of the input 
operands to temporary
registers.  Fix this by calling aarch64_expand_cpymem which does this.  Also 
fix an issue with
STRICT_ALIGNMENT being ignored if TARGET_MOPS is true, and avoid crashing or 
generating a huge
expansion if aarch64_mops_memcpy_size_threshold is large.

Passes regress/bootstrap, OK for commit?

gcc/ChangeLog/
PR target/21
* config/aarch64/aarch64.md (cpymemdi): Remove STRICT_ALIGNMENT, add 
param for memmove.
(aarch64_movmemdi): Add new expander similar to aarch64_cpymemdi.
(movmemdi): Like cpymemdi call aarch64_expand_cpymem for correct 
expansion.
* config/aarch64/aarch64.cc (aarch64_expand_cpymem_mops): Add support 
for memmove.
(aarch64_expand_cpymem): Add support for memmove. Handle 
STRICT_ALIGNMENT correctly.
Handle TARGET_MOPS size selection correctly.
* config/aarch64/aarch64-protos.h (aarch64_expand_cpymem): Update 
prototype. 

gcc/testsuite/ChangeLog/
PR target/21
* gcc.target/aarch64/mops_4.c: Add memmove testcases.

---
diff --git a/gcc/config/aarch64/aarch64-protos.h 
b/gcc/config/aarch64/aarch64-protos.h
index 
70303d6fd953e0c397b9138ede8858c2db2e53db..97375e81cbda078847af83bf5dd4e0d7673d6af4
 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -765,7 +765,7 @@ bool aarch64_emit_approx_div (rtx, rtx, rtx);
 bool aarch64_emit_approx_sqrt (rtx, rtx, bool);
 tree aarch64_vector_load_decl (tree);
 void aarch64_expand_call (rtx, rtx, rtx, bool);
-bool aarch64_expand_cpymem (rtx *);
+bool aarch64_expand_cpymem (rtx *, bool);
 bool aarch64_expand_setmem (rtx *);
 bool aarch64_float_const_zero_rtx_p (rtx);
 bool aarch64_float_const_rtx_p (rtx);
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 
eba5d4a7e04b7af82437453a691d5607d98133c9..5e8d0a0c91bc7719de2a8c5627b354cf905a4db0
 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -25135,10 +25135,11 @@ aarch64_copy_one_block_and_progress_pointers (rtx 
*src, rtx *dst,
   *dst = aarch64_progress_pointer (*dst);
 }
 
-/* Expand a cpymem using the MOPS extension.  OPERANDS are taken
-   from the cpymem pattern.  Return true iff we succeeded.  */
+/* Expand a cpymem/movmem using the MOPS extension.  OPERANDS are taken
+   from the cpymem/movmem pattern.  IS_MEMMOVE is true if this is a memmove
+   rather than memcpy.  Return true iff we succeeded.  */
 static bool
-aarch64_expand_cpymem_mops (rtx *operands)
+aarch64_expand_cpymem_mops (rtx *operands, bool is_memmove)
 {
   if (!TARGET_MOPS)
 return false;
@@ -25150,17 +25151,19 @@ aarch64_expand_cpymem_mops (rtx *operands)
   rtx dst_mem = replace_equiv_address (operands[0], dst_addr);
   rtx src_mem = replace_equiv_address (operands[1], src_addr);
   rtx sz_reg = copy_to_mode_reg (DImode, operands[2]);
-  emit_insn (gen_aarch64_cpymemdi (dst_mem, src_mem, sz_reg));
-
+  if (is_memmove)
+emit_insn (gen_aarch64_movmemdi (dst_mem, src_mem, sz_reg));
+  else
+emit_insn (gen_aarch64_cpymemdi (dst_mem, src_mem, sz_reg));
   return true;
 }
 
-/* Expand cpymem, as if from a __builtin_memcpy.  Return true if
-   we succeed, otherwise return false, indicating that a libcall to
-   memcpy should be emitted.  */
-
+/* Expand cpymem/movmem, as if from a __builtin_memcpy/memmove.
+   OPERANDS are taken from the cpymem/movmem pattern.  IS_MEMMOVE is true
+   if this is a memmove rather than memcpy.  Return true if we succeed,
+   otherwise return false, indicating that a libcall should be emitted.  */
 bool
-aarch64_expand_cpymem (rtx *operands)
+aarch64_expand_cpymem (rtx *operands, bool is_memmove)
 {
   int mode_bits;
   rtx dst = operands[0];
@@ -25168,25 +25171,23 @@ aarch64_expand_cpymem (rtx *operands)
   rtx base;
   machine_mode cur_mode = BLKmode;
 
-  /* Variable-sized memcpy can go through the MOPS expansion if available.  */
-  if (!CONST_INT_P (operands[2]))
-return aarch64_expand_cpymem_mops (operands);
+  /* Variable-sized or strict align copies may use the MOPS expansion.  */
+  if (!CONST_INT_P (operands[2]) || STRICT_ALIGNMENT)
+return aarch64_expand_cpymem_mops (operands, is_memmove);
 
   unsigned HOST_WIDE_INT size = INTVAL (operands[2]);
 
-  /* Try to inline up to 256 bytes or use the MOPS threshold if available.  */
-  unsigned HOST_WIDE_INT max_copy_size
-= TARGET_MOPS ? aarch64_mops_memcpy_size_threshold : 256;
+  /* Set inline limits for memmove/memcpy.  MOPS has a separate threshold.  */
+  unsigned HOST_WIDE_INT max_copy_size = is_memmove ? 0 : 256;
+  unsigned HOST_WIDE_INT max_mops_size = max_copy_size;
 
-  bool size_p = optimize_function_for_size_p (cfun);
+  if (TARGET_MOPS)
+max_mops_size = is_memmove ? aarch64_mops_memmove_size_threshold
+  : aarch64_mops_memcpy_size_threshold;
 
-  /* Large constant-sized cpymem should go through MOPS when 

Re: [PATCH] libatomic: Improve ifunc selection on AArch64

2023-08-10 Thread Wilco Dijkstra via Gcc-patches
Hi Richard,

>>> Answering my own question, N1 does not officially have FEAT_LSE2.
>> 
>> It doesn't indeed. However most cores support atomic 128-bit load/store
>> (part of LSE2), so we can still use the LSE2 ifunc for those cores. Since 
>> there
>> isn't a feature bit for this in the CPU or HWCAP, I check the CPUID register.
>
> That would be a really nice bit to add to HWCAP, then, to consolidate this 
> knowledge in 
> one place.  Certainly I would use it in QEMU as well.

Yes this was suggested by a colleague as well. I'll ask and see whether the 
kernel guys
like the idea. It would take some time to get added, so we still need this for 
the time
being.

Cheers,
Wilco

Re: [PATCH] libatomic: Improve ifunc selection on AArch64

2023-08-10 Thread Wilco Dijkstra via Gcc-patches
Hi Richard,

>> Why would HWCAP_USCAT not be set by the kernel?
>> 
>> Failing that, I would think you would check ID_AA64MMFR2_EL1.AT.
>>
> Answering my own question, N1 does not officially have FEAT_LSE2.

It doesn't indeed. However most cores support atomic 128-bit load/store
(part of LSE2), so we can still use the LSE2 ifunc for those cores. Since there
isn't a feature bit for this in the CPU or HWCAP, I check the CPUID register.

Cheers,
Wilco

Re: [PATCH] libatomic: Enable lock-free 128-bit atomics on AArch64 [PR110061]

2023-08-04 Thread Wilco Dijkstra via Gcc-patches
ping

From: Wilco Dijkstra
Sent: 02 June 2023 18:28
To: GCC Patches 
Cc: Richard Sandiford ; Kyrylo Tkachov 

Subject: [PATCH] libatomic: Enable lock-free 128-bit atomics on AArch64 
[PR110061] 
 

Enable lock-free 128-bit atomics on AArch64.  This is backwards compatible with
existing binaries, gives better performance than locking atomics and is what
most users expect.

Note 128-bit atomic loads use a load/store exclusive loop if LSE2 is not 
supported.
This results in an implicit store which is invisible to software as long as the 
given
address is writeable (which will be true when using atomics in actual code).

A simple test on an old Cortex-A72 showed 2.7x speedup of 128-bit atomics.

Passes regress, OK for commit?

libatomic/
    PR target/110061
    config/linux/aarch64/atomic_16.S: Implement lock-free ARMv8.0 atomics.
    config/linux/aarch64/host-config.h: Use atomic_16.S for baseline v8.0.
    State we have lock-free atomics.

---

diff --git a/libatomic/config/linux/aarch64/atomic_16.S 
b/libatomic/config/linux/aarch64/atomic_16.S
index 
05439ce394b9653c9bcb582761ff7aaa7c8f9643..0485c284117edf54f41959d2fab9341a9567b1cf
 100644
--- a/libatomic/config/linux/aarch64/atomic_16.S
+++ b/libatomic/config/linux/aarch64/atomic_16.S
@@ -22,6 +22,21 @@
    .  */
 
 
+/* AArch64 128-bit lock-free atomic implementation.
+
+   128-bit atomics are now lock-free for all AArch64 architecture versions.
+   This is backwards compatible with existing binaries and gives better
+   performance than locking atomics.
+
+   128-bit atomic loads use a exclusive loop if LSE2 is not supported.
+   This results in an implicit store which is invisible to software as long
+   as the given address is writeable.  Since all other atomics have explicit
+   writes, this will be true when using atomics in actual code.
+
+   The libat__16 entry points are ARMv8.0.
+   The libat__16_i1 entry points are used when LSE2 is available.  */
+
+
 .arch   armv8-a+lse
 
 #define ENTRY(name) \
@@ -37,6 +52,10 @@ name:    \
 .cfi_endproc;   \
 .size name, .-name;
 
+#define ALIAS(alias,name)  \
+   .global alias;  \
+   .set alias, name;
+
 #define res0 x0
 #define res1 x1
 #define in0  x2
@@ -70,6 +89,24 @@ name:    \
 #define SEQ_CST 5
 
 
+ENTRY (libat_load_16)
+   mov x5, x0
+   cbnz    w1, 2f
+
+   /* RELAXED.  */
+1: ldxp    res0, res1, [x5]
+   stxp    w4, res0, res1, [x5]
+   cbnz    w4, 1b
+   ret
+
+   /* ACQUIRE/CONSUME/SEQ_CST.  */
+2: ldaxp   res0, res1, [x5]
+   stxp    w4, res0, res1, [x5]
+   cbnz    w4, 2b
+   ret
+END (libat_load_16)
+
+
 ENTRY (libat_load_16_i1)
 cbnz    w1, 1f
 
@@ -93,6 +130,23 @@ ENTRY (libat_load_16_i1)
 END (libat_load_16_i1)
 
 
+ENTRY (libat_store_16)
+   cbnz    w4, 2f
+
+   /* RELAXED.  */
+1: ldxp    xzr, tmp0, [x0]
+   stxp    w4, in0, in1, [x0]
+   cbnz    w4, 1b
+   ret
+
+   /* RELEASE/SEQ_CST.  */
+2: ldxp    xzr, tmp0, [x0]
+   stlxp   w4, in0, in1, [x0]
+   cbnz    w4, 2b
+   ret
+END (libat_store_16)
+
+
 ENTRY (libat_store_16_i1)
 cbnz    w4, 1f
 
@@ -101,14 +155,14 @@ ENTRY (libat_store_16_i1)
 ret
 
 /* RELEASE/SEQ_CST.  */
-1: ldaxp   xzr, tmp0, [x0]
+1: ldxp    xzr, tmp0, [x0]
 stlxp   w4, in0, in1, [x0]
 cbnz    w4, 1b
 ret
 END (libat_store_16_i1)
 
 
-ENTRY (libat_exchange_16_i1)
+ENTRY (libat_exchange_16)
 mov x5, x0
 cbnz    w4, 2f
 
@@ -126,22 +180,55 @@ ENTRY (libat_exchange_16_i1)
 stxp    w4, in0, in1, [x5]
 cbnz    w4, 3b
 ret
-4:
-   cmp w4, RELEASE
-   b.ne    6f
 
-   /* RELEASE.  */
-5: ldxp    res0, res1, [x5]
+   /* RELEASE/ACQ_REL/SEQ_CST.  */
+4: ldaxp   res0, res1, [x5]
 stlxp   w4, in0, in1, [x5]
-   cbnz    w4, 5b
+   cbnz    w4, 4b
 ret
+END (libat_exchange_16)
 
-   /* ACQ_REL/SEQ_CST.  */
-6: ldaxp   res0, res1, [x5]
-   stlxp   w4, in0, in1, [x5]
-   cbnz    w4, 6b
+
+ENTRY (libat_compare_exchange_16)
+   ldp exp0, exp1, [x1]
+   cbz w4, 3f
+   cmp w4, RELEASE
+   b.hs    4f
+
+   /* ACQUIRE/CONSUME.  */
+1: ldaxp   tmp0, tmp1, [x0]
+   cmp tmp0, exp0
+   ccmp    tmp1, exp1, 0, eq
+   bne 2f
+   stxp    w4, in0, in1, [x0]
+   cbnz    w4, 1b
+   mov x0, 1
 ret
-END (libat_exchange_16_i1)
+
+2: stp tmp0, tmp1, [x1]
+   mov x0, 0
+   ret
+
+   /* RELAXED.  */
+3: ldxp    tmp0, tmp1, [x0]
+   cmp tmp0, exp0
+   ccmp    tmp1, exp1, 0, eq
+   bne 2b
+   stxp    w4, in0, in1, [x0]
+   cbnz    w4, 3b
+   mov x0, 1
+   ret
+
+   /* RELEASE/ACQ_REL/SEQ_CST.  */
+4: ldaxp   tmp0, tmp1, 

[PATCH] libatomic: Improve ifunc selection on AArch64

2023-08-04 Thread Wilco Dijkstra via Gcc-patches

Add support for ifunc selection based on CPUID register.  Neoverse N1 supports
atomic 128-bit load/store, so use the FEAT_USCAT ifunc like newer Neoverse
cores.

Passes regress, OK for commit?

libatomic/
config/linux/aarch64/host-config.h (ifunc1): Use CPUID in ifunc
selection.

---

diff --git a/libatomic/config/linux/aarch64/host-config.h 
b/libatomic/config/linux/aarch64/host-config.h
index 
851c78c01cd643318aaa52929ce4550266238b79..e5dc33c030a4bab927874fa6c69425db463fdc4b
 100644
--- a/libatomic/config/linux/aarch64/host-config.h
+++ b/libatomic/config/linux/aarch64/host-config.h
@@ -26,7 +26,7 @@
 
 #ifdef HWCAP_USCAT
 # if N == 16
-#  define IFUNC_COND_1 (hwcap & HWCAP_USCAT)
+#  define IFUNC_COND_1 ifunc1 (hwcap)
 # else
 #  define IFUNC_COND_1 (hwcap & HWCAP_ATOMICS)
 # endif
@@ -50,4 +50,28 @@
 #undef MAYBE_HAVE_ATOMIC_EXCHANGE_16
 #define MAYBE_HAVE_ATOMIC_EXCHANGE_16  1
 
+#ifdef HWCAP_USCAT
+
+#define MIDR_IMPLEMENTOR(midr) (((midr) >> 24) & 255)
+#define MIDR_PARTNUM(midr) (((midr) >> 4) & 0xfff)
+
+static inline bool
+ifunc1 (unsigned long hwcap)
+{
+  if (hwcap & HWCAP_USCAT)
+return true;
+  if (!(hwcap & HWCAP_CPUID))
+return false;
+
+  unsigned long midr;
+  asm volatile ("mrs %0, midr_el1" : "=r" (midr));
+
+  /* Neoverse N1 supports atomic 128-bit load/store.  */
+  if (MIDR_IMPLEMENTOR (midr) == 'A' && MIDR_PARTNUM(midr) == 0xd0c)
+return true;
+
+  return false;
+}
+#endif
+
 #include_next 



Re: [PATCH] libatomic: Enable lock-free 128-bit atomics on AArch64 [PR110061]

2023-07-05 Thread Wilco Dijkstra via Gcc-patches

ping

From: Wilco Dijkstra
Sent: 02 June 2023 18:28
To: GCC Patches 
Cc: Richard Sandiford ; Kyrylo Tkachov 

Subject: [PATCH] libatomic: Enable lock-free 128-bit atomics on AArch64 
[PR110061] 
 

Enable lock-free 128-bit atomics on AArch64.  This is backwards compatible with
existing binaries, gives better performance than locking atomics and is what
most users expect.

Note 128-bit atomic loads use a load/store exclusive loop if LSE2 is not 
supported.
This results in an implicit store which is invisible to software as long as the 
given
address is writeable (which will be true when using atomics in actual code).

A simple test on an old Cortex-A72 showed 2.7x speedup of 128-bit atomics.

Passes regress, OK for commit?

libatomic/
    PR target/110061
    config/linux/aarch64/atomic_16.S: Implement lock-free ARMv8.0 atomics.
    config/linux/aarch64/host-config.h: Use atomic_16.S for baseline v8.0.
    State we have lock-free atomics.

---

diff --git a/libatomic/config/linux/aarch64/atomic_16.S 
b/libatomic/config/linux/aarch64/atomic_16.S
index 
05439ce394b9653c9bcb582761ff7aaa7c8f9643..0485c284117edf54f41959d2fab9341a9567b1cf
 100644
--- a/libatomic/config/linux/aarch64/atomic_16.S
+++ b/libatomic/config/linux/aarch64/atomic_16.S
@@ -22,6 +22,21 @@
    .  */
 
 
+/* AArch64 128-bit lock-free atomic implementation.
+
+   128-bit atomics are now lock-free for all AArch64 architecture versions.
+   This is backwards compatible with existing binaries and gives better
+   performance than locking atomics.
+
+   128-bit atomic loads use a exclusive loop if LSE2 is not supported.
+   This results in an implicit store which is invisible to software as long
+   as the given address is writeable.  Since all other atomics have explicit
+   writes, this will be true when using atomics in actual code.
+
+   The libat__16 entry points are ARMv8.0.
+   The libat__16_i1 entry points are used when LSE2 is available.  */
+
+
 .arch   armv8-a+lse
 
 #define ENTRY(name) \
@@ -37,6 +52,10 @@ name:    \
 .cfi_endproc;   \
 .size name, .-name;
 
+#define ALIAS(alias,name)  \
+   .global alias;  \
+   .set alias, name;
+
 #define res0 x0
 #define res1 x1
 #define in0  x2
@@ -70,6 +89,24 @@ name:    \
 #define SEQ_CST 5
 
 
+ENTRY (libat_load_16)
+   mov x5, x0
+   cbnz    w1, 2f
+
+   /* RELAXED.  */
+1: ldxp    res0, res1, [x5]
+   stxp    w4, res0, res1, [x5]
+   cbnz    w4, 1b
+   ret
+
+   /* ACQUIRE/CONSUME/SEQ_CST.  */
+2: ldaxp   res0, res1, [x5]
+   stxp    w4, res0, res1, [x5]
+   cbnz    w4, 2b
+   ret
+END (libat_load_16)
+
+
 ENTRY (libat_load_16_i1)
 cbnz    w1, 1f
 
@@ -93,6 +130,23 @@ ENTRY (libat_load_16_i1)
 END (libat_load_16_i1)
 
 
+ENTRY (libat_store_16)
+   cbnz    w4, 2f
+
+   /* RELAXED.  */
+1: ldxp    xzr, tmp0, [x0]
+   stxp    w4, in0, in1, [x0]
+   cbnz    w4, 1b
+   ret
+
+   /* RELEASE/SEQ_CST.  */
+2: ldxp    xzr, tmp0, [x0]
+   stlxp   w4, in0, in1, [x0]
+   cbnz    w4, 2b
+   ret
+END (libat_store_16)
+
+
 ENTRY (libat_store_16_i1)
 cbnz    w4, 1f
 
@@ -101,14 +155,14 @@ ENTRY (libat_store_16_i1)
 ret
 
 /* RELEASE/SEQ_CST.  */
-1: ldaxp   xzr, tmp0, [x0]
+1: ldxp    xzr, tmp0, [x0]
 stlxp   w4, in0, in1, [x0]
 cbnz    w4, 1b
 ret
 END (libat_store_16_i1)
 
 
-ENTRY (libat_exchange_16_i1)
+ENTRY (libat_exchange_16)
 mov x5, x0
 cbnz    w4, 2f
 
@@ -126,22 +180,55 @@ ENTRY (libat_exchange_16_i1)
 stxp    w4, in0, in1, [x5]
 cbnz    w4, 3b
 ret
-4:
-   cmp w4, RELEASE
-   b.ne    6f
 
-   /* RELEASE.  */
-5: ldxp    res0, res1, [x5]
+   /* RELEASE/ACQ_REL/SEQ_CST.  */
+4: ldaxp   res0, res1, [x5]
 stlxp   w4, in0, in1, [x5]
-   cbnz    w4, 5b
+   cbnz    w4, 4b
 ret
+END (libat_exchange_16)
 
-   /* ACQ_REL/SEQ_CST.  */
-6: ldaxp   res0, res1, [x5]
-   stlxp   w4, in0, in1, [x5]
-   cbnz    w4, 6b
+
+ENTRY (libat_compare_exchange_16)
+   ldp exp0, exp1, [x1]
+   cbz w4, 3f
+   cmp w4, RELEASE
+   b.hs    4f
+
+   /* ACQUIRE/CONSUME.  */
+1: ldaxp   tmp0, tmp1, [x0]
+   cmp tmp0, exp0
+   ccmp    tmp1, exp1, 0, eq
+   bne 2f
+   stxp    w4, in0, in1, [x0]
+   cbnz    w4, 1b
+   mov x0, 1
 ret
-END (libat_exchange_16_i1)
+
+2: stp tmp0, tmp1, [x1]
+   mov x0, 0
+   ret
+
+   /* RELAXED.  */
+3: ldxp    tmp0, tmp1, [x0]
+   cmp tmp0, exp0
+   ccmp    tmp1, exp1, 0, eq
+   bne 2b
+   stxp    w4, in0, in1, [x0]
+   cbnz    w4, 3b
+   mov x0, 1
+   ret
+
+   /* RELEASE/ACQ_REL/SEQ_CST.  */
+4: ldaxp   tmp0, tmp1, 

Re: [PATCH] libatomic: Enable lock-free 128-bit atomics on AArch64 [PR110061]

2023-06-16 Thread Wilco Dijkstra via Gcc-patches

ping

From: Wilco Dijkstra
Sent: 02 June 2023 18:28
To: GCC Patches 
Cc: Richard Sandiford ; Kyrylo Tkachov 

Subject: [PATCH] libatomic: Enable lock-free 128-bit atomics on AArch64 
[PR110061] 
 

Enable lock-free 128-bit atomics on AArch64.  This is backwards compatible with
existing binaries, gives better performance than locking atomics and is what
most users expect.

Note 128-bit atomic loads use a load/store exclusive loop if LSE2 is not 
supported.
This results in an implicit store which is invisible to software as long as the 
given
address is writeable (which will be true when using atomics in actual code).

A simple test on an old Cortex-A72 showed 2.7x speedup of 128-bit atomics.

Passes regress, OK for commit?

libatomic/
    PR target/110061
    config/linux/aarch64/atomic_16.S: Implement lock-free ARMv8.0 atomics.
    config/linux/aarch64/host-config.h: Use atomic_16.S for baseline v8.0.
    State we have lock-free atomics.

---

diff --git a/libatomic/config/linux/aarch64/atomic_16.S 
b/libatomic/config/linux/aarch64/atomic_16.S
index 
05439ce394b9653c9bcb582761ff7aaa7c8f9643..0485c284117edf54f41959d2fab9341a9567b1cf
 100644
--- a/libatomic/config/linux/aarch64/atomic_16.S
+++ b/libatomic/config/linux/aarch64/atomic_16.S
@@ -22,6 +22,21 @@
    .  */
 
 
+/* AArch64 128-bit lock-free atomic implementation.
+
+   128-bit atomics are now lock-free for all AArch64 architecture versions.
+   This is backwards compatible with existing binaries and gives better
+   performance than locking atomics.
+
+   128-bit atomic loads use a exclusive loop if LSE2 is not supported.
+   This results in an implicit store which is invisible to software as long
+   as the given address is writeable.  Since all other atomics have explicit
+   writes, this will be true when using atomics in actual code.
+
+   The libat__16 entry points are ARMv8.0.
+   The libat__16_i1 entry points are used when LSE2 is available.  */
+
+
 .arch   armv8-a+lse
 
 #define ENTRY(name) \
@@ -37,6 +52,10 @@ name:    \
 .cfi_endproc;   \
 .size name, .-name;
 
+#define ALIAS(alias,name)  \
+   .global alias;  \
+   .set alias, name;
+
 #define res0 x0
 #define res1 x1
 #define in0  x2
@@ -70,6 +89,24 @@ name:    \
 #define SEQ_CST 5
 
 
+ENTRY (libat_load_16)
+   mov x5, x0
+   cbnz    w1, 2f
+
+   /* RELAXED.  */
+1: ldxp    res0, res1, [x5]
+   stxp    w4, res0, res1, [x5]
+   cbnz    w4, 1b
+   ret
+
+   /* ACQUIRE/CONSUME/SEQ_CST.  */
+2: ldaxp   res0, res1, [x5]
+   stxp    w4, res0, res1, [x5]
+   cbnz    w4, 2b
+   ret
+END (libat_load_16)
+
+
 ENTRY (libat_load_16_i1)
 cbnz    w1, 1f
 
@@ -93,6 +130,23 @@ ENTRY (libat_load_16_i1)
 END (libat_load_16_i1)
 
 
+ENTRY (libat_store_16)
+   cbnz    w4, 2f
+
+   /* RELAXED.  */
+1: ldxp    xzr, tmp0, [x0]
+   stxp    w4, in0, in1, [x0]
+   cbnz    w4, 1b
+   ret
+
+   /* RELEASE/SEQ_CST.  */
+2: ldxp    xzr, tmp0, [x0]
+   stlxp   w4, in0, in1, [x0]
+   cbnz    w4, 2b
+   ret
+END (libat_store_16)
+
+
 ENTRY (libat_store_16_i1)
 cbnz    w4, 1f
 
@@ -101,14 +155,14 @@ ENTRY (libat_store_16_i1)
 ret
 
 /* RELEASE/SEQ_CST.  */
-1: ldaxp   xzr, tmp0, [x0]
+1: ldxp    xzr, tmp0, [x0]
 stlxp   w4, in0, in1, [x0]
 cbnz    w4, 1b
 ret
 END (libat_store_16_i1)
 
 
-ENTRY (libat_exchange_16_i1)
+ENTRY (libat_exchange_16)
 mov x5, x0
 cbnz    w4, 2f
 
@@ -126,22 +180,55 @@ ENTRY (libat_exchange_16_i1)
 stxp    w4, in0, in1, [x5]
 cbnz    w4, 3b
 ret
-4:
-   cmp w4, RELEASE
-   b.ne    6f
 
-   /* RELEASE.  */
-5: ldxp    res0, res1, [x5]
+   /* RELEASE/ACQ_REL/SEQ_CST.  */
+4: ldaxp   res0, res1, [x5]
 stlxp   w4, in0, in1, [x5]
-   cbnz    w4, 5b
+   cbnz    w4, 4b
 ret
+END (libat_exchange_16)
 
-   /* ACQ_REL/SEQ_CST.  */
-6: ldaxp   res0, res1, [x5]
-   stlxp   w4, in0, in1, [x5]
-   cbnz    w4, 6b
+
+ENTRY (libat_compare_exchange_16)
+   ldp exp0, exp1, [x1]
+   cbz w4, 3f
+   cmp w4, RELEASE
+   b.hs    4f
+
+   /* ACQUIRE/CONSUME.  */
+1: ldaxp   tmp0, tmp1, [x0]
+   cmp tmp0, exp0
+   ccmp    tmp1, exp1, 0, eq
+   bne 2f
+   stxp    w4, in0, in1, [x0]
+   cbnz    w4, 1b
+   mov x0, 1
 ret
-END (libat_exchange_16_i1)
+
+2: stp tmp0, tmp1, [x1]
+   mov x0, 0
+   ret
+
+   /* RELAXED.  */
+3: ldxp    tmp0, tmp1, [x0]
+   cmp tmp0, exp0
+   ccmp    tmp1, exp1, 0, eq
+   bne 2b
+   stxp    w4, in0, in1, [x0]
+   cbnz    w4, 3b
+   mov x0, 1
+   ret
+
+   /* RELEASE/ACQ_REL/SEQ_CST.  */
+4: ldaxp   tmp0, tmp1, 

[PATCH] libatomic: Enable lock-free 128-bit atomics on AArch64 [PR110061]

2023-06-02 Thread Wilco Dijkstra via Gcc-patches

Enable lock-free 128-bit atomics on AArch64.  This is backwards compatible with
existing binaries, gives better performance than locking atomics and is what
most users expect.

Note 128-bit atomic loads use a load/store exclusive loop if LSE2 is not 
supported.
This results in an implicit store which is invisible to software as long as the 
given
address is writeable (which will be true when using atomics in actual code).

A simple test on an old Cortex-A72 showed 2.7x speedup of 128-bit atomics.

Passes regress, OK for commit?

libatomic/
PR target/110061
config/linux/aarch64/atomic_16.S: Implement lock-free ARMv8.0 atomics.
config/linux/aarch64/host-config.h: Use atomic_16.S for baseline v8.0.
State we have lock-free atomics.

---

diff --git a/libatomic/config/linux/aarch64/atomic_16.S 
b/libatomic/config/linux/aarch64/atomic_16.S
index 
05439ce394b9653c9bcb582761ff7aaa7c8f9643..0485c284117edf54f41959d2fab9341a9567b1cf
 100644
--- a/libatomic/config/linux/aarch64/atomic_16.S
+++ b/libatomic/config/linux/aarch64/atomic_16.S
@@ -22,6 +22,21 @@
.  */
 
 
+/* AArch64 128-bit lock-free atomic implementation.
+
+   128-bit atomics are now lock-free for all AArch64 architecture versions.
+   This is backwards compatible with existing binaries and gives better
+   performance than locking atomics.
+
+   128-bit atomic loads use a exclusive loop if LSE2 is not supported.
+   This results in an implicit store which is invisible to software as long
+   as the given address is writeable.  Since all other atomics have explicit
+   writes, this will be true when using atomics in actual code.
+
+   The libat__16 entry points are ARMv8.0.
+   The libat__16_i1 entry points are used when LSE2 is available.  */
+
+
.arch   armv8-a+lse
 
 #define ENTRY(name)\
@@ -37,6 +52,10 @@ name:\
.cfi_endproc;   \
.size name, .-name;
 
+#define ALIAS(alias,name)  \
+   .global alias;  \
+   .set alias, name;
+
 #define res0 x0
 #define res1 x1
 #define in0  x2
@@ -70,6 +89,24 @@ name:\
 #define SEQ_CST 5
 
 
+ENTRY (libat_load_16)
+   mov x5, x0
+   cbnzw1, 2f
+
+   /* RELAXED.  */
+1: ldxpres0, res1, [x5]
+   stxpw4, res0, res1, [x5]
+   cbnzw4, 1b
+   ret
+
+   /* ACQUIRE/CONSUME/SEQ_CST.  */
+2: ldaxp   res0, res1, [x5]
+   stxpw4, res0, res1, [x5]
+   cbnzw4, 2b
+   ret
+END (libat_load_16)
+
+
 ENTRY (libat_load_16_i1)
cbnzw1, 1f
 
@@ -93,6 +130,23 @@ ENTRY (libat_load_16_i1)
 END (libat_load_16_i1)
 
 
+ENTRY (libat_store_16)
+   cbnzw4, 2f
+
+   /* RELAXED.  */
+1: ldxpxzr, tmp0, [x0]
+   stxpw4, in0, in1, [x0]
+   cbnzw4, 1b
+   ret
+
+   /* RELEASE/SEQ_CST.  */
+2: ldxpxzr, tmp0, [x0]
+   stlxp   w4, in0, in1, [x0]
+   cbnzw4, 2b
+   ret
+END (libat_store_16)
+
+
 ENTRY (libat_store_16_i1)
cbnzw4, 1f
 
@@ -101,14 +155,14 @@ ENTRY (libat_store_16_i1)
ret
 
/* RELEASE/SEQ_CST.  */
-1: ldaxp   xzr, tmp0, [x0]
+1: ldxpxzr, tmp0, [x0]
stlxp   w4, in0, in1, [x0]
cbnzw4, 1b
ret
 END (libat_store_16_i1)
 
 
-ENTRY (libat_exchange_16_i1)
+ENTRY (libat_exchange_16)
mov x5, x0
cbnzw4, 2f
 
@@ -126,22 +180,55 @@ ENTRY (libat_exchange_16_i1)
stxpw4, in0, in1, [x5]
cbnzw4, 3b
ret
-4:
-   cmp w4, RELEASE
-   b.ne6f
 
-   /* RELEASE.  */
-5: ldxpres0, res1, [x5]
+   /* RELEASE/ACQ_REL/SEQ_CST.  */
+4: ldaxp   res0, res1, [x5]
stlxp   w4, in0, in1, [x5]
-   cbnzw4, 5b
+   cbnzw4, 4b
ret
+END (libat_exchange_16)
 
-   /* ACQ_REL/SEQ_CST.  */
-6: ldaxp   res0, res1, [x5]
-   stlxp   w4, in0, in1, [x5]
-   cbnzw4, 6b
+
+ENTRY (libat_compare_exchange_16)
+   ldp exp0, exp1, [x1]
+   cbz w4, 3f
+   cmp w4, RELEASE
+   b.hs4f
+
+   /* ACQUIRE/CONSUME.  */
+1: ldaxp   tmp0, tmp1, [x0]
+   cmp tmp0, exp0
+   ccmptmp1, exp1, 0, eq
+   bne 2f
+   stxpw4, in0, in1, [x0]
+   cbnzw4, 1b
+   mov x0, 1
ret
-END (libat_exchange_16_i1)
+
+2: stp tmp0, tmp1, [x1]
+   mov x0, 0
+   ret
+
+   /* RELAXED.  */
+3: ldxptmp0, tmp1, [x0]
+   cmp tmp0, exp0
+   ccmptmp1, exp1, 0, eq
+   bne 2b
+   stxpw4, in0, in1, [x0]
+   cbnzw4, 3b
+   mov x0, 1
+   ret
+
+   /* RELEASE/ACQ_REL/SEQ_CST.  */
+4: ldaxp   tmp0, tmp1, [x0]
+   cmp tmp0, exp0
+   ccmptmp1, exp1, 0, eq
+   bne 2b
+   stlxp   w4, in0, in1, [x0]
+   cbnzw4, 4b
+   mov x0, 1
+   ret
+END (libat_compare_exchange_16)
 
 
 

Re: [PATCH] libatomic: Fix SEQ_CST 128-bit atomic load [PR108891]

2023-03-16 Thread Wilco Dijkstra via Gcc-patches
ping


From: Wilco Dijkstra
Sent: 23 February 2023 15:11
To: GCC Patches 
Cc: Richard Sandiford ; Kyrylo Tkachov 

Subject: [PATCH] libatomic: Fix SEQ_CST 128-bit atomic load [PR108891] 
 

The LSE2 ifunc for 16-byte atomic load requires a barrier before the LDP -
without it, it effectively has Load-AcquirePC semantics similar to LDAPR,
which is less restrictive than what __ATOMIC_SEQ_CST requires.  This patch
fixes this and adds comments to make it easier to see which sequence is
used for each case. Use a load/store exclusive loop for store to simplify
testing memory ordering is correct (it is slightly faster too). 

Passes regress, OK for commit?

libatomic/
    PR libgcc/108891
    config/linux/aarch64/atomic_16.S: Fix libat_load_16_i1.
    Add comments describing the memory order.

---

diff --git a/libatomic/config/linux/aarch64/atomic_16.S 
b/libatomic/config/linux/aarch64/atomic_16.S
index 
732c3534a06678664a252bdbc53652eeab0af506..05439ce394b9653c9bcb582761ff7aaa7c8f9643
 100644
--- a/libatomic/config/linux/aarch64/atomic_16.S
+++ b/libatomic/config/linux/aarch64/atomic_16.S
@@ -72,33 +72,38 @@ name:   \
 
 ENTRY (libat_load_16_i1)
 cbnz    w1, 1f
+
+   /* RELAXED.  */
 ldp res0, res1, [x0]
 ret
 1:
-   cmp w1, ACQUIRE
-   b.hi    2f
+   cmp w1, SEQ_CST
+   b.eq    2f
+
+   /* ACQUIRE/CONSUME (Load-AcquirePC semantics).  */
 ldp res0, res1, [x0]
 dmb ishld
 ret
-2:
+
+   /* SEQ_CST.  */
+2: ldar    tmp0, [x0]  /* Block reordering with Store-Release instr.  
*/
 ldp res0, res1, [x0]
-   dmb ish
+   dmb ishld
 ret
 END (libat_load_16_i1)
 
 
 ENTRY (libat_store_16_i1)
 cbnz    w4, 1f
+
+   /* RELAXED.  */
 stp in0, in1, [x0]
 ret
-1:
-   dmb ish
-   stp in0, in1, [x0]
-   cmp w4, SEQ_CST
-   beq 2f
-   ret
-2:
-   dmb ish
+
+   /* RELEASE/SEQ_CST.  */
+1: ldaxp   xzr, tmp0, [x0]
+   stlxp   w4, in0, in1, [x0]
+   cbnz    w4, 1b
 ret
 END (libat_store_16_i1)
 
@@ -106,29 +111,33 @@ END (libat_store_16_i1)
 ENTRY (libat_exchange_16_i1)
 mov x5, x0
 cbnz    w4, 2f
-1:
-   ldxp    res0, res1, [x5]
+
+   /* RELAXED.  */
+1: ldxp    res0, res1, [x5]
 stxp    w4, in0, in1, [x5]
 cbnz    w4, 1b
 ret
 2:
 cmp w4, ACQUIRE
 b.hi    4f
-3:
-   ldaxp   res0, res1, [x5]
+
+   /* ACQUIRE/CONSUME.  */
+3: ldaxp   res0, res1, [x5]
 stxp    w4, in0, in1, [x5]
 cbnz    w4, 3b
 ret
 4:
 cmp w4, RELEASE
 b.ne    6f
-5:
-   ldxp    res0, res1, [x5]
+
+   /* RELEASE.  */
+5: ldxp    res0, res1, [x5]
 stlxp   w4, in0, in1, [x5]
 cbnz    w4, 5b
 ret
-6:
-   ldaxp   res0, res1, [x5]
+
+   /* ACQ_REL/SEQ_CST.  */
+6: ldaxp   res0, res1, [x5]
 stlxp   w4, in0, in1, [x5]
 cbnz    w4, 6b
 ret
@@ -142,6 +151,8 @@ ENTRY (libat_compare_exchange_16_i1)
 cbz w4, 2f
 cmp w4, RELEASE
 b.hs    3f
+
+   /* ACQUIRE/CONSUME.  */
 caspa   exp0, exp1, in0, in1, [x0]
 0:
 cmp exp0, tmp0
@@ -153,15 +164,18 @@ ENTRY (libat_compare_exchange_16_i1)
 stp exp0, exp1, [x1]
 mov x0, 0
 ret
-2:
-   casp    exp0, exp1, in0, in1, [x0]
+
+   /* RELAXED.  */
+2: casp    exp0, exp1, in0, in1, [x0]
 b   0b
-3:
-   b.hi    4f
+
+   /* RELEASE.  */
+3: b.hi    4f
 caspl   exp0, exp1, in0, in1, [x0]
 b   0b
-4:
-   caspal  exp0, exp1, in0, in1, [x0]
+
+   /* ACQ_REL/SEQ_CST.  */
+4: caspal  exp0, exp1, in0, in1, [x0]
 b   0b
 END (libat_compare_exchange_16_i1)
 
@@ -169,15 +183,17 @@ END (libat_compare_exchange_16_i1)
 ENTRY (libat_fetch_add_16_i1)
 mov x5, x0
 cbnz    w4, 2f
-1:
-   ldxp    res0, res1, [x5]
+
+   /* RELAXED.  */
+1: ldxp    res0, res1, [x5]
 adds    tmplo, reslo, inlo
 adc tmphi, reshi, inhi
 stxp    w4, tmp0, tmp1, [x5]
 cbnz    w4, 1b
 ret
-2:
-   ldaxp   res0, res1, [x5]
+
+   /* ACQUIRE/CONSUME/RELEASE/ACQ_REL/SEQ_CST.  */
+2: ldaxp   res0, res1, [x5]
 adds    tmplo, reslo, inlo
 adc tmphi, reshi, inhi
 stlxp   w4, tmp0, tmp1, [x5]
@@ -189,15 +205,17 @@ END (libat_fetch_add_16_i1)
 ENTRY (libat_add_fetch_16_i1)
 mov x5, x0
 cbnz    w4, 2f
-1:
-   ldxp    res0, res1, [x5]
+
+   /* RELAXED.  */
+1: ldxp    res0, res1, [x5]
 adds    reslo, reslo, inlo
 adc reshi, reshi, inhi
 stxp    w4, res0, res1, [x5]
 cbnz    w4, 1b
 ret
-2:
-   ldaxp   res0, res1, [x5]
+
+   /* ACQUIRE/CONSUME/RELEASE/ACQ_REL/SEQ_CST. 

[PATCH] libatomic: Fix SEQ_CST 128-bit atomic load [PR108891]

2023-02-23 Thread Wilco Dijkstra via Gcc-patches

The LSE2 ifunc for 16-byte atomic load requires a barrier before the LDP -
without it, it effectively has Load-AcquirePC semantics similar to LDAPR,
which is less restrictive than what __ATOMIC_SEQ_CST requires.  This patch
fixes this and adds comments to make it easier to see which sequence is
used for each case. Use a load/store exclusive loop for store to simplify
testing memory ordering is correct (it is slightly faster too). 

Passes regress, OK for commit?

libatomic/
PR libgcc/108891
config/linux/aarch64/atomic_16.S: Fix libat_load_16_i1.
Add comments describing the memory order.

---

diff --git a/libatomic/config/linux/aarch64/atomic_16.S 
b/libatomic/config/linux/aarch64/atomic_16.S
index 
732c3534a06678664a252bdbc53652eeab0af506..05439ce394b9653c9bcb582761ff7aaa7c8f9643
 100644
--- a/libatomic/config/linux/aarch64/atomic_16.S
+++ b/libatomic/config/linux/aarch64/atomic_16.S
@@ -72,33 +72,38 @@ name:   \
 
 ENTRY (libat_load_16_i1)
cbnzw1, 1f
+
+   /* RELAXED.  */
ldp res0, res1, [x0]
ret
 1:
-   cmp w1, ACQUIRE
-   b.hi2f
+   cmp w1, SEQ_CST
+   b.eq2f
+
+   /* ACQUIRE/CONSUME (Load-AcquirePC semantics).  */
ldp res0, res1, [x0]
dmb ishld
ret
-2:
+
+   /* SEQ_CST.  */
+2: ldartmp0, [x0]  /* Block reordering with Store-Release instr.  
*/
ldp res0, res1, [x0]
-   dmb ish
+   dmb ishld
ret
 END (libat_load_16_i1)
 
 
 ENTRY (libat_store_16_i1)
cbnzw4, 1f
+
+   /* RELAXED.  */
stp in0, in1, [x0]
ret
-1:
-   dmb ish
-   stp in0, in1, [x0]
-   cmp w4, SEQ_CST
-   beq 2f
-   ret
-2:
-   dmb ish
+
+   /* RELEASE/SEQ_CST.  */
+1: ldaxp   xzr, tmp0, [x0]
+   stlxp   w4, in0, in1, [x0]
+   cbnzw4, 1b
ret
 END (libat_store_16_i1)
 
@@ -106,29 +111,33 @@ END (libat_store_16_i1)
 ENTRY (libat_exchange_16_i1)
mov x5, x0
cbnzw4, 2f
-1:
-   ldxpres0, res1, [x5]
+
+   /* RELAXED.  */
+1: ldxpres0, res1, [x5]
stxpw4, in0, in1, [x5]
cbnzw4, 1b
ret
 2:
cmp w4, ACQUIRE
b.hi4f
-3:
-   ldaxp   res0, res1, [x5]
+
+   /* ACQUIRE/CONSUME.  */
+3: ldaxp   res0, res1, [x5]
stxpw4, in0, in1, [x5]
cbnzw4, 3b
ret
 4:
cmp w4, RELEASE
b.ne6f
-5:
-   ldxpres0, res1, [x5]
+
+   /* RELEASE.  */
+5: ldxpres0, res1, [x5]
stlxp   w4, in0, in1, [x5]
cbnzw4, 5b
ret
-6:
-   ldaxp   res0, res1, [x5]
+
+   /* ACQ_REL/SEQ_CST.  */
+6: ldaxp   res0, res1, [x5]
stlxp   w4, in0, in1, [x5]
cbnzw4, 6b
ret
@@ -142,6 +151,8 @@ ENTRY (libat_compare_exchange_16_i1)
cbz w4, 2f
cmp w4, RELEASE
b.hs3f
+
+   /* ACQUIRE/CONSUME.  */
caspa   exp0, exp1, in0, in1, [x0]
 0:
cmp exp0, tmp0
@@ -153,15 +164,18 @@ ENTRY (libat_compare_exchange_16_i1)
stp exp0, exp1, [x1]
mov x0, 0
ret
-2:
-   caspexp0, exp1, in0, in1, [x0]
+
+   /* RELAXED.  */
+2: caspexp0, exp1, in0, in1, [x0]
b   0b
-3:
-   b.hi4f
+
+   /* RELEASE.  */
+3: b.hi4f
caspl   exp0, exp1, in0, in1, [x0]
b   0b
-4:
-   caspal  exp0, exp1, in0, in1, [x0]
+
+   /* ACQ_REL/SEQ_CST.  */
+4: caspal  exp0, exp1, in0, in1, [x0]
b   0b
 END (libat_compare_exchange_16_i1)
 
@@ -169,15 +183,17 @@ END (libat_compare_exchange_16_i1)
 ENTRY (libat_fetch_add_16_i1)
mov x5, x0
cbnzw4, 2f
-1:
-   ldxpres0, res1, [x5]
+
+   /* RELAXED.  */
+1: ldxpres0, res1, [x5]
addstmplo, reslo, inlo
adc tmphi, reshi, inhi
stxpw4, tmp0, tmp1, [x5]
cbnzw4, 1b
ret
-2:
-   ldaxp   res0, res1, [x5]
+
+   /* ACQUIRE/CONSUME/RELEASE/ACQ_REL/SEQ_CST.  */
+2: ldaxp   res0, res1, [x5]
addstmplo, reslo, inlo
adc tmphi, reshi, inhi
stlxp   w4, tmp0, tmp1, [x5]
@@ -189,15 +205,17 @@ END (libat_fetch_add_16_i1)
 ENTRY (libat_add_fetch_16_i1)
mov x5, x0
cbnzw4, 2f
-1:
-   ldxpres0, res1, [x5]
+
+   /* RELAXED.  */
+1: ldxpres0, res1, [x5]
addsreslo, reslo, inlo
adc reshi, reshi, inhi
stxpw4, res0, res1, [x5]
cbnzw4, 1b
ret
-2:
-   ldaxp   res0, res1, [x5]
+
+   /* ACQUIRE/CONSUME/RELEASE/ACQ_REL/SEQ_CST.  */
+2: ldaxp   res0, res1, [x5]
addsreslo, reslo, inlo
adc reshi, reshi, inhi
stlxp   w4, res0, res1, [x5]
@@ -209,15 +227,17 @@ END (libat_add_fetch_16_i1)
 ENTRY (libat_fetch_sub_16_i1)
mov x5, x0

Re: [PATCH] libgcc: Fix uninitialized RA signing on AArch64 [PR107678]

2023-01-18 Thread Wilco Dijkstra via Gcc-patches
Hi,

>> +  /* Return-address signing state is toggled by DW_CFA_GNU_window_save 
>> (where
>> + REG_UNDEFINED means enabled), or set by a DW_CFA_expression.  */
>
> Needs updating to REG_UNSAVED_ARCHEXT.
> 
> OK with that changes, thanks, and sorry for the delays & runaround.

Thanks, I've improved the comment and it has been committed to trunk now.

Cheers,
Wilco

Re: [PATCH] libgcc: Fix uninitialized RA signing on AArch64 [PR107678]

2023-01-17 Thread Wilco Dijkstra via Gcc-patches
Hi,

> @Wilco, can you please send the rebased patch for patch review? We would
> need in out openSUSE package soon.

Here is an updated and rebased version:

Cheers,
Wilco

v4: rebase and add REG_UNSAVED_ARCHEXT.

A recent change only initializes the regs.how[] during Dwarf unwinding
which resulted in an uninitialized offset used in return address signing
and random failures during unwinding.  The fix is to encode the return
address signing state in REG_UNSAVED and a new state REG_UNSAVED_ARCHEXT.

Passes bootstrap & regress, OK for commit?

libgcc/
PR target/107678
* unwind-dw2.h (REG_UNSAVED_ARCHEXT): Add new enum.
* unwind-dw2.c (uw_update_context_1): Add REG_UNSAVED_ARCHEXT case.
* unwind-dw2-execute_cfa.h: Use REG_UNSAVED_ARCHEXT/REG_UNSAVED to 
encode the return address signing state.
* config/aarch64/aarch64-unwind.h (aarch64_demangle_return_addr)
Check current return address signing state.
(aarch64_frob_update_contex): Remove.

---
diff --git a/libgcc/config/aarch64/aarch64-unwind.h 
b/libgcc/config/aarch64/aarch64-unwind.h
index 
874cf6c3e77fb72d999f51b636d74cb0b5728bbd..727c27ba5da983958b3134715d9d4d7c0af5c1e2
 100644
--- a/libgcc/config/aarch64/aarch64-unwind.h
+++ b/libgcc/config/aarch64/aarch64-unwind.h
@@ -29,8 +29,6 @@ see the files COPYING3 and COPYING.RUNTIME respectively.  If 
not, see
 
 #define MD_DEMANGLE_RETURN_ADDR(context, fs, addr) \
   aarch64_demangle_return_addr (context, fs, addr)
-#define MD_FROB_UPDATE_CONTEXT(context, fs) \
-  aarch64_frob_update_context (context, fs)
 
 static inline int
 aarch64_cie_signed_with_b_key (struct _Unwind_Context *context)
@@ -55,42 +53,27 @@ aarch64_cie_signed_with_b_key (struct _Unwind_Context 
*context)
 
 static inline void *
 aarch64_demangle_return_addr (struct _Unwind_Context *context,
- _Unwind_FrameState *fs ATTRIBUTE_UNUSED,
+ _Unwind_FrameState *fs,
  _Unwind_Word addr_word)
 {
   void *addr = (void *)addr_word;
-  if (context->flags & RA_SIGNED_BIT)
+  const int reg = DWARF_REGNUM_AARCH64_RA_STATE;
+
+  if (fs->regs.how[reg] == REG_UNSAVED)
+return addr;
+
+  /* Return-address signing state is toggled by DW_CFA_GNU_window_save (where
+ REG_UNDEFINED means enabled), or set by a DW_CFA_expression.  */
+  if (fs->regs.how[reg] == REG_UNSAVED_ARCHEXT
+  || (_Unwind_GetGR (context, reg) & 0x1) != 0)
 {
   _Unwind_Word salt = (_Unwind_Word) context->cfa;
   if (aarch64_cie_signed_with_b_key (context) != 0)
return __builtin_aarch64_autib1716 (addr, salt);
   return __builtin_aarch64_autia1716 (addr, salt);
 }
-  else
-return addr;
-}
-
-/* Do AArch64 private initialization on CONTEXT based on frame info FS.  Mark
-   CONTEXT as return address signed if bit 0 of DWARF_REGNUM_AARCH64_RA_STATE 
is
-   set.  */
-
-static inline void
-aarch64_frob_update_context (struct _Unwind_Context *context,
-_Unwind_FrameState *fs)
-{
-  const int reg = DWARF_REGNUM_AARCH64_RA_STATE;
-  int ra_signed;
-  if (fs->regs.how[reg] == REG_UNSAVED)
-ra_signed = fs->regs.reg[reg].loc.offset & 0x1;
-  else
-ra_signed = _Unwind_GetGR (context, reg) & 0x1;
-  if (ra_signed)
-/* The flag is used for re-authenticating EH handler's address.  */
-context->flags |= RA_SIGNED_BIT;
-  else
-context->flags &= ~RA_SIGNED_BIT;
 
-  return;
+  return addr;
 }
 
 #endif /* defined AARCH64_UNWIND_H && defined __ILP32__ */
diff --git a/libgcc/unwind-dw2-execute_cfa.h b/libgcc/unwind-dw2-execute_cfa.h
index 
264c11c03ec4a09cac2c19a241c5b110b1b6b602..aef377092ceede6bdda8532679f9b081c98fadce
 100644
--- a/libgcc/unwind-dw2-execute_cfa.h
+++ b/libgcc/unwind-dw2-execute_cfa.h
@@ -278,10 +278,15 @@
case DW_CFA_GNU_window_save:
 #if defined (__aarch64__) && !defined (__ILP32__)
  /* This CFA is multiplexed with Sparc.  On AArch64 it's used to toggle
-return address signing status.  */
+return address signing status.  REG_UNSAVED/REG_UNSAVED_ARCHEXT
+mean RA signing is disabled/enabled.  */
  reg = DWARF_REGNUM_AARCH64_RA_STATE;
- gcc_assert (fs->regs.how[reg] == REG_UNSAVED);
- fs->regs.reg[reg].loc.offset ^= 1;
+ gcc_assert (fs->regs.how[reg] == REG_UNSAVED
+ || fs->regs.how[reg] == REG_UNSAVED_ARCHEXT);
+ if (fs->regs.how[reg] == REG_UNSAVED)
+   fs->regs.how[reg] = REG_UNSAVED_ARCHEXT;
+ else
+   fs->regs.how[reg] = REG_UNSAVED;
 #else
  /* ??? Hardcoded for SPARC register window configuration.  */
  if (__LIBGCC_DWARF_FRAME_REGISTERS__ >= 32)
diff --git a/libgcc/unwind-dw2.h b/libgcc/unwind-dw2.h
index 
e2f81983e1dcf3df6aebde2454630b7bee87d597..53e1b183c7d60112a14411d3356c49cb39cd0de7
 100644
--- a/libgcc/unwind-dw2.h
+++ b/libgcc/unwind-dw2.h
@@ -29,6 +29,7 @@ enum {
   REG_SAVED_EXP,
   

Re: [PATCH] libgcc: Fix uninitialized RA signing on AArch64 [PR107678]

2023-01-11 Thread Wilco Dijkstra via Gcc-patches
Hi,

> On 1/10/23 19:12, Jakub Jelinek via Gcc-patches wrote:
>> Anyway, the sooner this makes it into gcc trunk, the better, it breaks quite
>> a lot of stuff.
>
> Yep, please, we're also waiting for this patch for pushing to our gcc13 
> package.

Well I'm waiting for an OK from a maintainer... I believe Jakub can approve it 
as well.

Cheers,
Wilco

Re: [PATCH] libgcc: Fix uninitialized RA signing on AArch64 [PR107678]

2023-01-10 Thread Wilco Dijkstra via Gcc-patches
Hi Szabolcs,

> i would keep the assert: how[reg] must be either UNSAVED or UNDEFINED
> here, other how[reg] means the toggle cfi instruction is mixed with
> incompatible instructions for the pseudo reg.
>
> and i would add a comment about this e.g. saying that UNSAVED/UNDEFINED
> how[reg] is used for tracking the return address signing status and
> other how[reg] is not allowed here.

I've added the assert back and updated the comment.

Cheers,
Wilco

v3: Improve comments, add assert.

A recent change only initializes the regs.how[] during Dwarf unwinding
which resulted in an uninitialized offset used in return address signing
and random failures during unwinding.  The fix is to encode the return
address signing state in REG_UNSAVED and REG_UNDEFINED.

Passes bootstrap & regress, OK for commit?

libgcc/
PR target/107678
* unwind-dw2.c (execute_cfa_program): Use REG_UNSAVED/UNDEFINED
to encode return address signing state.
* config/aarch64/aarch64-unwind.h (aarch64_demangle_return_addr)
Check current return address signing state.
(aarch64_frob_update_contex): Remove.

---

diff --git a/libgcc/config/aarch64/aarch64-unwind.h 
b/libgcc/config/aarch64/aarch64-unwind.h
index 
26db9cbd9e5c526e0c410a4fc6be2bedb7d261cf..1afc3f9d308b95bc787398263e629bab226ff1ba
 100644
--- a/libgcc/config/aarch64/aarch64-unwind.h
+++ b/libgcc/config/aarch64/aarch64-unwind.h
@@ -29,8 +29,6 @@ see the files COPYING3 and COPYING.RUNTIME respectively.  If 
not, see
 
 #define MD_DEMANGLE_RETURN_ADDR(context, fs, addr) \
   aarch64_demangle_return_addr (context, fs, addr)
-#define MD_FROB_UPDATE_CONTEXT(context, fs) \
-  aarch64_frob_update_context (context, fs)
 
 static inline int
 aarch64_cie_signed_with_b_key (struct _Unwind_Context *context)
@@ -55,42 +53,27 @@ aarch64_cie_signed_with_b_key (struct _Unwind_Context 
*context)
 
 static inline void *
 aarch64_demangle_return_addr (struct _Unwind_Context *context,
- _Unwind_FrameState *fs ATTRIBUTE_UNUSED,
+ _Unwind_FrameState *fs,
  _Unwind_Word addr_word)
 {
   void *addr = (void *)addr_word;
-  if (context->flags & RA_SIGNED_BIT)
+  const int reg = DWARF_REGNUM_AARCH64_RA_STATE;
+
+  if (fs->regs.how[reg] == REG_UNSAVED)
+return addr;
+
+  /* Return-address signing state is toggled by DW_CFA_GNU_window_save (where
+ REG_UNDEFINED means enabled), or set by a DW_CFA_expression.  */
+  if (fs->regs.how[reg] == REG_UNDEFINED
+  || (_Unwind_GetGR (context, reg) & 0x1) != 0)
 {
   _Unwind_Word salt = (_Unwind_Word) context->cfa;
   if (aarch64_cie_signed_with_b_key (context) != 0)
return __builtin_aarch64_autib1716 (addr, salt);
   return __builtin_aarch64_autia1716 (addr, salt);
 }
-  else
-return addr;
-}
-
-/* Do AArch64 private initialization on CONTEXT based on frame info FS.  Mark
-   CONTEXT as return address signed if bit 0 of DWARF_REGNUM_AARCH64_RA_STATE 
is
-   set.  */
-
-static inline void
-aarch64_frob_update_context (struct _Unwind_Context *context,
-_Unwind_FrameState *fs)
-{
-  const int reg = DWARF_REGNUM_AARCH64_RA_STATE;
-  int ra_signed;
-  if (fs->regs.how[reg] == REG_UNSAVED)
-ra_signed = fs->regs.reg[reg].loc.offset & 0x1;
-  else
-ra_signed = _Unwind_GetGR (context, reg) & 0x1;
-  if (ra_signed)
-/* The flag is used for re-authenticating EH handler's address.  */
-context->flags |= RA_SIGNED_BIT;
-  else
-context->flags &= ~RA_SIGNED_BIT;
 
-  return;
+  return addr;
 }
 
 #endif /* defined AARCH64_UNWIND_H && defined __ILP32__ */
diff --git a/libgcc/unwind-dw2.c b/libgcc/unwind-dw2.c
index 
eaceace20298b9b13344aff9d1fe9ee5f9c7bd73..55fe35520106e848c5d4aea4e7104bf4a0c14891
 100644
--- a/libgcc/unwind-dw2.c
+++ b/libgcc/unwind-dw2.c
@@ -139,7 +139,6 @@ struct _Unwind_Context
 #define EXTENDED_CONTEXT_BIT ((~(_Unwind_Word) 0 >> 2) + 1)
   /* Bit reserved on AArch64, return address has been signed with A or B
  key.  */
-#define RA_SIGNED_BIT ((~(_Unwind_Word) 0 >> 3) + 1)
   _Unwind_Word flags;
   /* 0 for now, can be increased when further fields are added to
  struct _Unwind_Context.  */
@@ -1204,10 +1203,15 @@ execute_cfa_program (const unsigned char *insn_ptr,
case DW_CFA_GNU_window_save:
 #if defined (__aarch64__) && !defined (__ILP32__)
  /* This CFA is multiplexed with Sparc.  On AArch64 it's used to toggle
-return address signing status.  */
+return address signing status.  The REG_UNDEFINED/UNSAVED states
+mean RA signing is enabled/disabled.  */
  reg = DWARF_REGNUM_AARCH64_RA_STATE;
- gcc_assert (fs->regs.how[reg] == REG_UNSAVED);
- fs->regs.reg[reg].loc.offset ^= 1;
+ gcc_assert (fs->regs.how[reg] == REG_UNSAVED
+ || fs->regs.how[reg] == REG_UNDEFINED);
+ if (fs->regs.how[reg] == REG_UNSAVED)
+   

Re: [PATCH] libgcc: Fix uninitialized RA signing on AArch64 [PR107678]

2023-01-03 Thread Wilco Dijkstra via Gcc-patches
Hi Richard,

> Hmm, but the point of the original patch was to support code generators
> that emit DW_CFA_val_expression instead of DW_CFA_AARCH64_negate_ra_state.
> Doesn't this patch undo that?

Well it wasn't clear from the code or comments that was supported. I've
added that back in v2.

> Also, if I understood correctly, the reason we use REG_UNSAVED is to
> ensure that state from one frame isn't carried across to a parent frame,
> in cases where the parent frame lacks any signing.  That is, each frame
> should start out with a zero bit even if a child frame is unwound while
> it has a set bit.

This works fine since all registers are initialized to REG_UNSAVED every frame.

In v2 I've removed some clutter and encode the signing state in REG_UNSAVED/
REG_UNDEFINED.

Cheers,
Wilco

v2: Further cleanup, support DW_CFA_expression.

A recent change only initializes the regs.how[] during Dwarf unwinding
which resulted in an uninitialized offset used in return address signing
and random failures during unwinding.  The fix is to encode the return
address signing state in REG_UNSAVED and REG_UNDEFINED.

Passes bootstrap & regress, OK for commit?

libgcc/
PR target/107678
* unwind-dw2.c (execute_cfa_program): Use REG_UNSAVED/UNDEFINED
to encode return address signing state.
* config/aarch64/aarch64-unwind.h (aarch64_demangle_return_addr)
Check current return address signing state.
(aarch64_frob_update_contex): Remove.

---
diff --git a/libgcc/config/aarch64/aarch64-unwind.h 
b/libgcc/config/aarch64/aarch64-unwind.h
index 
26db9cbd9e5c526e0c410a4fc6be2bedb7d261cf..1afc3f9d308b95bc787398263e629bab226ff1ba
 100644
--- a/libgcc/config/aarch64/aarch64-unwind.h
+++ b/libgcc/config/aarch64/aarch64-unwind.h
@@ -29,8 +29,6 @@ see the files COPYING3 and COPYING.RUNTIME respectively.  If 
not, see
 
 #define MD_DEMANGLE_RETURN_ADDR(context, fs, addr) \
   aarch64_demangle_return_addr (context, fs, addr)
-#define MD_FROB_UPDATE_CONTEXT(context, fs) \
-  aarch64_frob_update_context (context, fs)
 
 static inline int
 aarch64_cie_signed_with_b_key (struct _Unwind_Context *context)
@@ -55,42 +53,27 @@ aarch64_cie_signed_with_b_key (struct _Unwind_Context 
*context)
 
 static inline void *
 aarch64_demangle_return_addr (struct _Unwind_Context *context,
- _Unwind_FrameState *fs ATTRIBUTE_UNUSED,
+ _Unwind_FrameState *fs,
  _Unwind_Word addr_word)
 {
   void *addr = (void *)addr_word;
-  if (context->flags & RA_SIGNED_BIT)
+  const int reg = DWARF_REGNUM_AARCH64_RA_STATE;
+
+  if (fs->regs.how[reg] == REG_UNSAVED)
+return addr;
+
+  /* Return-address signing state is toggled by DW_CFA_GNU_window_save (where
+ REG_UNDEFINED means enabled), or set by a DW_CFA_expression.  */
+  if (fs->regs.how[reg] == REG_UNDEFINED
+  || (_Unwind_GetGR (context, reg) & 0x1) != 0)
 {
   _Unwind_Word salt = (_Unwind_Word) context->cfa;
   if (aarch64_cie_signed_with_b_key (context) != 0)
return __builtin_aarch64_autib1716 (addr, salt);
   return __builtin_aarch64_autia1716 (addr, salt);
 }
-  else
-return addr;
-}
-
-/* Do AArch64 private initialization on CONTEXT based on frame info FS.  Mark
-   CONTEXT as return address signed if bit 0 of DWARF_REGNUM_AARCH64_RA_STATE 
is
-   set.  */
-
-static inline void
-aarch64_frob_update_context (struct _Unwind_Context *context,
-_Unwind_FrameState *fs)
-{
-  const int reg = DWARF_REGNUM_AARCH64_RA_STATE;
-  int ra_signed;
-  if (fs->regs.how[reg] == REG_UNSAVED)
-ra_signed = fs->regs.reg[reg].loc.offset & 0x1;
-  else
-ra_signed = _Unwind_GetGR (context, reg) & 0x1;
-  if (ra_signed)
-/* The flag is used for re-authenticating EH handler's address.  */
-context->flags |= RA_SIGNED_BIT;
-  else
-context->flags &= ~RA_SIGNED_BIT;
 
-  return;
+  return addr;
 }
 
 #endif /* defined AARCH64_UNWIND_H && defined __ILP32__ */
diff --git a/libgcc/unwind-dw2.c b/libgcc/unwind-dw2.c
index 
eaceace20298b9b13344aff9d1fe9ee5f9c7bd73..7c200cb6e730c5d63cf200ebe8a903f858e79d07
 100644
--- a/libgcc/unwind-dw2.c
+++ b/libgcc/unwind-dw2.c
@@ -139,7 +139,6 @@ struct _Unwind_Context
 #define EXTENDED_CONTEXT_BIT ((~(_Unwind_Word) 0 >> 2) + 1)
   /* Bit reserved on AArch64, return address has been signed with A or B
  key.  */
-#define RA_SIGNED_BIT ((~(_Unwind_Word) 0 >> 3) + 1)
   _Unwind_Word flags;
   /* 0 for now, can be increased when further fields are added to
  struct _Unwind_Context.  */
@@ -1206,8 +1205,10 @@ execute_cfa_program (const unsigned char *insn_ptr,
  /* This CFA is multiplexed with Sparc.  On AArch64 it's used to toggle
 return address signing status.  */
  reg = DWARF_REGNUM_AARCH64_RA_STATE;
- gcc_assert (fs->regs.how[reg] == REG_UNSAVED);
- fs->regs.reg[reg].loc.offset ^= 1;
+ if (fs->regs.how[reg] == REG_UNSAVED)
+  

[PATCH] AArch64: Enable TARGET_CONST_ANCHOR

2022-12-09 Thread Wilco Dijkstra via Gcc-patches
Enable TARGET_CONST_ANCHOR to allow complex constants to be created via 
immediate add.
Use a 24-bit range as that enables a 3 or 4-instruction immediate to be 
replaced by
2 additions.  Fix the costing of immediate add to support 24-bit immediate and 
12-bit shifted
immediates.  The generated code for the testcase is now the same or better than 
LLVM.
It also results in a small codesize reduction on SPEC.

Passes bootstrap and regress, OK for commit?

gcc/
* config/aarch64/aarch64.cc (aarch64_rtx_costs): Add correct costs for
24-bit immediate add and 12-bit high immediate add.
(TARGET_CONST_ANCHOR): Define.

gcc/testsuite/
* gcc.target/aarch64/movk_3.c: New test.

---

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 
e97f3b32f7c7f43564d6a4207eae5a34b9e9bfe7..a73741800c963ee6605fd2cfa918f4399da4bfdf
 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -14257,6 +14257,16 @@ cost_plus:
return true;
  }
 
+   if (aarch64_pluslong_immediate (op1, mode))
+ {
+   /* 24-bit add in 2 instructions or 12-bit shifted add.  */
+   if ((INTVAL (op1) & 0xfff) != 0)
+ *cost += COSTS_N_INSNS (1);
+
+   *cost += rtx_cost (op0, mode, PLUS, 0, speed);
+   return true;
+ }
+
*cost += rtx_cost (op1, mode, PLUS, 1, speed);
 
/* Look for ADD (extended register).  */
@@ -28051,6 +28061,9 @@ aarch64_libgcc_floating_mode_supported_p
 #undef TARGET_HAVE_SHADOW_CALL_STACK
 #define TARGET_HAVE_SHADOW_CALL_STACK true
 
+#undef TARGET_CONST_ANCHOR
+#define TARGET_CONST_ANCHOR 0x100
+
 struct gcc_target targetm = TARGET_INITIALIZER;
 
 #include "gt-aarch64.h"
diff --git a/gcc/testsuite/gcc.target/aarch64/movk_3.c 
b/gcc/testsuite/gcc.target/aarch64/movk_3.c
new file mode 100644
index 
..9e8c0c42671bef3f63028b4e51d0bd78c9903994
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/movk_3.c
@@ -0,0 +1,56 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 --save-temps" } */
+
+
+/* 2 MOV */
+void f16 (long *p)
+{
+  p[0] = 0x1234;
+  p[2] = 0x1235;
+}
+
+/* MOV, MOVK and ADD */
+void f32_1 (long *p)
+{
+  p[0] = 0x12345678;
+  p[2] = 0x12345678 + 0xfff;
+}
+
+/* 2 MOV, 2 MOVK */
+void f32_2 (long *p)
+{
+  p[0] = 0x12345678;
+  p[2] = 0x12345678 + 0x55;
+}
+
+/* MOV, MOVK and ADD */
+void f32_3 (long *p)
+{
+  p[0] = 0x12345678;
+  p[2] = 0x12345678 + 0x999000;
+}
+
+/* MOV, 2 MOVK and ADD */
+void f48_1 (long *p)
+{
+  p[0] = 0x123456789abc;
+  p[2] = 0x123456789abc + 0xfff;
+}
+
+/* MOV, 2 MOVK and 2 ADD */
+void f48_2 (long *p)
+{
+  p[0] = 0x123456789abc;
+  p[2] = 0x123456789abc + 0x66;
+}
+
+/* 2 MOV, 4 MOVK */
+void f48_3 (long *p)
+{
+  p[0] = 0x123456789abc;
+  p[2] = 0x123456789abc + 0x166;
+}
+
+/* { dg-final { scan-assembler-times "mov\tx\[0-9\]+, \[0-9\]+" 10 } } */
+/* { dg-final { scan-assembler-times "movk\tx\[0-9\]+, 0x\[0-9a-f\]+" 12 } } */
+/* { dg-final { scan-assembler-times "add\tx\[0-9\]+, x\[0-9\]+, \[0-9\]+" 5 } 
} */



Re: [PATCH][AArch64] Cleanup move immediate code

2022-12-07 Thread Wilco Dijkstra via Gcc-patches
Hi Andreas,

Thanks for the report, I've committed the fix: 
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108006

Cheers,
Wilco


[COMMITTED] AArch64: Fix assert in aarch64_move_imm [PR108006]

2022-12-07 Thread Wilco Dijkstra via Gcc-patches
Ensure we only pass SI/DImode which fixes the assert.

Committed as obvious.

gcc/
        PR target/108006
* config/aarch64/aarch64.c (aarch64_expand_sve_const_vector):
        Fix call to aarch64_move_imm to use SI/DI.
---

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 
89bf0dff904b6b52b71841aec299541f01884f3d..27a814d862101ce244c52d4863c6158cf549f066
 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -6513,7 +6513,8 @@ aarch64_expand_sve_const_vector (rtx target, rtx src)
  /* If the integer can be moved into a general register by a
 single instruction, do that and duplicate the result.  */
  if (CONST_INT_P (elt_value)
- && aarch64_move_imm (INTVAL (elt_value), elt_mode))
+ && aarch64_move_imm (INTVAL (elt_value),
+  encoded_bits <= 32 ? SImode : DImode))
{
  elt_value = force_reg (elt_mode, elt_value);
  return expand_vector_broadcast (mode, elt_value);



Re: [PATCH] libgcc: Fix uninitialized RA signing on AArch64 [PR107678]

2022-12-06 Thread Wilco Dijkstra via Gcc-patches
Hi,

> i don't think how[*RA_STATE] can ever be set to REG_SAVED_OFFSET,
> this pseudo reg is not spilled to the stack, it is reset to 0 in
> each frame and then toggled within a frame.

It's is just a state, we can use any state we want since it is a pseudo reg.
These registers are global and shared across all functions in an unwind,
so their state or value isn't reset for each frame. So if we want to reset
it in each frame then using a virtual register to hold per-function data
seems like a bad design. I'm surprised nobody has ever tested it...

Cheers,
Wilco

Re: [PATCH][AArch64] Cleanup move immediate code

2022-12-05 Thread Wilco Dijkstra via Gcc-patches
Hi Richard,

> -  scalar_int_mode imode = (mode == HFmode
> -    ? SImode
> -    : int_mode_for_mode (mode).require ());
> +  machine_mode imode = (mode == DFmode) ? DImode : SImode;

> It looks like this might mishandle DDmode, if not now then in the future.
> Seems safer to use known_eq (GET_MODE_SIZE (mode), 8)

I've changed that, but it does not matter for the narrow modes as the result
will be identical - only DDmode might get costed incorrectly.

> Sorry for not noticing last time, but: rather than have
> aarch64_zeroextended_move_imm (which is quite a complicated test),
> could we just add an extra (default off) parameter to aarch64_move_imm
> that suppresses the (val >> 32) == 0 test?

That makes things more complicated again - ultimately I'd like to get rid of the
mode parameter since most callers use a fixed mode, and ones that don't are
now creating and passing fake modes... I've change it like aarch64_move_imm
and call aarch64_is_movz twice to check it is not a 64-bit MOVZ/MOVN.

Cheers,
Wilco

v3: Use aarch64_is_movz, use known_eq

Simplify, refactor and improve various move immediate functions.
Allow 32-bit MOVI/N as a valid 64-bit immediate which removes special
cases in aarch64_internal_mov_immediate.  Add new constraint so the movdi
pattern only needs a single alternative for move immediate.

Passes bootstrap and regress, OK for commit?

gcc/ChangeLog:

* config/aarch64/aarch64.cc (aarch64_bitmask_imm): Use unsigned type.
(aarch64_zeroextended_move_imm): New function.
(aarch64_move_imm): Refactor, assert mode is SImode or DImode.
(aarch64_internal_mov_immediate): Assert mode is SImode or DImode.
Simplify special cases.
(aarch64_uimm12_shift): Simplify code.
(aarch64_clamp_to_uimm12_shift): Likewise.
(aarch64_movw_imm): Rename to aarch64_is_movz.
(aarch64_float_const_rtx_p): Pass either SImode or DImode to
aarch64_internal_mov_immediate.
(aarch64_rtx_costs): Likewise.
* config/aarch64/aarch64.md (movdi_aarch64): Merge 'N' and 'M'
constraints into single 'O'.
(mov_aarch64): Likewise.
* config/aarch64/aarch64-protos.h (aarch64_move_imm): Use unsigned.
(aarch64_bitmask_imm): Likewise.
(aarch64_uimm12_shift): Likewise.
(aarch64_zeroextended_move_imm): New prototype.
* config/aarch64/constraints.md: Add 'O' for 32/64-bit immediates,
limit 'N' to 64-bit only moves.

---

diff --git a/gcc/config/aarch64/aarch64-protos.h 
b/gcc/config/aarch64/aarch64-protos.h
index 
4be93c93c26e091f878bc8e4cf06e90888405fb2..8bce6ec7599edcc2e6a1d8006450f35c0ce7f61f
 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -756,7 +756,7 @@ void aarch64_post_cfi_startproc (void);
 poly_int64 aarch64_initial_elimination_offset (unsigned, unsigned);
 int aarch64_get_condition_code (rtx);
 bool aarch64_address_valid_for_prefetch_p (rtx, bool);
-bool aarch64_bitmask_imm (HOST_WIDE_INT val, machine_mode);
+bool aarch64_bitmask_imm (unsigned HOST_WIDE_INT val, machine_mode);
 unsigned HOST_WIDE_INT aarch64_and_split_imm1 (HOST_WIDE_INT val_in);
 unsigned HOST_WIDE_INT aarch64_and_split_imm2 (HOST_WIDE_INT val_in);
 bool aarch64_and_bitmask_imm (unsigned HOST_WIDE_INT val_in, machine_mode 
mode);
@@ -793,7 +793,7 @@ bool aarch64_masks_and_shift_for_bfi_p (scalar_int_mode, 
unsigned HOST_WIDE_INT,
unsigned HOST_WIDE_INT,
unsigned HOST_WIDE_INT);
 bool aarch64_zero_extend_const_eq (machine_mode, rtx, machine_mode, rtx);
-bool aarch64_move_imm (HOST_WIDE_INT, machine_mode);
+bool aarch64_move_imm (unsigned HOST_WIDE_INT, machine_mode);
 machine_mode aarch64_sve_int_mode (machine_mode);
 opt_machine_mode aarch64_sve_pred_mode (unsigned int);
 machine_mode aarch64_sve_pred_mode (machine_mode);
@@ -843,8 +843,9 @@ bool aarch64_sve_float_arith_immediate_p (rtx, bool);
 bool aarch64_sve_float_mul_immediate_p (rtx);
 bool aarch64_split_dimode_const_store (rtx, rtx);
 bool aarch64_symbolic_address_p (rtx);
-bool aarch64_uimm12_shift (HOST_WIDE_INT);
+bool aarch64_uimm12_shift (unsigned HOST_WIDE_INT);
 int aarch64_movk_shift (const wide_int_ref &, const wide_int_ref &);
+bool aarch64_zeroextended_move_imm (unsigned HOST_WIDE_INT);
 bool aarch64_use_return_insn_p (void);
 const char *aarch64_output_casesi (rtx *);
 
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 
a73741800c963ee6605fd2cfa918f4399da4bfdf..00269632eeb52c29ba2011c4c82274968b850d71
 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -5625,12 +5625,10 @@ aarch64_bitmask_imm (unsigned HOST_WIDE_INT val)
 
 /* Return true if VAL is a valid bitmask immediate for MODE.  */
 bool
-aarch64_bitmask_imm (HOST_WIDE_INT val_in, machine_mode mode)
+aarch64_bitmask_imm (unsigned HOST_WIDE_INT 

[PATCH] libgcc: Fix uninitialized RA signing on AArch64 [PR107678]

2022-12-01 Thread Wilco Dijkstra via Gcc-patches
A recent change only initializes the regs.how[] during Dwarf unwinding
which resulted in an uninitialized offset used in return address signing
and random failures during unwinding.  The fix is to use REG_SAVED_OFFSET
as the state where the return address signing bit is valid, and if the
state is REG_UNSAVED, initialize it to 0.

Passes bootstrap & regress, OK for commit?

libgcc/
PR target/107678
* unwind-dw2.c (execute_cfa_program): Initialize offset of
DWARF_REGNUM_AARCH64_RA_STATE if in REG_UNSAVED state.
* config/aarch64/aarch64-unwind.h (aarch64_frob_update_contex):
Check state is REG_SAVED_OFFSET before using offset for RA state.

---

diff --git a/libgcc/config/aarch64/aarch64-unwind.h 
b/libgcc/config/aarch64/aarch64-unwind.h
index 
26db9cbd9e5c526e0c410a4fc6be2bedb7d261cf..597133b3d708a50a366c8bfeff57475f5522b3f6
 100644
--- a/libgcc/config/aarch64/aarch64-unwind.h
+++ b/libgcc/config/aarch64/aarch64-unwind.h
@@ -71,21 +71,15 @@ aarch64_demangle_return_addr (struct _Unwind_Context 
*context,
 }
 
 /* Do AArch64 private initialization on CONTEXT based on frame info FS.  Mark
-   CONTEXT as return address signed if bit 0 of DWARF_REGNUM_AARCH64_RA_STATE 
is
-   set.  */
+   CONTEXT as having a signed return address if DWARF_REGNUM_AARCH64_RA_STATE
+   is initialized (REG_SAVED_OFFSET state) and the offset has bit 0 set.  */
 
 static inline void
 aarch64_frob_update_context (struct _Unwind_Context *context,
 _Unwind_FrameState *fs)
 {
-  const int reg = DWARF_REGNUM_AARCH64_RA_STATE;
-  int ra_signed;
-  if (fs->regs.how[reg] == REG_UNSAVED)
-ra_signed = fs->regs.reg[reg].loc.offset & 0x1;
-  else
-ra_signed = _Unwind_GetGR (context, reg) & 0x1;
-  if (ra_signed)
-/* The flag is used for re-authenticating EH handler's address.  */
+  if (fs->regs.how[DWARF_REGNUM_AARCH64_RA_STATE] == REG_SAVED_OFFSET
+  && (fs->regs.reg[DWARF_REGNUM_AARCH64_RA_STATE].loc.offset & 1) != 0)
 context->flags |= RA_SIGNED_BIT;
   else
 context->flags &= ~RA_SIGNED_BIT;
diff --git a/libgcc/unwind-dw2.c b/libgcc/unwind-dw2.c
index 
eaceace20298b9b13344aff9d1fe9ee5f9c7bd73..87f2ae065b67982ce48f74e45523d9c754a7661c
 100644
--- a/libgcc/unwind-dw2.c
+++ b/libgcc/unwind-dw2.c
@@ -1203,11 +1203,16 @@ execute_cfa_program (const unsigned char *insn_ptr,
 
case DW_CFA_GNU_window_save:
 #if defined (__aarch64__) && !defined (__ILP32__)
- /* This CFA is multiplexed with Sparc.  On AArch64 it's used to toggle
-return address signing status.  */
- reg = DWARF_REGNUM_AARCH64_RA_STATE;
- gcc_assert (fs->regs.how[reg] == REG_UNSAVED);
- fs->regs.reg[reg].loc.offset ^= 1;
+/* This CFA is multiplexed with Sparc.  On AArch64 it's used to toggle
+   the return address signing status.  It is initialized at the first
+   use and the state is stored in bit 0 of the offset.  */
+reg = DWARF_REGNUM_AARCH64_RA_STATE;
+if (fs->regs.how[reg] == REG_UNSAVED)
+  {
+fs->regs.how[reg] = REG_SAVED_OFFSET;
+fs->regs.reg[reg].loc.offset = 0;
+  }
+fs->regs.reg[reg].loc.offset ^= 1;
 #else
  /* ??? Hardcoded for SPARC register window configuration.  */
  if (__LIBGCC_DWARF_FRAME_REGISTERS__ >= 32)



Re: [PATCH][AArch64] Cleanup move immediate code

2022-11-29 Thread Wilco Dijkstra via Gcc-patches
Hi Richard,

> Just to make sure I understand: isn't it really just MOVN?  I would have
> expected a 32-bit MOVZ to be equivalent to (and add no capabilities over)
> a 64-bit MOVZ.

The 32-bit MOVZ immediates are equivalent, MOVN never overlaps, and
MOVI has some overlaps . Since we allow all 3 variants, the 2 alternatives
in the movdi pattern are overlapping for MOVZ and MOVI immediates.

> I agree the ctz trick is more elegant than (and an improvement over)
> the current approach to testing for movz.  But I think the overall logic
> is harder to follow than it was in the original version.  Initially
> canonicalising val2 based on the sign bit seems unintuitive since we
> still need to handle all four combinations of (top bit set, top bit clear)
> x (low 48 bits set, low 48 bits clear).  I preferred the original
> approach of testing once with the original value (for MOVZ) and once
> with the inverted value (for MOVN).

Yes, the canonicalization on the sign ends up requiring 2 special cases.
Handling the MOVZ case first and then MOVN does avoid that, and makes
things simpler overall, so I've used that approach in v2.

> Don't the new cases boil down to: if mode is DImode and the upper 32 bits
> are clear, we can test based on SImode instead?  In other words, couldn't
> the "(val >> 32) == 0" part of the final test be done first, with the
> effect of changing the mode to SImode?  Something like:

Yes that works. I used masking of the top bits to avoid repeatedly testing the
same condition. The new version removes most special cases and ends up
both smaller and simpler:


v2: Simplify the special cases in aarch64_move_imm, use aarch64_is_movz.

Simplify, refactor and improve various move immediate functions.
Allow 32-bit MOVZ/I/N as a valid 64-bit immediate which removes special
cases in aarch64_internal_mov_immediate.  Add new constraint so the movdi
pattern only needs a single alternative for move immediate.

Passes bootstrap and regress, OK for commit?

gcc/ChangeLog:

* config/aarch64/aarch64.cc (aarch64_bitmask_imm): Use unsigned type.
(aarch64_zeroextended_move_imm): New function.
(aarch64_move_imm): Refactor, assert mode is SImode or DImode.
(aarch64_internal_mov_immediate): Assert mode is SImode or DImode.
Simplify special cases.
(aarch64_uimm12_shift): Simplify code.
(aarch64_clamp_to_uimm12_shift): Likewise.
(aarch64_movw_imm): Rename to aarch64_is_movz.
(aarch64_float_const_rtx_p): Pass either SImode or DImode to
aarch64_internal_mov_immediate.
(aarch64_rtx_costs): Likewise.
* config/aarch64/aarch64.md (movdi_aarch64): Merge 'N' and 'M'
constraints into single 'O'.
(mov_aarch64): Likewise.
* config/aarch64/aarch64-protos.h (aarch64_move_imm): Use unsigned.
(aarch64_bitmask_imm): Likewise.
(aarch64_uimm12_shift): Likewise.
(aarch64_zeroextended_move_imm): New prototype.
* config/aarch64/constraints.md: Add 'O' for 32/64-bit immediates,
limit 'N' to 64-bit only moves.

---

diff --git a/gcc/config/aarch64/aarch64-protos.h 
b/gcc/config/aarch64/aarch64-protos.h
index 
4be93c93c26e091f878bc8e4cf06e90888405fb2..8bce6ec7599edcc2e6a1d8006450f35c0ce7f61f
 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -756,7 +756,7 @@ void aarch64_post_cfi_startproc (void);
 poly_int64 aarch64_initial_elimination_offset (unsigned, unsigned);
 int aarch64_get_condition_code (rtx);
 bool aarch64_address_valid_for_prefetch_p (rtx, bool);
-bool aarch64_bitmask_imm (HOST_WIDE_INT val, machine_mode);
+bool aarch64_bitmask_imm (unsigned HOST_WIDE_INT val, machine_mode);
 unsigned HOST_WIDE_INT aarch64_and_split_imm1 (HOST_WIDE_INT val_in);
 unsigned HOST_WIDE_INT aarch64_and_split_imm2 (HOST_WIDE_INT val_in);
 bool aarch64_and_bitmask_imm (unsigned HOST_WIDE_INT val_in, machine_mode 
mode);
@@ -793,7 +793,7 @@ bool aarch64_masks_and_shift_for_bfi_p (scalar_int_mode, 
unsigned HOST_WIDE_INT,
unsigned HOST_WIDE_INT,
unsigned HOST_WIDE_INT);
 bool aarch64_zero_extend_const_eq (machine_mode, rtx, machine_mode, rtx);
-bool aarch64_move_imm (HOST_WIDE_INT, machine_mode);
+bool aarch64_move_imm (unsigned HOST_WIDE_INT, machine_mode);
 machine_mode aarch64_sve_int_mode (machine_mode);
 opt_machine_mode aarch64_sve_pred_mode (unsigned int);
 machine_mode aarch64_sve_pred_mode (machine_mode);
@@ -843,8 +843,9 @@ bool aarch64_sve_float_arith_immediate_p (rtx, bool);
 bool aarch64_sve_float_mul_immediate_p (rtx);
 bool aarch64_split_dimode_const_store (rtx, rtx);
 bool aarch64_symbolic_address_p (rtx);
-bool aarch64_uimm12_shift (HOST_WIDE_INT);
+bool aarch64_uimm12_shift (unsigned HOST_WIDE_INT);
 int aarch64_movk_shift (const wide_int_ref &, const wide_int_ref &);
+bool aarch64_zeroextended_move_imm (unsigned HOST_WIDE_INT);
 bool 

Re: [PATCH] AArch64: Add fma_reassoc_width [PR107413]

2022-11-23 Thread Wilco Dijkstra via Gcc-patches
Hi Richard,

>> A smart reassociation pass could form more FMAs while also increasing
>> parallelism, but the way it currently works always results in fewer FMAs.
>
> Yeah, as Richard said, that seems the right long-term fix.
> It would also avoid the hack of treating PLUS_EXPR as a signal
> of an FMA, which has the drawback of assuming (for 2-FMA cores)
> that plain addition never benefits from reassociation in its own right.

True but it's hard to separate them. You will have a mix of FADD and FMAs
to reassociate (since FMA still counts as an add), and the ratio between
them as well as the number of operations may affect the best reassociation
width.

> Still, I guess the hackiness is pre-existing and the patch is removing
> the hackiness for some cores, so from that point of view it's a strict
> improvement over the status quo.  And it's too late in the GCC 13
> cycle to do FMA reassociation properly.  So I'm OK with the patch
> in principle, but could you post an update with more commentary?

Sure, here is an update with longer comment in aarch64_reassociation_width:


Add a reassocation width for FMAs in per-CPU tuning structures. Keep the
existing setting for cores with 2 FMA pipes, and use 4 for cores with 4
FMA pipes.  This improves SPECFP2017 on Neoverse V1 by ~1.5%.

Passes regress/bootstrap, OK for commit?

gcc/ChangeLog/
PR 107413
* config/aarch64/aarch64.cc (struct tune_params): Add
fma_reassoc_width to all CPU tuning structures.
(aarch64_reassociation_width): Use fma_reassoc_width.
* config/aarch64/aarch64-protos.h (struct tune_params): Add
fma_reassoc_width.

---
diff --git a/gcc/config/aarch64/aarch64-protos.h 
b/gcc/config/aarch64/aarch64-protos.h
index 
238820581c5ee7617f8eed1df2cf5418b1127e19..4be93c93c26e091f878bc8e4cf06e90888405fb2
 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -540,6 +540,7 @@ struct tune_params
   const char *loop_align;
   int int_reassoc_width;
   int fp_reassoc_width;
+  int fma_reassoc_width;
   int vec_reassoc_width;
   int min_div_recip_mul_sf;
   int min_div_recip_mul_df;
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 
c91df6f5006c257690aafb75398933d628a970e1..15d478c77ceb2d6c52a70b6ffd8fdadcfa8deba0
 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -1346,6 +1346,7 @@ static const struct tune_params generic_tunings =
   "8", /* loop_align.  */
   2,   /* int_reassoc_width.  */
   4,   /* fp_reassoc_width.  */
+  1,   /* fma_reassoc_width.  */
   1,   /* vec_reassoc_width.  */
   2,   /* min_div_recip_mul_sf.  */
   2,   /* min_div_recip_mul_df.  */
@@ -1382,6 +1383,7 @@ static const struct tune_params cortexa35_tunings =
   "8", /* loop_align.  */
   2,   /* int_reassoc_width.  */
   4,   /* fp_reassoc_width.  */
+  1,   /* fma_reassoc_width.  */
   1,   /* vec_reassoc_width.  */
   2,   /* min_div_recip_mul_sf.  */
   2,   /* min_div_recip_mul_df.  */
@@ -1415,6 +1417,7 @@ static const struct tune_params cortexa53_tunings =
   "8", /* loop_align.  */
   2,   /* int_reassoc_width.  */
   4,   /* fp_reassoc_width.  */
+  1,   /* fma_reassoc_width.  */
   1,   /* vec_reassoc_width.  */
   2,   /* min_div_recip_mul_sf.  */
   2,   /* min_div_recip_mul_df.  */
@@ -1448,6 +1451,7 @@ static const struct tune_params cortexa57_tunings =
   "8", /* loop_align.  */
   2,   /* int_reassoc_width.  */
   4,   /* fp_reassoc_width.  */
+  1,   /* fma_reassoc_width.  */
   1,   /* vec_reassoc_width.  */
   2,   /* min_div_recip_mul_sf.  */
   2,   /* min_div_recip_mul_df.  */
@@ -1481,6 +1485,7 @@ static const struct tune_params cortexa72_tunings =
   "8", /* loop_align.  */
   2,   /* int_reassoc_width.  */
   4,   /* fp_reassoc_width.  */
+  1,   /* fma_reassoc_width.  */
   1,   /* vec_reassoc_width.  */
   2,   /* min_div_recip_mul_sf.  */
   2,   /* min_div_recip_mul_df.  */
@@ -1514,6 +1519,7 @@ static const struct tune_params cortexa73_tunings =
   "8", /* loop_align.  */
   2,   /* int_reassoc_width.  */
   4,   /* fp_reassoc_width.  */
+  1,   /* fma_reassoc_width.  */
   1,   /* vec_reassoc_width.  */
   2,   /* min_div_recip_mul_sf.  */
   2,   /* min_div_recip_mul_df.  */
@@ -1548,6 +1554,7 @@ static const struct tune_params exynosm1_tunings =
   "4", /* loop_align.  */
   2,   /* int_reassoc_width.  */
   4,   /* fp_reassoc_width.  */
+  1,   /* fma_reassoc_width.  */
   1,   /* vec_reassoc_width.  */
   2,   /* min_div_recip_mul_sf.  */
   2,   /* min_div_recip_mul_df.  */
@@ -1580,6 +1587,7 @@ static const struct tune_params thunderxt88_tunings =
   "8", /* loop_align.  */
   2,   /* int_reassoc_width.  */
   4,   /* fp_reassoc_width.  */
+  1,   /* fma_reassoc_width.  */
   1,   /* vec_reassoc_width.  */
   2,   /* min_div_recip_mul_sf.  */
   2,   /* min_div_recip_mul_df.  */
@@ -1612,6 +1620,7 @@ static const struct tune_params thunderx_tunings =
   "8", /* loop_align.  */
   2,   /* 

Re: [PATCH] AArch64: Add fma_reassoc_width [PR107413]

2022-11-22 Thread Wilco Dijkstra via Gcc-patches
Hi Richard,

> I guess an obvious question is: if 1 (rather than 2) was the right value
> for cores with 2 FMA pipes, why is 4 the right value for cores with 4 FMA
> pipes?  It would be good to clarify how, conceptually, the core property
> should map to the fma_reassoc_width value.

1 turns off reassociation so that FMAs get properly formed. After reassociation 
far
fewer FMAs get formed so we end up with more FLOPS which means slower execution.
It's a significant slowdown on cores that are not wide, have only 1 or 2 FP 
pipes and
may have high FP latencies. So we turn it off by default on all older cores.

> It sounds from the existing comment like the main motivation for returning 1
> was to encourage more FMAs to be formed, rather than to prevent FMAs from
> being reassociated.  Is that no longer an issue?  Or is the point that,
> with more FMA pipes, lower FMA formation is a price worth paying for
> the better parallelism we get when FMAs can be formed?

Exactly. A wide CPU can deal with the extra instructions, so the loss from fewer
FMAs ends up lower than the speedup from the extra parallelism. Having more FMAs
will be even faster of course.

> Does this code ever see opc == FMA?

No, that's the problem, reassociation ignores the fact that we actually want 
FMAs. A smart
reassociation pass could form more FMAs while also increasing parallelism, but 
the way it
currently works always results in fewer FMAs.

Cheers,
Wilco

Re: [PATCH] AArch64: Add support for -mdirect-extern-access

2022-11-17 Thread Wilco Dijkstra via Gcc-patches
Hi Richard,

> Can you go into more detail about:
>
>    Use :option:`-mdirect-extern-access` either in shared libraries or in
>    executables, but not in both.  Protected symbols used both in a shared
>    library and executable may cause linker errors or fail to work correctly
>
> If this is LLVM's default for PIC (and by assumption shared libraries),
> is it then invalid to use -mdirect-extern-access for any PIEs that
> are linked against those shared libraries and use protected symbols
> from those libraries?  How would a user know that one of the shared
> libraries they're linking against was built in this way?

Yes, the usage model is that you'd either use it for static PIE or only on
data that is not shared. If you get it wrong them you'll get the copy
relocation error. In the future we need to decide what the ABI is and
ensure GCC and LLVM are compatible. An import feature to mark symbols
that may be overridden by a shared library would be useful too.

> It looks like the main difference between this implementation and
> the x86 one is that x86 allows direct accesses to common symbols.
> What's the reason for not doing that for AArch64?  Does it not work,
> is it a false optimisation (i.e. pessimisation), or did it not seem
> important now that -fno-common is the default?

I don't see any difference in the way common symbols are accessed on x86,
so it's not clear which cases common_local_p param actually affects (eg. with
-fPIC there is always a GOT indirection for common symbols).

Cheers,
Wilco

[PATCH] AArch64: Add support for -mdirect-extern-access

2022-11-11 Thread Wilco Dijkstra via Gcc-patches
Add a new option -mdirect-extern-access similar to other targets.  This removes
GOT indirections on external symbols with -fPIE, resulting in significantly
better code quality.  With -fPIC it only affects protected symbols, allowing
for more efficient shared libraries which can be linked with standard PIE
binaries (this is what LLVM does by default, so this improves interoperability
with LLVM). This patch doesn't affect ABI, but in the future GCC and LLVM
should converge to using the same ABI.

Regress and bootstrap pass, OK for commit?

gcc/
* config/aarch64/aarch64.cc (aarch64_binds_local_p): New function.
(aarch64_symbol_binds_local_p): Refactor, support direct extern access.
* config/aarch64/aarch64-linux.h (TARGET_BINDS_LOCAL_P):
Use aarch64_binds_local_p.
* config/aarch64/aarch64-freebsd.h (TARGET_BINDS_LOCAL_P): Likewise.
* config/aarch64/aarch64-protos.h: Add aarch64_binds_local_p.
* doc/gcc/gcc-command-options/option-summary.rst: Add
-mdirect-extern-access.
* 
doc/gcc/gcc-command-options/machine-dependent-options/aarch64-options.rst:
Add description of -mdirect-extern-access.

gcc/testsuite/
* gcc.target/aarch64/pr66912-2.c: New test.

---

diff --git a/gcc/config/aarch64/aarch64-freebsd.h 
b/gcc/config/aarch64/aarch64-freebsd.h
index 
13beb3781b61afd82d767884f3c16ff8eead09cc..20bc0f48e484686cd3754613bf20bb3521079d48
 100644
--- a/gcc/config/aarch64/aarch64-freebsd.h
+++ b/gcc/config/aarch64/aarch64-freebsd.h
@@ -71,7 +71,7 @@
strong definitions in dependent shared libraries, will resolve
to COPY relocated symbol in the executable.  See PR65780.  */
 #undef TARGET_BINDS_LOCAL_P
-#define TARGET_BINDS_LOCAL_P default_binds_local_p_2
+#define TARGET_BINDS_LOCAL_P aarch64_binds_local_p
 
 /* Use the AAPCS type for wchar_t, override the one from
config/freebsd.h.  */
diff --git a/gcc/config/aarch64/aarch64-linux.h 
b/gcc/config/aarch64/aarch64-linux.h
index 
5e4553d79f5053f2da0eb381e0805f47aec964ae..6c962402155d60b82610d4f65af5182d6faa47ad
 100644
--- a/gcc/config/aarch64/aarch64-linux.h
+++ b/gcc/config/aarch64/aarch64-linux.h
@@ -70,7 +70,7 @@
strong definitions in dependent shared libraries, will resolve
to COPY relocated symbol in the executable.  See PR65780.  */
 #undef TARGET_BINDS_LOCAL_P
-#define TARGET_BINDS_LOCAL_P default_binds_local_p_2
+#define TARGET_BINDS_LOCAL_P aarch64_binds_local_p
 
 /* Define this to be nonzero if static stack checking is supported.  */
 #define STACK_CHECK_STATIC_BUILTIN 1
diff --git a/gcc/config/aarch64/aarch64-protos.h 
b/gcc/config/aarch64/aarch64-protos.h
index 
238820581c5ee7617f8eed1df2cf5418b1127e19..fac754f78c1d7606ba90e1034820a62466b96b63
 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -1072,5 +1072,6 @@ const char *aarch64_sls_barrier (int);
 const char *aarch64_indirect_call_asm (rtx);
 extern bool aarch64_harden_sls_retbr_p (void);
 extern bool aarch64_harden_sls_blr_p (void);
+extern bool aarch64_binds_local_p (const_tree);
 
 #endif /* GCC_AARCH64_PROTOS_H */
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 
d1f979ebcf80333d957f8ad8631deef47dc693a5..ab4c42c34da5b15f6739c9b0a7ebaafda9488f2d
 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -19185,9 +19185,29 @@ aarch64_tlsdesc_abi_id ()
 static bool
 aarch64_symbol_binds_local_p (const_rtx x)
 {
-  return (SYMBOL_REF_DECL (x)
- ? targetm.binds_local_p (SYMBOL_REF_DECL (x))
- : SYMBOL_REF_LOCAL_P (x));
+  if (!SYMBOL_REF_DECL (x))
+return SYMBOL_REF_LOCAL_P (x);
+
+  if (targetm.binds_local_p (SYMBOL_REF_DECL (x)))
+return true;
+
+  /* In PIE binaries avoid a GOT indirection on non-weak data symbols if
+ aarch64_direct_extern_access is true.  */
+  if (flag_pie && aarch64_direct_extern_access && !SYMBOL_REF_WEAK (x)
+  && !SYMBOL_REF_FUNCTION_P (x))
+return true;
+
+  return false;
+}
+
+/* Implement TARGET_BINDS_LOCAL_P hook.  */
+
+bool
+aarch64_binds_local_p (const_tree exp)
+{
+  /* Protected symbols are local if aarch64_direct_extern_access is true.  */
+  return default_binds_local_p_3 (exp, flag_shlib != 0, true,
+ !aarch64_direct_extern_access, !flag_pic);
 }
 
 /* Return true if SYMBOL_REF X is thread local */
diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt
index 
b89b20450710592101b93f4f3b5dc33d152d1eb6..6251a36b544a03955361b445c9f5dfad3740eea8
 100644
--- a/gcc/config/aarch64/aarch64.opt
+++ b/gcc/config/aarch64/aarch64.opt
@@ -299,3 +299,7 @@ Constant memset size in bytes from which to start using 
MOPS sequence.
 -param=aarch64-vect-unroll-limit=
 Target Joined UInteger Var(aarch64_vect_unroll_limit) Init(4) Param
 Limit how much the autovectorizer may unroll a loop.
+
+mdirect-extern-access
+Target Var(aarch64_direct_extern_access) Init(0)
+Do not indirect accesses to 

[PATCH] libatomic: Add support for LSE and LSE2

2022-11-11 Thread Wilco Dijkstra via Gcc-patches
Add support for AArch64 LSE and LSE2 to libatomic.  Disable outline atomics,
and use LSE ifuncs for 1-8 byte atomics and LSE2 ifuncs for 16-byte atomics.
On Neoverse V1, 16-byte atomics are ~4x faster due to avoiding locks.

Note this is safe since we swap all 16-byte atomics using the same ifunc,
so they either use locks or LSE2 atomics, but never a mix. This also improves
ABI compatibility with LLVM: its inlined 16-byte atomics are compatible with
the new libatomic if LSE2 is supported.

Passes regress, OK for commit?

libatomic/
Makefile.in: Regenerated with automake 1.15.1.
Makefile.am: Add atomic_16.S for AArch64.
configure.tgt: Disable outline atomics in AArch64 build.
config/linux/aarch64/atomic_16.S: New file - implementation of
ifuncs for 128-bit atomics.
config/linux/aarch64/host-config.h: Enable ifuncs, use LSE 
(HWCAP_ATOMICS)
for 1-8-byte atomics and LSE2 (HWCAP_USCAT) for 16-byte atomics.

---
diff --git a/libatomic/Makefile.am b/libatomic/Makefile.am
index 
d88515e4a03bd812334ae0b7bf4c0bba119455dc..41e5da28512150780a2018386e22b4e70afcfa3f
 100644
--- a/libatomic/Makefile.am
+++ b/libatomic/Makefile.am
@@ -127,6 +127,8 @@ if HAVE_IFUNC
 if ARCH_AARCH64_LINUX
 IFUNC_OPTIONS   = -march=armv8-a+lse
 libatomic_la_LIBADD += $(foreach s,$(SIZES),$(addsuffix 
_$(s)_1_.lo,$(SIZEOBJS)))
+libatomic_la_SOURCES += atomic_16.S
+
 endif
 if ARCH_ARM_LINUX
 IFUNC_OPTIONS   = -march=armv7-a+fp -DHAVE_KERNEL64
diff --git a/libatomic/Makefile.in b/libatomic/Makefile.in
index 
80d25653dc75cca995c8b0b2107a55f1234a6d52..89e29fc60a7fb74341b2f0f805e461847073082c
 100644
--- a/libatomic/Makefile.in
+++ b/libatomic/Makefile.in
@@ -90,13 +90,14 @@ build_triplet = @build@
 host_triplet = @host@
 target_triplet = @target@
 @ARCH_AARCH64_LINUX_TRUE@@HAVE_IFUNC_TRUE@am__append_1 = $(foreach 
s,$(SIZES),$(addsuffix _$(s)_1_.lo,$(SIZEOBJS)))
-@ARCH_ARM_LINUX_TRUE@@HAVE_IFUNC_TRUE@am__append_2 = $(foreach \
+@ARCH_AARCH64_LINUX_TRUE@@HAVE_IFUNC_TRUE@am__append_2 = atomic_16.S
+@ARCH_ARM_LINUX_TRUE@@HAVE_IFUNC_TRUE@am__append_3 = $(foreach \
 @ARCH_ARM_LINUX_TRUE@@HAVE_IFUNC_TRUE@ s,$(SIZES),$(addsuffix \
 @ARCH_ARM_LINUX_TRUE@@HAVE_IFUNC_TRUE@ _$(s)_1_.lo,$(SIZEOBJS))) \
 @ARCH_ARM_LINUX_TRUE@@HAVE_IFUNC_TRUE@ $(addsuffix \
 @ARCH_ARM_LINUX_TRUE@@HAVE_IFUNC_TRUE@ _8_2_.lo,$(SIZEOBJS))
-@ARCH_I386_TRUE@@HAVE_IFUNC_TRUE@am__append_3 = $(addsuffix 
_8_1_.lo,$(SIZEOBJS))
-@ARCH_X86_64_TRUE@@HAVE_IFUNC_TRUE@am__append_4 = $(addsuffix 
_16_1_.lo,$(SIZEOBJS)) \
+@ARCH_I386_TRUE@@HAVE_IFUNC_TRUE@am__append_4 = $(addsuffix 
_8_1_.lo,$(SIZEOBJS))
+@ARCH_X86_64_TRUE@@HAVE_IFUNC_TRUE@am__append_5 = $(addsuffix 
_16_1_.lo,$(SIZEOBJS)) \
 @ARCH_X86_64_TRUE@@HAVE_IFUNC_TRUE@   $(addsuffix 
_16_2_.lo,$(SIZEOBJS))
 
 subdir = .
@@ -154,8 +155,11 @@ am__uninstall_files_from_dir = { \
   }
 am__installdirs = "$(DESTDIR)$(toolexeclibdir)"
 LTLIBRARIES = $(noinst_LTLIBRARIES) $(toolexeclib_LTLIBRARIES)
+@ARCH_AARCH64_LINUX_TRUE@@HAVE_IFUNC_TRUE@am__objects_1 =  \
+@ARCH_AARCH64_LINUX_TRUE@@HAVE_IFUNC_TRUE@ atomic_16.lo
 am_libatomic_la_OBJECTS = gload.lo gstore.lo gcas.lo gexch.lo \
-   glfree.lo lock.lo init.lo fenv.lo fence.lo flag.lo
+   glfree.lo lock.lo init.lo fenv.lo fence.lo flag.lo \
+   $(am__objects_1)
 libatomic_la_OBJECTS = $(am_libatomic_la_OBJECTS)
 AM_V_lt = $(am__v_lt_@AM_V@)
 am__v_lt_ = $(am__v_lt_@AM_DEFAULT_V@)
@@ -165,9 +169,9 @@ libatomic_la_LINK = $(LIBTOOL) $(AM_V_lt) --tag=CC 
$(AM_LIBTOOLFLAGS) \
$(LIBTOOLFLAGS) --mode=link $(CCLD) $(AM_CFLAGS) $(CFLAGS) \
$(libatomic_la_LDFLAGS) $(LDFLAGS) -o $@
 libatomic_convenience_la_DEPENDENCIES = $(libatomic_la_LIBADD)
-am__objects_1 = gload.lo gstore.lo gcas.lo gexch.lo glfree.lo lock.lo \
-   init.lo fenv.lo fence.lo flag.lo
-am_libatomic_convenience_la_OBJECTS = $(am__objects_1)
+am__objects_2 = gload.lo gstore.lo gcas.lo gexch.lo glfree.lo lock.lo \
+   init.lo fenv.lo fence.lo flag.lo $(am__objects_1)
+am_libatomic_convenience_la_OBJECTS = $(am__objects_2)
 libatomic_convenience_la_OBJECTS =  \
$(am_libatomic_convenience_la_OBJECTS)
 AM_V_P = $(am__v_P_@AM_V@)
@@ -185,6 +189,16 @@ am__v_at_1 =
 depcomp = $(SHELL) $(top_srcdir)/../depcomp
 am__depfiles_maybe = depfiles
 am__mv = mv -f
+CPPASCOMPILE = $(CCAS) $(DEFS) $(DEFAULT_INCLUDES) $(INCLUDES) \
+   $(AM_CPPFLAGS) $(CPPFLAGS) $(AM_CCASFLAGS) $(CCASFLAGS)
+LTCPPASCOMPILE = $(LIBTOOL) $(AM_V_lt) $(AM_LIBTOOLFLAGS) \
+   $(LIBTOOLFLAGS) --mode=compile $(CCAS) $(DEFS) \
+   $(DEFAULT_INCLUDES) $(INCLUDES) $(AM_CPPFLAGS) $(CPPFLAGS) \
+   $(AM_CCASFLAGS) $(CCASFLAGS)
+AM_V_CPPAS = $(am__v_CPPAS_@AM_V@)
+am__v_CPPAS_ = $(am__v_CPPAS_@AM_DEFAULT_V@)
+am__v_CPPAS_0 = @echo "  CPPAS   " $@;
+am__v_CPPAS_1 = 
 COMPILE = $(CC) $(DEFS) $(DEFAULT_INCLUDES) $(INCLUDES) $(AM_CPPFLAGS) \
$(CPPFLAGS) $(AM_CFLAGS) $(CFLAGS)
 LTCOMPILE = $(LIBTOOL) $(AM_V_lt) --tag=CC $(AM_LIBTOOLFLAGS) \

[PATCH] AArch64: Add fma_reassoc_width [PR107413]

2022-11-09 Thread Wilco Dijkstra via Gcc-patches
Add a reassocation width for FMAs in per-CPU tuning structures. Keep the
existing setting for cores with 2 FMA pipes, and use 4 for cores with 4
FMA pipes.  This improves SPECFP2017 on Neoverse V1 by ~1.5%.

Passes regress/bootstrap, OK for commit?

gcc/
PR 107413
* config/aarch64/aarch64.cc (struct tune_params): Add
fma_reassoc_width to all CPU tuning structures.
* config/aarch64/aarch64-protos.h (struct tune_params): Add
fma_reassoc_width.

---

diff --git a/gcc/config/aarch64/aarch64-protos.h 
b/gcc/config/aarch64/aarch64-protos.h
index 
a73bfa20acb9b92ae0475794c3f11c67d22feb97..71365a446007c26b906b61ca8b2a68ee06c83037
 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -540,6 +540,7 @@ struct tune_params
   const char *loop_align;
   int int_reassoc_width;
   int fp_reassoc_width;
+  int fma_reassoc_width;
   int vec_reassoc_width;
   int min_div_recip_mul_sf;
   int min_div_recip_mul_df;
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 
798363bcc449c414de5bbb4f26b8e1c64a0cf71a..643162cdecd6a8fe5587164cb2d0d62b709a491d
 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -1346,6 +1346,7 @@ static const struct tune_params generic_tunings =
   "8", /* loop_align.  */
   2,   /* int_reassoc_width.  */
   4,   /* fp_reassoc_width.  */
+  1,   /* fma_reassoc_width.  */
   1,   /* vec_reassoc_width.  */
   2,   /* min_div_recip_mul_sf.  */
   2,   /* min_div_recip_mul_df.  */
@@ -1382,6 +1383,7 @@ static const struct tune_params cortexa35_tunings =
   "8", /* loop_align.  */
   2,   /* int_reassoc_width.  */
   4,   /* fp_reassoc_width.  */
+  1,   /* fma_reassoc_width.  */
   1,   /* vec_reassoc_width.  */
   2,   /* min_div_recip_mul_sf.  */
   2,   /* min_div_recip_mul_df.  */
@@ -1415,6 +1417,7 @@ static const struct tune_params cortexa53_tunings =
   "8", /* loop_align.  */
   2,   /* int_reassoc_width.  */
   4,   /* fp_reassoc_width.  */
+  1,   /* fma_reassoc_width.  */
   1,   /* vec_reassoc_width.  */
   2,   /* min_div_recip_mul_sf.  */
   2,   /* min_div_recip_mul_df.  */
@@ -1448,6 +1451,7 @@ static const struct tune_params cortexa57_tunings =
   "8", /* loop_align.  */
   2,   /* int_reassoc_width.  */
   4,   /* fp_reassoc_width.  */
+  1,   /* fma_reassoc_width.  */
   1,   /* vec_reassoc_width.  */
   2,   /* min_div_recip_mul_sf.  */
   2,   /* min_div_recip_mul_df.  */
@@ -1481,6 +1485,7 @@ static const struct tune_params cortexa72_tunings =
   "8", /* loop_align.  */
   2,   /* int_reassoc_width.  */
   4,   /* fp_reassoc_width.  */
+  1,   /* fma_reassoc_width.  */
   1,   /* vec_reassoc_width.  */
   2,   /* min_div_recip_mul_sf.  */
   2,   /* min_div_recip_mul_df.  */
@@ -1514,6 +1519,7 @@ static const struct tune_params cortexa73_tunings =
   "8", /* loop_align.  */
   2,   /* int_reassoc_width.  */
   4,   /* fp_reassoc_width.  */
+  1,   /* fma_reassoc_width.  */
   1,   /* vec_reassoc_width.  */
   2,   /* min_div_recip_mul_sf.  */
   2,   /* min_div_recip_mul_df.  */
@@ -1548,6 +1554,7 @@ static const struct tune_params exynosm1_tunings =
   "4", /* loop_align.  */
   2,   /* int_reassoc_width.  */
   4,   /* fp_reassoc_width.  */
+  1,   /* fma_reassoc_width.  */
   1,   /* vec_reassoc_width.  */
   2,   /* min_div_recip_mul_sf.  */
   2,   /* min_div_recip_mul_df.  */
@@ -1580,6 +1587,7 @@ static const struct tune_params thunderxt88_tunings =
   "8", /* loop_align.  */
   2,   /* int_reassoc_width.  */
   4,   /* fp_reassoc_width.  */
+  1,   /* fma_reassoc_width.  */
   1,   /* vec_reassoc_width.  */
   2,   /* min_div_recip_mul_sf.  */
   2,   /* min_div_recip_mul_df.  */
@@ -1612,6 +1620,7 @@ static const struct tune_params thunderx_tunings =
   "8", /* loop_align.  */
   2,   /* int_reassoc_width.  */
   4,   /* fp_reassoc_width.  */
+  1,   /* fma_reassoc_width.  */
   1,   /* vec_reassoc_width.  */
   2,   /* min_div_recip_mul_sf.  */
   2,   /* min_div_recip_mul_df.  */
@@ -1646,6 +1655,7 @@ static const struct tune_params tsv110_tunings =
   "8",  /* loop_align.  */
   2,/* int_reassoc_width.  */
   4,/* fp_reassoc_width.  */
+  1,   /* fma_reassoc_width.  */
   1,/* vec_reassoc_width.  */
   2,/* min_div_recip_mul_sf.  */
   2,/* min_div_recip_mul_df.  */
@@ -1678,6 +1688,7 @@ static const struct tune_params xgene1_tunings =
   "16",/* loop_align.  */
   2,   /* int_reassoc_width.  */
   4,   /* fp_reassoc_width.  */
+  1,   /* fma_reassoc_width.  */
   1,   /* vec_reassoc_width.  */
   2,   /* min_div_recip_mul_sf.  */
   2,   /* min_div_recip_mul_df.  */
@@ -1710,6 +1721,7 @@ static const struct tune_params emag_tunings =
   "16",/* loop_align.  */
   2,   /* int_reassoc_width.  */
   4,   /* fp_reassoc_width.  */
+  1,   /* fma_reassoc_width.  */
   1,   /* vec_reassoc_width.  */
   2,   /* min_div_recip_mul_sf.  */
   2,   /* min_div_recip_mul_df.  */
@@ -1743,6 +1755,7 @@ static const 

[committed] AArch64: Fix testcase

2022-11-04 Thread Wilco Dijkstra via Gcc-patches
Committed as trivial fix.

gcc/testsuite/
* gcc.target/aarch64/mgeneral-regs_3.c: Fix testcase.

---

diff --git a/gcc/testsuite/gcc.target/aarch64/mgeneral-regs_3.c 
b/gcc/testsuite/gcc.target/aarch64/mgeneral-regs_3.c
index 
fa6299960e77141ce747552cfbe3256a7a96d229..4e634f6ebe2b769b85417bca056457b9176645c4
 100644
--- a/gcc/testsuite/gcc.target/aarch64/mgeneral-regs_3.c
+++ b/gcc/testsuite/gcc.target/aarch64/mgeneral-regs_3.c
@@ -3,9 +3,9 @@
 extern void abort (void);
 
 int
-test (int i, ...)
+test (int i, ...) /* { dg-error "'-mgeneral-regs-only' is incompatible with 
the use of floating-point types" } */
 {
-  float f = (float) i; /* { dg-error "'-mgeneral-regs-only' is incompatible 
with the use of floating-point types" } */
-  if (f != f) abort ();
+  float f = (float) i;
+  if (f != 0) abort ();
   return 2;
 }


[PATCH][AArch64] Cleanup move immediate code

2022-11-01 Thread Wilco Dijkstra via Gcc-patches
Hi Richard,

Here is the immediate cleanup splitoff from the previous patch:

Simplify, refactor and improve various move immediate functions.
Allow 32-bit MOVZ/N as a valid 64-bit immediate which removes special
cases in aarch64_internal_mov_immediate.  Add new constraint so the movdi
pattern only needs a single alternative for move immediate.

Passes bootstrap and regress, OK for commit?

gcc/ChangeLog:

* config/aarch64/aarch64.cc (aarch64_bitmask_imm): Use unsigned type.
(aarch64_zeroextended_move_imm): New function.
(aarch64_move_imm): Refactor, assert mode is SImode or DImode.
(aarch64_internal_mov_immediate): Assert mode is SImode or DImode.
Simplify special cases.
(aarch64_uimm12_shift): Simplify code.
(aarch64_clamp_to_uimm12_shift): Likewise.
(aarch64_movw_imm): Remove.
(aarch64_float_const_rtx_p): Pass either SImode or DImode to
aarch64_internal_mov_immediate.
(aarch64_rtx_costs): Likewise.
* config/aarch64/aarch64.md (movdi_aarch64): Merge 'N' and 'M'
constraints into single 'O'.
(mov_aarch64): Likewise.
* config/aarch64/aarch64-protos.h (aarch64_move_imm): Use unsigned.
(aarch64_bitmask_imm): Likewise.
(aarch64_uimm12_shift): Likewise.
(aarch64_zeroextended_move_imm): New prototype.
* config/aarch64/constraints.md: Add 'O' for 32/64-bit immediates,
limit 'N' to 64-bit only moves.

---

diff --git a/gcc/config/aarch64/aarch64-protos.h 
b/gcc/config/aarch64/aarch64-protos.h
index 
1a71f02284137c64e7115b26e6aa00447596f105..a73bfa20acb9b92ae0475794c3f11c67d22feb97
 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -755,7 +755,7 @@ void aarch64_post_cfi_startproc (void);
 poly_int64 aarch64_initial_elimination_offset (unsigned, unsigned);
 int aarch64_get_condition_code (rtx);
 bool aarch64_address_valid_for_prefetch_p (rtx, bool);
-bool aarch64_bitmask_imm (HOST_WIDE_INT val, machine_mode);
+bool aarch64_bitmask_imm (unsigned HOST_WIDE_INT val, machine_mode);
 unsigned HOST_WIDE_INT aarch64_and_split_imm1 (HOST_WIDE_INT val_in);
 unsigned HOST_WIDE_INT aarch64_and_split_imm2 (HOST_WIDE_INT val_in);
 bool aarch64_and_bitmask_imm (unsigned HOST_WIDE_INT val_in, machine_mode 
mode);
@@ -792,7 +792,7 @@ bool aarch64_masks_and_shift_for_bfi_p (scalar_int_mode, 
unsigned HOST_WIDE_INT,
unsigned HOST_WIDE_INT,
unsigned HOST_WIDE_INT);
 bool aarch64_zero_extend_const_eq (machine_mode, rtx, machine_mode, rtx);
-bool aarch64_move_imm (HOST_WIDE_INT, machine_mode);
+bool aarch64_move_imm (unsigned HOST_WIDE_INT, machine_mode);
 machine_mode aarch64_sve_int_mode (machine_mode);
 opt_machine_mode aarch64_sve_pred_mode (unsigned int);
 machine_mode aarch64_sve_pred_mode (machine_mode);
@@ -842,8 +842,9 @@ bool aarch64_sve_float_arith_immediate_p (rtx, bool);
 bool aarch64_sve_float_mul_immediate_p (rtx);
 bool aarch64_split_dimode_const_store (rtx, rtx);
 bool aarch64_symbolic_address_p (rtx);
-bool aarch64_uimm12_shift (HOST_WIDE_INT);
+bool aarch64_uimm12_shift (unsigned HOST_WIDE_INT);
 int aarch64_movk_shift (const wide_int_ref &, const wide_int_ref &);
+bool aarch64_zeroextended_move_imm (unsigned HOST_WIDE_INT);
 bool aarch64_use_return_insn_p (void);
 const char *aarch64_output_casesi (rtx *);
 
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 
5d1ab5aa42b2cda0a655d2bc69c4df19da457ab3..798363bcc449c414de5bbb4f26b8e1c64a0cf71a
 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -5558,12 +5558,10 @@ aarch64_bitmask_imm (unsigned HOST_WIDE_INT val)
 
 /* Return true if VAL is a valid bitmask immediate for MODE.  */
 bool
-aarch64_bitmask_imm (HOST_WIDE_INT val_in, machine_mode mode)
+aarch64_bitmask_imm (unsigned HOST_WIDE_INT val, machine_mode mode)
 {
   if (mode == DImode)
-return aarch64_bitmask_imm (val_in);
-
-  unsigned HOST_WIDE_INT val = val_in;
+return aarch64_bitmask_imm (val);
 
   if (mode == SImode)
 return aarch64_bitmask_imm ((val & 0x) | (val << 32));
@@ -5602,51 +5600,60 @@ aarch64_check_bitmask (unsigned HOST_WIDE_INT val,
 }
 
 
-/* Return true if val is an immediate that can be loaded into a
-   register by a MOVZ instruction.  */
-static bool
-aarch64_movw_imm (HOST_WIDE_INT val, scalar_int_mode mode)
+/* Return true if immediate VAL can only be created by using a 32-bit
+   zero-extended move immediate, not by a 64-bit move.  */
+bool
+aarch64_zeroextended_move_imm (unsigned HOST_WIDE_INT val)
 {
-  if (GET_MODE_SIZE (mode) > 4)
-{
-  if ((val & (((HOST_WIDE_INT) 0x) << 32)) == val
-  || (val & (((HOST_WIDE_INT) 0x) << 48)) == val)
-   return 1;
-}
-  else
-{
-  /* Ignore sign extension.  */
-  val &= (HOST_WIDE_INT) 0x;
-}
-  return ((val & (((HOST_WIDE_INT) 0x) 

Re: [PATCH][AArch64] Improve immediate expansion [PR106583]

2022-10-20 Thread Wilco Dijkstra via Gcc-patches
Hi Richard,

> Can you do the aarch64_mov_imm changes as a separate patch?  It's difficult
> to review the two changes folded together like this.

Sure, I'll send a separate patch. So here is version 2 again:

[PATCH v2][AArch64] Improve immediate expansion [PR106583]

Improve immediate expansion of immediates which can be created from a
bitmask immediate and 2 MOVKs.  Simplify, refactor and improve
efficiency of bitmask checks.  This reduces the number of 4-instruction
immediates in SPECINT/FP by 10-15%.

Passes regress, OK for commit?

gcc/ChangeLog:

PR target/106583
* config/aarch64/aarch64.cc (aarch64_internal_mov_immediate)
Add support for a bitmask immediate with 2 MOVKs.
(aarch64_check_bitmask): New function after refactorization.
(aarch64_replicate_bitmask_imm): Remove function, merge into...
(aarch64_bitmask_imm): Simplify replication of small modes.
Split function into 64-bit only version for efficiency.

gcc/testsuite:
PR target/106583
* gcc.target/aarch64/pr106583.c: Add new test.

---

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 
926e81f028c82aac9a5fecc18f921f84399c24ae..b2d9c7380975028131d0fe731a97b3909874b87b
 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -306,6 +306,7 @@ static machine_mode aarch64_simd_container_mode 
(scalar_mode, poly_int64);
 static bool aarch64_print_address_internal (FILE*, machine_mode, rtx,
 aarch64_addr_query_type);
 static HOST_WIDE_INT aarch64_clamp_to_uimm12_shift (HOST_WIDE_INT val);
+static bool aarch64_bitmask_imm (unsigned HOST_WIDE_INT);
 
 /* The processor for which instructions should be scheduled.  */
 enum aarch64_processor aarch64_tune = cortexa53;
@@ -5502,6 +5503,30 @@ aarch64_output_sve_vector_inc_dec (const char *operands, 
rtx x)
  factor, nelts_per_vq);
 }
 
+/* Return true if the immediate VAL can be a bitfield immediate
+   by changing the given MASK bits in VAL to zeroes, ones or bits
+   from the other half of VAL.  Return the new immediate in VAL2.  */
+static inline bool
+aarch64_check_bitmask (unsigned HOST_WIDE_INT val,
+  unsigned HOST_WIDE_INT ,
+  unsigned HOST_WIDE_INT mask)
+{
+  val2 = val & ~mask;
+  if (val2 != val && aarch64_bitmask_imm (val2))
+return true;
+  val2 = val | mask;
+  if (val2 != val && aarch64_bitmask_imm (val2))
+return true;
+  val = val & ~mask;
+  val2 = val | (((val >> 32) | (val << 32)) & mask);
+  if (val2 != val && aarch64_bitmask_imm (val2))
+return true;
+  val2 = val | (((val >> 16) | (val << 48)) & mask);
+  if (val2 != val && aarch64_bitmask_imm (val2))
+return true;
+  return false;
+}
+
 static int
 aarch64_internal_mov_immediate (rtx dest, rtx imm, bool generate,
 scalar_int_mode mode)
@@ -5568,36 +5593,43 @@ aarch64_internal_mov_immediate (rtx dest, rtx imm, bool 
generate,
   one_match = ((~val & mask) == 0) + ((~val & (mask << 16)) == 0) +
 ((~val & (mask << 32)) == 0) + ((~val & (mask << 48)) == 0);
 
-  if (zero_match != 2 && one_match != 2)
+  if (zero_match < 2 && one_match < 2)
 {
   /* Try emitting a bitmask immediate with a movk replacing 16 bits.
  For a 64-bit bitmask try whether changing 16 bits to all ones or
  zeroes creates a valid bitmask.  To check any repeated bitmask,
  try using 16 bits from the other 32-bit half of val.  */
 
-  for (i = 0; i < 64; i += 16, mask <<= 16)
-   {
- val2 = val & ~mask;
- if (val2 != val && aarch64_bitmask_imm (val2, mode))
-   break;
- val2 = val | mask;
- if (val2 != val && aarch64_bitmask_imm (val2, mode))
-   break;
- val2 = val2 & ~mask;
- val2 = val2 | (((val2 >> 32) | (val2 << 32)) & mask);
- if (val2 != val && aarch64_bitmask_imm (val2, mode))
-   break;
-   }
-  if (i != 64)
-   {
- if (generate)
+  for (i = 0; i < 64; i += 16)
+   if (aarch64_check_bitmask (val, val2, mask << i))
+ {
+   if (generate)
+ {
+   emit_insn (gen_rtx_SET (dest, GEN_INT (val2)));
+   emit_insn (gen_insv_immdi (dest, GEN_INT (i),
+  GEN_INT ((val >> i) & 0x)));
+ }
+   return 2;
+ }
+}
+
+  /* Try a bitmask plus 2 movk to generate the immediate in 3 instructions.  */
+  if (zero_match + one_match == 0)
+{
+  for (i = 0; i < 48; i += 16)
+   for (int j = i + 16; j < 64; j += 16)
+ if (aarch64_check_bitmask (val, val2, (mask << i) | (mask << j)))
 {
- emit_insn (gen_rtx_SET (dest, GEN_INT (val2)));
- emit_insn (gen_insv_immdi (dest, GEN_INT (i),
-GEN_INT ((val >> i) & 0x)));
+ 

Re: [PATCH][AArch64] Improve immediate expansion [PR106583]

2022-10-19 Thread Wilco Dijkstra via Gcc-patches
ping



Hi Richard,

>>> Sounds good, but could you put it before the mode version,
>>> to avoid the forward declaration?
>>
>> I can swap them around but the forward declaration is still required as
>> aarch64_check_bitmask is 5000 lines before aarch64_bitmask_imm.
>
> OK, how about moving them both above aarch64_check_bitmask?

Sure I've moved them as well as all related helper functions - it makes the diff
quite large but they are all together now which makes sense. I also refactored
aarch64_mov_imm to handle the case of a 64-bit immediate being generated
by a 32-bit MOVZ/MOVN - this simplifies aarch64_internal_move_immediate
and movdi patterns even further.

Cheers,
Wilco

v3: move immediate code together and avoid forward declarations,
further cleanups and simplifications.

Improve immediate expansion of immediates which can be created from a
bitmask immediate and 2 MOVKs.  Simplify, refactor and improve 
efficiency of bitmask checks and move immediate. Move various immediate
handling functions together to avoid forward declarations.
Include 32-bit MOVZ/N as valid 64-bit immediates. Add new constraint so
the movdi pattern only needs a single alternative for move immediate.

This reduces the number of 4-instruction immediates in SPECINT/FP by 10-15%.

Passes bootstrap & regress, OK for commit?

gcc/ChangeLog:

    PR target/106583
    * config/aarch64/aarch64.cc (aarch64_internal_mov_immediate)
    Add support for a bitmask immediate with 2 MOVKs.
    (aarch64_check_bitmask): New function after refactorization.
    (aarch64_replicate_bitmask_imm): Remove function, merge into...
    (aarch64_bitmask_imm): Simplify replication of small modes.
    Split function into 64-bit only version for efficiency.
    (aarch64_zeroextended_move_imm): New function.
    (aarch64_move_imm): Refactor code.
    (aarch64_uimm12_shift): Move near other immediate functions.
    (aarch64_clamp_to_uimm12_shift): Likewise.
    (aarch64_movk_shift): Likewise.
    (aarch64_replicate_bitmask_imm): Likewise.
    (aarch64_and_split_imm1): Likewise.
    (aarch64_and_split_imm2): Likewise.
    (aarch64_and_bitmask_imm): Likewise.
    (aarch64_movw_imm): Remove.
    * config/aarch64/aarch64.md (movdi_aarch64): Merge 'N' and 'M'
    constraints into single 'O'.
    (mov_aarch64): Likewise.
    * config/aarch64/aarch64-protos.h (aarch64_move_imm): Use unsigned.
    (aarch64_bitmask_imm): Likewise.
    (aarch64_uimm12_shift): Likewise.
    (aarch64_zeroextended_move_imm): New prototype.
    * config/aarch64/constraints.md: Add 'O' for 32/64-bit immediates,
    limit 'N' to 64-bit only moves.

gcc/testsuite:
    PR target/106583
    * gcc.target/aarch64/pr106583.c: Add new test.

---

diff --git a/gcc/config/aarch64/aarch64-protos.h 
b/gcc/config/aarch64/aarch64-protos.h
index 
3e4005c9f4ff1f999f1811c6fb0b2252878dc4ae..b82f9ba7c2bb4cffa16abbf45f87061f72015083
 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -755,7 +755,7 @@ void aarch64_post_cfi_startproc (void);
 poly_int64 aarch64_initial_elimination_offset (unsigned, unsigned);
 int aarch64_get_condition_code (rtx);
 bool aarch64_address_valid_for_prefetch_p (rtx, bool);
-bool aarch64_bitmask_imm (HOST_WIDE_INT val, machine_mode);
+bool aarch64_bitmask_imm (unsigned HOST_WIDE_INT val, machine_mode);
 unsigned HOST_WIDE_INT aarch64_and_split_imm1 (HOST_WIDE_INT val_in);
 unsigned HOST_WIDE_INT aarch64_and_split_imm2 (HOST_WIDE_INT val_in);
 bool aarch64_and_bitmask_imm (unsigned HOST_WIDE_INT val_in, machine_mode 
mode);
@@ -792,7 +792,7 @@ bool aarch64_masks_and_shift_for_bfi_p (scalar_int_mode, 
unsigned HOST_WIDE_INT,
 unsigned HOST_WIDE_INT,
 unsigned HOST_WIDE_INT);
 bool aarch64_zero_extend_const_eq (machine_mode, rtx, machine_mode, rtx);
-bool aarch64_move_imm (HOST_WIDE_INT, machine_mode);
+bool aarch64_move_imm (unsigned HOST_WIDE_INT, machine_mode);
 machine_mode aarch64_sve_int_mode (machine_mode);
 opt_machine_mode aarch64_sve_pred_mode (unsigned int);
 machine_mode aarch64_sve_pred_mode (machine_mode);
@@ -842,8 +842,9 @@ bool aarch64_sve_float_arith_immediate_p (rtx, bool);
 bool aarch64_sve_float_mul_immediate_p (rtx);
 bool aarch64_split_dimode_const_store (rtx, rtx);
 bool aarch64_symbolic_address_p (rtx);
-bool aarch64_uimm12_shift (HOST_WIDE_INT);
+bool aarch64_uimm12_shift (unsigned HOST_WIDE_INT);
 int aarch64_movk_shift (const wide_int_ref &, const wide_int_ref &);
+bool aarch64_zeroextended_move_imm (unsigned HOST_WIDE_INT);
 bool aarch64_use_return_insn_p (void);
 const char *aarch64_output_casesi (rtx *);
 
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 
4de55beb067ea8f0be0a90060a785c94bdee708b..785ec07692981d423582051ac0897e5dbc3a001f
 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ 

Re: [PATCH][AArch64] Improve bit tests [PR105773]

2022-10-13 Thread Wilco Dijkstra via Gcc-patches
Hi Richard,

> Maybe pre-existing, but are ordered comparisons safe for the
> ZERO_EXTRACT case?  If we extract the top 8 bits (say), zero extend,
> and compare with zero, the result should be >= 0, whereas TST would
> set N to the top bit.

Yes in principle zero extract should always be positive assuming we never
extract all bits (= nop). GCC never generates a zero extend of the top bits
(it becomes a shift), so I don't think it can be generated.

However I'll change it to use CC_Z in the commit since signed comparisons
of zero extend seem to be folded to equality (or true/false), so there is no
point in supporting anything but equality comparisons anyway.

Cheers,
Wilco

Re: [PATCH][AArch64] Improve immediate expansion [PR106583]

2022-10-12 Thread Wilco Dijkstra via Gcc-patches
Hi Richard,

>>> Sounds good, but could you put it before the mode version,
>>> to avoid the forward declaration?
>>
>> I can swap them around but the forward declaration is still required as
>> aarch64_check_bitmask is 5000 lines before aarch64_bitmask_imm.
>
> OK, how about moving them both above aarch64_check_bitmask?

Sure I've moved them as well as all related helper functions - it makes the diff
quite large but they are all together now which makes sense. I also refactored
aarch64_mov_imm to handle the case of a 64-bit immediate being generated
by a 32-bit MOVZ/MOVN - this simplifies aarch64_internal_move_immediate
and movdi patterns even further.

Cheers,
Wilco

v3: move immediate code together and avoid forward declarations,
further cleanups and simplifications.

Improve immediate expansion of immediates which can be created from a
bitmask immediate and 2 MOVKs.  Simplify, refactor and improve 
efficiency of bitmask checks and move immediate. Move various immediate
handling functions together to avoid forward declarations.
Include 32-bit MOVZ/N as valid 64-bit immediates. Add new constraint so
the movdi pattern only needs a single alternative for move immediate.

This reduces the number of 4-instruction immediates in SPECINT/FP by 10-15%.

Passes bootstrap & regress, OK for commit?

gcc/ChangeLog:

PR target/106583
* config/aarch64/aarch64.cc (aarch64_internal_mov_immediate)
Add support for a bitmask immediate with 2 MOVKs.
(aarch64_check_bitmask): New function after refactorization.
(aarch64_replicate_bitmask_imm): Remove function, merge into...
(aarch64_bitmask_imm): Simplify replication of small modes.
Split function into 64-bit only version for efficiency.
(aarch64_zeroextended_move_imm): New function.
(aarch64_move_imm): Refactor code.
(aarch64_uimm12_shift): Move near other immediate functions.
(aarch64_clamp_to_uimm12_shift): Likewise.
(aarch64_movk_shift): Likewise.
(aarch64_replicate_bitmask_imm): Likewise.
(aarch64_and_split_imm1): Likewise.
(aarch64_and_split_imm2): Likewise.
(aarch64_and_bitmask_imm): Likewise.
(aarch64_movw_imm): Remove.
* config/aarch64/aarch64.md (movdi_aarch64): Merge 'N' and 'M'
constraints into single 'O'.
(mov_aarch64): Likewise.
* config/aarch64/aarch64-protos.h (aarch64_move_imm): Use unsigned.
(aarch64_bitmask_imm): Likewise.
(aarch64_uimm12_shift): Likewise.
(aarch64_zeroextended_move_imm): New prototype.
* config/aarch64/constraints.md: Add 'O' for 32/64-bit immediates,
limit 'N' to 64-bit only moves.

gcc/testsuite:
PR target/106583
* gcc.target/aarch64/pr106583.c: Add new test.

---

diff --git a/gcc/config/aarch64/aarch64-protos.h 
b/gcc/config/aarch64/aarch64-protos.h
index 
3e4005c9f4ff1f999f1811c6fb0b2252878dc4ae..b82f9ba7c2bb4cffa16abbf45f87061f72015083
 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -755,7 +755,7 @@ void aarch64_post_cfi_startproc (void);
 poly_int64 aarch64_initial_elimination_offset (unsigned, unsigned);
 int aarch64_get_condition_code (rtx);
 bool aarch64_address_valid_for_prefetch_p (rtx, bool);
-bool aarch64_bitmask_imm (HOST_WIDE_INT val, machine_mode);
+bool aarch64_bitmask_imm (unsigned HOST_WIDE_INT val, machine_mode);
 unsigned HOST_WIDE_INT aarch64_and_split_imm1 (HOST_WIDE_INT val_in);
 unsigned HOST_WIDE_INT aarch64_and_split_imm2 (HOST_WIDE_INT val_in);
 bool aarch64_and_bitmask_imm (unsigned HOST_WIDE_INT val_in, machine_mode 
mode);
@@ -792,7 +792,7 @@ bool aarch64_masks_and_shift_for_bfi_p (scalar_int_mode, 
unsigned HOST_WIDE_INT,
unsigned HOST_WIDE_INT,
unsigned HOST_WIDE_INT);
 bool aarch64_zero_extend_const_eq (machine_mode, rtx, machine_mode, rtx);
-bool aarch64_move_imm (HOST_WIDE_INT, machine_mode);
+bool aarch64_move_imm (unsigned HOST_WIDE_INT, machine_mode);
 machine_mode aarch64_sve_int_mode (machine_mode);
 opt_machine_mode aarch64_sve_pred_mode (unsigned int);
 machine_mode aarch64_sve_pred_mode (machine_mode);
@@ -842,8 +842,9 @@ bool aarch64_sve_float_arith_immediate_p (rtx, bool);
 bool aarch64_sve_float_mul_immediate_p (rtx);
 bool aarch64_split_dimode_const_store (rtx, rtx);
 bool aarch64_symbolic_address_p (rtx);
-bool aarch64_uimm12_shift (HOST_WIDE_INT);
+bool aarch64_uimm12_shift (unsigned HOST_WIDE_INT);
 int aarch64_movk_shift (const wide_int_ref &, const wide_int_ref &);
+bool aarch64_zeroextended_move_imm (unsigned HOST_WIDE_INT);
 bool aarch64_use_return_insn_p (void);
 const char *aarch64_output_casesi (rtx *);
 
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 
4de55beb067ea8f0be0a90060a785c94bdee708b..785ec07692981d423582051ac0897e5dbc3a001f
 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ 

Re: [PATCH][AArch64] Improve bit tests [PR105773]

2022-10-12 Thread Wilco Dijkstra via Gcc-patches
Hi Richard,

> Realise this is awkward, but: CC_NZmode is for operations that set only
> the N and Z flags to useful values.  If we want to take advantage of V
> being zero then I think we need a different mode.
>
> We can't go all the way to CCmode because the carry flag has the opposite
> value compared to subtraction.  But we could have either:
> 
> * CC_INVC (inverted carry) that handles all comparisons, including the
>   redundant unsigned comparisons
>
> * CC_NZV
>
> Guess I've got a slight preference for CC_INVC, but either would be OK IMO.

I've added CC_NZV since that's easier to understand and unsigned comparisons
with zero are always changed into equality comparisons. There were a few cases
where CC_NZ mode was used rather than CC_Z, so I changed those too.

Cheers,
Wilco

v2: Add new CC_NZV mode for cmp+and.

Since AArch64 sets all flags on logical operations, comparisons with zero
can be combined into an AND even if the condition is LE or GT. Add a new
CC_NZV mode used by ANDS/BICS/TST instructions.

Passes regress, OK for commit?

gcc/ChangeLog:

PR target/105773
* config/aarch64/aarch64.cc (aarch64_select_cc_mode): Allow
GT/LE for merging compare with zero into AND.
(aarch64_get_condition_code_1): Add CC_NZVmode support.
* config/aarch64/aarch64-modes.def: Add CC_NZV.
* config/aarch64/aarch64.md: Use CC_NZV in cmp+and patterns.

gcc/testsuite:
PR target/105773
* gcc.target/aarch64/ands_2.c: Test for ANDS.
* gcc.target/aarch64/bics_2.c: Test for BICS.
* gcc.target/aarch64/tst_2.c: Test for TST.
* gcc.target/aarch64/tst_imm_split_1.c: Fix test.

---

diff --git a/gcc/config/aarch64/aarch64-modes.def 
b/gcc/config/aarch64/aarch64-modes.def
index 
d3c9b74434cd2c0d0cb1a2fd26af8c0bf38a4cfa..0fd4c32ad0bd09f8651d1b8a77378fa4504ff488
 100644
--- a/gcc/config/aarch64/aarch64-modes.def
+++ b/gcc/config/aarch64/aarch64-modes.def
@@ -35,6 +35,7 @@ CC_MODE (CCFPE);
 CC_MODE (CC_SWP);
 CC_MODE (CC_NZC);   /* Only N, Z and C bits of condition flags are valid.
   (Used with SVE predicate tests.)  */
+CC_MODE (CC_NZV);   /* Only N, Z and V bits of condition flags are valid.  */
 CC_MODE (CC_NZ);/* Only N and Z bits of condition flags are valid.  */
 CC_MODE (CC_Z); /* Only Z bit of condition flags is valid.  */
 CC_MODE (CC_C); /* C represents unsigned overflow of a simple addition.  */
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 
3a4b1f6987487e959648a343bb25180ea419f397..600e0f41d51242a6f100b3643ce8421ea116ec5c
 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -11284,7 +11284,7 @@ aarch64_select_cc_mode (RTX_CODE code, rtx x, rtx y)
   if (y == const0_rtx && (REG_P (x) || SUBREG_P (x))
   && (code == EQ || code == NE)
   && (mode_x == HImode || mode_x == QImode))
-return CC_NZmode;
+return CC_Zmode;
 
   /* Similarly, comparisons of zero_extends from shorter modes can
  be performed using an ANDS with an immediate mask.  */
@@ -11292,15 +11292,22 @@ aarch64_select_cc_mode (RTX_CODE code, rtx x, rtx y)
   && (mode_x == SImode || mode_x == DImode)
   && (GET_MODE (XEXP (x, 0)) == HImode || GET_MODE (XEXP (x, 0)) == QImode)
   && (code == EQ || code == NE))
-return CC_NZmode;
+return CC_Zmode;
+
+  /* ANDS/BICS/TST support equality and all signed comparisons.  */
+  if ((mode_x == SImode || mode_x == DImode)
+  && y == const0_rtx
+  && (code_x == AND || (code_x == ZERO_EXTRACT && CONST_INT_P (XEXP (x, 1))
+   && CONST_INT_P (XEXP (x, 2
+  && (code == EQ || code == NE || code == LT || code == GE
+ || code == GT || code == LE))
+return CC_NZVmode;
 
+  /* ADDS/SUBS correctly set N and Z flags.  */
   if ((mode_x == SImode || mode_x == DImode)
   && y == const0_rtx
   && (code == EQ || code == NE || code == LT || code == GE)
-  && (code_x == PLUS || code_x == MINUS || code_x == AND
- || code_x == NEG
- || (code_x == ZERO_EXTRACT && CONST_INT_P (XEXP (x, 1))
- && CONST_INT_P (XEXP (x, 2)
+  && (code_x == PLUS || code_x == MINUS || code_x == NEG))
 return CC_NZmode;
 
   /* A compare with a shifted operand.  Because of canonicalization,
@@ -11437,6 +11444,19 @@ aarch64_get_condition_code_1 (machine_mode mode, enum 
rtx_code comp_code)
}
   break;
 
+case E_CC_NZVmode:
+  switch (comp_code)
+   {
+   case NE: return AARCH64_NE;
+   case EQ: return AARCH64_EQ;
+   case GE: return AARCH64_PL;
+   case LT: return AARCH64_MI;
+   case GT: return AARCH64_GT;
+   case LE: return AARCH64_LE;
+   default: return -1;
+   }
+  break;
+
 case E_CC_NZmode:
   switch (comp_code)
{
diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 

Re: [PATCH][AArch64] Improve immediate expansion [PR106583]

2022-10-07 Thread Wilco Dijkstra via Gcc-patches
Hi Richard,

>> Yes, with a more general search loop we can get that case too -
>> it doesn't trigger much though. The code that checks for this is
>> now refactored into a new function. Given there are now many
>> more calls to aarch64_bitmask_imm, I added a streamlined internal
>> entry point that assumes the input is 64-bit.
>
> Sounds good, but could you put it before the mode version,
> to avoid the forward declaration?

I can swap them around but the forward declaration is still required as
aarch64_check_bitmask is 5000 lines before aarch64_bitmask_imm.

Cheers,
Wilco

Re: [PATCH][AArch64] Improve immediate expansion [PR106583]

2022-10-06 Thread Wilco Dijkstra via Gcc-patches
Hi Richard,

> Did you consider handling the case where the movks aren't for
> consecutive bitranges?  E.g. the patch handles:

> but it looks like it would be fairly easy to extend it to:
>
>  0x12345678

Yes, with a more general search loop we can get that case too -
it doesn't trigger much though. The code that checks for this is
now refactored into a new function. Given there are now many
more calls to aarch64_bitmask_imm, I added a streamlined internal
entry point that assumes the input is 64-bit.

Cheers,
Wilco

[PATCH v2][AArch64] Improve immediate expansion [PR106583]

Improve immediate expansion of immediates which can be created from a
bitmask immediate and 2 MOVKs.  Simplify, refactor and improve 
efficiency of bitmask checks.  This reduces the number of 4-instruction
immediates in SPECINT/FP by 10-15%.

Passes regress, OK for commit?

gcc/ChangeLog:

PR target/106583
* config/aarch64/aarch64.cc (aarch64_internal_mov_immediate)
Add support for a bitmask immediate with 2 MOVKs.
(aarch64_check_bitmask): New function after refactorization.
(aarch64_replicate_bitmask_imm): Remove function, merge into...
(aarch64_bitmask_imm): Simplify replication of small modes.
Split function into 64-bit only version for efficiency. 

gcc/testsuite:
PR target/106583
* gcc.target/aarch64/pr106583.c: Add new test.

---

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 
926e81f028c82aac9a5fecc18f921f84399c24ae..b2d9c7380975028131d0fe731a97b3909874b87b
 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -306,6 +306,7 @@ static machine_mode aarch64_simd_container_mode 
(scalar_mode, poly_int64);
 static bool aarch64_print_address_internal (FILE*, machine_mode, rtx,
aarch64_addr_query_type);
 static HOST_WIDE_INT aarch64_clamp_to_uimm12_shift (HOST_WIDE_INT val);
+static bool aarch64_bitmask_imm (unsigned HOST_WIDE_INT);
 
 /* The processor for which instructions should be scheduled.  */
 enum aarch64_processor aarch64_tune = cortexa53;
@@ -5502,6 +5503,30 @@ aarch64_output_sve_vector_inc_dec (const char *operands, 
rtx x)
 factor, nelts_per_vq);
 }
 
+/* Return true if the immediate VAL can be a bitfield immediate
+   by changing the given MASK bits in VAL to zeroes, ones or bits
+   from the other half of VAL.  Return the new immediate in VAL2.  */
+static inline bool
+aarch64_check_bitmask (unsigned HOST_WIDE_INT val,
+  unsigned HOST_WIDE_INT ,
+  unsigned HOST_WIDE_INT mask)
+{
+  val2 = val & ~mask;
+  if (val2 != val && aarch64_bitmask_imm (val2))
+return true;
+  val2 = val | mask;
+  if (val2 != val && aarch64_bitmask_imm (val2))
+return true;
+  val = val & ~mask;
+  val2 = val | (((val >> 32) | (val << 32)) & mask);
+  if (val2 != val && aarch64_bitmask_imm (val2))
+return true;
+  val2 = val | (((val >> 16) | (val << 48)) & mask);
+  if (val2 != val && aarch64_bitmask_imm (val2))
+return true;
+  return false;
+}
+
 static int
 aarch64_internal_mov_immediate (rtx dest, rtx imm, bool generate,
scalar_int_mode mode)
@@ -5568,36 +5593,43 @@ aarch64_internal_mov_immediate (rtx dest, rtx imm, bool 
generate,
   one_match = ((~val & mask) == 0) + ((~val & (mask << 16)) == 0) +
 ((~val & (mask << 32)) == 0) + ((~val & (mask << 48)) == 0);
 
-  if (zero_match != 2 && one_match != 2)
+  if (zero_match < 2 && one_match < 2)
 {
   /* Try emitting a bitmask immediate with a movk replacing 16 bits.
 For a 64-bit bitmask try whether changing 16 bits to all ones or
 zeroes creates a valid bitmask.  To check any repeated bitmask,
 try using 16 bits from the other 32-bit half of val.  */
 
-  for (i = 0; i < 64; i += 16, mask <<= 16)
-   {
- val2 = val & ~mask;
- if (val2 != val && aarch64_bitmask_imm (val2, mode))
-   break;
- val2 = val | mask;
- if (val2 != val && aarch64_bitmask_imm (val2, mode))
-   break;
- val2 = val2 & ~mask;
- val2 = val2 | (((val2 >> 32) | (val2 << 32)) & mask);
- if (val2 != val && aarch64_bitmask_imm (val2, mode))
-   break;
-   }
-  if (i != 64)
-   {
- if (generate)
+  for (i = 0; i < 64; i += 16)
+   if (aarch64_check_bitmask (val, val2, mask << i))
+ {
+   if (generate)
+ {
+   emit_insn (gen_rtx_SET (dest, GEN_INT (val2)));
+   emit_insn (gen_insv_immdi (dest, GEN_INT (i),
+  GEN_INT ((val >> i) & 0x)));
+ }
+   return 2;
+ }
+}
+
+  /* Try a bitmask plus 2 movk to generate the immediate in 3 instructions.  */
+  if (zero_match + one_match == 0)
+{
+  for (i = 0; i < 48; i += 16)
+   for (int 

[PATCH][AArch64] Improve bit tests [PR105773]

2022-10-05 Thread Wilco Dijkstra via Gcc-patches

Since AArch64 sets all flags on logical operations, comparisons with zero
can be combined into an AND even if the condition is LE or GT.

Passes regress, OK for commit?

gcc:
PR target/105773
* config/aarch64/aarch64.cc (aarch64_select_cc_mode): Allow
GT/LE for merging compare with zero into AND.
(aarch64_get_condition_code_1): Support GT and LE in CC_NZmode.

gcc/testsuite:
PR target/105773
* gcc.target/aarch64/ands_2.c: Test for ANDS.
* gcc.target/aarch64/bics_2.c: Test for BICS.
* gcc.target/aarch64/tst_2.c: Test for TST.
* gcc.target/aarch64/tst_imm_split_1.c: Fix test.

---

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 
1601d11710cb6132c80a77bb4fe2f8429519aa5a..00876b08d8fbb1329a37a0ea73d3abf09d28b829
 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -11323,7 +11323,8 @@ aarch64_select_cc_mode (RTX_CODE code, rtx x, rtx y)
 
   if ((mode_x == SImode || mode_x == DImode)
   && y == const0_rtx
-  && (code == EQ || code == NE || code == LT || code == GE)
+  && (code == EQ || code == NE || code == LT || code == GE
+ || (code_x == AND && (code == GT || code == LE)))
   && (code_x == PLUS || code_x == MINUS || code_x == AND
  || code_x == NEG
  || (code_x == ZERO_EXTRACT && CONST_INT_P (XEXP (x, 1))
@@ -11471,6 +11472,8 @@ aarch64_get_condition_code_1 (machine_mode mode, enum 
rtx_code comp_code)
case EQ: return AARCH64_EQ;
case GE: return AARCH64_PL;
case LT: return AARCH64_MI;
+   case GT: return AARCH64_GT;
+   case LE: return AARCH64_LE;
default: return -1;
}
   break;
diff --git a/gcc/testsuite/gcc.target/aarch64/ands_2.c 
b/gcc/testsuite/gcc.target/aarch64/ands_2.c
index 
b061b1dfc59c1847cb799a1e49f8e5fc53bf2f14..c8763f234c5f7d19ef9c222756ab5e8a6eaae6fe
 100644
--- a/gcc/testsuite/gcc.target/aarch64/ands_2.c
+++ b/gcc/testsuite/gcc.target/aarch64/ands_2.c
@@ -8,8 +8,7 @@ ands_si_test1 (int a, int b, int c)
 {
   int d = a & b;
 
-  /* { dg-final { scan-assembler-not "ands\tw\[0-9\]+, w\[0-9\]+, w\[0-9\]+" } 
} */
-  /* { dg-final { scan-assembler-times "and\tw\[0-9\]+, w\[0-9\]+, w\[0-9\]+" 
2 } } */
+  /* { dg-final { scan-assembler "ands\tw\[0-9\]+, w\[0-9\]+, w\[0-9\]+" } } */
   if (d <= 0)
 return a + c;
   else
@@ -21,12 +20,11 @@ ands_si_test2 (int a, int b, int c)
 {
   int d = a & 0x;
 
-  /* { dg-final { scan-assembler-not "ands\tw\[0-9\]+, w\[0-9\]+, -1717986919" 
} } */
-  /* { dg-final { scan-assembler "and\tw\[0-9\]+, w\[0-9\]+, -1717986919" } } 
*/
-  if (d <= 0)
-return a + c;
-  else
+  /* { dg-final { scan-assembler "ands\tw\[0-9\]+, w\[0-9\]+, -1717986919" } } 
*/
+  if (d > 0)
 return b + d + c;
+  else
+return a + c;
 }
 
 int
@@ -34,8 +32,7 @@ ands_si_test3 (int a, int b, int c)
 {
   int d = a & (b << 3);
 
-  /* { dg-final { scan-assembler-not "ands\tw\[0-9\]+, w\[0-9\]+, w\[0-9\]+, 
lsl 3" } } */
-  /* { dg-final { scan-assembler "and\tw\[0-9\]+, w\[0-9\]+, w\[0-9\]+, lsl 3" 
} } */
+  /* { dg-final { scan-assembler "ands\tw\[0-9\]+, w\[0-9\]+, w\[0-9\]+, lsl 
3" } } */
   if (d <= 0)
 return a + c;
   else
@@ -49,8 +46,7 @@ ands_di_test1 (s64 a, s64 b, s64 c)
 {
   s64 d = a & b;
 
-  /* { dg-final { scan-assembler-not "ands\tx\[0-9\]+, x\[0-9\]+, x\[0-9\]+" } 
} */
-  /* { dg-final { scan-assembler-times "and\tx\[0-9\]+, x\[0-9\]+, x\[0-9\]+" 
2 } } */
+  /* { dg-final { scan-assembler "ands\tx\[0-9\]+, x\[0-9\]+, x\[0-9\]+" } } */
   if (d <= 0)
 return a + c;
   else
@@ -62,12 +58,11 @@ ands_di_test2 (s64 a, s64 b, s64 c)
 {
   s64 d = a & 0xll;
 
-  /* { dg-final { scan-assembler-not "ands\tx\[0-9\]+, x\[0-9\]+, 
-6148914691236517206" } } */
-  /* { dg-final { scan-assembler "and\tx\[0-9\]+, x\[0-9\]+, 
-6148914691236517206" } } */
-  if (d <= 0)
-return a + c;
-  else
+  /* { dg-final { scan-assembler "ands\tx\[0-9\]+, x\[0-9\]+, 
-6148914691236517206" } } */
+  if (d > 0)
 return b + d + c;
+  else
+return a + c;
 }
 
 s64
@@ -75,8 +70,7 @@ ands_di_test3 (s64 a, s64 b, s64 c)
 {
   s64 d = a & (b << 3);
 
-  /* { dg-final { scan-assembler-not "ands\tx\[0-9\]+, x\[0-9\]+, x\[0-9\]+, 
lsl 3" } } */
-  /* { dg-final { scan-assembler "and\tx\[0-9\]+, x\[0-9\]+, x\[0-9\]+, lsl 3" 
} } */
+  /* { dg-final { scan-assembler "ands\tx\[0-9\]+, x\[0-9\]+, x\[0-9\]+, lsl 
3" } } */
   if (d <= 0)
 return a + c;
   else
diff --git a/gcc/testsuite/gcc.target/aarch64/bics_2.c 
b/gcc/testsuite/gcc.target/aarch64/bics_2.c
index 
9ccae368c1276618bad691c78fb621a9c82794b7..c1f7e87a6121f7b067b7b37b149da64aa63ebd1a
 100644
--- a/gcc/testsuite/gcc.target/aarch64/bics_2.c
+++ b/gcc/testsuite/gcc.target/aarch64/bics_2.c
@@ -8,8 +8,7 @@ bics_si_test1 (int a, int b, int c)
 {
   int d = a & ~b;
 
-  /* { dg-final { scan-assembler-not "bics\tw\[0-9\]+, w\[0-9\]+, w\[0-9\]+" } 
} */
-  /* { dg-final { 

[PATCH][AArch64] Improve immediate expansion [PR106583]

2022-10-04 Thread Wilco Dijkstra via Gcc-patches
Improve immediate expansion of immediates which can be created from a
bitmask immediate and 2 MOVKs.  This reduces the number of 4-instruction
immediates in SPECINT/FP by 10-15%.

Passes regress, OK for commit?

gcc/ChangeLog:

PR target/106583
* config/aarch64/aarch64.cc (aarch64_internal_mov_immediate)
Add support for a bitmask immediate with 2 MOVKs.

gcc/testsuite:
PR target/106583
* gcc.target/aarch64/pr106583.c: Add new test.

---

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 
926e81f028c82aac9a5fecc18f921f84399c24ae..1601d11710cb6132c80a77bb4fe2f8429519aa5a
 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -5568,7 +5568,7 @@ aarch64_internal_mov_immediate (rtx dest, rtx imm, bool 
generate,
   one_match = ((~val & mask) == 0) + ((~val & (mask << 16)) == 0) +
 ((~val & (mask << 32)) == 0) + ((~val & (mask << 48)) == 0);
 
-  if (zero_match != 2 && one_match != 2)
+  if (zero_match < 2 && one_match < 2)
 {
   /* Try emitting a bitmask immediate with a movk replacing 16 bits.
 For a 64-bit bitmask try whether changing 16 bits to all ones or
@@ -5600,6 +5600,43 @@ aarch64_internal_mov_immediate (rtx dest, rtx imm, bool 
generate,
}
 }
 
+  /* Try a bitmask plus 2 movk to generate the immediate in 3 instructions.  */
+  if (zero_match + one_match == 0)
+{
+  mask = 0x;
+
+  for (i = 0; i < 64; i += 16)
+   {
+ val2 = val & ~mask;
+ if (aarch64_bitmask_imm (val2, mode))
+   break;
+ val2 = val | mask;
+ if (aarch64_bitmask_imm (val2, mode))
+   break;
+ val2 = val2 & ~mask;
+ val2 = val2 | (((val2 >> 32) | (val2 << 32)) & mask);
+ if (aarch64_bitmask_imm (val2, mode))
+   break;
+
+ mask = (mask << 16) | (mask >> 48);
+   }
+
+  if (i != 64)
+   {
+ if (generate)
+   {
+ emit_insn (gen_rtx_SET (dest, GEN_INT (val2)));
+ emit_insn (gen_insv_immdi (dest, GEN_INT (i),
+GEN_INT ((val >> i) & 0x)));
+ i = (i + 16) & 63;
+ emit_insn (gen_insv_immdi (dest, GEN_INT (i),
+GEN_INT ((val >> i) & 0x)));
+   }
+
+ return 3;
+   }
+}
+
   /* Generate 2-4 instructions, skipping 16 bits of all zeroes or ones which
  are emitted by the initial mov.  If one_match > zero_match, skip set bits,
  otherwise skip zero bits.  */
diff --git a/gcc/testsuite/gcc.target/aarch64/pr106583.c 
b/gcc/testsuite/gcc.target/aarch64/pr106583.c
new file mode 100644
index 
..f0a027a0950e506d4ddaacce5e151f57070948dc
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr106583.c
@@ -0,0 +1,30 @@
+/* { dg-do assemble } */
+/* { dg-options "-O2 --save-temps" } */
+
+long f1 (void)
+{
+  return 0x7efefefefefefeff;
+}
+
+long f2 (void)
+{
+  return 0x12345678;
+}
+
+long f3 (void)
+{
+  return 0x12345678;
+}
+
+long f4 (void)
+{
+  return 0x12345678;
+}
+
+long f5 (void)
+{
+  return 0x12345678;
+}
+
+/* { dg-final { scan-assembler-times {\tmovk\t} 10 } } */
+/* { dg-final { scan-assembler-times {\tmov\t} 5 } } */



[PATCH] AArch64: Cleanup option processing code

2022-05-25 Thread Wilco Dijkstra via Gcc-patches

Further cleanup option processing. Remove the duplication of global
variables for CPU and tune settings so that CPU option processing is
simplified even further. Move global variables that need save and 
restore due to target option processing into aarch64.opt. This removes
the need for explicit saving/restoring and unnecessary reparsing of
options.

Bootstrap OK, regress pass, OK for commit?

gcc/
* config/aarch64/aarch64.opt (explicit_tune_core): Rename to
selected_tune.
(explicit_arch): Rename to selected_arch.
(x_aarch64_override_tune_string): Remove.
(aarch64_ra_sign_key): Add as TargetVariable so it gets saved/restored.
(aarch64_override_tune_string): Add Save so it gets saved/restored.
* config/aarch64/aarch64.h (aarch64_architecture_version): Remove.
* config/aarch64/aarch64.cc (aarch64_architecture_version): Remove.
(processor): Remove archtecture_version field.
(selected_arch): Remove global.
(selected_cpu): Remove global.
(selected_tune): Remove global.
(aarch64_ra_sign_key): Move global to aarch64.opt so it is saved.
(aarch64_override_options_internal): Use aarch64_get_tune_cpu.
(aarch64_override_options): Further simplify code to only set
selected_arch and selected_tune globals.
(aarch64_option_save): Remove now that target options are saved.
(aarch64_option_restore): Remove redundant target option restores.
* config/aarch64/aarch64-c.cc (aarch64_update_cpp_builtins): Use
AARCH64_ISA_V9.
* config/aarch64/aarch64-opts.h (aarch64_key_type): Add, moved from...
* config/aarch64/aarch64-protos.h (aarch64_key_type): Remove.
(aarch64_ra_sign_key): Remove.

---

diff --git a/gcc/config/aarch64/aarch64-c.cc b/gcc/config/aarch64/aarch64-c.cc
index 
767ee0c763c56a022089a647c7425afb00644644..3d2fb5ec2ef33e66aaa59d216c53a29737262794
 100644
--- a/gcc/config/aarch64/aarch64-c.cc
+++ b/gcc/config/aarch64/aarch64-c.cc
@@ -82,7 +82,7 @@ aarch64_update_cpp_builtins (cpp_reader *pfile)
 {
   aarch64_def_or_undef (flag_unsafe_math_optimizations, "__ARM_FP_FAST", 
pfile);
 
-  builtin_define_with_int_value ("__ARM_ARCH", aarch64_architecture_version);
+  builtin_define_with_int_value ("__ARM_ARCH", AARCH64_ISA_V9 ? 9 : 8);
 
   builtin_define_with_int_value ("__ARM_SIZEOF_MINIMAL_ENUM",
 flag_short_enums ? 1 : 4);
diff --git a/gcc/config/aarch64/aarch64-opts.h 
b/gcc/config/aarch64/aarch64-opts.h
index 
93572fe8330218568435f485a366101eb2c58da9..421648a156a02cc1f43660eddc5de38c2f366a72
 100644
--- a/gcc/config/aarch64/aarch64-opts.h
+++ b/gcc/config/aarch64/aarch64-opts.h
@@ -98,4 +98,10 @@ enum stack_protector_guard {
   SSP_GLOBAL   /* global canary */
 };
 
+/* The key type that -msign-return-address should use.  */
+enum aarch64_key_type {
+  AARCH64_KEY_A,
+  AARCH64_KEY_B
+};
+
 #endif
diff --git a/gcc/config/aarch64/aarch64-protos.h 
b/gcc/config/aarch64/aarch64-protos.h
index 
df311812e8d4b87c0ad8692adacedcd79a8e0f64..dabd047d7ba2c532238720d59ecd59f0f5ba822f
 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -672,14 +672,6 @@ enum simd_immediate_check {
   AARCH64_CHECK_MOV  = AARCH64_CHECK_ORR | AARCH64_CHECK_BIC
 };
 
-/* The key type that -msign-return-address should use.  */
-enum aarch64_key_type {
-  AARCH64_KEY_A,
-  AARCH64_KEY_B
-};
-
-extern enum aarch64_key_type aarch64_ra_sign_key;
-
 extern struct tune_params aarch64_tune_params;
 
 /* The available SVE predicate patterns, known in the ACLE as "svpattern".  */
diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h
index 
f835da33b72f36bbf25a0e1328135411bd8ab4f6..80cfe4b740798072ed2a3d08089ff889943f916a
 100644
--- a/gcc/config/aarch64/aarch64.h
+++ b/gcc/config/aarch64/aarch64.h
@@ -148,9 +148,6 @@
 
 #define PCC_BITFIELD_TYPE_MATTERS  1
 
-/* Major revision number of the ARM Architecture implemented by the target.  */
-extern unsigned aarch64_architecture_version;
-
 /* Instruction tuning/selection flags.  */
 
 /* Bit values used to identify processor capabilities.  */
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 
060c67a93b95b717563ed063509e3e31c978d891..6bdfd55bcd6308c71fb6cfccb5922b71830c7c4f
 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -306,9 +306,6 @@ static bool aarch64_print_address_internal (FILE*, 
machine_mode, rtx,
aarch64_addr_query_type);
 static HOST_WIDE_INT aarch64_clamp_to_uimm12_shift (HOST_WIDE_INT val);
 
-/* Major revision number of the ARM Architecture implemented by the target.  */
-unsigned aarch64_architecture_version;
-
 /* The processor for which instructions should be scheduled.  */
 enum aarch64_processor aarch64_tune = cortexa53;
 
@@ -2677,7 +2674,6 @@ struct processor
   enum aarch64_processor ident;
 

Re: [PATCH] AArch64: Prioritise init_have_lse_atomics constructor [PR 105708]

2022-05-25 Thread Wilco Dijkstra via Gcc-patches
Hi Richard,

I've added a comment - as usual it's just a number. A quick grep in gcc and
glibc showed that priorities 98-101 are used, so I just went a bit below so it
has a higher priority than typical initializations.

Cheers,
Wilco

Here is v2:

Increase the priority of the init_have_lse_atomics constructor so it runs
before other constructors. This improves chances that rr works when LSE
atomics are supported.

Regress and bootstrap pass, OK for commit?

2022-05-24  Wilco Dijkstra  

libgcc/
PR libgcc/105708
* config/aarch64/lse-init.c: Increase constructor priority.

---

diff --git a/libgcc/config/aarch64/lse-init.c b/libgcc/config/aarch64/lse-init.c
index 
fc875b7fe80e947623e570eac130e7a14b516551..33b97c8d766895cf0101a851e1dc4ed6a1a053d9
 100644
--- a/libgcc/config/aarch64/lse-init.c
+++ b/libgcc/config/aarch64/lse-init.c
@@ -38,7 +38,9 @@ _Bool __aarch64_have_lse_atomics
 
 unsigned long int __getauxval (unsigned long int);
 
-static void __attribute__((constructor))
+/* Use a higher priority to ensure it runs before user constructors
+   and library constructors with priority 100. */
+static void __attribute__((constructor (90)))
 init_have_lse_atomics (void)
 {
   unsigned long hwcap = __getauxval (AT_HWCAP);


[PATCH] AArch64: Prioritise init_have_lse_atomics constructor [PR 105708]

2022-05-24 Thread Wilco Dijkstra via Gcc-patches

Increase the priority of the init_have_lse_atomics constructor so it runs
before other constructors. This improves chances that rr works when LSE
atomics are supported.

Regress and bootstrap pass, OK for commit?

2022-05-24  Wilco Dijkstra  

libgcc/
PR libgcc/105708
* config/aarch64/lse-init.c: Increase constructor priority.

---
diff --git a/libgcc/config/aarch64/lse-init.c b/libgcc/config/aarch64/lse-init.c
index 
fc875b7fe80e947623e570eac130e7a14b516551..92d91dfeed77f299aa610d72091499271490
 100644
--- a/libgcc/config/aarch64/lse-init.c
+++ b/libgcc/config/aarch64/lse-init.c
@@ -38,7 +38,7 @@ _Bool __aarch64_have_lse_atomics
 
 unsigned long int __getauxval (unsigned long int);
 
-static void __attribute__((constructor))
+static void __attribute__((constructor (90)))
 init_have_lse_atomics (void)
 {
   unsigned long hwcap = __getauxval (AT_HWCAP);


Re: [AArch64] PR105162: emit barrier for __sync and __atomic builtins on CPUs without LSE

2022-05-13 Thread Wilco Dijkstra via Gcc-patches
Hi Sebastian,

>> Note the patch still needs an appropriate commit message.
>
> Added the following ChangeLog entry to the commit message.
>
> * config/aarch64/aarch64-protos.h (atomic_ool_names): Increase 
>dimension
> of str array.
> * config/aarch64/aarch64.cc (aarch64_atomic_ool_func): Call
> memmodel_from_int and handle MEMMODEL_SYNC_*.
> (DEF0): Add __aarch64_*_sync functions.

When you commit this, please also add a PR target/105162 before the changelog 
so the
commit gets automatically added to the PR. We will need backports as well.

Cheers,
Wilco

Re: [PATCH] AArch64: Improve address rematerialization costs

2022-05-12 Thread Wilco Dijkstra via Gcc-patches
Hi Richard,

> But even if the costs are too high, the patch seems to be overcompensating.
> It doesn't make logical sense for an ADRP+LDR to be cheaper than an LDR.

An LDR is not a replacement for ADRP+LDR, you need a store in addition the
original ADRP+LDR. Basically a simple spill would be comparing these 2 
sequences:

ADRP x0, ...
LDR x0, [x0, ...]
STR x0, [SP, ...]
...
LDR x0, [SP, ...]


ADRP x0, ...
LDR x0, [x0, ...]
...
ADRP x0, ...
LDR x0, [x0, ...]

Obviously it's far cheaper to do the latter than the former.

> Giving X zero cost means that a sequence like:
>
>   (set (reg x0) X)
>   (set (reg x1) X)
>
> should stay as-is rather than be changed to:
>
>   (set (reg x0) X)
>   (set (reg x1) (reg x0))
>
> I don't think we want that for multi-instruction constants when
> optimising for size.

I don't believe this is a real problem. The cost queries for address constants 
come
from register allocation, I don't see them affect other optimizations.

> Yeah, I wasn't suggesting that we increase the spill costs.  I'm saying

I'm saying that because we've set the spill costs low on purpose to work around
register allocation bugs. There have been some fixes since, so increasing the 
spill
costs may now be feasible (but not trivial).

> that we should look at whether the target-independent RA heuristics need
> to change, whether new target hooks are needed, etc.  We shouldn't go
> into this with the assumption that the target-independent code is
> invariant and that any fix must be in existing aarch64 hooks (rtx costs
> or spill costs).

But what bug do you think exists in target independent code? It behaves
correctly once we supply more accurate costs. If there was no rematerialization
irrespectively of the cost settings then you could claim there was a bug.

> Maybe it would help to turn the question around for a minute.  Can we
> describe the cases in which it's *better* for the RA to spill a constant
> address to the stack and reload it, rather than rematerialise on demand?

Rematerialization is almost always better than spilling and reloading from the
stack. If the constant requires multiple instructions and there are more than 2
references it would be better for codesize to spill, but for performance it is
better to rematerialize unless there are many references.

You also want to prefer rematerialization over spilling a different liferange 
when
other aspects are comparable.

Cheers,
Wilco

Re: [PATCH] AArch64: Improve address rematerialization costs

2022-05-12 Thread Wilco Dijkstra via Gcc-patches
Hi,

>> It's also said that chosen alternatives might be the reason that
>> rematerialization
>> is not choosen and alternatives are chosen based on reload heuristics, not 
>> based
>> on actual costs.
>
> Thanks for the pointer.  Yeah, it'd be interesting to know if this
> is the same issue, although I fear the only way of knowing for sure
> is to fix it first and see whether both targets benefit. ;-)

I don't believe this is the same issue - there are lots of register allocation 
problems
indeed, many are caused by the complex design. All the alternatives and 
register classes
create a huge crossproduct, making it almost impossible to make good allocation
decisions even if they were accurately costed.

I've found that the correct way to deal with this is to reduce all this choice 
as much
as possible. That means splitting instructions into simpler ones with fewer 
alternatives
and register classes. You also need to block it from treating all register 
classes as
equivalent - on AArch64 we had to force floating point values to be allocated 
to floating
point registers (which is obviously how any register allocator should work by 
default),
but maybe x86 doesn't do that yet.

Cheers,
Wilco

Re: [PATCH] AArch64: Cleanup CPU option processing code

2022-05-12 Thread Wilco Dijkstra via Gcc-patches
Hi Richard,

> Looks like you might have attached the old patch.  The aarch64_option_restore
> change is mentioned in the changelog but doesn't appear in the patch itself.

Indeed, not sure how that happened. Here is the correct v2 anyway.

Wilco


The --with-cpu/--with-arch configure option processing not only checks valid 
arguments
but also sets TARGET_CPU_DEFAULT with a CPU and extension bitmask.  This isn't 
used
however since a --with-cpu is translated into a -mcpu option which is processed 
as if
written on the command-line (so TARGET_CPU_DEFAULT is never accessed).

So remove all the complex processing and bitmask, and just validate the option.
Fix a bug that always reports valid architecture extensions as invalid.  As a 
result
the CPU processing in aarch64.c can be simplified.

Bootstrap OK, regress pass, OK for commit?

ChangeLog:
2022-04-19  Wilco Dijkstra  

* config.gcc (aarch64*-*-*): Simplify --with-cpu and --with-arch
processing.  Add support for architectural extensions.
* config/aarch64/aarch64.h (TARGET_CPU_DEFAULT): Remove
AARCH64_CPU_DEFAULT_FLAGS.
(TARGET_CPU_NBITS): Remove.
(TARGET_CPU_MASK): Remove.
* config/aarch64/aarch64.cc (AARCH64_CPU_DEFAULT_FLAGS): Remove define.
(get_tune_cpu): Assert CPU is always valid.
(get_arch): Assert architecture is always valid.
(aarch64_override_options): Cleanup CPU selection code and simplify 
logic.
(aarch64_option_restore): Remove unnecessary checks on tune.
---

diff --git a/gcc/config.gcc b/gcc/config.gcc
index 
c5064dd376660c192d5573997b4fc86b6b3e3838..b48d5451e8027c93fb1f614812589183d0a88c4b
 100644
--- a/gcc/config.gcc
+++ b/gcc/config.gcc
@@ -4178,8 +4178,6 @@ case "${target}" in
  pattern=AARCH64_CORE
fi
 
-   ext_mask=AARCH64_CPU_DEFAULT_FLAGS
-
# Find the base CPU or ARCH id in aarch64-cores.def or
# aarch64-arches.def
if [ x"$base_val" = x ] \
@@ -4187,23 +4185,6 @@ case "${target}" in
${srcdir}/config/aarch64/$def \
> /dev/null; then
 
- if [ $which = arch ]; then
-   base_id=`grep "^$pattern(\"$base_val\"," \
- ${srcdir}/config/aarch64/$def | \
- sed -e 's/^[^,]*,[]*//' | \
- sed -e 's/,.*$//'`
-   # Extract the architecture flags from 
aarch64-arches.def
-   ext_mask=`grep "^$pattern(\"$base_val\"," \
-  ${srcdir}/config/aarch64/$def | \
-  sed -e 's/)$//' | \
-  sed -e 's/^.*,//'`
- else
-   base_id=`grep "^$pattern(\"$base_val\"," \
- ${srcdir}/config/aarch64/$def | \
- sed -e 's/^[^,]*,[]*//' | \
- sed -e 's/,.*$//'`
- fi
-
  # Disallow extensions in --with-tune=cortex-a53+crc.
  if [ $which = tune ] && [ x"$ext_val" != x ]; then
echo "Architecture extensions not supported in 
--with-$which=$val" 1>&2
@@ -4234,25 +4215,7 @@ case "${target}" in
grep "^\"$base_ext\""`
 
if [ x"$base_ext" = x ] \
-   || [[ -n $opt_line ]]; then
-
- # These regexp extract the elements based on
- # their group match index in the regexp.
- ext_canon=`echo -e "$opt_line" | \
-   sed -e "s/$sed_patt/\2/"`
- ext_on=`echo -e "$opt_line" | \
-   sed -e "s/$sed_patt/\3/"`
- ext_off=`echo -e "$opt_line" | \
-   sed -e "s/$sed_patt/\4/"`
-
- if [ $ext = $base_ext ]; then
-   # Adding extension
-   ext_mask="("$ext_mask") | ("$ext_on" | 
"$ext_canon")"
- else
-   # Removing extension
-   ext_mask="("$ext_mask") & ~("$ext_off" 
| "$ext_canon")"
- fi
-
+   || [ x"$opt_line" != x ]; then
  true
else
  echo "Unknown extension used in 
--with-$which=$val" 1>&2
@@ -4261,10 +4224,6 @@ case "${target}" 

Re: [PATCH] AArch64: Cleanup CPU option processing code

2022-05-11 Thread Wilco Dijkstra via Gcc-patches
Hi Richard,

> Although invoking ./cc1 directly only half-works with --with-arch,
> it half-works well-enough that I'd still like to keep it working.
> But I agree we should apply your change first, then I can follow up
> with a patch to make --with-* work with ./cc1 later.  (I have a version
> locally and the net result is much simpler than the status quo, as well
> as hopefully actually working properly.)

That sounds good indeed. Is that changing TARGET_CPU_DEFAULT into
a string can could just be parsed like a -mcpu option?

> Do we still need both selected_arch and explicit_arch?  explicit_arch
> seems a misnomer now, since it includes implicit as well as explicit
> choices.  Same for selected_tune and explicit_tune_core.

At the moment we do since these are settings that must be saved/restored
since they can be overridden. However it may be possible to do further
cleanups to remove some of this. I also wonder whether we can remove the
internal override feature (override_tune_string) since that further complicates
the tunings, and I'm not convinced that it is either useful or being used at 
all.

> aarch64_option_restore has:
>
>  if (opts->x_explicit_tune_core == aarch64_none
>  && opts->x_explicit_arch != aarch64_no_arch)
>    selected_tune = _cores[selected_arch->ident];
>  else
>    selected_tune = aarch64_get_tune_cpu (ptr->x_explicit_tune_core);
>
> Is the “if” condition ever true, or can we now restore the tune
> info unconditionally?

Yes that was added a year or so after I created this patch, so this
is now redundant. I've removed it in v2:


The --with-cpu/--with-arch configure option processing not only checks valid 
arguments
but also sets TARGET_CPU_DEFAULT with a CPU and extension bitmask.  This isn't 
used
however since a --with-cpu is translated into a -mcpu option which is processed 
as if
written on the command-line (so TARGET_CPU_DEFAULT is never accessed).

So remove all the complex processing and bitmask, and just validate the option.
Fix a bug that always reports valid architecture extensions as invalid.  As a 
result
the CPU processing in aarch64.c can be simplified.

Bootstrap OK, regress pass, OK for commit?

ChangeLog:
2022-04-19  Wilco Dijkstra  

* config.gcc (aarch64*-*-*): Simplify --with-cpu and --with-arch
processing.  Add support for architectural extensions.
* config/aarch64/aarch64.h (TARGET_CPU_DEFAULT): Remove
AARCH64_CPU_DEFAULT_FLAGS.
(TARGET_CPU_NBITS): Remove.
(TARGET_CPU_MASK): Remove.
* config/aarch64/aarch64.cc (AARCH64_CPU_DEFAULT_FLAGS): Remove define.
(get_tune_cpu): Assert CPU is always valid.
(get_arch): Assert architecture is always valid.
(aarch64_override_options): Cleanup CPU selection code and simplify 
logic.
(aarch64_option_restore): Remove unnecessary checks on tune.

---

diff --git a/gcc/config.gcc b/gcc/config.gcc
index 
c5064dd376660c192d5573997b4fc86b6b3e3838..b48d5451e8027c93fb1f614812589183d0a88c4b
 100644
--- a/gcc/config.gcc
+++ b/gcc/config.gcc
@@ -4178,8 +4178,6 @@ case "${target}" in
  pattern=AARCH64_CORE
fi
 
-   ext_mask=AARCH64_CPU_DEFAULT_FLAGS
-
# Find the base CPU or ARCH id in aarch64-cores.def or
# aarch64-arches.def
if [ x"$base_val" = x ] \
@@ -4187,23 +4185,6 @@ case "${target}" in
${srcdir}/config/aarch64/$def \
> /dev/null; then
 
- if [ $which = arch ]; then
-   base_id=`grep "^$pattern(\"$base_val\"," \
- ${srcdir}/config/aarch64/$def | \
- sed -e 's/^[^,]*,[]*//' | \
- sed -e 's/,.*$//'`
-   # Extract the architecture flags from 
aarch64-arches.def
-   ext_mask=`grep "^$pattern(\"$base_val\"," \
-  ${srcdir}/config/aarch64/$def | \
-  sed -e 's/)$//' | \
-  sed -e 's/^.*,//'`
- else
-   base_id=`grep "^$pattern(\"$base_val\"," \
- ${srcdir}/config/aarch64/$def | \
- sed -e 's/^[^,]*,[]*//' | \
- sed -e 's/,.*$//'`
- fi
-
  # Disallow extensions in --with-tune=cortex-a53+crc.
  if [ $which = tune ] && [ x"$ext_val" != x ]; then
echo "Architecture extensions not supported in 
--with-$which=$val" 1>&2
@@ -4234,25 +4215,7 @@ case "${target}" in
grep "^\"$base_ext\""`
 
if [ x"$base_ext" = x ] \
-

Re: [PATCH] AArch64: Improve address rematerialization costs

2022-05-11 Thread Wilco Dijkstra via Gcc-patches
Hi Richard,

> Yeah, I'm not disagreeing with any of that.  It's just a question of
> whether the problem should be fixed by artificially lowering the general
> rtx costs with one particular user (RA spill costs) in mind, or whether
> it should be fixed by making the RA spill code take the factors above
> into account.

The RA spill code already works fine on immediates but not on address
constants. And the reason is that the current rtx costs for addresses are
set artificially high without justification (I checked the patch that increased
the costs but there was nothing explaining why it was beneficial).

It's certainly possible to experiment with increasing the spill costs, but that
won't improve the issue with address constants unless they are at least doubled.
And it has the effect of halving all rtx costs in the register allocator which 
is
likely to cause regressions. So we'd need to adjust many rtx costs to keep the
allocator working, plus fix any further regressions this causes.

Cheers,
Wilco


Re: [PATCH] AArch64: Improve address rematerialization costs

2022-05-10 Thread Wilco Dijkstra via Gcc-patches
Hi Richard,

>> There isn't really a better way of doing this within the existing costing 
>> code.
>
> Yeah, I was wondering whether we could change something there.
> ADRP+LDR is logically more expensive than a single LDR, especially
> when optimising for size, so I think it's reasonable for the rtx_costs
> to say so.  But that doesn't/shouldn't mean that spilling is better
> (for either size or speed).
>
> So it feels like there's something missing in the way the costs are
> being applied.

Calculating accurate spill costs is hard. Spill optimization is done later, so
until then you can't know the actual cost of a spill decision already made.
Spills are also more expensive than you think due to store latency, more
dirty cachelines etc. There is little benefit in lifting an ADRP to the start of
a function but keep ADD/LDR close to references. Basically ADRP/MOV are
very cheap, so it's a waste to allocate these to long-lived registers.

Given that there are significant codesize and performance improvements,
it is clear that doing more rematerialization is better even in cases where it
takes 2 instructions to recompute the address. Binaries show a significant
reduction in stack-based loads and stores.

Cheers,
Wilco



Re: [PATCH] AArch64: Improve address rematerialization costs

2022-05-09 Thread Wilco Dijkstra via Gcc-patches
Hi Richard,

> I'm not questioning the results, but I think we need to look in more
> detail why rematerialisation requires such low costs.  The point of
> comparison should be against a spill and reload, so any constant
> that is as cheap as a load should be rematerialised.  If that isn't
> happening then it sounds like changes are needed elsewhere.

The simple answer is that rematerializable expressions must have a lower cost
than the spill cost (potentially of something else), otherwise it will never 
happen.
The previous costs were set way too high (eg. 12 for ADRP+LDR vs 4 for a 
reload).
This patch basically ensures that is indeed the case. In principle a zero cost
works fine for anything that can be rematerialized. However it may use more
instructions than a spill (of something else), so a small non-zero cost avoids
bloating codesize.

There isn't really a better way of doing this within the existing costing code.
We could try doubling or quadrupling the spill costs but that would create a
lot of fallout since it affects everything.

Cheers,
Wilco


[PATCH] AArch64: Cleanup CPU option processing code

2022-05-09 Thread Wilco Dijkstra via Gcc-patches

The --with-cpu/--with-arch configure option processing not only checks valid 
arguments
but also sets TARGET_CPU_DEFAULT with a CPU and extension bitmask.  This isn't 
used
however since a --with-cpu is translated into a -mcpu option which is processed 
as if
written on the command-line (so TARGET_CPU_DEFAULT is never accessed).

So remove all the complex processing and bitmask, and just validate the option.
Fix a bug that always reports valid architecture extensions as invalid.  As a 
result
the CPU processing in aarch64.cc can be simplified.

Bootstrap OK, regress pass, OK for commit?

ChangeLog:
2022-04-19  Wilco Dijkstra  

* config.gcc (aarch64*-*-*): Simplify --with-cpu and --with-arch
processing.  Add support for architectural extensions.
* config/aarch64/aarch64.h (TARGET_CPU_DEFAULT): Remove
AARCH64_CPU_DEFAULT_FLAGS.
(TARGET_CPU_NBITS): Remove.
(TARGET_CPU_MASK): Remove.
* config/aarch64/aarch64.cc (AARCH64_CPU_DEFAULT_FLAGS): Remove define.
(get_tune_cpu): Assert CPU is always valid.
(get_arch): Assert architecture is always valid.
(aarch64_override_options): Cleanup CPU selection code and simplify 
logic.

---

diff --git a/gcc/config.gcc b/gcc/config.gcc
index 
c5064dd376660c192d5573997b4fc86b6b3e3838..b48d5451e8027c93fb1f614812589183d0a88c4b
 100644
--- a/gcc/config.gcc
+++ b/gcc/config.gcc
@@ -4178,8 +4178,6 @@ case "${target}" in
  pattern=AARCH64_CORE
fi
 
-   ext_mask=AARCH64_CPU_DEFAULT_FLAGS
-
# Find the base CPU or ARCH id in aarch64-cores.def or
# aarch64-arches.def
if [ x"$base_val" = x ] \
@@ -4187,23 +4185,6 @@ case "${target}" in
${srcdir}/config/aarch64/$def \
> /dev/null; then
 
- if [ $which = arch ]; then
-   base_id=`grep "^$pattern(\"$base_val\"," \
- ${srcdir}/config/aarch64/$def | \
- sed -e 's/^[^,]*,[]*//' | \
- sed -e 's/,.*$//'`
-   # Extract the architecture flags from 
aarch64-arches.def
-   ext_mask=`grep "^$pattern(\"$base_val\"," \
-  ${srcdir}/config/aarch64/$def | \
-  sed -e 's/)$//' | \
-  sed -e 's/^.*,//'`
- else
-   base_id=`grep "^$pattern(\"$base_val\"," \
- ${srcdir}/config/aarch64/$def | \
- sed -e 's/^[^,]*,[]*//' | \
- sed -e 's/,.*$//'`
- fi
-
  # Disallow extensions in --with-tune=cortex-a53+crc.
  if [ $which = tune ] && [ x"$ext_val" != x ]; then
echo "Architecture extensions not supported in 
--with-$which=$val" 1>&2
@@ -4234,25 +4215,7 @@ case "${target}" in
grep "^\"$base_ext\""`
 
if [ x"$base_ext" = x ] \
-   || [[ -n $opt_line ]]; then
-
- # These regexp extract the elements based on
- # their group match index in the regexp.
- ext_canon=`echo -e "$opt_line" | \
-   sed -e "s/$sed_patt/\2/"`
- ext_on=`echo -e "$opt_line" | \
-   sed -e "s/$sed_patt/\3/"`
- ext_off=`echo -e "$opt_line" | \
-   sed -e "s/$sed_patt/\4/"`
-
- if [ $ext = $base_ext ]; then
-   # Adding extension
-   ext_mask="("$ext_mask") | ("$ext_on" | 
"$ext_canon")"
- else
-   # Removing extension
-   ext_mask="("$ext_mask") & ~("$ext_off" 
| "$ext_canon")"
- fi
-
+   || [ x"$opt_line" != x ]; then
  true
else
  echo "Unknown extension used in 
--with-$which=$val" 1>&2
@@ -4261,10 +4224,6 @@ case "${target}" in
ext_val=`echo $ext_val | sed -e 
's/[a-z0-9]\+//'`
  done
 
- ext_mask="(("$ext_mask") << TARGET_CPU_NBITS)"
- if [ x"$base_id" != x ]; then
-   

[PATCH] AArch64: Improve address rematerialization costs

2022-05-09 Thread Wilco Dijkstra via Gcc-patches
Improve rematerialization costs of addresses.  The current costs are set too 
high
which results in extra register pressure and spilling.  Using lower costs means
addresses will be rematerialized more often rather than being spilled or causing
spills.  This results in significant codesize reductions and performance gains.
SPECINT2017 improves by 0.27% with LTO and 0.16% without LTO.  Codesize is 0.12%
smaller.

Passes bootstrap and regress. OK for commit?

ChangeLog:
2021-06-01  Wilco Dijkstra  

* config/aarch64/aarch64.cc (aarch64_rtx_costs): Use better 
rematerialization
costs for HIGH, LO_SUM and SYMREF.

---

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 
43d87d1b9c4ef1a85094e51f81745f98f1ef27fb..7341849121ffd6b3b0b77c9730e74e751742e852
 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -14529,45 +14529,28 @@ cost_plus:
  return false;  /* All arguments need to be in registers.  */
}
 
+/* The following costs are used for rematerialization of addresses.
+   Set a low cost for all global accesses - this ensures they are
+   preferred for rematerialization, blocks them from being spilled
+   and reduces register pressure.  The result is significant codesize
+   reductions and performance gains. */
+
 case SYMBOL_REF:
+  *cost = 0;
 
-  if (aarch64_cmodel == AARCH64_CMODEL_LARGE
- || aarch64_cmodel == AARCH64_CMODEL_SMALL_SPIC)
-   {
- /* LDR.  */
- if (speed)
-   *cost += extra_cost->ldst.load;
-   }
-  else if (aarch64_cmodel == AARCH64_CMODEL_SMALL
-  || aarch64_cmodel == AARCH64_CMODEL_SMALL_PIC)
-   {
- /* ADRP, followed by ADD.  */
- *cost += COSTS_N_INSNS (1);
- if (speed)
-   *cost += 2 * extra_cost->alu.arith;
-   }
-  else if (aarch64_cmodel == AARCH64_CMODEL_TINY
-  || aarch64_cmodel == AARCH64_CMODEL_TINY_PIC)
-   {
- /* ADR.  */
- if (speed)
-   *cost += extra_cost->alu.arith;
-   }
+  /* Use a separate remateralization cost for GOT accesses.  */
+  if (aarch64_cmodel == AARCH64_CMODEL_SMALL_PIC
+ && aarch64_classify_symbol (x, 0) == SYMBOL_SMALL_GOT_4G)
+   *cost = COSTS_N_INSNS (1) / 2;
 
-  if (flag_pic)
-   {
- /* One extra load instruction, after accessing the GOT.  */
- *cost += COSTS_N_INSNS (1);
- if (speed)
-   *cost += extra_cost->ldst.load;
-   }
   return true;
 
 case HIGH:
+  *cost = 0;
+  return true;
+
 case LO_SUM:
-  /* ADRP/ADD (immediate).  */
-  if (speed)
-   *cost += extra_cost->alu.arith;
+  *cost = COSTS_N_INSNS (3) / 4;
   return true;
 
 case ZERO_EXTRACT:



Re: [AArch64] PR105162: emit barrier for __sync and __atomic builtins on CPUs without LSE

2022-05-03 Thread Wilco Dijkstra via Gcc-patches
Hi Sebastian,

> Please find attached the patch amended following your recommendations.
> The number of new functions for _sync is reduced by 3x.
> I tested the patch on Graviton2 aarch64-linux.
> I also checked by hand that the outline functions in libgcc look similar to 
> what GCC produces for the inline version.

Yes this looks good to me (still needs maintainer approval). One minor nitpick,
a few of the tests check for __aarch64_cas2 - this should be 
__aarch64_cas2_sync.
Note the patch still needs an appropriate commit message.

Cheers,
Wilco

Re: [AArch64] PR105162: emit barrier for __sync and __atomic builtins on CPUs without LSE

2022-04-19 Thread Wilco Dijkstra via Gcc-patches
Hi Sebastian,

> Wilco pointed out in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105162#c7​ 
> that
> "Only __sync needs the extra full barrier, but __atomic does not."
> The attached patch does that by adding out-of-line functions for 
> MEMMODEL_SYNC_*.
> Those new functions contain a barrier on the path without LSE instructions.

Yes, adding _sync versions of the outline functions is the correct approach. 
However
there is no need to have separate _acq/_rel/_seq variants for every function 
since all
but one are _seq. Also we should ensure we generate the same sequence as the 
inlined
versions so that they are consistent. This means ensuring the LDXR macro 
ignores the
'A' for the _sync variants and the swp function switches to acquire semantics.

Cheers,
Wilco

[PATCH] AArch64: Improve rotate patterns

2021-12-08 Thread Wilco Dijkstra via Gcc-patches
Improve and generalize rotate patterns. Rotates by more than half the
bitwidth of a register are canonicalized to rotate left. Many existing
shift patterns don't handle this case correctly, so add rotate left to
the shift iterator and convert rotate left into ror during assembly
output. Add missing zero_extend patterns for shifted BIC, ORN and EON.

Passes bootstrap and regress. OK for commit?

ChangeLog:
2021-12-07  Wilco Dijkstra  

* config/aarch64/aarch64.md
(and_3_compare0): Support rotate left.
(and_si3_compare0_uxtw): Likewise.
(_3): Likewise.
(_si3_uxtw): Likewise.
(one_cmpl_2): Likewise.
(_one_cmpl_3): Likewise.
(_one_cmpl_sidi_uxtw): New pattern.
(eor_one_cmpl_3_alt): Support rotate left.
(eor_one_cmpl_sidi3_alt_ze): Likewise.
(and_one_cmpl_3_compare0): Likewise.
(and_one_cmpl_si3_compare0_uxtw): Likewise.
(and_one_cmpl_3_compare0_no_reuse): Likewise.
(and_3nr_compare0): Likewise.
(*si3_insn_uxtw): Use SHIFT_no_rotate.
(rolsi3_insn_uxtw): New pattern.
* config/aarch64/iterators.md (SHIFT): Add rotate left.
(SHIFT_no_rotate): Add new iterator.
(SHIFT:shift): Print rotate left as ror.
(is_rotl): Add test for left rotate.

* gcc.target/aarch64/ror_2.c: New test.
* gcc.target/aarch64/ror_3.c: New test.

---

diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 
5297b2d3f95744ac72e36814c6676cc97478d48b..f80679bcb34ea07918b1a0304b32be436568095d
 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -4419,7 +4419,11 @@ (define_insn "*and_3_compare0"
(set (match_operand:GPI 0 "register_operand" "=r")
(and:GPI (SHIFT:GPI (match_dup 1) (match_dup 2)) (match_dup 3)))]
   ""
-  "ands\\t%0, %3, %1,  %2"
+{
+  if ()
+operands[2] = GEN_INT ( - UINTVAL (operands[2]));
+  return "ands\\t%0, %3, %1,  %2";
+}
   [(set_attr "type" "logics_shift_imm")]
 )
 
@@ -4436,7 +4440,11 @@ (define_insn "*and_si3_compare0_uxtw"
(zero_extend:DI (and:SI (SHIFT:SI (match_dup 1) (match_dup 2))
(match_dup 3]
   ""
-  "ands\\t%w0, %w3, %w1,  %2"
+{
+  if ()
+operands[2] = GEN_INT (32 - UINTVAL (operands[2]));
+  return "ands\\t%w0, %w3, %w1,  %2";
+}
   [(set_attr "type" "logics_shift_imm")]
 )
 
@@ -4447,7 +4455,11 @@ (define_insn "*_3"
  (match_operand:QI 2 "aarch64_shift_imm_" "n"))
 (match_operand:GPI 3 "register_operand" "r")))]
   ""
-  "\\t%0, %3, %1,  %2"
+{
+  if ()
+operands[2] = GEN_INT ( - UINTVAL (operands[2]));
+  return "\\t%0, %3, %1,  %2";
+}
   [(set_attr "type" "logic_shift_imm")]
 )
 
@@ -4504,17 +4516,6 @@ (define_split
   "operands[3] = gen_reg_rtx (mode);"
 )
 
-(define_insn "*_rol3"
-  [(set (match_operand:GPI 0 "register_operand" "=r")
-   (LOGICAL:GPI (rotate:GPI
- (match_operand:GPI 1 "register_operand" "r")
- (match_operand:QI 2 "aarch64_shift_imm_" "n"))
-(match_operand:GPI 3 "register_operand" "r")))]
-  ""
-  "\\t%0, %3, %1, ror #( - %2)"
-  [(set_attr "type" "logic_shift_imm")]
-)
-
 ;; zero_extend versions of above
 (define_insn "*_si3_uxtw"
   [(set (match_operand:DI 0 "register_operand" "=r")
@@ -4524,19 +4525,11 @@ (define_insn "*_si3_uxtw"
  (match_operand:QI 2 "aarch64_shift_imm_si" "n"))
 (match_operand:SI 3 "register_operand" "r"]
   ""
-  "\\t%w0, %w3, %w1,  %2"
-  [(set_attr "type" "logic_shift_imm")]
-)
-
-(define_insn "*_rolsi3_uxtw"
-  [(set (match_operand:DI 0 "register_operand" "=r")
-   (zero_extend:DI
-(LOGICAL:SI (rotate:SI
- (match_operand:SI 1 "register_operand" "r")
- (match_operand:QI 2 "aarch64_shift_imm_si" "n"))
-(match_operand:SI 3 "register_operand" "r"]
-  ""
-  "\\t%w0, %w3, %w1, ror #(32 - %2)"
+{
+  if ()
+operands[2] = GEN_INT (32 - UINTVAL (operands[2]));
+  return "\\t%w0, %w3, %w1,  %2";
+}
   [(set_attr "type" "logic_shift_imm")]
 )
 
@@ -4565,7 +4558,11 @@ (define_insn "*one_cmpl_2"
(not:GPI (SHIFT:GPI (match_operand:GPI 1 "register_operand" "r")
(match_operand:QI 2 "aarch64_shift_imm_" 
"n"]
   ""
-  "mvn\\t%0, %1,  %2"
+{
+  if ()
+operands[2] = GEN_INT ( - UINTVAL (operands[2]));
+  return "mvn\\t%0, %1,  %2";
+}
   [(set_attr "type" "logic_shift_imm")]
 )
 
@@ -4672,7 +4669,28 @@ (define_insn 
"_one_cmpl_3"
   (match_operand:QI 2 "aarch64_shift_imm_" "n")))
 (match_operand:GPI 3 "register_operand" "r")))]
   ""
-  "\\t%0, %3, %1,  %2"
+{
+  if ()
+operands[2] = GEN_INT ( - UINTVAL (operands[2]));
+  return "\\t%0, %3, %1,  %2";
+}
+  [(set_attr "type" "logic_shift_imm")]
+)
+
+;; Zero-extend version of the above.
+(define_insn "_one_cmpl_sidi_uxtw"
+  [(set 

Re: [PATCH] AArch64: Improve address rematerialization costs

2021-11-24 Thread Wilco Dijkstra via Gcc-patches
Hi Richard,

> Can you fold in the rtx costs part of the original GOT relaxation patch?

Sure, see below for the updated version.

> I don't think there's enough information here for me to be able to review
> the patch though.  I'll need to find testcases, look in detail at what
> the rtl passes are doing, and try to work out whether (and why) this is
> a good way of fixing things.

Well today GCC does everything with costs rather than backend callbacks.
I'd be interested in hearing about alternatives that have the same effect 
without a callback that allows a backend to decide between spilling and
rematerialization.

Cheers,
Wilco


v2: fold in GOT remat cost

Improve rematerialization costs of addresses.  The current costs are set too 
high
which results in extra register pressure and spilling.  Using lower costs means
addresses will be rematerialized more often rather than being spilled or causing
spills.  This results in significant codesize reductions and performance gains.
SPECINT2017 improves by 0.27% with LTO and 0.16% without LTO.  Codesize is 0.12%
smaller.

Passes bootstrap and regress. OK for commit?

ChangeLog:
2021-06-01  Wilco Dijkstra  

* config/aarch64/aarch64.c (aarch64_rtx_costs): Use better 
rematerialization
costs for HIGH, LO_SUM and SYMREF.
---

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 
39de231d8ac6d10362cdd2b48eb9bd9de60c6703..a7f99ece55383168fb0f77e5c11c501d0bb2f013
 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -13610,45 +13610,28 @@ cost_plus:
  return false;  /* All arguments need to be in registers.  */
}
 
+/* The following costs are used for rematerialization of addresses.
+   Set a low cost for all global accesses - this ensures they are
+   preferred for rematerialization, blocks them from being spilled
+   and reduces register pressure.  The result is significant codesize
+   reductions and performance gains. */
+
 case SYMBOL_REF:
 
-  if (aarch64_cmodel == AARCH64_CMODEL_LARGE
- || aarch64_cmodel == AARCH64_CMODEL_SMALL_SPIC)
-   {
- /* LDR.  */
- if (speed)
-   *cost += extra_cost->ldst.load;
-   }
-  else if (aarch64_cmodel == AARCH64_CMODEL_SMALL
-  || aarch64_cmodel == AARCH64_CMODEL_SMALL_PIC)
-   {
- /* ADRP, followed by ADD.  */
- *cost += COSTS_N_INSNS (1);
- if (speed)
-   *cost += 2 * extra_cost->alu.arith;
-   }
-  else if (aarch64_cmodel == AARCH64_CMODEL_TINY
-  || aarch64_cmodel == AARCH64_CMODEL_TINY_PIC)
-   {
- /* ADR.  */
- if (speed)
-   *cost += extra_cost->alu.arith;
-   }
+  /* Use a separate remateralization cost for GOT accesses.  */
+  if (aarch64_cmodel == AARCH64_CMODEL_SMALL_PIC
+ && aarch64_classify_symbol (x, 0) == SYMBOL_SMALL_GOT_4G)
+   *cost = COSTS_N_INSNS (1) / 2;
 
-  if (flag_pic)
-   {
- /* One extra load instruction, after accessing the GOT.  */
- *cost += COSTS_N_INSNS (1);
- if (speed)
-   *cost += extra_cost->ldst.load;
-   }
+  *cost = 0;
   return true;
 
 case HIGH:
+  *cost = 0;
+  return true;
+
 case LO_SUM:
-  /* ADRP/ADD (immediate).  */
-  if (speed)
-   *cost += extra_cost->alu.arith;
+  *cost = COSTS_N_INSNS (3) / 4;
   return true;
 
 case ZERO_EXTRACT:


[PATCH] AArch64: Fix PR103085

2021-11-05 Thread Wilco Dijkstra via Gcc-patches
The stack protector implementation hides symbols in a const unspec, which means
movdi/movsi patterns must always support const on symbol operands and explicitly
strip away the unspec. Do this for the recently added GOT alternatives. Add a
test to ensure stack-protector tests GOT accesses as well.

Passes bootstrap and regress. OK for commit?

2021-11-05  Wilco Dijkstra  

PR target/103085
* config/aarch64/aarch64.c (aarch64_mov_operand_p): Strip the salt 
first.
* config/aarch64/constraints.md: Support const in Usw.

* gcc.target/aarch64/pr103085.c: New test
---

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 
430ea036a7be4da842fd08998923a3462457dbfd..39de231d8ac6d10362cdd2b48eb9bd9de60c6703
 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -20155,12 +20155,14 @@ aarch64_mov_operand_p (rtx x, machine_mode mode)
   return aarch64_simd_valid_immediate (x, NULL);
 }
 
+  /* Remove UNSPEC_SALT_ADDR before checking symbol reference.  */
+  x = strip_salt (x);
+
   /* GOT accesses are valid moves.  */
   if (SYMBOL_REF_P (x)
   && aarch64_classify_symbolic_expression (x) == SYMBOL_SMALL_GOT_4G)
 return true;
 
-  x = strip_salt (x);
   if (SYMBOL_REF_P (x) && mode == DImode && CONSTANT_ADDRESS_P (x))
 return true;
 
diff --git a/gcc/config/aarch64/constraints.md 
b/gcc/config/aarch64/constraints.md
index 
70ca66070217d06f974b18f7690326bc329fed89..da5438e7ce254272eb10e9ac60d5367c46635e84
 100644
--- a/gcc/config/aarch64/constraints.md
+++ b/gcc/config/aarch64/constraints.md
@@ -152,10 +152,11 @@ (define_constraint "Usa"
(match_test "aarch64_symbolic_address_p (op)")
(match_test "aarch64_mov_operand_p (op, GET_MODE (op))")))
 
+;; const is needed here to support UNSPEC_SALT_ADDR.
 (define_constraint "Usw"
   "@internal
A constraint that matches a small GOT access."
-  (and (match_code "symbol_ref")
+  (and (match_code "const,symbol_ref")
(match_test "aarch64_classify_symbolic_expression (op)
 == SYMBOL_SMALL_GOT_4G")))
 
diff --git a/gcc/testsuite/gcc.target/aarch64/pr103085.c 
b/gcc/testsuite/gcc.target/aarch64/pr103085.c
new file mode 100644
index 
..dbc9c15b71f224b3c7dec0cca5655a31adc207f6
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr103085.c
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fstack-protector-strong -fPIC" } */
+
+void g(int*);
+void
+f (int x)
+{
+  int arr[10];
+  g (arr);
+}
+


Re: [PATCH] AArch64: Improve address rematerialization costs

2021-11-04 Thread Wilco Dijkstra via Gcc-patches

ping


From: Wilco Dijkstra
Sent: 02 June 2021 11:21
To: GCC Patches 
Cc: Kyrylo Tkachov ; Richard Sandiford 

Subject: [PATCH] AArch64: Improve address rematerialization costs 
 
Hi,

Given the large improvements from better register allocation of GOT accesses,
I decided to generalize it to get large gains for normal addressing too:

Improve rematerialization costs of addresses.  The current costs are set too 
high
which results in extra register pressure and spilling.  Using lower costs means
addresses will be rematerialized more often rather than being spilled or causing
spills.  This results in significant codesize reductions and performance gains.
SPECINT2017 improves by 0.27% with LTO and 0.16% without LTO.  Codesize is 0.12%
smaller.

Passes bootstrap and regress. OK for commit?

ChangeLog:
2021-06-01  Wilco Dijkstra  

    * config/aarch64/aarch64.c (aarch64_rtx_costs): Use better 
rematerialization
    costs for HIGH, LO_SUM and SYMBOL_REF.

---

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 
641c83b479e76cbcc75b299eb7ae5f634d9db7cd..08245827daa3f8199b29031e754244c078f0f500
 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -13444,45 +13444,22 @@ cost_plus:
   return false;  /* All arguments need to be in registers.  */
 }
 
-    case SYMBOL_REF:
+    /* The following costs are used for rematerialization of addresses.
+   Set a low cost for all global accesses - this ensures they are
+   preferred for rematerialization, blocks them from being spilled
+   and reduces register pressure.  The result is significant codesize
+   reductions and performance gains. */
 
-  if (aarch64_cmodel == AARCH64_CMODEL_LARGE
- || aarch64_cmodel == AARCH64_CMODEL_SMALL_SPIC)
-   {
- /* LDR.  */
- if (speed)
-   *cost += extra_cost->ldst.load;
-   }
-  else if (aarch64_cmodel == AARCH64_CMODEL_SMALL
-  || aarch64_cmodel == AARCH64_CMODEL_SMALL_PIC)
-   {
- /* ADRP, followed by ADD.  */
- *cost += COSTS_N_INSNS (1);
- if (speed)
-   *cost += 2 * extra_cost->alu.arith;
-   }
-  else if (aarch64_cmodel == AARCH64_CMODEL_TINY
-  || aarch64_cmodel == AARCH64_CMODEL_TINY_PIC)
-   {
- /* ADR.  */
- if (speed)
-   *cost += extra_cost->alu.arith;
-   }
-
-  if (flag_pic)
-   {
- /* One extra load instruction, after accessing the GOT.  */
- *cost += COSTS_N_INSNS (1);
- if (speed)
-   *cost += extra_cost->ldst.load;
-   }
+    case SYMBOL_REF:
+  *cost = 0;
   return true;
 
 case HIGH:
+  *cost = 0;
+  return true;
+
 case LO_SUM:
-  /* ADRP/ADD (immediate).  */
-  if (speed)
-   *cost += extra_cost->alu.arith;
+  *cost = COSTS_N_INSNS (3) / 4;
   return true;
 
 case ZERO_EXTRACT:

[PATCH v2] AArch64: Cleanup CPU option processing code

2021-11-04 Thread Wilco Dijkstra via Gcc-patches
v2: rebased

The --with-cpu/--with-arch configure option processing not only checks valid 
arguments
but also sets TARGET_CPU_DEFAULT with a CPU and extension bitmask.  This isn't 
used
however since a --with-cpu is translated into a -mcpu option which is processed 
as if
written on the command-line (so TARGET_CPU_DEFAULT is never accessed).

So remove all the complex processing and bitmask, and just validate the option.
Fix a bug that always reports valid architecture extensions as invalid.  As a 
result
the CPU processing in aarch64.c can be simplified.

Bootstrap OK, regress pass, OK for commit?

ChangeLog:
2020-09-03  Wilco Dijkstra  

* config.gcc (aarch64*-*-*): Simplify --with-cpu and --with-arch
processing.  Add support for architectural extensions.
* config/aarch64/aarch64.h (TARGET_CPU_DEFAULT): Remove
AARCH64_CPU_DEFAULT_FLAGS.
* config/aarch64/aarch64.c (AARCH64_CPU_DEFAULT_FLAGS): Remove define.
(get_tune_cpu): Assert CPU is always valid.
(get_arch): Assert architecture is always valid.
(aarch64_override_options): Cleanup CPU selection code and simplify 
logic.

---

diff --git a/gcc/config.gcc b/gcc/config.gcc
index 
aa5bd5d14590e38dcee979f236e60c2505a789f9..25b0fa30a2b7cae4bf1e617db9dd70d9b321eeda
 100644
--- a/gcc/config.gcc
+++ b/gcc/config.gcc
@@ -4146,8 +4146,6 @@ case "${target}" in
  pattern=AARCH64_CORE
fi
 
-   ext_mask=AARCH64_CPU_DEFAULT_FLAGS
-
# Find the base CPU or ARCH id in aarch64-cores.def or
# aarch64-arches.def
if [ x"$base_val" = x ] \
@@ -4155,23 +4153,6 @@ case "${target}" in
${srcdir}/config/aarch64/$def \
> /dev/null; then
 
- if [ $which = arch ]; then
-   base_id=`grep "^$pattern(\"$base_val\"," \
- ${srcdir}/config/aarch64/$def | \
- sed -e 's/^[^,]*,[]*//' | \
- sed -e 's/,.*$//'`
-   # Extract the architecture flags from 
aarch64-arches.def
-   ext_mask=`grep "^$pattern(\"$base_val\"," \
-  ${srcdir}/config/aarch64/$def | \
-  sed -e 's/)$//' | \
-  sed -e 's/^.*,//'`
- else
-   base_id=`grep "^$pattern(\"$base_val\"," \
- ${srcdir}/config/aarch64/$def | \
- sed -e 's/^[^,]*,[]*//' | \
- sed -e 's/,.*$//'`
- fi
-
  # Disallow extensions in --with-tune=cortex-a53+crc.
  if [ $which = tune ] && [ x"$ext_val" != x ]; then
echo "Architecture extensions not supported in 
--with-$which=$val" 1>&2
@@ -4202,25 +4183,7 @@ case "${target}" in
grep "^\"$base_ext\""`
 
if [ x"$base_ext" = x ] \
-   || [[ -n $opt_line ]]; then
-
- # These regexp extract the elements based on
- # their group match index in the regexp.
- ext_canon=`echo -e "$opt_line" | \
-   sed -e "s/$sed_patt/\2/"`
- ext_on=`echo -e "$opt_line" | \
-   sed -e "s/$sed_patt/\3/"`
- ext_off=`echo -e "$opt_line" | \
-   sed -e "s/$sed_patt/\4/"`
-
- if [ $ext = $base_ext ]; then
-   # Adding extension
-   ext_mask="("$ext_mask") | ("$ext_on" | 
"$ext_canon")"
- else
-   # Removing extension
-   ext_mask="("$ext_mask") & ~("$ext_off" 
| "$ext_canon")"
- fi
-
+   || [ x"$opt_line" != x ]; then
  true
else
  echo "Unknown extension used in 
--with-$which=$val" 1>&2
@@ -4229,10 +4192,6 @@ case "${target}" in
ext_val=`echo $ext_val | sed -e 
's/[a-z0-9]\+//'`
  done
 
- ext_mask="(("$ext_mask") << 6)"
- if [ x"$base_id" != x ]; then
-   target_cpu_cname="TARGET_CPU_$base_id | 
$ext_mask"
- fi
  

Re: [PATCH v3] AArch64: Improve GOT addressing

2021-11-02 Thread Wilco Dijkstra via Gcc-patches
Hi Richard,

> - Why do we rewrite the constant moves after reload into ldr_got_small_sidi
>   and ldr_got_small_?  Couldn't we just get the move patterns to
>   output the sequence directly?

That's possible too, however it makes the movsi/di patterns more complex.
See version v4 below.

> - I think we should leave out the rtx_costs part and deal with that
>   separately.  This patch should just be about whether we emit two
>   separate define_insns for the move or whether we keep a single one
>   (to support relaxation).

As the title and description explain, code quality improves significantly by
keeping the instructions together before we even consider linker relaxation.
The cost improvements can be done separately but they are important to
get the measured code quality gains.

Here is v4:

Improve GOT addressing by treating the instructions as a pair.  This reduces
register pressure and improves code quality significantly.  SPECINT2017 improves
by 0.6% with -fPIC and codesize is 0.73% smaller.  Perlbench has 0.9% smaller
codesize, 1.5% fewer executed instructions and is 1.8% faster on Neoverse N1.

Passes bootstrap and regress. OK for commit?

ChangeLog:
2021-11-02  Wilco Dijkstra  

* config/aarch64/aarch64.md (movsi): Add alternative for GOT accesses.
(movdi): Likewise.
(ldr_got_small_): Remove pattern.
(ldr_got_small_sidi): Likewise.
* config/aarch64/aarch64.c (aarch64_load_symref_appropriately): Keep
GOT accesses as moves.
(aarch64_print_operand): Correctly print got_lo12 in L specifier.
(aarch64_mov_operand_p): Make GOT accesses valid move operands.
* config/aarch64/constraints.md: Add new constraint Usw for GOT access.

---
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 
6d26218a61bd0f9e369bb32388b7f8643b632172..ab2954d11333b3f5a419c55d0a74f13d1df70680
 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -3753,47 +3753,8 @@ aarch64_load_symref_appropriately (rtx dest, rtx imm,
   }
 
 case SYMBOL_SMALL_GOT_4G:
-  {
-   /* In ILP32, the mode of dest can be either SImode or DImode,
-  while the got entry is always of SImode size.  The mode of
-  dest depends on how dest is used: if dest is assigned to a
-  pointer (e.g. in the memory), it has SImode; it may have
-  DImode if dest is dereferenced to access the memeory.
-  This is why we have to handle three different ldr_got_small
-  patterns here (two patterns for ILP32).  */
-
-   rtx insn;
-   rtx mem;
-   rtx tmp_reg = dest;
-   machine_mode mode = GET_MODE (dest);
-
-   if (can_create_pseudo_p ())
- tmp_reg = gen_reg_rtx (mode);
-
-   emit_move_insn (tmp_reg, gen_rtx_HIGH (mode, imm));
-   if (mode == ptr_mode)
- {
-   if (mode == DImode)
- insn = gen_ldr_got_small_di (dest, tmp_reg, imm);
-   else
- insn = gen_ldr_got_small_si (dest, tmp_reg, imm);
-
-   mem = XVECEXP (SET_SRC (insn), 0, 0);
- }
-   else
- {
-   gcc_assert (mode == Pmode);
-
-   insn = gen_ldr_got_small_sidi (dest, tmp_reg, imm);
-   mem = XVECEXP (XEXP (SET_SRC (insn), 0), 0, 0);
- }
-
-   gcc_assert (MEM_P (mem));
-   MEM_READONLY_P (mem) = 1;
-   MEM_NOTRAP_P (mem) = 1;
-   emit_insn (insn);
-   return;
-  }
+  emit_insn (gen_rtx_SET (dest, imm));
+  return;
 
 case SYMBOL_SMALL_TLSGD:
   {
@@ -11159,7 +11120,7 @@ aarch64_print_operand (FILE *f, rtx x, int code)
   switch (aarch64_classify_symbolic_expression (x))
{
case SYMBOL_SMALL_GOT_4G:
- asm_fprintf (asm_out_file, ":lo12:");
+ asm_fprintf (asm_out_file, ":got_lo12:");
  break;
 
case SYMBOL_SMALL_TLSGD:
@@ -20241,6 +20202,11 @@ aarch64_mov_operand_p (rtx x, machine_mode mode)
   return aarch64_simd_valid_immediate (x, NULL);
 }
 
+  /* GOT accesses are valid moves.  */
+  if (SYMBOL_REF_P (x)
+  && aarch64_classify_symbolic_expression (x) == SYMBOL_SMALL_GOT_4G)
+return true;
+
   x = strip_salt (x);
   if (SYMBOL_REF_P (x) && mode == DImode && CONSTANT_ADDRESS_P (x))
 return true;
diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 
7085cd4a51dc4c22def9b95f2221b3847603a9e5..9d38269e9429597b825838faf9f241216cc6ab47
 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -1263,8 +1263,8 @@ (define_expand "mov"
 )
 
 (define_insn_and_split "*movsi_aarch64"
-  [(set (match_operand:SI 0 "nonimmediate_operand" "=r,k,r,r,r,r, r,w, m, m,  
r,  r, w,r,w, w")
-   (match_operand:SI 1 "aarch64_mov_operand"  " 
r,r,k,M,n,Usv,m,m,rZ,w,Usa,Ush,rZ,w,w,Ds"))]
+  [(set (match_operand:SI 0 "nonimmediate_operand" "=r,k,r,r,r,r, r,w, m, m,  
r,  r,  r, w,r,w, w")
+   (match_operand:SI 1 "aarch64_mov_operand"  " 

Re: [PATCH v3] AArch64: Improve GOT addressing

2021-10-20 Thread Wilco Dijkstra via Gcc-patches
ping


From: Wilco Dijkstra
Sent: 04 June 2021 14:44
To: Richard Sandiford 
Cc: Kyrylo Tkachov ; GCC Patches 

Subject: [PATCH v3] AArch64: Improve GOT addressing 
 
Hi Richard,

This merges the v1 and v2 patches and removes the spurious MEM from
ldr_got_small_si/di. This has been rebased after [1], and the performance
gain has now doubled.

[1] https://gcc.gnu.org/pipermail/gcc-patches/2021-June/571708.html

Improve GOT addressing by treating the instructions as a pair.  This reduces
register pressure and improves code quality significantly.  SPECINT2017 improves
by 0.6% with -fPIC and codesize is 0.73% smaller.  Perlbench has 0.9% smaller
codesize, 1.5% fewer executed instructions and is 1.8% faster on Neoverse N1.

Passes bootstrap and regress. OK for commit?

ChangeLog:
2021-06-04  Wilco Dijkstra  

    * config/aarch64/aarch64.md (movsi): Split GOT accesses after reload.
    (movdi): Likewise.
    (ldr_got_small_): Remove MEM and LO_SUM, emit ADRP+LDR GOT 
sequence.
    (ldr_got_small_sidi): Likewise.
    * config/aarch64/aarch64.c (aarch64_load_symref_appropriately): Delay
    splitting of GOT accesses until after reload. Remove tmp_reg and MEM.
    (aarch64_print_operand): Correctly print got_lo12 in L specifier.
    (aarch64_rtx_costs): Set rematerialization cost for GOT accesses.
    (aarch64_mov_operand_p): Make GOT accesses valid move operands.

---

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 
08245827daa3f8199b29031e754244c078f0f500..11ea33c70fb06194fadfe94322fdfa098e5320fc
 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -3615,6 +3615,14 @@ aarch64_load_symref_appropriately (rtx dest, rtx imm,
 
 case SYMBOL_SMALL_GOT_4G:
   {
+   /* Use movdi for GOT accesses until after reload - this improves
+  CSE and rematerialization.  */
+   if (!reload_completed)
+ {
+   emit_insn (gen_rtx_SET (dest, imm));
+   return;
+ }
+
 /* In ILP32, the mode of dest can be either SImode or DImode,
    while the got entry is always of SImode size.  The mode of
    dest depends on how dest is used: if dest is assigned to a
@@ -3624,34 +3632,21 @@ aarch64_load_symref_appropriately (rtx dest, rtx imm,
    patterns here (two patterns for ILP32).  */
 
 rtx insn;
-   rtx mem;
-   rtx tmp_reg = dest;
 machine_mode mode = GET_MODE (dest);
 
-   if (can_create_pseudo_p ())
- tmp_reg = gen_reg_rtx (mode);
-
-   emit_move_insn (tmp_reg, gen_rtx_HIGH (mode, imm));
 if (mode == ptr_mode)
   {
 if (mode == DImode)
- insn = gen_ldr_got_small_di (dest, tmp_reg, imm);
+ insn = gen_ldr_got_small_di (dest, imm);
 else
- insn = gen_ldr_got_small_si (dest, tmp_reg, imm);
-
-   mem = XVECEXP (SET_SRC (insn), 0, 0);
+ insn = gen_ldr_got_small_si (dest, imm);
   }
 else
   {
 gcc_assert (mode == Pmode);
-
-   insn = gen_ldr_got_small_sidi (dest, tmp_reg, imm);
-   mem = XVECEXP (XEXP (SET_SRC (insn), 0), 0, 0);
+   insn = gen_ldr_got_small_sidi (dest, imm);
   }
 
-   gcc_assert (MEM_P (mem));
-   MEM_READONLY_P (mem) = 1;
-   MEM_NOTRAP_P (mem) = 1;
 emit_insn (insn);
 return;
   }
@@ -11019,7 +11014,7 @@ aarch64_print_operand (FILE *f, rtx x, int code)
   switch (aarch64_classify_symbolic_expression (x))
 {
 case SYMBOL_SMALL_GOT_4G:
- asm_fprintf (asm_out_file, ":lo12:");
+ asm_fprintf (asm_out_file, ":got_lo12:");
   break;
 
 case SYMBOL_SMALL_TLSGD:
@@ -13452,6 +13447,12 @@ cost_plus:
 
 case SYMBOL_REF:
   *cost = 0;
+
+  /* Use a separate remateralization cost for GOT accesses.  */
+  if (aarch64_cmodel == AARCH64_CMODEL_SMALL_PIC
+ && aarch64_classify_symbol (x, 0) == SYMBOL_SMALL_GOT_4G)
+   *cost = COSTS_N_INSNS (1) / 2;
+
   return true;
 
 case HIGH:
@@ -19907,6 +19908,11 @@ aarch64_mov_operand_p (rtx x, machine_mode mode)
   return aarch64_simd_valid_immediate (x, NULL);
 }
 
+  /* GOT accesses are valid moves until after regalloc.  */
+  if (SYMBOL_REF_P (x)
+  && aarch64_classify_symbolic_expression (x) == SYMBOL_SMALL_GOT_4G)
+    return true;
+
   x = strip_salt (x);
   if (SYMBOL_REF_P (x) && mode == DImode && CONSTANT_ADDRESS_P (x))
 return true;
diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 
abfd84526745d029ad4953eabad6dd17b159a218..30effca6f3562f6870a6cc8097750e63bb0d424d
 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -1283,8 +1283,11 @@ (define_insn_and_split "*movsi_aarch64"
    fmov\\t%w0, %s1
    fmov\\t%s0, %s1
    * return aarch64_output_scalar_simd_mov_immediate (operands[1], SImode);"
-  "CONST_INT_P (operands[1]) 

Re: [PATCH] AArch64: Improve address rematerialization costs

2021-10-20 Thread Wilco Dijkstra via Gcc-patches
ping


From: Wilco Dijkstra
Sent: 02 June 2021 11:21
To: GCC Patches 
Cc: Kyrylo Tkachov ; Richard Sandiford 

Subject: [PATCH] AArch64: Improve address rematerialization costs 
 
Hi,

Given the large improvements from better register allocation of GOT accesses,
I decided to generalize it to get large gains for normal addressing too:

Improve rematerialization costs of addresses.  The current costs are set too 
high
which results in extra register pressure and spilling.  Using lower costs means
addresses will be rematerialized more often rather than being spilled or causing
spills.  This results in significant codesize reductions and performance gains.
SPECINT2017 improves by 0.27% with LTO and 0.16% without LTO.  Codesize is 0.12%
smaller.

Passes bootstrap and regress. OK for commit?

ChangeLog:
2021-06-01  Wilco Dijkstra  

    * config/aarch64/aarch64.c (aarch64_rtx_costs): Use better 
rematerialization
    costs for HIGH, LO_SUM and SYMBOL_REF.

---

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 
641c83b479e76cbcc75b299eb7ae5f634d9db7cd..08245827daa3f8199b29031e754244c078f0f500
 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -13444,45 +13444,22 @@ cost_plus:
   return false;  /* All arguments need to be in registers.  */
 }
 
-    case SYMBOL_REF:
+    /* The following costs are used for rematerialization of addresses.
+   Set a low cost for all global accesses - this ensures they are
+   preferred for rematerialization, blocks them from being spilled
+   and reduces register pressure.  The result is significant codesize
+   reductions and performance gains. */
 
-  if (aarch64_cmodel == AARCH64_CMODEL_LARGE
- || aarch64_cmodel == AARCH64_CMODEL_SMALL_SPIC)
-   {
- /* LDR.  */
- if (speed)
-   *cost += extra_cost->ldst.load;
-   }
-  else if (aarch64_cmodel == AARCH64_CMODEL_SMALL
-  || aarch64_cmodel == AARCH64_CMODEL_SMALL_PIC)
-   {
- /* ADRP, followed by ADD.  */
- *cost += COSTS_N_INSNS (1);
- if (speed)
-   *cost += 2 * extra_cost->alu.arith;
-   }
-  else if (aarch64_cmodel == AARCH64_CMODEL_TINY
-  || aarch64_cmodel == AARCH64_CMODEL_TINY_PIC)
-   {
- /* ADR.  */
- if (speed)
-   *cost += extra_cost->alu.arith;
-   }
-
-  if (flag_pic)
-   {
- /* One extra load instruction, after accessing the GOT.  */
- *cost += COSTS_N_INSNS (1);
- if (speed)
-   *cost += extra_cost->ldst.load;
-   }
+    case SYMBOL_REF:
+  *cost = 0;
   return true;
 
 case HIGH:
+  *cost = 0;
+  return true;
+
 case LO_SUM:
-  /* ADRP/ADD (immediate).  */
-  if (speed)
-   *cost += extra_cost->alu.arith;
+  *cost = COSTS_N_INSNS (3) / 4;
   return true;
 
 case ZERO_EXTRACT:


Re: [PATCH] AArch64: Tune case-values-threshold

2021-10-19 Thread Wilco Dijkstra via Gcc-patches
Hi Richard,

> The problem is that you're effectively asking for these values to be
> taken on faith without providing any analysis and without describing
> how you arrived at the new numbers.  Did you try other values too?
> If so, how did they compare with the numbers that you finally chose?
> At least that would give an indication of where the boundaries are.

Yes, I obviously tried other values, pretty much all in range 1-20. There is
generally a range of 4-5 values that are very similar in size, and then you
choose one in the middle which also looks good for performance.

> For example, it's easier to believe that 8 is the right value for -Os if
> you say that you tried 9 and 7 as well, and they were worse than 8 by X%
> and Y%.  This would also help anyone who wants to tweak the numbers
> again in future.

For -Os, the size range for values 6-10 is within 0.01% so they are virtually
identical and I picked the median. Whether this will remain best in the future
is unclear since it depends on so many things, so at some point it needs
to be looked at again, just like most other tunings.

> BTW, which CPU did you use to do the experiments?  Are the tuning
> parameters for that CPU already consistent with the new generic values?

This was done on Neoverse N1. Almost no CPUs use per-CPU tuning for this.

Cheers,
Wilco

Re: [PATCH] AArch64: Tune case-values-threshold

2021-10-19 Thread Wilco Dijkstra via Gcc-patches
Hi Richard,

> I'm just concerned that here we're using the same explanation but with
> different numbers.  Why are the new numbers more right than the old ones
> (especially when it comes to code size, where the trade-off hasn't
> really changed)?

Like all tuning/costing parameters, these values are never fixed but change
over time due to optimizations, micro architectures and workloads.
The previous values were out of date so that's why I retuned them by
benchmarking different values and choosing the best combinations.

> It would be good to have more discussion of why certain numbers are
> too small or too high, and why 8 is the right pivot point for -Os.

You mean add more discussion in the comment? That comment is already overly
large and too specific - it would be better to reduce it. The -Os value was 
never
tuned, and 8 turns out to be faster and smaller than GCC's default.

Cheers,
Wilco

[PATCH] AArch64: Enable fast shifts on Neoverse V1/N2

2021-10-18 Thread Wilco Dijkstra via Gcc-patches
Enable the fast shift feature in Neoverse V1 and N2 tunings as well.

ChangeLog:
2021-10-18  Wilco Dijkstra  

* config/aarch64/aarch64.c (neoversev1_tunings):
Enable AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND.
(neoversen2_tunings): Likewise.

---

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 
c7b76a7cdeea0539cd73f7987d1a4354d0a40624..e65afe39047359a3279ae6b2047a3e9a04e6c2a9
 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -1832,7 +1832,8 @@ static const struct tune_params neoversev1_tunings =
   tune_params::AUTOPREFETCHER_WEAK,/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
| AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
-   | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),/* tune_flags.  */
+   | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
+   | AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND),   /* tune_flags.  */
   _prefetch_tune
 };
 
@@ -1858,7 +1859,7 @@ static const struct tune_params neoversen2_tunings =
   2,   /* min_div_recip_mul_df.  */
   0,   /* max_case_values.  */
   tune_params::AUTOPREFETCHER_WEAK,/* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_NONE),   /* tune_flags.  */
+  (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND), /* tune_flags.  */
   _prefetch_tune
 };
 

[PATCH] AArch64: Tune case-values-threshold

2021-10-18 Thread Wilco Dijkstra via Gcc-patches

Tune the case-values-threshold setting for modern cores.  A value of 11 improves
SPECINT2017 by 0.2% and reduces codesize by 0.04%.  With -Os use value 8 which
reduces codesize by 0.07%.

Passes regress, OK for commit?

ChangeLog:

2021-10-18  Wilco Dijkstra  

* config/aarch64/aarch64.c (aarch64_case_values_threshold):
Change to 8 with -Os, 11 otherwise.

---

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 
f5b25a7f7041645921e6ad85714efda73b993492..adc5256c5ccc1182710d87cc6a1091083d888663
 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -9360,8 +9360,8 @@ aarch64_cannot_force_const_mem (machine_mode mode 
ATTRIBUTE_UNUSED, rtx x)
The expansion for a table switch is quite expensive due to the number
of instructions, the table lookup and hard to predict indirect jump.
When optimizing for speed, and -O3 enabled, use the per-core tuning if 
-   set, otherwise use tables for > 16 cases as a tradeoff between size and
-   performance.  When optimizing for size, use the default setting.  */
+   set, otherwise use tables for >= 11 cases as a tradeoff between size and
+   performance.  When optimizing for size, use 8 for smallest codesize.  */
 
 static unsigned int
 aarch64_case_values_threshold (void)
@@ -9372,7 +9372,7 @@ aarch64_case_values_threshold (void)
   && selected_cpu->tune->max_case_values != 0)
 return selected_cpu->tune->max_case_values;
   else
-return optimize_size ? default_case_values_threshold () : 17;
+return optimize_size ? 8 : 11;
 }
 
 /* Return true if register REGNO is a valid index register.


Re: [PATCH] AArch64: Add support for __builtin_roundeven[f] [PR100966]

2021-06-22 Thread Wilco Dijkstra via Gcc-patches
Hi Richard,

> So rather than have two patterns that generate frintn, I think
> it would be better to change the existing frint_pattern entry to
> "roundeven" instead, and fix whatever the fallout is.  Hopefully it
> shouldn't be too bad, since we already use the optab names for the
> other UNSPEC_FRINT* codes.

Well it requires various changes to the arm_neon headers since they use
existing intrinsics. If that is not considered a risky ABI change then here is 
v2:

Enable __builtin_roundeven[f] by changing existing frintn to roundeven.

Bootstrap OK and passes regress.

ChangeLog:
2021-06-18  Wilco Dijkstra  

PR target/100966
* config/aarch64/aarch64.md (frint_pattern): Update comment.
* config/aarch64/aarch64-simd-builtins.def: Change frintn to roundeven.
* config/aarch64/arm_fp16.h: Change frintn to roundeven.
* config/aarch64/arm_neon.h: Likewise.
* config/aarch64/iterators.md (frint_pattern): Use roundeven for FRINTN.

gcc/testsuite
PR target/100966
* gcc.target/aarch64/frint.x: Add roundeven tests.
* gcc.target/aarch64/frint_double.c: Likewise.
* gcc.target/aarch64/frint_float.c: Likewise.

---

diff --git a/gcc/config/aarch64/aarch64-simd-builtins.def 
b/gcc/config/aarch64/aarch64-simd-builtins.def
index 
b885bd5b38bf7ad83eb9d801284bf9b34db17210..534cc8ccb538c4fb0c208a7035020e131656260d
 100644
--- a/gcc/config/aarch64/aarch64-simd-builtins.def
+++ b/gcc/config/aarch64/aarch64-simd-builtins.def
@@ -485,7 +485,7 @@
   BUILTIN_VHSDF (UNOP, nearbyint, 2, FP)
   BUILTIN_VHSDF (UNOP, rint, 2, FP)
   BUILTIN_VHSDF (UNOP, round, 2, FP)
-  BUILTIN_VHSDF_HSDF (UNOP, frintn, 2, FP)
+  BUILTIN_VHSDF_HSDF (UNOP, roundeven, 2, FP)

   VAR1 (UNOP, btrunc, 2, FP, hf)
   VAR1 (UNOP, ceil, 2, FP, hf)
diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 
30effca6f3562f6870a6cc8097750e63bb0d424d..8977330b142d2cde1d2faa4a01282b01a68e25c5
 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -5922,7 +5922,7 @@ (define_insn "*bswapsi2_uxtw"
 ;; ---

 ;; frint floating-point round to integral standard patterns.
-;; Expands to btrunc, ceil, floor, nearbyint, rint, round, frintn.
+;; Expands to btrunc, ceil, floor, nearbyint, rint, round, roundeven.

 (define_insn "2"
   [(set (match_operand:GPF_F16 0 "register_operand" "=w")
diff --git a/gcc/config/aarch64/arm_fp16.h b/gcc/config/aarch64/arm_fp16.h
index 
2afbd1203361b54d6e1315ffaa1bec21834c060e..3efa7e1f19817df1409bf781266f4e238c128f0b
 100644
--- a/gcc/config/aarch64/arm_fp16.h
+++ b/gcc/config/aarch64/arm_fp16.h
@@ -333,7 +333,7 @@ vrndmh_f16 (float16_t __a)
 __extension__ static __inline float16_t __attribute__ ((__always_inline__))
 vrndnh_f16 (float16_t __a)
 {
-  return __builtin_aarch64_frintnhf (__a);
+  return __builtin_aarch64_roundevenhf (__a);
 }

 __extension__ static __inline float16_t __attribute__ ((__always_inline__))
diff --git a/gcc/config/aarch64/arm_neon.h b/gcc/config/aarch64/arm_neon.h
index 
baa30bd5a9d96c1bf04a37fb105091ea56a6444a..c88a8a627d3082d3577b0fe222381e93c35d7251
 100644
--- a/gcc/config/aarch64/arm_neon.h
+++ b/gcc/config/aarch64/arm_neon.h
@@ -24657,35 +24657,35 @@ __extension__ extern __inline float32_t
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 vrndns_f32 (float32_t __a)
 {
-  return __builtin_aarch64_frintnsf (__a);
+  return __builtin_aarch64_roundevensf (__a);
 }

 __extension__ extern __inline float32x2_t
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 vrndn_f32 (float32x2_t __a)
 {
-  return __builtin_aarch64_frintnv2sf (__a);
+  return __builtin_aarch64_roundevenv2sf (__a);
 }

 __extension__ extern __inline float64x1_t
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 vrndn_f64 (float64x1_t __a)
 {
-  return (float64x1_t) {__builtin_aarch64_frintndf (__a[0])};
+  return (float64x1_t) {__builtin_aarch64_roundevendf (__a[0])};
 }

 __extension__ extern __inline float32x4_t
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 vrndnq_f32 (float32x4_t __a)
 {
-  return __builtin_aarch64_frintnv4sf (__a);
+  return __builtin_aarch64_roundevenv4sf (__a);
 }

 __extension__ extern __inline float64x2_t
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 vrndnq_f64 (float64x2_t __a)
 {
-  return __builtin_aarch64_frintnv2df (__a);
+  return __builtin_aarch64_roundevenv2df (__a);
 }

 /* vrndp  */
@@ -31287,14 +31287,14 @@ __extension__ extern __inline float16x4_t
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 vrndn_f16 (float16x4_t __a)
 {
-  return __builtin_aarch64_frintnv4hf (__a);
+  return __builtin_aarch64_roundevenv4hf (__a);
 }

 __extension__ extern __inline float16x8_t
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 vrndnq_f16 (float16x8_t __a)
 {
-  return 

[PATCH] AArch64: Add support for __builtin_roundeven[f] [PR100966]

2021-06-18 Thread Wilco Dijkstra via Gcc-patches
Enable __builtin_roundeven[f] by adding roundeven as an alias to the
existing frintn support.

Bootstrap OK and passes regress.

ChangeLog:
2021-06-18  Wilco Dijkstra  

PR target/100966
* config/aarch64/aarch64.md (UNSPEC_FRINTR): Add.
* config/aarch64/aarch64.c (aarch64_frint_unspec_p): Add UNSPEC_FRINTR.
(aarch64_rtx_cost): Likewise.
* config/aarch64/iterators.md (FRINT): Add UNSPEC_FRINTR.
(frint_pattern): Likewise.
(frint_suffix): Likewise.

gcc/testsuite
PR target/100966
* gcc.target/aarch64/frint.x: Add roundeven tests.  
* gcc.target/aarch64/frint_double.c: Likewise.
* gcc.target/aarch64/frint_float.c: Likewise.

---

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 
f4d264bbc817153b330bac9ab423bf561689ebf2..63bd49d475433c04faa89bb380a9d17e4ad4bc6c
 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -12253,6 +12253,7 @@ aarch64_frint_unspec_p (unsigned int u)
   case UNSPEC_FRINTM:
   case UNSPEC_FRINTA:
   case UNSPEC_FRINTN:
+  case UNSPEC_FRINTR:
   case UNSPEC_FRINTX:
   case UNSPEC_FRINTI:
 return true;
@@ -13648,6 +13649,7 @@ cost_plus:
   || uns_code == UNSPEC_FRINTM
   || uns_code == UNSPEC_FRINTN
   || uns_code == UNSPEC_FRINTP
+ || uns_code == UNSPEC_FRINTR
   || uns_code == UNSPEC_FRINTZ)
 x = XVECEXP (x, 0, 0);
 }
diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 
30effca6f3562f6870a6cc8097750e63bb0d424d..70e46f28259640a273d444a2b963246ed9a91109
 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -167,6 +167,7 @@ (define_c_enum "unspec" [
 UNSPEC_FRINTM
 UNSPEC_FRINTN
 UNSPEC_FRINTP
+UNSPEC_FRINTR
 UNSPEC_FRINTX
 UNSPEC_FRINTZ
 UNSPEC_GOTSMALLPIC
diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
index 
cac33ae812b382cd55611b0da8a6e9eac3a513c4..1542661c6d101ed53c6e1c8683cbca32baefebbd
 100644
--- a/gcc/config/aarch64/iterators.md
+++ b/gcc/config/aarch64/iterators.md
@@ -2291,7 +2291,7 @@ (define_int_iterator REVERSE [UNSPEC_REV64 UNSPEC_REV32 
UNSPEC_REV16])
 
 (define_int_iterator FRINT [UNSPEC_FRINTZ UNSPEC_FRINTP UNSPEC_FRINTM
 UNSPEC_FRINTN UNSPEC_FRINTI UNSPEC_FRINTX
-UNSPEC_FRINTA])
+UNSPEC_FRINTA UNSPEC_FRINTR ])
 
 (define_int_iterator FCVT [UNSPEC_FRINTZ UNSPEC_FRINTP UNSPEC_FRINTM
UNSPEC_FRINTA UNSPEC_FRINTN])
@@ -3079,13 +3079,14 @@ (define_int_attr frint_pattern [(UNSPEC_FRINTZ "btrunc")
(UNSPEC_FRINTI "nearbyint")
(UNSPEC_FRINTX "rint")
(UNSPEC_FRINTA "round")
-   (UNSPEC_FRINTN "frintn")])
+   (UNSPEC_FRINTN "frintn")
+   (UNSPEC_FRINTR "roundeven")])
 
 ;; frint suffix for floating-point rounding instructions.
 (define_int_attr frint_suffix [(UNSPEC_FRINTZ "z") (UNSPEC_FRINTP "p")
   (UNSPEC_FRINTM "m") (UNSPEC_FRINTI "i")
   (UNSPEC_FRINTX "x") (UNSPEC_FRINTA "a")
-  (UNSPEC_FRINTN "n")])
+  (UNSPEC_FRINTN "n") (UNSPEC_FRINTR "n")])
 
 (define_int_attr fcvt_pattern [(UNSPEC_FRINTZ "btrunc") (UNSPEC_FRINTA "round")
   (UNSPEC_FRINTP "ceil") (UNSPEC_FRINTM "floor")
diff --git a/gcc/testsuite/gcc.target/aarch64/frint.x 
b/gcc/testsuite/gcc.target/aarch64/frint.x
index 
1403740686ea3927d2c39eb2466ef8cc67e223b2..d598a25ff21b8feeca6a96de79848b2fbde7f31e
 100644
--- a/gcc/testsuite/gcc.target/aarch64/frint.x
+++ b/gcc/testsuite/gcc.target/aarch64/frint.x
@@ -4,6 +4,7 @@ extern GPF SUFFIX(floor) (GPF);
 extern GPF SUFFIX(nearbyint) (GPF);
 extern GPF SUFFIX(rint) (GPF);
 extern GPF SUFFIX(round) (GPF);
+extern GPF SUFFIX(roundeven) (GPF);
 
 GPF test1a (GPF x)
 {
@@ -64,3 +65,14 @@ GPF test6b (GPF x)
 {
   return SUFFIX(round)(x);
 }
+
+GPF test7a (GPF x)
+{
+  return SUFFIX(__builtin_roundeven)(x);
+}
+
+GPF test7b (GPF x)
+{
+  return SUFFIX(roundeven)(x);
+}
+
diff --git a/gcc/testsuite/gcc.target/aarch64/frint_double.c 
b/gcc/testsuite/gcc.target/aarch64/frint_double.c
index 
96139496ca454cebd41d31e8b018dab7ffa33a3f..1d28eb09e11842802ab632a7696de4407263f8ce
 100644
--- a/gcc/testsuite/gcc.target/aarch64/frint_double.c
+++ b/gcc/testsuite/gcc.target/aarch64/frint_double.c
@@ -12,3 +12,4 @@
 /* { dg-final { scan-assembler-times "frinti\td\[0-9\]" 2 } } */
 /* { dg-final { scan-assembler-times "frintx\td\[0-9\]" 2 } } */
 /* { dg-final { scan-assembler-times "frinta\td\[0-9\]" 2 } } */
+/* { dg-final { scan-assembler-times "frintn\td\[0-9\]" 2 } } */
diff --git 

[PATCH v3] AArch64: Improve GOT addressing

2021-06-04 Thread Wilco Dijkstra via Gcc-patches
Hi Richard,

This merges the v1 and v2 patches and removes the spurious MEM from
ldr_got_small_si/di. This has been rebased after [1], and the performance
gain has now doubled.

[1] https://gcc.gnu.org/pipermail/gcc-patches/2021-June/571708.html

Improve GOT addressing by treating the instructions as a pair.  This reduces
register pressure and improves code quality significantly.  SPECINT2017 improves
by 0.6% with -fPIC and codesize is 0.73% smaller.  Perlbench has 0.9% smaller
codesize, 1.5% fewer executed instructions and is 1.8% faster on Neoverse N1.

Passes bootstrap and regress. OK for commit?

ChangeLog:
2021-06-04  Wilco Dijkstra  

* config/aarch64/aarch64.md (movsi): Split GOT accesses after reload.
(movdi): Likewise.
(ldr_got_small_): Remove MEM and LO_SUM, emit ADRP+LDR GOT 
sequence.
(ldr_got_small_sidi): Likewise.
* config/aarch64/aarch64.c (aarch64_load_symref_appropriately): Delay
splitting of GOT accesses until after reload. Remove tmp_reg and MEM.
(aarch64_print_operand): Correctly print got_lo12 in L specifier.
(aarch64_rtx_costs): Set rematerialization cost for GOT accesses.
(aarch64_mov_operand_p): Make GOT accesses valid move operands.

---

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 
08245827daa3f8199b29031e754244c078f0f500..11ea33c70fb06194fadfe94322fdfa098e5320fc
 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -3615,6 +3615,14 @@ aarch64_load_symref_appropriately (rtx dest, rtx imm,
 
 case SYMBOL_SMALL_GOT_4G:
   {
+   /* Use movdi for GOT accesses until after reload - this improves
+  CSE and rematerialization.  */
+   if (!reload_completed)
+ {
+   emit_insn (gen_rtx_SET (dest, imm));
+   return;
+ }
+
/* In ILP32, the mode of dest can be either SImode or DImode,
   while the got entry is always of SImode size.  The mode of
   dest depends on how dest is used: if dest is assigned to a
@@ -3624,34 +3632,21 @@ aarch64_load_symref_appropriately (rtx dest, rtx imm,
   patterns here (two patterns for ILP32).  */
 
rtx insn;
-   rtx mem;
-   rtx tmp_reg = dest;
machine_mode mode = GET_MODE (dest);
 
-   if (can_create_pseudo_p ())
- tmp_reg = gen_reg_rtx (mode);
-
-   emit_move_insn (tmp_reg, gen_rtx_HIGH (mode, imm));
if (mode == ptr_mode)
  {
if (mode == DImode)
- insn = gen_ldr_got_small_di (dest, tmp_reg, imm);
+ insn = gen_ldr_got_small_di (dest, imm);
else
- insn = gen_ldr_got_small_si (dest, tmp_reg, imm);
-
-   mem = XVECEXP (SET_SRC (insn), 0, 0);
+ insn = gen_ldr_got_small_si (dest, imm);
  }
else
  {
gcc_assert (mode == Pmode);
-
-   insn = gen_ldr_got_small_sidi (dest, tmp_reg, imm);
-   mem = XVECEXP (XEXP (SET_SRC (insn), 0), 0, 0);
+   insn = gen_ldr_got_small_sidi (dest, imm);
  }
 
-   gcc_assert (MEM_P (mem));
-   MEM_READONLY_P (mem) = 1;
-   MEM_NOTRAP_P (mem) = 1;
emit_insn (insn);
return;
   }
@@ -11019,7 +11014,7 @@ aarch64_print_operand (FILE *f, rtx x, int code)
   switch (aarch64_classify_symbolic_expression (x))
{
case SYMBOL_SMALL_GOT_4G:
- asm_fprintf (asm_out_file, ":lo12:");
+ asm_fprintf (asm_out_file, ":got_lo12:");
  break;
 
case SYMBOL_SMALL_TLSGD:
@@ -13452,6 +13447,12 @@ cost_plus:
 
 case SYMBOL_REF:
   *cost = 0;
+
+  /* Use a separate remateralization cost for GOT accesses.  */
+  if (aarch64_cmodel == AARCH64_CMODEL_SMALL_PIC
+ && aarch64_classify_symbol (x, 0) == SYMBOL_SMALL_GOT_4G)
+   *cost = COSTS_N_INSNS (1) / 2;
+
   return true;
 
 case HIGH:
@@ -19907,6 +19908,11 @@ aarch64_mov_operand_p (rtx x, machine_mode mode)
   return aarch64_simd_valid_immediate (x, NULL);
 }
 
+  /* GOT accesses are valid moves until after regalloc.  */
+  if (SYMBOL_REF_P (x)
+  && aarch64_classify_symbolic_expression (x) == SYMBOL_SMALL_GOT_4G)
+return true;
+
   x = strip_salt (x);
   if (SYMBOL_REF_P (x) && mode == DImode && CONSTANT_ADDRESS_P (x))
 return true;
diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 
abfd84526745d029ad4953eabad6dd17b159a218..30effca6f3562f6870a6cc8097750e63bb0d424d
 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -1283,8 +1283,11 @@ (define_insn_and_split "*movsi_aarch64"
fmov\\t%w0, %s1
fmov\\t%s0, %s1
* return aarch64_output_scalar_simd_mov_immediate (operands[1], SImode);"
-  "CONST_INT_P (operands[1]) && !aarch64_move_imm (INTVAL (operands[1]), 
SImode)
-&& REG_P (operands[0]) && GP_REGNUM_P (REGNO (operands[0]))"
+  "(CONST_INT_P (operands[1]) && !aarch64_move_imm (INTVAL 

Re: [PATCH] AArch64: Improve address rematerialization costs

2021-06-02 Thread Wilco Dijkstra via Gcc-patches
Hi Richard,

> No.  It's never correct to completely wipe out the existing cost - you 
> don't know the context where this is being used.
> 
> The most you can do is not add any additional cost.

Remember that aarch64_rtx_costs starts like this:

  /* By default, assume that everything has equivalent cost to the
 cheapest instruction.  Any additional costs are applied as a delta
 above this default.  */
  *cost = COSTS_N_INSNS (1);

This is literally the last statement executed before the big switch...
Given the cost is always initialized, there is no existing cost besides this
default value, and thus changing it to something else is not an issue.
We could of course do something like:

*cost -= COSTS_N_INSNS (1);

But that is less clear and problematic if the default value ever changes.

Cheers,
Wilco

[PATCH] AArch64: Improve address rematerialization costs

2021-06-02 Thread Wilco Dijkstra via Gcc-patches
Hi,

Given the large improvements from better register allocation of GOT accesses,
I decided to generalize it to get large gains for normal addressing too:

Improve rematerialization costs of addresses.  The current costs are set too 
high
which results in extra register pressure and spilling.  Using lower costs means
addresses will be rematerialized more often rather than being spilled or causing
spills.  This results in significant codesize reductions and performance gains.
SPECINT2017 improves by 0.27% with LTO and 0.16% without LTO.  Codesize is 0.12%
smaller.

Passes bootstrap and regress. OK for commit?

ChangeLog:
2021-06-01  Wilco Dijkstra  

* config/aarch64/aarch64.c (aarch64_rtx_costs): Use better 
rematerialization
costs for HIGH, LO_SUM and SYMBOL_REF.

---

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 
641c83b479e76cbcc75b299eb7ae5f634d9db7cd..08245827daa3f8199b29031e754244c078f0f500
 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -13444,45 +13444,22 @@ cost_plus:
  return false;  /* All arguments need to be in registers.  */
}
 
-case SYMBOL_REF:
+/* The following costs are used for rematerialization of addresses.
+   Set a low cost for all global accesses - this ensures they are
+   preferred for rematerialization, blocks them from being spilled
+   and reduces register pressure.  The result is significant codesize
+   reductions and performance gains. */
 
-  if (aarch64_cmodel == AARCH64_CMODEL_LARGE
- || aarch64_cmodel == AARCH64_CMODEL_SMALL_SPIC)
-   {
- /* LDR.  */
- if (speed)
-   *cost += extra_cost->ldst.load;
-   }
-  else if (aarch64_cmodel == AARCH64_CMODEL_SMALL
-  || aarch64_cmodel == AARCH64_CMODEL_SMALL_PIC)
-   {
- /* ADRP, followed by ADD.  */
- *cost += COSTS_N_INSNS (1);
- if (speed)
-   *cost += 2 * extra_cost->alu.arith;
-   }
-  else if (aarch64_cmodel == AARCH64_CMODEL_TINY
-  || aarch64_cmodel == AARCH64_CMODEL_TINY_PIC)
-   {
- /* ADR.  */
- if (speed)
-   *cost += extra_cost->alu.arith;
-   }
-
-  if (flag_pic)
-   {
- /* One extra load instruction, after accessing the GOT.  */
- *cost += COSTS_N_INSNS (1);
- if (speed)
-   *cost += extra_cost->ldst.load;
-   }
+case SYMBOL_REF:
+  *cost = 0;
   return true;
 
 case HIGH:
+  *cost = 0;
+  return true;
+
 case LO_SUM:
-  /* ADRP/ADD (immediate).  */
-  if (speed)
-   *cost += extra_cost->alu.arith;
+  *cost = COSTS_N_INSNS (3) / 4;
   return true;
 
 case ZERO_EXTRACT:



Re: [PATCH v2] AArch64: Improve GOT addressing

2021-05-26 Thread Wilco Dijkstra via Gcc-patches
Hi Richard,

> Are we actually planning to do any linker relaxations here, or is this
> purely theoretical?  If doing relaxations is a realistic possiblity then
> I agree that would be a good/legitimate reason to use a single define_insn
> for both instructions.  In that case though, there should be a comment
> above the define_insn explaining that linker relaxation is the reason
> for keeping the instructions together.

Yes, enabling linker relaxations is a key goal of this patch - it's a chicken
and egg problem since compilers split the instructions and schedule them
apart for no good reason, making such relaxations impossible. It turns out
that splitting early is very bad for code quality, so we can achieve smaller
and faster code as well.

I'll merge in the .md changes of the previous version so we don't need
scheduling fusion.

Cheers,
Wilco

[PATCH v2] AArch64: Improve GOT addressing

2021-05-24 Thread Wilco Dijkstra via Gcc-patches
Version v2 uses movsi/di for GOT accesses until after reload as suggested. This
caused worse spilling, however improving the costs of GOT accesses resulted in
better codesize and performance gains:

Improve GOT addressing by treating the instructions as a pair.  This reduces
register pressure and improves code quality significantly.  SPECINT2017 improves
by 0.30% with -fPIC and codesize is 0.7% smaller.  Perlbench has 0.9% smaller
codesize, 1.5% fewer executed instructions and is 1.8% faster on Neoverse N1.

Passes bootstrap and regress. OK for commit?

ChangeLog:
2021-05-21  Wilco Dijkstra  

* config/aarch64/aarch64.md (movsi): Split GOT accesses after reload.
(movdi): Likewise.
* config/aarch64/aarch64.c (aarch64_load_symref_appropriately): Delay
splitting of GOT accesses until after reload.
(aarch64_rtx_costs): Set rematerialization cost for GOT accesses.
(aarch64_macro_fusion_pair_p): Fuse GOT accesses.

---

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 
641c83b479e76cbcc75b299eb7ae5f634d9db7cd..75b3caa94dd8a52342bbddbfcb73ab06a7418907
 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -3615,6 +3615,14 @@ aarch64_load_symref_appropriately (rtx dest, rtx imm,
 
 case SYMBOL_SMALL_GOT_4G:
   {
+   /* Don't split into ADRP/LDR until after reload - this improves
+  CSE and rematerialization of GOT accesses.  */
+   if (!reload_completed)
+ {
+   emit_insn (gen_rtx_SET (dest, imm));
+   return;
+ }
+
/* In ILP32, the mode of dest can be either SImode or DImode,
   while the got entry is always of SImode size.  The mode of
   dest depends on how dest is used: if dest is assigned to a
@@ -13460,6 +13468,14 @@ cost_plus:
  *cost += COSTS_N_INSNS (1);
  if (speed)
*cost += 2 * extra_cost->alu.arith;
+
+ /* Set a low remateralization cost for GOT accesses - this blocks
+them from being spilled and reduces register pressure.  */
+ if (aarch64_cmodel == AARCH64_CMODEL_SMALL_PIC
+ && aarch64_classify_symbol (x, 0) == SYMBOL_SMALL_GOT_4G)
+   *cost = COSTS_N_INSNS (1) / 2;
+
+ return true;
}
   else if (aarch64_cmodel == AARCH64_CMODEL_TINY
   || aarch64_cmodel == AARCH64_CMODEL_TINY_PIC)
@@ -19930,6 +19946,11 @@ aarch64_mov_operand_p (rtx x, machine_mode mode)
   return aarch64_simd_valid_immediate (x, NULL);
 }
 
+  /* GOT accesses are split after regalloc.  */
+  if (SYMBOL_REF_P (x)
+  && aarch64_classify_symbolic_expression (x) == SYMBOL_SMALL_GOT_4G)
+return true;
+
   x = strip_salt (x);
   if (SYMBOL_REF_P (x) && mode == DImode && CONSTANT_ADDRESS_P (x))
 return true;
@@ -23746,6 +23767,24 @@ aarch_macro_fusion_pair_p (rtx_insn *prev, rtx_insn 
*curr)
 }
 }
 
+  /* Always treat GOT accesses as a pair to ensure they can be easily
+ identified and optimized in linkers.  */
+  if (simple_sets_p)
+{
+  /*  We're trying to match:
+ prev (adrp) == (set (reg r1) (high (symbol_ref ("SYM"
+ curr (add) == (set (reg r0)
+   (unspec [(mem (lo_sum (reg r1) (symbol_ref ("SYM"]
+UNSPEC_GOTSMALLPIC))  */
+
+  if (satisfies_constraint_Ush (SET_SRC (prev_set))
+ && REG_P (SET_DEST (prev_set))
+ && REG_P (SET_DEST (curr_set))
+ && GET_CODE (SET_SRC (curr_set)) == UNSPEC
+ && XINT (SET_SRC (curr_set), 1) == UNSPEC_GOTSMALLPIC)
+   return true;
+}
+
   if (simple_sets_p && aarch64_fusion_enabled_p (AARCH64_FUSE_MOVK_MOVK))
 {
 
diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 
abfd84526745d029ad4953eabad6dd17b159a218..2527c96576a78f2071da20721143a27adeb1551b
 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -1283,8 +1283,11 @@ (define_insn_and_split "*movsi_aarch64"
fmov\\t%w0, %s1
fmov\\t%s0, %s1
* return aarch64_output_scalar_simd_mov_immediate (operands[1], SImode);"
-  "CONST_INT_P (operands[1]) && !aarch64_move_imm (INTVAL (operands[1]), 
SImode)
-&& REG_P (operands[0]) && GP_REGNUM_P (REGNO (operands[0]))"
+  "(CONST_INT_P (operands[1]) && !aarch64_move_imm (INTVAL (operands[1]), 
SImode)
+&& REG_P (operands[0]) && GP_REGNUM_P (REGNO (operands[0])))
+|| (reload_completed
+   && (aarch64_classify_symbolic_expression (operands[1])
+   == SYMBOL_SMALL_GOT_4G))"
[(const_int 0)]
"{
aarch64_expand_mov_immediate (operands[0], operands[1]);
@@ -1319,8 +1322,11 @@ (define_insn_and_split "*movdi_aarch64"
fmov\\t%x0, %d1
fmov\\t%d0, %d1
* return aarch64_output_scalar_simd_mov_immediate (operands[1], DImode);"
-   "(CONST_INT_P (operands[1]) && !aarch64_move_imm (INTVAL (operands[1]), 
DImode))
-&& REG_P (operands[0]) && GP_REGNUM_P (REGNO 

Re: [PATCH] AArch64: Improve GOT addressing

2021-05-10 Thread Wilco Dijkstra via Gcc-patches
Hi Richard,

> Normally we should only put two instructions in the same define_insn
> if there's a specific ABI or architectural reason for not separating
> them.  Doing it purely for optimisation reasons is going against the
> general direction of travel.  So I think the first question is: why
> don't we simply delay the split until after reload instead, since
> that's the more normal way of handling this kind of thing?

Well there are no optimizations that benefit from them being split, and there
is no gain from scheduling them independently. Keeping them together
means the linker could perform relaxations on the pair without adding new
relocations. So if we split after reload we'd still want to keep them together.

> This should just be represented as:
>
>  (set (reg:PTR R) (symbol_ref:PTR S))
>
> and go through the normal move patterns.

Yes that should be feasible. Note the UNSPEC is unnecessary even in the
original pattern - given it uses the standard HIGH operator for GOT accesses,
we can also use LO_SUM for the 2nd part since the indirection is hidden.

Cheers,
Wilco

[PATCH] AArch64: Improve GOT addressing

2021-05-05 Thread Wilco Dijkstra via Gcc-patches

Improve GOT addressing by emitting the instructions as a pair.  This reduces
register pressure and improves code quality. With -fPIC codesize improves by
0.65% and SPECINT2017 improves by 0.25%.

Passes bootstrap and regress. OK for commit?

ChangeLog:
2021-05-05  Wilco Dijkstra  

* config/aarch64/aarch64.md (ldr_got_small_): Emit ADRP+LDR GOT 
sequence.
(ldr_got_small_sidi): Likewise.
* config/aarch64/aarch64.c (aarch64_load_symref_appropriately): Remove 
tmp_reg.
(aarch64_print_operand): Correctly print got_lo12 in L specifier.

---

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 
641c83b479e76cbcc75b299eb7ae5f634d9db7cd..32c5c76d3c001a79d2a69b7f8243f1f1f605f901
 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -3625,27 +3625,21 @@ aarch64_load_symref_appropriately (rtx dest, rtx imm,
 
rtx insn;
rtx mem;
-   rtx tmp_reg = dest;
machine_mode mode = GET_MODE (dest);
 
-   if (can_create_pseudo_p ())
- tmp_reg = gen_reg_rtx (mode);
-
-   emit_move_insn (tmp_reg, gen_rtx_HIGH (mode, imm));
if (mode == ptr_mode)
  {
if (mode == DImode)
- insn = gen_ldr_got_small_di (dest, tmp_reg, imm);
+ insn = gen_ldr_got_small_di (dest, imm);
else
- insn = gen_ldr_got_small_si (dest, tmp_reg, imm);
+ insn = gen_ldr_got_small_si (dest, imm);
 
mem = XVECEXP (SET_SRC (insn), 0, 0);
  }
else
  {
gcc_assert (mode == Pmode);
-
-   insn = gen_ldr_got_small_sidi (dest, tmp_reg, imm);
+   insn = gen_ldr_got_small_sidi (dest, imm);
mem = XVECEXP (XEXP (SET_SRC (insn), 0), 0, 0);
  }
 
@@ -11019,7 +11013,7 @@ aarch64_print_operand (FILE *f, rtx x, int code)
   switch (aarch64_classify_symbolic_expression (x))
{
case SYMBOL_SMALL_GOT_4G:
- asm_fprintf (asm_out_file, ":lo12:");
+ asm_fprintf (asm_out_file, ":got_lo12:");
  break;
 
case SYMBOL_SMALL_TLSGD:
diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 
abfd84526745d029ad4953eabad6dd17b159a218..36c5c054f86e9cdd1f0945cdbc1beb47aa7ad80a
 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -6705,25 +6705,23 @@ (define_insn "add_losym_"
 
 (define_insn "ldr_got_small_"
   [(set (match_operand:PTR 0 "register_operand" "=r")
-   (unspec:PTR [(mem:PTR (lo_sum:PTR
- (match_operand:PTR 1 "register_operand" "r")
- (match_operand:PTR 2 "aarch64_valid_symref" 
"S")))]
+   (unspec:PTR [(mem:PTR (match_operand:PTR 1 "aarch64_valid_symref" "S"))]
UNSPEC_GOTSMALLPIC))]
   ""
-  "ldr\\t%0, [%1, #:got_lo12:%c2]"
-  [(set_attr "type" "load_")]
+  "adrp\\t%0, %A1\;ldr\\t%0, [%0, %L1]"
+  [(set_attr "type" "load_")
+   (set_attr "length" "8")]
 )
 
 (define_insn "ldr_got_small_sidi"
   [(set (match_operand:DI 0 "register_operand" "=r")
(zero_extend:DI
-(unspec:SI [(mem:SI (lo_sum:DI
-(match_operand:DI 1 "register_operand" "r")
-(match_operand:DI 2 "aarch64_valid_symref" "S")))]
+(unspec:SI [(mem:SI (match_operand:DI 1 "aarch64_valid_symref" "S"))]
UNSPEC_GOTSMALLPIC)))]
   "TARGET_ILP32"
-  "ldr\\t%w0, [%1, #:got_lo12:%c2]"
-  [(set_attr "type" "load_4")]
+  "adrp\\t%0, %A1\;ldr\\t%w0, [%0, %L1]"
+  [(set_attr "type" "load_4")
+   (set_attr "length" "8")]
 )
 
 (define_insn "ldr_got_small_28k_"


Re: [PATCH] AArch64: Cleanup aarch64_classify_symbol

2021-04-30 Thread Wilco Dijkstra via Gcc-patches
Hi Richard,

> Hmm, OK.  I guess it makes things more consistent in that sense
> (PIC vs. non-PIC).  But on the other side it's making things less
> internally consistent for non-PIC, since we don't use the GOT for
> anything else there.  I guess in principle there's a danger that a
> custom *-elf linker script might not bother to set up the .got properly,
> on the assumption that it shouldn't be needed.

The GOT is always used in executables (even an empty main function has
a non-zero .got and .got.plt section). When you need an indirection,
using an existing mechanism is better than doing something special.
For example if the weak reference is resolved to a local symbol, the
linker could remove the indirection.

Cheers,
Wilco

Re: [PATCH] AArch64: Cleanup aarch64_classify_symbol

2021-04-28 Thread Wilco Dijkstra via Gcc-patches
Hi Andrew,

> I thought that was changed not to use the GOT on purpose.
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63874
>
> That is if the symbol is not declared in the TU, then using the GOT is
> correct thing to do.
> Is the testcase gcc.target/aarch64/pr63874.c still working or is not
> testing the correct thing?

That PR was about weak definitions, not references. The exact same
sequence is emitted whether we use the GOT or literal (ADRP+LDR),
hence there is no point in treating the non-PIC case differently.

Cheers,
Wilco

Re: [PATCH] AArch64: Cleanup aarch64_classify_symbol

2021-04-28 Thread Wilco Dijkstra via Gcc-patches
Hi Richard,

> Just to check: I guess this part is an optimisation, because it
> means that we can share the GOT entry with other TUs.  Is that right?
> I think it would be worth having a comment either way, whatever the
> rationale.  A couple of other very minor things:

It's just to make the code simpler and more orthogonal - the size of the
sequence and sharing of GOT/literal is identical in both cases.

Here is v2 with comments/spaces fixed up:


Use a GOT indirection for extern weak symbols instead of a literal - this is 
the same as
PIC/PIE and mirrors LLVM behaviour.  Ensure PIC/PIE use the same offset limits 
for symbols
that don't use the GOT.

Passes bootstrap and regress. OK for commit?

ChangeLog:
2021-04-27  Wilco Dijkstra  

* config/aarch64/aarch64.c (aarch64_classify_symbol): Use GOT for 
extern weak symbols.
Limit symbol offsets for non-GOT symbols with PIC/PIE.

---

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 
82957dddbe42a7f907b2960294ac7f8abf7be2ff..641c83b479e76cbcc75b299eb7ae5f634d9db7cd
 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -17854,7 +17854,14 @@ aarch64_classify_symbol (rtx x, HOST_WIDE_INT offset)
 
   switch (aarch64_cmodel)
{
+   case AARCH64_CMODEL_TINY_PIC:
case AARCH64_CMODEL_TINY:
+ /* With -fPIC non-local symbols use the GOT.  For orthogonality
+always use the GOT for extern weak symbols.  */
+ if ((flag_pic || SYMBOL_REF_WEAK (x))
+ && !aarch64_symbol_binds_local_p (x))
+   return SYMBOL_TINY_GOT;
+
  /* When we retrieve symbol + offset address, we have to make sure
 the offset does not cause overflow of the final address.  But
 we have no way of knowing the address of symbol at compile time
@@ -17862,42 +17869,30 @@ aarch64_classify_symbol (rtx x, HOST_WIDE_INT offset)
 symbol + offset is outside the addressible range of +/-1MB in the
 TINY code model.  So we limit the maximum offset to +/-64KB and
 assume the offset to the symbol is not larger than +/-(1MB - 64KB).
-If offset_within_block_p is true we allow larger offsets.
-Furthermore force to memory if the symbol is a weak reference to
-something that doesn't resolve to a symbol in this module.  */
-
- if (SYMBOL_REF_WEAK (x) && !aarch64_symbol_binds_local_p (x))
-   return SYMBOL_FORCE_TO_MEM;
+If offset_within_block_p is true we allow larger offsets.  */
  if (!(IN_RANGE (offset, -0x1, 0x1)
|| offset_within_block_p (x, offset)))
return SYMBOL_FORCE_TO_MEM;
 
  return SYMBOL_TINY_ABSOLUTE;
 
+
+   case AARCH64_CMODEL_SMALL_SPIC:
+   case AARCH64_CMODEL_SMALL_PIC:
case AARCH64_CMODEL_SMALL:
+ if ((flag_pic || SYMBOL_REF_WEAK (x))
+ && !aarch64_symbol_binds_local_p (x))
+   return aarch64_cmodel == AARCH64_CMODEL_SMALL_SPIC
+   ? SYMBOL_SMALL_GOT_28K : SYMBOL_SMALL_GOT_4G;
+
  /* Same reasoning as the tiny code model, but the offset cap here is
 1MB, allowing +/-3.9GB for the offset to the symbol.  */
-
- if (SYMBOL_REF_WEAK (x) && !aarch64_symbol_binds_local_p (x))
-   return SYMBOL_FORCE_TO_MEM;
  if (!(IN_RANGE (offset, -0x10, 0x10)
|| offset_within_block_p (x, offset)))
return SYMBOL_FORCE_TO_MEM;
 
  return SYMBOL_SMALL_ABSOLUTE;
 
-   case AARCH64_CMODEL_TINY_PIC:
- if (!aarch64_symbol_binds_local_p (x))
-   return SYMBOL_TINY_GOT;
- return SYMBOL_TINY_ABSOLUTE;
-
-   case AARCH64_CMODEL_SMALL_SPIC:
-   case AARCH64_CMODEL_SMALL_PIC:
- if (!aarch64_symbol_binds_local_p (x))
-   return (aarch64_cmodel == AARCH64_CMODEL_SMALL_SPIC
-   ?  SYMBOL_SMALL_GOT_28K : SYMBOL_SMALL_GOT_4G);
- return SYMBOL_SMALL_ABSOLUTE;
-
case AARCH64_CMODEL_LARGE:
  /* This is alright even in PIC code as the constant
 pool reference is always PC relative and within


[PATCH] AArch64: Cleanup aarch64_classify_symbol

2021-04-28 Thread Wilco Dijkstra via Gcc-patches

Use a GOT indirection for extern weak symbols instead of a literal - this is 
the same as
PIC/PIE and mirrors LLVM behaviour.  Ensure PIC/PIE use the same offset limits 
for symbols
that don't use the GOT.

Passes bootstrap and regress. OK for commit?

ChangeLog:
2021-04-27  Wilco Dijkstra  

* config/aarch64/aarch64.c (aarch64_classify_symbol): Use GOT for 
extern weak symbols.
Limit symbol offsets for non-GOT symbols with PIC/PIE.

---

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 
82957dddbe42a7f907b2960294ac7f8abf7be2ff..08aaea70d952fd62147b7f06ba781eb55e697d3a
 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -17854,7 +17854,12 @@ aarch64_classify_symbol (rtx x, HOST_WIDE_INT offset)
 
   switch (aarch64_cmodel)
{
+   case AARCH64_CMODEL_TINY_PIC:
case AARCH64_CMODEL_TINY:
+ if ((flag_pic || SYMBOL_REF_WEAK (x))
+ && !aarch64_symbol_binds_local_p (x))
+   return SYMBOL_TINY_GOT;
+
  /* When we retrieve symbol + offset address, we have to make sure
 the offset does not cause overflow of the final address.  But
 we have no way of knowing the address of symbol at compile time
@@ -17865,39 +17870,29 @@ aarch64_classify_symbol (rtx x, HOST_WIDE_INT offset)
 If offset_within_block_p is true we allow larger offsets.
 Furthermore force to memory if the symbol is a weak reference to
 something that doesn't resolve to a symbol in this module.  */
-
- if (SYMBOL_REF_WEAK (x) && !aarch64_symbol_binds_local_p (x))
-   return SYMBOL_FORCE_TO_MEM;
  if (!(IN_RANGE (offset, -0x1, 0x1)
|| offset_within_block_p (x, offset)))
return SYMBOL_FORCE_TO_MEM;
 
  return SYMBOL_TINY_ABSOLUTE;
 
+
+   case AARCH64_CMODEL_SMALL_SPIC:
+   case AARCH64_CMODEL_SMALL_PIC:
case AARCH64_CMODEL_SMALL:
+ if ((flag_pic || SYMBOL_REF_WEAK (x))
+ && !aarch64_symbol_binds_local_p (x))
+   return aarch64_cmodel == AARCH64_CMODEL_SMALL_SPIC
+   ?  SYMBOL_SMALL_GOT_28K : SYMBOL_SMALL_GOT_4G;
+
  /* Same reasoning as the tiny code model, but the offset cap here is
 1MB, allowing +/-3.9GB for the offset to the symbol.  */
-
- if (SYMBOL_REF_WEAK (x) && !aarch64_symbol_binds_local_p (x))
-   return SYMBOL_FORCE_TO_MEM;
  if (!(IN_RANGE (offset, -0x10, 0x10)
|| offset_within_block_p (x, offset)))
return SYMBOL_FORCE_TO_MEM;
 
  return SYMBOL_SMALL_ABSOLUTE;
 
-   case AARCH64_CMODEL_TINY_PIC:
- if (!aarch64_symbol_binds_local_p (x))
-   return SYMBOL_TINY_GOT;
- return SYMBOL_TINY_ABSOLUTE;
-
-   case AARCH64_CMODEL_SMALL_SPIC:
-   case AARCH64_CMODEL_SMALL_PIC:
- if (!aarch64_symbol_binds_local_p (x))
-   return (aarch64_cmodel == AARCH64_CMODEL_SMALL_SPIC
-   ?  SYMBOL_SMALL_GOT_28K : SYMBOL_SMALL_GOT_4G);
- return SYMBOL_SMALL_ABSOLUTE;
-
case AARCH64_CMODEL_LARGE:
  /* This is alright even in PIC code as the constant
 pool reference is always PC relative and within


[GCC8 backport] AArch64: Fix symbol offset limit (PR 98618)

2021-01-21 Thread Wilco Dijkstra via Gcc-patches
In aarch64_classify_symbol symbols are allowed large offsets on relocations.
This means the offset can use all of the +/-4GB offset, leaving no offset
available for the symbol itself.  This results in relocation overflow and
link-time errors for simple expressions like _array + 0xff00.

To avoid this, unless the offset_within_block_p is true, limit the offset
to +/-1MB so that the symbol needs to be within a 3.9GB offset from its
references.  For the tiny code model use a 64KB offset, allowing most of
the 1MB range for code/data between the symbol and its references.

gcc/
PR target/98618
* config/aarch64/aarch64.c (aarch64_classify_symbol):
Apply reasonable limit to symbol offsets.

gcc/testsuite/
PR target/98618
* gcc.target/aarch64/symbol-range.c: Improve testcase.
* gcc.target/aarch64/symbol-range-tiny.c: Likewise.

(cherry picked from commit 7d3b27ff12610fde9d6c4b56abc70c6ee9b6b3db)

---
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 
e8e73b8ea92b0dd3b9de661652c30c26c07bec86..7c4cf75b5a5e2394dc0b3f69b68d93df8f88111f
 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -12011,26 +12011,31 @@ aarch64_classify_symbol (rtx x, HOST_WIDE_INT offset)
 the offset does not cause overflow of the final address.  But
 we have no way of knowing the address of symbol at compile time
 so we can't accurately say if the distance between the PC and
-symbol + offset is outside the addressible range of +/-1M in the
-TINY code model.  So we rely on images not being greater than
-1M and cap the offset at 1M and anything beyond 1M will have to
-be loaded using an alternative mechanism.  Furthermore if the
-symbol is a weak reference to something that isn't known to
-resolve to a symbol in this module, then force to memory.  */
- if ((SYMBOL_REF_WEAK (x)
-  && !aarch64_symbol_binds_local_p (x))
- || !IN_RANGE (offset, -1048575, 1048575))
+symbol + offset is outside the addressible range of +/-1MB in the
+TINY code model.  So we limit the maximum offset to +/-64KB and
+assume the offset to the symbol is not larger than +/-(1MB - 64KB).
+If offset_within_block_p is true we allow larger offsets.
+Furthermore force to memory if the symbol is a weak reference to
+something that doesn't resolve to a symbol in this module.  */
+
+ if (SYMBOL_REF_WEAK (x) && !aarch64_symbol_binds_local_p (x))
return SYMBOL_FORCE_TO_MEM;
+ if (!(IN_RANGE (offset, -0x1, 0x1)
+   || offset_within_block_p (x, offset)))
+   return SYMBOL_FORCE_TO_MEM;
+
  return SYMBOL_TINY_ABSOLUTE;
 
case AARCH64_CMODEL_SMALL:
  /* Same reasoning as the tiny code model, but the offset cap here is
-4G.  */
- if ((SYMBOL_REF_WEAK (x)
-  && !aarch64_symbol_binds_local_p (x))
- || !IN_RANGE (offset, HOST_WIDE_INT_C (-4294967263),
-   HOST_WIDE_INT_C (4294967264)))
+1MB, allowing +/-3.9GB for the offset to the symbol.  */
+
+ if (SYMBOL_REF_WEAK (x) && !aarch64_symbol_binds_local_p (x))
return SYMBOL_FORCE_TO_MEM;
+ if (!(IN_RANGE (offset, -0x10, 0x10)
+   || offset_within_block_p (x, offset)))
+   return SYMBOL_FORCE_TO_MEM;
+
  return SYMBOL_SMALL_ABSOLUTE;
 
case AARCH64_CMODEL_TINY_PIC:
diff --git a/gcc/testsuite/gcc.target/aarch64/symbol-range-tiny.c 
b/gcc/testsuite/gcc.target/aarch64/symbol-range-tiny.c
index 
d7e46b059e41f2672b3a1da5506fa8944e752e01..fc6a4f3ec780d9fa86de1c8e1a42a55992ee8b2d
 100644
--- a/gcc/testsuite/gcc.target/aarch64/symbol-range-tiny.c
+++ b/gcc/testsuite/gcc.target/aarch64/symbol-range-tiny.c
@@ -1,12 +1,12 @@
-/* { dg-do compile } */
+/* { dg-do link } */
 /* { dg-options "-O3 -save-temps -mcmodel=tiny" } */
 
-int fixed_regs[0x0020];
+char fixed_regs[0x0008];
 
 int
-foo()
+main ()
 {
-  return fixed_regs[0x0008];
+  return fixed_regs[0x000ff000];
 }
 
 /* { dg-final { scan-assembler-not "adr\tx\[0-9\]+, fixed_regs\\\+" } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/symbol-range.c 
b/gcc/testsuite/gcc.target/aarch64/symbol-range.c
index 
6574cf4310430b847e77ea56bf8f20ef312d53e4..d8e82fa1b2829fd300b6ccf7f80241e5573e7e17
 100644
--- a/gcc/testsuite/gcc.target/aarch64/symbol-range.c
+++ b/gcc/testsuite/gcc.target/aarch64/symbol-range.c
@@ -1,12 +1,12 @@
-/* { dg-do compile } */
+/* { dg-do link } */
 /* { dg-options "-O3 -save-temps -mcmodel=small" } */
 
-int fixed_regs[0x2ULL];
+char fixed_regs[0x8000];
 
 int
-foo()
+main ()
 {
-  return fixed_regs[0x1ULL];
+  return fixed_regs[0xf000];
 }
 
 /* { dg-final { scan-assembler-not 

[GCC9 backport] AArch64: Fix symbol offset limit (PR 98618)

2021-01-19 Thread Wilco Dijkstra via Gcc-patches
In aarch64_classify_symbol symbols are allowed large offsets on relocations.
This means the offset can use all of the +/-4GB offset, leaving no offset
available for the symbol itself.  This results in relocation overflow and
link-time errors for simple expressions like _array + 0xff00.

To avoid this, unless the offset_within_block_p is true, limit the offset
to +/-1MB so that the symbol needs to be within a 3.9GB offset from its
references.  For the tiny code model use a 64KB offset, allowing most of
the 1MB range for code/data between the symbol and its references.

gcc/
PR target/98618
* config/aarch64/aarch64.c (aarch64_classify_symbol):
Apply reasonable limit to symbol offsets.

testsuite/
PR target/98616
* gcc.target/aarch64/symbol-range.c: Improve testcase.
* gcc.target/aarch64/symbol-range-tiny.c: Likewise.

From-SVN: r277068
(cherry picked from commit 7d3b27ff12610fde9d6c4b56abc70c6ee9b6b3db)

---

diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 
5a8acf8607a735d241c8849a449c000996d96931..6a42c06f04730d0bc55294f0eca59b9711952a2e
 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,12 @@
+2021-01-12  Wilco Dijkstra  
+
+   Backport from mainline:
+   2019-10-16  Wilco Dijkstra  
+
+   PR target/98618
+   * config/aarch64/aarch64.c (aarch64_classify_symbol):
+   Apply reasonable limit to symbol offsets.
+
 2021-01-06  2019-07-10  Marc Glisse  
 
PR testsuite/90806
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 
9189777da7e120b9a0912d8501f2da8db144d334..bce50aea01e58f72ab59b30ce43969b48f5ca1b1
 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -13297,26 +13297,31 @@ aarch64_classify_symbol (rtx x, HOST_WIDE_INT offset)
 the offset does not cause overflow of the final address.  But
 we have no way of knowing the address of symbol at compile time
 so we can't accurately say if the distance between the PC and
-symbol + offset is outside the addressible range of +/-1M in the
-TINY code model.  So we rely on images not being greater than
-1M and cap the offset at 1M and anything beyond 1M will have to
-be loaded using an alternative mechanism.  Furthermore if the
-symbol is a weak reference to something that isn't known to
-resolve to a symbol in this module, then force to memory.  */
- if ((SYMBOL_REF_WEAK (x)
-  && !aarch64_symbol_binds_local_p (x))
- || !IN_RANGE (offset, -1048575, 1048575))
+symbol + offset is outside the addressible range of +/-1MB in the
+TINY code model.  So we limit the maximum offset to +/-64KB and
+assume the offset to the symbol is not larger than +/-(1MB - 64KB).
+If offset_within_block_p is true we allow larger offsets.
+Furthermore force to memory if the symbol is a weak reference to
+something that doesn't resolve to a symbol in this module.  */
+
+ if (SYMBOL_REF_WEAK (x) && !aarch64_symbol_binds_local_p (x))
return SYMBOL_FORCE_TO_MEM;
+ if (!(IN_RANGE (offset, -0x1, 0x1)
+   || offset_within_block_p (x, offset)))
+   return SYMBOL_FORCE_TO_MEM;
+
  return SYMBOL_TINY_ABSOLUTE;
 
case AARCH64_CMODEL_SMALL:
  /* Same reasoning as the tiny code model, but the offset cap here is
-4G.  */
- if ((SYMBOL_REF_WEAK (x)
-  && !aarch64_symbol_binds_local_p (x))
- || !IN_RANGE (offset, HOST_WIDE_INT_C (-4294967263),
-   HOST_WIDE_INT_C (4294967264)))
+1MB, allowing +/-3.9GB for the offset to the symbol.  */
+
+ if (SYMBOL_REF_WEAK (x) && !aarch64_symbol_binds_local_p (x))
return SYMBOL_FORCE_TO_MEM;
+ if (!(IN_RANGE (offset, -0x10, 0x10)
+   || offset_within_block_p (x, offset)))
+   return SYMBOL_FORCE_TO_MEM;
+
  return SYMBOL_SMALL_ABSOLUTE;
 
case AARCH64_CMODEL_TINY_PIC:
diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
index 
835bf14107ffdef3b700dc00399a2d03e298f062..e25e5d85063e9e5bffa932dccfa19d9ec2550606
 100644
--- a/gcc/testsuite/ChangeLog
+++ b/gcc/testsuite/ChangeLog
@@ -1,3 +1,12 @@
+2021-01-12  Wilco Dijkstra  
+
+   Backported from mainline:
+   2019-10-16  Wilco Dijkstra  
+
+   PR target/98618
+   * gcc.target/aarch64/symbol-range.c: Improve testcase.
+   * gcc.target/aarch64/symbol-range-tiny.c: Likewise.
+
 2021-01-07  Paul Thomas  
 
Backported from master:
diff --git a/gcc/testsuite/gcc.target/aarch64/symbol-range-tiny.c 
b/gcc/testsuite/gcc.target/aarch64/symbol-range-tiny.c
index 
d7e46b059e41f2672b3a1da5506fa8944e752e01..fc6a4f3ec780d9fa86de1c8e1a42a55992ee8b2d
 100644
--- a/gcc/testsuite/gcc.target/aarch64/symbol-range-tiny.c

Re: [AArch64] Add --with-tune configure flag

2020-12-10 Thread Wilco Dijkstra via Gcc-patches
Hi Richard,

> I specifically want to test generic SVE rather than SVE tuned for a
> specific core, so --with-arch=armv8.2-a+sve is the thing I want to test.

Btw that's not actually what you get if you use cc1 - you always get armv8.0,
so --with-arch doesn't work at all. The only case that appears to work in cc1
is --with-cpu as long as it is not --with-cpu=native.

Cheers,
Wilco

Re: [AArch64] Add --with-tune configure flag

2020-12-07 Thread Wilco Dijkstra via Gcc-patches
Hi Richard,

>>> I share Richard E's concern about the effect of this on people who run
>>> ./cc1 directly.  (And I'm being selfish here, because I regularly run
>>> ./cc1 directly on toolchains configured with --with-arch=armv8.2-a+sve.)
>>> So TBH my preference would be to keep the TARGET_CPU_DEFAULT_FLAGS stuff.
>>
>> And what exactly should it do? It didn't work correctly before, and even now 
>> it's not
>> clear what TARGET_CPU_DEFAULT should be set to for all possible combinations 
>> of
>> --with-cpu=X --with-arch=Y --with-tune=Z. Should it have exactly the same 
>> behaviour
>> with and without the driver? If so, that's difficult to achieve in 
>> config.gcc, and it would
>> require a completely different mechanism to ensure the default ends up the 
>> same
>> without the driver (eg. store the original strings in TARGET_CPU_DEFAULT and 
>> then
>> postprocess in the backend as if they were actual --cpu/--arch/--tune 
>> commands if
>> the command-line didn't already set them, ie. replicating what the driver 
>> already does).
>> And it would need to be done for --with-abi as well.
>
> The specific case I mentioned works well enough though (given that I also
> run ./cc1 without -march, -mcpu or -mtune flags).  I agree it might not
> work for all combinations of --with-arch, --with-cpu and --with-tune,
> but I'm not sure that's a strong reason to remove the cases that do work
> (or at least, work well enough).  This cc1 behaviour is a feature for
> developers rather than users after all.

Well how do you suggest we keep the cases that work but block the cases that
don't work, but only in cc1 without knowing when someone uses cc1 without the
driver?

> Perhaps I've missed the point, but why do you need to remove this code
> in order to do what you need to do?

Basically it's buggy by design. We have 3 related options which are compressed 
into
a single CPU value plus a set of architecture extension flags. That obviously 
means
we can't describe the --with- flags, let alone any combinations, right?

For example the architecture to compile for is taken from that single CPU, so 
the
architecture is often wrong (depending on the order of processing in 
config.gcc).
Try bootstrapping trunk using --with-tune=neoverse-n2 to see what I mean.
It doesn't just set the tune, it forces an incorrect architecture!

My patch works precisely because I ignore the incorrect TARGET_CPU_DEFAULT
and allow the driver to do its job properly in prioritizing the options. Then 
it's
a simple matter of reading out the --cpu/arch/tune flags.

Like I said, it may be possible to fix this if we pass the --with strings to 
the backend
and process them there. But that means duplicating functionality of the driver 
for
a few users who use cc1. Is that really worth the effort?

>> So is changing your preferred config to --with-cpu=neoverse-v1 really a 
>> problem?
>
> I specifically want to test generic SVE rather than SVE tuned for a
> specific core, so --with-arch=armv8.2-a+sve is the thing I want to test.
> I think the question is instead whether it's really a problem to change
> my workflow so that I don't run cc1 directly any more, or that I remember
> to respecify the --with-arch value as an -march flag when running cc1.
> And yeah, maybe the right answer is that I should learn to do that.
> It's just that I don't personally understand why this change is necessary.

How about creating a macro that adds the --arch for you? Eg. cc1 
becomes ./cc1 -march=armv8.2-a+sve 

Cheers,
Wilco

Re: [AArch64] Add --with-tune configure flag

2020-12-07 Thread Wilco Dijkstra via Gcc-patches
Hi Richard,

> I share Richard E's concern about the effect of this on people who run
> ./cc1 directly.  (And I'm being selfish here, because I regularly run
> ./cc1 directly on toolchains configured with --with-arch=armv8.2-a+sve.)
> So TBH my preference would be to keep the TARGET_CPU_DEFAULT_FLAGS stuff.

And what exactly should it do? It didn't work correctly before, and even now 
it's not
clear what TARGET_CPU_DEFAULT should be set to for all possible combinations of
--with-cpu=X --with-arch=Y --with-tune=Z. Should it have exactly the same 
behaviour
with and without the driver? If so, that's difficult to achieve in config.gcc, 
and it would
require a completely different mechanism to ensure the default ends up the same
without the driver (eg. store the original strings in TARGET_CPU_DEFAULT and 
then
postprocess in the backend as if they were actual --cpu/--arch/--tune commands 
if
the command-line didn't already set them, ie. replicating what the driver 
already does).
And it would need to be done for --with-abi as well.

So is changing your preferred config to --with-cpu=neoverse-v1 really a problem?

Cheers,
Wilco

Re: [AArch64] Add --with-tune configure flag

2020-11-19 Thread Wilco Dijkstra via Gcc-patches
Hi,

>>>    As for your second patch, --with-cpu-64 could be a simple alias indeed,
>>>    but what is the exact definition/expected behaviour of a --with-cpu-32
>>>    on a target that only supports 64-bit code? The AArch64 target cannot
>>>    generate AArch32 code, so we shouldn't silently accept it.
>> 
>> IMO allowing users to specify all the flags available on x86 is important.
>> 
>
> This isn't about general users though; it's about how you configure the
> compiler and that's not all the same.  I don't mind the --with-cpu-64 as
> a strict alias for --with-cpu, but --with-cpu-32 is both redundant and
> misleading as it might give the impression that it does something useful.

We could make it do something useful, for example emit a warning, an error
or default to -mabi=ilp32 (since that is similar to what other targets do).
Anything is better than being the only target that doesn't support it...

Cheers,
Wilco


Re: [AArch64] Add --with-tune configure flag

2020-11-18 Thread Wilco Dijkstra via Gcc-patches
Hi Sebastian,

I presume you're trying to unify the --with- options across most targets?
That would be very useful! However there are significant differences between
targets in how they interpret options like --with-arch=native (or -march). So
those differences also need to be looked at and fixed to avoid unexpected 
results.

As for the first patch, I think support for --witch-tune requires more changes.
Without proper processing of a --with-tune, you get an incorrect architecture
version (if say the CPU you tune for is newer than the --with-cpu/arch
or default).

I posted patches to add --with-tune and fix various issues a while back:

https://gcc.gnu.org/pipermail/gcc-patches/2020-September/553865.html
https://gcc.gnu.org/pipermail/gcc-patches/2020-September/553866.html

So I think these should go in first. They are simple and useful enough to 
backport
(since they fix bugs) but that decision is up to the AArch64 maintainers.

As for your second patch, --with-cpu-64 could be a simple alias indeed,
but what is the exact definition/expected behaviour of a --with-cpu-32
on a target that only supports 64-bit code? The AArch64 target cannot
generate AArch32 code, so we shouldn't silently accept it.

Cheers,
Wilco



[PATCH] AArch64: Add cost table for Cortex-A76

2020-11-18 Thread Wilco Dijkstra via Gcc-patches
Add an initial cost table for Cortex-A76 - this is copied from
cotexa57_extra_costs but updates it based on the Optimization Guide.
Use the new cost table on all Neoverse tunings and ensure the tunings
are consistent for all.  As a result more compact code is generated
with more combined shift+alu operations. Eg. -mcpu=cortex-a76 will now
merge the shifts in:

int f(int x, int y) { return (x & y << 3) * (x | y << 3); }

and  w2, w0, w1, lsl 3
orr  w0, w0, w1, lsl 3
mul  w0, w2, w0
ret

SPEC2017 codesize improves by 0.02% and SPECINT2017 shows 0.24% gain.

Bootstrap OK, regress passes, OK for commit?

ChangeLog:
2020-11-18  Wilco Dijkstra  

* config/aarch64/aarch64.c (neoversen1_tunings): Use new
cortexa76_extra_costs.
(neoversev1_tunings): Likewise.
(neoversen2_tunines): Likewise.
* config/arm/aarch-cost-tables.h (cortexa76_extra_costs):
add new costs.

---
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 
6bf2f9aa344f9150dec72db660d951e50521285c..65ff49d2b4125013466f90a54ff698ae810580f0
 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -1312,7 +1312,7 @@ static const struct tune_params thunderx3t110_tunings =
 
 static const struct tune_params neoversen1_tunings =
 {
-  _extra_costs,
+  _extra_costs,
   _addrcost_table,
   _regmove_cost,
   _vector_cost,
@@ -1338,7 +1338,7 @@ static const struct tune_params neoversen1_tunings =
 
 static const struct tune_params neoversev1_tunings =
 {
-  _extra_costs,
+  _extra_costs,
   _addrcost_table,
   _regmove_cost,
   _vector_cost,
@@ -1364,7 +1364,7 @@ static const struct tune_params neoversev1_tunings =
 
 static const struct tune_params neoversen2_tunings =
 {
-  _extra_costs,
+  _extra_costs,
   _addrcost_table,
   _regmove_cost,
   _vector_cost,
diff --git a/gcc/config/arm/aarch-cost-tables.h 
b/gcc/config/arm/aarch-cost-tables.h
index 
cf8186599018cc5e51cf44e4f2080a502d895e1d..1b9d53d07b54bddf1767121236b06d2b4581631c
 100644
--- a/gcc/config/arm/aarch-cost-tables.h
+++ b/gcc/config/arm/aarch-cost-tables.h
@@ -331,6 +331,109 @@ const struct cpu_cost_table cortexa57_extra_costs =
   }
 };
 
+const struct cpu_cost_table cortexa76_extra_costs =
+{
+  /* ALU */
+  {
+0, /* arith.  */
+0, /* logical.  */
+0, /* shift.  */
+0,  /* shift_reg.  */
+COSTS_N_INSNS (1), /* arith_shift.  */
+COSTS_N_INSNS (1), /* arith_shift_reg.  */
+0,/* log_shift.  */
+COSTS_N_INSNS (1), /* log_shift_reg.  */
+0, /* extend.  */
+COSTS_N_INSNS (1), /* extend_arith.  */
+COSTS_N_INSNS (1), /* bfi.  */
+0, /* bfx.  */
+0, /* clz.  */
+0,  /* rev.  */
+0, /* non_exec.  */
+true   /* non_exec_costs_exec.  */
+  },
+  {
+/* MULT SImode */
+{
+  COSTS_N_INSNS (1),   /* simple.  */
+  COSTS_N_INSNS (2),   /* flag_setting.  */
+  COSTS_N_INSNS (1),   /* extend.  */
+  COSTS_N_INSNS (1),   /* add.  */
+  COSTS_N_INSNS (1),   /* extend_add.  */
+  COSTS_N_INSNS (6)   /* idiv.  */
+},
+/* MULT DImode */
+{
+  COSTS_N_INSNS (3),   /* simple.  */
+  0,   /* flag_setting (N/A).  */
+  COSTS_N_INSNS (1),   /* extend.  */
+  COSTS_N_INSNS (3),   /* add.  */
+  COSTS_N_INSNS (1),   /* extend_add.  */
+  COSTS_N_INSNS (10)   /* idiv.  */
+}
+  },
+  /* LD/ST */
+  {
+COSTS_N_INSNS (3), /* load.  */
+COSTS_N_INSNS (3), /* load_sign_extend.  */
+COSTS_N_INSNS (3), /* ldrd.  */
+COSTS_N_INSNS (2), /* ldm_1st.  */
+1, /* ldm_regs_per_insn_1st.  */
+2, /* ldm_regs_per_insn_subsequent.  */
+COSTS_N_INSNS (4), /* loadf.  */
+COSTS_N_INSNS (4), /* loadd.  */
+COSTS_N_INSNS (5), /* load_unaligned.  */
+0, /* store.  */
+0, /* strd.  */
+0, /* stm_1st.  */
+1, /* stm_regs_per_insn_1st.  */
+2, /* stm_regs_per_insn_subsequent.  */
+0, /* storef.  */
+0, /* stored.  */
+COSTS_N_INSNS (1), /* store_unaligned.  */
+COSTS_N_INSNS (1), /* loadv.  */
+COSTS_N_INSNS (1)  /* storev.  */
+  },
+  {
+/* FP SFmode */
+{
+  COSTS_N_INSNS (10),  /* div.  */
+  COSTS_N_INSNS (2),   /* mult.  */
+  COSTS_N_INSNS (3),   /* mult_addsub.  */
+  COSTS_N_INSNS (3),   /* fma.  */
+  COSTS_N_INSNS (1),   /* addsub.  */
+  0,   /* fpconst.  */
+  0,   /* neg.  */
+  0,   

Re: [PATCH] AArch64: Improve inline memcpy expansion

2020-11-16 Thread Wilco Dijkstra via Gcc-patches
Hi Richard,

>> +  if (size <= 24 || !TARGET_SIMD
>
> Nit: one condition per line when the condition spans multiple lines.

Fixed.

>> +  || (size <= (max_copy_size / 2)
>> +  && (aarch64_tune_params.extra_tuning_flags
>> +  & AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS)))
>> +    copy_bits = GET_MODE_BITSIZE (TImode);
>
> (Looks like the mailer has eaten some tabs here.)

The email contains the correct tabs at the time I send it.

> As discussed in Sudi's setmem patch, I think we should make the
> AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS conditional on optimising
> for speed.  For size, using LDP Q and STP Q is a win regardless
> of what the CPU wants.

I've changed the logic slightly based on benchmarking. It's actually better
to fallback to calling memcpy for larger sizes in this case rather than emit an
inlined Q-register memcpy.

>>    int n_bits = GET_MODE_BITSIZE (next_mode).to_constant ();
>
> As I mentioned in the reply I've just sent to Sudi's patch,
> I think it might help to add:
>
>  gcc_assert (n_bits <= mode_bits);
>
> to show why this is guaranteed not to overrun the original copy.
> (I agree that the code does guarantee no overrun.)

I've added the same assert so the code remains similar.

Here is the updated version:


Improve the inline memcpy expansion.  Use integer load/store for copies <= 24 
bytes
instead of SIMD.  Set the maximum copy to expand to 256 by default, except that 
-Os or
no Neon expands up to 128 bytes.  When using LDP/STP of Q-registers, also use 
Q-register
accesses for the unaligned tail, saving 2 instructions (eg. all sizes up to 48 
bytes emit
exactly 4 instructions).  Cleanup code and comments.

The codesize gain vs the GCC10 expansion is 0.05% on SPECINT2017.

Passes bootstrap and regress. OK for commit?

ChangeLog:
2020-11-16  Wilco Dijkstra  

* config/aarch64/aarch64.c (aarch64_expand_cpymem): Cleanup code and
comments, tweak expansion decisions and improve tail expansion.

---

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 
41e2a699108146e0fa7464743607bd34e91ea9eb..4b2d5fa7d452dc53ff42308dd2781096ff8c95d2
 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -21255,35 +21255,39 @@ aarch64_copy_one_block_and_progress_pointers (rtx 
*src, rtx *dst,
 bool
 aarch64_expand_cpymem (rtx *operands)
 {
-  /* These need to be signed as we need to perform arithmetic on n as
- signed operations.  */
-  int n, mode_bits;
+  int mode_bits;
   rtx dst = operands[0];
   rtx src = operands[1];
   rtx base;
-  machine_mode cur_mode = BLKmode, next_mode;
-  bool speed_p = !optimize_function_for_size_p (cfun);
-
-  /* When optimizing for size, give a better estimate of the length of a
- memcpy call, but use the default otherwise.  Moves larger than 8 bytes
- will always require an even number of instructions to do now.  And each
- operation requires both a load+store, so divide the max number by 2.  */
-  unsigned int max_num_moves = (speed_p ? 16 : AARCH64_CALL_RATIO) / 2;
+  machine_mode cur_mode = BLKmode;
 
-  /* We can't do anything smart if the amount to copy is not constant.  */
+  /* Only expand fixed-size copies.  */
   if (!CONST_INT_P (operands[2]))
 return false;
 
-  unsigned HOST_WIDE_INT tmp = INTVAL (operands[2]);
+  unsigned HOST_WIDE_INT size = INTVAL (operands[2]);
 
-  /* Try to keep the number of instructions low.  For all cases we will do at
- most two moves for the residual amount, since we'll always overlap the
- remainder.  */
-  if (((tmp / 16) + (tmp % 16 ? 2 : 0)) > max_num_moves)
-return false;
+  /* Inline up to 256 bytes when optimizing for speed.  */
+  unsigned HOST_WIDE_INT max_copy_size = 256;
 
-  /* At this point tmp is known to have to fit inside an int.  */
-  n = tmp;
+  if (optimize_function_for_size_p (cfun))
+max_copy_size = 128;
+
+  int copy_bits = 256;
+
+  /* Default to 256-bit LDP/STP on large copies, however small copies, no SIMD
+ support or slow 256-bit LDP/STP fall back to 128-bit chunks.  */
+  if (size <= 24
+  || !TARGET_SIMD
+  || (aarch64_tune_params.extra_tuning_flags
+ & AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS))
+{
+  copy_bits = 128;
+  max_copy_size = max_copy_size / 2;
+}
+
+  if (size > max_copy_size)
+return false;
 
   base = copy_to_mode_reg (Pmode, XEXP (dst, 0));
   dst = adjust_automodify_address (dst, VOIDmode, base, 0);
@@ -21291,15 +21295,8 @@ aarch64_expand_cpymem (rtx *operands)
   base = copy_to_mode_reg (Pmode, XEXP (src, 0));
   src = adjust_automodify_address (src, VOIDmode, base, 0);
 
-  /* Convert n to bits to make the rest of the code simpler.  */
-  n = n * BITS_PER_UNIT;
-
-  /* Maximum amount to copy in one go.  We allow 256-bit chunks based on the
- AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS tuning parameter and TARGET_SIMD.  */
-  const int copy_limit = ((aarch64_tune_params.extra_tuning_flags
-  & 

[PATCH] AArch64: Improve inline memcpy expansion

2020-11-05 Thread Wilco Dijkstra via Gcc-patches
Improve the inline memcpy expansion.  Use integer load/store for copies <= 24 
bytes
instead of SIMD.  Set the maximum copy to expand to 256 by default, except that 
-Os or
no Neon expands up to 128 bytes.  When using LDP/STP of Q-registers, also use 
Q-register
accesses for the unaligned tail, saving 2 instructions (eg. all sizes up to 48 
bytes emit
exactly 4 instructions).  Cleanup code and comments.

The codesize gain vs the GCC10 expansion is 0.05% on SPECINT2017.

Passes bootstrap and regress. OK for commit?

ChangeLog:
2020-11-03  Wilco Dijkstra  

* config/aarch64/aarch64.c (aarch64_expand_cpymem): Cleanup code and
comments, tweak expansion decisions and improve tail expansion.

---

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 
41e2a699108146e0fa7464743607bd34e91ea9eb..9487c1cb07b0d851c0f085262179470d0d596116
 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -21255,35 +21255,36 @@ aarch64_copy_one_block_and_progress_pointers (rtx 
*src, rtx *dst,
 bool
 aarch64_expand_cpymem (rtx *operands)
 {
-  /* These need to be signed as we need to perform arithmetic on n as
- signed operations.  */
-  int n, mode_bits;
+  int mode_bits;
   rtx dst = operands[0];
   rtx src = operands[1];
   rtx base;
-  machine_mode cur_mode = BLKmode, next_mode;
-  bool speed_p = !optimize_function_for_size_p (cfun);
+  machine_mode cur_mode = BLKmode;
 
-  /* When optimizing for size, give a better estimate of the length of a
- memcpy call, but use the default otherwise.  Moves larger than 8 bytes
- will always require an even number of instructions to do now.  And each
- operation requires both a load+store, so divide the max number by 2.  */
-  unsigned int max_num_moves = (speed_p ? 16 : AARCH64_CALL_RATIO) / 2;
-
-  /* We can't do anything smart if the amount to copy is not constant.  */
+  /* Only expand fixed-size copies.  */
   if (!CONST_INT_P (operands[2]))
 return false;
 
-  unsigned HOST_WIDE_INT tmp = INTVAL (operands[2]);
+  unsigned HOST_WIDE_INT size = INTVAL (operands[2]);
 
-  /* Try to keep the number of instructions low.  For all cases we will do at
- most two moves for the residual amount, since we'll always overlap the
- remainder.  */
-  if (((tmp / 16) + (tmp % 16 ? 2 : 0)) > max_num_moves)
+  /* Inline up to 256 bytes when optimizing for speed.  */
+  unsigned HOST_WIDE_INT max_copy_size = 256;
+
+  if (optimize_function_for_size_p (cfun) || !TARGET_SIMD)
+max_copy_size = 128;
+
+  if (size > max_copy_size)
 return false;
 
-  /* At this point tmp is known to have to fit inside an int.  */
-  n = tmp;
+  int copy_bits = 256;
+
+  /* Default to 256-bit LDP/STP on large copies, however small copies, no SIMD
+ support or slow 256-bit LDP/STP fall back to 128-bit chunks.  */
+  if (size <= 24 || !TARGET_SIMD
+  || (size <= (max_copy_size / 2)
+ && (aarch64_tune_params.extra_tuning_flags
+ & AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS)))
+copy_bits = GET_MODE_BITSIZE (TImode);
 
   base = copy_to_mode_reg (Pmode, XEXP (dst, 0));
   dst = adjust_automodify_address (dst, VOIDmode, base, 0);
@@ -21291,15 +21292,8 @@ aarch64_expand_cpymem (rtx *operands)
   base = copy_to_mode_reg (Pmode, XEXP (src, 0));
   src = adjust_automodify_address (src, VOIDmode, base, 0);
 
-  /* Convert n to bits to make the rest of the code simpler.  */
-  n = n * BITS_PER_UNIT;
-
-  /* Maximum amount to copy in one go.  We allow 256-bit chunks based on the
- AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS tuning parameter and TARGET_SIMD.  */
-  const int copy_limit = ((aarch64_tune_params.extra_tuning_flags
-  & AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS)
- || !TARGET_SIMD)
-? GET_MODE_BITSIZE (TImode) : 256;
+  /* Convert size to bits to make the rest of the code simpler.  */
+  int n = size * BITS_PER_UNIT;
 
   while (n > 0)
 {
@@ -21307,23 +21301,26 @@ aarch64_expand_cpymem (rtx *operands)
 or writing.  */
   opt_scalar_int_mode mode_iter;
   FOR_EACH_MODE_IN_CLASS (mode_iter, MODE_INT)
-   if (GET_MODE_BITSIZE (mode_iter.require ()) <= MIN (n, copy_limit))
+   if (GET_MODE_BITSIZE (mode_iter.require ()) <= MIN (n, copy_bits))
  cur_mode = mode_iter.require ();
 
   gcc_assert (cur_mode != BLKmode);
 
   mode_bits = GET_MODE_BITSIZE (cur_mode).to_constant ();
+
+  /* Prefer Q-register accesses for the last bytes.  */
+  if (mode_bits == 128 && copy_bits == 256)
+   cur_mode = V4SImode;
+
   aarch64_copy_one_block_and_progress_pointers (, , cur_mode);
 
   n -= mode_bits;
 
-  /* Do certain trailing copies as overlapping if it's going to be
-cheaper.  i.e. less instructions to do so.  For instance doing a 15
-byte copy it's more efficient to do two overlapping 8 byte copies than
-8 + 6 + 1.  */
-  if (n > 0 && n <= 8 * BITS_PER_UNIT)
+ 

Re: [PATCH] PR target/97312: Tweak gcc.target/aarch64/pr90838.c

2020-10-08 Thread Wilco Dijkstra via Gcc-patches
Hi Jakub,

> On Thu, Oct 08, 2020 at 11:37:24AM +0000, Wilco Dijkstra via Gcc-patches 
> wrote:
>> Which optimizations does it enable that aren't possible if the value is 
>> defined?
>
> See bugzilla.  Note other compilers heavily optimize on those builtins
> undefined at value zero.

You mean the PR94801, PR94793, PR95863 you mentioned before? The first doesn't
seem to be a useful optimization (would anyone ever write that?), the other 2 
would
benefit from clz(0) being well defined. In particular, x86 without BMI would 
greatly
benefit from setting CTZ_DEFINED_VALUE_AT_ZERO to 2.

So I fail to see any "heavy" optimizations here that show a benefit of keeping 
the value
undefined at zero.

>> > We just should make sure that we optimize code like x ? __builtin_c[lt]z 
>> > (x) : 32;
>> > etc. properly (and I believe we do).
>> 
>> I think we do, but both the external and internal documentation are not clear
>> enough that most targets actually do define a value and will optimize for it.
>> Otherwise we wouldn't have this bug now...
>
> The documentation is very clear that the builtins are undefined at zero,
> that is all that matters for users.

If we don't change the undefinedness, at least we should try to explain the 
above
idiom as a way to get a well-defined range that still results in a single 
instruction
on most targets.

Cheers,
Wilco


Re: [PATCH] PR target/97312: Tweak gcc.target/aarch64/pr90838.c

2020-10-08 Thread Wilco Dijkstra via Gcc-patches
Hi Jakub, 

> Having it undefined allows optimizations, and has been that way for years.

Which optimizations does it enable that aren't possible if the value is defined?

> We just should make sure that we optimize code like x ? __builtin_c[lt]z (x) 
> : 32;
> etc. properly (and I believe we do).

I think we do, but both the external and internal documentation are not clear
enough that most targets actually do define a value and will optimize for it.
Otherwise we wouldn't have this bug now...

Cheers,
Wilco

Re: [PATCH] PR target/97312: Tweak gcc.target/aarch64/pr90838.c

2020-10-08 Thread Wilco Dijkstra via Gcc-patches

Btw for PowerPC is 0..32:

https://www.ibm.com/support/knowledgecenter/ssw_aix_72/assembler/idalangref_cntlzw_instrs.html

Wilco


  1   2   >