[PATCH v3] aarch64: Fix normal returns inside functions which use eh_returns [PR114843]

2024-05-20 Thread Wilco Dijkstra
Hi Andrew, A few comments on the implementation, I think it can be simplified a lot: > +++ b/gcc/config/aarch64/aarch64.h > @@ -700,8 +700,9 @@ constexpr auto AARCH64_FL_DEFAULT_ISA_MODE = > AARCH64_FL_SM_OFF; > #define DWARF2_UNWIND_INFO 1 > > /* Use R0 through R3 to pass exception handling

Re: [PATCH] AArch64: Improve costing of ctz

2024-05-15 Thread Wilco Dijkstra
Hi Andrew, > I should note popcount has a similar issue which I hope to fix next week. > Popcount cost is used during expand so it is very useful to be slightly more > correct. It's useful to set the cost so that all of the special cases still apply - even if popcount is relatively fast, it's

[PATCH] AArch64: Improve costing of ctz

2024-05-15 Thread Wilco Dijkstra
Improve costing of ctz - both TARGET_CSSC and vector cases were not handled yet. Passes regress & bootstrap - OK for commit? gcc: * config/aarch64/aarch64.cc (aarch64_rtx_costs): Improve CTZ costing. --- diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc index

[PATCH] AArch64: Fix printing of 2-instruction alternatives

2024-05-15 Thread Wilco Dijkstra
Add missing '\' in 2-instruction movsi/di alternatives so that they are printed on separate lines. Passes bootstrap and regress, OK for commit once stage 1 reopens? gcc: * config/aarch64/aarch64.md (movsi_aarch64): Use '\;' to force newline in 2-instruction pattern.

[PATCH] AArch64: Use LDP/STP for large struct types

2024-05-15 Thread Wilco Dijkstra
Use LDP/STP for large struct types as they have useful immediate offsets and are typically faster. This removes differences between little and big endian and allows use of LDP/STP without UNSPEC. Passes regress and bootstrap, OK for commit? gcc: * config/aarch64/aarch64.cc

[PATCH] AArch64: Use LDP/STP for large struct types

2024-05-15 Thread Wilco Dijkstra
Use LDP/STP for large struct types as they have useful immediate offsets and are typically faster. This removes differences between little and big endian and allows use of LDP/STP without UNSPEC. Passes regress and bootstrap, OK for commit? gcc: * config/aarch64/aarch64.cc

[PATCH] AArch64: Use UZP1 instead of INS

2024-05-15 Thread Wilco Dijkstra
Use UZP1 instead of INS when combining low and high halves of vectors. UZP1 has 3 operands which improves register allocation, and is faster on some microarchitectures. Passes regress & bootstrap, OK for commit? gcc: * config/aarch64/aarch64-simd.md (aarch64_combine_internal):

[PATCH] regalloc: Ignore '^' in early costing [PR114766]

2024-04-29 Thread Wilco Dijkstra
According to documentation, '^' should only have an effect during reload. However ira-costs.cc treats it in the same way as '?' during early costing. As a result using '^' can accidentally disable valid alternatives and cause significant regressions (see PR114741). Avoid this by ignoring '^'

[PATCH] libgcc: Add missing HWCAP entries to aarch64/cpuinfo.c

2024-04-02 Thread Wilco Dijkstra
A few HWCAP entries are missing from aarch64/cpuinfo.c. This results in build errors on older machines. This counts a trivial build fix, but since it's late in stage 4 I'll let maintainers chip in. OK for commit? libgcc/ * config/aarch64/cpuinfo.c: Add HWCAP_EVTSTRM, HWCAP_CRC32,

[PATCH] libatomic: Cleanup macros in atomic_16.S

2024-03-26 Thread Wilco Dijkstra
As mentioned in https://gcc.gnu.org/pipermail/gcc-patches/2024-March/648397.html , do some additional cleanup of the macros and aliases: Cleanup the macros to add the libat_ prefixes in atomic_16.S. Emit the alias to __atomic_ when ifuncs are not enabled in the ENTRY macro. Passes regress and

Re: [PATCH] libatomic: Fix build for --disable-gnu-indirect-function [PR113986]

2024-03-26 Thread Wilco Dijkstra
Hi Richard, > This description is too brief for me.  Could you say in detail how the > new scheme works?  E.g. the description doesn't explain: > > -if ARCH_AARCH64_HAVE_LSE128 > -AM_CPPFLAGS   = -DHAVE_FEAT_LSE128 > -endif That is not needed because we can include auto-config.h in

[COMMITTED] ARM: Fix builtin-bswap-1.c test [PR113915]

2024-03-08 Thread Wilco Dijkstra
On Thumb-2 the use of CBZ blocks conditional execution, so change the test to compare with a non-zero value. gcc/testsuite/ChangeLog: PR target/113915 * gcc.target/arm/builtin-bswap.x: Fix test to avoid emitting CBZ. --- diff --git a/gcc/testsuite/gcc.target/arm/builtin-bswap.x

Re: [PATCH] ARM: Fix conditional execution [PR113915]

2024-02-26 Thread Wilco Dijkstra
Hi Richard, > Did you test this on a thumb1 target?  It seems to me that the target parts > that you've > removed were likely related to that.  In fact, I don't see why this test > would need to be changed at all. The testcase explicitly forces a Thumb-2 target (arm_arch_v6t2). The patterns

[PATCH] libatomic: Fix build for --disable-gnu-indirect-function [PR113986]

2024-02-23 Thread Wilco Dijkstra
Fix libatomic build to support --disable-gnu-indirect-function on AArch64. Always build atomic_16.S and add aliases to the __atomic_* functions if !HAVE_IFUNC. Passes regress and bootstrap, OK for commit? libatomic: PR target/113986 * Makefile.in: Regenerated. *

Re: [PATCH] ARM: Fix conditional execution [PR113915]

2024-02-23 Thread Wilco Dijkstra
Hi Richard, > This bit isn't.  The correct fix here is to fix the pattern(s) concerned to > add the missing predicate. > > Note that builtin-bswap.x explicitly mentions predicated mnemonics in the > comments. I fixed the patterns in v2. There are likely some more, plus we could likely merge

Re: [PATCH] AArch64: memcpy/memset expansions should not emit LDP/STP [PR113618]

2024-02-22 Thread Wilco Dijkstra
Hi Richard, > It looks like this is really doing two things at once: disabling the > direct emission of LDP/STP Qs, and switching the GPR handling from using > pairs of DImode moves to single TImode moves.  At least, that seems to be > the effect of... No it still uses TImode for the

[PATCH] ARM: Fix conditional execution [PR113915]

2024-02-21 Thread Wilco Dijkstra
By default most patterns can be conditionalized on Arm targets. However Thumb-2 predication requires the "predicable" attribute be explicitly set to "yes". Most patterns are shared between Arm and Thumb(-2) and are marked with "predicable". Given this sharing, it does not make sense to use a

[PATCH] AArch64: memcpy/memset expansions should not emit LDP/STP [PR113618]

2024-02-01 Thread Wilco Dijkstra
The new RTL introduced for LDP/STP results in regressions due to use of UNSPEC. Given the new LDP fusion pass is good at finding LDP opportunities, change the memcpy, memmove and memset expansions to emit single vector loads/stores. This fixes the regression and enables more RTL optimization on

Re: [PATCH v4] AArch64: Cleanup memset expansion

2024-01-30 Thread Wilco Dijkstra
Hi Richard, >> That tune is only used by an obsolete core. I ran the memcpy and memset >> benchmarks from Optimized Routines on xgene-1 with and without LDP/STP. >> There is no measurable penalty for using LDP/STP. I'm not sure why it was >> ever added given it does not do anything useful. I'll

[PATCH] AArch64: Remove AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS

2024-01-30 Thread Wilco Dijkstra
(follow-on based on review comments on https://gcc.gnu.org/pipermail/gcc-patches/2024-January/641913.html) Remove the tune AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS since it is only used by an old core and doesn't properly support -Os. SPECINT_2017 shows that removing it has no performance

Re: [PATCH] AArch64: Add -mcpu=cobalt-100

2024-01-25 Thread Wilco Dijkstra
Hi, >> Add support for -mcpu=cobalt-100 (Neoverse N2 with a different implementer >> ID). >> >> Passes regress, OK for commit? > > Ok. Also OK to backport to GCC 13, 12 and 11? Cheers, Wilco

[PATCH] AArch64: Add -mcpu=cobalt-100

2024-01-16 Thread Wilco Dijkstra
Add support for -mcpu=cobalt-100 (Neoverse N2 with a different implementer ID). Passes regress, OK for commit? gcc/ChangeLog: * config/aarch64/aarch64-cores.def (AARCH64_CORE): Add 'cobalt-100' CPU. * config/aarch64/aarch64-tune.md: Regenerated. * doc/invoke.texi

Re: [PATCH] AArch64: Reassociate CONST in address expressions [PR112573]

2024-01-16 Thread Wilco Dijkstra
Hi Richard, >> +  rtx base = strip_offset_and_salt (XEXP (x, 1), ); > > This should be just strip_offset, so that we don't lose the salt > during optimisation. Fixed. > + > +  if (offset.is_constant ()) > I'm not sure this is really required.  Logically the same thing > would apply to

[PATCH] AArch64: Reassociate CONST in address expressions [PR112573]

2024-01-10 Thread Wilco Dijkstra
GCC tends to optimistically create CONST of globals with an immediate offset. However it is almost always better to CSE addresses of globals and add immediate offsets separately (the offset could be merged later in single-use cases). Splitting CONST expressions with an index in

Re: [PATCH v4] AArch64: Cleanup memset expansion

2024-01-09 Thread Wilco Dijkstra
Hi Richard, >> +#define MAX_SET_SIZE(speed) (speed ? 256 : 96) > > Since this isn't (AFAIK) a standard macro, there doesn't seem to be > any need to put it in the header file.  It could just go at the head > of aarch64.cc instead. Sure, I've moved it in v4. >> +  if (len <= 24 ||

Re: [PATCH v3 2/3] libatomic: Enable LSE128 128-bit atomics for armv9.4-a

2024-01-08 Thread Wilco Dijkstra
Hi Richard, >> Benchmarking showed that LSE and LSE2 RMW atomics have similar performance >> once >> the atomic is acquire, release or both. Given there is already a significant >> overhead due >> to the function call, PLT indirection and argument setup, it doesn't make >> sense to add >>

Re: [PATCH v3 2/3] libatomic: Enable LSE128 128-bit atomics for armv9.4-a

2024-01-08 Thread Wilco Dijkstra
Hi, >> Is there no benefit to using SWPPL for RELEASE here?  Similarly for the >> others. > > We started off implementing all possible memory orderings available. > Wilco saw value in merging less restricted orderings into more > restricted ones - mainly to reduce codesize in less frequently

Re: [PATCH v3] AArch64: Cleanup memset expansion

2023-12-22 Thread Wilco Dijkstra
v3: rebased to latest trunk Cleanup memset implementation. Similar to memcpy/memmove, use an offset and bytes throughout. Simplify the complex calculations when optimizing for size by using a fixed limit. Passes regress & bootstrap. gcc/ChangeLog: * config/aarch64/aarch64.h

Re: [PATCH v2] libatomic: Enable lock-free 128-bit atomics on AArch64 [PR110061]

2023-12-04 Thread Wilco Dijkstra
Hi Richard, >> Enable lock-free 128-bit atomics on AArch64.  This is backwards compatible >> with >> existing binaries, gives better performance than locking atomics and is what >> most users expect. > > Please add a justification for why it's backwards compatible, rather > than just stating

Re: [PATCH v3] AArch64: Add inline memmove expansion

2023-12-01 Thread Wilco Dijkstra
Hi Richard, > +  rtx load[max_ops], store[max_ops]; > > Please either add a comment explaining why 40 is guaranteed to be > enough, or (my preference) use: > >  auto_vec, ...> ops; I've changed to using auto_vec since that should help reduce conflicts with Alex' LDP changes. I double-checked

Re: [PATCH] AArch64: Fix __sync_val_compare_and_swap [PR111404]

2023-11-30 Thread Wilco Dijkstra
Hi Richard, Thanks for the review, now committed. > The new aarch64_split_compare_and_swap code looks a bit twisty. > The approach in lse.S seems more obvious.  But I'm guessing you > didn't want to spend any time restructuring the pre-LSE > -mno-outline-atomics code, and I agree the patch in

Re: [PATCH v2] AArch64: Cleanup memset expansion

2023-11-14 Thread Wilco Dijkstra
Hi Richard, > +/* Maximum bytes set for an inline memset expansion.  With -Os use 3 STP > +   and 1 MOVI/DUP (same size as a call).  */ > +#define MAX_SET_SIZE(speed) (speed ? 256 : 96) > So it looks like this assumes we have AdvSIMD.  What about > -mgeneral-regs-only? After my strictalign

Re: [PATCH v2] AArch64: Cleanup memset expansion

2023-11-14 Thread Wilco Dijkstra
Hi, >>> I checked codesize on SPECINT2017, and 96 had practically identical size. >>> Using 128 would also be a reasonable Os value with a very slight size >>> increase, >>> and 384 looks good for O2 - however I didn't want to tune these values >>> as this >>> is a cleanup patch. >>> >>> Cheers,

Re: [PATCH] libatomic: Improve ifunc selection on AArch64

2023-11-10 Thread Wilco Dijkstra
Hi Kyrill, > +  if (!(hwcap & HWCAP_CPUID)) > +    return false; > + > +  unsigned long midr; > +  asm volatile ("mrs %0, midr_el1" : "=r" (midr)); > From what I recall that midr_el1 register is emulated by the kernel and so > userspace software > has to check that the kernel supports that

Re: [PATCH] AArch64: Cleanup memset expansion

2023-11-10 Thread Wilco Dijkstra
Hi Kyrill, > +  /* Reduce the maximum size with -Os.  */ > +  if (optimize_function_for_size_p (cfun)) > +    max_set_size = 96; > + > This is a new "magic" number in this code. It looks sensible, but how > did you arrive at it? We need 1 instruction to create the value to store (DUP or

Re: [PATCH] libatomic: Enable lock-free 128-bit atomics on AArch64 [PR110061]

2023-11-06 Thread Wilco Dijkstra
ping From: Wilco Dijkstra Sent: 02 June 2023 18:28 To: GCC Patches Cc: Richard Sandiford ; Kyrylo Tkachov Subject: [PATCH] libatomic: Enable lock-free 128-bit atomics on AArch64 [PR110061]   Enable lock-free 128-bit atomics on AArch64.  This is backwards compatible with existing binaries

Re: [PATCH] libatomic: Improve ifunc selection on AArch64

2023-11-06 Thread Wilco Dijkstra
  ping From: Wilco Dijkstra Sent: 04 August 2023 16:05 To: GCC Patches ; Richard Sandiford Cc: Kyrylo Tkachov Subject: [PATCH] libatomic: Improve ifunc selection on AArch64   Add support for ifunc selection based on CPUID register.  Neoverse N1 supports atomic 128-bit load/store, so use

Re: [PATCH] AArch64: Fix __sync_val_compare_and_swap [PR111404]

2023-11-06 Thread Wilco Dijkstra
  ping   __sync_val_compare_and_swap may be used on 128-bit types and either calls the outline atomic code or uses an inline loop.  On AArch64 LDXP is only atomic if the value is stored successfully using STXP, but the current implementations do not perform the store if the comparison fails.  In

Re: [PATCH] AArch64: Cleanup memset expansion

2023-11-06 Thread Wilco Dijkstra
ping   Cleanup memset implementation.  Similar to memcpy/memmove, use an offset and bytes throughout.  Simplify the complex calculations when optimizing for size by using a fixed limit. Passes regress/bootstrap, OK for commit?     gcc/ChangeLog:     * config/aarch64/aarch64.cc

Re: [PATCH v2] AArch64: Add inline memmove expansion

2023-11-06 Thread Wilco Dijkstra
ping   v2: further cleanups, improved comments Add support for inline memmove expansions.  The generated code is identical as for memcpy, except that all loads are emitted before stores rather than being interleaved.  The maximum size is 256 bytes which requires at most 16 registers. Passes

Re: [PATCH v2] AArch64: Fix strict-align cpymem/setmem [PR103100]

2023-11-06 Thread Wilco Dijkstra
ping   v2: Use UINTVAL, rename max_mops_size. The cpymemdi/setmemdi implementation doesn't fully support strict alignment. Block the expansion if the alignment is less than 16 with STRICT_ALIGNMENT. Clean up the condition when to use MOPS.     Passes regress/bootstrap, OK for commit?    

[PATCH v2] AArch64: Improve immediate generation

2023-10-24 Thread Wilco Dijkstra
v2: Use check-function-bodies in tests Further improve immediate generation by adding support for 2-instruction MOV/EOR bitmask immediates. This reduces the number of 3/4-instruction immediates in SPECCPU2017 by ~2%. Passes regress, OK for commit? gcc/ChangeLog: *

[PATCH] AArch64: Cleanup memset expansion

2023-10-19 Thread Wilco Dijkstra
Cleanup memset implementation. Similar to memcpy/memmove, use an offset and bytes throughout. Simplify the complex calculations when optimizing for size by using a fixed limit. Passes regress/bootstrap, OK for commit? gcc/ChangeLog: * config/aarch64/aarch64.cc

[PATCH] AArch64: Improve immediate generation

2023-10-19 Thread Wilco Dijkstra
Further improve immediate generation by adding support for 2-instruction MOV/EOR bitmask immediates. This reduces the number of 3/4-instruction immediates in SPECCPU2017 by ~2%. Passes regress, OK for commit? gcc/ChangeLog: * config/aarch64/aarch64.cc (aarch64_internal_mov_immediate)

Re: [PATCH] AArch64: Fix __sync_val_compare_and_swap [PR111404]

2023-10-16 Thread Wilco Dijkstra
Hi Ramana, > I remember this to be the previous discussions and common understanding. > > https://gcc.gnu.org/legacy-ml/gcc/2016-06/msg00017.html > > and here > > https://gcc.gnu.org/legacy-ml/gcc-patches/2017-02/msg00168.html > > Can you point any discussion recently that shows this has changed

Re: [PATCH] AArch64: Fix __sync_val_compare_and_swap [PR111404]

2023-10-16 Thread Wilco Dijkstra
ping   __sync_val_compare_and_swap may be used on 128-bit types and either calls the outline atomic code or uses an inline loop.  On AArch64 LDXP is only atomic if the value is stored successfully using STXP, but the current implementations do not perform the store if the comparison fails.  In

Re: [PATCH] libatomic: Improve ifunc selection on AArch64

2023-10-16 Thread Wilco Dijkstra
  ping From: Wilco Dijkstra Sent: 04 August 2023 16:05 To: GCC Patches ; Richard Sandiford Cc: Kyrylo Tkachov Subject: [PATCH] libatomic: Improve ifunc selection on AArch64   Add support for ifunc selection based on CPUID register.  Neoverse N1 supports atomic 128-bit load/store, so use

Re: [PATCH] libatomic: Enable lock-free 128-bit atomics on AArch64 [PR110061]

2023-10-16 Thread Wilco Dijkstra
  ping From: Wilco Dijkstra Sent: 02 June 2023 18:28 To: GCC Patches Cc: Richard Sandiford ; Kyrylo Tkachov Subject: [PATCH] libatomic: Enable lock-free 128-bit atomics on AArch64 [PR110061]   Enable lock-free 128-bit atomics on AArch64.  This is backwards compatible with existing binaries

Re: [PATCH v2] AArch64: Fix strict-align cpymem/setmem [PR103100]

2023-10-16 Thread Wilco Dijkstra
ping   v2: Use UINTVAL, rename max_mops_size. The cpymemdi/setmemdi implementation doesn't fully support strict alignment. Block the expansion if the alignment is less than 16 with STRICT_ALIGNMENT. Clean up the condition when to use MOPS.     Passes regress/bootstrap, OK for commit?    

[PATCH v2] AArch64: Add inline memmove expansion

2023-10-16 Thread Wilco Dijkstra
v2: further cleanups, improved comments Add support for inline memmove expansions. The generated code is identical as for memcpy, except that all loads are emitted before stores rather than being interleaved. The maximum size is 256 bytes which requires at most 16 registers. Passes

Re: [PATCH v2] ARM: Block predication on atomics [PR111235]

2023-10-02 Thread Wilco Dijkstra
Hi Ramana, >> I used --target=arm-none-linux-gnueabihf --host=arm-none-linux-gnueabihf >> --build=arm-none-linux-gnueabihf --with-float=hard. However it seems that the >> default armhf settings are incorrect. I shouldn't need the --with-float=hard >> since >> that is obviously implied by armhf,

Re: [PATCH v2] ARM: Block predication on atomics [PR111235]

2023-09-27 Thread Wilco Dijkstra
Hi Ramana, > Hope this helps. Yes definitely! >> Passes regress/bootstrap, OK for commit? > > Target ? armhf ? --with-arch , -with-fpu , -with-float parameters ? > Please be specific. I used --target=arm-none-linux-gnueabihf --host=arm-none-linux-gnueabihf --build=arm-none-linux-gnueabihf

[PATCH] AArch64: Remove BTI from outline atomics

2023-09-26 Thread Wilco Dijkstra
The outline atomic functions have hidden visibility and can only be called directly.  Therefore we can remove the BTI at function entry.  This improves security by reducing the number of indirect entry points in a binary. The BTI markings on the objects are still emitted. Passes regress, OK for

Re: [PATCH] AArch64: Fix __sync_val_compare_and_swap [PR111404]

2023-09-25 Thread Wilco Dijkstra
Hi Ramana, >> __sync_val_compare_and_swap may be used on 128-bit types and either calls the >> outline atomic code or uses an inline loop.  On AArch64 LDXP is only atomic >> if >> the value is stored successfully using STXP, but the current implementations >> do not perform the store if the

[PATCH] AArch64: Add inline memmove expansion

2023-09-21 Thread Wilco Dijkstra
Add support for inline memmove expansions. The generated code is identical as for memcpy, except that all loads are emitted before stores rather than being interleaved. The maximum size is 256 bytes which requires at most 16 registers. Passes regress/bootstrap, OK for commit?

[PATCH v2] AArch64: Fix strict-align cpymem/setmem [PR103100]

2023-09-21 Thread Wilco Dijkstra
v2: Use UINTVAL, rename max_mops_size. The cpymemdi/setmemdi implementation doesn't fully support strict alignment. Block the expansion if the alignment is less than 16 with STRICT_ALIGNMENT. Clean up the condition when to use MOPS. Passes regress/bootstrap, OK for commit?

Re: [PATCH] AArch64: Fix strict-align cpymem/setmem [PR103100]

2023-09-20 Thread Wilco Dijkstra
Hi Richard, > * config/aarch64/aarch64.md (cpymemdi): Remove pattern condition. > Shouldn't this be a separate patch?  It's not immediately obvious that this > is a necessary part of this change. You mean this? @@ -1627,7 +1627,7 @@ (define_expand "cpymemdi" (match_operand:BLK 1

[PATCH v2] AArch64: Fix memmove operand corruption [PR111121]

2023-09-20 Thread Wilco Dijkstra
A MOPS memmove may corrupt registers since there is no copy of the input operands to temporary registers. Fix this by calling aarch64_expand_cpymem_mops. Passes regress/bootstrap, OK for commit? gcc/ChangeLog/ PR target/21 * config/aarch64/aarch64.md

[PATCH] AArch64: Fix strict-align cpymem/setmem [PR103100]

2023-09-20 Thread Wilco Dijkstra
The cpymemdi/setmemdi implementation doesn't fully support strict alignment. Block the expansion if the alignment is less than 16 with STRICT_ALIGNMENT. Clean up the condition when to use MOPS. Passes regress/bootstrap, OK for commit? gcc/ChangeLog/ PR target/103100 *

Re: [PATCH] AArch64: Improve immediate expansion [PR105928]

2023-09-19 Thread Wilco Dijkstra
Hi Richard, >> Note that aarch64_internal_mov_immediate may be called after reload, >> so it would end up even more complex. > > The sequence I quoted was supposed to work before and after reload.  The: > >    rtx tmp = aarch64_target_reg (dest, DImode); > > would create a fresh

Re: [PATCH] AArch64: Improve immediate expansion [PR105928]

2023-09-18 Thread Wilco Dijkstra via Gcc-patches
Hi Richard, > I was worried that reusing "dest" for intermediate results would > prevent CSE for cases like: > > void g (long long, long long); > void > f (long long *ptr) > { >   g (0xee11ee22ee11ee22LL, 0xdc23dc44ee11ee22LL); > } Note that aarch64_internal_mov_immediate may be called after

[PATCH] AArch64: Improve immediate expansion [PR105928]

2023-09-14 Thread Wilco Dijkstra via Gcc-patches
Support immediate expansion of immediates which can be created from 2 MOVKs and a shifted ORR or BIC instruction. Change aarch64_split_dimode_const_store to apply if we save one instruction. This reduces the number of 4-instruction immediates in SPECINT/FP by 5%. Passes regress, OK for commit?

Re: [PATCH] libatomic: Enable lock-free 128-bit atomics on AArch64 [PR110061]

2023-09-13 Thread Wilco Dijkstra via Gcc-patches
ping From: Wilco Dijkstra Sent: 02 June 2023 18:28 To: GCC Patches Cc: Richard Sandiford ; Kyrylo Tkachov Subject: [PATCH] libatomic: Enable lock-free 128-bit atomics on AArch64 [PR110061]   Enable lock-free 128-bit atomics on AArch64.  This is backwards compatible with existing binaries

Re: [PATCH] libatomic: Improve ifunc selection on AArch64

2023-09-13 Thread Wilco Dijkstra via Gcc-patches
ping From: Wilco Dijkstra Sent: 04 August 2023 16:05 To: GCC Patches ; Richard Sandiford Cc: Kyrylo Tkachov Subject: [PATCH] libatomic: Improve ifunc selection on AArch64   Add support for ifunc selection based on CPUID register.  Neoverse N1 supports atomic 128-bit load/store, so use

[PATCH] AArch64: Fix __sync_val_compare_and_swap [PR111404]

2023-09-13 Thread Wilco Dijkstra via Gcc-patches
__sync_val_compare_and_swap may be used on 128-bit types and either calls the outline atomic code or uses an inline loop. On AArch64 LDXP is only atomic if the value is stored successfully using STXP, but the current implementations do not perform the store if the comparison fails. In this case

[PATCH] AArch64: List official cores before codenames

2023-09-13 Thread Wilco Dijkstra via Gcc-patches
List official cores first so that -cpu=native does not show a codename with -v or in errors/warnings. Passes regress, OK for commit? gcc/ChangeLog: * config/aarch64/aarch64-cores.def (neoverse-n1): Place before ares. (neoverse-v1): Place before zeus. (neoverse-v2): Place

[PATCH] ARM: Block predication on atomics [PR111235]

2023-09-07 Thread Wilco Dijkstra via Gcc-patches
The v7 memory ordering model allows reordering of conditional atomic instructions. To avoid this, make all atomic patterns unconditional. Expand atomic loads and stores for all architectures so the memory access can be wrapped into an UNSPEC. Passes regress/bootstrap, OK for commit?

Re: [PATCH] AArch64: Fix MOPS memmove operand corruption [PR111121]

2023-08-23 Thread Wilco Dijkstra via Gcc-patches
Hi Richard, (that's quick!) > + if (size > max_copy_size || size > max_mops_size) > +return aarch64_expand_cpymem_mops (operands, is_memmove); > > Could you explain this a bit more? If I've followed the logic correctly, > max_copy_size will always be 0 for movmem, so this "if" condition

[PATCH] AArch64: Fix MOPS memmove operand corruption [PR111121]

2023-08-23 Thread Wilco Dijkstra via Gcc-patches
A MOPS memmove may corrupt registers since there is no copy of the input operands to temporary registers. Fix this by calling aarch64_expand_cpymem which does this. Also fix an issue with STRICT_ALIGNMENT being ignored if TARGET_MOPS is true, and avoid crashing or generating a huge expansion

Re: [PATCH] libatomic: Improve ifunc selection on AArch64

2023-08-10 Thread Wilco Dijkstra via Gcc-patches
Hi Richard, >>> Answering my own question, N1 does not officially have FEAT_LSE2. >> >> It doesn't indeed. However most cores support atomic 128-bit load/store >> (part of LSE2), so we can still use the LSE2 ifunc for those cores. Since >> there >> isn't a feature bit for this in the CPU or

Re: [PATCH] libatomic: Improve ifunc selection on AArch64

2023-08-10 Thread Wilco Dijkstra via Gcc-patches
Hi Richard, >> Why would HWCAP_USCAT not be set by the kernel? >> >> Failing that, I would think you would check ID_AA64MMFR2_EL1.AT. >> > Answering my own question, N1 does not officially have FEAT_LSE2. It doesn't indeed. However most cores support atomic 128-bit load/store (part of LSE2), so

Re: [PATCH] libatomic: Enable lock-free 128-bit atomics on AArch64 [PR110061]

2023-08-04 Thread Wilco Dijkstra via Gcc-patches
ping From: Wilco Dijkstra Sent: 02 June 2023 18:28 To: GCC Patches Cc: Richard Sandiford ; Kyrylo Tkachov Subject: [PATCH] libatomic: Enable lock-free 128-bit atomics on AArch64 [PR110061]   Enable lock-free 128-bit atomics on AArch64.  This is backwards compatible with existing binaries

[PATCH] libatomic: Improve ifunc selection on AArch64

2023-08-04 Thread Wilco Dijkstra via Gcc-patches
Add support for ifunc selection based on CPUID register. Neoverse N1 supports atomic 128-bit load/store, so use the FEAT_USCAT ifunc like newer Neoverse cores. Passes regress, OK for commit? libatomic/ config/linux/aarch64/host-config.h (ifunc1): Use CPUID in ifunc selection.

Re: [PATCH] libatomic: Enable lock-free 128-bit atomics on AArch64 [PR110061]

2023-07-05 Thread Wilco Dijkstra via Gcc-patches
ping From: Wilco Dijkstra Sent: 02 June 2023 18:28 To: GCC Patches Cc: Richard Sandiford ; Kyrylo Tkachov Subject: [PATCH] libatomic: Enable lock-free 128-bit atomics on AArch64 [PR110061]   Enable lock-free 128-bit atomics on AArch64.  This is backwards compatible with existing binaries

Re: [PATCH] libatomic: Enable lock-free 128-bit atomics on AArch64 [PR110061]

2023-06-16 Thread Wilco Dijkstra via Gcc-patches
ping From: Wilco Dijkstra Sent: 02 June 2023 18:28 To: GCC Patches Cc: Richard Sandiford ; Kyrylo Tkachov Subject: [PATCH] libatomic: Enable lock-free 128-bit atomics on AArch64 [PR110061]   Enable lock-free 128-bit atomics on AArch64.  This is backwards compatible with existing binaries

[PATCH] libatomic: Enable lock-free 128-bit atomics on AArch64 [PR110061]

2023-06-02 Thread Wilco Dijkstra via Gcc-patches
Enable lock-free 128-bit atomics on AArch64. This is backwards compatible with existing binaries, gives better performance than locking atomics and is what most users expect. Note 128-bit atomic loads use a load/store exclusive loop if LSE2 is not supported. This results in an implicit store

Re: [PATCH] libatomic: Fix SEQ_CST 128-bit atomic load [PR108891]

2023-03-16 Thread Wilco Dijkstra via Gcc-patches
ping From: Wilco Dijkstra Sent: 23 February 2023 15:11 To: GCC Patches Cc: Richard Sandiford ; Kyrylo Tkachov Subject: [PATCH] libatomic: Fix SEQ_CST 128-bit atomic load [PR108891]   The LSE2 ifunc for 16-byte atomic load requires a barrier before the LDP - without it, it effectively has

[PATCH] libatomic: Fix SEQ_CST 128-bit atomic load [PR108891]

2023-02-23 Thread Wilco Dijkstra via Gcc-patches
The LSE2 ifunc for 16-byte atomic load requires a barrier before the LDP - without it, it effectively has Load-AcquirePC semantics similar to LDAPR, which is less restrictive than what __ATOMIC_SEQ_CST requires. This patch fixes this and adds comments to make it easier to see which sequence is

Re: [PATCH] libgcc: Fix uninitialized RA signing on AArch64 [PR107678]

2023-01-18 Thread Wilco Dijkstra via Gcc-patches
Hi, >> +  /* Return-address signing state is toggled by DW_CFA_GNU_window_save >> (where >> + REG_UNDEFINED means enabled), or set by a DW_CFA_expression.  */ > > Needs updating to REG_UNSAVED_ARCHEXT. > > OK with that changes, thanks, and sorry for the delays & runaround. Thanks, I've

Re: [PATCH] libgcc: Fix uninitialized RA signing on AArch64 [PR107678]

2023-01-17 Thread Wilco Dijkstra via Gcc-patches
Hi, > @Wilco, can you please send the rebased patch for patch review? We would > need in out openSUSE package soon. Here is an updated and rebased version: Cheers, Wilco v4: rebase and add REG_UNSAVED_ARCHEXT. A recent change only initializes the regs.how[] during Dwarf unwinding which

Re: [PATCH] libgcc: Fix uninitialized RA signing on AArch64 [PR107678]

2023-01-11 Thread Wilco Dijkstra via Gcc-patches
Hi, > On 1/10/23 19:12, Jakub Jelinek via Gcc-patches wrote: >> Anyway, the sooner this makes it into gcc trunk, the better, it breaks quite >> a lot of stuff. > > Yep, please, we're also waiting for this patch for pushing to our gcc13 > package. Well I'm waiting for an OK from a maintainer...

Re: [PATCH] libgcc: Fix uninitialized RA signing on AArch64 [PR107678]

2023-01-10 Thread Wilco Dijkstra via Gcc-patches
Hi Szabolcs, > i would keep the assert: how[reg] must be either UNSAVED or UNDEFINED > here, other how[reg] means the toggle cfi instruction is mixed with > incompatible instructions for the pseudo reg. > > and i would add a comment about this e.g. saying that UNSAVED/UNDEFINED > how[reg] is used

Re: [PATCH] libgcc: Fix uninitialized RA signing on AArch64 [PR107678]

2023-01-03 Thread Wilco Dijkstra via Gcc-patches
Hi Richard, > Hmm, but the point of the original patch was to support code generators > that emit DW_CFA_val_expression instead of DW_CFA_AARCH64_negate_ra_state. > Doesn't this patch undo that? Well it wasn't clear from the code or comments that was supported. I've added that back in v2. >

[PATCH] AArch64: Enable TARGET_CONST_ANCHOR

2022-12-09 Thread Wilco Dijkstra via Gcc-patches
Enable TARGET_CONST_ANCHOR to allow complex constants to be created via immediate add. Use a 24-bit range as that enables a 3 or 4-instruction immediate to be replaced by 2 additions. Fix the costing of immediate add to support 24-bit immediate and 12-bit shifted immediates. The generated

Re: [PATCH][AArch64] Cleanup move immediate code

2022-12-07 Thread Wilco Dijkstra via Gcc-patches
Hi Andreas, Thanks for the report, I've committed the fix: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108006 Cheers, Wilco

[COMMITTED] AArch64: Fix assert in aarch64_move_imm [PR108006]

2022-12-07 Thread Wilco Dijkstra via Gcc-patches
Ensure we only pass SI/DImode which fixes the assert. Committed as obvious. gcc/         PR target/108006 * config/aarch64/aarch64.c (aarch64_expand_sve_const_vector):         Fix call to aarch64_move_imm to use SI/DI. --- diff --git a/gcc/config/aarch64/aarch64.cc

Re: [PATCH] libgcc: Fix uninitialized RA signing on AArch64 [PR107678]

2022-12-06 Thread Wilco Dijkstra via Gcc-patches
Hi, > i don't think how[*RA_STATE] can ever be set to REG_SAVED_OFFSET, > this pseudo reg is not spilled to the stack, it is reset to 0 in > each frame and then toggled within a frame. It's is just a state, we can use any state we want since it is a pseudo reg. These registers are global and

Re: [PATCH][AArch64] Cleanup move immediate code

2022-12-05 Thread Wilco Dijkstra via Gcc-patches
Hi Richard, > -  scalar_int_mode imode = (mode == HFmode > -    ? SImode > -    : int_mode_for_mode (mode).require ()); > +  machine_mode imode = (mode == DFmode) ? DImode : SImode; > It looks like this might mishandle DDmode, if not now

[PATCH] libgcc: Fix uninitialized RA signing on AArch64 [PR107678]

2022-12-01 Thread Wilco Dijkstra via Gcc-patches
A recent change only initializes the regs.how[] during Dwarf unwinding which resulted in an uninitialized offset used in return address signing and random failures during unwinding. The fix is to use REG_SAVED_OFFSET as the state where the return address signing bit is valid, and if the state is

Re: [PATCH][AArch64] Cleanup move immediate code

2022-11-29 Thread Wilco Dijkstra via Gcc-patches
Hi Richard, > Just to make sure I understand: isn't it really just MOVN?  I would have > expected a 32-bit MOVZ to be equivalent to (and add no capabilities over) > a 64-bit MOVZ. The 32-bit MOVZ immediates are equivalent, MOVN never overlaps, and MOVI has some overlaps . Since we allow all 3

Re: [PATCH] AArch64: Add fma_reassoc_width [PR107413]

2022-11-23 Thread Wilco Dijkstra via Gcc-patches
Hi Richard, >> A smart reassociation pass could form more FMAs while also increasing >> parallelism, but the way it currently works always results in fewer FMAs. > > Yeah, as Richard said, that seems the right long-term fix. > It would also avoid the hack of treating PLUS_EXPR as a signal > of an

Re: [PATCH] AArch64: Add fma_reassoc_width [PR107413]

2022-11-22 Thread Wilco Dijkstra via Gcc-patches
Hi Richard, > I guess an obvious question is: if 1 (rather than 2) was the right value > for cores with 2 FMA pipes, why is 4 the right value for cores with 4 FMA > pipes?  It would be good to clarify how, conceptually, the core property > should map to the fma_reassoc_width value. 1 turns off

Re: [PATCH] AArch64: Add support for -mdirect-extern-access

2022-11-17 Thread Wilco Dijkstra via Gcc-patches
Hi Richard, > Can you go into more detail about: > >    Use :option:`-mdirect-extern-access` either in shared libraries or in >    executables, but not in both.  Protected symbols used both in a shared >    library and executable may cause linker errors or fail to work correctly > > If this is

[PATCH] AArch64: Add support for -mdirect-extern-access

2022-11-11 Thread Wilco Dijkstra via Gcc-patches
Add a new option -mdirect-extern-access similar to other targets. This removes GOT indirections on external symbols with -fPIE, resulting in significantly better code quality. With -fPIC it only affects protected symbols, allowing for more efficient shared libraries which can be linked with

[PATCH] libatomic: Add support for LSE and LSE2

2022-11-11 Thread Wilco Dijkstra via Gcc-patches
Add support for AArch64 LSE and LSE2 to libatomic. Disable outline atomics, and use LSE ifuncs for 1-8 byte atomics and LSE2 ifuncs for 16-byte atomics. On Neoverse V1, 16-byte atomics are ~4x faster due to avoiding locks. Note this is safe since we swap all 16-byte atomics using the same ifunc,

[PATCH] AArch64: Add fma_reassoc_width [PR107413]

2022-11-09 Thread Wilco Dijkstra via Gcc-patches
Add a reassocation width for FMAs in per-CPU tuning structures. Keep the existing setting for cores with 2 FMA pipes, and use 4 for cores with 4 FMA pipes. This improves SPECFP2017 on Neoverse V1 by ~1.5%. Passes regress/bootstrap, OK for commit? gcc/ PR 107413 *

[committed] AArch64: Fix testcase

2022-11-04 Thread Wilco Dijkstra via Gcc-patches
Committed as trivial fix. gcc/testsuite/ * gcc.target/aarch64/mgeneral-regs_3.c: Fix testcase. --- diff --git a/gcc/testsuite/gcc.target/aarch64/mgeneral-regs_3.c b/gcc/testsuite/gcc.target/aarch64/mgeneral-regs_3.c index

[PATCH][AArch64] Cleanup move immediate code

2022-11-01 Thread Wilco Dijkstra via Gcc-patches
Hi Richard, Here is the immediate cleanup splitoff from the previous patch: Simplify, refactor and improve various move immediate functions. Allow 32-bit MOVZ/N as a valid 64-bit immediate which removes special cases in aarch64_internal_mov_immediate. Add new constraint so the movdi pattern

Re: [PATCH][AArch64] Improve immediate expansion [PR106583]

2022-10-20 Thread Wilco Dijkstra via Gcc-patches
Hi Richard, > Can you do the aarch64_mov_imm changes as a separate patch?  It's difficult > to review the two changes folded together like this. Sure, I'll send a separate patch. So here is version 2 again: [PATCH v2][AArch64] Improve immediate expansion [PR106583] Improve immediate expansion

Re: [PATCH][AArch64] Improve immediate expansion [PR106583]

2022-10-19 Thread Wilco Dijkstra via Gcc-patches
ping Hi Richard, >>> Sounds good, but could you put it before the mode version, >>> to avoid the forward declaration? >> >> I can swap them around but the forward declaration is still required as >> aarch64_check_bitmask is 5000 lines before aarch64_bitmask_imm. > > OK, how about moving them

  1   2   3   4   5   6   7   8   9   10   >