Hi Andrew,
A few comments on the implementation, I think it can be simplified a lot:
> +++ b/gcc/config/aarch64/aarch64.h
> @@ -700,8 +700,9 @@ constexpr auto AARCH64_FL_DEFAULT_ISA_MODE =
> AARCH64_FL_SM_OFF;
> #define DWARF2_UNWIND_INFO 1
>
> /* Use R0 through R3 to pass exception handling
Hi Andrew,
> I should note popcount has a similar issue which I hope to fix next week.
> Popcount cost is used during expand so it is very useful to be slightly more
> correct.
It's useful to set the cost so that all of the special cases still apply - even
if popcount is
relatively fast, it's
Improve costing of ctz - both TARGET_CSSC and vector cases were not handled yet.
Passes regress & bootstrap - OK for commit?
gcc:
* config/aarch64/aarch64.cc (aarch64_rtx_costs): Improve CTZ costing.
---
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index
Add missing '\' in 2-instruction movsi/di alternatives so that they are
printed on separate lines.
Passes bootstrap and regress, OK for commit once stage 1 reopens?
gcc:
* config/aarch64/aarch64.md (movsi_aarch64): Use '\;' to force
newline in 2-instruction pattern.
Use LDP/STP for large struct types as they have useful immediate offsets and
are typically faster.
This removes differences between little and big endian and allows use of
LDP/STP without UNSPEC.
Passes regress and bootstrap, OK for commit?
gcc:
* config/aarch64/aarch64.cc
Use LDP/STP for large struct types as they have useful immediate offsets and
are typically faster.
This removes differences between little and big endian and allows use of
LDP/STP without UNSPEC.
Passes regress and bootstrap, OK for commit?
gcc:
* config/aarch64/aarch64.cc
Use UZP1 instead of INS when combining low and high halves of vectors.
UZP1 has 3 operands which improves register allocation, and is faster on
some microarchitectures.
Passes regress & bootstrap, OK for commit?
gcc:
* config/aarch64/aarch64-simd.md (aarch64_combine_internal):
According to documentation, '^' should only have an effect during reload.
However ira-costs.cc treats it in the same way as '?' during early costing.
As a result using '^' can accidentally disable valid alternatives and cause
significant regressions (see PR114741). Avoid this by ignoring '^'
A few HWCAP entries are missing from aarch64/cpuinfo.c. This results in build
errors
on older machines.
This counts a trivial build fix, but since it's late in stage 4 I'll let
maintainers chip in.
OK for commit?
libgcc/
* config/aarch64/cpuinfo.c: Add HWCAP_EVTSTRM, HWCAP_CRC32,
As mentioned in
https://gcc.gnu.org/pipermail/gcc-patches/2024-March/648397.html ,
do some additional cleanup of the macros and aliases:
Cleanup the macros to add the libat_ prefixes in atomic_16.S. Emit the
alias to __atomic_ when ifuncs are not enabled in the ENTRY macro.
Passes regress and
Hi Richard,
> This description is too brief for me. Could you say in detail how the
> new scheme works? E.g. the description doesn't explain:
>
> -if ARCH_AARCH64_HAVE_LSE128
> -AM_CPPFLAGS = -DHAVE_FEAT_LSE128
> -endif
That is not needed because we can include auto-config.h in
On Thumb-2 the use of CBZ blocks conditional execution, so change the
test to compare with a non-zero value.
gcc/testsuite/ChangeLog:
PR target/113915
* gcc.target/arm/builtin-bswap.x: Fix test to avoid emitting CBZ.
---
diff --git a/gcc/testsuite/gcc.target/arm/builtin-bswap.x
Hi Richard,
> Did you test this on a thumb1 target? It seems to me that the target parts
> that you've
> removed were likely related to that. In fact, I don't see why this test
> would need to be changed at all.
The testcase explicitly forces a Thumb-2 target (arm_arch_v6t2). The patterns
Fix libatomic build to support --disable-gnu-indirect-function on AArch64.
Always build atomic_16.S and add aliases to the __atomic_* functions if
!HAVE_IFUNC.
Passes regress and bootstrap, OK for commit?
libatomic:
PR target/113986
* Makefile.in: Regenerated.
*
Hi Richard,
> This bit isn't. The correct fix here is to fix the pattern(s) concerned to
> add the missing predicate.
>
> Note that builtin-bswap.x explicitly mentions predicated mnemonics in the
> comments.
I fixed the patterns in v2. There are likely some more, plus we could likely
merge
Hi Richard,
> It looks like this is really doing two things at once: disabling the
> direct emission of LDP/STP Qs, and switching the GPR handling from using
> pairs of DImode moves to single TImode moves. At least, that seems to be
> the effect of...
No it still uses TImode for the
By default most patterns can be conditionalized on Arm targets. However
Thumb-2 predication requires the "predicable" attribute be explicitly
set to "yes". Most patterns are shared between Arm and Thumb(-2) and are
marked with "predicable". Given this sharing, it does not make sense to
use a
The new RTL introduced for LDP/STP results in regressions due to use of UNSPEC.
Given the new LDP fusion pass is good at finding LDP opportunities, change the
memcpy, memmove and memset expansions to emit single vector loads/stores.
This fixes the regression and enables more RTL optimization on
Hi Richard,
>> That tune is only used by an obsolete core. I ran the memcpy and memset
>> benchmarks from Optimized Routines on xgene-1 with and without LDP/STP.
>> There is no measurable penalty for using LDP/STP. I'm not sure why it was
>> ever added given it does not do anything useful. I'll
(follow-on based on review comments on
https://gcc.gnu.org/pipermail/gcc-patches/2024-January/641913.html)
Remove the tune AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS since it is only
used by an old core and doesn't properly support -Os. SPECINT_2017
shows that removing it has no performance
Hi,
>> Add support for -mcpu=cobalt-100 (Neoverse N2 with a different implementer
>> ID).
>>
>> Passes regress, OK for commit?
>
> Ok.
Also OK to backport to GCC 13, 12 and 11?
Cheers,
Wilco
Add support for -mcpu=cobalt-100 (Neoverse N2 with a different implementer ID).
Passes regress, OK for commit?
gcc/ChangeLog:
* config/aarch64/aarch64-cores.def (AARCH64_CORE): Add 'cobalt-100' CPU.
* config/aarch64/aarch64-tune.md: Regenerated.
* doc/invoke.texi
Hi Richard,
>> + rtx base = strip_offset_and_salt (XEXP (x, 1), );
>
> This should be just strip_offset, so that we don't lose the salt
> during optimisation.
Fixed.
> +
> + if (offset.is_constant ())
> I'm not sure this is really required. Logically the same thing
> would apply to
GCC tends to optimistically create CONST of globals with an immediate offset.
However it is almost always better to CSE addresses of globals and add immediate
offsets separately (the offset could be merged later in single-use cases).
Splitting CONST expressions with an index in
Hi Richard,
>> +#define MAX_SET_SIZE(speed) (speed ? 256 : 96)
>
> Since this isn't (AFAIK) a standard macro, there doesn't seem to be
> any need to put it in the header file. It could just go at the head
> of aarch64.cc instead.
Sure, I've moved it in v4.
>> + if (len <= 24 ||
Hi Richard,
>> Benchmarking showed that LSE and LSE2 RMW atomics have similar performance
>> once
>> the atomic is acquire, release or both. Given there is already a significant
>> overhead due
>> to the function call, PLT indirection and argument setup, it doesn't make
>> sense to add
>>
Hi,
>> Is there no benefit to using SWPPL for RELEASE here? Similarly for the
>> others.
>
> We started off implementing all possible memory orderings available.
> Wilco saw value in merging less restricted orderings into more
> restricted ones - mainly to reduce codesize in less frequently
v3: rebased to latest trunk
Cleanup memset implementation. Similar to memcpy/memmove, use an offset and
bytes throughout. Simplify the complex calculations when optimizing for size
by using a fixed limit.
Passes regress & bootstrap.
gcc/ChangeLog:
* config/aarch64/aarch64.h
Hi Richard,
>> Enable lock-free 128-bit atomics on AArch64. This is backwards compatible
>> with
>> existing binaries, gives better performance than locking atomics and is what
>> most users expect.
>
> Please add a justification for why it's backwards compatible, rather
> than just stating
Hi Richard,
> + rtx load[max_ops], store[max_ops];
>
> Please either add a comment explaining why 40 is guaranteed to be
> enough, or (my preference) use:
>
> auto_vec, ...> ops;
I've changed to using auto_vec since that should help reduce conflicts
with Alex' LDP changes. I double-checked
Hi Richard,
Thanks for the review, now committed.
> The new aarch64_split_compare_and_swap code looks a bit twisty.
> The approach in lse.S seems more obvious. But I'm guessing you
> didn't want to spend any time restructuring the pre-LSE
> -mno-outline-atomics code, and I agree the patch in
Hi Richard,
> +/* Maximum bytes set for an inline memset expansion. With -Os use 3 STP
> + and 1 MOVI/DUP (same size as a call). */
> +#define MAX_SET_SIZE(speed) (speed ? 256 : 96)
> So it looks like this assumes we have AdvSIMD. What about
> -mgeneral-regs-only?
After my strictalign
Hi,
>>> I checked codesize on SPECINT2017, and 96 had practically identical size.
>>> Using 128 would also be a reasonable Os value with a very slight size
>>> increase,
>>> and 384 looks good for O2 - however I didn't want to tune these values
>>> as this
>>> is a cleanup patch.
>>>
>>> Cheers,
Hi Kyrill,
> + if (!(hwcap & HWCAP_CPUID))
> + return false;
> +
> + unsigned long midr;
> + asm volatile ("mrs %0, midr_el1" : "=r" (midr));
> From what I recall that midr_el1 register is emulated by the kernel and so
> userspace software
> has to check that the kernel supports that
Hi Kyrill,
> + /* Reduce the maximum size with -Os. */
> + if (optimize_function_for_size_p (cfun))
> + max_set_size = 96;
> +
> This is a new "magic" number in this code. It looks sensible, but how
> did you arrive at it?
We need 1 instruction to create the value to store (DUP or
ping
From: Wilco Dijkstra
Sent: 02 June 2023 18:28
To: GCC Patches
Cc: Richard Sandiford ; Kyrylo Tkachov
Subject: [PATCH] libatomic: Enable lock-free 128-bit atomics on AArch64
[PR110061]
Enable lock-free 128-bit atomics on AArch64. This is backwards compatible with
existing binaries
ping
From: Wilco Dijkstra
Sent: 04 August 2023 16:05
To: GCC Patches ; Richard Sandiford
Cc: Kyrylo Tkachov
Subject: [PATCH] libatomic: Improve ifunc selection on AArch64
Add support for ifunc selection based on CPUID register. Neoverse N1 supports
atomic 128-bit load/store, so use
ping
__sync_val_compare_and_swap may be used on 128-bit types and either calls the
outline atomic code or uses an inline loop. On AArch64 LDXP is only atomic if
the value is stored successfully using STXP, but the current implementations
do not perform the store if the comparison fails. In
ping
Cleanup memset implementation. Similar to memcpy/memmove, use an offset and
bytes throughout. Simplify the complex calculations when optimizing for size
by using a fixed limit.
Passes regress/bootstrap, OK for commit?
gcc/ChangeLog:
* config/aarch64/aarch64.cc
ping
v2: further cleanups, improved comments
Add support for inline memmove expansions. The generated code is identical
as for memcpy, except that all loads are emitted before stores rather than
being interleaved. The maximum size is 256 bytes which requires at most 16
registers.
Passes
ping
v2: Use UINTVAL, rename max_mops_size.
The cpymemdi/setmemdi implementation doesn't fully support strict alignment.
Block the expansion if the alignment is less than 16 with STRICT_ALIGNMENT.
Clean up the condition when to use MOPS.
Passes regress/bootstrap, OK for commit?
v2: Use check-function-bodies in tests
Further improve immediate generation by adding support for 2-instruction
MOV/EOR bitmask immediates. This reduces the number of 3/4-instruction
immediates in SPECCPU2017 by ~2%.
Passes regress, OK for commit?
gcc/ChangeLog:
*
Cleanup memset implementation. Similar to memcpy/memmove, use an offset and
bytes throughout. Simplify the complex calculations when optimizing for size
by using a fixed limit.
Passes regress/bootstrap, OK for commit?
gcc/ChangeLog:
* config/aarch64/aarch64.cc
Further improve immediate generation by adding support for 2-instruction
MOV/EOR bitmask immediates. This reduces the number of 3/4-instruction
immediates in SPECCPU2017 by ~2%.
Passes regress, OK for commit?
gcc/ChangeLog:
* config/aarch64/aarch64.cc (aarch64_internal_mov_immediate)
Hi Ramana,
> I remember this to be the previous discussions and common understanding.
>
> https://gcc.gnu.org/legacy-ml/gcc/2016-06/msg00017.html
>
> and here
>
> https://gcc.gnu.org/legacy-ml/gcc-patches/2017-02/msg00168.html
>
> Can you point any discussion recently that shows this has changed
ping
__sync_val_compare_and_swap may be used on 128-bit types and either calls the
outline atomic code or uses an inline loop. On AArch64 LDXP is only atomic if
the value is stored successfully using STXP, but the current implementations
do not perform the store if the comparison fails. In
ping
From: Wilco Dijkstra
Sent: 04 August 2023 16:05
To: GCC Patches ; Richard Sandiford
Cc: Kyrylo Tkachov
Subject: [PATCH] libatomic: Improve ifunc selection on AArch64
Add support for ifunc selection based on CPUID register. Neoverse N1 supports
atomic 128-bit load/store, so use
ping
From: Wilco Dijkstra
Sent: 02 June 2023 18:28
To: GCC Patches
Cc: Richard Sandiford ; Kyrylo Tkachov
Subject: [PATCH] libatomic: Enable lock-free 128-bit atomics on AArch64
[PR110061]
Enable lock-free 128-bit atomics on AArch64. This is backwards compatible with
existing binaries
ping
v2: Use UINTVAL, rename max_mops_size.
The cpymemdi/setmemdi implementation doesn't fully support strict alignment.
Block the expansion if the alignment is less than 16 with STRICT_ALIGNMENT.
Clean up the condition when to use MOPS.
Passes regress/bootstrap, OK for commit?
v2: further cleanups, improved comments
Add support for inline memmove expansions. The generated code is identical
as for memcpy, except that all loads are emitted before stores rather than
being interleaved. The maximum size is 256 bytes which requires at most 16
registers.
Passes
Hi Ramana,
>> I used --target=arm-none-linux-gnueabihf --host=arm-none-linux-gnueabihf
>> --build=arm-none-linux-gnueabihf --with-float=hard. However it seems that the
>> default armhf settings are incorrect. I shouldn't need the --with-float=hard
>> since
>> that is obviously implied by armhf,
Hi Ramana,
> Hope this helps.
Yes definitely!
>> Passes regress/bootstrap, OK for commit?
>
> Target ? armhf ? --with-arch , -with-fpu , -with-float parameters ?
> Please be specific.
I used --target=arm-none-linux-gnueabihf --host=arm-none-linux-gnueabihf
--build=arm-none-linux-gnueabihf
The outline atomic functions have hidden visibility and can only be called
directly. Therefore we can remove the BTI at function entry. This improves
security by reducing the number of indirect entry points in a binary.
The BTI markings on the objects are still emitted.
Passes regress, OK for
Hi Ramana,
>> __sync_val_compare_and_swap may be used on 128-bit types and either calls the
>> outline atomic code or uses an inline loop. On AArch64 LDXP is only atomic
>> if
>> the value is stored successfully using STXP, but the current implementations
>> do not perform the store if the
Add support for inline memmove expansions. The generated code is identical
as for memcpy, except that all loads are emitted before stores rather than
being interleaved. The maximum size is 256 bytes which requires at most 16
registers.
Passes regress/bootstrap, OK for commit?
v2: Use UINTVAL, rename max_mops_size.
The cpymemdi/setmemdi implementation doesn't fully support strict alignment.
Block the expansion if the alignment is less than 16 with STRICT_ALIGNMENT.
Clean up the condition when to use MOPS.
Passes regress/bootstrap, OK for commit?
Hi Richard,
> * config/aarch64/aarch64.md (cpymemdi): Remove pattern condition.
> Shouldn't this be a separate patch? It's not immediately obvious that this
> is a necessary part of this change.
You mean this?
@@ -1627,7 +1627,7 @@ (define_expand "cpymemdi"
(match_operand:BLK 1
A MOPS memmove may corrupt registers since there is no copy of the input
operands to temporary registers. Fix this by calling
aarch64_expand_cpymem_mops.
Passes regress/bootstrap, OK for commit?
gcc/ChangeLog/
PR target/21
* config/aarch64/aarch64.md
The cpymemdi/setmemdi implementation doesn't fully support strict alignment.
Block the expansion if the alignment is less than 16 with STRICT_ALIGNMENT.
Clean up the condition when to use MOPS.
Passes regress/bootstrap, OK for commit?
gcc/ChangeLog/
PR target/103100
*
Hi Richard,
>> Note that aarch64_internal_mov_immediate may be called after reload,
>> so it would end up even more complex.
>
> The sequence I quoted was supposed to work before and after reload. The:
>
> rtx tmp = aarch64_target_reg (dest, DImode);
>
> would create a fresh
Hi Richard,
> I was worried that reusing "dest" for intermediate results would
> prevent CSE for cases like:
>
> void g (long long, long long);
> void
> f (long long *ptr)
> {
> g (0xee11ee22ee11ee22LL, 0xdc23dc44ee11ee22LL);
> }
Note that aarch64_internal_mov_immediate may be called after
Support immediate expansion of immediates which can be created from 2 MOVKs
and a shifted ORR or BIC instruction. Change aarch64_split_dimode_const_store
to apply if we save one instruction.
This reduces the number of 4-instruction immediates in SPECINT/FP by 5%.
Passes regress, OK for commit?
ping
From: Wilco Dijkstra
Sent: 02 June 2023 18:28
To: GCC Patches
Cc: Richard Sandiford ; Kyrylo Tkachov
Subject: [PATCH] libatomic: Enable lock-free 128-bit atomics on AArch64
[PR110061]
Enable lock-free 128-bit atomics on AArch64. This is backwards compatible with
existing binaries
ping
From: Wilco Dijkstra
Sent: 04 August 2023 16:05
To: GCC Patches ; Richard Sandiford
Cc: Kyrylo Tkachov
Subject: [PATCH] libatomic: Improve ifunc selection on AArch64
Add support for ifunc selection based on CPUID register. Neoverse N1 supports
atomic 128-bit load/store, so use
__sync_val_compare_and_swap may be used on 128-bit types and either calls the
outline atomic code or uses an inline loop. On AArch64 LDXP is only atomic if
the value is stored successfully using STXP, but the current implementations
do not perform the store if the comparison fails. In this case
List official cores first so that -cpu=native does not show a codename with -v
or in errors/warnings.
Passes regress, OK for commit?
gcc/ChangeLog:
* config/aarch64/aarch64-cores.def (neoverse-n1): Place before ares.
(neoverse-v1): Place before zeus.
(neoverse-v2): Place
The v7 memory ordering model allows reordering of conditional atomic
instructions.
To avoid this, make all atomic patterns unconditional. Expand atomic loads and
stores for all architectures so the memory access can be wrapped into an UNSPEC.
Passes regress/bootstrap, OK for commit?
Hi Richard,
(that's quick!)
> + if (size > max_copy_size || size > max_mops_size)
> +return aarch64_expand_cpymem_mops (operands, is_memmove);
>
> Could you explain this a bit more? If I've followed the logic correctly,
> max_copy_size will always be 0 for movmem, so this "if" condition
A MOPS memmove may corrupt registers since there is no copy of the input
operands to temporary
registers. Fix this by calling aarch64_expand_cpymem which does this. Also
fix an issue with
STRICT_ALIGNMENT being ignored if TARGET_MOPS is true, and avoid crashing or
generating a huge
expansion
Hi Richard,
>>> Answering my own question, N1 does not officially have FEAT_LSE2.
>>
>> It doesn't indeed. However most cores support atomic 128-bit load/store
>> (part of LSE2), so we can still use the LSE2 ifunc for those cores. Since
>> there
>> isn't a feature bit for this in the CPU or
Hi Richard,
>> Why would HWCAP_USCAT not be set by the kernel?
>>
>> Failing that, I would think you would check ID_AA64MMFR2_EL1.AT.
>>
> Answering my own question, N1 does not officially have FEAT_LSE2.
It doesn't indeed. However most cores support atomic 128-bit load/store
(part of LSE2), so
ping
From: Wilco Dijkstra
Sent: 02 June 2023 18:28
To: GCC Patches
Cc: Richard Sandiford ; Kyrylo Tkachov
Subject: [PATCH] libatomic: Enable lock-free 128-bit atomics on AArch64
[PR110061]
Enable lock-free 128-bit atomics on AArch64. This is backwards compatible with
existing binaries
Add support for ifunc selection based on CPUID register. Neoverse N1 supports
atomic 128-bit load/store, so use the FEAT_USCAT ifunc like newer Neoverse
cores.
Passes regress, OK for commit?
libatomic/
config/linux/aarch64/host-config.h (ifunc1): Use CPUID in ifunc
selection.
ping
From: Wilco Dijkstra
Sent: 02 June 2023 18:28
To: GCC Patches
Cc: Richard Sandiford ; Kyrylo Tkachov
Subject: [PATCH] libatomic: Enable lock-free 128-bit atomics on AArch64
[PR110061]
Enable lock-free 128-bit atomics on AArch64. This is backwards compatible with
existing binaries
ping
From: Wilco Dijkstra
Sent: 02 June 2023 18:28
To: GCC Patches
Cc: Richard Sandiford ; Kyrylo Tkachov
Subject: [PATCH] libatomic: Enable lock-free 128-bit atomics on AArch64
[PR110061]
Enable lock-free 128-bit atomics on AArch64. This is backwards compatible with
existing binaries
Enable lock-free 128-bit atomics on AArch64. This is backwards compatible with
existing binaries, gives better performance than locking atomics and is what
most users expect.
Note 128-bit atomic loads use a load/store exclusive loop if LSE2 is not
supported.
This results in an implicit store
ping
From: Wilco Dijkstra
Sent: 23 February 2023 15:11
To: GCC Patches
Cc: Richard Sandiford ; Kyrylo Tkachov
Subject: [PATCH] libatomic: Fix SEQ_CST 128-bit atomic load [PR108891]
The LSE2 ifunc for 16-byte atomic load requires a barrier before the LDP -
without it, it effectively has
The LSE2 ifunc for 16-byte atomic load requires a barrier before the LDP -
without it, it effectively has Load-AcquirePC semantics similar to LDAPR,
which is less restrictive than what __ATOMIC_SEQ_CST requires. This patch
fixes this and adds comments to make it easier to see which sequence is
Hi,
>> + /* Return-address signing state is toggled by DW_CFA_GNU_window_save
>> (where
>> + REG_UNDEFINED means enabled), or set by a DW_CFA_expression. */
>
> Needs updating to REG_UNSAVED_ARCHEXT.
>
> OK with that changes, thanks, and sorry for the delays & runaround.
Thanks, I've
Hi,
> @Wilco, can you please send the rebased patch for patch review? We would
> need in out openSUSE package soon.
Here is an updated and rebased version:
Cheers,
Wilco
v4: rebase and add REG_UNSAVED_ARCHEXT.
A recent change only initializes the regs.how[] during Dwarf unwinding
which
Hi,
> On 1/10/23 19:12, Jakub Jelinek via Gcc-patches wrote:
>> Anyway, the sooner this makes it into gcc trunk, the better, it breaks quite
>> a lot of stuff.
>
> Yep, please, we're also waiting for this patch for pushing to our gcc13
> package.
Well I'm waiting for an OK from a maintainer...
Hi Szabolcs,
> i would keep the assert: how[reg] must be either UNSAVED or UNDEFINED
> here, other how[reg] means the toggle cfi instruction is mixed with
> incompatible instructions for the pseudo reg.
>
> and i would add a comment about this e.g. saying that UNSAVED/UNDEFINED
> how[reg] is used
Hi Richard,
> Hmm, but the point of the original patch was to support code generators
> that emit DW_CFA_val_expression instead of DW_CFA_AARCH64_negate_ra_state.
> Doesn't this patch undo that?
Well it wasn't clear from the code or comments that was supported. I've
added that back in v2.
>
Enable TARGET_CONST_ANCHOR to allow complex constants to be created via
immediate add.
Use a 24-bit range as that enables a 3 or 4-instruction immediate to be
replaced by
2 additions. Fix the costing of immediate add to support 24-bit immediate and
12-bit shifted
immediates. The generated
Hi Andreas,
Thanks for the report, I've committed the fix:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108006
Cheers,
Wilco
Ensure we only pass SI/DImode which fixes the assert.
Committed as obvious.
gcc/
PR target/108006
* config/aarch64/aarch64.c (aarch64_expand_sve_const_vector):
Fix call to aarch64_move_imm to use SI/DI.
---
diff --git a/gcc/config/aarch64/aarch64.cc
Hi,
> i don't think how[*RA_STATE] can ever be set to REG_SAVED_OFFSET,
> this pseudo reg is not spilled to the stack, it is reset to 0 in
> each frame and then toggled within a frame.
It's is just a state, we can use any state we want since it is a pseudo reg.
These registers are global and
Hi Richard,
> - scalar_int_mode imode = (mode == HFmode
> - ? SImode
> - : int_mode_for_mode (mode).require ());
> + machine_mode imode = (mode == DFmode) ? DImode : SImode;
> It looks like this might mishandle DDmode, if not now
A recent change only initializes the regs.how[] during Dwarf unwinding
which resulted in an uninitialized offset used in return address signing
and random failures during unwinding. The fix is to use REG_SAVED_OFFSET
as the state where the return address signing bit is valid, and if the
state is
Hi Richard,
> Just to make sure I understand: isn't it really just MOVN? I would have
> expected a 32-bit MOVZ to be equivalent to (and add no capabilities over)
> a 64-bit MOVZ.
The 32-bit MOVZ immediates are equivalent, MOVN never overlaps, and
MOVI has some overlaps . Since we allow all 3
Hi Richard,
>> A smart reassociation pass could form more FMAs while also increasing
>> parallelism, but the way it currently works always results in fewer FMAs.
>
> Yeah, as Richard said, that seems the right long-term fix.
> It would also avoid the hack of treating PLUS_EXPR as a signal
> of an
Hi Richard,
> I guess an obvious question is: if 1 (rather than 2) was the right value
> for cores with 2 FMA pipes, why is 4 the right value for cores with 4 FMA
> pipes? It would be good to clarify how, conceptually, the core property
> should map to the fma_reassoc_width value.
1 turns off
Hi Richard,
> Can you go into more detail about:
>
> Use :option:`-mdirect-extern-access` either in shared libraries or in
> executables, but not in both. Protected symbols used both in a shared
> library and executable may cause linker errors or fail to work correctly
>
> If this is
Add a new option -mdirect-extern-access similar to other targets. This removes
GOT indirections on external symbols with -fPIE, resulting in significantly
better code quality. With -fPIC it only affects protected symbols, allowing
for more efficient shared libraries which can be linked with
Add support for AArch64 LSE and LSE2 to libatomic. Disable outline atomics,
and use LSE ifuncs for 1-8 byte atomics and LSE2 ifuncs for 16-byte atomics.
On Neoverse V1, 16-byte atomics are ~4x faster due to avoiding locks.
Note this is safe since we swap all 16-byte atomics using the same ifunc,
Add a reassocation width for FMAs in per-CPU tuning structures. Keep the
existing setting for cores with 2 FMA pipes, and use 4 for cores with 4
FMA pipes. This improves SPECFP2017 on Neoverse V1 by ~1.5%.
Passes regress/bootstrap, OK for commit?
gcc/
PR 107413
*
Committed as trivial fix.
gcc/testsuite/
* gcc.target/aarch64/mgeneral-regs_3.c: Fix testcase.
---
diff --git a/gcc/testsuite/gcc.target/aarch64/mgeneral-regs_3.c
b/gcc/testsuite/gcc.target/aarch64/mgeneral-regs_3.c
index
Hi Richard,
Here is the immediate cleanup splitoff from the previous patch:
Simplify, refactor and improve various move immediate functions.
Allow 32-bit MOVZ/N as a valid 64-bit immediate which removes special
cases in aarch64_internal_mov_immediate. Add new constraint so the movdi
pattern
Hi Richard,
> Can you do the aarch64_mov_imm changes as a separate patch? It's difficult
> to review the two changes folded together like this.
Sure, I'll send a separate patch. So here is version 2 again:
[PATCH v2][AArch64] Improve immediate expansion [PR106583]
Improve immediate expansion
ping
Hi Richard,
>>> Sounds good, but could you put it before the mode version,
>>> to avoid the forward declaration?
>>
>> I can swap them around but the forward declaration is still required as
>> aarch64_check_bitmask is 5000 lines before aarch64_bitmask_imm.
>
> OK, how about moving them
1 - 100 of 1115 matches
Mail list logo