[Bug libstdc++/104928] std::counting_semaphore on Linux can sleep forever

2023-12-11 Thread nate at thatsmathematics dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104928

--- Comment #7 from Nate Eldredge  ---
@Jonathan: Done,
https://gcc.gnu.org/pipermail/gcc-patches/2023-December/640119.html (sorry, may
not be linked to original threads).

[Bug libstdc++/104928] std::counting_semaphore on Linux can sleep forever

2023-12-10 Thread nate at thatsmathematics dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104928

--- Comment #5 from Nate Eldredge  ---
Oh wait, disregard that last, I realized that I only applied one of the two
patches.  Let me try again.

[Bug libstdc++/104928] std::counting_semaphore on Linux can sleep forever

2023-12-10 Thread nate at thatsmathematics dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104928

--- Comment #4 from Nate Eldredge  ---
@Jonathan: I think that patch set is on the right track, but it has some other
serious bugs.  For one, __atomic_wait_address calls __detail::__wait_impl with
__args._M_old uninitialized (g++ -O3 -Wall catches it).  There is another
uninitialized warning about __wait_addr that I haven't yet confirmed.  

Lastly, in __wait_impl, there is a test `if (__args &
__wait_flags::__spin_only)`, but __spin_only has two bits set, one of which is
__do_spin.  So in effect, __do_spin (which is set by default on Linux) is taken
to imply __spin_only, with the result that it *only* ever spins, without ever
sleeping.  Thus every semaphore (and maybe other waits too) becomes a spinlock,
which is Not Good.

Should I take this up on gcc-patches, or elsewhere?

[Bug libstdc++/104928] std::counting_semaphore on Linux can sleep forever

2023-12-09 Thread nate at thatsmathematics dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104928

--- Comment #2 from Nate Eldredge  ---
This bug is still present.  Tested and reproduced with g++ 13.1.0 (Ubuntu
package), and by inspection of the source code, it's still in the trunk as
well.

Encountered on StackOverflow:
https://stackoverflow.com/questions/77626624/race-condition-in-morriss-algorithm

[Bug target/110780] New: aarch64 NEON redundant displaced ld3

2023-07-23 Thread nate at thatsmathematics dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110780

Bug ID: 110780
   Summary: aarch64 NEON redundant displaced ld3
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: nate at thatsmathematics dot com
  Target Milestone: ---

Compile the following with gcc 14.0.0 20230723 on aarch64 with -O3:

#include 
void CSI2toBE12(uint8_t* pCSI2, uint8_t* pBE, uint8_t* pCSI2LineEnd)
{
while (pCSI2 < pCSI2LineEnd) {
pBE[0] = pCSI2[0];
pBE[1] = ((pCSI2[2] & 0xf) << 4) | (pCSI2[1] >> 4);
pBE[2] = ((pCSI2[1] & 0xf) << 4) | (pCSI2[2] >> 4);
pCSI2 += 3;
pBE += 3;
}
}

Godbolt: https://godbolt.org/z/WshTPKzY5

In the inner loop (.L5 of the godbolt asm) we have

ld3 {v25.16b - v27.16b}, [x3]
add x6, x3, 1
// no intervening stores
ld3 {v25.16b - v27.16b}, [x6]

The second load is redundant.  v25, v26 are the same as what was already in
v26, v27 respectively.  The value loaded into v27 is new but it is not used in
the subsequent code.

This might also account for some extra later complexity, because it means that
the last 48 bytes of the input can't be handled by this loop (or else the
second load would be out of bounds by one byte) and so must be handled
specially.

[Bug target/30527] Use of input/output operands in __asm__ templates not fully documented

2023-01-21 Thread nate at thatsmathematics dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=30527

Nate Eldredge  changed:

   What|Removed |Added

 CC||nate at thatsmathematics dot 
com

--- Comment #8 from Nate Eldredge  ---
For arm/aarch64 in particular, there is an extra wrinkle, which is that
armclang *does* support and document the template modifiers.  See
https://developer.arm.com/documentation/100067/0610/armclang-Inline-Assembler/Inline-assembly-template-modifiers?lang=en.
 As best I can tell, they are exactly the same as what gcc already supports
(undocumentedly).

So that makes this into a compatibility issue.  People may be writing code for
armclang using the modifiers, and then want to build with gcc instead.  In
practice it will work fine, but from the gcc docs, you wouldn't know it.

On these targets, some of the modifiers are pretty important, and there are
fairly basic things that you simply can't do without them.  For example, on
aarch64, the b/h/s/d/q modifiers to get the names of various scalar pieces of a
vector register (v15 -> b15 / h15 / s15 / d15 / q15).  It's just impossible to
write any scalar floating-point asm without this, or SIMD code using the
"across vector" instructions like ADDV which need a scalar output operand.  

Or, the c modifier to suppress the leading # on an immediate.  This one is
documented for x86, where the need for it is similarly obvious, but no
indication in the docs that it works on arm/aarch64 as well.

I really do think it would be a good idea for these to become officially
supported and documented by gcc, at least for these targets.

[Bug target/104110] New: AArch64 unnecessary use of call-preserved register

2022-01-18 Thread nate at thatsmathematics dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104110

Bug ID: 104110
   Summary: AArch64 unnecessary use of call-preserved register
   Product: gcc
   Version: 12.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: nate at thatsmathematics dot com
  Target Milestone: ---

gcc misses an optimization (or in some sense deoptimizes) by using a
call-preserved register to save a trivial constant across a function call.

Source code:

void bar(unsigned);
unsigned foo(unsigned c) {
bar(1U << c);
return 1;
}

Output from gcc -O3 on AArch64:

foo:
stp x29, x30, [sp, -32]!
mov x29, sp
str x19, [sp, 16]
mov w19, 1
lsl w0, w19, w0
bl  bar
mov w0, w19
ldr x19, [sp, 16]
ldp x29, x30, [sp], 32
ret

Note that x19 is used unnecessarily to save the constant 1 across the function
call, causing an unnecessary push and pop.  It would have been better to just
use some call-clobbered register for the constant 1 before the function call,
and then a simple `mov w0, 1` afterward.\

Same behavior with -O, -O2, -Os.  Tested on godbolt, affects yesterday's trunk
and all the way back to 5.4.

Might be related to bug 70801 or bug 71768 but I am not sure.

[Bug target/104039] New: AArch64 Redundant instruction moving general to vector register

2022-01-14 Thread nate at thatsmathematics dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104039

Bug ID: 104039
   Summary: AArch64 Redundant instruction moving general to vector
register
   Product: gcc
   Version: 12.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: nate at thatsmathematics dot com
  Target Milestone: ---

Compiling the following code on AArch64 with -O2 or -O3:

typedef unsigned long u64x2 __attribute__((vector_size(16)));

u64x2 combine(unsigned long a, unsigned long b) {
u64x2 v = {a,b};
return v;
}

yields the following assembly:

combine:
fmovd0, x0
ins v0.d[1], x1
ins v0.d[1], x1
ret

where the second ins is entirely redundant with the first and serves no
apparent purpose.  (Unless it is something extremely clever...)

This seems to be a regression from 8.x to 9.x; Godbolt's 8.5 looks correct with
just one ins, but 9.3 has the two.

Originally noticed by Peter Cordes on StackOverflow:
https://stackoverflow.com/questions/70717360/how-to-load-vector-registers-from-integer-registers-in-arm64-m1/70718572#comment125016906_70717360

[Bug other/97473] Spilled function parameters not aligned properly on multiple non-x86 targets

2020-10-18 Thread nate at thatsmathematics dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97473

--- Comment #2 from Nate Eldredge  ---
Possibly related to bug 84877 ?

[Bug other/97473] New: Spilled function parameters not aligned properly on multiple non-x86 targets

2020-10-17 Thread nate at thatsmathematics dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97473

Bug ID: 97473
   Summary: Spilled function parameters not aligned properly on
multiple non-x86 targets
   Product: gcc
   Version: 11.0
Status: UNCONFIRMED
  Keywords: wrong-code
  Severity: normal
  Priority: P3
 Component: other
  Assignee: unassigned at gcc dot gnu.org
  Reporter: nate at thatsmathematics dot com
  Target Milestone: ---

Created attachment 49394
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49394&action=edit
Test case

Suppose we have a type V requiring alignment, such as with
__attribute__((aligned(N))).  In current versions of gcc, both 10.2 and recent
trunk, it appears that local (auto) variables of this type are properly aligned
on the stack, at least on all the targets I tested.  However, on many targets
other than x86, alignment is apparently not respected for function parameters
of this type when their address is taken. 

The function parameter may actually be passed in a register, in which case when
its address is taken, it must be spilled to the stack.  But on the failing
targets, the spilled copy is not sufficiently aligned, and so for instance,
other functions which receive a pointer to this variable will find it does not
have the alignment that it should.

I'm not sure if this is a bug or a limitation, but it's quite counterintuitive,
since function parameters generally can be treated like local variables for
most other purposes.  I couldn't find any mention of this in the documentation
or past bug reports.

This can be reproduced by a very short C example like the following:

typedef int V __attribute__((aligned(64)));

void g(V *);

void f(V x) {
g(&x);
}

The function g can get a pointer that is not aligned to 64 bytes.  A more
complete test case is attached, which I tested mainly on ARM and AArch64 with
gcc 10.2 and also trunk.  It seems to happen with or without optimization, so
long as one prevents IPA of g.  Inspection of the assembly shows gcc does not
generate any code to align the objects beyond the stack alignment guaranteed by
the ABI (8 bytes for ARM, 16 bytes for AArch64).

It fails on (complete gcc -v output below):

- aarch64-linux-gnu 10.2.0 and trunk from today
- arm-linux-gnueabihf 10.2.0 and trunk from last week
- alpha-linux-gnu 10.2.0
- sparc64-linux-gnu 10.2.0
- mips-linux-gnu 10.2.0

It succeeds on:

- x86_64-linux-gnu 10.2.0, also with -m32

On x86_64-linux-gnu, gcc generates instructions to align the stack and place
the spilled copy of x at an aligned address, and the testcase passes there. 
(Perhaps this was implemented to support AVX?)  With -m32 it copies x from its
original unaligned position on the stack into an aligned stack slot.

As noted, auto variables of the same type do get proper alignment on all the
platforms I tested, and so one can work around with `V tmp = x; g(&tmp);`.

For what it's worth, clang on ARM and AArch64 does align the spilled copies.

I was not sure which pass of the compiler is responsible for this so I just
chose component "other".  I didn't think "target" was appropriate as this
affects many targets, though not all.

This issue was brought to my attention by StackOverflow user Alf (thanks!), see
https://stackoverflow.com/questions/64287587/memory-alignment-issues-with-gcc-vector-extension-and-arm-neon.
 Alf's original program was in C++ for ARM32 with NEON and the hard-float ABI,
and involved mixing functions that passed vector types (like int32x4_t) either
by value or by by reference.  In this setting they can be passed by value in
SIMD registers, but in memory they require 16-byte alignment.  This was
violated, resulting in bus errors at runtime.  So there is "real life" code
affected by this.

I tried including full `gcc -v` output from all versions tested, but it seems
to be triggering the bugzilla spam filter, so I'm omitting it.  Hopefully it
isn't needed, but let me know if it is.