[Bug libstdc++/104928] std::counting_semaphore on Linux can sleep forever
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104928 --- Comment #7 from Nate Eldredge --- @Jonathan: Done, https://gcc.gnu.org/pipermail/gcc-patches/2023-December/640119.html (sorry, may not be linked to original threads).
[Bug libstdc++/104928] std::counting_semaphore on Linux can sleep forever
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104928 --- Comment #5 from Nate Eldredge --- Oh wait, disregard that last, I realized that I only applied one of the two patches. Let me try again.
[Bug libstdc++/104928] std::counting_semaphore on Linux can sleep forever
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104928 --- Comment #4 from Nate Eldredge --- @Jonathan: I think that patch set is on the right track, but it has some other serious bugs. For one, __atomic_wait_address calls __detail::__wait_impl with __args._M_old uninitialized (g++ -O3 -Wall catches it). There is another uninitialized warning about __wait_addr that I haven't yet confirmed. Lastly, in __wait_impl, there is a test `if (__args & __wait_flags::__spin_only)`, but __spin_only has two bits set, one of which is __do_spin. So in effect, __do_spin (which is set by default on Linux) is taken to imply __spin_only, with the result that it *only* ever spins, without ever sleeping. Thus every semaphore (and maybe other waits too) becomes a spinlock, which is Not Good. Should I take this up on gcc-patches, or elsewhere?
[Bug libstdc++/104928] std::counting_semaphore on Linux can sleep forever
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104928 --- Comment #2 from Nate Eldredge --- This bug is still present. Tested and reproduced with g++ 13.1.0 (Ubuntu package), and by inspection of the source code, it's still in the trunk as well. Encountered on StackOverflow: https://stackoverflow.com/questions/77626624/race-condition-in-morriss-algorithm
[Bug target/110780] New: aarch64 NEON redundant displaced ld3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110780 Bug ID: 110780 Summary: aarch64 NEON redundant displaced ld3 Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: nate at thatsmathematics dot com Target Milestone: --- Compile the following with gcc 14.0.0 20230723 on aarch64 with -O3: #include void CSI2toBE12(uint8_t* pCSI2, uint8_t* pBE, uint8_t* pCSI2LineEnd) { while (pCSI2 < pCSI2LineEnd) { pBE[0] = pCSI2[0]; pBE[1] = ((pCSI2[2] & 0xf) << 4) | (pCSI2[1] >> 4); pBE[2] = ((pCSI2[1] & 0xf) << 4) | (pCSI2[2] >> 4); pCSI2 += 3; pBE += 3; } } Godbolt: https://godbolt.org/z/WshTPKzY5 In the inner loop (.L5 of the godbolt asm) we have ld3 {v25.16b - v27.16b}, [x3] add x6, x3, 1 // no intervening stores ld3 {v25.16b - v27.16b}, [x6] The second load is redundant. v25, v26 are the same as what was already in v26, v27 respectively. The value loaded into v27 is new but it is not used in the subsequent code. This might also account for some extra later complexity, because it means that the last 48 bytes of the input can't be handled by this loop (or else the second load would be out of bounds by one byte) and so must be handled specially.
[Bug target/30527] Use of input/output operands in __asm__ templates not fully documented
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=30527 Nate Eldredge changed: What|Removed |Added CC||nate at thatsmathematics dot com --- Comment #8 from Nate Eldredge --- For arm/aarch64 in particular, there is an extra wrinkle, which is that armclang *does* support and document the template modifiers. See https://developer.arm.com/documentation/100067/0610/armclang-Inline-Assembler/Inline-assembly-template-modifiers?lang=en. As best I can tell, they are exactly the same as what gcc already supports (undocumentedly). So that makes this into a compatibility issue. People may be writing code for armclang using the modifiers, and then want to build with gcc instead. In practice it will work fine, but from the gcc docs, you wouldn't know it. On these targets, some of the modifiers are pretty important, and there are fairly basic things that you simply can't do without them. For example, on aarch64, the b/h/s/d/q modifiers to get the names of various scalar pieces of a vector register (v15 -> b15 / h15 / s15 / d15 / q15). It's just impossible to write any scalar floating-point asm without this, or SIMD code using the "across vector" instructions like ADDV which need a scalar output operand. Or, the c modifier to suppress the leading # on an immediate. This one is documented for x86, where the need for it is similarly obvious, but no indication in the docs that it works on arm/aarch64 as well. I really do think it would be a good idea for these to become officially supported and documented by gcc, at least for these targets.
[Bug target/104110] New: AArch64 unnecessary use of call-preserved register
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104110 Bug ID: 104110 Summary: AArch64 unnecessary use of call-preserved register Product: gcc Version: 12.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: nate at thatsmathematics dot com Target Milestone: --- gcc misses an optimization (or in some sense deoptimizes) by using a call-preserved register to save a trivial constant across a function call. Source code: void bar(unsigned); unsigned foo(unsigned c) { bar(1U << c); return 1; } Output from gcc -O3 on AArch64: foo: stp x29, x30, [sp, -32]! mov x29, sp str x19, [sp, 16] mov w19, 1 lsl w0, w19, w0 bl bar mov w0, w19 ldr x19, [sp, 16] ldp x29, x30, [sp], 32 ret Note that x19 is used unnecessarily to save the constant 1 across the function call, causing an unnecessary push and pop. It would have been better to just use some call-clobbered register for the constant 1 before the function call, and then a simple `mov w0, 1` afterward.\ Same behavior with -O, -O2, -Os. Tested on godbolt, affects yesterday's trunk and all the way back to 5.4. Might be related to bug 70801 or bug 71768 but I am not sure.
[Bug target/104039] New: AArch64 Redundant instruction moving general to vector register
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104039 Bug ID: 104039 Summary: AArch64 Redundant instruction moving general to vector register Product: gcc Version: 12.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: nate at thatsmathematics dot com Target Milestone: --- Compiling the following code on AArch64 with -O2 or -O3: typedef unsigned long u64x2 __attribute__((vector_size(16))); u64x2 combine(unsigned long a, unsigned long b) { u64x2 v = {a,b}; return v; } yields the following assembly: combine: fmovd0, x0 ins v0.d[1], x1 ins v0.d[1], x1 ret where the second ins is entirely redundant with the first and serves no apparent purpose. (Unless it is something extremely clever...) This seems to be a regression from 8.x to 9.x; Godbolt's 8.5 looks correct with just one ins, but 9.3 has the two. Originally noticed by Peter Cordes on StackOverflow: https://stackoverflow.com/questions/70717360/how-to-load-vector-registers-from-integer-registers-in-arm64-m1/70718572#comment125016906_70717360
[Bug other/97473] Spilled function parameters not aligned properly on multiple non-x86 targets
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97473 --- Comment #2 from Nate Eldredge --- Possibly related to bug 84877 ?
[Bug other/97473] New: Spilled function parameters not aligned properly on multiple non-x86 targets
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97473 Bug ID: 97473 Summary: Spilled function parameters not aligned properly on multiple non-x86 targets Product: gcc Version: 11.0 Status: UNCONFIRMED Keywords: wrong-code Severity: normal Priority: P3 Component: other Assignee: unassigned at gcc dot gnu.org Reporter: nate at thatsmathematics dot com Target Milestone: --- Created attachment 49394 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49394&action=edit Test case Suppose we have a type V requiring alignment, such as with __attribute__((aligned(N))). In current versions of gcc, both 10.2 and recent trunk, it appears that local (auto) variables of this type are properly aligned on the stack, at least on all the targets I tested. However, on many targets other than x86, alignment is apparently not respected for function parameters of this type when their address is taken. The function parameter may actually be passed in a register, in which case when its address is taken, it must be spilled to the stack. But on the failing targets, the spilled copy is not sufficiently aligned, and so for instance, other functions which receive a pointer to this variable will find it does not have the alignment that it should. I'm not sure if this is a bug or a limitation, but it's quite counterintuitive, since function parameters generally can be treated like local variables for most other purposes. I couldn't find any mention of this in the documentation or past bug reports. This can be reproduced by a very short C example like the following: typedef int V __attribute__((aligned(64))); void g(V *); void f(V x) { g(&x); } The function g can get a pointer that is not aligned to 64 bytes. A more complete test case is attached, which I tested mainly on ARM and AArch64 with gcc 10.2 and also trunk. It seems to happen with or without optimization, so long as one prevents IPA of g. Inspection of the assembly shows gcc does not generate any code to align the objects beyond the stack alignment guaranteed by the ABI (8 bytes for ARM, 16 bytes for AArch64). It fails on (complete gcc -v output below): - aarch64-linux-gnu 10.2.0 and trunk from today - arm-linux-gnueabihf 10.2.0 and trunk from last week - alpha-linux-gnu 10.2.0 - sparc64-linux-gnu 10.2.0 - mips-linux-gnu 10.2.0 It succeeds on: - x86_64-linux-gnu 10.2.0, also with -m32 On x86_64-linux-gnu, gcc generates instructions to align the stack and place the spilled copy of x at an aligned address, and the testcase passes there. (Perhaps this was implemented to support AVX?) With -m32 it copies x from its original unaligned position on the stack into an aligned stack slot. As noted, auto variables of the same type do get proper alignment on all the platforms I tested, and so one can work around with `V tmp = x; g(&tmp);`. For what it's worth, clang on ARM and AArch64 does align the spilled copies. I was not sure which pass of the compiler is responsible for this so I just chose component "other". I didn't think "target" was appropriate as this affects many targets, though not all. This issue was brought to my attention by StackOverflow user Alf (thanks!), see https://stackoverflow.com/questions/64287587/memory-alignment-issues-with-gcc-vector-extension-and-arm-neon. Alf's original program was in C++ for ARM32 with NEON and the hard-float ABI, and involved mixing functions that passed vector types (like int32x4_t) either by value or by by reference. In this setting they can be passed by value in SIMD registers, but in memory they require 16-byte alignment. This was violated, resulting in bus errors at runtime. So there is "real life" code affected by this. I tried including full `gcc -v` output from all versions tested, but it seems to be triggering the bugzilla spam filter, so I'm omitting it. Hopefully it isn't needed, but let me know if it is.