https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80835
Bug ID: 80835 Summary: Reading a member of an atomic<struct> can load just that member, not the whole struct Product: gcc Version: 8.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: libstdc++ Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- For std::atomic<pointer_and_counter> or similar small struct, accessing just one member with foo.load().m compiles to an atomic load of the entire thing. This is very slow for a 16B object on x86 (lock cmpxchg16b). It's also not free for 8-byte objects in 32-bit code (bouncing through SSE). I think it's safe to optimize partial accesses to atomic object into narrow loads. We need to be sure: * it's still atomic * it still Synchronizes With release-stores of the whole object (i.e. that other threads' ops become globally visible to us in the right order) * it can't lead to any of this thread's operations becoming globally visible in the wrong order For larger objects (non-lock-free), obviously once you hold the lock you can copy as much or as little as you want, but I think libatomic would need new entry-points to support small copies out of larger objects. Acquiring the same lock that the writer used gives us acquire semantics for the load. ---------- For lock-free objects, foo.load(mo_relaxed).m can be a narrow load if a load with that size & alignment is still atomic. There are no ordering requirements for the load itself, except that it's still ordered by fences and other acq/rel operations (which is the case). mo_acquire loads need to Synchronize With release-stores of the whole object. AFAIK, all architectures that uses MESI-style coherent caches synchronize based on cache lines, not object addresses. So an acquire-load Synchronizes With a release-store if they both touch the same cache-line, even if they use different address and size. Thus it's safe to use a narrow acquire-load on part of an object, even if the object spans two cache lines but the load only touches one. (But note that for x86, cmpxchg16b requires a 16B aligned memory operand, so this would only be possible with an 8B or smaller object, in which case we'd need lock cmpxchg loads and xchg stores, and those would be even slower than usual because of the cl-split.) For platforms other than x86, there may be issues if using synchronization primitives that only synchronize a particular address, but I'm not familiar with them. ------------ You can use a union of an atomic<struct> and a struct of atomic<> members to implement efficient read-only access to a single member with current gcc/clang. It's safe this optimization is safe. See http://stackoverflow.com/questions/38984153/implement-aba-counter-with-c11-cas/38991835#38991835. It would be nice if gcc would do the optimization for us on targets where it really is safe. If someone uses a union to store to part of an atomic object, the resulting UB is their problem. Lock-free atomic stores must always store the whole object. It usually doesn't make sense to store less than the whole object anyway, and C++ std::atomic has no way to get an lvalue referring to a single member of an atomic<struct>. --------- One worry I had was Intel's x86 manual saying: > 8.2.3.4 Loads May Be Reordered with Earlier Stores to Different Locations > The Intel-64 memory-ordering model allows a load to be reordered with an > earlier store to a different location. However, loads are not reordered with > stores to the same location. We know (from http://stackoverflow.com/questions/35830641/can-x86-reorder-a-narrow-store-with-a-wider-load-that-fully-contains-it) that there can be reordering when the load and store addresses don't match, even if they overlap. But it seems Intel only means within a single thread, i.e. that a thread observes its own operations to happen in program order. Store-forwarding allows a store/reload to become globally visible load-first. But I think if the store fully overlaps the load, opposite of what the SO question is about, there's no way you can observe that reordering, because all the load data comes from the store. That just means the store and reload are adjacent to each other in the total order, with way for a store from another thread to sneak in between them. See discussion in comments for more about this: http://stackoverflow.com/questions/35830641/can-x86-reorder-a-narrow-store-with-a-wider-load-that-fully-contains-it/35910141?noredirect=1#comment75186782_35910141 ----- A simplified example of loading one member of a pointer+counter struct: // https://godbolt.org/g/k063wH for the code with a union, from the SO Q&A. // https://godbolt.org/g/hNoJzj for this simplified code: #include <atomic> #include <stdint.h> struct node; struct alignas(2*sizeof(void*)) counted_ptr { node *ptr; // non-atomic pointer-to-atomic uintptr_t count; }; node *load_nounion(std::atomic<counted_ptr> *p) { return p->load(std::memory_order_acquire).ptr; } From gcc8 -mx32 -O3 -march=haswell is able to make pretty good code: movq (%edi), %rax movl %eax, %eax ret I didn't think it was necessary to zero-extend return values at all. Unless there's something special about pointer return values in the x32 ABI. But if we do need to, it would be more efficient to avoid mov same,same, because Intel Haswell always needs an execution unit in that case. It can only eliminate MOV in the rename stage between two different architectural registers. From gcc8-snapshot -m64 -O3 -march=haswell: subq $8, %rsp movl $2, %esi call __atomic_load_16 addq $8, %rsp ret IDK why this doesn't tail-call optimize to just mov $2,%esi / jmp. With gcc6.x and earlier, the cmpxchg16b is inlined. (https://gcc.gnu.org/ml/gcc-patches/2017-01/msg02344.html changed that for gcc7). Efficient read-mostly access with occasional cmpxchg16b updates of the pointer+counter is a potential use-case for atomic 16B objects. Maybe something to keep in mind when deciding whether to change from the current strategy...