https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83285
Bug ID: 83285 Summary: non-atomic stores can reorder more aggressively with seq_cst on AArch64 than x86: missed x86 optimization? Product: gcc Version: 6.3.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- This is either an x86-64 missed optimization or an AArch64 bug. I *think* x86-64 missed optimization, but it's not-a-bug on AArch64 only because any observers that could tell the difference would have data race UB. #include <atomic> // int na; // std::atomic_int sync; void seq_cst(int &na, std::atomic_int &sync) { na = 1; sync = 2; na = 3; } https://godbolt.org/g/bUwZaM On x86, all 3 stores are there in the asm in source order (for mo_seq_cst, but not for mo_release). On AArch64, gcc6.3 does does sync=2; na=3; If `na` was using relaxed atomic stores, this would be a bug (because a thread that saw `sync==2` could then see the original value of na, not na==1 or na==3). But for non-atomic na, reading na even after Synchronizing With the `sync=2` (with an acquire load) would be UB, because the thread that writes sync writes na again *after* that. It seems that gcc's AArch64 backend is using this as license to sink the na=1 store past the sync=2 and merge it with the na=3. seq_cst(int&, std::atomic<int>&, std::atomic<int>&): mov w2, 2 // tmp79, stlr w2, [x1] // tmp79,* sync mov w1, 3 // tmp78, str w1, [x0] // tmp78, *na_2(D) ret ----- If sync=2 is a release store (not seq_cst), then gcc for x86 does sink the na=1 past the release and merge. (See the godbolt link.) In this case it's also allowed to hoist the na=3 store ahead of the release, because plain release is only a one-way barrier for earlier stores. That would be safe for relaxed-atomic as well (unlike for non-atomic), but gcc doesn't do that. I'm slightly worried that this is unintentional and could maybe happen for relaxed atomics when it would be illegal. (On AArch64 with seq_cst or release, and on x86 only with release.) But hopefully this is just gcc being clever and taking advantage of the fact that writing a non-atomic after a possible synchronization point means that the sync point is irrelevant for programs without data race UB.