Good day.
During conversation about other patch, Andres pointed out SpinLockAcquire
have semantic of full memory barrier [1] .
spin.h also states:
> Load and stores operation in calling code are guaranteed not to be
reordered with respect to these operations, because they include a compiler
barrier.
But on ARM/ARM64 platform tas in s_lock.h is implemented as simple call to
__sync_lock_test_and_set, and . And GCC's documentation states it has only
Acquire semantic, and __sync_lock_release (used as implementation of
SpinLockRelease) has only Release semantic [2].
Compiling and disassembling postgresql with -O2 optimization level on arm64
we can see that SpinLockAquire is compiled down to call to
__aarch64_swp4_sync (in optimistic case) which has disassemble:
0000000000662cc0 <__aarch64_swp4_sync>:
662cc0: d503245f bti c
662cc4: 90001bf0 adrp x16, 9de000 <hist_start+0x3d18>
662cc8: 39582210 ldrb w16, [x16, #1544]
662ccc: 34000070 cbz w16, 662cd8
662cd0: b8a08020 swpa w0, w0, [x1]
662cd4: d65f03c0 ret
662cd8: 2a0003f0 mov w16, w0
662cdc: 885f7c20 ldxr w0, [x1]
662ce0: 88117c30 stxr w17, w16, [x1]
662ce4: 35ffffd1 cbnz w17, 662cdc
662ce8: d5033bbf dmb ish
662cec: d65f03c0 ret
Here we see 'swpa' instruction which has only acquire semantic [3].
And SpinLockRelease generates 'stlr' instruction, which has release
semantic, iirc.
Godbolt shows noticable difference between __sync_lock_test_and_set vs
__atomic_exchange_n(ATOMIC_SEQ_CST) (and __sync_lock_release vs
__atomic_store_n(ATOMIC_SEQ_CST)) using GCC even with march=armv7 [4]:
- __atomic_exchange_n and __atomic_store_n has 'dmb ish' instruction both
before and after load/store operations,
- but __sync_lock_test_and_set has 'dmb ish' only after load/store
operations and __sync_lock_release only before store operation. (You can
see same pattern in __aarch64_swp4_sync above in case 'swpa' instruction is
not present).
Therefore I think neither SpinLockAcquire nor SpinLockRelease have no full
memory barrier semantic on ARM/ARM64 at the moment when compiled with GCC.
Surprisingly, Clang puts 'dmb ish' both before and after load/store at
__sync_lock_test_and_set, but does not the same for __sync_lock_release.
As a simple fix I suggest to put __sync_synchronize before
__sync_lock_test_and_set and after __sync_lock_release.
Initially I wanted to use __atomic_exchange_n and __atomic_store_n. But in
absence of support for atomic instructions, their inner loop doesn't look
to be safe for reordering since it uses ldar+stlr loop:
- looks like previous operations could be reordered after 'ldar'.
- similarly following operations could be reordered before 'stlr'.
There's very relevant discussion at GCC's bugzilla [5].
Probably this hypothesis is not valid. I believe we have to ask someone
closely familiar with ARM internals to be sure.
# Alternative.
As an alternative, we may fix comments about
SpinLockAcquire/SpinLockRelease to state they provide acquire and release
barrier semantics only. And then check all their uses and set appropriate
memory barriers where their usage as full memory barriers is detected (as
in [1]).
[1]
https://postgr.es/m/5ccgykypol3azijw2chqnpg3rhuwjtwsmbfs3pgcqm7fu6laus%40wppo6zcfszay
[2]
https://gcc.gnu.org/onlinedocs/gcc-15.1.0/gcc/_005f_005fsync-Builtins.html#index-_005f_005fsync_005flock_005ftest_005fand_005fset
[3]
https://developer.arm.com/documentation/100069/0606/Data-Transfer-Instructions/SWPA--SWPAL--SWP--SWPL--SWPAL--SWP--SWPL
[4] https://godbolt.org/z/h5P9fjzWd
[5] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65697
--
regards
Yura Sokolov aka funny-falcon
From 1315faa9a01c8d768fd750ed643bae4f3eefc48a Mon Sep 17 00:00:00 2001
From: Yura Sokolov <y.soko...@postgrespro.ru>
Date: Tue, 13 May 2025 18:57:46 +0300
Subject: [PATCH v1] Make SpinLockAcquire and SpinLockRelease full memory
barriers.
Andres Freund stated SpinLockAcquire is used as a full memory barrier at
least once [1].
Code comment in spin.h also states:
> Load and stores operation in calling code are guaranteed not to be
> reordered with respect to these operations, because they include a
> compiler barrier.
But ARM/ARM64 code used __spin_lock_test_and_set and __spin_lock_release
which only have acquire and release semantics respectively.
Fix it by adding __sync_synchronize before __sync_lock_test_and_set and
after __sync_lock_release.
---
src/include/storage/s_lock.h | 18 +++++++++++++++---
1 file changed, 15 insertions(+), 3 deletions(-)
diff --git a/src/include/storage/s_lock.h b/src/include/storage/s_lock.h
index 2f73f9fcf57..7ee94302df1 100644
--- a/src/include/storage/s_lock.h
+++ b/src/include/storage/s_lock.h
@@ -258,10 +258,14 @@ typedef int slock_t;
static __inline__ int
tas(volatile slock_t *lock)
{
+ __sync_synchronize();
return __sync_lock_test_and_set(lock, 1);
}
-#define S_UNLOCK(lock) __sync_lock_release(lock)
+#define S_UNLOCK(lock) do { \
+ __sync_lock_release(lock); \
+ __sync_synchronize(); \
+} while(0)
#if defined(__aarch64__)
@@ -543,10 +547,14 @@ typedef int slock_t;
static __inline__ int
tas(volatile slock_t *lock)
{
+ __sync_synchronize();
return __sync_lock_test_and_set(lock, 1);
}
-#define S_UNLOCK(lock) __sync_lock_release(lock)
+#define S_UNLOCK(lock) do { \
+ __sync_lock_release(lock); \
+ __sync_synchronize(); \
+} while(0)
#elif defined(HAVE_GCC__SYNC_CHAR_TAS)
#define HAS_TEST_AND_SET
@@ -558,10 +566,14 @@ typedef char slock_t;
static __inline__ int
tas(volatile slock_t *lock)
{
+ __sync_synchronize();
return __sync_lock_test_and_set(lock, 1);
}
-#define S_UNLOCK(lock) __sync_lock_release(lock)
+#define S_UNLOCK(lock) do { \
+ __sync_lock_release(lock); \
+ __sync_synchronize(); \
+} while(0)
#endif /* HAVE_GCC__SYNC_INT32_TAS */
--
2.43.0