My colleague Salvatore Dipietro (CC'd) sent me a couple of profiles that
showed an enormous amount of s_lock() time going to the
__sync_lock_test_and_set() call in the AArch64 implementation of tas().
Upon closer inspection, I noticed that we don't implement a custom
TAS_SPIN() for this architecture, so I quickly hacked together the attached
patch and ran a couple of benchmarks that stressed the spinlock code. I
found no discussion about TAS_SPIN() on ARM in the archives, but I did
notice that the initial AArch64 support was added [0] before x86_64 started
using a non-locking test [1].
These benchmarks are for a c8g.24xlarge running a select-only pgbench with
256 clients and pg_stat_statements.track_planning enabled.
without the patch:
90.04% postgres [.] s_lock
1.07% pg_stat_statements.so [.] pgss_store
0.59% postgres [.] LWLockRelease
0.56% postgres [.] perform_spin_delay
0.31% [kernel] [k] arch_local_irq_enable
| while (TAS_SPIN(lock))
| {
| perform_spin_delay(&delayStatus);
0.12 |2c: -> bl perform_spin_delay
| tas():
| return __sync_lock_test_and_set(lock, 1);
0.01 |30: swpa w20, w1, [x19]
| s_lock():
99.87 | add x0, sp, #0x28
| while (TAS_SPIN(lock))
0.00 | ^ cbnz w1, 2c
tps = 74135.100891 (without initial connection time)
with the patch:
30.46% postgres [.] s_lock
5.88% postgres [.] perform_spin_delay
4.61% [kernel] [k] arch_local_irq_enable
3.31% [kernel] [k] next_uptodate_page
2.50% postgres [.] hash_search_with_hash_value
| while (TAS_SPIN(lock))
| {
| perform_spin_delay(&delayStatus);
0.63 |2c:+-->add x0, sp, #0x28
0.07 | |-> bl perform_spin_delay
| |while (TAS_SPIN(lock))
0.25 |34:| ldr w0, [x19]
65.19 | +--cbnz w0, 2c
| tas():
| return __sync_lock_test_and_set(lock, 1);
0.00 | swpa w20, w0, [x19]
| s_lock():
33.82 | ^ cbnz w0, 2c
tps = 549462.785554 (without initial connection time)
[0] https://postgr.es/c/5c7603c
[1] https://postgr.es/c/b03d196
--
nathan
>From a69b7d38c40f0fb49f714c49bb45c292e7c8587d Mon Sep 17 00:00:00 2001
From: Nathan Bossart <[email protected]>
Date: Tue, 22 Oct 2024 14:33:39 -0500
Subject: [PATCH v1 1/1] Use a non-locking initial test in TAS_SPIN on AArch64.
---
src/include/storage/s_lock.h | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/src/include/storage/s_lock.h b/src/include/storage/s_lock.h
index e94ed5f48b..07f343baf8 100644
--- a/src/include/storage/s_lock.h
+++ b/src/include/storage/s_lock.h
@@ -270,6 +270,12 @@ tas(volatile slock_t *lock)
*/
#if defined(__aarch64__)
+/*
+ * On ARM64, it's a win to use a non-locking test before the TAS proper. It
+ * may be a win on 32-bit ARM, too, but nobody's tested it yet.
+ */
+#define TAS_SPIN(lock) (*(lock) ? 1 : TAS(lock))
+
#define SPIN_DELAY() spin_delay()
static __inline__ void
--
2.39.5 (Apple Git-154)