Hi, On 2025-01-08 12:12:19 -0600, Nathan Bossart wrote: > On Tue, Oct 22, 2024 at 02:54:57PM -0500, Nathan Bossart wrote: > > My colleague Salvatore Dipietro (CC'd) sent me a couple of profiles that > > showed an enormous amount of s_lock() time going to the > > __sync_lock_test_and_set() call in the AArch64 implementation of tas(). > > Upon closer inspection, I noticed that we don't implement a custom > > TAS_SPIN() for this architecture, so I quickly hacked together the attached > > patch and ran a couple of benchmarks that stressed the spinlock code. I > > found no discussion about TAS_SPIN() on ARM in the archives, but I did > > notice that the initial AArch64 support was added [0] before x86_64 started > > using a non-locking test [1]. > > > > These benchmarks are for a c8g.24xlarge running a select-only pgbench with > > 256 clients and pg_stat_statements.track_planning enabled. > > > > without the patch: > > > > [...] > > > > tps = 74135.100891 (without initial connection time) > > > > with the patch: > > > > [...] > > > > tps = 549462.785554 (without initial connection time) > > Are there any objections to proceeding with this change? So far, it's been > tested on a c8g.24xlarge and an Apple M3 (which seems to be too small to > show any effect). If anyone has access to a larger ARM machine, additional > testing would be greatly appreciated. I think it would be unfortunate if > this slipped to v19.
Seems reasonable. It's not surprising that spinning with an atomic operation doesn't scale very well once a you have more than a handful cores. Of course the whole way locking works in pg_stat_statements is a scalability disaster, but that's a bigger project to fix. Out of curiosity, have you measured whether this has a positive effect without pg_stat_statements? I think it'll e.g. also affect lwlocks, as they also use perform_spin_delay(). Greetings, Andres