On Tue, Nov 28, 2023 at 07:00:16PM +0100, David Geier wrote: > PostgreSQL hit the following assertion during error cleanup, after being OOM > in dsa_allocate0(): > > void dshash_detach(dshash_table *hash_table) { > ASSERT_NO_PARTITION_LOCKS_HELD_BY_ME(hash_table); > > called from pgstat_shutdown_hook(), called from shmem_exit(), called from > proc_exit(), called from the exception handler.
Nice find. > AutoVacWorkerMain() pgstat_report_autovac() pgstat_get_entry_ref_locked() > pgstat_get_entry_ref() dshash_find_or_insert() resize() resize() locks all > partitions so the hash table can safely be resized. Then it calls > dsa_allocate0(). If dsa_allocate0() fails to allocate, it errors out. The > exception handler calls proc_exit() which normally calls LWLockReleaseAll() > via AbortTransaction() but only if there's an active transaction. However, > pgstat_report_autovac() runs before a transaction got started and hence > LWLockReleaseAll() doesn't run before pgstat_shutdown_hook() is called. >From a glance, it looks to me like the problem is that pgstat_shutdown_hook is registered as a before_shmem_exit callback, while ProcKill is registered as an on_shmem_exit callback. However, IIUC even moving them to the same list wouldn't be sufficient because the pg_stat_shutdown_hook is registered after ProcKill, and the code that calls the callbacks walks backwards through the list. I would expect your patch to fix this particular issue, but I'm wondering whether there's a bigger problem here. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com