Hi, On 2022-02-25 12:15:58 -0800, Andres Freund wrote: > On 2022-02-25 15:07:01 -0500, Tom Lane wrote: > > Andres Freund <and...@anarazel.de> writes: > > Hmm. Maybe put a couple more debug messages into ReplicationSlotCleanup > > and/or ReplicationSlotDropPtr? It doesn't seem very clear where in that > > sequence it's hanging up. > > Yea, was thinking that as well. > > > I'm also wondering whether it's worth adding an assert, or at least a WARNING, > about no lwlocks held to the tail end of ShutdownPostgres? I don't want to > add an LWLockReleaseAll() yet, before I understand what's actually happening.
Did those things. Sure looks like there's some interaction with checkpointer. There's a similar sequence in the two failures since the additional debugging. 2022-02-26 06:35:52.453 CET [6219bc37.8717f:20] DEBUG: replication slot drop: pg_basebackup_553343: removed on-disk 2022-02-26 06:35:52.453 CET [6219bc37.87168:17] DEBUG: snapshot of 0+0 running transaction ids (lsn 0/700060 oldest xid 720 latest complete 719 next xid 720) 2022-02-26 06:35:52.453 CET [6219bc37.87168:18] DEBUG: begin invalidating obsolete replication slots older than 0/700000 2022-02-26 06:35:52.453 CET [6219bc37.87168:19] DEBUG: done invalidating obsolete replication slots 2022-02-26 06:35:52.453 CET [6219bc37.87168:20] DEBUG: attempting to remove WAL segments older than log file 000000000000000000000006 2022-02-26 06:35:52.453 CET [6219bc37.87168:21] DEBUG: removing write-ahead log file "000000010000000000000006" 2022-02-26 06:35:52.453 CET [6219bc37.87168:22] DEBUG: SlruScanDirectory invoking callback on pg_subtrans/0000 2022-02-26 06:35:52.453 CET [6219bc37.87168:23] LOG: checkpoint complete: wrote 0 buffers (0.0%); 0 WAL file(s) added, 1 removed, 0 recycled; write=0.001 s, sync=0.001 s, total=0.351 s; sync files=0, longest=0.000 s, average=0.000 s; distance=1024 kB, estimate=1024 kB [...] 2022-02-26 06:35:52.970 CET [6219bc37.87139:71] LOG: received fast shutdown request 2022-02-26 06:35:52.970 CET [6219bc37.87139:72] LOG: aborting any active transactions [...] 2022-02-26 06:35:52.971 CET [6219bc37.87168:24] LOG: shutting down 2022-02-26 06:35:52.976 CET [6219bc37.8717f:21] DEBUG: replication slot drop: pg_basebackup_553343: done 2022-02-26 06:35:52.976 CET [6219bc37.8717f:22] DEBUG: temporary replication slot cleanup: begin 2022-02-26 06:35:52.976 CET [6219bc37.8717f:23] DEBUG: temporary replication slot cleanup: 0 in use, active_pid: 553385 2022-02-26 06:35:52.976 CET [6219bc37.8717f:24] DEBUG: temporary replication slot cleanup: done 2022-02-26 06:35:52.977 CET [6219bc37.8717f:25] DEBUG: shmem_exit(0): 7 on_shmem_exit callbacks to make 2022-02-26 06:35:52.977 CET [6219bc37.8717f:26] DEBUG: proc_exit(0): 3 callbacks to make 2022-02-26 06:35:52.977 CET [6219bc37.8717f:27] LOG: disconnection: session time: 0:00:00.996 user=bf database= host=[local] 2022-02-26 06:35:52.977 CET [6219bc37.8717f:28] DEBUG: exit(0) 2022-02-26 06:35:52.977 CET [6219bc37.8717f:29] DEBUG: shmem_exit(-1): 0 before_shmem_exit callbacks to make 2022-02-26 06:35:52.977 CET [6219bc37.8717f:30] DEBUG: shmem_exit(-1): 0 on_shmem_exit callbacks to make 2022-02-26 06:35:52.977 CET [6219bc37.8717f:31] DEBUG: proc_exit(-1): 0 callbacks to make So the backend (8717f/553343) is in the middle of ReplicationSlotDrop(), after removing on-disk data. And then for 500ms nothing happens until checkpointer wakes up again. As soon as it does, the slot drop continues. Just before calling ReplicationSlotDrop() we were able to acquire ReplicationSlotControlLock in share mode. Just after the log message after which there's the delay, ReplicationSlotControlLock is locked in exclusive mode. Too tired now.. Greetings, Andres Freund