Re: Race conditions in 019_replslot_limit.pl

Andres Freund Fri, 25 Feb 2022 23:23:50 -0800

Hi,

On 2022-02-25 12:15:58 -0800, Andres Freund wrote:
> On 2022-02-25 15:07:01 -0500, Tom Lane wrote:
> > Andres Freund <and...@anarazel.de> writes:
> > Hmm.  Maybe put a couple more debug messages into ReplicationSlotCleanup
> > and/or ReplicationSlotDropPtr?  It doesn't seem very clear where in that
> > sequence it's hanging up.
> 
> Yea, was thinking that as well.
> 
> 
> I'm also wondering whether it's worth adding an assert, or at least a WARNING,
> about no lwlocks held to the tail end of ShutdownPostgres?  I don't want to
> add an LWLockReleaseAll() yet, before I understand what's actually happening.


Did those things. Sure looks like there's some interaction with
checkpointer. There's a similar sequence in the two failures since the
additional debugging.

2022-02-26 06:35:52.453 CET [6219bc37.8717f:20] DEBUG:  replication slot drop: 
pg_basebackup_553343: removed on-disk
2022-02-26 06:35:52.453 CET [6219bc37.87168:17] DEBUG:  snapshot of 0+0 running 
transaction ids (lsn 0/700060 oldest xid 720 latest complete 719 next xid 720)
2022-02-26 06:35:52.453 CET [6219bc37.87168:18] DEBUG:  begin invalidating 
obsolete replication slots older than 0/700000
2022-02-26 06:35:52.453 CET [6219bc37.87168:19] DEBUG:  done invalidating 
obsolete replication slots
2022-02-26 06:35:52.453 CET [6219bc37.87168:20] DEBUG:  attempting to remove 
WAL segments older than log file 000000000000000000000006
2022-02-26 06:35:52.453 CET [6219bc37.87168:21] DEBUG:  removing write-ahead 
log file "000000010000000000000006"
2022-02-26 06:35:52.453 CET [6219bc37.87168:22] DEBUG:  SlruScanDirectory 
invoking callback on pg_subtrans/0000
2022-02-26 06:35:52.453 CET [6219bc37.87168:23] LOG:  checkpoint complete: 
wrote 0 buffers (0.0%); 0 WAL file(s) added, 1 removed, 0 recycled; write=0.001 
s, sync=0.001 s, total=0.351 s; sync files=0, longest=0.000 s, average=0.000 s; 
distance=1024 kB, estimate=1024 kB
[...]
2022-02-26 06:35:52.970 CET [6219bc37.87139:71] LOG:  received fast shutdown 
request
2022-02-26 06:35:52.970 CET [6219bc37.87139:72] LOG:  aborting any active 
transactions
[...]
2022-02-26 06:35:52.971 CET [6219bc37.87168:24] LOG:  shutting down
2022-02-26 06:35:52.976 CET [6219bc37.8717f:21] DEBUG:  replication slot drop: 
pg_basebackup_553343: done
2022-02-26 06:35:52.976 CET [6219bc37.8717f:22] DEBUG:  temporary replication 
slot cleanup: begin
2022-02-26 06:35:52.976 CET [6219bc37.8717f:23] DEBUG:  temporary replication 
slot cleanup: 0 in use, active_pid: 553385
2022-02-26 06:35:52.976 CET [6219bc37.8717f:24] DEBUG:  temporary replication 
slot cleanup: done
2022-02-26 06:35:52.977 CET [6219bc37.8717f:25] DEBUG:  shmem_exit(0): 7 
on_shmem_exit callbacks to make
2022-02-26 06:35:52.977 CET [6219bc37.8717f:26] DEBUG:  proc_exit(0): 3 
callbacks to make
2022-02-26 06:35:52.977 CET [6219bc37.8717f:27] LOG:  disconnection: session 
time: 0:00:00.996 user=bf database= host=[local]
2022-02-26 06:35:52.977 CET [6219bc37.8717f:28] DEBUG:  exit(0)
2022-02-26 06:35:52.977 CET [6219bc37.8717f:29] DEBUG:  shmem_exit(-1): 0 
before_shmem_exit callbacks to make
2022-02-26 06:35:52.977 CET [6219bc37.8717f:30] DEBUG:  shmem_exit(-1): 0 
on_shmem_exit callbacks to make
2022-02-26 06:35:52.977 CET [6219bc37.8717f:31] DEBUG:  proc_exit(-1): 0 
callbacks to make

So the backend (8717f/553343) is in the middle of ReplicationSlotDrop(), after
removing on-disk data. And then for 500ms nothing happens until checkpointer
wakes up again. As soon as it does, the slot drop continues.

Just before calling ReplicationSlotDrop() we were able to acquire
ReplicationSlotControlLock in share mode. Just after the log message after
which there's the delay, ReplicationSlotControlLock is locked in exclusive
mode.

Too tired now..

Greetings,

Andres Freund

Re: Race conditions in 019_replslot_limit.pl

Reply via email to