Hello,

The attached patch speeds up the removal of WAL files in the old timelines.  
I'll add this to the next CF.


BACKGROUND
==================================================

We need to meet a severe availability requirement of a potential customer.  
They will use synchronous streaming replication.  The allowed failover 
duration, from the failure through failure detection to the failover 
completion, is 10 seconds.  Even one second is precious.

During a testing on a fast machine with SSD, we observed about 2 seconds 
between these messages.  There were no other messages between them.

LOG:  archive recovery complete
LOG:  MultiXact member wraparound protections are now enabled


CAUSE
==================================================

Examining the source code, RemoveNonParentXlogFiles() seems to account for the 
time.  It syncs pg_wal directory every time it deletes a WAL file.  
max_wal_size was set to 48GB, so about 1,000 WAL files were probably deleted 
and hence the pg_wal directory was synced as much.


FIX
==================================================

unlink() the WAL files, then sync the pg_wal directory once at the end.

Unfortunately, the original machine is now not available, so I confirmed the 
speedup on a VM with HDD.

[time to remove 1,000 WAL files including the directory sync]
nonpatched: 2.45 seconds
patched:    0.81 seconds


Regards
Takayuki Tsunakawa

Attachment: speedup_wal_removal.patch
Description: speedup_wal_removal.patch

Reply via email to