On Tue, Sep 9, 2025 at 12:28 PM Amit Kapila <[email protected]> wrote: > > On Mon, Sep 8, 2025 at 3:03 PM Nitin Motiani <[email protected]> wrote: > > > > I'd like to propose a patch to allow accepting connections post recovery > > without waiting for the removal of old xlog files. > > > > Why : We have seen instances where the crash recovery takes very long (tens > > of minutes to hours) if a large number of accumulated WAL files need to be > > cleaned up (eg : Cleaning up 2M old WAL files took close to 4 hours). > > > > This WAL accumulation is usually caused by : > > > > 1. Inactive replication slot > > 2. PITR failing to keep up > > > > In the above cases when the resolution (deleting inactive slot/disabling > > PITR) is followed by a crash (before checkpoint could run), we see the > > recovery take a very long time. Note that in these cases the actual WAL > > replay is done relatively quickly and most of the delay is due to > > RemoveOldXlogFiles(). > > > > Isn't it better to fix the reasons for WAL accumulation? Because even > without recovery, this can fill up the disk. For example, one can use > idle_replication_slot_timeout for inactive slots. Similarly, we can > see what leads to slow PITR and try to avoid that.
I agree that in the ideal world it's better if someone can set 'idle_replication_slot_timeout' correctly so that we don't even create WAL accumulation. But that's not always the case with the user and there are situations where WAL gets accumulated. In this context, the goal is to address the problem after it has already happened, minimizing additional downtime for the user. I feel this is a reasonable goal although we can think more about whether it is worth issuing the extra checkpoint for improving this situation. -- Regards, Dilip Kumar Google
