At Tue, 2 Aug 2022 16:03:42 -0500, Don Seiler <d...@seiler.us> wrote in > On Tue, Aug 2, 2022 at 10:01 AM David Steele <da...@pgmasters.net> wrote: > > > > > > That makes sense. Each iteration of the restartpoint recycle loop has a > > 1/N > > > chance of failing. Recovery adds >N files between restartpoints. > > Hence, the > > > WAL directory grows without bound. Is that roughly the theory in mind? > > > > Yes, though you have formulated it better than I had in my mind.
I'm not sure I understand it correctly, but isn't the cause of the issue in the other thread due to skipping many checkpoint records within the checkpoint_timeout? I remember that I proposed a GUC variable to disable that checkpoint skipping. As another measure for that issue, we could force replaying checkpoint if max_wal_size is already filled up or known to be filled in the next checkpoint cycle. If this is correct, this patch is irrelevant to the issue. > > Let's see if Don can confirm that he is seeing the "could not link file" > > messages. > > > During my latest incident, there was only one occurrence: > > could not link file “pg_wal/xlogtemp.18799" to > > “pg_wal/000000010000D45300000010”: File exists (I noticed that the patch in the other thread is broken:() Hmm. It seems like a race condition betwen StartupXLOG() and RemoveXlogFIle(). We need wider extent of ContolFileLock. Concretely taking ControlFileLock before deciding the target xlog file name in RemoveXlogFile() seems to prevent this happening. (If this is correct this is a live issue on the master branch.) > WAL restore/recovery seemed to continue on just fine then. And it would > continue on until the pg_wal volume ran out of space unless I was manually > rm'ing already-recovered WAL files from the side. regards. -- Kyotaro Horiguchi NTT Open Source Software Center