On Mon, Oct 23, 2023 at 8:43 PM Andres Freund <and...@anarazel.de> wrote: > The source of the emode=13=DEBUG2 is that that's hardcoded in > WaitForWALToBecomeAvailable(). I guess the error ought to come from > XLogPageRead(), but all that happens is this: > > case XLREAD_FAIL: > if (readFile >= 0) > close(readFile); > readFile = -1; > readLen = 0; > readSource = XLOG_FROM_ANY; > return XLREAD_FAIL;
I've been under the impression that the guiding principle here is that we shouldn't error out upon hitting a condition that should only cause us to switch sources. I think WaitForWALToBecomeAvailable() is supposed to set things up so that XLogPageRead()'s call to pg_pread() will succeed. If it says it can't, then XLogPageRead() is only obliged to pass that information up to the caller, who can decide to wait longer for the data to show up, or give up, or whatever it wants to do. On the other hand, if WaitForWALToBecomeAvailable() says that it's fine to go ahead and call pg_pread() and pg_pread() then fails, then that means that we've got a problem with the WAL file other than it just not being available yet, like it's the wrong length or there was an I/O error, and those are reportable errors. Said differently, in the former case, the WAL is not broken, merely not currently available; in the latter case, it's broken. The legibility and maintainability of this code are certainly not great. The xlogreader mechanism is extremely useful, but maybe we should have done more cleanup of the underlying mechanism first. It's a giant ball of spaghetti code that is challenging to understand and almost impossible to validate thoroughly (as you just discovered). -- Robert Haas EDB: http://www.enterprisedb.com