On Tue, Nov 23, 2021 at 4:36 AM Drouvot, Bertrand <bdrou...@amazon.com> wrote: > Indeed, I think the logical decoding on standby patch just needs to > change the Assert in GetWALInsertionTimeLine() to check > SharedRecoveryState is RECOVERY_STATE_DONE or RECOVERY_STATE_ARCHIVE. > > Would that make sense from your point of view to add the check on > RECOVERY_STATE_ARCHIVE or do you think it would need to be more > "sophisticated"?
Unless I'm confused, that would just be incorrect. GetWALInsertionTimeLine() just returns InsertTimeLineID, which won't be initialized during archive recovery. I discussed this point a bit on some other thread with Andres. It's possible that all that you need to do here is determine the replayTLI using e.g. GetXLogReplayRecPtr. But I don't really see why that should be correct: read_local_xlog_page() has some kind of wait-and-retry logic and I'm not clear why we don't need the same thing here. After all, if we're on the primary and we determine that sendTimeLineHistoric = false and sendTimeLine = whatever, neither of those values can become incorrect afterward. But on a standby that can certainly happen, if an upstream server is promoted. Note that the comment in read_local_xlog_page() says this: * If that happens after our caller determined the TLI but before * we actually read the xlog page, we might still try to read from the * old (now renamed) segment and fail. There's not much we can do * about this, but it can only happen when we're a leaf of a cascading * standby whose primary gets promoted while we're decoding, so a * one-off ERROR isn't too bad. That seems to imply that the retry loop here is designed, at least in part, to protect against exactly this kind of scenario. -- Robert Haas EDB: http://www.enterprisedb.com