On Wed, 13 Mar 2024 at 04:56, Kyotaro Horiguchi <horikyota....@gmail.com> wrote: > > At Mon, 11 Mar 2024 16:43:32 +0900 (JST), Kyotaro Horiguchi > <horikyota....@gmail.com> wrote in > > Oh, I once saw the fix work, but seems not to be working after some > > point. The new issue was a corruption of received WAL records on the > > first standby, and it may be related to the setting. > > I identified the cause of the second issue. When I tried to replay the > issue, the second standby accidentally received the old timeline's > last page-spanning record till the end while the first standby was > promoting (but it had not been read by recovery). In addition to that, > on the second standby, there's a time window where the timeline > increased but the first segment of the new timeline is not available > yet. In this case, the second standby successfully reads the > page-spanning record in the old timeline even after the second standby > noticed that the timeline ID has been increased, thanks to the > robustness of XLogFileReadAnyTLI(). > > I think the primary change to XLogPageRead that I suggested is correct > (assuming the use of wal_segment_size instead of the > constant). However, still XLogFileReadAnyTLI() has a chance to read > the segment from the old timeline after the second standby notices a > timeline switch, leading to the second issue. The second issue was > fixed by preventing XLogFileReadAnyTLI from reading segments from > older timelines than those suggested by the latest timeline > history. (In other words, disabling the "AnyTLI" part). > > I recall that there was a discussion for commit 4bd0ad9e44, about the > objective of allowing reading segments from older timelines than the > timeline history suggests. In my faint memory, we concluded to > postpone making the decision to remove the feature due to uncertainity > about the objective. If there's no clear reason to continue using > XLogFileReadAnyTLI(), I suggest we stop its use and instead adopt > XLogFileReadOnTLHistory(), which reads segments that align precisely > with the timeline history.
This sounds very similar to the problem described in [1]. And I think both will be resolved by that change. [1] https://postgr.es/m/CANwKhkMN3QwAcvuDZHb6wsvLRtkweBiYso-KLFykkQVWuQLcOw%40mail.gmail.com