On Fri, Jun 24, 2022 at 04:17:34PM +0000, Imseih (AWS), Sami wrote: > It is been difficult to get a generic repro, but the way we reproduce > Is through our test suite. To give more details, we are running tests > In which we constantly failover and promote standbys. The issue > surfaces after we have gone through a few promotions which occur > every few hours or so ( not really important but to give context ).
Hmm. Could you describe exactly the failover scenario you are using? Is the test using a set of cascading standbys linked to the promoted one? Are the standbys recycled from the promoted nodes with pg_rewind or created from scratch with a new base backup taken from the freshly-promoted primary? I have been looking more at this thread through the day but I don't see a remaining issue. It could be perfectly possible that we are missing a piece related to the handling of those new overwrite contrecords in some cases, like in a rewind. > I am adding some additional debugging to see if I can draw a better > picture of what is happening. Will also give aborted_contrec_reset_3.patch > a go, although I suspect it will not handle the specific case we are deaing > with. Yeah, this is not going to change much things if you are still seeing an issue. This patch does not change the logic, aka it just simplifies the tracking of the continuation record data, resetting it when a complete record has been read. Saying that, getting rid of the dependency on StandbyMode because we cannot promote in the middle of a record is nice (my memories around that were a bit blurry but even recovery_target_lsn would not recover in the middle of an continuation record), and this is not bug so there is limited reason to backpatch this part of the change. -- Michael
signature.asc
Description: PGP signature