On 2021-Sep-25, Alvaro Herrera wrote: >> On 2021-Sep-24, Alvaro Herrera wrote: >> >> > Here's the set for all branches, which I think are really final, in >> > case somebody wants to play and reproduce their respective problem >> scenarios. >> >> I forgot to mention that I'll wait until 14.0 is tagged before getting >> anything >> pushed.
Hi Alvaro, sorry for being late to the party, but to add some reassurance that v2-commited-fix this really fixes solves the initial production problem, I've done limited test on it (just like with the v1-patch idea earlier/ with using wal_keep_segments, wal_init_zero=on, archive_mode=on and archive_command='/bin/true') - On 12.8, I was able like last time to manually reproduce it on 3 out of 3 tries and I've got: 2x "invalid contrecord length", 1x "there is no contrecord flag" on standby. - On soon-to-be-become-12.9 REL_12_STABLE (with commit 1df0a914d58f2bdb03c11dfcd2cb9cd01c286d59 ) on 4 out of 4 tries, I've got beautiful insight into what happened: LOG: started streaming WAL from primary at 1/EC000000 on timeline 1 LOG: sucessfully skipped missing contrecord at 1/EBFFFFF8, overwritten at 2021-10-13 11:22:37.48305+00 CONTEXT: WAL redo at 1/EC000028 for XLOG/OVERWRITE_CONTRECORD: lsn 1/EBFFFFF8; time 2021-10-13 11:22:37.48305+00 ...and slave was able to carry-on automatically. In 4th test, the cascade was tested too (m -> s1 -> s11) and both {s1,s11} did behave properly and log the above message. Also additional check proved that after simulating ENOSPC crash on master the data contents were identical everywhere (m1=s1=s11). Thank you Alvaro and also to everybody else who participated in solving this challenging and really edge-case nasty bug. -J.