On 2021-Sep-25, Alvaro Herrera wrote:
>> On 2021-Sep-24, Alvaro Herrera wrote:
>> 
>> > Here's the set for all branches, which I think are really final, in
>> > case somebody wants to play and reproduce their respective problem
>> scenarios.
>> 
>> I forgot to mention that I'll wait until 14.0 is tagged before getting 
>> anything
>> pushed.

Hi Alvaro, sorry for being late to the party, but to add some reassurance that 
v2-commited-fix this really fixes solves the initial production problem, I've 
done limited test on it (just like with the v1-patch idea earlier/ with using 
wal_keep_segments, wal_init_zero=on, archive_mode=on and 
archive_command='/bin/true')

- On 12.8, I was able like last time to manually reproduce it on 3 out of 3 
tries and I've got: 2x "invalid contrecord length", 1x "there is no contrecord 
flag" on standby.

- On soon-to-be-become-12.9 REL_12_STABLE (with commit 
1df0a914d58f2bdb03c11dfcd2cb9cd01c286d59 ) on 4 out of 4 tries, I've got 
beautiful insight into what happened:
LOG:  started streaming WAL from primary at 1/EC000000 on timeline 1
LOG:  sucessfully skipped missing contrecord at 1/EBFFFFF8, overwritten at 
2021-10-13 11:22:37.48305+00
CONTEXT:  WAL redo at 1/EC000028 for XLOG/OVERWRITE_CONTRECORD: lsn 1/EBFFFFF8; 
time 2021-10-13 11:22:37.48305+00
...and slave was able to carry-on automatically. In 4th test, the cascade was 
tested too (m -> s1 -> s11) and both {s1,s11} did behave properly and log the 
above message. Also additional check proved that after simulating ENOSPC crash 
on master the data contents were identical everywhere (m1=s1=s11). 

Thank you Alvaro and also to everybody else who participated in solving this 
challenging and really edge-case nasty bug.

-J.


Reply via email to