On 2021-Sep-25, Alvaro Herrera wrote:
>> On 2021-Sep-24, Alvaro Herrera wrote:
>>
>> > Here's the set for all branches, which I think are really final, in
>> > case somebody wants to play and reproduce their respective problem
>> scenarios.
>>
>> I forgot to mention that I'll wait until 14.0 is tagged before getting
>> anything
>> pushed.
Hi Alvaro, sorry for being late to the party, but to add some reassurance that
v2-commited-fix this really fixes solves the initial production problem, I've
done limited test on it (just like with the v1-patch idea earlier/ with using
wal_keep_segments, wal_init_zero=on, archive_mode=on and
archive_command='/bin/true')
- On 12.8, I was able like last time to manually reproduce it on 3 out of 3
tries and I've got: 2x "invalid contrecord length", 1x "there is no contrecord
flag" on standby.
- On soon-to-be-become-12.9 REL_12_STABLE (with commit
1df0a914d58f2bdb03c11dfcd2cb9cd01c286d59 ) on 4 out of 4 tries, I've got
beautiful insight into what happened:
LOG: started streaming WAL from primary at 1/EC000000 on timeline 1
LOG: sucessfully skipped missing contrecord at 1/EBFFFFF8, overwritten at
2021-10-13 11:22:37.48305+00
CONTEXT: WAL redo at 1/EC000028 for XLOG/OVERWRITE_CONTRECORD: lsn 1/EBFFFFF8;
time 2021-10-13 11:22:37.48305+00
...and slave was able to carry-on automatically. In 4th test, the cascade was
tested too (m -> s1 -> s11) and both {s1,s11} did behave properly and log the
above message. Also additional check proved that after simulating ENOSPC crash
on master the data contents were identical everywhere (m1=s1=s11).
Thank you Alvaro and also to everybody else who participated in solving this
challenging and really edge-case nasty bug.
-J.