Hi Álvaro, -hackers, > I attach the patch with the change you suggested.
I've gave a shot to to the v02 patch on top of REL_12_STABLE (already including 5065aeafb0b7593c04d3bc5bc2a86037f32143fc). Previously(yesterday) without the v02 patch I was getting standby corruption always via simulation by having separate /pg_xlog dedicated fs, and archive_mode=on, wal_keep_segments=120, archive_command set to rsync to different dir on same fs, wal_init_zero at default(true). Today (with v02) I've got corruption in only initial 2 runs out of ~ >30 tries on standby. Probably the 2 failures were somehow my fault (?) or some rare condition (and in 1 of those 2 cases simply restarting standby did help). To be honest I've tried to force this error, but with v02 I simply cannot force this error anymore, so that's good! :) > I didn't have a lot of luck with a reliable reproducer script. I was able to > reproduce the problem starting with Ryo Matsumura's script and attaching > a replica; most of the time the replica would recover by restarting from a > streaming position earlier than where the problem occurred; but a few > times it would just get stuck with a WAL segment containing a bogus > record. In order to get reliable reproducer and get proper the fault injection instead of playing with really filling up fs, apparently one could substitute fd with fd of /dev/full using e.g. dup2() so that every write is going to throw this error too: root@hive:~# ./t & # simple while(1) { fprintf() flush () } testcase root@hive:~# ls -l /proc/27296/fd/3 lrwx------ 1 root root 64 Aug 25 06:22 /proc/27296/fd/3 -> /tmp/testwrite root@hive:~# gdb -q -p 27296 -- 1089 is bitmask O_WRONLY|.. (gdb) p dup2(open("/dev/full", 1089, 0777), 3) $1 = 3 (gdb) c Continuing. ==> fflush/write(): : No space left on device So I've also tried to be malicious while writing to the DB and inject ENOSPCE near places like: a) XLogWrite()->XLogFileInit() near line 3322 // assuming: if (wal_init_zero) is true, one gets classic "PANIC: could not write to file "pg_wal/xlogtemp.90670": No space left on device" b) XLogWrite() near line 2547 just after pg_pwrite // one can get "PANIC: could not write to log file 000000010000003B000000A8 at offset 0, length 15466496: No space left on device" (that would be possible with wal_init_zero=false?) c) XLogWrite() near line 2592 // just before issue_xlog_fsync to get "PANIC: could not fdatasync file "000000010000004300000004": Invalid argument" that would pretty much mean same as above but with last possible offset near end of WAL? This was done with gdb voodoo: handle SIGUSR1 noprint nostop break xlog.c:<LINE> // https://github.com/postgres/postgres/blob/REL_12_STABLE/src/backend/access/transam/xlog.c#L3311 c print fd or openLogFile -- to verify it is 3 p dup2(open("/dev/full", 1089, 0777), 3) -- during most of walwriter runtime it has current log as fd=3 After restarting master and inspecting standby - in all of those above 3 cases - the standby didn't inhibit the "invalid contrecord length" at least here, while without corruption this v02 patch it is notorious. So if it passes the worst-case code review assumptions I would be wondering if it shouldn't even be committed as it stands right now. -J.