Hi all, Recently, one of the test beds we use has blown up once when doing streaming replication like that: FATAL: could not seek to end of file "base/16386/19817_fsm": No such file or directory CONTEXT: WAL redo at 60/8DA22448 for Heap2/CLEAN: remxid 65751197 LOG: startup process (PID 44886) exited with exit code 1
All the WAL records have been wiped out since, so I don't know exactly what happened, but I could track down that this FSM file got removed a couple of hours before as I got my hands on some FS-level logs which showed a deletion. This happens in the context of a WAL record XLOG_HEAP2_CLEAN, and the redo logic is in heap_xlog_clean(), where there are FSM lookups within XLogRecordPageWithFreeSpace() -> XLogReadBufferExtended(). At the subsequent restart, recovery has been able to move on after the failing record, so the FSM has been rebuilt correctly, still that caused an HA setup to be less... Available. However, we are rather careful in those code paths to call smgrcreate() so as the file gets created at redo if it is not around. Before blaming a lower level of the application stack, I am wondering if we have some issues with mdfd_vfd meaning that the file has been removed but that it is still tracked as opened. A quick lookup of the code does not show any issues, has anyone seen this particular error recently? The last commit on REL_11_STABLE which touched this area is this one FWIW: commit: 6872c2be6a97057aa736110e31f0390a53305c9e author: Alvaro Herrera <alvhe...@alvh.no-ip.org> date: Wed, 15 Aug 2018 18:09:29 -0300 Update FSM on WAL replay of page all-visible/frozen Also, this setup was using 11.2 (I know this one lags behind a bit, anyway...). Thanks, -- Michael
signature.asc
Description: PGP signature