Hi all,

Recently, one of the test beds we use has blown up once when doing
streaming replication like that:
FATAL:  could not seek to end of file "base/16386/19817_fsm": No such
   file or directory 
CONTEXT:  WAL redo at 60/8DA22448 for Heap2/CLEAN: remxid 65751197
LOG:  startup process (PID 44886) exited with exit code 1

All the WAL records have been wiped out since, so I don't know exactly
what happened, but I could track down that this FSM file got removed
a couple of hours before as I got my hands on some FS-level logs which
showed a deletion.

This happens in the context of a WAL record XLOG_HEAP2_CLEAN, and the
redo logic is in heap_xlog_clean(), where there are FSM lookups within
XLogRecordPageWithFreeSpace() -> XLogReadBufferExtended().  At the
subsequent restart, recovery has been able to move on after the
failing record, so the FSM has been rebuilt correctly, still that
caused an HA setup to be less...  Available.  However, we are rather
careful in those code paths to call smgrcreate() so as the file gets
created at redo if it is not around.  Before blaming a lower level of
the application stack, I am wondering if we have some issues with
mdfd_vfd meaning that the file has been removed but that it is still
tracked as opened.  A quick lookup of the code does not show any
issues, has anyone seen this particular error recently?

The last commit on REL_11_STABLE which touched this area is this one
FWIW:
commit: 6872c2be6a97057aa736110e31f0390a53305c9e
author: Alvaro Herrera <alvhe...@alvh.no-ip.org>
date: Wed, 15 Aug 2018 18:09:29 -0300
Update FSM on WAL replay of page all-visible/frozen

Also, this setup was using 11.2 (I know this one lags behind a bit,
anyway...).

Thanks,
--
Michael

Attachment: signature.asc
Description: PGP signature

Reply via email to