Michael Paquier <mich...@paquier.xyz> writes:
> Recently, one of the test beds we use has blown up once when doing
> streaming replication like that:
> FATAL:  could not seek to end of file "base/16386/19817_fsm": No such
>    file or directory 
> CONTEXT:  WAL redo at 60/8DA22448 for Heap2/CLEAN: remxid 65751197
> LOG:  startup process (PID 44886) exited with exit code 1

> All the WAL records have been wiped out since, so I don't know exactly
> what happened, but I could track down that this FSM file got removed
> a couple of hours before as I got my hands on some FS-level logs which
> showed a deletion.

Hm.  AFAICS the immediate issuer of the error must have been
_mdnblocks(); there are other matches to that error string but
they are in places where we can tell which file the seek must
have been applied to, and it wasn't a FSM file.

> Before blaming a lower level of
> the application stack, I am wondering if we have some issues with
> mdfd_vfd meaning that the file has been removed but that it is still
> tracked as opened.

lseek() per se presumably would never return ENOENT.  A more likely
theory is that the file wasn't actually open but only had a leftover
VFD entry, and when FileSize() -> FileAccess() tried to open it,
the open failed with ENOENT --- but _mdnblocks() would still call it
a seek failure.

So I'd opine that this is a pretty high-level failure --- what are
we doing trying to replay WAL against a table that's been dropped?
Or if it wasn't dropped, why was the FSM removed?

                        regards, tom lane


Reply via email to