Michael Paquier <mich...@paquier.xyz> writes: > Recently, one of the test beds we use has blown up once when doing > streaming replication like that: > FATAL: could not seek to end of file "base/16386/19817_fsm": No such > file or directory > CONTEXT: WAL redo at 60/8DA22448 for Heap2/CLEAN: remxid 65751197 > LOG: startup process (PID 44886) exited with exit code 1
> All the WAL records have been wiped out since, so I don't know exactly > what happened, but I could track down that this FSM file got removed > a couple of hours before as I got my hands on some FS-level logs which > showed a deletion. Hm. AFAICS the immediate issuer of the error must have been _mdnblocks(); there are other matches to that error string but they are in places where we can tell which file the seek must have been applied to, and it wasn't a FSM file. > Before blaming a lower level of > the application stack, I am wondering if we have some issues with > mdfd_vfd meaning that the file has been removed but that it is still > tracked as opened. lseek() per se presumably would never return ENOENT. A more likely theory is that the file wasn't actually open but only had a leftover VFD entry, and when FileSize() -> FileAccess() tried to open it, the open failed with ENOENT --- but _mdnblocks() would still call it a seek failure. So I'd opine that this is a pretty high-level failure --- what are we doing trying to replay WAL against a table that's been dropped? Or if it wasn't dropped, why was the FSM removed? regards, tom lane