Gavin Sherry <[EMAIL PROTECTED]> writes: > It seems that by adding the following to SlruPhysicalReadPage() we can > recover in a reasonable way here. Instead of: > [ add non-error check to lseek() ]
But it's not the lseek() that's gonna fail. What we'll actually see, and did see in Chris' report, is a read() that returns zero bytes, or possibly an incomplete page. So actually this change is needed in the next step, not the lseek. BUT: after looking at the code more, I'm confused again about exactly how Chris' failure happened. The backtrace he sent this morning shows that the panic occurs while replaying a transaction-commit WAL entry --- it's trying to set the commit status bit for that transaction number, and finding that the clog page containing that bit doesn't exist. But there should have been a previous WAL entry recording the ZeroCLOGPage() action for that clog page. The only way that wouldn't have got replayed too is if there was a checkpoint in between ... but a checkpoint should not have been able to complete without flushing the clog buffer to disk. If there wasn't disk space enough to hold the clog page, the checkpoint attempt should have failed. So it may be that allowing a short read in slru.c would be patching the symptom of a bug that is really elsewhere. We need to understand the sequence of events in more detail. Chris, you said you'd saved a copy of the data directory at the time of the failure, right? Could you send me the pg_control file and the active segments of pg_xlog? (It should be sufficient to send the ones with file mod times within five minutes of the crash.) regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 7: don't forget to increase your free space map settings