On Thu, Jan 23, 2014 at 2:08 AM, Andres Freund <and...@2ndquadrant.com> wrote: > On 2014-01-23 02:05:48 +0900, Fujii Masao wrote: >> On Thu, Jan 23, 2014 at 1:21 AM, Andres Freund <and...@2ndquadrant.com> >> wrote: >> > Hi, >> > >> > Currently, XLogInsert(), XLogFlush() or XLogBackgroundFlush() will >> > write() data before fdatasync()ing them (duh, kinda obvious). But I >> > think given the current recovery code that leaves a window where we can >> > get into strange inconsistencies. >> > Consider what happens if postgres (not the OS!) crashes after writing >> > WAL data to the OS, but before fdatasync()ing it. Replay will happily >> > read that record from disk and replay it, which is fine. At the end of >> > recovery we then will start inserting new records, and those will be >> > properly fsynced to disk. >> > But if the *OS* crashes in that moment we might get into the strange >> > situation where older records might be lost since they weren't >> > fsync()ed, but newer records and the control file will persist. >> > >> > I think for a primary that window is relatively small, but I think it's >> > a good bit bigger for a standby, especially if it's promoted. >> >> In normal streaming replication case, ISTM that window is not bigger for >> the standby because basically the standby replays only the WAL data >> which walreceiver fsync'd to the disk. But if it replays the WAL file which >> was fetched from the archive, that WAL file might not have been flushed >> to the disk yet. In this case, that window might become bigger... > > Yea, but if the walreceiver receives data and crashes/disconnects before > fsync(), we'll read it from pg_xlog, rigth? And if we promote, we'll > start inserting new records before establishing a new checkpoint.
Yeah, true. Such unflushed WAL file can be read by the subsequent recovery... Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers