What happens is that when we recycle WAL segments, we rename them and then sync them using fdatasync (which is the default on Linux). However fdatasync does not force fsync on the parent directory, so in case of power failure the rename may get lost. The recovery won't realize those segments actually contain changes
Agree. Some time ago I faced with this, although it wasn't a postgres.
So, what's going on? The problem is that while the rename() is atomic, it's not guaranteed to be durable without an explicit fsync on the parent directory. And by default we only do fdatasync on the recycled segments, which may not force fsync on the directory (and ext4 does not do that, apparently). This impacts all current kernels (tested on 188.8.131.52, 4.0.5 and 4.4-rc1), and also all supported PostgreSQL versions (tested on 9.1.19, but I believe all versions since spread checkpoints were introduced are vulnerable). FWIW this has nothing to do with storage reliability - you may have good drives, RAID controller with BBU, reliable SSDs or whatever, and you're still not safe. This issue is at the filesystem level, not storage.
I plan to do more power failure testing soon, with more complex test scenarios. I suspect there might be other similar issues (e.g. when we rename a file before a checkpoint and don't fsync the directory - then the rename won't be replayed and will be lost).
It would be very useful, but I hope you will not find a new bug :) -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/ -- Sent via pgsql-hackers mailing list (firstname.lastname@example.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers