On Tue, Jan 31, 2017 at 11:07 AM, Michael Paquier <michael.paqu...@gmail.com> wrote: > On Mon, Jan 30, 2017 at 10:52 PM, Heikki Linnakangas <hlinn...@iki.fi> wrote: >> If that can happen, don't we have the same problem in many other places? >> Like, all the SLRUs? They don't fsync the directory either. > > Right, pg_commit_ts and pg_clog enter in this category.
Implemented as attached. >> Is unlink() guaranteed to be durable, without fsyncing the directory? If >> not, then we need to fsync() the directory even if there are no files in it >> at the moment, because some might've been removed earlier in the checkpoint >> cycle. > > Hm... I am not an expert in file systems. At least on ext4 I can see > that unlink() is atomic, but not durable. So if an unlink() is > followed by a power failure, the previously unlinked file could be > here if the parent directory is not fsync'd. So I have been doing more work on this patch, with the following things done: - Flush pg_clog, pg_commit_ts and pg_twophase at checkpoint phase to ensure their durability. - Create a durable_unlink() routine to give a way to perform a durable file removal. I am now counting 111 calls to unlink() in the backend code, and looking at all of them most look fine with plain unlink() if they are not made durable as they work on temporary files (see timeline.c for example), with some exceptions: - In pg_stop_backup, the old backup_label and tablespace_map removal should be durable to avoid putting the system in a wrong state after power loss. Other calls of unlink() are followed by durable_rename so they are fine if let as such. - Removal of old WAL segments should be durable as well. There is already an effort to rename them durably in case of a segment recycled. In case of a power loss, a file that should have been removed could remain in pg_xlog. Looking around, I have bumped as well on the following bug report for SQlite which is in the same category of things: http://sqlite.1065341.n5.nabble.com/Potential-bug-in-crash-recovery-code-unlink-and-friends-are-not-synchronous-td68885.html Scary to see that in this case durability can be a problem at transaction commit... -- Michael
Description: Binary data
-- Sent via pgsql-hackers mailing list (email@example.com) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers