On Thu, 2007-06-21 at 18:15 -0400, Tom Lane wrote: > I've been reflecting a bit about whether the notion of deferred fsync > for transaction commits is really safe. The proposed patch tries to > ensure that no consequences of a committed transaction can reach disk > before the commit WAL record is fsync'd, but ISTM there are potential > holes in what it's doing. In particular the path that concerns me is > > (1) transaction A commits with deferred fsync; > > (2) transaction B observes some effect of A (eg, a committed-good tuple); > > (3) transaction B makes a change that is contingent on the observation. > > If B's changes were to reach disk in advance of A's commit record, we'd > have a risk of logical inconsistency.
B's changes cannot reach disk before B's commit record. That is the existing WAL-before-data rule implemented by the buffer manager. If B can see A's changes, then A has written a commit record to the log that is definitely before B's commit record. So B's commit will also commit A's changes to WAL when it flushes at EOX. So whether A is a guaranteed transaction or not, B can always rely on those changes. I agree this feels unsafe when you first think about it, and was the reason for me taking months before publishing the idea. > The patch is doing what it can > to prevent *direct* effects of A from reaching disk before the commit > record does, but it doesn't (and I think cannot) extend this to indirect > effects perpetrated by other transactions. An example of the sort of > risk I'm worried about is a REINDEX omitting an index entry for a tuple > that it sees as committed dead by A. > > Now this may be safe anyway, but it requires analysis that I don't > recall anyone having put forward. The cases that I can see are: > > 1. Ordinary WAL-logged change in a shared buffer page. The change will > not be allowed to reach disk before the associated WAL record does, and > that WAL record must follow A's commit, so we're safe. > > 2. Non-WAL-logged change in a temp table. Could reach disk in advance > of A's commit, but we don't care since temp table contents don't survive > crashes anyway. > > 3. Non-WAL-logged change made via one of the paths we have introduced > to avoid WAL overhead for bulk updates. In these cases it's entirely > possible for the data to reach disk before A's commit, because B will > fsync it down to disk without any sort of interlock, as soon as it > finishes the bulk update. However, I believe it's the case that all > these paths are designed to write data that no other transaction can see > until after B commits. That commit must follow A's in the WAL log, > so until it has reached disk, the contents of the bulk-updated file > are unimportant after a crash. > > So I think it's probably all OK, but this is a sufficiently long chain > of reasoning that it had better be checked over by multiple people and > recorded as part of the design implications of the patch. Does anyone > think any of this is wrong, or too fragile to survive future code > changes? Are there cases I've missed? I've done the analysis, but perhaps I should finish the docs now to aid with review of the patch on the points you make. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend