On Thu, 2007-06-21 at 18:15 -0400, Tom Lane wrote:
> I've been reflecting a bit about whether the notion of deferred fsync
> for transaction commits is really safe. The proposed patch tries to
> ensure that no consequences of a committed transaction can reach disk
> before the commit WAL record is fsync'd, but ISTM there are potential
> holes in what it's doing. In particular the path that concerns me is
> (1) transaction A commits with deferred fsync;
> (2) transaction B observes some effect of A (eg, a committed-good tuple);
> (3) transaction B makes a change that is contingent on the observation.
> If B's changes were to reach disk in advance of A's commit record, we'd
> have a risk of logical inconsistency.
B's changes cannot reach disk before B's commit record. That is the
existing WAL-before-data rule implemented by the buffer manager.
If B can see A's changes, then A has written a commit record to the log
that is definitely before B's commit record. So B's commit will also
commit A's changes to WAL when it flushes at EOX. So whether A is a
guaranteed transaction or not, B can always rely on those changes.
I agree this feels unsafe when you first think about it, and was the
reason for me taking months before publishing the idea.
> The patch is doing what it can
> to prevent *direct* effects of A from reaching disk before the commit
> record does, but it doesn't (and I think cannot) extend this to indirect
> effects perpetrated by other transactions. An example of the sort of
> risk I'm worried about is a REINDEX omitting an index entry for a tuple
> that it sees as committed dead by A.
> Now this may be safe anyway, but it requires analysis that I don't
> recall anyone having put forward. The cases that I can see are:
> 1. Ordinary WAL-logged change in a shared buffer page. The change will
> not be allowed to reach disk before the associated WAL record does, and
> that WAL record must follow A's commit, so we're safe.
> 2. Non-WAL-logged change in a temp table. Could reach disk in advance
> of A's commit, but we don't care since temp table contents don't survive
> crashes anyway.
> 3. Non-WAL-logged change made via one of the paths we have introduced
> to avoid WAL overhead for bulk updates. In these cases it's entirely
> possible for the data to reach disk before A's commit, because B will
> fsync it down to disk without any sort of interlock, as soon as it
> finishes the bulk update. However, I believe it's the case that all
> these paths are designed to write data that no other transaction can see
> until after B commits. That commit must follow A's in the WAL log,
> so until it has reached disk, the contents of the bulk-updated file
> are unimportant after a crash.
> So I think it's probably all OK, but this is a sufficiently long chain
> of reasoning that it had better be checked over by multiple people and
> recorded as part of the design implications of the patch. Does anyone
> think any of this is wrong, or too fragile to survive future code
> changes? Are there cases I've missed?
I've done the analysis, but perhaps I should finish the docs now to aid
with review of the patch on the points you make.
---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend