[HACKERS] Worries about delayed-commit semantics

Tom Lane Thu, 21 Jun 2007 15:19:13 -0700

I've been reflecting a bit about whether the notion of deferred fsync
for transaction commits is really safe.  The proposed patch tries to
ensure that no consequences of a committed transaction can reach disk
before the commit WAL record is fsync'd, but ISTM there are potential
holes in what it's doing.  In particular the path that concerns me is


(1) transaction A commits with deferred fsync;

(2) transaction B observes some effect of A (eg, a committed-good tuple);

(3) transaction B makes a change that is contingent on the observation.

If B's changes were to reach disk in advance of A's commit record, we'd
have a risk of logical inconsistency.  The patch is doing what it can
to prevent *direct* effects of A from reaching disk before the commit
record does, but it doesn't (and I think cannot) extend this to indirect
effects perpetrated by other transactions.  An example of the sort of
risk I'm worried about is a REINDEX omitting an index entry for a tuple
that it sees as committed dead by A.

Now this may be safe anyway, but it requires analysis that I don't
recall anyone having put forward.  The cases that I can see are:

1. Ordinary WAL-logged change in a shared buffer page.  The change will
not be allowed to reach disk before the associated WAL record does, and
that WAL record must follow A's commit, so we're safe.

2. Non-WAL-logged change in a temp table.  Could reach disk in advance
of A's commit, but we don't care since temp table contents don't survive
crashes anyway.

3. Non-WAL-logged change made via one of the paths we have introduced
to avoid WAL overhead for bulk updates.  In these cases it's entirely
possible for the data to reach disk before A's commit, because B will
fsync it down to disk without any sort of interlock, as soon as it
finishes the bulk update.  However, I believe it's the case that all
these paths are designed to write data that no other transaction can see
until after B commits.  That commit must follow A's in the WAL log,
so until it has reached disk, the contents of the bulk-updated file
are unimportant after a crash.

So I think it's probably all OK, but this is a sufficiently long chain
of reasoning that it had better be checked over by multiple people and
recorded as part of the design implications of the patch.  Does anyone
think any of this is wrong, or too fragile to survive future code
changes?  Are there cases I've missed?

BTW: I really dislike the name "transaction guarantee" for the feature;
it sounds like marketing-speak, not to mention overpromising what we
can deliver.  Postgres can't "guarantee" anything in the face of
untrustworthy disk hardware, for instance.  I'd much rather use names
derived from "deferred commit" or "delayed commit" or some such.

                        regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

               http://www.postgresql.org/docs/faq

[HACKERS] Worries about delayed-commit semantics

Reply via email to