On Sat, 18 Jun 2005, Tom Lane wrote:
I've attached the 2PC patch as applied in case you want to have a look.
I did some fairly significant hacking on it, and I thought it'd be a
good idea to enumerate some of the changes:
I modified the in-memory data structure so that there is a separate
dummy PGPROC for each prepared transaction (GXACT), and we control
the visibility of the transaction to TransactionIdIsInProgress by
inserting or removing the PGPROC in the global ProcArray. This costs
a little space but it has a lot of benefits:
(..list of benefits..)
Yep, that looks much better. In general, a prepared transaction now looks
and behaves more like normal active transactions. That leaves less room
Another area that I changed was fsyncing of prepared transaction status
files. I really didn't like expecting checkpoint to do this --- for one
thing there are race conditions associated with fsyncing a status file
that someone else is busy deleting. (You could possibly avoid that by
having the checkpointer hold the TwoPhaseStateLock while it's fsyncing,
but the performance implications of that are unpleasant.) The
checkpoint coding "dealt with" the race condition by ignoring *all*
open() failures, which is hardly what I'd call a robust solution.
The way it's done in the committed patch, the prepare sequence is:
1. Write the status file, append a deliberately bad CRC,
and fsync it.
2. Write the WAL record and fsync it.
3. Overwrite the bad CRC with a good one and fsync.
It's annoying to have 3 fsyncs here; 2 would be nicer; but from a safety
perspective I don't think there's much choice. Steps 2 and 3 form a
critical section: if we complete 2 but fail at 3 we have to PANIC
because on-disk state is not consistent with what it says in WAL. So
I felt we had to do the full pushup in step 1 to be as certain as we
can possibly be that step 3 won't fail.
I'm not totally satisfied with this --- it's OK from a correctness
standpoint but the performance leaves something to be desired.
Ouch, that really hurts performance.
In typical 2PC use, the state files live for a very short period of time,
just long enough for the transaction manager to prepare all the resource
managers participating in the global transaction, and then commit them.
We're talking < 1 s. If we let checkpoint to fsync the state files, we
would only have to fsync those state files that happen to be alive when
the checkpoint comes. And if we fsync the state files at the end of the
checkpoint, after all the usual heap pages etc, it's very likely that
even those rare state files that were alive when the checkpoint began,
have already been deleted.
So we're really talking about 1 fsync vs. 3 fsyncs.
Can we figure out another way to solve the race condition? Would it
in fact be ok for the checkpointer to hold the TwoPhaseStateLock,
considering that it usually wouldn't be held for long, since usually the
checkpoint would have very little work to do?
Was there other reasons beside the race condition why you did this way?
I also fooled around a bit with locking of GXACTs. In the original
patch you prevent two backends from trying to commit/rollback the same
GXACT by removing it from the pool at the start of the process.
That's not workable anymore, so what I did was to add a field
"locking_xid" showing the XID of the transaction currently working on
the GXACT. As long as this XID is active according to ProcArray
(ignoring the prepared xact itself of course) the GXACT cannot be
touched by anyone else. This provides a convenient way to interlock
creation of a GXACT as well as its eventual destruction.
I also changed the handling of smgr pending deletions. The original
patch made smgr a 2PC resource manager, but that does not work because
XLOG replay doesn't (and shouldn't I think) run the resource managers;
thus replay of a PREPARE/COMMIT sequence would fail to delete whatever
files should have been deleted. The correct way to do this is that
COMMIT or ROLLBACK PREPARED must emit a WAL record that looks exactly
like a normal COMMIT or ABORT WAL record, including listing any files
to be deleted. So I pulled the files-to-delete info into the 2PC file
header rather than keeping it in resource records.
---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly