Tom Lane wrote:
I've spent the past hour or so trying to consolidate the comments in
GetSnapshotData and related places into a single chunk of text to be
added to src/backend/access/transam/README. Attached is what I have so
far --- this incorporates the idea of not taking ProcArrayLock to exit
an XID-less transaction, but not yet Florian's idea. I think it'd get
simpler if we changed to that, but am posting this for comments.
Interlocking transaction begin, transaction end, and snapshots
We try hard to minimize the amount of overhead and lock contention involved
in the frequent activities of beginning/ending a transaction and taking a
snapshot. Unfortunately, we must have some interlocking for this, because
it is critical that all backends agree on the commit order of transactions.
This is actually still a slightly stronger requirement than what we really
need I think.
To simplify the following discussion, "A -> B" shall mean that transaction A saw
B as committed. Conversely, "A !> B" shall mean that A treats B as in-progress.
If A was in read-committed mode, the visibility refers to the latest snapshot
that A used.
Now assume A and B commit at nearly the same time, and for two other
transactions C and D the following holds:
C -> A, C !> B but
D !> A, D -> B.
This would violate the requirement that the commit order is globally
agreed upon, yet as long as both A !> B and B !> A holds, there is no
conflict. (Note that if A and B are serializable, A !> B && B !> A
implies that A and B cannot have touched the same record and have
both committed - one would have been aborted due to a
I must admit, though, that this is a quite academic case, since the
prerequisite A !> B and B !> A is something we have no control over
for read-committed transactions - who knows when they might have taken
their last snapshot...
Still, I wanted to mention this because I believe that the minimal
requirement that we actually *need* to enforce is
A -> B and B -> C imply A -> C. (T1)
The actual implementation will probably always have to enforce
something slightly stronger, but it's still nice to know the
minimal guarantee needed to be able to judge correctness.
For example, suppose an UPDATE in xact A is blocked by xact B's prior
update of the same row, and xact B is doing commit while xact C gets a
snapshot. Xact A can complete and commit as soon as B releases its locks.
If xact C's GetSnapshotData sees xact B as still running, then it had
better see xact A as still running as well, or it will be able to see two
tuple versions - one deleted by xact B and one inserted by xact A.
In my notation this becomes: A -> B and C !> B implies C !> A.
This then follows from (T1) - Assume that A -> B, C !> B but C -> A,
then with (A) C -> B follows from C -> A and A -> B, which contradicts
C !> B.
enforce this by not allowing any transaction to exit the set of running
transactions while a snapshot is being taken. (This rule is probably
stronger than necessary, but see the next problem.) The implementation
of this is that GetSnapshotData takes the ProcArrayLock in shared mode
(thereby allowing multiple backends to take snapshots in parallel), but
xact.c must take the ProcArrayLock in exclusive mode while clearing
MyProc->xid at transaction end (either commit or abort).
Agreed. We *do* enforce a strict global ordering of committs and snapshots.
This then guarantees (T1) because if A -> B and B -> C, than A *must*
have taken it's snapshot after B committed, and B in turn *must* have
taken it's snapshot after C committed, so surely A -> C will hold too.
Here is another variant of the risk scenario:
1. Xact A is running (in Read Committed mode).
2. Xact C's GetSnapshotData reads next transaction ID into xmax, then is
swapped out before it can acquire ProcArrayLock.
3. Xact B gets new XID (>= C's xmax), makes changes and commits.
4. Xact A changes some row R changed by xact B and commits.
5. Xact C finishes getting its snapshot data. It sees xact A as done,
but sees xact B as still running (since B >= xmax).
Now C will see R changed by xact B and then xact A, *but* does not see
other changes made by xact B. If C is supposed to be in Serializable mode,
this is wrong.
I never really grasped why we need to assume serializable here - this seems
wrong if C is read-committed too. Seeing only half of a transaction's changes
can never be right, can it?
To prevent this it is necessary that GetSnapshotData acquire ProcArrayLock
before it calls ReadNewTransactionId. This prevents xact A from exiting
the set of running transactions seen by xact C. Therefore both A and B
will be seen as still running => no inconsistency.
Another point of view is that determining the xmax of a snapshot really *is*
part of taking the snapshot. Since we already obtained that we need to
serialize snapshotting and committing, it follows that we must not allow
committs to happen between the time we take the xmax and the time we
complete the snapshot.
In short, then, the rule is that no transactions may exit the set of
currently-running transactions between the time we fetch xmax and the time
we finish building our snapshot. However, this restriction only applies
to transactions that have an XID --- read-only transactions can end without
acquiring ProcArrayLock, since they don't affect anyone else's snapshot.
Yes, they completely fall out of the picture, because nobody ever cares if
they even committed.
We must also require GetNewTransactionId to store the new XID into the
shared ProcArray before releasing XidGenLock. This ensures that when
GetSnapshotData calls ReadNewTransactionId (which also takes XidGenLock),
all active XIDs before the returned value of nextXid are already present in
the ProcArray and can't be missed by GetSnapshotData. Unfortunately, we
can't have GetNewTransactionId take ProcArrayLock to do this, else it could
deadlock against GetSnapshotData. Therefore, we simply let
GetNewTransactionId store into MyProc->xid without any lock. We are
thereby relying on fetch/store of an XID to be atomic, else other backends
might see a partially-set XID. (NOTE: for multiprocessors that need
explicit memory access fence instructions, this means that
acquiring/releasing XidGenLock is just as necessary as acquiring/releasing
ProcArrayLock for GetSnapshotData to ensure it sees up-to-date xid fields.)
This also means that readers of the ProcArray xid fields must be careful to
fetch a value only once, rather than assume they can read it multiple times
and get the same answer each time.
In other words: It won't do for transaction to acquire an xid, then go into
hiding, and later suddenly reappear and start storing it's rather old xid
Or, again, differently put: When we take a snapshot we must be sure that any
xid < our xmax is either not in use anymore, or shows up in the proc array.
By storing the xid in the proc array while holding the XidGenLock, we
ensure that there is always is at most one xid assigned, but not showing
up in the proc array. This xid is always equal to either nextXid, or
nextXid-1. It is the "nextXid-1" case that forces us to take the XidGenLock
- it ensures that *no* xid is assigned but not showing up in the proc array.
If we used any older value for the xmax, the nextXid-1 case wouldn't hurt us -
the "hidden" xid will still be >= our xmax.
The last paragraph is in essence what guarantees the correctness of using
LargestCompletedXid instead of ReadNewTransactionId() as the xmax.
Another important activity that uses the shared ProcArray is GetOldestXmin,
which must determine a lower bound for the oldest xmin of any active MVCC
snapshot, system-wide. Each individual backend advertises the smallest
xmin of its own snapshots in MyProc->xmin, or zero if it currently has no
live snapshots (eg, if it's between transactions or hasn't yet set a
snapshot for a new transaction). GetOldestXmin takes the MIN() of the
valid xmin fields. It does this with only shared lock on ProcArrayLock,
which means there is a potential race condition against other backends
doing GetSnapshotData concurrently: we must be certain that a concurrent
backend that is about to set its xmin does not compute an xmin less than
what GetOldestXmin returns. We ensure that by including all the active
XIDs into the MIN() calculation, along with the valid xmins. The rule that
transactions can't exit without taking exclusive ProcArrayLock ensures that
concurrent holders of shared ProcArrayLock will compute the same minimum of
currently-active XIDs: no xact, in particular not the oldest, can exit
while we hold shared ProcArrayLock. So GetOldestXmin's view of the minimum
active XID will be the same as that of any concurrent GetSnapshotData, and
so it can't produce an overestimate. If there is no active transaction at
all, GetOldestXmin returns the result of ReadNewTransactionId. Note that
two concurrent executions of GetOldestXmin might not see the same result
from ReadNewTransactionId --- but if there is a difference, the intervening
execution(s) of GetNewTransactionId must have stored their XIDs into the
ProcArray, so the later execution of GetOldestXmin will see them and
compute the same global xmin anyway.
Agreed. A shorter reasoning would maybe be:
We need to ensure that no snapshot that is still in use has a xmin smaller
than what GetOldestXmin() returns. We have to check two cases, snapshot
creation and transaction exit.
During snapshot creation, we compute the xmin as MIN() of all xids in our
snapshot. A concurrent GetOldestXmin() or GlobalXmin computation cannot derive a
larger value, since it takes the published xids into account, as well as the
published xmins. This, together with the rule that no transaction can remove
it's xid from the set of running transactions while we either take a snapshot or
do GetOldestXmin() guarantees correctness.
During transaction exit, we don't need to take any lock to clear our xmin.
The xmin won't influence the xmin calculations of other transactions, only
GlobalXmin and GetOldestXmin() computations. Since we won't use our snapshot
anymore, we don't care if they take our xmin into account for that or not.
GetSnapshotData also performs an oldest-xmin calculation (which had better
match GetOldestXmin's) and stores that into RecentGlobalXmin, which is used
for some tuple age cutoff checks where a fresh call of GetOldestXmin seems
too expensive. Note that while it is certain that two concurrent
executions of GetSnapshotData will compute the same xmin for their own
snapshots, as argued above, it is not certain that they will arrive at the
same estimate of RecentGlobalXmin. This is because we allow XID-less
transactions to clear their MyProc->xmin asynchronously (without taking
ProcArrayLock), so one execution might see what had been the oldest xmin,
and another not. This is OK since RecentGlobalXmin need only be a valid
lower bound. As noted above, we are already assuming that fetch/store
of the xid fields is atomic, so assuming it for xmin as well is no extra
Agreed - see above for a equivalent description
I had a rather hard time understanding these things initially - at least
for me, the enlightenment came when I realized that most of the locking
in there actually ensures that (T1) holds - or in other words, that
the relation "A sees B as committed" *has* to be a transitive one.
So I'd like to see this reasoning in that README - though maybe I'm the
only one to whom this is the clearest wording...
greetings, Florian Pflug
---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend