I've updated the lazy xid assignment patch, incorporating the results of the
discussion over the past few days. I'll try to explain what the patch does,
and why. This reasoning got a bit lengthy, but I felt that I should summarize
my own thought, plus the relevant parts of the discussions here...
And I might be a bit chatty too ;-)

The patch can be found at: http://soc.phlo.org/lazyxidassign.v2.patch
(Seems that large attachments don't get through on this list - or was that
just bad luck when I tried posting the first version of that patch?)

It includes a new regression test that asserts the we really don't assign xids
for pure selects. The patch got a bit large in the end, though most of it is
just the removal of XLOG_NO_TRAN and the various MyXact* flags and friends.
The interesting parts really aren't that much code.

The patch survives "make installcheck-parallel", and some light testing I did. I
didn't really hammer it yet, though.
I'll post this to the patches list if nobody finds a hole, or suggest further

XID Assignment, Snapshot creation and oldest XMIN computation
We no longer call GetNewTransactionId during transaction start.
Instead, we leave the xid set to InvalidTransactionId, similar
to what we currently do for subtransactions.
GetTopTransactionId checks if we have a xid assigned, and assigns
one using GetNewTransactionId if it's still InvalidTransactionId -
again, similar to what GetCurrentTransactionId does for subtransactions
already. There is a new function GetTopTransactionIdIfAny() bears
the same relation to GetTopTransactionId() as GetCurrentTransactionIdIfAny()
does to GetCurrentTransactionId() - it won't force xid assignment,
but rather return InvalidTransactionId if the xid is not set.

Both GetTopTransactionId() and GetCurrentTranscationId() use
AssignTranscationId() to do the actualy work - this is a modified
version of what is currently called AssignSubTransactionId().

GetOldestXmin() is changed slightly - it used to use
GetTopTransactionId() as a starting point for it's xmin
calculation, falling back to ReadNewTransactionId() if called
from outside a transaction. Now, it always starts with ReadNewTransactionId().
It also used to ignore procarray entries without a valid xmin,
which it obviously mustn't do anymore - they might still have
a valid xmin set.

The changes to GetSnapshotData() are similar. Again, we initialize
xmin to ReadNewTransactionId() instead of GetTopTransactionId().
When scanning the procarray, we don't skip our own entry, but
just process it normally.

The function xid_age computes the age of a given xid relative
to the transaction's toplevel xid. Since forcing xid assignment
just to do that computation seemed silly, I modified xid_age
to use MyProc->xmin as a reference instead. I could have used
ReadNewTransactionId(), but that involves locking overhead,
and we probably want to use something that remains constant
during a transaction.

Correctness issues:
* Users of Get{Top|Current}TransactionId{IfAny} *
Since subtransactions *already* assign their xids on-demand, most of
the code correctly distinguishes between GetCurrentTransactionId()
and GetCurrentTransactionIdIfAny(). So that code is safe. Callers
of GetTopTransactionId() have to be checked - but there are only
very few of them. In fact, all callers are discussed above, with
the exception of some logging code which I modified to use
GetCurrentTransactionIdIfAny() instead.

* Obtaining a snapshot *
The most interesting function from a correctness perspective is
GetSnapshotData(). It's correctness *does* heavily depend on
the assumption that no transaction can *exit* the set of running
transactions while it takes a snapshot - not after it set xmax
in any case. This assumption ensures that if transaction A sees B
as as committed, and B sees C as committed, than A will see C
as committed.

The assumption implies the "commit transitivity" because
because C can't commit while A and B take their snapshot,
and B can't commit while A takes it's snapshot. So for
A to see B as committed, it'll have to take it's snapshot
after B committed. And B will have to have taken it's snapshot
after C commited, if B is to see C as committed. But since
B will obviously haven taken it's snapshot before it committed,
the ordering must then be |C commit|B snapshot|B commit|A snapshot|.
Which implies that A will indeed see C as commited.

This reasoning stays valid even if we lazy assign xids (No xids
show up in the reasoning at all), provided that we see *all*
transactions as committed that did so before we took a snapshot.
But even with lazy xid assignment, the xid will surely be assigned
before commit (or not at all, so nobody cares about us anyway) -
so when we a snapshot after some commit, ReadNewTransactionId()
will return a larger value than the just committed xid. So our
snapshot's xmax will end up > that xid, the xid won't show up in
our snasphot, and so we'll see it as committed.

* Global xmin computation*
Both GetSnapshotData() and GetOldestXmin() assume that concurrent
callers will end up computing the same GlobalXmin. This is guaranteed
as long as we hold the ProcArrayLock in ExclusiveMode while doing
the GlobalXmin computation - something that my patch doesn't change.

To cut the long story short - the small modifications outlined
above are all that is needed to ensure the correctness of
both GetSnapshotData and GetOldestXmin(),
even with lazily assignment of xids.

Top- and subtransaction commit and abort:
* Deciding whether to write COMMIT/ABORT records or not *
The existing code maintains quite a few flags needed to decide
if a COMMIT or ABORT record needs to be written, if the
XLOG needs to be flushed afterwards and if so, how far.
Those flage are
* MyXactMadeTempRelUpdate
* MyXactMadeXLogEntry
* MyLastRecPtr
* ProcLastRecEnd

There is also a historical distinction between transactional and
non-transactional xlog entries - the latter are generated by
passing XLOG_NO_TRAN to XLogInsert. The only difference between
those two types is that MyLastRecPtr is only advanced for
transactional xlog entries, though.

With lazy xid assignment, we have a much cleaner way to decide if
a transaction has done anything that warrants writing a commit
record. We assume that if someone forced xid assignment by calling
GetCurrentTransactionId, than he will also have written that xid
to disk, and so we must write a commit record. If no transaction id
was assigned, then writing a commit record is impossible (there is
no xid to put into that record), and also completely useless. The
same hold true for ABORT records.

A transaction might still have created an xlog entry, even without
having been assigned an xid. An example for this is nextval(). We
still want to force those records to permanent storage, even if we
don't write a commit record. We therefore still track the end of the
latest xlog record a transaction has written. My patch renames
ProcLastRecEnd to XactLastRecEnd, and resets that value after committing
or aborting a transaction.

The rest of the flags (MyXactMadeTempRelUpdate, MyXactMadeXLogEntry
and MyLastRecPtr) are removed because they no longer serve a useful
purpose. Since removing MyLastRecPtr makes XLOG_NO_TRAN utterly
useless, that flags is remove too.

* Files scheduled for deletion on COMMIT & asynchronous commits *
Whenever a transaction drops a relation, it schedules the physical file
for deletion on COMMIT. Obviously we must only delete such a file when
we are *absolutely* sure that our transaction is committed. Since
we can only assume that when the COMMIT record has physically hit the disk,
we first *must* not skip writing the COMMIT record if there are files
scheduled for deletion, and second *must* not allow asynchronous commits
in that case.

The first requirement is already satisfied - file creation
will surely be followed by a catalog update, which will force xid assingment,
and thus writing a COMMIT record. There is an assertion in
RecordTransactionCommit that checks for no schedules file deletions when
we don't have a xid assigned.

The second requirement can be relaxed a bit. If all to-be-deleted files are
in fact temporary relations, then they will be gone after a crash anyway. So
we may delete them before the COMMIT record hits the disk.

Note that there is a small window where we might leak a file. If we flush the
COMMIT record to disk, CHECKPOINT immediatly afterwards, and then crash before
physically deleting the file, we won't do the deletion during recovery since our
COMMIT record is before the latest CHECKPOINT. For asynchronous commit that
window is actually *smaller* since the COMMIT record hits the disk later (maybe
even after deleting the files. This is therefore *only* allowed if all files
are temporary relations).

* Files scheduled for deletion on ABORT *
Whenever a transaction creates a file, it put it onto the on-abort delete list.
There is no protection against leaks here - if we crash after creating the file,
but before the transaction COMMITS or ABORTS, the file is leaked. The current
code attempts to make that window smaller by forcing a xlog flush on ABORT if
there are files to be deleted. But there is not much point in that - the
transaction might have run for an arbitrarily long time after creating the file,
and during that whole time a crash would have caused file leakage. So getting
paranoid at ABORT time seems unjustified. And flushing might even do more harm
than good here - it's entirely possible that it'll take longer to flush that it
would have to get to the file deletion point if we had not flushed.

* Suma sumarum *
This modifies RecordTransactionCommit and RecordSubTransactionCommit, plus it
joins RecordSubTransactionAbort and RecordTransactionAbort into one function,
implementing the following rules (Which seems much clearer than the ones
in the current code).

Toplevel COMMIT:
.) If we have an XID assigned, write a COMMIT record. If not, assert that there
   are no pending deletions.
.) If we're not committing asynchronously, we flush the XLOG up to
   XactLastRecEnd. Otherwise, we just do XLogSetAsyncCommitLSN(), and let
   the wal writer do the flushing.
.) If we have an XID assigned, we update the CLOG. For asynchronous commits,
   we have to pass the LSN of our COMMIT record to prevent the update from
   hitting the disk before the xlog record does.
.) Reset XactLastRecEnd

Subtransaction COMMIT:
.) Do nothing if we don't have a xid assigned, otherwise
.) Just update the CLOG. No WAL-logging needed.

Toplevel *and* Subtransaction ABORT:
.) Do nothing if we don't have a xid assigned, otherwise
.) Write an ABORT record, and then
.) update the CLOG. We don't ensure that this doesn't hit the disk early -
   there is not much difference between actively aborting, and just being
   assumed to have crashed.
.) For subtransactions, remove the relevent xids from the procarray, otherwise
   reset XactLastRecEnd.

Waiting for transactions to exit / ResourceOwnerIds
Each transaction currently holds a lock on it's xid. This is used for two quite
different purposes - one is waiting for a transaction that modified some
tuple - this still works with lazily assigned xids. But during CREATE INDEX
CONCURRENTLY, this is used to first wait for all transactions that are holding
a lock conflicting with ShareLock on the relation in question, and than to out-wait *all* currently running transactions. This obviously fails when xids
are assigned lazily. In the first waiting phase because a transaction will
acquire the lock *before* requesting an xid, in the second phase because we have
to out-wait even pure selects.

Therefore, the concept of a ResourceOwnerId is introduced. This id identifies
the toplevel resourceowner of a transaction, and the owning transaction holds
a lock on it just as it does for transaction ids. Porting the first waiting
phase to this is trivial - we just collect the RIDs instead of the XIDs of
the current lock holders, and wait for them in turn.

The second waiting phase is a bit more complicated, because the current code
just waits on all xids in it's reference snapshot - which won't work for RIDs,
because they do not show up in snapshots. So, instead, we scan the procarray,
and wait for any transaction with an xmin < the xmax of our reference snapshot.

ResourceOwnerIds are a struct composed of two uint32 fields - one sessionId that
is assigned to each backend at backend startup (From a counter in shmem), and
one localTransactionId that is just incremented locally whenever we need a new
unique RID in a backend. This ensures that RIDs create no new locking overhead,
apart from having to hold at least two locks per transaction instead of one.

Locks on ResourceOwnerIds are not saved to the 2PC state file on PREPARE, since
the RID or a transaction will be reused at time point after it was prepared.

ResourceOwnerIds & pg_locks.
Locking requests on RIDs are shown in pg_locks, just as those on xids are.
Therefore, some representation has to be chosen for RIDs. XID won't work,
because due to the local/global split we need 8 bytes to represent a RID.
After much discussion (A new datatype is to heavy-weight, leaving out bits
might hinder debugging, and using int8 is impossible as long as we support
INT64_IS_BUSTED), we settled on a string representation. This at least has
the advantage that the transition to a new full-blown type for RIDs would
be seamless shoudl we even choose to go that way.

My patch adds two columns to pg_locks - "resourceownerid" on the 'left'
side (The locking request), and "resourceowner" on the 'right' side
(the holder/waiter). This is similar to how xids are handled in that view.

Possible future Improvements:
It's not entirely clear that we really need the exclusive ProcArrayLock
when exiting a transaction that didn't hold an xid. Not holding that lock
*will* make two concurrent calculations of globalxmin get two different
results sometimes - but that might actually be OK.

With some effort, we could get rid of the lock we hold on the xid entirely
I think. We'd need to reserve a special sessionId (0xffffffff probably)
for RIDs of prepared transactions. During prepare, we'd then grab a lock
on the RID (0xffffffff, xid) *before* releasing the lock on our old RID.
In XactLockTableWait we'd have to first check if the transaction using
the xid we want to wait for is still running, and get it's RID if it is.
We'd than have to wait on that RID, and after getting that lock we'd have
to wait on the RID (0xffffffff, xid). Only after getting that lock too,
we may return. But that doesn't seem much prettier than just holding two
locks per transaction, so for now I think it's best to leave things as
they are.

greetings, Florian Pflug

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Reply via email to