Hackers, I'm (again) trying to get my hands dirty on the nested transaction thingie. Hopefully I will have a lot more commitment this time ...
I'm currently looking at whatever is missing in the description below. If you see anything that isn't being considered please let me know. -- Alvaro Herrera (<alvherre[a]dcc.uchile.cl>) "Uno puede defenderse de los ataques; contra los elogios se esta indefenso" Request For Comments -- Nested Transactions Topics: - UNDO? Multiple XIDs? - Tuple Visibility - Memory Management - Lock management - File deletions, Deferred triggers, Asynchronous notifies UNDO ---- There was a proposal for implementing the UNDO functionality to allow an implementation of an overwriting storage manager, mainly by Vadim Mikheev. This would quasi-automatically give the ability to support subtransactions and savepoints. Not much people seemed to like the idea, and everybody agreed that there was no one willing to implement it. Also, the UNDO idea had its own problems and we don't seem to be moving in the overwriting storage manager anyway. So UNDO is discarded from the beggining. The other way to implement subtransactions seems to be using multiple transaction IDs, assigning a new one to each subxact. This has its own issues, which I discuss below. Vadim seemed to think that the transaction mechanism should be rethought completely instead of "hacked". If someone has better ideas on how this could be done, I'm all ears. Tuple Visibility ---------------- The main idea is that a transaction tree will have several XIDs; thus, some tuples written by it will have the XID of the main transaction in xmin/xmax, but other tuples can have different XIDs. The important point is that the final commit has to be atomic, that is, when the main transaction commits, all tuples created by it will be seen as committed. There would be a new storage area, called pg_subtrans, where the XID of the parent transaction of each subtransaction would be stored permanently, in a way similar to that of pg_clog. The last time this issue was discussed, Manfred Koizar proposed the following mechanism: Let the currently unused fourth state in pg_clog indicate a committed subtransaction. In pg_clog there are two bits per transaction, commit and abort, with the following meaning: a c 0 0 transaction in progress, the owning backend knows whether it is a main- or a sub-transaction, other backends don't care 1 0 aborted, nobody cares whether main- or sub-transaction 0 1 committed main-transaction or - with shortcut 2 - a sub- transaction that's known committed to all active transactions 1 1 committed sub-transaction, have to look for parent in pg_subtrans http://archives.postgresql.org/pgsql-hackers/2003-03/msg00754.php The other, older proposal (mentioned in the same article), was to use the 00 status indicate both a transaction in progress and a subtransaction whose parent we don't know whether it committed or rolled back (i.e. have to check). The main problem I see with this is that every 00 status will have to be rechecked in pg_subtrans and pg_clog, creating a lot of read traffic. Note that the four-status mechanism increments the pg_clog write traffic instead (a committed subtrans will have the 11 status written first, then the 01 later). I think this is the lesser of the two evils, because: - there is more locality (I/O increment only in pg_clog, not pg_subtrans) - read and write both use exclusive lock on page (no difference between them, none is "cheaper") - after write, pg_clog buffer is marked dirty but not physically written (unless taken out of the buffer pool) So I'm opting for Manfred's proposal. One comment I didn't understand from that proposal was If we set XMIN/MAX_IS_COMMITTED in a tuple header, we have to replace a sub-transaction xid in xmin or xmax respectively with the main-transaction xid at the same time. Otherwise we'd have to look for the main xid, whenever a tuple is touched. I suppose this is related to checking whether the xmax/xmin XID is in the snapshot used by the tester. I'm not sure how to attack this, since we cannot touch both things "at the same time". One idea would be to put all subtransactions' XIDs in all snapshots, and just don't update the xmin/xmax in the tuple. I'm not sure if this is workable. Explanations and other ideas are welcome. Memory Management ----------------- We would have two new global memory contexts: TopTransactionTreeContext This one would live for the duration of the whole transaction tree CommitContext This one would actually be several contexts, one for each subtransaction. If the subtrans rolls back, the context is deleted, but it is kept if the subtrans commits. This facilitates the handling of memory for things that have to be delayed until main transaction commit. The current TopTransactionContext would stay the same as now. Lock Management --------------- When a subtransaction aborts, it has to release all LWLocks and heavyweight locks it's currently holding. On subtrans commit, all locks are held and reassigned to the parent transaction. Only at main transaction commit are all locks released. Note that it's OK to abort a subtrans with LWLocks held, and they should be released; however, it's not OK to commit a subtrans with LWLocks held. This should be considered a programming error. Regarding regular locks, there are two ways of doing this: one is using the current TransactionId field, releasing those locks which have the current transaction ID. However this has it's own problems, because there are a couple of call sites that use the Xid for its own purposes (apparently it's only LockRelationForSession() -- the other one with a extraneous call seems to be XactLockTableInsert()/XactLockTableWait() but apparently they are OK). The other way would be keeping a List of all the Locks the current subtrans is holding and release them on abort. This could be difficult because the mechanism allows to upgrade a held lock, so aborting a subtrans that upgraded some lock should "downgrade" it to the previous level, but this sounds dangerous to me. Maybe we should consider aborting the first subtrans involved in the lock (could be very tricky to implement). Regarding LWLocks, the first mechanism is not available so we will have to stick to keeping a list within each subtrans. Note regarding deadlock detection: currently, the routine checks for deadlocks and if found aborts the current transaction. The new mechanism has to abort the innermost transaction, and then recheck for a deadlock condition again in a loop. File deletion, Deferred triggers, Asynchronous notifies -------------------------------------------------------- These mechanisms will have to keep lists of items corresponding to each subtransaction, and do something special to it on commit or on abort. Deftriggers and notifies can be dropped on transaction abort, which is very easy if we allocate them in the CommitContext. On subtrans commit, the list of them is reassigned to the parent subtrans. File deletion is different because we can drop some files on subtransaction abort. At commit, items are reassigned to parent. The one other thing to look for is on-commit actions. They require special handling but are not much different from the above. ---------------------------(end of broadcast)--------------------------- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html