Summary of algorithm to use "promise tuples" for concurrency control
during UPSERT

1. Perform btree search to location of key, if it exists.
a) If an unkilled index tuple exists, we decide this is an UPDATE and
drop straight thru to step 2
b) If it does not exist, insert a "promise" tuple into unique index(s)
- marked with the xid of the inserting transaction, but using the key.
This happens while the page is locked, so it is not possible to insert
a second promise tuple concurrently.
Record the btree blockid on the index scan and move to step 3
When later insert scans see the promise tuple they perform
XactLockTableWait() and when they get control they look again for the
key. If they find a promise tuple with an aborted xid they replace
that value with their own xid and continue as a). Otherwise b).

2. Find existing heap tuple
Find heap tuple.
Check it is actually valid. If not, go back to (1), kill the prior
tuple and follow 1b) path
If it is valid, perform heap_update as normal.

3. Insert new heap tuple
Perform heap_insert
Re-find index tuple using the btree blockid recorded at step 1; this
may require moving right until we find the actual value we are looking
for, so block splits don't negatively affect this approach.
Once re-found we change the index tuple from a promise tuple to a
normal index tuple, by setting tid and removing promise flag. Tuple
remains same length because the value was known when promise tuple
inserted, so this is an inplace update.
Insert other index values normally.

If a transaction that inserted a promise tuple dies, how is that cleaned up?
Any user that sees a dead promise tuple with a value they want will clean it up.
Dead promise tuples can be removed as needed, just as killed tuples
currently are.
VACUUM can remove dead transactions from index as it scans, no problems.

Index bloat is less of a problem than with normal inserts since there
are additional ways of removing promise tuples. Only one index tuple
at a time can have a specific value, so we would actually reduce heap
bloat caused by concurrent inserts.

It's possible we find existing rows that we can't see according to our
snapshot. That is handled in exactly the same as the way we treat

There is a potential loop here in that we might find an index tuple,
follow it, find out the tuple is actually dead then return to insert a
promise tuple, only to find that someone else just did that and we
have to wait/re-follow the link to update the new tuple. That is an
extremely unlikely race condition, though possible; there is no heavy
locking to prevent that in this approach. In principle deadlocks are
possible from that, but that is not any different from the current
case where people that insert same key at same time might cause
deadlocks. Its not a common application pattern and not something we
should be protecting against.

All of this is only needed for unique indexes.

It can handled by a new API call for acquire_value_lock() and
release_value_lock(), or similar.

* We don't do anything weird in the heap - if this breaks, data is not corrupt
* There is no heap bloat or index bloat above existing levels
* The approach leverages existing mechanism for transaction waiting
* Optimistic approach to value locking will improve performance over
heavy weight locking

* Not written yet - <1 month to code, still possible for 9.5, doesn't look hard

Other stated possible disadvantages
* Violates existing invariant that every index tuple has a
corresponding heap tuple, which goes back to the Berkeley days.
Currently, we always create heap tuples first, and physically delete
them last. [Why is that a problem? Could be, but why?]
("Deleting them last" implies something is being touched in that
regard by this suggestion. I'm not sure what you are referring to)

* Relatedly, concern about breaking VACUUM. VACUUMing of btrees occurs
with a list of TIDs to kill in index, built from the heap. Current
bulk delete callback cannot be used for this. Super-exclusive lock is
needed to delete tuples in btree page (i.e. VACUUM). Normally skipping
of LockBufferForCleanup() (as added by bbb6e559) works okay in heap
because it tends to imply that list of tids won't be bulk deleted in
index, where we currently cannot skip for failure to quickly acquire
super exclusive lock. So this could make the problem addressed by
bbb6e559 return to some extent.
[Don't see any problems; just test the xid as we scan a promise tuple
and remove it if needed]

"Index-only" bloat becomes a possibility. Implications are not well understood.
[FUD, no reason to believe there is a problem.]

We have to re-find any items using an ordinary index scan, not a tid.
Must match our xid to that.
[Explained above, easy and efficient.]

Doesn't have a strategy for dealing with unprincipled deadlocks, at
least at the moment. Presumably some aspects of #2 could be adopted
[FUD. Unprincipled deadlock still not properly explained as yet. No
way of telling whether it will happen with this approach, or not].

Comments please.

 Simon Riggs         
 PostgreSQL Development, 24x7 Support, Training & Services

Sent via pgsql-hackers mailing list (
To make changes to your subscription:

Reply via email to