Attached revision only uses heavyweight page locks across complex
operations. I haven't benchmarked it, but it appears to perform
reasonably well. I haven't attempted to measure a regression for
regular insertion, but offhand it seems likely that any regression
would be well within the noise - more or less immeasurably small. I
won't repeat too much of what is already well commented in the patch.
For those that would like a relatively quick summary of what I've
done, I include inline a new section that I've added to the nbtree

Notes about speculative insertion

As an amcanunique AM, the btree implementation is required to support
"speculative insertion".  This means that the value locking method
through which unique index enforcement conventionally occurs is
extended and generalized, such that insertion is staggered:  the core
code attempts to get full consensus on whether values proposed for
insertion will not cause duplicate key violations.  Speculative
insertion is only possible for unique index insertion without deferred
uniqueness checking (since speculative insertion into a deferred
unique constraint's index is a contradiction in terms).

For conventional unique index insertion, the Btree implementation
exclusive locks a buffer holding the first page that the value to be
inserted could possibly be on, though only for an instant, during and
shortly after uniqueness verification.  It would not be acceptable to
hold this lock across complex operations for the duration of the
remainder of the first phase of speculative insertion.  Therefore, we
convert this exclusive buffer lock to an exclusive page lock managed
by the lock manager, thereby greatly ameliorating the consequences of
undiscovered deadlocking implementation bugs (though deadlocks are not
expected), and minimizing the impact on system interruptibility, while
not affecting index scans.

It may be useful to informally think of the page lock type acquired by
speculative insertion as similar to an intention exclusive lock, a
type of lock found in some legacy 2PL database systems that use
multi-granularity locking.  A session establishes the exclusive right
to subsequently establish a full write lock, without actually blocking
reads of the page unless and until a lock conversion actually occurs,
at which point both reads and writes are blocked.  Under this mental
model, buffer shared locks can be thought of as intention shared

As implemented, these heavyweight locks are only relevant to the
insertion case; at no other point are they actually considered, since
insertion is the only way through which new values are introduced.
The first page a value proposed for insertion into an index could be
on represents a natural choke point for our extended, though still
rather limited system of value locking.  Naturally, when we perform a
"lock escalation" and acquire an exclusive buffer lock, all other
buffer locks on the same buffer are blocked, which is how the
implementation localizes knowledge about the heavyweight lock to
insertion-related routines.  Apart from deletion, which is
concomitantly prevented by holding a pin on the buffer throughout, all
exclusive locking of Btree buffers happen as a direct or indirect
result of insertion, so this approach is sufficient. (Actually, an
exclusive lock may still be acquired without insertion to initialize a
root page, but that hardly matters.)

Note that all value locks (including buffer pins) are dropped
immediately as speculative insertion is aborted, as the implementation
waits on the outcome of another xact, or as "insertion proper" occurs.
These page-level locks are not intended to last more than an instant.
In general, the protocol for heavyweight locking Btree pages is that
heavyweight locks are acquired before any buffer locks are held, while
the locks are only released after all buffer locks are released.
While not a hard and fast rule, presently we avoid heavyweight page
locking more than one page per unique index concurrently.

Happy new year
Peter Geoghegan

Attachment: btreelock_insert_on_dup.v5.2013_12_28.patch.gz
Description: GNU Zip compressed data

Sent via pgsql-hackers mailing list (
To make changes to your subscription:

Reply via email to