On Tue, Jul 22, 2014 at 10:49 PM, Peter Geoghegan <p...@heroku.com> wrote:
>> Basically I think it will be better if you can explain in bit more detail
>> that
>> how does "right-links at all levels and high-key" helps to detect and
>> recover from concurrent page splits.
>
> You might be right about that - perhaps I should go into more detail.

I've gone into more detail on what a high key is, and how we arguably
do not follow Lehman & Yao to the letter because we still have read
locks. L&Y's assumption of atomic page reads/writes, and cursory
handling of deletion kind of make it inevitable that read locks are
used, which I now imply. So nbtree isn't a substandard L&Y
implementation - it's a realistic one, which only needs to hold a
single read lock at a time when servicing index scans (or when finding
a place for insertion). I guess Lehman & Yao preferred to put forward
the claim "no read locks" rather than "only one read lock at a time on
internal pages, even for insertion" because there might be some uses
of their algorithm where that is actually realistic. It is really a
sympathetic way of spinning things to say that *no* read locks are
used, though.

-- 
Peter Geoghegan
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
new file mode 100644
index 4820f76..8285050
*** a/src/backend/access/nbtree/README
--- b/src/backend/access/nbtree/README
*************** use a simplified version of the deletion
*** 11,16 ****
--- 11,50 ----
  Shasha (V. Lanin and D. Shasha, A Symmetric Concurrent B-Tree Algorithm,
  Proceedings of 1986 Fall Joint Computer Conference, pp 380-389).
  
+ Lehman and Yao don't require read locks, but assume that in-memory
+ copies of tree pages are unshared.  Postgres shares in-memory buffers
+ among backends.  As a result, we do page-level read locking on btree
+ pages in order to guarantee that no record is modified while we are
+ examining it.  This reduces concurrency but guarantees correct
+ behavior.  An advantage is that when trading in a read lock for a
+ write lock, we need not re-read the page after getting the write lock.
+ Since we're also holding a pin on the shared buffer containing the
+ page, we know that buffer still contains the page and is up-to-date.
+ 
+ Although it could be argued that Lehman and Yao isn't followed to the
+ letter because single pages are read locked as the tree is descended,
+ this is at least necessary to support deletion, a common requirement
+ which L&Y hardly acknowledge.  Read locks also ensure that B-tree
+ pages are self-consistent (L&Y appear to assume atomic page reads and
+ writes).  Even with these read locks, following L&Y obviates the need
+ of earlier schemes to hold multiple locks concurrently when descending
+ the tree as part of servicing index scans (pessimistic lock coupling).
+ The addition of right-links at all levels, as well as the addition of
+ a page "high key" allows detection and dynamic recovery from
+ concurrent page splits (that is, splits between unlocking an internal
+ page, and subsequently locking its child page during a descent).  When
+ a page is first locked (at every level of a descent servicing an index
+ scan), we consider the need to "move right":  if the scankey value is
+ less than (or sometimes less than or equal to) the page's existing
+ highkey value, a value which serves as an upper bound for values on
+ the page generally, then it must be necessary to move the scan to the
+ right-hand page on the same level.  It's even possible that the scan
+ needs to move right more than once.  Once the other session's
+ concurrent page split finishes, a downlink will be inserted into the
+ parent, and so assuming there are no further page splits, future index
+ scans using the same scankey value will not need to move right.  L&Y
+ Trees are sometimes referred to as "B-Link trees" in the literature.
+ 
  The Lehman and Yao Algorithm and Insertions
  -------------------------------------------
  
*************** to be inserted has a choice whether or n
*** 42,57 ****
  key could go on either page.  (Currently, we try to find a page where
  there is room for the new key without a split.)
  
- Lehman and Yao don't require read locks, but assume that in-memory
- copies of tree pages are unshared.  Postgres shares in-memory buffers
- among backends.  As a result, we do page-level read locking on btree
- pages in order to guarantee that no record is modified while we are
- examining it.  This reduces concurrency but guarantees correct
- behavior.  An advantage is that when trading in a read lock for a
- write lock, we need not re-read the page after getting the write lock.
- Since we're also holding a pin on the shared buffer containing the
- page, we know that buffer still contains the page and is up-to-date.
- 
  We support the notion of an ordered "scan" of an index as well as
  insertions, deletions, and simple lookups.  A scan in the forward
  direction is no problem, we just use the right-sibling pointers that
--- 76,81 ----
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to