On 05/27/2014 09:17 AM, Peter Geoghegan wrote:
While speccing out a new B-Tree verification tool, I had the
opportunity to revisit a thought I had during review concerning
_bt_findinsertloc(): that the new coding is unsafe in the event of
deferred split completion of a leaf page of a unique index. To recap,
we now do the following in _bt_findinsertloc():
/*
* If this page was incompletely split, finish the split now. We
* do this while holding a lock on the left sibling, which is not
* good because finishing the split could be a fairly lengthy
* operation. But this should happen very seldom.
*/
if (P_INCOMPLETE_SPLIT(lpageop))
{
_bt_finish_split(rel, rbuf, stack);
rbuf = InvalidBuffer;
continue;
}
The "left sibling" referred to here is "the first page this key could
be on", an important concept for unique index enforcement.
No, it's not "the first page this key could be on".
_bt_findinsertloc() does *not* hold a lock on the first valid page the
key could go to. It merely ensures that when it steps to the next page,
it releases the lock on the previous page only after acquiring the lock
on the next page. Throughout the operation, it will hold a lock on
*some* page that could legally hold the inserted value, and it acquires
the locks in left-to-right order. This is sufficient for the uniqueness
checks, because _bt_unique_check() scans all the pages, and
_bt_unique_check() *does* hold the first page locked while it scans the
rest of the pages. But _bt_findinsertlock() does not.
Also note that _bt_moveright() also finishes any incomplete splits it
encounters (when called for an insertion). So _bt_findinsertloc() never
gets called on a page with the incomplete-split flag set. It might
encounter one when it moves right, but never the first page.
Anyway, the concern is that there may be problems when we call
_bt_finish_split() with that left sibling locked thoughout (i.e.
finish a split where the right sibling is BTP_INCOMPLETE_SPLIT, and
itself has a right sibling from the incomplete split (which is usually
the value lock page's right-right sibling)). I'm not concerned about
performance, since as the comment says it ought to be an infrequent
occurrence. I also believe that there are no deadlock hazards. But
consider this scenario:
* We insert the value 7 into an int4 unique index. We need to split
the leaf page. We run out of memory or something, and ours is an
incomplete page split. Our transaction is aborted. For the sake of
argument, suppose that there are also already a bunch of dead tuples
within the index with values of 7, 8 and 9.
If I understood correctly, the tree looks like this before the insertion:
Parent page:
+-------------+
| |
| 9 -> A |
+-------------+
Leaf A
+-------------+
| HI-key: 9 |
| |
| 7 8 9 |
+-------------+
And after insertion and incomplete split:
Parent page
+-------------+
| |
| 9 -> A |
+-------------+
Leaf A Leaf B
+--------------+ +-------------+
| HI-key: 8 | | HI-key: 9 |
| (INCOMPLETE_ | | |
| SPLIT) | <-> | |
| | | |
| 7 7* 8 | | 9 |
+--------------+ +-------------+
where 7* is the newly inserted key, with value 7.
(you didn't mention at what point the split happens, but in the next
paragraph you said the new downlink has value 8, which implies the above
split)
* Another inserter of the value 7 comes along. It follows exactly the
same downlink as the first, now aborted inserter (suppose the
downlink's value is 9). It also locks the same leaf page to establish
a "value lock" in precisely the same manner. Finding no room on the
first page, it looks further right, maintaining its original "value
lock" throughout. It finishes the first inserter's split of the
non-value-lock page - a new downlink is inserted into the parent page,
with the value 8. It then releases all buffer locks except the first
one/original "value lock". A physical insertion has yet to occur.
Hmm, I think you got confused at this step. When inserting a 7, you
cannot "look further right" to find a page with more space, because the
HI-key, 8, on the first page stipulates that 7 must go on that page, not
some later page.
* A third inserter of the value comes along. It gets to a later page
than the one locked by the second inserter, preferring the newer
downlink with value 8 (the internal-page _bt_binsrch() logic ensures
this).
That's a contradiction: the downlink with value 8 points to the first
page, not some later page. After the split is finished, the tree looks
like this:
Parent page
+-------------+
| 8 -> A |
| 9 -> B |
+-------------+
Leaf A Leaf B
+-------------+ +-------------+
| HI-key: 8 | | HI-key: 9 |
| | <-> | |
| 7 7* 8 | | 9 |
+-------------+ +-------------+
Regardless of whether or not I have these details exactly right (that
is, regardless of whether or not this scenario is strictly possible) I
suggest as a code-hardening measure that _bt_findinsertloc() release
its "value lock", upon realizing it must complete splits, and then
complete the split or splits known to be required. It would finally
report that it "couldn't find" an insertion location to
_bt_doinsert(), which would then retry from the start, just like when
_bt_check_unique() finds an inconclusive conflict. The only difference
is that we don't have an xact to wait on. We haven't actually done
anything extra that makes this later "goto top;" any different to the
existing one.
This should occur so infrequently that it isn't worth trying harder,
or worth differentiating between the UNIQUE_CHECK_NO and
!UNIQUE_CHECK_NO cases when retrying. This also removes the more
general risk of holding an extra buffer lock during page split
completion.
Yeah, that would work too. It seems safe enough as it is, though, so I
don't see the point.
It kind of looks like _bt_findinsertloc() doesn't have this bug,
because in my scenario _bt_finish_split() is called with both the
value lock and its right page locked (so the right page is the left
page for _bt_finish_split()'s purposes). But when you take another
look, and realize that _bt_finish_split() releases its locks, and so
once again only the original value lock will be held when
_bt_finish_split() returns, and so the downlink is there to skip the
value locked page, you realize that the bug does exist (assuming that
I haven't failed to consider some third factor and am not otherwise
mistaken).
When inserting, the scan for the right insert location always begins
from the first page where the key can legally go to. Inserting a missing
downlink doesn't change what that page is - it just makes it faster to
find, by reducing the number of right-links you need to follow.
PS. Thanks for looking into this again! These B-tree changes really need
thorough review.
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers