Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

Heikki Linnakangas Sun, 10 Mar 2019 07:10:09 -0700

On 08/03/2019 23:21, Peter Geoghegan wrote:

On Fri, Mar 8, 2019 at 10:48 AM Peter Geoghegan <p...@bowt.ie> wrote:

All of that said, maybe it would be clearer if page deletion was not
the special case that has to opt in to minusinfkey semantics. Perhaps
it would make more sense for everyone else to opt out of minusinfkey
semantics, and to get the !minusinfkey optimization as a result of
that. I only did it the other way around because that meant that only
nbtpage.c had to acknowledge the special case.


This seems like a good idea -- we should reframe the !minusinfkey
optimization, without actually changing the behavior. Flip it around.

The minusinfkey field within the insertion scankey struct would be
called something like "descendrighttrunc" instead. Same idea, but with
the definition inverted. Most _bt_search() callers (all of those
outside of nbtpage.c and amcheck) would be required to opt in to that
optimization to get it.

Under this arrangement, nbtpage.c/page deletion would not ask for the
"descendrighttrunc" optimization, and would therefore continue to do
what it has always done: find the first leaf page that its insertion
scankey values could be on (we don't lie about searching for negative
infinity, or having a negative infinity sentinel value in scan key).
The only difference for page deletion between v3 indexes and v4
indexes is that with v4 indexes we'll relocate the same leaf page
reliably, since every separator key value is guaranteed to be unique
on its level (including the leaf level/leaf high keys). This is just a
detail, though, and not one that's even worth pointing out; we're not
*relying* on that being true on v4 indexes anyway (we check that the
block number is a match too, which is strictly necessary for v3
indexes and seems like a good idea for v4 indexes).

This is also good because it makes it clear that the unique index code
within _bt_doinsert() (that temporarily sets scantid to NULL) benefits
from the descendrighttrunc/!minusinfkey optimization -- it should be
"honest" and ask for it explicitly. We can make _bt_doinsert() opt in
to the optimization for unique indexes, but not for other indexes,
where scantid is set from the start. The
descendrighttrunc/!minusinfkey optimization cannot help when scantid
is set from the start, because we'll always have an attribute value in
insertion scankey that breaks the tie for us instead. We'll always
move right of a heap-TID-truncated separator key whose untruncated
attributes are all equal to a prefix of our insertion scankey values.

(This _bt_doinsert() descendrighttrunc/!minusinfkey optimization for
unique indexes matters more than you might think -- we do really badly
with things like Zipfian distributions currently, and reducing the
contention goes some way towards helping with that. Postgres pro
noticed this a couple of years back, and analyzed it in detail at that
time. It's really nice that we very rarely have to move right within
code like _bt_check_unique() and _bt_findsplitloc() with the patch.)

Does that make sense to you? Can you live with the name
"descendrighttrunc", or do you have a better one?

"descendrighttrunc" doesn't make much sense to me, either. I don'tunderstand it. Maybe a comment would make it clear, though.

I don't feel like this is an optimization. It's natural consequence ofwhat the high key means. I guess you can think of it as an optimization,in the same way that not fully scanning the whole index for every searchis an optimization, but that's not how I think of it :-).

If we don't flip the meaning of the flag, then maybe calling itsomething like "searching_for_leaf_page" would make sense:


/*
 * When set, we're searching for the leaf page with the given high key,
 * rather than leaf tuples matching the search keys.
 *
 * Normally, when !searching_for_pivot_tuple, if a page's high key
 * has truncated columns, and the columns that are present are equal to
 * the search key, the search will not descend to that page. For
 * example, if an index has two columns, and a page's high key is
 * ("foo", <omitted>), and the search key is also ("foo," <omitted>),
 * the search will not descend to that page, but its right sibling. The
 * omitted column in the high key means that all tuples on the page must
 * be strictly < "foo", so we don't need to visit it. However, sometimes
 * we perform a search to find the parent of a leaf page, using the leaf
 * page's high key as the search key. In that case, when we search for
 * ("foo", <omitted>), we do want to land on that page, not its right
 * sibling.
 */
bool    searching_for_leaf_page;

As the patch stands, you're also setting minusinfkey when dealing withv3 indexes. I think it would be better to only setsearching_for_leaf_page in nbtpage.c. In general, I think BTScanInsertshould describe the search key, regardless of whether it's a V3 or V4index. Properties of the index belong elsewhere. (We're violating thatby storing the 'heapkeyspace' flag in BTScanInsert. That wart isprobably OK, it is pretty convenient to have it there. But in principle...)


- Heikki

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

Reply via email to