On 08/03/2019 12:22, Peter Geoghegan wrote:
I would like to work through these other items with you
(_bt_binsrch_insert() and so on), but I think that it would be helpful
if you made an effort to understand the minusinfkey stuff first. I
spent a lot of time improving the explanation of that within
_bt_compare(). It's important.

Ok, after thinking about it for a while, I think I understand the minus infinity stuff now. Let me try to explain it in my own words:

Imagine that you have an index with two key columns, A and B. The index has two leaf pages, with the following items:

+--------+   +--------+
| Page 1 |   | Page 2 |
|        |   |        |
|    1 1 |   |    2 1 |
|    1 2 |   |    2 2 |
|    1 3 |   |    2 3 |
|    1 4 |   |    2 4 |
|    1 5 |   |    2 5 |
+--------+   +--------+

The key space is neatly split on the first key column - probably thanks to the new magic in the page split code.

Now, what do we have as the high key of page 1? Answer: "2 -inf". The "-inf" is not stored in the key itself, the second key column is just omitted, and the search code knows to treat it implicitly as a value that's lower than any real value. Hence, "minus infinity".

However, during page deletion, we need to perform a search to find the downlink pointing to a leaf page. We do that by using the leaf page's high key as the search key. But the search needs to treat it slightly differently in that case. Normally, searching with a single key value, "2", we would land on page 2, because any real value beginning with "2" would be on that page, but in the page deletion case, we want to find page 1. Setting the BTScanInsert.minusinfkey flag tells the search code to do that.

Question: Wouldn't it be more straightforward to use "1 +inf" as page 1's high key? I.e treat any missing columns as positive infinity. That way, the search for page deletion wouldn't need to be treated differently. That's also how this used to work: all tuples on a page used to be <= high key, not strictly < high key. And it would also make the rightmost page less of a special case: you could pretend the rightmost page has a pivot tuple with all columns truncated away, which means positive infinity.

You have this comment _bt_split which touches the subject:

        /*
         * The "high key" for the new left page will be the first key that's 
going
         * to go into the new right page, or possibly a truncated version if 
this
         * is a leaf page split.  This might be either the existing data item at
         * position firstright, or the incoming tuple.
         *
         * The high key for the left page is formed using the first item on the
         * right page, which may seem to be contrary to Lehman & Yao's approach 
of
         * using the left page's last item as its new high key when splitting on
         * the leaf level.  It isn't, though: suffix truncation will leave the
         * left page's high key fully equal to the last item on the left page 
when
         * two tuples with equal key values (excluding heap TID) enclose the 
split
         * point.  It isn't actually necessary for a new leaf high key to be 
equal
         * to the last item on the left for the L&Y "subtree" invariant to hold.
         * It's sufficient to make sure that the new leaf high key is strictly
         * less than the first item on the right leaf page, and greater than or
         * equal to (not necessarily equal to) the last item on the left leaf
         * page.
         *
         * In other words, when suffix truncation isn't possible, L&Y's exact
         * approach to leaf splits is taken.  (Actually, even that is slightly
         * inaccurate.  A tuple with all the keys from firstright but the heap 
TID
         * from lastleft will be used as the new high key, since the last left
         * tuple could be physically larger despite being opclass-equal in 
respect
         * of all attributes prior to the heap TID attribute.)
         */

But it doesn't explain why it's done like that.

- Heikki

Reply via email to