Hi Peter, On Wed, 31 Dec 2025 19:21:24 +0900 Yugo Nagata <[email protected]> wrote:
> > I wonder if we also do something about these existing _bt_binsrch comments: > > > > * On an internal (non-leaf) page, _bt_binsrch() returns the OffsetNumber > > * of the last key < given scankey, or last key <= given scankey if nextkey > > * is true. (Since _bt_compare treats the first data key of such a page as > > * minus infinity, there will be at least one key < scankey, so the result > > * always points at one of the keys on the page.) > > > > Here we describe what happens on an internal page. This is correct, > > but *seems* to contradict the higher level comments at the end of > > _bt_first. There is actually no contradiction between the two -- not > > when you understand that _bt_first describes the whole end-to-end > > process of calling _bt_search and then calling _bt_binsrch on the leaf > > page (not of calling _bt_binsrch once, against an internal page). > > > > One could think of this _bt_binsrch internal page behavior as > > compensating for the fact that internal pages have pivot tuples that > > consist of a separator key (except for the first key, which is just > > has a -inf key/no key), plus a downlink that points to the *next* page > > after that separator key one level down (except for the internal page > > high key, which has no downlink at all). Might be useful to say > > something like that instead. > > > > This is hard to explain without an example. Right now, an internal > > page might have pivot tuples that look like this: > > > > Separator: -inf, Downlink: 1 > > Separator: 'a', Downlink: 2 > > Separator: 'c', Downlink: 3 > > Separator: 'f', Downlink: (none, this is the high key) > > > > But _bt_binsrch "pretends" that our internal page actually contains: > > > > Downlink: 1 > > Separator: 'a' > > Downlink: 2 > > Separator: 'c' > > Downlink: 3 > > Separator: 'f' > > > > So if our = scan key is (say) 'c', we should descend using the > > downlink that points to block 2 (not the one that points to block 3, > > as might be expected from looking at the real physical representation > > consisting of pivot tuples). That is what the scan needs to do to get > > to the page one level down whose high key is also 'c' -- that's where > > the scan needs to look. (If the next level down is the leaf level, > > then the leaf page we descend to might also contain a *non*-pivot > > tuple with the key value 'c', "right before" the high key with 'c', > > which the scan will need to return in _bt_readpage. Lehman & Yao allow > > the key before a leaf page's high key to be equal to the high key, but > > otherwise forbid all duplicate keys.) > > > > I find it very hard to know what explanation will be the least > > confusing to other people, at least in this area. Since you're > > interested in this area, I thought I'd ask what you think. > > I do not see any contradiction between the comment on _bt_binsrch and > the comments at the end of _bt_first. The former explicitly states that > it refers to internal (non-leaf) pages, and I understand the latter to > describe loading data from a leaf page. > > That said, we could make this clearer by explicitly mentioning the leaf > page in the first sentence, for example: > > * Now load data from the first leaf page of the scan (usually the > *page currently in so->currPos.buf). I've added a follow-up patch to clarify the _bt_binsrch comments a bit more, distinguishing internal and leaf page behavior. Does this help reduce your concern? Regards, Yugo Nagata -- Yugo Nagata <[email protected]>
>From 322f6d71f290ae8e2581539d1b1c4670f61ec7ae Mon Sep 17 00:00:00 2001 From: Yugo Nagata <[email protected]> Date: Tue, 24 Mar 2026 18:26:48 +0900 Subject: [PATCH v2 2/2] Add a berief general comment on BTScanInsertData's nextkey and backward --- src/include/access/nbtree.h | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h index da7503c57b6..00997a67bf4 100644 --- a/src/include/access/nbtree.h +++ b/src/include/access/nbtree.h @@ -774,8 +774,9 @@ typedef BTStackData *BTStack; * bit, but may not when inserting into an INCLUDE index (tuple header value * is affected by the NULL-ness of both key and non-key attributes). * - * See comments in _bt_first for an explanation of the nextkey and backward - * fields. + * nextkey determines how the scankey's boundary is interpreted, and backward + * indicates a backward scan. See comments in _bt_first for a more detailed + * explanation of these fields. * * scantid is the heap TID that is used as a final tiebreaker attribute. It * is set to NULL when index scan doesn't need to find a position for a -- 2.43.0
>From bbe35ac54d5b0de5ebc61131c5deb91fc026f4de Mon Sep 17 00:00:00 2001 From: Yugo Nagata <[email protected]> Date: Tue, 24 Mar 2026 18:23:14 +0900 Subject: [PATCH v2 1/2] Clarify _bt_binsrch comments for internal vs leaf page usage --- src/backend/access/nbtree/nbtsearch.c | 28 ++++++++++++++------------- 1 file changed, 15 insertions(+), 13 deletions(-) diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c index aae6acb7f57..5ebb6a061d0 100644 --- a/src/backend/access/nbtree/nbtsearch.c +++ b/src/backend/access/nbtree/nbtsearch.c @@ -324,17 +324,19 @@ _bt_moveright(Relation rel, /* * _bt_binsrch() -- Do a binary search for a key on a particular page. * - * On an internal (non-leaf) page, _bt_binsrch() returns the OffsetNumber - * of the last key < given scankey, or last key <= given scankey if nextkey - * is true. (Since _bt_compare treats the first data key of such a page as - * minus infinity, there will be at least one key < scankey, so the result - * always points at one of the keys on the page.) + * When called from _bt_search() to find a downlink on an internal (non-leaf) + * page, _bt_binsrch() returns the OffsetNumber of the last key < given scankey, + * or last key <= given scankey if nextkey is true. (Since _bt_compare treats + * the first data key of such a page as minus infinity, there will be at least + * one key < scankey, so the result always points at one of the keys on the + * page.) * - * On a leaf page, _bt_binsrch() returns the final result of the initial - * positioning process that started with _bt_first's call to _bt_search. - * We're returning a non-pivot tuple offset, so things are a little different. - * It is possible that we'll return an offset that's either past the last - * non-pivot slot, or (in the case of a backward scan) before the first slot. + * When called from _bt_first() to find a key on a leaf page, _bt_binsrch() + * returns the final result of the initial positioning process that started + * with _bt_search. Since we're returning a non-pivot tuple offset, the logic + * differs slightly. It is possible that we'll return an offset that's either + * past the last non-pivot slot, or (in the case of a backward scan) before the + * first slot. * * This procedure is not responsible for walking right, it just examines * the given page. _bt_binsrch() has no lock or refcount side effects @@ -1538,12 +1540,12 @@ _bt_first(IndexScanDesc scan, ScanDirection dir) } } - /* position to the precise item on the page */ + /* position to the precise item on the leaf page */ offnum = _bt_binsrch(rel, &inskey, so->currPos.buf); /* - * Now load data from the first page of the scan (usually the page - * currently in so->currPos.buf). + * Now load data from the first leaf page of the scan (usually the + * page currently in so->currPos.buf). * * If inskey.nextkey = false and inskey.backward = false, offnum is * positioned at the first non-pivot tuple >= inskey.scankeys. -- 2.43.0
