I've been thinking about using heap TID as a tie-breaker when comparing B-Tree index tuples for a while now [1]. I'd like to make all tuples at the leaf level unique, as assumed by L&Y. This can enable "retail index tuple deletion", which I think we'll probably end up implementing in some form or another, possibly as part of the zheap project. It's also possible that this work will facilitate GIN-style deduplication based on run length encoding of TIDs, or storing versioned heap TIDs in an out-of-line nbtree-versioning structure (unique indexes only). I can see many possibilities, but we have to start somewhere.
I attach an unfinished prototype of suffix truncation, that also sometimes *adds* a new attribute in pivot tuples. It adds an extra heap TID from the leaf level when truncating away non-distinguishing attributes during a leaf page split, though only when it must. The patch also has nbtree treat heap TID as a first class part of the key space of the index. Claudio wrote a patch that did something similar, though without the suffix truncation part [2] (I haven't studied his patch, to be honest). My patch is actually a very indirect spin-off of Anastasia's covering index patch, and I want to show what I have in mind now, while it's still swapped into my head. I won't do any serious work on this project unless and until I see a way to implement retail index tuple deletion, which seems like a multi-year project that requires the buy-in of multiple senior community members. On its own, my patch regresses performance unacceptably in some workloads, probably due to interactions with kill_prior_tuple()/LP_DEAD hint setting, and interactions with page space management when there are many "duplicates" (it can still help performance in some pgbench workloads with non-unique indexes, though). Note that the approach to suffix truncation that I've taken isn't even my preferred approach [3] -- it's a medium-term solution that enables making a heap TID attribute part of the key space, which enables everything else. Cheap incremental/retail tuple deletion is the real prize here; don't lose sight of that when looking through my patch. If we're going to teach nbtree to truncate this new implicit heap TID attribute, which seems essential, then we might as well teach nbtree to do suffix truncation of other (user-visible) attributes while we're at it. This patch isn't a particularly effective implementation of suffix truncation, because that's not what I'm truly interested in improving here (plus I haven't even bothered to optimize the logic for picking a split point in light of suffix truncation). amcheck ======= This patch adds amcheck coverage, which seems like essential infrastructure for developing a feature such as this. Extensive amcheck coverage gave me confidence in my general approach. The basic idea, invariant-wise, is to treat truncated attributes (often including a truncated heap TID attribute in internal pages) as "minus infinity" attributes, which participate in comparisons if and only if we reach such attributes before the end of the scan key (a smaller keysz for the index scan could prevent this). I've generalized the minus infinity concept that _bt_compare() has always considered as a special case, extending it to individual attributes. It's actually possible to remove that old hard-coded _bt_compare() logic with this patch applied without breaking anything, since we can rely on the comparison of an explicitly 0-attribute tuple working the same way (pg_upgrade'd databases will break if we do this, however, so I didn't go that far). Note that I didn't change the logic that has _bt_binsrch() treat internal pages in a special way when tuples compare as equal. We still need that logic for cases where keysz is less than the number of indexed columns. It's only possible to avoid this _bt_binsrch() thing for internal pages when all attributes, including heap TID, were specified and compared (an insertion scan key has to have an entry for every indexed column, including even heap TID). Doing better there doesn't seem worth the trouble of teaching _bt_compare() to tell the _bt_binsrch() caller about this as a special case. That means that we still move left on equality in some cases where it isn't strictly necessary, contrary to L&Y. However, amcheck verifies that the classic "Ki < v <= Ki+1" invariant holds (as opposed to "Ki <= v <= Ki+1") when verifying parent/child relationships, which demonstrates that I have restored the classic invariant (I just don't find it worthwhile to take advantage of it within _bt_binsrch() just yet). Most of this work was done while I was an employee of VMware, though I joined Crunchy data on Monday and cleaned it up a bit more since then. I'm excited about joining Crunchy, but I should also acknowledge VMware's strong support of my work. [1] https://wiki.postgresql.org/wiki/Key_normalization#Making_all_items_in_the_index_unique_by_treating_heap_TID_as_an_implicit_last_attribute [2] https://postgr.es/m/cagtbqpz-ktrqiaa13xg1gne461yowra-s-yccqptyfrpkta...@mail.gmail.com [3] https://wiki.postgresql.org/wiki/Key_normalization#Suffix_truncation_of_normalized_keys -- Peter Geoghegan
0001-Ensure-nbtree-leaf-tuple-keys-are-always-unique.patch
Description: Binary data