Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

Peter Geoghegan Sun, 06 Nov 2016 16:04:25 -0800

On Mon, Jul 4, 2016 at 2:30 AM, Heikki Linnakangas <[email protected]> wrote:
> I think we should pack the TIDs more tightly, like GIN does with the varbyte
> encoding. It's tempting to commit this without it for now, and add the
> compression later, but I'd like to avoid having to deal with multiple
> binary-format upgrades, so let's figure out the final on-disk format that we
> want, right from the beginning.


While the idea of duplicate storage is pretty obviously compelling,
there could be other, non-obvious benefits. I think that it could
bring further benefits if we could use duplicate storage to change
this property of nbtree (this is from the README):

"""
Lehman and Yao assume that the key range for a subtree S is described
by Ki < v <= Ki+1 where Ki and Ki+1 are the adjacent keys in the parent
page.  This does not work for nonunique keys (for example, if we have
enough equal keys to spread across several leaf pages, there *must* be
some equal bounding keys in the first level up).  Therefore we assume
Ki <= v <= Ki+1 instead.  A search that finds exact equality to a
bounding key in an upper tree level must descend to the left of that
key to ensure it finds any equal keys in the preceding page.  An
insertion that sees the high key of its target page is equal to the key
to be inserted has a choice whether or not to move right, since the new
key could go on either page.  (Currently, we try to find a page where
there is room for the new key without a split.)

"""

If we could *guarantee* that all keys in the index are unique, then we
could maintain the keyspace as L&Y originally described.

The practical benefits to this would be:

* We wouldn't need to take the extra step described above -- finding a
bounding key/separator key that's fully equal to our scankey would no
longer necessitate a probably-useless descent to the left of that key.
(BTW, I wonder if we could get away with not inserting a downlink into
parent when a leaf page split finds an identical IndexTuple in parent,
*without* changing the keyspace invariant I mention -- if we're always
going to go to the left of an equal-to-scankey key in an internal
page, why even have more than one?)

* This would make suffix truncation of internal index tuples easier,
and that's important.

The traditional reason why suffix truncation is important is that it
can keep the tree a lot shorter than it would otherwise be. These
days, that might not seem that important, because even if you have
twice the number of internal pages than strictly necessary, that still
isn't that many relative to typical main memory size (and even CPU
cache sizes, perhaps).

The reason I think it's important these days is that not having suffix
truncation makes our "separator keys" overly prescriptive about what
part of the keyspace is owned by each internal page. With a pristine
index (following REINDEX), this doesn't matter much. But, I think that
we get much bigger problems with index bloat due to the poor fan-out
that we sometimes see due to not having suffix truncation, *combined*
with the page deletion algorithms restriction on deleting internal
pages (it can only be done for internal pages with *no* children).

Adding another level or two to the B-Tree makes it so that your
workload's "sparse deletion patterns" really don't need to be that
sparse in order to bloat the B-Tree badly, necessitating a REINDEX to
get back to acceptable performance (VACUUM won't do it). To avoid
this, we should make the internal pages represent the key space in the
least restrictive way possible, by applying suffix truncation so that
it's much more likely that things will *stay* balanced as churn
occurs. This is probably a really bad problem with things like
composite indexes over text columns, or indexes with many NULL values.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

Reply via email to