Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

Peter Geoghegan Wed, 18 Sep 2019 10:44:01 -0700

On Tue, Sep 17, 2019 at 9:43 AM Anastasia Lubennikova
<a.lubennik...@postgrespro.ru> wrote:
> > 3. Third, there is incremental writing of the page itself -- avoiding
> > using a temp buffer. Not sure where I stand on this.
>
> I think it's a good idea.  memmove must be much faster than copying
> items tuple by tuple.
> I'll send next patch by the end of the week.


I think that the biggest problem is that we copy all of the tuples,
including existing posting list tuples that can't be merged with
anything. Even if you assume that we'll never finish early (e.g. by
using logic like the "if (pagesaving >= newitemsz) deduplicate =
false;" thing), this can still unnecessarily slow down deduplication.
Very often,  _bt_dedup_one_page() is called when 1/2 - 2/3 of the
space on the page is already used by a small number of very large
posting list tuples.

> > The loop within _bt_dedup_one_page() is very confusing in both v13 and
> > v14 -- I couldn't figure out why the accounting worked like this:

> I'll look at it.

I'm currently working on merging my refactored version of
_bt_dedup_one_page() with your v15 WAL-logging. This is a bit tricky.
(I have finished merging the other WAL-logging stuff, though -- that
was easy.)

The general idea is that the loop in _bt_dedup_one_page() now
explicitly operates with a "base" tuple, rather than *always* saving
the prev tuple from the last loop iteration. We always have a "pending
posting list", which won't be written-out as a posting list if it
isn't possible to merge at least one existing page item. The "base"
tuple doesn't change. "pagesaving" space accounting works in a way
that doesn't care about whether or not the base tuple was already a
posting list -- it saves the size of the IndexTuple without any
existing posting list size, and calculates the contribution to the
total size of the new posting list separately (heap TIDs from the
original base tuple and subsequent tuples are counted together).

This has a number of advantages:

* The loop is a lot clearer now, and seems to have slightly better
space utilization because of improved accounting (with or without the
"if (pagesaving >= newitemsz) deduplicate = false;" thing).

* I think that we're going to need to be disciplined about which tuple
is the "base" tuple for correctness reasons -- we should always use
the leftmost existing tuple to form a new posting list tuple. I am
concerned about rare cases where we deduplicate tuples that are equal
according to _bt_keep_natts_fast()/datum_image_eq() that nonetheless
have different sizes (and are not bitwise equal). There are rare cases
involving TOAST compression where that is just about possible (see the
temp comments I added to  _bt_keep_natts_fast() in the patch).

* It's clearly faster, because there is far less palloc() overhead --
the "land" unlogged table test completes in about 95.5% of the time
taken by v15 (I disabled "if (pagesaving >= newitemsz) deduplicate =
false;" for both versions here, to keep it simple and fair).

This also suggests that making _bt_dedup_one_page() do raw page adds
and page deletes to the page in shared_buffers (i.e. don't use a temp
buffer page) could pay off. As I went into at the start of this
e-mail, unnecessarily doing expensive things like copying large
posting lists around is a real concern. Even if it isn't truly useful
for _bt_dedup_one_page() to operate in a very incremental fashion,
incrementalism is probably still a good thing to aim for -- it seems
to make deduplication faster in all cases.

-- 
Peter Geoghegan

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

Reply via email to