On Tue, Sep 17, 2019 at 9:43 AM Anastasia Lubennikova <a.lubennik...@postgrespro.ru> wrote: > > 3. Third, there is incremental writing of the page itself -- avoiding > > using a temp buffer. Not sure where I stand on this. > > I think it's a good idea. memmove must be much faster than copying > items tuple by tuple. > I'll send next patch by the end of the week.
I think that the biggest problem is that we copy all of the tuples, including existing posting list tuples that can't be merged with anything. Even if you assume that we'll never finish early (e.g. by using logic like the "if (pagesaving >= newitemsz) deduplicate = false;" thing), this can still unnecessarily slow down deduplication. Very often, _bt_dedup_one_page() is called when 1/2 - 2/3 of the space on the page is already used by a small number of very large posting list tuples. > > The loop within _bt_dedup_one_page() is very confusing in both v13 and > > v14 -- I couldn't figure out why the accounting worked like this: > I'll look at it. I'm currently working on merging my refactored version of _bt_dedup_one_page() with your v15 WAL-logging. This is a bit tricky. (I have finished merging the other WAL-logging stuff, though -- that was easy.) The general idea is that the loop in _bt_dedup_one_page() now explicitly operates with a "base" tuple, rather than *always* saving the prev tuple from the last loop iteration. We always have a "pending posting list", which won't be written-out as a posting list if it isn't possible to merge at least one existing page item. The "base" tuple doesn't change. "pagesaving" space accounting works in a way that doesn't care about whether or not the base tuple was already a posting list -- it saves the size of the IndexTuple without any existing posting list size, and calculates the contribution to the total size of the new posting list separately (heap TIDs from the original base tuple and subsequent tuples are counted together). This has a number of advantages: * The loop is a lot clearer now, and seems to have slightly better space utilization because of improved accounting (with or without the "if (pagesaving >= newitemsz) deduplicate = false;" thing). * I think that we're going to need to be disciplined about which tuple is the "base" tuple for correctness reasons -- we should always use the leftmost existing tuple to form a new posting list tuple. I am concerned about rare cases where we deduplicate tuples that are equal according to _bt_keep_natts_fast()/datum_image_eq() that nonetheless have different sizes (and are not bitwise equal). There are rare cases involving TOAST compression where that is just about possible (see the temp comments I added to _bt_keep_natts_fast() in the patch). * It's clearly faster, because there is far less palloc() overhead -- the "land" unlogged table test completes in about 95.5% of the time taken by v15 (I disabled "if (pagesaving >= newitemsz) deduplicate = false;" for both versions here, to keep it simple and fair). This also suggests that making _bt_dedup_one_page() do raw page adds and page deletes to the page in shared_buffers (i.e. don't use a temp buffer page) could pay off. As I went into at the start of this e-mail, unnecessarily doing expensive things like copying large posting lists around is a real concern. Even if it isn't truly useful for _bt_dedup_one_page() to operate in a very incremental fashion, incrementalism is probably still a good thing to aim for -- it seems to make deduplication faster in all cases. -- Peter Geoghegan