On Wed, Jul 24, 2019 at 3:06 PM Peter Geoghegan <p...@bowt.ie> wrote:
> There seems to be a kind of "synergy" between the nbtsplitloc.c
> handling of pages that have lots of duplicates and posting list
> compression. It seems as if the former mechanism "sets up the bowling
> pins", while the latter mechanism "knocks them down", which is really
> cool. We should try to gain a better understanding of how that works,
> because it's possible that it could be even more effective in some
> cases.

I found another important way in which this synergy can fail to take
place, which I can fix.

By removing the BT_COMPRESS_THRESHOLD limit entirely, certain indexes
from my test suite become much smaller, while most are not affected.
These indexes were not helped too much by the patch before. For
example, the TPC-E i_t_st_id index is 50% smaller. It is entirely full
of duplicates of a single value (that's how it appears after an
initial TPC-E bulk load), as are a couple of other TPC-E indexes.
TPC-H's idx_partsupp_partkey index becomes ~18% smaller, while its
idx_lineitem_orderkey index becomes ~15% smaller.

I believe that this happened because rightmost page splits were an
inefficient case for compression. But rightmost page split heavy
indexes with lots of duplicates are not that uncommon. Think of any
index with many NULL values, for example.

I don't know for sure if BT_COMPRESS_THRESHOLD should be removed. I'm
not sure what the idea is behind it. My sense is that we're likely to
benefit by delaying page splits, no matter what. Though I am still
looking at it purely from a space utilization point of view, at least
for now.

-- 
Peter Geoghegan


Reply via email to