Re: [HACKERS] SP-GiST support for inet datatypes

Tom Lane Wed, 24 Aug 2016 07:11:55 -0700

I wrote:
> I've pushed this patch with mostly (not entirely) cosmetic adjustments.
> I still think it'd be worth looking into why the produced index is larger
> than the GiST equivalent, and for that matter exactly why the GiST
> equivalent is so much slower to search.


I spent some time poking at this, and noticed that although the picksplit
rule is guaranteed to produce a non-degenerate split unless the inputs
are all identical, it's entirely capable of producing a very bad split.
Here's some instrumentation printout showing the numbers of items assigned
to the four subnodes, for the test case we've been working with:

     58 inet picksplit (227 tuples): 0 0 127 100
     65 inet picksplit (227 tuples): 0 0 223 4
     69 inet picksplit (225 tuples): 2 0 126 97
     69 inet picksplit (227 tuples): 1 0 226 0
     72 inet picksplit (227 tuples): 0 0 224 3
     72 inet picksplit (227 tuples): 1 0 132 94
     82 inet picksplit (226 tuples): 1 0 127 98
     86 inet picksplit (227 tuples): 0 0 130 97
     95 inet picksplit (227 tuples): 0 0 2 225
     99 inet picksplit (227 tuples): 0 0 225 2
    117 inet picksplit (227 tuples): 2 0 126 99
    118 inet picksplit (227 tuples): 0 0 128 99
    137 inet picksplit (227 tuples): 0 0 226 1
    151 inet picksplit (227 tuples): 0 0 1 226
    270 inet picksplit (227 tuples): 1 0 127 99
    499 inet picksplit (122 tuples): 0 0 64 58

(This is from "sort | uniq -c" output, so the first column is the number
of occurrences of identical split counts.)

Aside from the disturbing frequency of 100-to-1 split ratios, it also
looks like the inclusion of the masklen bit is hardly pulling its weight,
though that might be a artifact of this data set.

The bad splits seem to be a direct contributor to the index's relatively
poor space efficiency; they lead to SPGiST having to move nearly all of
a long leaf chain to another page, and then soon having to split it again,
resulting in another mostly-empty page, lather rinse repeat.  They can't
be very helpful for search speed either.

I think that it'd be worth investigating changing to a split rule that
uses the next two or three or four bits of the address, not just the
single next bit, to index into more than four subnodes.  It's pretty clear
how to adjust the insertion functions to support that, and a crude hack in
that line suggested that the index does get noticeably smaller.  However,
I didn't quite grok how to adjust the search (consistent) functions.
Obviously we could apply the same rules in a loop, considering each
successive address bit, but there may be a better way.

Even though I already committed the code, we're a very long way from
v10 release, so I see no reason not to consider incompatible changes
in the way this opclass works.

I also noticed that the index build wastes some space because SPGIST
remembers only one partly-full page within each of its page categories.
A prototype patch to enlarge the size of that cache helped a good bit
too.  I'll clean that up and post it later, but I was hoping you'd
work on improving this opclass's split rules.

                        regards, tom lane


-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] SP-GiST support for inet datatypes

Reply via email to