Yeb Havinga <yebhavi...@gmail.com> writes:
> I think it is time to mark this patch ready for committer:

> The unintuitive result thus far is that sorting outperforms the R-tree 
> bounding boxes style index, as Alexander has demonstrated with several 
> different distributions on 20-11 (uniform, natural (is that a bell 
> curve?), many distinct values)

I spent some time analyzing this patch today.  A few observations:

1. The figures of merit that we're interested in are minimizing the
overlap between the bounding boxes of the split pages, and trying to
divide the items more or less equally between the two pages.  The less
overlap, the fewer subsequent searches will have to descend to both
children.  If we divide the items unequally we risk ending up with a large
(low fill ratio) index, if we're unlucky enough for the less-populated
side to receive few additional entries.  (Which is quite likely, if the
subsequently-added data has distribution like the existing data's.)
Low fill ratio is of course bad for I/O and buffer space consumption.

Alexander's proposed algorithm wins on the equal-division front, since
it always produces an even split, while the existing Guttman algorithm
can produce very uneven splits (more on that below).  However, it's less
clear that Alexander's patch beats the existing code in terms of getting
good bounding boxes.

2. It's not really appropriate to compare the patch directly against git
HEAD, since we know that's broken.  I experimented with just fixing the
s/size_alpha/size_beta/ problem, and observed that search efficiency
(measured as the number of buffers touched in a single query) improved
markedly, but index size didn't, and index build time actually got worse.

3. I think that the test cases presented by Alexander aren't really that
useful, because they all consist of large numbers of small segments.
Almost any split algorithm can do reasonably well there, because you can
always find a split that has pretty minimal overlap between the bounding
boxes of the two child pages; and in particular, changing the distribution
of where the segments are doesn't change things much.  Yeb pointed this
out upthread but didn't pursue the point.  Of course, if there are only
large segments then it's impossible to find minimally-overlapping splits,
so any split algorithm will do poorly.  I think the interesting case is
where there are a few large segments among a population of small ones.
Accordingly, I experimented with test data built this way:

create table seg_test as
  select (a || ' .. ' || a + 0.25*b)::seg as a from
  (select random() as a, random() as b from generate_series(1,10)) x
  union all
  select (a || ' .. ' || a + 0.00005*b)::seg as a from
  (select random() as a, random() as b from generate_series(1,1000000)) x;

(Note: order is important here, and it's also important to set
synchronize_seqscans off, else you won't get repeatable results.
We want to inject the large segments early in the index build process.)

What I got for this was

                index build time        index pages     search buffers

fixed HEAD run A        114 sec         17777           16
fixed HEAD run B        113 sec         16574           6
Alexander's 0.5 run A   15.5 sec        6259            34
Alexander's 0.5 run B   16 sec          6016            3

("Search buffers" is the number of buffers touched in
select * from seg_test where a @> '0.5 .. 0.5'::seg
Run A and run B are two different data sets generated as per the
above recipe)

Unsurprisingly, the patch wins on index size, but its variance in
search efficiency seems a little worse than before.  Note however that
the search-buffers number is still a lot better than unpatched HEAD,
where I was seeing values of several hundred.

4. The main speed problem with Guttman's algorithm is that the initial
seed-finding loop is O(N^2) in the number of items on the page to be split
(which is ~260, on my machine anyway).  This is why Yeb found
significantly shorter index build time with 1K instead of 8K pages.
However, given the 1-D nature of the data it's not hard to think of
cheaper ways to select the seed items --- basically we want the two
"extremal" items.  I experimented with finding the smallest-lower and
largest-upper items, and also the smallest-upper and largest-lower,
which of course can be done in one pass with O(N) time.  Leaving the
second pass as-is, I got

                index build time        index pages     search buffers

smallest_upper run A    34 sec          16049           7
smallest_upper run B    33 sec          15185           5
smallest_lower run A    15 sec          7327            40
smallest_lower run B    14 sec          7234            4

(I also tried smallest and largest center points, but that was worse than
these across the board.)  So there's more than one way to skin a cat here.

5. The *real* problem with Guttman's algorithm became obvious while I was
doing these experiments: it's unstable as heck.  If one of the seed items
is something of an outlier, it's not hard at all for the code to end up
attaching all the other items to the other seed's page, ie you end up with
just one item in one of the child pages.  This can happen repeatedly,
leading to severe index bloat.  Several times I ended up cancelling an
index build after it had blown past a gigabyte of index with no sign of
stopping.  In particular the algorithm is sensitive to data ordering
because of the incremental way it enlarges the two pages: if the input
data is ordered already, that can increase the bias towards the first
seed.  I also found out that it seems to be intentional, not a bug, that
the seed-finding loop excludes the last input item (which is generally the
new item that didn't fit on the page): if you don't do that then this
misbehavior becomes extremely likely to happen, anytime the new item isn't
within the range of existing items.  This undocumented assumption didn't
make me any more pleasantly disposed towards the existing code.


Bottom line: although there might be even better ways to do it,
Alexander's patch seems to me to be a significant improvement over what's
there.  It definitely wins on index size, and I'm not seeing any major
degradation in search efficiency.  There are some stylistic things that
could be improved, but I'm going to clean it up and commit it.

As a future TODO, I think we ought to look at whether we can't get rid of
Guttman's algorithm in the other picksplit routines.  I find it hard to
believe that it's any more stable in higher-dimensional cases than it is
here.

                        regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to