On Tue, Nov 16, 2010 at 3:07 AM, Robert Haas <robertmh...@gmail.com> wrote:

> But on a broader note, I'm not very certain the sorting algorithm is
> sensible.  For example, suppose you have 10 segments that are exactly
> '0' and 20 segments that are exactly '1'.  Maybe I'm misunderstanding,
> but it seems like this will result in a 15/15 split when we almost
> certainly want a 10/20 split.  I think there will be problems in more
> complex cases as well.  The documentation says about the less-than and
> greater-than operators that "These operators do not make a lot of
> sense for any practical purpose but sorting."

In order to illustrate a real problem we should think about
gist behavior with great enough amount of data. For example, I tried to
extrapolate this case to 100000 of segs where 40% are (0,1) segs and 60% are
(1,2) segs. And this case doesn't seem a problem for me.
Here are the details.

test=# insert into seg_test (select case when random()<0.4 then '0..1'::seg
else '1..2'::seg end from generate_series(1,100000));
INSERT 0 100000
Time: 7543,941 ms

test=# create index seg_test_idx on seg_test using gist(a);
CREATE INDEX
Time: 17839,450 ms

test=# explain (analyze, buffers) select count(*) from seg_test where a @>
'1.5'::seg;
                                                            QUERY PLAN


-----------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=344.84..344.85 rows=1 width=0) (actual
time=186.192..186.193 rows=1 loops=1)
   Buffers: shared hit=887
   ->  Bitmap Heap Scan on seg_test  (cost=5.05..344.59 rows=100 width=0)
(actual time=16.066..102.586 rows=60206 l
oops=1)
         Recheck Cond: (a @> '1.5'::seg)
         Buffers: shared hit=887
         ->  Bitmap Index Scan on seg_test_idx  (cost=0.00..5.03 rows=100
width=0) (actual time=15.948..15.948 rows
=60206 loops=1)
               Index Cond: (a @> '1.5'::seg)
               Buffers: shared hit=396
 Total runtime: 186.306 ms
(9 rows)

test=# explain (analyze, buffers) select count(*) from seg_test where a @>
'0.5'::seg;
                                                            QUERY PLAN


-----------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=344.84..344.85 rows=1 width=0) (actual
time=144.293..144.295 rows=1 loops=1)
   Buffers: shared hit=754
   ->  Bitmap Heap Scan on seg_test  (cost=5.05..344.59 rows=100 width=0)
(actual time=28.926..87.625 rows=39794 lo
ops=1)
         Recheck Cond: (a @> '0.5'::seg)
         Buffers: shared hit=754
         ->  Bitmap Index Scan on seg_test_idx  (cost=0.00..5.03 rows=100
width=0) (actual time=26.969..26.969 rows
=39794 loops=1)
               Index Cond: (a @> '0.5'::seg)
               Buffers: shared hit=263
 Total runtime: 144.374 ms
(9 rows)

test=# select pg_relpages('seg_test_idx');
 pg_relpages
-------------
         656
(1 row)

Total number of pages in index is 656 and number of pages used in scans
is 263+396=659. Only 3 pages overhead.
We can see the distribution of segs in index using gevel.

test=# select a, count(1) from gist_print('seg_test_idx') as t(level int,
valid bool, a seg) group by a;
   a    | count
--------+-------
 0 .. 1 | 40054
 0 .. 2 |     2
 1 .. 2 | 60599
(3 rows)

test=# select level, a, count(1) from gist_print('seg_test_idx') as t(level
int, valid bool, a seg) group by level,a order by level, a;
 level |   a    | count
-------+--------+-------
     1 | 0 .. 1 |     1
     1 | 0 .. 2 |     1
     1 | 1 .. 2 |     1
     2 | 0 .. 1 |   259
     2 | 0 .. 2 |     1
     2 | 1 .. 2 |   392
     3 | 0 .. 1 | 39794
     3 | 1 .. 2 | 60206
(8 rows)

We have only 2 segs '0..2' (one on the first level and one on the second)
and all other segs are '0..1' and '1..2'. I think there will be more
problems when we'll have many of distinct values where each have count value
about half of index page.

----
With best regards,
Alexander Korotkov.

Reply via email to