Re: [HACKERS] PoC: Partial sort

Alexander Korotkov Sat, 14 Dec 2013 06:23:08 -0800

Hi!

Thanks for feedback!


On Sat, Dec 14, 2013 at 4:54 PM, Andres Freund <[email protected]>wrote:

> Hi,
>
> Cool stuff.
>
> On 2013-12-14 13:59:02 +0400, Alexander Korotkov wrote:
> > Currently when we need to get ordered result from table we have to choose
> > one of two approaches: get results from index in exact order we need or
> do
> > sort of tuples. However, it could be useful to mix both methods: get
> > results from index in order which partially meets our requirements and do
> > rest of work from heap.
>
> >
> ------------------------------------------------------------------------------------------------------------------------------------------
> >  Limit  (cost=69214.06..69214.08 rows=10 width=16) (actual
> > time=0.097..0.099 rows=10 loops=1)
> >    ->  Sort  (cost=69214.06..71714.06 rows=1000000 width=16) (actual
> > time=0.096..0.097 rows=10 loops=1)
> >          Sort Key: v1, v2
> >          Sort Method: top-N heapsort  Memory: 25kB
> >          ->  Index Scan using test_v1_idx on test  (cost=0.42..47604.42
> > rows=1000000 width=16) (actual time=0.017..0.066 rows=56 loops=1)
> >  Total runtime: 0.125 ms
> > (6 rows)
>
> Is that actually all that beneficial when sorting with a bog standard
> qsort() since that doesn't generally benefit from data being pre-sorted?
> I think we might need to switch to a different algorithm to really
> benefit from mostly pre-sorted input.
>

In this patch I don't do full sort of dataset. For instance, index returns
data ordered by first column and we need to order them also by second
column. Then this node sorts groups (assumed to be small) where values of
the first column are same by value of second column. And with limit clause
only required number of such groups will be processed. But, I don't think
we should expect pre-sorted values of second column inside a group.


> > *partial-knn-1.patch*
> >
> > KNN-GiST provides ability to get ordered results from index, but this
> order
> > is based only on index information. For instance, GiST index contains
> > bounding rectangles for polygons, and we can't get exact distance to
> > polygon from index (similar situation is in PostGIS). In attached patch,
> > GiST distance method can set recheck flag (similar to consistent method).
> > This flag means that distance method returned lower bound of distance and
> > we should recheck it from heap.
>
> > See an example.
> >
> > create table test as (select id, polygon(3+(random()*10)::int,
> > circle(point(random(), random()), 0.0003 + random()*0.001)) as p from
> > generate_series(1,1000000) id);
> > create index test_idx on test using gist (p);
> >
> > We can get results ordered by distance from polygon to point.
> >
> > postgres=# select id, p <-> point(0.5,0.5) from test order by p <->
> > point(0.5,0.5) limit 10;
>
> >
> ----------------------------------------------------------------------------------------------------------------------------------
> >  Limit  (cost=0.29..1.86 rows=10 width=36) (actual time=0.180..0.230
> > rows=10 loops=1)
> >    ->  Index Scan using test_idx on test  (cost=0.29..157672.29
> > rows=1000000 width=36) (actual time=0.179..0.228 rows=10 loops=1)
> >          Order By: (p <-> '(0.5,0.5)'::point)
> >  Total runtime: 0.305 ms
> > (4 rows)
>
> Rechecking from the heap means adding a sort node though, which I don't
> see here? Or am I misunderstanding something?
>

KNN-GiST contain RB-tree of scanned items. In this patch item is rechecked
inside GiST and reinserted into same RB-tree. It appears to be much easier
implementation for PoC and also looks very efficient. I'm not sure what is
actually right design for it. This is what I like to discuss.

------
With best regards,
Alexander Korotkov.

Re: [HACKERS] PoC: Partial sort

Reply via email to