Re: [HACKERS] Re: Abbreviated keys for Datum tuplesort

Andrew Gierth Sun, 25 Jan 2015 03:16:12 -0800

>>>>> "Robert" == Robert Haas <robertmh...@gmail.com> writes:


 >> Here's the cleaned-up version of the patch to allow abbreviated keys
 >> when sorting a single Datum. This also removes comments that suggest
 >> that the caller of tuplesort_begin_datum should ever have to care
 >> about the abbreviated key optimization.
 >> 
 >> I'll add this to the CF.

 Robert> I think this is a good idea.  Do you have a test case that
 Robert> shows the benefit?

The best test case for datum sort performance is to use percentile_disc,
since that has almost no overhead beyond performing the actual sort.
(Unlike, say, count(distinct) or mode(), both of which have to do an
additional comparison pass over the sorted data; but count(distinct) is
probably the most common use of the datum sort in the wild, so it's
useful to try that too.)

So given some suitable test data, such as

create table stuff as select random()::text as randtext
  from generate_series(1,1000000);  -- or however many rows

you can do

select percentile_disc(0) within group (order by randtext) from stuff;

or

select count(distinct randtext) from stuff;

The performance improvements I saw were pretty much exactly as expected
from the improvement in the ORDER BY and CREATE INDEX cases.

The best test case for checking the correct order of results is to use
array_agg(x order by x), for example as follows:

select u, u <= lag(u) over ()
  from (select unnest(a) as u
          from (select array_agg(randtext order by randtext)
                  from stuff) s1) s2;

(note that array_agg(x order by x) uses the datum sort, but
array_agg(x order by y) uses the ordinary heap tuple sort)

-- 
Andrew (irc:RhodiumToad)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: Abbreviated keys for Datum tuplesort

Reply via email to