> The query,
> select count(distinct j) from t group by t, i;
> 
> runs for 35 seconds.  However, if I change the query to,
> select count(distinct j) from t group by i, t;  -- note switching the
> ordering
> select count(distinct j) from t group by decode(t, 'escape'), i; -- convert
> t to bytea
> 
> Run times are just about 5 and 6.5 seconds.  The reason is clear, compare a
> string with collation is slow, which is well understood by pg hackers.
> However, here, the sorting order is forced by the planner, not user.
> Planner can do the following optimizations,

Interesting. I got following result:

test=# explain analyze select count(distinct j) from t group by t, i;
                                                        QUERY PLAN              
                                          
--------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=137519.84..157519.84 rows=1000000 width=22) (actual 
time=1332.937..2431.238 rows=1000000 loops=1)
   Group Key: t, i
   ->  Sort  (cost=137519.84..140019.84 rows=1000000 width=22) (actual 
time=1332.922..1507.413 rows=1000000 loops=1)
         Sort Key: t, i
         Sort Method: external merge  Disk: 33232kB
         ->  Seq Scan on t  (cost=0.00..17352.00 rows=1000000 width=22) (actual 
time=0.006..131.406 rows=1000000 loops=1)
 Planning time: 0.031 ms
 Execution time: 2484.271 ms
(8 rows)

Time: 2484.520 ms

test=# explain analyze select count(distinct j) from t group by i, t;
                                                        QUERY PLAN              
                                          
--------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=137519.84..157519.84 rows=1000000 width=22) (actual 
time=602.510..1632.087 rows=1000000 loops=1)
   Group Key: i, t
   ->  Sort  (cost=137519.84..140019.84 rows=1000000 width=22) (actual 
time=602.493..703.274 rows=1000000 loops=1)
         Sort Key: i, t
         Sort Method: external sort  Disk: 33240kB
         ->  Seq Scan on t  (cost=0.00..17352.00 rows=1000000 width=22) (actual 
time=0.014..129.213 rows=1000000 loops=1)
 Planning time: 0.176 ms
 Execution time: 1685.575 ms
(8 rows)

Time: 1687.641 ms

Not so big difference here (maybe because I use SSD) but there is
still about 50% difference in execution time. Note that I disable
locale support.

> 1. for the sort we generated for sort agg, planner can switch column
> ordering, put int before string,
> 2. for the sort we generated for sort agg, use bytea compare instead of
> string compare.
> 
> They will bring big improvement to this common query.   Is this something
> reasonable?

Not looking at sort and planner code closely, I guess planner does
not recognize the cost difference between "external merge" and
"external sort" because cost estimations for sort step in each plan
are exactly same (137519.84..140019.84). However I am not sure if we
should fix the planner or should fix our sort machinery since it is
possible that the sort machinery should not expose such a big
difference in the two sort methods.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to