I have done another experiment for count(*) vs count(distinct ...), on same table schema but with 10000000 rows now. And for this time, the postgres version is 9.3.2 (9.3.2-1.pgdg12.4+1). These two has same resulted query plan with same estimated cost, but count(*) is straightly fast.
test=> explain analyze select count(*) from t1; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------ Aggregate (cost=169247.92..169247.93 rows=1 width=0) (actual time=37775.187..37775.188 rows=1 loops=1) -> Seq Scan on t1 (cost=0.00..144247.94 rows=9999994 width=0) (actual time=0.037..19303.022 rows=10000000 loops=1) Total runtime: 37775.216 ms (3 筆資料列) 時間: 37775.493 ms test=> explain analyze select count(distinct col_int) from t1; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------ Aggregate (cost=169247.92..169247.93 rows=1 width=4) (actual time=45883.192..45883.195 rows=1 loops=1) -> Seq Scan on t1 (cost=0.00..144247.94 rows=9999994 width=4) (actual time=0.037..19652.540 rows=10000000 loops=1) Total runtime: 45883.224 ms (3 筆資料列) 時間: 45883.473 ms test=> select count(*) from t1; count ---------- 10000000 (1 筆資料列) 時間: 602.018 ms test=> select count(*) from t1; count ---------- 10000000 (1 筆資料列) 時間: 598.291 ms test=> select count(*) from t1; count ---------- 10000000 (1 筆資料列) 時間: 592.439 ms test=> select count(distinct col_int) from t1; count ------- 1025 (1 筆資料列) 時間: 10311.788 ms test=> select count(distinct col_int) from t1; count ------- 1025 (1 筆資料列) 時間: 7063.156 ms test=> select count(distinct col_int) from t1; count ------- 1025 (1 筆資料列) 時間: 6899.283 ms I don't think count(*) also uses sort since it should not be needed. But for the query planner, it seems it can not distinguish between these two now. regards, jacket41142 2013/12/11 jacket41142 <jacket41...@gmail.com> > Thanks very much. > > I think another problem is that the cost estimation isn't good enough to > reflex real cost. Since we can see, from "explain analyze ...", > count(distinct ...) has smallest cost between the others, but since it uses > sorts, the time complexity should be higher especially for large amount of > rows. > > Also I think even if we can have multiple count() expressions, the > optimizer should also be able to choose between use sort, HashAggregate or > maybe something like linear aggregate if sorts are not needed or other > methods if exist. Also this may be done as just one job for entire table of > interested columns, or for each column separately. > > regards, > jacket41142 > > > 2013/12/11 Jeff Janes <jeff.ja...@gmail.com> > >> On Tue, Dec 10, 2013 at 9:28 AM, jacket41142 <jacket41...@gmail.com>wrote: >> >> >>> >>> test=> select distinct col_int from t1 group by col_int; >>> Time: 1177.936 ms >>> >>> So the performance difference is not very large. >>> But when I do that: >>> >>> test=> select count(distinct col_int) from t1; >>> count >>> ------- >>> 1025 >>> (1 row) >>> >>> Time: 7367.476 ms >>> >> >> >> count(distinct ...) always sorts, rather than using a hash, to do its >> work. I don't think that there is any fundamental reason that it could not >> be changed to allow it to use hashing, it just hasn't been done yet. It is >> complicated by the fact that you can have multiple count() expressions in >> the same query which demand sorting/grouping on different columns. >> >> Cheers, >> >> Jeff >> > >