I have done another experiment for count(*) vs count(distinct ...), on same
table schema but with 10000000 rows now. And for this time, the postgres
version is 9.3.2 (9.3.2-1.pgdg12.4+1).
These two has same resulted query plan with same estimated cost, but
count(*) is straightly fast.

test=> explain analyze select count(*) from t1;
                                                       QUERY
PLAN
------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=169247.92..169247.93 rows=1 width=0) (actual
time=37775.187..37775.188 rows=1 loops=1)
   ->  Seq Scan on t1  (cost=0.00..144247.94 rows=9999994 width=0) (actual
time=0.037..19303.022 rows=10000000 loops=1)
 Total runtime: 37775.216 ms
(3 筆資料列)

時間: 37775.493 ms
test=> explain analyze select count(distinct col_int) from t1;
                                                       QUERY
PLAN
------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=169247.92..169247.93 rows=1 width=4) (actual
time=45883.192..45883.195 rows=1 loops=1)
   ->  Seq Scan on t1  (cost=0.00..144247.94 rows=9999994 width=4) (actual
time=0.037..19652.540 rows=10000000 loops=1)
 Total runtime: 45883.224 ms
(3 筆資料列)

時間: 45883.473 ms



test=> select count(*) from t1;
  count
----------
 10000000
(1 筆資料列)

時間: 602.018 ms
test=> select count(*) from t1;
  count
----------
 10000000
(1 筆資料列)

時間: 598.291 ms
test=> select count(*) from t1;
  count
----------
 10000000
(1 筆資料列)

時間: 592.439 ms

test=> select count(distinct col_int) from t1;
 count
-------
  1025
(1 筆資料列)

時間: 10311.788 ms
test=> select count(distinct col_int) from t1;
 count
-------
  1025
(1 筆資料列)

時間: 7063.156 ms
test=> select count(distinct col_int) from t1;
 count
-------
  1025
(1 筆資料列)

時間: 6899.283 ms


I don't think count(*) also uses sort since it should not be needed. But
for the query planner, it seems it can not distinguish between these two
now.

regards,
jacket41142



2013/12/11 jacket41142 <jacket41...@gmail.com>

> Thanks very much.
>
> I think another problem is that the cost estimation isn't good enough to
> reflex real cost. Since we can see, from "explain analyze ...",
> count(distinct ...) has smallest cost between the others, but since it uses
> sorts, the time complexity should be higher especially for large amount of
> rows.
>
> Also I think even if we can have multiple count() expressions, the
> optimizer should also be able to choose between use sort, HashAggregate or
> maybe something like linear aggregate if sorts are not needed or other
> methods if exist. Also this may be done as just one job for entire table of
> interested columns, or for each column separately.
>
> regards,
> jacket41142
>
>
> 2013/12/11 Jeff Janes <jeff.ja...@gmail.com>
>
>> On Tue, Dec 10, 2013 at 9:28 AM, jacket41142 <jacket41...@gmail.com>wrote:
>>
>>
>>>
>>>  test=> select distinct col_int from t1 group by col_int;
>>> Time: 1177.936 ms
>>>
>>> So the performance difference is not very large.
>>> But when I do that:
>>>
>>> test=> select count(distinct col_int) from t1;
>>>  count
>>> -------
>>>   1025
>>> (1 row)
>>>
>>> Time: 7367.476 ms
>>>
>>
>>
>> count(distinct ...) always sorts, rather than using a hash, to do its
>> work.  I don't think that there is any fundamental reason that it could not
>> be changed to allow it to use hashing, it just hasn't been done yet.  It is
>> complicated by the fact that you can have multiple count() expressions in
>> the same query which demand sorting/grouping on different columns.
>>
>> Cheers,
>>
>> Jeff
>>
>
>

Reply via email to