Re: [PERFORM] Group by more efficient than distinct?

Mark Mielke Mon, 21 Apr 2008 16:50:48 -0700

PFC wrote:

Actually, the memory used by the hash depends on the number ofdistinct values, not the number of rows which are processed...
    Consider :
SELECT a GROUP BY a
SELECT a,count(*) GROUP BY a
In both cases the hash only holds discinct values. So if you have1 million rows to process but only 10 distinct values of "a", the hashwill only contain those 10 values (and the counts), so it will be verysmall and fast, it will absorb a huge seq scan without problem. Ifhowever, you have (say) 100 million distinct values for a, using ahash would be a bad idea. As usual, divide the size of your RAM by thenumber of concurrent connections or something.Note that "a" could be a column, several columns, anything, thesize of the hash will be proportional to the number of distinctvalues, ie. the number of rows returned by the query, not the numberof rows processed (read) by the query. Same with hash joins etc,that's why when you join a very small table to a large one Postgreslikes to use seq scan + hash join on the small table.

This surprises me - hash values are lossy, so it must still need toconfirm against the real list of values, which at a minimum shouldrequire references to the rows to check against?


Is PostgreSQL doing something beyond my imagination? :-)

Cheers,
mark

--
Mark Mielke <[EMAIL PROTECTED]>


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Group by more efficient than distinct?

Reply via email to