Mark Mielke wrote:
PFC wrote:
Actually, the memory used by the hash depends on the number of
distinct values, not the number of rows which are processed...
Consider :
SELECT a GROUP BY a
SELECT a,count(*) GROUP BY a
In both cases the hash only holds discinct values. So if you have
1 million rows to process but only 10 distinct values of "a", the
hash will only contain those 10 values (and the counts), so it will
be very small and fast, it will absorb a huge seq scan without
problem. If however, you have (say) 100 million distinct values for
a, using a hash would be a bad idea. As usual, divide the size of
your RAM by the number of concurrent connections or something.
Note that "a" could be a column, several columns, anything, the
size of the hash will be proportional to the number of distinct
values, ie. the number of rows returned by the query, not the number
of rows processed (read) by the query. Same with hash joins etc,
that's why when you join a very small table to a large one Postgres
likes to use seq scan + hash join on the small table.
This surprises me - hash values are lossy, so it must still need to
confirm against the real list of values, which at a minimum should
require references to the rows to check against?
Is PostgreSQL doing something beyond my imagination? :-)
Hmmm... You did say distinct values, so I can see how that would work
for distinct. What about seq scan + hash join, though? To complete the
join, wouldn't it need to have a reference to each of the rows to join
against? If there is 20 distinct values and 200 rows in the small table
- wouldn't it need 200 references to be stored?
Cheers,
mark
--
Mark Mielke <[EMAIL PROTECTED]>
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance