Re: [PERFORM] Group by more efficient than distinct?

PFC Mon, 21 Apr 2008 16:30:23 -0700

On Sun, 20 Apr 2008 17:15:36 +0200, Francisco Reyes<[EMAIL PROTECTED]> wrote:

PFC writes:
- If you process up to some percentage of your RAM worth of data,hashing is going to be a lot faster
Thanks for the excellent breakdown and explanation. I will try and getsizes of the tables in question and how much memory the machines have.

Actually, the memory used by the hash depends on the number of distinctvalues, not the number of rows which are processed...

        Consider :

SELECT a GROUP BY a
SELECT a,count(*) GROUP BY a

In both cases the hash only holds discinct values. So if you have 1million rows to process but only 10 distinct values of "a", the hash willonly contain those 10 values (and the counts), so it will be very smalland fast, it will absorb a huge seq scan without problem. If however, youhave (say) 100 million distinct values for a, using a hash would be a badidea. As usual, divide the size of your RAM by the number of concurrentconnections or something.Note that "a" could be a column, several columns, anything, the size ofthe hash will be proportional to the number of distinct values, ie. thenumber of rows returned by the query, not the number of rows processed(read) by the query. Same with hash joins etc, that's why when you join avery small table to a large one Postgres likes to use seq scan + hash joinon the small table.

        - If you need DISTINCT ON, well, you're stuck with the Sort
        - So, for the time being, you can replace DISTINCT with GROUP BY...
Have seen a few of those already on some code (new job..) so for thoseit is a matter of having a good disk subsystem?

Depends on your RAM, sorting in RAM is always faster than sorting on diskof course, unless you eat all the RAM and trash the other processes.Tradeoffs...




--
Sent via pgsql-performance mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Group by more efficient than distinct?

Reply via email to