Hi! on irc somebody complained yesterday that a simple group by on a 25M integer row caused his backend to exhaust the 3GB process limit on his 32bit built(one a box with 16GB Ram). Some testing showed that the planner was seriously underestimating the number of distinct rows in the table (with the default statistic target it estimated ~150k rows while there are about 19M distinct values) and chosing a hashagg for the aggregate. uping the statistics target to 1000 improves the estimate to about 5M rows which unfortunably is still not enough to cause the planner to switch to a groupagg with work_mem set to 256000. Some testing seems to indicate that even with perfectly matching stats like(8.1.3 here):
foo=# create table testtable AS select a from generate_series(1,5000000) as a; SELECT foo=# CREATE INDEX test_idx on testtable (a); CREATE INDEX foo=# ANALYZE ; ANALYZE foo=# explain select a,count(*) from testtable group by a; QUERY PLAN ------------------------------------------------------------------------- HashAggregate (cost=97014.73..159504.51 rows=4999182 width=4) -> Seq Scan on testtable (cost=0.00..72018.82 rows=4999182 width=4) (2 rows) will use about 2,5x of what work_mem is set too, while that is partly expected it seems quite dangerous that one can even with only moderate underestimation of the expected resultcount(say 2x or 4x) run a server out of memory. Stefan ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org