Re: Default setting for enable_hashagg_disk

Tomas Vondra Mon, 29 Jun 2020 14:33:59 -0700

On Mon, Jun 29, 2020 at 01:31:40PM -0400, Bruce Momjian wrote:

On Mon, Jun 29, 2020 at 10:20:14AM -0700, Peter Geoghegan wrote:

I have no reason to believe that the planner is any more or any less
likely to conclude that the hash table will fit in memory in v13 as
things stand (I don't know if the BufFile issue matters).


In general, grouping estimates probably aren't very good compared to
join estimates. I imagine that in either v12 or v13 the planner is
likely to incorrectly believe that it'll all fit in memory fairly
often. v12 was much too permissive about what could happen. But v13 is
too conservative.


FYI, we have improved planner statistics estimates for years, which must
have affected node spill behavior on many node types (except hash_agg),
and don't remember any complaints about it.


I think misestimates for GROUP BY are quite common and very hard to fix.
Firstly, our ndistinct estimator may give pretty bad results depending
e.g. on how is the table correlated.

I've been running some TPC-H benchmarks, and for partsupp.ps_partkey our
estimate was 4338776, when the actual value is 15000000, i.e. ~3.5x
higher. This was with statistics target increased to 1000. I can easily
imagine even worse estimates with lower values.

This ndistinct estimator is used even for extended statistics, so that
can't quite save us. Moreover, the grouping may be on top of a join, in
which case using ndistinct coefficients may not be possible :-(

So I think this is a quite real problem ...


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Default setting for enable_hashagg_disk

Reply via email to