Re: [HACKERS] pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H

Tomas Vondra Sat, 20 Jun 2015 08:30:07 -0700


On 06/20/2015 04:17 PM, Robert Haas wrote:

On Wed, Jun 17, 2015 at 1:52 PM, Tomas Vondra
<tomas.von...@2ndquadrant.com> wrote:

I'm currently running some tests on a 3TB TPC-H data set, and I
tripped over a pretty bad n_distinct underestimate, causing OOM in

>> HashAgg (which somehow illustrates the importance of the
>> memory-bounded hashagg patch Jeff Davis is working on).


Stupid question, but why not just override it using ALTER TABLE ...
ALTER COLUMN ... SET (n_distinct = ...)?

Sure, I'll do that, and it's probably the only way to do that at themoment. But I wasn't really sure why we're producing so poor estimateinitially, even considering how small the sample is, it seemed a bit toobad, and the proportionality to statistics target seemed really strange.

Also, if we could produce better n_distinct estimate, wouldn't that beawesome? I'd much rather not force users to use the manual override ifpossible.


I think it's been discussed quite often on previous threads that you
need to sample an awful lot of the table to get a good estimate for
n_distinct. We could support that, but it would be expensive, and it
would have to be done again every time the table is auto-analyzed.
The above syntax supports nailing the estimate to either an exact
value or a percentage of the table, and I'm not sure why that isn't
good enough.

Not really. Using a larger sample would certainly help in most cases, inthis case the main problem is that the sample is not as random needed.It produces way more duplicate values than it should, which then utterlyconfuses the estimator which assumes random sample. This happens becausethe column is

I've lobotomized the sampling a bit to really produce a random set ofblocks first, and that produces way better estimates:


   statistics target     estimate               random
   -----------------------------------------------------------------
   100                     429491 (10000x)   334430766 (14x)
   1000                   4240418  (1000x)   439010499 (10x)

Also, the number of sampled blocks is not that different. With target100, the current sampler reads ~2900 blocks, while a completely randomsampler uses 3000 blocks. So, where's the benefit?

I'm sure there are counter-examples, but that's true for all estimators.I do realize the random sampler assumes the tuple density is about thesame on all blocks (so that it can sample the blocks directly), but ISTMthe current block sampler has to do basically the same assumption toskip some of the blocks.


Have to read the previous threads on this topic I guess ...

--
Tomas Vondra                   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H

Reply via email to