On 06/20/2015 04:17 PM, Robert Haas wrote:
On Wed, Jun 17, 2015 at 1:52 PM, Tomas Vondra
<tomas.von...@2ndquadrant.com> wrote:
I'm currently running some tests on a 3TB TPC-H data set, and I
tripped over a pretty bad n_distinct underestimate, causing OOM in
>> HashAgg (which somehow illustrates the importance of the
>> memory-bounded hashagg patch Jeff Davis is working on).

Stupid question, but why not just override it using ALTER TABLE ...
ALTER COLUMN ... SET (n_distinct = ...)?

Sure, I'll do that, and it's probably the only way to do that at the moment. But I wasn't really sure why we're producing so poor estimate initially, even considering how small the sample is, it seemed a bit too bad, and the proportionality to statistics target seemed really strange.

Also, if we could produce better n_distinct estimate, wouldn't that be awesome? I'd much rather not force users to use the manual override if possible.


I think it's been discussed quite often on previous threads that you
need to sample an awful lot of the table to get a good estimate for
n_distinct. We could support that, but it would be expensive, and it
would have to be done again every time the table is auto-analyzed.
The above syntax supports nailing the estimate to either an exact
value or a percentage of the table, and I'm not sure why that isn't
good enough.

Not really. Using a larger sample would certainly help in most cases, in this case the main problem is that the sample is not as random needed. It produces way more duplicate values than it should, which then utterly confuses the estimator which assumes random sample. This happens because the column is

I've lobotomized the sampling a bit to really produce a random set of blocks first, and that produces way better estimates:

   statistics target     estimate               random
   -----------------------------------------------------------------
   100                     429491 (10000x)   334430766 (14x)
   1000                   4240418  (1000x)   439010499 (10x)

Also, the number of sampled blocks is not that different. With target 100, the current sampler reads ~2900 blocks, while a completely random sampler uses 3000 blocks. So, where's the benefit?

I'm sure there are counter-examples, but that's true for all estimators. I do realize the random sampler assumes the tuple density is about the same on all blocks (so that it can sample the blocks directly), but ISTM the current block sampler has to do basically the same assumption to skip some of the blocks.

Have to read the previous threads on this topic I guess ...

--
Tomas Vondra                   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to