On Thu, Feb 16, 2017 at 3:51 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> Robert Haas <robertmh...@gmail.com> writes:
>> On Thu, Feb 16, 2017 at 2:38 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
>>> I initially thought about driving the shutoff strictly from the estimate
>>> of the MCV frequency, without involving the more general ndistinct
>>> computation that estimate_hash_bucketsize does.  I'm not sure how much
>>> that would do for your concern, but at least the MCV frequency doesn't
>>> involve quite as much extrapolation as ndistinct.
>> Hmm, so we could do something like: if the estimated frequency of the
>> least-common MCV is enough to make one bucket overflow work_mem, then
>> don't use a hash join?  That would still be prone to some error (in
>> both directions, really) but it seems less likely to spit out
>> completely stupid results than relying on ndistinct, which never gets
>> very big even in a 10TB table.
> No, it'd be the *most* common MCV, because we're concerned about the
> worst-case (largest) bucket size.  But that's good, really, because the
> highest MCV frequency will be the one we have most statistical
> confidence in.  There's generally a whole lot of noise in the tail-end
> MCV numbers.

Oh, right.  That's reassuring, as it seems like it has a much better
chance of actually being right.

> Also, I'd be inclined to do nothing (no shutoff) if we have no MCV
> stats.  That would be an expected case if the column is believed unique,
> and it's probably a better fallback behavior when we simply don't have
> stats.  With the ndistinct-based rule, we'd be shutting off hashjoin
> almost always when we don't have stats.  Given how long it took us
> to recognize this problem, that's probably the wrong default.


Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to