Robert Haas <robertmh...@gmail.com> writes: > On Thu, Feb 16, 2017 at 2:38 PM, Tom Lane <t...@sss.pgh.pa.us> wrote: >> I initially thought about driving the shutoff strictly from the estimate >> of the MCV frequency, without involving the more general ndistinct >> computation that estimate_hash_bucketsize does. I'm not sure how much >> that would do for your concern, but at least the MCV frequency doesn't >> involve quite as much extrapolation as ndistinct.
> Hmm, so we could do something like: if the estimated frequency of the > least-common MCV is enough to make one bucket overflow work_mem, then > don't use a hash join? That would still be prone to some error (in > both directions, really) but it seems less likely to spit out > completely stupid results than relying on ndistinct, which never gets > very big even in a 10TB table. No, it'd be the *most* common MCV, because we're concerned about the worst-case (largest) bucket size. But that's good, really, because the highest MCV frequency will be the one we have most statistical confidence in. There's generally a whole lot of noise in the tail-end MCV numbers. Also, I'd be inclined to do nothing (no shutoff) if we have no MCV stats. That would be an expected case if the column is believed unique, and it's probably a better fallback behavior when we simply don't have stats. With the ndistinct-based rule, we'd be shutting off hashjoin almost always when we don't have stats. Given how long it took us to recognize this problem, that's probably the wrong default. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers