On Thu, Feb 16, 2017 at 3:51 PM, Tom Lane <t...@sss.pgh.pa.us> wrote: > Robert Haas <robertmh...@gmail.com> writes: >> On Thu, Feb 16, 2017 at 2:38 PM, Tom Lane <t...@sss.pgh.pa.us> wrote: >>> I initially thought about driving the shutoff strictly from the estimate >>> of the MCV frequency, without involving the more general ndistinct >>> computation that estimate_hash_bucketsize does. I'm not sure how much >>> that would do for your concern, but at least the MCV frequency doesn't >>> involve quite as much extrapolation as ndistinct. > >> Hmm, so we could do something like: if the estimated frequency of the >> least-common MCV is enough to make one bucket overflow work_mem, then >> don't use a hash join? That would still be prone to some error (in >> both directions, really) but it seems less likely to spit out >> completely stupid results than relying on ndistinct, which never gets >> very big even in a 10TB table. > > No, it'd be the *most* common MCV, because we're concerned about the > worst-case (largest) bucket size. But that's good, really, because the > highest MCV frequency will be the one we have most statistical > confidence in. There's generally a whole lot of noise in the tail-end > MCV numbers.
Oh, right. That's reassuring, as it seems like it has a much better chance of actually being right. > Also, I'd be inclined to do nothing (no shutoff) if we have no MCV > stats. That would be an expected case if the column is believed unique, > and it's probably a better fallback behavior when we simply don't have > stats. With the ndistinct-based rule, we'd be shutting off hashjoin > almost always when we don't have stats. Given how long it took us > to recognize this problem, that's probably the wrong default. Right. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (firstname.lastname@example.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers