On Fri, Feb 17, 2017 at 11:13 AM, Tom Lane <t...@sss.pgh.pa.us> wrote: > Robert Haas <robertmh...@gmail.com> writes: >> On Thu, Feb 16, 2017 at 3:51 PM, Tom Lane <t...@sss.pgh.pa.us> wrote: >>> No, it'd be the *most* common MCV, because we're concerned about the >>> worst-case (largest) bucket size. But that's good, really, because the >>> highest MCV frequency will be the one we have most statistical >>> confidence in. There's generally a whole lot of noise in the tail-end >>> MCV numbers. > >> Oh, right. That's reassuring, as it seems like it has a much better >> chance of actually being right. > > Here's a version that does it that way. Unsurprisingly, it doesn't > cause any regression test changes, but you can confirm it's having an > effect with this test case: > > create table tt(f1 int); > insert into tt select 1 from generate_series(1,1000000) g; > insert into tt select g from generate_series(1,1000000) g; > analyze tt; > explain select * from tt a natural join tt b; > > Unpatched code will go for a hash join on this example.
+1 By strange coincidence, I was about to propose something along these lines on theoretical grounds, having spent a bunch of time studying the hash join code recently. It makes a lot of sense to use statistics to try to avoid the "fail" (ie fail to respect work_mem, and maybe fail to complete: maybe better called "overflow" or "explode") mode during planning. I have been wondering about a couple of different worst case execution strategies that would be better than throwing our hands up and potentially exploding memory once we detect that further partitioning is not going to help, if we still manage to reach that case despite adding stats-based defences like this due to statistics being missing, bad or confused by joins below us. 1. We could probe the fraction of the hash table that we have managed to load into work_mem so far and then rewind the outer batch and do it again for the next work_mem-sized fraction of the inner batch and so on. For outer joins we'd need to scan for unmatched tuples after each hash table refill. If we detect this condition during the initial hash build (as opposed to a later inner batch->hash table load), we'd need to disable the so called 'hybrid' optimisation and fall back to the so called 'Grace' hash join; that is, we'd need to pull in the whole outer relation and write it out to batches before we even begin probing batch 0, so that we have the ability to rewind outer batch 0 for another pass. When doing extra passes of an outer batch file, we have to make sure that we don't do the 'send this tuple to a future batch' behaviour if the number of batches happens to have increased. Modulo some details, and I may be missing something fundamental here (maybe breaks in some semi/anti case?)... 2. We could just abandon hash join for this batch. "She cannae take it, captain", so sort inner and outer batches and merge-join them instead. Same comment re switching to Grace hash join if this happens while loading inner batch 0; we'll need a complete inner batch 0 and outer batch 0, so we can't juse the hybrid optimisation. Obviously there are vanishing returns here as we add more defences making it increasingly unlikely that we hit "fail" mode. But it bothers me that hash joins in general are not 100% guaranteed to be able to complete unless you have infinite RAM. -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (email@example.com) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers