On Fri, Feb 17, 2017 at 11:13 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> Robert Haas <robertmh...@gmail.com> writes:
>> On Thu, Feb 16, 2017 at 3:51 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
>>> No, it'd be the *most* common MCV, because we're concerned about the
>>> worst-case (largest) bucket size.  But that's good, really, because the
>>> highest MCV frequency will be the one we have most statistical
>>> confidence in.  There's generally a whole lot of noise in the tail-end
>>> MCV numbers.
>> Oh, right.  That's reassuring, as it seems like it has a much better
>> chance of actually being right.
> Here's a version that does it that way.  Unsurprisingly, it doesn't
> cause any regression test changes, but you can confirm it's having an
> effect with this test case:
> create table tt(f1 int);
> insert into tt select 1 from generate_series(1,1000000) g;
> insert into tt select g from generate_series(1,1000000) g;
> analyze tt;
> explain select * from tt a natural join tt b;
> Unpatched code will go for a hash join on this example.


By strange coincidence, I was about to propose something along these
lines on theoretical grounds, having spent a bunch of time studying
the hash join code recently.  It makes a lot of sense to use
statistics to try to avoid the "fail" (ie fail to respect work_mem,
and maybe fail to complete: maybe better called "overflow" or
"explode") mode during planning.

I have been wondering about a couple of different worst case execution
strategies that would be better than throwing our hands up and
potentially exploding memory once we detect that further partitioning
is not going to help, if we still manage to reach that case despite
adding stats-based defences like this due to statistics being missing,
bad or confused by joins below us.

1.  We could probe the fraction of the hash table that we have managed
to load into work_mem so far and then rewind the outer batch and do it
again for the next work_mem-sized fraction of the inner batch and so
on.  For outer joins we'd need to scan for unmatched tuples after each
hash table refill.  If we detect this condition during the initial
hash build (as opposed to a later inner batch->hash table load), we'd
need to disable the so called 'hybrid' optimisation and fall back to
the so called 'Grace' hash join; that is, we'd need to pull in the
whole outer relation and write it out to batches before we even begin
probing batch 0, so that we have the ability to rewind outer batch 0
for another pass.  When doing extra passes of an outer batch file, we
have to make sure that we don't do the 'send this tuple to a future
batch' behaviour if the number of batches happens to have increased.
Modulo some details, and I may be missing something fundamental here
(maybe breaks in some semi/anti case?)...

2.  We could just abandon hash join for this batch.  "She cannae take
it, captain", so sort inner and outer batches and merge-join them
instead.  Same comment re switching to Grace hash join if this happens
while loading inner batch 0; we'll need a complete inner batch 0 and
outer batch 0, so we can't juse the hybrid optimisation.

Obviously there are vanishing returns here as we add more defences
making it increasingly unlikely that we hit "fail" mode.  But it
bothers me that hash joins in general are not 100% guaranteed to be
able to complete unless you have infinite RAM.

Thomas Munro

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to