Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

Robert Haas Tue, 19 Aug 2014 10:13:16 -0700

On Sat, Aug 16, 2014 at 9:31 AM, Tomas Vondra <t...@fuzzy.cz> wrote:
> On 12.8.2014 00:30, Tomas Vondra wrote:
>> On 11.8.2014 20:25, Robert Haas wrote:
>>> It also strikes me that when there's only 1 batch, the set of bits
>>> that map onto the batch number is zero-width, and one zero-width bit
>>> range is as good as another.  In other words, if you're only planning
>>> to do one batch, you can easily grow the number of buckets on the fly.
>>> Growing the number of buckets only becomes difficult once you have
>>> more than one batch.
>
> ...
>
>> I was considering using reversing the bits of the hash, because that's
>> pretty much the simplest solution. But I think you're right it might
>> actually work like this:
>>
>>   * Are more batches needed?
>>
>>       (yes) => just use nbuckets = my_log2(work_mem / tuple_size)
>>
>>       (no) => go ahead, until processing all tuples or hitting work_mem
>>
>>               (work_mem) => meh, use the same nbuckets above
>>
>>               (all tuples) => compute optimal nbuckets / resize
>>
>>
>> But I need to think about this a bit. So far it seems to me there's no
>> way additional batches might benefit from increasing nbuckets further.
>
> I think this is a simple and solid solution, solving the batchno
> computation issues quite nicely. Attached is v10 patch (bare and
> combined with the dense allocation), that does this:
>
> 1) when we know we'll need batching, buckets are sized for full work_mem
>    (using the estimated tuple width, etc.)
>
> 2) without the batching, we estimate the 'right number of buckets' for
>    the estimated number of tuples, and keep track of the optimal number
>    as tuples are added to the hash table
>
>    - if we discover we need to start batching, we keep the current
>      optimal value (which should be the same as the max number of
>      buckets) and don't mess with it anymore (making it possible to
>      compute batch IDs just like before)
>
>    - also, on the fist rebatch (nbatch=1 => nbatch=2) the hash table
>      is resized as part of the rebatch
>
>    - if the hash build completes without batching, we do the resize
>
> I believe the patch is pretty much perfect. I plan to do more thorough
> testing on a wide range of queries in the next few days.
>
> I also removed the 'enable_hash_resize' GUC, because it would be more
> complex to implement this properly after doing the resize as part of
> rebatch etc.. So either it would make the patch more complex, or it
> wouldn't do what the name promises.


A variety of trivial comments on this:

PostgreSQL style is un-cuddled curly braces.  Also, multi-line
comments need to start with a line containing only /* and end with a
line containing only */.  In one place you've added curly braces
around a single-line block that is otherwise unmodified; please don't
do that.  In one place, you have "becase" instead of "because".  In
another place, you write "add if after it" but it should say "add it
after it" or maybe better "add the new one after it".  Avoid using
punctuation like "=>" in comments to illustrate the connection between
sentences; instead, use a connecting word like "then" or "therefore"
or whatever is appropriate; in this instance, a period followed by the
start of a new sentence seems sufficient.  Revert the removal of a
single line of whitespace near the top of nodeHash.c.

There are too many things marked XXX in this patch.  They should
either be fixed, if they are real problems, or they should be
commented in a way that doesn't give rise to the idea that they're
problems if they aren't.

OK, now on to some more substantive stuff:

1. It's not clear to me what the overall effect of this patch on
memory utilization is.  Reducing NTUP_PER_BUCKET from 10 to 1 is going
to use, on the average, 10x as much bucket-header memory per tuple.
Specifically, I think it means we'll use about 8 bytes of
bucket-header memory per tuple instead of 0.8 bytes per tuple.  If the
tuples are narrow, that could be significant; concerns have been
expressed about that here in the past.  Increasing the number of
buckets could also increase memory usage.  On the other hand, the
dense allocation stuff probably saves a ton of memory, so maybe we end
up overall, but I'm not sure.  Your thoughts, and maybe some test
results with narrow and wide tuples, would be appreciated.

2. But, on the positive side, modulo the memory utilization questions
mentioned above, I would expect the impact on hash join performance to
be positive.  Going from 10 tuples per bucket to just 1 should help,
and on cases where the actual load factor would have ended up much
higher because of poor estimation, increasing the number of buckets on
the fly should help even more.  I haven't tested this, though.

I haven't had a chance to completely go through this yet, so these are
just some initial thoughts.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

Reply via email to