Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

Robert Haas Fri, 12 Sep 2014 13:26:03 -0700

On Fri, Sep 12, 2014 at 3:39 PM, Tomas Vondra <[email protected]> wrote:
> On 12.9.2014 18:49, Robert Haas wrote:
>> On Fri, Sep 12, 2014 at 8:28 AM, Robert Haas <[email protected]> wrote:
>>> On Thu, Sep 11, 2014 at 5:57 PM, Tomas Vondra <[email protected]> wrote:
>>>> Attached is the patch split as suggested:
>>>>
>>>> (a) hashjoin-nbuckets-v14a-size.patch
>>>>
>>>>     * NTUP_PER_BUCKET=1
>>>>     * counting buckets towards work_mem
>>>>     * changes in ExecChooseHashTableSize (due to the other changes)
>>>
>>> OK, I'm going to work on this one now.  I have some ideas about how to
>>> simplify this a bit and will post an update in a few hours once I see
>>> how that pans out.
>>
>> Here's what I came up with. I believe this has the same semantics as
>> your patch for less code, but please check, because I might be
>> wrong. The main simplifications I made were:
>>
>> (1) In master, we decide whether or not to batch based on the size
>> of the data, without knowing the number of buckets. If we want to
>> account for the bucket overhead, that doesn't work. But in your
>> patch, we basically computed the number of buckets we'd need for the
>> single-batch case, then decide whether to do a single batch, and if
>> yes, computed the same values over again with the same formula
>> phrased differently. I revised that so that the single-batch case is
>> calculated only once. If everything fits, we're all set, and don't
>> need to recalculate anything.
>>
>> (2) In your patch, once we decide we need to batch, you set nbatch
>> as now, and then check whether the computed value is still adequate
>> after subtracting buckets_byte from hash_table_bytes. I revised that
>> so that we make the initial batch estimate of dbatch by dividing
>> inner_rel_bytes by hash_table_bytes - bucket_bytes rather than
>> hash_table_bytes. I think that avoids the need to maybe increase
>> nbatch again afterwards.
>
> Yes, I like those changes and I think your reasoning is correct in both
> cases. It certainly makes the method shorter and more readable - I was
> too "stuck" in the original logic, so thanks for improving this.
>
> So +1 from me to those changes.


OK, committed.  Please rebase the rest.

>> I'm comfortable with this version if you are, but (maybe as a
>> follow-on commit) I think we could make this even a bit smarter. If
>> inner_rel_bytes + bucket_bytes > hash_table_bytes but
>> inner_rel_bytes + bucket_bytes / 4 < hash_table_bytes, then reduce
>> the number of buckets by 2x or 4x to make it fit. That would provide
>> a bit of additional safety margin against cases where this patch
>> might unluckily lose.
>
> I don't think that's a good idea. That essentially says 'we're shooting
> for NTUP_PER_BUCKET but not really'. Moreover, I often see cases where
> batching (because of smaller work_mem) actually significantly improves
> performance. If we want to make this reasoning properly, deciding
> whether to add batches or reduce buckets, we need a better heuristics -
> that's quite complex and I'd expect it to result ain a quite complex patch.
>
> So -1 from me to this at this moment (it certainly needs much more
> thought than a mere follow-on commit).

OK, but let's discuss further.  You make it sound like treating
NTUP_PER_BUCKET as a soft limit is a bad thing, but I'm not convinced.
I think it's perfectly reasonable to treat NTUP_PER_BUCKET as a range:
if we can get NTUP_PER_BUCKET=1, great, but if not, NTUP_PER_BUCKET=2
or NTUP_PER_BUCKET=4 may be perfectly reasonable.

I'm actually quite surprised that you find batching to be a better
strategy than skimping on buckets, because I would have expect the
opposite, almost categorically.  Batching means having to write out
the tuples we can't process right away and read them back in.  If that
involves disk I/O, I think the cost of that I/O is going to be FAR
more than the overhead we would have incurred by searching slightly
longer bucket chains.  If it didn't, then you could've set work_mem
higher and avoided batching in the first place.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

Reply via email to