On Fri, Sep 12, 2014 at 3:39 PM, Tomas Vondra <t...@fuzzy.cz> wrote: > On 12.9.2014 18:49, Robert Haas wrote: >> On Fri, Sep 12, 2014 at 8:28 AM, Robert Haas <robertmh...@gmail.com> wrote: >>> On Thu, Sep 11, 2014 at 5:57 PM, Tomas Vondra <t...@fuzzy.cz> wrote: >>>> Attached is the patch split as suggested: >>>> >>>> (a) hashjoin-nbuckets-v14a-size.patch >>>> >>>> * NTUP_PER_BUCKET=1 >>>> * counting buckets towards work_mem >>>> * changes in ExecChooseHashTableSize (due to the other changes) >>> >>> OK, I'm going to work on this one now. I have some ideas about how to >>> simplify this a bit and will post an update in a few hours once I see >>> how that pans out. >> >> Here's what I came up with. I believe this has the same semantics as >> your patch for less code, but please check, because I might be >> wrong. The main simplifications I made were: >> >> (1) In master, we decide whether or not to batch based on the size >> of the data, without knowing the number of buckets. If we want to >> account for the bucket overhead, that doesn't work. But in your >> patch, we basically computed the number of buckets we'd need for the >> single-batch case, then decide whether to do a single batch, and if >> yes, computed the same values over again with the same formula >> phrased differently. I revised that so that the single-batch case is >> calculated only once. If everything fits, we're all set, and don't >> need to recalculate anything. >> >> (2) In your patch, once we decide we need to batch, you set nbatch >> as now, and then check whether the computed value is still adequate >> after subtracting buckets_byte from hash_table_bytes. I revised that >> so that we make the initial batch estimate of dbatch by dividing >> inner_rel_bytes by hash_table_bytes - bucket_bytes rather than >> hash_table_bytes. I think that avoids the need to maybe increase >> nbatch again afterwards. > > Yes, I like those changes and I think your reasoning is correct in both > cases. It certainly makes the method shorter and more readable - I was > too "stuck" in the original logic, so thanks for improving this. > > So +1 from me to those changes.
OK, committed. Please rebase the rest. >> I'm comfortable with this version if you are, but (maybe as a >> follow-on commit) I think we could make this even a bit smarter. If >> inner_rel_bytes + bucket_bytes > hash_table_bytes but >> inner_rel_bytes + bucket_bytes / 4 < hash_table_bytes, then reduce >> the number of buckets by 2x or 4x to make it fit. That would provide >> a bit of additional safety margin against cases where this patch >> might unluckily lose. > > I don't think that's a good idea. That essentially says 'we're shooting > for NTUP_PER_BUCKET but not really'. Moreover, I often see cases where > batching (because of smaller work_mem) actually significantly improves > performance. If we want to make this reasoning properly, deciding > whether to add batches or reduce buckets, we need a better heuristics - > that's quite complex and I'd expect it to result ain a quite complex patch. > > So -1 from me to this at this moment (it certainly needs much more > thought than a mere follow-on commit). OK, but let's discuss further. You make it sound like treating NTUP_PER_BUCKET as a soft limit is a bad thing, but I'm not convinced. I think it's perfectly reasonable to treat NTUP_PER_BUCKET as a range: if we can get NTUP_PER_BUCKET=1, great, but if not, NTUP_PER_BUCKET=2 or NTUP_PER_BUCKET=4 may be perfectly reasonable. I'm actually quite surprised that you find batching to be a better strategy than skimping on buckets, because I would have expect the opposite, almost categorically. Batching means having to write out the tuples we can't process right away and read them back in. If that involves disk I/O, I think the cost of that I/O is going to be FAR more than the overhead we would have incurred by searching slightly longer bucket chains. If it didn't, then you could've set work_mem higher and avoided batching in the first place. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers