On Tue, Jul 8, 2014 at 6:35 AM, Tomas Vondra <t...@fuzzy.cz> wrote:
> On 8 Červenec 2014, 14:49, Robert Haas wrote:
> > On Wed, Jul 2, 2014 at 8:13 PM, Tomas Vondra <t...@fuzzy.cz> wrote:
> >> I propose dynamic increase of the nbuckets (up to NTUP_PER_BUCKET=1)
> >> once the table is built and there's free space in work_mem. The patch
> >> mentioned above makes implementing this possible / rather simple.
> > Another idea would be to start with NTUP_PER_BUCKET=1 and then, if we
> > run out of memory, increase NTUP_PER_BUCKET. I'd like to think that
> > the common case is that work_mem will be large enough that everything
> > fits; and if you do it that way, then you save yourself the trouble of
> > rehashing later, which as you point out might lose if there are only a
> > few probes. If it turns out that you run short of memory, you can
> > merge pairs of buckets up to three times, effectively doubling
> > NTUP_PER_BUCKET each time.
> Maybe. I'm not against setting NTUP_PER_BUCKET=1, but with large outer
> relations it may be way cheaper to use higher NTUP_PER_BUCKET values
> instead of increasing the number of batches (resulting in repeated scans
> of the outer table). I think it's important (but difficult) to keep these
> things somehow balanced.
> With large work_mem values the amount of memory for buckets may be quite
> significant. E.g. 800MB work_mem may easily give ~120MB of memory taken by
> buckets with NTUP_PER_BUCKET=1. With NTUP_PER_BUCKET=4 it's just ~30MB.
That extra 90MB is memory well spent, in my book. Versus having to walk a
4-deep linked list which is scattered all over main memory just to find one
entry we care about in it.
It might cause some things that were very close to the edge to tip over
into multi-pass hash joins, but even that is not necessarily a bad thing.
(When I test with work_mem in the 20 to 50 MB range, I often find
batches=2 be ~30% faster than batches=1, I think because partitioning into
main memory using sequential writes is not much of a burden, and building
and walking two hash tables that both fit in L3 cache is much faster than
building 1 hash table in main memory, and more than makes up for the work
of partitioning. Presumably that situation doesn't apply to work_mem
900MB, but isn't NUMA about the same thing as L4 cache, in effect?).
And if someone had a whole bunch of hash joins which were right in that
anti-sweet spot, all they have to do is increase work_mem by (at most) 15%
to get out of it. work_mem is basically impossible to tune, so I doubt
anyone exists who has a reasonable setting for it which can' be increased
by 15% and still be reasonable. And if someone does have it tuned so
tightly, they probably won't be upgrading to new major versions without
expecting to do some re-tuning.