Re: [HACKERS] WIP: [[Parallel] Shared] Hash

Thomas Munro Fri, 27 Jan 2017 17:05:03 -0800

On Fri, Jan 13, 2017 at 2:36 PM, Peter Geoghegan <p...@heroku.com> wrote:
> [...]
> Yeah. That's basically what the BufFile unification process can
> provide you with (or will, once I get around to implementing the
> refcount thing, which shouldn't be too hard). As already noted, I'll
> also want to make it defer creation of a leader-owned segment, unless
> and until that proves necessary, which it never will for hash join.


Hi Peter,

I have broken this up into a patch series, harmonised the private vs
shared hash table code paths better and fixed many things including
the problems with rescans and regression tests mentioned upthread.
You'll see that one of the patches is that throwaway BufFile
import/export facility, which I'll replace with your code as
discussed.

The three 'refactor' patches change the existing hash join code to
work in terms of chunks in more places.  These may be improvements in
their own right, but mainly they pave the way for parallelism.  The
later patches introduce single-batch and then multi-batch shared
tables.

The patches in the attached tarball are:

0001-nail-down-regression-test-row-order-v4.patch:

A couple of regression tests would fail with later refactoring that
changes the order of unmatched rows emitted by hash joins.  So first,
let's fix that by adding ORDER BY in those places, without any code
changes.

0002-hj-add-dtrace-probes-v4.patch:

Before making any code changes, let's add some dtrace probes so that
we can measure time spent doing different phases of hash join work
before and after the later changes.  The main problem with the probes
as I have them here (and the extra probes inserted by later patches in
the series) is that interesting query plans contain multiple hash
joins so these get all mixed up when you're trying to measure stuff,
so perhaps I should pass executor node IDs into all the probes.  More
on this later.  (If people don't want dtrace probes in the executor,
I'm happy to omit them and maintain that kind of thing locally for my
own testing purposes...)

0003-hj-refactor-memory-accounting-v4.patch:

Modify the existing hash join code to work in terms of chunks when
estimating and later tracking memory usage.  This is probably more
accurate than the current tuple-based approach, because it tries to
take into account the space used by chunk headers and the wasted space
in chunks.  In practice the difference is probably small, but it's
arguably more accurate;  I did this because I need chunk-based
accounting the later patches.  Also, make HASH_CHUNK_SIZE the actual
size of allocated chunks (ie the header information is included in
that size so we allocate exactly 32KB, not 32KB + a bit, for the
benefit of the dsa allocator which otherwise finishes up allocating
36KB).

0004-hj-refactor-batch-increases-v4.patch:

Modify the existing hash join code to detect work_mem exhaustion at
the point where chunks are allocated, instead of checking after every
tuple insertion.  This matches the logic used for estimating, and more
importantly allows for some parallelism in later patches.

0005-hj-refactor-unmatched-v4.patch:

Modifies the existing hash join code to handle unmatched tuples in
right/full joins chunk-by-chunk.  This is probably a cache-friendlier
scan order anyway, but the real goal is to provide a natural grain for
parallelism in a later patch.

0006-hj-barrier-v4.patch:

The patch from a nearby thread previously presented as a dependency of
this project.  It might as well be considered part of this patch
series.

0007-hj-exec-detach-node-v4.patch

By the time ExecEndNode() runs in workers, ExecShutdownNode() has
already run.  That's done on purpose because, for example, the hash
table needs to survive longer than the parallel environment to allow
EXPLAIN to peek at it.  But it means that the Gather node has thrown
out the shared memory before any parallel-aware node below it gets to
run its Shutdown and End methods.  So I invented ExecDetachNode()
which runs before ExecShutdownNode(), giving parallel-aware nodes a
chance to say goodbye before their shared memory vanishes.  Better
ideas?

0008-hj-shared-single-batch-v4.patch:

Introduces hash joins with "Shared Hash" and "Parallel Shared Hash"
nodes, for single-batch joins only.  If the planner has a partial
inner plan, it'll pick a Parallel Shared Hash plan to divide that over
K participants.  Failing that, if the planner has a parallel-safe
inner plan and thinks that it can avoid batching by using work_mem * K
memory (shared by all K participants), it will now use a Shared Hash.
Otherwise it'll typically use a Hash plan as before.  Without the
later patches, it will blow through work_mem * K if it turns out to
have underestimated the hash table size, because it lacks
infrastructure for dealing with batches.

The trickiest thing at this point in the series is that participants
(workers and the leader) can show up at any time, so there are three
places that provide synchronisation with a parallel hash join that is
already in progress.  Those can be seen in ExecHashTableCreate,
MultiExecHash and ExecHashJoin (HJ_BUILD_HASHTABLE case).

0009-hj-shared-buffile-strawman-v4.patch:

Simple code for sharing BufFiles between backends.  This is standing
in for Peter G's BufFile sharing facility with refcount-based cleanup.

0010-hj-shared-multi-batch-v4.patch:

Adds support for multi-batch joins with shared hash tables.  At this
point, more complications appear: deadlock avoidance with the leader,
batch file sharing and coordinated batch number increases (shrinking
the hash table) while building or loading.

Some thoughts:

* Although this patch series adds a ton of wait points, in the common
case of a single batch inner join there is effectively only one:
participants wait for PHJ_PHASE_BUILDING to end and PHJ_PHASE_PROBING
to begin (resizing the hash table in between if necessary).  For a
single batch outer join, there is one more wait point: participants
wait for PHJ_PHASE_PROBING to end so that PHJ_PHASE_UNMATCHED can
begin.  The length of the wait for PHJ_PHASE_BUILDING to finish is
limited by the grain of the scattered data being loaded into the hash
table: if the source of parallelism is Parallel Seq Scan, then the
worst case scenario is that you run out of tuples to insert and
twiddle your thumbs while some other participant chews on the final
pageful of tuples.  The wait for PHJ_PHASE_UNMATCHED (if applicable)
is similarly limited by the time it takes for the slowest participant
to scan the match bits of one chunk of tuples.  All other phases and
associated wait points relate to multi-batch joins: either running out
of work_mem and needing to shrink the hash table, or coordinating
loading and various batches; in other words, ugly synchronisation only
enters the picture at the point where hash join starts doing IO
because you don't have enough work_mem.

* I wrestled with rescans for a long time; I think I have it right
now! The key thing to understand is that only the leader runs
ExecHashJoinReScan; new workers will be created for the next scan, so
the leader is able to get the barrier into the right state (attached
and fast-forwarded to PHJ_PHASE_PROBING if reusing the hash table,
detached and in the initial phase PHJ_PHASE_BEGINNING if we need to
recreate it).

* Skew table not supported yet.

* I removed the support for preloading data for the next batch; it
didn't seem to buy anything (it faithfully used up exactly all of your
work_mem for a brief moment, but since probing usually finishes very
close together in all participants anyway, no total execution time
seems to be saved) and added some complexity to the code; might be
worth revisiting but I'm not hopeful.

* The thing where different backends attach at different phases of the
hash join obviously creates a fairly large bug surface; of course we
can review the code and convince ourselves that it is correct, but
what is really needed is a test with 100% coverage that somehow
arranges for a worker to join at phases 0 to 12, and then perhaps also
for the leader to do the same; I have an idea for how to do that with
a debug build, more soon.

* Some of this needs to be more beautiful.

* With the patches up to 0008-hj-shared-single-batch.patch, I find
that typically I can get up to 3x or 4x speedups on queries like TPCH
Q9 that can benefit from a partial inner plan using Parallel Shared
Hash when work_mem is set 'just right', and at least some speedup on
queries without a partial inner plan but where the extra usable memory
available to Shared Hash can avoid the need to batching.  (The best
cases I've seen probably combine these factors: avoiding batching and
dividing work up).

* With the full patch series up to 0010-hj-shared-multi-batch.patch,
it produces some terrible plans for some TPCH queries right now, and
I'm investigating that.  Up to this point I have been focused on
getting the multi-batch code to work correctly, but will now turn some
attention to planning and efficiency and figure out what's happening
there.

-- 
Thomas Munro
http://www.enterprisedb.com

parallel-shared-hash-v4.tgz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: [[Parallel] Shared] Hash

Reply via email to