On Fri, Jan 13, 2017 at 2:36 PM, Peter Geoghegan <p...@heroku.com> wrote: > [...] > Yeah. That's basically what the BufFile unification process can > provide you with (or will, once I get around to implementing the > refcount thing, which shouldn't be too hard). As already noted, I'll > also want to make it defer creation of a leader-owned segment, unless > and until that proves necessary, which it never will for hash join.
Hi Peter, I have broken this up into a patch series, harmonised the private vs shared hash table code paths better and fixed many things including the problems with rescans and regression tests mentioned upthread. You'll see that one of the patches is that throwaway BufFile import/export facility, which I'll replace with your code as discussed. The three 'refactor' patches change the existing hash join code to work in terms of chunks in more places. These may be improvements in their own right, but mainly they pave the way for parallelism. The later patches introduce single-batch and then multi-batch shared tables. The patches in the attached tarball are: 0001-nail-down-regression-test-row-order-v4.patch: A couple of regression tests would fail with later refactoring that changes the order of unmatched rows emitted by hash joins. So first, let's fix that by adding ORDER BY in those places, without any code changes. 0002-hj-add-dtrace-probes-v4.patch: Before making any code changes, let's add some dtrace probes so that we can measure time spent doing different phases of hash join work before and after the later changes. The main problem with the probes as I have them here (and the extra probes inserted by later patches in the series) is that interesting query plans contain multiple hash joins so these get all mixed up when you're trying to measure stuff, so perhaps I should pass executor node IDs into all the probes. More on this later. (If people don't want dtrace probes in the executor, I'm happy to omit them and maintain that kind of thing locally for my own testing purposes...) 0003-hj-refactor-memory-accounting-v4.patch: Modify the existing hash join code to work in terms of chunks when estimating and later tracking memory usage. This is probably more accurate than the current tuple-based approach, because it tries to take into account the space used by chunk headers and the wasted space in chunks. In practice the difference is probably small, but it's arguably more accurate; I did this because I need chunk-based accounting the later patches. Also, make HASH_CHUNK_SIZE the actual size of allocated chunks (ie the header information is included in that size so we allocate exactly 32KB, not 32KB + a bit, for the benefit of the dsa allocator which otherwise finishes up allocating 36KB). 0004-hj-refactor-batch-increases-v4.patch: Modify the existing hash join code to detect work_mem exhaustion at the point where chunks are allocated, instead of checking after every tuple insertion. This matches the logic used for estimating, and more importantly allows for some parallelism in later patches. 0005-hj-refactor-unmatched-v4.patch: Modifies the existing hash join code to handle unmatched tuples in right/full joins chunk-by-chunk. This is probably a cache-friendlier scan order anyway, but the real goal is to provide a natural grain for parallelism in a later patch. 0006-hj-barrier-v4.patch: The patch from a nearby thread previously presented as a dependency of this project. It might as well be considered part of this patch series. 0007-hj-exec-detach-node-v4.patch By the time ExecEndNode() runs in workers, ExecShutdownNode() has already run. That's done on purpose because, for example, the hash table needs to survive longer than the parallel environment to allow EXPLAIN to peek at it. But it means that the Gather node has thrown out the shared memory before any parallel-aware node below it gets to run its Shutdown and End methods. So I invented ExecDetachNode() which runs before ExecShutdownNode(), giving parallel-aware nodes a chance to say goodbye before their shared memory vanishes. Better ideas? 0008-hj-shared-single-batch-v4.patch: Introduces hash joins with "Shared Hash" and "Parallel Shared Hash" nodes, for single-batch joins only. If the planner has a partial inner plan, it'll pick a Parallel Shared Hash plan to divide that over K participants. Failing that, if the planner has a parallel-safe inner plan and thinks that it can avoid batching by using work_mem * K memory (shared by all K participants), it will now use a Shared Hash. Otherwise it'll typically use a Hash plan as before. Without the later patches, it will blow through work_mem * K if it turns out to have underestimated the hash table size, because it lacks infrastructure for dealing with batches. The trickiest thing at this point in the series is that participants (workers and the leader) can show up at any time, so there are three places that provide synchronisation with a parallel hash join that is already in progress. Those can be seen in ExecHashTableCreate, MultiExecHash and ExecHashJoin (HJ_BUILD_HASHTABLE case). 0009-hj-shared-buffile-strawman-v4.patch: Simple code for sharing BufFiles between backends. This is standing in for Peter G's BufFile sharing facility with refcount-based cleanup. 0010-hj-shared-multi-batch-v4.patch: Adds support for multi-batch joins with shared hash tables. At this point, more complications appear: deadlock avoidance with the leader, batch file sharing and coordinated batch number increases (shrinking the hash table) while building or loading. Some thoughts: * Although this patch series adds a ton of wait points, in the common case of a single batch inner join there is effectively only one: participants wait for PHJ_PHASE_BUILDING to end and PHJ_PHASE_PROBING to begin (resizing the hash table in between if necessary). For a single batch outer join, there is one more wait point: participants wait for PHJ_PHASE_PROBING to end so that PHJ_PHASE_UNMATCHED can begin. The length of the wait for PHJ_PHASE_BUILDING to finish is limited by the grain of the scattered data being loaded into the hash table: if the source of parallelism is Parallel Seq Scan, then the worst case scenario is that you run out of tuples to insert and twiddle your thumbs while some other participant chews on the final pageful of tuples. The wait for PHJ_PHASE_UNMATCHED (if applicable) is similarly limited by the time it takes for the slowest participant to scan the match bits of one chunk of tuples. All other phases and associated wait points relate to multi-batch joins: either running out of work_mem and needing to shrink the hash table, or coordinating loading and various batches; in other words, ugly synchronisation only enters the picture at the point where hash join starts doing IO because you don't have enough work_mem. * I wrestled with rescans for a long time; I think I have it right now! The key thing to understand is that only the leader runs ExecHashJoinReScan; new workers will be created for the next scan, so the leader is able to get the barrier into the right state (attached and fast-forwarded to PHJ_PHASE_PROBING if reusing the hash table, detached and in the initial phase PHJ_PHASE_BEGINNING if we need to recreate it). * Skew table not supported yet. * I removed the support for preloading data for the next batch; it didn't seem to buy anything (it faithfully used up exactly all of your work_mem for a brief moment, but since probing usually finishes very close together in all participants anyway, no total execution time seems to be saved) and added some complexity to the code; might be worth revisiting but I'm not hopeful. * The thing where different backends attach at different phases of the hash join obviously creates a fairly large bug surface; of course we can review the code and convince ourselves that it is correct, but what is really needed is a test with 100% coverage that somehow arranges for a worker to join at phases 0 to 12, and then perhaps also for the leader to do the same; I have an idea for how to do that with a debug build, more soon. * Some of this needs to be more beautiful. * With the patches up to 0008-hj-shared-single-batch.patch, I find that typically I can get up to 3x or 4x speedups on queries like TPCH Q9 that can benefit from a partial inner plan using Parallel Shared Hash when work_mem is set 'just right', and at least some speedup on queries without a partial inner plan but where the extra usable memory available to Shared Hash can avoid the need to batching. (The best cases I've seen probably combine these factors: avoiding batching and dividing work up). * With the full patch series up to 0010-hj-shared-multi-batch.patch, it produces some terrible plans for some TPCH queries right now, and I'm investigating that. Up to this point I have been focused on getting the multi-batch code to work correctly, but will now turn some attention to planning and efficiency and figure out what's happening there. -- Thomas Munro http://www.enterprisedb.com
parallel-shared-hash-v4.tgz
Description: GNU Zip compressed data
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers