Re: [HACKERS] WIP: [[Parallel] Shared] Hash

Andres Freund Wed, 15 Feb 2017 18:37:22 -0800

Hi,

On 2017-02-13 23:57:00 +1300, Thomas Munro wrote:
> Here's a new version to fix the problems reported by Rafia above.  The
> patch descriptions are as before but it starts from 0002 because 0001
> was committed as 7c5d8c16 (thanks, Andres).


FWIW, I'd appreciate if you'd added a short commit message to the
individual patches - I find it helpful to have a littlebit more context
while looking at them than just the titles.  Alternatively you can
include that text when re-posting the series, but it's imo just as easy
to have a short commit message (and just use format-patch).

I'm for now using [1] as context.


0002-hj-add-dtrace-probes-v5.patch

Hm. I'm personally very unenthusiastic about addming more of these, and
would rather rip all of them out.  I tend to believe that static
problems simply aren't a good approach for anything requiring a lot of
detail.  But whatever.


0003-hj-refactor-memory-accounting-v5.patch
@@ -424,15 +422,29 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, 
bool useskew,
        if (ntuples <= 0.0)
                ntuples = 1000.0;
 
-       /*
-        * Estimate tupsize based on footprint of tuple in hashtable... note 
this
-        * does not allow for any palloc overhead.  The manipulations of 
spaceUsed
-        * don't count palloc overhead either.
-        */
+       /* Estimate tupsize based on footprint of tuple in hashtable. */

palloc overhead is still unaccounted for, no? In the chunked case that
might not be much, I realize that (so that comment should probably have
been updated when chunking was introduced).

-       Size            spaceUsed;              /* memory space currently used 
by tuples */
+       Size            spaceUsed;              /* memory space currently used 
by hashtable */

It's not really the full hashtable, is it? The ->buckets array appears
to still be unaccounted for.

Looks ok.


0004-hj-refactor-batch-increases-v5.patch

@@ -1693,10 +1689,12 @@ ExecHashRemoveNextSkewBucket(HashJoinTable hashtable)
 }
 
 /*
- * Allocate 'size' bytes from the currently active HashMemoryChunk
+ * Allocate 'size' bytes from the currently active HashMemoryChunk.  If
+ * 'respect_work_mem' is true, this may cause the number of batches to be
+ * increased in an attempt to shrink the hash table.
  */
 static void *
-dense_alloc(HashJoinTable hashtable, Size size)
+dense_alloc(HashJoinTable hashtable, Size size, bool respect_work_mem)

{
        HashMemoryChunk newChunk;
        char       *ptr;
@@ -1710,6 +1708,15 @@ dense_alloc(HashJoinTable hashtable, Size size)
         */
        if (size > HASH_CHUNK_THRESHOLD)
        {
+               if (respect_work_mem &&
+                       hashtable->growEnabled &&
+                       hashtable->spaceUsed + HASH_CHUNK_HEADER_SIZE + size >
+                       hashtable->spaceAllowed)
+               {
+                       /* work_mem would be exceeded: try to shrink hash table 
*/
+                       ExecHashIncreaseNumBatches(hashtable);
+               }
+

Isn't it kinda weird to do this from within dense_alloc()?  I mean that
dumps a lot of data to disk, frees a bunch of memory and so on - not
exactly what "dense_alloc" implies.  Isn't the free()ing part also
dangerous, because the caller might actually use some of that memory,
like e.g. in ExecHashRemoveNextSkewBucket() or such.  I haven't looked
deeply enough to check whether that's an active bug, but it seems like
inviting one if not.


0005-hj-refactor-unmatched-v5.patch

I'm a bit confused as to why unmatched tuple scan is a good parallelism
target, but I might see later...

0006-hj-barrier-v5.patch

Skipping that here.


0007-hj-exec-detach-node-v5.patch

Hm. You write elsewhere:
> By the time ExecEndNode() runs in workers, ExecShutdownNode() has
> already run.  That's done on purpose because, for example, the hash
> table needs to survive longer than the parallel environment to allow
> EXPLAIN to peek at it.  But it means that the Gather node has thrown
> out the shared memory before any parallel-aware node below it gets to
> run its Shutdown and End methods.  So I invented ExecDetachNode()
> which runs before ExecShutdownNode(), giving parallel-aware nodes a
> chance to say goodbye before their shared memory vanishes.  Better
> ideas?

To me that is a weakness in the ExecShutdownNode() API - imo child nodes
should get the chance to shutdown before the upper-level node.
ExecInitNode/ExecEndNode etc give individual nodes the freedom to do
things in the right order, but ExecShutdownNode() doesn't.  I don't
quite see why we'd want to invent a separate ExecDetachNode() that'd be
called immediately before ExecShutdownNode().

An easy way to change that would be to return in the
ExecShutdownNode()'s T_GatherState case, and delegate the responsibility
of calling it on Gather's children to ExecShutdownGather().
Alternatively we could make it a full-blown thing like ExecInitNode()
that every node needs to implement, but that seems a bit painful.

Or have I missed something here?

Random aside: Wondered before if having to provide all executor
callbacks is a weakness of our executor integration, and whether it
shouldn't be a struct of callbacks instead...


0008-hj-shared-single-batch-v5.patch

First-off: I wonder if we should get the HASHPATH_TABLE_SHARED_SERIAL
path committed first. ISTM that's already quite beneficial, and there's
a good chunk of problems that we could push out initially.

This desperately needs tests.

Have you measured whether the new branches in nodeHash[join] slow down
non-parallel executions?  I do wonder if it'd not be better to have to
put the common code in helper functions and have seperate
T_SharedHashJoin/T_SharedHash types.  If you both have a parallel and
non-parallel hash in the same query, the branches will be hard to
predict...

I think the synchronization protocol with the various phases needs to be
documented somewhere.  Probably in nodeHashjoin.c's header.

The state machine code in MultiExecHash() also needs more
comments. Including the fact that avoiding repeating work is done by
"electing" leaders via BarrierWait().

I wonder if it wouldn't be better to inline the necessary code into the
switch (with fall-throughs), instead of those gotos; putting some of the
relevant code (particularly the scanning of the child node) into
seperate functions.

+ build:
+       if (HashJoinTableIsShared(hashtable))
+       {
+               /* Make sure our local state is up-to-date so we can build. */
+               Assert(BarrierPhase(barrier) == PHJ_PHASE_BUILDING);
+               ExecHashUpdate(hashtable);
+       }
+
        /*
         * set expression context
         */
@@ -128,18 +197,78 @@ MultiExecHash(HashState *node)

Why's is the parallel code before variable initialization stuff like
setting up econtext?


> Introduces hash joins with "Shared Hash" and "Parallel Shared Hash"
> nodes, for single-batch joins only.

We don't necessarily know that ahead of time.  So this isn't something
that we could actually apply separately, right?


        /* resize the hash table if needed (NTUP_PER_BUCKET exceeded) */
-       if (hashtable->nbuckets != hashtable->nbuckets_optimal)
-               ExecHashIncreaseNumBuckets(hashtable);
+       ExecHashUpdate(hashtable);
+       ExecHashIncreaseNumBuckets(hashtable);

It's kinda weird that we had the nearly redundant nbuckets !=
nbuckets_optimal checks before...


+static void *
+dense_alloc_shared(HashJoinTable hashtable,
+                                  Size size,
+                                  dsa_pointer *shared)

Hm. I wonder if HASH_CHUNK_SIZE being only 32kb isn't going to be a
bottlenck here.


@@ -195,6 +238,40 @@ ExecHashJoin(HashJoinState *node)
                                if (TupIsNull(outerTupleSlot))
                                {
                                        /* end of batch, or maybe whole join */
+
+                                       if (HashJoinTableIsShared(hashtable))
+                                       {
+                                               /*
+                                                * An important optimization: 
if this is a
+                                                * single-batch join and not an 
outer join, there is
+                                                * no reason to synchronize 
again when we've finished
+                                                * probing.
+                                                */
+                                               
Assert(BarrierPhase(&hashtable->shared->barrier) ==
+                                                          PHJ_PHASE_PROBING);
+                                               if (hashtable->nbatch == 1 && 
!HJ_FILL_INNER(node))
+                                                       return NULL;    /* end 
of join */
+

I think it's a bit weird that the parallel path now has an exit path
that the non-parallel path doesn't have.


+        * If this is a shared hash table, there is a extra charge for inserting
+        * each tuple into the shared hash table to cover memory synchronization
+        * overhead, compared to a private hash table.  There is no extra charge
+        * for probing the hash table for outer path row, on the basis that
+        * read-only access to a shared hash table shouldn't be any more
+        * expensive.
+        *
+        * cpu_shared_tuple_cost acts a tie-breaker controlling whether we 
prefer
+        * HASHPATH_TABLE_PRIVATE or HASHPATH_TABLE_SHARED_SERIAL plans when the
+        * hash table fits in work_mem, since the cost is otherwise the same.  
If
+        * it is positive, then we'll prefer private hash tables, even though 
that
+        * means that we'll be running N copies of the inner plan.  Running N
+        * copies of the copies of the inner plan in parallel is not considered
+        * more expensive than running 1 copy of the inner plan while N-1
+        * participants do nothing, despite doing less work in total.
+        */
+       if (table_type != HASHPATH_TABLE_PRIVATE)
+               startup_cost += cpu_shared_tuple_cost * inner_path_rows;
+
+       /*
+        * If this is a parallel shared hash table, then the value we have for
+        * inner_rows refers only to the rows returned by each participant.  For
+        * shared hash table size estimation, we need the total number, so we 
need
+        * to undo the division.
+        */
+       if (table_type == HASHPATH_TABLE_SHARED_PARALLEL)
+               inner_path_rows_total *= get_parallel_divisor(inner_path);
+
+       /*

Is the per-tuple cost really the same for HASHPATH_TABLE_SHARED_SERIAL
and PARALLEL?

Don't we also need to somehow account for the more expensive hash-probes
in the HASHPATH_TABLE_SHARED_* cases? Seems quite possible that we'll
otherwise tend to use shared tables for small hashed tables that are
looked up very frequently, even though a private one will likely be
faster.


+       /*
+        * Set the table as sharable if appropriate, with parallel or serial
+        * building.  If parallel, the executor will also need an estimate of 
the
+        * total number of rows expected from all participants.
+        */

Oh. I was about to comment that sharable is wrong, just to discover it's
valid in NA. Weird.


@@ -2096,6 +2096,7 @@ create_mergejoin_path(PlannerInfo *root,
  * 'required_outer' is the set of required outer rels
  * 'hashclauses' are the RestrictInfo nodes to use as hash clauses
  *             (this should be a subset of the restrict_clauses list)
+ * 'table_type' to select [[Parallel] Shared] Hash
  */
 HashPath *
 create_hashjoin_path(PlannerInfo *root,

Reminds me that you're not denoting the Parallel bit in explain right
now - intentionally so?


 /*
- * To reduce palloc overhead, the HashJoinTuples for the current batch are
- * packed in 32kB buffers instead of pallocing each tuple individually.
+ * To reduce palloc/dsa_allocate overhead, the HashJoinTuples for the current
+ * batch are packed in 32kB buffers instead of pallocing each tuple
+ * individually.

s/palloc\/dsa_allocate/allocator/?


@@ -112,8 +121,12 @@ typedef struct HashMemoryChunkData
        size_t          maxlen;                 /* size of the buffer holding 
the tuples */
        size_t          used;                   /* number of buffer bytes 
already used */
 
-       struct HashMemoryChunkData *next;       /* pointer to the next chunk 
(linked
-                                                                               
 * list) */
+       /* pointer to the next chunk (linked list) */
+       union
+       {
+               dsa_pointer shared;
+               struct HashMemoryChunkData *unshared;
+       } next;

This'll increase memory usage on some platforms, e.g. when using
spinlock backed atomics.  I tend to think that that's fine, but it's
probably worth calling out.


--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -787,7 +787,15 @@ typedef enum
        WAIT_EVENT_MQ_SEND,
        WAIT_EVENT_PARALLEL_FINISH,
        WAIT_EVENT_SAFE_SNAPSHOT,
-       WAIT_EVENT_SYNC_REP
+       WAIT_EVENT_SYNC_REP,
+       WAIT_EVENT_HASH_BEGINNING,
+       WAIT_EVENT_HASH_CREATING,
+       WAIT_EVENT_HASH_BUILDING,
+       WAIT_EVENT_HASH_RESIZING,
+       WAIT_EVENT_HASH_REINSERTING,
+       WAIT_EVENT_HASH_UNMATCHED,
+       WAIT_EVENT_HASHJOIN_PROBING,
+       WAIT_EVENT_HASHJOIN_REWINDING
 } WaitEventIPC;

Hm. That seems a bit on the detailed side - if we're going that way it
seems likely that we'll end up with hundreds of wait events. I don't
think gradually evolving wait events into something like a query
progress framework is a good idea.



That's it for now...

- Andres

[1] 
http://archives.postgresql.org/message-id/CAEepm%3D1D4-tP7j7UAgT_j4ZX2j4Ehe1qgZQWFKBMb8F76UW5Rg%40mail.gmail.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: [[Parallel] Shared] Hash

Reply via email to