Re: [HACKERS] WIP: [[Parallel] Shared] Hash

Rafia Sabih Thu, 12 Jan 2017 03:03:34 -0800

On Thu, Jan 12, 2017 at 9:07 AM, Thomas Munro <thomas.mu...@enterprisedb.com
> wrote:


> On Wed, Jan 11, 2017 at 2:56 PM, Peter Geoghegan <p...@heroku.com> wrote:
> > On Fri, Jan 6, 2017 at 12:01 PM, Thomas Munro
> > <thomas.mu...@enterprisedb.com> wrote:
> >> Here is a new WIP patch.  I have plenty of things to tidy up (see note
> >> at end), but the main ideas are now pretty clear and I'd appreciate
> >> some feedback.
> >
> > I have some review feedback for your V3. I've chosen to start with the
> > buffile.c stuff, since of course it might share something with my
> > parallel tuplesort patch. This isn't comprehensive, but I will have
> > more comprehensive feedback soon.
>
> Thanks!
>
> > I'm not surprised that you've generally chosen to make shared BufFile
> > management as simple as possible, with no special infrastructure other
> > than the ability to hold open other backend temp files concurrently
> > within a worker, and no writing to another worker's temp file, or
> > shared read pointer. As you put it, everything is immutable. I
> > couldn't see much opportunity for adding a lot of infrastructure that
> > wasn't written explicitly as parallel hash join code/infrastructure.
> > My sense is that that was a good decision. I doubted that you'd ever
> > want some advanced, generic shared BufFile thing with multiple read
> > pointers, built-in cache coherency, etc. (Robert seemed to think that
> > you'd go that way, though.)
>
> Right, this is extremely minimalist infrastructure.  fd.c is
> unchanged.  buffile.c only gains the power to export/import read-only
> views of BufFiles.  There is no 'unification' of BufFiles: each hash
> join participant simply reads from the buffile it wrote, and then
> imports and reads from its peers' BufFiles, until all are exhausted;
> so the 'unification' is happening in caller code which knows about the
> set of participants and manages shared read positions.  Clearly there
> are some ownership/cleanup issues to straighten out, but I think those
> problems are fixable (probably involving refcounts).
>
> I'm entirely willing to throw that away and use the unified BufFile
> concept, if it can be extended to support multiple readers of the
> data, where every participant unifies the set of files.  I have so far
> assumed that it would be most efficient for each participant to read
> from the file that it wrote before trying to read from files written
> by other participants.  I'm reading your patch now; more soon.
>
> > Anyway, some more specific observations:
> >
> > * ISTM that this is the wrong thing for shared BufFiles:
> >
> >> +BufFile *
> >> +BufFileImport(BufFileDescriptor *descriptor)
> >> +{
> > ...
> >> +   file->isInterXact = true; /* prevent cleanup by this backend */
> >
> > There is only one user of isInterXact = true BufFiles at present,
> > tuplestore.c. It, in turn, only does so for cases that require
> > persistent tuple stores. A quick audit of these tuplestore.c callers
> > show this to just be cursor support code within portalmem.c. Here is
> > the relevant tuplestore_begin_heap() rule that that code adheres to,
> > unlike the code I've quoted above:
> >
> >  * interXact: if true, the files used for on-disk storage persist beyond
> the
> >  * end of the current transaction.  NOTE: It's the caller's
> responsibility to
> >  * create such a tuplestore in a memory context and resource owner that
> will
> >  * also survive transaction boundaries, and to ensure the tuplestore is
> closed
> >  * when it's no longer wanted.
>
> Hmm.  Yes, that is an entirely bogus use of isInterXact.  I am
> thinking about how to fix that with refcounts.
>
> > I don't think it's right for buffile.c to know anything about file
> > paths directly -- I'd say that that's a modularity violation.
> > PathNameOpenFile() is called by very few callers at the moment, all of
> > them very low level (e.g. md.c), but you're using it within buffile.c
> > to open a path to the file that you obtain from shared memory
>
> Hmm.  I'm not seeing the modularity violation.  buffile.c uses
> interfaces already exposed by fd.c to do this:  OpenTemporaryFile,
> then FilePathName to find the path, then PathNameOpenFile to open from
> another process.  I see that your approach instead has client code
> provide more meta data so that things can be discovered, which may
> well be a much better idea.
>
> > directly. This is buggy because the following code won't be reached in
> > workers that call your BufFileImport() function:
> >
> >     /* Mark it for deletion at close */
> >     VfdCache[file].fdstate |= FD_TEMPORARY;
> >
> >     /* Register it with the current resource owner */
> >     if (!interXact)
> >     {
> >         VfdCache[file].fdstate |= FD_XACT_TEMPORARY;
> >
> >         ResourceOwnerEnlargeFiles(CurrentResourceOwner);
> >         ResourceOwnerRememberFile(CurrentResourceOwner, file);
> >         VfdCache[file].resowner = CurrentResourceOwner;
> >
> >         /* ensure cleanup happens at eoxact */
> >         have_xact_temporary_files = true;
> >     }
>
> Right, that is a problem.  A refcount mode could fix that; virtual
> file descriptors would be closed in every backend using the current
> resource owner, and the files would be deleted when the last one turns
> out the lights.
>
> > Certainly, you don't want the "Mark it for deletion at close" bit.
> > Deletion should not happen at eoxact for non-owners-but-sharers
> > (within FileClose()), but you *do* want CleanupTempFiles() to call
> > FileClose() for the virtual file descriptors you've opened in the
> > backend, to do some other cleanup. In general, you want to buy into
> > resource ownership for workers. As things stand, I think that this
> > will leak virtual file descriptors. That's really well hidden because
> > there is a similar CleanupTempFiles() call at proc exit, I think.
> > (Didn't take the time to make sure that that's what masked problems.
> > I'm sure that you want minimal divergence with serial cases,
> > resource-ownership-wise, in any case.)
> >
> > Instead of all this, I suggest copying some of my changes to fd.c, so
> > that resource ownership within fd.c differentiates between a vfd that
> > is owned by the backend in the conventional sense, including having a
> > need to delete at eoxact, as well as a lesser form of ownership where
> > deletion should not happen. Maybe you'll end up using my BufFileUnify
> > interface [1] within workers (instead of just within the leader, as
> > with parallel tuplesort), and have it handle all of that for you.
> > Currently, that would mean that there'd be an unused/0 sized "local"
> > segment for the unified BufFile, but I was thinking of making that not
> > happen unless and until a new segment is actually needed, so even that
> > minor wart wouldn't necessarily affect you.
>
> Ok, I'm studying that code now.
>
> >> Some assorted notes on the status:  I need to do some thinking about
> >> the file cleanup logic: both explicit deletes at the earliest possible
> >> time, and failure/error paths.  Currently the creator of each file is
> >> responsible for cleaning it up, but I guess if the creator aborts
> >> early the file disappears underneath the others' feet, and then I
> >> guess they might raise a confusing error report that races against the
> >> root cause error report; I'm looking into that.  Rescans and skew
> >> buckets not finished yet.
> >
> > The rescan code path seems to segfault when the regression tests are
> > run. There is a NULL pointer dereference here:
> >
> >> @@ -985,6 +1855,14 @@ ExecReScanHashJoin(HashJoinState *node)
> >>             node->hj_HashTable = NULL;
> >>             node->hj_JoinState = HJ_BUILD_HASHTABLE;
> >>
> >> +           if (HashJoinTableIsShared(node->hj_HashTable))
> >> +           {
> >> +               /* Coordinate a rewind to the shared hash table
> creation phase. */
> >> +               BarrierWaitSet(&hashNode->shared_table_data->barrier,
> >> +                              PHJ_PHASE_BEGINNING,
> >> +                              WAIT_EVENT_HASHJOIN_REWINDING3);
> >> +           }
> >> +
> >
> > Clearly, HashJoinTableIsShared() should not be called when its
> > argument (in this case  node->hj_HashTable) is NULL.
> >
> > In general, I think you should try to set expectations about what
> > happens when the regression tests run up front, because that's usually
> > the first thing reviewers do.
>
> Apologies, poor form.  That block can be commented out for now because
> rescan support is obviously incomplete, and I didn't mean to post it
> that way.  Doing so reveals two remaining test failures: "join" and
> "rowsecurity" managed to lose a couple of rows.  Oops.  I will figure
> out what I broke and have a fix for that in my next version.
>
> > Various compiler warnings on my system:
> >
> > /home/pg/pgbuild/builds/root/../../postgresql/src/backend/
> executor/nodeHash.c:1376:7:
> > warning: variable ‘size_before_shrink’ set but not used
> > [-Wunused-but-set-variable]
> >   Size size_before_shrink = 0;
> >        ^
>
> In this case it was only used in dtrace builds; I will make sure any
> such code is compiled out when in non-dtrace builds.
>
> > /home/pg/pgbuild/builds/root/../../postgresql/src/backend/
> executor/nodeHashjoin.c:
> > In function ‘ExecHashJoinCloseBatch’:
> > /home/pg/pgbuild/builds/root/../../postgresql/src/backend/
> executor/nodeHashjoin.c:1548:28:
> > warning: variable ‘participant’ set but not used
> > [-Wunused-but-set-variable]
> >   HashJoinParticipantState *participant;
> >                             ^
> > /home/pg/pgbuild/builds/root/../../postgresql/src/backend/
> executor/nodeHashjoin.c:
> > In function ‘ExecHashJoinRewindBatches’:
> > /home/pg/pgbuild/builds/root/../../postgresql/src/backend/
> executor/nodeHashjoin.c:1587:23:
> > warning: variable ‘batch_reader’ set but not used
> > [-Wunused-but-set-variable]
> >   HashJoinBatchReader *batch_reader;
> >                        ^
> >
> > Is this change really needed?:
> >
> >> --- a/src/backend/executor/nodeSeqscan.c
> >> +++ b/src/backend/executor/nodeSeqscan.c
> >> @@ -31,6 +31,8 @@
> >>  #include "executor/nodeSeqscan.h"
> >>  #include "utils/rel.h"
> >>
> >> +#include <unistd.h>
> >> +
> >>  static void InitScanRelation(SeqScanState *node, EState *estate, int
> eflags);
> >>  static TupleTableSlot *SeqNext(SeqScanState *node);
>
> Right, will clean up.
>
> > That's all I have for now...
>
> Thanks!  I'm away from my computer for a couple of days but will have
> a new patch series early next week, and hope to have a better handle
> on what's involved in adopting the 'unification' approach here
> instead.
>
> --
> Thomas Munro
> http://www.enterprisedb.com
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

Hi Thomas,
I was trying to analyse the performance of TPC-H queries with your patch
and came across following results,
Q9 and Q21 were crashing, both of them had following bt in core dump (I
thought it might be helpful),

#0  0x0000000010757da4 in pfree (pointer=0x3fff78d11000) at mcxt.c:1012
#1  0x000000001032c574 in ExecHashIncreaseNumBatches
(hashtable=0x1003af6da60) at nodeHash.c:1124
#2  0x000000001032d518 in ExecHashTableInsert (hashtable=0x1003af6da60,
slot=0x1003af695c0, hashvalue=2904801109, preload=1 '\001') at
nodeHash.c:1700
#3  0x0000000010330fd4 in ExecHashJoinPreloadNextBatch
(hjstate=0x1003af39118) at nodeHashjoin.c:886
#4  0x00000000103301fc in ExecHashJoin (node=0x1003af39118) at
nodeHashjoin.c:376
#5  0x0000000010308644 in ExecProcNode (node=0x1003af39118) at
execProcnode.c:490
#6  0x000000001031f530 in fetch_input_tuple (aggstate=0x1003af38910) at
nodeAgg.c:587
#7  0x0000000010322b50 in agg_fill_hash_table (aggstate=0x1003af38910) at
nodeAgg.c:2304
#8  0x000000001032239c in ExecAgg (node=0x1003af38910) at nodeAgg.c:1942
#9  0x0000000010308694 in ExecProcNode (node=0x1003af38910) at
execProcnode.c:509
#10 0x0000000010302a1c in ExecutePlan (estate=0x1003af37fa0,
planstate=0x1003af38910, use_parallel_mode=0 '\000', operation=CMD_SELECT,
sendTuples=1 '\001', numberTuples=0,
    direction=ForwardScanDirection, dest=0x1003af19390) at execMain.c:1587

In case you want to know, I was using TPC-H with 20 scale factor. Please
let me know if you want anymore information on this.

-- 
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/

Re: [HACKERS] WIP: [[Parallel] Shared] Hash

Reply via email to