Re: [HACKERS] WIP: [[Parallel] Shared] Hash

Thomas Munro Wed, 11 Jan 2017 19:38:41 -0800

On Wed, Jan 11, 2017 at 2:56 PM, Peter Geoghegan <[email protected]> wrote:
> On Fri, Jan 6, 2017 at 12:01 PM, Thomas Munro
> <[email protected]> wrote:
>> Here is a new WIP patch.  I have plenty of things to tidy up (see note
>> at end), but the main ideas are now pretty clear and I'd appreciate
>> some feedback.
>
> I have some review feedback for your V3. I've chosen to start with the
> buffile.c stuff, since of course it might share something with my
> parallel tuplesort patch. This isn't comprehensive, but I will have
> more comprehensive feedback soon.


Thanks!

> I'm not surprised that you've generally chosen to make shared BufFile
> management as simple as possible, with no special infrastructure other
> than the ability to hold open other backend temp files concurrently
> within a worker, and no writing to another worker's temp file, or
> shared read pointer. As you put it, everything is immutable. I
> couldn't see much opportunity for adding a lot of infrastructure that
> wasn't written explicitly as parallel hash join code/infrastructure.
> My sense is that that was a good decision. I doubted that you'd ever
> want some advanced, generic shared BufFile thing with multiple read
> pointers, built-in cache coherency, etc. (Robert seemed to think that
> you'd go that way, though.)

Right, this is extremely minimalist infrastructure.  fd.c is
unchanged.  buffile.c only gains the power to export/import read-only
views of BufFiles.  There is no 'unification' of BufFiles: each hash
join participant simply reads from the buffile it wrote, and then
imports and reads from its peers' BufFiles, until all are exhausted;
so the 'unification' is happening in caller code which knows about the
set of participants and manages shared read positions.  Clearly there
are some ownership/cleanup issues to straighten out, but I think those
problems are fixable (probably involving refcounts).

I'm entirely willing to throw that away and use the unified BufFile
concept, if it can be extended to support multiple readers of the
data, where every participant unifies the set of files.  I have so far
assumed that it would be most efficient for each participant to read
from the file that it wrote before trying to read from files written
by other participants.  I'm reading your patch now; more soon.

> Anyway, some more specific observations:
>
> * ISTM that this is the wrong thing for shared BufFiles:
>
>> +BufFile *
>> +BufFileImport(BufFileDescriptor *descriptor)
>> +{
> ...
>> +   file->isInterXact = true; /* prevent cleanup by this backend */
>
> There is only one user of isInterXact = true BufFiles at present,
> tuplestore.c. It, in turn, only does so for cases that require
> persistent tuple stores. A quick audit of these tuplestore.c callers
> show this to just be cursor support code within portalmem.c. Here is
> the relevant tuplestore_begin_heap() rule that that code adheres to,
> unlike the code I've quoted above:
>
>  * interXact: if true, the files used for on-disk storage persist beyond the
>  * end of the current transaction.  NOTE: It's the caller's responsibility to
>  * create such a tuplestore in a memory context and resource owner that will
>  * also survive transaction boundaries, and to ensure the tuplestore is closed
>  * when it's no longer wanted.

Hmm.  Yes, that is an entirely bogus use of isInterXact.  I am
thinking about how to fix that with refcounts.

> I don't think it's right for buffile.c to know anything about file
> paths directly -- I'd say that that's a modularity violation.
> PathNameOpenFile() is called by very few callers at the moment, all of
> them very low level (e.g. md.c), but you're using it within buffile.c
> to open a path to the file that you obtain from shared memory

Hmm.  I'm not seeing the modularity violation.  buffile.c uses
interfaces already exposed by fd.c to do this:  OpenTemporaryFile,
then FilePathName to find the path, then PathNameOpenFile to open from
another process.  I see that your approach instead has client code
provide more meta data so that things can be discovered, which may
well be a much better idea.

> directly. This is buggy because the following code won't be reached in
> workers that call your BufFileImport() function:
>
>     /* Mark it for deletion at close */
>     VfdCache[file].fdstate |= FD_TEMPORARY;
>
>     /* Register it with the current resource owner */
>     if (!interXact)
>     {
>         VfdCache[file].fdstate |= FD_XACT_TEMPORARY;
>
>         ResourceOwnerEnlargeFiles(CurrentResourceOwner);
>         ResourceOwnerRememberFile(CurrentResourceOwner, file);
>         VfdCache[file].resowner = CurrentResourceOwner;
>
>         /* ensure cleanup happens at eoxact */
>         have_xact_temporary_files = true;
>     }

Right, that is a problem.  A refcount mode could fix that; virtual
file descriptors would be closed in every backend using the current
resource owner, and the files would be deleted when the last one turns
out the lights.

> Certainly, you don't want the "Mark it for deletion at close" bit.
> Deletion should not happen at eoxact for non-owners-but-sharers
> (within FileClose()), but you *do* want CleanupTempFiles() to call
> FileClose() for the virtual file descriptors you've opened in the
> backend, to do some other cleanup. In general, you want to buy into
> resource ownership for workers. As things stand, I think that this
> will leak virtual file descriptors. That's really well hidden because
> there is a similar CleanupTempFiles() call at proc exit, I think.
> (Didn't take the time to make sure that that's what masked problems.
> I'm sure that you want minimal divergence with serial cases,
> resource-ownership-wise, in any case.)
>
> Instead of all this, I suggest copying some of my changes to fd.c, so
> that resource ownership within fd.c differentiates between a vfd that
> is owned by the backend in the conventional sense, including having a
> need to delete at eoxact, as well as a lesser form of ownership where
> deletion should not happen. Maybe you'll end up using my BufFileUnify
> interface [1] within workers (instead of just within the leader, as
> with parallel tuplesort), and have it handle all of that for you.
> Currently, that would mean that there'd be an unused/0 sized "local"
> segment for the unified BufFile, but I was thinking of making that not
> happen unless and until a new segment is actually needed, so even that
> minor wart wouldn't necessarily affect you.

Ok, I'm studying that code now.

>> Some assorted notes on the status:  I need to do some thinking about
>> the file cleanup logic: both explicit deletes at the earliest possible
>> time, and failure/error paths.  Currently the creator of each file is
>> responsible for cleaning it up, but I guess if the creator aborts
>> early the file disappears underneath the others' feet, and then I
>> guess they might raise a confusing error report that races against the
>> root cause error report; I'm looking into that.  Rescans and skew
>> buckets not finished yet.
>
> The rescan code path seems to segfault when the regression tests are
> run. There is a NULL pointer dereference here:
>
>> @@ -985,6 +1855,14 @@ ExecReScanHashJoin(HashJoinState *node)
>>             node->hj_HashTable = NULL;
>>             node->hj_JoinState = HJ_BUILD_HASHTABLE;
>>
>> +           if (HashJoinTableIsShared(node->hj_HashTable))
>> +           {
>> +               /* Coordinate a rewind to the shared hash table creation 
>> phase. */
>> +               BarrierWaitSet(&hashNode->shared_table_data->barrier,
>> +                              PHJ_PHASE_BEGINNING,
>> +                              WAIT_EVENT_HASHJOIN_REWINDING3);
>> +           }
>> +
>
> Clearly, HashJoinTableIsShared() should not be called when its
> argument (in this case  node->hj_HashTable) is NULL.
>
> In general, I think you should try to set expectations about what
> happens when the regression tests run up front, because that's usually
> the first thing reviewers do.

Apologies, poor form.  That block can be commented out for now because
rescan support is obviously incomplete, and I didn't mean to post it
that way.  Doing so reveals two remaining test failures: "join" and
"rowsecurity" managed to lose a couple of rows.  Oops.  I will figure
out what I broke and have a fix for that in my next version.

> Various compiler warnings on my system:
>
> /home/pg/pgbuild/builds/root/../../postgresql/src/backend/executor/nodeHash.c:1376:7:
> warning: variable ‘size_before_shrink’ set but not used
> [-Wunused-but-set-variable]
>   Size size_before_shrink = 0;
>        ^

In this case it was only used in dtrace builds; I will make sure any
such code is compiled out when in non-dtrace builds.

> /home/pg/pgbuild/builds/root/../../postgresql/src/backend/executor/nodeHashjoin.c:
> In function ‘ExecHashJoinCloseBatch’:
> /home/pg/pgbuild/builds/root/../../postgresql/src/backend/executor/nodeHashjoin.c:1548:28:
> warning: variable ‘participant’ set but not used
> [-Wunused-but-set-variable]
>   HashJoinParticipantState *participant;
>                             ^
> /home/pg/pgbuild/builds/root/../../postgresql/src/backend/executor/nodeHashjoin.c:
> In function ‘ExecHashJoinRewindBatches’:
> /home/pg/pgbuild/builds/root/../../postgresql/src/backend/executor/nodeHashjoin.c:1587:23:
> warning: variable ‘batch_reader’ set but not used
> [-Wunused-but-set-variable]
>   HashJoinBatchReader *batch_reader;
>                        ^
>
> Is this change really needed?:
>
>> --- a/src/backend/executor/nodeSeqscan.c
>> +++ b/src/backend/executor/nodeSeqscan.c
>> @@ -31,6 +31,8 @@
>>  #include "executor/nodeSeqscan.h"
>>  #include "utils/rel.h"
>>
>> +#include <unistd.h>
>> +
>>  static void InitScanRelation(SeqScanState *node, EState *estate, int 
>> eflags);
>>  static TupleTableSlot *SeqNext(SeqScanState *node);

Right, will clean up.

> That's all I have for now...

Thanks!  I'm away from my computer for a couple of days but will have
a new patch series early next week, and hope to have a better handle
on what's involved in adopting the 'unification' approach here
instead.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: [[Parallel] Shared] Hash

Reply via email to