On Thu, Jan 12, 2017 at 9:07 AM, Thomas Munro <thomas.mu...@enterprisedb.com > wrote:
> On Wed, Jan 11, 2017 at 2:56 PM, Peter Geoghegan <p...@heroku.com> wrote: > > On Fri, Jan 6, 2017 at 12:01 PM, Thomas Munro > > <thomas.mu...@enterprisedb.com> wrote: > >> Here is a new WIP patch. I have plenty of things to tidy up (see note > >> at end), but the main ideas are now pretty clear and I'd appreciate > >> some feedback. > > > > I have some review feedback for your V3. I've chosen to start with the > > buffile.c stuff, since of course it might share something with my > > parallel tuplesort patch. This isn't comprehensive, but I will have > > more comprehensive feedback soon. > > Thanks! > > > I'm not surprised that you've generally chosen to make shared BufFile > > management as simple as possible, with no special infrastructure other > > than the ability to hold open other backend temp files concurrently > > within a worker, and no writing to another worker's temp file, or > > shared read pointer. As you put it, everything is immutable. I > > couldn't see much opportunity for adding a lot of infrastructure that > > wasn't written explicitly as parallel hash join code/infrastructure. > > My sense is that that was a good decision. I doubted that you'd ever > > want some advanced, generic shared BufFile thing with multiple read > > pointers, built-in cache coherency, etc. (Robert seemed to think that > > you'd go that way, though.) > > Right, this is extremely minimalist infrastructure. fd.c is > unchanged. buffile.c only gains the power to export/import read-only > views of BufFiles. There is no 'unification' of BufFiles: each hash > join participant simply reads from the buffile it wrote, and then > imports and reads from its peers' BufFiles, until all are exhausted; > so the 'unification' is happening in caller code which knows about the > set of participants and manages shared read positions. Clearly there > are some ownership/cleanup issues to straighten out, but I think those > problems are fixable (probably involving refcounts). > > I'm entirely willing to throw that away and use the unified BufFile > concept, if it can be extended to support multiple readers of the > data, where every participant unifies the set of files. I have so far > assumed that it would be most efficient for each participant to read > from the file that it wrote before trying to read from files written > by other participants. I'm reading your patch now; more soon. > > > Anyway, some more specific observations: > > > > * ISTM that this is the wrong thing for shared BufFiles: > > > >> +BufFile * > >> +BufFileImport(BufFileDescriptor *descriptor) > >> +{ > > ... > >> + file->isInterXact = true; /* prevent cleanup by this backend */ > > > > There is only one user of isInterXact = true BufFiles at present, > > tuplestore.c. It, in turn, only does so for cases that require > > persistent tuple stores. A quick audit of these tuplestore.c callers > > show this to just be cursor support code within portalmem.c. Here is > > the relevant tuplestore_begin_heap() rule that that code adheres to, > > unlike the code I've quoted above: > > > > * interXact: if true, the files used for on-disk storage persist beyond > the > > * end of the current transaction. NOTE: It's the caller's > responsibility to > > * create such a tuplestore in a memory context and resource owner that > will > > * also survive transaction boundaries, and to ensure the tuplestore is > closed > > * when it's no longer wanted. > > Hmm. Yes, that is an entirely bogus use of isInterXact. I am > thinking about how to fix that with refcounts. > > > I don't think it's right for buffile.c to know anything about file > > paths directly -- I'd say that that's a modularity violation. > > PathNameOpenFile() is called by very few callers at the moment, all of > > them very low level (e.g. md.c), but you're using it within buffile.c > > to open a path to the file that you obtain from shared memory > > Hmm. I'm not seeing the modularity violation. buffile.c uses > interfaces already exposed by fd.c to do this: OpenTemporaryFile, > then FilePathName to find the path, then PathNameOpenFile to open from > another process. I see that your approach instead has client code > provide more meta data so that things can be discovered, which may > well be a much better idea. > > > directly. This is buggy because the following code won't be reached in > > workers that call your BufFileImport() function: > > > > /* Mark it for deletion at close */ > > VfdCache[file].fdstate |= FD_TEMPORARY; > > > > /* Register it with the current resource owner */ > > if (!interXact) > > { > > VfdCache[file].fdstate |= FD_XACT_TEMPORARY; > > > > ResourceOwnerEnlargeFiles(CurrentResourceOwner); > > ResourceOwnerRememberFile(CurrentResourceOwner, file); > > VfdCache[file].resowner = CurrentResourceOwner; > > > > /* ensure cleanup happens at eoxact */ > > have_xact_temporary_files = true; > > } > > Right, that is a problem. A refcount mode could fix that; virtual > file descriptors would be closed in every backend using the current > resource owner, and the files would be deleted when the last one turns > out the lights. > > > Certainly, you don't want the "Mark it for deletion at close" bit. > > Deletion should not happen at eoxact for non-owners-but-sharers > > (within FileClose()), but you *do* want CleanupTempFiles() to call > > FileClose() for the virtual file descriptors you've opened in the > > backend, to do some other cleanup. In general, you want to buy into > > resource ownership for workers. As things stand, I think that this > > will leak virtual file descriptors. That's really well hidden because > > there is a similar CleanupTempFiles() call at proc exit, I think. > > (Didn't take the time to make sure that that's what masked problems. > > I'm sure that you want minimal divergence with serial cases, > > resource-ownership-wise, in any case.) > > > > Instead of all this, I suggest copying some of my changes to fd.c, so > > that resource ownership within fd.c differentiates between a vfd that > > is owned by the backend in the conventional sense, including having a > > need to delete at eoxact, as well as a lesser form of ownership where > > deletion should not happen. Maybe you'll end up using my BufFileUnify > > interface [1] within workers (instead of just within the leader, as > > with parallel tuplesort), and have it handle all of that for you. > > Currently, that would mean that there'd be an unused/0 sized "local" > > segment for the unified BufFile, but I was thinking of making that not > > happen unless and until a new segment is actually needed, so even that > > minor wart wouldn't necessarily affect you. > > Ok, I'm studying that code now. > > >> Some assorted notes on the status: I need to do some thinking about > >> the file cleanup logic: both explicit deletes at the earliest possible > >> time, and failure/error paths. Currently the creator of each file is > >> responsible for cleaning it up, but I guess if the creator aborts > >> early the file disappears underneath the others' feet, and then I > >> guess they might raise a confusing error report that races against the > >> root cause error report; I'm looking into that. Rescans and skew > >> buckets not finished yet. > > > > The rescan code path seems to segfault when the regression tests are > > run. There is a NULL pointer dereference here: > > > >> @@ -985,6 +1855,14 @@ ExecReScanHashJoin(HashJoinState *node) > >> node->hj_HashTable = NULL; > >> node->hj_JoinState = HJ_BUILD_HASHTABLE; > >> > >> + if (HashJoinTableIsShared(node->hj_HashTable)) > >> + { > >> + /* Coordinate a rewind to the shared hash table > creation phase. */ > >> + BarrierWaitSet(&hashNode->shared_table_data->barrier, > >> + PHJ_PHASE_BEGINNING, > >> + WAIT_EVENT_HASHJOIN_REWINDING3); > >> + } > >> + > > > > Clearly, HashJoinTableIsShared() should not be called when its > > argument (in this case node->hj_HashTable) is NULL. > > > > In general, I think you should try to set expectations about what > > happens when the regression tests run up front, because that's usually > > the first thing reviewers do. > > Apologies, poor form. That block can be commented out for now because > rescan support is obviously incomplete, and I didn't mean to post it > that way. Doing so reveals two remaining test failures: "join" and > "rowsecurity" managed to lose a couple of rows. Oops. I will figure > out what I broke and have a fix for that in my next version. > > > Various compiler warnings on my system: > > > > /home/pg/pgbuild/builds/root/../../postgresql/src/backend/ > executor/nodeHash.c:1376:7: > > warning: variable ‘size_before_shrink’ set but not used > > [-Wunused-but-set-variable] > > Size size_before_shrink = 0; > > ^ > > In this case it was only used in dtrace builds; I will make sure any > such code is compiled out when in non-dtrace builds. > > > /home/pg/pgbuild/builds/root/../../postgresql/src/backend/ > executor/nodeHashjoin.c: > > In function ‘ExecHashJoinCloseBatch’: > > /home/pg/pgbuild/builds/root/../../postgresql/src/backend/ > executor/nodeHashjoin.c:1548:28: > > warning: variable ‘participant’ set but not used > > [-Wunused-but-set-variable] > > HashJoinParticipantState *participant; > > ^ > > /home/pg/pgbuild/builds/root/../../postgresql/src/backend/ > executor/nodeHashjoin.c: > > In function ‘ExecHashJoinRewindBatches’: > > /home/pg/pgbuild/builds/root/../../postgresql/src/backend/ > executor/nodeHashjoin.c:1587:23: > > warning: variable ‘batch_reader’ set but not used > > [-Wunused-but-set-variable] > > HashJoinBatchReader *batch_reader; > > ^ > > > > Is this change really needed?: > > > >> --- a/src/backend/executor/nodeSeqscan.c > >> +++ b/src/backend/executor/nodeSeqscan.c > >> @@ -31,6 +31,8 @@ > >> #include "executor/nodeSeqscan.h" > >> #include "utils/rel.h" > >> > >> +#include <unistd.h> > >> + > >> static void InitScanRelation(SeqScanState *node, EState *estate, int > eflags); > >> static TupleTableSlot *SeqNext(SeqScanState *node); > > Right, will clean up. > > > That's all I have for now... > > Thanks! I'm away from my computer for a couple of days but will have > a new patch series early next week, and hope to have a better handle > on what's involved in adopting the 'unification' approach here > instead. > > -- > Thomas Munro > http://www.enterprisedb.com > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > Hi Thomas, I was trying to analyse the performance of TPC-H queries with your patch and came across following results, Q9 and Q21 were crashing, both of them had following bt in core dump (I thought it might be helpful), #0 0x0000000010757da4 in pfree (pointer=0x3fff78d11000) at mcxt.c:1012 #1 0x000000001032c574 in ExecHashIncreaseNumBatches (hashtable=0x1003af6da60) at nodeHash.c:1124 #2 0x000000001032d518 in ExecHashTableInsert (hashtable=0x1003af6da60, slot=0x1003af695c0, hashvalue=2904801109, preload=1 '\001') at nodeHash.c:1700 #3 0x0000000010330fd4 in ExecHashJoinPreloadNextBatch (hjstate=0x1003af39118) at nodeHashjoin.c:886 #4 0x00000000103301fc in ExecHashJoin (node=0x1003af39118) at nodeHashjoin.c:376 #5 0x0000000010308644 in ExecProcNode (node=0x1003af39118) at execProcnode.c:490 #6 0x000000001031f530 in fetch_input_tuple (aggstate=0x1003af38910) at nodeAgg.c:587 #7 0x0000000010322b50 in agg_fill_hash_table (aggstate=0x1003af38910) at nodeAgg.c:2304 #8 0x000000001032239c in ExecAgg (node=0x1003af38910) at nodeAgg.c:1942 #9 0x0000000010308694 in ExecProcNode (node=0x1003af38910) at execProcnode.c:509 #10 0x0000000010302a1c in ExecutePlan (estate=0x1003af37fa0, planstate=0x1003af38910, use_parallel_mode=0 '\000', operation=CMD_SELECT, sendTuples=1 '\001', numberTuples=0, direction=ForwardScanDirection, dest=0x1003af19390) at execMain.c:1587 In case you want to know, I was using TPC-H with 20 scale factor. Please let me know if you want anymore information on this. -- Regards, Rafia Sabih EnterpriseDB: http://www.enterprisedb.com/