On Sun, Mar 19, 2017 at 9:03 PM, Peter Geoghegan <p...@bowt.ie> wrote: > On Sun, Mar 12, 2017 at 3:05 PM, Peter Geoghegan <p...@bowt.ie> wrote: >> I attach my V9 of the patch. I came up some stuff for the design of >> resource management that I think meets every design goal that we have >> for shared/unified BufFiles: > > Commit 2609e91fc broke the parallel CREATE INDEX cost model. I should > now pass -1 as the index block argument to compute_parallel_worker(), > just as all callers that aren't parallel index scan do after that > commit. This issue caused V9 to never choose parallel CREATE INDEX > within nbtsort.c. There was also a small amount of bitrot. > > Attached V10 fixes this regression. I also couldn't resist adding a > few new assertions that I thought were worth having to buffile.c, plus > dedicated wait events for parallel tuplesort. And, I fixed a silly bug > added in V9 around where worker_wait() should occur.
Some initial review comments: - * This code is moderately slow (~10% slower) compared to the regular - * btree (insertion) build code on sorted or well-clustered data. On - * random data, however, the insertion build code is unusable -- the - * difference on a 60MB heap is a factor of 15 because the random - * probes into the btree thrash the buffer pool. (NOTE: the above - * "10%" estimate is probably obsolete, since it refers to an old and - * not very good external sort implementation that used to exist in - * this module. tuplesort.c is almost certainly faster.) While I agree that the old comment is probably inaccurate, I don't think dropping it without comment in a patch to implement parallel sorting is the way to go. How about updating it to be more current as a separate patch? +/* Magic numbers for parallel state sharing */ +#define PARALLEL_KEY_BTREE_SHARED UINT64CONST(0xA000000000000001) +#define PARALLEL_KEY_TUPLESORT UINT64CONST(0xA000000000000002) +#define PARALLEL_KEY_TUPLESORT_SPOOL2 UINT64CONST(0xA000000000000003) 1, 2, and 3 would probably work just as well. The parallel infrastructure uses high-numbered values to avoid conflict with plan_node_id values, but this is a utility statement so there's no such problem. But it doesn't matter very much. + * Note: caller had better already hold some type of lock on the table and + * index. + */ +int +plan_create_index_workers(Oid tableOid, Oid indexOid) Caller should pass down the Relation rather than the Oid. That is better both because it avoids unnecessary work and because it more or less automatically avoids the problem mentioned in the note. Why is this being put in planner.c rather than something specific to creating indexes? Not sure that's a good idea. + * This should be called when workers have flushed out temp file buffers and + * yielded control to caller's process. Workers should hold open their + * BufFiles at least until the caller's process is able to call here and + * assume ownership of BufFile. The general pattern is that workers make + * available data from their temp files to one nominated process; there is + * no support for workers that want to read back data from their original + * BufFiles following writes performed by the caller, or any other + * synchronization beyond what is implied by caller contract. All + * communication occurs in one direction. All output is made available to + * caller's process exactly once by workers, following call made here at the + * tail end of processing. Thomas has designed a system for sharing files among cooperating processes that lacks several of these restrictions. With his system, it's still necessary for all data to be written and flushed by the writer before anybody tries to read it. But the restriction that the worker has to hold its BufFile open until the leader can assume ownership goes away. That's a good thing; it avoids the need for workers to sit around waiting for the leader to assume ownership of a resource instead of going away faster and freeing up worker slots for some other query, or moving on to some other computation. The restriction that the worker can't reread the data after handing off the file also goes away. The files can be read and written by any participant in any order, as many times as you like, with only the restriction that the caller must guarantee that data will be written and flushed from private buffers before it can be read. I don't see any reason to commit both his system and your system, and his is more general so I think you should use it. That would cut hundreds of lines from this patch with no real disadvantage that I can see -- including things like worker_wait(), which are only needed because of the shortcomings of the underlying mechanism. + * run. Parallel workers always use quicksort, however. Comment fails to mention a reason. + elog(LOG, "%d using " INT64_FORMAT " KB of memory for read buffers among %d input tapes", + state->worker, state->availMem / 1024, numInputTapes); I think "worker %d" or "participant %d" would be a lot better than just starting the message with "%d". (There are multiple instances of this, with various messages.) I think some of the smaller changes that this patch makes, like extending the parallel context machinery to support SnapshotAny, could be usefully broken out as separately-committable patches. I haven't really dug down into the details here, but with the exception of the buffile.c stuff which I don't like, the overall design of this seems pretty sensible to me. We might eventually want to do something more clever at the sorting level, but those changes would be confined to tuplesort.c, and all the other changes you've introduced here would stand on their own. Which is to say that even if there's more win to be had here, this is a good start. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (firstname.lastname@example.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers