On Thu, Jan 11, 2018 at 8:58 PM, Peter Geoghegan <p...@bowt.ie> wrote: > I've caught up with you again. I just need to take a look at what I > came up with with fresh eyes, and maybe do some more testing.
More comments: BufFileView() looks fairly pointless. It basically creates a copy of the input and, in so doing, destroys the input, which is a lot like returning the input parameter except that it uses more cycles. It does do a few things. First, it zeroes the offsets array instead of copying the offsets. But as used, those offsets would have been 0 anyway. Second, it sets the fileset parameter to NULL. But that doesn't actually seem to be important for anything: the fileset is only used when creating new files, and the BufFile must already be marked read-only, so we won't be doing that. It seems like this function can just be entirely removed and replaced by Assert()-ing some things about the target in BufFileViewAppend, which I would just rename to BufFileAppend. In miscadmin.h, I'd put the prototype for the new GUC next to max_worker_processes, not maintenance_work_mem. The ereport() in index_build will, I think, confuse people when it says that there are 0 parallel workers. I suggest splitting this into two cases: if (indexInfo->ii_ParallelWorkers == 0) ereport(... "building index \"%s\" on table \"%s\" serially" ...) else ereport(... "building index \"%s\" on table \"%s\" in parallel with request for %d parallel workers" ...). Might even need three cases to handle parallel_leader_participation without needing to assemble the message, unless we drop parallel_leader_participation support. The logic in IndexBuildHeapRangeScan() around need_register_snapshot and OldestXmin seems convoluted and not very well-edited to me. For example, need_register_snapshot is set to false in a block that is only entered when it's already false, and the comment that follows is supposed to be associated with GetOldestXmin() and makes no sense here. I suggest that you go back to the original code organization and then just insert an additional case for a caller-supplied scan, so that the overall flow looks like this: if (scan != NULL) { ... } else if (IsBootstrapProcessingMode() || indexInfo->ii_Concurrent) { ... } else { ... } Along with that, I'd change the name of need_register_snapshot to need_unregister_snapshot (it's doing both jobs right now) and initialize it to false. If you enter the second of the above blocks then change it to true just after snapshot = RegisterSnapshot(GetTransactionSnapshot()). Then adjust the comment that begins "Prepare for scan of the base relation." by inserting an additional sentence just after that one: "If the caller has supplied a scan, just use it. Otherwise, in a normal index build..." and the rest as it is currently. + * This support code isn't reliable when called from within a parallel + * worker process due to the fact that our state isn't propagated. This is + * why parallel index builds are disallowed on catalogs. It is possible + * that we'll fail to catch an attempted use of a user index undergoing + * reindexing due the non-propagation of this state to workers, which is not + * ideal, but the problem is not particularly likely to go undetected due to + * our not doing better there. I understand the first two sentences, but I have no idea what the third one means, especially the part that says "not particularly likely to go undetected due to our not doing better there". It sounds scary that something bad is only "not particularly likely to go undetected"; don't we need to detect bad things reliably? But also, you used the word "not" three times and also the prefix "un-", meaning "not", once. Four negations in 13 words! Perhaps I'm not entirely in a position to cast aspersions on overly-complex phraseology -- the pot calling the kettle black and all that -- but I bet that will be a lot clearer if you reduce the number of negations to either 0 or 1. The comment change in standard_planner() doesn't look helpful to me; I'd leave it out. + * tableOid is the table that index is to be built on. indexOid is the OID + * of a index to be created or reindexed (which must be a btree index). I'd rewrite that first sentence to end "the table on which the index is to be built". The second sentence should say "an index" rather than "a index". + * leaderWorker indicates whether leader will participate as worker or not. + * This needs to be taken into account because leader process is guaranteed to + * be idle when not participating as a worker, in contrast with conventional + * parallel relation scans, where the leader process typically has plenty of + * other work to do relating to coordinating the scan, etc. For CREATE INDEX, + * leader is usually treated as just another participant for our scaling + * calculation. OK, I get the first sentence. But the rest of this appears to be partially irrelevant and partially incorrect. The degree to which the leader is likely to be otherwise occupied isn't very relevant; as long as we think it's going to do anything at all, we have to account for it somehow. Also, the issue isn't that in a query the leader would be busy "coordinating the scan, etc." but rather that it would have to read the tuples produced by the Gather (Merge) node. I think you could just delete everything from "This needs to be..." through the end. You can cover the details of how it's used closer to the point where you do anything with leaderWorker (or, as I assume it will soon be, leaderParticipates). But, actually, I think we would be better off just ripping leaderWorker/leaderParticipates out of this function altogether. compute_parallel_worker() is not really under any illusion that it's computing a number of *participants*; it's just computing a number of *workers*. Deducting 1 when the leader is also participating but only when at least 2 workers were computed leads to an oddity: for a regular parallel sequential scan, the number of workers increases by 1 when the table size increases by a factor of 3, but here, the number of workers increases from 1 to 2 when the table size increases by a factor of 9, and then by 1 for every further multiple of 3. There doesn't seem to be any theoretical or practical justification for such behavior, or with being inconsistent with what parallel sequential scan does otherwise. I think it's fine for parallel_leader_participation=off to simply mean that you get one fewer participants. That's actually what would happen with parallel query, too. Parallel query would consider parallel_leader_participation later, in get_parallel_divisor(), when working out the cost of one path vs. another, but it doesn't use it to choose the number of workers. So it seems to me that getting rid of all of the workerLeader considerations will make it both simpler and more consistent with what we do for queries. To be clear, I don't think there's any real need for the cost model we choose for CREATE INDEX to be the same as the one we use for regular scans. The problem with regular scans is that it's very hard to predict how many workers we can usefully use; it depends not only on the table size but on what plan nodes get stacked on top of it higher in the plan tree. In a perfect world we'd like to add as many workers as required to avoid having the query be I/O bound and then stop, but that would require both the ability to predict future system utilization and a heck of a lot more knowledge than the planner can hope to have at this point. If you have an idea how to make a better cost model than this for CREATE INDEX, I'm willing to consider other options. If you don't, or want to propose that as a follow-up patch, then I think it's OK to use what you've got here for starters. I just don't want it to be more baroque than necessary. I think that the naming of the wait events could be improved. Right now, they are named by which kind of process does the waiting, but it really should be named based on what the thing for which we're waiting. I also suggest that we could just write Sort instead of Tuplesort. In short, I suggest ParallelTuplesortLeader -> ParallelSortWorkersDone and ParallelTuplesortLeader -> ParallelSortTapeHandover. Not for this patch, but I wonder if it might be a worthwhile future optimization to allow workers to return multiple tapes to the leader. One doesn't want to go crazy with this, of course. If the worker returns 100 tapes, then the leader might get stuck doing multiple merge passes, which would be a foolish way to divide up the labor, and even if that doesn't happen, Amdahl's law argues for minimizing the amount of work that is not done in parallel. Still, what if a worker (perhaps after merging) ends up with 2 or 3 tapes? Is it really worth merging them so that the leader can do a 5-way merge instead of a 15-way merge? Maybe this case is rare in practice, because multiple merge passes will be uncommon with reasonable values of work_mem, and it might be silly to go to the trouble of firing up workers if they'll only generate a few runs in total. Just a thought. + * Make sure that the temp file(s) underlying the tape set are created in + * suitable temp tablespaces. This is only really needed for serial + * sorts. This comment makes me wonder whether it is "sorta" needed for parallel sorts. - if (trace_sort) + if (trace_sort && !WORKER(state)) I have a feeling we still want to get this output even from workers, but maybe I'm missing something. + arg5 indicates serial, parallel worker, or parallel leader sort.</entry> I think it should say what values are used for each case. + /* Release worker tuplesorts within leader process as soon as possible */ IIUC, the worker tuplesorts aren't really holding onto much of anything in terms of resources. I think it might be better to phrase this as /* The sort we just did absorbed the final tapes produced by these tuplesorts, which are of no further use. */ or words to that effect. Instead of making a special case in CreateParallelContext for serializable_okay, maybe index_build should just use SetConfigOption() to force the isolation level to READ COMMITTED right after it does NewGUCNestLevel(). The change would only be temporary because the subsequent call to AtEOXact_GUC() will revert it. The point isn't really that CREATE INDEX is somehow exempt from the problem that SIREAD locks haven't been updated to work correctly with parallelism; it's that CREATE INDEX itself is defined to ignore serializability concerns. There is *still* more to review here, but my concentration is fading. If you could post an updated patch after adjusting for the comments above, I think that would be helpful. I'm not totally out of things to review that I haven't already looked over once, but I think I'm close. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company