Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

Robert Haas Fri, 12 Jan 2018 10:29:27 -0800

On Thu, Jan 11, 2018 at 8:58 PM, Peter Geoghegan <p...@bowt.ie> wrote:
> I've caught up with you again. I just need to take a look at what I
> came up with with fresh eyes, and maybe do some more testing.


More comments:

BufFileView() looks fairly pointless.  It basically creates a copy of
the input and, in so doing, destroys the input, which is a lot like
returning the input parameter except that it uses more cycles.  It
does do a few things.  First, it zeroes the offsets array instead of
copying the offsets.  But as used, those offsets would have been 0
anyway.  Second, it sets the fileset parameter to NULL.   But that
doesn't actually seem to be important for anything: the fileset is
only used when creating new files, and the BufFile must already be
marked read-only, so we won't be doing that.  It seems like this
function can just be entirely removed and replaced by Assert()-ing
some things about the target in BufFileViewAppend, which I would just
rename to BufFileAppend.

In miscadmin.h, I'd put the prototype for the new GUC next to
max_worker_processes, not maintenance_work_mem.

The ereport() in index_build will, I think, confuse people when it
says that there are 0 parallel workers.  I suggest splitting this into
two cases: if (indexInfo->ii_ParallelWorkers == 0) ereport(...
"building index \"%s\" on table \"%s\" serially" ...) else ereport(...
"building index \"%s\" on table \"%s\" in parallel with request for %d
parallel workers" ...).  Might even need three cases to handle
parallel_leader_participation without needing to assemble the message,
unless we drop parallel_leader_participation support.

The logic in IndexBuildHeapRangeScan() around need_register_snapshot
and OldestXmin seems convoluted and not very well-edited to me.  For
example, need_register_snapshot is set to false in a block that is
only entered when it's already false, and the comment that follows is
supposed to be associated with GetOldestXmin() and makes no sense
here.  I suggest that you go back to the original code organization
and then just insert an additional case for a caller-supplied scan, so
that the overall flow looks like this:

if (scan != NULL)
{
   ...
}
else if (IsBootstrapProcessingMode() || indexInfo->ii_Concurrent)
{
   ...
}
else
{
   ...
}

Along with that, I'd change the name of need_register_snapshot to
need_unregister_snapshot (it's doing both jobs right now) and
initialize it to false.  If you enter the second of the above blocks
then change it to true just after snapshot =
RegisterSnapshot(GetTransactionSnapshot()).  Then adjust the comment
that begins "Prepare for scan of the base relation." by inserting an
additional sentence just after that one: "If the caller has supplied a
scan, just use it.  Otherwise, in a normal index build..." and the
rest as it is currently.

+ * This support code isn't reliable when called from within a parallel
+ * worker process due to the fact that our state isn't propagated.  This is
+ * why parallel index builds are disallowed on catalogs.  It is possible
+ * that we'll fail to catch an attempted use of a user index undergoing
+ * reindexing due the non-propagation of this state to workers, which is not
+ * ideal, but the problem is not particularly likely to go undetected due to
+ * our not doing better there.

I understand the first two sentences, but I have no idea what the
third one means, especially the part that says "not particularly
likely to go undetected due to our not doing better there".  It sounds
scary that something bad is only "not particularly likely to go
undetected"; don't we need to detect bad things reliably? But also,
you used the word "not" three times and also the prefix "un-", meaning
"not", once.  Four negations in 13 words! Perhaps I'm not entirely in
a position to cast aspersions on overly-complex phraseology -- the pot
calling the kettle black and all that -- but I bet that will be a lot
clearer if you reduce the number of negations to either 0 or 1.

The comment change in standard_planner() doesn't look helpful to me;
I'd leave it out.

+ * tableOid is the table that index is to be built on.  indexOid is the OID
+ * of a index to be created or reindexed (which must be a btree index).

I'd rewrite that first sentence to end "the table on which the index
is to be built".  The second sentence should say "an index" rather
than "a index".

+ * leaderWorker indicates whether leader will participate as worker or not.
+ * This needs to be taken into account because leader process is guaranteed to
+ * be idle when not participating as a worker, in contrast with conventional
+ * parallel relation scans, where the leader process typically has plenty of
+ * other work to do relating to coordinating the scan, etc.  For CREATE INDEX,
+ * leader is usually treated as just another participant for our scaling
+ * calculation.

OK, I get the first sentence.  But the rest of this appears to be
partially irrelevant and partially incorrect.  The degree to which the
leader is likely to be otherwise occupied isn't very relevant; as long
as we think it's going to do anything at all, we have to account for
it somehow.  Also, the issue isn't that in a query the leader would be
busy "coordinating the scan, etc." but rather that it would have to
read the tuples produced by the Gather (Merge) node.  I think you
could just delete everything from "This needs to be..." through the
end.  You can cover the details of how it's used closer to the point
where you do anything with leaderWorker (or, as I assume it will soon
be, leaderParticipates).

But, actually, I think we would be better off just ripping
leaderWorker/leaderParticipates out of this function altogether.
compute_parallel_worker() is not really under any illusion that it's
computing a number of *participants*; it's just computing a number of
*workers*.  Deducting 1 when the leader is also participating but only
when at least 2 workers were computed leads to an oddity: for a
regular parallel sequential scan, the number of workers increases by 1
when the table size increases by a factor of 3, but here, the number
of workers increases from 1 to 2 when the table size increases by a
factor of 9, and then by 1 for every further multiple of 3.  There
doesn't seem to be any theoretical or practical justification for such
behavior, or with being inconsistent with what parallel sequential
scan does otherwise.  I think it's fine for
parallel_leader_participation=off to simply mean that you get one
fewer participants.  That's actually what would happen with parallel
query, too.  Parallel query would consider
parallel_leader_participation later, in get_parallel_divisor(), when
working out the cost of one path vs. another, but it doesn't use it to
choose the number of workers.  So it seems to me that getting rid of
all of the workerLeader considerations will make it both simpler and
more consistent with what we do for queries.

To be clear, I don't think there's any real need for the cost model we
choose for CREATE INDEX to be the same as the one we use for regular
scans.  The problem with regular scans is that it's very hard to
predict how many workers we can usefully use; it depends not only on
the table size but on what plan nodes get stacked on top of it higher
in the plan tree.  In a perfect world we'd like to add as many workers
as required to avoid having the query be I/O bound and then stop, but
that would require both the ability to predict future system
utilization and a heck of a lot more knowledge than the planner can
hope to have at this point.  If you have an idea how to make a better
cost model than this for CREATE INDEX, I'm willing to consider other
options.  If you don't, or want to propose that as a follow-up patch,
then I think it's OK to use what you've got here for starters.  I just
don't want it to be more baroque than necessary.

I think that the naming of the wait events could be improved.  Right
now, they are named by which kind of process does the waiting, but it
really should be named based on what the thing for which we're
waiting.  I also suggest that we could just write Sort instead of
Tuplesort. In short, I suggest ParallelTuplesortLeader ->
ParallelSortWorkersDone and ParallelTuplesortLeader ->
ParallelSortTapeHandover.

Not for this patch, but I wonder if it might be a worthwhile future
optimization to allow workers to return multiple tapes to the leader.
One doesn't want to go crazy with this, of course.  If the worker
returns 100 tapes, then the leader might get stuck doing multiple
merge passes, which would be a foolish way to divide up the labor, and
even if that doesn't happen, Amdahl's law argues for minimizing the
amount of work that is not done in parallel.  Still, what if a worker
(perhaps after merging) ends up with 2 or 3 tapes?  Is it really worth
merging them so that the leader can do a 5-way merge instead of a
15-way merge?  Maybe this case is rare in practice, because multiple
merge passes will be uncommon with reasonable values of work_mem, and
it might be silly to go to the trouble of firing up workers if they'll
only generate a few runs in total.  Just a thought.

+     * Make sure that the temp file(s) underlying the tape set are created in
+     * suitable temp tablespaces.  This is only really needed for serial
+     * sorts.

This comment makes me wonder whether it is "sorta" needed for parallel sorts.

-    if (trace_sort)
+    if (trace_sort && !WORKER(state))

I have a feeling we still want to get this output even from workers,
but maybe I'm missing something.

+      arg5 indicates serial, parallel worker, or parallel leader sort.</entry>

I think it should say what values are used for each case.

+    /* Release worker tuplesorts within leader process as soon as possible */

IIUC, the worker tuplesorts aren't really holding onto much of
anything in terms of resources.  I think it might be better to phrase
this as /* The sort we just did absorbed the final tapes produced by
these tuplesorts, which are of no further use. */ or words to that
effect.

Instead of making a special case in CreateParallelContext for
serializable_okay, maybe index_build should just use SetConfigOption()
to force the isolation level to READ COMMITTED right after it does
NewGUCNestLevel().  The change would only be temporary because the
subsequent call to AtEOXact_GUC() will revert it.  The point isn't
really that CREATE INDEX is somehow exempt from the problem that
SIREAD locks haven't been updated to work correctly with parallelism;
it's that CREATE INDEX itself is defined to ignore serializability
concerns.

There is *still* more to review here, but my concentration is fading.
If you could post an updated patch after adjusting for the comments
above, I think that would be helpful.  I'm not totally out of things
to review that I haven't already looked over once, but I think I'm
close.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

Reply via email to