Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

Peter Geoghegan Tue, 21 Mar 2017 11:04:06 -0700

On Tue, Mar 21, 2017 at 9:10 AM, Robert Haas <robertmh...@gmail.com> wrote:
> - * This code is moderately slow (~10% slower) compared to the regular
> - * btree (insertion) build code on sorted or well-clustered data.  On
> - * random data, however, the insertion build code is unusable -- the
> - * difference on a 60MB heap is a factor of 15 because the random
> - * probes into the btree thrash the buffer pool.  (NOTE: the above
> - * "10%" estimate is probably obsolete, since it refers to an old and
> - * not very good external sort implementation that used to exist in
> - * this module.  tuplesort.c is almost certainly faster.)
>
> While I agree that the old comment is probably inaccurate, I don't
> think dropping it without comment in a patch to implement parallel
> sorting is the way to go. How about updating it to be more current as
> a separate patch?


I think that since the comment refers to code from before 1999, it can
go. Any separate patch to remove it would have an entirely negative
linediff.

> +/* Magic numbers for parallel state sharing */

> 1, 2, and 3 would probably work just as well.

Okay.

> Why is this being put in planner.c rather than something specific to
> creating indexes?  Not sure that's a good idea.

The idea is that it's the planner's domain, but this is a utility
statement, so it makes sense to put it next to the CLUSTER function
that determine whether CLUSTER sorts rather than does an index scan. I
don't have strong feelings on how appropriate that is.

> + * This should be called when workers have flushed out temp file buffers and
> + * yielded control to caller's process.  Workers should hold open their
> + * BufFiles at least until the caller's process is able to call here and
> + * assume ownership of BufFile.  The general pattern is that workers make
> + * available data from their temp files to one nominated process; there is
> + * no support for workers that want to read back data from their original
> + * BufFiles following writes performed by the caller, or any other
> + * synchronization beyond what is implied by caller contract.  All
> + * communication occurs in one direction.  All output is made available to
> + * caller's process exactly once by workers, following call made here at the
> + * tail end of processing.
>
> Thomas has designed a system for sharing files among cooperating
> processes that lacks several of these restrictions.  With his system,
> it's still necessary for all data to be written and flushed by the
> writer before anybody tries to read it.  But the restriction that the
> worker has to hold its BufFile open until the leader can assume
> ownership goes away.  That's a good thing; it avoids the need for
> workers to sit around waiting for the leader to assume ownership of a
> resource instead of going away faster and freeing up worker slots for
> some other query, or moving on to some other computation.   The
> restriction that the worker can't reread the data after handing off
> the file also goes away.

There is no restriction about workers not being able to reread data.
That comment makes it clear that that's only when the leader writes to
the file. It alludes to rereading within a worker following the leader
writing to their files in order to recycle blocks within logtape.c,
which the patch never has to do, unless you enable one of the 0002-*
testing GUCs to force randomAccess.

Obviously iff you write to the file in the leader, there is little
that the worker can do afterwards, but it's not a given that you'd
want to do that, and this patch actually never does. You could equally
well say that PHJ fails to provide for my requirement for having the
leader write to the files sensibly in order to recycle blocks, a
requirement that its shared BufFile mechanism expressly does not
support.

> That would cut hundreds of
> lines from this patch with no real disadvantage that I can see --
> including things like worker_wait(), which are only needed because of
> the shortcomings of the underlying mechanism.

I think it would definitely be a significant net gain in LOC. And,
worker_wait() will probably be replaced by the use of the barrier
abstraction anyway. It didn't seem worth creating a dependency on
early given my simple requirements. PHJ uses barriers instead,
presumably because there is much more of this stuff. The workers
generally won't have to wait at all. It's expected to be pretty much
instantaneous.

> + * run.  Parallel workers always use quicksort, however.
>
> Comment fails to mention a reason.

Well, I don't think that there is any reason to use replacement
selection at all, what with the additional merge heap work last year.
But, the theory there remains that RS is good when you can get one big
run and no merge. You're not going to get that with parallel sort in
any case, since the leader must merge. Besides, merging in the workers
happens in the workers. And, the backspace requirement of 32MB of
workMem per participant pretty much eliminates any use of RS that
you'd get otherwise.

> I think "worker %d" or "participant %d" would be a lot better than
> just starting the message with "%d".  (There are multiple instances of
> this, with various messages.)

Okay.

> I think some of the smaller changes that this patch makes, like
> extending the parallel context machinery to support SnapshotAny, could
> be usefully broken out as separately-committable patches.

Okay.

> I haven't really dug down into the details here, but with the
> exception of the buffile.c stuff which I don't like, the overall
> design of this seems pretty sensible to me.  We might eventually want
> to do something more clever at the sorting level, but those changes
> would be confined to tuplesort.c, and all the other changes you've
> introduced here would stand on their own.  Which is to say that even
> if there's more win to be had here, this is a good start.

That's certainly how I feel about it.

I believe that the main reason that you like the design I came up with
on the whole is that it's minimally divergent from the serial case.
The changes in logtape.c and tuplesort.c are actually very minor. But,
the reason that that's possible at all is because buffile.c adds some
complexity that is all about maintaining existing assumptions. You
don't like that complexity. I would suggest that it's useful that I've
been able to isolate it to buffile.c fairly well.

A quick tally of the existing assumptions this patch preserves:

1. Resource managers still work as before. This means that error
handling will work the same way as before. We cooperate with that
mechanism, rather than supplanting it entirely.

2. There is only one BufFile per logical tapeset per tuplesort, in
both workers and the leader.

3. You can write to the end of a unified BufFile in leader to have it
extended, while resource managers continue to do the right thing
despite differing requirements for each segment. This leaves things
sane for workers to read, provided the leader keeps to its own space
in the unified BufFile.

4. Temp files must go away at EoX, no matter what.

Thomas has created a kind of shadow resource manager in shared memory.
So, he isn't using fd.c resource management stuff. He is concerned
with a set of BufFiles, each of which has specific significance to
each parallel hash join (they're per worker HJ batch). PHJ has an
unpredictable number of BufFiles, while parallel tuplesort always has
one, just as before. For the most part, I think that what Thomas has
done reflects his own requirements, just as what I've done reflects my
requirements. There seems to be no excellent opportunity to use a
common infrastructure.

I think that not cooperating with the existing mechanism will prove to
be buggy. Following a quick look at the latest PHJ patch series, and
its 0008-hj-shared-buf-file-v8.patch file, I already see one example.
I notice that there could be multiple calls to
pgstat_report_tempfile() within each backend for the same BufFile
segment. Isn't that counting the same thing more than once? In
general, it seems problematic that there is now "true" fd.c temp
segments, as well as Shared BufFile temp segments that are never in a
backend resource manager.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

Reply via email to