On Wed, Aug 10, 2016 at 11:59 AM, Robert Haas <robertmh...@gmail.com> wrote:
> My view on this - currently anyway - is that we shouldn't conflate the
> tuplesort with the subsequent index generation, but that we should try
> to use parallelism within the tuplesort itself to the greatest extent
> possible.  If there is a single output stream that the leader uses to
> generate the final index, then none of the above problems arise.  They
> only arise if you've got multiple processes actually writing to the
> index.

I'm not sure if you're agreeing with my contention about parallel
CREATE INDEX not being a good target for partitioning here. Are you?

You can get some idea of how much a separate pass over the
concatenated outputs would hurt by using the test GUC with my patch
applied (the one that artificially forces randomAccess by B-Tree
tuplesort callers).

>> Suggested partitioning algorithm
>> ================================

>> The basic idea I have in mind is that we create runs in workers in the
>> same way that the parallel CREATE INDEX patch does (one output run per
>> worker). However, rather than merging in the leader, we use a
>> splitting algorithm to determine partition boundaries on-the-fly. The
>> logical tape stuff then does a series of binary searches to find those
>> exact split points within each worker's "final" tape. Each worker
>> reports the boundary points of its original materialized output run in
>> shared memory. Then, the leader instructs workers to "redistribute"
>> slices of their final runs among each other, by changing the tapeset
>> metadata to reflect that each worker has nworker input tapes with
>> redrawn offsets into a unified BufFile. Workers immediately begin
>> their own private on-the-fly merges.
> Yeah, this is pretty cool.  You end up with the final merge segmented
> into N submerges producing nonoverlapping ranges.  So you could have
> the leader perform submerge 0 itself, and while it's doing that the
> other workers can perform submerges 1..N.  By the time  the leader
> finishes submerge 0, the remaining submerges will likely be complete
> and after that the leader can just read the outputs of those submerges
> one after another and it has basically no additional work to do.

Again, I'm a little puzzled by your remarks here. Surely the really
great case for parallel sort with partitioning is the case where there
remains minimal further IPC between workers? So, while "the leader can
just read the outputs of those submerges", ideally it will be reading
as little as possible from workers. For example, it's ideal when the
workers were able to determine that their particular range in the
parallel merge join has very few tuples to return, having also
"synchronized" their range within two underlying relations
(importantly, the merge join "synchronization" can begin per worker
when the tuplesort.c on-the-fly merge begins and returns its first
tuple -- that is, it can begin very soon).

In short, partitioning when sorting is as much about avoiding a serial
dependency for the entire query tree as it is about taking advantage
of available CPU cores and disk spindles. That is my understanding, at
any rate.

While all this speculation about choice of algorithm is fun,
realistically I'm not gong to write the patch for a rainy day (nor for
parallel CREATE INDEX, at least until we become very comfortable with
all the issues I raise, which could never happen). I'd be happy to
consider helping you improve parallel query by providing
infrastructure like this, but I need someone else to write the client
of the infrastructure (e.g. a parallel merge join patch), or to at
least agree to meet me half way with an interdependent prototype of
their own. It's going to be messy, and we'll have to do a bit of
stumbling to get to a good place. I can sign up to that if I'm not the
only one that has to stumble.

Remember how I said we should work on the merging bottleneck
indirectly? I'm currently experimenting with having merging use
sifting down to replace the root in the heap. This is very loosely
based on the Jeremy Harris patch from 2014, I suppose. Anyway, this
can be far, far faster, with perhaps none of the downsides that we saw
in the context of building an initial replacement selection heap,
because we have more control of the distribution of input (tapes have
sorted tuples), and because this heap is so tiny and cache efficient
to begin with. This does really well in the event of clustering of
values, which is a common case, but also helps with totally random
initially input.

I need to do some more research before posting a patch, but right now
I can see that it makes merging presorted numeric values more than 2x
faster. And that's with 8 tapes, on my very I/O bound laptop. I bet
that the benefits would also be large for text (temporal locality is
improved, and so strcoll() comparison caching is more effective).
Serial merging still needs work, it seems.

Peter Geoghegan

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to