Over on the "Parallel tuplesort (for parallel B-Tree index creation)" thread , there has been some discussion of merging vs. partitioning. There is a concern about the fact the merge of the tuplesort used to build a B-Tree is not itself parallelized. There is a weak consensus that we'd be better off with partitioning rather than merging anyway.
I've started this new thread to discuss where we want to be with partitioning (and further parallel merging). In short: How can we maximize the benefit of parallel tuplesort? I've said or at least implied that I favor keeping the CREATE INDEX merge serial, while pursuing partitioning (for other uses) at a later date. Further parallelization is useful for parallel query much more so than parallel CREATE INDEX. To summarize, I favor not parallelizing merging because: * The scalability of parallel CREATE INDEX as implemented is apparently comparable to the scalability of similar features in other systems  (systems with very mature implementations). ISTM that we cannot really expect to do too much better than that. * We already do some merging in parallel to get runs for the leader to merge in serial, if and only if workers each produce more than one initial run (not unlikely). That's still pretty effective, and ensures that the final leader merge is cache efficient. I think that I'll probably end up changing the patch to share maintenance_work_mem among workers (not have it be per-worker), making it common for workers to merge, while the leader merge ends up needed fairly few rounds of preloading to merge the final output. * Even if it helps a bit, parallel merging is not the future; partitioning is (more on this later). I don't think partitioning is urgent for CREATE INDEX, and may be inappropriate for CREATE INDEX under any circumstances, because: * Possible problems with parallel infrastructure and writes. Can the parallel infrastructure be even made to write and WAL log an index in parallel, too? Partitioning more or less forces this, at least if you expect any benefit at all -- reading the concatenated output files in a serial pass where the index is written seems likely to hurt more than it helps. This is not much of an advantage when you're likely to be I/O bound anyway, and when the tuple startup cost *per worker* is not of particular concern. * Unbalanced B-Trees (or the risk thereof). This is the really big one for me. We can only write leaf pages out in parallel; we have to "merge together sub B-Trees" to produce internal pages that make up a finished B-Tree. That risks creating a set of internal pages that reflect a skew that any partition-to-sort approach happened to have, due to whatever problem might emerge in the choice of separators in tuplesort.c (this is a documented issue with SQL Server Enterprise Edition, in fact). Can we tolerate the risk of ending up with an unbalanced final B-Tree in the event of a misestimation? Seems like that's quite a lot worse than poor query performance (due to poor load balancing among parallel workers). DBAs are more or less used to the latter variety of problems. They are not used to the former, especially because the problem might go unnoticed for a long time. * What I've come up with is minimally divergent from the existing approach to tuplesorting. This is useful because of the code complexity of having multiple workers consuming final output (writing an index) in parallel looks to be fairly high. Consider B-Tree related features that tuplesort must care about today, which have further special considerations in a world where this simple "consume all output at once in one process" model is replaced for parallel CREATE INDEX. You'd get all kinds of subtle problems at various boundaries in the final keyspace for things like CREATE UNIQUE INDEX CONCURRENTLY. That seems like something that I'd be quite happy to avoid worrying about (while still allowing parallel CREATE UNIQUE INDEX CONCURRENTLY). Note that some other systems don't allow the equivalent of parallel CREATE INDEX CONCURRENTLY at all, presumably because of similar concerns. My patch supports CIC, and does so almost automatically (although the second pass isn't performed in parallel). I bet that CREATE INDEX won't be the last case where that simplicity and generality clearly matters. Suggested partitioning algorithm ================================ I think a hybrid partitioning + merging approach would work well for us. The paper "Parallel Sorting on a Shared-Nothing Architecture using Probabilistic Splitting"  has influenced my thinking here (this was written by prominent researchers from the influential UW-Madison Wisconsin database group). Currently, I have in mind something that is closer to what they call exact splitting to what they call probabilistic splitting, because I don't think it's going to be generally possible to have good statistics on partition boundaries immediately available (e.g., through something like their probabilistic splitting sampling the relation ahead of time). The basic idea I have in mind is that we create runs in workers in the same way that the parallel CREATE INDEX patch does (one output run per worker). However, rather than merging in the leader, we use a splitting algorithm to determine partition boundaries on-the-fly. The logical tape stuff then does a series of binary searches to find those exact split points within each worker's "final" tape. Each worker reports the boundary points of its original materialized output run in shared memory. Then, the leader instructs workers to "redistribute" slices of their final runs among each other, by changing the tapeset metadata to reflect that each worker has nworker input tapes with redrawn offsets into a unified BufFile. Workers immediately begin their own private on-the-fly merges. What's really nice about this is that not long after the initial generation of nworker runs, which has already been shown to be very scalable, workers can each *individually* start returning output tuples immediately (the first tuple can be returned by each on-the-fly merge just after the drawing of boundaries). There is almost no serial bottleneck that they block on. I think that this can be more important than improving sort throughput per se. There'd probably also be an interface for callers to tell tuplesorts what their splitter boundaries are directly, and an interface to retrieve details of what splitter boundaries a tuplesort has determined for itself (in order to have two sort nodes with matching boundaries, as with a parallel merge join). Clearly it's really hard to be sure that this is the right thing at this point, but my intuition is that this is the way to go (while avoiding anything like this for CREATE INDEX). I'd like to know how others feel about it.  https://www.postgresql.org/message-id/flat/CAM3SWZQKM=Pzc=cahzrixkjp2eo5q0jg1sofqqexfq647ji...@mail.gmail.com#CAM3SWZQKM=Pzc=cahzrixkjp2eo5q0jg1sofqqexfq647ji...@mail.gmail.com  https://www.postgresql.org/message-id/CA%2BTgmoY5JYs4R1g_ZJ-P6SkULSb19xx4zUh7S8LJiXonCgVTuQ%40mail.gmail.com  http://pages.cs.wisc.edu/~dewitt/includes/paralleldb/parsort.pdf -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (email@example.com) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers