On Sat, Aug 6, 2016 at 6:46 AM, Amit Kapila <amit.kapil...@gmail.com> wrote: > I think here some of the factors like how many workers will be used > for merge phase might impact the performance. Having too many > workers can lead to more communication cost and having too few workers > might not yield best results for merge. One thing, I have noticed > that in general for sorting, some of the other databases uses range > partitioning , now that might not be what is good for us.
I don't disagree with anything you say here. I acknowledged that partitioning will probably be important for sorting in my introductory e-mail, after all. > I see > you mentioned above that why it is not good , but I don't > understand why you think it is a risky assumption to assume good > partition boundaries for parallelizing sort. Well, apparently there are numerous problems with partitioning in systems like SQL Server and Oracle in the worst case. For one thing, in the event of a misestimation (or failure of the dynamic sampling that I presume can sometimes be used), workers can be completely starved of work for the entire duration of the sort. And for CREATE INDEX to get much of any benefit, all workers must write their part of the index independently, too. This can affect the physical structure of the final index. SQL Server also has a caveat in its documentation about this resulting in an unbalanced final index, which I imagine could be quite bad in the worst case. I believe that it's going to be hard to get any version of this that writes the index simultaneously in each worker accepted for these reasons. This patch I came up with isn't very different from the serial case at all. Any index built in parallel by the patch ought to have relfilenode files on the filesystem that are 100% identical to those produced by the serial case, in fact (since CREATE INDEX does not set LSNs in the new index pages). I've actually developed a simple way of "fingerprinting" indexes during testing of this patch, knowing that hashing the files on disk ought to produce a perfect match compared to a master branch serial sort case. At the same time, any information that I've seen about how much parallel CREATE INDEX speeds things up in these other systems indicates that the benefits are very similar. It tends to be in the 2x - 3x range, with the same reduction in throughput seen at about 16 workers, after we peak at about 8 workers. So, I think that the benefits of partitioning are not really seen with CREATE INDEX (I think of partitioning as more of a parallel query thing). Obviously, any benefit that might still exist for CREATE INDEX in particular, when weighed against the costs, makes partitioning look pretty unattractive as a next step. I think that during the merge phase of parallel CREATE INDEX as implemented, the system generally still isn't that far from being I/O bound. Whereas, with parallel query, partitioning makes each worker able to return one tuple from its own separated range very quickly, not just one worker (presumably, each worker merges non-overlapping "ranges" from runs initially sorted in each worker. Each worker subsequently merges after a partition-wise redistribution of the initial fully sorted runs, allowing for dynamic sampling to optimize the actual range used for load balancing.). The workers can then do more CPU-bound processing in whatever node is fed by each worker's ranged merge; everything is kept busy. That's the approach that I personally had in mind for partitioning, at least. It's really nice for parallel query to be able to totally separate workers after the point of redistribution. CREATE INDEX is not far from being I/O bound anyway, though, so it benefits far less. (Consider how fast the merge phase still is at writing out the index in *absolute* terms.) Look at figure 9 in this paper: http://www.vldb.org/pvldb/vol7/p85-balkesen.pdf Even in good cases for "independent sorting", there is only a benefit seen at 8 cores. At the same time, I can only get about 6x scaling with 8 workers, just for the initial generation of runs. All of these factors are why I believe I'm able to compete well with other systems with this relatively straightforward, evolutionary approach. I have a completely open mind about partitioning, but my approach makes sense in this context. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (email@example.com) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers