On Wed, Aug 3, 2016 at 11:42 AM, Robert Haas <robertmh...@gmail.com> wrote: > I'm not going to say it's bad to be able to do things 2-2.5x faster, > but linear scalability this ain't - particularly because your 2.58x > faster case is using up to 7 or 8 times as much memory. The > single-process case would be faster in that case, too: you could > quicksort.
Certainly, there are cases where a parallel version could benefit from having more memory more so than actually parallelizing the underlying task. However, this case was pointedly chosen to *not* be such a case. When maintenance_work_mem exceeds about 5GB, I've observed that since 9.6 increasing it is just as likely to hurt as to help by about +/-5% (unless and until it's all in memory, which still doesn't help much). In general, there isn't all that much point in doing a very large sort like this in memory. You just don't get that much of a benefit for the memory you use, because linearithmic CPU costs eventually really dominate linear sequential I/O costs. I think you're focusing on the fact that there is a large absolute disparity in memory used in this one benchmark, but that isn't something that the gains shown particularly hinge upon. There isn't that much difference when workers must merge their own runs, for example. It saves the serial leader merge some work, and in particular makes it more cache efficient (by having fewer runs/tapes). Finally, while about 8x as much memory is used, the memory used over and above the serial case is almost all freed when the final merge begins (the final merges are therefore very similar in both cases, including in terms of memory use). So, for as long as you use 8x as much memory for 8 active processes, you get a 5.67x speed-up of that part alone. You still keep a few extra KiBs of memory for worker tapes and things like that during the leader's merge, but that's a close to negligible amount. > I feel like for sorting, in particular, we probably ought > to be setting the total memory budget, not the per-process memory > budget. Or if not, then any CREATE INDEX benchmarking had better > compare using scaled values for maintenance_work_mem; otherwise, > you're measuring the impact of using more memory as much as anything > else. As I said, the benchmark was chosen to avoid that (and to be simple and reproducible). I am currently neutral on the question of whether or not maintenance_work_mem should be dolled out per process or per sort operation. I do think that making it a per-process allowance is far closer to what we do for hash joins today, and is simpler. What's nice about the idea of making the workMem/maintenance_work_mem budget per sort is that that leaves the leader process with license to greatly increase the amount of memory it can use for the merge. Increasing the amount of memory used for the merge will improve things for longer than it will for workers. I've simulated it already. > I also think that Amdahl's law is going to pinch pretty severely here. Doesn't that almost always happen, though? Isn't that what you generally see with queries that show off the parallel join capability? > If the final merge phase is a significant percentage of the total > runtime, picking an algorithm that can't parallelize the final merge > is going to limit the speedups to small multiples. That's an OK place > to be as a result of not having done all the work yet, but you don't > want to get locked into it. If we're going to have a substantial > portion of the work that can never be parallelized, maybe we've picked > the wrong algorithm. I suggest that this work be compared to something with similar constraints. I used Google to try to get some indication of how much of a difference parallel CREATE INDEX makes in other major database systems. This is all I could find: https://www.mssqltips.com/sqlservertip/3100/reduce-time-for-sql-server-index-rebuilds-and-update-statistics/ It seems like the degree of parallelism used for SQL Server tends to affect index build time in a way that is strikingly similar with what I've come up with (which may be a coincidence; I don't know anything about SQL Server). So, I suspect that the performance of this is fairly good in an apples-to-apples comparison. Parallelizing merging can hurt or help, because there is a cost in memory bandwidth (if not I/O) for the extra passes that are used to keep more CPUs busy, which is kind of analogous to the situation with polyphase merge. I'm not saying that we shouldn't do that even still, but I believe that there are sharply diminishing returns. Merging tuple comparisons are much more expensive than quicksort tuple comparisons, which tend to benefit from abbreviated keys a lot. As I've said, there is probably a good argument to be made for partitioning to increase parallelism. But, that involves risks around the partitioning being driven by statistics or a cost model, and I don't think you'd be too on board with the idea of every CREATE INDEX after bulk loading needing an ANALYZE first. I tend to think of that as more of a parallel query thing, because you can often push down a lot more there, dynamic sampling might be possible, and there isn't a need to push all the tuples through one point in the end. Nothing I've done here precludes your idea of a sort-order-preserving gather node. I think that we may well need both. Since merging is a big bottleneck with this, we should probably also work to address that indirectly. > The work on making the logtape infrastructure parallel-aware seems > very interesting and potentially useful for other things. Sadly, I > don't have time to look at it right now. I would be happy to look at generalizing that further, to help parallel hash join. As you know, Thomas Munro and I have discussed this privately. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (email@example.com) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers