On Wed, Nov 9, 2016 at 7:54 PM, Peter Geoghegan <p...@heroku.com> wrote: >> Now, on the other hand, as far as I can see, the actual amount of >> evidence that 0001 is a good idea which has been presented in this >> forum is pretty near zero. You've argued for it on theoretical >> grounds several times, but theoretical arguments are not a substitute >> for test results. > > See the illustration in TAOCP, vol III, page 273 in the second edition > -- "Fig. 70. Efficiency of Polyphase merge using Algorithm D". I think > that it's actually a real-world benchmark.
I don't have that publication, and I'm guessing that's not based on PostgreSQL's implementation. There's no substitute for tests using the code we've actually got. >> So, I'm now feeling pretty bullish about this patch, except for one >> thing, which is that I think the comments are way off-base. Peter >> writes: $When allowedMem is significantly lower than what is required >> for an internal sort, it is unlikely that there are benefits to >> increasing the number of tapes beyond Knuth's "sweet spot" of 7.$ >> I'm pretty sure that's totally wrong, first of all because commit >> df700e6b40195d28dc764e0c694ac8cef90d4638 improved performance by doing >> precisely the thing which this comment says we shouldn't > > It's more complicated than that. As I said, I think that Knuth > basically had it right with his sweet spot of 7. I think that commit > df700e6b40195d28dc764e0c694ac8cef90d4638 was effective in large part > because a one-pass merge avoided certain overheads not inherent to > polyphase merge, like all that memory accounting stuff, extra palloc() > traffic, etc. The expanded use of per tape buffering we have even in > multi-pass cases likely makes that much less true for us these days. > > The reason I haven't actually gone right back down to 7 with this cap > is that it's possible that the added I/O costs outweigh the CPU costs > in extreme cases, even though I think that polyphase merge doesn't > have all that much to do with I/O costs, even with its 1970s > perspective. Knuth doesn't say much about I/O costs -- it's more about > using an extremely small amount of memory effectively (minimizing CPU > costs with very little available main memory). > > Furthermore, not limiting ourselves to 7 tapes and seeing a benefit > (benefitting from a few dozen or hundred instead) seems more possible > with the improved merge heap maintenance logic added recently, where > there could be perhaps hundreds of runs merged with very low CPU cost > in the event of presorted input (or, input that is inversely > logically/physically correlated). That would be true because we'd only > examine the top of the heap through, and so I/O costs may matter much > more. > > Depending on the exact details, I bet you could see a benefit with > only 7 tapes due to CPU cache efficiency in a case like the one you > describe. Perhaps when sorting integers, but not when sorting collated > text. There are many competing considerations, which I've tried my > best to balance here with a merge order of 500. I guess that's possible, but the problem with polyphase merge is that the increased I/O becomes a pretty significant cost in a hurry. Here's the same test with max_sort_tapes = 100: 2016-11-09 23:02:49 UTC  LOG: begin tuple sort: nkeys = 1, workMem = 262144, randomAccess = f 2016-11-09 23:02:55 UTC  LOG: switching to external sort with 101 tapes: CPU: user: 5.72 s, system: 0.25 s, elapsed: 6.04 s 2016-11-10 00:13:00 UTC  LOG: finished writing run 544 to tape 49: CPU: user: 4003.00 s, system: 156.89 s, elapsed: 4211.33 s 2016-11-10 00:16:52 UTC  LOG: finished 51-way merge step: CPU: user: 4214.84 s, system: 161.94 s, elapsed: 4442.98 s 2016-11-10 00:25:41 UTC  LOG: finished 100-way merge step: CPU: user: 4704.14 s, system: 170.83 s, elapsed: 4972.47 s 2016-11-10 00:36:47 UTC  LOG: finished 99-way merge step: CPU: user: 5333.12 s, system: 179.94 s, elapsed: 5638.52 s 2016-11-10 00:45:32 UTC  LOG: finished 99-way merge step: CPU: user: 5821.13 s, system: 189.00 s, elapsed: 6163.53 s 2016-11-10 01:01:29 UTC  LOG: finished 100-way merge step: CPU: user: 6691.10 s, system: 210.60 s, elapsed: 7120.58 s 2016-11-10 01:01:29 UTC  LOG: performsort done (except 100-way final merge): CPU: user: 6691.10 s, system: 210.60 s, elapsed: 7120.58 s 2016-11-10 01:45:40 UTC  LOG: external sort ended, 6255949 disk blocks used: CPU: user: 9271.07 s, system: 232.26 s, elapsed: 9771.49 s This is already worse than max_sort_tapes = 501, though the total runtime is still better than no cap (the time-to-first-tuple is way worse, though). I'm going to try max_sort_tapes = 10 next, but I think the basic pattern is already fairly clear. As you reduce the cap on the number of tapes, (a) the time to build the initial runs doesn't change very much, (b) the time to perform the final merge decreases significantly, and (c) the time to perform the non-final merges increases even faster. In this particular test configuration on this particular hardware, rewriting 77 tapes in the 501-tape configuration wasn't too bad, but now that we're down to 100 tapes, we have to rewrite 449 tapes out of a total of 544, and that's actually a loss: rewriting the bulk of your data an extra time to save on cache misses doesn't pay. It would probably be even less good if there were other concurrent activity on the system. It's possible that if your polyphase merge is actually being done all in memory, cache efficiency might remain the dominant consideration, but I think we should assume that a polyphase merge is doing actual I/O, because it's sort of pointless to use that algorithm in the first place if there's no real I/O involved. At the moment, at least, it looks to me as though we don't need to be afraid of a *little* bit of polyphase merging, but a *lot* of polyphase merging is actually pretty bad. In other words, by imposing a limit of the number of tapes, we're going to improve sorts that are smaller than work_mem * num_tapes * ~1.5 -- because cache efficiency will be better -- but above that things will probably get worse because of the increased I/O cost. From that point of view, a 500-tape limit is the same as saying that it's we don't think it's entirely reasonable to try to perform a sort that exceeds work_mem by a factor of more than ~750, whereas a 7-tape limit is the same as saying that we don't think it's entirely reasonable to perform a sort that exceeds work_mem by a factor of more than ~10. That latter proposition seems entirely untenable. Our default work_mem setting is 4MB, and people will certainly expect to be able to get away with, say, an 80MB sort without changing settings. On the other hand, if they're sorting more than 3GB with work_mem = 4MB, I think we'll be justified in making a gentle suggestion that they reconsider that setting. Among other arguments, it's going to be pretty slow in that case no matter what we do here. Maybe another way of putting this is that, while there's clearly a benefit to having some kind of a cap, it's appropriate to pick a large value, such as 500. Having no cap at all risks creating many extra tapes that just waste memory, and also risks an unduly cache-inefficient final merge. Reigning that in makes sense. However, we can't reign it in too far or we'll create slow polyphase merges in case that are reasonably likely to occur in real life. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (firstname.lastname@example.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers