Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

Robert Haas Wed, 09 Nov 2016 18:58:27 -0800

On Wed, Nov 9, 2016 at 7:54 PM, Peter Geoghegan <p...@heroku.com> wrote:
>> Now, on the other hand, as far as I can see, the actual amount of
>> evidence that 0001 is a good idea which has been presented in this
>> forum is pretty near zero.  You've argued for it on theoretical
>> grounds several times, but theoretical arguments are not a substitute
>> for test results.
>
> See the illustration in TAOCP, vol III, page 273 in the second edition
> -- "Fig. 70. Efficiency of Polyphase merge using Algorithm D". I think
> that it's actually a real-world benchmark.

I don't have that publication, and I'm guessing that's not based on
PostgreSQL's implementation.  There's no substitute for tests using
the code we've actually got.

>> So, I'm now feeling pretty bullish about this patch, except for one
>> thing, which is that I think the comments are way off-base. Peter
>> writes: $When allowedMem is significantly lower than what is required
>> for an internal sort, it is unlikely that there are benefits to
>> increasing the number of tapes beyond Knuth's "sweet spot" of 7.$
>> I'm pretty sure that's totally wrong, first of all because commit
>> df700e6b40195d28dc764e0c694ac8cef90d4638 improved performance by doing
>> precisely the thing which this comment says we shouldn't
>
> It's more complicated than that. As I said, I think that Knuth
> basically had it right with his sweet spot of 7. I think that commit
> df700e6b40195d28dc764e0c694ac8cef90d4638 was effective in large part
> because a one-pass merge avoided certain overheads not inherent to
> polyphase merge, like all that memory accounting stuff, extra palloc()
> traffic, etc. The expanded use of per tape buffering we have even in
> multi-pass cases likely makes that much less true for us these days.
>
> The reason I haven't actually gone right back down to 7 with this cap
> is that it's possible that the added I/O costs outweigh the CPU costs
> in extreme cases, even though I think that polyphase merge doesn't
> have all that much to do with I/O costs, even with its 1970s
> perspective. Knuth doesn't say much about I/O costs -- it's more about
> using an extremely small amount of memory effectively (minimizing CPU
> costs with very little available main memory).
>
> Furthermore, not limiting ourselves to 7 tapes and seeing a benefit
> (benefitting from a few dozen or hundred instead) seems more possible
> with the improved merge heap maintenance logic added recently, where
> there could be perhaps hundreds of runs merged with very low CPU cost
> in the event of presorted input (or, input that is inversely
> logically/physically correlated). That would be true because we'd only
> examine the top of the heap through, and so I/O costs may matter much
> more.
>
> Depending on the exact details, I bet you could see a benefit with
> only 7 tapes due to CPU cache efficiency in a case like the one you
> describe. Perhaps when sorting integers, but not when sorting collated
> text. There are many competing considerations, which I've tried my
> best to balance here with a merge order of 500.

I guess that's possible, but the problem with polyphase merge is that
the increased I/O becomes a pretty significant cost in a hurry.
Here's the same test with max_sort_tapes = 100:

2016-11-09 23:02:49 UTC [48551] LOG:  begin tuple sort: nkeys = 1,
workMem = 262144, randomAccess = f
2016-11-09 23:02:55 UTC [48551] LOG:  switching to external sort with
101 tapes: CPU: user: 5.72 s, system: 0.25 s, elapsed: 6.04 s
2016-11-10 00:13:00 UTC [48551] LOG:  finished writing run 544 to tape
49: CPU: user: 4003.00 s, system: 156.89 s, elapsed: 4211.33 s
2016-11-10 00:16:52 UTC [48551] LOG:  finished 51-way merge step: CPU:
user: 4214.84 s, system: 161.94 s, elapsed: 4442.98 s
2016-11-10 00:25:41 UTC [48551] LOG:  finished 100-way merge step:
CPU: user: 4704.14 s, system: 170.83 s, elapsed: 4972.47 s
2016-11-10 00:36:47 UTC [48551] LOG:  finished 99-way merge step: CPU:
user: 5333.12 s, system: 179.94 s, elapsed: 5638.52 s
2016-11-10 00:45:32 UTC [48551] LOG:  finished 99-way merge step: CPU:
user: 5821.13 s, system: 189.00 s, elapsed: 6163.53 s
2016-11-10 01:01:29 UTC [48551] LOG:  finished 100-way merge step:
CPU: user: 6691.10 s, system: 210.60 s, elapsed: 7120.58 s
2016-11-10 01:01:29 UTC [48551] LOG:  performsort done (except 100-way
final merge): CPU: user: 6691.10 s, system: 210.60 s, elapsed: 7120.58
s
2016-11-10 01:45:40 UTC [48551] LOG:  external sort ended, 6255949
disk blocks used: CPU: user: 9271.07 s, system: 232.26 s, elapsed:
9771.49 s

This is already worse than max_sort_tapes = 501, though the total
runtime is still better than no cap (the time-to-first-tuple is way
worse, though).   I'm going to try max_sort_tapes = 10 next, but I
think the basic pattern is already fairly clear.  As you reduce the
cap on the number of tapes, (a) the time to build the initial runs
doesn't change very much, (b) the time to perform the final merge
decreases significantly, and (c) the time to perform the non-final
merges increases even faster.  In this particular test configuration
on this particular hardware, rewriting 77 tapes in the 501-tape
configuration wasn't too bad, but now that we're down to 100 tapes, we
have to rewrite 449 tapes out of a total of 544, and that's actually a
loss: rewriting the bulk of your data an extra time to save on cache
misses doesn't pay.  It would probably be even less good if there were
other concurrent activity on the system.  It's possible that if your
polyphase merge is actually being done all in memory, cache efficiency
might remain the dominant consideration, but I think we should assume
that a polyphase merge is doing actual I/O, because it's sort of
pointless to use that algorithm in the first place if there's no real
I/O involved.

At the moment, at least, it looks to me as though we don't need to be
afraid of a *little* bit of polyphase merging, but a *lot* of
polyphase merging is actually pretty bad.  In other words, by imposing
a limit of the number of tapes, we're going to improve sorts that are
smaller than work_mem  * num_tapes * ~1.5 -- because cache efficiency
will be better -- but above that things will probably get worse
because of the increased I/O cost.  From that point of view, a
500-tape limit is the same as saying that it's we don't think it's
entirely reasonable to try to perform a sort that exceeds work_mem by
a factor of more than ~750, whereas a 7-tape limit is the same as
saying that we don't think it's entirely reasonable to perform a sort
that exceeds work_mem by a factor of more than ~10.  That latter
proposition seems entirely untenable.  Our default work_mem setting is
4MB, and people will certainly expect to be able to get away with,
say, an 80MB sort without changing settings.  On the other hand, if
they're sorting more than 3GB with work_mem = 4MB, I think we'll be
justified in making a gentle suggestion that they reconsider that
setting.  Among other arguments, it's going to be pretty slow in that
case no matter what we do here.

Maybe another way of putting this is that, while there's clearly a
benefit to having some kind of a cap, it's appropriate to pick a large
value, such as 500.  Having no cap at all risks creating many extra
tapes that just waste memory, and also risks an unduly
cache-inefficient final merge.  Reigning that in makes sense.
However, we can't reign it in too far or we'll create slow polyphase
merges in case that are reasonably likely to occur in real life.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

Reply via email to