On Mon, Dec 7, 2015 at 9:01 AM, Jeff Janes <jeff.ja...@gmail.com> wrote:
> The patched one ends with a 2-way, two sequential 233-way merges, and
> a final 233-way merge:
> LOG:  finished 2-way merge step: CPU 62.08s/435.70u sec elapsed 587.52 sec
> LOG:  finished 233-way merge step: CPU 77.94s/660.11u sec elapsed 897.51 sec
> LOG:  a multi-pass external merge sort is required (234 tape maximum)
> HINT:  Consider increasing the configuration parameter "maintenance_work_mem".
> LOG:  finished 233-way merge step: CPU 94.55s/884.63u sec elapsed 1185.17 sec
> LOG:  performsort done (except 233-way final merge): CPU
> 94.76s/884.69u sec elapsed 1192.01 sec
> LOG:  external sort ended, 2541656 disk blocks used: CPU
> 202.65s/1771.50u sec elapsed 3963.90 sec
> If you just look at the final merges of each, they should have the
> same number of tuples going through them (i.e. all of the tuples) but
> the patched one took well over twice as long, and all that time was IO
> time, not CPU time.
> I reversed out the memory pooling patch, and that shaved some time
> off, but nowhere near bringing it back to parity.
> I think what is going on here is that the different numbers of runs
> with the patched code just makes it land in an anti-sweat spot in the
> tape emulation and buffering algorithm.
> Each tape gets 256kB of buffer.  But two tapes have one third of the
> tuples each other third are spread over all the other tapes almost
> equally (or maybe one tape has 2/3 of the tuples, if the output of one
> 233-way nonfinal merge was selected as the input of the other one).
> Once the large tape(s) has depleted its buffer, the others have had
> only slightly more than 1kB each depleted.  Yet when it goes to fill
> the large tape, it also tops off every other tape while it is there,
> which is not going to get much read-ahead performance on them, leading
> to a lot of random IO.

The final merge only refills each tape buffer as that buffer gets
depleted, rather than refilling all of them whenever any is depleted,
so my explanation doesn't work. But move it back one layer.  There are
3 sequential 233-way merges.  The first one produces a giant tape run.
The second one consumes that giant tape run along with 232 small tape
runs.  At this point, the logic I describe above does come into play,
refilling each of the buffers for the small runs much too often,
freeing blocks on the tape emulation for those runs in dribs and
drabs.  Those free blocks get re-used by the giant output tape run, in
a scattered fashion.

Then in the next (final) merge, it is has to read in this huge
fragmented tape run emulation, generating a lot of random IO to read

With the patched code, the average length of reads on files in
pgsql_tmp between lseeks or changing to a different file descriptor is
8, while in the unpatched code it is 14.

> Now, I'm not sure why this same logic wouldn't apply to the unpatched
> code with 118-way merge too.  So maybe I am all wet here.  It seems
> like that imbalance would be enough to also cause the problem.

So my current theory is that it takes one large merge to generate an
unbalanced tape, one large merge where that large unbalanced tape
leads to fragmenting the output tape, and one final merge to be slowed
down by this fragmentation.

I looked at https://code.google.com/p/ioapps/ as Peter recommended,
but couldn't figure out what do with it.  The only conclusion I got
from ioprofiler was that it spend a lot of time reading files in
pgsql_tmp.   I found just doing
strace -y -ttt -T -p <pid>
And then analyzing with perl one liners to work better, but it could
just be the learning curve.

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to