On Wed, Aug 19, 2015 at 7:24 PM, Peter Geoghegan <p...@heroku.com> wrote: > I'll start a new thread for this, since my external sorting patch has > now evolved well past the original "quicksort with spillover" > idea...although not quite how I anticipated it would. It seems like > I've reached a good point to get some feedback.
Corey Huinker has once again assisted me with this work, by doing some benchmarking on an AWS instance of his: 32 cores (c3.8xlarge, I suppose) MemTotal: 251902912 kB I believe it had one EBS volume. This testing included 2 data sets: * A data set that he happens to have that is representative of his production use-case. Corey had some complaints about the sort performance of PostgreSQL, particularly prior to 9.5, and I like to link any particular performance optimization to an improvement in an actual production workload, if at all possible. * A tool that I wrote, that works on top of sortbenchmark.org's "gensort"  data generation tool. It seems reasonable to me to drive this work in part with a benchmark devised by Jim Gray. He did after all receive a Turing award for this contribution to transaction processing. I'm certainly a fan of his work. A key practical advantage of that is that is has reasonable guarantees about determinism, making these results relatively easy to recreate independently. The modified "gensort" is available from https://github.com/petergeoghegan/gensort The python script postgres_load.py, which performs bulk-loading for Postgres using COPY FREEZE. It ought to be fairly self-documenting: $:~/gensort$ ./postgres_load.py --help usage: postgres_load.py [-h] [-w WORKERS] [-m MILLION] [-s] [-l] [-c] optional arguments: -h, --help show this help message and exit -w WORKERS, --workers WORKERS Number of gensort workers (default: 4) -m MILLION, --million MILLION Generate n million tuples (default: 100) -s, --skew Skew distribution of output keys (default: False) -l, --logged Use logged PostgreSQL table (default: False) -c, --collate Use default collation rather than C collation (default: False) For this initial report to the list, I'm going to focus on a case involving 16 billion non-skewed tuples generated using the gensort tool. I wanted to see how a sort of a ~1TB table (1017GB as reported by psql, actually) could be improved, as compared to relatively small volumes of data (in the multiple gigabyte range) that were so improved by sorts on my laptop, which has enough memory to avoid blocking on physical I/O much of the time. How the new approach deals with hundreds of runs that are actually reasonably sized is also of interest. This server does have a lot of memory, and many CPU cores. It was kind of underpowered on I/O, though. The initial load of 16 billion tuples (with a sortkey that is "C" locale text) took about 10 hours. My tool supports parallel generation of COPY format files, but serial performance of that stage isn't especially fast. Further, in order to support COPY FREEZE, and in order to ensure perfect determinism, the COPY operations occur serially in a single transaction that creates the table that we performed a CREATE INDEX on. Patch, with 3GB maintenance_work_mem: ... LOG: performsort done (except 411-way final merge): CPU 1017.95s/17615.74u sec elapsed 23910.99 sec STATEMENT: create index on sort_test (sortkey ); LOG: external sort ended, 54740802 disk blocks used: CPU 2001.81s/31395.96u sec elapsed 41648.05 sec STATEMENT: create index on sort_test (sortkey ); So just over 11 hours (11:34:08), then. The initial sorting for 411 runs took 06:38:30.99, as you can see. Master branch: ... LOG: finished writing run 202 to tape 201: CPU 1224.68s/31060.15u sec elapsed 34409.16 sec LOG: finished writing run 203 to tape 202: CPU 1230.48s/31213.55u sec elapsed 34580.41 sec LOG: finished writing run 204 to tape 203: CPU 1236.74s/31366.63u sec elapsed 34750.28 sec LOG: performsort starting: CPU 1241.70s/31501.61u sec elapsed 34898.63 sec LOG: finished writing run 205 to tape 204: CPU 1242.19s/31516.52u sec elapsed 34914.17 sec LOG: finished writing final run 206 to tape 205: CPU 1243.23s/31564.23u sec elapsed 34963.03 sec LOG: performsort done (except 206-way final merge): CPU 1243.86s/31570.58u sec elapsed 34974.08 sec LOG: external sort ended, 54740731 disk blocks used: CPU 2026.98s/48448.13u sec elapsed 55299.24 sec CREATE INDEX Time: 55299315.220 ms So 15:21:39 for master -- it's much improved, but this was still disappointing given the huge improvements on relatively small cases. Finished index was fairly large, which can be seen here by working back from "total relation size": postgres=# select pg_size_pretty(pg_total_relation_size('sort_test')); pg_size_pretty ---------------- 1487 GB (1 row) I think that this is probably due to the relatively slow I/O on this server, and because the merge step is more of a bottleneck. As we increase maintenance_work_mem, we're likely to then suffer from the lack of explicit asynchronous I/O here. It helps, still, but not dramatically. With with maintenance_work_mem = 30GB, patch is somewhat faster (no reason to think that this would help master at all, so that was untested): ... LOG: starting quicksort of run 40: CPU 1815.99s/19339.80u sec elapsed 24910.38 sec LOG: finished quicksorting run 40: CPU 1820.09s/19565.94u sec elapsed 25140.69 sec LOG: finished writing run 40 to tape 39: CPU 1833.76s/19642.11u sec elapsed 25234.44 sec LOG: performsort starting: CPU 1849.46s/19803.28u sec elapsed 25499.98 sec LOG: starting quicksort of run 41: CPU 1849.46s/19803.28u sec elapsed 25499.98 sec LOG: finished quicksorting run 41: CPU 1852.37s/20000.73u sec elapsed 25700.43 sec LOG: finished writing run 41 to tape 40: CPU 1864.89s/20069.09u sec elapsed 25782.93 sec LOG: performsort done (except 41-way final merge): CPU 1965.43s/20086.28u sec elapsed 25980.80 sec LOG: external sort ended, 54740909 disk blocks used: CPU 3270.57s/31595.37u sec elapsed 40376.43 sec CREATE INDEX Time: 40383174.977 ms So that takes 11:13:03 in total -- we only managed to shave about 20 minutes off the total time taken, despite a 10x increase in maintenance_work_mem. Still, at least it gets moderately better, not worse, which is certainly what I'd expect from the master branch. 60GB was half way between 3GB and 30GB in terms of performance, so it doesn't continue to help, but, again, at least things don't get much worse. Thoughts on these results: * I'd really like to know the role of I/O here. Better, low-overhead instrumentation is required to see when and how we are I/O bound. I've been doing much of that on a more-or-less ad hoc basis so far, using iotop. I'm looking into a way to usefully graph the I/O activity over many hours, to correlate with the trace_sort output that I'll also show. I'm open to suggestions on the easiest way of doing that. Having used the "perf" tool for instrumenting I/O at all in the past. * Parallelism would probably help us here *a lot*. * As I said, I think we suffer from the lack of asynchronous I/O much more at this scale. Will need to confirm that theory. * It seems kind of ill-advised to make run size (which is always in linear proportion to maintenance_work_mem with this new approach to sorting) larger, because it probably will hurt writing runs more than it will help in making merging cheaper (perhaps mostly due to the lack of asynchronous I/O to hide the latency of writes -- Linux might not do so well at this scale). * Maybe adding actual I/O bandwidth is the way to go to get a better picture. I wouldn't be surprised if we were very bottlenecked on I/O here. Might be worth using many parallel EBS volumes here, for example.  http://sortbenchmark.org/FAQ-2015.html -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (firstname.lastname@example.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers