On Mon, Aug 1, 2016 at 3:18 PM, Peter Geoghegan <p...@heroku.com> wrote: > Setup: > > CREATE TABLE parallel_sort_test AS > SELECT hashint8(i) randint, > md5(i::text) collate "C" padding1, > md5(i::text || '2') collate "C" padding2 > FROM generate_series(0, 1e9::bigint) i; > > CHECKPOINT; > > This leaves us with a parallel_sort_test table that is 94 GB in size. > > SET maintenance_work_mem = '8GB'; > > -- Serial case (external sort, should closely match master branch): > CREATE INDEX serial_idx ON parallel_sort_test (randint) WITH > (parallel_workers = 0); > > Total time: 00:15:42.15 > > -- Patch with 8 tuplesort "sort-and-scan" workers (leader process > participates as a worker here): > CREATE INDEX patch_8_idx ON parallel_sort_test (randint) WITH > (parallel_workers = 7); > > Total time: 00:06:03.86 > > As you can see, the parallel case is 2.58x faster
I decided to revisit this exact benchmark, using the same AWS instance type (the one with 16 HDDs, again configured in software RAID0) to see how things had changed for both parallel and serial cases. I am now testing V4. A lot changed in the last 3 months, with most of the changes that help here now already committed to the master branch. Relevant changes ================ * Heikki's major overhaul of preload memory makes CREATE INDEX merging have more sequential access patterns. It also effectively allows us to use more memory. It's possible that the biggest benefit it brings to parallel CREATE INDEX is that is eliminates almost any random I/O penalty from logtape.c fragmentation that an extra merge pass has; parallel workers now usually do their own merge to produce one big run for the leader to merge. It also improves CPU cache efficiency quite directly, I think. This is the patch that helps most. Many thanks to Heikki for driving this forward. * My patch to simplify and optimize how the K-way merge heap is maintained (as tuples fill leaf pages of the final index structure) makes the merge phase significantly less CPU bound overall. (These first two items particularly help parallel CREATE INDEX, which spends proportionally much more wall clock time merging than would be expected for similar serial cases. Serial cases do of course also benefit.) * V2 of the patch (and all subsequent versions) apportioned slices of maintenance_work_mem to workers. maintenance_work_mem became a per-utility-operation budget, regardless of number of workers launched. This means that workers have less memory than the original V1 benchmark (they simply don't make use of it now), but this seems unlikely to hurt. Possibly, it even helps. * Andres' work on md.c scalability may have helped (seems unlikely with these CREATE INDEX cases that produce indexes not in the hundreds of gigabytes, though). It would help with *extremely* large index creation, which we won't really look at here. Things now look better than ever for the parallel CREATE INDEX patch. While it's typical for about 75% of wall clock time to be spent on sorting runs with serial CREATE INDEX, with the remaining 25% going on merging/writing index, with parallel CREATE INDEX I now generally see about a 50/50 split between parallel sorting of runs (including any worker merging to produce final runs) and serial merging for final on-the-fly merge where we actually write new index out as input is merged. This is a *significant* improvement over what we saw here back in August, where it was not uncommon for parallel CREATE INDEX to spend *twice* as much time in the serial final on-the-fly merge step. All improvements to the code that we've seen since August have targeted this final on-the-fly merge bottleneck. (The final on-the-fly merge is now *consistently* able to write out the index at a rate of 150MB/sec+ in my tests, which is pretty good.) New results ========== Same setup as one quoted above -- once again, we "SET maintenance_work_mem = '8GB'". -- Patch with 8 tuplesort "sort-and-scan" workers: CREATE INDEX patch_8_idx ON parallel_sort_test (randint) WITH (parallel_workers = 7); Total time: 00:04:24.93 -- Serial case: CREATE INDEX serial_idx ON parallel_sort_test (randint) WITH (parallel_workers = 0); Total time: 00:14:25.19 3.27x faster. Not bad. As you see in the quoted text, that was 2.58x back in August, even though the implementation now uses a lot less memory in parallel workers. And, that's without even considering the general question of how much faster index creation can be compared to Postgres 9.6 -- it's roughly 3.5x faster at times. New case ======== Separately, using my gensort tool [1], I came up with a new test case. The tool generated a 2.5 billion row table, sized at 159GB. This is how long is takes to produce a 73GB index on the "sortkey" column of the resulting table: -- gensort "C" locale text parallel case: CREATE INDEX test8 on sort_test(sortkey) WITH (parallel_workers = 7); Total time: 00:16:19.63 -- gensort "C" locale text serial case: CREATE INDEX test0 on sort_test(sortkey) WITH (parallel_workers = 0); Total time: 00:45:56.96 That's a 2.81x improvement in creation time relative to a serial case. Not quite as big a difference as seen in the first case, but remember that this is just like cases that were only made something like 2x - 2.2x faster by the use of parallelism back in August (see full e-mail quoted above [2]). These are cases involving a text column, or maybe a numeric column, that have complex comparators used during merging that must handle detoasting, possibly even allocate memory, etc. This second result is therefore probably the more significant of the two results shown, since it now seems like we're more consistently close to the ~3x improvement that other major database systems also seem to top out at as parallel CREATE INDEX workers are added. (I still can't see any benefit with 16 workers; my guess is that the anti-scaling begins even before the merge starts when using that many workers. That guess is hard to verify, given the confounding factor of more workers producing more runs, leaving more work for the serial merge phase.) I'd still welcome benchmarking or performance validation from somebody else. [1] https://github.com/petergeoghegan/gensort [2] https://www.postgresql.org/message-id/CAM3SWZQKM=Pzc=cahzrixkjp2eo5q0jg1sofqqexfq647ji...@mail.gmail.com -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers