Postgres 9.6 saw the performance characteristics of sorting
significantly altered. While almost every external sort will see a
benefit without any change in configuration, there is a new scope for
DBAs to tune the system to better take advantage of the improved
implementation of external sorting. They should get guidance from us
on this to make it more likely that this actually happens, though.

I'd like to discuss where this guidance should be added, but first, a
little background.

Firstly, DBAs may get a significant, appreciable benefit from making
sure that temp_tablespaces puts temp files on a fast filesystem; this
would probably have been far less helpful prior to 9.6. It seems
reasonable to suppose that explicitly setting temp_tablespaces does
not happen all that often on production installations today, since
external sorts were stalled on memory access most of the time with the
old replacement selection approach. While memory stalls remain a big
cost in 9.6, the situation is much improved. External sorts may be
significantly I/O bound far more frequently in 9.6, especially during
merging (which is also optimized in 9.6), and especially during
merging for CREATE INDEX in particular. In the latter case, it's
likely that the system writes index tuples and WAL out as it reads
runs in, with everything on the same filesystem.

Modern filesystems are really good at read-ahead and write-behind. It
really is not at all unexpected for sequential I/O with fast *disks*
to be faster than random *memory* access [1], and so I really doubt
that there is any question that I/O will now matter more. I expect
that blocking on sequential I/O will become much more of a problem
when parallel sort is added, but it's still going to be a problem with
serial sorts on busy production systems servicing many clients. (With
only a single client, this is much less true, but that's seldom the

Secondly, with 9.6 it's roughly true that the more memory you can
spare for maintenance_work_mem and work_mem, the better. There are
some caveats that I won't rehash right now; the caveats are rather
limited, and DBAs can probably just pretend that the common, simple
intuition "more memory is better, until you OOM" is at last true. I
don't imagine more information (the caveats) is of much practical use.

A duality

Adding faster disks to get more sequential write bandwidth is a
practical measure available to many DBAs, which is more important for
sorting now. While not every user can afford to do so, it's nice that
some have the *option* of upgrading to get better performance. It need
not be expensive; we get plenty of benefit from cheaper SATA HDDs,
since, with care, only sequential I/O performance really matters. This
was not really the case in prior releases. At the same time, if you
can afford the memory, you should probably just increase
work_mem/maintenance_work_mem to get another performance benefit.
These two benefits will often be complementary. There may a feedback
loop that looks a bit like this:

Quicksort is cache oblivious, which implies that we may effectively
use more working memory, which implies longer runs, which implies
fewer merge passes, which implies that sequential I/O performance is
mostly where I/O costs are paid, which implies that you can manage I/O
costs by adding (consumer grade?) SATA HDDs configured in RAID0 for
temp files, which implies that sorts finish sooner, which implies that
in aggregate memory use is lower, which implies that you can afford to
set work_mem/maintenance_work_mem higher still, which implies ... .

This may seem facile, but consider the huge difference in costs I'm
focussing on. There is a huge difference between the cost of random
I/O and sequential I/O (when FS cache is not able to "amortize" the
random I/O), just as there is a huge difference between the cost of an
L1 cache access, and main memory access (a cache miss). So, if I can
be forgiven for using such a hand-wavey term, there is an interesting
duality here. We should avoid "the expensive variety" of I/O (random
I/O) wherever possible, and avoid "the expensive variety" of memory
access (involving a last-level CPU cache miss) as much as possible.
Certainly, the economics of modern hardware strongly support
increasing locality of access at every level. We're talking about
differences of perhaps several orders of magnitude.

I/O bound sorts in earlier releases

Perhaps someone has personally seen a DBA adding a new temp_tablespace
on a separate, faster filesystem. Perhaps this clearly improved
performance. I don't think that undermines my argument, though. This
*does* sometimes happen on prior versions of Postgres, but my guess is
that this is almost accidental:

* Increasing work_mem/maintenance_work_mem perversely makes external
sorts *slower* before 9.6. Iterative tuning of these settings on busy
production systems will therefore tend to guide the DBA to decrease
work_mem/maintenance_work_mem settings (maybe external sorts got so
slow that OOMs happened). Whether or not the DBA realizes it, the
counter-intuitive slowdown occurs because replacement selection is
sensitive to CPU cache size.

* As those memory settings are pushed down, runs become smaller, and
more merge steps are required for any larger external sorts performed.
Merge steps increase the amount of I/O, particularly when a few passes
are necessary; *random* I/O suddenly spikes. Of course, this could
have been avoided if work_mem/maintenance_work_mem were higher, but
that's already been ruled out.


I think we can expand "21.6. Tablespaces" to describe the implications
of these new performance characteristics. I'd like to hear opinions on
how to approach that before proposing a patch, though. The basic
guidance should, IMV, be:

* A temp tablespace with cheaper disks that have good sequential I/O
performance can speed up external sorts quite a lot. Probably not a
great idea to have many temp tablespaces. Use RAID0 instead, because
that performs better, and because it doesn't matter that temp files
are not recoverable if a disk is faulty.

* More memory for sorting and hashing is often better in PostgreSQL
9.6. Notably, the performance of hash joins that spill will tend to
degrade less predictably than the performance of sorts that spill as
less memory is made available. (Perhaps mention the number of external
sort passes?)

* Increasing work_mem/maintenance_work_mem may fail to improve
performance only because sorts then become more I/O bound. When in
doubt, testing is advised. A balance may need to be found, if only to
avoid wasting memory.


Peter Geoghegan

Sent via pgsql-hackers mailing list (
To make changes to your subscription:

Reply via email to