Date: Fri, 17 Aug 2012 00:26:37 +0100 From: Peter Geoghegan <> To: Jeff Janes <> Cc: pgsql-hackers <> Subject: Re: tuplesort memory usage: grow_memtuples Message-ID: <> On 27 July 2012 16:39, Jeff Janes <> wrote:
>>  Can you suggest a benchmark that will usefully exercise this patch?
>  I think the given sizes below work on most 64 bit machines.

I think this patch (or at least your observation about I/O waits
within vmstat) may point to a more fundamental issue with our sort
code: Why are we not using asynchronous I/O in our implementation?
There are anecdotal reports of other RDBMS implementations doing far
better than we do here, and I believe asynchronous I/O, pipelining,
and other such optimisations have a lot to do with that. It's
something I'd hoped to find the time to look at in detail, but
probably won't in the 9.3 cycle. One of the more obvious ways of
optimising an external sort is to use asynchronous I/O so that one run
of data can be sorted or merged while other runs are being read from
or written to disk. Our current implementation seems naive about this.
There are some interesting details about how this is exposed by POSIX

I've recently tried extending the postgresql prefetch mechanism on linux to use the posix (i.e. librt) aio_read and friends where possible. In other words, in PrefetchBuffer(), try getting a buffer and issuing aio_read before falling back to fposix_advise(). It gives me about 8% improvement in throughput relative to the fposix-advise variety, for a workload of 16 highly-disk-read-intensive applications running to 16 backends. For my test each application runs a query chosen to have plenty of bitmap heap scans.

I can provide more details on my changes if interested.

On whether this technique might improve sort performance  :

First, the disk access pattern for sorting is mostly sequential (although I think the sort module does some tricky work with reuse of pages in its "logtape" files which maybe is random-like), and there are several claims on the net that linux buffered file handling already does a pretty good job of read-ahead for a sequential access pattern
without any need for the application to help it.
I can half-confirm that in that I tried adding calls to PrefetchBuffer in regular heap scan
and did not see much improvement.    But I am still pursuing that area.

But second, it would be easy enough to add some fposix_advise calls to sort and see whether that helps. (Can't make use of PrefetchBuffer since sort does not use the regular relation buffer pool)

It's already anticipated that we might take advantage of libaio for
the benefit of FilePrefetch() (see its accompanying comments - it uses
posix_fadvise itself - effective_io_concurrency must be>  0 for this
to ever be called). It perhaps could be considered parallel
"low-hanging fruit" in that it allows us to offer limited though
useful backend parallelism without first resolving thorny issues
around what abstraction we might use, or how we might eventually make
backends thread-safe. AIO supports registering signal callbacks (a
SIGPOLL handler can be called), which seems relatively

I believe libaio is dead, as it depended on the old linux kernel asynchronous file io,
which was problematic and imposed various restrictions on the application.
librt aio has no restrictions and does a good enough job but uses pthreads
and synchronous io, which can make CPU overhead a bit heavy and also I believe
results in causing more context switching than with synchronous io,
whereas one of the benefits of kernel async io (in theory) is reduce context switching.

From what I've seen, pthreads aio can give a benefit when there is high IO wait from mostly-read activity, the disk access pattern is not sequential (so kernel readahead cant predict it) but postgresql can predict it, and there's enough spare idle CPU to run the pthreads. So it does seem that bitmap heap scan is a good choice for prefetching.

Platform support for AIO might be a bit lacking, but then you can say
the same about posix_fadvise. We don't assume that poll(2) is
available, but we already use it where it is within the latch code.
Besides, in-kernel support can be emulated if POSIX threads is
available, which I believe would make this broadly useful on unix-like

-- Peter Geoghegan PostgreSQL Development, 24x7 Support, Training and Services

Sent via pgsql-hackers mailing list (
To make changes to your subscription:

Reply via email to