[HACKERS] asynchronous disk io (was : tuplesort memory usage)

johnlumby Fri, 17 Aug 2012 15:29:27 -0700

Date: Fri, 17 Aug 2012 00:26:37 +0100 From: Peter Geoghegan<pe...@2ndquadrant.com> To: Jeff Janes <jeff.ja...@gmail.com> Cc:pgsql-hackers <pgsql-hackers@postgresql.org> Subject: Re: tuplesortmemory usage: grow_memtuples Message-ID:<caeylb_vezpkdx54vex3x30oy_uoth89xoejjw6aucjjiujs...@mail.gmail.com>On 27 July 2012 16:39, Jeff Janes <jeff.ja...@gmail.com> wrote:

>>  Can you suggest a benchmark that will usefully exercise this patch?

>
>  I think the given sizes below work on most 64 bit machines.

  [...]


I think this patch (or at least your observation about I/O waits
within vmstat) may point to a more fundamental issue with our sort
code: Why are we not using asynchronous I/O in our implementation?
There are anecdotal reports of other RDBMS implementations doing far
better than we do here, and I believe asynchronous I/O, pipelining,
and other such optimisations have a lot to do with that. It's
something I'd hoped to find the time to look at in detail, but
probably won't in the 9.3 cycle. One of the more obvious ways of
optimising an external sort is to use asynchronous I/O so that one run
of data can be sorted or merged while other runs are being read from
or written to disk. Our current implementation seems naive about this.
There are some interesting details about how this is exposed by POSIX
here:

http://www.gnu.org/software/libc/manual/html_node/Asynchronous-I_002fO.html

I've recently tried extending the postgresql prefetch mechanism on linuxto use the posix (i.e. librt)aio_read and friends where possible. In other words, inPrefetchBuffer(), try getting a bufferand issuing aio_read before falling back to fposix_advise(). Itgives me about 8% improvementin throughput relative to the fposix-advise variety, for a workload of16 highly-disk-read-intensive applications running to 16 backends.For my test each application runs a query chosen to have plenty ofbitmap heap scans.


I can provide more details on my changes if interested.

On whether this technique might improve sort performance  :

First, the disk access pattern for sorting is mostly sequential(although I thinkthe sort module does some tricky work with reuse of pages in its"logtape" fileswhich maybe is random-like), and there are several claims on the netthat linux buffered file handlingalready does a pretty good job of read-ahead for a sequential accesspattern

without any need for the application to help it.

I can half-confirm that in that I tried adding calls to PrefetchBufferin regular heap scan

and did not see much improvement.    But I am still pursuing that area.

But second, it would be easy enough to add some fposix_advise calls tosort and see whetherthat helps. (Can't make use of PrefetchBuffer since sort does not usethe regular relation buffer pool)


It's already anticipated that we might take advantage of libaio for
the benefit of FilePrefetch() (see its accompanying comments - it uses
posix_fadvise itself - effective_io_concurrency must be>  0 for this
to ever be called). It perhaps could be considered parallel
"low-hanging fruit" in that it allows us to offer limited though
useful backend parallelism without first resolving thorny issues
around what abstraction we might use, or how we might eventually make
backends thread-safe. AIO supports registering signal callbacks (a
SIGPOLL handler can be called), which seems relatively
uncontroversial.

I believe libaio is dead, as it depended on the old linux kernelasynchronous file io,

which was problematic and imposed various restrictions on the application.
librt aio has no restrictions and does a good enough job but uses pthreads

and synchronous io, which can make CPU overhead a bit heavy and also Ibelieve

results in causing more context switching than with synchronous io,

whereas one of the benefits of kernel async io (in theory) is reducecontext switching.

From what I've seen, pthreads aio can give a benefit when there ishigh IO waitfrom mostly-read activity, the disk access pattern is not sequential(so kernel readaheadcant predict it) but postgresql can predict it, and there's enoughspare idle CPU torun the pthreads. So it does seem that bitmap heap scan is a goodchoice for prefetching.


Platform support for AIO might be a bit lacking, but then you can say
the same about posix_fadvise. We don't assume that poll(2) is
available, but we already use it where it is within the latch code.
Besides, in-kernel support can be emulated if POSIX threads is
available, which I believe would make this broadly useful on unix-like
platforms.

-- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development,24x7 Support, Training and Services




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] asynchronous disk io (was : tuplesort memory usage)

Reply via email to