On Wed, May 28, 2014 at 2:19 PM, Heikki Linnakangas
<hlinnakan...@vmware.com> wrote:
> How portable is POSIX aio nowadays? Googling around, it still seems that on
> Linux, it's implemented using threads. Does the thread-emulation
> implementation cause problems with the rest of the backend, which assumes
> that there is only a single thread? In any case, I think we'll want to
> encapsulate the AIO implementation behind some kind of an API, to allow
> other implementations to co-exist.

I think POSIX aio is pretty damn standard and it's a pretty fiddly
interface. If we abstract it behind an i/o interface we're going to
lose a lot of the power. Abstracting it behind a set of buffer manager
operations (initiate i/o on buffer, complete i/o on buffer, abort i/o
on buffer) should be fine but that's basically what we have, no?

I don't think the threaded implementation on Linux is the one to use
though. I find this *super* confusing but the kernel definitely
supports aio syscalls, glibc also has a threaded implementation it
uses if run on a kernel that doesn't implement the syscalls, and I
think there are existing libaio and librt libraries from outside glibc
that do one or the other. Which you build against seems to make a big
difference. My instinct is that anything but the kernel native
implementation will be worthless. The overhead of thread communication
will completely outweigh any advantage over posix_fadvise's partial

The main advantage of posix aio is that we can actually receive the
data out of order. With posix_fadvise we can get the i/o and cpu
overlap but we will never process the later blocks until the earlier
requests are satisfied and processed in order. With aio you could do a
sequential scan, initiating i/o on 1,000 blocks and then processing
them as they arrive, initiating new requests as those blocks are

When I investigated this I found the buffer manager's I/O bits seemed
to already be able to represent the state we needed (i/o initiated on
this buffer but not completed). The problem was in ensuring that a
backend would process the i/o completion promptly when it might be in
the midst of handling other tasks and might even get an elog() stack
unwinding. The interface that actually fits Postgres best might be the
threaded interface (orthogonal to the threaded implementation
question) which is you give aio a callback which gets called on a
separate thread when the i/o completes. The alternative is you give
aio a list of operation control blocks and it tells you the state of
all the i/o operations. But it's not clear to me how you arrange to do
that regularly, promptly, and reliably.

The other gotcha here is that the kernel implementation only does
anything useful on DIRECT_IO files. That means you have to do *all*
the prefetching and i/o scheduling yourself. You would be doing that
anyways for sequential scans and bitmap scans -- and we already do it
with things like synchronised scans and posix_fadvise -- but index
scans would need to get some intelligence for when it makes sense to
read more than one page at a time.  It might be possible to do
something fairly coarse like having our i/o operators keep track of
how often i/o on a relation falls within a certain number of blocks of
an earlier i/o and autotune number of blocks to read based on that. It
might not be hard to do better than the kernel with even basic info
like what level of the index we're reading or what type of pointer
we're following.

Finally, when I did the posix_fadvise work I wrote a synthetic
benchmark for testing the equivalent i/o pattern of a bitmap scan. It
let me simulate bitmap scans of varying densities with varying
parameters, notably how many i/o to keep in flight at once. It
supported posix_fadvise or aio. You should look it up in the archives,
it made for some nice looking graphs. IIRC I could not find any build
environment where aio offered any performance boost at all. I think
this means I just didn't know how to build it against the right
libraries or wasn't using the right kernel or there was some skew
between them at the time.


Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to