03:19 pm - Buffered
async IO
We
have this infrastructure in the kernel for doing asynchronous IO, which
sits in fs/aio.c and fs/direct-io.c mainly. It works fine and it's
pretty fast. From userspace you would link with -laio and use the
functions there for hooking into the syscalls. However, it's only
really async for direct uncached IO (files opened with O_DIRECT).
This last little detail essentially means that it's largely useless to
most people. Oracle uses it, and other databases may as well. But
nobody uses aio on the desktop or elsewhere simply because it isn't a
good fit when O_DIRECT is required - you then need
aligned IO buffers and transfer sizes and you lose readahead. The
alignment and size restrictions also make it difficult to convert
existing apps to use libaio with O_DIRECT, since it
requires more than just a straight forward conversion. It adds
complexity.
Over
the years, several alternative approaches to async IO have been
proposed/developed. We've had several variants of
uatom/syslet/fibril/threadlet type schemes, which all boil down to
(just about) the same type of implementation - when a process is about
to block inside the kernel, a cloned process/thread returns to
userspace on behalf of the original process and informs it of the
postponed work. The completed events can then later be retrieved
through some sort of get_event/wait_event type interface, similar to
how you reap completions with other types of async IO. Or the
implementation provided a callback type scheme, similar to how eg a
signal would be received in the process. This sort of setup performed
acceptably, but suffered from the schizophrenic disorder of split
personalities due to the change in personalities on return from the
kernel. Various approaches to "fix" or alleviate this problem weren't
particularly pretty.
Recently, Zach Brown started playing with
something he calls acall. The user interface is pretty neat, and (like
the above syslet etc like implementations), it allows for performing
all system calls in an async nature. Zach took a more simplified
approach for making it async in the kernel, by punting every operation
to a thread in the kernel. This obviously works and means that the
submission interface is very fast, which is of course a good thing. It
also means that some operations are going to be performed by someone
else than the process that requested the operation, which has
implications for IO scheduling in the system. Advanced IO schedulers
like CFQ tracks IOs on a per-process basis, and they
then need the specific process context for both performance and
accounting reasons. Last year I added a CLONE_IO flag
for sharing the IO context across processes for situations like this,
so this part is actually easily fixable by just using that clone flag
for creation of the kernel worker thread. Obviously, doing a fork()
like operation for every async system call isn't going to scale, so
some sort of thread pooling must be put in place for speeding it up.
Not sure what Zach has done there yet (I don't think a version with
that feature has been released yet), but even with that in place
there's still going to be identity fiddling when a thread is taking
over work from a user space process. Apart from the overhead of
juggling these threads, there's also going to be a substantial increase
in context switch rates with this type of setup. And this last bit is
mostly why I don't think the acall approach will end up performing
acceptably for doing lots of IO, while it does seem to be quite nice
for the more casual async system call requirements.
A third and
final possibility also exists, and this is what I have been trying to
beat into submission lately. Back in the very early 2.6 kernel days,
Suparna Bhattacharya led an effort to add support for buffered IO to
the normal fs/aio.c submission path. The patches sat in Andrews -mm
tree for some time, before they eventually got dropped due to Andrew
spending too much time massaging them into every -mm release. They work
by replacing the lock_page() call done for waiting IO completion to a
page with another helper function - I call it lock_page_async() in the
current patches. If the page is still locked when this function is
called, we return -EIOCBRETRY to the caller,
informing him that he should retry this call later on. When a process
in the kernel wants to wait for some type of event, a wait queue is
supplied. When that event occurs, the other end does a wake up call on
that wait queue. This last operation invokes a callback in the wait
queue, which normally just does a wake_up_process() like call to wake
the process blocked for this event. With the async page waiting, we
simply provide our own wait queue in the process structure and the
callback given from fs/aio.c then merely adds the completed event to
the aio ringbuffer associated with the IO context for that IO and
process.
This last approach means that we have to replace the
lock_page() calls in the IO path with lock_page_async() and be able to
handle an "error" return from that function. But apart from that,
things generally just work. The original process is still the one that
does the IO submission, thus we don't have to play tricks with identity
thefts or pay the large context switch fee for each IO. We also get
readahead. In other words, it works just like you expect buffered IO to
work. Additionally, the existing libaio interface works for this as
well. Currently my diffstat looks like this:
19 files changed, 445 insertions(+), 153 deletions(-)
for
adding support for the infrastructure, buffered async reads, and
buffered async writes for ext2 and ext3. That isn't bad, imho.
Initial performance testing looks encouraging. I'll be back with more
numbers and details, as progress and time permits!