Nice investigation Phil!
Results look great. Isnt this approach similar to what Julian was doing
with the threaded I/O work in the kunkel branch? (with the exception of
the O_DIRECT stuff?)

It also looks like the trove backend could benefit from the readx, writex
posix extension patches to the kernel and ext3, although that would be a
lot of work I guess..
thanks,
Murali

On Thu, 10 Aug 2006, Phil Carns wrote:

> Background:
>
> We have been a little suspicious of the posix aio performance on some of
> our servers. After digging in the glibc code a little, we found a
> possible problem. Glibc's aio will spawn up to 16 threads by default,
> but will never assign more than a single thread to a given fd. That
> thread will then service all operations on that fd sequentially using a
> FIFO queue. This means that if several clients are performing I/O to the
> same datafile, then all of their I/O requests get pushed to the disk
> sequentially (and probably not in order by offset).
>
> Patch:
>
> This patch replaces the lio_listio() calls with a macro called
> LIO_LISTIO(). You can then toggle what this macro does by using a
> config file option "TroveAltAIOMode yes|no". If the option is not
> specified (or is set to no) then the normal code path is taken. If the
> option is enabled, then it looks at the arguments. If the operation is
> a single buffer read or write, then it immediately spawns a new detached
> thread, services the opertion using p{read/write}, triggers a callback
> function, and exits. More complex operations are sent to the usual
> lio_listio() route.
>
> This idea is to basically try to get the requests off to the kernel as
> quickly as possible without queueing so that the kernel can sort out how
> to best service them. Trove doesn't care about ordering at that level.
>
> Drawbacks:
>
> - This option/implementation is only reasonable for systems with NPTL,
> because of the low thread spawning overhead. Non-NPTL systems will
> probably find the cost to be higher. As a side note, we tried an
> implementation that kept a pool of threads and sent operations to those
> threads, but we found that the overhead of synchronization and signaling
> in this approach was (surprisingly) much higher than the cost of just
> creating brand new threads on every operation that did not require
> synchronization.
> - This implementation only helps contiguous reads or writes as they
> appear to Trove. You could extend it to work for other patterns by just
> doing a series of preads and pwrites to work down the list of buffers,
> but we did not handle this case.
>
> Results:
>
> We didn't see a big gain from this approach at first, but since then we
> have taken care of some other bottlenecks that make the improvement more
> obvious. It also seems that the performance boost varies quite a bit
> depending on the type of system you run it on. We have some new servers
> (results shown below) that benefitted greatly from this optimization.
>
> The numbers below show the results from a setup with 16 servers and a
> variable number of clients and number of processes per client. The
> benchmark is performing a read only access pattern with 100 MB buffers.
> All clients are accessing the same file 40 GB file (we rotate among
> several to avoid caching). The file is divided into contiguous regions,
> one per each process.  We are using local hardware raid at each server,
> and gigabit ethernet for communication.
>
> Before optimization:
> client nodes x processes per node - MB/s aggregate throughput
> --------------------------------------------------------------
>
> 1 x 1 - 97.8
> 1 x 2 - 110.4
> 1 x 5 - 111.1
> 12 x 1 - 195.8
> 12 x 2 - 138.8
> 25 x 1 - 160.4
> 25 x 2 - 178.0
>
> After optimization:
> client nodes x processes per node - MB/s aggregate throughput
> --------------------------------------------------------------
> 1 x 1 - 93.4
> 1 x 2 - 109.2
> 1 x 5 - 108.9
> 12 x 1 - 443.1
> 12 x 2 - 502.6
> 25 x 1 - 496.7
> 25 x 2 - 550.7
>
> To confirm the cause of the problem, we performed a variation on the
> test where each client read an independent file, rather than the clients
> all hitting the same file. Running this benchmark with 12 client nodes
> (one process per node) resulted in a consistent 430 MB/s of aggregate
> throughput regardless of whether the new AIO path was used or not. This
> seems to confirm that the problem is a result of the sequential queueing
> that the normal AIO implementation does when multiple requests hit the
> same file.
>
> For these particular machines we were able to double or triple the read
> throughput for a parallel application that shared one large file. I am
> fairly sure that not all of our machines demonstrate this problem to
> such a drastic degree, but we will probably be testing some other setups
> later to get a better idea.
>
> -Phil
>
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Reply via email to