Re: [Pvfs2-developers] patch: alternate AIO implementation

Phil Carns Thu, 10 Aug 2006 13:00:17 -0700

Sam Lang wrote:

This differs from what Julian did quite a bit I guess. You'veessentially re-written the lio_listio call?

Pretty much- we were just looking for a way to work around or at leastconfirm a specific limitation in the lio_listio() implementation ratherthan addressing higher levels of the I/O path. I guess it really isn'tnecessarily a limitation, its just that the implied ordering requirementper fd isn't really helpful for what Trove needs to do.

Does it make sense tofactor that out so that it could be used as a separate implementationfor others (LD_PRELOAD=libfastaio.so :-))? I guess the only advantagepvfs would gain from that would be the performance comparisons of thetwo different lio_listio implementations from all the aio tests outalready out there. A disadvantage might be that you can't switchbetween one and the other dynamically at runtime.

I would imagine this is possible, but probably more work than I havetime to tinker with :) You could probably even play some trick to makeit controllable at runtime (looking at the value of a global variable ifdefined, or maybe latching hold of the aio_init() function to control it).

Julian's implementation changes the actual dbpf_bstream_rw_list call,so its one level up from yours. With the performance gains you seefrom spawning a thread on demand instead of pooling, maybe its time toget rid of the dbpf queue altogether and start spawning a thread pertrove operation? The advantage we get from queuing being that we cando our own scheduling, although your numbers are impressive, so maybethat's not such a good idea.

I really don't know. The problem here was that glibc wasn't reallyhonoring the decision even when trove decided to let something throughthe queue :) I don't know what the implications would be for berkeleydb or other operations. I also don't know that nptl is ubiquitousenough to make spawning that many threads the default behavior.

At any rate, I am definitely interested in looking at Julian'simplementation when it is ready for evaluation. It may be that hisstuff renders the lio_listio() workaround a moot point, although we willdefinitely be using this approach in the meantime.


-Phil


-sam

On Aug 10, 2006, at 3:37 PM, Phil Carns wrote:

Background:

We have been a little suspicious of the posix aio performance on  some of
our servers. After digging in the glibc code a little, we found a
possible problem. Glibc's aio will spawn up to 16 threads by default,
but will never assign more than a single thread to a given fd. That
thread will then service all operations on that fd sequentially  using a
FIFO queue. This means that if several clients are performing I/O  to the
same datafile, then all of their I/O requests get pushed to the disk
sequentially (and probably not in order by offset).

Patch:

This patch replaces the lio_listio() calls with a macro called
LIO_LISTIO(). You can then toggle what this macro does by using a
config file option "TroveAltAIOMode yes|no". If the option is not
specified (or is set to no) then the normal code path is taken. If the
option is enabled, then it looks at the arguments. If the operation is
a single buffer read or write, then it immediately spawns a new  detached
thread, services the opertion using p{read/write}, triggers a callback
function, and exits. More complex operations are sent to the usual
lio_listio() route.

This idea is to basically try to get the requests off to the kernel as
quickly as possible without queueing so that the kernel can sort  out how
to best service them. Trove doesn't care about ordering at that level.

Drawbacks:

- This option/implementation is only reasonable for systems with NPTL,
because of the low thread spawning overhead. Non-NPTL systems will
probably find the cost to be higher. As a side note, we tried an
implementation that kept a pool of threads and sent operations to  those
threads, but we found that the overhead of synchronization and  signaling
in this approach was (surprisingly) much higher than the cost of just
creating brand new threads on every operation that did not require
synchronization.
- This implementation only helps contiguous reads or writes as they
appear to Trove. You could extend it to work for other patterns by  just
doing a series of preads and pwrites to work down the list of buffers,
but we did not handle this case.

Results:

We didn't see a big gain from this approach at first, but since  then we
have taken care of some other bottlenecks that make the improvement  more
obvious. It also seems that the performance boost varies quite a bit
depending on the type of system you run it on. We have some new  servers
(results shown below) that benefitted greatly from this optimization.

The numbers below show the results from a setup with 16 servers and a
variable number of clients and number of processes per client. The
benchmark is performing a read only access pattern with 100 MB  buffers.
All clients are accessing the same file 40 GB file (we rotate among
several to avoid caching). The file is divided into contiguous  regions,

one per each process. We are using local hardware raid at eachserver, and gigabit ethernet for communication.


Before optimization:
client nodes x processes per node - MB/s aggregate throughput
--------------------------------------------------------------

1 x 1 - 97.8
1 x 2 - 110.4
1 x 5 - 111.1
12 x 1 - 195.8
12 x 2 - 138.8
25 x 1 - 160.4
25 x 2 - 178.0

After optimization:
client nodes x processes per node - MB/s aggregate throughput
--------------------------------------------------------------
1 x 1 - 93.4
1 x 2 - 109.2
1 x 5 - 108.9
12 x 1 - 443.1
12 x 2 - 502.6
25 x 1 - 496.7
25 x 2 - 550.7

To confirm the cause of the problem, we performed a variation on the
test where each client read an independent file, rather than the  clients

all hitting the same file. Running this benchmark with 12 clientnodes (one process per node) resulted in a consistent 430 MB/s ofaggregate

throughput regardless of whether the new AIO path was used or not.  This
seems to confirm that the problem is a result of the sequential  queueing
that the normal AIO implementation does when multiple requests hit the
same file.

For these particular machines we were able to double or triple the  read
throughput for a parallel application that shared one large file. I am

fairly sure that not all of our machines demonstrate this problem tosuch a drastic degree, but we will probably be testing some othersetups later to get a better idea.

_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Re: [Pvfs2-developers] patch: alternate AIO implementation

Reply via email to