Re: [Pvfs2-developers] patch: alternate AIO implementation

Murali Vilayannur Fri, 01 Sep 2006 10:37:05 -0700

Hi Phil,

> - define _GNU_SOURCE only in dbpf-bstream.c.  This might be the least
> intrusive option.


This seems like the best option to me..
We already have a precedent for doing it in the tree right now.. :)
./test/client/sysint/module.mk.in:MODCFLAGS_$(DIR)/io-stress.c := -D_GNU_SOURCE
thanks,
Murali

>
> Any suggestions?
>
> -Phil
>
> Sam Lang wrote:
> >
> > Hi Phil,
> >
> > I went ahead and committed this patch to trunk.  The changes are
> > relatively small and you've demonstrated good perf improvements out  of
> > them!  In the longer term I'm going to try to merge Julian's  threaded
> > implementation with O_DIRECT support to trunk at some point  as well, so
> > that we can still have some control over grouping and  scheduling
> > operations.
> >
> > -sam
> >
> > On Aug 10, 2006, at 3:37 PM, Phil Carns wrote:
> >
> >> Background:
> >>
> >> We have been a little suspicious of the posix aio performance on  some of
> >> our servers. After digging in the glibc code a little, we found a
> >> possible problem. Glibc's aio will spawn up to 16 threads by default,
> >> but will never assign more than a single thread to a given fd. That
> >> thread will then service all operations on that fd sequentially  using a
> >> FIFO queue. This means that if several clients are performing I/O  to the
> >> same datafile, then all of their I/O requests get pushed to the disk
> >> sequentially (and probably not in order by offset).
> >>
> >> Patch:
> >>
> >> This patch replaces the lio_listio() calls with a macro called
> >> LIO_LISTIO(). You can then toggle what this macro does by using a
> >> config file option "TroveAltAIOMode yes|no". If the option is not
> >> specified (or is set to no) then the normal code path is taken. If the
> >> option is enabled, then it looks at the arguments. If the operation is
> >> a single buffer read or write, then it immediately spawns a new  detached
> >> thread, services the opertion using p{read/write}, triggers a callback
> >> function, and exits. More complex operations are sent to the usual
> >> lio_listio() route.
> >>
> >> This idea is to basically try to get the requests off to the kernel as
> >> quickly as possible without queueing so that the kernel can sort  out how
> >> to best service them. Trove doesn't care about ordering at that level.
> >>
> >> Drawbacks:
> >>
> >> - This option/implementation is only reasonable for systems with NPTL,
> >> because of the low thread spawning overhead. Non-NPTL systems will
> >> probably find the cost to be higher. As a side note, we tried an
> >> implementation that kept a pool of threads and sent operations to  those
> >> threads, but we found that the overhead of synchronization and  signaling
> >> in this approach was (surprisingly) much higher than the cost of just
> >> creating brand new threads on every operation that did not require
> >> synchronization.
> >> - This implementation only helps contiguous reads or writes as they
> >> appear to Trove. You could extend it to work for other patterns by  just
> >> doing a series of preads and pwrites to work down the list of buffers,
> >> but we did not handle this case.
> >>
> >> Results:
> >>
> >> We didn't see a big gain from this approach at first, but since  then we
> >> have taken care of some other bottlenecks that make the improvement  more
> >> obvious. It also seems that the performance boost varies quite a bit
> >> depending on the type of system you run it on. We have some new  servers
> >> (results shown below) that benefitted greatly from this optimization.
> >>
> >> The numbers below show the results from a setup with 16 servers and a
> >> variable number of clients and number of processes per client. The
> >> benchmark is performing a read only access pattern with 100 MB  buffers.
> >> All clients are accessing the same file 40 GB file (we rotate among
> >> several to avoid caching). The file is divided into contiguous  regions,
> >> one per each process.  We are using local hardware raid at each
> >> server, and gigabit ethernet for communication.
> >>
> >> Before optimization:
> >> client nodes x processes per node - MB/s aggregate throughput
> >> --------------------------------------------------------------
> >>
> >> 1 x 1 - 97.8
> >> 1 x 2 - 110.4
> >> 1 x 5 - 111.1
> >> 12 x 1 - 195.8
> >> 12 x 2 - 138.8
> >> 25 x 1 - 160.4
> >> 25 x 2 - 178.0
> >>
> >> After optimization:
> >> client nodes x processes per node - MB/s aggregate throughput
> >> --------------------------------------------------------------
> >> 1 x 1 - 93.4
> >> 1 x 2 - 109.2
> >> 1 x 5 - 108.9
> >> 12 x 1 - 443.1
> >> 12 x 2 - 502.6
> >> 25 x 1 - 496.7
> >> 25 x 2 - 550.7
> >>
> >> To confirm the cause of the problem, we performed a variation on the
> >> test where each client read an independent file, rather than the  clients
> >> all hitting the same file. Running this benchmark with 12 client
> >> nodes (one process per node) resulted in a consistent 430 MB/s of
> >> aggregate
> >> throughput regardless of whether the new AIO path was used or not.  This
> >> seems to confirm that the problem is a result of the sequential  queueing
> >> that the normal AIO implementation does when multiple requests hit the
> >> same file.
> >>
> >> For these particular machines we were able to double or triple the  read
> >> throughput for a parallel application that shared one large file. I am
> >> fairly sure that not all of our machines demonstrate this problem  to
> >> such a drastic degree, but we will probably be testing some  other
> >> setups later to get a better idea.
> >>
> _______________________________________________
> Pvfs2-developers mailing list
> [email protected]
> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
>
>
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Re: [Pvfs2-developers] patch: alternate AIO implementation

Reply via email to