Phil, I've done some tests for noncontiguous I/O comparing the
lio_listio, aio_read/aio_write, and normal read/write.  In cases where
there are a lot of noncontiguous regions, lio_listio and aio tend to
really fall behind.  At least 1 order of magnitude slower than normal
read/write.

Avery

On Fri, 2006-03-24 at 10:49 -0600, Rob Ross wrote:
> Nice Phil. I saw this exact same sort of stalling eight years ago on 
> grendel at Clemson! But we didn't have alternative schedulers and the 
> like to play with at the time.
> 
> It might be worth our time to explore the dirty_ratio value a little 
> more in the context of both I/O and metadata tests. Perhaps once the 
> DBPF changes are merged in we can spend some time on this?
> 
> Rob
> 
> Phil Carns wrote:
> > Background:
> > 
> > This whole issue started off while trying to debug the PVFS2 
> > stall/timeout problem that ended up being caused by the ext3 reservation 
> > bug... but we found some interesting things along the way.
> > 
> > One of the things we noticed while looking at the problem is that
> > occasionally a Trove write operation would take much longer than 
> > expected; essentially stalling all I/O for a while. So we wrote some 
> > small benchmark programs to look at the issue outside of PVFS2. These 
> > benchmarks (in the cases shown here) write 8 G of data, 256K at a time. 
> > They show the stall also.  We ended up changing some PVFS2 timeouts to 
> > avoid the problem (see earlier email).
> > 
> > We then started trying to figure out why the writes stall sometimes, 
> > because that seemed like a bad thing regardless of whether the timeouts 
> > could handle it or if the kernel bug was fixed :)
> > 
> > These tests look at three possibilities:
> > 
> > A.      Is the AIO interface causing delays?
> > B.      Is the linux kernel waiting too long to start writing out its 
> > buffer cache?
> > C.      Is the linux kernel disk scheduler appropriate for PVFS2?
> > 
> > To test A:
> > 
> > The benchmark can run in 2 modes.  The first uses AIO (as in PVFS2), 
> > allowing a maximum of 16 concurrent writes at a time. The second doesn't 
> > use AIO or threads at all, but instead does each write one at a time 
> > with the pwrite() function.
> > 
> > To test B:
> > 
> > We can change this behavior by adjusting the /proc/sys/vm/dirty* files. 
> > They are documented in the Documentation/filesystems/proc.txt file in 
> > the linux kernel source.  The only one that really ended up being 
> > interesting for us (after trial and error) is the dirty_ratio file.  The 
> > explanation given in the documentation is: "Contains, as a percentage of 
> > total system memory, the number of pages at which a process which is 
> > generating disk writes will itself start writing out dirty data.".  It 
> > defaults to 40, but some of the results below show what happens when it 
> > is set to 1.  There is also a dirty_background_ratio file, which 
> > controls when pdflush decides to write out data in the background.  That 
> > would seem to be the more desirable tweak, but it didn't have the effect 
> > that dirty_ratio did for some reason.
> > 
> > To test C:
> > 
> > Reboot the machine with different I/O schedulers specified.  CFQ 
> > scheduler is the default, but we set it to the AS (anticipatory) 
> > scheduler using "elevator=as" in kernel command line.  The other 
> > scheduler options (deadline, noop) didn't change much.  The schedulers
> > also have tunable parameters in /sys/block/<DEVICE>/queue/iosched/*,
> > but they didn't seem to impact much either.  The schedulers are somewhat
> > documented in the Documentation/block subdirectory in the linux kernel
> > source.
> > 
> > The results are listed below.  The benchmarks show 3 things: The maximum
> > time that any individual write (during the course of the entire test 
> > run) took, the average individual write time, and then the total 
> > benchmark time.  Everything is shown in seconds.
> > 
> > The maximum single write time is what would have shown up as a long
> > "stall" in the PVFS2 I/O realm, so that is the most interesting value
> > in terms of our original problem.
> > 
> > A few things to point out:
> > 
> > - the choice of aio/pwrite didn't really matter a whole lot.  Individual 
> > aio operations take longer than pwrite, but they are overlapped and end 
> > up giving basically the same overall throughput.
> > - the io scheduler and buffer cache settings can have a big impact
> > - this wasn't the point of the test, but in this particular setup the 
> > san is actually a little slower than local disk for writes (this is an 
> > old san setup)
> > 
> > local disk results:
> > - using the AS scheduler reduced the maximum stall time
> > significantly and improved total benchmark run time
> > - setting the dirty ratio to 1 further reduced the maximum stall time, 
> > but also seemed to increase the total benchmark run time a little (maybe 
> > there is a sweet spot between 40 and 1 for this value that doesn't 
> > penalize the throughput as much?)
> > 
> > san results:
> > - the AS scheduler didn't really help
> > - setting the dirty ratio to 1 reduced the maximum stall time significantly
> > 
> > Maximum single write time
> > -------------------------
> >                          default       AS            AS,dirty_ratio=1
> > aio local                30.874424     2.040070      0.907068
> > pwrite local             28.146439     4.423536      1.052867
> > 
> > aio san                  46.486595     46.813606     6.161530
> > pwrite san               17.991354     10.994622     6.119389
> > 
> > Average single write time
> > -------------------------
> >                          default       AS            AS,dirty_ratio=1
> > aio local                0.061520      0.057819      0.064450
> > pwrite local             0.003711      0.003567      0.004022
> > 
> > aio san                  0.095062      0.096853      0.095410
> > pwrite san               0.005551      0.005713      0.005619
> > 
> > Total benchmark time
> > -------------------------
> >                          default       AS            AS,dirty_ratio=1
> > aio local                252.018623    236.855234    264.018140
> > pwrite local             243.552892    234.140043    263.995362
> > 
> > aio san                  389.380213    396.724146    390.813488
> > pwrite san               364.203958    374.827604    368.691822
> > 
> > These results aren't super scientific- in all cases it is just one test 
> > run per data point and no averaging.  We also didn't exhaustively try 
> > many parameter combinations.  This is also a write-only test; no telling 
> > what these parameter do to other workloads.
> > 
> > We don't really have time to follow through with this any further, but 
> > it does show that these VM and iosched settings might be interesting to 
> > tune in some cases.
> > 
> > If anyone has any similar experiences to share we would love to hear 
> > about it.
> > 
> > -Phil
> > 
> > 
> > 
> > _______________________________________________
> > Pvfs2-developers mailing list
> > [email protected]
> > http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
> > 
> _______________________________________________
> Pvfs2-developers mailing list
> [email protected]
> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Reply via email to