Re: [Pvfs2-developers] tuning the 2.6 kernels for write performance

Avery Ching Fri, 24 Mar 2006 13:48:01 -0800

I think it was just on Reiser3, ext3 and ext2.  It was a long time ago.
But for tiny pieces of noncontiguous data access (i.e. 8 bytes or so),
easily an order of magnitude of difference.  It wasn't a PVFS2 specific
test.


Avery

On Fri, 2006-03-24 at 11:22 -0600, Rob Ross wrote:
> Hey Avery,
> 
> In what environment were you testing this?
> 
> Rob
> 
> Avery Ching wrote:
> > Phil, I've done some tests for noncontiguous I/O comparing the
> > lio_listio, aio_read/aio_write, and normal read/write.  In cases where
> > there are a lot of noncontiguous regions, lio_listio and aio tend to
> > really fall behind.  At least 1 order of magnitude slower than normal
> > read/write.
> > 
> > Avery
> > 
> > On Fri, 2006-03-24 at 10:49 -0600, Rob Ross wrote:
> >> Nice Phil. I saw this exact same sort of stalling eight years ago on 
> >> grendel at Clemson! But we didn't have alternative schedulers and the 
> >> like to play with at the time.
> >>
> >> It might be worth our time to explore the dirty_ratio value a little 
> >> more in the context of both I/O and metadata tests. Perhaps once the 
> >> DBPF changes are merged in we can spend some time on this?
> >>
> >> Rob
> >>
> >> Phil Carns wrote:
> >>> Background:
> >>>
> >>> This whole issue started off while trying to debug the PVFS2 
> >>> stall/timeout problem that ended up being caused by the ext3 reservation 
> >>> bug... but we found some interesting things along the way.
> >>>
> >>> One of the things we noticed while looking at the problem is that
> >>> occasionally a Trove write operation would take much longer than 
> >>> expected; essentially stalling all I/O for a while. So we wrote some 
> >>> small benchmark programs to look at the issue outside of PVFS2. These 
> >>> benchmarks (in the cases shown here) write 8 G of data, 256K at a time. 
> >>> They show the stall also.  We ended up changing some PVFS2 timeouts to 
> >>> avoid the problem (see earlier email).
> >>>
> >>> We then started trying to figure out why the writes stall sometimes, 
> >>> because that seemed like a bad thing regardless of whether the timeouts 
> >>> could handle it or if the kernel bug was fixed :)
> >>>
> >>> These tests look at three possibilities:
> >>>
> >>> A.      Is the AIO interface causing delays?
> >>> B.      Is the linux kernel waiting too long to start writing out its 
> >>> buffer cache?
> >>> C.      Is the linux kernel disk scheduler appropriate for PVFS2?
> >>>
> >>> To test A:
> >>>
> >>> The benchmark can run in 2 modes.  The first uses AIO (as in PVFS2), 
> >>> allowing a maximum of 16 concurrent writes at a time. The second doesn't 
> >>> use AIO or threads at all, but instead does each write one at a time 
> >>> with the pwrite() function.
> >>>
> >>> To test B:
> >>>
> >>> We can change this behavior by adjusting the /proc/sys/vm/dirty* files. 
> >>> They are documented in the Documentation/filesystems/proc.txt file in 
> >>> the linux kernel source.  The only one that really ended up being 
> >>> interesting for us (after trial and error) is the dirty_ratio file.  The 
> >>> explanation given in the documentation is: "Contains, as a percentage of 
> >>> total system memory, the number of pages at which a process which is 
> >>> generating disk writes will itself start writing out dirty data.".  It 
> >>> defaults to 40, but some of the results below show what happens when it 
> >>> is set to 1.  There is also a dirty_background_ratio file, which 
> >>> controls when pdflush decides to write out data in the background.  That 
> >>> would seem to be the more desirable tweak, but it didn't have the effect 
> >>> that dirty_ratio did for some reason.
> >>>
> >>> To test C:
> >>>
> >>> Reboot the machine with different I/O schedulers specified.  CFQ 
> >>> scheduler is the default, but we set it to the AS (anticipatory) 
> >>> scheduler using "elevator=as" in kernel command line.  The other 
> >>> scheduler options (deadline, noop) didn't change much.  The schedulers
> >>> also have tunable parameters in /sys/block/<DEVICE>/queue/iosched/*,
> >>> but they didn't seem to impact much either.  The schedulers are somewhat
> >>> documented in the Documentation/block subdirectory in the linux kernel
> >>> source.
> >>>
> >>> The results are listed below.  The benchmarks show 3 things: The maximum
> >>> time that any individual write (during the course of the entire test 
> >>> run) took, the average individual write time, and then the total 
> >>> benchmark time.  Everything is shown in seconds.
> >>>
> >>> The maximum single write time is what would have shown up as a long
> >>> "stall" in the PVFS2 I/O realm, so that is the most interesting value
> >>> in terms of our original problem.
> >>>
> >>> A few things to point out:
> >>>
> >>> - the choice of aio/pwrite didn't really matter a whole lot.  Individual 
> >>> aio operations take longer than pwrite, but they are overlapped and end 
> >>> up giving basically the same overall throughput.
> >>> - the io scheduler and buffer cache settings can have a big impact
> >>> - this wasn't the point of the test, but in this particular setup the 
> >>> san is actually a little slower than local disk for writes (this is an 
> >>> old san setup)
> >>>
> >>> local disk results:
> >>> - using the AS scheduler reduced the maximum stall time
> >>> significantly and improved total benchmark run time
> >>> - setting the dirty ratio to 1 further reduced the maximum stall time, 
> >>> but also seemed to increase the total benchmark run time a little (maybe 
> >>> there is a sweet spot between 40 and 1 for this value that doesn't 
> >>> penalize the throughput as much?)
> >>>
> >>> san results:
> >>> - the AS scheduler didn't really help
> >>> - setting the dirty ratio to 1 reduced the maximum stall time 
> >>> significantly
> >>>
> >>> Maximum single write time
> >>> -------------------------
> >>>                          default       AS            AS,dirty_ratio=1
> >>> aio local                30.874424     2.040070      0.907068
> >>> pwrite local             28.146439     4.423536      1.052867
> >>>
> >>> aio san                  46.486595     46.813606     6.161530
> >>> pwrite san               17.991354     10.994622     6.119389
> >>>
> >>> Average single write time
> >>> -------------------------
> >>>                          default       AS            AS,dirty_ratio=1
> >>> aio local                0.061520      0.057819      0.064450
> >>> pwrite local             0.003711      0.003567      0.004022
> >>>
> >>> aio san                  0.095062      0.096853      0.095410
> >>> pwrite san               0.005551      0.005713      0.005619
> >>>
> >>> Total benchmark time
> >>> -------------------------
> >>>                          default       AS            AS,dirty_ratio=1
> >>> aio local                252.018623    236.855234    264.018140
> >>> pwrite local             243.552892    234.140043    263.995362
> >>>
> >>> aio san                  389.380213    396.724146    390.813488
> >>> pwrite san               364.203958    374.827604    368.691822
> >>>
> >>> These results aren't super scientific- in all cases it is just one test 
> >>> run per data point and no averaging.  We also didn't exhaustively try 
> >>> many parameter combinations.  This is also a write-only test; no telling 
> >>> what these parameter do to other workloads.
> >>>
> >>> We don't really have time to follow through with this any further, but 
> >>> it does show that these VM and iosched settings might be interesting to 
> >>> tune in some cases.
> >>>
> >>> If anyone has any similar experiences to share we would love to hear 
> >>> about it.
> >>>
> >>> -Phil
> >>>
> >>>
> >>>
> >>> _______________________________________________
> >>> Pvfs2-developers mailing list
> >>> [email protected]
> >>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
> >>>
> >> _______________________________________________
> >> Pvfs2-developers mailing list
> >> [email protected]
> >> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
> > 

_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Re: [Pvfs2-developers] tuning the 2.6 kernels for write performance

Reply via email to