Phil, I've done some tests for noncontiguous I/O comparing the lio_listio, aio_read/aio_write, and normal read/write. In cases where there are a lot of noncontiguous regions, lio_listio and aio tend to really fall behind. At least 1 order of magnitude slower than normal read/write.
Avery On Fri, 2006-03-24 at 10:49 -0600, Rob Ross wrote: > Nice Phil. I saw this exact same sort of stalling eight years ago on > grendel at Clemson! But we didn't have alternative schedulers and the > like to play with at the time. > > It might be worth our time to explore the dirty_ratio value a little > more in the context of both I/O and metadata tests. Perhaps once the > DBPF changes are merged in we can spend some time on this? > > Rob > > Phil Carns wrote: > > Background: > > > > This whole issue started off while trying to debug the PVFS2 > > stall/timeout problem that ended up being caused by the ext3 reservation > > bug... but we found some interesting things along the way. > > > > One of the things we noticed while looking at the problem is that > > occasionally a Trove write operation would take much longer than > > expected; essentially stalling all I/O for a while. So we wrote some > > small benchmark programs to look at the issue outside of PVFS2. These > > benchmarks (in the cases shown here) write 8 G of data, 256K at a time. > > They show the stall also. We ended up changing some PVFS2 timeouts to > > avoid the problem (see earlier email). > > > > We then started trying to figure out why the writes stall sometimes, > > because that seemed like a bad thing regardless of whether the timeouts > > could handle it or if the kernel bug was fixed :) > > > > These tests look at three possibilities: > > > > A. Is the AIO interface causing delays? > > B. Is the linux kernel waiting too long to start writing out its > > buffer cache? > > C. Is the linux kernel disk scheduler appropriate for PVFS2? > > > > To test A: > > > > The benchmark can run in 2 modes. The first uses AIO (as in PVFS2), > > allowing a maximum of 16 concurrent writes at a time. The second doesn't > > use AIO or threads at all, but instead does each write one at a time > > with the pwrite() function. > > > > To test B: > > > > We can change this behavior by adjusting the /proc/sys/vm/dirty* files. > > They are documented in the Documentation/filesystems/proc.txt file in > > the linux kernel source. The only one that really ended up being > > interesting for us (after trial and error) is the dirty_ratio file. The > > explanation given in the documentation is: "Contains, as a percentage of > > total system memory, the number of pages at which a process which is > > generating disk writes will itself start writing out dirty data.". It > > defaults to 40, but some of the results below show what happens when it > > is set to 1. There is also a dirty_background_ratio file, which > > controls when pdflush decides to write out data in the background. That > > would seem to be the more desirable tweak, but it didn't have the effect > > that dirty_ratio did for some reason. > > > > To test C: > > > > Reboot the machine with different I/O schedulers specified. CFQ > > scheduler is the default, but we set it to the AS (anticipatory) > > scheduler using "elevator=as" in kernel command line. The other > > scheduler options (deadline, noop) didn't change much. The schedulers > > also have tunable parameters in /sys/block/<DEVICE>/queue/iosched/*, > > but they didn't seem to impact much either. The schedulers are somewhat > > documented in the Documentation/block subdirectory in the linux kernel > > source. > > > > The results are listed below. The benchmarks show 3 things: The maximum > > time that any individual write (during the course of the entire test > > run) took, the average individual write time, and then the total > > benchmark time. Everything is shown in seconds. > > > > The maximum single write time is what would have shown up as a long > > "stall" in the PVFS2 I/O realm, so that is the most interesting value > > in terms of our original problem. > > > > A few things to point out: > > > > - the choice of aio/pwrite didn't really matter a whole lot. Individual > > aio operations take longer than pwrite, but they are overlapped and end > > up giving basically the same overall throughput. > > - the io scheduler and buffer cache settings can have a big impact > > - this wasn't the point of the test, but in this particular setup the > > san is actually a little slower than local disk for writes (this is an > > old san setup) > > > > local disk results: > > - using the AS scheduler reduced the maximum stall time > > significantly and improved total benchmark run time > > - setting the dirty ratio to 1 further reduced the maximum stall time, > > but also seemed to increase the total benchmark run time a little (maybe > > there is a sweet spot between 40 and 1 for this value that doesn't > > penalize the throughput as much?) > > > > san results: > > - the AS scheduler didn't really help > > - setting the dirty ratio to 1 reduced the maximum stall time significantly > > > > Maximum single write time > > ------------------------- > > default AS AS,dirty_ratio=1 > > aio local 30.874424 2.040070 0.907068 > > pwrite local 28.146439 4.423536 1.052867 > > > > aio san 46.486595 46.813606 6.161530 > > pwrite san 17.991354 10.994622 6.119389 > > > > Average single write time > > ------------------------- > > default AS AS,dirty_ratio=1 > > aio local 0.061520 0.057819 0.064450 > > pwrite local 0.003711 0.003567 0.004022 > > > > aio san 0.095062 0.096853 0.095410 > > pwrite san 0.005551 0.005713 0.005619 > > > > Total benchmark time > > ------------------------- > > default AS AS,dirty_ratio=1 > > aio local 252.018623 236.855234 264.018140 > > pwrite local 243.552892 234.140043 263.995362 > > > > aio san 389.380213 396.724146 390.813488 > > pwrite san 364.203958 374.827604 368.691822 > > > > These results aren't super scientific- in all cases it is just one test > > run per data point and no averaging. We also didn't exhaustively try > > many parameter combinations. This is also a write-only test; no telling > > what these parameter do to other workloads. > > > > We don't really have time to follow through with this any further, but > > it does show that these VM and iosched settings might be interesting to > > tune in some cases. > > > > If anyone has any similar experiences to share we would love to hear > > about it. > > > > -Phil > > > > > > > > _______________________________________________ > > Pvfs2-developers mailing list > > [email protected] > > http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers > > > _______________________________________________ > Pvfs2-developers mailing list > [email protected] > http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers _______________________________________________ Pvfs2-developers mailing list [email protected] http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
