I think it was just on Reiser3, ext3 and ext2. It was a long time ago. But for tiny pieces of noncontiguous data access (i.e. 8 bytes or so), easily an order of magnitude of difference. It wasn't a PVFS2 specific test.
Avery On Fri, 2006-03-24 at 11:22 -0600, Rob Ross wrote: > Hey Avery, > > In what environment were you testing this? > > Rob > > Avery Ching wrote: > > Phil, I've done some tests for noncontiguous I/O comparing the > > lio_listio, aio_read/aio_write, and normal read/write. In cases where > > there are a lot of noncontiguous regions, lio_listio and aio tend to > > really fall behind. At least 1 order of magnitude slower than normal > > read/write. > > > > Avery > > > > On Fri, 2006-03-24 at 10:49 -0600, Rob Ross wrote: > >> Nice Phil. I saw this exact same sort of stalling eight years ago on > >> grendel at Clemson! But we didn't have alternative schedulers and the > >> like to play with at the time. > >> > >> It might be worth our time to explore the dirty_ratio value a little > >> more in the context of both I/O and metadata tests. Perhaps once the > >> DBPF changes are merged in we can spend some time on this? > >> > >> Rob > >> > >> Phil Carns wrote: > >>> Background: > >>> > >>> This whole issue started off while trying to debug the PVFS2 > >>> stall/timeout problem that ended up being caused by the ext3 reservation > >>> bug... but we found some interesting things along the way. > >>> > >>> One of the things we noticed while looking at the problem is that > >>> occasionally a Trove write operation would take much longer than > >>> expected; essentially stalling all I/O for a while. So we wrote some > >>> small benchmark programs to look at the issue outside of PVFS2. These > >>> benchmarks (in the cases shown here) write 8 G of data, 256K at a time. > >>> They show the stall also. We ended up changing some PVFS2 timeouts to > >>> avoid the problem (see earlier email). > >>> > >>> We then started trying to figure out why the writes stall sometimes, > >>> because that seemed like a bad thing regardless of whether the timeouts > >>> could handle it or if the kernel bug was fixed :) > >>> > >>> These tests look at three possibilities: > >>> > >>> A. Is the AIO interface causing delays? > >>> B. Is the linux kernel waiting too long to start writing out its > >>> buffer cache? > >>> C. Is the linux kernel disk scheduler appropriate for PVFS2? > >>> > >>> To test A: > >>> > >>> The benchmark can run in 2 modes. The first uses AIO (as in PVFS2), > >>> allowing a maximum of 16 concurrent writes at a time. The second doesn't > >>> use AIO or threads at all, but instead does each write one at a time > >>> with the pwrite() function. > >>> > >>> To test B: > >>> > >>> We can change this behavior by adjusting the /proc/sys/vm/dirty* files. > >>> They are documented in the Documentation/filesystems/proc.txt file in > >>> the linux kernel source. The only one that really ended up being > >>> interesting for us (after trial and error) is the dirty_ratio file. The > >>> explanation given in the documentation is: "Contains, as a percentage of > >>> total system memory, the number of pages at which a process which is > >>> generating disk writes will itself start writing out dirty data.". It > >>> defaults to 40, but some of the results below show what happens when it > >>> is set to 1. There is also a dirty_background_ratio file, which > >>> controls when pdflush decides to write out data in the background. That > >>> would seem to be the more desirable tweak, but it didn't have the effect > >>> that dirty_ratio did for some reason. > >>> > >>> To test C: > >>> > >>> Reboot the machine with different I/O schedulers specified. CFQ > >>> scheduler is the default, but we set it to the AS (anticipatory) > >>> scheduler using "elevator=as" in kernel command line. The other > >>> scheduler options (deadline, noop) didn't change much. The schedulers > >>> also have tunable parameters in /sys/block/<DEVICE>/queue/iosched/*, > >>> but they didn't seem to impact much either. The schedulers are somewhat > >>> documented in the Documentation/block subdirectory in the linux kernel > >>> source. > >>> > >>> The results are listed below. The benchmarks show 3 things: The maximum > >>> time that any individual write (during the course of the entire test > >>> run) took, the average individual write time, and then the total > >>> benchmark time. Everything is shown in seconds. > >>> > >>> The maximum single write time is what would have shown up as a long > >>> "stall" in the PVFS2 I/O realm, so that is the most interesting value > >>> in terms of our original problem. > >>> > >>> A few things to point out: > >>> > >>> - the choice of aio/pwrite didn't really matter a whole lot. Individual > >>> aio operations take longer than pwrite, but they are overlapped and end > >>> up giving basically the same overall throughput. > >>> - the io scheduler and buffer cache settings can have a big impact > >>> - this wasn't the point of the test, but in this particular setup the > >>> san is actually a little slower than local disk for writes (this is an > >>> old san setup) > >>> > >>> local disk results: > >>> - using the AS scheduler reduced the maximum stall time > >>> significantly and improved total benchmark run time > >>> - setting the dirty ratio to 1 further reduced the maximum stall time, > >>> but also seemed to increase the total benchmark run time a little (maybe > >>> there is a sweet spot between 40 and 1 for this value that doesn't > >>> penalize the throughput as much?) > >>> > >>> san results: > >>> - the AS scheduler didn't really help > >>> - setting the dirty ratio to 1 reduced the maximum stall time > >>> significantly > >>> > >>> Maximum single write time > >>> ------------------------- > >>> default AS AS,dirty_ratio=1 > >>> aio local 30.874424 2.040070 0.907068 > >>> pwrite local 28.146439 4.423536 1.052867 > >>> > >>> aio san 46.486595 46.813606 6.161530 > >>> pwrite san 17.991354 10.994622 6.119389 > >>> > >>> Average single write time > >>> ------------------------- > >>> default AS AS,dirty_ratio=1 > >>> aio local 0.061520 0.057819 0.064450 > >>> pwrite local 0.003711 0.003567 0.004022 > >>> > >>> aio san 0.095062 0.096853 0.095410 > >>> pwrite san 0.005551 0.005713 0.005619 > >>> > >>> Total benchmark time > >>> ------------------------- > >>> default AS AS,dirty_ratio=1 > >>> aio local 252.018623 236.855234 264.018140 > >>> pwrite local 243.552892 234.140043 263.995362 > >>> > >>> aio san 389.380213 396.724146 390.813488 > >>> pwrite san 364.203958 374.827604 368.691822 > >>> > >>> These results aren't super scientific- in all cases it is just one test > >>> run per data point and no averaging. We also didn't exhaustively try > >>> many parameter combinations. This is also a write-only test; no telling > >>> what these parameter do to other workloads. > >>> > >>> We don't really have time to follow through with this any further, but > >>> it does show that these VM and iosched settings might be interesting to > >>> tune in some cases. > >>> > >>> If anyone has any similar experiences to share we would love to hear > >>> about it. > >>> > >>> -Phil > >>> > >>> > >>> > >>> _______________________________________________ > >>> Pvfs2-developers mailing list > >>> [email protected] > >>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers > >>> > >> _______________________________________________ > >> Pvfs2-developers mailing list > >> [email protected] > >> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers > > _______________________________________________ Pvfs2-developers mailing list [email protected] http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
