http://lwn.net/Articles/10874/

A new deadline I/O scheduler

One of the results of all the recent VM improvements in the 2.5 tree is that the shortcomings of the current block I/O scheduler are becoming more apparent. The scheduler (or "elevator") tries to join adjacent requests together, and to sort requests along the disk in order to minimize seeks and optimize I/O performance. Unfortunately, the current scheduler is not all that good at treating requests fairly. In the name of overall performance, the scheduler can cause the starvation of individual requests for a very long time. Requests eventually get serviced, but if the starved request is satisfying a page fault in the X server, the user can get seriously annoyed in the mean time.

The scheduler shortcomings have always been there, but they are more apparent now because the VM subsystem does a more effective job of filling the disk request queues. Now that the scheduler problems can't hide behind VM problems, they need to be addressed. To that end, Jens Axboe has dusted off and fixed up his deadline I/O scheduler patch, and is looking for people to try it out and report on how it works. His initial results are good:

The Andrew Morton Interactive Workload (AMIW) rates the current kernel poorly, on my test machine it completes in 1-2 minutes depending on your luck. 2.5.38-BK does a lot better, but mainly because it's being extremely unfair. This deadline io scheduler finishes the AMIW in anywhere from ~0.5 seconds to ~3-4 seconds, depending on the io load.

The "AMIW" works, essentially, by creating a great deal of disk write traffic, then seeing how long it takes to list out a large directory while the writes are being handled.

The idea behind the deadline scheduler is that all read requests get satisfied within a specified time period - 1/2 second by default. (Write requests do not have specific deadlines; processes usually can not continue until their reads are performed, but writes can generally be deferred indefinitely). The deadline performance is achieved by sorting I/O requests into several different queues:

  • The sorted queues (sort_list) contain I/O requests sorted by the elevator for best disk performance. There are two queues, one for read requests, and one for writes.

  • The FIFO queue (read_fifo) contains all read requests. Since it's a FIFO, the request at the head of the queue will be the oldest, and will be the one whose deadline expires first.

  • The dispatch queue contains requests which are ready to be passed down to the block driver. Requests do not go into the dispatch queue until the scheduler decides that it is time to execute them.

With these lists, the operation of the scheduler is not that hard. Incoming requests are merged into the proper sort_list in the usual way: they are sorted by sector number, and merged with adjacent requests. (The real operation is complicated by barrier operations and a hashing mechanism designed to improve the performance of the scheduler itself, but the basic idea is as described.) Read requests are also given a deadline time stamp and put on the end of the read_fifo list.

When the block driver is ready to start another disk I/O request, the core algorithm of the deadline scheduler comes into play. In simplified form, it works like this:

  • If there are requests waiting in the dispatch queue then the scheduler need do no work, since it has already decided what it wants to do next.

  • Otherwise it's time to move a new set of requests to the dispatch queue. The scheduler looks in the following places, taking requests from (only) the first source that it finds:

    • If there are pending write requests, and the scheduler has not selected any writes for "a long time," a set of write requests is selected.

    • If there are expired read requests in the read_fifo list, take a set from there.

    • If there are pending reads in the sorted queue, some of them are moved.

    • Finally, if there are pending write operations, the dispatch queue is filled from the sorted write queue.

  • Once that's done, if there are any requests in the dispatch queue (i.e. if there's anything for the disk to do at all) the first one is passed back to the driver.

The definition of "a long time" for write request starvation is simply two iterations of the scheduler algorithm. After two sets of read requests have been moved to the dispatch queue, the scheduler will send over a set of writes. A "set," by default, can be as many as 64 contiguous requests - but a request that requires a disk seek counts the same as 16 contiguous requests.

There has been no word from Linus on this patch; barring problems, however, it would be surprising if some form of it were not merged in the near future.


(Log in to post comments)

Sorting disk I/O requests...

Posted Sep 30, 2002 0:41 UTC (Mon) by Baylink (subscriber, #755) [Link]

Ok, I suppose the possibility exists that either

a) I don't know how to read, or

b) Everyone else involved here is missing something...

but every reference I've read on current geenration hard disk technology suggests strongly, if it doesn't say so outright, that it became pretty much useless for the kernel to try to make *any* assumptions about what was where on a drive (and therefore which requests should be filed with the drive in which order) right about the time that drives started lying to the outside world about what was going on inside them.

In particular, Scott Mueller's _Repairing and Upgrading PC's_ (I have the "Linux Edition", a tweaked version of the 11th) says this pretty much explicitly.

And yet these guys are trying to do it anyway.

Since what I read meshes with what I can figure out on my own from the facts given, I have to assume that there's *something* here that I'm missing.

Or else they're just betting that it will help enough to be worth doing, even though it's not going to work "right".

But maybe it's just me.

So many things are just me.

Sorting disk I/O requests...

Posted Oct 2, 2002 11:49 UTC (Wed) by axboe (subscriber, #904) [Link]

You cannot say for sure where anything is on a hard drive, but you can make some basic assumptions that generally are true. The assumptions that the deadline io scheduler makes is that:

- Generally, sector x and x+1 are contigous on media.

That is it. I say generally, because x+1 might be remapped due to defects, for instance. But you'll find that in most cases the above is true. Of course drive makers are free to do whatever they want, but general performance breakdown would occur in _all_ os's if a streamed read from LBA 0 to LBA 2048, eg, would incur lots of seeks.

deadline doesn't even attempt to say that a seek from a to b is more or less costly than a seek from x to y.

Maybe you should read the source. I'll bet you it's a lot better source than some 'repairing pcs' book.

Sorting disk I/O requests...

Posted Oct 8, 2002 2:27 UTC (Tue) by tigerand (guest, #6150) [Link]

This doesn't sound like it takes software raid (md) or filesystem I/O footprints into account. So if either of those two subsystems have any I/O scheduling type optimizations, then these optimizations are almost certainly going to clash with very unsatisfactory results.

It sounds to me like this needs a heck of a lot more thought before it should be truely adopted. Even just sorting between reads and writes could clash with optimizations in some of the fancier journaling file systems. Delayed writes and the like could cloud the algorythm. There really needs to be a way to synchronize between file system code, low level block i/o code, and md code. For instance, I can see this scheme making a huge difference on a single disk with multiple file systems and maybe even a couple of chuckleheads dd'ing to a partition (or, god forbid, a user level file system like Coda or some databases that want to use raw partitions). In that case, a file system might want to wave off it's own optimizations in favor or this. But if it's a single file system on a disk situation, the file system code might know better, especially with it's advanced knowledge about what might be coming just behind this, but isn't in the queue yet. For instance, it knows that right now there are some data writes to accomplish, but right after that there's going to be some journal writes, so save up some pending journal writes just a tad, because there's a couple more just around the corner. This is an example of special knowledge, not a conflicting optimization. But you get the idea.

Bottom line, this seems like a thing that belongs in the file systems, and at the low level is probably the wrong place to "fix" this.

effects on sync(1)?

Posted Oct 2, 2002 22:34 UTC (Wed) by roelofs (subscriber, #2599) [Link]

Apologies if this should be obvious from the article, but is there any effect
on syncs? Specifically, if I want to remount a partition read-only or just
unmount it, will a few syncs still carry the day, or does this change affect how
they work in some (not so desirable) way?



Reply via email to