Re: RFC: Allow block drivers to poll for I/O instead of sleeping
On Thu, Jun 20, 2013 at 04:17:13PM -0400, Matthew Wilcox wrote: A paper at FAST2012 (http://static.usenix.org/events/fast12/tech/full_papers/Yang.pdf) pointed out the performance overhead of taking interrupts for low-latency block I/Os. The solution the author investigated was to spin waiting for each I/O to complete. This is inefficient as Linux submits many I/Os which are not latency-sensitive, and even when we do submit latency-sensitive I/Os (eg swap-in), we frequently submit several I/Os before waiting. This RFC takes a different approach, only spinning when we would otherwise sleep. To implement this, I add an 'io_poll' function pointer to backing_dev_info. I include a sample implementation for the NVMe driver. Next, I add an io_wait() function which will call io_poll() if it is set. It falls back to calling io_schedule() if anything goes wrong with io_poll() or the task exceeds its timeslice. Finally, all that is left is to judiciously replace calls to io_schedule() with calls to io_wait(). I think I've covered the main contenders with sleep_on_page(), sleep_on_buffer() and the DIO path. I've measured the performance benefits of this with a Chatham NVMe prototype device and a simple # dd if=/dev/nvme0n1 of=/dev/null iflag=direct bs=512 count=100 The latency of each I/O reduces by about 2.5us (from around 8.0us to around 5.5us). This matches up quite well with the performance numbers shown in the FAST2012 paper (which used a similar device). Hi Matthew, I'm wondering where the 2.5us latency cut comes from. I did a simple test. In my xeon 3.4G CPU, one cpu can do about 2M/s context switch of applications. Assuming switching to idle is faster, so switching to idle and back should take less than 1us. Does the 2.5us latency cut mostly come from deep idle state latency? if so, maybe set a lower pm_qos value or have a better idle governer to prevent cpu entering deep idle state can help too. Thanks, Shaohua -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: Allow block drivers to poll for I/O instead of sleeping
On 06/23/2013 02:29 PM, Linus Torvalds wrote: You could try to do that either *in* the idle thread (which would take the context switch overhead - maybe negating some of the advantages), or alternatively hook into the scheduler idle logic before actually doing the switch. But anything that starts polling when there are other runnable processes to be done sounds really debatable. Even if it's only 5us or so. There's a lot of real work that could be done in 5us. Having a hook into the idle code could be useful for KVM, too. In certain message passing workloads, we have seen that the system throughput suffers greatly simply by having the KVM thread's preempt notifiers save CPU state when the guest goes idle, and re-load it when the guest VCPU is activated again. Avoiding the context switch overhead when nothing else wants to run, but being immediately preempted when something does, could be a big win. Would it make sense to have some hook that says I *am* an idle thread right now, as opposed to context switch to the idle thread? That hook could even run the cpuidle code, and switch to the real idle task (saving the current task state), and put the CPU into a C-state, when the expected sleep time is long... -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: Allow block drivers to poll for I/O instead of sleeping
On 06/20/2013 04:17 PM, Matthew Wilcox wrote: --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4527,6 +4527,36 @@ long __sched io_schedule_timeout(long timeout) return ret; } +/* + * Wait for an I/O to complete against this backing_dev_info. If the + * task exhausts its timeslice polling for completions, call io_schedule() + * anyway. If a signal comes pending, return so the task can handle it. + * If the io_poll returns an error, give up and call io_schedule(), but + * swallow the error. We may miss an I/O completion (eg if the interrupt + * handler gets to it first). Guard against this possibility by returning + * if we've been set back to TASK_RUNNING. + */ +void io_wait(struct backing_dev_info *bdi) +{ I would like something a little more generic in the scheduler code, that could also be used by other things in the kernel (say, KVM with message passing workloads). Maybe something looking a little like this? void idle_poll(struct idle_poll_info *ipi) struct idle_poll_info { int (*idle_poll_func)(void *data); int (*idle_poll_preempt)(void *data); void *data; } That way the kernel can: 1) mark the current thread as having idle priority, allowing the scheduler to preempt it if something else wants to run 2) switch to asynchronous mode if something else wants to run, or if the average wait for the process is so long that it is better to go asynchronous and avoid polling 3) poll for completion if nothing else wants to run Does that make sense? Did I forget something you need? Did I forget something KVM could need? Is this insane? If so, is it too insane? :) -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: Allow block drivers to poll for I/O instead of sleeping
On 06/25/13 05:18, Matthew Wilcox wrote: On Mon, Jun 24, 2013 at 10:07:51AM +0200, Ingo Molnar wrote: I'm wondering, how will this scheme work if the IO completion latency is a lot more than the 5 usecs in the testcase? What if it takes 20 usecs or 100 usecs or more? There's clearly a threshold at which it stops making sense, and our current NAND-based SSDs are almost certainly on the wrong side of that threshold! I can't wait for one of the post-NAND technologies to make it to market in some form that makes it economical to use in an SSD. The problem is that some of the people who are looking at those technologies are crazy. They want to bypass the kernel and do user space I/O because the kernel is too slow. This patch is part of an effort to show them how crazy they are. And even if it doesn't convince them, at least users who refuse to rewrite their applications to take advantage of magical userspace I/O libraries will see real performance benefits. Recently I attended an interesting talk about this subject in which it was proposed not only to bypass the kernel for access to high-IOPS devices but also to allow byte-addressability for block devices. The slides that accompanied that talk can be found here (includes a performance comparison with the traditional block driver API): Bernard Metzler, On Suitability of High-Performance Networking API for Storage, OFA Int'l Developer Workshop, April 24, 2013 (http://www.openfabrics.org/ofa-documents/presentations/doc_download/559-on-suitability-of-high-performance-networking-api-for-storage.html). This approach leaves the choice of whether to use polling or an interrupt-based completion notification to the user of the new API, something the Linux InfiniBand RDMA verbs API already allows today. Bart. -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: Allow block drivers to poll for I/O instead of sleeping
On Mon, 2013-06-24 at 23:07 -0400, Matthew Wilcox wrote: On Mon, Jun 24, 2013 at 08:11:02PM -0400, Steven Rostedt wrote: What about hooking into the idle_balance code? That happens if we are about to go to idle but before the full schedule switch to the idle task. In __schedule(void): if (unlikely(!rq-nr_running)) idle_balance(cpu, rq); That may be a great place to try it from the PoV of the scheduler, but are you OK with me threading a struct backing_dev_info * all the way through the scheduler to idle_balance()? :-) Well, there's other ways to pass data down besides parameters. You could attach something to the task itself. -- Steve -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: Allow block drivers to poll for I/O instead of sleeping
On Mon, Jun 24 2013, Matthew Wilcox wrote: On Mon, Jun 24, 2013 at 10:07:51AM +0200, Ingo Molnar wrote: I'm wondering, how will this scheme work if the IO completion latency is a lot more than the 5 usecs in the testcase? What if it takes 20 usecs or 100 usecs or more? There's clearly a threshold at which it stops making sense, and our current NAND-based SSDs are almost certainly on the wrong side of that threshold! I can't wait for one of the post-NAND technologies to make it to market in some form that makes it economical to use in an SSD. The problem is that some of the people who are looking at those technologies are crazy. They want to bypass the kernel and do user space I/O because the kernel is too slow. This patch is part of an effort to show them how crazy they are. And even if it doesn't convince them, at least users who refuse to rewrite their applications to take advantage of magical userspace I/O libraries will see real performance benefits. Fully concur with that. At least on the read side, nand is just getting crappier and polled completions is usually not going to be great. On the write side, however, there are definite gains. Completions in the 10-15usec range aren't unusual. And once we hit PCM, well, it'll be fun. On the write side, there are plenty of super latency customers out there who would LOVE to poll when/if it's useful. Most often also the same kind of people who talk the crazy of putting everything in user space. Which is why I like the polling. If we can get sufficiently close, then we can shut some of that up. -- Jens Axboe -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: Allow block drivers to poll for I/O instead of sleeping
On Sun, Jun 23 2013, Ingo Molnar wrote: I'm wondering why this makes such a performance difference. They key ingredient here is simply not going to sleep, only to get an IRQ and get woken up very shortly again. NAPI and similar approaches work great for high IOPS cases, where you maintain a certain depth of IO. For lower queue depth or sync IO (like Willy is running here), nothing beats the pure app driven poll from a latency perspective. I've seen plenty of systems where the power management is also so aggressive, that you manage to enter lower C states very quickly and that then of course makes things even worse. Intelligent polling would make that less of a problem. Willy, I think the general design is fine, hooking in via the bdi is the only way to get back to the right place from where you need to sleep. Some thoughts: - This should be hooked in via blk-iopoll, both of them should call into the same driver hook for polling completions. - It needs to be more intelligent in when you want to poll and when you want regular irq driven IO. - With the former note, the app either needs to opt in (and hence willingly sacrifice CPU cycles of its scheduling slice) or it needs to be nicer in when it gives up and goes back to irq driven IO. -- Jens Axboe -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: Allow block drivers to poll for I/O instead of sleeping
On Sun, Jun 23 2013, Linus Torvalds wrote: nothing in common. Networking very very seldom has the kind of submit and wait for immediate result issues that disk reads do. That said, I dislike the patch intensely. I do not think it's at all a good idea to look at need_resched to say I can spin now. You're still wasting CPU cycles. So Willy, please do *not* mix this up with the scheduler, or at least not need_resched. Instead, maybe we should introduce a notion of if we are switching to the idle thread, let's see if we can try to do some IO synchronously. You could try to do that either *in* the idle thread (which would take the context switch overhead - maybe negating some of the advantages), or alternatively hook into the scheduler idle logic before actually doing the switch. It can't happen in the idle thread. If you need to take the context switch, then you've negated pretty much all of the gain of the polled approach. But anything that starts polling when there are other runnable processes to be done sounds really debatable. Even if it's only 5us or so. There's a lot of real work that could be done in 5us. IMHO that depends mostly on whether the app wants to give up those 5us of work time just spinning on the completion of important IO. You obviously can't steal the time of others. -- Jens Axboe -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: Allow block drivers to poll for I/O instead of sleeping
* Linus Torvalds torva...@linux-foundation.org wrote: On Sun, Jun 23, 2013 at 12:09 AM, Ingo Molnar mi...@kernel.org wrote: The spinning approach you add has the disadvantage of actively wasting CPU time, which could be used to run other tasks. In general it's much better to make sure the completion IRQs are rate-limited and just schedule. This (combined with a metric ton of fine details) is what the networking code does in essence, and they have no trouble reaching very high throughput. It's not about throughput - it's about latency. Don't ever confuse the two, they have almost nothing in common. Networking very very seldom has the kind of submit and wait for immediate result issues that disk reads do. Yeah, indeed that's true, the dd measurement Matthew did issued IO at a rate of one sector at a time and waiting for every sector to complete: dd if=/dev/nvme0n1 of=/dev/null iflag=direct bs=512 count=100 So my suggestions about batching and IRQ rate control are immaterial... That said, I dislike the patch intensely. I do not think it's at all a good idea to look at need_resched to say I can spin now. You're still wasting CPU cycles. So Willy, please do *not* mix this up with the scheduler, or at least not need_resched. Instead, maybe we should introduce a notion of if we are switching to the idle thread, let's see if we can try to do some IO synchronously. You could try to do that either *in* the idle thread (which would take the context switch overhead - maybe negating some of the advantages), or alternatively hook into the scheduler idle logic before actually doing the switch. But anything that starts polling when there are other runnable processes to be done sounds really debatable. Even if it's only 5us or so. There's a lot of real work that could be done in 5us. I'm wondering, how will this scheme work if the IO completion latency is a lot more than the 5 usecs in the testcase? What if it takes 20 usecs or 100 usecs or more? Will we still burn our CPU time, wasting power, inflating this CPU's load which keeps other CPUs from balancing tasks over to this CPU, etc? In the 5 usecs case it looks beneficial to do. In the longer-latency cases I'm not so sure. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: Allow block drivers to poll for I/O instead of sleeping
* Jens Axboe ax...@kernel.dk wrote: - With the former note, the app either needs to opt in (and hence willingly sacrifice CPU cycles of its scheduling slice) or it needs to be nicer in when it gives up and goes back to irq driven IO. The scheduler could look at sleep latency averages of the task in question - we measure that already in most cases. If the 'average sleep latency' is below a certain threshold, the scheduler, if it sees that the CPU is about to go idle, could delay doing the context switch and do light idle-polling, for say twice the length of the expected sleep latency - assuming the CPU is otherwise idle - before it really schedules away the task and the CPU goes idle. This would still require an IRQ and a wakeup to be taken, but would avoid the context switch. Yet I have an ungood feeling about depending on actual latency values so explicitly. There will have to be a cutoff value, and if a workload is just below or just above that threshold then behavior will change markedly. Such schemes rarely worked out nicely in the past. [Might still be worth trying it.] Couldn't the block device driver itself estimate the expected latency of IO completion and simply poll if that's expected to be very short [such as there's only a single outstanding IO to a RAM backed device]? IO drivers doing some polling and waiting in the microseconds range isnt overly controversial. I'd even do that if the CPU is busy otherwise: the task should see a proportional slowdown as load increases, with no change in IO queueing behavior. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: Allow block drivers to poll for I/O instead of sleeping
* David Ahern dsah...@gmail.com wrote: On 6/23/13 3:09 AM, Ingo Molnar wrote: If an IO driver is implemented properly then it will batch up requests for the controller, and gets IRQ-notified on a (sub-)batch of buffers completed. If there's any spinning done then it should be NAPI-alike polling: a single is stuff completed polling pass per new block of work submitted, to opportunistically interleave completion with submission work. I don't see where active spinning brings would improve performance compared to a NAPI-alike technique. Your numbers obviously show a speedup we'd like to have, I'm just wondering whether the same speedup (or even more) could be implemented via: - smart batching that rate-limits completion IRQs in essence + NAPI-alike polling ... which would almost never result in IRQ driven completion when we are close to CPU-bound and while not yet saturating the IO controller's capacity. The spinning approach you add has the disadvantage of actively wasting CPU time, which could be used to run other tasks. In general it's much better to make sure the completion IRQs are rate-limited and just schedule. This (combined with a metric ton of fine details) is what the networking code does in essence, and they have no trouble reaching very high throughput. Networking code has a similar proposal for low latency sockets using polling: https://lwn.net/Articles/540281/ In that case it might make sense to try the generic approach I suggested in the previous mail, which would measure average sleep latencies of tasks, and would do light idle-polling instead of the more expensive switch-to-the-idle-task context switch plus associated RCU, nohz, etc. busy-CPU-tear-down and the symmetric build-up work on idle wakeup. The IO driver would still have to take an IRQ though, preferably on the CPU that runs the IO task ... Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: Allow block drivers to poll for I/O instead of sleeping
On Mon, Jun 24, 2013 at 09:17:18AM +0200, Jens Axboe wrote: On Sun, Jun 23 2013, Linus Torvalds wrote: You could try to do that either *in* the idle thread (which would take the context switch overhead - maybe negating some of the advantages), or alternatively hook into the scheduler idle logic before actually doing the switch. It can't happen in the idle thread. If you need to take the context switch, then you've negated pretty much all of the gain of the polled approach. What about hooking into the idle_balance code? That happens if we are about to go to idle but before the full schedule switch to the idle task. In __schedule(void): if (unlikely(!rq-nr_running)) idle_balance(cpu, rq); -- Steve -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: Allow block drivers to poll for I/O instead of sleeping
On Mon, Jun 24, 2013 at 09:15:45AM +0200, Jens Axboe wrote: Willy, I think the general design is fine, hooking in via the bdi is the only way to get back to the right place from where you need to sleep. Some thoughts: - This should be hooked in via blk-iopoll, both of them should call into the same driver hook for polling completions. I actually started working on this, then I realised that it's actually a bad idea. blk-iopoll's poll function is to poll the single I/O queue closest to this CPU. The iowait poll function is to poll all queues that the I/O for this address_space might complete on. I'm reluctant to ask drivers to define two poll functions, but I'm even more reluctant to ask them to define one function with two purposes. - It needs to be more intelligent in when you want to poll and when you want regular irq driven IO. Oh yeah, absolutely. While the example patch didn't show it, I wouldn't enable it for all NVMe devices; only ones with sufficiently low latency. There's also the ability for the driver to look at the number of outstanding I/Os and return an error (eg -EBUSY) to stop spinning. - With the former note, the app either needs to opt in (and hence willingly sacrifice CPU cycles of its scheduling slice) or it needs to be nicer in when it gives up and goes back to irq driven IO. Yup. I like the way you framed it. If the task *wants* to spend its CPU cycles on polling for I/O instead of giving up the remainder of its time slice, then it should be able to do that. After all, it already can; it can submit an I/O request via AIO, and then call io_getevents in a tight loop. So maybe the right way to do this is with a task flag? If we go that route, I'd like to further develop this option to allow I/Os to be designated as low latency vs normal. Taking a page fault would be low latency for all tasks, not just ones that choose to spin for I/O. -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: Allow block drivers to poll for I/O instead of sleeping
On Mon, Jun 24, 2013 at 08:11:02PM -0400, Steven Rostedt wrote: What about hooking into the idle_balance code? That happens if we are about to go to idle but before the full schedule switch to the idle task. In __schedule(void): if (unlikely(!rq-nr_running)) idle_balance(cpu, rq); That may be a great place to try it from the PoV of the scheduler, but are you OK with me threading a struct backing_dev_info * all the way through the scheduler to idle_balance()? :-) -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: Allow block drivers to poll for I/O instead of sleeping
On Mon, Jun 24, 2013 at 10:07:51AM +0200, Ingo Molnar wrote: I'm wondering, how will this scheme work if the IO completion latency is a lot more than the 5 usecs in the testcase? What if it takes 20 usecs or 100 usecs or more? There's clearly a threshold at which it stops making sense, and our current NAND-based SSDs are almost certainly on the wrong side of that threshold! I can't wait for one of the post-NAND technologies to make it to market in some form that makes it economical to use in an SSD. The problem is that some of the people who are looking at those technologies are crazy. They want to bypass the kernel and do user space I/O because the kernel is too slow. This patch is part of an effort to show them how crazy they are. And even if it doesn't convince them, at least users who refuse to rewrite their applications to take advantage of magical userspace I/O libraries will see real performance benefits. -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: Allow block drivers to poll for I/O instead of sleeping
* Matthew Wilcox wi...@linux.intel.com wrote: A paper at FAST2012 (http://static.usenix.org/events/fast12/tech/full_papers/Yang.pdf) pointed out the performance overhead of taking interrupts for low-latency block I/Os. The solution the author investigated was to spin waiting for each I/O to complete. This is inefficient as Linux submits many I/Os which are not latency-sensitive, and even when we do submit latency-sensitive I/Os (eg swap-in), we frequently submit several I/Os before waiting. This RFC takes a different approach, only spinning when we would otherwise sleep. To implement this, I add an 'io_poll' function pointer to backing_dev_info. I include a sample implementation for the NVMe driver. Next, I add an io_wait() function which will call io_poll() if it is set. It falls back to calling io_schedule() if anything goes wrong with io_poll() or the task exceeds its timeslice. Finally, all that is left is to judiciously replace calls to io_schedule() with calls to io_wait(). I think I've covered the main contenders with sleep_on_page(), sleep_on_buffer() and the DIO path. I've measured the performance benefits of this with a Chatham NVMe prototype device and a simple # dd if=/dev/nvme0n1 of=/dev/null iflag=direct bs=512 count=100 The latency of each I/O reduces by about 2.5us (from around 8.0us to around 5.5us). This matches up quite well with the performance numbers shown in the FAST2012 paper (which used a similar device). Nice speedup! Is backing_dev_info the right place for this function pointer? Have I made any bad assumptions about the correct way to retrieve the backing_dev_info from (eg) a struct page, buffer_head or kiocb? Should I pass a 'state' into io_wait() instead of retrieving it from 'current'? Is kernel/sched/core.c the right place for io_wait()? Should it be bdi_wait() instead? Should it be marked with __sched? Where should I add documentation for the io_poll() function pointer? NB: The NVMe driver piece of this is for illustrative purposes only and should not be applied. I've rearranged the diff so that the interesting pieces appear first. -EMISSINGDIFFSTAT diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h index c388155..97f8fde 100644 --- a/include/linux/backing-dev.h +++ b/include/linux/backing-dev.h @@ -68,6 +68,7 @@ struct backing_dev_info { unsigned int capabilities; /* Device capabilities */ congested_fn *congested_fn; /* Function pointer if device is md/dm */ void *congested_data; /* Pointer to aux data for congested func */ + int (*io_poll)(struct backing_dev_info *); char *name; @@ -126,6 +127,8 @@ int bdi_has_dirty_io(struct backing_dev_info *bdi); void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi); void bdi_lock_two(struct bdi_writeback *wb1, struct bdi_writeback *wb2); +void io_wait(struct backing_dev_info *bdi); + extern spinlock_t bdi_lock; extern struct list_head bdi_list; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 58453b8..4840065 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4527,6 +4527,36 @@ long __sched io_schedule_timeout(long timeout) return ret; } +/* + * Wait for an I/O to complete against this backing_dev_info. If the + * task exhausts its timeslice polling for completions, call io_schedule() + * anyway. If a signal comes pending, return so the task can handle it. + * If the io_poll returns an error, give up and call io_schedule(), but + * swallow the error. We may miss an I/O completion (eg if the interrupt + * handler gets to it first). Guard against this possibility by returning + * if we've been set back to TASK_RUNNING. + */ +void io_wait(struct backing_dev_info *bdi) +{ + if (bdi bdi-io_poll) { + long state = current-state; + while (!need_resched()) { + int ret = bdi-io_poll(bdi); + if ((ret 0) || signal_pending_state(state, current)) { + set_current_state(TASK_RUNNING); + return; + } + if (current-state == TASK_RUNNING) + return; + if (ret 0) + break; + cpu_relax(); + } + } + + io_schedule(); +} I'm wondering why this makes such a performance difference. If an IO driver is implemented properly then it will batch up requests for the controller, and gets IRQ-notified on a (sub-)batch of buffers completed. If there's any spinning done then it should be NAPI-alike polling: a single is stuff completed polling pass per new block of work submitted, to opportunistically interleave completion with submission work. I don't see where active spinning brings would improve performance compared to a NAPI-alike technique. Your numbers
Re: RFC: Allow block drivers to poll for I/O instead of sleeping
On Sun, Jun 23, 2013 at 12:09 AM, Ingo Molnar mi...@kernel.org wrote: The spinning approach you add has the disadvantage of actively wasting CPU time, which could be used to run other tasks. In general it's much better to make sure the completion IRQs are rate-limited and just schedule. This (combined with a metric ton of fine details) is what the networking code does in essence, and they have no trouble reaching very high throughput. It's not about throughput - it's about latency. Don't ever confuse the two, they have almost nothing in common. Networking very very seldom has the kind of submit and wait for immediate result issues that disk reads do. That said, I dislike the patch intensely. I do not think it's at all a good idea to look at need_resched to say I can spin now. You're still wasting CPU cycles. So Willy, please do *not* mix this up with the scheduler, or at least not need_resched. Instead, maybe we should introduce a notion of if we are switching to the idle thread, let's see if we can try to do some IO synchronously. You could try to do that either *in* the idle thread (which would take the context switch overhead - maybe negating some of the advantages), or alternatively hook into the scheduler idle logic before actually doing the switch. But anything that starts polling when there are other runnable processes to be done sounds really debatable. Even if it's only 5us or so. There's a lot of real work that could be done in 5us. Linus -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RFC: Allow block drivers to poll for I/O instead of sleeping
A paper at FAST2012 (http://static.usenix.org/events/fast12/tech/full_papers/Yang.pdf) pointed out the performance overhead of taking interrupts for low-latency block I/Os. The solution the author investigated was to spin waiting for each I/O to complete. This is inefficient as Linux submits many I/Os which are not latency-sensitive, and even when we do submit latency-sensitive I/Os (eg swap-in), we frequently submit several I/Os before waiting. This RFC takes a different approach, only spinning when we would otherwise sleep. To implement this, I add an 'io_poll' function pointer to backing_dev_info. I include a sample implementation for the NVMe driver. Next, I add an io_wait() function which will call io_poll() if it is set. It falls back to calling io_schedule() if anything goes wrong with io_poll() or the task exceeds its timeslice. Finally, all that is left is to judiciously replace calls to io_schedule() with calls to io_wait(). I think I've covered the main contenders with sleep_on_page(), sleep_on_buffer() and the DIO path. I've measured the performance benefits of this with a Chatham NVMe prototype device and a simple # dd if=/dev/nvme0n1 of=/dev/null iflag=direct bs=512 count=100 The latency of each I/O reduces by about 2.5us (from around 8.0us to around 5.5us). This matches up quite well with the performance numbers shown in the FAST2012 paper (which used a similar device). Is backing_dev_info the right place for this function pointer? Have I made any bad assumptions about the correct way to retrieve the backing_dev_info from (eg) a struct page, buffer_head or kiocb? Should I pass a 'state' into io_wait() instead of retrieving it from 'current'? Is kernel/sched/core.c the right place for io_wait()? Should it be bdi_wait() instead? Should it be marked with __sched? Where should I add documentation for the io_poll() function pointer? NB: The NVMe driver piece of this is for illustrative purposes only and should not be applied. I've rearranged the diff so that the interesting pieces appear first. diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h index c388155..97f8fde 100644 --- a/include/linux/backing-dev.h +++ b/include/linux/backing-dev.h @@ -68,6 +68,7 @@ struct backing_dev_info { unsigned int capabilities; /* Device capabilities */ congested_fn *congested_fn; /* Function pointer if device is md/dm */ void *congested_data; /* Pointer to aux data for congested func */ + int (*io_poll)(struct backing_dev_info *); char *name; @@ -126,6 +127,8 @@ int bdi_has_dirty_io(struct backing_dev_info *bdi); void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi); void bdi_lock_two(struct bdi_writeback *wb1, struct bdi_writeback *wb2); +void io_wait(struct backing_dev_info *bdi); + extern spinlock_t bdi_lock; extern struct list_head bdi_list; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 58453b8..4840065 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4527,6 +4527,36 @@ long __sched io_schedule_timeout(long timeout) return ret; } +/* + * Wait for an I/O to complete against this backing_dev_info. If the + * task exhausts its timeslice polling for completions, call io_schedule() + * anyway. If a signal comes pending, return so the task can handle it. + * If the io_poll returns an error, give up and call io_schedule(), but + * swallow the error. We may miss an I/O completion (eg if the interrupt + * handler gets to it first). Guard against this possibility by returning + * if we've been set back to TASK_RUNNING. + */ +void io_wait(struct backing_dev_info *bdi) +{ + if (bdi bdi-io_poll) { + long state = current-state; + while (!need_resched()) { + int ret = bdi-io_poll(bdi); + if ((ret 0) || signal_pending_state(state, current)) { + set_current_state(TASK_RUNNING); + return; + } + if (current-state == TASK_RUNNING) + return; + if (ret 0) + break; + cpu_relax(); + } + } + + io_schedule(); +} + /** * sys_sched_get_priority_max - return maximum RT priority. * @policy: scheduling class. diff --git a/fs/aio.c b/fs/aio.c index 2bbcacf..7b20397 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -453,11 +453,15 @@ static void kill_ioctx(struct kioctx *ctx) */ ssize_t wait_on_sync_kiocb(struct kiocb *iocb) { + struct backing_dev_info *bdi = NULL; + + if (iocb-ki_filp) + bdi = iocb-ki_filp-f_mapping-backing_dev_info; while (atomic_read(iocb-ki_users)) { set_current_state(TASK_UNINTERRUPTIBLE); if (!atomic_read(iocb-ki_users)) break; - io_schedule(); +