Re: RFC: Allow block drivers to poll for I/O instead of sleeping

2013-07-03 Thread Shaohua Li
On Thu, Jun 20, 2013 at 04:17:13PM -0400, Matthew Wilcox wrote:
 
 A paper at FAST2012
 (http://static.usenix.org/events/fast12/tech/full_papers/Yang.pdf) pointed
 out the performance overhead of taking interrupts for low-latency block
 I/Os.  The solution the author investigated was to spin waiting for each
 I/O to complete.  This is inefficient as Linux submits many I/Os which
 are not latency-sensitive, and even when we do submit latency-sensitive
 I/Os (eg swap-in), we frequently submit several I/Os before waiting.
 
 This RFC takes a different approach, only spinning when we would
 otherwise sleep.  To implement this, I add an 'io_poll' function pointer
 to backing_dev_info.  I include a sample implementation for the NVMe
 driver.  Next, I add an io_wait() function which will call io_poll()
 if it is set.  It falls back to calling io_schedule() if anything goes
 wrong with io_poll() or the task exceeds its timeslice.  Finally, all
 that is left is to judiciously replace calls to io_schedule() with
 calls to io_wait().  I think I've covered the main contenders with
 sleep_on_page(), sleep_on_buffer() and the DIO path.
 
 I've measured the performance benefits of this with a Chatham NVMe
 prototype device and a simple
 # dd if=/dev/nvme0n1 of=/dev/null iflag=direct bs=512 count=100
 The latency of each I/O reduces by about 2.5us (from around 8.0us to
 around 5.5us).  This matches up quite well with the performance numbers
 shown in the FAST2012 paper (which used a similar device).

Hi Matthew,
I'm wondering where the 2.5us latency cut comes from. I did a simple test. In
my xeon 3.4G CPU, one cpu can do about 2M/s context switch of applications.
Assuming switching to idle is faster, so switching to idle and back should take
less than 1us. Does the 2.5us latency cut mostly come from deep idle state
latency? if so, maybe set a lower pm_qos value or have a better idle governer
to prevent cpu entering deep idle state can help too.

Thanks,
Shaohua
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: Allow block drivers to poll for I/O instead of sleeping

2013-06-27 Thread Rik van Riel

On 06/23/2013 02:29 PM, Linus Torvalds wrote:


You could try to do that either *in* the idle thread (which would take
the context switch overhead - maybe negating some of the advantages),
or alternatively hook into the scheduler idle logic before actually
doing the switch.

But anything that starts polling when there are other runnable
processes to be done sounds really debatable. Even if it's only 5us
or so. There's a lot of real work that could be done in 5us.


Having a hook into the idle code could be useful for
KVM, too.

In certain message passing workloads, we have seen
that the system throughput suffers greatly simply
by having the KVM thread's preempt notifiers save
CPU state when the guest goes idle, and re-load it
when the guest VCPU is activated again.

Avoiding the context switch overhead when nothing
else wants to run, but being immediately preempted
when something does, could be a big win.

Would it make sense to have some hook that says
I *am* an idle thread right now, as opposed to
context switch to the idle thread?

That hook could even run the cpuidle code, and
switch to the real idle task (saving the current
task state), and put the CPU into a C-state, when
the expected sleep time is long...



--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: Allow block drivers to poll for I/O instead of sleeping

2013-06-27 Thread Rik van Riel

On 06/20/2013 04:17 PM, Matthew Wilcox wrote:


--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4527,6 +4527,36 @@ long __sched io_schedule_timeout(long timeout)
return ret;
  }

+/*
+ * Wait for an I/O to complete against this backing_dev_info.  If the
+ * task exhausts its timeslice polling for completions, call io_schedule()
+ * anyway.  If a signal comes pending, return so the task can handle it.
+ * If the io_poll returns an error, give up and call io_schedule(), but
+ * swallow the error.  We may miss an I/O completion (eg if the interrupt
+ * handler gets to it first).  Guard against this possibility by returning
+ * if we've been set back to TASK_RUNNING.
+ */
+void io_wait(struct backing_dev_info *bdi)
+{


I would like something a little more generic in the
scheduler code, that could also be used by other
things in the kernel (say, KVM with message passing
workloads).

Maybe something looking a little like this?

void idle_poll(struct idle_poll_info *ipi)

struct idle_poll_info {
int (*idle_poll_func)(void *data);
int (*idle_poll_preempt)(void *data);
void *data;
}

That way the kernel can:

1) mark the current thread as having idle priority,
   allowing the scheduler to preempt it if something
   else wants to run

2) switch to asynchronous mode if something else wants
   to run, or if the average wait for the process is
   so long that it is better to go asynchronous and
   avoid polling

3) poll for completion if nothing else wants to run

Does that make sense?

Did I forget something you need?

Did I forget something KVM could need?

Is this insane? If so, is it too insane? :)
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: Allow block drivers to poll for I/O instead of sleeping

2013-06-25 Thread Bart Van Assche

On 06/25/13 05:18, Matthew Wilcox wrote:

On Mon, Jun 24, 2013 at 10:07:51AM +0200, Ingo Molnar wrote:

I'm wondering, how will this scheme work if the IO completion latency is a
lot more than the 5 usecs in the testcase? What if it takes 20 usecs or
100 usecs or more?


There's clearly a threshold at which it stops making sense, and our
current NAND-based SSDs are almost certainly on the wrong side of that
threshold!  I can't wait for one of the post-NAND technologies to make
it to market in some form that makes it economical to use in an SSD.

The problem is that some of the people who are looking at those
technologies are crazy.  They want to bypass the kernel and do user
space I/O because the kernel is too slow.  This patch is part of an
effort to show them how crazy they are.  And even if it doesn't convince
them, at least users who refuse to rewrite their applications to take
advantage of magical userspace I/O libraries will see real performance
benefits.


Recently I attended an interesting talk about this subject in which it 
was proposed not only to bypass the kernel for access to high-IOPS 
devices but also to allow byte-addressability for block devices. The 
slides that accompanied that talk can be found here (includes a 
performance comparison with the traditional block driver API):


Bernard Metzler, On Suitability of High-Performance Networking API for 
Storage, OFA Int'l Developer Workshop, April 24, 2013 
(http://www.openfabrics.org/ofa-documents/presentations/doc_download/559-on-suitability-of-high-performance-networking-api-for-storage.html).


This approach leaves the choice of whether to use polling or an 
interrupt-based completion notification to the user of the new API, 
something the Linux InfiniBand RDMA verbs API already allows today.


Bart.
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: Allow block drivers to poll for I/O instead of sleeping

2013-06-25 Thread Steven Rostedt
On Mon, 2013-06-24 at 23:07 -0400, Matthew Wilcox wrote:
 On Mon, Jun 24, 2013 at 08:11:02PM -0400, Steven Rostedt wrote:
  What about hooking into the idle_balance code? That happens if we are
  about to go to idle but before the full schedule switch to the idle
  task.
  
  
  In __schedule(void):
  
  if (unlikely(!rq-nr_running))
  idle_balance(cpu, rq);
 
 That may be a great place to try it from the PoV of the scheduler, but are
 you OK with me threading a struct backing_dev_info * all the way through
 the scheduler to idle_balance()?  :-)

Well, there's other ways to pass data down besides parameters. You could
attach something to the task itself.

-- Steve


--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: Allow block drivers to poll for I/O instead of sleeping

2013-06-25 Thread Jens Axboe
On Mon, Jun 24 2013, Matthew Wilcox wrote:
 On Mon, Jun 24, 2013 at 10:07:51AM +0200, Ingo Molnar wrote:
  I'm wondering, how will this scheme work if the IO completion latency is a 
  lot more than the 5 usecs in the testcase? What if it takes 20 usecs or 
  100 usecs or more?
 
 There's clearly a threshold at which it stops making sense, and our
 current NAND-based SSDs are almost certainly on the wrong side of that
 threshold!  I can't wait for one of the post-NAND technologies to make
 it to market in some form that makes it economical to use in an SSD.
 
 The problem is that some of the people who are looking at those
 technologies are crazy.  They want to bypass the kernel and do user
 space I/O because the kernel is too slow.  This patch is part of an
 effort to show them how crazy they are.  And even if it doesn't convince
 them, at least users who refuse to rewrite their applications to take
 advantage of magical userspace I/O libraries will see real performance
 benefits.

Fully concur with that. At least on the read side, nand is just getting
crappier and polled completions is usually not going to be great. On the
write side, however, there are definite gains. Completions in the
10-15usec range aren't unusual. And once we hit PCM, well, it'll be fun.

On the write side, there are plenty of super latency customers out there
who would LOVE to poll when/if it's useful. Most often also the same
kind of people who talk the crazy of putting everything in user space.
Which is why I like the polling. If we can get sufficiently close, then
we can shut some of that up.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: Allow block drivers to poll for I/O instead of sleeping

2013-06-24 Thread Jens Axboe
On Sun, Jun 23 2013, Ingo Molnar wrote:
 I'm wondering why this makes such a performance difference.

They key ingredient here is simply not going to sleep, only to get an
IRQ and get woken up very shortly again. NAPI and similar approaches
work great for high IOPS cases, where you maintain a certain depth of
IO. For lower queue depth or sync IO (like Willy is running here),
nothing beats the pure app driven poll from a latency perspective. I've
seen plenty of systems where the power management is also so aggressive,
that you manage to enter lower C states very quickly and that then of
course makes things even worse. Intelligent polling would make that less
of a problem.

Willy, I think the general design is fine, hooking in via the bdi is the
only way to get back to the right place from where you need to sleep.
Some thoughts:

- This should be hooked in via blk-iopoll, both of them should call into
  the same driver hook for polling completions.

- It needs to be more intelligent in when you want to poll and when you
  want regular irq driven IO.

- With the former note, the app either needs to opt in (and hence
  willingly sacrifice CPU cycles of its scheduling slice) or it needs to
  be nicer in when it gives up and goes back to irq driven IO.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: Allow block drivers to poll for I/O instead of sleeping

2013-06-24 Thread Jens Axboe
On Sun, Jun 23 2013, Linus Torvalds wrote:
  nothing in common. Networking very very seldom
 has the kind of submit and wait for immediate result issues that
 disk reads do.
 
 That said, I dislike the patch intensely. I do not think it's at all a
 good idea to look at need_resched to say I can spin now. You're
 still wasting CPU cycles.
 
 So Willy, please do *not* mix this up with the scheduler, or at least
 not need_resched. Instead, maybe we should introduce a notion of if
 we are switching to the idle thread, let's see if we can try to do
 some IO synchronously.
 
 You could try to do that either *in* the idle thread (which would take
 the context switch overhead - maybe negating some of the advantages),
 or alternatively hook into the scheduler idle logic before actually
 doing the switch.

It can't happen in the idle thread. If you need to take the context
switch, then you've negated pretty much all of the gain of the polled
approach.

 But anything that starts polling when there are other runnable
 processes to be done sounds really debatable. Even if it's only 5us
 or so. There's a lot of real work that could be done in 5us.

IMHO that depends mostly on whether the app wants to give up those 5us
of work time just spinning on the completion of important IO. You
obviously can't steal the time of others.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: Allow block drivers to poll for I/O instead of sleeping

2013-06-24 Thread Ingo Molnar

* Linus Torvalds torva...@linux-foundation.org wrote:

 On Sun, Jun 23, 2013 at 12:09 AM, Ingo Molnar mi...@kernel.org wrote:
 
  The spinning approach you add has the disadvantage of actively wasting 
  CPU time, which could be used to run other tasks. In general it's much 
  better to make sure the completion IRQs are rate-limited and just 
  schedule. This (combined with a metric ton of fine details) is what 
  the networking code does in essence, and they have no trouble reaching 
  very high throughput.
 
 It's not about throughput - it's about latency. Don't ever confuse the 
 two, they have almost nothing in common. Networking very very seldom has 
 the kind of submit and wait for immediate result issues that disk 
 reads do.

Yeah, indeed that's true, the dd measurement Matthew did issued IO at a 
rate of one sector at a time and waiting for every sector to complete:

dd if=/dev/nvme0n1 of=/dev/null iflag=direct bs=512 count=100

So my suggestions about batching and IRQ rate control are immaterial...

 That said, I dislike the patch intensely. I do not think it's at all a 
 good idea to look at need_resched to say I can spin now. You're 
 still wasting CPU cycles.
 
 So Willy, please do *not* mix this up with the scheduler, or at least 
 not need_resched. Instead, maybe we should introduce a notion of if 
 we are switching to the idle thread, let's see if we can try to do some 
 IO synchronously.
 
 You could try to do that either *in* the idle thread (which would take 
 the context switch overhead - maybe negating some of the advantages), or 
 alternatively hook into the scheduler idle logic before actually doing 
 the switch.
 
 But anything that starts polling when there are other runnable processes 
 to be done sounds really debatable. Even if it's only 5us or so. 
 There's a lot of real work that could be done in 5us.

I'm wondering, how will this scheme work if the IO completion latency is a 
lot more than the 5 usecs in the testcase? What if it takes 20 usecs or 
100 usecs or more?

Will we still burn our CPU time, wasting power, inflating this CPU's load 
which keeps other CPUs from balancing tasks over to this CPU, etc?

In the 5 usecs case it looks beneficial to do. In the longer-latency cases 
I'm not so sure.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: Allow block drivers to poll for I/O instead of sleeping

2013-06-24 Thread Ingo Molnar

* Jens Axboe ax...@kernel.dk wrote:

 - With the former note, the app either needs to opt in (and hence
   willingly sacrifice CPU cycles of its scheduling slice) or it needs to 
   be nicer in when it gives up and goes back to irq driven IO.

The scheduler could look at sleep latency averages of the task in question 
- we measure that already in most cases.

If the 'average sleep latency' is below a certain threshold, the 
scheduler, if it sees that the CPU is about to go idle, could delay doing 
the context switch and do light idle-polling, for say twice the length 
of the expected sleep latency - assuming the CPU is otherwise idle - 
before it really schedules away the task and the CPU goes idle.

This would still require an IRQ and a wakeup to be taken, but would avoid 
the context switch.

Yet I have an ungood feeling about depending on actual latency values so 
explicitly. There will have to be a cutoff value, and if a workload is 
just below or just above that threshold then behavior will change 
markedly. Such schemes rarely worked out nicely in the past. [Might still 
be worth trying it.]

Couldn't the block device driver itself estimate the expected latency of 
IO completion and simply poll if that's expected to be very short [such as 
there's only a single outstanding IO to a RAM backed device]? IO drivers 
doing some polling and waiting in the microseconds range isnt overly 
controversial. I'd even do that if the CPU is busy otherwise: the task 
should see a proportional slowdown as load increases, with no change in IO 
queueing behavior.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: Allow block drivers to poll for I/O instead of sleeping

2013-06-24 Thread Ingo Molnar

* David Ahern dsah...@gmail.com wrote:

 On 6/23/13 3:09 AM, Ingo Molnar wrote:
 If an IO driver is implemented properly then it will batch up requests for
 the controller, and gets IRQ-notified on a (sub-)batch of buffers
 completed.
 
 If there's any spinning done then it should be NAPI-alike polling: a
 single is stuff completed polling pass per new block of work submitted,
 to opportunistically interleave completion with submission work.
 
 I don't see where active spinning brings would improve performance
 compared to a NAPI-alike technique. Your numbers obviously show a speedup
 we'd like to have, I'm just wondering whether the same speedup (or even
 more) could be implemented via:
 
   - smart batching that rate-limits completion IRQs in essence
   + NAPI-alike polling
 
 ... which would almost never result in IRQ driven completion when we are
 close to CPU-bound and while not yet saturating the IO controller's
 capacity.
 
 The spinning approach you add has the disadvantage of actively wasting CPU
 time, which could be used to run other tasks. In general it's much better
 to make sure the completion IRQs are rate-limited and just schedule. This
 (combined with a metric ton of fine details) is what the networking code
 does in essence, and they have no trouble reaching very high throughput.
 
 Networking code has a similar proposal for low latency sockets using 
 polling: https://lwn.net/Articles/540281/

In that case it might make sense to try the generic approach I suggested 
in the previous mail, which would measure average sleep latencies of 
tasks, and would do light idle-polling instead of the more expensive 
switch-to-the-idle-task context switch plus associated RCU, nohz, etc. 
busy-CPU-tear-down and the symmetric build-up work on idle wakeup.

The IO driver would still have to take an IRQ though, preferably on the 
CPU that runs the IO task ...

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: Allow block drivers to poll for I/O instead of sleeping

2013-06-24 Thread Steven Rostedt
On Mon, Jun 24, 2013 at 09:17:18AM +0200, Jens Axboe wrote:
 On Sun, Jun 23 2013, Linus Torvalds wrote:
  
  You could try to do that either *in* the idle thread (which would take
  the context switch overhead - maybe negating some of the advantages),
  or alternatively hook into the scheduler idle logic before actually
  doing the switch.
 
 It can't happen in the idle thread. If you need to take the context
 switch, then you've negated pretty much all of the gain of the polled
 approach.

What about hooking into the idle_balance code? That happens if we are
about to go to idle but before the full schedule switch to the idle
task.


In __schedule(void):

if (unlikely(!rq-nr_running))
idle_balance(cpu, rq);

-- Steve

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: Allow block drivers to poll for I/O instead of sleeping

2013-06-24 Thread Matthew Wilcox
On Mon, Jun 24, 2013 at 09:15:45AM +0200, Jens Axboe wrote:
 Willy, I think the general design is fine, hooking in via the bdi is the
 only way to get back to the right place from where you need to sleep.
 Some thoughts:
 
 - This should be hooked in via blk-iopoll, both of them should call into
   the same driver hook for polling completions.

I actually started working on this, then I realised that it's actually
a bad idea.  blk-iopoll's poll function is to poll the single I/O queue
closest to this CPU.  The iowait poll function is to poll all queues
that the I/O for this address_space might complete on.

I'm reluctant to ask drivers to define two poll functions, but I'm even
more reluctant to ask them to define one function with two purposes.

 - It needs to be more intelligent in when you want to poll and when you
   want regular irq driven IO.

Oh yeah, absolutely.  While the example patch didn't show it, I wouldn't
enable it for all NVMe devices; only ones with sufficiently low latency.
There's also the ability for the driver to look at the number of
outstanding I/Os and return an error (eg -EBUSY) to stop spinning.

 - With the former note, the app either needs to opt in (and hence
   willingly sacrifice CPU cycles of its scheduling slice) or it needs to
   be nicer in when it gives up and goes back to irq driven IO.

Yup.  I like the way you framed it.  If the task *wants* to spend its
CPU cycles on polling for I/O instead of giving up the remainder of its
time slice, then it should be able to do that.  After all, it already can;
it can submit an I/O request via AIO, and then call io_getevents in a
tight loop.

So maybe the right way to do this is with a task flag?  If we go that
route, I'd like to further develop this option to allow I/Os to be
designated as low latency vs normal.  Taking a page fault would be
low latency for all tasks, not just ones that choose to spin for I/O.
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: Allow block drivers to poll for I/O instead of sleeping

2013-06-24 Thread Matthew Wilcox
On Mon, Jun 24, 2013 at 08:11:02PM -0400, Steven Rostedt wrote:
 What about hooking into the idle_balance code? That happens if we are
 about to go to idle but before the full schedule switch to the idle
 task.
 
 
 In __schedule(void):
 
   if (unlikely(!rq-nr_running))
   idle_balance(cpu, rq);

That may be a great place to try it from the PoV of the scheduler, but are
you OK with me threading a struct backing_dev_info * all the way through
the scheduler to idle_balance()?  :-)
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: Allow block drivers to poll for I/O instead of sleeping

2013-06-24 Thread Matthew Wilcox
On Mon, Jun 24, 2013 at 10:07:51AM +0200, Ingo Molnar wrote:
 I'm wondering, how will this scheme work if the IO completion latency is a 
 lot more than the 5 usecs in the testcase? What if it takes 20 usecs or 
 100 usecs or more?

There's clearly a threshold at which it stops making sense, and our
current NAND-based SSDs are almost certainly on the wrong side of that
threshold!  I can't wait for one of the post-NAND technologies to make
it to market in some form that makes it economical to use in an SSD.

The problem is that some of the people who are looking at those
technologies are crazy.  They want to bypass the kernel and do user
space I/O because the kernel is too slow.  This patch is part of an
effort to show them how crazy they are.  And even if it doesn't convince
them, at least users who refuse to rewrite their applications to take
advantage of magical userspace I/O libraries will see real performance
benefits.

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: Allow block drivers to poll for I/O instead of sleeping

2013-06-23 Thread Ingo Molnar

* Matthew Wilcox wi...@linux.intel.com wrote:

 
 A paper at FAST2012
 (http://static.usenix.org/events/fast12/tech/full_papers/Yang.pdf) pointed
 out the performance overhead of taking interrupts for low-latency block
 I/Os.  The solution the author investigated was to spin waiting for each
 I/O to complete.  This is inefficient as Linux submits many I/Os which
 are not latency-sensitive, and even when we do submit latency-sensitive
 I/Os (eg swap-in), we frequently submit several I/Os before waiting.
 
 This RFC takes a different approach, only spinning when we would
 otherwise sleep.  To implement this, I add an 'io_poll' function pointer
 to backing_dev_info.  I include a sample implementation for the NVMe
 driver.  Next, I add an io_wait() function which will call io_poll()
 if it is set.  It falls back to calling io_schedule() if anything goes
 wrong with io_poll() or the task exceeds its timeslice.  Finally, all
 that is left is to judiciously replace calls to io_schedule() with
 calls to io_wait().  I think I've covered the main contenders with
 sleep_on_page(), sleep_on_buffer() and the DIO path.
 
 I've measured the performance benefits of this with a Chatham NVMe
 prototype device and a simple
 # dd if=/dev/nvme0n1 of=/dev/null iflag=direct bs=512 count=100
 The latency of each I/O reduces by about 2.5us (from around 8.0us to
 around 5.5us).  This matches up quite well with the performance numbers
 shown in the FAST2012 paper (which used a similar device).

Nice speedup!

 Is backing_dev_info the right place for this function pointer?
 Have I made any bad assumptions about the correct way to retrieve
 the backing_dev_info from (eg) a struct page, buffer_head or kiocb?
 Should I pass a 'state' into io_wait() instead of retrieving it from
 'current'?  Is kernel/sched/core.c the right place for io_wait()?
 Should it be bdi_wait() instead?  Should it be marked with __sched?
 Where should I add documentation for the io_poll() function pointer?
 
 NB: The NVMe driver piece of this is for illustrative purposes only and
 should not be applied.  I've rearranged the diff so that the interesting
 pieces appear first.


-EMISSINGDIFFSTAT

 diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
 index c388155..97f8fde 100644
 --- a/include/linux/backing-dev.h
 +++ b/include/linux/backing-dev.h
 @@ -68,6 +68,7 @@ struct backing_dev_info {
   unsigned int capabilities; /* Device capabilities */
   congested_fn *congested_fn; /* Function pointer if device is md/dm */
   void *congested_data;   /* Pointer to aux data for congested func */
 + int (*io_poll)(struct backing_dev_info *);
  
   char *name;
  
 @@ -126,6 +127,8 @@ int bdi_has_dirty_io(struct backing_dev_info *bdi);
  void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi);
  void bdi_lock_two(struct bdi_writeback *wb1, struct bdi_writeback *wb2);
  
 +void io_wait(struct backing_dev_info *bdi);
 +
  extern spinlock_t bdi_lock;
  extern struct list_head bdi_list;
  
 diff --git a/kernel/sched/core.c b/kernel/sched/core.c
 index 58453b8..4840065 100644
 --- a/kernel/sched/core.c
 +++ b/kernel/sched/core.c
 @@ -4527,6 +4527,36 @@ long __sched io_schedule_timeout(long timeout)
   return ret;
  }
  
 +/*
 + * Wait for an I/O to complete against this backing_dev_info.  If the
 + * task exhausts its timeslice polling for completions, call io_schedule()
 + * anyway.  If a signal comes pending, return so the task can handle it.
 + * If the io_poll returns an error, give up and call io_schedule(), but
 + * swallow the error.  We may miss an I/O completion (eg if the interrupt
 + * handler gets to it first).  Guard against this possibility by returning
 + * if we've been set back to TASK_RUNNING.
 + */
 +void io_wait(struct backing_dev_info *bdi)
 +{
 + if (bdi  bdi-io_poll) {
 + long state = current-state;
 + while (!need_resched()) {
 + int ret = bdi-io_poll(bdi);
 + if ((ret  0) || signal_pending_state(state, current)) {
 + set_current_state(TASK_RUNNING);
 + return;
 + }
 + if (current-state == TASK_RUNNING)
 + return;
 + if (ret  0)
 + break;
 + cpu_relax();
 + }
 + }
 +
 + io_schedule();
 +}

I'm wondering why this makes such a performance difference.

If an IO driver is implemented properly then it will batch up requests for 
the controller, and gets IRQ-notified on a (sub-)batch of buffers 
completed.

If there's any spinning done then it should be NAPI-alike polling: a 
single is stuff completed polling pass per new block of work submitted, 
to opportunistically interleave completion with submission work.

I don't see where active spinning brings would improve performance 
compared to a NAPI-alike technique. Your numbers 

Re: RFC: Allow block drivers to poll for I/O instead of sleeping

2013-06-23 Thread Linus Torvalds
On Sun, Jun 23, 2013 at 12:09 AM, Ingo Molnar mi...@kernel.org wrote:

 The spinning approach you add has the disadvantage of actively wasting CPU
 time, which could be used to run other tasks. In general it's much better
 to make sure the completion IRQs are rate-limited and just schedule. This
 (combined with a metric ton of fine details) is what the networking code
 does in essence, and they have no trouble reaching very high throughput.

It's not about throughput - it's about latency. Don't ever confuse the
two, they have almost nothing in common. Networking very very seldom
has the kind of submit and wait for immediate result issues that
disk reads do.

That said, I dislike the patch intensely. I do not think it's at all a
good idea to look at need_resched to say I can spin now. You're
still wasting CPU cycles.

So Willy, please do *not* mix this up with the scheduler, or at least
not need_resched. Instead, maybe we should introduce a notion of if
we are switching to the idle thread, let's see if we can try to do
some IO synchronously.

You could try to do that either *in* the idle thread (which would take
the context switch overhead - maybe negating some of the advantages),
or alternatively hook into the scheduler idle logic before actually
doing the switch.

But anything that starts polling when there are other runnable
processes to be done sounds really debatable. Even if it's only 5us
or so. There's a lot of real work that could be done in 5us.

   Linus
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RFC: Allow block drivers to poll for I/O instead of sleeping

2013-06-20 Thread Matthew Wilcox

A paper at FAST2012
(http://static.usenix.org/events/fast12/tech/full_papers/Yang.pdf) pointed
out the performance overhead of taking interrupts for low-latency block
I/Os.  The solution the author investigated was to spin waiting for each
I/O to complete.  This is inefficient as Linux submits many I/Os which
are not latency-sensitive, and even when we do submit latency-sensitive
I/Os (eg swap-in), we frequently submit several I/Os before waiting.

This RFC takes a different approach, only spinning when we would
otherwise sleep.  To implement this, I add an 'io_poll' function pointer
to backing_dev_info.  I include a sample implementation for the NVMe
driver.  Next, I add an io_wait() function which will call io_poll()
if it is set.  It falls back to calling io_schedule() if anything goes
wrong with io_poll() or the task exceeds its timeslice.  Finally, all
that is left is to judiciously replace calls to io_schedule() with
calls to io_wait().  I think I've covered the main contenders with
sleep_on_page(), sleep_on_buffer() and the DIO path.

I've measured the performance benefits of this with a Chatham NVMe
prototype device and a simple
# dd if=/dev/nvme0n1 of=/dev/null iflag=direct bs=512 count=100
The latency of each I/O reduces by about 2.5us (from around 8.0us to
around 5.5us).  This matches up quite well with the performance numbers
shown in the FAST2012 paper (which used a similar device).

Is backing_dev_info the right place for this function pointer?
Have I made any bad assumptions about the correct way to retrieve
the backing_dev_info from (eg) a struct page, buffer_head or kiocb?
Should I pass a 'state' into io_wait() instead of retrieving it from
'current'?  Is kernel/sched/core.c the right place for io_wait()?
Should it be bdi_wait() instead?  Should it be marked with __sched?
Where should I add documentation for the io_poll() function pointer?

NB: The NVMe driver piece of this is for illustrative purposes only and
should not be applied.  I've rearranged the diff so that the interesting
pieces appear first.

diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index c388155..97f8fde 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -68,6 +68,7 @@ struct backing_dev_info {
unsigned int capabilities; /* Device capabilities */
congested_fn *congested_fn; /* Function pointer if device is md/dm */
void *congested_data;   /* Pointer to aux data for congested func */
+   int (*io_poll)(struct backing_dev_info *);
 
char *name;
 
@@ -126,6 +127,8 @@ int bdi_has_dirty_io(struct backing_dev_info *bdi);
 void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi);
 void bdi_lock_two(struct bdi_writeback *wb1, struct bdi_writeback *wb2);
 
+void io_wait(struct backing_dev_info *bdi);
+
 extern spinlock_t bdi_lock;
 extern struct list_head bdi_list;
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 58453b8..4840065 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4527,6 +4527,36 @@ long __sched io_schedule_timeout(long timeout)
return ret;
 }
 
+/*
+ * Wait for an I/O to complete against this backing_dev_info.  If the
+ * task exhausts its timeslice polling for completions, call io_schedule()
+ * anyway.  If a signal comes pending, return so the task can handle it.
+ * If the io_poll returns an error, give up and call io_schedule(), but
+ * swallow the error.  We may miss an I/O completion (eg if the interrupt
+ * handler gets to it first).  Guard against this possibility by returning
+ * if we've been set back to TASK_RUNNING.
+ */
+void io_wait(struct backing_dev_info *bdi)
+{
+   if (bdi  bdi-io_poll) {
+   long state = current-state;
+   while (!need_resched()) {
+   int ret = bdi-io_poll(bdi);
+   if ((ret  0) || signal_pending_state(state, current)) {
+   set_current_state(TASK_RUNNING);
+   return;
+   }
+   if (current-state == TASK_RUNNING)
+   return;
+   if (ret  0)
+   break;
+   cpu_relax();
+   }
+   }
+
+   io_schedule();
+}
+
 /**
  * sys_sched_get_priority_max - return maximum RT priority.
  * @policy: scheduling class.
diff --git a/fs/aio.c b/fs/aio.c
index 2bbcacf..7b20397 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -453,11 +453,15 @@ static void kill_ioctx(struct kioctx *ctx)
  */
 ssize_t wait_on_sync_kiocb(struct kiocb *iocb)
 {
+   struct backing_dev_info *bdi = NULL;
+
+   if (iocb-ki_filp)
+   bdi = iocb-ki_filp-f_mapping-backing_dev_info;
while (atomic_read(iocb-ki_users)) {
set_current_state(TASK_UNINTERRUPTIBLE);
if (!atomic_read(iocb-ki_users))
break;
-   io_schedule();
+