On Tue, Jan 14, 2014 at 09:54:20PM -0600, Jim Nasby wrote:
> On 1/14/14, 3:41 PM, Dave Chinner wrote:
> >On Tue, Jan 14, 2014 at 09:40:48AM -0500, Robert Haas wrote:
> >>On Mon, Jan 13, 2014 at 5:26 PM, Mel Gorman <mgor...@suse.de>
> >>wrote: Whether the problem is with the system call or the
> >>programmer is harder to determine. I think the problem is in
> >>part that it's not exactly clear when we should call it. So
> >>suppose we want to do a checkpoint. What we used to do a long
> >>time ago is write everything, and then fsync it all, and then
> >>call it good. But that produced horrible I/O storms. So what
> >>we do now is do the writes over a period of time, with sleeps in
> >>between, and then fsync it all at the end, hoping that the
> >>kernel will write some of it before the fsyncs arrive so that we
> >>don't get a huge I/O spike. And that sorta works, and it's
> >>definitely better than doing it all at full speed, but it's
> >>pretty imprecise. If the kernel doesn't write enough of the
> >>data out in advance, then there's still a huge I/O storm when we
> >>do the fsyncs and everything grinds to a halt. If it writes out
> >>more data than needed in advance, it increases the total number
> >>of physical writes because we get less write-combining, and that
> >>hurts performance, too.
> I think there's a pretty important bit that Robert didn't mention:
> we have a specific *time* target for when we want all the fsync's
> to complete. People that have problems here tend to tune
> checkpoints to complete every 5-15 minutes, and they want the
> write traffic for the checkpoint spread out over 90% of that time
> interval. To put it another way, fsync's should be done when 90%
> of the time to the next checkpoint hits, but preferably not a lot
> before then.
I think that is pretty much understood. I don't recall anyone
mentioning a typical checkpoint period, though, so knowing the
typical timeframe of IO storms and how much data is typically
written in a checkpoint helps us understand the scale of the
> >It sounds to me like you want the kernel to start background
> >writeback earlier so that it doesn't build up as much dirty data
> >before you require a flush. There are several ways to do this by
> >tweaking writeback knobs. The simplest is probably just to set
> >/proc/sys/vm/dirty_background_bytes to an appropriate threshold
> >(say 50MB) and dirty_expire_centiseconds to a few seconds so that
> >background writeback starts and walks all dirty inodes almost
> >immediately. This will keep a steady stream of low level
> >background IO going, and fsync should then not take very long.
> Except that still won't throttle writes, right? That's the big
> issue here: our users often can't tolerate big spikes in IO
> latency. They want user requests to always happen within a
> specific amount of time.
Right, but that's a different problem and one that io scheduling
tweaks can have a major effect on. e.g. the deadline scheduler
should be able to provide a maximum upper bound on read IO latency
even while writes are in progress, though how successful it is is
dependent on the nature of the write load and the architecture of
the underlying storage.
However, the first problem is dealing with the IO storm problem on
fsync. Then we can measure the effect of spreading those writes out
in time and determine what triggers read starvations (if they are
apparent). The we can look at whether IO scheduling tweaks or
whether blk-io throttling solves those problems. Or whether
something else needs to be done to make it work in environments
where problems are manifesting.
FWIW [and I know you're probably sick of hearing this by now], but
the blk-io throttling works almost perfectly with applications that
use direct IO.....
> So while delaying writes potentially reduces the total amount of
> data you're writing, users that run into problems here ultimately
> care more about ensuring that their foreground IO completes in a
> timely fashion.
Understood. Applications that crunch randomly through large data
sets are almost always read IO latency bound....
> >Fundamentally, though, we need bug reports from people seeing
> >these problems when they see them so we can diagnose them on
> >their systems. Trying to discuss/diagnose these problems without
> >knowing anything about the storage, the kernel version, writeback
> >thresholds, etc really doesn't work because we can't easily
> >determine a root cause.
> So is lsf...@linux-foundation.org the best way to accomplish that?
No. That is just the list for organising the LFSMM summit. ;)
For general pagecache and writeback issues, discussions, etc,
linux-fsde...@vger.kernel.org is the list to use. LKML simple has
too much noise to be useful these days, so I'd avoid it. Otherwise
the filesystem specific lists are are good place to get help for
specific problems (e.g. linux-e...@vger.kernel.org and
x...@oss.sgi.com). We tend to cross-post to other relevant lists as
triage moves into different areas of the storage stack.
> Also, along the lines of collaboration, it would also be awesome
> to see kernel hackers at PGCon (http://pgcon.org) for further
> discussion of this stuff.
True, but I don't think I'll be one of those hackers as Ottawa is
(roughly) a 30 hour commute from where I live and I try to limit the
number of them I do every year....
Sent via pgsql-hackers mailing list (email@example.com)
To make changes to your subscription: