On Tue, Jan 21, 2014 at 09:20:52PM +0100, Jan Kara wrote:
> On Fri 17-01-14 08:57:25, Robert Haas wrote:
> > On Fri, Jan 17, 2014 at 7:34 AM, Jeff Layton <jlay...@redhat.com> wrote:
> > > So this says to me that the WAL is a place where DIO should really be
> > > reconsidered. It's mostly sequential writes that need to hit the disk
> > > ASAP, and you need to know that they have hit the disk before you can
> > > proceed with other operations.
> > 
> > Ironically enough, we actually *have* an option to use O_DIRECT here.
> > But it doesn't work well.  See below.
> > 
> > > Also, is the WAL actually ever read under normal (non-recovery)
> > > conditions or is it write-only under normal operation? If it's seldom
> > > read, then using DIO for them also avoids some double buffering since
> > > they wouldn't go through pagecache.
> > 
> > This is the first problem: if replication is in use, then the WAL gets
> > read shortly after it gets written.  Using O_DIRECT bypasses the
> > kernel cache for the writes, but then the reads stink.
>   OK, yes, this is hard to fix with direct IO.

Actually, it's not. Block level caching is the time-honoured answer
to this problem, and it's been used very successfully on a large
scale by many organisations. e.g. facebook with MySQL, O_DIRECT, XFS
and flashcache sitting on an SSD in front of rotating storage.
There's multiple choices for this now - bcache, dm-cache,
flahscache, etc, and they all solve this same problem. And in many
cases do it better than using the page cache because you can
independently scale the size of the block level cache...

And given the size of SSDs these days, being able to put half a TB
of flash cache in front of spinning disks is a pretty inexpensive
way of solving such IO problems....

> > If we're forcing the WAL out to disk because of transaction commit or
> > because we need to write the buffer protected by a certain WAL record
> > only after the WAL hits the platter, then it's fine.  But sometimes
> > we're writing WAL just because we've run out of internal buffer space,
> > and we don't want to block waiting for the write to complete.  Opening
> > the file with O_SYNC deprives us of the ability to control the timing
> > of the sync relative to the timing of the write.
>   O_SYNC has a heavy performance penalty. For ext4 it means an extra fs
> transaction commit whenever there's any metadata changed on the filesystem.
> Since mtime & ctime of files will be changed often, the will be a case very
> often.

Therefore: O_DATASYNC.

> > Maybe it'll be useful to have hints that say "always write this file
> > to disk as quick as you can" and "always postpone writing this file to
> > disk for as long as you can" for WAL and temp files respectively.  But
> > the rule for the data files, which are the really important case, is
> > not so simple.  fsync() is actually a fine API except that it tends to
> > destroy system throughput.  Maybe what we need is just for fsync() to
> > be less aggressive, or a less aggressive version of it.  We wouldn't
> > mind waiting an almost arbitrarily long time for fsync to complete if
> > other processes could still get their I/O requests serviced in a
> > reasonable amount of time in the meanwhile.
>   As I wrote in some other email in this thread, using IO priorities for
> data file checkpoint might be actually the right answer. They will work for
> IO submitted by fsync(). The downside is that currently IO priorities / IO
> scheduling classes work only with CFQ IO scheduler.

And I don't see it being implemented anywhere else because it's the
priority aware scheduling infrastructure in CFQ that causes all the
problems with IO concurrency and scalability...


Dave Chinner

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to