subject:"Proposal for \"proper\" durable fsync\(\) and fdatasync\(\)"


Jamie Lokier wrote:

Jeff Garzik wrote:

Nick Piggin wrote:

Anyway, the idea of making fsync/fdatasync etc. safe by default is
a good idea IMO, and is a bad bug that we don't do that :(
Agreed...  it's also disappointing that [unless I'm mistaken] you have 
to hack each filesystem to support barriers.


It seems far easier to make sync_blkdev() Do The Right Thing, and 
magically make all filesystems data-safe.


Well, you need ordered metadata writes, barriers _and_ flushes with
some filesystems.

Merely writing all the data pages than issuing a drive cache flush
won't Do The Right Thing with those filesystems - someone already
mentioned Btrfs, where it won't.


Oh certainly.  That's why we have a VFS :)  fsync for NFS will look 
quite different, too.




But I agree that your suggestion would make a superb default, for
filesystems which don't provide their own function.


Yep.  That would immediately cover a bunch of filesystems.



It's not optimal even then.

  Devices: On a software RAID, you ideally don't want to issue flushes
  to all drives if your database did a 1 block commit entry.  (But they
  probably use O_DIRECT anyway, changing the rules again).  But all that
  can be optimised in generic VFS code eventually.  It doesn't need
  filesystem assistance in most cases.


My own idea is that we create a FLUSH command for blkdev request queues, 
to exist alongside READ, WRITE, and the current barrier implementation. 
 Then FLUSH could be passed down through MD or DM.


Jeff


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Proposal for "proper" durable fsync() and fdatasync()

On Tue, 26 February 2008 17:29:13 +, Jamie Lokier wrote:
> 
> You're right.  Though, doesn't normal page writeback enqueue the COW
> metadata changes?  If not, how do they get written in a timely
> fashion?

It does.  But this is not sufficient to guarantee that the pages in
question have been safely committed to the device by the time
sync_file_range() has returned.

Jörn

-- 
Joern's library part 5:
http://www.faqs.org/faqs/compression-faq/part2/section-9.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Proposal for "proper" durable fsync() and fdatasync()

Jörn Engel wrote:
> On Tue, 26 February 2008 15:28:10 +, Jamie Lokier wrote:
> > 
> > > One interesting aspect of this comes with COW filesystems like btrfs or
> > > logfs.  Writing out data pages is not sufficient, because those will get
> > > lost unless their referencing metadata is written as well.  So either we
> > > have to call fsync for those filesystems or add another callback and let
> > > filesystems override the default implementation.
> > 
> > Doesn't the ->fsync callback get called in the sys_fdatasync() case,
> > with appropriate arguments?
> 
> My paragraph above was aimed at the sync_file_range() case.  fsync and
> fdatasync do the right thing within the limitations you brought up in
> this thread.  sync_file_range() without further changes will only write
> data pages, not the metadata required to actually access those data
> pages.  This works just fine for non-COW filesystems, which covers all
> currently merged ones.
> 
> With COW filesystems it is currently impossible to do sync_file_range()
> properly.  The problem is orthogonal to your's, I just brought it up
> since you were already mentioning sync_file_range().

You're right.  Though, doesn't normal page writeback enqueue the COW
metadata changes?  If not, how do they get written in a timely
fashion?

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Proposal for "proper" durable fsync() and fdatasync()

On Tue, 26 February 2008 15:28:10 +, Jamie Lokier wrote:
> 
> > One interesting aspect of this comes with COW filesystems like btrfs or
> > logfs.  Writing out data pages is not sufficient, because those will get
> > lost unless their referencing metadata is written as well.  So either we
> > have to call fsync for those filesystems or add another callback and let
> > filesystems override the default implementation.
> 
> Doesn't the ->fsync callback get called in the sys_fdatasync() case,
> with appropriate arguments?

My paragraph above was aimed at the sync_file_range() case.  fsync and
fdatasync do the right thing within the limitations you brought up in
this thread.  sync_file_range() without further changes will only write
data pages, not the metadata required to actually access those data
pages.  This works just fine for non-COW filesystems, which covers all
currently merged ones.

With COW filesystems it is currently impossible to do sync_file_range()
properly.  The problem is orthogonal to your's, I just brought it up
since you were already mentioning sync_file_range().

Jörn

-- 
Joern's library part 10:
http://blogs.msdn.com/David_Gristwood/archive/2004/06/24/164849.aspx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Proposal for "proper" durable fsync() and fdatasync()

Jeff Garzik wrote:
> Nick Piggin wrote:
> >Anyway, the idea of making fsync/fdatasync etc. safe by default is
> >a good idea IMO, and is a bad bug that we don't do that :(
> 
> Agreed...  it's also disappointing that [unless I'm mistaken] you have 
> to hack each filesystem to support barriers.
> 
> It seems far easier to make sync_blkdev() Do The Right Thing, and 
> magically make all filesystems data-safe.

Well, you need ordered metadata writes, barriers _and_ flushes with
some filesystems.

Merely writing all the data pages than issuing a drive cache flush
won't Do The Right Thing with those filesystems - someone already
mentioned Btrfs, where it won't.

But I agree that your suggestion would make a superb default, for
filesystems which don't provide their own function.

It's not optimal even then.

  Devices: On a software RAID, you ideally don't want to issue flushes
  to all drives if your database did a 1 block commit entry.  (But they
  probably use O_DIRECT anyway, changing the rules again).  But all that
  can be optimised in generic VFS code eventually.  It doesn't need
  filesystem assistance in most cases.

  Apps: don't always want a full flush; sometimes a barrier would do.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Proposal for "proper" durable fsync() and fdatasync()


Nick Piggin wrote:

Anyway, the idea of making fsync/fdatasync etc. safe by default is
a good idea IMO, and is a bad bug that we don't do that :(


Agreed...  it's also disappointing that [unless I'm mistaken] you have 
to hack each filesystem to support barriers.


It seems far easier to make sync_blkdev() Do The Right Thing, and 
magically make all filesystems data-safe.


Jeff


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Proposal for "proper" durable fsync() and fdatasync()

2008-02-26 Thread Andrew Morton

On Tue, 26 Feb 2008 15:07:45 + Jamie Lokier <[EMAIL PROTECTED]> wrote:

> SYNC_FILE_RANGE_WRITE scans all pages in the range, looking for dirty
> pages which aren't already queued for write-out.  It marks those with
> a "write-out" flag, and starts write I/Os at some unspecified time in
> the near future; it can be assumed writes for all the pages will
> complete eventually if there's no errors.  When I/O completes on a
> page, it cleans the page and also clears the write-out flag.
> 
> SYNC_FILE_RANGE_WAIT_AFTER waits until all pages in the range don't
> have the "write-out" flag set.
> 
> SYNC_FILE_RANGE_WAIT_BEFORE does the same wait, but before marking
> pages for write-out.  I don't actually see the point in this.  Isn't a
> preceding call with SYNC_FILE_RANGE_WAIT_AFTER equivalent, making
> BEFORE a redundant flag?

Consider the case of pages which are dirty but are already under writeout. 
ie: someone redirtied the page after someone started writing the page out. 
For these pages the kernel needs to

a) wait for the current writeout to complete

b) start new writeout

c) wait for that writeout to complete.

those are the three stages of sync_file_range().  They are independently
selectable and various combinations provide various results.

The reason for providing b) only (SYNC_FILE_RANGE_WRITE) is so that
userspace can get as much data into the queue as possible, to permit the
kernel to optimise IO scheduling better.

If you perform a) and b) together
(SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE) then you are guaranteed
that all data which was dirty when sync_file_range() executed will be sent
into the queue, but you won't get as much data into the queue if the kernel
encounters dirty, under-writeout pages.  This is especially hurtful if
you're trying to feed a lot of little segments into the queue.  In that
case perhaps userspace should do an asynchrnous pass
(SYNC_FILE_RANGE_WRITE) to stuff as much data as poss into the queue, then
a SYNC_FILE_RANGE_WAIT_AFTER pass then a
SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER
pass to clean up any stragglers.  WHich mode is best very much depends on
the application's file dirtying patterns.  One would have to experiment
with it, and tuning of sync_file_range() usage would occur alongside tuning
of the application's write() design.

It's an interesting problem, with potentially high payback.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Proposal for "proper" durable fsync() and fdatasync()

Ric Wheeler wrote:
> >>I was surprised that fsync() doesn't do this already.  There was a lot
> >>of effort put into block I/O write barriers during 2.5, so that
> >>journalling filesystems can force correct write ordering, using disk
> >>flush cache commands.
> >>
> >>After all that effort, I was very surprised to notice that Linux 2.6.x
> >>doesn't use that capability to ensure fsync() flushes the disk cache
> >>onto stable storage.
> >
> >It's surprising you are surprised, given that this [lame] fsync behavior 
> >has remaining consistently lame throughout Linux's history.
> 
> Maybe I am confused, but isn't this is what fsync() does today whenever 
> barriers are enabled (the fsync() invalidates the drive's write cache).

No, fsync() doesn't always flush the drive's write cache.  It often
does, any I think many people are under the impression it always does,
but it doesn't.

Try this code on ext3:

fd = open ("test_file", O_RDWR | O_CREAT | O_TRUNC, 0666);
while (1) {
char byte;
usleep (10);
pwrite (fd, , 1, 0);
fsync (fd);
}

It will do just over 10 write ops per second on an idle system (13 on
mine), and 1 flush op per second.

That's because ext3 fsync() only does a journal commit when the inode
has changed.  The inode mtime is changed by write only with 1 second
granularity.  Without a journal commit, there's no barrier, which
translates to not flushing disk write cache.

If you add "fchmod (fd, 0644); fchmod (fd, 0664);" between the write
and fsync, you'll see at least 20 write ops and 20 flush ops per
second, and you'll here the disk seeking more.  That's because the
fchmod dirties the inode, so fsync() writes the inode with a journal
commit.

It turns out even _that_ is not sufficient according to the kernel
internals.  A journal commit uses an ordered request, which isn't the
same as a flush potentially, it just happens to use flush in this
instance.  I'm not sure if ordered requests are actually implemented
by any drivers at the moment.  If not now, they will be one day.

We could change ext3 fsync() to always do a journal commit, and depend
on the non-existence of block drivers which do ordered (not flush)
barrier requests.  But there's lots of things wrong with that.  Not
least, it sucks performance for database-like applications and virtual
machines, a lot due to unnecessary seeks.  That way lies wrongness.

Rightness is to make fdatasync() work well, with a genuine flush (or
equivalent (see FUA), only when required, and not a mere ordered
barrier), no inode write, and to make sync_file_range()[*] offer the
fancier applications finer controls which reflect what they actually
need.

[*] - or whatever.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Proposal for "proper" durable fsync() and fdatasync()

Jörn Engel wrote:
> On Tue, 26 February 2008 20:16:11 +1100, Nick Piggin wrote:
> > Yeah, sync_file_range has slightly unusual semantics and introduce
> > the new concept, "writeout", to userspace (does "writeout" include
> > "in drive cache"? the kernel doesn't think so, but the only way to
> > make sync_file_range "safe" is if you do consider it writeout).
> 
> If sync_file_range isn't safe, it should get replaced by a noop
> implementation.  There really is no point in promising "a little"
> safety.

Sometimes there is a point in "a little" safety.

There's a spectrum of durability (meaning how safely stored the data
is).  In the cases we're imagining, it's application -> main memory
cache -> disk cache -> disk surface.  There are others.

_None_ of those provide perfect safety for your data.  They are a
spectrum, and how far along you want data to be committed before you
say "fine, the data is safe enough for me" depends on your application.

For example, there are users who like to turn _off_ fdatasync() with
their SQL database of choice.  They prefer speed over safety, and they
don't mind losing an hour's data and doing regular backups (we assume
;-) Some blogs fall into this category; who cares if a rare crash
costs you a comment or two and a restore from backup; it's acceptable
for the speed.

There's users who would really like fdatasync() to commit data to the
drive platters, so after their database says "done", they are very
confident that a power failure won't cause committed data to be lost.
Accepting credit cards is more at this end.  So should be anyone using
a virtual machine of any kind without a journalling fs in the guest!

And there's users who like it where it is right now: a compromise,
where a system crash won't lose committed data; but a power failure
might.  (I'm making assumptions about drive behaviour on reset here.)

My problem with fdatasync() at the moment is, I can't choose what I
want from it, and there's no mechanism to give me the safest option.
Most annoyingly, in-kernel filesystems _do_ have a mechanism; it just
isn't exported to userspace.

(A quick aside: fdatasync() et al. are actually used for two
_different_ things.  1: A program says "I've written it", it can say
so with confidence, e.g. announcing email receipt.  2: It's used for
write ordering with write-ahead logging: write, fdatasync, write.
When you tease at the details, efficient implementations of them are
different...  Think SCSI tagged commands versus cache flushes.)

> One interesting aspect of this comes with COW filesystems like btrfs or
> logfs.  Writing out data pages is not sufficient, because those will get
> lost unless their referencing metadata is written as well.  So either we
> have to call fsync for those filesystems or add another callback and let
> filesystems override the default implementation.

Doesn't the ->fsync callback get called in the sys_fdatasync() case,
with appropriate arguments?

With barriers/flushes it certainly makes those a bit more complicated.
You have to flush not just the disks with data pages, but the _other_
disks in a software RAID with data pointer metadata pages, but ideally
not all of them (think database journal commit).

That can be implemented with per-buffer pending-barrier/flush flags
(like I described for pages in the first mail), which are equally
useful when a database-like application uses a block device.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Proposal for "proper" durable fsync() and fdatasync()

2008-02-26 Thread Ric Wheeler


Jeff Garzik wrote:

Jamie Lokier wrote:

By durable, I mean that fsync() should actually commit writes to
physical stable storage,


Yes, it should.



I was surprised that fsync() doesn't do this already.  There was a lot
of effort put into block I/O write barriers during 2.5, so that
journalling filesystems can force correct write ordering, using disk
flush cache commands.

After all that effort, I was very surprised to notice that Linux 2.6.x
doesn't use that capability to ensure fsync() flushes the disk cache
onto stable storage.


It's surprising you are surprised, given that this [lame] fsync behavior 
has remaining consistently lame throughout Linux's history.


Maybe I am confused, but isn't this is what fsync() does today whenever 
barriers are enabled (the fsync() invalidates the drive's write cache).


ric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Proposal for "proper" durable fsync() and fdatasync()

Jörn Engel wrote:
> On Tue, 26 February 2008 20:16:11 +1100, Nick Piggin wrote:
> > 
> > Yeah, sync_file_range has slightly unusual semantics and introduce
> > the new concept, "writeout", to userspace (does "writeout" include
> > "in drive cache"? the kernel doesn't think so, but the only way to
> > make sync_file_range "safe" is if you do consider it writeout).
> 
> If sync_file_range isn't safe, it should get replaced by a noop
> implementation.  There really is no point in promising "a little"
> safety.
> 
> One interesting aspect of this comes with COW filesystems like btrfs or
> logfs.  Writing out data pages is not sufficient, because those will get
> lost unless their referencing metadata is written as well.  So either we
> have to call fsync for those filesystems or add another callback and let
> filesystems override the default implementation.

fdatasync() is required to write data pages _and_ the necessary
metadata to reference those changed pages (btrfs tree etc.), but not
non-data metadata.

It's the filesystem's responsibility to interpret that correctly.
In-place writes don't need anything else.  Phase-tree style writes do.
Some kinds of logged writes don't.

I'm under the impression that sync_file_range() is a sort of
restricted-range asynchronous fdatasync().

By limiting the range of file date which must be written out, it
becomes more refined for database and filesystem-in-a-file type
applications.  Just as fsync() is more refined than sync() - it's
useful to sync less - same goes for syncing just part of a file.

It's still the filesystem's responsibility to sync data access
metadata appropriately.  It can sync more if it wants, but not less.

That's what I understand by
   sync_file_range(fd, start,length, SYNC_FILE_RANGE_WRITE_BEFORE
   | SYNC_FILE_RANGE_WRITE
   | SYNC_FILE_RANGE_WRITE_AFTER);
Largely because the manual says to use that combination of flags for
an equivalent to fdatasync().

The concept of "write-out" is not defined in the manual.  I'm assuming
it to mean this, as a reasonable guess:

SYNC_FILE_RANGE_WRITE scans all pages in the range, looking for dirty
pages which aren't already queued for write-out.  It marks those with
a "write-out" flag, and starts write I/Os at some unspecified time in
the near future; it can be assumed writes for all the pages will
complete eventually if there's no errors.  When I/O completes on a
page, it cleans the page and also clears the write-out flag.

SYNC_FILE_RANGE_WAIT_AFTER waits until all pages in the range don't
have the "write-out" flag set.

SYNC_FILE_RANGE_WAIT_BEFORE does the same wait, but before marking
pages for write-out.  I don't actually see the point in this.  Isn't a
preceding call with SYNC_FILE_RANGE_WAIT_AFTER equivalent, making
BEFORE a redundant flag?

The manual says it is something to do with data-integrity, but it's
not clear to me what that means.

All this implies that "write-out" flag is a concept userspace can rely
on.  That's not so peculiar: WRITE seems to be equivalent to AIO-style
fdatasync() on a limited range of offsets, and WAIT_AFTER seems to be
equivalent to waiting for any previously issued such ops to complete.

Any data access metadata updates that btrfs must make for fdatasync(),
it must also make for sync_file_range(), for the limited range of
offsets.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Proposal for "proper" durable fsync() and fdatasync()

On Tue, 26 February 2008 20:16:11 +1100, Nick Piggin wrote:
> 
> Yeah, sync_file_range has slightly unusual semantics and introduce
> the new concept, "writeout", to userspace (does "writeout" include
> "in drive cache"? the kernel doesn't think so, but the only way to
> make sync_file_range "safe" is if you do consider it writeout).

If sync_file_range isn't safe, it should get replaced by a noop
implementation.  There really is no point in promising "a little"
safety.

One interesting aspect of this comes with COW filesystems like btrfs or
logfs.  Writing out data pages is not sufficient, because those will get
lost unless their referencing metadata is written as well.  So either we
have to call fsync for those filesystems or add another callback and let
filesystems override the default implementation.

Jörn

-- 
There is no worse hell than that provided by the regrets
for wasted opportunities.
-- Andre-Louis Moreau in Scarabouche
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Proposal for "proper" durable fsync() and fdatasync()

Jeff Garzik wrote:
> [snip huge long proposal]
> 
> Rather than invent new APIs, we should fix the existing ones to _really_ 
> flush data to physical media.

Btw, one reason for the length is the current block request API isn't
sufficient even to make fsync() durable with _no_ new APIs.

It offers ordering barriers only, which aren't enough.  I tried to
explain, discuss some changes and then suggest optimisations.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Proposal for "proper" durable fsync() and fdatasync()

2008-02-26 Thread Nick Piggin

On Tuesday 26 February 2008 18:59, Jamie Lokier wrote:
> Andrew Morton wrote:
> > On Tue, 26 Feb 2008 07:26:50 + Jamie Lokier <[EMAIL PROTECTED]> 
wrote:
> > > (It would be nicer if sync_file_range()
> > > took a vector of ranges for better elevator scheduling, but let's
> > > ignore that :-)
> >
> > Two passes:
> >
> > Pass 1: shove each of the segments into the queue with
> > SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE
> >
> > Pass 2: wait for them all to complete and return accumulated result
> > with SYNC_FILE_RANGE_WAIT_AFTER
>
> Thanks.
>
> Seems ok, though being able to cork the I/O until the last one would
> be a bonus (like TCP_MORE...  SYNC_FILE_RANGE_MORE?)
>
> I'm imagining I'd omit the SYNC_FILE_RANGE_WAIT_BEFORE.  Is there a
> reason why you have it there?  The man page isn't very enlightening.

Yeah, sync_file_range has slightly unusual semantics and introduce
the new concept, "writeout", to userspace (does "writeout" include
"in drive cache"? the kernel doesn't think so, but the only way to
make sync_file_range "safe" is if you do consider it writeout).

If it makes it any easier to understand, we can add in
SYNC_FILE_ASYNC, SYNC_FILE_SYNC parts that just deal with
safe/unsafe and sync/async semantics that is part of the normal
POSIX api.

Anyway, the idea of making fsync/fdatasync etc. safe by default is
a good idea IMO, and is a bad bug that we don't do that :(

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Proposal for "proper" durable fsync() and fdatasync()

Andrew Morton wrote:
> On Tue, 26 Feb 2008 07:26:50 + Jamie Lokier <[EMAIL PROTECTED]> wrote:
> 
> > (It would be nicer if sync_file_range()
> > took a vector of ranges for better elevator scheduling, but let's
> > ignore that :-)
> 
> Two passes:
> 
> Pass 1: shove each of the segments into the queue with
> SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE
> 
> Pass 2: wait for them all to complete and return accumulated result
> with SYNC_FILE_RANGE_WAIT_AFTER

Thanks.

Seems ok, though being able to cork the I/O until the last one would
be a bonus (like TCP_MORE...  SYNC_FILE_RANGE_MORE?)

I'm imagining I'd omit the SYNC_FILE_RANGE_WAIT_BEFORE.  Is there a
reason why you have it there?  The man page isn't very enlightening.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Proposal for proper durable fsync() and fdatasync()

Andrew Morton wrote:
 On Tue, 26 Feb 2008 07:26:50 + Jamie Lokier [EMAIL PROTECTED] wrote:
 
  (It would be nicer if sync_file_range()
  took a vector of ranges for better elevator scheduling, but let's
  ignore that :-)
 
 Two passes:
 
 Pass 1: shove each of the segments into the queue with
 SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE
 
 Pass 2: wait for them all to complete and return accumulated result
 with SYNC_FILE_RANGE_WAIT_AFTER

Thanks.

Seems ok, though being able to cork the I/O until the last one would
be a bonus (like TCP_MORE...  SYNC_FILE_RANGE_MORE?)

I'm imagining I'd omit the SYNC_FILE_RANGE_WAIT_BEFORE.  Is there a
reason why you have it there?  The man page isn't very enlightening.

-- Jamie
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Proposal for proper durable fsync() and fdatasync()

2008-02-26 Thread Nick Piggin

On Tuesday 26 February 2008 18:59, Jamie Lokier wrote:
 Andrew Morton wrote:
  On Tue, 26 Feb 2008 07:26:50 + Jamie Lokier [EMAIL PROTECTED] 
wrote:
   (It would be nicer if sync_file_range()
   took a vector of ranges for better elevator scheduling, but let's
   ignore that :-)
 
  Two passes:
 
  Pass 1: shove each of the segments into the queue with
  SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE
 
  Pass 2: wait for them all to complete and return accumulated result
  with SYNC_FILE_RANGE_WAIT_AFTER

 Thanks.

 Seems ok, though being able to cork the I/O until the last one would
 be a bonus (like TCP_MORE...  SYNC_FILE_RANGE_MORE?)

 I'm imagining I'd omit the SYNC_FILE_RANGE_WAIT_BEFORE.  Is there a
 reason why you have it there?  The man page isn't very enlightening.


Yeah, sync_file_range has slightly unusual semantics and introduce
the new concept, writeout, to userspace (does writeout include
in drive cache? the kernel doesn't think so, but the only way to
make sync_file_range safe is if you do consider it writeout).

If it makes it any easier to understand, we can add in
SYNC_FILE_ASYNC, SYNC_FILE_SYNC parts that just deal with
safe/unsafe and sync/async semantics that is part of the normal
POSIX api.

Anyway, the idea of making fsync/fdatasync etc. safe by default is
a good idea IMO, and is a bad bug that we don't do that :(

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Proposal for proper durable fsync() and fdatasync()

Jeff Garzik wrote:
 [snip huge long proposal]
 
 Rather than invent new APIs, we should fix the existing ones to _really_ 
 flush data to physical media.

Btw, one reason for the length is the current block request API isn't
sufficient even to make fsync() durable with _no_ new APIs.

It offers ordering barriers only, which aren't enough.  I tried to
explain, discuss some changes and then suggest optimisations.

-- Jamie
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Proposal for proper durable fsync() and fdatasync()

On Tue, 26 February 2008 20:16:11 +1100, Nick Piggin wrote:
 
 Yeah, sync_file_range has slightly unusual semantics and introduce
 the new concept, writeout, to userspace (does writeout include
 in drive cache? the kernel doesn't think so, but the only way to
 make sync_file_range safe is if you do consider it writeout).

If sync_file_range isn't safe, it should get replaced by a noop
implementation.  There really is no point in promising a little
safety.

One interesting aspect of this comes with COW filesystems like btrfs or
logfs.  Writing out data pages is not sufficient, because those will get
lost unless their referencing metadata is written as well.  So either we
have to call fsync for those filesystems or add another callback and let
filesystems override the default implementation.

Jörn

-- 
There is no worse hell than that provided by the regrets
for wasted opportunities.
-- Andre-Louis Moreau in Scarabouche
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Proposal for proper durable fsync() and fdatasync()

Jörn Engel wrote:
 On Tue, 26 February 2008 20:16:11 +1100, Nick Piggin wrote:
  
  Yeah, sync_file_range has slightly unusual semantics and introduce
  the new concept, writeout, to userspace (does writeout include
  in drive cache? the kernel doesn't think so, but the only way to
  make sync_file_range safe is if you do consider it writeout).
 
 If sync_file_range isn't safe, it should get replaced by a noop
 implementation.  There really is no point in promising a little
 safety.
 
 One interesting aspect of this comes with COW filesystems like btrfs or
 logfs.  Writing out data pages is not sufficient, because those will get
 lost unless their referencing metadata is written as well.  So either we
 have to call fsync for those filesystems or add another callback and let
 filesystems override the default implementation.

fdatasync() is required to write data pages _and_ the necessary
metadata to reference those changed pages (btrfs tree etc.), but not
non-data metadata.

It's the filesystem's responsibility to interpret that correctly.
In-place writes don't need anything else.  Phase-tree style writes do.
Some kinds of logged writes don't.

I'm under the impression that sync_file_range() is a sort of
restricted-range asynchronous fdatasync().

By limiting the range of file date which must be written out, it
becomes more refined for database and filesystem-in-a-file type
applications.  Just as fsync() is more refined than sync() - it's
useful to sync less - same goes for syncing just part of a file.

It's still the filesystem's responsibility to sync data access
metadata appropriately.  It can sync more if it wants, but not less.

That's what I understand by
   sync_file_range(fd, start,length, SYNC_FILE_RANGE_WRITE_BEFORE
   | SYNC_FILE_RANGE_WRITE
   | SYNC_FILE_RANGE_WRITE_AFTER);
Largely because the manual says to use that combination of flags for
an equivalent to fdatasync().

The concept of write-out is not defined in the manual.  I'm assuming
it to mean this, as a reasonable guess:

SYNC_FILE_RANGE_WRITE scans all pages in the range, looking for dirty
pages which aren't already queued for write-out.  It marks those with
a write-out flag, and starts write I/Os at some unspecified time in
the near future; it can be assumed writes for all the pages will
complete eventually if there's no errors.  When I/O completes on a
page, it cleans the page and also clears the write-out flag.

SYNC_FILE_RANGE_WAIT_AFTER waits until all pages in the range don't
have the write-out flag set.

SYNC_FILE_RANGE_WAIT_BEFORE does the same wait, but before marking
pages for write-out.  I don't actually see the point in this.  Isn't a
preceding call with SYNC_FILE_RANGE_WAIT_AFTER equivalent, making
BEFORE a redundant flag?

The manual says it is something to do with data-integrity, but it's
not clear to me what that means.

All this implies that write-out flag is a concept userspace can rely
on.  That's not so peculiar: WRITE seems to be equivalent to AIO-style
fdatasync() on a limited range of offsets, and WAIT_AFTER seems to be
equivalent to waiting for any previously issued such ops to complete.

Any data access metadata updates that btrfs must make for fdatasync(),
it must also make for sync_file_range(), for the limited range of
offsets.

-- Jamie
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Proposal for proper durable fsync() and fdatasync()

2008-02-26 Thread Ric Wheeler


Jeff Garzik wrote:

Jamie Lokier wrote:

By durable, I mean that fsync() should actually commit writes to
physical stable storage,


Yes, it should.



I was surprised that fsync() doesn't do this already.  There was a lot
of effort put into block I/O write barriers during 2.5, so that
journalling filesystems can force correct write ordering, using disk
flush cache commands.

After all that effort, I was very surprised to notice that Linux 2.6.x
doesn't use that capability to ensure fsync() flushes the disk cache
onto stable storage.


It's surprising you are surprised, given that this [lame] fsync behavior 
has remaining consistently lame throughout Linux's history.


Maybe I am confused, but isn't this is what fsync() does today whenever 
barriers are enabled (the fsync() invalidates the drive's write cache).


ric
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Proposal for proper durable fsync() and fdatasync()

Jörn Engel wrote:
 On Tue, 26 February 2008 20:16:11 +1100, Nick Piggin wrote:
  Yeah, sync_file_range has slightly unusual semantics and introduce
  the new concept, writeout, to userspace (does writeout include
  in drive cache? the kernel doesn't think so, but the only way to
  make sync_file_range safe is if you do consider it writeout).
 
 If sync_file_range isn't safe, it should get replaced by a noop
 implementation.  There really is no point in promising a little
 safety.

Sometimes there is a point in a little safety.

There's a spectrum of durability (meaning how safely stored the data
is).  In the cases we're imagining, it's application - main memory
cache - disk cache - disk surface.  There are others.

_None_ of those provide perfect safety for your data.  They are a
spectrum, and how far along you want data to be committed before you
say fine, the data is safe enough for me depends on your application.

For example, there are users who like to turn _off_ fdatasync() with
their SQL database of choice.  They prefer speed over safety, and they
don't mind losing an hour's data and doing regular backups (we assume
;-) Some blogs fall into this category; who cares if a rare crash
costs you a comment or two and a restore from backup; it's acceptable
for the speed.

There's users who would really like fdatasync() to commit data to the
drive platters, so after their database says done, they are very
confident that a power failure won't cause committed data to be lost.
Accepting credit cards is more at this end.  So should be anyone using
a virtual machine of any kind without a journalling fs in the guest!

And there's users who like it where it is right now: a compromise,
where a system crash won't lose committed data; but a power failure
might.  (I'm making assumptions about drive behaviour on reset here.)

My problem with fdatasync() at the moment is, I can't choose what I
want from it, and there's no mechanism to give me the safest option.
Most annoyingly, in-kernel filesystems _do_ have a mechanism; it just
isn't exported to userspace.

(A quick aside: fdatasync() et al. are actually used for two
_different_ things.  1: A program says I've written it, it can say
so with confidence, e.g. announcing email receipt.  2: It's used for
write ordering with write-ahead logging: write, fdatasync, write.
When you tease at the details, efficient implementations of them are
different...  Think SCSI tagged commands versus cache flushes.)

 One interesting aspect of this comes with COW filesystems like btrfs or
 logfs.  Writing out data pages is not sufficient, because those will get
 lost unless their referencing metadata is written as well.  So either we
 have to call fsync for those filesystems or add another callback and let
 filesystems override the default implementation.

Doesn't the -fsync callback get called in the sys_fdatasync() case,
with appropriate arguments?

With barriers/flushes it certainly makes those a bit more complicated.
You have to flush not just the disks with data pages, but the _other_
disks in a software RAID with data pointer metadata pages, but ideally
not all of them (think database journal commit).

That can be implemented with per-buffer pending-barrier/flush flags
(like I described for pages in the first mail), which are equally
useful when a database-like application uses a block device.

-- Jamie
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Proposal for proper durable fsync() and fdatasync()

Ric Wheeler wrote:
 I was surprised that fsync() doesn't do this already.  There was a lot
 of effort put into block I/O write barriers during 2.5, so that
 journalling filesystems can force correct write ordering, using disk
 flush cache commands.
 
 After all that effort, I was very surprised to notice that Linux 2.6.x
 doesn't use that capability to ensure fsync() flushes the disk cache
 onto stable storage.
 
 It's surprising you are surprised, given that this [lame] fsync behavior 
 has remaining consistently lame throughout Linux's history.
 
 Maybe I am confused, but isn't this is what fsync() does today whenever 
 barriers are enabled (the fsync() invalidates the drive's write cache).

No, fsync() doesn't always flush the drive's write cache.  It often
does, any I think many people are under the impression it always does,
but it doesn't.

Try this code on ext3:

fd = open (test_file, O_RDWR | O_CREAT | O_TRUNC, 0666);
while (1) {
char byte;
usleep (10);
pwrite (fd, byte, 1, 0);
fsync (fd);
}

It will do just over 10 write ops per second on an idle system (13 on
mine), and 1 flush op per second.

That's because ext3 fsync() only does a journal commit when the inode
has changed.  The inode mtime is changed by write only with 1 second
granularity.  Without a journal commit, there's no barrier, which
translates to not flushing disk write cache.

If you add fchmod (fd, 0644); fchmod (fd, 0664); between the write
and fsync, you'll see at least 20 write ops and 20 flush ops per
second, and you'll here the disk seeking more.  That's because the
fchmod dirties the inode, so fsync() writes the inode with a journal
commit.

It turns out even _that_ is not sufficient according to the kernel
internals.  A journal commit uses an ordered request, which isn't the
same as a flush potentially, it just happens to use flush in this
instance.  I'm not sure if ordered requests are actually implemented
by any drivers at the moment.  If not now, they will be one day.

We could change ext3 fsync() to always do a journal commit, and depend
on the non-existence of block drivers which do ordered (not flush)
barrier requests.  But there's lots of things wrong with that.  Not
least, it sucks performance for database-like applications and virtual
machines, a lot due to unnecessary seeks.  That way lies wrongness.

Rightness is to make fdatasync() work well, with a genuine flush (or
equivalent (see FUA), only when required, and not a mere ordered
barrier), no inode write, and to make sync_file_range()[*] offer the
fancier applications finer controls which reflect what they actually
need.

[*] - or whatever.

-- Jamie
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Proposal for proper durable fsync() and fdatasync()

2008-02-26 Thread Andrew Morton

On Tue, 26 Feb 2008 15:07:45 + Jamie Lokier [EMAIL PROTECTED] wrote:

 SYNC_FILE_RANGE_WRITE scans all pages in the range, looking for dirty
 pages which aren't already queued for write-out.  It marks those with
 a write-out flag, and starts write I/Os at some unspecified time in
 the near future; it can be assumed writes for all the pages will
 complete eventually if there's no errors.  When I/O completes on a
 page, it cleans the page and also clears the write-out flag.
 
 SYNC_FILE_RANGE_WAIT_AFTER waits until all pages in the range don't
 have the write-out flag set.
 
 SYNC_FILE_RANGE_WAIT_BEFORE does the same wait, but before marking
 pages for write-out.  I don't actually see the point in this.  Isn't a
 preceding call with SYNC_FILE_RANGE_WAIT_AFTER equivalent, making
 BEFORE a redundant flag?

Consider the case of pages which are dirty but are already under writeout. 
ie: someone redirtied the page after someone started writing the page out. 
For these pages the kernel needs to

a) wait for the current writeout to complete

b) start new writeout

c) wait for that writeout to complete.

those are the three stages of sync_file_range().  They are independently
selectable and various combinations provide various results.

The reason for providing b) only (SYNC_FILE_RANGE_WRITE) is so that
userspace can get as much data into the queue as possible, to permit the
kernel to optimise IO scheduling better.

If you perform a) and b) together
(SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE) then you are guaranteed
that all data which was dirty when sync_file_range() executed will be sent
into the queue, but you won't get as much data into the queue if the kernel
encounters dirty, under-writeout pages.  This is especially hurtful if
you're trying to feed a lot of little segments into the queue.  In that
case perhaps userspace should do an asynchrnous pass
(SYNC_FILE_RANGE_WRITE) to stuff as much data as poss into the queue, then
a SYNC_FILE_RANGE_WAIT_AFTER pass then a
SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER
pass to clean up any stragglers.  WHich mode is best very much depends on
the application's file dirtying patterns.  One would have to experiment
with it, and tuning of sync_file_range() usage would occur alongside tuning
of the application's write() design.

It's an interesting problem, with potentially high payback.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Proposal for proper durable fsync() and fdatasync()


Nick Piggin wrote:

Anyway, the idea of making fsync/fdatasync etc. safe by default is
a good idea IMO, and is a bad bug that we don't do that :(


Agreed...  it's also disappointing that [unless I'm mistaken] you have 
to hack each filesystem to support barriers.


It seems far easier to make sync_blkdev() Do The Right Thing, and 
magically make all filesystems data-safe.


Jeff


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Proposal for proper durable fsync() and fdatasync()

Jeff Garzik wrote:
 Nick Piggin wrote:
 Anyway, the idea of making fsync/fdatasync etc. safe by default is
 a good idea IMO, and is a bad bug that we don't do that :(
 
 Agreed...  it's also disappointing that [unless I'm mistaken] you have 
 to hack each filesystem to support barriers.
 
 It seems far easier to make sync_blkdev() Do The Right Thing, and 
 magically make all filesystems data-safe.

Well, you need ordered metadata writes, barriers _and_ flushes with
some filesystems.

Merely writing all the data pages than issuing a drive cache flush
won't Do The Right Thing with those filesystems - someone already
mentioned Btrfs, where it won't.

But I agree that your suggestion would make a superb default, for
filesystems which don't provide their own function.

It's not optimal even then.

  Devices: On a software RAID, you ideally don't want to issue flushes
  to all drives if your database did a 1 block commit entry.  (But they
  probably use O_DIRECT anyway, changing the rules again).  But all that
  can be optimised in generic VFS code eventually.  It doesn't need
  filesystem assistance in most cases.

  Apps: don't always want a full flush; sometimes a barrier would do.

-- Jamie
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Proposal for proper durable fsync() and fdatasync()

On Tue, 26 February 2008 15:28:10 +, Jamie Lokier wrote:
 
  One interesting aspect of this comes with COW filesystems like btrfs or
  logfs.  Writing out data pages is not sufficient, because those will get
  lost unless their referencing metadata is written as well.  So either we
  have to call fsync for those filesystems or add another callback and let
  filesystems override the default implementation.
 
 Doesn't the -fsync callback get called in the sys_fdatasync() case,
 with appropriate arguments?

My paragraph above was aimed at the sync_file_range() case.  fsync and
fdatasync do the right thing within the limitations you brought up in
this thread.  sync_file_range() without further changes will only write
data pages, not the metadata required to actually access those data
pages.  This works just fine for non-COW filesystems, which covers all
currently merged ones.

With COW filesystems it is currently impossible to do sync_file_range()
properly.  The problem is orthogonal to your's, I just brought it up
since you were already mentioning sync_file_range().


Jörn

-- 
Joern's library part 10:
http://blogs.msdn.com/David_Gristwood/archive/2004/06/24/164849.aspx
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Proposal for proper durable fsync() and fdatasync()

Jörn Engel wrote:
 On Tue, 26 February 2008 15:28:10 +, Jamie Lokier wrote:
  
   One interesting aspect of this comes with COW filesystems like btrfs or
   logfs.  Writing out data pages is not sufficient, because those will get
   lost unless their referencing metadata is written as well.  So either we
   have to call fsync for those filesystems or add another callback and let
   filesystems override the default implementation.
  
  Doesn't the -fsync callback get called in the sys_fdatasync() case,
  with appropriate arguments?
 
 My paragraph above was aimed at the sync_file_range() case.  fsync and
 fdatasync do the right thing within the limitations you brought up in
 this thread.  sync_file_range() without further changes will only write
 data pages, not the metadata required to actually access those data
 pages.  This works just fine for non-COW filesystems, which covers all
 currently merged ones.
 
 With COW filesystems it is currently impossible to do sync_file_range()
 properly.  The problem is orthogonal to your's, I just brought it up
 since you were already mentioning sync_file_range().

You're right.  Though, doesn't normal page writeback enqueue the COW
metadata changes?  If not, how do they get written in a timely
fashion?

-- Jamie
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Proposal for proper durable fsync() and fdatasync()

On Tue, 26 February 2008 17:29:13 +, Jamie Lokier wrote:
 
 You're right.  Though, doesn't normal page writeback enqueue the COW
 metadata changes?  If not, how do they get written in a timely
 fashion?

It does.  But this is not sufficient to guarantee that the pages in
question have been safely committed to the device by the time
sync_file_range() has returned.

Jörn

-- 
Joern's library part 5:
http://www.faqs.org/faqs/compression-faq/part2/section-9.html
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Proposal for proper durable fsync() and fdatasync()


Jamie Lokier wrote:

Jeff Garzik wrote:

Nick Piggin wrote:

Anyway, the idea of making fsync/fdatasync etc. safe by default is
a good idea IMO, and is a bad bug that we don't do that :(
Agreed...  it's also disappointing that [unless I'm mistaken] you have 
to hack each filesystem to support barriers.


It seems far easier to make sync_blkdev() Do The Right Thing, and 
magically make all filesystems data-safe.


Well, you need ordered metadata writes, barriers _and_ flushes with
some filesystems.

Merely writing all the data pages than issuing a drive cache flush
won't Do The Right Thing with those filesystems - someone already
mentioned Btrfs, where it won't.


Oh certainly.  That's why we have a VFS :)  fsync for NFS will look 
quite different, too.




But I agree that your suggestion would make a superb default, for
filesystems which don't provide their own function.


Yep.  That would immediately cover a bunch of filesystems.



It's not optimal even then.

  Devices: On a software RAID, you ideally don't want to issue flushes
  to all drives if your database did a 1 block commit entry.  (But they
  probably use O_DIRECT anyway, changing the rules again).  But all that
  can be optimised in generic VFS code eventually.  It doesn't need
  filesystem assistance in most cases.


My own idea is that we create a FLUSH command for blkdev request queues, 
to exist alongside READ, WRITE, and the current barrier implementation. 
 Then FLUSH could be passed down through MD or DM.


Jeff


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Proposal for "proper" durable fsync() and fdatasync()

Jeff Garzik wrote:
> Jamie Lokier wrote:
> >By durable, I mean that fsync() should actually commit writes to
> >physical stable storage,
> 
> Yes, it should.

Glad we agree :-)

> >I was surprised that fsync() doesn't do this already.  There was a lot
> >of effort put into block I/O write barriers during 2.5, so that
> >journalling filesystems can force correct write ordering, using disk
> >flush cache commands.
> >
> >After all that effort, I was very surprised to notice that Linux 2.6.x
> >doesn't use that capability to ensure fsync() flushes the disk cache
> >onto stable storage.
> 
> It's surprising you are surprised, given that this [lame] fsync behavior 
> has remaining consistently lame throughout Linux's history.

I was surprised because of the effort put into IDE write barriers to
get it right for in-kernel filesystems, and the messages in 2004
telling concerned users that fsync would use barriers in 2.6, which it
does sometimes but not always.

> [snip huge long proposal]
> 
> Rather than invent new APIs, we should fix the existing ones to _really_ 
> flush data to physical media.
>
> Linux should default to SAFE data storage, and permit users to retain 
> the older unsafe behavior via an option.  It's completely ridiculous 
> that we default to an unsafe fsync.

Well, I agree with you.  Which is why the "new API" I suggested, being
really just an extension of an existing one, allows fsync() to be SAFE
if that's what people want.

To be fair, fsync() is rather overkill for some apps.
sync_file_range() is obviously the right place for fine tuning "less
safe" variations.

> And [anticipating a common response from others] it is completely 
> irrelevant that POSIX fsync(2) permits Linux's current behavior.  The 
> current behavior is unsafe.
> 
> Safety before performance -- ESPECIALLY when it comes to storing user data.

Especially now that people work a lot in guest VMs, where the IDE
barrier stuff doesn't work if the host fdatasync() doesn't work.

Since it happened with Mac OS X, I wouldn't be surprised if changing
fsync() and just that wasn't popular.  Heck, you already get people
asking "how to turn off fsync in PostGreSQL"...  (Haven't those people
heard of transactions...?)

But with changes to sync_file_range() [or whatever... I don't care] to
support database's finely tuned commit needs, and then adoption of
that by database vendors, perhaps nobody will mind fsync() becoming
safe then.

Nobody seems bothered by it's performance for other things.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Proposal for "proper" durable fsync() and fdatasync()

2008-02-25 Thread Jeff Garzik


Jamie Lokier wrote:

By durable, I mean that fsync() should actually commit writes to
physical stable storage,


Yes, it should.



I was surprised that fsync() doesn't do this already.  There was a lot
of effort put into block I/O write barriers during 2.5, so that
journalling filesystems can force correct write ordering, using disk
flush cache commands.

After all that effort, I was very surprised to notice that Linux 2.6.x
doesn't use that capability to ensure fsync() flushes the disk cache
onto stable storage.


It's surprising you are surprised, given that this [lame] fsync behavior 
has remaining consistently lame throughout Linux's history.


[snip huge long proposal]

Rather than invent new APIs, we should fix the existing ones to _really_ 
flush data to physical media.


Linux should default to SAFE data storage, and permit users to retain 
the older unsafe behavior via an option.  It's completely ridiculous 
that we default to an unsafe fsync.


And [anticipating a common response from others] it is completely 
irrelevant that POSIX fsync(2) permits Linux's current behavior.  The 
current behavior is unsafe.


Safety before performance -- ESPECIALLY when it comes to storing user data.

Regards,

Jeff (Linux ATA driver dude)


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Proposal for "proper" durable fsync() and fdatasync()

2008-02-25 Thread Andrew Morton

On Tue, 26 Feb 2008 07:26:50 + Jamie Lokier <[EMAIL PROTECTED]> wrote:

> (It would be nicer if sync_file_range()
> took a vector of ranges for better elevator scheduling, but let's
> ignore that :-)

Two passes:

Pass 1: shove each of the segments into the queue with
SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE

Pass 2: wait for them all to complete and return accumulated result
with SYNC_FILE_RANGE_WAIT_AFTER


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Proposal for "proper" durable fsync() and fdatasync()

Dear kernel,

This is a proposal to add "proper" durable fsync() and fdatasync() to Linux.

First the problem, then a proposed solution "with benefits", so to speak.

I need feedback on the details, before implementing anything.  Or
(hopefully) someone else thinks it's very important and does it
themselves :-)

By durable, I mean that fsync() should actually commit writes to
physical stable storage, not just the disk write cache when that is
enabled.  Databases and guest VMs needs this, or an equivalent
feature, if they aren't to face occasional corruption after power
failure and perhaps some crashes.

The alternative is to disable the disk write cache.  But that isn't
modern practice or recommendation, since I/O write barriers were
implemented and they are much faster.

I was surprised that fsync() doesn't do this already.  There was a lot
of effort put into block I/O write barriers during 2.5, so that
journalling filesystems can force correct write ordering, using disk
flush cache commands.

After all that effort, I was very surprised to notice that Linux 2.6.x
doesn't use that capability to ensure fsync() flushes the disk cache
onto stable storage.

I noticed this following up discussions on the Qemu mailing list,
about guest VMs and how their IDE flush cache command should translate
to fsync() to avoid data loss.  (For guest VMs, fsync() isn't
necessary if the host machine is fine, and it isn't enough (on Linux
host) if the host machine loses power or the hard disk crashes another
way.)

Then I noticed it again, when I was designing a database engine with
filesystem characteristics.  I thought "how do I ensure ordered
journal writes; can I use fdatasync()?" and was surprised to find the
answer is no, I have to use hacks like calling hdparm, and the authors
of major SQL databases seem to brush the problem under a carpet.

(Interestingly, in the Linux 2.4 patches for write barriers, fsync()
seems to be fine, if a bit slow.)

It isn't the first time this topic has come up:

http://groups.google.com.br/group/linux.kernel/browse_thread/thread/d343e51655b4ac7c/7ee9bca80977c2d1?#7ee9bca80977c2d1
("True fsync() in Linux (on IDE)")

In that thread, it was implied that would be fixed in 2.6.  So I bet
some people are under the illusion that it's fixed in 2.6...

For a while, I've been meaning to bring it up on linux-kernel...

The fsync problem
-

Chris Wedgwood wrote:
> On Mon, Feb 25, 2008 at 08:50:40PM +, Jamie Lokier wrote:
> 
> > On Linux (and other host OSes), fdatsync() and fsync() don't always
> > commit data to hard storage; it sometimes only commits it to the hard
> > drive cache.
> 
> That's a filesystem bug IMO.  People should be able to use f[data]sync
> with some level onf confidence or else it's basically pointless.

I agree, I consider it a serious bug, and I would be pleased if
someone paid it some love and attention.

Right now, if you want a reliable database on Linux, you _cannot_
properly depend on fsync() or fdatasync().  Considering how much Linux
is used for critical databases, using these functions, this amazes me.

Also, if you have a guest VM, then the guest's filesystem journalling
is not reliable.  Not only can it lose data on power loss, it can
corrupt the guest filesystem too, due to reordering.  This is contrary
to what people expect, I think.

I'm not sure if a system reset can cause similar loss; I don't know
how disks react to that.

Also, for the person porting ZFS to run on FUSE, same applies...

Linux fsync is faulty in two ways:

   1. Database commits aren't _durable_ against power failure, because
  fsync doesn't flush the disk's cache.  This means data stored
  is not guaranteed to be stored at the expected durability.

   2. It's unsafe for write-ahead logging, because it doesn't really
  guarantee any _ordering_ for the writes at the hard storage
  level.  So aside from losing committed data, it can also corrupt
  structural metadata.

With ext3 it's quite easy to verify that fsync/fdatasync don't always
write a journal entry.  (Apart from looking at the kernel code :-)

Just write some data, fsync(), and observe the number of writes in
/proc/diskstats.  If the current mtime second _hasn't_ changed, the
inode isn't written.  If you write data, say, 10 times a second to the
same place followed by fsync(), you'll see a little more than 10 write
I/Os, and less than 20.

By the way, this shows a trick for fixing #2 (ordering): use fchmod()
to toggle the file attributes, and that will force the next fsync() to
write a journal entry, which _does_ issue a write barrier.  If you do
that with each write as above (write, fchmod change, fsync 10 times a
second), you will clearly see more write I/Os, and you'll hear the
disk behaving differently: it's seeking more.

However, even this ugly trick has problems:

  3. Using the fchmod() trick or good fortune, fsync() issues a write
 barrier.  Right now, this does commit data (if

Proposal for proper durable fsync() and fdatasync()

Dear kernel,

This is a proposal to add proper durable fsync() and fdatasync() to Linux.

First the problem, then a proposed solution with benefits, so to speak.

I need feedback on the details, before implementing anything. Or
(hopefully) someone else thinks it's very important and does it
themselves :-)

By durable, I mean that fsync() should actually commit writes to
physical stable storage, not just the disk write cache when that is
enabled. Databases and guest VMs needs this, or an equivalent
feature, if they aren't to face occasional corruption after power
failure and perhaps some crashes.

The alternative is to disable the disk write cache. But that isn't
modern practice or recommendation, since I/O write barriers were
implemented and they are much faster.

I was surprised that fsync() doesn't do this already. There was a lot
of effort put into block I/O write barriers during 2.5, so that
journalling filesystems can force correct write ordering, using disk
flush cache commands.

After all that effort, I was very surprised to notice that Linux 2.6.x
doesn't use that capability to ensure fsync() flushes the disk cache
onto stable storage.

I noticed this following up discussions on the Qemu mailing list,
about guest VMs and how their IDE flush cache command should translate
to fsync() to avoid data loss. (For guest VMs, fsync() isn't
necessary if the host machine is fine, and it isn't enough (on Linux
host) if the host machine loses power or the hard disk crashes another
way.)

Then I noticed it again, when I was designing a database engine with
filesystem characteristics. I thought how do I ensure ordered
journal writes; can I use fdatasync()? and was surprised to find the
answer is no, I have to use hacks like calling hdparm, and the authors
of major SQL databases seem to brush the problem under a carpet.

(Interestingly, in the Linux 2.4 patches for write barriers, fsync()
seems to be fine, if a bit slow.)

It isn't the first time this topic has come up:

http://groups.google.com.br/group/linux.kernel/browse_thread/thread/d343e51655b4ac7c/7ee9bca80977c2d1?#7ee9bca80977c2d1
(True fsync() in Linux (on IDE))

In that thread, it was implied that would be fixed in 2.6. So I bet
some people are under the illusion that it's fixed in 2.6...

For a while, I've been meaning to bring it up on linux-kernel...

The fsync problem
-

Chris Wedgwood wrote:
On Mon, Feb 25, 2008 at 08:50:40PM +, Jamie Lokier wrote:

On Linux (and other host OSes), fdatsync() and fsync() don't always
commit data to hard storage; it sometimes only commits it to the hard
drive cache.

That's a filesystem bug IMO. People should be able to use f[data]sync
with some level onf confidence or else it's basically pointless.

I agree, I consider it a serious bug, and I would be pleased if
someone paid it some love and attention.

Right now, if you want a reliable database on Linux, you _cannot_
properly depend on fsync() or fdatasync(). Considering how much Linux
is used for critical databases, using these functions, this amazes me.

Also, if you have a guest VM, then the guest's filesystem journalling
is not reliable. Not only can it lose data on power loss, it can
corrupt the guest filesystem too, due to reordering. This is contrary
to what people expect, I think.

I'm not sure if a system reset can cause similar loss; I don't know
how disks react to that.

Also, for the person porting ZFS to run on FUSE, same applies...

Linux fsync is faulty in two ways:

1. Database commits aren't _durable_ against power failure, because
fsync doesn't flush the disk's cache. This means data stored
is not guaranteed to be stored at the expected durability.

2. It's unsafe for write-ahead logging, because it doesn't really
guarantee any _ordering_ for the writes at the hard storage
level. So aside from losing committed data, it can also corrupt
structural metadata.

With ext3 it's quite easy to verify that fsync/fdatasync don't always
write a journal entry. (Apart from looking at the kernel code :-)

Just write some data, fsync(), and observe the number of writes in
/proc/diskstats. If the current mtime second _hasn't_ changed, the
inode isn't written. If you write data, say, 10 times a second to the
same place followed by fsync(), you'll see a little more than 10 write
I/Os, and less than 20.

By the way, this shows a trick for fixing #2 (ordering): use fchmod()
to toggle the file attributes, and that will force the next fsync() to
write a journal entry, which _does_ issue a write barrier. If you do
that with each write as above (write, fchmod change, fsync 10 times a
second), you will clearly see more write I/Os, and you'll hear the
disk behaving differently: it's seeking more.

However, even this ugly trick has problems:

3. Using the fchmod() trick or good fortune, fsync() issues a write
barrier. Right now, this does commit data (if the device can).

Re: Proposal for proper durable fsync() and fdatasync()

2008-02-25 Thread Andrew Morton

On Tue, 26 Feb 2008 07:26:50 + Jamie Lokier [EMAIL PROTECTED] wrote:

 (It would be nicer if sync_file_range()
 took a vector of ranges for better elevator scheduling, but let's
 ignore that :-)

Two passes:

Pass 1: shove each of the segments into the queue with
SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE

Pass 2: wait for them all to complete and return accumulated result
with SYNC_FILE_RANGE_WAIT_AFTER


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Proposal for proper durable fsync() and fdatasync()

2008-02-25 Thread Jeff Garzik


Jamie Lokier wrote:

By durable, I mean that fsync() should actually commit writes to
physical stable storage,


Yes, it should.



I was surprised that fsync() doesn't do this already.  There was a lot
of effort put into block I/O write barriers during 2.5, so that
journalling filesystems can force correct write ordering, using disk
flush cache commands.

After all that effort, I was very surprised to notice that Linux 2.6.x
doesn't use that capability to ensure fsync() flushes the disk cache
onto stable storage.


It's surprising you are surprised, given that this [lame] fsync behavior 
has remaining consistently lame throughout Linux's history.


[snip huge long proposal]

Rather than invent new APIs, we should fix the existing ones to _really_ 
flush data to physical media.


Linux should default to SAFE data storage, and permit users to retain 
the older unsafe behavior via an option.  It's completely ridiculous 
that we default to an unsafe fsync.


And [anticipating a common response from others] it is completely 
irrelevant that POSIX fsync(2) permits Linux's current behavior.  The 
current behavior is unsafe.


Safety before performance -- ESPECIALLY when it comes to storing user data.

Regards,

Jeff (Linux ATA driver dude)


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Proposal for proper durable fsync() and fdatasync()