Re: Proposal for "proper" durable fsync() and fdatasync()

2008-02-26 Thread Jeff Garzik
Jamie Lokier wrote: Jeff Garzik wrote: Nick Piggin wrote: Anyway, the idea of making fsync/fdatasync etc. safe by default is a good idea IMO, and is a bad bug that we don't do that :( Agreed... it's also disappointing that [unless I'm mistaken] you have to hack each filesystem to support

Re: Proposal for "proper" durable fsync() and fdatasync()

2008-02-26 Thread Jörn Engel
On Tue, 26 February 2008 17:29:13 +, Jamie Lokier wrote: > > You're right. Though, doesn't normal page writeback enqueue the COW > metadata changes? If not, how do they get written in a timely > fashion? It does. But this is not sufficient to guarantee that the pages in question have been

Re: Proposal for "proper" durable fsync() and fdatasync()

2008-02-26 Thread Jamie Lokier
Jörn Engel wrote: > On Tue, 26 February 2008 15:28:10 +, Jamie Lokier wrote: > > > > > One interesting aspect of this comes with COW filesystems like btrfs or > > > logfs. Writing out data pages is not sufficient, because those will get > > > lost unless their referencing metadata is written

Re: Proposal for "proper" durable fsync() and fdatasync()

2008-02-26 Thread Jörn Engel
On Tue, 26 February 2008 15:28:10 +, Jamie Lokier wrote: > > > One interesting aspect of this comes with COW filesystems like btrfs or > > logfs. Writing out data pages is not sufficient, because those will get > > lost unless their referencing metadata is written as well. So either we > >

Re: Proposal for "proper" durable fsync() and fdatasync()

2008-02-26 Thread Jamie Lokier
Jeff Garzik wrote: > Nick Piggin wrote: > >Anyway, the idea of making fsync/fdatasync etc. safe by default is > >a good idea IMO, and is a bad bug that we don't do that :( > > Agreed... it's also disappointing that [unless I'm mistaken] you have > to hack each filesystem to support barriers. >

Re: Proposal for "proper" durable fsync() and fdatasync()

2008-02-26 Thread Jeff Garzik
Nick Piggin wrote: Anyway, the idea of making fsync/fdatasync etc. safe by default is a good idea IMO, and is a bad bug that we don't do that :( Agreed... it's also disappointing that [unless I'm mistaken] you have to hack each filesystem to support barriers. It seems far easier to make

Re: Proposal for "proper" durable fsync() and fdatasync()

2008-02-26 Thread Andrew Morton
On Tue, 26 Feb 2008 15:07:45 + Jamie Lokier <[EMAIL PROTECTED]> wrote: > SYNC_FILE_RANGE_WRITE scans all pages in the range, looking for dirty > pages which aren't already queued for write-out. It marks those with > a "write-out" flag, and starts write I/Os at some unspecified time in > the

Re: Proposal for "proper" durable fsync() and fdatasync()

2008-02-26 Thread Jamie Lokier
Ric Wheeler wrote: > >>I was surprised that fsync() doesn't do this already. There was a lot > >>of effort put into block I/O write barriers during 2.5, so that > >>journalling filesystems can force correct write ordering, using disk > >>flush cache commands. > >> > >>After all that effort, I was

Re: Proposal for "proper" durable fsync() and fdatasync()

2008-02-26 Thread Jamie Lokier
Jörn Engel wrote: > On Tue, 26 February 2008 20:16:11 +1100, Nick Piggin wrote: > > Yeah, sync_file_range has slightly unusual semantics and introduce > > the new concept, "writeout", to userspace (does "writeout" include > > "in drive cache"? the kernel doesn't think so, but the only way to > >

Re: Proposal for "proper" durable fsync() and fdatasync()

2008-02-26 Thread Ric Wheeler
Jeff Garzik wrote: Jamie Lokier wrote: By durable, I mean that fsync() should actually commit writes to physical stable storage, Yes, it should. I was surprised that fsync() doesn't do this already. There was a lot of effort put into block I/O write barriers during 2.5, so that

Re: Proposal for "proper" durable fsync() and fdatasync()

2008-02-26 Thread Jamie Lokier
Jörn Engel wrote: > On Tue, 26 February 2008 20:16:11 +1100, Nick Piggin wrote: > > > > Yeah, sync_file_range has slightly unusual semantics and introduce > > the new concept, "writeout", to userspace (does "writeout" include > > "in drive cache"? the kernel doesn't think so, but the only way to

Re: Proposal for "proper" durable fsync() and fdatasync()

2008-02-26 Thread Jörn Engel
On Tue, 26 February 2008 20:16:11 +1100, Nick Piggin wrote: > > Yeah, sync_file_range has slightly unusual semantics and introduce > the new concept, "writeout", to userspace (does "writeout" include > "in drive cache"? the kernel doesn't think so, but the only way to > make sync_file_range

Re: Proposal for "proper" durable fsync() and fdatasync()

2008-02-26 Thread Jamie Lokier
Jeff Garzik wrote: > [snip huge long proposal] > > Rather than invent new APIs, we should fix the existing ones to _really_ > flush data to physical media. Btw, one reason for the length is the current block request API isn't sufficient even to make fsync() durable with _no_ new APIs. It

Re: Proposal for "proper" durable fsync() and fdatasync()

2008-02-26 Thread Nick Piggin
On Tuesday 26 February 2008 18:59, Jamie Lokier wrote: > Andrew Morton wrote: > > On Tue, 26 Feb 2008 07:26:50 + Jamie Lokier <[EMAIL PROTECTED]> wrote: > > > (It would be nicer if sync_file_range() > > > took a vector of ranges for better elevator scheduling, but let's > > > ignore that :-)

Re: Proposal for "proper" durable fsync() and fdatasync()

2008-02-26 Thread Jamie Lokier
Andrew Morton wrote: > On Tue, 26 Feb 2008 07:26:50 + Jamie Lokier <[EMAIL PROTECTED]> wrote: > > > (It would be nicer if sync_file_range() > > took a vector of ranges for better elevator scheduling, but let's > > ignore that :-) > > Two passes: > > Pass 1: shove each of the segments into

Re: Proposal for proper durable fsync() and fdatasync()

2008-02-26 Thread Jamie Lokier
Andrew Morton wrote: On Tue, 26 Feb 2008 07:26:50 + Jamie Lokier [EMAIL PROTECTED] wrote: (It would be nicer if sync_file_range() took a vector of ranges for better elevator scheduling, but let's ignore that :-) Two passes: Pass 1: shove each of the segments into the queue with

Re: Proposal for proper durable fsync() and fdatasync()

2008-02-26 Thread Nick Piggin
On Tuesday 26 February 2008 18:59, Jamie Lokier wrote: Andrew Morton wrote: On Tue, 26 Feb 2008 07:26:50 + Jamie Lokier [EMAIL PROTECTED] wrote: (It would be nicer if sync_file_range() took a vector of ranges for better elevator scheduling, but let's ignore that :-) Two

Re: Proposal for proper durable fsync() and fdatasync()

2008-02-26 Thread Jamie Lokier
Jeff Garzik wrote: [snip huge long proposal] Rather than invent new APIs, we should fix the existing ones to _really_ flush data to physical media. Btw, one reason for the length is the current block request API isn't sufficient even to make fsync() durable with _no_ new APIs. It offers

Re: Proposal for proper durable fsync() and fdatasync()

2008-02-26 Thread Jörn Engel
On Tue, 26 February 2008 20:16:11 +1100, Nick Piggin wrote: Yeah, sync_file_range has slightly unusual semantics and introduce the new concept, writeout, to userspace (does writeout include in drive cache? the kernel doesn't think so, but the only way to make sync_file_range safe is if you

Re: Proposal for proper durable fsync() and fdatasync()

2008-02-26 Thread Jamie Lokier
Jörn Engel wrote: On Tue, 26 February 2008 20:16:11 +1100, Nick Piggin wrote: Yeah, sync_file_range has slightly unusual semantics and introduce the new concept, writeout, to userspace (does writeout include in drive cache? the kernel doesn't think so, but the only way to make

Re: Proposal for proper durable fsync() and fdatasync()

2008-02-26 Thread Ric Wheeler
Jeff Garzik wrote: Jamie Lokier wrote: By durable, I mean that fsync() should actually commit writes to physical stable storage, Yes, it should. I was surprised that fsync() doesn't do this already. There was a lot of effort put into block I/O write barriers during 2.5, so that

Re: Proposal for proper durable fsync() and fdatasync()

2008-02-26 Thread Jamie Lokier
Jörn Engel wrote: On Tue, 26 February 2008 20:16:11 +1100, Nick Piggin wrote: Yeah, sync_file_range has slightly unusual semantics and introduce the new concept, writeout, to userspace (does writeout include in drive cache? the kernel doesn't think so, but the only way to make

Re: Proposal for proper durable fsync() and fdatasync()

2008-02-26 Thread Jamie Lokier
Ric Wheeler wrote: I was surprised that fsync() doesn't do this already. There was a lot of effort put into block I/O write barriers during 2.5, so that journalling filesystems can force correct write ordering, using disk flush cache commands. After all that effort, I was very surprised to

Re: Proposal for proper durable fsync() and fdatasync()

2008-02-26 Thread Andrew Morton
On Tue, 26 Feb 2008 15:07:45 + Jamie Lokier [EMAIL PROTECTED] wrote: SYNC_FILE_RANGE_WRITE scans all pages in the range, looking for dirty pages which aren't already queued for write-out. It marks those with a write-out flag, and starts write I/Os at some unspecified time in the near

Re: Proposal for proper durable fsync() and fdatasync()

2008-02-26 Thread Jeff Garzik
Nick Piggin wrote: Anyway, the idea of making fsync/fdatasync etc. safe by default is a good idea IMO, and is a bad bug that we don't do that :( Agreed... it's also disappointing that [unless I'm mistaken] you have to hack each filesystem to support barriers. It seems far easier to make

Re: Proposal for proper durable fsync() and fdatasync()

2008-02-26 Thread Jamie Lokier
Jeff Garzik wrote: Nick Piggin wrote: Anyway, the idea of making fsync/fdatasync etc. safe by default is a good idea IMO, and is a bad bug that we don't do that :( Agreed... it's also disappointing that [unless I'm mistaken] you have to hack each filesystem to support barriers. It

Re: Proposal for proper durable fsync() and fdatasync()

2008-02-26 Thread Jörn Engel
On Tue, 26 February 2008 15:28:10 +, Jamie Lokier wrote: One interesting aspect of this comes with COW filesystems like btrfs or logfs. Writing out data pages is not sufficient, because those will get lost unless their referencing metadata is written as well. So either we have to

Re: Proposal for proper durable fsync() and fdatasync()

2008-02-26 Thread Jamie Lokier
Jörn Engel wrote: On Tue, 26 February 2008 15:28:10 +, Jamie Lokier wrote: One interesting aspect of this comes with COW filesystems like btrfs or logfs. Writing out data pages is not sufficient, because those will get lost unless their referencing metadata is written as well.

Re: Proposal for proper durable fsync() and fdatasync()

2008-02-26 Thread Jörn Engel
On Tue, 26 February 2008 17:29:13 +, Jamie Lokier wrote: You're right. Though, doesn't normal page writeback enqueue the COW metadata changes? If not, how do they get written in a timely fashion? It does. But this is not sufficient to guarantee that the pages in question have been

Re: Proposal for proper durable fsync() and fdatasync()

2008-02-26 Thread Jeff Garzik
Jamie Lokier wrote: Jeff Garzik wrote: Nick Piggin wrote: Anyway, the idea of making fsync/fdatasync etc. safe by default is a good idea IMO, and is a bad bug that we don't do that :( Agreed... it's also disappointing that [unless I'm mistaken] you have to hack each filesystem to support

Re: Proposal for "proper" durable fsync() and fdatasync()

2008-02-25 Thread Jamie Lokier
Jeff Garzik wrote: > Jamie Lokier wrote: > >By durable, I mean that fsync() should actually commit writes to > >physical stable storage, > > Yes, it should. Glad we agree :-) > >I was surprised that fsync() doesn't do this already. There was a lot > >of effort put into block I/O write barriers

Re: Proposal for "proper" durable fsync() and fdatasync()

2008-02-25 Thread Jeff Garzik
Jamie Lokier wrote: By durable, I mean that fsync() should actually commit writes to physical stable storage, Yes, it should. I was surprised that fsync() doesn't do this already. There was a lot of effort put into block I/O write barriers during 2.5, so that journalling filesystems can

Re: Proposal for "proper" durable fsync() and fdatasync()

2008-02-25 Thread Andrew Morton
On Tue, 26 Feb 2008 07:26:50 + Jamie Lokier <[EMAIL PROTECTED]> wrote: > (It would be nicer if sync_file_range() > took a vector of ranges for better elevator scheduling, but let's > ignore that :-) Two passes: Pass 1: shove each of the segments into the queue with

Re: Proposal for proper durable fsync() and fdatasync()

2008-02-25 Thread Andrew Morton
On Tue, 26 Feb 2008 07:26:50 + Jamie Lokier [EMAIL PROTECTED] wrote: (It would be nicer if sync_file_range() took a vector of ranges for better elevator scheduling, but let's ignore that :-) Two passes: Pass 1: shove each of the segments into the queue with

Re: Proposal for proper durable fsync() and fdatasync()

2008-02-25 Thread Jeff Garzik
Jamie Lokier wrote: By durable, I mean that fsync() should actually commit writes to physical stable storage, Yes, it should. I was surprised that fsync() doesn't do this already. There was a lot of effort put into block I/O write barriers during 2.5, so that journalling filesystems can

Re: Proposal for proper durable fsync() and fdatasync()

2008-02-25 Thread Jamie Lokier
Jeff Garzik wrote: Jamie Lokier wrote: By durable, I mean that fsync() should actually commit writes to physical stable storage, Yes, it should. Glad we agree :-) I was surprised that fsync() doesn't do this already. There was a lot of effort put into block I/O write barriers during