Bah, that's quite right.  Thanks for the step-by-step, I'm not sure how I 
missed it before.

Adam

On Apr 14, 2010, at 11:04 AM, Robert Newson wrote:

> I think Damien is right here. Consider this sequence;
> 
> 1) update btree
> 2) fsync
> 3) write new header
> 4) fsync
> 5) more updates
> 6) fsync
> 7) write new header
> 8) process terminates
> 
> On open, the header at 7) might or might not be flushed all the way to
> disk, but couchdb would update views to include changes made at 5).
> Since the header at 7) isn't definitely fsync'ed, a second crash (say,
> a kernel panic) could revert the .couch file itself to the state at
> 4), but views are permanently wrong. It's hard to see it in practice
> because the header is 4k and almost always gets to disk soon enough
> anyway, especially if you do more i/o on the view indexes.
> 
> B.
> 
> On Wed, Apr 14, 2010 at 3:46 PM, Adam Kocoloski <[email protected]> wrote:
>> Thanks Damien.  I'm thinking that the situation you describe cannot occur if 
>> before_header is enabled in the fsync_options, since any data pointed to by 
>> the #db_header that the server found after the restart was already synced.  
>> Is that correct?
>> 
>> Adam
>> 
>> On Apr 14, 2010, at 10:26 AM, Damien Katz wrote:
>> 
>>> The reason for fsync on open is the server doesn't know if the data it's 
>>> reading off the file is commited fully to the disk. It's possible the the 
>>> server wrote to file and crashed before fsync, then restarted. Then it 
>>> could refresh view indexes on the non-fsynced storage data, for example, 
>>> and crash again, losing data in the storage file, but not the updates to 
>>> the index file. Now the index is permanently out of date with the storage 
>>> file. But if you fsync on opening the storage file, that can't happen.
>>> 
>>> -Damien
>>> 
>>> 
>>> On Apr 14, 2010, at 5:52 AM, Adam Kocoloski wrote:
>>> 
>>>> Initially posted on user@, but maybe it got lost in the noise.  Does 
>>>> anyone know why we call fsync when we open a file?
>>>> 
>>>> Adam
>>>> 
>>>> Begin forwarded message:
>>>> 
>>>>> From: Adam Kocoloski <[email protected]>
>>>>> Date: April 11, 2010 10:44:03 PM EDT
>>>>> To: [email protected]
>>>>> Subject: optimal settings for [couchdb] fsync_options?
>>>>> 
>>>>> Hi folks, I wanted to assemble some concrete information about the 
>>>>> purpose of each of the three fsync_options available in CouchDB and under 
>>>>> what conditions they should be enabled/disabled.  These options are
>>>>> 
>>>>> 1) before_header - calls file:sync(Fd) before writing a DB header to 
>>>>> disk.  I believe the goal here is to prevent DB corruption by ensuring 
>>>>> that all the data referred to by the header is durably stored before the 
>>>>> header is written.  A system that preserves write ordering could safely 
>>>>> disable this option.  Does anyone know an example of such a system? 
>>>>> Perhaps a combination of a noop IO scheduler and a write-through or 
>>>>> nonvolatile disk cache?
>>>>> 
>>>>> 2) after_header - calls file:sync(Fd) immediately after writing the DB 
>>>>> header.  I think this one is done so that we don't lose too much data 
>>>>> following a CouchDB restart, and so that a client can ensure that stored 
>>>>> data will be retrievable after a restart by POSTing to 
>>>>> /db/_ensure_full_commit.  It might make sense to disable this option if 
>>>>> e.g. you're relying on replication for durability.  Although that's dicey 
>>>>> because the replicator calls ensure_full_commit for both DBs before 
>>>>> writing its own checkpoint record*, and by disabling the after_header 
>>>>> option you'd run the risk of skipping updates on the target in the face 
>>>>> of a power failure.
>>>>> 
>>>>> 3) on_file_open - calls file:sync(Fd) immediately after opening a DB 
>>>>> file.  I really don't know the purpose of this one.  Anyone?
>>>>> 
>>>>> Best, Adam
>>>>> 
>>>>> * The reason the replicator calls ensure_full_commit on the source is to 
>>>>> detect situations where update_seqs might be reused.  I wonder if we 
>>>>> could engineer a way around that ever happening, for example by ensuring 
>>>>> that on restart the update sequence jumps by a large number.  But that's 
>>>>> a discussion for d...@.
>>>> 
>>> 
>> 
>> 

Reply via email to