Investigate alternative couch_file writer implementations
---------------------------------------------------------

                 Key: COUCHDB-754
                 URL: https://issues.apache.org/jira/browse/COUCHDB-754
             Project: CouchDB
          Issue Type: Improvement
         Environment: some code might be platform-specific
            Reporter: Adam Kocoloski
             Fix For: 1.1


I've got a number of possible enhancements to couch_file floating around in my 
head, wanted to write them down.

* Use fdatasync instead of fsync.  Filipe posted a patch to the OTP file driver 
[1] that adds a new file:datasync/1 function.  I suspect that we won't see much 
of a performance gain from this switch because we append to the file and thus 
need to update the file metedata anyway.  On the other hand, I'm fairly certain 
fdatasync is always safe for our needs, so if it is ever more efficient we 
should use it.  Obviously, we'll need to fall back to file:sync/1 on platforms 
where the datasync function is not available.

* Use file:pwrite/2 to batch together multiple outstanding write requests.  
This is essentially Paul's zip_server [2].  In order to take full advantage of 
it we need to patch couch_btree to update nodes in parallel.  Currently there 
should only be 1 outstanding write request in a couch_file at a time, so it 
wouldn't help at all.

* Open the file in append mode and stop seeking to eof in user space.  We never 
modify files (aside from truncating, which is rare enough to be handled 
separately), so perhaps it would help with performance if we let the kernel 
deal with the seek.  We'd still need a way to get the file size for the 
make_blocks function.  I'm wondering if file:read_file_info(Fd) is more 
efficient than file:position(Fd, eof) for this purpose.

A caveat - I'm not sure if append-only files are compatible with the previous 
enhancement.  There is no file:write/2, and I have no idea how file:pwrite 
behaves on a file which is opened append-only.  Is the Pos ignored, or is it an 
error?  Will have to test.

* Use O_DSYNC instead of fsync/fdatasync.  This one is inspired by antirez' 
recent blog post [3] and some historical discussions on pgsql-performance.  
Basically, it seems that opening a file with O_DSYNC (or O_SYNC on Linux, which 
is currently the same thing) and doing all synchronous writes is reasonably 
fast.  Antirez' tests showed 250 µs delays for (tiny) synchronous writes, 
compared to 40 ms delays for fsync and fdatasync on his ext4 system.

At the very least, this looks to be a compelling choice for file access when 
the server is running with delayed_commits = true.  We'd need to patch the OTP 
file driver again, and also investigate the cross-platform support.  In 
particular, I don't think it works on NFS.

[1]: http://github.com/fdmanana/otp/tree/fdatasync
[2]: http://github.com/davisp/zip_server
[3]: http://antirez.com/post/fsync-different-thread-useless.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to