[
https://issues.apache.org/jira/browse/COUCHDB-754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Adam Kocoloski updated COUCHDB-754:
-----------------------------------
Attachment: cheaper-appending.patch
Here's a patch for the "Open the file in append mode and stop seeking to eof in
user space." task item. It extends the #file record to track EOF, which allows
us to skip the call to file:position/2 before every write. It also opens the
file with O_APPEND and thus can use file:write/2 instead of file:pwrite/3.
I did try file:read_file_info/1, but it calls stat() and is quite slow (~35
µs). There is no Erlang interface to fstat() that I'm aware of.
I wrote a tiny little C routine to compare the cost of lseek() + pwrite() (what
we do now) to the cost of write() with an O_APPEND file (this patch). 1-byte
write times dropped from ~30 µs to ~9µs. I can come up with more quantitative
numbers if desired. The extra time in the old code seemed to be split pretty
evenly between the lseek() and the pwrite().
So, what we were doing before wasn't exactly expensive, but I believe this
patch is faster. The iolist_size calls that I've added are sub-µs, so we
shouldn't worry about that.
> Investigate alternative couch_file writer implementations
> ---------------------------------------------------------
>
> Key: COUCHDB-754
> URL: https://issues.apache.org/jira/browse/COUCHDB-754
> Project: CouchDB
> Issue Type: Improvement
> Environment: some code might be platform-specific
> Reporter: Adam Kocoloski
> Fix For: 1.1
>
> Attachments: cheaper-appending.patch
>
>
> I've got a number of possible enhancements to couch_file floating around in
> my head, wanted to write them down.
> * Use fdatasync instead of fsync. Filipe posted a patch to the OTP file
> driver [1] that adds a new file:datasync/1 function. I suspect that we won't
> see much of a performance gain from this switch because we append to the file
> and thus need to update the file metedata anyway. On the other hand, I'm
> fairly certain fdatasync is always safe for our needs, so if it is ever more
> efficient we should use it. Obviously, we'll need to fall back to
> file:sync/1 on platforms where the datasync function is not available.
> * Use file:pwrite/2 to batch together multiple outstanding write requests.
> This is essentially Paul's zip_server [2]. In order to take full advantage
> of it we need to patch couch_btree to update nodes in parallel. Currently
> there should only be 1 outstanding write request in a couch_file at a time,
> so it wouldn't help at all.
> * Open the file in append mode and stop seeking to eof in user space. We
> never modify files (aside from truncating, which is rare enough to be handled
> separately), so perhaps it would help with performance if we let the kernel
> deal with the seek. We'd still need a way to get the file size for the
> make_blocks function. I'm wondering if file:read_file_info(Fd) is more
> efficient than file:position(Fd, eof) for this purpose.
> A caveat - I'm not sure if append-only files are compatible with the previous
> enhancement. There is no file:write/2, and I have no idea how file:pwrite
> behaves on a file which is opened append-only. Is the Pos ignored, or is it
> an error? Will have to test.
> * Use O_DSYNC instead of fsync/fdatasync. This one is inspired by antirez'
> recent blog post [3] and some historical discussions on pgsql-performance.
> Basically, it seems that opening a file with O_DSYNC (or O_SYNC on Linux,
> which is currently the same thing) and doing all synchronous writes is
> reasonably fast. Antirez' tests showed 250 µs delays for (tiny) synchronous
> writes, compared to 40 ms delays for fsync and fdatasync on his ext4 system.
> At the very least, this looks to be a compelling choice for file access when
> the server is running with delayed_commits = true. We'd need to patch the
> OTP file driver again, and also investigate the cross-platform support. In
> particular, I don't think it works on NFS.
> [1]: http://github.com/fdmanana/otp/tree/fdatasync
> [2]: http://github.com/davisp/zip_server
> [3]: http://antirez.com/post/fsync-different-thread-useless.html
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.