On Jan 5, 2009, at 1:59 PM, Geir Magnusson Jr. wrote:
On Jan 5, 2009, at 12:54 PM, Damien Katz wrote:
It was brought to my attention that commits on OS X were very slow
with the latest releases of Erlang. After I upgraded to the most
recent version, I found them to be indeed slow, slowing the tests
down to point it was painful to run them. It appears, any disk sync
from Erlang now takes somewhere between 50 and 100ms, up from the
previous times of ~5ms. This is almost certainly due to the
F_FULLFSYNC flag the erlang file handling now uses on darwin based
systems, but it's surprising how bad performance on OS X. A little
investigation had shown other database engines have similar issues
on OS X.
One user, though, reported a huge performance difference between
0.8 and trunk *using the same version of Erlang*.
http://mail-archives.apache.org/mod_mbox/couchdb-user/200901.mbox/%[email protected]%3e
To me, this hints that something changed in CouchDB that triggers
usage of fcntl(F_FULLFSYNC).
I haven't tried to duplicate but will. If this can be verified, I
think that this is an important mystery to solve.
To address this problem, I implemented delayed commit
functionality. We had always intended to implement delayed commit
for performance reasons, but hadn't had the need until now. This
makes updates much faster in the general case, but with the caveat
they aren't flushed completely to disk right way. If you can't
tolerate the possible loss of recent updates, you can use the "full
commit" option for ACID commits.
So this bring up a question. There is no way to get the same
durability semantics on linux as you can get on OS X with F_FULLSYNC.
On linux, as always it depends (on distro, file system, etc), but
generally fsync flushes to disk, or so I've been told by those who
should know and I've not seen credible evidence otherwise. But if
fsync is broken by default on Linux (like say Debian based distros),
file a bug and we'll see about get Erlang patched with the proper apis
(the Erlang F_FULLFSYNC change was from us too).
This means that the "full commit" option really gives you different
levels of durability, depending on whether or not you are on OS X.
And thinking more about what appears to be the perf bug/slowdown in
CouchDB code, might his warrant three options?
1) delayed commit (what you did last night)
2) fsync() commit (what I suspect Couch did on and around 0.8)
3) optional F_FULLSYNC commit, on OS X and any other platform that
provides this level of commit
If necessary and possible, we'll patch the Erlang VM. But if a
platform doesn't support proper flushing, then it's not a platform
that can support an ACID database.
For full acid commit, add a header field to the doc PUT or
_bulk_docs POST like this:
X-Couch-Full-Commit:true
Then couchdb will completely commit the change before returning.
That's cool but it puts the burden on the client -
True, but they already have the API they must conform too, I don't see
this option as being particularly burdensome unless it's simply the
wrong default.
why not make it a config option, so that the db admin can choose the
durability level in general, and let clients that know they are
talking to couch override w/ a header?
Definitely, I think commit options should be settable per-database.
But for now I was just wanting to address the slowdown, especially for
replication and the tests, to keep everyone productive. More commit
features and options is lower priority work for now, I was just
addresses the most serious slowdown.
Also, having the default be delayed commit will help us flush out any
problems, especially for usability and real productioning testing. If
it's broken, either a simple bug or by design, we need to know it as
soon as possible.
-Damien
geir
Also, if you have several delayed updates and you want to make sure
they all made it to disk, you can invoke POST /db/
_ensure_full_commit and all outstanding commits are flushed to disk.
The view engine has been already modified to deal with delayed
commits too, it ensures it never fully commits it's own indexes to
disk if the documents indexed aren't already committed to disk.
The last remaining work item is db server crash detection, so that
clients can detect when a server has crashed and potentially lost
updates. This is pretty simple, each db server just needs a unique
ID generated at it's startup. Client retrieve this value at the
beginning of the writes and then checks that the value is the same
once down a flushed to disk. If not, we know we maybe have lost
some updates and we redo the replication from the last known good
commit.
Right now the default is to delay the commits, because I think that
will be the most common use case but I'm really not sure. I
definitely want the commits delayed for the test suite, to keep
things running fast.
-Damien