Re: Faster updates, optional ACID

Geir Magnusson Jr. Mon, 05 Jan 2009 11:52:37 -0800


On Jan 5, 2009, at 2:32 PM, Damien Katz wrote:

On Jan 5, 2009, at 1:59 PM, Geir Magnusson Jr. wrote:
On Jan 5, 2009, at 12:54 PM, Damien Katz wrote:
It was brought to my attention that commits on OS X were very slowwith the latest releases of Erlang. After I upgraded to the mostrecent version, I found them to be indeed slow, slowing the testsdown to point it was painful to run them. It appears, any disksync from Erlang now takes somewhere between 50 and 100ms, up fromthe previous times of ~5ms. This is almost certainly due to theF_FULLFSYNC flag the erlang file handling now uses on darwin basedsystems, but it's surprising how bad performance on OS X. A littleinvestigation had shown other database engines have similar issueson OS X.
One user, though, reported a huge performance difference between0.8 and trunk *using the same version of Erlang*.
http://mail-archives.apache.org/mod_mbox/couchdb-user/200901.mbox/%[email protected]%3e
To me, this hints that something changed in CouchDB that triggersusage of fcntl(F_FULLFSYNC).
I haven't tried to duplicate but will. If this can be verified, Ithink that this is an important mystery to solve.
To address this problem, I implemented delayed commitfunctionality. We had always intended to implement delayed commitfor performance reasons, but hadn't had the need until now. Thismakes updates much faster in the general case, but with the caveatthey aren't flushed completely to disk right way. If you can'ttolerate the possible loss of recent updates, you can use the"full commit" option for ACID commits.
So this bring up a question. There is no way to get the samedurability semantics on linux as you can get on OS X with F_FULLSYNC.
On linux, as always it depends (on distro, file system, etc), butgenerally fsync flushes to disk, or so I've been told by those whoshould know and I've not seen credible evidence otherwise. But iffsync is broken by default on Linux (like say Debian based distros),file a bug and we'll see about get Erlang patched with the properapis (the Erlang F_FULLFSYNC change was from us too).

fsync() on Linux and OS X flushes to disk. I'm not suggesting that itdoesn't.

What it doesn't do is flush the write caches *on* the disk unit itself(which is what F_FULLSYNC supposedly does, hence the sloth)

IOW, with fsync(), there's no guarantee that the bits get written tothe physical media. As far as the OS knows, the FS caches are flushedto the device, but the device may still be holding in it's own RAM.

re the FULLFSYNC change, do you have the option to not have it used,but have fsync() used instead?

This means that the "full commit" option really gives you differentlevels of durability, depending on whether or not you are on OS X.
And thinking more about what appears to be the perf bug/slowdown inCouchDB code, might his warrant three options?
1) delayed commit (what you did last night)
2) fsync() commit (what I suspect Couch did on and around 0.8)
3) optional F_FULLSYNC commit, on OS X and any other platform thatprovides this level of commit
If necessary and possible, we'll patch the Erlang VM.

That seems like a bad idea to me - I'd think you'd want to stay out ofthe VM business.

But if a platform doesn't support proper flushing, then it's not aplatform that can support an ACID database.


We're not communicating well here.

"proper flushing" depends on what you want to do - if you need yourdata to in confirmed permanent storage so that it can survive a crashor power cut, then w/o special configuration (e.g. battery-backedRAID, for example), I don't think that you're going to get assuranceon linux.


Do you see what I'm saying?

For full acid commit, add a header field to the doc PUT or_bulk_docs POST like this:
X-Couch-Full-Commit:true

Then couchdb will completely commit the change before returning.
That's cool but it puts the burden on the client -
True, but they already have the API they must conform too, I don'tsee this option as being particularly burdensome unless it's simplythe wrong default.

But it keeps adding requirements to the API that aren't really in theapplication domain necessarily.

why not make it a config option, so that the db admin can choosethe durability level in general, and let clients that know they aretalking to couch override w/ a header?
Definitely, I think commit options should be settable per-database.But for now I was just wanting to address the slowdown, especiallyfor replication and the tests, to keep everyone productive. Morecommit features and options is lower priority work for now, I wasjust addresses the most serious slowdown.

That makes sense, but IMO you papered over the root problem. It'sgood to keep people working, but I think the issue deserves a look. Idon't know erlang, or I would look myself.


geir

Also, having the default be delayed commit will help us flush outany problems, especially for usability and real productioningtesting. If it's broken, either a simple bug or by design, we needto know it as soon as possible.
-Damien
geir
Also, if you have several delayed updates and you want to makesure they all made it to disk, you can invoke POST /db/_ensure_full_commit and all outstanding commits are flushed to disk.
The view engine has been already modified to deal with delayedcommits too, it ensures it never fully commits it's own indexes todisk if the documents indexed aren't already committed to disk.
The last remaining work item is db server crash detection, so thatclients can detect when a server has crashed and potentially lostupdates. This is pretty simple, each db server just needs a uniqueID generated at it's startup. Client retrieve this value at thebeginning of the writes and then checks that the value is the sameonce down a flushed to disk. If not, we know we maybe have lostsome updates and we redo the replication from the last known goodcommit.
Right now the default is to delay the commits, because I thinkthat will be the most common use case but I'm really not sure. Idefinitely want the commits delayed for the test suite, to keepthings running fast.
-Damien

Re: Faster updates, optional ACID

Reply via email to