On Sun, Nov 7, 2010 at 7:20 PM, Adam Kocoloski <[email protected]> wrote: > On Nov 7, 2010, at 11:35 AM, Filipe David Manana wrote: > >> Hi, >> >> Regarding the change introduced by the ticket: >> >> https://issues.apache.org/jira/browse/COUCHDB-767 >> >> (opening the same file in a different process and call fsync on the >> new file descriptor through the new process) >> >> I found out that it's not a recomended practice. I posted the >> following question to the ext4 development mailing list: >> >> http://www.spinics.net/lists/linux-ext4/msg21388.html > > Reading that thread it seems quite unlikely that we're getting into any > trouble with this patch. But it does seem like we should be considering > alternatives.
Right, but unlikely does not mean never, unfortunately. > >> Also, with this patch I verified (on Solaris, with the 'zpool iostat >> 1' command) that when running a writes only test with relaximation >> (200 write processes), disk write activity is not continuous. Without >> this patch, there's continuous (every 1 second) write activity. > > I'm confused by this statement. You must be talking about relaximation runs > with delayed_commits = true, right? Why do you think you see larger > intervals between write activity with the optimization from COUCHDB-767? > Have you measured the time it takes to open the extra FD? In my tests that > was a sub-millisecond operation, but maybe you've uncovered something else. No, it happens for tests with delayed_commits = false. The only possible explanation I see for the variance might be related to the Erlang VM scheduler decisions about when to start/run that process. Nevertheless, I dont know the exact cause, but the fsync run frequency varies a lot. > >> This also makes performance comparison tests with relaximation much harder >> to analyse, as the peak variation is much higher and not periodic. > > Still confused. I'm guessing the increased variance only shows up on the > writes graph - reads certainly ought to have decreased variance because the > fsync occurs out of band. Is the issue that the fsync takes longer when it > requires a new FD, or am I reading this all wrong? Yes, for writes. > > <rant>Any variance-related statements based on relaximation results are > purely qualitative, since relaximation does not report measurement variance. > Similarly, any claims about relative improvements in response times have an > unknown statistical significance.</rant> Right. I didn't meant to say that relaximation is always right or that we should be blindily guided by relaximation reports. > >> For the goal of not having readers getting blocked by fsync calls (and >> write calls), I would propose using a separate couch_file process just >> for read operations. I have a branch in my github for this (with >> COUCHDB-767 reverted). It needs to be polished, but the relaximation >> tests are very positive, both reads and writes get better response >> times and throughput: >> >> https://github.com/fdmanana/couchdb/tree/2_couch_files_no_batch_reads > > I'd like to propose an alternative optimization, which is to keep a dedicated > file descriptor open in the couch_db_updater process and use that file > descriptor for _all_ IO initiated by the db_updater. The advantage is that > the db_updater does not need to do any message passing for disk IO, and thus > does not slow down when the incoming message queue is large. A message queue > much much larger than the number of concurrent writers can occur if a user > writes with batch=ok, and it can also happen rather easily in a BigCouch > cluster. I don't see how that will improve things, since all write operations will still be done in a serialized manner. Since only couch_db_updater writes to the DB file, and since access to the couch_db_updater is serialized, to me it only seems that you're solution avoids one level of indirection (the couch_file process). I don't see how, when using a couch_file only for writes, you get the message queue for that couc_file process full of write messages. Also, what I did on that branch is a bit more generic, as it works for view index files as well, and doesn't introduce significant changes elsewhere except in couch_file.erl. Of course your solution might be extended to the view updater process as well easily, I don't have anything against it. Anyway, +1. > >> http://graphs.mikeal.couchone.com/#/graph/62b286fbb7aa55a4b0c4cc913c00e659 >> (relaximation test) >> >> The test does a direct comparsion with trunk (also with COUCHDB-767 >> reverted) and was run like this: >> >> $ node tests/compare_write_and_read.js --wclients 300 --rclients 150 \ >> -name1 2_couch_files_no_batch_reads -name2 trunk \ >> -url1 http://localhost:5984/ -url2 http://localhost:5985/ \ >> --duration 120 > > Is the graph you posted a comparison with trunk, or with trunk minus > COUCHDB-767? I couldn't parse your statement. With trunk minus COUCHDB-767. > > Side note: I linked to a relaximation test in COUCHDB-767, but that link (to > mikeal.couchone.com) went dead. Is the graph still available somewhere? No idea. > > That patch looks like it probably improves the requests per second for both > reads and writes, as it should if you have a multicore server and/or you've > turned on the async thread pool. Pretty difficult to say if the change is > statistically significant, though. Well, yes. But at least write operations (started by couch_db_updater for e.g.) no longer wait for read operations to be processed (issued by couch_db.erl for e.g.) and the other way around. This was tested with the +A flag set to 4 (default on trunk) and set to 16. For both cases the read and write performance was better (according to the relaximation tests). > >> This approach, of using a file descriptor just for reads and another >> one just for writes (but both referring to the same file), seems to be >> safe: >> >> http://www.spinics.net/lists/linux-ext4/msg21429.html >> >> Race conditions shouldn't happen, as each read call needs an offset, >> and the offset is only known after a previous write calls finished. > > I'm definitely not worried about race conditions, our DB updating logic seems > to prevent that several times over. > >> Thoughts on this? Should we revert COUCHDB-767? Integrate the 2 >> couch_files strategy into trunk? (only after 1.1 has its own branch) > > I'm against reverting COUCHDB-767, but I guess it deserves more research to > see if any of our supported platforms actually implement an fsync() that does > not sync writes from another file descriptor for the same file. If we do > find a platform like that we should certainly revert. The fact that the behaviour is not guaranteed by POSIX or the Single Unix Specification worries me (not to mention Windows). > > I'm probably in favor of a two file descriptors per DB approach, but I think > attaching the second fd directly to the db_updater will be the winning > strategy. Best, > > Adam > >> >> cheers >> >> -- >> Filipe David Manana, >> [email protected], [email protected] >> >> "Reasonable men adapt themselves to the world. >> Unreasonable men adapt the world to themselves. >> That's why all progress depends on unreasonable men." > > -- Filipe David Manana, [email protected], [email protected] "Reasonable men adapt themselves to the world. Unreasonable men adapt the world to themselves. That's why all progress depends on unreasonable men."
