On Mon, Nov 8, 2010 at 3:04 PM, Randall Leeds <[email protected]> wrote: > Whoops. Hit send too early, but I think I got everything in there that > I wanted to say. > > As for the ref counter bottleneck, I just pushed to > https://github.com/tilgovi/couchdb/tree/ets_ref_count > This branch uses a public ets for the ref_counter. I think I managed > to linear the updates over the {total, RefCtr} keys in the ets table > such that there should be no race conditions but please, please take a > look at this if you have time. > > It seems to pass the ref_counter tests, but I still need to handle > giving away ownership of the ets table. Right now I use couch_server > as the heir so I can use only one ETS table for all couch_ref_counter > processes, but the couch_server just crashes if it actually receives > the 'ETS-TRANSFER' message. If I can't find an easy way to hand the > table to another couch_ref_counter whenever the owner exits I may just > break the encapsulation of the module a bit by leaving couch_server as > the owner and ignoring that message. > > Thanks, guys. My gut says we're going to get some nice numbers when > all this is done. > > -Randall > > On Mon, Nov 8, 2010 at 11:56, Randall Leeds <[email protected]> wrote: >> Thanks to both of you for getting this conversation going again and >> for the work on the patch and testing, Filipe. >> >> On Sun, Nov 7, 2010 at 12:49, Adam Kocoloski <[email protected]> wrote: >>> On Nov 7, 2010, at 3:29 PM, Filipe David Manana wrote: >>> >>>> On Sun, Nov 7, 2010 at 8:09 PM, Adam Kocoloski <[email protected]> wrote: >>>>> On Nov 7, 2010, at 2:52 PM, Filipe David Manana wrote: >>>>> >>>>>> On Sun, Nov 7, 2010 at 7:20 PM, Adam Kocoloski <[email protected]> >>>>>> wrote: >>>>>>> On Nov 7, 2010, at 11:35 AM, Filipe David Manana wrote: >>>>>>> >>>>>>>> Also, with this patch I verified (on Solaris, with the 'zpool iostat >>>>>>>> 1' command) that when running a writes only test with relaximation >>>>>>>> (200 write processes), disk write activity is not continuous. Without >>>>>>>> this patch, there's continuous (every 1 second) write activity. >>>>>>> >>>>>>> I'm confused by this statement. You must be talking about relaximation >>>>>>> runs with delayed_commits = true, right? Why do you think you see >>>>>>> larger intervals between write activity with the optimization from >>>>>>> COUCHDB-767? Have you measured the time it takes to open the extra FD? >>>>>>> In my tests that was a sub-millisecond operation, but maybe you've >>>>>>> uncovered something else. >>>>>> >>>>>> No, it happens for tests with delayed_commits = false. The only >>>>>> possible explanation I see for the variance might be related to the >>>>>> Erlang VM scheduler decisions about when to start/run that process. >>>>>> Nevertheless, I dont know the exact cause, but the fsync run frequency >>>>>> varies a lot. >>>>> >>>>> I think it's worth investigating. I couldn't reproduce it on my >>>>> plain-old spinning disk MacBook with 200 writers in relaximation; the >>>>> IOPS reported by iostat stayed very uniform. >>>>> >>>>>>>> For the goal of not having readers getting blocked by fsync calls (and >>>>>>>> write calls), I would propose using a separate couch_file process just >>>>>>>> for read operations. I have a branch in my github for this (with >>>>>>>> COUCHDB-767 reverted). It needs to be polished, but the relaximation >>>>>>>> tests are very positive, both reads and writes get better response >>>>>>>> times and throughput: >>>>>>>> >>>>>>>> https://github.com/fdmanana/couchdb/tree/2_couch_files_no_batch_reads >>>>>>> >>>>>>> I'd like to propose an alternative optimization, which is to keep a >>>>>>> dedicated file descriptor open in the couch_db_updater process and use >>>>>>> that file descriptor for _all_ IO initiated by the db_updater. The >>>>>>> advantage is that the db_updater does not need to do any message >>>>>>> passing for disk IO, and thus does not slow down when the incoming >>>>>>> message queue is large. A message queue much much larger than the >>>>>>> number of concurrent writers can occur if a user writes with batch=ok, >>>>>>> and it can also happen rather easily in a BigCouch cluster. >>>>>> >>>>>> I don't see how that will improve things, since all write operations >>>>>> will still be done in a serialized manner. Since only couch_db_updater >>>>>> writes to the DB file, and since access to the couch_db_updater is >>>>>> serialized, to me it only seems that you're solution avoids one level >>>>>> of indirection (the couch_file process). I don't see how, when using a >>>>>> couch_file only for writes, you get the message queue for that >>>>>> couc_file process full of write messages. >>>>> >>>>> It's the db_updater which gets a large message queue, not the couch_file. >>>>> The db_updater ends up with a big backlog of update_docs messages that >>>>> get in the way when it needs to make gen_server calls to the couch_file >>>>> process for IO. It's a significant problem in R13B, probably less so in >>>>> R14B because of some cool optimizations by the OTP team. >>>> >>>> So, let me see if I get it. The couch_db_updater process is slow >>>> picking the results of the calls to the couch_file process because its >>>> mailbox is full of update_docs messages? >>> >>> Correct. Each call to the couch_file requires a selective receive on the >>> part of the db_updater in order to get the response, and prior to R14 that >>> selective receive needed to match against every message in the mailbox. >>> It's really a bigger problem in couch_server, which uses a gen_server call >>> to increment a reference counter before handing the #db{} to the client, >>> since every request to any DB has to talk to couch_server first. Best, >>> >>> Adam >> >> Adam, >> I think the problem is made worse by a backed up db_updater, but the >> db_updater becomes backed up because it makes more synchronous calls >> to the couch_file than a reader does, handling only one update >> operation at a time while readers queue up on the couch_file in >> parallel. >> >> Filipe, >> Using a separate fd for writes at the couch_file level is not the >> answer. The db_updater has to read the btree before it can write, >> incurring multiple trips through the couch_file message queue between >> queuing append_term requests and processing its message queue for new >> updates. Using two file descriptors keeps the readers out of the way >> of the writers only if you select which fd to use at the db-operation >> level and not the file-operation level. Perhaps two couch_file >> processes is better. Fairness should be left to the operating system >> I/O scheduler once reads don'. This seems seems like the best way >> forward to me right now. Let's try to crunch some numbers on it soon. >> >> I couldn't find a solution I liked that was fair to readers and >> writers at any workload with only one file descriptor. The btree cache >> alleviates this problem a bit because the read path becomes much >> faster and therefore improves database reads and writes. >> >> As to the patch, I'd think we need the readers and writers separated >> into two separate couch_files. That way the updater can perform its >> reads on the "writer" fd, otherwise writers suffer starvation because >> readers go directly into the couch_file queue in parallel instead of >> serializing through something like db_updater. >> >
Wasn't there a branch or patch somehwere that just removed the ref_counter code entirely and used monitors/links to make sure everything behaved correctly? I'm not sure I ever saw it to see how dramatic and/or scary it was, but it might be another approach to consider.
