Re: About possibly reverting COUCHDB-767

Paul Davis Mon, 08 Nov 2010 12:23:47 -0800

On Mon, Nov 8, 2010 at 3:04 PM, Randall Leeds <[email protected]> wrote:
> Whoops. Hit send too early, but I think I got everything in there that
> I wanted to say.
>
> As for the ref counter bottleneck, I just pushed to
> https://github.com/tilgovi/couchdb/tree/ets_ref_count
> This branch uses a public ets for the ref_counter. I think I managed
> to linear the updates over the {total, RefCtr} keys in the ets table
> such that there should be no race conditions but please, please take a
> look at this if you have time.
>
> It seems to pass the ref_counter tests, but I still need to handle
> giving away ownership of the ets table. Right now I use couch_server
> as the heir so I can use only one ETS table for all couch_ref_counter
> processes, but the couch_server just crashes if it actually receives
> the 'ETS-TRANSFER' message. If I can't find an easy way to hand the
> table to another couch_ref_counter whenever the owner exits I may just
> break the encapsulation of the module a bit by leaving couch_server as
> the owner and ignoring that message.
>
> Thanks, guys. My gut says we're going to get some nice numbers when
> all this is done.
>
> -Randall
>
> On Mon, Nov 8, 2010 at 11:56, Randall Leeds <[email protected]> wrote:
>> Thanks to both of you for getting this conversation going again and
>> for the work on the patch and testing, Filipe.
>>
>> On Sun, Nov 7, 2010 at 12:49, Adam Kocoloski <[email protected]> wrote:
>>> On Nov 7, 2010, at 3:29 PM, Filipe David Manana wrote:
>>>
>>>> On Sun, Nov 7, 2010 at 8:09 PM, Adam Kocoloski <[email protected]> wrote:
>>>>> On Nov 7, 2010, at 2:52 PM, Filipe David Manana wrote:
>>>>>
>>>>>> On Sun, Nov 7, 2010 at 7:20 PM, Adam Kocoloski <[email protected]> 
>>>>>> wrote:
>>>>>>> On Nov 7, 2010, at 11:35 AM, Filipe David Manana wrote:
>>>>>>>
>>>>>>>> Also, with this patch I verified (on Solaris, with the 'zpool iostat
>>>>>>>> 1' command) that when running a writes only test with relaximation
>>>>>>>> (200 write processes), disk write activity is not continuous. Without
>>>>>>>> this patch, there's continuous (every 1 second) write activity.
>>>>>>>
>>>>>>> I'm confused by this statement. You must be talking about relaximation 
>>>>>>> runs with delayed_commits = true, right?  Why do you think you see 
>>>>>>> larger intervals between write activity with the optimization from 
>>>>>>> COUCHDB-767?  Have you measured the time it takes to open the extra FD? 
>>>>>>>  In my tests that was a sub-millisecond operation, but maybe you've 
>>>>>>> uncovered something else.
>>>>>>
>>>>>> No, it happens for tests with delayed_commits = false. The only
>>>>>> possible explanation I see for the variance might be related to the
>>>>>> Erlang VM scheduler decisions about when to start/run that process.
>>>>>> Nevertheless, I dont know the exact cause, but the fsync run frequency
>>>>>> varies a lot.
>>>>>
>>>>> I think it's worth investigating.  I couldn't reproduce it on my 
>>>>> plain-old spinning disk MacBook with 200 writers in relaximation; the 
>>>>> IOPS reported by iostat stayed very uniform.
>>>>>
>>>>>>>> For the goal of not having readers getting blocked by fsync calls (and
>>>>>>>> write calls), I would propose using a separate couch_file process just
>>>>>>>> for read operations. I have a branch in my github for this (with
>>>>>>>> COUCHDB-767 reverted). It needs to be polished, but the relaximation
>>>>>>>> tests are very positive, both reads and writes get better response
>>>>>>>> times and throughput:
>>>>>>>>
>>>>>>>> https://github.com/fdmanana/couchdb/tree/2_couch_files_no_batch_reads
>>>>>>>
>>>>>>> I'd like to propose an alternative optimization, which is to keep a 
>>>>>>> dedicated file descriptor open in the couch_db_updater process and use 
>>>>>>> that file descriptor for _all_ IO initiated by the db_updater.  The 
>>>>>>> advantage is that the db_updater does not need to do any message 
>>>>>>> passing for disk IO, and thus does not slow down when the incoming 
>>>>>>> message queue is large.  A message queue much much larger than the 
>>>>>>> number of concurrent writers can occur if a user writes with batch=ok, 
>>>>>>> and it can also happen rather easily in a BigCouch cluster.
>>>>>>
>>>>>> I don't see how that will improve things, since all write operations
>>>>>> will still be done in a serialized manner. Since only couch_db_updater
>>>>>> writes to the DB file, and since access to the couch_db_updater is
>>>>>> serialized, to me it only seems that you're solution avoids one level
>>>>>> of indirection (the couch_file process). I don't see how, when using a
>>>>>> couch_file only for writes, you get the message queue for that
>>>>>> couc_file process full of write messages.
>>>>>
>>>>> It's the db_updater which gets a large message queue, not the couch_file. 
>>>>>  The db_updater ends up with a big backlog of update_docs messages that 
>>>>> get in the way when it needs to make gen_server calls to the couch_file 
>>>>> process for IO.  It's a significant problem in R13B, probably less so in 
>>>>> R14B because of some cool optimizations by the OTP team.
>>>>
>>>> So, let me see if I get it. The couch_db_updater process is slow
>>>> picking the results of the calls to the couch_file process because its
>>>> mailbox is full of update_docs messages?
>>>
>>> Correct.  Each call to the couch_file requires a selective receive on the 
>>> part of the db_updater in order to get the response, and prior to R14 that 
>>> selective receive needed to match against every message in the mailbox.  
>>> It's really a bigger problem in couch_server, which uses a gen_server call 
>>> to increment a reference counter before handing the #db{} to the client, 
>>> since every request to any DB has to talk to couch_server first.  Best,
>>>
>>> Adam
>>
>> Adam,
>> I think the problem is made worse by a backed up db_updater, but the
>> db_updater becomes backed up because it makes more synchronous calls
>> to the couch_file than a reader does, handling only one update
>> operation at a time while readers queue up on the couch_file in
>> parallel.
>>
>> Filipe,
>> Using a separate fd for writes at the couch_file level is not the
>> answer. The db_updater has to read the btree before it can write,
>> incurring multiple trips through the couch_file message queue between
>> queuing append_term requests and processing its message queue for new
>> updates. Using two file descriptors keeps the readers out of the way
>> of the writers only if you select which fd to use at the db-operation
>> level and not the file-operation level. Perhaps two couch_file
>> processes is better. Fairness should be left to the operating system
>> I/O scheduler once reads don'. This seems seems like the best way
>> forward to me right now. Let's try to crunch some numbers on it soon.
>>
>> I couldn't find a solution I liked that was fair to readers and
>> writers at any workload with only one file descriptor. The btree cache
>> alleviates this problem a bit because the read path becomes much
>> faster and therefore improves database reads and writes.
>>
>> As to the patch, I'd think we need the readers and writers separated
>> into two separate couch_files. That way the updater can perform its
>> reads on the "writer" fd, otherwise writers suffer starvation because
>> readers go directly into the couch_file queue in parallel instead of
>> serializing through something like db_updater.
>>
>


Wasn't there a branch or patch somehwere that just removed the
ref_counter code entirely and used monitors/links to make sure
everything behaved correctly? I'm not sure I ever saw it to see how
dramatic and/or scary it was, but it might be another approach to
consider.

Re: About possibly reverting COUCHDB-767

Reply via email to