Re: About possibly reverting COUCHDB-767

Randall Leeds Mon, 08 Nov 2010 12:05:13 -0800

Whoops. Hit send too early, but I think I got everything in there that
I wanted to say.


As for the ref counter bottleneck, I just pushed to
https://github.com/tilgovi/couchdb/tree/ets_ref_count
This branch uses a public ets for the ref_counter. I think I managed
to linear the updates over the {total, RefCtr} keys in the ets table
such that there should be no race conditions but please, please take a
look at this if you have time.

It seems to pass the ref_counter tests, but I still need to handle
giving away ownership of the ets table. Right now I use couch_server
as the heir so I can use only one ETS table for all couch_ref_counter
processes, but the couch_server just crashes if it actually receives
the 'ETS-TRANSFER' message. If I can't find an easy way to hand the
table to another couch_ref_counter whenever the owner exits I may just
break the encapsulation of the module a bit by leaving couch_server as
the owner and ignoring that message.

Thanks, guys. My gut says we're going to get some nice numbers when
all this is done.

-Randall

On Mon, Nov 8, 2010 at 11:56, Randall Leeds <[email protected]> wrote:
> Thanks to both of you for getting this conversation going again and
> for the work on the patch and testing, Filipe.
>
> On Sun, Nov 7, 2010 at 12:49, Adam Kocoloski <[email protected]> wrote:
>> On Nov 7, 2010, at 3:29 PM, Filipe David Manana wrote:
>>
>>> On Sun, Nov 7, 2010 at 8:09 PM, Adam Kocoloski <[email protected]> wrote:
>>>> On Nov 7, 2010, at 2:52 PM, Filipe David Manana wrote:
>>>>
>>>>> On Sun, Nov 7, 2010 at 7:20 PM, Adam Kocoloski <[email protected]> 
>>>>> wrote:
>>>>>> On Nov 7, 2010, at 11:35 AM, Filipe David Manana wrote:
>>>>>>
>>>>>>> Also, with this patch I verified (on Solaris, with the 'zpool iostat
>>>>>>> 1' command) that when running a writes only test with relaximation
>>>>>>> (200 write processes), disk write activity is not continuous. Without
>>>>>>> this patch, there's continuous (every 1 second) write activity.
>>>>>>
>>>>>> I'm confused by this statement. You must be talking about relaximation 
>>>>>> runs with delayed_commits = true, right?  Why do you think you see 
>>>>>> larger intervals between write activity with the optimization from 
>>>>>> COUCHDB-767?  Have you measured the time it takes to open the extra FD?  
>>>>>> In my tests that was a sub-millisecond operation, but maybe you've 
>>>>>> uncovered something else.
>>>>>
>>>>> No, it happens for tests with delayed_commits = false. The only
>>>>> possible explanation I see for the variance might be related to the
>>>>> Erlang VM scheduler decisions about when to start/run that process.
>>>>> Nevertheless, I dont know the exact cause, but the fsync run frequency
>>>>> varies a lot.
>>>>
>>>> I think it's worth investigating.  I couldn't reproduce it on my plain-old 
>>>> spinning disk MacBook with 200 writers in relaximation; the IOPS reported 
>>>> by iostat stayed very uniform.
>>>>
>>>>>>> For the goal of not having readers getting blocked by fsync calls (and
>>>>>>> write calls), I would propose using a separate couch_file process just
>>>>>>> for read operations. I have a branch in my github for this (with
>>>>>>> COUCHDB-767 reverted). It needs to be polished, but the relaximation
>>>>>>> tests are very positive, both reads and writes get better response
>>>>>>> times and throughput:
>>>>>>>
>>>>>>> https://github.com/fdmanana/couchdb/tree/2_couch_files_no_batch_reads
>>>>>>
>>>>>> I'd like to propose an alternative optimization, which is to keep a 
>>>>>> dedicated file descriptor open in the couch_db_updater process and use 
>>>>>> that file descriptor for _all_ IO initiated by the db_updater.  The 
>>>>>> advantage is that the db_updater does not need to do any message passing 
>>>>>> for disk IO, and thus does not slow down when the incoming message queue 
>>>>>> is large.  A message queue much much larger than the number of 
>>>>>> concurrent writers can occur if a user writes with batch=ok, and it can 
>>>>>> also happen rather easily in a BigCouch cluster.
>>>>>
>>>>> I don't see how that will improve things, since all write operations
>>>>> will still be done in a serialized manner. Since only couch_db_updater
>>>>> writes to the DB file, and since access to the couch_db_updater is
>>>>> serialized, to me it only seems that you're solution avoids one level
>>>>> of indirection (the couch_file process). I don't see how, when using a
>>>>> couch_file only for writes, you get the message queue for that
>>>>> couc_file process full of write messages.
>>>>
>>>> It's the db_updater which gets a large message queue, not the couch_file.  
>>>> The db_updater ends up with a big backlog of update_docs messages that get 
>>>> in the way when it needs to make gen_server calls to the couch_file 
>>>> process for IO.  It's a significant problem in R13B, probably less so in 
>>>> R14B because of some cool optimizations by the OTP team.
>>>
>>> So, let me see if I get it. The couch_db_updater process is slow
>>> picking the results of the calls to the couch_file process because its
>>> mailbox is full of update_docs messages?
>>
>> Correct.  Each call to the couch_file requires a selective receive on the 
>> part of the db_updater in order to get the response, and prior to R14 that 
>> selective receive needed to match against every message in the mailbox.  
>> It's really a bigger problem in couch_server, which uses a gen_server call 
>> to increment a reference counter before handing the #db{} to the client, 
>> since every request to any DB has to talk to couch_server first.  Best,
>>
>> Adam
>
> Adam,
> I think the problem is made worse by a backed up db_updater, but the
> db_updater becomes backed up because it makes more synchronous calls
> to the couch_file than a reader does, handling only one update
> operation at a time while readers queue up on the couch_file in
> parallel.
>
> Filipe,
> Using a separate fd for writes at the couch_file level is not the
> answer. The db_updater has to read the btree before it can write,
> incurring multiple trips through the couch_file message queue between
> queuing append_term requests and processing its message queue for new
> updates. Using two file descriptors keeps the readers out of the way
> of the writers only if you select which fd to use at the db-operation
> level and not the file-operation level. Perhaps two couch_file
> processes is better. Fairness should be left to the operating system
> I/O scheduler once reads don'. This seems seems like the best way
> forward to me right now. Let's try to crunch some numbers on it soon.
>
> I couldn't find a solution I liked that was fair to readers and
> writers at any workload with only one file descriptor. The btree cache
> alleviates this problem a bit because the read path becomes much
> faster and therefore improves database reads and writes.
>
> As to the patch, I'd think we need the readers and writers separated
> into two separate couch_files. That way the updater can perform its
> reads on the "writer" fd, otherwise writers suffer starvation because
> readers go directly into the couch_file queue in parallel instead of
> serializing through something like db_updater.
>

Re: About possibly reverting COUCHDB-767

Reply via email to