Re: About possibly reverting COUCHDB-767

Adam Kocoloski Sun, 07 Nov 2010 12:50:29 -0800

On Nov 7, 2010, at 3:29 PM, Filipe David Manana wrote:

> On Sun, Nov 7, 2010 at 8:09 PM, Adam Kocoloski <[email protected]> wrote:
>> On Nov 7, 2010, at 2:52 PM, Filipe David Manana wrote:
>> 
>>> On Sun, Nov 7, 2010 at 7:20 PM, Adam Kocoloski <[email protected]> wrote:
>>>> On Nov 7, 2010, at 11:35 AM, Filipe David Manana wrote:
>>>> 
>>>>> Also, with this patch I verified (on Solaris, with the 'zpool iostat
>>>>> 1' command) that when running a writes only test with relaximation
>>>>> (200 write processes), disk write activity is not continuous. Without
>>>>> this patch, there's continuous (every 1 second) write activity.
>>>> 
>>>> I'm confused by this statement. You must be talking about relaximation 
>>>> runs with delayed_commits = true, right?  Why do you think you see larger 
>>>> intervals between write activity with the optimization from COUCHDB-767?  
>>>> Have you measured the time it takes to open the extra FD?  In my tests 
>>>> that was a sub-millisecond operation, but maybe you've uncovered something 
>>>> else.
>>> 
>>> No, it happens for tests with delayed_commits = false. The only
>>> possible explanation I see for the variance might be related to the
>>> Erlang VM scheduler decisions about when to start/run that process.
>>> Nevertheless, I dont know the exact cause, but the fsync run frequency
>>> varies a lot.
>> 
>> I think it's worth investigating.  I couldn't reproduce it on my plain-old 
>> spinning disk MacBook with 200 writers in relaximation; the IOPS reported by 
>> iostat stayed very uniform.
>> 
>>>>> For the goal of not having readers getting blocked by fsync calls (and
>>>>> write calls), I would propose using a separate couch_file process just
>>>>> for read operations. I have a branch in my github for this (with
>>>>> COUCHDB-767 reverted). It needs to be polished, but the relaximation
>>>>> tests are very positive, both reads and writes get better response
>>>>> times and throughput:
>>>>> 
>>>>> https://github.com/fdmanana/couchdb/tree/2_couch_files_no_batch_reads
>>>> 
>>>> I'd like to propose an alternative optimization, which is to keep a 
>>>> dedicated file descriptor open in the couch_db_updater process and use 
>>>> that file descriptor for _all_ IO initiated by the db_updater.  The 
>>>> advantage is that the db_updater does not need to do any message passing 
>>>> for disk IO, and thus does not slow down when the incoming message queue 
>>>> is large.  A message queue much much larger than the number of concurrent 
>>>> writers can occur if a user writes with batch=ok, and it can also happen 
>>>> rather easily in a BigCouch cluster.
>>> 
>>> I don't see how that will improve things, since all write operations
>>> will still be done in a serialized manner. Since only couch_db_updater
>>> writes to the DB file, and since access to the couch_db_updater is
>>> serialized, to me it only seems that you're solution avoids one level
>>> of indirection (the couch_file process). I don't see how, when using a
>>> couch_file only for writes, you get the message queue for that
>>> couc_file process full of write messages.
>> 
>> It's the db_updater which gets a large message queue, not the couch_file.  
>> The db_updater ends up with a big backlog of update_docs messages that get 
>> in the way when it needs to make gen_server calls to the couch_file process 
>> for IO.  It's a significant problem in R13B, probably less so in R14B 
>> because of some cool optimizations by the OTP team.
> 
> So, let me see if I get it. The couch_db_updater process is slow
> picking the results of the calls to the couch_file process because its
> mailbox is full of update_docs messages?


Correct.  Each call to the couch_file requires a selective receive on the part 
of the db_updater in order to get the response, and prior to R14 that selective 
receive needed to match against every message in the mailbox.  It's really a 
bigger problem in couch_server, which uses a gen_server call to increment a 
reference counter before handing the #db{} to the client, since every request 
to any DB has to talk to couch_server first.  Best,

Adam

Re: About possibly reverting COUCHDB-767

Reply via email to