On Nov 7, 2010, at 3:29 PM, Filipe David Manana wrote:
> On Sun, Nov 7, 2010 at 8:09 PM, Adam Kocoloski <[email protected]> wrote:
>> On Nov 7, 2010, at 2:52 PM, Filipe David Manana wrote:
>>
>>> On Sun, Nov 7, 2010 at 7:20 PM, Adam Kocoloski <[email protected]> wrote:
>>>> On Nov 7, 2010, at 11:35 AM, Filipe David Manana wrote:
>>>>
>>>>> Also, with this patch I verified (on Solaris, with the 'zpool iostat
>>>>> 1' command) that when running a writes only test with relaximation
>>>>> (200 write processes), disk write activity is not continuous. Without
>>>>> this patch, there's continuous (every 1 second) write activity.
>>>>
>>>> I'm confused by this statement. You must be talking about relaximation
>>>> runs with delayed_commits = true, right? Why do you think you see larger
>>>> intervals between write activity with the optimization from COUCHDB-767?
>>>> Have you measured the time it takes to open the extra FD? In my tests
>>>> that was a sub-millisecond operation, but maybe you've uncovered something
>>>> else.
>>>
>>> No, it happens for tests with delayed_commits = false. The only
>>> possible explanation I see for the variance might be related to the
>>> Erlang VM scheduler decisions about when to start/run that process.
>>> Nevertheless, I dont know the exact cause, but the fsync run frequency
>>> varies a lot.
>>
>> I think it's worth investigating. I couldn't reproduce it on my plain-old
>> spinning disk MacBook with 200 writers in relaximation; the IOPS reported by
>> iostat stayed very uniform.
>>
>>>>> For the goal of not having readers getting blocked by fsync calls (and
>>>>> write calls), I would propose using a separate couch_file process just
>>>>> for read operations. I have a branch in my github for this (with
>>>>> COUCHDB-767 reverted). It needs to be polished, but the relaximation
>>>>> tests are very positive, both reads and writes get better response
>>>>> times and throughput:
>>>>>
>>>>> https://github.com/fdmanana/couchdb/tree/2_couch_files_no_batch_reads
>>>>
>>>> I'd like to propose an alternative optimization, which is to keep a
>>>> dedicated file descriptor open in the couch_db_updater process and use
>>>> that file descriptor for _all_ IO initiated by the db_updater. The
>>>> advantage is that the db_updater does not need to do any message passing
>>>> for disk IO, and thus does not slow down when the incoming message queue
>>>> is large. A message queue much much larger than the number of concurrent
>>>> writers can occur if a user writes with batch=ok, and it can also happen
>>>> rather easily in a BigCouch cluster.
>>>
>>> I don't see how that will improve things, since all write operations
>>> will still be done in a serialized manner. Since only couch_db_updater
>>> writes to the DB file, and since access to the couch_db_updater is
>>> serialized, to me it only seems that you're solution avoids one level
>>> of indirection (the couch_file process). I don't see how, when using a
>>> couch_file only for writes, you get the message queue for that
>>> couc_file process full of write messages.
>>
>> It's the db_updater which gets a large message queue, not the couch_file.
>> The db_updater ends up with a big backlog of update_docs messages that get
>> in the way when it needs to make gen_server calls to the couch_file process
>> for IO. It's a significant problem in R13B, probably less so in R14B
>> because of some cool optimizations by the OTP team.
>
> So, let me see if I get it. The couch_db_updater process is slow
> picking the results of the calls to the couch_file process because its
> mailbox is full of update_docs messages?
Correct. Each call to the couch_file requires a selective receive on the part
of the db_updater in order to get the response, and prior to R14 that selective
receive needed to match against every message in the mailbox. It's really a
bigger problem in couch_server, which uses a gen_server call to increment a
reference counter before handing the #db{} to the client, since every request
to any DB has to talk to couch_server first. Best,
Adam