Re: About possibly reverting COUCHDB-767

Filipe David Manana Sun, 07 Nov 2010 11:52:36 -0800

On Sun, Nov 7, 2010 at 7:20 PM, Adam Kocoloski <[email protected]> wrote:
> On Nov 7, 2010, at 11:35 AM, Filipe David Manana wrote:
>
>> Hi,
>>
>> Regarding the change introduced by the ticket:
>>
>> https://issues.apache.org/jira/browse/COUCHDB-767
>>
>> (opening the same file in a different process and call fsync on the
>> new file descriptor through the new process)
>>
>> I found out that it's not a recomended practice. I posted the
>> following question to the ext4 development mailing list:
>>
>> http://www.spinics.net/lists/linux-ext4/msg21388.html
>
> Reading that thread it seems quite unlikely that we're getting into any 
> trouble with this patch.  But it does seem like we should be considering 
> alternatives.


Right, but unlikely does not mean never, unfortunately.

>
>> Also, with this patch I verified (on Solaris, with the 'zpool iostat
>> 1' command) that when running a writes only test with relaximation
>> (200 write processes), disk write activity is not continuous. Without
>> this patch, there's continuous (every 1 second) write activity.
>
> I'm confused by this statement. You must be talking about relaximation runs 
> with delayed_commits = true, right?  Why do you think you see larger 
> intervals between write activity with the optimization from COUCHDB-767?  
> Have you measured the time it takes to open the extra FD?  In my tests that 
> was a sub-millisecond operation, but maybe you've uncovered something else.

No, it happens for tests with delayed_commits = false. The only
possible explanation I see for the variance might be related to the
Erlang VM scheduler decisions about when to start/run that process.
Nevertheless, I dont know the exact cause, but the fsync run frequency
varies a lot.

>
>> This also makes performance comparison tests with relaximation much harder
>> to analyse, as the peak variation is much higher and not periodic.
>
> Still confused.  I'm guessing the increased variance only shows up on the 
> writes graph - reads certainly ought to have decreased variance because the 
> fsync occurs out of band.  Is the issue that the fsync takes longer when it 
> requires a new FD, or am I reading this all wrong?

Yes, for writes.

>
> <rant>Any variance-related statements based on relaximation results are 
> purely qualitative, since relaximation does not report measurement variance.  
> Similarly, any claims about relative improvements in response times have an 
> unknown statistical significance.</rant>

Right. I didn't meant to say that relaximation is always right or that
we should be blindily guided by relaximation reports.

>
>> For the goal of not having readers getting blocked by fsync calls (and
>> write calls), I would propose using a separate couch_file process just
>> for read operations. I have a branch in my github for this (with
>> COUCHDB-767 reverted). It needs to be polished, but the relaximation
>> tests are very positive, both reads and writes get better response
>> times and throughput:
>>
>> https://github.com/fdmanana/couchdb/tree/2_couch_files_no_batch_reads
>
> I'd like to propose an alternative optimization, which is to keep a dedicated 
> file descriptor open in the couch_db_updater process and use that file 
> descriptor for _all_ IO initiated by the db_updater.  The advantage is that 
> the db_updater does not need to do any message passing for disk IO, and thus 
> does not slow down when the incoming message queue is large.  A message queue 
> much much larger than the number of concurrent writers can occur if a user 
> writes with batch=ok, and it can also happen rather easily in a BigCouch 
> cluster.

I don't see how that will improve things, since all write operations
will still be done in a serialized manner. Since only couch_db_updater
writes to the DB file, and since access to the couch_db_updater is
serialized, to me it only seems that you're solution avoids one level
of indirection (the couch_file process). I don't see how, when using a
couch_file only for writes, you get the message queue for that
couc_file process full of write messages.

Also, what I did on that branch is a bit more generic, as it works for
view index files as well, and doesn't introduce significant changes
elsewhere except in couch_file.erl. Of course your solution might be
extended to the view updater process as well easily, I don't have
anything against it.

Anyway, +1.

>
>> http://graphs.mikeal.couchone.com/#/graph/62b286fbb7aa55a4b0c4cc913c00e659
>>  (relaximation test)
>>
>> The test does a direct comparsion with trunk (also with COUCHDB-767
>> reverted) and was run like this:
>>
>> $ node tests/compare_write_and_read.js --wclients 300 --rclients 150 \
>>    -name1 2_couch_files_no_batch_reads -name2 trunk \
>>    -url1 http://localhost:5984/ -url2 http://localhost:5985/ \
>>    --duration 120
>
> Is the graph you posted a comparison with trunk, or with trunk minus 
> COUCHDB-767?  I couldn't parse your statement.

With trunk minus COUCHDB-767.
>
> Side note: I linked to a relaximation test in COUCHDB-767, but that link (to 
> mikeal.couchone.com) went dead. Is the graph still available somewhere?

No idea.

>
> That patch looks like it probably improves the requests per second for both 
> reads and writes, as it should if you have a multicore server and/or you've 
> turned on the async thread pool.  Pretty difficult to say if the change is 
> statistically significant, though.

Well, yes. But at least write operations (started by couch_db_updater
for e.g.) no longer wait for read operations to be processed (issued
by couch_db.erl for e.g.) and the other way around. This was tested
with the +A flag set to 4 (default on trunk) and set to 16. For both
cases the read and write performance was better (according to the
relaximation tests).

>
>> This approach, of using a file descriptor just for reads and another
>> one just for writes (but both referring to the same file), seems to be
>> safe:
>>
>> http://www.spinics.net/lists/linux-ext4/msg21429.html
>>
>> Race conditions shouldn't happen, as each read call needs an offset,
>> and the offset is only known after a previous write calls finished.
>
> I'm definitely not worried about race conditions, our DB updating logic seems 
> to prevent that several times over.
>
>> Thoughts on this? Should we revert COUCHDB-767? Integrate the 2
>> couch_files strategy into trunk? (only after 1.1 has its own branch)
>
> I'm against reverting COUCHDB-767, but I guess it deserves more research to 
> see if any of our supported platforms actually implement an fsync() that does 
> not sync writes from another file descriptor for the same file.  If we do 
> find a platform like that we should certainly revert.

The fact that the behaviour is not guaranteed by POSIX or the Single
Unix Specification worries me (not to mention Windows).

>
> I'm probably in favor of a two file descriptors per DB approach, but I think 
> attaching the second fd directly to the db_updater will be the winning 
> strategy.  Best,
>
> Adam
>
>>
>> cheers
>>
>> --
>> Filipe David Manana,
>> [email protected], [email protected]
>>
>> "Reasonable men adapt themselves to the world.
>>  Unreasonable men adapt the world to themselves.
>>  That's why all progress depends on unreasonable men."
>
>



-- 
Filipe David Manana,
[email protected], [email protected]

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."

Re: About possibly reverting COUCHDB-767

Reply via email to