On Nov 7, 2010, at 11:35 AM, Filipe David Manana wrote:

> Hi,
> 
> Regarding the change introduced by the ticket:
> 
> https://issues.apache.org/jira/browse/COUCHDB-767
> 
> (opening the same file in a different process and call fsync on the
> new file descriptor through the new process)
> 
> I found out that it's not a recomended practice. I posted the
> following question to the ext4 development mailing list:
> 
> http://www.spinics.net/lists/linux-ext4/msg21388.html

Reading that thread it seems quite unlikely that we're getting into any trouble 
with this patch.  But it does seem like we should be considering alternatives.

> Also, with this patch I verified (on Solaris, with the 'zpool iostat
> 1' command) that when running a writes only test with relaximation
> (200 write processes), disk write activity is not continuous. Without
> this patch, there's continuous (every 1 second) write activity.

I'm confused by this statement. You must be talking about relaximation runs 
with delayed_commits = true, right?  Why do you think you see larger intervals 
between write activity with the optimization from COUCHDB-767?  Have you 
measured the time it takes to open the extra FD?  In my tests that was a 
sub-millisecond operation, but maybe you've uncovered something else.

> This also makes performance comparison tests with relaximation much harder
> to analyse, as the peak variation is much higher and not periodic.

Still confused.  I'm guessing the increased variance only shows up on the 
writes graph - reads certainly ought to have decreased variance because the 
fsync occurs out of band.  Is the issue that the fsync takes longer when it 
requires a new FD, or am I reading this all wrong?

<rant>Any variance-related statements based on relaximation results are purely 
qualitative, since relaximation does not report measurement variance.  
Similarly, any claims about relative improvements in response times have an 
unknown statistical significance.</rant>

> For the goal of not having readers getting blocked by fsync calls (and
> write calls), I would propose using a separate couch_file process just
> for read operations. I have a branch in my github for this (with
> COUCHDB-767 reverted). It needs to be polished, but the relaximation
> tests are very positive, both reads and writes get better response
> times and throughput:
> 
> https://github.com/fdmanana/couchdb/tree/2_couch_files_no_batch_reads

I'd like to propose an alternative optimization, which is to keep a dedicated 
file descriptor open in the couch_db_updater process and use that file 
descriptor for _all_ IO initiated by the db_updater.  The advantage is that the 
db_updater does not need to do any message passing for disk IO, and thus does 
not slow down when the incoming message queue is large.  A message queue much 
much larger than the number of concurrent writers can occur if a user writes 
with batch=ok, and it can also happen rather easily in a BigCouch cluster.

> http://graphs.mikeal.couchone.com/#/graph/62b286fbb7aa55a4b0c4cc913c00e659
>  (relaximation test)
> 
> The test does a direct comparsion with trunk (also with COUCHDB-767
> reverted) and was run like this:
> 
> $ node tests/compare_write_and_read.js --wclients 300 --rclients 150 \
>    -name1 2_couch_files_no_batch_reads -name2 trunk \
>    -url1 http://localhost:5984/ -url2 http://localhost:5985/ \
>    --duration 120

Is the graph you posted a comparison with trunk, or with trunk minus 
COUCHDB-767?  I couldn't parse your statement.

Side note: I linked to a relaximation test in COUCHDB-767, but that link (to 
mikeal.couchone.com) went dead. Is the graph still available somewhere?

That patch looks like it probably improves the requests per second for both 
reads and writes, as it should if you have a multicore server and/or you've 
turned on the async thread pool.  Pretty difficult to say if the change is 
statistically significant, though.

> This approach, of using a file descriptor just for reads and another
> one just for writes (but both referring to the same file), seems to be
> safe:
> 
> http://www.spinics.net/lists/linux-ext4/msg21429.html
> 
> Race conditions shouldn't happen, as each read call needs an offset,
> and the offset is only known after a previous write calls finished.

I'm definitely not worried about race conditions, our DB updating logic seems 
to prevent that several times over.
 
> Thoughts on this? Should we revert COUCHDB-767? Integrate the 2
> couch_files strategy into trunk? (only after 1.1 has its own branch)

I'm against reverting COUCHDB-767, but I guess it deserves more research to see 
if any of our supported platforms actually implement an fsync() that does not 
sync writes from another file descriptor for the same file.  If we do find a 
platform like that we should certainly revert.

I'm probably in favor of a two file descriptors per DB approach, but I think 
attaching the second fd directly to the db_updater will be the winning 
strategy.  Best,

Adam

> 
> cheers
> 
> -- 
> Filipe David Manana,
> [email protected], [email protected]
> 
> "Reasonable men adapt themselves to the world.
>  Unreasonable men adapt the world to themselves.
>  That's why all progress depends on unreasonable men."

Reply via email to