[ 
https://issues.apache.org/jira/browse/CASSANDRA-2660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034773#comment-13034773
 ] 

Jonathan Ellis commented on CASSANDRA-2660:
-------------------------------------------

bq. What are people's thoughts on creating separate abstractions for streaming 
I/O that can perhaps be a lot more simple, and use BRAF only for random reads 
in response to live traffic? (Not as part of this JIRA, just asking in general.)

Every time I've looked at doing this I've put it aside because making all 
writes two-pass (first pass to compute size, so we don't have to seek back 
after serializing the row itself) is such a pain.

> BRAF.sync() bug can cause massive commit log write magnification
> ----------------------------------------------------------------
>
>                 Key: CASSANDRA-2660
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2660
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Peter Schuller
>            Assignee: Peter Schuller
>         Attachments: CASSANDRA-2660-075.txt
>
>
> This was discovered, fixed and tested on 0.7.5. Cursory examination shows it 
> should still be an issue on trunk/0.8. If people otherwise agree with the 
> patch I can rebase if necessary.
> Problem:
> BRAF.flush() is actually broken in the sense that it cannot be called without 
> close co-operation with the caller. rebuffer() does the co-op by adjusting 
> bufferOffset and validateBufferBytes appropriately, by sync() doesn't. This 
> means sync() is broken, and sync() is used by the commit log.
> The attached patch moves the bufferOffset/validateBufferBytes handling out 
> into resetBuffer() and has both flush() and rebuffer() call that. This makes 
> sync() safe.
> What happened was that for batched commit log mode, every time sync() was 
> called the data buffered so far would get written to the OS and fsync():ed. 
> But until rebuffer() is called for other reasons as part of the write path, 
> all subsequent sync():s would result in the very same data (plus whatever was 
> written since last time) being re-written and fsync():ed again. So first you 
> write+fsync N bytes, then N+N1, then N+N1+N2... (each N being a batch), until 
> at some point you trigger a rebuffer() and it starts all over again.
> The result is that you see *a lot* more writes to the commit log than are in 
> fact written to the BRAF. And these writes translate into actual real writes 
> to the underlying storage device due to fsync(). We had crazy numbers where 
> we saw spikes upwards of 80 mb/second where the actual throughput was more 
> like ~ 1 mb second of data to the commit log.
> (One can make a possibly weak argument that it is also functionally incorrect 
> as I can imagine implementations where re-writing the same blocks does 
> copy-on-write in such a way that you're not necessarily guaranteed to see 
> before-or-after data, particularly in case of partial page writes. However 
> that's probably not a practical issue.)
> Worthy of noting is that this probably causes added difficulties in fsync() 
> latencies since the average fsync() will contain a lot more data. Depending 
> on I/O scheduler and underlying device characteristics, the extra writes 
> *may* not have a detrimental effect, but I think it's pretty easy to point to 
> cases where it will be detrimental - in particular if the commit log is on a 
> non-battery backed drive. Even with a nice battery backed RAID with the 
> commit log on, the size of the writes probably contributes to difficulty in 
> making the write requests propagate down without being starved by reads (but 
> this is speculation, not tested, other than that I've observed commit log 
> writer starvation that seemed excessive).
> This isn't the first subtle BRAF bug. What are people's thoughts on creating 
> separate abstractions for streaming I/O that can perhaps be a lot more 
> simple, and use BRAF only for random reads in response to live traffic? (Not 
> as part of this JIRA, just asking in general.)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to