[ 
https://issues.apache.org/jira/browse/CASSANDRA-2660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Schuller reassigned CASSANDRA-2660:
-----------------------------------------

    Assignee: Peter Schuller

> BRAF.sync() bug can cause massive commit log write magnification
> ----------------------------------------------------------------
>
>                 Key: CASSANDRA-2660
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2660
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Peter Schuller
>            Assignee: Peter Schuller
>         Attachments: CASSANDRA-2660-075.txt
>
>
> This was discovered, fixed and tested on 0.7.5. Cursory examination shows it 
> should still be an issue on trunk/0.8. If people otherwise agree with the 
> patch I can rebase if necessary.
> Problem:
> BRAF.flush() is actually broken in the sense that it cannot be called without 
> close co-operation with the caller. rebuffer() does the co-op by adjusting 
> bufferOffset and validateBufferBytes appropriately, by sync() doesn't. This 
> means sync() is broken, and sync() is used by the commit log.
> The attached patch moves the bufferOffset/validateBufferBytes handling out 
> into resetBuffer() and has both flush() and rebuffer() call that. This makes 
> sync() safe.
> What happened was that for batched commit log mode, every time sync() was 
> called the data buffered so far would get written to the OS and fsync():ed. 
> But until rebuffer() is called for other reasons as part of the write path, 
> all subsequent sync():s would result in the very same data (plus whatever was 
> written since last time) being re-written and fsync():ed again. So first you 
> write+fsync N bytes, then N+N1, then N+N1+N2... (each N being a batch), until 
> at some point you trigger a rebuffer() and it starts all over again.
> The result is that you see *a lot* more writes to the commit log than are in 
> fact written to the BRAF. And these writes translate into actual real writes 
> to the underlying storage device due to fsync(). We had crazy numbers where 
> we saw spikes upwards of 80 mb/second where the actual throughput was more 
> like ~ 1 mb second of data to the commit log.
> (One can make a possibly weak argument that it is also functionally incorrect 
> as I can imagine implementations where re-writing the same blocks does 
> copy-on-write in such a way that you're not necessarily guaranteed to see 
> before-or-after data, particularly in case of partial page writes. However 
> that's probably not a practical issue.)
> Worthy of noting is that this probably causes added difficulties in fsync() 
> latencies since the average fsync() will contain a lot more data. Depending 
> on I/O scheduler and underlying device characteristics, the extra writes 
> *may* not have a detrimental effect, but I think it's pretty easy to point to 
> cases where it will be detrimental - in particular if the commit log is on a 
> non-battery backed drive. Even with a nice battery backed RAID with the 
> commit log on, the size of the writes probably contributes to difficulty in 
> making the write requests propagate down without being starved by reads (but 
> this is speculation, not tested, other than that I've observed commit log 
> writer starvation that seemed excessive).
> This isn't the first subtle BRAF bug. What are people's thoughts on creating 
> separate abstractions for streaming I/O that can perhaps be a lot more 
> simple, and use BRAF only for random reads in response to live traffic? (Not 
> as part of this JIRA, just asking in general.)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to