BRAF.sync() bug can cause massive commit log write magnification
----------------------------------------------------------------
Key: CASSANDRA-2660
URL: https://issues.apache.org/jira/browse/CASSANDRA-2660
Project: Cassandra
Issue Type: Bug
Reporter: Peter Schuller
Attachments: CASSANDRA-2660-075.txt
This was discovered, fixed and tested on 0.7.5. Cursory examination shows it
should still be an issue on trunk/0.8. If people otherwise agree with the patch
I can rebase if necessary.
Problem:
BRAF.flush() is actually broken in the sense that it cannot be called without
close co-operation with the caller. rebuffer() does the co-op by adjusting
bufferOffset and validateBufferBytes appropriately, by sync() doesn't. This
means sync() is broken, and sync() is used by the commit log.
The attached patch moves the bufferOffset/validateBufferBytes handling out into
resetBuffer() and has both flush() and rebuffer() call that. This makes sync()
safe.
What happened was that for batched commit log mode, every time sync() was
called the data buffered so far would get written to the OS and fsync():ed. But
until rebuffer() is called for other reasons as part of the write path, all
subsequent sync():s would result in the very same data (plus whatever was
written since last time) being re-written and fsync():ed again. So first you
write+fsync N bytes, then N+N1, then N+N1+N2... (each N being a batch), until
at some point you trigger a rebuffer() and it starts all over again.
The result is that you see *a lot* more writes to the commit log than are in
fact written to the BRAF. And these writes translate into actual real writes to
the underlying storage device due to fsync(). We had crazy numbers where we saw
spikes upwards of 80 mb/second where the actual throughput was more like ~ 1 mb
second of data to the commit log.
(One can make a possibly weak argument that it is also functionally incorrect
as I can imagine implementations where re-writing the same blocks does
copy-on-write in such a way that you're not necessarily guaranteed to see
before-or-after data, particularly in case of partial page writes. However
that's probably not a practical issue.)
Worthy of noting is that this probably causes added difficulties in fsync()
latencies since the average fsync() will contain a lot more data. Depending on
I/O scheduler and underlying device characteristics, the extra writes *may* not
have a detrimental effect, but I think it's pretty easy to point to cases where
it will be detrimental - in particular if the commit log is on a non-battery
backed drive. Even with a nice battery backed RAID with the commit log on, the
size of the writes probably contributes to difficulty in making the write
requests propagate down without being starved by reads (but this is
speculation, not tested, other than that I've observed commit log writer
starvation that seemed excessive).
This isn't the first subtle BRAF bug. What are people's thoughts on creating
separate abstractions for streaming I/O that can perhaps be a lot more simple,
and use BRAF only for random reads in response to live traffic? (Not as part of
this JIRA, just asking in general.)
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira