[
https://issues.apache.org/jira/browse/CASSANDRA-9669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14645845#comment-14645845
]
Benedict commented on CASSANDRA-9669:
-------------------------------------
So, I have a patch available for this
[here|https://github.com/belliottsmith/cassandra/tree/9669-2.0]
I managed to make it less invasive than I had anticipated, but it still
requires an sstable version increment. The patch:
* Introduces a commitLogLowerBound to the memtable, which tracks the commit log
position at its creation
* Changes sstable metadata's "replayPosition" into "commitLogLowerBound" and
"commitLogUpperBound" in the new sstable version
* Delays exposing a new sstable to the compaction strategy until all of its
preceding flushes have completed
* On compaction, extends the new sstable's lower/upper bounds to the min/max of
all sstables we're replacing. Given (3), we only extend over ranges that are
known to already be covered by other sstables.
* On replay, we take any range covered by an sstable to not need replay (and
any range prior to the earliest known safe range is also ignored)
Test Engineering: there are failures on dtests, but I cannot tell if these are
new or existing. Mostly the look like flakey tests. The one that looks most
worrisome to me is counter upgrade test, but could you take a look and tell me
what you think of the test situation in general? Modifying 2.0 makes me
uncomfortable
> If sstable flushes complete out of order, on restart we can fail to replay
> necessary commit log records
> -------------------------------------------------------------------------------------------------------
>
> Key: CASSANDRA-9669
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9669
> Project: Cassandra
> Issue Type: Bug
> Components: Core
> Reporter: Benedict
> Assignee: Benedict
> Priority: Critical
> Labels: correctness
> Fix For: 3.x, 2.1.x, 2.2.x, 3.0.x
>
>
> While {{postFlushExecutor}} ensures it never expires CL entries out-of-order,
> on restart we simply take the maximum replay position of any sstable on disk,
> and ignore anything prior.
> It is quite possible for there to be two flushes triggered for a given table,
> and for the second to finish first by virtue of containing a much smaller
> quantity of live data (or perhaps the disk is just under less pressure). If
> we crash before the first sstable has been written, then on restart the data
> it would have represented will disappear, since we will not replay the CL
> records.
> This looks to be a bug present since time immemorial, and also seems pretty
> serious.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)