[jira] [Commented] (CASSANDRA-15368) Failing to flush Memtable without terminating process results in permanent data loss
[ https://issues.apache.org/jira/browse/CASSANDRA-15368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16972639#comment-16972639 ] Dimitar Dimitrov commented on CASSANDRA-15368: -- Thanks for chasing this down, [~benedict]! I'm glad it turned out that, as initially suspected, you're pretty good at this stuff, and the issue was not lurking from before, but more or less necessitated by the fix for CASSANDRA-15367. Then I guess it makes the most sense if you continue and take care of this. > Failing to flush Memtable without terminating process results in permanent > data loss > > > Key: CASSANDRA-15368 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15368 > Project: Cassandra > Issue Type: Bug > Components: Local/Commit Log, Local/Memtable >Reporter: Benedict Elliott Smith >Priority: Normal > Fix For: 4.0, 2.2.x, 3.0.x, 3.11.x > > > {{Memtable}} do not contain records that cover a precise contiguous range of > {{ReplayPosition}}, since there are only weak ordering constraints when > rolling over to a new {{Memtable}} - the last operations for the old > {{Memtable}} may obtain their {{ReplayPosition}} after the first operations > for the new {{Memtable}}. > Unfortunately, we treat the {{Memtable}} range as contiguous, and invalidate > the entire range on flush. Ordinarily we only invalidate records when all > prior {{Memtable}} have also successfully flushed. However, in the event of > a flush that does not terminate the process (either because of disk failure > policy, or because it is a software error), the later flush is able to > invalidate the region of the commit log that includes records that should > have been flushed in the prior {{Memtable}} > More problematically, this can also occur on restart without any associated > flush failure, as we use commit log boundaries written to our flushed > sstables to filter {{ReplayPosition}} on recovery, which is meant to > replicate our {{Memtable}} flush behaviour above. However, we do not know > that earlier flushes have completed, and they may complete successfully > out-of-order. So any flush that completes before the process terminates, but > began after another flush that _doesn’t_ complete before the process > terminates, has the potential to cause permanent data loss. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15368) Failing to flush Memtable without terminating process results in permanent data loss
[ https://issues.apache.org/jira/browse/CASSANDRA-15368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16972280#comment-16972280 ] Benedict Elliott Smith commented on CASSANDRA-15368: Hmm. I had convinced myself that this occurred already, and that CASSANDRA-15367 simply exploited the the start of the commit log region owned by a Memtable actually could occur in either Memtable (as opposed to the end, which was contiguous). But now I attempt to properly construct the scenario, I see that I was wrong, and past me that wrote the bounds logic was better at this stuff. You're right, CASSANDRA-15367 introduces, rather than exploits, this issue. > Failing to flush Memtable without terminating process results in permanent > data loss > > > Key: CASSANDRA-15368 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15368 > Project: Cassandra > Issue Type: Bug > Components: Local/Commit Log, Local/Memtable >Reporter: Benedict Elliott Smith >Priority: Normal > Fix For: 4.0, 2.2.x, 3.0.x, 3.11.x > > > {{Memtable}} do not contain records that cover a precise contiguous range of > {{ReplayPosition}}, since there are only weak ordering constraints when > rolling over to a new {{Memtable}} - the last operations for the old > {{Memtable}} may obtain their {{ReplayPosition}} after the first operations > for the new {{Memtable}}. > Unfortunately, we treat the {{Memtable}} range as contiguous, and invalidate > the entire range on flush. Ordinarily we only invalidate records when all > prior {{Memtable}} have also successfully flushed. However, in the event of > a flush that does not terminate the process (either because of disk failure > policy, or because it is a software error), the later flush is able to > invalidate the region of the commit log that includes records that should > have been flushed in the prior {{Memtable}} > More problematically, this can also occur on restart without any associated > flush failure, as we use commit log boundaries written to our flushed > sstables to filter {{ReplayPosition}} on recovery, which is meant to > replicate our {{Memtable}} flush behaviour above. However, we do not know > that earlier flushes have completed, and they may complete successfully > out-of-order. So any flush that completes before the process terminates, but > began after another flush that _doesn’t_ complete before the process > terminates, has the potential to cause permanent data loss. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15368) Failing to flush Memtable without terminating process results in permanent data loss
[ https://issues.apache.org/jira/browse/CASSANDRA-15368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16972149#comment-16972149 ] Branimir Lambov commented on CASSANDRA-15368: - Does this mean that this issue is only an artifact of the fix to CASSANDRA-15367? > Failing to flush Memtable without terminating process results in permanent > data loss > > > Key: CASSANDRA-15368 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15368 > Project: Cassandra > Issue Type: Bug > Components: Local/Commit Log, Local/Memtable >Reporter: Benedict Elliott Smith >Priority: Normal > Fix For: 4.0, 2.2.x, 3.0.x, 3.11.x > > > {{Memtable}} do not contain records that cover a precise contiguous range of > {{ReplayPosition}}, since there are only weak ordering constraints when > rolling over to a new {{Memtable}} - the last operations for the old > {{Memtable}} may obtain their {{ReplayPosition}} after the first operations > for the new {{Memtable}}. > Unfortunately, we treat the {{Memtable}} range as contiguous, and invalidate > the entire range on flush. Ordinarily we only invalidate records when all > prior {{Memtable}} have also successfully flushed. However, in the event of > a flush that does not terminate the process (either because of disk failure > policy, or because it is a software error), the later flush is able to > invalidate the region of the commit log that includes records that should > have been flushed in the prior {{Memtable}} > More problematically, this can also occur on restart without any associated > flush failure, as we use commit log boundaries written to our flushed > sstables to filter {{ReplayPosition}} on recovery, which is meant to > replicate our {{Memtable}} flush behaviour above. However, we do not know > that earlier flushes have completed, and they may complete successfully > out-of-order. So any flush that completes before the process terminates, but > began after another flush that _doesn’t_ complete before the process > terminates, has the potential to cause permanent data loss. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15368) Failing to flush Memtable without terminating process results in permanent data loss
[ https://issues.apache.org/jira/browse/CASSANDRA-15368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16971743#comment-16971743 ] Benedict Elliott Smith commented on CASSANDRA-15368: Hi [~dimitarndimitrov], I think you have it the wrong wrong way around; in your parlance, we need: * oldMemtable.accepts() returns false * oldMemtable.accepts() returns false * newMemtable.accepts() returns true * newMemtable.accepts() returns true If you look at the new documentation introduced in CASSANDRA-15367 [here|https://github.com/belliottsmith/cassandra/commit/ed6adf5eabe62f8ce6a1341e0c5423ba53036197#diff-f0a15c3588b56c5ce53ece7c48e325b5R109], you'll see that there is a region at the start of all memtables where some records from the prior {{group}}, that may have arbitrarily delayed obtaining their {{ReplayPosition}}, are intermixed with those of the later group. This region is essentially owned by both memtables, but only the later memtable invalidates the relevant commit log records. The problem occurs if the earlier flush fails (and we do not terminate the process), _or_ if the process terminates with the later flush having completed (since we will use the start/end {{ReplayPosition}} associated with the sstable to invalidate the commit log in the same way). > Failing to flush Memtable without terminating process results in permanent > data loss > > > Key: CASSANDRA-15368 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15368 > Project: Cassandra > Issue Type: Bug > Components: Local/Commit Log, Local/Memtable >Reporter: Benedict Elliott Smith >Priority: Normal > Fix For: 4.0, 2.2.x, 3.0.x, 3.11.x > > > {{Memtable}} do not contain records that cover a precise contiguous range of > {{ReplayPosition}}, since there are only weak ordering constraints when > rolling over to a new {{Memtable}} - the last operations for the old > {{Memtable}} may obtain their {{ReplayPosition}} after the first operations > for the new {{Memtable}}. > Unfortunately, we treat the {{Memtable}} range as contiguous, and invalidate > the entire range on flush. Ordinarily we only invalidate records when all > prior {{Memtable}} have also successfully flushed. However, in the event of > a flush that does not terminate the process (either because of disk failure > policy, or because it is a software error), the later flush is able to > invalidate the region of the commit log that includes records that should > have been flushed in the prior {{Memtable}} > More problematically, this can also occur on restart without any associated > flush failure, as we use commit log boundaries written to our flushed > sstables to filter {{ReplayPosition}} on recovery, which is meant to > replicate our {{Memtable}} flush behaviour above. However, we do not know > that earlier flushes have completed, and they may complete successfully > out-of-order. So any flush that completes before the process terminates, but > began after another flush that _doesn’t_ complete before the process > terminates, has the potential to cause permanent data loss. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15368) Failing to flush Memtable without terminating process results in permanent data loss
[ https://issues.apache.org/jira/browse/CASSANDRA-15368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968613#comment-16968613 ] Dimitar Dimitrov commented on CASSANDRA-15368: -- Thanks for the super-quick reply, [~benedict]! I'll definitely check out your patch for CASSANDRA-15367. As for this problem, if me taking potentially (much) longer to fix it isn't a problem for you, I can surely take a stab. Also here's the analysis that I mentioned in my previous reply - all comments appreciated. h4. Defining the problem Let's assume we have a single table (no indexes, no MVs) that's being continuously written to when a single flush for it is requested. We want to examine if we can have the old memtable accepting a write with a higher CL position, and the new memtable accepting a write with a lower CL position - the latter also implies the old memtable rejecting that write. * Below we'll be calling the write with the higher CL position *HW*, its assigned {{OpOrder.Group}} (and the action for assigning it) *HW group*, and its assigned CL position (and the action for assigning it) *HW position*. * Similar for the write with the lower CL position - *LW*, *LW group*, and *LW position*. So to get the (un)desired ordering, we need the following specific results from 3 executions of {{Memtable.accepts(OpOrder.Group, CommitLogPosition)}}: - {{oldMemtable.accepts()}} (called *HW accept?* below), which should return true - {{oldMemtable.accepts()}} (called *LW accept?* below), which should return false - {{newMemtable.accepts()}}, which should return true (not necessary for the analysis below) h4. Some constraints A. For each of the writes, the {{OpOrder.Group}} assignment happens-before the CL position allocation for the corresponding write, which happens-before the {{Memtable.accepts(OpOrder.Group, CommitLogPosition)}} call for the corresponding write. * HW group --hb-> HW position --hb-> HW accept? * LW group --hb-> LW position --hb-> LW accept? B. The CL position allocations are totally (and numerically) ordered by happens-before, due to the way {{CommitLogSegment}}-s are advanced and the way their internal {{allocatePosition}} markers are CAS-ed. * LW position --hb-> HW position C. If {{writeBarrier.issue()}} in the {{Flush}} ctor happens-before HW group, then the final upper CL bound for the old memtable (called *UB* below) has been set, and is guaranteed to be less than HW position, but then HW accept? is guaranteed to return false (because it will see {{writeBarrier}} as not {{null}}, and HW position would be guaranteed to be more than UB) => contradiction * If {{writeBarrier.issue()}} --hb-> HW group => UB --hb-> HW group => UB --hb-> HW position => contradiction * Therefore HW group --hb-> {{writeBarrier.issue()}} * Note that this was not true before the fix for CASSANDRA-8383. D. If {{writeBarrier.issue()}} happens-before LW group, then UB has been set, and is guaranteed to be less than LW position, and therefore less than HW position. Also {{writeBarrier.issue()}} would happen-before HW position, which would happen-before HW accept?. That means that HW accept? will see {{writeBarrier}} as not {{null}}, and UB as set and less than HW position, so is guaranteed to return false => contradiction * If {{writeBarrier.issue()}} --hb-> LW group => UB --hb-> LW position --hb-> HW position && {{writeBarrier.issue()}} --hb-> HW accept? => contradiction * Therefore LW group --hb-> {{writeBarrier.issue()}} E. As a corollary of C. and D., LW group and HW group should both be before the barrier issued by the flush, and therefore *the placements of LW and HW will both be determined by LW position, HW position, and UB*. h4. The case work In order for HW accept? to return true: # ...it could be seeing {{writeBarrier}} as {{null}}, which means to have started before the {{writeBarrier}} is set in {{oldMemtable.setDiscarding}}. ## This implies that LW accept? is started after HW accept? has started - otherwise LW accept? would also have seen {{writeBarrier}} as {{null}} and returned true already => contradiction ## So LW accept? has started after HW accept? has started, and needs to return false because of LW position (see E. why it cannot be due to LW group). This could happen only if UB has been set and is less than LW position. But as setting UB happens after {{oldMemtable.setDiscarding}}, and HW accept? had started before the {{writeBarrier}} is set in {{oldMemtable.setDiscarding}}, UB should be at least HW position, which is more than LW position => contradiction #* If HW accept? start --hb-> writeBarrier set in {{oldMemtable.setDiscarding}} => HW position --hb-> writeBarrier set in {{oldMemtable.setDiscarding}} --hb-> UB => LW position --hb-> UB => contradiction #* Therefore writeBarrier set in {{oldMemtable.setDiscarding}} --hb-> HW accept? start # ...it could have been
[jira] [Commented] (CASSANDRA-15368) Failing to flush Memtable without terminating process results in permanent data loss
[ https://issues.apache.org/jira/browse/CASSANDRA-15368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968267#comment-16968267 ] Benedict Elliott Smith commented on CASSANDRA-15368: Hi [~dimitarndimitrov], Thanks for your interest. I'm about to post a patch for CASSANDRA-15367, after which the improved comments and explanations may help you understand (and will help me explain). So I'll elaborate after that is up. This is quite a hairy bit of the codebase, with a lot of parts we'd like to burn with fire, but if you want to volunteer to take a stab at it I certainly won't stop you. > Failing to flush Memtable without terminating process results in permanent > data loss > > > Key: CASSANDRA-15368 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15368 > Project: Cassandra > Issue Type: Bug > Components: Local/Commit Log, Local/Memtable >Reporter: Benedict Elliott Smith >Priority: Normal > Fix For: 4.0, 2.2.x, 3.0.x, 3.11.x > > > {{Memtable}} do not contain records that cover a precise contiguous range of > {{ReplayPosition}}, since there are only weak ordering constraints when > rolling over to a new {{Memtable}} - the last operations for the old > {{Memtable}} may obtain their {{ReplayPosition}} after the first operations > for the new {{Memtable}}. > Unfortunately, we treat the {{Memtable}} range as contiguous, and invalidate > the entire range on flush. Ordinarily we only invalidate records when all > prior {{Memtable}} have also successfully flushed. However, in the event of > a flush that does not terminate the process (either because of disk failure > policy, or because it is a software error), the later flush is able to > invalidate the region of the commit log that includes records that should > have been flushed in the prior {{Memtable}} > More problematically, this can also occur on restart without any associated > flush failure, as we use commit log boundaries written to our flushed > sstables to filter {{ReplayPosition}} on recovery, which is meant to > replicate our {{Memtable}} flush behaviour above. However, we do not know > that earlier flushes have completed, and they may complete successfully > out-of-order. So any flush that completes before the process terminates, but > began after another flush that _doesn’t_ complete before the process > terminates, has the potential to cause permanent data loss. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15368) Failing to flush Memtable without terminating process results in permanent data loss
[ https://issues.apache.org/jira/browse/CASSANDRA-15368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968261#comment-16968261 ] Dimitar Dimitrov commented on CASSANDRA-15368: -- [~benedict], I assume this is something that you're planning to take up yourself, but let me know if you can use a volunteer in any way. Also can you please help me understand some of the details around the pre-conditions for this problem? I'm probably mising something, but I still can't understand: * how _*the last operations for the old Memtable may obtain their ReplayPosition after the first operations for the new Memtable*_ can hold true after CASSANDRA-8383. * how _*Unfortunately, we treat the Memtable range as contiguous, and invalidate the entire range on flush*_ can hold true after CASSANDRA-11828 (with some interaction with CASSANDRA-9669). I'm also wondering, is _*More problematically, this can also occur on restart without any associated flush failure, as we use commit log boundaries written to our flushed sstables to filter ReplayPosition on recovery*_ related to {{CommitLogReplayer#firstNotCovered(Collection>)}} and its caveats? P.S. Specifically for the upper bound of the old memtable being above the lower bound of the new memtable, I've tried to explicitly write down the possible orderings, and I can't see how that could happen - I'll format and post my notes in a separate comment a bit later. > Failing to flush Memtable without terminating process results in permanent > data loss > > > Key: CASSANDRA-15368 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15368 > Project: Cassandra > Issue Type: Bug > Components: Local/Commit Log, Local/Memtable >Reporter: Benedict Elliott Smith >Priority: Normal > Fix For: 4.0, 2.2.x, 3.0.x, 3.11.x > > > {{Memtable}} do not contain records that cover a precise contiguous range of > {{ReplayPosition}}, since there are only weak ordering constraints when > rolling over to a new {{Memtable}} - the last operations for the old > {{Memtable}} may obtain their {{ReplayPosition}} after the first operations > for the new {{Memtable}}. > Unfortunately, we treat the {{Memtable}} range as contiguous, and invalidate > the entire range on flush. Ordinarily we only invalidate records when all > prior {{Memtable}} have also successfully flushed. However, in the event of > a flush that does not terminate the process (either because of disk failure > policy, or because it is a software error), the later flush is able to > invalidate the region of the commit log that includes records that should > have been flushed in the prior {{Memtable}} > More problematically, this can also occur on restart without any associated > flush failure, as we use commit log boundaries written to our flushed > sstables to filter {{ReplayPosition}} on recovery, which is meant to > replicate our {{Memtable}} flush behaviour above. However, we do not know > that earlier flushes have completed, and they may complete successfully > out-of-order. So any flush that completes before the process terminates, but > began after another flush that _doesn’t_ complete before the process > terminates, has the potential to cause permanent data loss. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org