[jira] [Commented] (CASSANDRA-15368) Failing to flush Memtable without terminating process results in permanent data loss

2019-11-12 Thread Dimitar Dimitrov (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16972639#comment-16972639
 ] 

Dimitar Dimitrov commented on CASSANDRA-15368:
--

Thanks for chasing this down, [~benedict]!

I'm glad it turned out that, as initially suspected, you're pretty good at this 
stuff, and the issue was not lurking from before, but more or less necessitated 
by the fix for CASSANDRA-15367. Then I guess it makes the most sense if you 
continue and take care of this.

> Failing to flush Memtable without terminating process results in permanent 
> data loss
> 
>
> Key: CASSANDRA-15368
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15368
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Commit Log, Local/Memtable
>Reporter: Benedict Elliott Smith
>Priority: Normal
> Fix For: 4.0, 2.2.x, 3.0.x, 3.11.x
>
>
> {{Memtable}} do not contain records that cover a precise contiguous range of 
> {{ReplayPosition}}, since there are only weak ordering constraints when 
> rolling over to a new {{Memtable}} - the last operations for the old 
> {{Memtable}} may obtain their {{ReplayPosition}} after the first operations 
> for the new {{Memtable}}.
> Unfortunately, we treat the {{Memtable}} range as contiguous, and invalidate 
> the entire range on flush.  Ordinarily we only invalidate records when all 
> prior {{Memtable}} have also successfully flushed.  However, in the event of 
> a flush that does not terminate the process (either because of disk failure 
> policy, or because it is a software error), the later flush is able to 
> invalidate the region of the commit log that includes records that should 
> have been flushed in the prior {{Memtable}}
> More problematically, this can also occur on restart without any associated 
> flush failure, as we use commit log boundaries written to our flushed 
> sstables to filter {{ReplayPosition}} on recovery, which is meant to 
> replicate our {{Memtable}} flush behaviour above.  However, we do not know 
> that earlier flushes have completed, and they may complete successfully 
> out-of-order.  So any flush that completes before the process terminates, but 
> began after another flush that _doesn’t_ complete before the process 
> terminates, has the potential to cause permanent data loss.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15368) Failing to flush Memtable without terminating process results in permanent data loss

2019-11-12 Thread Benedict Elliott Smith (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16972280#comment-16972280
 ] 

Benedict Elliott Smith commented on CASSANDRA-15368:


Hmm.  I had convinced myself that this occurred already, and that 
CASSANDRA-15367 simply exploited the the start of the commit log region owned 
by a Memtable actually could occur in either Memtable (as opposed to the end, 
which was contiguous).  But now I attempt to properly construct the scenario, I 
see that I was wrong, and past me that wrote the bounds logic was better at 
this stuff.  You're right, CASSANDRA-15367 introduces, rather than exploits, 
this issue.

> Failing to flush Memtable without terminating process results in permanent 
> data loss
> 
>
> Key: CASSANDRA-15368
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15368
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Commit Log, Local/Memtable
>Reporter: Benedict Elliott Smith
>Priority: Normal
> Fix For: 4.0, 2.2.x, 3.0.x, 3.11.x
>
>
> {{Memtable}} do not contain records that cover a precise contiguous range of 
> {{ReplayPosition}}, since there are only weak ordering constraints when 
> rolling over to a new {{Memtable}} - the last operations for the old 
> {{Memtable}} may obtain their {{ReplayPosition}} after the first operations 
> for the new {{Memtable}}.
> Unfortunately, we treat the {{Memtable}} range as contiguous, and invalidate 
> the entire range on flush.  Ordinarily we only invalidate records when all 
> prior {{Memtable}} have also successfully flushed.  However, in the event of 
> a flush that does not terminate the process (either because of disk failure 
> policy, or because it is a software error), the later flush is able to 
> invalidate the region of the commit log that includes records that should 
> have been flushed in the prior {{Memtable}}
> More problematically, this can also occur on restart without any associated 
> flush failure, as we use commit log boundaries written to our flushed 
> sstables to filter {{ReplayPosition}} on recovery, which is meant to 
> replicate our {{Memtable}} flush behaviour above.  However, we do not know 
> that earlier flushes have completed, and they may complete successfully 
> out-of-order.  So any flush that completes before the process terminates, but 
> began after another flush that _doesn’t_ complete before the process 
> terminates, has the potential to cause permanent data loss.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15368) Failing to flush Memtable without terminating process results in permanent data loss

2019-11-11 Thread Branimir Lambov (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16972149#comment-16972149
 ] 

Branimir Lambov commented on CASSANDRA-15368:
-

Does this mean that this issue is only an artifact of the fix to 
CASSANDRA-15367?

> Failing to flush Memtable without terminating process results in permanent 
> data loss
> 
>
> Key: CASSANDRA-15368
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15368
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Commit Log, Local/Memtable
>Reporter: Benedict Elliott Smith
>Priority: Normal
> Fix For: 4.0, 2.2.x, 3.0.x, 3.11.x
>
>
> {{Memtable}} do not contain records that cover a precise contiguous range of 
> {{ReplayPosition}}, since there are only weak ordering constraints when 
> rolling over to a new {{Memtable}} - the last operations for the old 
> {{Memtable}} may obtain their {{ReplayPosition}} after the first operations 
> for the new {{Memtable}}.
> Unfortunately, we treat the {{Memtable}} range as contiguous, and invalidate 
> the entire range on flush.  Ordinarily we only invalidate records when all 
> prior {{Memtable}} have also successfully flushed.  However, in the event of 
> a flush that does not terminate the process (either because of disk failure 
> policy, or because it is a software error), the later flush is able to 
> invalidate the region of the commit log that includes records that should 
> have been flushed in the prior {{Memtable}}
> More problematically, this can also occur on restart without any associated 
> flush failure, as we use commit log boundaries written to our flushed 
> sstables to filter {{ReplayPosition}} on recovery, which is meant to 
> replicate our {{Memtable}} flush behaviour above.  However, we do not know 
> that earlier flushes have completed, and they may complete successfully 
> out-of-order.  So any flush that completes before the process terminates, but 
> began after another flush that _doesn’t_ complete before the process 
> terminates, has the potential to cause permanent data loss.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15368) Failing to flush Memtable without terminating process results in permanent data loss

2019-11-11 Thread Benedict Elliott Smith (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16971743#comment-16971743
 ] 

Benedict Elliott Smith commented on CASSANDRA-15368:


Hi [~dimitarndimitrov],

I think you have it the wrong wrong way around; in your parlance, we need:

* oldMemtable.accepts() returns false
* oldMemtable.accepts() returns false
* newMemtable.accepts() returns true
* newMemtable.accepts() returns true

If you look at the new documentation introduced in CASSANDRA-15367 
[here|https://github.com/belliottsmith/cassandra/commit/ed6adf5eabe62f8ce6a1341e0c5423ba53036197#diff-f0a15c3588b56c5ce53ece7c48e325b5R109],
 you'll see that there is a region at the start of all memtables where some 
records from the prior {{group}}, that may have arbitrarily delayed obtaining 
their {{ReplayPosition}}, are intermixed with those of the later group.  This 
region is essentially owned by both memtables, but only the later memtable 
invalidates the relevant commit log records.  The problem occurs if the earlier 
flush fails (and we do not terminate the process), _or_ if the process 
terminates with the later flush having completed (since we will use the 
start/end {{ReplayPosition}} associated with the sstable to invalidate the 
commit log in the same way).



> Failing to flush Memtable without terminating process results in permanent 
> data loss
> 
>
> Key: CASSANDRA-15368
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15368
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Commit Log, Local/Memtable
>Reporter: Benedict Elliott Smith
>Priority: Normal
> Fix For: 4.0, 2.2.x, 3.0.x, 3.11.x
>
>
> {{Memtable}} do not contain records that cover a precise contiguous range of 
> {{ReplayPosition}}, since there are only weak ordering constraints when 
> rolling over to a new {{Memtable}} - the last operations for the old 
> {{Memtable}} may obtain their {{ReplayPosition}} after the first operations 
> for the new {{Memtable}}.
> Unfortunately, we treat the {{Memtable}} range as contiguous, and invalidate 
> the entire range on flush.  Ordinarily we only invalidate records when all 
> prior {{Memtable}} have also successfully flushed.  However, in the event of 
> a flush that does not terminate the process (either because of disk failure 
> policy, or because it is a software error), the later flush is able to 
> invalidate the region of the commit log that includes records that should 
> have been flushed in the prior {{Memtable}}
> More problematically, this can also occur on restart without any associated 
> flush failure, as we use commit log boundaries written to our flushed 
> sstables to filter {{ReplayPosition}} on recovery, which is meant to 
> replicate our {{Memtable}} flush behaviour above.  However, we do not know 
> that earlier flushes have completed, and they may complete successfully 
> out-of-order.  So any flush that completes before the process terminates, but 
> began after another flush that _doesn’t_ complete before the process 
> terminates, has the potential to cause permanent data loss.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15368) Failing to flush Memtable without terminating process results in permanent data loss

2019-11-06 Thread Dimitar Dimitrov (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968613#comment-16968613
 ] 

Dimitar Dimitrov commented on CASSANDRA-15368:
--

Thanks for the super-quick reply, [~benedict]!
I'll definitely check out your patch for CASSANDRA-15367.
As for this problem, if me taking potentially (much) longer to fix it isn't a 
problem for you, I can surely take a stab.

Also here's the analysis that I mentioned in my previous reply - all comments 
appreciated.

h4. Defining the problem

Let's assume we have a single table (no indexes, no MVs) that's being 
continuously written to when a single flush for it is requested.
 We want to examine if we can have the old memtable accepting a write with a 
higher CL position, and the new memtable
 accepting a write with a lower CL position - the latter also implies the old 
memtable rejecting that write.
 * Below we'll be calling the write with the higher CL position *HW*, its 
assigned {{OpOrder.Group}} (and the action for assigning it) *HW group*, and 
its assigned CL position (and the action for assigning it) *HW position*.
 * Similar for the write with the lower CL position - *LW*, *LW group*, and *LW 
position*.

So to get the (un)desired ordering, we need the following specific results from 
3 executions of {{Memtable.accepts(OpOrder.Group, CommitLogPosition)}}:
 - {{oldMemtable.accepts()}} (called *HW accept?* below), which should 
return true
 - {{oldMemtable.accepts()}} (called *LW accept?* below), which should 
return false
 - {{newMemtable.accepts()}}, which should return true (not necessary for 
the analysis below)

h4. Some constraints

 A. For each of the writes, the {{OpOrder.Group}} assignment happens-before the 
CL position allocation for the corresponding write, which happens-before the 
{{Memtable.accepts(OpOrder.Group, CommitLogPosition)}} call for the 
corresponding write.
 * HW group --hb-> HW position --hb-> HW accept?
 * LW group --hb-> LW position --hb-> LW accept?

B. The CL position allocations are totally (and numerically) ordered by 
happens-before, due to the way {{CommitLogSegment}}-s are advanced and the way 
their internal {{allocatePosition}} markers are CAS-ed.
 * LW position --hb-> HW position

C. If {{writeBarrier.issue()}} in the {{Flush}} ctor happens-before HW group, 
then the final upper CL bound for the old memtable (called *UB* below) has been 
set, and is guaranteed to be less than HW position, but then HW accept? is 
guaranteed to return false (because it will see {{writeBarrier}} as not 
{{null}}, and HW position would be guaranteed to be more than UB) => 
contradiction
 * If {{writeBarrier.issue()}} --hb-> HW group => UB --hb-> HW group => UB 
--hb-> HW position => contradiction
 * Therefore HW group --hb-> {{writeBarrier.issue()}}
 * Note that this was not true before the fix for CASSANDRA-8383.

D. If {{writeBarrier.issue()}} happens-before LW group, then UB has been set, 
and is guaranteed to be less than LW position, and therefore less than HW 
position. Also {{writeBarrier.issue()}} would happen-before HW position, which 
would happen-before HW accept?. That means that HW accept? will see 
{{writeBarrier}} as not {{null}}, and UB as set and less than HW position, so 
is guaranteed to return false => contradiction
 * If {{writeBarrier.issue()}} --hb-> LW group => UB --hb-> LW position --hb-> 
HW position && {{writeBarrier.issue()}} --hb-> HW accept? => contradiction
 * Therefore LW group --hb-> {{writeBarrier.issue()}}

E. As a corollary of C. and D., LW group and HW group should both be before the 
barrier issued by the flush, and therefore *the placements of LW and HW will 
both be determined by LW position, HW position, and UB*.

h4. The case work

In order for HW accept? to return true:
# ...it could be seeing {{writeBarrier}} as {{null}}, which means to have 
started before the {{writeBarrier}} is set in {{oldMemtable.setDiscarding}}.
## This implies that LW accept? is started after HW accept? has started - 
otherwise LW accept? would also have seen {{writeBarrier}} as {{null}} and 
returned true already => contradiction
 ## So LW accept? has started after HW accept? has started, and needs to return 
false because of LW position (see E. why it cannot be due to LW group).
 This could happen only if UB has been set and is less than LW position. But as 
setting UB happens after {{oldMemtable.setDiscarding}}, and HW accept? had 
started before the {{writeBarrier}} is set in {{oldMemtable.setDiscarding}}, UB 
should be at least HW position, which is more than LW position => contradiction
 #* If HW accept? start --hb-> writeBarrier set in 
{{oldMemtable.setDiscarding}} => HW position --hb-> writeBarrier set in 
{{oldMemtable.setDiscarding}} --hb-> UB => LW position --hb-> UB => 
contradiction
 #* Therefore writeBarrier set in {{oldMemtable.setDiscarding}} --hb-> HW 
accept? start
 # ...it could have been 

[jira] [Commented] (CASSANDRA-15368) Failing to flush Memtable without terminating process results in permanent data loss

2019-11-06 Thread Benedict Elliott Smith (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968267#comment-16968267
 ] 

Benedict Elliott Smith commented on CASSANDRA-15368:


Hi [~dimitarndimitrov],

Thanks for your interest.  I'm about to post a patch for CASSANDRA-15367, after 
which the improved comments and explanations may help you understand (and will 
help me explain).  So I'll elaborate after that is up.  This is quite a hairy 
bit of the codebase, with a lot of parts we'd like to burn with fire, but if 
you want to volunteer to take a stab at it I certainly won't stop you.

> Failing to flush Memtable without terminating process results in permanent 
> data loss
> 
>
> Key: CASSANDRA-15368
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15368
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Commit Log, Local/Memtable
>Reporter: Benedict Elliott Smith
>Priority: Normal
> Fix For: 4.0, 2.2.x, 3.0.x, 3.11.x
>
>
> {{Memtable}} do not contain records that cover a precise contiguous range of 
> {{ReplayPosition}}, since there are only weak ordering constraints when 
> rolling over to a new {{Memtable}} - the last operations for the old 
> {{Memtable}} may obtain their {{ReplayPosition}} after the first operations 
> for the new {{Memtable}}.
> Unfortunately, we treat the {{Memtable}} range as contiguous, and invalidate 
> the entire range on flush.  Ordinarily we only invalidate records when all 
> prior {{Memtable}} have also successfully flushed.  However, in the event of 
> a flush that does not terminate the process (either because of disk failure 
> policy, or because it is a software error), the later flush is able to 
> invalidate the region of the commit log that includes records that should 
> have been flushed in the prior {{Memtable}}
> More problematically, this can also occur on restart without any associated 
> flush failure, as we use commit log boundaries written to our flushed 
> sstables to filter {{ReplayPosition}} on recovery, which is meant to 
> replicate our {{Memtable}} flush behaviour above.  However, we do not know 
> that earlier flushes have completed, and they may complete successfully 
> out-of-order.  So any flush that completes before the process terminates, but 
> began after another flush that _doesn’t_ complete before the process 
> terminates, has the potential to cause permanent data loss.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15368) Failing to flush Memtable without terminating process results in permanent data loss

2019-11-06 Thread Dimitar Dimitrov (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968261#comment-16968261
 ] 

Dimitar Dimitrov commented on CASSANDRA-15368:
--

[~benedict], I assume this is something that you're planning to take up 
yourself, but let me know if you can use a volunteer in any way. 

Also can you please help me understand some of the details around the 
pre-conditions for this problem?

I'm probably mising something, but I still can't understand:
 * how _*the last operations for the old Memtable may obtain their 
ReplayPosition after the first operations for the new Memtable*_ can hold true 
after CASSANDRA-8383.
 * how _*Unfortunately, we treat the Memtable range as contiguous, and 
invalidate the entire range on flush*_ can hold true after CASSANDRA-11828 
(with some interaction with CASSANDRA-9669).

I'm also wondering, is _*More problematically, this can also occur on restart 
without any associated flush failure, as we use commit log boundaries written 
to our flushed sstables to filter ReplayPosition on recovery*_ related to 
{{CommitLogReplayer#firstNotCovered(Collection>)}}
 and its caveats?

P.S. Specifically for the upper bound of the old memtable being above the lower 
bound of the new memtable, I've tried to explicitly write down the possible 
orderings, and I can't see how that could happen - I'll format and post my 
notes in a separate comment a bit later.

> Failing to flush Memtable without terminating process results in permanent 
> data loss
> 
>
> Key: CASSANDRA-15368
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15368
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Commit Log, Local/Memtable
>Reporter: Benedict Elliott Smith
>Priority: Normal
> Fix For: 4.0, 2.2.x, 3.0.x, 3.11.x
>
>
> {{Memtable}} do not contain records that cover a precise contiguous range of 
> {{ReplayPosition}}, since there are only weak ordering constraints when 
> rolling over to a new {{Memtable}} - the last operations for the old 
> {{Memtable}} may obtain their {{ReplayPosition}} after the first operations 
> for the new {{Memtable}}.
> Unfortunately, we treat the {{Memtable}} range as contiguous, and invalidate 
> the entire range on flush.  Ordinarily we only invalidate records when all 
> prior {{Memtable}} have also successfully flushed.  However, in the event of 
> a flush that does not terminate the process (either because of disk failure 
> policy, or because it is a software error), the later flush is able to 
> invalidate the region of the commit log that includes records that should 
> have been flushed in the prior {{Memtable}}
> More problematically, this can also occur on restart without any associated 
> flush failure, as we use commit log boundaries written to our flushed 
> sstables to filter {{ReplayPosition}} on recovery, which is meant to 
> replicate our {{Memtable}} flush behaviour above.  However, we do not know 
> that earlier flushes have completed, and they may complete successfully 
> out-of-order.  So any flush that completes before the process terminates, but 
> began after another flush that _doesn’t_ complete before the process 
> terminates, has the potential to cause permanent data loss.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org