[jira] [Comment Edited] (CASSANDRA-15327) Deleted data can re-appear if range movement streaming time exceeds gc_grace_seconds

2019-09-17 Thread James Baker (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931356#comment-16931356
 ] 

James Baker edited comment on CASSANDRA-15327 at 9/17/19 11:56 AM:
---

Hi [~jmeredithco]

>From first glance, it's not obvious to me that the solution given in 
>CASSANDRA-6343 cannot suffer from the exact same issue.

Imagine I have SSTables A and B which have both been repaired, and A has a 
tombstone for a value contained in B. A new node is bootstrapped and it 
downloads A, and then for some reason a pause occurs or the order of 
bootstrapping changes and so it doesn't download B for some time.

What stops the tombstone from being eligible to be compacted away at this 
point? All SSTables on the new node have been repaired, I would presume (unless 
bootstrap marks SSTables as being unrepaired).


was (Author: jbaker200):
Hi [~jmeredithco] 

>From first glance, it's not obvious to me that the solution given in 
>CASSANDRA-6343 cannot suffer from the exact same issue.

Imagine I have SSTables A and B which have both been repaired, and A has a 
tombstone for a value contained in B. A new node is bootstrapped and it 
downloads A, and then for some reason a pause occurs.

What stops the tombstone from being compacted away at this point? All SSTables 
on the new node have been repaired, I would presume (unless bootstrap marks 
SSTables as being unrepaired).

> Deleted data can re-appear if range movement streaming time exceeds 
> gc_grace_seconds
> 
>
> Key: CASSANDRA-15327
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15327
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Bootstrap and Decommission
>Reporter: Leon Zaruvinsky
>Priority: Normal
> Fix For: 2.2.15, 2.1.x
>
> Attachments: CASSANDRA-15327-2.1.txt, CASSANDRA-15327-2.2.txt
>
>
> Hey,
> We've come across a scenario in production (noticed on Cassandra 2.2.14) 
> where data that is deleted from Cassandra at consistency {{ALL}} can be 
> resurrected.  I've added a reproduction in a comment.
> If a {{delete}} is issued during a range movement (i.e. bootstrap, 
> decommission, move), and {{gc_grace_seconds}} is surpassed before the stream 
> is finished, then the tombstones from the {{delete}} can be purged from the 
> recipient node before the data is streamed. Once the move is complete, the 
> data now exists on the recipient node without a tombstone.
> We noticed this because our bootstrapping time occasionally exceeds our 
> configured gc_grace_seconds, so we lose the consistency guarantee.  As an 
> operator, it would be great to not have to worry about this edge case.
> I've attached a patch that we have tested and successfully used in 
> production, and haven't noticed any ill effects.  Happy to submit patches for 
> more recent versions, I'm not sure how cleanly this will actually merge since 
> there was some refactoring to this logic in 3.x.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-15327) Deleted data can re-appear if range movement streaming time exceeds gc_grace_seconds

2019-09-16 Thread Leon Zaruvinsky (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930940#comment-16930940
 ] 

Leon Zaruvinsky edited comment on CASSANDRA-15327 at 9/16/19 10:54 PM:
---

h2. Reproduction (with Bootstrap)
h3. Cluster setup

Apache Cassandra 2.2.14, 3 nodes with SimpleSeedProvider. 
{code:java}
$ nodetool status
Datacenter: DC11

Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN x.x.x.180 101.68 KB 512 66.7% 5b073424-1f21-4aac-acd5-e0e8e82f7073 RAC1
UN x.x.x.217 101.64 KB 512 67.6% 20e9ecd3-b3a5-4171-901c-eb6e7bc7a429 RAC1
UN x.x.x.142 101.69 KB 512 65.7% 1df2908e-5697-4b1a-9110-dcf581511510 RAC1
{code}
h3. Seed Data

We create a table {{datadump.ddt}} using {{LeveledCompactionStrategy}}, with 
{{RF = 3}} and {{gc_grace_seconds = 1}}. We write at {{QUORUM}} 20,000 rows of 
data in the format (row, value) where row is a partition key [0, 2) and 
value is a one megabyte blob. 
{code:java}
cqlsh> describe datadump;

CREATE KEYSPACE datadump WITH replication = {'class': 'SimpleStrategy', 
'replication_factor': '3'} AND durable_writes = true;

CREATE TABLE datadump.ddt (
 row bigint,
 col bigint,
 value blob static,
 PRIMARY KEY (row, col)
) WITH CLUSTERING ORDER BY (col ASC)
 AND bloom_filter_fp_chance = 0.1
 AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
 AND comment = ''
 AND compaction = {'class': 
'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'}
 AND compression = {}
 AND dclocal_read_repair_chance = 0.1
 AND default_time_to_live = 0
 AND gc_grace_seconds = 1
 AND max_index_interval = 2048
 AND memtable_flush_period_in_ms = 0
 AND min_index_interval = 128
 AND read_repair_chance = 0.0
 AND speculative_retry = '99.0PERCENTILE';

cqlsh> SELECT COUNT(row) from datadump.ddt;

system.count(row)

2
 
$ nodetool status
Datacenter: DC11

Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN x.x.x.180 19.26 GB 512 100.0% 5b073424-1f21-4aac-acd5-e0e8e82f7073 RAC1
UN x.x.x.217 19.26 GB 512 100.0% 20e9ecd3-b3a5-4171-901c-eb6e7bc7a429 RAC1
UN x.x.x.142 18.71 GB 512 100.0% 1df2908e-5697-4b1a-9110-dcf581511510 RAC1
{code}
h3. Bootstrap + Delete

We bootstrap a fourth node to the cluster, and as soon as it begins to stream 
data, we delete at {{ALL}} the 20,000 rows that we inserted earlier. The 
deletes will be forwarded to the bootstrapping node immediately. As soon as the 
delete queries complete, we run a flush and major compaction on the 
bootstrapping node to clear the tombstones.
{code:java}
$ nodetool status
Datacenter: DC11

Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UJ x.x.x.129 14.43 KB 512 ? 8418418b-2aa2-4918-a065-8fc25887194f RAC1
UN x.x.x.180 19.26 GB 512 100.0% 5b073424-1f21-4aac-acd5-e0e8e82f7073 RAC1
UN x.x.x.217 19.26 GB 512 100.0% 20e9ecd3-b3a5-4171-901c-eb6e7bc7a429 RAC1
UN x.x.x.142 18.71 GB 512 100.0% 1df2908e-5697-4b1a-9110-dcf581511510 RAC1
// First, we trigger deletes on one of the original three nodes.
// Then, we run the below on the bootstrapping node
$ nodetool flush && nodetool compact
$ nodetool compactionhistory
{code}
Compaction History:
{code:java}
id keyspace_name columnfamily_name compacted_at bytes_in bytes_out rows_merged
b63e7e30-d00a-11e9-a7b9-f18b3a65a899 datadump ddt 1567708003731 366768 0 {}
7e2253a0-d00a-11e9-a7b9-f18b3a65a899 system local 1567707909594 10705 10500 
{4:1}
{code}
h3. Final State

Once the bootstrap has completed, the cluster looks like this:
{code:java}
$ nodetool status
Datacenter: DC11

Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN x.x.x.129 14.93 GB 512 76.1% 8418418b-2aa2-4918-a065-8fc25887194f RAC1
UN x.x.x.180 19.54 GB 512 74.5% 5b073424-1f21-4aac-acd5-e0e8e82f7073 RAC1
UN x.x.x.217 19.54 GB 512 75.3% 20e9ecd3-b3a5-4171-901c-eb6e7bc7a429 RAC1
UN x.x.x.142 19.54 GB 512 74.1% 1df2908e-5697-4b1a-9110-dcf581511510 RAC1{code}
We run a flush and major compaction on every node. The original three nodes 
drop everything, but the bootstrapped node holds onto nearly 75% of the data.
{code:java}
$ nodetool status
Datacenter: DC11

Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN x.x.x.129 14.93 GB 512 76.1% 8418418b-2aa2-4918-a065-8fc25887194f RAC1
UN x.x.x.180 143.39 KB 512 74.5% 5b073424-1f21-4aac-acd5-e0e8e82f7073 RAC1
UN x.x.x.217 128.6 KB 512 75.3% 20e9ecd3-b3a5-4171-901c-eb6e7bc7a429 RAC1
UN x.x.x.142 149.97 KB 512 74.1% 1df2908e-5697-4b1a-9110-dcf581511510 RAC1{code}
If we run the count query again, we see the data get read-repaired back into 
the other nodes.
{code:java}
$ nodetool status
Datacenter: DC11

Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns