[ https://issues.apache.org/jira/browse/CASSANDRA-15327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16930940#comment-16930940 ]
Leon Zaruvinsky commented on CASSANDRA-15327: --------------------------------------------- h2. Reproduction (with Bootstrap) h3. Cluster setup Apache Cassandra 2.2.14, 3 nodes with SimpleSeedProvider. {code:java} $ nodetool status Datacenter: DC11 Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN x.x.x.180 101.68 KB 512 66.7% 5b073424-1f21-4aac-acd5-e0e8e82f7073 RAC1 UN x.x.x.217 101.64 KB 512 67.6% 20e9ecd3-b3a5-4171-901c-eb6e7bc7a429 RAC1 UN x.x.x.142 101.69 KB 512 65.7% 1df2908e-5697-4b1a-9110-dcf581511510 RAC1 {code} h3. Seed Data We create a table {{datadump.ddt}} using {{LeveledCompactionStrategy}}, with {{RF = 3}} and {{gc_grace_seconds = 1}}. We write at {{QUORUM}} 20,000 rows of data in the format (row, value) where row is a partition key [0, 20000) and value is a one megabyte blob. {code:java} cqlsh> describe datadump; CREATE KEYSPACE datadump WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '3'} AND durable_writes = true; CREATE TABLE datadump.ddt ( row bigint, col bigint, value blob static, PRIMARY KEY (row, col) ) WITH CLUSTERING ORDER BY (col ASC) AND bloom_filter_fp_chance = 0.1 AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}' AND comment = '' AND compaction = {'class': 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'} AND compression = {} AND dclocal_read_repair_chance = 0.1 AND default_time_to_live = 0 AND gc_grace_seconds = 1 AND max_index_interval = 2048 AND memtable_flush_period_in_ms = 0 AND min_index_interval = 128 AND read_repair_chance = 0.0 AND speculative_retry = '99.0PERCENTILE'; cqlsh> SELECT COUNT(row) from datadump.ddt; system.count(row) 20000 $ nodetool status Datacenter: DC11 Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN x.x.x.180 19.26 GB 512 100.0% 5b073424-1f21-4aac-acd5-e0e8e82f7073 RAC1 UN x.x.x.217 19.26 GB 512 100.0% 20e9ecd3-b3a5-4171-901c-eb6e7bc7a429 RAC1 UN x.x.x.142 18.71 GB 512 100.0% 1df2908e-5697-4b1a-9110-dcf581511510 RAC1 {code} h3. Bootstrap + Delete We bootstrap a fourth node to the cluster, and as soon as it begins to stream data, we delete at {{ALL}} the 20,000 rows that we inserted earlier. The deletes will be forwarded to the bootstrapping node immediately. As soon as the delete queries complete, we run a flush and major compaction on the bootstrapping node to clear the tombstones. {code:java} $ nodetool status Datacenter: DC11 Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UJ x.x.x.129 14.43 KB 512 ? 8418418b-2aa2-4918-a065-8fc25887194f RAC1 UN x.x.x.180 19.26 GB 512 100.0% 5b073424-1f21-4aac-acd5-e0e8e82f7073 RAC1 UN x.x.x.217 19.26 GB 512 100.0% 20e9ecd3-b3a5-4171-901c-eb6e7bc7a429 RAC1 UN x.x.x.142 18.71 GB 512 100.0% 1df2908e-5697-4b1a-9110-dcf581511510 RAC1 // First, we trigger deletes on one of the original three nodes. // Then, we run the below on the bootstrapping node $ nodetool flush && nodetool compact $ nodetool compactionhistory {code} Compaction History: {code:java} id keyspace_name columnfamily_name compacted_at bytes_in bytes_out rows_merged b63e7e30-d00a-11e9-a7b9-f18b3a65a899 datadump ddt 1567708003731 366768 0 {} 7e2253a0-d00a-11e9-a7b9-f18b3a65a899 system local 1567707909594 10705 10500 {4:1} {code} h3. Final State Once the bootstrap has completed, the cluster looks like this: {code:java} $ nodetool status Datacenter: DC11 Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN x.x.x.129 14.93 GB 512 76.1% 8418418b-2aa2-4918-a065-8fc25887194f RAC1 UN x.x.x.180 19.54 GB 512 74.5% 5b073424-1f21-4aac-acd5-e0e8e82f7073 RAC1 UN x.x.x.217 19.54 GB 512 75.3% 20e9ecd3-b3a5-4171-901c-eb6e7bc7a429 RAC1 UN x.x.x.142 19.54 GB 512 74.1% 1df2908e-5697-4b1a-9110-dcf581511510 RAC1{code} We run a flush and major compaction on every node. The original three nodes drop everything, but the bootstrapped node holds onto nearly 75% of the data. {code:java} $ nodetool status Datacenter: DC11 Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN x.x.x.129 14.93 GB 512 76.1% 8418418b-2aa2-4918-a065-8fc25887194f RAC1 UN x.x.x.180 143.39 KB 512 74.5% 5b073424-1f21-4aac-acd5-e0e8e82f7073 RAC1 UN x.x.x.217 128.6 KB 512 75.3% 20e9ecd3-b3a5-4171-901c-eb6e7bc7a429 RAC1 UN x.x.x.142 149.97 KB 512 74.1% 1df2908e-5697-4b1a-9110-dcf581511510 RAC1{code} If we run the count query again, we see the data get read-repaired back into the other nodes. {code:java} $ nodetool status Datacenter: DC11 Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN x.x.x.129 14.93 GB 512 76.1% 8418418b-2aa2-4918-a065-8fc25887194f RAC1 UN x.x.x.180 9.5 GB 512 74.5% 5b073424-1f21-4aac-acd5-e0e8e82f7073 RAC1 UN x.x.x.217 9.58 GB 512 75.3% 20e9ecd3-b3a5-4171-901c-eb6e7bc7a429 RAC1 UN x.x.x.142 9.5 GB 512 74.1% 1df2908e-5697-4b1a-9110-dcf581511510 RAC1 cqlsh> SELECT COUNT(row) from datadump.ddt; system.count(row) 15282 (1 rows) {code} > Deleted data can re-appear if range movement streaming time exceeds > gc_grace_seconds > ------------------------------------------------------------------------------------ > > Key: CASSANDRA-15327 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15327 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Bootstrap and Decommission > Reporter: Leon Zaruvinsky > Priority: Normal > Attachments: CASSANDRA-15327-2.1.txt > > > Hey, > We've come across a scenario in production (noticed on Cassandra 2.2.14) > where data that is deleted from Cassandra at consistency {{ALL}} can be > resurrected. I've added a reproduction in a comment. > If a {{delete}} is issued during a range movement (i.e. bootstrap, > decommission, move), and {{gc_grace_seconds}} is surpassed before the stream > is finished, then the tombstones from the {{delete}} can be purged from the > recipient node before the data is streamed. Once the move is complete, the > data now exists on the recipient node without a tombstone. > > We noticed this because our bootstrapping time occasionally exceeds our > configured gc_grace_seconds, so we lose the consistency guarantee. As an > operator, it would be great to not have to worry about this edge case. > I've attached a patch that we have tested and successfully used in > production, and haven't noticed any ill effects. -- This message was sent by Atlassian Jira (v8.3.2#803003) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org