[
https://issues.apache.org/jira/browse/CASSANDRA-18118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Caleb Rackliffe updated CASSANDRA-18118:
----------------------------------------
Reviewers: Caleb Rackliffe, Caleb Rackliffe
Caleb Rackliffe, Caleb Rackliffe (was: Caleb Rackliffe)
Status: Review In Progress (was: Patch Available)
> Do not leak 2015 memtable synthetic Epoch
> -----------------------------------------
>
> Key: CASSANDRA-18118
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18118
> Project: Cassandra
> Issue Type: Bug
> Components: Local/Memtable
> Reporter: Berenguer Blasi
> Assignee: Berenguer Blasi
> Priority: Normal
> Fix For: 3.11.x, 4.0.x
>
>
> This
> [Epoch|https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/db/rows/EncodingStats.java#L48]
> can
> [leak|https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/db/Memtable.java#L392]
> affecting all the timestamps logic. It has been observed in a production
> env it can i.e. prevent proper sstable and tombstone cleanup.
> To reproduce create the following table:
> {noformat}
> drop keyspace test;
> create keyspace test WITH replication = {'class':'SimpleStrategy',
> 'replication_factor' : 1};
> CREATE TABLE test.test (
> key text PRIMARY KEY,
> id text
> ) WITH bloom_filter_fp_chance = 0.01
> AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
> AND comment = ''
> AND compaction = {'class':
> 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy',
> 'max_threshold': '32', 'min_threshold': '2', 'tombstone_compaction_interval':
> '3000', 'tombstone_threshold': '0.1', 'unchecked_tombstone_compaction':
> 'true'}
> AND compression = {'chunk_length_in_kb': '64', 'class':
> 'org.apache.cassandra.io.compress.LZ4Compressor'}
> AND crc_check_chance = 1.0
> AND dclocal_read_repair_chance = 0.0
> AND default_time_to_live = 10
> AND gc_grace_seconds = 10
> AND max_index_interval = 2048
> AND memtable_flush_period_in_ms = 0
> AND min_index_interval = 128
> AND read_repair_chance = 0.0
> AND speculative_retry = '99PERCENTILE';
> CREATE INDEX id_idx ON test.test (id);
> {noformat}
> And stress load it with:
> {noformat}
> insert into test.test (key,id) values('$RANDOM_UUID $RANDOM_UUID',
> 'eaca36a1-45f1-469c-a3f6-3ba54220363f') USING TTL 10
> {noformat}
> Notice how all inserts have a 10s TTL, the default 10s TTL and gc_grace is
> also at 10s. This is to speed up the repro:
> - Run the load for a couple minutes and track sstables disk usage. You will
> see it does only increase, nothing gets cleaned up and it doesn't stop
> growing (notice all this is well past the 10s gc_grace and TTL)
> - Running a flush and a compaction while under load against the keyspace,
> table or index doesn't solve the issue.
> - Stopping the load and running a compaction doesn't solve the issue.
> Flushing does though.
> - On the original observation where TTL was around 600s and gc_grace around
> 1800s we could get GBs of sstables that weren't cleaned up or compacted away
> after hours of work.
> - Reproduction can also happen on plain sstables by repeatedly
> inserting/deleting/overwriting the same values over and over again without 2i
> indices or TTL being involved.
> The problem seems to be
> [EncodingStats|https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/db/rows/EncodingStats.java#L48]
> using a synthetic Epoch in 2015 which plays nice with Vint serialization.
> Unfortunately {{Memtable}} is using that to keep track of the
> {{minTimestamp}} which can leak the 2015 Epoch. This confuses any logic
> consuming that timestamp. In this particular case purge and fully expired
> sstables weren't properly detected.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]