maxwellguo created CASSANDRA-18278:
--------------------------------------
Summary: Add a tool to clean redundant data for native secondary
index
Key: CASSANDRA-18278
URL: https://issues.apache.org/jira/browse/CASSANDRA-18278
Project: Cassandra
Issue Type: Improvement
Components: Feature/2i Index, Tool/nodetool
Reporter: maxwellguo
Assignee: maxwellguo
As we know Cassandra' secondary index is a local secondary index , and for
every data update , and the every update hit the indexed columns. The old
redundant data for index table are keeped in the table only when the data are
read (may be a little like read repair ).
So there may exist some old and useless data for index table if they are not
read. So we would like to support a tool that can remove the old useless data
.See the picture below , we create a table with a secondary index on c1 column
, then update data with same pk ,different c1 value, and we flush after every
update, after that we force a major on the index table . See the sstable dump
for secondary index (The dump tool for secondary index can not be used but
fortunately we use the
[CASSANDRA-17698|https://issues.apache.org/jira/browse/CASSANDRA-17698]), and
we can see the content of index sstable.
Below are the cql and dump result.
{code:java}
cqlsh> DESC ks.tb
CREATE TABLE ks.tb (
pk int PRIMARY KEY,
c1 int
) WITH additional_write_policy = '99p'
AND allow_auto_snapshot = true
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND cdc = false
AND comment = ''
AND compaction = {'class':
'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy',
'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '16', 'class':
'org.apache.cassandra.io.compress.LZ4Compressor'}
AND memtable = 'default'
AND crc_check_chance = 1.0
AND default_time_to_live = 0
AND extensions = {}
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair = 'BLOCKING'
AND speculative_retry = '99p';
CREATE INDEX idx ON ks.tb (c1);
cqlsh> INSERT INTO ks.tb(pk, c1)values (1, 1);
cqlsh> INSERT INTO ks.tb(pk, c1)values (1, 2);
cqlsh> INSERT INTO ks.tb(pk, c1)values (1, 3);
cqlsh>
{code}
On the other hand we flush after every update and force a major at the end.
{code:java}
bin git:(trunk) ✗ ./nodetool flush
➜ bin git:(trunk) ✗ ./nodetool flush
➜ bin git:(trunk) ✗ ./nodetool flush
➜ bin git:(trunk) ✗ ./nodetool compact ks tb.idx
➜ bin git:(trunk) ✗ ../tools/bin/sstabledump
../data/data/ks/tb-65d902b0b2bc11ed86ed81daebeca99d/.idx/nb-13-big-Data.db
[
{
"table kind" : "INDEX",
"partition" : {
"key" : [ "1" ],
"position" : 0
},
"rows" : [
{
"type" : "row",
"position" : 18,
"clustering" : [ 1 ],
"liveness_info" : { "tstamp" : "2023-02-23T03:21:57.638558Z" },
"cells" : [ ]
}
]
},
{
"table kind" : "INDEX",
"partition" : {
"key" : [ "2" ],
"position" : 29
},
"rows" : [
{
"type" : "row",
"position" : 47,
"clustering" : [ 1 ],
"liveness_info" : { "tstamp" : "2023-02-23T03:22:19.834466Z" },
"cells" : [ ]
}
]
},
{
"table kind" : "INDEX",
"partition" : {
"key" : [ "3" ],
"position" : 61
},
"rows" : [
{
"type" : "row",
"position" : 79,
"clustering" : [ 1 ],
"liveness_info" : { "tstamp" : "2023-02-23T03:22:27.532174Z" },
"cells" : [ ]
}
]
}
]%
{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]