maxwellguo created CASSANDRA-18278:
--------------------------------------

             Summary: Add a tool to clean  redundant data for native secondary 
index 
                 Key: CASSANDRA-18278
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-18278
             Project: Cassandra
          Issue Type: Improvement
          Components: Feature/2i Index, Tool/nodetool
            Reporter: maxwellguo
            Assignee: maxwellguo


As we know Cassandra' secondary index is a local secondary index , and for 
every data update , and the every update hit the indexed columns. The old 
redundant data for index table are keeped in the table only when the data are 
read (may be a little like read repair ).
So there may exist some old and useless data for index table if they are not 
read. So we would like to support a tool that can remove the old useless data 
.See the picture below , we create a table with a secondary index on c1 column 
, then update data with same pk ,different c1 value, and we flush after every 
update, after that we force a major on the index table . See the sstable dump 
for secondary index (The dump tool for secondary index can not be used but 
fortunately we use the 
[CASSANDRA-17698|https://issues.apache.org/jira/browse/CASSANDRA-17698]), and 
we can see the content of index sstable.
Below are the cql and dump result.

{code:java}
cqlsh> DESC ks.tb

CREATE TABLE ks.tb (
    pk int PRIMARY KEY,
    c1 int
) WITH additional_write_policy = '99p'
    AND allow_auto_snapshot = true
    AND bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
    AND cdc = false
    AND comment = ''
    AND compaction = {'class': 
'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 
'max_threshold': '32', 'min_threshold': '4'}
    AND compression = {'chunk_length_in_kb': '16', 'class': 
'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND memtable = 'default'
    AND crc_check_chance = 1.0
    AND default_time_to_live = 0
    AND extensions = {}
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair = 'BLOCKING'
    AND speculative_retry = '99p';

CREATE INDEX idx ON ks.tb (c1);
cqlsh> INSERT INTO ks.tb(pk, c1)values (1, 1);
cqlsh> INSERT INTO ks.tb(pk, c1)values (1, 2);
cqlsh> INSERT INTO ks.tb(pk, c1)values (1, 3);
cqlsh> 
{code}

On the other hand we flush after every update and force a major at the end.

{code:java}
  bin git:(trunk) ✗ ./nodetool flush
➜  bin git:(trunk) ✗ ./nodetool flush
➜  bin git:(trunk) ✗ ./nodetool flush
➜  bin git:(trunk) ✗ ./nodetool compact ks tb.idx
➜  bin git:(trunk) ✗ ../tools/bin/sstabledump 
../data/data/ks/tb-65d902b0b2bc11ed86ed81daebeca99d/.idx/nb-13-big-Data.db 
[
  {
    "table kind" : "INDEX",
    "partition" : {
      "key" : [ "1" ],
      "position" : 0
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 18,
        "clustering" : [ 1 ],
        "liveness_info" : { "tstamp" : "2023-02-23T03:21:57.638558Z" },
        "cells" : [ ]
      }
    ]
  },
  {
    "table kind" : "INDEX",
    "partition" : {
      "key" : [ "2" ],
      "position" : 29
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 47,
        "clustering" : [ 1 ],
        "liveness_info" : { "tstamp" : "2023-02-23T03:22:19.834466Z" },
        "cells" : [ ]
      }
    ]
  },
  {
    "table kind" : "INDEX",
    "partition" : {
      "key" : [ "3" ],
      "position" : 61
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 79,
        "clustering" : [ 1 ],
        "liveness_info" : { "tstamp" : "2023-02-23T03:22:27.532174Z" },
        "cells" : [ ]
      }
    ]
  }
]%       
{code}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to