[
https://issues.apache.org/jira/browse/CASSANDRA-9107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sam Tunnicliffe updated CASSANDRA-9107:
---------------------------------------
Attachment: 9107-v2.txt
Attached a v2
bq. Should we start to switch terms from 'row' to 'partition' in new metrics
when we mean 'partition'?
We should, but having a mixture of "row" and "partition" referring to the same
thing would be worse IMHO. How about a 3.0 ticket to review the naming of all
externally facing metrics?
bq. Also, should we include the count in flush-pending memtables?
Good call, added in v2.
> More accurate row count estimates
> ---------------------------------
>
> Key: CASSANDRA-9107
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9107
> Project: Cassandra
> Issue Type: Improvement
> Reporter: Chris Lohfink
> Assignee: Chris Lohfink
> Attachments: 9107-cassandra2-1.patch, 9107-v2.txt
>
>
> Currently the estimated row count from cfstats is the sum of the number of
> rows in all the sstables. This becomes very inaccurate with wide rows or
> heavily updated datasets since the same partition would exist in many
> sstables. In example:
> {code}
> create KEYSPACE test WITH replication = {'class': 'SimpleStrategy',
> 'replication_factor': 1};
> create TABLE wide (key text PRIMARY KEY , value text) WITH compaction =
> {'class': 'SizeTieredCompactionStrategy', 'min_threshold': 30,
> 'max_threshold': 100} ;
> -------------------------------
> insert INTO wide (key, value) VALUES ('key', 'value');
> // flush
> // cfstats output: Number of keys (estimate): 1 (128 in older version from
> index)
> insert INTO wide (key, value) VALUES ('key', 'value');
> // flush
> // cfstats output: Number of keys (estimate): 2 (256 in older version from
> index)
> ... etc
> {code}
> previously it used the index but it still did it per sstable and summed them
> up which became inaccurate as there are more sstables (just by much worse).
> With new versions of sstables we can merge the cardinalities to resolve this
> with a slight hit to accuracy in the case of every sstable having completely
> unique partitions.
> Furthermore I think it would be pretty minimal effort to include the number
> of rows in the memtables to this count. We wont have the cardinality merging
> between memtables and sstables but I would consider that a relatively minor
> negative.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)