[jira] [Updated] (CASSANDRA-9107) More accurate row count estimates

Sam Tunnicliffe (JIRA) Tue, 28 Apr 2015 08:00:06 -0700

     [ 
https://issues.apache.org/jira/browse/CASSANDRA-9107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sam Tunnicliffe updated CASSANDRA-9107:
---------------------------------------
    Attachment: 9107-v2.txt

Attached a v2

bq. Should we start to switch terms from 'row' to 'partition' in new metrics 
when we mean 'partition'?

We should, but having a mixture of "row" and "partition" referring to the same 
thing would be worse IMHO. How about a 3.0 ticket to review the naming of all 
externally facing metrics? 

bq. Also, should we include the count in flush-pending memtables?

Good call, added in v2.

> More accurate row count estimates
> ---------------------------------
>
>                 Key: CASSANDRA-9107
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9107
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Chris Lohfink
>            Assignee: Chris Lohfink
>         Attachments: 9107-cassandra2-1.patch, 9107-v2.txt
>
>
> Currently the estimated row count from cfstats is the sum of the number of 
> rows in all the sstables. This becomes very inaccurate with wide rows or 
> heavily updated datasets since the same partition would exist in many 
> sstables.  In example:
> {code}
> create KEYSPACE test WITH replication = {'class': 'SimpleStrategy', 
> 'replication_factor': 1};
> create TABLE wide (key text PRIMARY KEY , value text) WITH compaction = 
> {'class': 'SizeTieredCompactionStrategy', 'min_threshold': 30, 
> 'max_threshold': 100} ;
> -------------------------------
> insert INTO wide (key, value) VALUES ('key', 'value');
> // flush
> // cfstats output: Number of keys (estimate): 1  (128 in older version from 
> index)
> insert INTO wide (key, value) VALUES ('key', 'value');
> // flush
> // cfstats output: Number of keys (estimate): 2  (256 in older version from 
> index)
> ... etc
> {code}
> previously it used the index but it still did it per sstable and summed them 
> up which became inaccurate as there are more sstables (just by much worse). 
> With new versions of sstables we can merge the cardinalities to resolve this 
> with a slight hit to accuracy in the case of every sstable having completely 
> unique partitions.
> Furthermore I think it would be pretty minimal effort to include the number 
> of rows in the memtables to this count. We wont have the cardinality merging 
> between memtables and sstables but I would consider that a relatively minor 
> negative.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (CASSANDRA-9107) More accurate row count estimates

Reply via email to