SELECT COUNT(*) probably works (with internal paging) on many datasets with 
enough time and assuming you don’t have any partitions that will kill you.

No, it doesn’t count extra replicas / duplicates.

The old way to do this (before paging / fetch size) was to use manual paging 
based on tokens/clustering keys:

https://docs.datastax.com/en/cql/3.1/cql/cql_using/paging_c.html – SELECT’s 
WHERE clause can use token(), which is what you’d want to use to page through 
the whole token space. 

You could, in theory, issue thousands of queries in parallel, all for different 
token ranges, and then sum the results. That’s what something like spark would 
be doing. If you want to determine rows per node, limit the token range to that 
owned by the node (easier with 1 token than vnodes, with vnodes repeat 
num_tokens times).



From:  Jack Krupansky
Reply-To:  "user@cassandra.apache.org"
Date:  Friday, April 8, 2016 at 3:48 PM
To:  "user@cassandra.apache.org"
Subject:  1, 2, 3...

I'm afraid I don't have the solid answer to this obvious question: How do I get 
a fairly accurate count of (CQL) rows in a Cassandra table? 

Does SELECT COUNT (*) FROM <table-name> actually do it?

Does it really count (CQL) rows across all nodes and exclude replicated rows?

Is there a better/preferred technique? For example, is it more efficient to 
query the row count one node at a time?

And for bonus points: How do you count (CQL) rows for each node? Again, 
excluding replication.

-- Jack Krupansky

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to