SELECT COUNT(*) probably works (with internal paging) on many datasets with enough time and assuming you don’t have any partitions that will kill you.
No, it doesn’t count extra replicas / duplicates. The old way to do this (before paging / fetch size) was to use manual paging based on tokens/clustering keys: https://docs.datastax.com/en/cql/3.1/cql/cql_using/paging_c.html – SELECT’s WHERE clause can use token(), which is what you’d want to use to page through the whole token space. You could, in theory, issue thousands of queries in parallel, all for different token ranges, and then sum the results. That’s what something like spark would be doing. If you want to determine rows per node, limit the token range to that owned by the node (easier with 1 token than vnodes, with vnodes repeat num_tokens times). From: Jack Krupansky Reply-To: "user@cassandra.apache.org" Date: Friday, April 8, 2016 at 3:48 PM To: "user@cassandra.apache.org" Subject: 1, 2, 3... I'm afraid I don't have the solid answer to this obvious question: How do I get a fairly accurate count of (CQL) rows in a Cassandra table? Does SELECT COUNT (*) FROM <table-name> actually do it? Does it really count (CQL) rows across all nodes and exclude replicated rows? Is there a better/preferred technique? For example, is it more efficient to query the row count one node at a time? And for bonus points: How do you count (CQL) rows for each node? Again, excluding replication. -- Jack Krupansky
smime.p7s
Description: S/MIME cryptographic signature