Unless I'm mistaken, nodetool tablestats gives you the number of partitions (partition keys), not the number of primary keys. IOW, the term "keys" is ambiguous. That's why I phrased the original question as count of (CQL) rows, to distinguish from the pre-CQL3 concept of a partition being treated as a single row.
-- Jack Krupansky On Mon, Apr 11, 2016 at 11:46 AM, Emīls Šolmanis <emils.solma...@gmail.com> wrote: > Wouldn't the "number of keys" part of *nodetool cfstats* run on every > node, summed and divided by replication factor give you a decent > approximation? Or are you really after a completely precise number? > > On Mon, 11 Apr 2016 at 16:18 Jack Krupansky <jack.krupan...@gmail.com> > wrote: > >> Agreed, that anything requiring a full table scan, short of batch >> analytics,is an antipattern, although the goal is not to do a full scan per >> se, but just get the row count. It still surprises people that Cassandra >> cannot quickly get COUNT(*). The easy answer: Use DSE Search and do a Solr >> query for q=*:* and that will very quickly return the total row count. I >> presume that Stratio will handle this fine as well. >> >> >> -- Jack Krupansky >> >> On Mon, Apr 11, 2016 at 11:10 AM, <sean_r_dur...@homedepot.com> wrote: >> >>> Cassandra is not good for table scan type queries (which count(*) >>> typically is). While there are some attempts to do that (as noted below), >>> this is a path I avoid. >>> >>> >>> >>> >>> >>> Sean Durity >>> >>> >>> >>> *From:* Max C [mailto:mc_cassan...@core43.com] >>> *Sent:* Saturday, April 09, 2016 6:19 PM >>> *To:* user@cassandra.apache.org >>> *Subject:* Re: 1, 2, 3... >>> >>> >>> >>> Looks like this guy (Brian Hess) wrote a script to split the token range >>> and run count(*) on each subrange: >>> >>> >>> >>> https://github.com/brianmhess/cassandra-count >>> >>> >>> >>> - Max >>> >>> >>> >>> On Apr 8, 2016, at 10:56 pm, Jeff Jirsa <jeff.ji...@crowdstrike.com> >>> wrote: >>> >>> >>> >>> SELECT COUNT(*) probably works (with internal paging) on many datasets >>> with enough time and assuming you don’t have any partitions that will kill >>> you. >>> >>> >>> >>> No, it doesn’t count extra replicas / duplicates. >>> >>> >>> >>> The old way to do this (before paging / fetch size) was to use manual >>> paging based on tokens/clustering keys: >>> >>> >>> >>> https://docs.datastax.com/en/cql/3.1/cql/cql_using/paging_c.html – >>> SELECT’s WHERE clause can use token(), which is what you’d want to use to >>> page through the whole token space. >>> >>> >>> >>> You could, in theory, issue thousands of queries in parallel, all for >>> different token ranges, and then sum the results. That’s what something >>> like spark would be doing. If you want to determine rows per node, limit >>> the token range to that owned by the node (easier with 1 token than vnodes, >>> with vnodes repeat num_tokens times). >>> >>> >>> >>> ------------------------------ >>> >>> The information in this Internet Email is confidential and may be >>> legally privileged. It is intended solely for the addressee. Access to this >>> Email by anyone else is unauthorized. If you are not the intended >>> recipient, any disclosure, copying, distribution or any action taken or >>> omitted to be taken in reliance on it, is prohibited and may be unlawful. >>> When addressed to our clients any opinions or advice contained in this >>> Email are subject to the terms and conditions expressed in any applicable >>> governing The Home Depot terms of business or client engagement letter. The >>> Home Depot disclaims all responsibility and liability for the accuracy and >>> content of this attachment and for any damages or losses arising from any >>> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other >>> items of a destructive nature, which may be contained in this attachment >>> and shall not be liable for direct, indirect, consequential or special >>> damages in connection with this e-mail message or its attachment. >>> >> >>