Wouldn't the "number of keys" part of *nodetool cfstats* run on every node,
summed and divided by replication factor give you a decent approximation?
Or are you really after a completely precise number?

On Mon, 11 Apr 2016 at 16:18 Jack Krupansky <jack.krupan...@gmail.com>
wrote:

> Agreed, that anything requiring a full table scan, short of batch
> analytics,is an antipattern, although the goal is not to do a full scan per
> se, but just get the row count. It still surprises people that Cassandra
> cannot quickly get COUNT(*). The easy answer: Use DSE Search and do a Solr
> query for q=*:* and that will very quickly return the total row count. I
> presume that Stratio will handle this fine as well.
>
>
> -- Jack Krupansky
>
> On Mon, Apr 11, 2016 at 11:10 AM, <sean_r_dur...@homedepot.com> wrote:
>
>> Cassandra is not good for table scan type queries (which count(*)
>> typically is). While there are some attempts to do that (as noted below),
>> this is a path I avoid.
>>
>>
>>
>>
>>
>> Sean Durity
>>
>>
>>
>> *From:* Max C [mailto:mc_cassan...@core43.com]
>> *Sent:* Saturday, April 09, 2016 6:19 PM
>> *To:* user@cassandra.apache.org
>> *Subject:* Re: 1, 2, 3...
>>
>>
>>
>> Looks like this guy (Brian Hess) wrote a script to split the token range
>> and run count(*) on each subrange:
>>
>>
>>
>> https://github.com/brianmhess/cassandra-count
>>
>>
>>
>> - Max
>>
>>
>>
>> On Apr 8, 2016, at 10:56 pm, Jeff Jirsa <jeff.ji...@crowdstrike.com>
>> wrote:
>>
>>
>>
>> SELECT COUNT(*) probably works (with internal paging) on many datasets
>> with enough time and assuming you don’t have any partitions that will kill
>> you.
>>
>>
>>
>> No, it doesn’t count extra replicas / duplicates.
>>
>>
>>
>> The old way to do this (before paging / fetch size) was to use manual
>> paging based on tokens/clustering keys:
>>
>>
>>
>> https://docs.datastax.com/en/cql/3.1/cql/cql_using/paging_c.html –
>> SELECT’s WHERE clause can use token(), which is what you’d want to use to
>> page through the whole token space.
>>
>>
>>
>> You could, in theory, issue thousands of queries in parallel, all for
>> different token ranges, and then sum the results. That’s what something
>> like spark would be doing. If you want to determine rows per node, limit
>> the token range to that owned by the node (easier with 1 token than vnodes,
>> with vnodes repeat num_tokens times).
>>
>>
>>
>> ------------------------------
>>
>> The information in this Internet Email is confidential and may be legally
>> privileged. It is intended solely for the addressee. Access to this Email
>> by anyone else is unauthorized. If you are not the intended recipient, any
>> disclosure, copying, distribution or any action taken or omitted to be
>> taken in reliance on it, is prohibited and may be unlawful. When addressed
>> to our clients any opinions or advice contained in this Email are subject
>> to the terms and conditions expressed in any applicable governing The Home
>> Depot terms of business or client engagement letter. The Home Depot
>> disclaims all responsibility and liability for the accuracy and content of
>> this attachment and for any damages or losses arising from any
>> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
>> items of a destructive nature, which may be contained in this attachment
>> and shall not be liable for direct, indirect, consequential or special
>> damages in connection with this e-mail message or its attachment.
>>
>
>

Reply via email to