[jira] [Commented] (CASSANDRA-11314) Inconsistent select count(*)

Benjamin Lerer (JIRA) Wed, 30 Mar 2016 05:06:08 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-11314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15217860#comment-15217860
 ]


Benjamin Lerer commented on CASSANDRA-11314:
--------------------------------------------

I am not sure of why your replicas were not containing the same data, taking 
into account that the servers seems to be on the same rack, but once you get 
into that situation the results make sense and you explanation seems to be the 
good one.

As your  {{read_repair_chance}} is {{0.0}} your data will not be corrected by 
read repair.

{quote}From these results ... it's clear that 55300 came from counting the 
items from node1 and that 55677 came from node3 .... but what about 55634, 
55342, 55352 ? Where were these results comming from ... ?{quote}

The situation is not that simple due to {{paging}}. Cassandra and the driver 
will split your requests into multiple ones and each of them might be executed 
on any of the 2 replicas. The mechanism is slightly different for simple 
requests and aggregate ones but the result is similar: {{you will get a random 
number of missing data}}.

Unless, you still have some issues I will close the ticket as: not a problem.

> Inconsistent select count(*)
> ----------------------------
>
>                 Key: CASSANDRA-11314
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11314
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Local Write-Read Paths
>         Environment: Ununtu 14.04 LTS
>            Reporter: Mircea Lemnaru
>            Assignee: Benjamin Lerer
>         Attachments: testrun.log, vnodes_and_hosts
>
>
> Hello,
> I currently have this setup: 
> Cassandra 3.3 (Community edition downloaded from Datastax) installed on 3 
> nodes and I have created this table:
> CREATE TABLE billing.collected_data_day (
>     collection_day int,
>     timestamp timestamp,
>     record_id uuid,
>     dimensions map<text, text>,
>     entity_id text,
>     measurements map<text, text>,
>     source_id text,
>     PRIMARY KEY (collection_day, timestamp, record_id)
> ) WITH CLUSTERING ORDER BY (timestamp ASC, record_id ASC)
>     AND bloom_filter_fp_chance = 0.01
>     AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
>     AND comment = ''
>     AND compaction = {'class': 
> 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 
> 'max_threshold': '32', 'min_threshold': '4'}
>     AND compression = {'chunk_length_in_kb': '64', 'class': 
> 'org.apache.cassandra.io.compress.LZ4Compressor'}
>     AND crc_check_chance = 1.0
>     AND dclocal_read_repair_chance = 0.1
>     AND default_time_to_live = 0
>     AND gc_grace_seconds = 864000
>     AND max_index_interval = 2048
>     AND memtable_flush_period_in_ms = 0
>     AND min_index_interval = 128
>     AND read_repair_chance = 0.0
>     AND speculative_retry = '99PERCENTILE';
> This table as you notice is partitioned by collection_day. This is because at 
> the end of the day we need to have fast access to all the data generated in a 
> day. collection day will be the x day from 1970
> In this table we have inserted roughly 12milion rows for testing purposes and 
> we did a simple count. As you can see the results vary ... 
> cqlsh:billing> select count(*) from collected_data_day where 
> collection_day=16462;
>  count
> -------
>  55341
> (1 rows)
> cqlsh:billing> select count(*) from collected_data_day where 
> collection_day=16462;
>  count
> -------
>  55372
> (1 rows)
> cqlsh:billing> select count(*) from collected_data_day where 
> collection_day=16462;
>  count
> -------
>  55300
> (1 rows)
> cqlsh:billing> select count(*) from collected_data_day where 
> collection_day=16462;
>  count
> -------
>  55300
> (1 rows)
> cqlsh:billing> select count(*) from collected_data_day where 
> collection_day=16462;
>  count
> -------
>  55300
> (1 rows)
> cqlsh:billing> select count(*) from collected_data_day where 
> collection_day=16462;
>  count
> -------
>  55303
> (1 rows)
> cqlsh:billing> select count(*) from collected_data_day where 
> collection_day=16462;
>  count
> -------
>  55374
> (1 rows)
> I am running the query from the seed node of the cassandra cluster. As you 
> can see most of the results are varying and I don't know the reason for this. 
> We are not writing anything into the cluster at this time , we are only 
> querying the cluster and only using this CQLSH.
> This is very similar to CASSANDRA-8940 but that is targeted for 2.1x
> Could it be that we are having the same issue in 3.3 ? 
> Please let me know what extra info I can provide



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-11314) Inconsistent select count(*)

Reply via email to