[ 
https://issues.apache.org/jira/browse/CASSANDRA-11314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15184854#comment-15184854
 ] 

Mircea Lemnaru commented on CASSANDRA-11314:
--------------------------------------------

I have ran several tests , trying to isolate the problem and I have the 
following results:

If you look at the attached file: testrun.log , that is the output of a spark + 
scala program I wrote to test some data fetch and count. It is using spark and 
is connecting to the cassandra cluster and does the following things:

for(i <- 1 to 10){
    var rowsCollected = sc.cassandraTable("billing", "collected_data_day")
      .where("collection_day in ?", Set(16462)).collect()
    println("Count (done on client side after data fetch): " + 
rowsCollected.length + " for day " + 16462)
}

The code will fetch all the data "where collection_day in (16462)" on the 
client side and then do a count on that - also on the client side. If you look 
at the attached file you will see the following output:

Count (done on client side after data fetch): 55300 for day 16462
Count (done on client side after data fetch): 55316 for day 16462
Count (done on client side after data fetch): 55300 for day 16462
Count (done on client side after data fetch): 55677 for day 16462
Count (done on client side after data fetch): 55300 for day 16462
Count (done on client side after data fetch): 55300 for day 16462
Count (done on client side after data fetch): 55300 for day 16462
Count (done on client side after data fetch): 55300 for day 16462
Count (done on client side after data fetch): 55677 for day 16462
Count (done on client side after data fetch): 55677 for day 16462

Because of the fact that we fetch the data on the client side and then count , 
this indicates that also the datafetch is flawed because for some reason we are 
missing chunks of data , I have not analysed the data just counted it ... 
-----------------

for(i <- 1 to 20){
    var result = sc.cassandraTable("billing", 
"collected_data_day").where("collection_day in ?", Set(16462)).cassandraCount();
    println("Count (done on cassandra cluster): " + result)
}

In the code above we count the number of items in the respective partition on 
the cassandra side , so no client data fetch , we are doing all the processing 
in cassandra / spark nodes

The output for this code:

Count (done on cassandra cluster): 55300
Count (done on cassandra cluster): 55677
Count (done on cassandra cluster): 55677
Count (done on cassandra cluster): 55300
Count (done on cassandra cluster): 55677
Count (done on cassandra cluster): 55677
Count (done on cassandra cluster): 55634
Count (done on cassandra cluster): 55530
Count (done on cassandra cluster): 55677
Count (done on cassandra cluster): 55342
Count (done on cassandra cluster): 55300
Count (done on cassandra cluster): 55630
Count (done on cassandra cluster): 55677
Count (done on cassandra cluster): 55297
Count (done on cassandra cluster): 55677
Count (done on cassandra cluster): 55677
Count (done on cassandra cluster): 55352
Count (done on cassandra cluster): 55300
Count (done on cassandra cluster): 55677
Count (done on cassandra cluster): 55300

As you can see , still inconsistent.

Ideas ? What to look over next ... ?

Thanks
Mircea

> Inconsistent select count(*)
> ----------------------------
>
>                 Key: CASSANDRA-11314
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11314
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Local Write-Read Paths
>         Environment: Ununtu 14.04 LTS
>            Reporter: Mircea Lemnaru
>            Assignee: Benjamin Lerer
>         Attachments: vnodes_and_hosts
>
>
> Hello,
> I currently have this setup: 
> Cassandra 3.3 (Community edition downloaded from Datastax) installed on 3 
> nodes and I have created this table:
> CREATE TABLE billing.collected_data_day (
>     collection_day int,
>     timestamp timestamp,
>     record_id uuid,
>     dimensions map<text, text>,
>     entity_id text,
>     measurements map<text, text>,
>     source_id text,
>     PRIMARY KEY (collection_day, timestamp, record_id)
> ) WITH CLUSTERING ORDER BY (timestamp ASC, record_id ASC)
>     AND bloom_filter_fp_chance = 0.01
>     AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
>     AND comment = ''
>     AND compaction = {'class': 
> 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 
> 'max_threshold': '32', 'min_threshold': '4'}
>     AND compression = {'chunk_length_in_kb': '64', 'class': 
> 'org.apache.cassandra.io.compress.LZ4Compressor'}
>     AND crc_check_chance = 1.0
>     AND dclocal_read_repair_chance = 0.1
>     AND default_time_to_live = 0
>     AND gc_grace_seconds = 864000
>     AND max_index_interval = 2048
>     AND memtable_flush_period_in_ms = 0
>     AND min_index_interval = 128
>     AND read_repair_chance = 0.0
>     AND speculative_retry = '99PERCENTILE';
> This table as you notice is partitioned by collection_day. This is because at 
> the end of the day we need to have fast access to all the data generated in a 
> day. collection day will be the x day from 1970
> In this table we have inserted roughly 12milion rows for testing purposes and 
> we did a simple count. As you can see the results vary ... 
> cqlsh:billing> select count(*) from collected_data_day where 
> collection_day=16462;
>  count
> -------
>  55341
> (1 rows)
> cqlsh:billing> select count(*) from collected_data_day where 
> collection_day=16462;
>  count
> -------
>  55372
> (1 rows)
> cqlsh:billing> select count(*) from collected_data_day where 
> collection_day=16462;
>  count
> -------
>  55300
> (1 rows)
> cqlsh:billing> select count(*) from collected_data_day where 
> collection_day=16462;
>  count
> -------
>  55300
> (1 rows)
> cqlsh:billing> select count(*) from collected_data_day where 
> collection_day=16462;
>  count
> -------
>  55300
> (1 rows)
> cqlsh:billing> select count(*) from collected_data_day where 
> collection_day=16462;
>  count
> -------
>  55303
> (1 rows)
> cqlsh:billing> select count(*) from collected_data_day where 
> collection_day=16462;
>  count
> -------
>  55374
> (1 rows)
> I am running the query from the seed node of the cassandra cluster. As you 
> can see most of the results are varying and I don't know the reason for this. 
> We are not writing anything into the cluster at this time , we are only 
> querying the cluster and only using this CQLSH.
> This is very similar to CASSANDRA-8940 but that is targeted for 2.1x
> Could it be that we are having the same issue in 3.3 ? 
> Please let me know what extra info I can provide



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to