[
https://issues.apache.org/jira/browse/CASSANDRA-8940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14535458#comment-14535458
]
Benjamin Lerer commented on CASSANDRA-8940:
-------------------------------------------
[~frensjan], you did not succeed to reproduce the problem with ccm because by
default ccm disable {{vnodes}}. This means that the data will be distributed on
only 3 contiguous ranges and the {{StoreProxy}} will have to perform at most 2
requests.
If the ccm cluster is created with the {{vnodes}} option. The data will be
distributed over 256 * 3 = 768 ranges and the problem is easily reproducible.
In our scenario, for each page of data (5000 cql rows), the {{StoreProxy}} will
initially issue a first request and check if enough results are returned. If
not enough result have been returned it will guess based on the amount of data
returned for the first range how much more range it needs to query and will
query them in parallel.
In the worst case, where no result have been found in the first range, the
{{StoreProxy}} will assume that we only have a small amount of data per range
and will issue 767 concurrent requests to get the remaining data.
A third of those request will target some ranges of data located on the
coordinator node.
Cassandra will optimise those requests by not serializing and deserializing
them.
The problem was that the {{SliceQueryFilter}} which is part of the request and
which is used to filter out the data ended up being shared between the threads
while it should not have been as it is mutable.
I described worst case scenario but the problem could occurs with a smaller
amount of concurrent requests.
> Inconsistent select count and select distinct
> ---------------------------------------------
>
> Key: CASSANDRA-8940
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8940
> Project: Cassandra
> Issue Type: Bug
> Components: Core
> Environment: 2.1.2
> Reporter: Frens Jan Rumph
> Assignee: Benjamin Lerer
> Attachments: 7b74fb00-e935-11e4-b10c-317579db7eb4.csv,
> 8d5899d0-e935-11e4-847b-2d06da75a6cd.csv, Vagrantfile, install_cassandra.sh,
> setup_hosts.sh
>
>
> When performing {{select count( * ) from ...}} I expect the results to be
> consistent over multiple query executions if the table at hand is not written
> to / deleted from in the mean time. However, in my set-up it is not. The
> counts returned vary considerable (several percent). The same holds for
> {{select distinct partition-key-columns from ...}}.
> I have a table in a keyspace with replication_factor = 1 which is something
> like:
> {code}
> CREATE TABLE tbl (
> id frozen<id_type>,
> bucket bigint,
> offset int,
> value double,
> PRIMARY KEY ((id, bucket), offset)
> )
> {code}
> The frozen udt is:
> {code}
> CREATE TYPE id_type (
> tags map<text, text>
> );
> {code}
> The table contains around 35k rows (I'm not trying to be funny here ...). The
> consistency level for the queries was ONE.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)