[jira] [Updated] (CASSANDRA-8940) Inconsistent select count and select distinct

Frens Jan Rumph (JIRA) Tue, 14 Apr 2015 23:43:25 -0700

     [ 
https://issues.apache.org/jira/browse/CASSANDRA-8940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Frens Jan Rumph updated CASSANDRA-8940:
---------------------------------------
    Attachment: Vagrantfile
                setup_hosts.sh
                install_cassandra.sh

Great [~blerer]!

As I said before, I had issues reproducing my issue with CCM. The set-up in 
which I could reproduce it was built on Vagrant + LXC ... which I didn't want 
to bother you with ;) So I put some effort in a more general set-up based on 
Vagrant + Virtualbox, see the attached files.

The vagrant file creates a 3 node cluster on CentOS 7 with Cassandra 2.1 (2.1.4 
at the time of writing ... depends on the packaging by datastax, so might bump 
in the future on a new patch version).

At first I thought I had the same issue as with trying to use CCM, but 
apparently I needed to increase the number of rows written from 50k to 500k 
(with 5 ids, 10 buckets each (so 50 partitions) and 100k rows per partition).

Example output from my setup:
{code}
connecting to 192.168.33.11, 192.168.33.12, 192.168.33.13
setting up schema
inserting data for 5 ids, 10 buckets and 10000 offsets
inserted 500000 rows
queried count was 494495 (fail)
queried count was 493530 (fail)
queried count was 494604 (fail)
queried count was 490000 (fail)
queried count was 500000
queried count was 494382 (fail)
queried count was 494204 (fail)
queried count was 494625 (fail)
queried count was 500000
queried count was 494758 (fail)
{code}

Note that I have slightly modified the script to accept contact points for 
{{cassandra.cluster.Cluster(...)}} and also increased the number of rows 
inserted as mentioned before. So it can be executed with e.g. {{python2 test.py 
192.168.33.11 192.168.33.12 192.168.33.13}}

I haven't had the time do something like a proper sweep of the variables, but I 
tried a configuration with 5 ids, 1 bucket per id (so 5 unique partition keys) 
and 100k rows per partition which also seems to fail, but in a perhaps 
interesting different way, for example:

{code}
setting up schema
inserting data for 5 ids, 1 buckets and 100000 offsets
inserted 500000 rows
queried count was 500000
queried count was 500000
queried count was 403172 (fail)
queried count was 500000
queried count was 500000
queried count was 302821 (fail)
queried count was 500000
queried count was 500000
queried count was 304049 (fail)
queried count was 500000
{code}

With 5 ids, 100 bucket per id and 1k rows per partition - in my set-up - things 
do seem to pan out better, only one failure out of ten (in a particular run):
{code}
connecting to 192.168.33.11, 192.168.33.12, 192.168.33.13
setting up schema
inserting data for 5 ids, 100 buckets and 1000 offsets
inserted 500000 rows
queried count was 500000
queried count was 500000
queried count was 500000
queried count was 500000
queried count was 500000
queried count was 498740 (fail)
queried count was 500000
queried count was 500000
queried count was 500000
queried count was 500000
{code}

> Inconsistent select count and select distinct
> ---------------------------------------------
>
>                 Key: CASSANDRA-8940
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8940
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: 2.1.2
>            Reporter: Frens Jan Rumph
>            Assignee: Benjamin Lerer
>         Attachments: Vagrantfile, install_cassandra.sh, setup_hosts.sh
>
>
> When performing {{select count( * ) from ...}} I expect the results to be 
> consistent over multiple query executions if the table at hand is not written 
> to / deleted from in the mean time. However, in my set-up it is not. The 
> counts returned vary considerable (several percent). The same holds for 
> {{select distinct partition-key-columns from ...}}.
> I have a table in a keyspace with replication_factor = 1 which is something 
> like:
> {code}
> CREATE TABLE tbl (
>     id frozen<id_type>,
>     bucket bigint,
>     offset int,
>     value double,
>     PRIMARY KEY ((id, bucket), offset)
> )
> {code}
> The frozen udt is:
> {code}
> CREATE TYPE id_type (
>     tags map<text, text>
> );
> {code}
> The table contains around 35k rows (I'm not trying to be funny here ...). The 
> consistency level for the queries was ONE.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (CASSANDRA-8940) Inconsistent select count and select distinct

Reply via email to