[jira] [Commented] (CASSANDRA-8940) Inconsistent select count and select distinct

Frens Jan Rumph (JIRA) Mon, 13 Apr 2015 10:31:01 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-8940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14492700#comment-14492700
 ]


Frens Jan Rumph commented on CASSANDRA-8940:
--------------------------------------------

[~blerer], sorry for the delay ... been a bit busy past few weeks.

I've whipped up a script which should reproduce my problems: 

{code}
import cassandra.cluster
import cassandra.concurrent

import string
import sys


def setup_schema(session):
        print("setting up schema")

        session.execute("CREATE KEYSPACE IF NOT EXISTS count_test WITH 
replication = {'class': 'SimpleStrategy', 'replication_factor': 1};")
        session.set_keyspace("count_test")

        session.execute("""
                CREATE TABLE IF NOT EXISTS tbl (
                        id text,
                        bucket bigint,
                        offset int,
                        value double,
                        PRIMARY KEY ((id, bucket), offset)
                )
        """)


def insert_test_data(session):
        # setup parameters for the inserts
        ids = string.lowercase[:5]
        bucket_count = 10
        offset_count = 1000

        print('inserting data for %s ids, %s buckets and %s offsets' % 
(len(ids), bucket_count, offset_count))

        # clear the table
        session.execute("TRUNCATE tbl;")

        # prepare the insert
        insert = session.prepare("INSERT INTO tbl (id, bucket, offset, value) 
VALUES (?, ?, ?, ?)")

        # insert a CQL row for each tag, bucket and offset
        inserts = [
                (insert, (t, b, o, 0))
                for t in ids
                for b in xrange(bucket_count)
                for o in xrange(offset_count)
        ]
        _ = cassandra.concurrent.execute_concurrent(session, inserts)

        return len(inserts)


if __name__ == '__main__':
        contact_points = ['cas-1', 'cas-2', 'cas-3']
        session = cassandra.cluster.Cluster(contact_points).connect()

        try:
                setup_schema(session)
                inserted = insert_test_data(session)
                print("inserted %s rows" % inserted)

                for count in (session.execute("SELECT count(*) FROM tbl") for _ 
in range(10)):
                        print('queried count was %s%s' % (count[0].count, '' if 
count[0].count == inserted else ' (fail)'))
        finally:
                session.shutdown()
{code}

In my setup this yields (on a particular run):
{code}
setting up schema
inserting data for 5 ids, 10 buckets and 1000 offsets
inserted 50000 rows
queried count was 50000
queried count was 49396 (fail)
queried count was 49918 (fail)
queried count was 50000
queried count was 50000
queried count was 50000
queried count was 49993 (fail)
queried count was 48997 (fail)
queried count was 49772 (fail)
queried count was 49551 (fail)
{code}

As you can see the counts vary. The number of failures seem to be correlated to 
the number of rows in the cluster. E.g. with only 1000 rows there are no wrong 
counts.

As for my set-up: I'm using a three node cluster (cas-1, cas-2 and cas-3) which 
run on Vagrant + LXC. I planned on writing a script using CCM to be portable, 
but I wasn't able to reproduce the results with CCM! I've tried both Cassandra 
2.1.2 and 2.1.4 with CCM. That was rather disappointing. Or looking at it 
differently ... it might be considered a clue to where things go wrong ...

Any of this ring a bell? Do you perhaps have pointers for me to dig deeper?

> Inconsistent select count and select distinct
> ---------------------------------------------
>
>                 Key: CASSANDRA-8940
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8940
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: 2.1.2
>            Reporter: Frens Jan Rumph
>            Assignee: Benjamin Lerer
>
> When performing {{select count( * ) from ...}} I expect the results to be 
> consistent over multiple query executions if the table at hand is not written 
> to / deleted from in the mean time. However, in my set-up it is not. The 
> counts returned vary considerable (several percent). The same holds for 
> {{select distinct partition-key-columns from ...}}.
> I have a table in a keyspace with replication_factor = 1 which is something 
> like:
> {code}
> CREATE TABLE tbl (
>     id frozen<id_type>,
>     bucket bigint,
>     offset int,
>     value double,
>     PRIMARY KEY ((id, bucket), offset)
> )
> {code}
> The frozen udt is:
> {code}
> CREATE TYPE id_type (
>     tags map<text, text>
> );
> {code}
> The table contains around 35k rows (I'm not trying to be funny here ...). The 
> consistency level for the queries was ONE.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-8940) Inconsistent select count and select distinct

Reply via email to