[
https://issues.apache.org/jira/browse/CASSANDRA-8940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14492700#comment-14492700
]
Frens Jan Rumph commented on CASSANDRA-8940:
--------------------------------------------
[~blerer], sorry for the delay ... been a bit busy past few weeks.
I've whipped up a script which should reproduce my problems:
{code}
import cassandra.cluster
import cassandra.concurrent
import string
import sys
def setup_schema(session):
print("setting up schema")
session.execute("CREATE KEYSPACE IF NOT EXISTS count_test WITH
replication = {'class': 'SimpleStrategy', 'replication_factor': 1};")
session.set_keyspace("count_test")
session.execute("""
CREATE TABLE IF NOT EXISTS tbl (
id text,
bucket bigint,
offset int,
value double,
PRIMARY KEY ((id, bucket), offset)
)
""")
def insert_test_data(session):
# setup parameters for the inserts
ids = string.lowercase[:5]
bucket_count = 10
offset_count = 1000
print('inserting data for %s ids, %s buckets and %s offsets' %
(len(ids), bucket_count, offset_count))
# clear the table
session.execute("TRUNCATE tbl;")
# prepare the insert
insert = session.prepare("INSERT INTO tbl (id, bucket, offset, value)
VALUES (?, ?, ?, ?)")
# insert a CQL row for each tag, bucket and offset
inserts = [
(insert, (t, b, o, 0))
for t in ids
for b in xrange(bucket_count)
for o in xrange(offset_count)
]
_ = cassandra.concurrent.execute_concurrent(session, inserts)
return len(inserts)
if __name__ == '__main__':
contact_points = ['cas-1', 'cas-2', 'cas-3']
session = cassandra.cluster.Cluster(contact_points).connect()
try:
setup_schema(session)
inserted = insert_test_data(session)
print("inserted %s rows" % inserted)
for count in (session.execute("SELECT count(*) FROM tbl") for _
in range(10)):
print('queried count was %s%s' % (count[0].count, '' if
count[0].count == inserted else ' (fail)'))
finally:
session.shutdown()
{code}
In my setup this yields (on a particular run):
{code}
setting up schema
inserting data for 5 ids, 10 buckets and 1000 offsets
inserted 50000 rows
queried count was 50000
queried count was 49396 (fail)
queried count was 49918 (fail)
queried count was 50000
queried count was 50000
queried count was 50000
queried count was 49993 (fail)
queried count was 48997 (fail)
queried count was 49772 (fail)
queried count was 49551 (fail)
{code}
As you can see the counts vary. The number of failures seem to be correlated to
the number of rows in the cluster. E.g. with only 1000 rows there are no wrong
counts.
As for my set-up: I'm using a three node cluster (cas-1, cas-2 and cas-3) which
run on Vagrant + LXC. I planned on writing a script using CCM to be portable,
but I wasn't able to reproduce the results with CCM! I've tried both Cassandra
2.1.2 and 2.1.4 with CCM. That was rather disappointing. Or looking at it
differently ... it might be considered a clue to where things go wrong ...
Any of this ring a bell? Do you perhaps have pointers for me to dig deeper?
> Inconsistent select count and select distinct
> ---------------------------------------------
>
> Key: CASSANDRA-8940
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8940
> Project: Cassandra
> Issue Type: Bug
> Components: Core
> Environment: 2.1.2
> Reporter: Frens Jan Rumph
> Assignee: Benjamin Lerer
>
> When performing {{select count( * ) from ...}} I expect the results to be
> consistent over multiple query executions if the table at hand is not written
> to / deleted from in the mean time. However, in my set-up it is not. The
> counts returned vary considerable (several percent). The same holds for
> {{select distinct partition-key-columns from ...}}.
> I have a table in a keyspace with replication_factor = 1 which is something
> like:
> {code}
> CREATE TABLE tbl (
> id frozen<id_type>,
> bucket bigint,
> offset int,
> value double,
> PRIMARY KEY ((id, bucket), offset)
> )
> {code}
> The frozen udt is:
> {code}
> CREATE TYPE id_type (
> tags map<text, text>
> );
> {code}
> The table contains around 35k rows (I'm not trying to be funny here ...). The
> consistency level for the queries was ONE.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)