Igor Novgorodov created CASSANDRA-13379: -------------------------------------------
Summary: SASI index returns duplicate rows Key: CASSANDRA-13379 URL: https://issues.apache.org/jira/browse/CASSANDRA-13379 Project: Cassandra Issue Type: Bug Components: sasi Reporter: Igor Novgorodov {code} CREATE TABLE bulks_recipients ( bulk_id uuid, recipient text, bulk_id_idx uuid, status int, ts timestamp, PRIMARY KEY ((bulk_id, recipient)) ) {code} *bulk_id_idx* is just a copy of *bulk_id* because SASI does not work on partition key component at all for some reason. {code} CREATE CUSTOM INDEX bulks_recipients_bulk_id ON bulks_recipients (bulk_id_idx) USING 'org.apache.cassandra.index.sasi.SASIIndex'; {code} Then i insert 1 million rows with the same *bulk_id* and different *recipient*. Then {code} > select count(*) from bulks_recipients ; count --------- 1000000 (1 rows) {/code} Ok, it's fine here. Now let's query by SASI: {code} > select count(*) from bulks_recipients where bulk_id_idx = > fedd95ec-2cc8-4040-8619-baf69647700b; count --------- 1010101 (1 rows) {code} Hmm, very strange count - 10101 extra rows. Ok, i've dumped the query result into a text file: {code} # cat sasi.txt | wc -l 1000200 {code} Here we have 200 extra rows for some reason. Let's check if these are duplicates: {code} # cat sasi.txt | sort | uniq | wc -l 1000000 {code} Yep, looks like. Recreating index does not help. -- This message was sent by Atlassian JIRA (v6.3.15#6346)