[ 
https://issues.apache.org/jira/browse/CASSANDRA-12756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15554686#comment-15554686
 ] 

Alex Petrov commented on CASSANDRA-12756:
-----------------------------------------

Individual SSTables can not / should not have any duplicates. Current storage 
format operates under the invariant that there's just one row with a given key. 
Merge iterators (which would usually correctly reconcile results and combine 
live row with the tombstone and return nothing if tombstone supersedes the live 
row) also expect the same.

What you describe sounds a bit like [CASSANDRA-12144], and sstabledump looks 
just like what we've seen in that issue. I would expect that the rows would 
also be undeletable (I would appreciate if you tried to remove those items 
maybe in test environment or locally if you copy and restore sstables).

If this is the case, it's already fixed in {{3.0.8}} and there's both a fix for 
upgrade path and for scrub.


> Duplicate (cql)rows for the same primary key
> --------------------------------------------
>
>                 Key: CASSANDRA-12756
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12756
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Compaction, CQL
>         Environment: Linux, Cassandra 3.7 (upgraded at one point from 2.?).
>            Reporter: Andreas Wederbrand
>            Priority: Minor
>
> I observe what looks like duplicates when I run cql queries against a table. 
> It only show for rows written during a couple of hours on a specific date but 
> it shows for several partions and serveral clustering keys for each partition 
> during that time range.
> We've loaded data in two ways. 
> 1) through a normal insert
> 2) through sstableloader with sstables created using update-statements (to 
> append to the map) and an older version of SSTableWriter. During this 
> processes several months of data was re-loaded. 
> The table DDL is 
> {code:title=create statement|borderStyle=solid}
> CREATE TABLE climate.climate_1510 (
>     installation_id bigint,
>     node_id bigint,
>     time_bucket int,
>     gateway_time timestamp,
>     humidity map<int, float>,
>     temperature map<int, float>,
>     PRIMARY KEY ((installation_id, node_id, time_bucket), gateway_time)
> ) WITH CLUSTERING ORDER BY (gateway_time DESC)
>     AND bloom_filter_fp_chance = 0.01
>     AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
>     AND comment = ''
>     AND compaction = {'class': 
> 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 
> 'max_threshold': '32', 'min_threshold': '4'}
>     AND compression = {'chunk_length_in_kb': '64', 'class': 
> 'org.apache.cassandra.io.compress.LZ4Compressor'}
>     AND crc_check_chance = 1.0
>     AND dclocal_read_repair_chance = 0.1
>     AND default_time_to_live = 0
>     AND gc_grace_seconds = 864000
>     AND max_index_interval = 2048
>     AND memtable_flush_period_in_ms = 0
>     AND min_index_interval = 128
>     AND read_repair_chance = 0.0
>     AND speculative_retry = '99PERCENTILE';
> {code}
> and the result from the SELECT is
> {code:title=cql output|borderStyle=solid}
> > select * from climate.climate_1510 where installation_id = 133235 and 
> > node_id = 35453983 and time_bucket = 189 and gateway_time > '2016-08-10 
> > 20:00:00' and gateway_time < '2016-08-10 21:00:00' ;
>  installation_id | node_id  | time_bucket | gateway_time             | 
> humidity | temperature
> -----------------+----------+-------------+--------------------------+----------+---------------
>           133235 | 35453983 |         189 | 20160810 20:23:28.000000 |  {0: 
> 51} | {0: 24.37891}
>           133235 | 35453983 |         189 | 20160810 20:23:28.000000 |  {0: 
> 51} | {0: 24.37891}
>           133235 | 35453983 |         189 | 20160810 20:23:28.000000 |  {0: 
> 51} | {0: 24.37891}
> {code}
> I've used Andrew Tolbert's sstable-tools to be able to dump the json for this 
> specific time and this is what I find. 
> {code:title=json dump|borderStyle=solid}
> [133235:35453983:189] Row[info=[ts=1470878906618000] ]: 
> gateway_time=2016-08-10 22:23+0200 | 
> del(humidity)=deletedAt=1470878906617999, localDeletion=1470878906, 
> [humidity[0]=51.0 ts=1470878906618000], 
> del(temperature)=deletedAt=1470878906617999, localDeletion=1470878906, 
> [temperature[0]=24.378906 ts=1470878906618000]
> [133235:35453983:189] Row[info=[ts=-9223372036854775808] 
> del=deletedAt=1470864506441999, localDeletion=1470864506 ]: 
> gateway_time=2016-08-10 22:23+0200 | , [humidity[0]=51.0 
> ts=1470878906618000], , [temperature[0]=24.378906 ts=1470878906618000]
> [133235:35453983:189] Row[info=[ts=-9223372036854775808] 
> del=deletedAt=1470868106489000, localDeletion=1470868106 ]: 
> gateway_time=2016-08-10 22:23+0200 | 
> [133235:35453983:189] Row[info=[ts=-9223372036854775808] 
> del=deletedAt=1470871706530999, localDeletion=1470871706 ]: 
> gateway_time=2016-08-10 22:23+0200 | 
> [133235:35453983:189] Row[info=[ts=-9223372036854775808] 
> del=deletedAt=1470878906617999, localDeletion=1470878906 ]: 
> gateway_time=2016-08-10 22:23+0200 | , [humidity[0]=51.0 
> ts=1470878906618000], , [temperature[0]=24.378906 ts=1470878906618000]
> {code}
> From my understanding this should be impossible. Even if we have duplicates 
> in the sstables (which is normal) it should be filtered away before being 
> returned to the client.
> I'm happy to add details to this bug if anything is missing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to