Re: Repair completes successfully but data is still inconsistent
On Thu, Nov 27, 2014 at 2:38 AM, André Cruz andre.c...@co.sapo.pt wrote: On 26 Nov 2014, at 19:07, Robert Coli rc...@eventbrite.com wrote: Yes. Do you know if 5748 was created as a result of compaction or via a flush from a memtable? It was the result of a compaction: Ok, so in theory if you had the input SSTables to that compaction, you could use them to diagnose this bug, by seeing what values they had for that row before compaction. This is why something like tablesnap can be useful in practice... =Rob
Re: Repair completes successfully but data is still inconsistent
On 26 Nov 2014, at 19:07, Robert Coli rc...@eventbrite.com wrote: Yes. Do you know if 5748 was created as a result of compaction or via a flush from a memtable? It was the result of a compaction: INFO [CompactionExecutor:22422] 2014-11-13 13:08:41,926 CompactionTask.java (line 262) Compacted 2 sstables to [/servers/storage/cassandra-data/Disco/NamespaceFile2/Disco-NamespaceFile2-ic-5744,]. 10,732,071 bytes to 10,824,798 (~100% of original) in 2,950ms = 3.499435MB/s. 36,632 total rows, 35,545 unique. Row merge counts were {1:34460, 2:1086, } INFO [CompactionExecutor:22528] 2014-11-13 14:55:09,944 CompactionTask.java (line 262) Compacted 2 sstables to [/servers/storage/cassandra-data/Disco/NamespaceFile2/Disco-NamespaceFile2-ic-5746,]. 15,790,121 bytes to 14,913,599 (~94% of original) in 3,491ms = 4.074110MB/s. 46,167 total rows, 43,745 unique. Row merge counts were {1:41323, 2:2422, } INFO [CompactionExecutor:22590] 2014-11-13 15:26:50,087 CompactionTask.java (line 262) Compacted 2 sstables to [/servers/storage/cassandra-data/Disco/NamespaceFile2/Disco-NamespaceFile2-ic-5748,]. 16,088,983 bytes to 16,110,649 (~100% of original) in 2,741ms = 5.605367MB/s. 48,332 total rows, 47,392 unique. Row merge counts were {1:46452, 2:940, } --- INFO [CompactionExecutor:22718] 2014-11-13 18:05:36,326 CompactionTask.java (line 262) Compacted 2 sstables to [/servers/storage/cassandra-data/Disco/NamespaceFile2/Disco-NamespaceFile2-ic-5750,]. 21,508,530 bytes to 21,342,786 (~99% of original) in 3,461ms = 5.880979MB/s. 66,361 total rows, 63,219 unique. Row merge counts were {1:60077, 2:3142, } INFO [CompactionExecutor:22817] 2014-11-13 19:06:04,564 CompactionTask.java (line 262) Compacted 2 sstables to [/servers/storage/cassandra-data/Disco/NamespaceFile2/Disco-NamespaceFile2-ic-5752,]. 23,232,660 bytes to 23,087,822 (~99% of original) in 3,144ms = 7.003264MB/s. 68,968 total rows, 67,602 unique. Row merge counts were {1:66236, 2:1366, } Regarding flushes I have: INFO [FlushWriter:3079] 2014-11-13 13:08:38,972 Memtable.java (line 436) Completed flushing /servers/storage/cassandra-data/Disco/NamespaceFile2/Disco-NamespaceFile2-ic-5743-Data.db (4698473 bytes) for commitlog position ReplayPosition(segmentId=1413900469571, position=6240) INFO [FlushWriter:3093] 2014-11-13 14:55:06,436 Memtable.java (line 436) Completed flushing /servers/storage/cassandra-data/Disco/NamespaceFile2/Disco-NamespaceFile2-ic-5745-Data.db (4965323 bytes) for commitlog position ReplayPosition(segmentId=1413900469603, position=2518) INFO [FlushWriter:3101] 2014-11-13 15:26:47,336 Memtable.java (line 436) Completed flushing /servers/storage/cassandra-data/Disco/NamespaceFile2/Disco-NamespaceFile2-ic-5747-Data.db (1175384 bytes) for commitlog position ReplayPosition(segmentId=1413900469635, position=9984) INFO [FlushWriter:3121] 2014-11-13 18:05:32,853 Memtable.java (line 436) Completed flushing /servers/storage/cassandra-data/Disco/NamespaceFile2/Disco-NamespaceFile2-ic-5749-Data.db (5397881 bytes) for commitlog position ReplayPosition(segmentId=1413900469667, position=8533) INFO [FlushWriter:3134] 2014-11-13 19:06:01,416 Memtable.java (line 436) Completed flushing /servers/storage/cassandra-data/Disco/NamespaceFile2/Disco-NamespaceFile2-ic-5751-Data.db (1889874 bytes) for commitlog position ReplayPosition(segmentId=1413900469699, position=108) INFO [FlushWriter:3147] 2014-11-13 21:20:58,312 Memtable.java (line 436) Completed flushing /servers/storage/cassandra-data/Disco/NamespaceFile2/Disco-NamespaceFile2-ic-5753-Data.db (3283519 bytes) for commitlog position ReplayPosition(segmentId=1413900469731, position=6848) André
Re: Repair completes successfully but data is still inconsistent
On 24 Nov 2014, at 18:54, Robert Coli rc...@eventbrite.com wrote: But for any given value on any given node, you can verify the value it has in 100% of SStables... that's what both the normal read path and repair should do when reconciling row fragments into the materialized row? Hard to understand a case where repair fails, and I might provide that set of SStables attached to an Apache JIRA. These were the sstables present on node 3 on the 13th: -rw-r--r-- 1 andre staff 167794724 Nov 13 17:57 Disco-NamespaceFile2-ic-232-Data.db -rw-r--r-- 1 andre staff 167809485 Nov 13 17:58 Disco-NamespaceFile2-ic-3608-Data.db -rw-r--r-- 1 andre staff 167800404 Nov 13 17:58 Disco-NamespaceFile2-ic-3609-Data.db -rw-r--r-- 1 andre staff 59773136 Nov 13 17:58 Disco-NamespaceFile2-ic-3610-Data.db -rw-r--r-- 1 andre staff 167804631 Nov 13 17:59 Disco-NamespaceFile2-ic-4022-Data.db -rw-r--r-- 1 andre staff 167795369 Nov 13 17:59 Disco-NamespaceFile2-ic-4023-Data.db -rw-r--r-- 1 andre staff 30930871 Nov 13 18:00 Disco-NamespaceFile2-ic-4024-Data.db -rw-r--r-- 1 andre staff 167806410 Nov 13 18:00 Disco-NamespaceFile2-ic-5334-Data.db -rw-r--r-- 1 andre staff 167789187 Nov 13 18:01 Disco-NamespaceFile2-ic-5336-Data.db -rw-r--r-- 1 andre staff 167786564 Nov 13 18:02 Disco-NamespaceFile2-ic-5337-Data.db -rw-r--r-- 1 andre staff 167794911 Nov 13 18:02 Disco-NamespaceFile2-ic-5338-Data.db -rw-r--r-- 1 andre staff 167813170 Nov 13 18:03 Disco-NamespaceFile2-ic-5339-Data.db -rw-r--r-- 1 andre staff 64246179 Nov 13 18:03 Disco-NamespaceFile2-ic-5340-Data.db -rw-r--r-- 1 andre staff 167794716 Nov 13 18:04 Disco-NamespaceFile2-ic-5719-Data.db -rw-r--r-- 1 andre staff 167782343 Nov 13 18:04 Disco-NamespaceFile2-ic-5721-Data.db -rw-r--r-- 1 andre staff 167789502 Nov 13 18:05 Disco-NamespaceFile2-ic-5722-Data.db -rw-r--r-- 1 andre staff 167800931 Nov 13 18:05 Disco-NamespaceFile2-ic-5723-Data.db -rw-r--r-- 1 andre staff 167810143 Nov 13 18:06 Disco-NamespaceFile2-ic-5724-Data.db -rw-r--r-- 1 andre staff 167800713 Nov 13 18:07 Disco-NamespaceFile2-ic-5725-Data.db -rw-r--r-- 1 andre staff 167812647 Nov 13 18:07 Disco-NamespaceFile2-ic-5726-Data.db -rw-r--r-- 1 andre staff 167782180 Nov 13 18:08 Disco-NamespaceFile2-ic-5727-Data.db -rw-r--r-- 1 andre staff 167797488 Nov 13 18:09 Disco-NamespaceFile2-ic-5728-Data.db -rw-r--r-- 1 andre staff 12950043 Nov 13 18:09 Disco-NamespaceFile2-ic-5729-Data.db -rw-r--r-- 1 andre staff 167798810 Nov 13 18:09 Disco-NamespaceFile2-ic-5730-Data.db -rw-r--r-- 1 andre staff 167805918 Nov 13 18:10 Disco-NamespaceFile2-ic-5731-Data.db -rw-r--r-- 1 andre staff 167805189 Nov 13 18:10 Disco-NamespaceFile2-ic-5732-Data.db -rw-r--r-- 1 andre staff 12563937 Nov 13 18:10 Disco-NamespaceFile2-ic-5733-Data.db -rw-r--r-- 1 andre staff 16110649 Nov 13 18:11 Disco-NamespaceFile2-ic-5748-Data.db Of these, the row in question was present on: Disco-NamespaceFile2-ic-5337-Data.db - tombstone column Disco-NamespaceFile2-ic-5719-Data.db - no trace of that column Disco-NamespaceFile2-ic-5748-Data.db - live column with original timestamp Is this the information you requested? Best regards, André
Re: Repair completes successfully but data is still inconsistent
On Wed, Nov 26, 2014 at 10:17 AM, André Cruz andre.c...@co.sapo.pt wrote: Of these, the row in question was present on: Disco-NamespaceFile2-ic-5337-Data.db - tombstone column Disco-NamespaceFile2-ic-5719-Data.db - no trace of that column Disco-NamespaceFile2-ic-5748-Data.db - live column with original timestamp Is this the information you requested? Yes. Do you know if 5748 was created as a result of compaction or via a flush from a memtable? =Rob
Re: Repair completes successfully but data is still inconsistent
On 21 Nov 2014, at 19:01, Robert Coli rc...@eventbrite.com wrote: 2- Why won’t repair propagate this column value to the other nodes? Repairs have run everyday and the value is still missing on the other nodes. No idea. Are you sure it's not expired via TTL or masked in some other way? When you ask that node for it at CL.ONE, do you get this value? This data does not use TTLs. What other reason could there be for a mask? If I connect using cassandra-cli to that specific node, which becomes the coordinator, is it guaranteed to not ask another node when CL is ONE and it contains that row? Cassandra has never stored data consistently except by fortunate accident.” I wish I had read that a few years back. :) Thank you, André
Re: Repair completes successfully but data is still inconsistent
On Mon, Nov 24, 2014 at 10:39 AM, André Cruz andre.c...@co.sapo.pt wrote: This data does not use TTLs. What other reason could there be for a mask? If I connect using cassandra-cli to that specific node, which becomes the coordinator, is it guaranteed to not ask another node when CL is ONE and it contains that row? Other than rare cases of writing doomstones (deletes with timestamps in the future) not sure in what case this might occur. But for any given value on any given node, you can verify the value it has in 100% of SStables... that's what both the normal read path and repair should do when reconciling row fragments into the materialized row? Hard to understand a case where repair fails, and I might provide that set of SStables attached to an Apache JIRA. =Rob
Re: Repair completes successfully but data is still inconsistent
On 19 Nov 2014, at 19:53, Robert Coli rc...@eventbrite.com wrote: My hunch is that you originally triggered this by picking up some obsolete SSTables during the 1.2 era. Probably if you clean up the existing zombies you will not encounter them again, unless you encounter another obsolete sstables marked live bug. I agree that your compaction exceptions in the 1.2 era are likely implicated. I’m still in the “1.2 era”, I upgraded from 1.2.16 to 1.2.19. You mentioned that obsolete sstables may have been picked up, and each node had 3-9 exceptions like this when they were brought down prior to being updated: ERROR [CompactionExecutor:15173] 2014-10-21 15:04:01,923 CassandraDaemon.java (line 191) Exception in thread Thread[CompactionExecutor:15173,1,main] java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@1dad30c0 rejected from org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor@555b9c78[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 14052] at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.reject(Unknown Source) at java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(Unknown Source) at java.util.concurrent.ScheduledThreadPoolExecutor.schedule(Unknown Source) at java.util.concurrent.ScheduledThreadPoolExecutor.submit(Unknown Source) at org.apache.cassandra.io.sstable.SSTableDeletingTask.schedule(SSTableDeletingTask.java:65) Can it be that they were all in the middle of a compaction (Leveled compaction) and the new sstables were written but the old ones were not deleted? Will Cassandra blindly pick up old and new sstables when it restarts? If so, I have a few questions: 1- What is the correct sequence of commands to bring down a node safely? I know that “drain was used here, because it is in the log. I’ve read somewhere that drain should not be used and “disablethrift”, “disablegossip”, “flush” and then waiting a while is the correct way. 2- Why won’t repair propagate this column value to the other nodes? Repairs have run everyday and the value is still missing on the other nodes. 3- If I have these rogue sstables loaded this seems like a time bomb. Down the line I will again delete columns that will reappear after some time. Is there a way I can find these sstables that should not be there? I thought the timestamp of the file would help but this zombie column is present on one of the latest sstables. Thanks, André
Re: Repair completes successfully but data is still inconsistent
On Fri, Nov 21, 2014 at 3:11 AM, André Cruz andre.c...@co.sapo.pt wrote: Can it be that they were all in the middle of a compaction (Leveled compaction) and the new sstables were written but the old ones were not deleted? Will Cassandra blindly pick up old and new sstables when it restarts? Yes. https://issues.apache.org/jira/browse/CASSANDRA-6756 Also note linked tickets there like : https://issues.apache.org/jira/browse/CASSANDRA-6503 6503 is fix version 1.2.14 ... 1- What is the correct sequence of commands to bring down a node safely? I know that “drain was used here, because it is in the log. I’ve read somewhere that drain should not be used and “disablethrift”, “disablegossip”, “flush” and then waiting a while is the correct way. drain is the canonical answer. But drain has historically not always worked. In general when it hasn't worked, it has flushed properly but not marked clean properly. https://issues.apache.org/jira/browse/CASSANDRA-4446 https://issues.apache.org/jira/browse/CASSANDRA-5911 5911 is too late for 1.2 and will not be merged there. 2- Why won’t repair propagate this column value to the other nodes? Repairs have run everyday and the value is still missing on the other nodes. No idea. Are you sure it's not expired via TTL or masked in some other way? When you ask that node for it at CL.ONE, do you get this value? 3- If I have these rogue sstables loaded this seems like a time bomb. Down the line I will again delete columns that will reappear after some time. Is there a way I can find these sstables that should not be there? I thought the timestamp of the file would help but this zombie column is present on one of the latest sstables. Unfortunately, not really. You'd need the patch from CASSANDRA-6568 and you'd need to continuously have been capturing your live SStable set to determine when https://issues.apache.org/jira/browse/CASSANDRA-6568 The experience you're having is, AFAICT, irrecoverably fatal to consistency. This is why, at the summit this year, when Apple announced it had encountered and fixed like 5 bugs of this type, I summarized their talk in chats as the talk where Apple showed that, despite what you've heard about quorum reads and writes, Cassandra has never stored data consistently except by fortunate accident. =Rob http://twitter.com/rcolidba
Re: Repair completes successfully but data is still inconsistent
On 19 Nov 2014, at 00:43, Robert Coli rc...@eventbrite.com wrote: @OP : can you repro if you run a major compaction between the deletion and the tombstone collection? This happened in production and, AFAIK, for the first time in a system that has been running for 2 years. We have upgraded the Cassandra version last month so there’s that difference, but the upgrade happened before the original delete of this column. I have found more examples of zombie columns like this (aprox 30k columns of a 1.2M total) and they are all in this same row of this CF. I should point out that we have a sister CF where we do similar inserts/deletes, but it uses STCS, and it doesn’t exhibit this problem. I don’t think I can reproduce this easily in a test environment. Basically, I am conjecturing that a compaction bug or one of the handful of unmask previously deleted data bugs are resulting in the unmasking of a non-tombstone row which is sitting in a SStable. OP could also support this conjecture by running sstablekeys on other SSTables on 3rd replica and determining what masked values there are for the row prior to deletion. If the data is sitting in an old SStable, this is suggestive. There are 3 sstables that have this row on the 3rd replica: Disco-NamespaceFile2-ic-5337-Data.db.json - Has the column tombstone Disco-NamespaceFile2-ic-5719-Data.db.json - Has no value for this column Disco-NamespaceFile2-ic-5748-Data.db.json - Has the original value One last question for OP would be whether the nodes were restarted during the time period this bug was observed. An assortment of the unmask previously deleted data bugs come from dead sstables in the data directory being marked live on a restart. All the nodes were restarted on 21-23 October, for the upgrade (1.2.16 - 1.2.19) I mentioned. The delete happened after. I should also point out that we were experiencing problems related to CASSANDRA-4206 and CASSANDRA-7808. ERROR 15:01:51,885 Exception in thread Thread[CompactionExecutor:15172,1,main] java.lang.AssertionError: originally calculated column size of 78041151 but now it is 78041303 at org.apache.cassandra.db.compaction.LazilyCompactedRow.write(LazilyCompactedRow.java:135) at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:160) at org.apache.cassandra.db.compaction.CompactionTask.runWith(CompactionTask.java:162) at org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:58) at org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:60) at org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionTask.run(CompactionManager.java:208) at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) We saw that line multiple times in the logs, always for the same row because the 78041151 and 78041303 even though the data seemed fine. Could that row be the one experiencing problems now? Maybe with the upgrade the new Cassandra correctly compacted this row and all hell broke loose? If so, is there a easy way to fix this? Shouldn’t repair also propagate this zombie column to the other nodes? Thank you and best regards, André Cruz
Re: Repair completes successfully but data is still inconsistent
On 19 Nov 2014, at 11:37, André Cruz andre.c...@co.sapo.pt wrote: All the nodes were restarted on 21-23 October, for the upgrade (1.2.16 - 1.2.19) I mentioned. The delete happened after. I should also point out that we were experiencing problems related to CASSANDRA-4206 and CASSANDRA-7808. Another possible cause are these exceptions I found in the log as the nodes were shutdown and brought up with the new version: INFO [RMI TCP Connection(270364)-10.134.101.18] 2014-10-21 15:04:00,867 StorageService.java (line 939) DRAINED ERROR [CompactionExecutor:15173] 2014-10-21 15:04:01,923 CassandraDaemon.java (line 191) Exception in thread Thread[CompactionExecutor:15173,1,main] java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@1dad30c0 rejected from org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor@555b9c78[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 14052] at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.reject(Unknown Source) at java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(Unknown Source) at java.util.concurrent.ScheduledThreadPoolExecutor.schedule(Unknown Source) at java.util.concurrent.ScheduledThreadPoolExecutor.submit(Unknown Source) at org.apache.cassandra.io.sstable.SSTableDeletingTask.schedule(SSTableDeletingTask.java:65) at org.apache.cassandra.io.sstable.SSTableReader.releaseReference(SSTableReader.java:976) at org.apache.cassandra.db.DataTracker.removeOldSSTablesSize(DataTracker.java:370) at org.apache.cassandra.db.DataTracker.postReplace(DataTracker.java:335) at org.apache.cassandra.db.DataTracker.replace(DataTracker.java:329) at org.apache.cassandra.db.DataTracker.replaceCompactedSSTables(DataTracker.java:232) at org.apache.cassandra.db.ColumnFamilyStore.replaceCompactedSSTables(ColumnFamilyStore.java:995) at org.apache.cassandra.db.compaction.CompactionTask.replaceCompactedSSTables(CompactionTask.java:270) at org.apache.cassandra.db.compaction.CompactionTask.runWith(CompactionTask.java:230) at org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:58) at org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:60) at org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionTask.run(CompactionManager.java:208) at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) Each node has 4-9 of these exceptions as it is going down after being drained. It seems Cassandra was trying to delete an sstable. Can this be related? Best regards, André Cruz
Re: Repair completes successfully but data is still inconsistent
On Wed, Nov 19, 2014 at 5:18 AM, André Cruz andre.c...@co.sapo.pt wrote: Each node has 4-9 of these exceptions as it is going down after being drained. It seems Cassandra was trying to delete an sstable. Can this be related? That seems plausible, though the versions of the files you indicate have the versions of the data now suggests that the file was eventually successfully deleted. My hunch is that you originally triggered this by picking up some obsolete SSTables during the 1.2 era. Probably if you clean up the existing zombies you will not encounter them again, unless you encounter another obsolete sstables marked live bug. I agree that your compaction exceptions in the 1.2 era are likely implicated. =Rob
Re: Repair completes successfully but data is still inconsistent
On 18 Nov 2014, at 01:08, Michael Shuler mich...@pbandjelly.org wrote: André, does `nodetool gossipinfo` show all the nodes in schema agreement? Yes: $ nodetool -h XXX.XXX.XXX.XXX gossipinfo |grep -i schema SCHEMA:8ef63726-c845-3565-9851-91c0074a9b5e SCHEMA:8ef63726-c845-3565-9851-91c0074a9b5e SCHEMA:8ef63726-c845-3565-9851-91c0074a9b5e SCHEMA:8ef63726-c845-3565-9851-91c0074a9b5e SCHEMA:8ef63726-c845-3565-9851-91c0074a9b5e SCHEMA:8ef63726-c845-3565-9851-91c0074a9b5e SCHEMA:8ef63726-c845-3565-9851-91c0074a9b5e SCHEMA:8ef63726-c845-3565-9851-91c0074a9b5e SCHEMA:8ef63726-c845-3565-9851-91c0074a9b5e SCHEMA:8ef63726-c845-3565-9851-91c0074a9b5e SCHEMA:8ef63726-c845-3565-9851-91c0074a9b5e SCHEMA:8ef63726-c845-3565-9851-91c0074a9b5e Best regards, André
Re: Repair completes successfully but data is still inconsistent
`nodetool cleanup` also looks interesting as an option. -- Michael
Re: Repair completes successfully but data is still inconsistent
On Tue, Nov 18, 2014 at 12:46 PM, Michael Shuler mich...@pbandjelly.org wrote: `nodetool cleanup` also looks interesting as an option. I don't understand why cleanup or scrub would help with a case where data is being un-tombstoned. 1 November - column is deleted - gc_grace_period is 10 days 8 November - all 3 replicas have tombstone 13/14 November - column/tombstone is gone on 2 replicas, 3rd replica has the original value (!), with the original timestamp… To me the smoking gun here is that 3rd replica has the original value. @OP : can you repro if you run a major compaction between the deletion and the tombstone collection? Basically, I am conjecturing that a compaction bug or one of the handful of unmask previously deleted data bugs are resulting in the unmasking of a non-tombstone row which is sitting in a SStable. OP could also support this conjecture by running sstablekeys on other SSTables on 3rd replica and determining what masked values there are for the row prior to deletion. If the data is sitting in an old SStable, this is suggestive. One last question for OP would be whether the nodes were restarted during the time period this bug was observed. An assortment of the unmask previously deleted data bugs come from dead sstables in the data directory being marked live on a restart. =Rob http://twitter.com/rcolidba
Re: Repair completes successfully but data is still inconsistent
On 14 Nov 2014, at 18:44, André Cruz andre.c...@co.sapo.pt wrote: On 14 Nov 2014, at 18:29, Michael Shuler mich...@pbandjelly.org wrote: On 11/14/2014 12:12 PM, André Cruz wrote: Some extra info. I checked the backups and on the 8th of November, all 3 replicas had the tombstone of the deleted column. So: 1 November - column is deleted - gc_grace_period is 10 days 8 November - all 3 replicas have tombstone 13/14 November - column/tombstone is gone on 2 replicas, 3rd replica has the original value (!), with the original timestamp… After seeing your first post, this is helpful info. I'm curious what the logs show between the 8th-13th, particularly around the 10th-11th :) Which logs in particular, just the ones from the 3rd machine which has the zombie column? What should I be looking for? :) I have checked the logs of the 3 replicas for that period and nothing really jumps out. Still, repairs have been running daily, the log reports that the CF is synced, and as of this moment one of the replicas still returns the zombie column so they don’t agree on the data. André
Re: Repair completes successfully but data is still inconsistent
On 11/17/2014 05:22 AM, André Cruz wrote: I have checked the logs of the 3 replicas for that period and nothing really jumps out. Still, repairs have been running daily, the log reports that the CF is synced, and as of this moment one of the replicas still returns the zombie column so they don’t agree on the data. I'm at a bit of a loss. Readers, is `nodetool scrub` going to be helpful here, or any other suggestions of things to look for? André, does `nodetool gossipinfo` show all the nodes in schema agreement? -- Michael
Repair completes successfully but data is still inconsistent
Hello. So, I have detected a data inconsistency between my nodes: (Consistency level is ONE) [disco@Disco] get NamespaceFile2['45fc8996-41bc-429b-a382-5da9294eb59c:/XXXDIRXXX']['XXXFILEXXX']; Value was not found Elapsed time: 48 msec(s). [disco@Disco] get NamespaceFile2['45fc8996-41bc-429b-a382-5da9294eb59c:/XXXDIRXXX']['XXXFILEXXX']; = (name=XXXFILEXXX, value=a96a7f54-49d4-11e4-8185-e0db55018fa4, timestamp=1412213845215797) Elapsed time: 7.45 msec(s). [disco@Disco] I have a replication factor of 3, and if I read with a consistency level of QUORUM, the result converges to the column being present. The strange thing is: I have deleted it on 2014-11-01 03:39:25 and my writes use QUORUM. My gc_grace_period is 10 days and if I had not been running repairs during this period this could be explained, but the fact is that repairs have been running every day with no sign of problems. First of all some data: Cassandra version: 1.2.19 (we upgraded from 1.2.16 on 2014-10-22) NamespaceFile2 compaction strategy: LeveledCompactionStrategy CFStats: Read Count: 3376397 Read Latency: 0.24978254956392865 ms. Write Count: 21254817 Write Latency: 0.04713691710448507 ms. Pending Tasks: 0 Column Family: NamespaceFile2 SSTable count: 28 SSTables in each level: [1, 10, 17, 0, 0, 0, 0, 0, 0] Space used (live): 4961751613 Space used (total): 5021967044 SSTable Compression Ratio: 0.5773464014899713 Number of Keys (estimate): 9369856 Memtable Columns Count: 82901 Memtable Data Size: 19352636 Memtable Switch Count: 283 Read Count: 3376397 Read Latency: NaN ms. Write Count: 21254817 Write Latency: 0.049 ms. Pending Tasks: 0 Bloom Filter False Positives: 21904 Bloom Filter False Ratio: 0.0 Bloom Filter Space Used: 7405728 Compacted row minimum size: 61 Compacted row maximum size: 74975550 Compacted row mean size: 795 Average live cells per slice (last five minutes): 1025.0 Average tombstones per slice (last five minutes): 0.0 This particular row has 1.1M columns, 70934390 bytes, and the offending key encoded by the partitioner is 001045fc899641bc429ba3825da9294eb59c072f646f7267656d00. First I checked which nodes were responsible for this particular row: $ nodetool -h XXX.XXX.XXX.112 getendpoints Disco NamespaceFile2 '45fc8996-41bc-429b-a382-5da9294eb59c:/XXXDIRXXX' XXX.XXX.XXX.17 XXX.XXX.XXX.14 XXX.XXX.XXX.18 This is the repair log of these particular nodes: --XXX.XXX.XXX.14-- INFO [AntiEntropySessions:366] 2014-11-14 04:34:55,740 AntiEntropyService.java (line 651) [repair #956244b0-6bb7-11e4-8eec-f5962c02982e] new session: will sync /XXX.XXX.XXX.14, /XXX.XXX.XXX.12, /XXX.XXX.XXX.18 on range (14178431955039102644307275309657008810,28356863910078205288614550619314017621] for Disco.[NamespaceFile2] ... INFO [AntiEntropyStage:1] 2014-11-14 04:36:51,125 AntiEntropyService.java (line 764) [repair #956244b0-6bb7-11e4-8eec-f5962c02982e] NamespaceFile2 is fully synced INFO [AntiEntropySessions:366] 2014-11-14 04:36:51,126 AntiEntropyService.java (line 698) [repair #956244b0-6bb7-11e4-8eec-f5962c02982e] session completed successfully ... INFO [AntiEntropySessions:367] 2014-11-14 04:36:51,103 AntiEntropyService.java (line 651) [repair #da2543e0-6bb7-11e4-8eec-f5962c02982e] new session: will sync /XXX.XXX.XXX.14, /XXX.XXX.XXX.11, /XXX.XXX.XXX.17 on range (155962751505430129087380028406227096910,0] for Disco.[NamespaceFile2] ... INFO [AntiEntropyStage:1] 2014-11-14 04:38:20,949 AntiEntropyService.java (line 764) [repair #da2543e0-6bb7-11e4-8eec-f5962c02982e] NamespaceFile2 is fully synced INFO [AntiEntropySessions:367] 2014-11-14 04:38:20,949 AntiEntropyService.java (line 698) [repair #da2543e0-6bb7-11e4-8eec-f5962c02982e] session completed successfully ... INFO [AntiEntropySessions:368] 2014-11-14 04:38:20,930 AntiEntropyService.java (line 651) [repair #0fafc710-6bb8-11e4-8eec-f5962c02982e] new session: will sync /XXX.XXX.XXX.14, /XXX.XXX.XXX.17, /XXX.XXX.XXX.18 on range (0,14178431955039102644307275309657008810] for Disco.[NamespaceFile2] ... INFO [AntiEntropyStage:1] 2014-11-14 04:40:34,237 AntiEntropyService.java (line 764) [repair #0fafc710-6bb8-11e4-8eec-f5962c02982e] NamespaceFile2 is fully synced INFO [AntiEntropySessions:368] 2014-11-14 04:40:34,237 AntiEntropyService.java (line 698) [repair #0fafc710-6bb8-11e4-8eec-f5962c02982e] session completed successfully --XXX.XXX.XXX.17-- INFO [AntiEntropySessions:362] 2014-11-13 04:33:01,974 AntiEntropyService.java (line 651) [repair #27293450-6aee-11e4-aabc-9b69569c95c3] new session: will sync /XXX.XXX.XXX.17,
Re: Repair completes successfully but data is still inconsistent
Some extra info. I checked the backups and on the 8th of November, all 3 replicas had the tombstone of the deleted column. So: 1 November - column is deleted - gc_grace_period is 10 days 8 November - all 3 replicas have tombstone 13/14 November - column/tombstone is gone on 2 replicas, 3rd replica has the original value (!), with the original timestamp… Is there a logical explanation for this behaviour? Thank you, André Cruz
Re: Repair completes successfully but data is still inconsistent
On 11/14/2014 12:12 PM, André Cruz wrote: Some extra info. I checked the backups and on the 8th of November, all 3 replicas had the tombstone of the deleted column. So: 1 November - column is deleted - gc_grace_period is 10 days 8 November - all 3 replicas have tombstone 13/14 November - column/tombstone is gone on 2 replicas, 3rd replica has the original value (!), with the original timestamp… After seeing your first post, this is helpful info. I'm curious what the logs show between the 8th-13th, particularly around the 10th-11th :) Is there a logical explanation for this behaviour? Not yet! -- Kind regards, Michael
Re: Repair completes successfully but data is still inconsistent
On 14 Nov 2014, at 18:29, Michael Shuler mich...@pbandjelly.org wrote: On 11/14/2014 12:12 PM, André Cruz wrote: Some extra info. I checked the backups and on the 8th of November, all 3 replicas had the tombstone of the deleted column. So: 1 November - column is deleted - gc_grace_period is 10 days 8 November - all 3 replicas have tombstone 13/14 November - column/tombstone is gone on 2 replicas, 3rd replica has the original value (!), with the original timestamp… After seeing your first post, this is helpful info. I'm curious what the logs show between the 8th-13th, particularly around the 10th-11th :) Which logs in particular, just the ones from the 3rd machine which has the zombie column? What should I be looking for? :) Thanks for the help, André