Re: Repair completes successfully but data is still inconsistent

2014-12-01 Thread Robert Coli
On Thu, Nov 27, 2014 at 2:38 AM, André Cruz andre.c...@co.sapo.pt wrote:

 On 26 Nov 2014, at 19:07, Robert Coli rc...@eventbrite.com wrote:
 
  Yes. Do you know if 5748 was created as a result of compaction or via a
 flush from a memtable?

 It was the result of a compaction:


Ok, so in theory if you had the input SSTables to that compaction, you
could use them to diagnose this bug, by seeing what values they had for
that row before compaction. This is why something like tablesnap can be
useful in practice...

=Rob


Re: Repair completes successfully but data is still inconsistent

2014-11-27 Thread André Cruz
On 26 Nov 2014, at 19:07, Robert Coli rc...@eventbrite.com wrote:
 
 Yes. Do you know if 5748 was created as a result of compaction or via a flush 
 from a memtable?

It was the result of a compaction:

 INFO [CompactionExecutor:22422] 2014-11-13 13:08:41,926 CompactionTask.java 
(line 262) Compacted 2 sstables to 
[/servers/storage/cassandra-data/Disco/NamespaceFile2/Disco-NamespaceFile2-ic-5744,].
  10,732,071 bytes to 10,824,798 (~100% of original) in 2,950ms = 3.499435MB/s. 
 36,632 total rows, 35,545 unique.  Row merge counts were {1:34460, 2:1086, }
 INFO [CompactionExecutor:22528] 2014-11-13 14:55:09,944 CompactionTask.java 
(line 262) Compacted 2 sstables to 
[/servers/storage/cassandra-data/Disco/NamespaceFile2/Disco-NamespaceFile2-ic-5746,].
  15,790,121 bytes to 14,913,599 (~94% of original) in 3,491ms = 4.074110MB/s.  
46,167 total rows, 43,745 unique.  Row merge counts were {1:41323, 2:2422, }
 INFO [CompactionExecutor:22590] 2014-11-13 15:26:50,087 CompactionTask.java 
(line 262) Compacted 2 sstables to 
[/servers/storage/cassandra-data/Disco/NamespaceFile2/Disco-NamespaceFile2-ic-5748,].
  16,088,983 bytes to 16,110,649 (~100% of original) in 2,741ms = 5.605367MB/s. 
 48,332 total rows, 47,392 unique.  Row merge counts were {1:46452, 2:940, } 
---
 INFO [CompactionExecutor:22718] 2014-11-13 18:05:36,326 CompactionTask.java 
(line 262) Compacted 2 sstables to 
[/servers/storage/cassandra-data/Disco/NamespaceFile2/Disco-NamespaceFile2-ic-5750,].
  21,508,530 bytes to 21,342,786 (~99% of original) in 3,461ms = 5.880979MB/s.  
66,361 total rows, 63,219 unique.  Row merge counts were {1:60077, 2:3142, }
 INFO [CompactionExecutor:22817] 2014-11-13 19:06:04,564 CompactionTask.java 
(line 262) Compacted 2 sstables to 
[/servers/storage/cassandra-data/Disco/NamespaceFile2/Disco-NamespaceFile2-ic-5752,].
  23,232,660 bytes to 23,087,822 (~99% of original) in 3,144ms = 7.003264MB/s.  
68,968 total rows, 67,602 unique.  Row merge counts were {1:66236, 2:1366, }


Regarding flushes I have:

INFO [FlushWriter:3079] 2014-11-13 13:08:38,972 Memtable.java (line 436) 
Completed flushing 
/servers/storage/cassandra-data/Disco/NamespaceFile2/Disco-NamespaceFile2-ic-5743-Data.db
 (4698473 bytes) for commitlog position ReplayPosition(segmentId=1413900469571, 
position=6240)
 INFO [FlushWriter:3093] 2014-11-13 14:55:06,436 Memtable.java (line 436) 
Completed flushing 
/servers/storage/cassandra-data/Disco/NamespaceFile2/Disco-NamespaceFile2-ic-5745-Data.db
 (4965323 bytes) for commitlog position ReplayPosition(segmentId=1413900469603, 
position=2518)
 INFO [FlushWriter:3101] 2014-11-13 15:26:47,336 Memtable.java (line 436) 
Completed flushing 
/servers/storage/cassandra-data/Disco/NamespaceFile2/Disco-NamespaceFile2-ic-5747-Data.db
 (1175384 bytes) for commitlog position ReplayPosition(segmentId=1413900469635, 
position=9984)
 INFO [FlushWriter:3121] 2014-11-13 18:05:32,853 Memtable.java (line 436) 
Completed flushing 
/servers/storage/cassandra-data/Disco/NamespaceFile2/Disco-NamespaceFile2-ic-5749-Data.db
 (5397881 bytes) for commitlog position ReplayPosition(segmentId=1413900469667, 
position=8533)
 INFO [FlushWriter:3134] 2014-11-13 19:06:01,416 Memtable.java (line 436) 
Completed flushing 
/servers/storage/cassandra-data/Disco/NamespaceFile2/Disco-NamespaceFile2-ic-5751-Data.db
 (1889874 bytes) for commitlog position ReplayPosition(segmentId=1413900469699, 
position=108)
 INFO [FlushWriter:3147] 2014-11-13 21:20:58,312 Memtable.java (line 436) 
Completed flushing 
/servers/storage/cassandra-data/Disco/NamespaceFile2/Disco-NamespaceFile2-ic-5753-Data.db
 (3283519 bytes) for commitlog position ReplayPosition(segmentId=1413900469731, 
position=6848)


André

Re: Repair completes successfully but data is still inconsistent

2014-11-26 Thread André Cruz
On 24 Nov 2014, at 18:54, Robert Coli rc...@eventbrite.com wrote:
 
 But for any given value on any given node, you can verify the value it has in 
 100% of SStables... that's what both the normal read path and repair should 
 do when reconciling row fragments into the materialized row? Hard to 
 understand a case where repair fails, and I might provide that set of 
 SStables attached to an Apache JIRA.

These were the sstables present on node 3 on the 13th:

-rw-r--r--  1 andre  staff  167794724 Nov 13 17:57 
Disco-NamespaceFile2-ic-232-Data.db
-rw-r--r--  1 andre  staff  167809485 Nov 13 17:58 
Disco-NamespaceFile2-ic-3608-Data.db
-rw-r--r--  1 andre  staff  167800404 Nov 13 17:58 
Disco-NamespaceFile2-ic-3609-Data.db
-rw-r--r--  1 andre  staff   59773136 Nov 13 17:58 
Disco-NamespaceFile2-ic-3610-Data.db
-rw-r--r--  1 andre  staff  167804631 Nov 13 17:59 
Disco-NamespaceFile2-ic-4022-Data.db
-rw-r--r--  1 andre  staff  167795369 Nov 13 17:59 
Disco-NamespaceFile2-ic-4023-Data.db
-rw-r--r--  1 andre  staff   30930871 Nov 13 18:00 
Disco-NamespaceFile2-ic-4024-Data.db
-rw-r--r--  1 andre  staff  167806410 Nov 13 18:00 
Disco-NamespaceFile2-ic-5334-Data.db
-rw-r--r--  1 andre  staff  167789187 Nov 13 18:01 
Disco-NamespaceFile2-ic-5336-Data.db
-rw-r--r--  1 andre  staff  167786564 Nov 13 18:02 
Disco-NamespaceFile2-ic-5337-Data.db
-rw-r--r--  1 andre  staff  167794911 Nov 13 18:02 
Disco-NamespaceFile2-ic-5338-Data.db
-rw-r--r--  1 andre  staff  167813170 Nov 13 18:03 
Disco-NamespaceFile2-ic-5339-Data.db
-rw-r--r--  1 andre  staff   64246179 Nov 13 18:03 
Disco-NamespaceFile2-ic-5340-Data.db
-rw-r--r--  1 andre  staff  167794716 Nov 13 18:04 
Disco-NamespaceFile2-ic-5719-Data.db
-rw-r--r--  1 andre  staff  167782343 Nov 13 18:04 
Disco-NamespaceFile2-ic-5721-Data.db
-rw-r--r--  1 andre  staff  167789502 Nov 13 18:05 
Disco-NamespaceFile2-ic-5722-Data.db
-rw-r--r--  1 andre  staff  167800931 Nov 13 18:05 
Disco-NamespaceFile2-ic-5723-Data.db
-rw-r--r--  1 andre  staff  167810143 Nov 13 18:06 
Disco-NamespaceFile2-ic-5724-Data.db
-rw-r--r--  1 andre  staff  167800713 Nov 13 18:07 
Disco-NamespaceFile2-ic-5725-Data.db
-rw-r--r--  1 andre  staff  167812647 Nov 13 18:07 
Disco-NamespaceFile2-ic-5726-Data.db
-rw-r--r--  1 andre  staff  167782180 Nov 13 18:08 
Disco-NamespaceFile2-ic-5727-Data.db
-rw-r--r--  1 andre  staff  167797488 Nov 13 18:09 
Disco-NamespaceFile2-ic-5728-Data.db
-rw-r--r--  1 andre  staff   12950043 Nov 13 18:09 
Disco-NamespaceFile2-ic-5729-Data.db
-rw-r--r--  1 andre  staff  167798810 Nov 13 18:09 
Disco-NamespaceFile2-ic-5730-Data.db
-rw-r--r--  1 andre  staff  167805918 Nov 13 18:10 
Disco-NamespaceFile2-ic-5731-Data.db
-rw-r--r--  1 andre  staff  167805189 Nov 13 18:10 
Disco-NamespaceFile2-ic-5732-Data.db
-rw-r--r--  1 andre  staff   12563937 Nov 13 18:10 
Disco-NamespaceFile2-ic-5733-Data.db
-rw-r--r--  1 andre  staff   16110649 Nov 13 18:11 
Disco-NamespaceFile2-ic-5748-Data.db


Of these, the row in question was present on:
Disco-NamespaceFile2-ic-5337-Data.db - tombstone column
Disco-NamespaceFile2-ic-5719-Data.db - no trace of that column
Disco-NamespaceFile2-ic-5748-Data.db - live column with original timestamp

Is this the information you requested?

Best regards,
André

Re: Repair completes successfully but data is still inconsistent

2014-11-26 Thread Robert Coli
On Wed, Nov 26, 2014 at 10:17 AM, André Cruz andre.c...@co.sapo.pt wrote:

 Of these, the row in question was present on:
 Disco-NamespaceFile2-ic-5337-Data.db - tombstone column
 Disco-NamespaceFile2-ic-5719-Data.db - no trace of that column
 Disco-NamespaceFile2-ic-5748-Data.db - live column with original timestamp

 Is this the information you requested?


Yes. Do you know if 5748 was created as a result of compaction or via a
flush from a memtable?

=Rob


Re: Repair completes successfully but data is still inconsistent

2014-11-24 Thread André Cruz
On 21 Nov 2014, at 19:01, Robert Coli rc...@eventbrite.com wrote:
 
 2- Why won’t repair propagate this column value to the other nodes? Repairs 
 have run everyday and the value is still missing on the other nodes.
 
 No idea. Are you sure it's not expired via TTL or masked in some other way? 
 When you ask that node for it at CL.ONE, do you get this value?

This data does not use TTLs. What other reason could there be for a mask? If I 
connect using cassandra-cli to that specific node, which becomes the 
coordinator, is it guaranteed to not ask another node when CL is ONE and it 
contains that row?

  
 Cassandra has never stored data consistently except by fortunate accident.”

I wish I had read that a few years back. :)

Thank you,
André

Re: Repair completes successfully but data is still inconsistent

2014-11-24 Thread Robert Coli
On Mon, Nov 24, 2014 at 10:39 AM, André Cruz andre.c...@co.sapo.pt wrote:

 This data does not use TTLs. What other reason could there be for a mask?
 If I connect using cassandra-cli to that specific node, which becomes the
 coordinator, is it guaranteed to not ask another node when CL is ONE and it
 contains that row?


Other than rare cases of writing doomstones (deletes with timestamps in
the future) not sure in what case this might occur.

But for any given value on any given node, you can verify the value it has
in 100% of SStables... that's what both the normal read path and repair
should do when reconciling row fragments into the materialized row? Hard to
understand a case where repair fails, and I might provide that set of
SStables attached to an Apache JIRA.

=Rob


Re: Repair completes successfully but data is still inconsistent

2014-11-21 Thread André Cruz
On 19 Nov 2014, at 19:53, Robert Coli rc...@eventbrite.com wrote:
 
 My hunch is that you originally triggered this by picking up some obsolete 
 SSTables during the 1.2 era. Probably if you clean up the existing zombies 
 you will not encounter them again, unless you encounter another obsolete 
 sstables marked live bug. I agree that your compaction exceptions in the 1.2 
 era are likely implicated.

I’m still in the “1.2 era”, I upgraded from 1.2.16 to 1.2.19. You mentioned 
that obsolete sstables may have been picked up, and each node had 3-9 
exceptions like this when they were brought down prior to being updated:

ERROR [CompactionExecutor:15173] 2014-10-21 15:04:01,923 CassandraDaemon.java 
(line 191) Exception in thread Thread[CompactionExecutor:15173,1,main]
java.util.concurrent.RejectedExecutionException: Task 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@1dad30c0 
rejected from 
org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor@555b9c78[Terminated,
 pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 14052]
   at 
java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(Unknown 
Source)
   at java.util.concurrent.ThreadPoolExecutor.reject(Unknown Source)
   at 
java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(Unknown Source)
   at java.util.concurrent.ScheduledThreadPoolExecutor.schedule(Unknown 
Source)
   at java.util.concurrent.ScheduledThreadPoolExecutor.submit(Unknown 
Source)
   at 
org.apache.cassandra.io.sstable.SSTableDeletingTask.schedule(SSTableDeletingTask.java:65)

Can it be that they were all in the middle of a compaction (Leveled compaction) 
and the new sstables were written but the old ones were not deleted? Will 
Cassandra blindly pick up old and new sstables when it restarts?

If so, I have a few questions:

1- What is the correct sequence of commands to bring down a node safely? I know 
that “drain was used here, because it is in the log. I’ve read somewhere that 
drain should not be used and “disablethrift”, “disablegossip”, “flush” and then 
waiting a while is the correct way.

2- Why won’t repair propagate this column value to the other nodes? Repairs 
have run everyday and the value is still missing on the other nodes.

3- If I have these rogue sstables loaded this seems like a time bomb. Down the 
line I will again delete columns that will reappear after some time. Is there a 
way I can find these sstables that should not be there? I thought the timestamp 
of the file would help but this zombie column is present on one of the latest 
sstables.

Thanks,
André

Re: Repair completes successfully but data is still inconsistent

2014-11-21 Thread Robert Coli
On Fri, Nov 21, 2014 at 3:11 AM, André Cruz andre.c...@co.sapo.pt wrote:

 Can it be that they were all in the middle of a compaction (Leveled
 compaction) and the new sstables were written but the old ones were not
 deleted? Will Cassandra blindly pick up old and new sstables when it
 restarts?


Yes.

https://issues.apache.org/jira/browse/CASSANDRA-6756

Also note linked tickets there like :

https://issues.apache.org/jira/browse/CASSANDRA-6503

6503 is fix version 1.2.14 ...


 1- What is the correct sequence of commands to bring down a node safely? I
 know that “drain was used here, because it is in the log. I’ve read
 somewhere that drain should not be used and “disablethrift”,
 “disablegossip”, “flush” and then waiting a while is the correct way.


drain is the canonical answer. But drain has historically not always
worked. In general when it hasn't worked, it has flushed properly but not
marked clean properly.

https://issues.apache.org/jira/browse/CASSANDRA-4446

https://issues.apache.org/jira/browse/CASSANDRA-5911

5911 is too late for 1.2 and will not be merged there.


 2- Why won’t repair propagate this column value to the other nodes?
 Repairs have run everyday and the value is still missing on the other nodes.


No idea. Are you sure it's not expired via TTL or masked in some other way?
When you ask that node for it at CL.ONE, do you get this value?


 3- If I have these rogue sstables loaded this seems like a time bomb. Down
 the line I will again delete columns that will reappear after some time. Is
 there a way I can find these sstables that should not be there? I thought
 the timestamp of the file would help but this zombie column is present on
 one of the latest sstables.


Unfortunately, not really. You'd need the patch from CASSANDRA-6568 and
you'd need to continuously have been capturing your live SStable set to
determine when

https://issues.apache.org/jira/browse/CASSANDRA-6568

The experience you're having is, AFAICT, irrecoverably fatal to
consistency. This is why, at the summit this year, when Apple announced it
had encountered and fixed like 5 bugs of this type, I summarized their talk
in chats as the talk where Apple showed that, despite what you've heard
about quorum reads and writes, Cassandra has never stored data consistently
except by fortunate accident.

=Rob
http://twitter.com/rcolidba


Re: Repair completes successfully but data is still inconsistent

2014-11-19 Thread André Cruz
On 19 Nov 2014, at 00:43, Robert Coli rc...@eventbrite.com wrote:
 
 @OP : can you repro if you run a major compaction between the deletion and 
 the tombstone collection?

This happened in production and, AFAIK, for the first time in a system that has 
been running for 2 years. We have upgraded the Cassandra version last month so 
there’s that difference, but the upgrade happened before the original delete of 
this column.

I have found more examples of zombie columns like this (aprox 30k columns of a 
1.2M total) and they are all in this same row of this CF. I should point out 
that we have a sister CF where we do similar inserts/deletes, but it uses STCS, 
and it doesn’t exhibit this problem. 

I don’t think I can reproduce this easily in a test environment.

 
 Basically, I am conjecturing that a compaction bug or one of the handful of 
 unmask previously deleted data bugs are resulting in the unmasking of a 
 non-tombstone row which is sitting in a SStable.
 
 OP could also support this conjecture by running sstablekeys on other 
 SSTables on 3rd replica and determining what masked values there are for 
 the row prior to deletion. If the data is sitting in an old SStable, this is 
 suggestive.

There are 3 sstables that have this row on the 3rd replica:

Disco-NamespaceFile2-ic-5337-Data.db.json - Has the column tombstone
Disco-NamespaceFile2-ic-5719-Data.db.json - Has no value for this column
Disco-NamespaceFile2-ic-5748-Data.db.json - Has the original value

 
 One last question for OP would be whether the nodes were restarted during the 
 time period this bug was observed. An assortment of the unmask previously 
 deleted data bugs come from dead sstables in the data directory being 
 marked live on a restart.

All the nodes were restarted on 21-23 October, for the upgrade (1.2.16 - 
1.2.19) I mentioned. The delete happened after. I should also point out that we 
were experiencing problems related to CASSANDRA-4206 and CASSANDRA-7808.

ERROR 15:01:51,885 Exception in thread Thread[CompactionExecutor:15172,1,main]
java.lang.AssertionError: originally calculated column size of 78041151 but now 
it is 78041303
at 
org.apache.cassandra.db.compaction.LazilyCompactedRow.write(LazilyCompactedRow.java:135)
at 
org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:160)
at 
org.apache.cassandra.db.compaction.CompactionTask.runWith(CompactionTask.java:162)
at 
org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
at 
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
at 
org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:58)
at 
org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:60)
at 
org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionTask.run(CompactionManager.java:208)
at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)

We saw that line multiple times in the logs, always for the same row because 
the 78041151 and 78041303 even though the data seemed fine. Could that row be 
the one experiencing problems now? Maybe with the upgrade the new Cassandra 
correctly compacted this row and all hell broke loose?

If so, is there a easy way to fix this? Shouldn’t repair also propagate this 
zombie column to the other nodes?

Thank you and best regards,
André Cruz




Re: Repair completes successfully but data is still inconsistent

2014-11-19 Thread André Cruz
On 19 Nov 2014, at 11:37, André Cruz andre.c...@co.sapo.pt wrote:
 
 All the nodes were restarted on 21-23 October, for the upgrade (1.2.16 - 
 1.2.19) I mentioned. The delete happened after. I should also point out that 
 we were experiencing problems related to CASSANDRA-4206 and CASSANDRA-7808.

Another possible cause are these exceptions I found in the log as the nodes 
were shutdown and brought up with the new version:

INFO [RMI TCP Connection(270364)-10.134.101.18] 2014-10-21 15:04:00,867 
StorageService.java (line 939) DRAINED
ERROR [CompactionExecutor:15173] 2014-10-21 15:04:01,923 CassandraDaemon.java 
(line 191) Exception in thread Thread[CompactionExecutor:15173,1,main]
java.util.concurrent.RejectedExecutionException: Task 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@1dad30c0 
rejected from 
org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor@555b9c78[Terminated,
 pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 14052]
at 
java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(Unknown 
Source)
at java.util.concurrent.ThreadPoolExecutor.reject(Unknown Source)
at 
java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(Unknown Source)
at java.util.concurrent.ScheduledThreadPoolExecutor.schedule(Unknown 
Source)
at java.util.concurrent.ScheduledThreadPoolExecutor.submit(Unknown 
Source)
at 
org.apache.cassandra.io.sstable.SSTableDeletingTask.schedule(SSTableDeletingTask.java:65)
at 
org.apache.cassandra.io.sstable.SSTableReader.releaseReference(SSTableReader.java:976)
at 
org.apache.cassandra.db.DataTracker.removeOldSSTablesSize(DataTracker.java:370)
at org.apache.cassandra.db.DataTracker.postReplace(DataTracker.java:335)
at org.apache.cassandra.db.DataTracker.replace(DataTracker.java:329)
at 
org.apache.cassandra.db.DataTracker.replaceCompactedSSTables(DataTracker.java:232)
at 
org.apache.cassandra.db.ColumnFamilyStore.replaceCompactedSSTables(ColumnFamilyStore.java:995)
at 
org.apache.cassandra.db.compaction.CompactionTask.replaceCompactedSSTables(CompactionTask.java:270)
at 
org.apache.cassandra.db.compaction.CompactionTask.runWith(CompactionTask.java:230)
at 
org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
at 
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
at 
org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:58)
at 
org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:60)
at 
org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionTask.run(CompactionManager.java:208)
at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)


Each node has 4-9 of these exceptions as it is going down after being drained. 
It seems Cassandra was trying to delete an sstable. Can this be related?

Best regards,
André Cruz

Re: Repair completes successfully but data is still inconsistent

2014-11-19 Thread Robert Coli
On Wed, Nov 19, 2014 at 5:18 AM, André Cruz andre.c...@co.sapo.pt wrote:

 Each node has 4-9 of these exceptions as it is going down after being
 drained. It seems Cassandra was trying to delete an sstable. Can this be
 related?


That seems plausible, though the versions of the files you indicate have
the versions of the data now suggests that the file was eventually
successfully deleted.

My hunch is that you originally triggered this by picking up some obsolete
SSTables during the 1.2 era. Probably if you clean up the existing zombies
you will not encounter them again, unless you encounter another obsolete
sstables marked live bug. I agree that your compaction exceptions in the
1.2 era are likely implicated.

=Rob


Re: Repair completes successfully but data is still inconsistent

2014-11-18 Thread André Cruz
On 18 Nov 2014, at 01:08, Michael Shuler mich...@pbandjelly.org wrote:
 
 André, does `nodetool gossipinfo` show all the nodes in schema agreement?
 

Yes:

$ nodetool -h XXX.XXX.XXX.XXX gossipinfo |grep -i schema
  SCHEMA:8ef63726-c845-3565-9851-91c0074a9b5e
  SCHEMA:8ef63726-c845-3565-9851-91c0074a9b5e
  SCHEMA:8ef63726-c845-3565-9851-91c0074a9b5e
  SCHEMA:8ef63726-c845-3565-9851-91c0074a9b5e
  SCHEMA:8ef63726-c845-3565-9851-91c0074a9b5e
  SCHEMA:8ef63726-c845-3565-9851-91c0074a9b5e
  SCHEMA:8ef63726-c845-3565-9851-91c0074a9b5e
  SCHEMA:8ef63726-c845-3565-9851-91c0074a9b5e
  SCHEMA:8ef63726-c845-3565-9851-91c0074a9b5e
  SCHEMA:8ef63726-c845-3565-9851-91c0074a9b5e
  SCHEMA:8ef63726-c845-3565-9851-91c0074a9b5e
  SCHEMA:8ef63726-c845-3565-9851-91c0074a9b5e


Best regards,
André

Re: Repair completes successfully but data is still inconsistent

2014-11-18 Thread Michael Shuler

`nodetool cleanup` also looks interesting as an option.

--
Michael



Re: Repair completes successfully but data is still inconsistent

2014-11-18 Thread Robert Coli
On Tue, Nov 18, 2014 at 12:46 PM, Michael Shuler mich...@pbandjelly.org
wrote:

 `nodetool cleanup` also looks interesting as an option.


I don't understand why cleanup or scrub would help with a case where data
is being un-tombstoned.


1 November - column is deleted - gc_grace_period is 10 days
8 November - all 3 replicas have tombstone
13/14 November - column/tombstone is gone on 2 replicas, 3rd replica has
the original value (!), with the original timestamp…


To me the smoking gun here is that 3rd replica has the original value.

@OP : can you repro if you run a major compaction between the deletion and
the tombstone collection?

Basically, I am conjecturing that a compaction bug or one of the handful of
unmask previously deleted data bugs are resulting in the unmasking of a
non-tombstone row which is sitting in a SStable.

OP could also support this conjecture by running sstablekeys on other
SSTables on 3rd replica and determining what masked values there are for
the row prior to deletion. If the data is sitting in an old SStable, this
is suggestive.

One last question for OP would be whether the nodes were restarted during
the time period this bug was observed. An assortment of the unmask
previously deleted data bugs come from dead sstables in the data
directory being marked live on a restart.

=Rob
http://twitter.com/rcolidba


Re: Repair completes successfully but data is still inconsistent

2014-11-17 Thread André Cruz
On 14 Nov 2014, at 18:44, André Cruz andre.c...@co.sapo.pt wrote:
 
 On 14 Nov 2014, at 18:29, Michael Shuler mich...@pbandjelly.org wrote:
 
 On 11/14/2014 12:12 PM, André Cruz wrote:
 Some extra info. I checked the backups and on the 8th of November, all 3 
 replicas had the tombstone of the deleted column. So:
 
 1 November - column is deleted - gc_grace_period is 10 days
 8 November - all 3 replicas have tombstone
 13/14 November - column/tombstone is gone on 2 replicas, 3rd replica has 
 the original value (!), with the original timestamp…
 
 After seeing your first post, this is helpful info. I'm curious what the 
 logs show between the 8th-13th, particularly around the 10th-11th :)
 
 Which logs in particular, just the ones from the 3rd machine which has the 
 zombie column? What should I be looking for? :)

I have checked the logs of the 3 replicas for that period and nothing really 
jumps out. Still, repairs have been running daily, the log reports that the CF 
is synced, and as of this moment one of the replicas still returns the zombie 
column so they don’t agree on the data.

André

Re: Repair completes successfully but data is still inconsistent

2014-11-17 Thread Michael Shuler

On 11/17/2014 05:22 AM, André Cruz wrote:

I have checked the logs of the 3 replicas for that period and nothing
really jumps out. Still, repairs have been running daily, the log
reports that the CF is synced, and as of this moment one of the
replicas still returns the zombie column so they don’t agree on the
data.


I'm at a bit of a loss. Readers, is `nodetool scrub` going to be helpful 
here, or any other suggestions of things to look for?


André, does `nodetool gossipinfo` show all the nodes in schema agreement?

--
Michael


Repair completes successfully but data is still inconsistent

2014-11-14 Thread André Cruz
Hello.

So, I have detected a data inconsistency between my nodes:

(Consistency level is ONE)
[disco@Disco] get 
NamespaceFile2['45fc8996-41bc-429b-a382-5da9294eb59c:/XXXDIRXXX']['XXXFILEXXX'];
Value was not found
Elapsed time: 48 msec(s).
[disco@Disco] get 
NamespaceFile2['45fc8996-41bc-429b-a382-5da9294eb59c:/XXXDIRXXX']['XXXFILEXXX'];
= (name=XXXFILEXXX, value=a96a7f54-49d4-11e4-8185-e0db55018fa4, 
timestamp=1412213845215797)
Elapsed time: 7.45 msec(s).
[disco@Disco]

I have a replication factor of 3, and if I read with a consistency level of 
QUORUM, the result converges to the column being present. The strange thing is: 
I have deleted it on 2014-11-01 03:39:25 and my writes use QUORUM. My 
gc_grace_period is 10 days and if I had not been running repairs during this 
period this could be explained, but the fact is that repairs have been running 
every day with no sign of problems. First of all some data:

Cassandra version: 1.2.19 (we upgraded from 1.2.16 on 2014-10-22)
NamespaceFile2 compaction strategy: LeveledCompactionStrategy
CFStats:
Read Count: 3376397
Read Latency: 0.24978254956392865 ms.
Write Count: 21254817
Write Latency: 0.04713691710448507 ms.
Pending Tasks: 0
Column Family: NamespaceFile2
SSTable count: 28
SSTables in each level: [1, 10, 17, 0, 0, 0, 0, 0, 0]
Space used (live): 4961751613
Space used (total): 5021967044
SSTable Compression Ratio: 0.5773464014899713
Number of Keys (estimate): 9369856
Memtable Columns Count: 82901
Memtable Data Size: 19352636
Memtable Switch Count: 283
Read Count: 3376397
Read Latency: NaN ms.
Write Count: 21254817
Write Latency: 0.049 ms.
Pending Tasks: 0
Bloom Filter False Positives: 21904
Bloom Filter False Ratio: 0.0
Bloom Filter Space Used: 7405728
Compacted row minimum size: 61
Compacted row maximum size: 74975550
Compacted row mean size: 795
Average live cells per slice (last five minutes): 1025.0
Average tombstones per slice (last five minutes): 0.0

This particular row has 1.1M columns, 70934390 bytes, and the offending key 
encoded by the partitioner is 
001045fc899641bc429ba3825da9294eb59c072f646f7267656d00.

First I checked which nodes were responsible for this particular row:

$ nodetool -h XXX.XXX.XXX.112 getendpoints Disco NamespaceFile2 
'45fc8996-41bc-429b-a382-5da9294eb59c:/XXXDIRXXX'
XXX.XXX.XXX.17
XXX.XXX.XXX.14
XXX.XXX.XXX.18


This is the repair log of these particular nodes:

--XXX.XXX.XXX.14--
INFO [AntiEntropySessions:366] 2014-11-14 04:34:55,740 AntiEntropyService.java 
(line 651) [repair #956244b0-6bb7-11e4-8eec-f5962c02982e] new session: will 
sync /XXX.XXX.XXX.14, /XXX.XXX.XXX.12, /XXX.XXX.XXX.18 on range 
(14178431955039102644307275309657008810,28356863910078205288614550619314017621] 
for Disco.[NamespaceFile2]
...
INFO [AntiEntropyStage:1] 2014-11-14 04:36:51,125 AntiEntropyService.java (line 
764) [repair #956244b0-6bb7-11e4-8eec-f5962c02982e] NamespaceFile2 is fully 
synced
INFO [AntiEntropySessions:366] 2014-11-14 04:36:51,126 AntiEntropyService.java 
(line 698) [repair #956244b0-6bb7-11e4-8eec-f5962c02982e] session completed 
successfully
...
INFO [AntiEntropySessions:367] 2014-11-14 04:36:51,103 AntiEntropyService.java 
(line 651) [repair #da2543e0-6bb7-11e4-8eec-f5962c02982e] new session: will 
sync /XXX.XXX.XXX.14, /XXX.XXX.XXX.11, /XXX.XXX.XXX.17 on range 
(155962751505430129087380028406227096910,0] for Disco.[NamespaceFile2]
...
INFO [AntiEntropyStage:1] 2014-11-14 04:38:20,949 AntiEntropyService.java (line 
764) [repair #da2543e0-6bb7-11e4-8eec-f5962c02982e] NamespaceFile2 is fully 
synced
INFO [AntiEntropySessions:367] 2014-11-14 04:38:20,949 AntiEntropyService.java 
(line 698) [repair #da2543e0-6bb7-11e4-8eec-f5962c02982e] session completed 
successfully
...
INFO [AntiEntropySessions:368] 2014-11-14 04:38:20,930 AntiEntropyService.java 
(line 651) [repair #0fafc710-6bb8-11e4-8eec-f5962c02982e] new session: will 
sync /XXX.XXX.XXX.14, /XXX.XXX.XXX.17, /XXX.XXX.XXX.18 on range 
(0,14178431955039102644307275309657008810] for Disco.[NamespaceFile2]
...
INFO [AntiEntropyStage:1] 2014-11-14 04:40:34,237 AntiEntropyService.java (line 
764) [repair #0fafc710-6bb8-11e4-8eec-f5962c02982e] NamespaceFile2 is fully 
synced
INFO [AntiEntropySessions:368] 2014-11-14 04:40:34,237 AntiEntropyService.java 
(line 698) [repair #0fafc710-6bb8-11e4-8eec-f5962c02982e] session completed 
successfully


--XXX.XXX.XXX.17--
INFO [AntiEntropySessions:362] 2014-11-13 04:33:01,974 AntiEntropyService.java 
(line 651) [repair #27293450-6aee-11e4-aabc-9b69569c95c3] new session: will 
sync /XXX.XXX.XXX.17, 

Re: Repair completes successfully but data is still inconsistent

2014-11-14 Thread André Cruz
Some extra info. I checked the backups and on the 8th of November, all 3 
replicas had the tombstone of the deleted column. So:

1 November - column is deleted - gc_grace_period is 10 days
8 November - all 3 replicas have tombstone
13/14 November - column/tombstone is gone on 2 replicas, 3rd replica has the 
original value (!), with the original timestamp…

Is there a logical explanation for this behaviour?

Thank you,
André Cruz

Re: Repair completes successfully but data is still inconsistent

2014-11-14 Thread Michael Shuler

On 11/14/2014 12:12 PM, André Cruz wrote:

Some extra info. I checked the backups and on the 8th of November, all 3 
replicas had the tombstone of the deleted column. So:

1 November - column is deleted - gc_grace_period is 10 days
8 November - all 3 replicas have tombstone
13/14 November - column/tombstone is gone on 2 replicas, 3rd replica has the 
original value (!), with the original timestamp…


After seeing your first post, this is helpful info. I'm curious what the 
logs show between the 8th-13th, particularly around the 10th-11th :)



Is there a logical explanation for this behaviour?


Not yet!

--
Kind regards,
Michael


Re: Repair completes successfully but data is still inconsistent

2014-11-14 Thread André Cruz
On 14 Nov 2014, at 18:29, Michael Shuler mich...@pbandjelly.org wrote:
 
 On 11/14/2014 12:12 PM, André Cruz wrote:
 Some extra info. I checked the backups and on the 8th of November, all 3 
 replicas had the tombstone of the deleted column. So:
 
 1 November - column is deleted - gc_grace_period is 10 days
 8 November - all 3 replicas have tombstone
 13/14 November - column/tombstone is gone on 2 replicas, 3rd replica has the 
 original value (!), with the original timestamp…
 
 After seeing your first post, this is helpful info. I'm curious what the logs 
 show between the 8th-13th, particularly around the 10th-11th :)

Which logs in particular, just the ones from the 3rd machine which has the 
zombie column? What should I be looking for? :)

Thanks for the help,
André