On 19 Nov 2014, at 19:53, Robert Coli <rc...@eventbrite.com> wrote:
> 
> My hunch is that you originally triggered this by picking up some obsolete 
> SSTables during the 1.2 era. Probably if you clean up the existing zombies 
> you will not encounter them again, unless you encounter another "obsolete 
> sstables marked live" bug. I agree that your compaction exceptions in the 1.2 
> era are likely implicated.

I’m still in the “1.2 era”, I upgraded from 1.2.16 to 1.2.19. You mentioned 
that obsolete sstables may have been picked up, and each node had 3-9 
exceptions like this when they were brought down prior to being updated:

ERROR [CompactionExecutor:15173] 2014-10-21 15:04:01,923 CassandraDaemon.java 
(line 191) Exception in thread Thread[CompactionExecutor:15173,1,main]
java.util.concurrent.RejectedExecutionException: Task 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@1dad30c0 
rejected from 
org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor@555b9c78[Terminated,
 pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 14052]
       at 
java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(Unknown 
Source)
       at java.util.concurrent.ThreadPoolExecutor.reject(Unknown Source)
       at 
java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(Unknown Source)
       at java.util.concurrent.ScheduledThreadPoolExecutor.schedule(Unknown 
Source)
       at java.util.concurrent.ScheduledThreadPoolExecutor.submit(Unknown 
Source)
       at 
org.apache.cassandra.io.sstable.SSTableDeletingTask.schedule(SSTableDeletingTask.java:65)

Can it be that they were all in the middle of a compaction (Leveled compaction) 
and the new sstables were written but the old ones were not deleted? Will 
Cassandra blindly pick up old and new sstables when it restarts?

If so, I have a few questions:

1- What is the correct sequence of commands to bring down a node safely? I know 
that “drain" was used here, because it is in the log. I’ve read somewhere that 
drain should not be used and “disablethrift”, “disablegossip”, “flush” and then 
waiting a while is the correct way.

2- Why won’t repair propagate this column value to the other nodes? Repairs 
have run everyday and the value is still missing on the other nodes.

3- If I have these rogue sstables loaded this seems like a time bomb. Down the 
line I will again delete columns that will reappear after some time. Is there a 
way I can find these sstables that should not be there? I thought the timestamp 
of the file would help but this zombie column is present on one of the latest 
sstables.

Thanks,
André

Reply via email to