[jira] Updated: (CASSANDRA-1463) Failed bootstrap can cause NPE in batch_mutate on every node, taking down the entire cluster

Jonathan Ellis (JIRA) Fri, 03 Sep 2010 14:11:17 -0700

     [ 
https://issues.apache.org/jira/browse/CASSANDRA-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jonathan Ellis updated CASSANDRA-1463:
--------------------------------------

    Attachment: 1463.txt

the CME is a red herring, the real problem is the NPE (caused by the IP being 
cleared out of the gossip records as indicated in the log, but not out of the 
pending ranges)

attached patch should fix the NPE, looking at how much of a bitch it would be 
to fix the root cause (the PR orphan)

> Failed bootstrap can cause NPE in batch_mutate on every node, taking down the 
> entire cluster
> --------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-1463
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1463
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6
>            Reporter: David King
>            Assignee: Jonathan Ellis
>             Fix For: 0.6.6
>
>         Attachments: 1463.txt
>
>
> In adding a node to the cluster, the bootstrap failed (still investigating 
> the cause). An hour later, the entire cluster failed, preventing any writes 
> from being accepted. This exception started being printed to the logs:
> {quote}
>  INFO [Timer-0] 2010-09-03 12:23:33,282 Gossiper.java (line 402) FatClient 
> /10.251.243.191 has been silent for 3600000ms, removing from gossip
> ERROR [Timer-0] 2010-09-03 12:23:33,318 Gossiper.java (line 99) Gossip error
> java.util.ConcurrentModificationException
>         at java.util.Hashtable$Enumerator.next(Hashtable.java:1048)
>         at org.apache.cassandra.gms.Gossiper.doStatusCheck(Gossiper.java:383)
>         at 
> org.apache.cassandra.gms.Gossiper$GossipTimerTask.run(Gossiper.java:93)
>         at java.util.TimerThread.mainLoop(Timer.java:534)
>         at java.util.TimerThread.run(Timer.java:484)
> ERROR [pool-1-thread-69153] 2010-09-03 12:23:33,857 Cassandra.java (line 
> 1659) Internal error processing batch_mutate
> java.lang.NullPointerException
>         at 
> org.apache.cassandra.gms.FailureDetector.isAlive(FailureDetector.java:135)
>         at 
> org.apache.cassandra.locator.AbstractReplicationStrategy.getHintedEndpoints(AbstractReplicationStrategy.java:85)
>         at 
> org.apache.cassandra.service.StorageProxy.mutateBlocking(StorageProxy.java:204)
>         at 
> org.apache.cassandra.thrift.CassandraServer.batch_mutate(CassandraServer.java:415)
>         at 
> org.apache.cassandra.thrift.Cassandra$Processor$batch_mutate.process(Cassandra.java:1651)
>         at 
> org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.java:1166)
>         at 
> org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:167)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>         at java.lang.Thread.run(Thread.java:636)
> ERROR [pool-1-thread-69154] 2010-09-03 12:23:33,869 Cassandra.java (line 
> 1659) Internal error processing batch_mutate
> java.lang.NullPointerException
>         at 
> org.apache.cassandra.gms.FailureDetector.isAlive(FailureDetector.java:135)
>         at 
> org.apache.cassandra.locator.AbstractReplicationStrategy.getHintedEndpoints(AbstractReplicationStrategy.java:85)
>         at 
> org.apache.cassandra.service.StorageProxy.mutateBlocking(StorageProxy.java:204)
>         at 
> org.apache.cassandra.thrift.CassandraServer.batch_mutate(CassandraServer.java:415)
>         at 
> org.apache.cassandra.thrift.Cassandra$Processor$batch_mutate.process(Cassandra.java:1651)
>         at 
> org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.java:1166)
>         at 
> org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:167)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>         at java.lang.Thread.run(Thread.java:636)
> {quote}
> After a large number of iterations of that (at least thousands), the printed 
> exception was shortened (this shortening is what made me mistakenly file 
> #1462) to
> {quote}
> ERROR [pool-1-thread-68869] 2010-09-03 12:39:22,857 Cassandra.java (line 
> 1659) Internal error processing batch_mutate
> java.lang.NullPointerException
> ERROR [pool-1-thread-68869] 2010-09-03 12:39:22,883 Cassandra.java (line 
> 1659) Internal error processing batch_mutate
> java.lang.NullPointerException
> ERROR [pool-1-thread-68869] 2010-09-03 12:39:22,894 Cassandra.java (line 
> 1659) Internal error processing batch_mutate
> java.lang.NullPointerException
> ERROR [pool-1-thread-68970] 2010-09-03 12:39:22,985 Cassandra.java (line 
> 1659) Internal error processing batch_mutate
> java.lang.NullPointerException
> ERROR [pool-1-thread-68970] 2010-09-03 12:39:23,084 Cassandra.java (line 
> 1659) Internal error processing batch_mutate
> java.lang.NullPointerException
> {quote}
> Rolling a restart over the cluster fixed it, but every node had to be 
> restarted before it started accepting writes again.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-1463) Failed bootstrap can cause NPE in batch_mutate on every node, taking down the entire cluster

Reply via email to