Failed bootstrap can cause NPE in batch_mutate on every node, taking down the
entire cluster
--------------------------------------------------------------------------------------------
Key: CASSANDRA-1463
URL: https://issues.apache.org/jira/browse/CASSANDRA-1463
Project: Cassandra
Issue Type: Bug
Components: Core
Affects Versions: 0.6.5
Reporter: David King
In adding a node to the cluster, the bootstrap failed (still investigating the
cause). An hour later, the entire cluster failed, preventing any writes from
being accepted. This exception started being printed to the logs:
{quote}
INFO [Timer-0] 2010-09-03 12:23:33,282 Gossiper.java (line 402) FatClient
/10.251.243.191 has been silent for 3600000ms, removing from gossip
ERROR [Timer-0] 2010-09-03 12:23:33,318 Gossiper.java (line 99) Gossip error
java.util.ConcurrentModificationException
at java.util.Hashtable$Enumerator.next(Hashtable.java:1048)
at org.apache.cassandra.gms.Gossiper.doStatusCheck(Gossiper.java:383)
at
org.apache.cassandra.gms.Gossiper$GossipTimerTask.run(Gossiper.java:93)
at java.util.TimerThread.mainLoop(Timer.java:534)
at java.util.TimerThread.run(Timer.java:484)
ERROR [pool-1-thread-69153] 2010-09-03 12:23:33,857 Cassandra.java (line 1659)
Internal error processing batch_mutate
java.lang.NullPointerException
at
org.apache.cassandra.gms.FailureDetector.isAlive(FailureDetector.java:135)
at
org.apache.cassandra.locator.AbstractReplicationStrategy.getHintedEndpoints(AbstractReplicationStrategy.java:85)
at
org.apache.cassandra.service.StorageProxy.mutateBlocking(StorageProxy.java:204)
at
org.apache.cassandra.thrift.CassandraServer.batch_mutate(CassandraServer.java:415)
at
org.apache.cassandra.thrift.Cassandra$Processor$batch_mutate.process(Cassandra.java:1651)
at
org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.java:1166)
at
org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:167)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:636)
ERROR [pool-1-thread-69154] 2010-09-03 12:23:33,869 Cassandra.java (line 1659)
Internal error processing batch_mutate
java.lang.NullPointerException
at
org.apache.cassandra.gms.FailureDetector.isAlive(FailureDetector.java:135)
at
org.apache.cassandra.locator.AbstractReplicationStrategy.getHintedEndpoints(AbstractReplicationStrategy.java:85)
at
org.apache.cassandra.service.StorageProxy.mutateBlocking(StorageProxy.java:204)
at
org.apache.cassandra.thrift.CassandraServer.batch_mutate(CassandraServer.java:415)
at
org.apache.cassandra.thrift.Cassandra$Processor$batch_mutate.process(Cassandra.java:1651)
at
org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.java:1166)
at
org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:167)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:636)
{quote}
After a large number of iterations of that (at least thousands), the printed
exception was shortened (this shortening is what made me mistakenly file #1462)
to
{quote}
ERROR [pool-1-thread-68869] 2010-09-03 12:39:22,857 Cassandra.java (line 1659)
Internal error processing batch_mutate
java.lang.NullPointerException
ERROR [pool-1-thread-68869] 2010-09-03 12:39:22,883 Cassandra.java (line 1659)
Internal error processing batch_mutate
java.lang.NullPointerException
ERROR [pool-1-thread-68869] 2010-09-03 12:39:22,894 Cassandra.java (line 1659)
Internal error processing batch_mutate
java.lang.NullPointerException
ERROR [pool-1-thread-68970] 2010-09-03 12:39:22,985 Cassandra.java (line 1659)
Internal error processing batch_mutate
java.lang.NullPointerException
ERROR [pool-1-thread-68970] 2010-09-03 12:39:23,084 Cassandra.java (line 1659)
Internal error processing batch_mutate
java.lang.NullPointerException
{quote}
Rolling a restart over the cluster fixed it, but every node had to be restarted
before it started accepting writes again.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.