NPE in AntiEntropyService$RepairSession.completed()
---------------------------------------------------
Key: CASSANDRA-3548
URL: https://issues.apache.org/jira/browse/CASSANDRA-3548
Project: Cassandra
Issue Type: Bug
Components: Core
Affects Versions: 1.0.1
Environment: Free BSD 8.2, JVM vendor/version: OpenJDK 64-Bit Server
VM/1.6.0
Reporter: Aaron Morton
Assignee: Aaron Morton
Priority: Minor
This may be related to CASSANDRA-3519 (cluster it was observed on is still
1.0.1), however i think there is still a race condition.
Observed on a 2 DC cluster, during a repair that spanned the DC's.
{noformat}
INFO [AntiEntropyStage:1] 2011-11-28 06:22:56,225 StreamingRepairTask.java
(line 136) [streaming task #69187510-1989-11e1-0000-5ff37d368cb6] Forwarding
streaming repair of 8602
ranges to /10.6.130.70 (to be streamed with /10.37.114.10)
...
INFO [AntiEntropyStage:66] 2011-11-29 11:20:57,109 StreamingRepairTask.java
(line 253) [streaming task #69187510-1989-11e1-0000-5ff37d368cb6] task succeeded
ERROR [AntiEntropyStage:66] 2011-11-29 11:20:57,109
AbstractCassandraDaemon.java (line 133) Fatal exception in thread
Thread[AntiEntropyStage:66,5,main]
java.lang.NullPointerException
at
org.apache.cassandra.service.AntiEntropyService$RepairSession.completed(AntiEntropyService.java:712)
at
org.apache.cassandra.service.AntiEntropyService$RepairSession$Differencer$1.run(AntiEntropyService.java:912)
at
org.apache.cassandra.streaming.StreamingRepairTask$2.run(StreamingRepairTask.java:186)
at
org.apache.cassandra.streaming.StreamingRepairTask$StreamingRepairResponse.doVerb(StreamingRepairTask.java:255)
at
org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:59)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:679)
{noformat}
One of the nodes involved in the repair session failed, e.g. (Not sure if this
is from the same repair session as the streaming task above, but it illustrates
the issue)
{noformat}
ERROR [AntiEntropySessions:1] 2011-11-28 19:39:52,507 AntiEntropyService.java
(line 688) [repair #2bf19860-197f-11e1-0000-5ff37d368cb6] session completed
with the following error
java.io.IOException: Endpoint /10.29.60.10 died
at
org.apache.cassandra.service.AntiEntropyService$RepairSession.failedNode(AntiEntropyService.java:725)
at
org.apache.cassandra.service.AntiEntropyService$RepairSession.convict(AntiEntropyService.java:762)
at
org.apache.cassandra.gms.FailureDetector.interpret(FailureDetector.java:192)
at org.apache.cassandra.gms.Gossiper.doStatusCheck(Gossiper.java:559)
at org.apache.cassandra.gms.Gossiper.access$700(Gossiper.java:62)
at org.apache.cassandra.gms.Gossiper$GossipTask.run(Gossiper.java:167)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at
java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:165)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:267)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:679)
ERROR [GossipTasks:1] 2011-11-28 19:39:52,507 StreamOutSession.java (line 232)
StreamOutSession /10.29.60.10 failed because {} died or was restarted/removed
ERROR [GossipTasks:1] 2011-11-28 19:39:52,571 Gossiper.java (line 172) Gossip
error
java.util.ConcurrentModificationException
at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:782)
at java.util.ArrayList$Itr.next(ArrayList.java:754)
at
org.apache.cassandra.gms.FailureDetector.interpret(FailureDetector.java:190)
at org.apache.cassandra.gms.Gossiper.doStatusCheck(Gossiper.java:559)
at org.apache.cassandra.gms.Gossiper.access$700(Gossiper.java:62)
at org.apache.cassandra.gms.Gossiper$GossipTask.run(Gossiper.java:167)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at
java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:165)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:267)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:679)
{noformat}
When a node is marked as failed
AntiEntropyService.RepairSession.forceShutdown() clears the activejobs map. But
the jobs to other nodes will continue, and will eventually call completed().
RepairSession.terminated should stop completed() from checking the map, but
there is a race between the map been cleared and if there is an error in
finally block it wont be set.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira