[
https://issues.apache.org/jira/browse/CASSANDRA-3548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Aaron Morton updated CASSANDRA-3548:
------------------------------------
Attachment: 0001-3548.patch
check for null
> NPE in AntiEntropyService$RepairSession.completed()
> ---------------------------------------------------
>
> Key: CASSANDRA-3548
> URL: https://issues.apache.org/jira/browse/CASSANDRA-3548
> Project: Cassandra
> Issue Type: Bug
> Components: Core
> Affects Versions: 1.0.1
> Environment: Free BSD 8.2, JVM vendor/version: OpenJDK 64-Bit Server
> VM/1.6.0
> Reporter: Aaron Morton
> Assignee: Aaron Morton
> Priority: Minor
> Attachments: 0001-3548.patch
>
>
> This may be related to CASSANDRA-3519 (cluster it was observed on is still
> 1.0.1), however i think there is still a race condition.
> Observed on a 2 DC cluster, during a repair that spanned the DC's.
> {noformat}
> INFO [AntiEntropyStage:1] 2011-11-28 06:22:56,225 StreamingRepairTask.java
> (line 136) [streaming task #69187510-1989-11e1-0000-5ff37d368cb6] Forwarding
> streaming repair of 8602
> ranges to /10.6.130.70 (to be streamed with /10.37.114.10)
> ...
> INFO [AntiEntropyStage:66] 2011-11-29 11:20:57,109 StreamingRepairTask.java
> (line 253) [streaming task #69187510-1989-11e1-0000-5ff37d368cb6] task
> succeeded
> ERROR [AntiEntropyStage:66] 2011-11-29 11:20:57,109
> AbstractCassandraDaemon.java (line 133) Fatal exception in thread
> Thread[AntiEntropyStage:66,5,main]
> java.lang.NullPointerException
> at
> org.apache.cassandra.service.AntiEntropyService$RepairSession.completed(AntiEntropyService.java:712)
> at
> org.apache.cassandra.service.AntiEntropyService$RepairSession$Differencer$1.run(AntiEntropyService.java:912)
> at
> org.apache.cassandra.streaming.StreamingRepairTask$2.run(StreamingRepairTask.java:186)
> at
> org.apache.cassandra.streaming.StreamingRepairTask$StreamingRepairResponse.doVerb(StreamingRepairTask.java:255)
> at
> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:59)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> at java.lang.Thread.run(Thread.java:679)
> {noformat}
> One of the nodes involved in the repair session failed, e.g. (Not sure if
> this is from the same repair session as the streaming task above, but it
> illustrates the issue)
> {noformat}
> ERROR [AntiEntropySessions:1] 2011-11-28 19:39:52,507 AntiEntropyService.java
> (line 688) [repair #2bf19860-197f-11e1-0000-5ff37d368cb6] session completed
> with the following error
> java.io.IOException: Endpoint /10.29.60.10 died
> at
> org.apache.cassandra.service.AntiEntropyService$RepairSession.failedNode(AntiEntropyService.java:725)
> at
> org.apache.cassandra.service.AntiEntropyService$RepairSession.convict(AntiEntropyService.java:762)
> at
> org.apache.cassandra.gms.FailureDetector.interpret(FailureDetector.java:192)
> at org.apache.cassandra.gms.Gossiper.doStatusCheck(Gossiper.java:559)
> at org.apache.cassandra.gms.Gossiper.access$700(Gossiper.java:62)
> at org.apache.cassandra.gms.Gossiper$GossipTask.run(Gossiper.java:167)
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> at
> java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:165)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:267)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> at java.lang.Thread.run(Thread.java:679)
> ERROR [GossipTasks:1] 2011-11-28 19:39:52,507 StreamOutSession.java (line
> 232) StreamOutSession /10.29.60.10 failed because {} died or was
> restarted/removed
> ERROR [GossipTasks:1] 2011-11-28 19:39:52,571 Gossiper.java (line 172) Gossip
> error
> java.util.ConcurrentModificationException
> at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:782)
> at java.util.ArrayList$Itr.next(ArrayList.java:754)
> at
> org.apache.cassandra.gms.FailureDetector.interpret(FailureDetector.java:190)
> at org.apache.cassandra.gms.Gossiper.doStatusCheck(Gossiper.java:559)
> at org.apache.cassandra.gms.Gossiper.access$700(Gossiper.java:62)
> at org.apache.cassandra.gms.Gossiper$GossipTask.run(Gossiper.java:167)
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> at
> java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:165)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:267)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> at java.lang.Thread.run(Thread.java:679)
> {noformat}
> When a node is marked as failed
> AntiEntropyService.RepairSession.forceShutdown() clears the activejobs map.
> But the jobs to other nodes will continue, and will eventually call
> completed().
> RepairSession.terminated should stop completed() from checking the map, but
> there is a race between the map been cleared and if there is an error in
> finally block it wont be set.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira