[
https://issues.apache.org/jira/browse/CASSANDRA-1494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12908605#action_12908605
]
Dan Retzlaff commented on CASSANDRA-1494:
-----------------------------------------
The two nodes going up/down are actually running Cassandra 0.6.1. (Sorry for
omitting this fact earlier.) I'm new to Cassandra, but it seems entirely
possible that they also hit the ConcurrentModificationException and the
then-broken timer task caused repeated, faulty FailureDetector evictions. The
failure mode bit us twice (we decommissioned two nodes) so if there's still
uncertainty I'm sure I can reproduce. Unfortunately the log files on those two
machines have rotated since the failures so I can't look for more evidence from
those events.
> Gossiper ConcurrentModificationException after Decommissioning
> --------------------------------------------------------------
>
> Key: CASSANDRA-1494
> URL: https://issues.apache.org/jira/browse/CASSANDRA-1494
> Project: Cassandra
> Issue Type: Bug
> Components: Core
> Affects Versions: 0.6.5
> Environment: Linux 2.6.33.8-149.fc13.x86_64 #1 SMP Tue Aug 17
> 22:53:15 UTC 2010 x86_64 x86_64 x86_64 GNU/Linux
> Reporter: Dan Retzlaff
> Assignee: Brandon Williams
> Fix For: 0.6.6
>
>
> After decommissioning 192.168.2.147, the Gossiper caused a
> ConcurrentModificationException in 192.168.2.55. This cascaded into
> 192.168.2.55 thinking that 192.168.2.148 and 192.168.2.149 repeatedly went UP
> and then DOWN. Eventually this left so many intranode (storage port) TCP
> connections in CLOSE_WAIT that other nodes started failing with "too many
> open files" exceptions.
> INFO [Timer-0] 2010-09-08 17:00:02,398 Gossiper.java (line 402) FatClient
> /192.168.2.147 has been silent for 3600000ms, removing from gossip
> ERROR [Timer-0] 2010-09-08 17:00:02,418 Gossiper.java (line 99) Gossip error
> java.util.ConcurrentModificationException
> at java.util.Hashtable$Enumerator.next(Hashtable.java:1031)
> at org.apache.cassandra.gms.Gossiper.doStatusCheck(Gossiper.java:383)
> at
> org.apache.cassandra.gms.Gossiper$GossipTimerTask.run(Gossiper.java:93)
> at java.util.TimerThread.mainLoop(Timer.java:512)
> at java.util.TimerThread.run(Timer.java:462)
> INFO [Timer-0] 2010-09-08 17:00:12,398 Gossiper.java (line 180) InetAddress
> /192.168.2.148 is now dead.
> INFO [Timer-0] 2010-09-08 17:00:14,399 Gossiper.java (line 180) InetAddress
> /192.168.2.149 is now dead.
> INFO [GMFD:1] 2010-09-08 17:00:19,400 Gossiper.java (line 578) InetAddress
> /192.168.2.149 is now UP
> INFO [HINTED-HANDOFF-POOL:1] 2010-09-08 17:00:19,400
> HintedHandOffManager.java (line 165) Started hinted handoff for endPoint
> /192.168.2.149
> INFO [HINTED-HANDOFF-POOL:1] 2010-09-08 17:00:19,401
> HintedHandOffManager.java (line 222) Finished hinted handoff of 0 rows to
> endpoint /192.168.2.149
> INFO [Timer-0] 2010-09-08 17:00:20,399 Gossiper.java (line 180) InetAddress
> /192.168.2.149 is now dead.
> INFO [GMFD:1] 2010-09-08 17:00:43,409 Gossiper.java (line 578) InetAddress
> /192.168.2.148 is now UP
> INFO [HINTED-HANDOFF-POOL:1] 2010-09-08 17:00:43,409
> HintedHandOffManager.java (line 165) Started hinted handoff for endPoint
> /192.168.2.148
> INFO [HINTED-HANDOFF-POOL:1] 2010-09-08 17:00:43,410
> HintedHandOffManager.java (line 222) Finished hinted handoff of 0 rows to
> endpoint /192.168.2.148
> INFO [Timer-0] 2010-09-08 17:00:44,404 Gossiper.java (line 180) InetAddress
> /192.168.2.148 is now dead.
> INFO [GMFD:1] 2010-09-08 17:01:18,415 Gossiper.java (line 578) InetAddress
> /192.168.2.149 is now UP
> (UP/DOWN cycle repeats until the target node *really* goes DOWN due to too
> many TCP sockets in CLOSE_WAIT.)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.