[jira] Commented: (CASSANDRA-1494) Gossiper ConcurrentModificationException after Decommissioning

Dan Retzlaff (JIRA) Sun, 12 Sep 2010 21:24:16 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-1494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12908605#action_12908605
 ]


Dan Retzlaff commented on CASSANDRA-1494:
-----------------------------------------

The two nodes going up/down are actually running Cassandra 0.6.1. (Sorry for 
omitting this fact earlier.) I'm new to Cassandra, but it seems entirely 
possible that they also hit the ConcurrentModificationException and the 
then-broken timer task caused repeated, faulty FailureDetector evictions. The 
failure mode bit us twice (we decommissioned two nodes) so if there's still 
uncertainty I'm sure I can reproduce. Unfortunately the log files on those two 
machines have rotated since the failures so I can't look for more evidence from 
those events.

> Gossiper ConcurrentModificationException after Decommissioning
> --------------------------------------------------------------
>
>                 Key: CASSANDRA-1494
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1494
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.5
>         Environment: Linux 2.6.33.8-149.fc13.x86_64 #1 SMP Tue Aug 17 
> 22:53:15 UTC 2010 x86_64 x86_64 x86_64 GNU/Linux
>            Reporter: Dan Retzlaff
>            Assignee: Brandon Williams
>             Fix For: 0.6.6
>
>
> After decommissioning 192.168.2.147, the Gossiper caused a 
> ConcurrentModificationException in 192.168.2.55. This cascaded into 
> 192.168.2.55 thinking that 192.168.2.148 and 192.168.2.149 repeatedly went UP 
> and then DOWN. Eventually this left so many intranode (storage port) TCP 
> connections in CLOSE_WAIT that other nodes started failing with "too many 
> open files" exceptions.
>  INFO [Timer-0] 2010-09-08 17:00:02,398 Gossiper.java (line 402) FatClient 
> /192.168.2.147 has been silent for 3600000ms, removing from gossip
> ERROR [Timer-0] 2010-09-08 17:00:02,418 Gossiper.java (line 99) Gossip error
> java.util.ConcurrentModificationException
>         at java.util.Hashtable$Enumerator.next(Hashtable.java:1031)
>         at org.apache.cassandra.gms.Gossiper.doStatusCheck(Gossiper.java:383)
>         at 
> org.apache.cassandra.gms.Gossiper$GossipTimerTask.run(Gossiper.java:93)
>         at java.util.TimerThread.mainLoop(Timer.java:512)
>         at java.util.TimerThread.run(Timer.java:462)
>  INFO [Timer-0] 2010-09-08 17:00:12,398 Gossiper.java (line 180) InetAddress 
> /192.168.2.148 is now dead.
>  INFO [Timer-0] 2010-09-08 17:00:14,399 Gossiper.java (line 180) InetAddress 
> /192.168.2.149 is now dead.
>  INFO [GMFD:1] 2010-09-08 17:00:19,400 Gossiper.java (line 578) InetAddress 
> /192.168.2.149 is now UP
>  INFO [HINTED-HANDOFF-POOL:1] 2010-09-08 17:00:19,400 
> HintedHandOffManager.java (line 165) Started hinted handoff for endPoint 
> /192.168.2.149
>  INFO [HINTED-HANDOFF-POOL:1] 2010-09-08 17:00:19,401 
> HintedHandOffManager.java (line 222) Finished hinted handoff of 0 rows to 
> endpoint /192.168.2.149
>  INFO [Timer-0] 2010-09-08 17:00:20,399 Gossiper.java (line 180) InetAddress 
> /192.168.2.149 is now dead.
>  INFO [GMFD:1] 2010-09-08 17:00:43,409 Gossiper.java (line 578) InetAddress 
> /192.168.2.148 is now UP
>  INFO [HINTED-HANDOFF-POOL:1] 2010-09-08 17:00:43,409 
> HintedHandOffManager.java (line 165) Started hinted handoff for endPoint 
> /192.168.2.148
>  INFO [HINTED-HANDOFF-POOL:1] 2010-09-08 17:00:43,410 
> HintedHandOffManager.java (line 222) Finished hinted handoff of 0 rows to 
> endpoint /192.168.2.148
>  INFO [Timer-0] 2010-09-08 17:00:44,404 Gossiper.java (line 180) InetAddress 
> /192.168.2.148 is now dead.
>  INFO [GMFD:1] 2010-09-08 17:01:18,415 Gossiper.java (line 578) InetAddress 
> /192.168.2.149 is now UP
> (UP/DOWN cycle repeats until the target node *really* goes DOWN due to too 
> many TCP sockets in CLOSE_WAIT.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-1494) Gossiper ConcurrentModificationException after Decommissioning

Reply via email to