[ 
https://issues.apache.org/jira/browse/CASSANDRA-5154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13553804#comment-13553804
 ] 

Brandon Williams commented on CASSANDRA-5154:
---------------------------------------------

Some things here are strange:

* the restarted nodes should have picked up the dead state from the 
not-restarted nodes
* the dead states have an expire time of 2013-01-06 and should have been purged
* the restarted nodes shouldn't be connecting to a node they don't know about 
in gossip

For the first point, I suspect they did actually get the dead state, notice it 
was past it's expire time, and purged it.  For the second point, my best guess 
is that the clock on these machines is wrong, and they don't know to purge the 
dead state.  The last point is a mystery since they can't connect to something 
they don't know about...
                
> Gossip sends removed node which causes restarted nodes to constantly create 
> new threads
> ---------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-5154
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-5154
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.1.7
>         Environment: centos 6, JVM 1.6.0_37
>            Reporter: Mariusz Gronczewski
>
> Our cassandra cluster had 14 nodes but it was mostly idle so about 2 weeks 
> ago we removed 3 of them (via standard decommision) & moved tokens to balance 
> load.
> Since then no node was restarted but last week after restarting 2 of them we 
> observed that both of them spawn threads ( WRITE-/1.2.3.4 where 1.2.3.4 is 
> one of removed nodes IPs ) till they hit limit ( which is 800 on our system) 
> and then cassandra dies. Not restarted nodes do not do that. There are no 
> outgoing connections to those dead nodes
> I noticed dead nodes are still in nodetool gossipinfo on non-restarted nodes 
> but not on restarted ones so it seems they are not propertly removed from 
> gossip.
> Would rolling restart work for fixing this  or is full cluster stop-start 
> required ?
> trace from hanging threads:
> {code}
>  "WRITE-/1.2.3.4" daemon prio=10 tid=0x00007f5fe8194000 nid=0x2fb2 waiting on
> condition [0x00007f6020de0000]
>    java.lang.Thread.State: WAITING (parking)
>       at sun.misc.Unsafe.park(Native Method)
>       - parking to wait for <0x00000007536a1160> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>       at java.util.concurrent.locks.LockSupport.park(LockSupport.java:156)
>       at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987)
>       at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)
>       at 
> org.apache.cassandra.net.OutboundTcpConnection.run(OutboundTcpConnection.java:104)
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to