[ 
https://issues.apache.org/jira/browse/CASSANDRA-5154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13553934#comment-13553934
 ] 

Mariusz Gronczewski commented on CASSANDRA-5154:
------------------------------------------------

I've tried to restart both nodes and:

{code}
15:34:23]dev40:~☠ nodetool gossipinfo
/10.0.100.51
  RELEASE_VERSION:1.1.7
  RPC_ADDRESS:0.0.0.0
  LOAD:2.671090404E9
  STATUS:NORMAL,56713727820156407428984779325531226112
  SCHEMA:38349a62-3e49-3b99-84af-f675dbdc3137
/10.0.100.52
/10.0.100.50
  RELEASE_VERSION:1.1.7
  RPC_ADDRESS:0.0.0.0
  LOAD:1.139624484E9
  STATUS:NORMAL,0
  SCHEMA:38349a62-3e49-3b99-84af-f675dbdc3137

[15:34:30]dev40:~☠ nodetool ring
Note: Ownership information does not include topology, please specify a 
keyspace. 
Address         DC          Rack        Status State   Load            Owns     
           Token                                       
                                                                                
           165262560952710176606119205433922453072     
10.0.100.50     datacenter1 rack1       Up     Normal  1.06 GB         2.87%    
           0                                           
10.0.100.51     datacenter1 rack1       Up     Normal  2.49 GB         33.33%   
           56713727820156407428984779325531226112      
10.0.100.52     datacenter1 rack1       Down   Normal  ?               63.80%   
           165262560952710176606119205433922453072     
{code}

{code}
[15:32:43]dev41:~ᛯ nodetool ring
Note: Ownership information does not include topology, please specify a 
keyspace. 
Address         DC          Rack        Status State   Load            Owns     
           Token                                       
                                                                                
           56713727820156407428984779325531226112      
10.0.100.50     datacenter1 rack1       Down   Normal  1.06 GB         66.67%   
           0                                           
10.0.100.51     datacenter1 rack1       Up     Normal  2.49 GB         33.33%   
           56713727820156407428984779325531226112      
[15:32:44]dev41:~ᛯ nodetool gossipinfo
/10.0.100.51
  SCHEMA:38349a62-3e49-3b99-84af-f675dbdc3137
  RPC_ADDRESS:0.0.0.0
  STATUS:NORMAL,56713727820156407428984779325531226112
  RELEASE_VERSION:1.1.7
  LOAD:2.671090248E9
/10.0.100.50
  SCHEMA:38349a62-3e49-3b99-84af-f675dbdc3137
  RPC_ADDRESS:0.0.0.0
  STATUS:NORMAL,0
  RELEASE_VERSION:1.1.7
  LOAD:1.139615032E9
{code}
so one node sees wrong state.
what fixed it on our dev servers was

* stopping node
* deleting all but schema* dirs in system/
* starting node
* repeating for other nodes

                
> Gossip sends removed node which causes restarted nodes to constantly create 
> new threads
> ---------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-5154
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-5154
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.1.7
>         Environment: centos 6, JVM 1.6.0_37
>            Reporter: Mariusz Gronczewski
>
> Our cassandra cluster had 14 nodes but it was mostly idle so about 2 weeks 
> ago we removed 3 of them (via standard decommision) & moved tokens to balance 
> load.
> Since then no node was restarted but last week after restarting 2 of them we 
> observed that both of them spawn threads ( WRITE-/1.2.3.4 where 1.2.3.4 is 
> one of removed nodes IPs ) till they hit limit ( which is 800 on our system) 
> and then cassandra dies. Not restarted nodes do not do that. There are no 
> outgoing connections to those dead nodes
> I noticed dead nodes are still in nodetool gossipinfo on non-restarted nodes 
> but not on restarted ones so it seems they are not propertly removed from 
> gossip.
> Would rolling restart work for fixing this  or is full cluster stop-start 
> required ?
> trace from hanging threads:
> {code}
>  "WRITE-/1.2.3.4" daemon prio=10 tid=0x00007f5fe8194000 nid=0x2fb2 waiting on
> condition [0x00007f6020de0000]
>    java.lang.Thread.State: WAITING (parking)
>       at sun.misc.Unsafe.park(Native Method)
>       - parking to wait for <0x00000007536a1160> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>       at java.util.concurrent.locks.LockSupport.park(LockSupport.java:156)
>       at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987)
>       at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)
>       at 
> org.apache.cassandra.net.OutboundTcpConnection.run(OutboundTcpConnection.java:104)
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to