[ 
https://issues.apache.org/jira/browse/CASSANDRA-13204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15861624#comment-15861624
 ] 

Jason Brown edited comment on CASSANDRA-13204 at 2/10/17 6:24 PM:
------------------------------------------------------------------

bq. It's not clear to me that you need to move the backlog.clear()?

I think you are correct (that it probably does not need to move). I have a 
(possibly unfounded) preference for referencing shared state variables like 
this in the same order from different spots. wdyt?

And, yes, your analysis is correct :) 


was (Author: jasobrown):
bq. It's not clear to me that you need to move the backlog.clear()?

I think you are correct. I have a (possibly unfounded) preference for 
referencing shared state variables like this in the same order from different 
spots. wdyt?

And, yes, your analysis is correct :) 

> Thread Leak in OutboundTcpConnection
> ------------------------------------
>
>                 Key: CASSANDRA-13204
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13204
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: sankalp kohli
>            Assignee: Jason Brown
>             Fix For: 3.0.11, 2.1.x, 2.2.x, 3.11.x
>
>
> We found threads leaking from OutboundTcpConnection to machines which are not 
> part of the cluster and still in Gossip for some reason. There are two issues 
> here, this JIRA will cover the second one which is most important. 
> 1) First issue is that Gossip has information about machines not in the ring 
> which has been replaced out. It causes Cassandra to connect to those machines 
> but due to internode auth, it wont be able to connect to them at the socket 
> level.  
> 2) Second issue is a race between creating a connection and closing a 
> connections which is triggered by the gossip bug explained above. Let me try 
> to explain it using the code
> In OutboundTcpConnection, we are calling closeSocket(true) which will set 
> isStopped=true and also put a close sentinel into the queue to exit the 
> thread. On the ack connection, Gossip tries to send a message which calls 
> connect() which will block for 10 seconds which is RPC timeout. The reason we 
> will block is because Cassandra might not be running there or internode auth 
> will not let it connect. During this 10 seconds, if Gossip calls closeSocket, 
> it will put close sentinel into the queue. When we return from the connect 
> method after 10 seconds, we will clear the backlog queue causing this thread 
> to leak. 
> Proofs from the heap dump of the affected machine which is leaking threads 
> 1. Only ack connection is leaking and not the command connection which is not 
> used by Gossip. 
> 2. We see thread blocked on the backlog queue, isStopped=true and backlog 
> queue is empty. This is happening on the threads which have already leaked. 
> 3. A running thread was blocked on the connect waiting for timeout(10 
> seconds) and we see backlog queue to contain the close sentinel. Once the 
> connect will return false, we will clear the backlog and this thread will 
> have leaked.  
> Interesting bits from j stack 
> 1282 number of threads for "MessagingService-Outgoing-/<IP-Address>"
> Thread which is about to leak:
> "MessagingService-Outgoing-/<IP Address>" 
>    java.lang.Thread.State: RUNNABLE
>       at sun.nio.ch.Net.connect0(Native Method)
>       at sun.nio.ch.Net.connect(Net.java:454)
>       at sun.nio.ch.Net.connect(Net.java:446)
>       at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:648)
>       - locked <> (a java.lang.Object)
>       - locked <> (a java.lang.Object)
>       - locked <> (a java.lang.Object)
>       at 
> org.apache.cassandra.net.OutboundTcpConnectionPool.newSocket(OutboundTcpConnectionPool.java:137)
>       at 
> org.apache.cassandra.net.OutboundTcpConnectionPool.newSocket(OutboundTcpConnectionPool.java:119)
>       at 
> org.apache.cassandra.net.OutboundTcpConnection.connect(OutboundTcpConnection.java:381)
>       at 
> org.apache.cassandra.net.OutboundTcpConnection.run(OutboundTcpConnection.java:217)
> Thread already leaked:
> "MessagingService-Outgoing-/<IP Address>"
>    java.lang.Thread.State: WAITING (parking)
>       at sun.misc.Unsafe.park(Native Method)
>       - parking to wait for  <> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>       at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>       at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
>       at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
>       at 
> org.apache.cassandra.utils.CoalescingStrategies$DisabledCoalescingStrategy.coalesceInternal(CoalescingStrategies.java:482)
>       at 
> org.apache.cassandra.utils.CoalescingStrategies$CoalescingStrategy.coalesce(CoalescingStrategies.java:213)
>       at 
> org.apache.cassandra.net.OutboundTcpConnection.run(OutboundTcpConnection.java:190)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to