[jira] [Commented] (CASSANDRA-9630) Killing cassandra process results in unclosed connections
[ https://issues.apache.org/jira/browse/CASSANDRA-9630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16323944#comment-16323944 ] Robert Stupp commented on CASSANDRA-9630: - +1 it should fix the issue. > Killing cassandra process results in unclosed connections > - > > Key: CASSANDRA-9630 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9630 > Project: Cassandra > Issue Type: Bug > Components: Distributed Metadata, Streaming and Messaging >Reporter: Paulo Motta >Assignee: Paulo Motta >Priority: Minor > Fix For: 3.11.x > > Attachments: apache-cassandra-3.0.8-SNAPSHOT.jar > > > After upgrading from Cassandra from 2.0.12 to 2.0.15, whenever we killed a > cassandra process (with SIGTERM), some other nodes maintained a connection > with the killed node in the CLOSE_WAIT state on port 7000 for about 5-20 > minutes. > So, when we started the killed node again, other nodes could not establish a > handshake because of the connections on the CLOSE_WAIT state, so they > remained on the DOWN state to each other until the initial connection expired. > The problem did not happen if I ran a nodetool disablegossip before killing > the node. > I was able to fix this issue by reverting the CASSANDRA-8336 commits > (including CASSANDRA-9238). After reverting this, cassandra now closes > connection correctly when killed with -TERM, but leaves connections on > CLOSE_WAIT state if I run nodetool disablethrift before killing the nodes. > I did not try to reproduce the problem in a clean environment. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-9630) Killing cassandra process results in unclosed connections
[ https://issues.apache.org/jira/browse/CASSANDRA-9630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16320733#comment-16320733 ] Paulo Motta commented on CASSANDRA-9630: Even though we didn't hear back from someone who tested the patch, I'm quite confident this will fix the hanging sockets problem, so I will set this to patch available. Would you mind having a look [~snazy]? Patch [here|https://github.com/pauloricardomg/cassandra/tree/3.0-9630]. Submitted CI, will update after results. > Killing cassandra process results in unclosed connections > - > > Key: CASSANDRA-9630 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9630 > Project: Cassandra > Issue Type: Bug > Components: Distributed Metadata, Streaming and Messaging >Reporter: Paulo Motta >Assignee: Paulo Motta >Priority: Minor > Fix For: 3.11.x > > Attachments: apache-cassandra-3.0.8-SNAPSHOT.jar > > > After upgrading from Cassandra from 2.0.12 to 2.0.15, whenever we killed a > cassandra process (with SIGTERM), some other nodes maintained a connection > with the killed node in the CLOSE_WAIT state on port 7000 for about 5-20 > minutes. > So, when we started the killed node again, other nodes could not establish a > handshake because of the connections on the CLOSE_WAIT state, so they > remained on the DOWN state to each other until the initial connection expired. > The problem did not happen if I ran a nodetool disablegossip before killing > the node. > I was able to fix this issue by reverting the CASSANDRA-8336 commits > (including CASSANDRA-9238). After reverting this, cassandra now closes > connection correctly when killed with -TERM, but leaves connections on > CLOSE_WAIT state if I run nodetool disablethrift before killing the nodes. > I did not try to reproduce the problem in a clean environment. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-9630) Killing cassandra process results in unclosed connections
[ https://issues.apache.org/jira/browse/CASSANDRA-9630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16308966#comment-16308966 ] SathishKumar Alwar commented on CASSANDRA-9630: --- Is there a plan to fix this issue, we are observing the same behavior. > Killing cassandra process results in unclosed connections > - > > Key: CASSANDRA-9630 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9630 > Project: Cassandra > Issue Type: Bug > Components: Distributed Metadata, Streaming and Messaging >Reporter: Paulo Motta >Assignee: Paulo Motta >Priority: Minor > Fix For: 3.11.x > > Attachments: apache-cassandra-3.0.8-SNAPSHOT.jar > > > After upgrading from Cassandra from 2.0.12 to 2.0.15, whenever we killed a > cassandra process (with SIGTERM), some other nodes maintained a connection > with the killed node in the CLOSE_WAIT state on port 7000 for about 5-20 > minutes. > So, when we started the killed node again, other nodes could not establish a > handshake because of the connections on the CLOSE_WAIT state, so they > remained on the DOWN state to each other until the initial connection expired. > The problem did not happen if I ran a nodetool disablegossip before killing > the node. > I was able to fix this issue by reverting the CASSANDRA-8336 commits > (including CASSANDRA-9238). After reverting this, cassandra now closes > connection correctly when killed with -TERM, but leaves connections on > CLOSE_WAIT state if I run nodetool disablethrift before killing the nodes. > I did not try to reproduce the problem in a clean environment. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-9630) Killing cassandra process results in unclosed connections
[ https://issues.apache.org/jira/browse/CASSANDRA-9630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398338#comment-15398338 ] Paulo Motta commented on CASSANDRA-9630: I noticed we're not closing the socket [if there is an exception while connecting|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundTcpConnection.java#L496] on {{OutboundTcpConnection}}, so a race during a node's shutdown might cause a failed connection attempt to that node remain in the {{CLOSE_WAIT}} state until the next GC, which could potentially cause this. [~farzad.panahi] Are you willing to try out [this patch|https://github.com/pauloricardomg/cassandra/commit/3f46d414b06afb607b6a97152661b10c53c103e6] to see if it fixes it? You need to replace your {{lib/apache-cassandra-3.0.8.jar}} with [apache-cassandra-3.0.8-SNAPSHOT.jar|https://issues.apache.org/jira/secure/attachment/12820814/apache-cassandra-3.0.8-SNAPSHOT.jar] and perform a rolling restart on some of the nodes and check if this will fix the issue in these nodes (if you prefer you can generate your own jar by cloning [this branch|https://github.com/pauloricardomg/cassandra/tree/3.0.6-9630] and running {{ant clean jar}}). If this does not solve it, it would be nice if you could set the logging level of the {{org.apache.cassandra.net}} package to {{TRACE}}, either via {{nodetool setlogginglevel org.apache.cassandra.net TRACE}} or by adding {{}} to the end of your {{conf/logback.xml}}. After this, please attach the relevant information in the logs of affected nodes to this ticket for further analysis. > Killing cassandra process results in unclosed connections > - > > Key: CASSANDRA-9630 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9630 > Project: Cassandra > Issue Type: Bug > Components: Distributed Metadata, Streaming and Messaging >Reporter: Paulo Motta >Assignee: Paulo Motta >Priority: Minor > Fix For: 3.x > > Attachments: apache-cassandra-3.0.8-SNAPSHOT.jar > > > After upgrading from Cassandra from 2.0.12 to 2.0.15, whenever we killed a > cassandra process (with SIGTERM), some other nodes maintained a connection > with the killed node in the CLOSE_WAIT state on port 7000 for about 5-20 > minutes. > So, when we started the killed node again, other nodes could not establish a > handshake because of the connections on the CLOSE_WAIT state, so they > remained on the DOWN state to each other until the initial connection expired. > The problem did not happen if I ran a nodetool disablegossip before killing > the node. > I was able to fix this issue by reverting the CASSANDRA-8336 commits > (including CASSANDRA-9238). After reverting this, cassandra now closes > connection correctly when killed with -TERM, but leaves connections on > CLOSE_WAIT state if I run nodetool disablethrift before killing the nodes. > I did not try to reproduce the problem in a clean environment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9630) Killing cassandra process results in unclosed connections
[ https://issues.apache.org/jira/browse/CASSANDRA-9630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15396756#comment-15396756 ] Farzad Panahi commented on CASSANDRA-9630: -- I am experiencing similar issue. Cassandra version: 3.0.8 Environment: Amazon EC2 Error Case: When I restart Cassandra service on a node, after the node comes up it sees some or all of other nodes as DN even though other nodes see this node as UN. Here is the output of netstat and nodetool status for this error case: 1. right after stopping cassandra service on node 10.4.68.222: {code} -- ip-10-4-54-176 tcp0 0 10.4.54.176:51268 10.4.68.222:7000 TIME_WAIT tcp0 0 10.4.54.176:56135 10.4.68.222:7000 TIME_WAIT tcp1 0 10.4.54.176:43697 10.4.68.222:7000 CLOSE_WAIT tcp0 0 10.4.54.176:52372 10.4.68.222:7000 TIME_WAIT -- -- ip-10-4-54-177 tcp0 0 10.4.54.177:56960 10.4.68.222:7000 TIME_WAIT tcp0 0 10.4.54.177:54539 10.4.68.222:7000 TIME_WAIT tcp0 0 10.4.54.177:32823 10.4.68.222:7000 TIME_WAIT tcp1 0 10.4.54.177:48985 10.4.68.222:7000 CLOSE_WAIT -- -- ip-10-4-68-222 tcp0 0 10.4.68.222:700010.4.54.176:43697 FIN_WAIT2 tcp0 0 10.4.68.222:700010.4.54.177:48985 FIN_WAIT2 tcp0 0 10.4.68.222:700010.4.68.222:54419 TIME_WAIT tcp0 0 10.4.68.222:700010.4.43.65:43197 FIN_WAIT2 tcp0 0 10.4.68.222:700010.4.68.221:44149 FIN_WAIT2 tcp0 0 10.4.68.222:700010.4.68.222:41302 TIME_WAIT tcp0 0 10.4.68.222:700010.4.43.66:54321 FIN_WAIT2 -- -- ip-10-4-68-221 tcp0 0 10.4.68.221:49599 10.4.68.222:7000 TIME_WAIT tcp0 0 10.4.68.221:55033 10.4.68.222:7000 TIME_WAIT tcp0 0 10.4.68.221:51628 10.4.68.222:7000 TIME_WAIT tcp1 0 10.4.68.221:44149 10.4.68.222:7000 CLOSE_WAIT -- -- ip-10-4-43-66 tcp0 0 10.4.43.66:5593010.4.68.222:7000 TIME_WAIT tcp1 0 10.4.43.66:5432110.4.68.222:7000 CLOSE_WAIT tcp0 0 10.4.43.66:6096810.4.68.222:7000 TIME_WAIT tcp0 0 10.4.43.66:4908710.4.68.222:7000 TIME_WAIT -- -- ip-10-4-43-65 tcp1 0 10.4.43.65:4319710.4.68.222:7000 CLOSE_WAIT tcp0 0 10.4.43.65:3646710.4.68.222:7000 TIME_WAIT tcp0 0 10.4.43.65:5331710.4.68.222:7000 TIME_WAIT tcp0 0 10.4.43.65:5489710.4.68.222:7000 TIME_WAIT -- {code} 2. a bit after stopping cassandra service on node 10.4.68.222: {code} -- ip-10-4-54-176 tcp1 0 10.4.54.176:43697 10.4.68.222:7000 CLOSE_WAIT -- -- ip-10-4-54-177 -- -- ip-10-4-68-222 -- -- ip-10-4-68-221 tcp1 0 10.4.68.221:44149 10.4.68.222:7000 CLOSE_WAIT -- -- ip-10-4-43-66 tcp1 0 10.4.43.66:5432110.4.68.222:7000 CLOSE_WAIT -- -- ip-10-4-43-65 tcp1 0 10.4.43.65:4319710.4.68.222:7000 CLOSE_WAIT -- {code} 3. after starting cassandra service on node 10.4.68.222: {code} -- ip-10-4-54-176 tcp0 0 10.4.54.176:42460 10.4.68.222:7000 ESTABLISHED tcp1 303403 10.4.54.176:43697 10.4.68.222:7000 CLOSE_WAIT tcp0 0 10.4.54.176:42109 10.4.68.222:7000
[jira] [Commented] (CASSANDRA-9630) Killing cassandra process results in unclosed connections
[ https://issues.apache.org/jira/browse/CASSANDRA-9630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15280325#comment-15280325 ] T Jake Luciani commented on CASSANDRA-9630: --- This is likely the lack of MessageService not setting SO_LINGER to 0 on the sockets. > Killing cassandra process results in unclosed connections > - > > Key: CASSANDRA-9630 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9630 > Project: Cassandra > Issue Type: Bug > Components: Distributed Metadata, Streaming and Messaging >Reporter: Paulo Motta >Assignee: Paulo Motta >Priority: Minor > Fix For: 3.x > > > After upgrading from Cassandra from 2.0.12 to 2.0.15, whenever we killed a > cassandra process (with SIGTERM), some other nodes maintained a connection > with the killed node in the CLOSE_WAIT state on port 7000 for about 5-20 > minutes. > So, when we started the killed node again, other nodes could not establish a > handshake because of the connections on the CLOSE_WAIT state, so they > remained on the DOWN state to each other until the initial connection expired. > The problem did not happen if I ran a nodetool disablegossip before killing > the node. > I was able to fix this issue by reverting the CASSANDRA-8336 commits > (including CASSANDRA-9238). After reverting this, cassandra now closes > connection correctly when killed with -TERM, but leaves connections on > CLOSE_WAIT state if I run nodetool disablethrift before killing the nodes. > I did not try to reproduce the problem in a clean environment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)