[ https://issues.apache.org/jira/browse/CASSANDRA-14424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16458787#comment-16458787 ]
Joseph Lynch commented on CASSANDRA-14424: ------------------------------------------ [~alienth] yea this looks really similar (in particular the 10-20 minute delay looks a lot like Linux's default tcp_retries2 parameter). Note that if this is 14358 you won't get retransmits on the restarted node, but the old node that has a bad {{OutboundTCPConnection}}. The restarted node is 100% fine, it's the old node that is endlessly transmitting on a blackholing TCP connection. By any chance are you using VPC or security groups (sg in particular)? The core finding of CASSANDRA-14358 imo is that those network products can swallow packets on an {{ESTABLISHED}} TCP connection (rather than the more useful {{RST}}). If you are able to get a reproduction, you can see if this is happening by running the following on the _old_ host (the one that the restarted node can't see, not the restarted node): {noformat} sudo netstat -o -n -p <CASSANDRA PID> | grep <RESTARTED IP> | grep <STORAGE PORT> {noformat} If it is the same issue as CASSANDRA-14358, you should see {{ESTABLISHED}} connections from an ephemeral port on the old node to the restarted node with non zero retransmission counters (e.g. keepalive (X/Y/Z) where Y > 0). On the restarted node you should not see that the ephemeral port from the non-restarted node in an {{ESTABLISHED}} state. I know that reproducing this issue is tricky. If you can't reproduce it to check the connection state, you could try testing the proposed mitigation in CASSANDRA-14358 by setting the sysctl on the entire cluster lower and then restarting nodes: {noformat} sudo sysctl net.ipv4.tcp_retries2=5 {noformat} If it is 14358, then your partitions should reduce to ~2 minutes rather than 15-20. > Gossip EchoMessages not being handled somewhere after node restart > ------------------------------------------------------------------ > > Key: CASSANDRA-14424 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14424 > Project: Cassandra > Issue Type: Bug > Environment: cassandra 3.11.2 - brand new ring - 18 nodes. > ubuntu 16.04 > AWS - cross AZ, with GossipingPropertyFileSnitch setting the rack to the AZs. > Reporter: Jason Harvey > Priority: Major > Fix For: 3.11.x, 4.x > > > Noticing this behaviour on a brand new 3.11.2 ring: > # Restart a random node in the ring. > # When that node comes back up, around 30% of the time it sees a single > other node down. No other node in the ring sees that node is down. > # After 10-20 minutes, the DOWN node suddenly appears UP to the restarted > node. > > After digging through tracing logs, here's what I know: > > The node seen as DOWN has not gone down, but simply hasn't been seen as UP > yet. The restarted node is attempting to `markAlive()` the target node. > Relevant logs from the restarted node's POV: > > {{INFO [GossipStage:1] 2018-04-27 14:03:50,950 Gossiper.java:1053 - Node > /10.0.225.147 has restarted, now UP}} > {{INFO [GossipStage:1] 2018-04-27 14:03:50,969 StorageService.java:2292 - > Node /10.0.225.147 state jump to NORMAL}} > {{INFO [HANDSHAKE-/10.0.225.147] 2018-04-27 14:03:50,976 > OutboundTcpConnection.java:560 - Handshaking version with /10.0.225.147}} > {{INFO [GossipStage:1] 2018-04-27 14:03:50,977 TokenMetadata.java:479 - > Updating topology for /10.0.225.147}} > {{INFO [GossipStage:1] 2018-04-27 14:03:50,977 TokenMetadata.java:479 - > Updating topology for /10.0.225.147}} > > (note that despite the Gossip seeing the DOWN node as 'UP', nodetool status > still shows it as 'DOWN', as markAlive has not completed, and will not > actually be seen as 'UP' for 20 more minutes) > > The restarted node is repeatedly sending Echo messages to the DOWN node as > part of the `markAlive()` call. The DOWN node is receiving those, and claims > to be sending a response. However, the restarted node is not marking the DOWN > node as UP even after the DOWN node sends the Echo response. > > Relevant logs from the restarted node's POV: > > {{TRACE [GossipStage:1] 2018-04-27 14:11:28,792 MessagingService.java:945 - > 10.0.103.45 sending ECHO to 99248@/10.0.225.147}} > {{TRACE [GossipTasks:1] 2018-04-27 14:11:29,792 MessagingService.java:945 - > 10.0.103.45 sending GOSSIP_DIGEST_SYN to 99631@/10.0.225.147}} > {{TRACE [GossipStage:1] 2018-04-27 14:11:29,792 MessagingService.java:945 - > 10.0.103.45 sending ECHO to 99632@/10.0.225.147}} > {{TRACE [GossipStage:1] 2018-04-27 14:11:29,793 MessagingService.java:945 - > 10.0.103.45 sending GOSSIP_DIGEST_ACK2 to 99633@/10.0.225.147}} > {{TRACE [GossipStage:1] 2018-04-27 14:11:29,793 MessagingService.java:945 - > 10.0.103.45 sending ECHO to 99635@/10.0.225.147}} > {{TRACE [GossipStage:1] 2018-04-27 14:11:31,794 MessagingService.java:945 - > 10.0.103.45 sending ECHO to 100348@/10.0.225.147}} > {{TRACE [GossipStage:1] 2018-04-27 14:11:33,750 MessagingService.java:945 - > 10.0.103.45 sending ECHO to 101157@/10.0.225.147}} > {{TRACE [GossipStage:1] 2018-04-27 14:11:35,412 MessagingService.java:945 - > 10.0.103.45 sending ECHO to 101753@/10.0.225.147}} > > > Relevant logs from the DOWN node's POV: > > {{TRACE [GossipStage:1] 2018-04-27 14:18:16,500 EchoVerbHandler.java:39 - > Sending a EchoMessage reply /10.0.103.45}} > {{TRACE [GossipStage:1] 2018-04-27 14:18:16,500 MessagingService.java:945 - > 10.0.225.147 sending REQUEST_RESPONSE to 328389@/10.0.103.45}} > {{TRACE [GossipStage:1] 2018-04-27 14:18:17,679 EchoVerbHandler.java:39 - > Sending a EchoMessage reply /10.0.103.45}} > {{TRACE [GossipStage:1] 2018-04-27 14:18:17,679 MessagingService.java:945 - > 10.0.225.147 sending REQUEST_RESPONSE to 329412@/10.0.103.45}} > {{TRACE [GossipStage:1] 2018-04-27 14:18:18,680 EchoVerbHandler.java:39 - > Sending a EchoMessage reply /10.0.103.45}} > {{TRACE [GossipStage:1] 2018-04-27 14:18:18,680 MessagingService.java:945 - > 10.0.225.147 sending REQUEST_RESPONSE to 330185@/10.0.103.45}} > > > The metrics on the restarted node show that the MessagingService has a large > number of TimeoutsPerHost for the DOWN node, and all other nodes have 0 > timeouts. > > > Eventually, `realMarkAlive()` is called and the restarted node finally sees > DOWN node as coming up, and it spams several UP messages when this happens: > > > {{INFO [RequestResponseStage-7] 2018-04-27 14:19:27,210 Gossiper.java:1019 - > InetAddress /10.0.225.147 is now UP}} > {{INFO [RequestResponseStage-11] 2018-04-27 14:19:27,210 Gossiper.java:1019 > - InetAddress /10.0.225.147 is now UP}} > {{INFO [RequestResponseStage-11] 2018-04-27 14:19:27,210 Gossiper.java:1019 > - InetAddress /10.0.225.147 is now UP}} > {{INFO [RequestResponseStage-11] 2018-04-27 14:19:27,210 Gossiper.java:1019 > - InetAddress /10.0.225.147 is now UP}} > {{INFO [RequestResponseStage-11] 2018-04-27 14:19:27,210 Gossiper.java:1019 > - InetAddress /10.0.225.147 is now UP}} > {{INFO [RequestResponseStage-11] 2018-04-27 14:19:27,210 Gossiper.java:1019 > - InetAddress /10.0.225.147 is now UP}} > {{INFO [RequestResponseStage-12] 2018-04-27 14:19:27,210 Gossiper.java:1019 > - InetAddress /10.0.225.147 is now UP}} > {{INFO [RequestResponseStage-11] 2018-04-27 14:19:27,210 Gossiper.java:1019 > - InetAddress /10.0.225.147 is now UP}} > {{INFO [RequestResponseStage-12] 2018-04-27 14:19:27,211 Gossiper.java:1019 > - InetAddress /10.0.225.147 is now UP}} > > > > A tcpdump shows no packet loss or other oddities between these two hosts. The > restarted node is sending Echo messages, and the DOWN node is ACKing them. > > The fact that a burst comes through at the very end suggests to me that > perhaps the Echo messages are getting queued up somewhere. > > The issue happens roughly 30% of the time a given node in the ring is > restarted. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org