[jira] [Updated] (CASSANDRA-14424) Gossip EchoMessages not being handled somewhere after node restart

Jason Harvey (JIRA) Fri, 27 Apr 2018 18:06:14 -0700

     [ 
https://issues.apache.org/jira/browse/CASSANDRA-14424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jason Harvey updated CASSANDRA-14424:
-------------------------------------
    Environment: 
cassandra 3.11.2 - brand new ring - 18 nodes.

ubuntu 16.04

AWS - cross AZ, with GossipingPropertyFileSnitch setting the rack to the AZs.

  was:
cassandra 3.11.2 - brand new ring.

ubuntu 16.04

AWS


> Gossip EchoMessages not being handled somewhere after node restart
> ------------------------------------------------------------------
>
>                 Key: CASSANDRA-14424
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-14424
>             Project: Cassandra
>          Issue Type: Bug
>         Environment: cassandra 3.11.2 - brand new ring - 18 nodes.
> ubuntu 16.04
> AWS - cross AZ, with GossipingPropertyFileSnitch setting the rack to the AZs.
>            Reporter: Jason Harvey
>            Priority: Major
>             Fix For: 3.11.x, 4.x
>
>
> Noticing this behaviour on a brand new 3.11.2 ring:
>  # Restart a random node in the ring.
>  # When that node comes back up, around 30% of the time it sees a single 
> other node down. No other node in the ring sees that node is down.
>  # After 10-20 minutes, the DOWN node suddenly appears UP to the restarted 
> node.
>  
> After digging through tracing logs, here's what I know:
>  
> The node seen as DOWN has not gone down, but simply hasn't been seen as UP 
> yet. The restarted node is attempting to `markAlive()` the target node. 
> Relevant logs from the restarted node's POV:
>  
> {{INFO [GossipStage:1] 2018-04-27 14:03:50,950 Gossiper.java:1053 - Node 
> /10.0.225.147 has restarted, now UP}}
>  {{INFO [GossipStage:1] 2018-04-27 14:03:50,969 StorageService.java:2292 - 
> Node /10.0.225.147 state jump to NORMAL}}
>  {{INFO [HANDSHAKE-/10.0.225.147] 2018-04-27 14:03:50,976 
> OutboundTcpConnection.java:560 - Handshaking version with /10.0.225.147}}
>  {{INFO [GossipStage:1] 2018-04-27 14:03:50,977 TokenMetadata.java:479 - 
> Updating topology for /10.0.225.147}}
>  {{INFO [GossipStage:1] 2018-04-27 14:03:50,977 TokenMetadata.java:479 - 
> Updating topology for /10.0.225.147}}
>  
> (note that despite the Gossip seeing the DOWN node as 'UP', nodetool status 
> still shows it as 'DOWN', as markAlive has not completed, and will not 
> actually be seen as 'UP' for 20 more minutes)
>  
> The restarted node is repeatedly sending Echo messages to the DOWN node as 
> part of the `markAlive()` call. The DOWN node is receiving those, and claims 
> to be sending a response. However, the restarted node is not marking the DOWN 
> node as UP even after the DOWN node sends the Echo response.
>  
> Relevant logs from the restarted node's POV:
>  
> {{TRACE [GossipStage:1] 2018-04-27 14:11:28,792 MessagingService.java:945 - 
> 10.0.103.45 sending ECHO to 99248@/10.0.225.147}}
> {{TRACE [GossipTasks:1] 2018-04-27 14:11:29,792 MessagingService.java:945 - 
> 10.0.103.45 sending GOSSIP_DIGEST_SYN to 99631@/10.0.225.147}}
> {{TRACE [GossipStage:1] 2018-04-27 14:11:29,792 MessagingService.java:945 - 
> 10.0.103.45 sending ECHO to 99632@/10.0.225.147}}
> {{TRACE [GossipStage:1] 2018-04-27 14:11:29,793 MessagingService.java:945 - 
> 10.0.103.45 sending GOSSIP_DIGEST_ACK2 to 99633@/10.0.225.147}}
> {{TRACE [GossipStage:1] 2018-04-27 14:11:29,793 MessagingService.java:945 - 
> 10.0.103.45 sending ECHO to 99635@/10.0.225.147}}
> {{TRACE [GossipStage:1] 2018-04-27 14:11:31,794 MessagingService.java:945 - 
> 10.0.103.45 sending ECHO to 100348@/10.0.225.147}}
> {{TRACE [GossipStage:1] 2018-04-27 14:11:33,750 MessagingService.java:945 - 
> 10.0.103.45 sending ECHO to 101157@/10.0.225.147}}
> {{TRACE [GossipStage:1] 2018-04-27 14:11:35,412 MessagingService.java:945 - 
> 10.0.103.45 sending ECHO to 101753@/10.0.225.147}}
>  
>  
> Relevant logs from the DOWN node's POV:
>  
> {{TRACE [GossipStage:1] 2018-04-27 14:18:16,500 EchoVerbHandler.java:39 - 
> Sending a EchoMessage reply /10.0.103.45}}
>  {{TRACE [GossipStage:1] 2018-04-27 14:18:16,500 MessagingService.java:945 - 
> 10.0.225.147 sending REQUEST_RESPONSE to 328389@/10.0.103.45}}
> {{TRACE [GossipStage:1] 2018-04-27 14:18:17,679 EchoVerbHandler.java:39 - 
> Sending a EchoMessage reply /10.0.103.45}}
>  {{TRACE [GossipStage:1] 2018-04-27 14:18:17,679 MessagingService.java:945 - 
> 10.0.225.147 sending REQUEST_RESPONSE to 329412@/10.0.103.45}}
> {{TRACE [GossipStage:1] 2018-04-27 14:18:18,680 EchoVerbHandler.java:39 - 
> Sending a EchoMessage reply /10.0.103.45}}
>  {{TRACE [GossipStage:1] 2018-04-27 14:18:18,680 MessagingService.java:945 - 
> 10.0.225.147 sending REQUEST_RESPONSE to 330185@/10.0.103.45}}
>  
>  
> The metrics on the restarted node show that the MessagingService has a large 
> number of TimeoutsPerHost for the DOWN node, and all other nodes have 0 
> timeouts.
>  
>  
> Eventually, `realMarkAlive()` is called and the restarted node finally sees 
> DOWN node as coming up, and it spams several UP messages when this happens:
>  
>  
> {{INFO [RequestResponseStage-7] 2018-04-27 14:19:27,210 Gossiper.java:1019 - 
> InetAddress /10.0.225.147 is now UP}}
>  {{INFO [RequestResponseStage-11] 2018-04-27 14:19:27,210 Gossiper.java:1019 
> - InetAddress /10.0.225.147 is now UP}}
>  {{INFO [RequestResponseStage-11] 2018-04-27 14:19:27,210 Gossiper.java:1019 
> - InetAddress /10.0.225.147 is now UP}}
>  {{INFO [RequestResponseStage-11] 2018-04-27 14:19:27,210 Gossiper.java:1019 
> - InetAddress /10.0.225.147 is now UP}}
>  {{INFO [RequestResponseStage-11] 2018-04-27 14:19:27,210 Gossiper.java:1019 
> - InetAddress /10.0.225.147 is now UP}}
>  {{INFO [RequestResponseStage-11] 2018-04-27 14:19:27,210 Gossiper.java:1019 
> - InetAddress /10.0.225.147 is now UP}}
>  {{INFO [RequestResponseStage-12] 2018-04-27 14:19:27,210 Gossiper.java:1019 
> - InetAddress /10.0.225.147 is now UP}}
>  {{INFO [RequestResponseStage-11] 2018-04-27 14:19:27,210 Gossiper.java:1019 
> - InetAddress /10.0.225.147 is now UP}}
>  {{INFO [RequestResponseStage-12] 2018-04-27 14:19:27,211 Gossiper.java:1019 
> - InetAddress /10.0.225.147 is now UP}}
>  
>  
>  
> A tcpdump shows no packet loss or other oddities between these two hosts. The 
> restarted node is sending Echo messages, and the DOWN node is ACKing them.
>  
> The fact that a burst comes through at the very end suggests to me that 
> perhaps the Echo messages are getting queued up somewhere.
>  
> The issue happens roughly 30% of the time a given node in the ring is 
> restarted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (CASSANDRA-14424) Gossip EchoMessages not being handled somewhere after node restart

Reply via email to