[ 
https://issues.apache.org/jira/browse/CASSANDRA-13984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Semb Wever updated CASSANDRA-13984:
-------------------------------------------
    Bug Category: Parent values: Availability(12983)Level 1 values: 
Unavailable(12994)

> dead Small OutboundTcpConnection to restarted nodes blocking hint delivery
> --------------------------------------------------------------------------
>
>                 Key: CASSANDRA-13984
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13984
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Local/Startup and Shutdown
>         Environment: - We run entirely a counter based workload the clusters 
> we've been seeing this issue on (although it might happen for non counter 
> clusters).
> - We originally had a 48 node cassandra 3.7 cluster (averaging ~500GB of load 
> on each node) running on c4.4xlarge instances in EC2.
> - We've since split this cluster into two cassandra 3.11.0 clusters (for 
> other performance reasons), one with 1/3 of our dataset on 48 c4.4xlarge 
> instances, and the other 2/3 of our data set onto a 96 node c4.4xlarge 
> cluster.
> - All of these clusters run on ubuntu 14.04 with enhanced networking enabled.
>            Reporter: John Meichle
>            Priority: Normal
>             Fix For: 3.11.x
>
>         Attachments: 3.11_node_restart_with_disablegossip.svg, 
> 3.11_node_restart_without_disablegossip.svg
>
>
> Hello. For the last two months we've been fighting performance issues 
> relating to node restarts and hint playback, and were able to get a pretty 
> good bit of proof for the issue last night when debugging one of these 
> restarts.
> The main issue we've been fighting with these clusters is very slow and 
> unstable node restarts which seem to be due to hint playback with logs 
> indicating "Finished hinted handoff of file ... hints to endpoint: 1-2-3-4, 
> partially". We've posted about this on the mailing list as well and this bug 
> seems to be the cause of this issue 
> https://lists.apache.org/thread.html/41e9cb7b20696dd177fe925aba30d372614fcfe11d98275c7d9579cc@%3Cuser.cassandra.apache.org%3E
> Our process for restarting nodes is to run nodetool drain and then restart 
> the service via init. When originally watching the logs we saw on the 
> restarting node the standard cassandra startup process of initializing 
> keyspaces, loading sstables, and finally starting to handshake with the 
> cluster, which takes about 5 minutes. After this, the logs are very quiet 
> until 15 minutes later.
> During this 15 minute period, some peers (between 0 and half of the cluster) 
> are reported as DN status by nodetool status.
> When checking one of the nodes that is reported as DN, we see hint playback 
> logging lines such as:
> {code}
> INFO [HintsDispatcher:12] 2017-10-31 01:06:18,876 
> HintsDispatchExecutor.java:289 - Finished hinted handoff of file 
> 8724f664-dff1-4c20-887b-6a26ae54a9b5-1509410768866-1.hints to endpoint 
> /20.0.131.175: 8724f664-dff1-4c20-887b-6a26ae54a9b5, partially
> {code}
> We traced the codepath from this log line, starting in HintsDispatcher. It 
> appears the callback that is created for the hint sending is timing out as we 
> verified this the JMX metrics for the HintsService:
> {code}
> org.apache.cassandra.metrics:name=HintsFailed,type=HintsService
> org.apache.cassandra.metrics:name=HintsSucceeded,type=HintsService
> org.apache.cassandra.metrics:name=HintsTimedOut,type=HintsService
> {code}
> in which HintsTimedOut was incrementing during this period. We suspected this 
> might be due to the restarting node being overloaded, as we do see heavy IO 
> on the restart. However that IO pattern is not for the complete duration of 
> the "partial" hint playback period. We also generated a flame graph (using 
> the jvm-tools stcap and ssa) of the restarting node during this time period, 
> attached as 3.11_node_restart_without_disablegossip.svg. This seemed to 
> confirm that the node isn't doing much during the idle period as threads are 
> just parked.
> What is interesting is checking the debug logs on one of the nodes that is 
> marked as DN and is partially playing back hints. While the log lines for 
> hint playback are not helpful, what lead us to this apparent bug is the debug 
> logs for OutboundTcpConnection. Below is a pruned snippet of these logs:
> {code}
>  [GossipStage:1] 2017-10-31 00:46:00,586 OutboundTcpConnection.java:205 - 
> Enqueuing socket close for /20.0.131.175
>  [MessagingService-Outgoing-/20.0.131.175-Gossip] 2017-10-31 00:46:00,586 
> OutboundTcpConnection.java:410 - Socket to /20.0.131.175 closed
>  [MessagingService-Outgoing-/20.0.131.175-Small] 2017-10-31 00:46:00,586 
> OutboundTcpConnection.java:410 - Socket to /20.0.131.175 closed
>  [MessagingService-Outgoing-/20.0.131.175-Small] 2017-10-31 00:46:00,598 
> OutboundTcpConnection.java:424 - Attempting to connect to /20.0.131.175
>  [HANDSHAKE-/20.0.131.175] 2017-10-31 00:46:00,599 
> OutboundTcpConnection.java:560 - Handshaking version with /20.0.131.175
>  [MessagingService-Outgoing-/20.0.131.175-Small] 2017-10-31 00:46:00,601 
> OutboundTcpConnection.java:532 - Done connecting to /20.0.131.175
>  [MessagingService-Outgoing-/20.0.131.175-Gossip] 2017-10-31 00:46:47,795 
> OutboundTcpConnection.java:424 - Attempting to connect to /20.0.131.175
>  [MessagingService-Outgoing-/20.0.131.175-Gossip] 2017-10-31 00:46:47,797 
> OutboundTcpConnection.java:545 - Unable to connect to /20.0.131.175
>  [MessagingService-Outgoing-/20.0.131.175-Gossip] 2017-10-31 00:46:47,899 
> OutboundTcpConnection.java:545 - Unable to connect to /20.0.131.175
>  [MessagingService-Outgoing-/20.0.131.175-Gossip] 2017-10-31 00:46:48,002 
> OutboundTcpConnection.java:545 - Unable to connect to /20.0.131.175
>     ... repeats until the node restarts gossip transport ...
>  [MessagingService-Outgoing-/20.0.131.175-Gossip] 2017-10-31 00:49:04,703 
> OutboundTcpConnection.java:545 - Unable to connect to /20.0.131.175
>  [MessagingService-Outgoing-/20.0.131.175-Gossip] 2017-10-31 00:49:04,806 
> OutboundTcpConnection.java:545 - Unable to connect to /20.0.131.175
>  [MessagingService-Outgoing-/20.0.131.175-Gossip] 2017-10-31 00:50:46,545 
> OutboundTcpConnection.java:424 - Attempting to connect to /20.0.131.175
>  [HANDSHAKE-/20.0.131.175] 2017-10-31 00:50:46,547 
> OutboundTcpConnection.java:560 - Handshaking version with /20.0.131.175
>  [MessagingService-Outgoing-/20.0.131.175-Gossip] 2017-10-31 00:50:46,600 
> OutboundTcpConnection.java:532 - Done connecting to /20.0.131.175
>  [MessagingService-Outgoing-/20.0.131.175-Small] 2017-10-31 01:06:25,376 
> OutboundTcpConnection.java:410 - Socket to /20.0.131.175 closed
>  [MessagingService-Outgoing-/20.0.131.175-Small] 2017-10-31 01:06:25,377 
> OutboundTcpConnection.java:349 - Error writing to /20.0.131.175
>  [MessagingService-Outgoing-/20.0.131.175-Small] 2017-10-31 01:06:25,382 
> OutboundTcpConnection.java:424 - Attempting to connect to /20.0.131.175
>  [HANDSHAKE-/20.0.131.175] 2017-10-31 01:06:25,383 
> OutboundTcpConnection.java:560 - Handshaking version with /20.0.131.175
>  [MessagingService-Outgoing-/20.0.131.175-Small] 2017-10-31 01:06:25,385 
> OutboundTcpConnection.java:532 - Done connecting to /20.0.131.175
> {code}
> The timeline of events here is:
> {code}
> T 00:00 - restarting node closed the small + gossip socket to the hints host, 
> and restarted
> T 00:00 - hints host then reattempted a connection and handshaked for small 
> instantly after
> (node restarted)
> T 00:47 - T4:46 hints host repeatedly attempts outbound gossip connections 
> for this time range hints are still playing back "partially"
> T 21:27 - hints hosts logs the previously (reattempted) small connection as 
> errored (timeout)
> T 21:27 - hints host retries the connection
> T 21:27 - hints work, things are done.
> {code}
> Non pruned debug log snippet for the restarting node in the time period of 
> the reconnection is:
> {code}
> debug.log:INFO  [RMI TCP Connection(54158)-127.0.0.1] 2017-10-31 00:46:00,537 
> StorageService.java:1439 - DRAINING: starting drain process
> debug.log:INFO  [RMI TCP Connection(54158)-127.0.0.1] 2017-10-31 00:46:00,537 
> HintsService.java:220 - Paused hints dispatch
> debug.log:INFO  [RMI TCP Connection(54158)-127.0.0.1] 2017-10-31 00:46:00,578 
> Server.java:176 - Stop listening for CQL clients
> debug.log:INFO  [RMI TCP Connection(54158)-127.0.0.1] 2017-10-31 00:46:00,581 
> Gossiper.java:1530 - Announcing shutdown
> debug.log:DEBUG [RMI TCP Connection(54158)-127.0.0.1] 2017-10-31 00:46:00,582 
> StorageService.java:2252 - Node /20.0.131.175 state shutdown, token [ .... 
> token range .... ]
> debug.log:INFO  [RMI TCP Connection(54158)-127.0.0.1] 2017-10-31 00:46:00,582 
> StorageService.java:2255 - Node /20.0.131.175 state jump to shutdown
> {code}
> and on the hint playing host:
> {code}
> debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:45:59,642 
> FailureDetector.java:457 - Ignoring interval time of 2408381657 for 
> /20.0.132.6
> debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:45:59,642 
> FailureDetector.java:457 - Ignoring interval time of 3000155164 for 
> /20.0.131.252
> debug.log.8.zip:INFO  [GossipStage:1] 2017-10-31 00:46:00,586 
> Gossiper.java:1044 - InetAddress /20.0.131.175 is now DOWN
> debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,586 
> OutboundTcpConnection.java:205 - Enqueuing socket close for /20.0.131.175
> debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,586 
> OutboundTcpConnection.java:205 - Enqueuing socket close for /20.0.131.175
> debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,586 
> OutboundTcpConnection.java:205 - Enqueuing socket close for /20.0.131.175
> debug.log.8.zip:DEBUG [MessagingService-Outgoing-/20.0.131.175-Gossip] 
> 2017-10-31 00:46:00,586 OutboundTcpConnection.java:410 - Socket to 
> /20.0.131.175 closed
> debug.log.8.zip:DEBUG [MessagingService-Outgoing-/20.0.131.175-Small] 
> 2017-10-31 00:46:00,586 OutboundTcpConnection.java:410 - Socket to 
> /20.0.131.175 closed
> debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,586 
> FailureDetector.java:323 - Forcing conviction of /20.0.131.175
> debug.log.8.zip:DEBUG [MessagingService-Outgoing-/20.0.131.175-Small] 
> 2017-10-31 00:46:00,598 OutboundTcpConnection.java:424 - Attempting to 
> connect to /20.0.131.175
> debug.log.8.zip:INFO  [HANDSHAKE-/20.0.131.175] 2017-10-31 00:46:00,599 
> OutboundTcpConnection.java:560 - Handshaking version with /20.0.131.175
> debug.log.8.zip:DEBUG [MessagingService-Outgoing-/20.0.131.175-Small] 
> 2017-10-31 00:46:00,601 OutboundTcpConnection.java:532 - Done connecting to 
> /20.0.131.175
> debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,642 
> FailureDetector.java:457 - Ignoring interval time of 2001490285 for 
> /20.0.131.6
> debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,642 
> FailureDetector.java:457 - Ignoring interval time of 2001552636 for 
> /20.0.132.8
> debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,642 
> FailureDetector.java:457 - Ignoring interval time of 2001576461 for 
> /20.0.131.8
> debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,643 
> FailureDetector.java:457 - Ignoring interval time of 2001599351 for 
> /20.0.130.18
> debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,643 
> FailureDetector.java:457 - Ignoring interval time of 2930107432 for 
> /20.0.131.148
> debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,643 
> FailureDetector.java:457 - Ignoring interval time of 2001646958 for 
> /20.0.130.154
> debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,643 
> FailureDetector.java:457 - Ignoring interval time of 2001665876 for 
> /20.0.132.26
> debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,643 
> FailureDetector.java:457 - Ignoring interval time of 2001689764 for 
> /20.0.130.31
> debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,643 
> FailureDetector.java:457 - Ignoring interval time of 2001624719 for 
> /20.0.130.42
> {code}
> and lastly we generated a flame graph with the disablegossip step included, 
> attached as 3.11_node_restart_with_disablegossip.svg
> Based on this, it would appear that nodetool drain is correctly gossiping to 
> its peers that it is shutting down, and the peers teardown their own gossip 
> and small connections. The bug appears to be the fact that peers nearly 
> instantly reconnect the small connection which then lingers around in a dead 
> state until a 20 minute timeout kills it. The OutboundTcpConnection class 
> does not seem to detect this condition, and OutboundTcpConnectionPool reuses 
> this dead connection for hint playback which is causing the partially log 
> lines and the JMX timeout metrics for the hints service.
> We have worked around this issue by issuing a nodetool disablegossip after 
> nodetool drain to further kill any of these connections and have not seen any 
> issues on the 8 node restarts we've done since that process change. This has 
> then caused the small connection to get killed correctly, which allows it to 
> reconnect correctly once the node restarts and have hints playback correctly.
> I have also not been able to recreate this issue on a smaller non production 
> cluster or locally via CCM. It appears to be a two folded issue:
> - On the peer nodes there seems to be some kind of race condition in the 
> handling of of the gossip message indicating the restarting node is shutting 
> down (resulting from nodetool drain) and whichever thread is using the small 
> tcp connection.
> - On the restarting node a new small connection is accepted within the 
> duration of nodetool drain.
> If there is anything else I should provide here, please let me know. 
> We've been discussing this with [~rustyrazorblade] as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to