[
https://issues.apache.org/jira/browse/CASSANDRA-13984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Brandon Williams updated CASSANDRA-13984:
-----------------------------------------
Fix Version/s: 3.11.x
> dead Small OutboundTcpConnection to restarted nodes blocking hint delivery
> --------------------------------------------------------------------------
>
> Key: CASSANDRA-13984
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13984
> Project: Cassandra
> Issue Type: Bug
> Components: Local/Startup and Shutdown
> Environment: - We run entirely a counter based workload the clusters
> we've been seeing this issue on (although it might happen for non counter
> clusters).
> - We originally had a 48 node cassandra 3.7 cluster (averaging ~500GB of load
> on each node) running on c4.4xlarge instances in EC2.
> - We've since split this cluster into two cassandra 3.11.0 clusters (for
> other performance reasons), one with 1/3 of our dataset on 48 c4.4xlarge
> instances, and the other 2/3 of our data set onto a 96 node c4.4xlarge
> cluster.
> - All of these clusters run on ubuntu 14.04 with enhanced networking enabled.
> Reporter: John Meichle
> Priority: Normal
> Fix For: 3.11.x
>
> Attachments: 3.11_node_restart_with_disablegossip.svg,
> 3.11_node_restart_without_disablegossip.svg
>
>
> Hello. For the last two months we've been fighting performance issues
> relating to node restarts and hint playback, and were able to get a pretty
> good bit of proof for the issue last night when debugging one of these
> restarts.
> The main issue we've been fighting with these clusters is very slow and
> unstable node restarts which seem to be due to hint playback with logs
> indicating "Finished hinted handoff of file ... hints to endpoint: 1-2-3-4,
> partially". We've posted about this on the mailing list as well and this bug
> seems to be the cause of this issue
> https://lists.apache.org/thread.html/41e9cb7b20696dd177fe925aba30d372614fcfe11d98275c7d9579cc@%3Cuser.cassandra.apache.org%3E
> Our process for restarting nodes is to run nodetool drain and then restart
> the service via init. When originally watching the logs we saw on the
> restarting node the standard cassandra startup process of initializing
> keyspaces, loading sstables, and finally starting to handshake with the
> cluster, which takes about 5 minutes. After this, the logs are very quiet
> until 15 minutes later.
> During this 15 minute period, some peers (between 0 and half of the cluster)
> are reported as DN status by nodetool status.
> When checking one of the nodes that is reported as DN, we see hint playback
> logging lines such as:
> {code}
> INFO [HintsDispatcher:12] 2017-10-31 01:06:18,876
> HintsDispatchExecutor.java:289 - Finished hinted handoff of file
> 8724f664-dff1-4c20-887b-6a26ae54a9b5-1509410768866-1.hints to endpoint
> /20.0.131.175: 8724f664-dff1-4c20-887b-6a26ae54a9b5, partially
> {code}
> We traced the codepath from this log line, starting in HintsDispatcher. It
> appears the callback that is created for the hint sending is timing out as we
> verified this the JMX metrics for the HintsService:
> {code}
> org.apache.cassandra.metrics:name=HintsFailed,type=HintsService
> org.apache.cassandra.metrics:name=HintsSucceeded,type=HintsService
> org.apache.cassandra.metrics:name=HintsTimedOut,type=HintsService
> {code}
> in which HintsTimedOut was incrementing during this period. We suspected this
> might be due to the restarting node being overloaded, as we do see heavy IO
> on the restart. However that IO pattern is not for the complete duration of
> the "partial" hint playback period. We also generated a flame graph (using
> the jvm-tools stcap and ssa) of the restarting node during this time period,
> attached as 3.11_node_restart_without_disablegossip.svg. This seemed to
> confirm that the node isn't doing much during the idle period as threads are
> just parked.
> What is interesting is checking the debug logs on one of the nodes that is
> marked as DN and is partially playing back hints. While the log lines for
> hint playback are not helpful, what lead us to this apparent bug is the debug
> logs for OutboundTcpConnection. Below is a pruned snippet of these logs:
> {code}
> [GossipStage:1] 2017-10-31 00:46:00,586 OutboundTcpConnection.java:205 -
> Enqueuing socket close for /20.0.131.175
> [MessagingService-Outgoing-/20.0.131.175-Gossip] 2017-10-31 00:46:00,586
> OutboundTcpConnection.java:410 - Socket to /20.0.131.175 closed
> [MessagingService-Outgoing-/20.0.131.175-Small] 2017-10-31 00:46:00,586
> OutboundTcpConnection.java:410 - Socket to /20.0.131.175 closed
> [MessagingService-Outgoing-/20.0.131.175-Small] 2017-10-31 00:46:00,598
> OutboundTcpConnection.java:424 - Attempting to connect to /20.0.131.175
> [HANDSHAKE-/20.0.131.175] 2017-10-31 00:46:00,599
> OutboundTcpConnection.java:560 - Handshaking version with /20.0.131.175
> [MessagingService-Outgoing-/20.0.131.175-Small] 2017-10-31 00:46:00,601
> OutboundTcpConnection.java:532 - Done connecting to /20.0.131.175
> [MessagingService-Outgoing-/20.0.131.175-Gossip] 2017-10-31 00:46:47,795
> OutboundTcpConnection.java:424 - Attempting to connect to /20.0.131.175
> [MessagingService-Outgoing-/20.0.131.175-Gossip] 2017-10-31 00:46:47,797
> OutboundTcpConnection.java:545 - Unable to connect to /20.0.131.175
> [MessagingService-Outgoing-/20.0.131.175-Gossip] 2017-10-31 00:46:47,899
> OutboundTcpConnection.java:545 - Unable to connect to /20.0.131.175
> [MessagingService-Outgoing-/20.0.131.175-Gossip] 2017-10-31 00:46:48,002
> OutboundTcpConnection.java:545 - Unable to connect to /20.0.131.175
> ... repeats until the node restarts gossip transport ...
> [MessagingService-Outgoing-/20.0.131.175-Gossip] 2017-10-31 00:49:04,703
> OutboundTcpConnection.java:545 - Unable to connect to /20.0.131.175
> [MessagingService-Outgoing-/20.0.131.175-Gossip] 2017-10-31 00:49:04,806
> OutboundTcpConnection.java:545 - Unable to connect to /20.0.131.175
> [MessagingService-Outgoing-/20.0.131.175-Gossip] 2017-10-31 00:50:46,545
> OutboundTcpConnection.java:424 - Attempting to connect to /20.0.131.175
> [HANDSHAKE-/20.0.131.175] 2017-10-31 00:50:46,547
> OutboundTcpConnection.java:560 - Handshaking version with /20.0.131.175
> [MessagingService-Outgoing-/20.0.131.175-Gossip] 2017-10-31 00:50:46,600
> OutboundTcpConnection.java:532 - Done connecting to /20.0.131.175
> [MessagingService-Outgoing-/20.0.131.175-Small] 2017-10-31 01:06:25,376
> OutboundTcpConnection.java:410 - Socket to /20.0.131.175 closed
> [MessagingService-Outgoing-/20.0.131.175-Small] 2017-10-31 01:06:25,377
> OutboundTcpConnection.java:349 - Error writing to /20.0.131.175
> [MessagingService-Outgoing-/20.0.131.175-Small] 2017-10-31 01:06:25,382
> OutboundTcpConnection.java:424 - Attempting to connect to /20.0.131.175
> [HANDSHAKE-/20.0.131.175] 2017-10-31 01:06:25,383
> OutboundTcpConnection.java:560 - Handshaking version with /20.0.131.175
> [MessagingService-Outgoing-/20.0.131.175-Small] 2017-10-31 01:06:25,385
> OutboundTcpConnection.java:532 - Done connecting to /20.0.131.175
> {code}
> The timeline of events here is:
> {code}
> T 00:00 - restarting node closed the small + gossip socket to the hints host,
> and restarted
> T 00:00 - hints host then reattempted a connection and handshaked for small
> instantly after
> (node restarted)
> T 00:47 - T4:46 hints host repeatedly attempts outbound gossip connections
> for this time range hints are still playing back "partially"
> T 21:27 - hints hosts logs the previously (reattempted) small connection as
> errored (timeout)
> T 21:27 - hints host retries the connection
> T 21:27 - hints work, things are done.
> {code}
> Non pruned debug log snippet for the restarting node in the time period of
> the reconnection is:
> {code}
> debug.log:INFO [RMI TCP Connection(54158)-127.0.0.1] 2017-10-31 00:46:00,537
> StorageService.java:1439 - DRAINING: starting drain process
> debug.log:INFO [RMI TCP Connection(54158)-127.0.0.1] 2017-10-31 00:46:00,537
> HintsService.java:220 - Paused hints dispatch
> debug.log:INFO [RMI TCP Connection(54158)-127.0.0.1] 2017-10-31 00:46:00,578
> Server.java:176 - Stop listening for CQL clients
> debug.log:INFO [RMI TCP Connection(54158)-127.0.0.1] 2017-10-31 00:46:00,581
> Gossiper.java:1530 - Announcing shutdown
> debug.log:DEBUG [RMI TCP Connection(54158)-127.0.0.1] 2017-10-31 00:46:00,582
> StorageService.java:2252 - Node /20.0.131.175 state shutdown, token [ ....
> token range .... ]
> debug.log:INFO [RMI TCP Connection(54158)-127.0.0.1] 2017-10-31 00:46:00,582
> StorageService.java:2255 - Node /20.0.131.175 state jump to shutdown
> {code}
> and on the hint playing host:
> {code}
> debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:45:59,642
> FailureDetector.java:457 - Ignoring interval time of 2408381657 for
> /20.0.132.6
> debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:45:59,642
> FailureDetector.java:457 - Ignoring interval time of 3000155164 for
> /20.0.131.252
> debug.log.8.zip:INFO [GossipStage:1] 2017-10-31 00:46:00,586
> Gossiper.java:1044 - InetAddress /20.0.131.175 is now DOWN
> debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,586
> OutboundTcpConnection.java:205 - Enqueuing socket close for /20.0.131.175
> debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,586
> OutboundTcpConnection.java:205 - Enqueuing socket close for /20.0.131.175
> debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,586
> OutboundTcpConnection.java:205 - Enqueuing socket close for /20.0.131.175
> debug.log.8.zip:DEBUG [MessagingService-Outgoing-/20.0.131.175-Gossip]
> 2017-10-31 00:46:00,586 OutboundTcpConnection.java:410 - Socket to
> /20.0.131.175 closed
> debug.log.8.zip:DEBUG [MessagingService-Outgoing-/20.0.131.175-Small]
> 2017-10-31 00:46:00,586 OutboundTcpConnection.java:410 - Socket to
> /20.0.131.175 closed
> debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,586
> FailureDetector.java:323 - Forcing conviction of /20.0.131.175
> debug.log.8.zip:DEBUG [MessagingService-Outgoing-/20.0.131.175-Small]
> 2017-10-31 00:46:00,598 OutboundTcpConnection.java:424 - Attempting to
> connect to /20.0.131.175
> debug.log.8.zip:INFO [HANDSHAKE-/20.0.131.175] 2017-10-31 00:46:00,599
> OutboundTcpConnection.java:560 - Handshaking version with /20.0.131.175
> debug.log.8.zip:DEBUG [MessagingService-Outgoing-/20.0.131.175-Small]
> 2017-10-31 00:46:00,601 OutboundTcpConnection.java:532 - Done connecting to
> /20.0.131.175
> debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,642
> FailureDetector.java:457 - Ignoring interval time of 2001490285 for
> /20.0.131.6
> debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,642
> FailureDetector.java:457 - Ignoring interval time of 2001552636 for
> /20.0.132.8
> debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,642
> FailureDetector.java:457 - Ignoring interval time of 2001576461 for
> /20.0.131.8
> debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,643
> FailureDetector.java:457 - Ignoring interval time of 2001599351 for
> /20.0.130.18
> debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,643
> FailureDetector.java:457 - Ignoring interval time of 2930107432 for
> /20.0.131.148
> debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,643
> FailureDetector.java:457 - Ignoring interval time of 2001646958 for
> /20.0.130.154
> debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,643
> FailureDetector.java:457 - Ignoring interval time of 2001665876 for
> /20.0.132.26
> debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,643
> FailureDetector.java:457 - Ignoring interval time of 2001689764 for
> /20.0.130.31
> debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,643
> FailureDetector.java:457 - Ignoring interval time of 2001624719 for
> /20.0.130.42
> {code}
> and lastly we generated a flame graph with the disablegossip step included,
> attached as 3.11_node_restart_with_disablegossip.svg
> Based on this, it would appear that nodetool drain is correctly gossiping to
> its peers that it is shutting down, and the peers teardown their own gossip
> and small connections. The bug appears to be the fact that peers nearly
> instantly reconnect the small connection which then lingers around in a dead
> state until a 20 minute timeout kills it. The OutboundTcpConnection class
> does not seem to detect this condition, and OutboundTcpConnectionPool reuses
> this dead connection for hint playback which is causing the partially log
> lines and the JMX timeout metrics for the hints service.
> We have worked around this issue by issuing a nodetool disablegossip after
> nodetool drain to further kill any of these connections and have not seen any
> issues on the 8 node restarts we've done since that process change. This has
> then caused the small connection to get killed correctly, which allows it to
> reconnect correctly once the node restarts and have hints playback correctly.
> I have also not been able to recreate this issue on a smaller non production
> cluster or locally via CCM. It appears to be a two folded issue:
> - On the peer nodes there seems to be some kind of race condition in the
> handling of of the gossip message indicating the restarting node is shutting
> down (resulting from nodetool drain) and whichever thread is using the small
> tcp connection.
> - On the restarting node a new small connection is accepted within the
> duration of nodetool drain.
> If there is anything else I should provide here, please let me know.
> We've been discussing this with [~rustyrazorblade] as well.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]