[ 
https://issues.apache.org/jira/browse/CASSANDRA-13984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17716054#comment-17716054
 ] 

Eric Evans commented on CASSANDRA-13984:
----------------------------------------

 I can confirm this issue; I am seeing what is described here on a 3.11.13 
cluster, and I believe it is rather more serious than a 20 minute delay in 
hints delivery.

We have a number of clusters (all 3.11.13), but this manifests only on one of 
them (unfortunately, our most critical use-case).  The cluster in question has 
3 nodes in each of 2 data-centers, replication factor of 3 per data-center (6 
total).  We do LOCAL_QUORUM reads & writes, and EACH_QUORUM deletes.  When this 
happens,  the affected node is for all intents and purposes partitioned from 
the rest, and clients making queries to it are unable to obtain quorum.  The 
only remedy (short of waiting 20 minutes) is to restart the other nodes in the 
cluster, which exacerbates the issue.

I can make our logs available, but they look like those posted in the 
description.  The other nodes of the cluster see the shutdown, terminate the 
outbound messaging connection, and then immediately attempt to reconnect.  If 
you are doing a simple restart, then the affected node seems to come back up in 
time for this connection to succeed, if not (if for example you are rebooting 
the host), then these outbound connections must (apparently) timeout and 
reconnect (20 minutes later).

> dead Small OutboundTcpConnection to restarted nodes blocking hint delivery
> --------------------------------------------------------------------------
>
>                 Key: CASSANDRA-13984
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13984
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Local/Startup and Shutdown
>         Environment: - We run entirely a counter based workload the clusters 
> we've been seeing this issue on (although it might happen for non counter 
> clusters).
> - We originally had a 48 node cassandra 3.7 cluster (averaging ~500GB of load 
> on each node) running on c4.4xlarge instances in EC2.
> - We've since split this cluster into two cassandra 3.11.0 clusters (for 
> other performance reasons), one with 1/3 of our dataset on 48 c4.4xlarge 
> instances, and the other 2/3 of our data set onto a 96 node c4.4xlarge 
> cluster.
> - All of these clusters run on ubuntu 14.04 with enhanced networking enabled.
>            Reporter: John Meichle
>            Priority: Normal
>         Attachments: 3.11_node_restart_with_disablegossip.svg, 
> 3.11_node_restart_without_disablegossip.svg
>
>
> Hello. For the last two months we've been fighting performance issues 
> relating to node restarts and hint playback, and were able to get a pretty 
> good bit of proof for the issue last night when debugging one of these 
> restarts.
> The main issue we've been fighting with these clusters is very slow and 
> unstable node restarts which seem to be due to hint playback with logs 
> indicating "Finished hinted handoff of file ... hints to endpoint: 1-2-3-4, 
> partially". We've posted about this on the mailing list as well and this bug 
> seems to be the cause of this issue 
> https://lists.apache.org/thread.html/41e9cb7b20696dd177fe925aba30d372614fcfe11d98275c7d9579cc@%3Cuser.cassandra.apache.org%3E
> Our process for restarting nodes is to run nodetool drain and then restart 
> the service via init. When originally watching the logs we saw on the 
> restarting node the standard cassandra startup process of initializing 
> keyspaces, loading sstables, and finally starting to handshake with the 
> cluster, which takes about 5 minutes. After this, the logs are very quiet 
> until 15 minutes later.
> During this 15 minute period, some peers (between 0 and half of the cluster) 
> are reported as DN status by nodetool status.
> When checking one of the nodes that is reported as DN, we see hint playback 
> logging lines such as:
> {code}
> INFO [HintsDispatcher:12] 2017-10-31 01:06:18,876 
> HintsDispatchExecutor.java:289 - Finished hinted handoff of file 
> 8724f664-dff1-4c20-887b-6a26ae54a9b5-1509410768866-1.hints to endpoint 
> /20.0.131.175: 8724f664-dff1-4c20-887b-6a26ae54a9b5, partially
> {code}
> We traced the codepath from this log line, starting in HintsDispatcher. It 
> appears the callback that is created for the hint sending is timing out as we 
> verified this the JMX metrics for the HintsService:
> {code}
> org.apache.cassandra.metrics:name=HintsFailed,type=HintsService
> org.apache.cassandra.metrics:name=HintsSucceeded,type=HintsService
> org.apache.cassandra.metrics:name=HintsTimedOut,type=HintsService
> {code}
> in which HintsTimedOut was incrementing during this period. We suspected this 
> might be due to the restarting node being overloaded, as we do see heavy IO 
> on the restart. However that IO pattern is not for the complete duration of 
> the "partial" hint playback period. We also generated a flame graph (using 
> the jvm-tools stcap and ssa) of the restarting node during this time period, 
> attached as 3.11_node_restart_without_disablegossip.svg. This seemed to 
> confirm that the node isn't doing much during the idle period as threads are 
> just parked.
> What is interesting is checking the debug logs on one of the nodes that is 
> marked as DN and is partially playing back hints. While the log lines for 
> hint playback are not helpful, what lead us to this apparent bug is the debug 
> logs for OutboundTcpConnection. Below is a pruned snippet of these logs:
> {code}
>  [GossipStage:1] 2017-10-31 00:46:00,586 OutboundTcpConnection.java:205 - 
> Enqueuing socket close for /20.0.131.175
>  [MessagingService-Outgoing-/20.0.131.175-Gossip] 2017-10-31 00:46:00,586 
> OutboundTcpConnection.java:410 - Socket to /20.0.131.175 closed
>  [MessagingService-Outgoing-/20.0.131.175-Small] 2017-10-31 00:46:00,586 
> OutboundTcpConnection.java:410 - Socket to /20.0.131.175 closed
>  [MessagingService-Outgoing-/20.0.131.175-Small] 2017-10-31 00:46:00,598 
> OutboundTcpConnection.java:424 - Attempting to connect to /20.0.131.175
>  [HANDSHAKE-/20.0.131.175] 2017-10-31 00:46:00,599 
> OutboundTcpConnection.java:560 - Handshaking version with /20.0.131.175
>  [MessagingService-Outgoing-/20.0.131.175-Small] 2017-10-31 00:46:00,601 
> OutboundTcpConnection.java:532 - Done connecting to /20.0.131.175
>  [MessagingService-Outgoing-/20.0.131.175-Gossip] 2017-10-31 00:46:47,795 
> OutboundTcpConnection.java:424 - Attempting to connect to /20.0.131.175
>  [MessagingService-Outgoing-/20.0.131.175-Gossip] 2017-10-31 00:46:47,797 
> OutboundTcpConnection.java:545 - Unable to connect to /20.0.131.175
>  [MessagingService-Outgoing-/20.0.131.175-Gossip] 2017-10-31 00:46:47,899 
> OutboundTcpConnection.java:545 - Unable to connect to /20.0.131.175
>  [MessagingService-Outgoing-/20.0.131.175-Gossip] 2017-10-31 00:46:48,002 
> OutboundTcpConnection.java:545 - Unable to connect to /20.0.131.175
>     ... repeats until the node restarts gossip transport ...
>  [MessagingService-Outgoing-/20.0.131.175-Gossip] 2017-10-31 00:49:04,703 
> OutboundTcpConnection.java:545 - Unable to connect to /20.0.131.175
>  [MessagingService-Outgoing-/20.0.131.175-Gossip] 2017-10-31 00:49:04,806 
> OutboundTcpConnection.java:545 - Unable to connect to /20.0.131.175
>  [MessagingService-Outgoing-/20.0.131.175-Gossip] 2017-10-31 00:50:46,545 
> OutboundTcpConnection.java:424 - Attempting to connect to /20.0.131.175
>  [HANDSHAKE-/20.0.131.175] 2017-10-31 00:50:46,547 
> OutboundTcpConnection.java:560 - Handshaking version with /20.0.131.175
>  [MessagingService-Outgoing-/20.0.131.175-Gossip] 2017-10-31 00:50:46,600 
> OutboundTcpConnection.java:532 - Done connecting to /20.0.131.175
>  [MessagingService-Outgoing-/20.0.131.175-Small] 2017-10-31 01:06:25,376 
> OutboundTcpConnection.java:410 - Socket to /20.0.131.175 closed
>  [MessagingService-Outgoing-/20.0.131.175-Small] 2017-10-31 01:06:25,377 
> OutboundTcpConnection.java:349 - Error writing to /20.0.131.175
>  [MessagingService-Outgoing-/20.0.131.175-Small] 2017-10-31 01:06:25,382 
> OutboundTcpConnection.java:424 - Attempting to connect to /20.0.131.175
>  [HANDSHAKE-/20.0.131.175] 2017-10-31 01:06:25,383 
> OutboundTcpConnection.java:560 - Handshaking version with /20.0.131.175
>  [MessagingService-Outgoing-/20.0.131.175-Small] 2017-10-31 01:06:25,385 
> OutboundTcpConnection.java:532 - Done connecting to /20.0.131.175
> {code}
> The timeline of events here is:
> {code}
> T 00:00 - restarting node closed the small + gossip socket to the hints host, 
> and restarted
> T 00:00 - hints host then reattempted a connection and handshaked for small 
> instantly after
> (node restarted)
> T 00:47 - T4:46 hints host repeatedly attempts outbound gossip connections 
> for this time range hints are still playing back "partially"
> T 21:27 - hints hosts logs the previously (reattempted) small connection as 
> errored (timeout)
> T 21:27 - hints host retries the connection
> T 21:27 - hints work, things are done.
> {code}
> Non pruned debug log snippet for the restarting node in the time period of 
> the reconnection is:
> {code}
> debug.log:INFO  [RMI TCP Connection(54158)-127.0.0.1] 2017-10-31 00:46:00,537 
> StorageService.java:1439 - DRAINING: starting drain process
> debug.log:INFO  [RMI TCP Connection(54158)-127.0.0.1] 2017-10-31 00:46:00,537 
> HintsService.java:220 - Paused hints dispatch
> debug.log:INFO  [RMI TCP Connection(54158)-127.0.0.1] 2017-10-31 00:46:00,578 
> Server.java:176 - Stop listening for CQL clients
> debug.log:INFO  [RMI TCP Connection(54158)-127.0.0.1] 2017-10-31 00:46:00,581 
> Gossiper.java:1530 - Announcing shutdown
> debug.log:DEBUG [RMI TCP Connection(54158)-127.0.0.1] 2017-10-31 00:46:00,582 
> StorageService.java:2252 - Node /20.0.131.175 state shutdown, token [ .... 
> token range .... ]
> debug.log:INFO  [RMI TCP Connection(54158)-127.0.0.1] 2017-10-31 00:46:00,582 
> StorageService.java:2255 - Node /20.0.131.175 state jump to shutdown
> {code}
> and on the hint playing host:
> {code}
> debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:45:59,642 
> FailureDetector.java:457 - Ignoring interval time of 2408381657 for 
> /20.0.132.6
> debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:45:59,642 
> FailureDetector.java:457 - Ignoring interval time of 3000155164 for 
> /20.0.131.252
> debug.log.8.zip:INFO  [GossipStage:1] 2017-10-31 00:46:00,586 
> Gossiper.java:1044 - InetAddress /20.0.131.175 is now DOWN
> debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,586 
> OutboundTcpConnection.java:205 - Enqueuing socket close for /20.0.131.175
> debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,586 
> OutboundTcpConnection.java:205 - Enqueuing socket close for /20.0.131.175
> debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,586 
> OutboundTcpConnection.java:205 - Enqueuing socket close for /20.0.131.175
> debug.log.8.zip:DEBUG [MessagingService-Outgoing-/20.0.131.175-Gossip] 
> 2017-10-31 00:46:00,586 OutboundTcpConnection.java:410 - Socket to 
> /20.0.131.175 closed
> debug.log.8.zip:DEBUG [MessagingService-Outgoing-/20.0.131.175-Small] 
> 2017-10-31 00:46:00,586 OutboundTcpConnection.java:410 - Socket to 
> /20.0.131.175 closed
> debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,586 
> FailureDetector.java:323 - Forcing conviction of /20.0.131.175
> debug.log.8.zip:DEBUG [MessagingService-Outgoing-/20.0.131.175-Small] 
> 2017-10-31 00:46:00,598 OutboundTcpConnection.java:424 - Attempting to 
> connect to /20.0.131.175
> debug.log.8.zip:INFO  [HANDSHAKE-/20.0.131.175] 2017-10-31 00:46:00,599 
> OutboundTcpConnection.java:560 - Handshaking version with /20.0.131.175
> debug.log.8.zip:DEBUG [MessagingService-Outgoing-/20.0.131.175-Small] 
> 2017-10-31 00:46:00,601 OutboundTcpConnection.java:532 - Done connecting to 
> /20.0.131.175
> debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,642 
> FailureDetector.java:457 - Ignoring interval time of 2001490285 for 
> /20.0.131.6
> debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,642 
> FailureDetector.java:457 - Ignoring interval time of 2001552636 for 
> /20.0.132.8
> debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,642 
> FailureDetector.java:457 - Ignoring interval time of 2001576461 for 
> /20.0.131.8
> debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,643 
> FailureDetector.java:457 - Ignoring interval time of 2001599351 for 
> /20.0.130.18
> debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,643 
> FailureDetector.java:457 - Ignoring interval time of 2930107432 for 
> /20.0.131.148
> debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,643 
> FailureDetector.java:457 - Ignoring interval time of 2001646958 for 
> /20.0.130.154
> debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,643 
> FailureDetector.java:457 - Ignoring interval time of 2001665876 for 
> /20.0.132.26
> debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,643 
> FailureDetector.java:457 - Ignoring interval time of 2001689764 for 
> /20.0.130.31
> debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,643 
> FailureDetector.java:457 - Ignoring interval time of 2001624719 for 
> /20.0.130.42
> {code}
> and lastly we generated a flame graph with the disablegossip step included, 
> attached as 3.11_node_restart_with_disablegossip.svg
> Based on this, it would appear that nodetool drain is correctly gossiping to 
> its peers that it is shutting down, and the peers teardown their own gossip 
> and small connections. The bug appears to be the fact that peers nearly 
> instantly reconnect the small connection which then lingers around in a dead 
> state until a 20 minute timeout kills it. The OutboundTcpConnection class 
> does not seem to detect this condition, and OutboundTcpConnectionPool reuses 
> this dead connection for hint playback which is causing the partially log 
> lines and the JMX timeout metrics for the hints service.
> We have worked around this issue by issuing a nodetool disablegossip after 
> nodetool drain to further kill any of these connections and have not seen any 
> issues on the 8 node restarts we've done since that process change. This has 
> then caused the small connection to get killed correctly, which allows it to 
> reconnect correctly once the node restarts and have hints playback correctly.
> I have also not been able to recreate this issue on a smaller non production 
> cluster or locally via CCM. It appears to be a two folded issue:
> - On the peer nodes there seems to be some kind of race condition in the 
> handling of of the gossip message indicating the restarting node is shutting 
> down (resulting from nodetool drain) and whichever thread is using the small 
> tcp connection.
> - On the restarting node a new small connection is accepted within the 
> duration of nodetool drain.
> If there is anything else I should provide here, please let me know. 
> We've been discussing this with [~rustyrazorblade] as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to