[
https://issues.apache.org/jira/browse/CASSANDRA-14358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16630690#comment-16630690
]
Kevin Zhang commented on CASSANDRA-14358:
-----------------------------------------
[~jolynch]: thanks for the answer. I did the kernel workaround and it worked.
> OutboundTcpConnection can hang for many minutes when nodes restart
> ------------------------------------------------------------------
>
> Key: CASSANDRA-14358
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14358
> Project: Cassandra
> Issue Type: Bug
> Components: Streaming and Messaging
> Environment: Cassandra 2.1.19 (also reproduced on 3.0.15), running
> with {{internode_encryption: all}} and the EC2 multi region snitch on Linux
> 4.13 within the same AWS region. Smallest cluster I've seen the problem on is
> 12 nodes, reproduces more reliably on 40+ and 300 node clusters consistently
> reproduce on at least one node in the cluster.
> So all the connections are SSL and we're connecting on the internal ip
> addresses (not the public endpoint ones).
> Potentially relevant sysctls:
> {noformat}
> /proc/sys/net/ipv4/tcp_syn_retries = 2
> /proc/sys/net/ipv4/tcp_synack_retries = 5
> /proc/sys/net/ipv4/tcp_keepalive_time = 7200
> /proc/sys/net/ipv4/tcp_keepalive_probes = 9
> /proc/sys/net/ipv4/tcp_keepalive_intvl = 75
> /proc/sys/net/ipv4/tcp_retries2 = 15
> {noformat}
> Reporter: Joseph Lynch
> Assignee: Joseph Lynch
> Priority: Major
> Labels: 4.0-feature-freeze-review-requested
> Fix For: 4.0, 2.1.x, 2.2.x, 3.0.x, 3.11.x
>
> Attachments: 10 Minute Partition.pdf
>
>
> edit summary: This primarily impacts networks with stateful firewalls such as
> AWS. I'm working on a proper patch for trunk but unfortunately it relies on
> the Netty refactor in 4.0 so it will be hard to backport to previous
> versions. A workaround for earlier versions is to set the
> {{net.ipv4.tcp_retries2}} sysctl to ~5. This can be done with the following:
> {code:java}
> $ cat /etc/sysctl.d/20-cassandra-tuning.conf
> net.ipv4.tcp_retries2=5
> $ # Reload all sysctls
> $ sysctl --system{code}
> Original Bug Report:
> I've been trying to debug nodes not being able to see each other during
> longer (~5 minute+) Cassandra restarts in 3.0.x and 2.1.x which can
> contribute to {{UnavailableExceptions}} during rolling restarts of 3.0.x and
> 2.1.x clusters for us. I think I finally have a lead. It appears that prior
> to trunk (with the awesome Netty refactor) we do not set socket connect
> timeouts on SSL connections (in 2.1.x, 3.0.x, or 3.11.x) nor do we set
> {{SO_TIMEOUT}} as far as I can tell on outbound connections either. I believe
> that this means that we could potentially block forever on {{connect}} or
> {{recv}} syscalls, and we could block forever on the SSL Handshake as well. I
> think that the OS will protect us somewhat (and that may be what's causing
> the eventual timeout) but I think that given the right network conditions our
> {{OutboundTCPConnection}} threads can just be stuck never making any progress
> until the OS intervenes.
> I have attached some logs of such a network partition during a rolling
> restart where an old node in the cluster has a completely foobarred
> {{OutboundTcpConnection}} for ~10 minutes before finally getting a
> {{java.net.SocketException: Connection timed out (Write failed)}} and
> immediately successfully reconnecting. I conclude that the old node is the
> problem because the new node (the one that restarted) is sending ECHOs to the
> old node, and the old node is sending ECHOs and REQUEST_RESPONSES to the new
> node's ECHOs, but the new node is never getting the ECHO's. This appears, to
> me, to indicate that the old node's {{OutboundTcpConnection}} thread is just
> stuck and can't make any forward progress. By the time we could notice this
> and slap TRACE logging on, the only thing we see is ~10 minutes later a
> {{SocketException}} inside {{writeConnected}}'s flush and an immediate
> recovery. It is interesting to me that the exception happens in
> {{writeConnected}} and it's a _connection timeout_ (and since we see {{Write
> failure}} I believe that this can't be a connection reset), because my
> understanding is that we should have a fully handshaked SSL connection at
> that point in the code.
> Current theory:
> # "New" node restarts, "Old" node calls
> [newSocket|https://github.com/apache/cassandra/blob/6f30677b28dcbf82bcd0a291f3294ddf87dafaac/src/java/org/apache/cassandra/net/OutboundTcpConnection.java#L433]
> # Old node starts [creating a
> new|https://github.com/apache/cassandra/blob/6f30677b28dcbf82bcd0a291f3294ddf87dafaac/src/java/org/apache/cassandra/net/OutboundTcpConnectionPool.java#L141]
> SSL socket
> # SSLSocket calls
> [createSocket|https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/security/SSLFactory.java#L98],
> which conveniently calls connect with a default timeout of "forever". We
> could hang here forever until the OS kills us.
> # If we continue, we get to
> [writeConnected|https://github.com/apache/cassandra/blob/6f30677b28dcbf82bcd0a291f3294ddf87dafaac/src/java/org/apache/cassandra/net/OutboundTcpConnection.java#L263]
> which eventually calls
> [flush|https://github.com/apache/cassandra/blob/6f30677b28dcbf82bcd0a291f3294ddf87dafaac/src/java/org/apache/cassandra/net/OutboundTcpConnection.java#L341]
> on the output stream and also can hang forever. I think the probability is
> especially high when a node is restarting and is overwhelmed with SSL
> handshakes and such.
> I don't fully understand the attached traceback as it appears we are getting
> a {{Connection Timeout}} from a {{send}} failure (my understanding is you can
> only get a connection timeout prior to a send), but I think it's reasonable
> that we have a timeout configuration issue. I'd like to try to make Cassandra
> robust to networking issues like this via maybe:
> # Change the {{SSLSocket}} {{getSocket}} methods to provide connection
> timeouts of 2s (equivalent to trunk's
> [timeout|https://github.com/apache/cassandra/blob/11496039fb18bb45407246602e31740c56d28157/src/java/org/apache/cassandra/net/async/NettyFactory.java#L329])
> # Appropriately set recv timeouts via {{SO_TIMEOUT}}, maybe something like 2
> minutes (in old versions via
> [setSoTimeout|https://docs.oracle.com/javase/8/docs/api/java/net/Socket.html#setSoTimeout-int-],
> in trunk via
> [SO_TIMEOUT|http://netty.io/4.0/api/io/netty/channel/ChannelOption.html#SO_TIMEOUT]
> # Since we can't set send timeouts afaik (thanks java) maybe we can have
> some kind of watchdog that ensures OutboundTcpConnection is making progress
> in its queue and if it doesn't make any progress for ~30s-1m, forces a
> disconnect.
> If anyone has insight or suggestions, I'd be grateful. I am going to rule out
> if this is keepalive duration by setting tcp_keepalive_probes to like 1 and
> maybe tcp_retries2 to like 8 get more information about the state of the tcp
> connections the next time this happens. It's a very rare bug and when it does
> happen I only have 10 minutes to jump on the nodes and fix it before it fixes
> itself so I'll do my best.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]