[ https://issues.apache.org/jira/browse/CASSANDRA-14358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16424033#comment-16424033 ]
Jason Brown commented on CASSANDRA-14358: ----------------------------------------- For trunk, we already use {{IdleStateHandler}} [in the outbound pipeline|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/async/OutboundHandshakeHandler.java#L196] for the "not making progress sending data" case. iirc, {{WriteTimeoutHandler}} / {{ReadTimeoutHandler}} does not fit our use case for messaging; I'd have to look again to page it back in. Further, [~jolynch], I thought your problem was in pre-4.0? There is a problem (or, at least, non-optimization) with {{MessagingService.SocketThread}} where the single-threaded accept thread handles receiving the first message pf the c* internode messaging protocol, which may require executing the TLS handshake before that can be received (if using internode TLS). Thus a single new establishing connection may block the accept thread for 500-1000 ms for a cross-datacenter connection. Even worse, if you are using the {{Ec2MultiregionSnitch}}, all connections for the local region will be torn down and built anew as we switch from the public IP to the region-local private IP - more traffic on the accept thread. So one slow connection interaction screws the whole accept thread. I have an internal patch for 3.0 which, when a new socket is accepted, moves all processing (internode protocol handshake, TLS handshake) to a separate thread (via an executor group). So, if one connection runs into a problem, the other incoming connections are not held up. I've been reticent about OSS'ing this late in the 3.0/3.11 branches, even though it's not that clever or invasive. Let me know if you'd be interested in giving it shot and I can post it. If it's useful/helpful we can consider committing it, as well. > OutboundTcpConnection can hang for many minutes when nodes restart > ------------------------------------------------------------------ > > Key: CASSANDRA-14358 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14358 > Project: Cassandra > Issue Type: Bug > Components: Streaming and Messaging > Environment: Cassandra 2.1.19 (also reproduced on 3.0.15), running > with {{internode_encryption: all}} and the EC2 multi region snitch on Linux > 4.13 within the same AWS region. Smallest cluster I've seen the problem on is > 12 nodes, reproduces more reliably on 40+ and 300 node clusters consistently > reproduce on at least one node in the cluster. > So all the connections are SSL and we're connecting on the internal ip > addresses (not the public endpoint ones). > Potentially relevant sysctls: > {noformat} > /proc/sys/net/ipv4/tcp_syn_retries = 2 > /proc/sys/net/ipv4/tcp_synack_retries = 5 > /proc/sys/net/ipv4/tcp_keepalive_time = 7200 > /proc/sys/net/ipv4/tcp_keepalive_probes = 9 > /proc/sys/net/ipv4/tcp_keepalive_intvl = 75 > /proc/sys/net/ipv4/tcp_retries2 = 15 > {noformat} > Reporter: Joseph Lynch > Priority: Major > Attachments: 10 Minute Partition.pdf > > > I've been trying to debug nodes not being able to see each other during > longer (~5 minute+) Cassandra restarts in 3.0.x and 2.1.x which can > contribute to {{UnavailableExceptions}} during rolling restarts of 3.0.x and > 2.1.x clusters for us. I think I finally have a lead. It appears that prior > to trunk (with the awesome Netty refactor) we do not set socket connect > timeouts on SSL connections (in 2.1.x, 3.0.x, or 3.11.x) nor do we set > {{SO_TIMEOUT}} as far as I can tell on outbound connections either. I believe > that this means that we could potentially block forever on {{connect}} or > {{recv}} syscalls, and we could block forever on the SSL Handshake as well. I > think that the OS will protect us somewhat (and that may be what's causing > the eventual timeout) but I think that given the right network conditions our > {{OutboundTCPConnection}} threads can just be stuck never making any progress > until the OS intervenes. > I have attached some logs of such a network partition during a rolling > restart where an old node in the cluster has a completely foobarred > {{OutboundTcpConnection}} for ~10 minutes before finally getting a > {{java.net.SocketException: Connection timed out (Write failed)}} and > immediately successfully reconnecting. I conclude that the old node is the > problem because the new node (the one that restarted) is sending ECHOs to the > old node, and the old node is sending ECHOs and REQUEST_RESPONSES to the new > node's ECHOs, but the new node is never getting the ECHO's. This appears, to > me, to indicate that the old node's {{OutboundTcpConnection}} thread is just > stuck and can't make any forward progress. By the time we could notice this > and slap TRACE logging on, the only thing we see is ~10 minutes later a > {{SocketException}} inside {{writeConnected}}'s flush and an immediate > recovery. It is interesting to me that the exception happens in > {{writeConnected}} and it's a _connection timeout_ (and since we see {{Write > failure}} I believe that this can't be a connection reset), because my > understanding is that we should have a fully handshaked SSL connection at > that point in the code. > Current theory: > # "New" node restarts, "Old" node calls > [newSocket|https://github.com/apache/cassandra/blob/6f30677b28dcbf82bcd0a291f3294ddf87dafaac/src/java/org/apache/cassandra/net/OutboundTcpConnection.java#L433] > # Old node starts [creating a > new|https://github.com/apache/cassandra/blob/6f30677b28dcbf82bcd0a291f3294ddf87dafaac/src/java/org/apache/cassandra/net/OutboundTcpConnectionPool.java#L141] > SSL socket > # SSLSocket calls > [createSocket|https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/security/SSLFactory.java#L98], > which conveniently calls connect with a default timeout of "forever". We > could hang here forever until the OS kills us. > # If we continue, we get to > [writeConnected|https://github.com/apache/cassandra/blob/6f30677b28dcbf82bcd0a291f3294ddf87dafaac/src/java/org/apache/cassandra/net/OutboundTcpConnection.java#L263] > which eventually calls > [flush|https://github.com/apache/cassandra/blob/6f30677b28dcbf82bcd0a291f3294ddf87dafaac/src/java/org/apache/cassandra/net/OutboundTcpConnection.java#L341] > on the output stream and also can hang forever. I think the probability is > especially high when a node is restarting and is overwhelmed with SSL > handshakes and such. > I don't fully understand the attached traceback as it appears we are getting > a {{Connection Timeout}} from a {{send}} failure (my understanding is you can > only get a connection timeout prior to a send), but I think it's reasonable > that we have a timeout configuration issue. I'd like to try to make Cassandra > robust to networking issues like this via maybe: > # Change the {{SSLSocket}} {{getSocket}} methods to provide connection > timeouts of 2s (equivalent to trunk's > [timeout|https://github.com/apache/cassandra/blob/11496039fb18bb45407246602e31740c56d28157/src/java/org/apache/cassandra/net/async/NettyFactory.java#L329]) > # Appropriately set recv timeouts via {{SO_TIMEOUT}}, maybe something like 2 > minutes (in old versions via > [setSoTimeout|https://docs.oracle.com/javase/8/docs/api/java/net/Socket.html#setSoTimeout-int-], > in trunk via > [SO_TIMEOUT|http://netty.io/4.0/api/io/netty/channel/ChannelOption.html#SO_TIMEOUT] > # Since we can't set send timeouts afaik (thanks java) maybe we can have > some kind of watchdog that ensures OutboundTcpConnection is making progress > in its queue and if it doesn't make any progress for ~30s-1m, forces a > disconnect. > If anyone has insight or suggestions, I'd be grateful. I am going to rule out > if this is keepalive duration by setting tcp_keepalive_probes to like 1 and > maybe tcp_retries2 to like 8 get more information about the state of the tcp > connections the next time this happens. It's a very rare bug and when it does > happen I only have 10 minutes to jump on the nodes and fix it before it fixes > itself so I'll do my best. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org