[
https://issues.apache.org/jira/browse/CASSANDRA-14358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Joseph Lynch updated CASSANDRA-14358:
-------------------------------------
Description:
edit summary: This primarily impacts networks with stateful firewalls such as
AWS. I'm working on a proper patch for trunk but unfortunately it relies on the
Netty refactor in 4.0 so it will be hard to backport to previous versions. A
workaround for earlier versions is to set the {{net.ipv4.tcp_retries2}} sysctl
to ~5. This can be done with the following:
{code:java}
$ cat /etc/sysctl.d/20-cassandra-tuning.conf
net.ipv4.tcp_retries2=5
$ # Reload all sysctls
$ sysctl --system{code}
Original Bug Report:
I've been trying to debug nodes not being able to see each other during longer
(~5 minute+) Cassandra restarts in 3.0.x and 2.1.x which can contribute to
{{UnavailableExceptions}} during rolling restarts of 3.0.x and 2.1.x clusters
for us. I think I finally have a lead. It appears that prior to trunk (with the
awesome Netty refactor) we do not set socket connect timeouts on SSL
connections (in 2.1.x, 3.0.x, or 3.11.x) nor do we set {{SO_TIMEOUT}} as far as
I can tell on outbound connections either. I believe that this means that we
could potentially block forever on {{connect}} or {{recv}} syscalls, and we
could block forever on the SSL Handshake as well. I think that the OS will
protect us somewhat (and that may be what's causing the eventual timeout) but I
think that given the right network conditions our {{OutboundTCPConnection}}
threads can just be stuck never making any progress until the OS intervenes.
I have attached some logs of such a network partition during a rolling restart
where an old node in the cluster has a completely foobarred
{{OutboundTcpConnection}} for ~10 minutes before finally getting a
{{java.net.SocketException: Connection timed out (Write failed)}} and
immediately successfully reconnecting. I conclude that the old node is the
problem because the new node (the one that restarted) is sending ECHOs to the
old node, and the old node is sending ECHOs and REQUEST_RESPONSES to the new
node's ECHOs, but the new node is never getting the ECHO's. This appears, to
me, to indicate that the old node's {{OutboundTcpConnection}} thread is just
stuck and can't make any forward progress. By the time we could notice this and
slap TRACE logging on, the only thing we see is ~10 minutes later a
{{SocketException}} inside {{writeConnected}}'s flush and an immediate
recovery. It is interesting to me that the exception happens in
{{writeConnected}} and it's a _connection timeout_ (and since we see {{Write
failure}} I believe that this can't be a connection reset), because my
understanding is that we should have a fully handshaked SSL connection at that
point in the code.
Current theory:
# "New" node restarts, "Old" node calls
[newSocket|https://github.com/apache/cassandra/blob/6f30677b28dcbf82bcd0a291f3294ddf87dafaac/src/java/org/apache/cassandra/net/OutboundTcpConnection.java#L433]
# Old node starts [creating a
new|https://github.com/apache/cassandra/blob/6f30677b28dcbf82bcd0a291f3294ddf87dafaac/src/java/org/apache/cassandra/net/OutboundTcpConnectionPool.java#L141]
SSL socket
# SSLSocket calls
[createSocket|https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/security/SSLFactory.java#L98],
which conveniently calls connect with a default timeout of "forever". We could
hang here forever until the OS kills us.
# If we continue, we get to
[writeConnected|https://github.com/apache/cassandra/blob/6f30677b28dcbf82bcd0a291f3294ddf87dafaac/src/java/org/apache/cassandra/net/OutboundTcpConnection.java#L263]
which eventually calls
[flush|https://github.com/apache/cassandra/blob/6f30677b28dcbf82bcd0a291f3294ddf87dafaac/src/java/org/apache/cassandra/net/OutboundTcpConnection.java#L341]
on the output stream and also can hang forever. I think the probability is
especially high when a node is restarting and is overwhelmed with SSL
handshakes and such.
I don't fully understand the attached traceback as it appears we are getting a
{{Connection Timeout}} from a {{send}} failure (my understanding is you can
only get a connection timeout prior to a send), but I think it's reasonable
that we have a timeout configuration issue. I'd like to try to make Cassandra
robust to networking issues like this via maybe:
# Change the {{SSLSocket}} {{getSocket}} methods to provide connection
timeouts of 2s (equivalent to trunk's
[timeout|https://github.com/apache/cassandra/blob/11496039fb18bb45407246602e31740c56d28157/src/java/org/apache/cassandra/net/async/NettyFactory.java#L329])
# Appropriately set recv timeouts via {{SO_TIMEOUT}}, maybe something like 2
minutes (in old versions via
[setSoTimeout|https://docs.oracle.com/javase/8/docs/api/java/net/Socket.html#setSoTimeout-int-],
in trunk via
[SO_TIMEOUT|http://netty.io/4.0/api/io/netty/channel/ChannelOption.html#SO_TIMEOUT]
# Since we can't set send timeouts afaik (thanks java) maybe we can have some
kind of watchdog that ensures OutboundTcpConnection is making progress in its
queue and if it doesn't make any progress for ~30s-1m, forces a disconnect.
If anyone has insight or suggestions, I'd be grateful. I am going to rule out
if this is keepalive duration by setting tcp_keepalive_probes to like 1 and
maybe tcp_retries2 to like 8 get more information about the state of the tcp
connections the next time this happens. It's a very rare bug and when it does
happen I only have 10 minutes to jump on the nodes and fix it before it fixes
itself so I'll do my best.
was:
edit: There is a reasonably workaround on Linux, I'm working on a proper patch
for trunk but unfortunately it relies on the Netty refactor there so it will be
hard to backport to previous versions. The workaround for earlier versions is
to set:
{code:
I've been trying to debug nodes not being able to see each other during longer
(~5 minute+) Cassandra restarts in 3.0.x and 2.1.x which can contribute to
{{UnavailableExceptions}} during rolling restarts of 3.0.x and 2.1.x clusters
for us. I think I finally have a lead. It appears that prior to trunk (with the
awesome Netty refactor) we do not set socket connect timeouts on SSL
connections (in 2.1.x, 3.0.x, or 3.11.x) nor do we set {{SO_TIMEOUT}} as far as
I can tell on outbound connections either. I believe that this means that we
could potentially block forever on {{connect}} or {{recv}} syscalls, and we
could block forever on the SSL Handshake as well. I think that the OS will
protect us somewhat (and that may be what's causing the eventual timeout) but I
think that given the right network conditions our {{OutboundTCPConnection}}
threads can just be stuck never making any progress until the OS intervenes.
I have attached some logs of such a network partition during a rolling restart
where an old node in the cluster has a completely foobarred
{{OutboundTcpConnection}} for ~10 minutes before finally getting a
{{java.net.SocketException: Connection timed out (Write failed)}} and
immediately successfully reconnecting. I conclude that the old node is the
problem because the new node (the one that restarted) is sending ECHOs to the
old node, and the old node is sending ECHOs and REQUEST_RESPONSES to the new
node's ECHOs, but the new node is never getting the ECHO's. This appears, to
me, to indicate that the old node's {{OutboundTcpConnection}} thread is just
stuck and can't make any forward progress. By the time we could notice this and
slap TRACE logging on, the only thing we see is ~10 minutes later a
{{SocketException}} inside {{writeConnected}}'s flush and an immediate
recovery. It is interesting to me that the exception happens in
{{writeConnected}} and it's a _connection timeout_ (and since we see {{Write
failure}} I believe that this can't be a connection reset), because my
understanding is that we should have a fully handshaked SSL connection at that
point in the code.
Current theory:
# "New" node restarts, "Old" node calls
[newSocket|https://github.com/apache/cassandra/blob/6f30677b28dcbf82bcd0a291f3294ddf87dafaac/src/java/org/apache/cassandra/net/OutboundTcpConnection.java#L433]
# Old node starts [creating a
new|https://github.com/apache/cassandra/blob/6f30677b28dcbf82bcd0a291f3294ddf87dafaac/src/java/org/apache/cassandra/net/OutboundTcpConnectionPool.java#L141]
SSL socket
# SSLSocket calls
[createSocket|https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/security/SSLFactory.java#L98],
which conveniently calls connect with a default timeout of "forever". We could
hang here forever until the OS kills us.
# If we continue, we get to
[writeConnected|https://github.com/apache/cassandra/blob/6f30677b28dcbf82bcd0a291f3294ddf87dafaac/src/java/org/apache/cassandra/net/OutboundTcpConnection.java#L263]
which eventually calls
[flush|https://github.com/apache/cassandra/blob/6f30677b28dcbf82bcd0a291f3294ddf87dafaac/src/java/org/apache/cassandra/net/OutboundTcpConnection.java#L341]
on the output stream and also can hang forever. I think the probability is
especially high when a node is restarting and is overwhelmed with SSL
handshakes and such.
I don't fully understand the attached traceback as it appears we are getting a
{{Connection Timeout}} from a {{send}} failure (my understanding is you can
only get a connection timeout prior to a send), but I think it's reasonable
that we have a timeout configuration issue. I'd like to try to make Cassandra
robust to networking issues like this via maybe:
# Change the {{SSLSocket}} {{getSocket}} methods to provide connection
timeouts of 2s (equivalent to trunk's
[timeout|https://github.com/apache/cassandra/blob/11496039fb18bb45407246602e31740c56d28157/src/java/org/apache/cassandra/net/async/NettyFactory.java#L329])
# Appropriately set recv timeouts via {{SO_TIMEOUT}}, maybe something like 2
minutes (in old versions via
[setSoTimeout|https://docs.oracle.com/javase/8/docs/api/java/net/Socket.html#setSoTimeout-int-],
in trunk via
[SO_TIMEOUT|http://netty.io/4.0/api/io/netty/channel/ChannelOption.html#SO_TIMEOUT]
# Since we can't set send timeouts afaik (thanks java) maybe we can have some
kind of watchdog that ensures OutboundTcpConnection is making progress in its
queue and if it doesn't make any progress for ~30s-1m, forces a disconnect.
If anyone has insight or suggestions, I'd be grateful. I am going to rule out
if this is keepalive duration by setting tcp_keepalive_probes to like 1 and
maybe tcp_retries2 to like 8 get more information about the state of the tcp
connections the next time this happens. It's a very rare bug and when it does
happen I only have 10 minutes to jump on the nodes and fix it before it fixes
itself so I'll do my best.
> OutboundTcpConnection can hang for many minutes when nodes restart
> ------------------------------------------------------------------
>
> Key: CASSANDRA-14358
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14358
> Project: Cassandra
> Issue Type: Bug
> Components: Streaming and Messaging
> Environment: Cassandra 2.1.19 (also reproduced on 3.0.15), running
> with {{internode_encryption: all}} and the EC2 multi region snitch on Linux
> 4.13 within the same AWS region. Smallest cluster I've seen the problem on is
> 12 nodes, reproduces more reliably on 40+ and 300 node clusters consistently
> reproduce on at least one node in the cluster.
> So all the connections are SSL and we're connecting on the internal ip
> addresses (not the public endpoint ones).
> Potentially relevant sysctls:
> {noformat}
> /proc/sys/net/ipv4/tcp_syn_retries = 2
> /proc/sys/net/ipv4/tcp_synack_retries = 5
> /proc/sys/net/ipv4/tcp_keepalive_time = 7200
> /proc/sys/net/ipv4/tcp_keepalive_probes = 9
> /proc/sys/net/ipv4/tcp_keepalive_intvl = 75
> /proc/sys/net/ipv4/tcp_retries2 = 15
> {noformat}
> Reporter: Joseph Lynch
> Assignee: Joseph Lynch
> Priority: Major
> Attachments: 10 Minute Partition.pdf
>
>
> edit summary: This primarily impacts networks with stateful firewalls such as
> AWS. I'm working on a proper patch for trunk but unfortunately it relies on
> the Netty refactor in 4.0 so it will be hard to backport to previous
> versions. A workaround for earlier versions is to set the
> {{net.ipv4.tcp_retries2}} sysctl to ~5. This can be done with the following:
> {code:java}
> $ cat /etc/sysctl.d/20-cassandra-tuning.conf
> net.ipv4.tcp_retries2=5
> $ # Reload all sysctls
> $ sysctl --system{code}
> Original Bug Report:
> I've been trying to debug nodes not being able to see each other during
> longer (~5 minute+) Cassandra restarts in 3.0.x and 2.1.x which can
> contribute to {{UnavailableExceptions}} during rolling restarts of 3.0.x and
> 2.1.x clusters for us. I think I finally have a lead. It appears that prior
> to trunk (with the awesome Netty refactor) we do not set socket connect
> timeouts on SSL connections (in 2.1.x, 3.0.x, or 3.11.x) nor do we set
> {{SO_TIMEOUT}} as far as I can tell on outbound connections either. I believe
> that this means that we could potentially block forever on {{connect}} or
> {{recv}} syscalls, and we could block forever on the SSL Handshake as well. I
> think that the OS will protect us somewhat (and that may be what's causing
> the eventual timeout) but I think that given the right network conditions our
> {{OutboundTCPConnection}} threads can just be stuck never making any progress
> until the OS intervenes.
> I have attached some logs of such a network partition during a rolling
> restart where an old node in the cluster has a completely foobarred
> {{OutboundTcpConnection}} for ~10 minutes before finally getting a
> {{java.net.SocketException: Connection timed out (Write failed)}} and
> immediately successfully reconnecting. I conclude that the old node is the
> problem because the new node (the one that restarted) is sending ECHOs to the
> old node, and the old node is sending ECHOs and REQUEST_RESPONSES to the new
> node's ECHOs, but the new node is never getting the ECHO's. This appears, to
> me, to indicate that the old node's {{OutboundTcpConnection}} thread is just
> stuck and can't make any forward progress. By the time we could notice this
> and slap TRACE logging on, the only thing we see is ~10 minutes later a
> {{SocketException}} inside {{writeConnected}}'s flush and an immediate
> recovery. It is interesting to me that the exception happens in
> {{writeConnected}} and it's a _connection timeout_ (and since we see {{Write
> failure}} I believe that this can't be a connection reset), because my
> understanding is that we should have a fully handshaked SSL connection at
> that point in the code.
> Current theory:
> # "New" node restarts, "Old" node calls
> [newSocket|https://github.com/apache/cassandra/blob/6f30677b28dcbf82bcd0a291f3294ddf87dafaac/src/java/org/apache/cassandra/net/OutboundTcpConnection.java#L433]
> # Old node starts [creating a
> new|https://github.com/apache/cassandra/blob/6f30677b28dcbf82bcd0a291f3294ddf87dafaac/src/java/org/apache/cassandra/net/OutboundTcpConnectionPool.java#L141]
> SSL socket
> # SSLSocket calls
> [createSocket|https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/security/SSLFactory.java#L98],
> which conveniently calls connect with a default timeout of "forever". We
> could hang here forever until the OS kills us.
> # If we continue, we get to
> [writeConnected|https://github.com/apache/cassandra/blob/6f30677b28dcbf82bcd0a291f3294ddf87dafaac/src/java/org/apache/cassandra/net/OutboundTcpConnection.java#L263]
> which eventually calls
> [flush|https://github.com/apache/cassandra/blob/6f30677b28dcbf82bcd0a291f3294ddf87dafaac/src/java/org/apache/cassandra/net/OutboundTcpConnection.java#L341]
> on the output stream and also can hang forever. I think the probability is
> especially high when a node is restarting and is overwhelmed with SSL
> handshakes and such.
> I don't fully understand the attached traceback as it appears we are getting
> a {{Connection Timeout}} from a {{send}} failure (my understanding is you can
> only get a connection timeout prior to a send), but I think it's reasonable
> that we have a timeout configuration issue. I'd like to try to make Cassandra
> robust to networking issues like this via maybe:
> # Change the {{SSLSocket}} {{getSocket}} methods to provide connection
> timeouts of 2s (equivalent to trunk's
> [timeout|https://github.com/apache/cassandra/blob/11496039fb18bb45407246602e31740c56d28157/src/java/org/apache/cassandra/net/async/NettyFactory.java#L329])
> # Appropriately set recv timeouts via {{SO_TIMEOUT}}, maybe something like 2
> minutes (in old versions via
> [setSoTimeout|https://docs.oracle.com/javase/8/docs/api/java/net/Socket.html#setSoTimeout-int-],
> in trunk via
> [SO_TIMEOUT|http://netty.io/4.0/api/io/netty/channel/ChannelOption.html#SO_TIMEOUT]
> # Since we can't set send timeouts afaik (thanks java) maybe we can have
> some kind of watchdog that ensures OutboundTcpConnection is making progress
> in its queue and if it doesn't make any progress for ~30s-1m, forces a
> disconnect.
> If anyone has insight or suggestions, I'd be grateful. I am going to rule out
> if this is keepalive duration by setting tcp_keepalive_probes to like 1 and
> maybe tcp_retries2 to like 8 get more information about the state of the tcp
> connections the next time this happens. It's a very rare bug and when it does
> happen I only have 10 minutes to jump on the nodes and fix it before it fixes
> itself so I'll do my best.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]