[
https://issues.apache.org/jira/browse/CASSANDRA-11286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15174751#comment-15174751
]
Paulo Motta commented on CASSANDRA-11286:
-----------------------------------------
In order to verify that socket timeout was indeed not being respected, I added
a new property {{cassandra.dtest.sleep_during_stream_write}} to [this
branch|https://github.com/apache/cassandra/compare/trunk...pauloricardomg:11286-unpatched]
and set it to a value much longer than {{streaming_socket_timeout_in_ms}} on
[this dtest|https://github.com/pauloricardomg/cassandra-dtest/tree/11286], and
verified that the bootstrap streaming session hanged forever.
Two changes are necessary to make stream socket timeout be enforced/respected
during a stream session:
* Creation of socket via {{Channels.newChannel(socket.getInputStream());}} on
{{ConnectionHandler.getReadChannel(socket)}}, as suggested in the [blog
post|https://technfun.wordpress.com/2009/01/29/networking-in-java-non-blocking-nio-blocking-nio-and-io/].
* Set socket timeout on follower side on {{IncomingStreamingConnection}}
After these changes, I re-executed the previous dtest on [a fixed
branch|https://github.com/apache/cassandra/compare/trunk...pauloricardomg:11286-testing]
and verified that the bootstrap stream session did not hang, but instead
failed. The reason for the stream to fail and not to be retried is because
{{SocketTimeoutException}} is an {{IOException}}, so it's catch by [this
block|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/streaming/messages/IncomingFileMessage.java#L52]
(and not
[this|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/streaming/messages/IncomingFileMessage.java#L58])
on {{IncomingFileMessage}}.
This behavior of failing on socket timeout is not the one documented on
{{cassandra.yaml}}:
{noformat}
# Enable socket timeout for streaming operation.
# When a timeout occurs during streaming, streaming is retried from the start
# of the current file. This _can_ involve re-streaming an important amount of
# data, so you should avoid setting the value too low.
# Default value is 3600000, which means streams timeout after an hour.
# streaming_socket_timeout_in_ms: 3600000
{noformat}
So I updated it to:
{noformat}
# Set socket timeout for streaming operation.
# The stream session is failed if no data is received by any of the
# participants within that period.
# Default value is 3600000, which means streams timeout after an hour.
# streaming_socket_timeout_in_ms: 3600000
{noformat}
While retrying when receiving corrupted data is probably the right approach,
I'm not sure retrying on a socket timeout is desirable here. I can see two main
reasons for the socket to timeout:
* Connection was broken/reset in only one side of the socket (rare but possible
situation)
* Deadlock or protocol error on sender side
In both scenarios, I think failing stream is the correct approach, rather than
retrying and dealing with unexpected error conditions. WDYT [~yukim]?
Below are branches with the suggested changes and tests.
||2.1||2.2||3.0||trunk||
|[branch|https://github.com/apache/cassandra/compare/cassandra-2.1...pauloricardomg:2.1-11286]|[branch|https://github.com/apache/cassandra/compare/cassandra-2.2...pauloricardomg:2.2-11286]|[branch|https://github.com/apache/cassandra/compare/cassandra-3.0...pauloricardomg:3.0-11286]|[branch|https://github.com/apache/cassandra/compare/trunk...pauloricardomg:trunk-11286]|
|[testall|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-2.1-11286-testall/lastCompletedBuild/testReport/]|[testall|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-2.2-11286-testall/lastCompletedBuild/testReport/]|[testall|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-3.0-11286-testall/lastCompletedBuild/testReport/]|[testall|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-trunk-11286-testall/lastCompletedBuild/testReport/]|
|[dtest|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-2.1-11286-dtest/lastCompletedBuild/testReport/]|[dtest|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-2.2-11286-dtest/lastCompletedBuild/testReport/]|[dtest|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-3.0-11286-dtest/lastCompletedBuild/testReport/]|[dtest|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-trunk-11286-dtest/lastCompletedBuild/testReport/]|
commit info: minor conflict on 2.2, but other than that it merges cleanly
upwards.
> streaming socket never times out
> --------------------------------
>
> Key: CASSANDRA-11286
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11286
> Project: Cassandra
> Issue Type: Bug
> Components: Streaming and Messaging
> Reporter: Paulo Motta
> Assignee: Paulo Motta
>
> While trying to reproduce CASSANDRA-8343 I was not able to trigger a
> {{SocketTimeoutException}} by adding an artificial sleep longer than
> {{streaming_socket_timeout_in_ms}}.
> After investigation, I detected two problems:
> * {{ReadableByteChannel}} creation via {{socket.getChannel()}}, as done in
> {{ConnectionHandler.getReadChannel(socket)}}, does not respect
> {{socket.setSoTimeout()}}, as explained in this [blog
> post|https://technfun.wordpress.com/2009/01/29/networking-in-java-non-blocking-nio-blocking-nio-and-io/]
> ** bq. The only difference between “blocking NIO” and “NIO wrapped around IO”
> is that you can’t use socket timeout with SocketChannels. Why ? Read a
> javadoc for setSocketTimeout(). It says that this timeout is used only by
> streams.
> * {{socketSoTimeout}} is never set on "follower" side, only on initiator side
> via {{DefaultConnectionFactory.createConnection(peer)}}.
> This may cause streaming to hang indefinitely, as exemplified by
> CASSANDRA-8621:
> bq. For the scenario that prompted this ticket, it appeared that the
> streaming process was completely stalled. One side of the stream (the sender
> side) had an exception that appeared to be a connection reset. The receiving
> side appeared to think that the connection was still active, at least in
> terms of the netstats reported by nodetool. We were unable to verify whether
> this was specifically the case in terms of connected sockets due to the fact
> that there were multiple streams for those peers, and there is no simple way
> to correlate a specific stream to a tcp session.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)