[ 
https://issues.apache.org/jira/browse/CASSANDRA-11286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15174751#comment-15174751
 ] 

Paulo Motta commented on CASSANDRA-11286:
-----------------------------------------

In order to verify that socket timeout was indeed not being respected, I added 
a new property {{cassandra.dtest.sleep_during_stream_write}} to [this 
branch|https://github.com/apache/cassandra/compare/trunk...pauloricardomg:11286-unpatched]
 and set it to a value much longer than {{streaming_socket_timeout_in_ms}} on 
[this dtest|https://github.com/pauloricardomg/cassandra-dtest/tree/11286], and 
verified that the bootstrap streaming session hanged forever.

Two changes are necessary to make stream socket timeout be enforced/respected 
during a stream session:
* Creation of socket via {{Channels.newChannel(socket.getInputStream());}} on 
{{ConnectionHandler.getReadChannel(socket)}}, as suggested in the [blog 
post|https://technfun.wordpress.com/2009/01/29/networking-in-java-non-blocking-nio-blocking-nio-and-io/].
* Set socket timeout on follower side on {{IncomingStreamingConnection}}

After these changes, I re-executed the previous dtest on [a fixed 
branch|https://github.com/apache/cassandra/compare/trunk...pauloricardomg:11286-testing]
 and verified that the bootstrap stream session did not hang, but instead 
failed. The reason for the stream to fail and not to be retried is because 
{{SocketTimeoutException}} is an {{IOException}}, so it's catch by [this 
block|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/streaming/messages/IncomingFileMessage.java#L52]
 (and not 
[this|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/streaming/messages/IncomingFileMessage.java#L58])
 on {{IncomingFileMessage}}.

This behavior of failing on socket timeout is not the one documented on 
{{cassandra.yaml}}:
{noformat}
# Enable socket timeout for streaming operation.
# When a timeout occurs during streaming, streaming is retried from the start
# of the current file. This _can_ involve re-streaming an important amount of
# data, so you should avoid setting the value too low.
# Default value is 3600000, which means streams timeout after an hour.
# streaming_socket_timeout_in_ms: 3600000
{noformat}

So I updated it to:
{noformat}
# Set socket timeout for streaming operation.
# The stream session is failed if no data is received by any of the
# participants within that period.
# Default value is 3600000, which means streams timeout after an hour.
# streaming_socket_timeout_in_ms: 3600000
{noformat}

While retrying when receiving corrupted data is probably the right approach, 
I'm not sure retrying on a socket timeout is desirable here. I can see two main 
reasons for the socket to timeout:
* Connection was broken/reset in only one side of the socket (rare but possible 
situation)
* Deadlock or protocol error on sender side

In both scenarios, I think failing stream is the correct approach, rather than 
retrying and dealing with unexpected error conditions. WDYT [~yukim]?

Below are branches with the suggested changes and tests.

||2.1||2.2||3.0||trunk||
|[branch|https://github.com/apache/cassandra/compare/cassandra-2.1...pauloricardomg:2.1-11286]|[branch|https://github.com/apache/cassandra/compare/cassandra-2.2...pauloricardomg:2.2-11286]|[branch|https://github.com/apache/cassandra/compare/cassandra-3.0...pauloricardomg:3.0-11286]|[branch|https://github.com/apache/cassandra/compare/trunk...pauloricardomg:trunk-11286]|
|[testall|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-2.1-11286-testall/lastCompletedBuild/testReport/]|[testall|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-2.2-11286-testall/lastCompletedBuild/testReport/]|[testall|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-3.0-11286-testall/lastCompletedBuild/testReport/]|[testall|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-trunk-11286-testall/lastCompletedBuild/testReport/]|
|[dtest|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-2.1-11286-dtest/lastCompletedBuild/testReport/]|[dtest|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-2.2-11286-dtest/lastCompletedBuild/testReport/]|[dtest|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-3.0-11286-dtest/lastCompletedBuild/testReport/]|[dtest|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-trunk-11286-dtest/lastCompletedBuild/testReport/]|

commit info: minor conflict on 2.2, but other than that it merges cleanly 
upwards.

> streaming socket never times out
> --------------------------------
>
>                 Key: CASSANDRA-11286
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11286
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Streaming and Messaging
>            Reporter: Paulo Motta
>            Assignee: Paulo Motta
>
> While trying to reproduce CASSANDRA-8343 I was not able to trigger a 
> {{SocketTimeoutException}} by adding an artificial sleep longer than 
> {{streaming_socket_timeout_in_ms}}.
> After investigation, I detected two problems:
> * {{ReadableByteChannel}} creation via {{socket.getChannel()}}, as done in 
> {{ConnectionHandler.getReadChannel(socket)}}, does not respect 
> {{socket.setSoTimeout()}}, as explained in this [blog 
> post|https://technfun.wordpress.com/2009/01/29/networking-in-java-non-blocking-nio-blocking-nio-and-io/]
> ** bq. The only difference between “blocking NIO” and “NIO wrapped around IO” 
> is that you can’t use socket timeout with SocketChannels. Why ? Read a 
> javadoc for setSocketTimeout(). It says that this timeout is used only by 
> streams.
> * {{socketSoTimeout}} is never set on "follower" side, only on initiator side 
> via {{DefaultConnectionFactory.createConnection(peer)}}.
> This may cause streaming to hang indefinitely, as exemplified by 
> CASSANDRA-8621:
> bq. For the scenario that prompted this ticket, it appeared that the 
> streaming process was completely stalled. One side of the stream (the sender 
> side) had an exception that appeared to be a connection reset. The receiving 
> side appeared to think that the connection was still active, at least in 
> terms of the netstats reported by nodetool. We were unable to verify whether 
> this was specifically the case in terms of connected sockets due to the fact 
> that there were multiple streams for those peers, and there is no simple way 
> to correlate a specific stream to a tcp session.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to