[
https://issues.apache.org/jira/browse/CASSANDRA-11286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15185564#comment-15185564
]
Paulo Motta commented on CASSANDRA-11286:
-----------------------------------------
There is a relatively common scenario when the receiver receives a {{COMPLETE}}
message from the sender, changes its state to {{WAIT_COMPLETE}} and then the
{{IncomingMessageHandler}} blocks on the socket again, and meanwhile the
{{OnCompletionRunnable}} finishes processing the received stream and closes the
session via {{maybeCompleted()}}. Even though the session is terminated
correctly on both sides, after {{streaming_socket_timeout_in_ms}} the user may
get a harmless but scary {{StreamSocketTimeout}} log message from the
previously blocked socket (got this in dtests a few times). In order to fix
this I updated the patch to close the {{OutgoingStreamHandler}} after sending
the last {{COMPLETE}} message, and the {{IncomingStreamHandler}} after
receiving the last {{COMPLETE}} message, so all resources are properly released.
Below is an example of this scenario:
{noformat}
DEBUG [STREAM-IN-/127.0.0.1] 2016-03-08 15:31:18,632 ConnectionHandler.java:269
- [Stream #f19e82b0-e55b-11e5-9a50-f59bd42ef741] Received Complete
DEBUG [STREAM-OUT-/127.0.0.1] 2016-03-08 15:31:19,004
ConnectionHandler.java:341 - [Stream #f19e82b0-e55b-11e5-9a50-f59bd42ef741]
Sending Complete
DEBUG [StreamReceiveTask:1] 2016-03-08 15:31:19,004 ConnectionHandler.java:110
- [Stream #f19e82b0-e55b-11e5-9a50-f59bd42ef741] Closing stream connection
handler on /127.0.0.1
INFO [StreamReceiveTask:1] 2016-03-08 15:31:19,005 StreamResultFuture.java:182
- [Stream #f19e82b0-e55b-11e5-9a50-f59bd42ef741] Session with /127.0.0.1 is
complete
INFO [StreamReceiveTask:1] 2016-03-08 15:31:19,025 StreamResultFuture.java:214
- [Stream #f19e82b0-e55b-11e5-9a50-f59bd42ef741] All sessions completed
.... {{streaming_socket_timeout}} passes...
ERROR [STREAM-IN-/127.0.0.1] 2016-03-08 15:31:19,638 StreamSession.java:524 -
[Stream #f19e82b0-e55b-11e5-9a50-f59bd42ef741] Streaming error occurred
java.net.SocketTimeoutException: null
at
sun.nio.ch.SocketAdaptor$SocketInputStream.read(SocketAdaptor.java:211)
~[na:1.8.0_66]
at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103)
~[na:1.8.0_66]
at
java.nio.channels.Channels$ReadableByteChannelImpl.read(Channels.java:385)
~[na:1.8.0_66]
at
org.apache.cassandra.streaming.messages.StreamMessage.deserialize(StreamMessage.java:53)
~[main/:na]
at
org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:268)
~[main/:na]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_66]
{noformat}
I also noticed that
{{replace_address_test.TestReplaceAddress.replace_with_reset_resume_state_test}}
(CASSANDRA-11246) and
{{replace_address_test.TestReplaceAddress.resumable_replace_test}} was failing
for the same reason as
{{bootstrap_test.py:TestBootstrap.resumable_bootstrap_test}} (CASSANDRA-10912),
so I applied the fix of CASSANDRA-10167 (decrease the socket timeout) to the
[dtest PR|https://github.com/riptano/cassandra-dtest/pull/841], which should
avoid hanging on these tests due to default {{streaming_socket_timeout_in_ms}}
of 1 hour.
I rebased and updated branches and resubmitted tests:
||2.1||2.2||3.0||3.5||trunk||dtest||
|[branch|https://github.com/apache/cassandra/compare/cassandra-2.1...pauloricardomg:2.1-11286]|[branch|https://github.com/apache/cassandra/compare/cassandra-2.2...pauloricardomg:2.2-11286]|[branch|https://github.com/apache/cassandra/compare/cassandra-3.0...pauloricardomg:3.0-11286]|[branch|https://github.com/apache/cassandra/compare/cassandra-3.5...pauloricardomg:3.5-11286]|[branch|https://github.com/apache/cassandra/compare/trunk...pauloricardomg:trunk-11286]|[branch|https://github.com/riptano/cassandra-dtest/compare/master...pauloricardomg:11286-dtest]|
|[testall|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-2.1-11286-testall/lastCompletedBuild/testReport/]|[testall|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-2.2-11286-testall/lastCompletedBuild/testReport/]|[testall|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-3.0-11286-testall/lastCompletedBuild/testReport/]|[testall|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-3.5-11286-testall/lastCompletedBuild/testReport/]|[testall|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-trunk-11286-testall/lastCompletedBuild/testReport/]|
|[dtest|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-2.1-11286-dtest/lastCompletedBuild/testReport/]|[dtest|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-2.2-11286-dtest/lastCompletedBuild/testReport/]|[dtest|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-3.0-11286-dtest/lastCompletedBuild/testReport/]|[dtest|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-3.5-11286-dtest/lastCompletedBuild/testReport/]|[dtest|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-trunk-11286-dtest/lastCompletedBuild/testReport/]|
> streaming socket never times out
> --------------------------------
>
> Key: CASSANDRA-11286
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11286
> Project: Cassandra
> Issue Type: Bug
> Components: Streaming and Messaging
> Reporter: Paulo Motta
> Assignee: Paulo Motta
>
> While trying to reproduce CASSANDRA-8343 I was not able to trigger a
> {{SocketTimeoutException}} by adding an artificial sleep longer than
> {{streaming_socket_timeout_in_ms}}.
> After investigation, I detected two problems:
> * {{ReadableByteChannel}} creation via {{socket.getChannel()}}, as done in
> {{ConnectionHandler.getReadChannel(socket)}}, does not respect
> {{socket.setSoTimeout()}}, as explained in this [blog
> post|https://technfun.wordpress.com/2009/01/29/networking-in-java-non-blocking-nio-blocking-nio-and-io/]
> ** bq. The only difference between “blocking NIO” and “NIO wrapped around IO”
> is that you can’t use socket timeout with SocketChannels. Why ? Read a
> javadoc for setSocketTimeout(). It says that this timeout is used only by
> streams.
> * {{socketSoTimeout}} is never set on "follower" side, only on initiator side
> via {{DefaultConnectionFactory.createConnection(peer)}}.
> This may cause streaming to hang indefinitely, as exemplified by
> CASSANDRA-8621:
> bq. For the scenario that prompted this ticket, it appeared that the
> streaming process was completely stalled. One side of the stream (the sender
> side) had an exception that appeared to be a connection reset. The receiving
> side appeared to think that the connection was still active, at least in
> terms of the netstats reported by nodetool. We were unable to verify whether
> this was specifically the case in terms of connected sockets due to the fact
> that there were multiple streams for those peers, and there is no simple way
> to correlate a specific stream to a tcp session.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)