[ 
https://issues.apache.org/jira/browse/CASSANDRA-3569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14008609#comment-14008609
 ] 

Marcus Eriksson commented on CASSANDRA-3569:
--------------------------------------------

What I on the sending side is:
{code}
INFO  06:02:48 InetAddress /192.168.1.50 is now DOWN
ERROR 06:03:28 [Stream #44eea080-e49b-11e3-8245-79bb5a6fc73b] Streaming error 
occurred
java.io.IOException: Connection timed out
        at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[na:1.7.0_55]
        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) 
~[na:1.7.0_55]
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) 
~[na:1.7.0_55]
        at sun.nio.ch.IOUtil.read(IOUtil.java:197) ~[na:1.7.0_55]
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) 
~[na:1.7.0_55]
        at 
org.apache.cassandra.streaming.messages.StreamMessage.deserialize(StreamMessage.java:51)
 ~[main/:na]
        at 
org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:289)
 ~[main/:na]
        at java.lang.Thread.run(Thread.java:745) [na:1.7.0_55]
INFO  06:03:28 [Stream #44eea080-e49b-11e3-8245-79bb5a6fc73b] Session with 
/192.168.1.50 is complete
WARN  06:03:28 [Stream #44eea080-e49b-11e3-8245-79bb5a6fc73b] Stream failed
ERROR 06:03:29 [Stream #45724f70-e49b-11e3-8245-79bb5a6fc73b] Streaming error 
occurred
java.io.IOException: Connection timed out
        at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[na:1.7.0_55]
        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) 
~[na:1.7.0_55]
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) 
~[na:1.7.0_55]
        at sun.nio.ch.IOUtil.read(IOUtil.java:197) ~[na:1.7.0_55]
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) 
~[na:1.7.0_55]
        at 
org.apache.cassandra.streaming.messages.StreamMessage.deserialize(StreamMessage.java:51)
 ~[main/:na]
        at 
org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:289)
 ~[main/:na]
        at java.lang.Thread.run(Thread.java:745) [na:1.7.0_55]
INFO  06:03:29 [Stream #45724f70-e49b-11e3-8245-79bb5a6fc73b] Session with 
/192.168.1.50 is complete
WARN  06:03:29 [Stream #45724f70-e49b-11e3-8245-79bb5a6fc73b] Stream failed
ERROR 06:03:30 [Stream #4663b450-e49b-11e3-8245-79bb5a6fc73b] Streaming error 
occurred
java.io.IOException: Connection timed out
        at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[na:1.7.0_55]
        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) 
~[na:1.7.0_55]
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) 
~[na:1.7.0_55]
        at sun.nio.ch.IOUtil.read(IOUtil.java:197) ~[na:1.7.0_55]
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) 
~[na:1.7.0_55]
        at 
org.apache.cassandra.streaming.messages.StreamMessage.deserialize(StreamMessage.java:51)
 ~[main/:na]
        at 
org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:289)
 ~[main/:na]
        at java.lang.Thread.run(Thread.java:745) [na:1.7.0_55]
INFO  06:03:30 [Stream #4663b450-e49b-11e3-8245-79bb5a6fc73b] Session with 
/192.168.1.50 is complete
WARN  06:03:30 [Stream #4663b450-e49b-11e3-8245-79bb5a6fc73b] Stream failed
ERROR 06:03:30 [Stream #46832330-e49b-11e3-8245-79bb5a6fc73b] Streaming error 
occurred
java.io.IOException: Connection timed out
        at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[na:1.7.0_55]
        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) 
~[na:1.7.0_55]
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) 
~[na:1.7.0_55]
        at sun.nio.ch.IOUtil.read(IOUtil.java:197) ~[na:1.7.0_55]
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) 
~[na:1.7.0_55]
        at 
org.apache.cassandra.streaming.messages.StreamMessage.deserialize(StreamMessage.java:51)
 ~[main/:na]
        at 
org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:289)
 ~[main/:na]
        at java.lang.Thread.run(Thread.java:745) [na:1.7.0_55]
INFO  06:03:30 [Stream #46832330-e49b-11e3-8245-79bb5a6fc73b] Session with 
/192.168.1.50 is complete
WARN  06:03:30 [Stream #46832330-e49b-11e3-8245-79bb5a6fc73b] Stream failed
{code}

but netstats still shows:

{code}
Mode: NORMAL
Repair 4663b450-e49b-11e3-8245-79bb5a6fc73b
    /192.168.1.50
        Sending 1 files, 1961099 bytes total
Repair 46832330-e49b-11e3-8245-79bb5a6fc73b
    /192.168.1.50
        Sending 1 files, 16671730 bytes total
Repair 44eea080-e49b-11e3-8245-79bb5a6fc73b
    /192.168.1.50
        Sending 1 files, 2071813 bytes total
Repair 45724f70-e49b-11e3-8245-79bb5a6fc73b
    /192.168.1.50
        Sending 1 files, 3856163 bytes total
Read Repair Statistics:
Attempted: 0
Mismatch (Blocking): 0
Mismatch (Background): 0
Pool Name                    Active   Pending      Completed
Commands                        n/a         1            533
Responses                       n/a        83           1285
{code}

And, if I add a check for -1 on the return value for skip(..) on the receiving 
side, it works (and the streaming session is cleared out correctly), nice catch.

> Failure detector downs should not break streams
> -----------------------------------------------
>
>                 Key: CASSANDRA-3569
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3569
>             Project: Cassandra
>          Issue Type: New Feature
>            Reporter: Peter Schuller
>            Assignee: Joshua McKenzie
>             Fix For: 2.1.1
>
>         Attachments: 3569-2.0.txt, 3569_v1.txt
>
>
> CASSANDRA-2433 introduced this behavior just to get repairs to don't sit 
> there waiting forever. In my opinion the correct fix to that problem is to 
> use TCP keep alive. Unfortunately the TCP keep alive period is insanely high 
> by default on a modern Linux, so just doing that is not entirely good either.
> But using the failure detector seems non-sensicle to me. We have a 
> communication method which is the TCP transport, that we know is used for 
> long-running processes that you don't want to incorrectly be killed for no 
> good reason, and we are using a failure detector tuned to detecting when not 
> to send real-time sensitive request to nodes in order to actively kill a 
> working connection.
> So, rather than add complexity with protocol based ping/pongs and such, I 
> propose that we simply just use TCP keep alive for streaming connections and 
> instruct operators of production clusters to tweak 
> net.ipv4.tcp_keepalive_{probes,intvl} as appropriate (or whatever equivalent 
> on their OS).
> I can submit the patch. Awaiting opinions.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to