[ https://issues.apache.org/jira/browse/CASSANDRA-3569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14008609#comment-14008609 ]
Marcus Eriksson commented on CASSANDRA-3569: -------------------------------------------- What I on the sending side is: {code} INFO 06:02:48 InetAddress /192.168.1.50 is now DOWN ERROR 06:03:28 [Stream #44eea080-e49b-11e3-8245-79bb5a6fc73b] Streaming error occurred java.io.IOException: Connection timed out at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[na:1.7.0_55] at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) ~[na:1.7.0_55] at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) ~[na:1.7.0_55] at sun.nio.ch.IOUtil.read(IOUtil.java:197) ~[na:1.7.0_55] at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) ~[na:1.7.0_55] at org.apache.cassandra.streaming.messages.StreamMessage.deserialize(StreamMessage.java:51) ~[main/:na] at org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:289) ~[main/:na] at java.lang.Thread.run(Thread.java:745) [na:1.7.0_55] INFO 06:03:28 [Stream #44eea080-e49b-11e3-8245-79bb5a6fc73b] Session with /192.168.1.50 is complete WARN 06:03:28 [Stream #44eea080-e49b-11e3-8245-79bb5a6fc73b] Stream failed ERROR 06:03:29 [Stream #45724f70-e49b-11e3-8245-79bb5a6fc73b] Streaming error occurred java.io.IOException: Connection timed out at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[na:1.7.0_55] at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) ~[na:1.7.0_55] at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) ~[na:1.7.0_55] at sun.nio.ch.IOUtil.read(IOUtil.java:197) ~[na:1.7.0_55] at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) ~[na:1.7.0_55] at org.apache.cassandra.streaming.messages.StreamMessage.deserialize(StreamMessage.java:51) ~[main/:na] at org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:289) ~[main/:na] at java.lang.Thread.run(Thread.java:745) [na:1.7.0_55] INFO 06:03:29 [Stream #45724f70-e49b-11e3-8245-79bb5a6fc73b] Session with /192.168.1.50 is complete WARN 06:03:29 [Stream #45724f70-e49b-11e3-8245-79bb5a6fc73b] Stream failed ERROR 06:03:30 [Stream #4663b450-e49b-11e3-8245-79bb5a6fc73b] Streaming error occurred java.io.IOException: Connection timed out at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[na:1.7.0_55] at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) ~[na:1.7.0_55] at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) ~[na:1.7.0_55] at sun.nio.ch.IOUtil.read(IOUtil.java:197) ~[na:1.7.0_55] at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) ~[na:1.7.0_55] at org.apache.cassandra.streaming.messages.StreamMessage.deserialize(StreamMessage.java:51) ~[main/:na] at org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:289) ~[main/:na] at java.lang.Thread.run(Thread.java:745) [na:1.7.0_55] INFO 06:03:30 [Stream #4663b450-e49b-11e3-8245-79bb5a6fc73b] Session with /192.168.1.50 is complete WARN 06:03:30 [Stream #4663b450-e49b-11e3-8245-79bb5a6fc73b] Stream failed ERROR 06:03:30 [Stream #46832330-e49b-11e3-8245-79bb5a6fc73b] Streaming error occurred java.io.IOException: Connection timed out at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[na:1.7.0_55] at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) ~[na:1.7.0_55] at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) ~[na:1.7.0_55] at sun.nio.ch.IOUtil.read(IOUtil.java:197) ~[na:1.7.0_55] at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) ~[na:1.7.0_55] at org.apache.cassandra.streaming.messages.StreamMessage.deserialize(StreamMessage.java:51) ~[main/:na] at org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:289) ~[main/:na] at java.lang.Thread.run(Thread.java:745) [na:1.7.0_55] INFO 06:03:30 [Stream #46832330-e49b-11e3-8245-79bb5a6fc73b] Session with /192.168.1.50 is complete WARN 06:03:30 [Stream #46832330-e49b-11e3-8245-79bb5a6fc73b] Stream failed {code} but netstats still shows: {code} Mode: NORMAL Repair 4663b450-e49b-11e3-8245-79bb5a6fc73b /192.168.1.50 Sending 1 files, 1961099 bytes total Repair 46832330-e49b-11e3-8245-79bb5a6fc73b /192.168.1.50 Sending 1 files, 16671730 bytes total Repair 44eea080-e49b-11e3-8245-79bb5a6fc73b /192.168.1.50 Sending 1 files, 2071813 bytes total Repair 45724f70-e49b-11e3-8245-79bb5a6fc73b /192.168.1.50 Sending 1 files, 3856163 bytes total Read Repair Statistics: Attempted: 0 Mismatch (Blocking): 0 Mismatch (Background): 0 Pool Name Active Pending Completed Commands n/a 1 533 Responses n/a 83 1285 {code} And, if I add a check for -1 on the return value for skip(..) on the receiving side, it works (and the streaming session is cleared out correctly), nice catch. > Failure detector downs should not break streams > ----------------------------------------------- > > Key: CASSANDRA-3569 > URL: https://issues.apache.org/jira/browse/CASSANDRA-3569 > Project: Cassandra > Issue Type: New Feature > Reporter: Peter Schuller > Assignee: Joshua McKenzie > Fix For: 2.1.1 > > Attachments: 3569-2.0.txt, 3569_v1.txt > > > CASSANDRA-2433 introduced this behavior just to get repairs to don't sit > there waiting forever. In my opinion the correct fix to that problem is to > use TCP keep alive. Unfortunately the TCP keep alive period is insanely high > by default on a modern Linux, so just doing that is not entirely good either. > But using the failure detector seems non-sensicle to me. We have a > communication method which is the TCP transport, that we know is used for > long-running processes that you don't want to incorrectly be killed for no > good reason, and we are using a failure detector tuned to detecting when not > to send real-time sensitive request to nodes in order to actively kill a > working connection. > So, rather than add complexity with protocol based ping/pongs and such, I > propose that we simply just use TCP keep alive for streaming connections and > instruct operators of production clusters to tweak > net.ipv4.tcp_keepalive_{probes,intvl} as appropriate (or whatever equivalent > on their OS). > I can submit the patch. Awaiting opinions. -- This message was sent by Atlassian JIRA (v6.2#6252)