[jira] [Commented] (CASSANDRA-3569) Failure detector downs should not break streams

Peter Schuller (Commented) (JIRA) Sat, 04 Feb 2012 17:03:19 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-3569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13200613#comment-13200613
 ]


Peter Schuller commented on CASSANDRA-3569:
-------------------------------------------

There is some sanity! :) It turns out that on Linux specifically you can set 
per-socket keep-alive socket options. From tcp(7):

{code}
       TCP_KEEPCNT (since Linux 2.4)
              The maximum number of keepalive probes TCP should send before 
dropping the connection.  This option should not be used in code intended to be 
portable.

       TCP_KEEPIDLE (since Linux 2.4)
              The  time  (in  seconds)  the connection needs to remain idle 
before TCP starts sending keepalive probes, if the socket option SO_KEEPALIVE 
has been set on this socket.  This option
              should not be used in code intended to be portable.

       TCP_KEEPINTVL (since Linux 2.4)
              The time (in seconds) between individual keepalive probes.  This 
option should not be used in code intended to be portable.
{code}

This suddenly makes it insanely more usable to us, with the caveat of 
portability.

                
> Failure detector downs should not break streams
> -----------------------------------------------
>
>                 Key: CASSANDRA-3569
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3569
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Peter Schuller
>            Assignee: Peter Schuller
>
> CASSANDRA-2433 introduced this behavior just to get repairs to don't sit 
> there waiting forever. In my opinion the correct fix to that problem is to 
> use TCP keep alive. Unfortunately the TCP keep alive period is insanely high 
> by default on a modern Linux, so just doing that is not entirely good either.
> But using the failure detector seems non-sensicle to me. We have a 
> communication method which is the TCP transport, that we know is used for 
> long-running processes that you don't want to incorrectly be killed for no 
> good reason, and we are using a failure detector tuned to detecting when not 
> to send real-time sensitive request to nodes in order to actively kill a 
> working connection.
> So, rather than add complexity with protocol based ping/pongs and such, I 
> propose that we simply just use TCP keep alive for streaming connections and 
> instruct operators of production clusters to tweak 
> net.ipv4.tcp_keepalive_{probes,intvl} as appropriate (or whatever equivalent 
> on their OS).
> I can submit the patch. Awaiting opinions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-3569) Failure detector downs should not break streams

Reply via email to