[jira] [Comment Edited] (CASSANDRA-3569) Failure detector downs should not break streams

Omid Aladini (JIRA) Thu, 21 May 2015 16:20:54 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-3569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14554770#comment-14554770
 ]


Omid Aladini edited comment on CASSANDRA-3569 at 5/21/15 11:19 PM:
-------------------------------------------------------------------

{quote}
May have better luck asking on CASSANDRA-7560 - I'm sure Yuki can better speak 
to why he decided to only partially back-port this ticket to 2.0; my guess is 
that he did the minimum necessary to the stable branch to rectify the hang that 
7560 is supposed to address.
{quote}
Right. Thought I'd keep the conversation here as the rest of the patch 
(streaming part) isn't relevant to that thread. 

If everyone's ok with that, I can submit a partial patch that applies the rest 
of this change-set on the current 2.0 before it becomes frozen.

[~yukim]: thoughts? 


was (Author: omid):
{quote}
May have better luck asking on CASSANDRA-7560 - I'm sure Yuki can better speak 
to why he decided to only partially back-port this ticket to 2.0; my guess is 
that he did the minimum necessary to the stable branch to rectify the hang that 
7560 is supposed to address.
{quote}
Right. Thought I'd keep the conversation here as the rest of the patch 
(streaming part) isn't relevant to that thread. 

If anyone's ok with that, I can submit a partial patch that applies the rest of 
this change-set on the current 2.0 before it becomes frozen.

[~yukim]: thoughts? 

> Failure detector downs should not break streams
> -----------------------------------------------
>
>                 Key: CASSANDRA-3569
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3569
>             Project: Cassandra
>          Issue Type: New Feature
>            Reporter: Peter Schuller
>            Assignee: Joshua McKenzie
>             Fix For: 2.1.1
>
>         Attachments: 3569-2.0.txt, 3569_v1.txt
>
>
> CASSANDRA-2433 introduced this behavior just to get repairs to don't sit 
> there waiting forever. In my opinion the correct fix to that problem is to 
> use TCP keep alive. Unfortunately the TCP keep alive period is insanely high 
> by default on a modern Linux, so just doing that is not entirely good either.
> But using the failure detector seems non-sensicle to me. We have a 
> communication method which is the TCP transport, that we know is used for 
> long-running processes that you don't want to incorrectly be killed for no 
> good reason, and we are using a failure detector tuned to detecting when not 
> to send real-time sensitive request to nodes in order to actively kill a 
> working connection.
> So, rather than add complexity with protocol based ping/pongs and such, I 
> propose that we simply just use TCP keep alive for streaming connections and 
> instruct operators of production clusters to tweak 
> net.ipv4.tcp_keepalive_{probes,intvl} as appropriate (or whatever equivalent 
> on their OS).
> I can submit the patch. Awaiting opinions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (CASSANDRA-3569) Failure detector downs should not break streams

Reply via email to