[
https://issues.apache.org/jira/browse/CASSANDRA-3569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Joshua McKenzie updated CASSANDRA-3569:
---------------------------------------
Attachment: 3569_v1.txt
Bit if a pain to test but yanking a cable and/or hard-killing a vm appears to
do the trick. Default tcp_keepalive_intvl @ 75 and tcp_keepalive_probes @ 9
means we'll have about 16 minutes before any StreamSession is considered dead
with this v1 patch (300 seconds @ keepalive_time + 675 seconds for intvl *
probe).
This initial version removes all Gossip and FailureDetector registration from
the StreamSessions and completely relies on tcp_keepalive_time for health
detection on those components.
I'll follow up with ticket for DSE changes and open a ticket for windows
install script changes once this ticket is reviewed and any necessary changes
are complete.
> Failure detector downs should not break streams
> -----------------------------------------------
>
> Key: CASSANDRA-3569
> URL: https://issues.apache.org/jira/browse/CASSANDRA-3569
> Project: Cassandra
> Issue Type: New Feature
> Reporter: Peter Schuller
> Assignee: Joshua McKenzie
> Fix For: 2.1.1
>
> Attachments: 3569-2.0.txt, 3569_v1.txt
>
>
> CASSANDRA-2433 introduced this behavior just to get repairs to don't sit
> there waiting forever. In my opinion the correct fix to that problem is to
> use TCP keep alive. Unfortunately the TCP keep alive period is insanely high
> by default on a modern Linux, so just doing that is not entirely good either.
> But using the failure detector seems non-sensicle to me. We have a
> communication method which is the TCP transport, that we know is used for
> long-running processes that you don't want to incorrectly be killed for no
> good reason, and we are using a failure detector tuned to detecting when not
> to send real-time sensitive request to nodes in order to actively kill a
> working connection.
> So, rather than add complexity with protocol based ping/pongs and such, I
> propose that we simply just use TCP keep alive for streaming connections and
> instruct operators of production clusters to tweak
> net.ipv4.tcp_keepalive_{probes,intvl} as appropriate (or whatever equivalent
> on their OS).
> I can submit the patch. Awaiting opinions.
--
This message was sent by Atlassian JIRA
(v6.2#6252)