[jira] [Commented] (CASSANDRA-9536) The failure detector becomes more sensitive when the network is flakey

Jonathan Ellis (JIRA) Tue, 02 Jun 2015 14:43:06 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-9536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14569811#comment-14569811
 ]


Jonathan Ellis commented on CASSANDRA-9536:
-------------------------------------------

Curious if [~jasobrown] has any thoughts here.

> The failure detector becomes more sensitive when the network is flakey
> ----------------------------------------------------------------------
>
>                 Key: CASSANDRA-9536
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9536
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Ron Kuris
>            Priority: Minor
>             Fix For: 2.2.x
>
>
> I added considerable instrumentation into the failure detector, and then 
> blocked port 7000 for a random 5-6 second interval, then resumed traffic for 
> the same amount of time, with a script like:
> {code}while :
> do
>    iptables -A INPUT -p tcp --destination-port 7000 -j DROP
>         v=$[100 + (RANDOM % 100)]$[1000 + (RANDOM % 1000)]
>         s=5.${v:1:2}${v:4:3}
>         echo offline for $s
>         sleep $s
>         iptables -F
>         v=$[100 + (RANDOM % 100)]$[1000 + (RANDOM % 1000)]
>         s=5.${v:1:2}${v:4:3}
>         echo online for $s
>         sleep $s
> done{code}
> When I do this, I watch the values being reported to the FailureDetector. The 
> mean actually goes down, as low as 850ms. The reason is that the very slow 
> packets are not recorded (they exceed MAX_INTERVAL_IN_NANO which is 2 
> seconds) and the retransmitted packets arrive very quickly in succession, 
> lowering the overall average.
> Once the average is lowered, the node becomes much more sensitive to shorter 
> outages. If you run this code for a while, the average drops down to 800ms or 
> less, which means that the node will go down 20% quicker than expected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9536) The failure detector becomes more sensitive when the network is flakey

Reply via email to