[
https://issues.apache.org/jira/browse/CASSANDRA-9536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jonathan Ellis updated CASSANDRA-9536:
--------------------------------------
Priority: Minor (was: Major)
> The failure detector becomes more sensitive when the network is flakey
> ----------------------------------------------------------------------
>
> Key: CASSANDRA-9536
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9536
> Project: Cassandra
> Issue Type: Bug
> Components: Core
> Reporter: Ron Kuris
> Priority: Minor
>
> I added considerable instrumentation into the failure detector, and then
> blocked port 7000 for a random 5-6 second interval, then resumed traffic for
> the same amount of time, with a script like:
> {code}while :
> do
> iptables -A INPUT -p tcp --destination-port 7000 -j DROP
> v=$[100 + (RANDOM % 100)]$[1000 + (RANDOM % 1000)]
> s=5.${v:1:2}${v:4:3}
> echo offline for $s
> sleep $s
> iptables -F
> v=$[100 + (RANDOM % 100)]$[1000 + (RANDOM % 1000)]
> s=5.${v:1:2}${v:4:3}
> echo online for $s
> sleep $s
> done{code}
> When I do this, I watch the values being reported to the FailureDetector. The
> median actually goes down, as low as 850ms. The reason is that the very slow
> packets are not recorded (they exceed MAX_INTERVAL_IN_NANO which is 2
> seconds) and the retransmitted packets arrive very quickly in succession,
> lowering the overall average.
> Once the average is lowered, the node becomes much more sensitive to shorter
> outages. If you run this code for a while, the average drops down to 800ms or
> less, which means that the node will go down 20% quicker than expected.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)