[
https://issues.apache.org/jira/browse/CASSANDRA-9536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14569642#comment-14569642
]
Ron Kuris commented on CASSANDRA-9536:
--------------------------------------
Some additional notes from my research on how this code works:
FailureDetector's report() is called each time the gossiper receives a message
from that node. The message should arrive every second (this is not tunable).
The failure detector records the time since the last message, up to 1000
samples, which will take about 16 minutes, 40 seconds. It does not record the
time if the message was sent more than 2 seconds ago.
In the gossiper's status check, it calls FailureDetector's interpret() method.
This calculates "phi", which is the time
since the last message, divided by the average of the collected samples. It
then divides by a constant, PHI_FACTOR, which is about 0.434, then compares to
phi_convict_threshold. If the adjusted phi is greater than the threshold, the
node is marked down.
If the average collected samples drops below 1, this increases phi.
I ran several additional tests, including changing the packet dropping time to
0-1 seconds, which severely lowers the average and causes the node to flip flop
UP/DOWN.
I'm very open to helping to fix this, but I'm afraid it might require a
different implementation of the FailureDetector. Perhaps I should make it
pluggable, where we have a new implementation of IFailureDetector? I'm open to
suggestions.
> The failure detector becomes more sensitive when the network is flakey
> ----------------------------------------------------------------------
>
> Key: CASSANDRA-9536
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9536
> Project: Cassandra
> Issue Type: Improvement
> Components: Core
> Reporter: Ron Kuris
> Priority: Minor
>
> I added considerable instrumentation into the failure detector, and then
> blocked port 7000 for a random 5-6 second interval, then resumed traffic for
> the same amount of time, with a script like:
> {code}while :
> do
> iptables -A INPUT -p tcp --destination-port 7000 -j DROP
> v=$[100 + (RANDOM % 100)]$[1000 + (RANDOM % 1000)]
> s=5.${v:1:2}${v:4:3}
> echo offline for $s
> sleep $s
> iptables -F
> v=$[100 + (RANDOM % 100)]$[1000 + (RANDOM % 1000)]
> s=5.${v:1:2}${v:4:3}
> echo online for $s
> sleep $s
> done{code}
> When I do this, I watch the values being reported to the FailureDetector. The
> median actually goes down, as low as 850ms. The reason is that the very slow
> packets are not recorded (they exceed MAX_INTERVAL_IN_NANO which is 2
> seconds) and the retransmitted packets arrive very quickly in succession,
> lowering the overall average.
> Once the average is lowered, the node becomes much more sensitive to shorter
> outages. If you run this code for a while, the average drops down to 800ms or
> less, which means that the node will go down 20% quicker than expected.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)