[
https://issues.apache.org/jira/browse/CASSANDRA-4375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13401410#comment-13401410
]
Brandon Williams commented on CASSANDRA-4375:
---------------------------------------------
What if we set MAX_INTERVAL_IN_MS to the greater of DD.getRpcTimeout() or the
gossip interval * 2?
> FD incorrectly using RPC timeout to ignore gossip heartbeats
> ------------------------------------------------------------
>
> Key: CASSANDRA-4375
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4375
> Project: Cassandra
> Issue Type: Bug
> Components: Core
> Reporter: Peter Schuller
> Priority: Critical
>
> Short version: You can't run a cluster with short RPC timeouts because nodes
> just constantly flap up/down.
> Long version:
> CASSANDRA-3273 tried to fix a problem resulting from the way the failure
> detector works, but did so by introducing a much more sever bug: With low RPC
> timeouts, that are lower than the typical gossip propagation time, a cluster
> will just constantly have all nodes flapping other nodes up and down.
> The cause is this:
> {code}
> + // in the event of a long partition, never record an interval longer
> than the rpc timeout,
> + // since if a host is regularly experiencing connectivity problems
> lasting this long we'd
> + // rather mark it down quickly instead of adapting
> + private final double MAX_INTERVAL_IN_MS =
> DatabaseDescriptor.getRpcTimeout();
> {code}
> And then:
> {code}
> - tLast_ = value;
> - arrivalIntervals_.add(interArrivalTime);
> + if (interArrivalTime <= MAX_INTERVAL_IN_MS)
> + arrivalIntervals_.add(interArrivalTime);
> + else
> + logger_.debug("Ignoring interval time of {}", interArrivalTime);
> {code}
> Using the RPC timeout to ignore unreasonably long intervals is not correct,
> as the RPC timeout is completely orthogonal to gossip propagation delay (see
> CASSANDRA-3927 for a quick description of how the FD works).
> In practice, the propagation delay ends up being in the 0-3 second range on a
> cluster with good local latency. With a low RPC timeout of say 200 ms, very
> few heartbeat updates come in fast enough that it doesn't get ignored by the
> failure detector. This in turn means that the FD records a completely skewed
> average heartbeat interval, which in turn means that nodes almost always get
> flapped on interpret() unless they happen to *just* have had their heartbeat
> updated. Then they flap back up whenever the next heartbeat comes in (since
> it gets brought up immediately).
> In our build, we are replacing the FD with an implementation that simply uses
> a fixed {{N}} second time to convict, because this is just one of many ways
> in which the current FD hurts, while we still haven't found a way it actually
> helps relative to the trivial fixed-second conviction policy.
> For upstream, assuming people won't agree on changing it to a fixed timeout,
> I suggest, at minimum, never using a value lower than something like 10
> seconds or something, when determining whether to ignore. Slightly better is
> to make it a config option.
> (I should note that if propagation delays are significantly off from the
> expected level, other things than the FD already breaks - such as the whole
> concept of {{RING_DELAY}}, which assumes the propagation time is roughly
> constant with e.g. cluster size.)
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira