[
https://issues.apache.org/jira/browse/CASSANDRA-10244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Joel Knighton updated CASSANDRA-10244:
--------------------------------------
Assignee: (was: Joel Knighton)
> Replace heartbeats with locally recorded metrics for failure detection
> ----------------------------------------------------------------------
>
> Key: CASSANDRA-10244
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10244
> Project: Cassandra
> Issue Type: Improvement
> Reporter: Jason Brown
>
> In the current implementation, the primary purpose of sending gossip messages
> is for delivering the updated heartbeat values of each node in a cluster. The
> other data that is passed in gossip (node metadata such as status, dc, rack,
> tokens, and so on) changes very infrequently (or rarely), such that the
> eventual delivery of that data is entirely reasonable. Heartbeats, however,
> are quite different. A continuous and nearly consistent delivery time of
> updated heartbeats is critical for the stability of a cluster. It is through
> the receipt of the updated heartbeat that a node determines the reachability
> (UP/DOWN status) of all peers in the cluster. The current implementation of
> FailureDetector measures the time differences between the heartbeat updates
> received about a peer (Note: I said about a peer, not from the peer directly,
> as those values are disseminated via gossip). Without a consistent time
> delivery of those updates, the FD, via it's use of the PHI-accrual
> algorigthm, will mark the peer as DOWN (unreachable). The two nodes could be
> sending all other traffic without problem, but if the heartbeats are not
> propagated correctly, each of the nodes will mark the other as DOWN, which is
> clearly suboptimal to cluster health. Further, heartbeat updates are the only
> mechanism we use to determine reachability (UP/DOWN) of a peer; dynamic
> snitch measurements, for example, are not included in the determination.
> To illustrate this, in the current implementation, assume a cluster of nodes:
> A, B, and C. A partition starts between nodes A and C (no communication
> succeeds), but both nodes can communicate with B. As B will get the updated
> heartbeats from both A and C, it will, via gossip, send those over to the
> other node. Thus, A thinks C is UP, and C thinks A is UP. Unfortunately, due
> to the partition between them, all communication between A and C will fail,
> yet neither node will mark the other as down because each is receiving,
> transitively via B, the updated heartbeat about the other. While it's true
> that the other node is alive, only having transitive knowledge about a peer,
> and allowing that to be the sole determinant of UP/DOWN reachability status,
> is not sufficient for a correct and effieicently operating cluster.
> This transitive availability is suboptimal, and I propose we drop the
> heartbeat concept altogether. Instead, the dynamic snitch should become more
> intelligent, and it's measurements ultimately become the input for
> determining the reachability status of each peer(as fed into a revamped FD).
> As we already capture latencies in the dsntich, we can reasonably extend it
> to include timeouts/missed responses, and make that the basis for the UP/DOWN
> decisioning. Thus we will have more accurate and relevant peer statueses that
> is tailored to the local node.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)