[jira] [Comment Edited] (CASSANDRA-9237) Gossip messages subject to head of line blocking by other intra-cluster traffic

Jason Brown (JIRA) Wed, 22 Jul 2015 15:44:22 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-9237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14637787#comment-14637787
 ]


Jason Brown edited comment on CASSANDRA-9237 at 7/22/15 10:43 PM:
------------------------------------------------------------------

To be clear, let's understand what the real implication is here with the 
'gossip' messages. The purpose of sending the gossip messages in the current 
implementation is that it is the primary vehicle for delivering updated 
heartbeat values of nodes in the cluster. The other data that is passed in 
gossip (node metadata such as status, dc, rack, tokens, and so on) changes very 
infrequently (or rarely), such that the eventual (or delayed!) delivery of that 
data is reasonable. Heartbeats, however, are quite different. A continuous and 
nearly consistent delivery time of updated heartbeats is critical for the 
stability of a cluster. You see, it is through the receipt of the updated 
heartbeat that a node determines the reachability (UP/DOWN status) of all peers 
in the cluster. The current implementation of FailureDetector measures the time 
differences between the heartbeat updates received about a peer (Note: I said 
*about* a peer, not *from* the peer directly, as those values are disseminated 
via gossip). Without a consistent time delivery of those updates, the FD, via 
it's use of the PHI-accrual algorigthm, will mark the peer as DOWN 
(unreachable). The two nodes could be sending all other traffic without 
problem, but if the heartbeats are not propagated correctly, each of the nodes 
will mark the other as DOWN, which is clearly suboptimal to cluster health. 
Note that heartbeat updates are the only mechanism we use to determine 
reachability (UP/DOWN) of a peer; dynamic snitch measurements, for example, are 
not included in the determination.  Hence, CASSANDRA-8789 could be quite a 
problem with regard to cluster stability, and raised by this ticket. (shame on 
me for not raising any concerns earlier).

Now, all this being said, we have a dilemma about what to do with regard to to 
heartbeat dissemination. I propose we drop the heartbeat concept altogether. 
The functionality we would lose immediately is the ability to declare a peer 
node as UP or DOWN. To make up for that, the dynamic snitch becomes much more 
intelligent and it's measurements ultimately become responsible for determining 
the reachability status (input to a revamped FD). As we already capture 
latencies in the dsntich, we can reasonably extend this to include 
timeouts/missed responses, and make that the basis for the UP/DOWN decisioning. 
Not only will this be more efficient as we will only need to connect to and 
track the responses of peer that a node actually connects to, it will lead to 
more relevant decisions about the reachability of a peer.

To illustrate this last point, in the current implementation, assume a cluster 
of nodes: A, B, and C. A partition starts between nodes A and C (no 
communication succeeds), but both nodes can communicate with B. As B will get 
the updated heartbeats from both A and C, it will, via gossip, send those over 
to the other node. Thus, A thinks C is UP, and C thinks A is UP. Unfortunately, 
due to the partition between them, all communication between A and C will fail, 
yet neither node will mark the other as down because each is receiving, 
transitively via B, the updated heartbeat about the other. While it's true that 
the other node is alive, only having transitive knowledge about a peer, and 
allowing that to be the sole determinant of UP/DOWN reachability status, is not 
sufficient for a correct and effieicently operating cluster. Thus, if the 
dynamic snitch had an expanded role, where it's observations, based on actual 
in-use communication paths, fed directly into the FailureDetector (for the 
UP/DOWN status), I think we would have a better system, and one that more 
accurately reflects the state of the reachable cluster that is available to 
each node.

I expect there would be some subtleties and complications with this idea, but I 
feel those are surmountable implementation details.

Going back to this ticket, then, I am convinced that if we eliminate the 
time-sensitive delivery of the heartbeats (which drives our current notions of 
peer availability), then we don't need to be overly concerned about the HoL 
issues raised here.


was (Author: jasobrown):
To be clear, let's understand what the real implication is here with the 
'gossip' messages. The purpose of sending the gossip messages in the current 
implementation is that it is the primary vehicle for delivering updated 
heartbeat values of nodes in the cluster. The other data that is passed in 
gossip (node metadata such as status, dc, rack, tokens, and so on) changes very 
infrequently (or rarely), such that the eventual (or delayed!) delivery of that 
data is reasonable. Heartbeats, however, are quite different. A continuous and 
nearly consistent delivery time of updated heartbeats is critical for the 
stability of a cluster. You see, it is through the receipt of the updated 
heartbeat that a node determines the reachability (UP/DOWN status) of all peers 
in the cluster. The current implementation of FailureDetector measures the time 
differences between the heartbeat updates received about a peer (Note: I said 
*about* a peer, not from athe directly, as those values are disseminated via 
gossip). Without a consistent time delivery of those updates, the FD, via it's 
use of the PHI-accrual algorigthm, will mark the peer as DOWN (unreachable). 
The two nodes could be sending all other traffic without problem, but if the 
heartbeats are not propagated correctly, each of the nodes will mark the other 
as DOWN, which is clearly suboptimal to cluster health. Note that heartbeat 
updates are the only mechanism we use to determine reachability (UP/DOWN) of a 
peer; dynamic snitch measurements, for example, are not included in the 
determination.  Hence, CASSANDRA-8789 could be quite a problem with regard to 
cluster stability, and raised by this ticket. (shame on me for not raising any 
concerns earlier).

Now, all this being said, we have a dilemma about what to do with regard to to 
heartbeat dissemination. I propose we drop the heartbeat concept altogether. 
The functionality we would lose immediately is the ability to declare a peer 
node as UP or DOWN. To make up for that, the dynamic snitch becomes much more 
intelligent and it's measurements ultimately become responsible for determining 
the reachability status (input to a revamped FD). As we already capture 
latencies in the dsntich, we can reasonably extend this to include 
timeouts/missed responses, and make that the basis for the UP/DOWN decisioning. 
Not only will this be more efficient as we will only need to connect to and 
track the responses of peer that a node actually connects to, it will lead to 
more relevant decisions about the reachability of a peer.

To illustrate this last point, in the current implementation, assume a cluster 
of nodes: A, B, and C. A partition starts between nodes A and C (no 
communication succeeds), but both nodes can communicate with B. As B will get 
the updated heartbeats from both A and C, it will, via gossip, send those over 
to the other node. Thus, A thinks C is UP, and C thinks A is UP. Unfortunately, 
due to the partition between them, all communication between A and C will fail, 
yet neither node will mark the other as down because each is receiving, 
transitively via B, the updated heartbeat about the other. While it's true that 
the other node is alive, only having transitive knowledge about a peer, and 
allowing that to be the sole determinant of UP/DOWN reachability status, is not 
sufficient for a correct and effieicently operating cluster. Thus, if the 
dynamic snitch had an expanded role, where it's observations, based on actual 
in-use communication paths, fed directly into the FailureDetector (for the 
UP/DOWN status), I think we would have a better system, and one that more 
accurately reflects the state of the reachable cluster that is available to 
each node.

I expect there would be some subtleties and complications with this idea, but I 
feel those are surmountable implementation details.

Going back to this ticket, then, I am convinced that if we eliminate the 
time-sensitive delivery of the heartbeats (which drives our current notions of 
peer availability), then we don't need to be overly concerned about the HoL 
issues raised here.

> Gossip messages subject to head of line blocking by other intra-cluster 
> traffic
> -------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-9237
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9237
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Ariel Weisberg
>            Assignee: Ariel Weisberg
>
> Reported as an issue over less than perfect networks like VPNs between data 
> centers.
> Gossip goes over the small message socket where small is 64k which isn't 
> particularly small. This is done for performance to keep most traffic on one 
> hot socket.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (CASSANDRA-9237) Gossip messages subject to head of line blocking by other intra-cluster traffic

Reply via email to