[jira] [Commented] (CASSANDRA-14001) Gossip after node restart can take a long time to converge about "down" nodes in large clusters

Joseph Lynch (JIRA) Tue, 16 Jan 2018 13:00:17 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-14001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16327795#comment-16327795
 ]


Joseph Lynch commented on CASSANDRA-14001:
------------------------------------------

[~jasonstack] thanks for the reply :-)

If I understand you correctly, you're saying since the remote node is still 
heartbeating to other nodes (versions are increasing) we'll end up in 
[{{applyStateLocally}}|https://github.com/apache/cassandra/blob/6d324f9d769f24ac209f6ea7649fee02b0200ba0/src/java/org/apache/cassandra/gms/Gossiper.java#L1153-L1166]
 and that will cause us to call {{markAlive}} again which sends another Echo 
after marking dead? That makes sense, let me spend some more time trying to get 
a reproduction so we can narrow this down. I can simulate a partitioned cluster 
using {{CCM}} and {{ipfw}} and use {{wireshark}} to determine if cassandra 
continues sending echo messages.

bq. Roughly speaking, there should be one node ( N * #dead / (#live +1) ) in 
the cluster that will talk to "dead" node per second. Convergence time is O( 
log2(N) ) where N is size of cluster.
bq. I suppose that the long convergence time you observed is not related to 
gossip peer selection..

Hm, even in that case isn't it probabilistically O(log2(N))? I guess we'd need 
something like the broadcast trees as proposed in CASSANDRA-12345 to guarantee 
O(log2(N)) convergence? I'll work on modeling the variance of the existing 
system since I think that may be the issue (rather than the expected value).

bq. Could you share the phi_convict_threshold value?

We don't tune {{phi_convict_threshold}}, so I believe it's the default of 8.

 

> Gossip after node restart can take a long time to converge about "down" nodes 
> in large clusters
> -----------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-14001
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-14001
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Lifecycle
>            Reporter: Joseph Lynch
>            Priority: Minor
>
> When nodes restart in a large cluster, they mark all nodes as "alive", which 
> first calls {{markDead}} and then creates an {{EchoMessage}} and in the 
> callback to that marks the node as alive. This works great, except when that 
> initial echo fails for w.e. reason and that node is marked as dead, in which 
> case it will remain dead for a long while.
> We mostly see this on 100+ node clusters, and almost always when nodes are in 
> different datacenters that have unreliable network connections (e.g, cross 
> region in AWS) and I think that it comes down to a combination of:
> 1. Only a node itself can mark another node as "UP"
> 2. Nodes only gossip with dead nodes with probability {{#dead / (#live +1)}}
> In particular the algorithm in #2 leads to long convergence times because the 
> number of dead nodes it typically very small compared to the cluster size. My 
> back of the envelope model of this algorithm indicates that for a 100 node 
> cluster this would take an average of ~50 seconds with a stdev of 50 seconds, 
> which means we might be waiting _minutes_ for the nodes to gossip with each 
> other. I'm modeling this as the minimum of two [geometric 
> distributions|https://en.wikipedia.org/wiki/Geometric_distribution] with 
> parameter {{p=1/#nodes}}, yielding a geometric distribution with parameter 
> {{p=1-(1-(1/#nodes)^2)}}. So for a 100 node cluster:
> {noformat}
> 100 node cluster =>
> X = Pr(node1 gossips with node2) = geom(0.01)
> Y = Pr(node 2 gossips with node1) = geom(0.01)
> Z = min(X or Y) = geom(1 - (1 - 0.01)^2) = geom(0.02)
> E[Z] = 1/0.02 = 50
> V[Z] = (1-0.02)/(0.02)^2 = 2450
> 1000 node cluster ->
> Z = geom(1 - (1 - 0.001)^2) = geom(0.002)
> E[Z] = 500
> V[Z] = 24500
> {noformat}
> Since we gossip every second that means that on expectation in a 100 node 
> cluster these nodes would see each other after about a minute and in a 
> thousand node cluster, after ~8 minutes. For 100 node clusters the variance 
> is astounding, and means that in particular edge cases we might be waiting 
> hours before these nodes gossip with each other.
> I'm thinking of writing a patch which either:
> # Makes gossip order a shuffled list that includes dead nodes a la [swim 
> gossip|https://www.cs.cornell.edu/~asdas/research/dsn02-swim.pdf]. This would 
> make it so that we waste some rounds on dead nodes but guarantee linear 
> bounding of gossip.
> # Adds an endpoint that re-triggers gossip with all nodes. Operators could 
> call this after a restart a few times if they detect a gossip inconsistency.
> # Bounding the probability we gossip with a dead node at some reasonable 
> number like 1/10 or something. This might cause a lot of gossip load when a 
> node is actually down for large clusters, but would also act to bound the 
> variance.
> # Something else?
> I've got a WIP 
> [branch|https://github.com/apache/cassandra/compare/cassandra-3.11...jolynch:force_gossip]
>  on 3.11 which implements options #1 and #2, but I can reduce/change/modify 
> as needed if people think there is a better way. The patch doesn't pass tests 
> yet but I'm not going to change/add the tests unless we think moving to time 
> bounded gossip for down nodes is a good idea.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (CASSANDRA-14001) Gossip after node restart can take a long time to converge about "down" nodes in large clusters

Reply via email to