[jira] [Commented] (CASSANDRA-3830) gossip-to-seeds is not obviously independent of failure detection algorithm

Peter Schuller (Commented) (JIRA) Mon, 13 Feb 2012 00:17:36 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-3830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13206745#comment-13206745
 ]


Peter Schuller commented on CASSANDRA-3830:
-------------------------------------------

I had written a response here, but I assume I must have failed to submit it and 
lost track of the browser tab or something.

What you describe is not the behavior of the Gossiper. It picks a random node 
to gossip to. Then, unless the node *happened* to also be a seed node, it picks 
a random *seed node* to gossip to *as well*.

The "less than number of seeds" you're mentioning is presumably due to the 
comments in the code before the gossip to seed:

{code}
                    /* Gossip to a seed if we did not do so above, or we have 
seen less nodes
                       than there are seeds.  This prevents partitions where 
each group of nodes
                       is only gossiping to a subset of the seeds.

                       The most straightforward check would be to check that 
all the seeds have been
                       verified either as live or unreachable.  To avoid that 
computation each round,
                       we reason that:

                       either all the live nodes are seeds, in which case 
non-seeds that come online
                       will introduce themselves to a member of the ring by 
definition,

                       or there is at least one non-seed node in the list, in 
which case eventually
                       someone will gossip to it, and then do a gossip to a 
random seed from the
                       gossipedToSeed check.

                       See CASSANDRA-150 for more exposition. */
                    if (!gossipedToSeed || liveEndpoints.size() < seeds.size())
                        doGossipToSeed(prod);
{code}

If you look carefully though, you'll see that the number of live endpoints is 
*only* relevant in the sense that it forces *always* gossiping to a seed even 
if we already did. In the normal case of almost all cases, we have more live 
endpoints than seeds, and we'll still gossip to seeds because 
{{!gossipedToSeed}}.


                
> gossip-to-seeds is not obviously independent of failure detection algorithm 
> ----------------------------------------------------------------------------
>
>                 Key: CASSANDRA-3830
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3830
>             Project: Cassandra
>          Issue Type: Task
>          Components: Core
>            Reporter: Peter Schuller
>            Priority: Minor
>
> The failure detector, ignoring all the theory, boils down to an
> extremely simple algorithm. The FD keeps track of a sliding window (of
> 1000 currently) intervals of heartbeat for a given host. Meaning, we
> have a track record of the last 1000 times we saw an updated heartbeat
> for a host.
> At any given moment, a host has a score which is simply the time since
> the last heartbeat, over the *mean* interval in the sliding
> window. For historical reasons a simple scaling factor is applied to
> this prior to checking the phi conviction threshold.
> (CASSANDRA-2597 has details, but thanks to Paul's work there it's now
> trivial to understand what it does based on gut feeling)
> So in effect, a host is considered down if we haven't heard from it in
> some time which is significantly longer than the "average" time we
> expect to hear from it.
> This seems reasonable, but it does assume that under normal conditions
> the average time between heartbeats does not change for reasons other
> than those that would be plausible reasons to think a node is
> unhealthy.
> This assumption *could* be violated by the gossip-to-seed
> feature. There is an argument to avoid gossip-to-seed for other
> reasons (see CASSANDRA-3829), but this is a concrete case in which the
> gossip-to-seed could cause a negative side-effect of the general kind
> mentioned in CASSANDRA-3829 (see notes at end about not case w/o seeds
> not being continuously tested). Normally, due to gossip to seed,
> everyone essentially sees latest information within very few hart
> beats (assuming only 2-3 seeds). But should all seeds be down,
> suddenly we flip a switch and start relying on generalized propagation
> in the gossip system, rather than the seed special case.
> The potential problem I forese here is that if the average propagation
> time suddenly spikes when all seeds become available, it could cause
> bogus flapping of nodes into down state.
> In order to test this, I deployeda ~ 180 node cluster with a version
> that logs heartbet information on each interpret(), similar to:
>  INFO [GossipTasks:1] 2012-02-01 23:29:58,746 FailureDetector.java (line 187) 
> ep /XXX.XXX.XXX.XXX is at phi 0.0019521638443084342, last interval 7.0, mean 
> is 1557.2777777777778
> It turns out that, at least at 180 nodes, with 4 seed nodes, whether
> or not seeds are running *does not* seem to matter significantly. In
> both cases, the mean interval is around 1500 milliseconds.
> I don't feel I have a good grasp of whether this is incidental or
> guaranteed, and it would be good to at least empirically test
> propagation time w/o seeds at differnet cluster sizes; it's supposed
> to be un-affected by cluster size ({{RING_DELAY}} is static for this
> reason, is my understanding). Would be nice to see this be the case.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-3830) gossip-to-seeds is not obviously independent of failure detection algorithm

Reply via email to