[jira] [Commented] (CASSANDRA-6127) vnodes don't scale to hundreds of nodes

Quentin Conner (JIRA) Tue, 29 Oct 2013 11:18:50 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-6127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13808260#comment-13808260
 ]


Quentin Conner commented on CASSANDRA-6127:
-------------------------------------------

Brandon,

You said Patch #3 will make it take much longer for a rebooted node to know 
who's actually up or down, exacerbating CASSANDRA-4288.  I've given this some 
thought and want to see if I understand your concern.

Patch #3 serves to send a zero value for phi, for newly-discovered nodes, until 
an accurate calculation of variance is complete.  This would be 40 seconds, 
applicable to new nodes only.

However (and this is what I'm looking for you to confirm) If a new node comes 
online, but is stopped again within 40 seconds of start-up, the FD will not 
"convict" it until the end of that 40 seconds.

I suspect this occurs less frequently than adding a node to a cluster, but 
probably depends on your use case (dev vs prod).

In my view, we can't escape the math, and the need to amass 40 samples.  That 
is why the bug exists today.  I agree we should look at tying thrift to a 
healthy startup as a compensating measure.

Instead of a fixed amount of time (gossip rounds), perhaps we should consider 
adding a hold-down timer based on a statistical measure?

This hold-down timer could be implemented for newly discovered nodes to 
suppress interaction until Gossip "stabilizes".  Just like we have a high-water 
mark for phi to denote failure, we could set a low-water mark and call it a 
trust threshold.  We wouldn't enable thrift communications to the new node 
until their phi value is below this low-water mark.

So the condition for "recognizing" a new node for thrift purposes could be two 
fold:
  1.  valid computation for variance (40 samples obtained in the 1000 sample 
window)
  2.  accurate phi value is indeed below the low-water mark

> vnodes don't scale to hundreds of nodes
> ---------------------------------------
>
>                 Key: CASSANDRA-6127
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6127
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: Any cluster that has vnodes and consists of hundreds of 
> physical nodes.
>            Reporter: Tupshin Harper
>            Assignee: Jonathan Ellis
>         Attachments: 6000vnodes.patch, AdjustableGossipPeriod.patch, 
> delayEstimatorUntilStatisticallyValid.patch
>
>
> There are a lot of gossip-related issues related to very wide clusters that 
> also have vnodes enabled. Let's use this ticket as a master in case there are 
> sub-tickets.
> The most obvious symptom I've seen is with 1000 nodes in EC2 with m1.xlarge 
> instances. Each node configured with 32 vnodes.
> Without vnodes, cluster spins up fine and is ready to handle requests within 
> 30 minutes or less. 
> With vnodes, nodes are reporting constant up/down flapping messages with no 
> external load on the cluster. After a couple of hours, they were still 
> flapping, had very high cpu load, and the cluster never looked like it was 
> going to stabilize or be useful for traffic.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (CASSANDRA-6127) vnodes don't scale to hundreds of nodes

Reply via email to