[
https://issues.apache.org/jira/browse/CASSANDRA-6127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13804362#comment-13804362
]
Quentin Conner commented on CASSANDRA-6127:
-------------------------------------------
*Analysis*
My first experiments aimed to quantify the length of Gossip messages and
determine what factors drive the message length. I found the size of certain
gossip messages increases proportionally with the number of vnodes (num_tokens
in c.yaml). I recorded message size over the num_tokens and number of nodes
domains (64,128,256,512,...) for tokens and (32,64,128,256,512) for nodes. I
also made non-rigorous observation of User and Kernel CPU (Ubuntu 10.0.4 LTS).
My hunch is that both vnode count and node count have a mild effect on user CPU
resource usage.
What is the rough estimate of bytes sent for certain Gossip messages and why
does this matter? The Phi Accrual Failure Detector (Hayashibara, et al)
assumes fixed length heartbeat messages while Cassandra uses variable length
messages. I observed a correlation with larger messages, higher vnodes and
false positive detections by the Gossip FailureDetector. These observations,
IMHO, are not explained by the research paper. I formed a hypothesis that the
false positives are due to jitter in the interval values. I wondered if
perhaps using a longer baseline to integrate over would reduce the jitter.
I have a second theory to follow up on. A newly added node will not have a
long history of Gossip heartbeat interarrival times. At least 40 samples are
needed to compute mean, variance with any statistical significance. It's
possible the phi estimation algorithm is simply invalid for newly created nodes
and that is why we see them flap shortly after creation.
In any case, the message of interest is the GossipDigestAck2 (GDA2) because it
is the largest of the Gossip messages. GDA2 contains the set of
EndpointStateMaps (node metadata) for newly-discovered nodes, i.e. those nodes
just added to an existing cluster. When each node becomes aware of joining
node, they Gossip it to three randomly-chosen other nodes. The GDA2 message is
tailored to contain the delta of new node metadata the receiving node is
unaware of.
For a single node, the upper limit on GDA message size is roughly 3 * N * k * V
Where N is the number of nodes in the cluster,
V is the number of tokens (vnodes) per cluster,
k is a constant value, approximately 64 bytes, that represents a serialized
token plus some other endpoint metadata.
If one is running hundreds of nodes in a cluster, the Gossip message traffic
created when a node joins can be significant and increases with the number of
nodes. I believe this to be the first order effect and probably violates one
of the assumptions of the PHI Accrual Failure Detection, that heartbeat
messages are small enough not to consume a relevant amount of compute or
communication resources. The variable transmission time (due to variable
length messages) is a clear violation of assumptions, if I've read the source
code correctly.
On a related topic, there is a hard-coded limitation to the number of vnodes
due to the serialization of the GDA messages.
No more than 1720 vnodes can be configured without creating a greater than 32K
serialized String vnode message. A patch is provided below for future use
should this become an issue.
In clusters with hundreds of nodes, GDA2 messages can be 200 KB or 2 MB if many
nodes join simultaneously. This is not an issue if the computer experiences no
latency from competing workloads. In the real world, nodes are added because
the cluster load has grown in terms of retained data, or in terms of a high
transaction arrival rate. This means node resources may be fully utilized when
adding new nodes is typically attempted.
It occured to me that we have another use case to accomodate. It is common to
experience transient failure modes, even in modern data centers with
disciplined maintenance practices. Ethernet cables get moved, switches and
routers rebooted. BGP route errors and other temporary interruptions may occur
with the network fabric in real world scenarios. People make mistakes, plans
change and preventative maintenance often causes short-lived interruptions
occur with network, CPU and disk subsystems.
> vnodes don't scale to hundreds of nodes
> ---------------------------------------
>
> Key: CASSANDRA-6127
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6127
> Project: Cassandra
> Issue Type: Bug
> Components: Core
> Environment: Any cluster that has vnodes and consists of hundreds of
> physical nodes.
> Reporter: Tupshin Harper
> Assignee: Jonathan Ellis
>
> There are a lot of gossip-related issues related to very wide clusters that
> also have vnodes enabled. Let's use this ticket as a master in case there are
> sub-tickets.
> The most obvious symptom I've seen is with 1000 nodes in EC2 with m1.xlarge
> instances. Each node configured with 32 vnodes.
> Without vnodes, cluster spins up fine and is ready to handle requests within
> 30 minutes or less.
> With vnodes, nodes are reporting constant up/down flapping messages with no
> external load on the cluster. After a couple of hours, they were still
> flapping, had very high cpu load, and the cluster never looked like it was
> going to stabilize or be useful for traffic.
--
This message was sent by Atlassian JIRA
(v6.1#6144)