[
https://issues.apache.org/jira/browse/CASSANDRA-617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jaakko Laine updated CASSANDRA-617:
-----------------------------------
Attachment: spread_90.png
spread_50.png
averages.png
I've run some simulated gossiper tests and there are some issues with Gossiper
scalability:
(1) Overall capacity to spread state information:
Due to gossiper packet size, at most 117 gossip digest messages fit into one
GossipDigestSynMessage. This means that if there are more than 117 nodes, at
one round only a subset of all digests (endpoint states) can be gossiped. This
can cause starvation as some nodes' state information has to potentially queue
for long time before getting a chance to spread (for example if there are 234
nodes, there's only 50% chance for state information to spread at each stage).
Due to this, the time it takes for one state information to reach all nodes,
grows logarithmically only to 117 nodes. Growth rate after this seems to be
linearish (it is not exponential as randomness and multiple paths between any
two nodes dampen the effect of worsening max_digest_size / all_endpoints_size
ratio). Attached chart (averages.png) shows average amount of rounds it takes
for a piece of gossip to reach given percentage of nodes. As can be seen, after
cluster size exceeds 117 nodes, the curves take a sharp turn upwards.
As can also be seen, when cluster size grows, 98% curve stays considerably
lower than 100%. When there are only a few nodes without certain piece of
gossip, max_digest_size / all_endpoints_size ratio plays bigger role and it is
more difficult for certain state information to reach the last nodes.
Attached there are also spread_100.png, spread_90.png and spread_50.png charts.
These show minimum, maximum and average time to reach 100%, 90% and 50% of the
nodes respectively. Samples per one stage (number of nodes) were only 12
gossips, so there is some room for error, but should give some indication as to
how the gossiper behaves in larger clusters.
(2) Ability of Ack message to spread endpoint state information
(a) Size of GossipDigestAckMessage is similarly restricted, so the amount of
gossip digests and endpoint states it can carry is also limited. In the
simulation, once cluster size reaches about 80-90 nodes, not all endpoint
states can be included in Ack message. This means that on top of delays caused
by Syn message limitations, also Ack message capacity will cause delays in
endpoint state propagation.
(b) Another related issue is how GossipDigestAckMessage size is divided between
digests and endpoint states. Currently all digests are included first, and
then, if there is room, as many endpoint states will be included as will fit.
In the simulation this was not a problem, but it might happen that digests take
so much room that not many endpoint states can be included.
(3) New nodes entering a large cluster
(a) When a new node enters a ring, it will first gossip to a seed and let it
know that it has joined the cluster. However, the way SYN/ACK/ACK2 exchange
works, the seed will not volunteer information about any other nodes it knows
about. Only when a seed randomly chooses this node to gossip to, the new node
will know (through the arriving Syn message) about other nodes. In a large
cluster, this chance might be very small (less than a percent), so the node
might need to wait for considerable time before getting knowledge of other
nodes.
(b) Another related issue is endpoint state size. There are currently four
"move" application states, which make the whole endpoint state rather big.
During normal operation these states are of course delivered as deltas, but
when a new node enters the ring, full state needs to be delivered. Only 9-10 of
these states fit to one Ack message, so it will take some time before all data
is delivered in a big cluster.
> gossiper scalability
> --------------------
>
> Key: CASSANDRA-617
> URL: https://issues.apache.org/jira/browse/CASSANDRA-617
> Project: Cassandra
> Issue Type: Improvement
> Components: Core
> Affects Versions: 0.5
> Reporter: Jaakko Laine
> Fix For: 0.5
>
> Attachments: averages.png, spread_100.png, spread_50.png,
> spread_90.png
>
>
> Improve gossiper scalability.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.