[ 
https://issues.apache.org/jira/browse/CASSANDRA-617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jaakko Laine updated CASSANDRA-617:
-----------------------------------

    Attachment: spread_90.png
                spread_50.png
                averages.png

I've run some simulated gossiper tests and there are some issues with Gossiper 
scalability:

(1) Overall capacity to spread state information:
Due to gossiper packet size, at most 117 gossip digest messages fit into one 
GossipDigestSynMessage. This means that if there are more than 117 nodes, at 
one round only a subset of all digests (endpoint states) can be gossiped. This 
can cause starvation as some nodes' state information has to potentially queue 
for long time before getting a chance to spread (for example if there are 234 
nodes, there's only 50% chance for state information to spread at each stage). 
Due to this, the time it takes for one state information to reach all nodes, 
grows logarithmically only to 117 nodes. Growth rate after this seems to be 
linearish (it is not exponential as randomness and multiple paths between any 
two nodes dampen the effect of worsening max_digest_size / all_endpoints_size 
ratio). Attached chart (averages.png) shows average amount of rounds it takes 
for a piece of gossip to reach given percentage of nodes. As can be seen, after 
cluster size exceeds 117 nodes, the curves take a sharp turn upwards.

As can also be seen, when cluster size grows, 98% curve stays considerably 
lower than 100%. When there are only a few nodes without certain piece of 
gossip, max_digest_size / all_endpoints_size ratio plays bigger role and it is 
more difficult for certain state information to reach the last nodes.

Attached there are also spread_100.png, spread_90.png and spread_50.png charts. 
These show minimum, maximum and average time to reach 100%, 90% and 50% of the 
nodes respectively. Samples per one stage (number of nodes) were only 12 
gossips, so there is some room for error, but should give some indication as to 
how the gossiper behaves in larger clusters.

(2) Ability of Ack message to spread endpoint state information
(a) Size of GossipDigestAckMessage is similarly restricted, so the amount of 
gossip digests and endpoint states it can carry is also limited. In the 
simulation, once cluster size reaches about 80-90 nodes, not all endpoint 
states can be included in Ack message. This means that on top of delays caused 
by Syn message limitations, also Ack message capacity will cause delays in 
endpoint state propagation.

(b) Another related issue is how GossipDigestAckMessage size is divided between 
digests and endpoint states. Currently all digests are included first, and 
then, if there is room, as many endpoint states will be included as will fit. 
In the simulation this was not a problem, but it might happen that digests take 
so much room that not many endpoint states can be included.

(3) New nodes entering a large cluster
(a) When a new node enters a ring, it will first gossip to a seed and let it 
know that it has joined the cluster. However, the way SYN/ACK/ACK2 exchange 
works, the seed will not volunteer information about any other nodes it knows 
about. Only when a seed randomly chooses this node to gossip to, the new node 
will know (through the arriving Syn message) about other nodes. In a large 
cluster, this chance might be very small (less than a percent), so the node 
might need to wait for considerable time before getting knowledge of other 
nodes.

(b) Another related issue is endpoint state size. There are currently four 
"move" application states, which make the whole endpoint state rather big. 
During normal operation these states are of course delivered as deltas, but 
when a new node enters the ring, full state needs to be delivered. Only 9-10 of 
these states fit to one Ack message, so it will take some time before all data 
is delivered in a big cluster.


> gossiper scalability
> --------------------
>
>                 Key: CASSANDRA-617
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-617
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.5
>            Reporter: Jaakko Laine
>             Fix For: 0.5
>
>         Attachments: averages.png, spread_100.png, spread_50.png, 
> spread_90.png
>
>
> Improve gossiper scalability.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to