Pengchao Wang created CASSANDRA-14660:
-----------------------------------------
Summary: Improve TokenMetaData cache populating performance for
large cluster
Key: CASSANDRA-14660
URL: https://issues.apache.org/jira/browse/CASSANDRA-14660
Project: Cassandra
Issue Type: Improvement
Components: Coordination
Environment: Benchmark is on MacOSX 10.13.5, 2017 MBP
Reporter: Pengchao Wang
Fix For: 4.0.x
Attachments: TokenMetaDataBenchmark.java
TokenMetaData#cachedOnlyTokenMap is a method C* used to get a consistent token
and topology view on coordinations without paying read lock cost. Upon first
read the method acquire a synchronize lock and generate a copy of major token
meta data structures and cached it, and upon every token meta data changes(due
to gossip changes), the cache get cleared and next read will taking care of
cache population.
For small to medium size clusters this strategy works pretty well. But large
clusters can actually be suffered from the locking since cache populating is
much slower. On one of our largest cluster (~1000 nodes, 125k tokens, C*
3.0.15) each cache population take about 500~700ms, and during that there are
no requests can go through since synchronize lock was acquired. This caused
waves of timeouts errors when there are large amount gossip messages
propagating cross the cluster, such as in the case of cluster restarting.
Base on profiling we found that the cost mostly comes from copying
tokenToEndpointMap. It is a SortedBiMultiValueMap made from a forward map use
TreeMap and a reverse map use guava TreeMultiMap. There is an optimization in
TreeMap helps reduce copying complexity from O(n*log(n)) to O(n) when copying
from already ordered data. But guava's TreeMultiMap copying missed that
optimization and make it ~10 times slower than it actually need to be on our
size of cluster.
The patch attached to the issue replace the reverse TreeMultiMap<K, V> to a
vanilla TreeMap<K, TreeSet<V>> in SortedBiMultiValueMap to make sure we can
copy it O(n) time.
I also attached a benchmark script (TokenMetaDataBenchmark.java), which
simulates a large cluster then measures average latency for TokenMetaData cache
populating.
Benchmark result before and after that patch:
{code:java}
trunk:
before 100ms, after 13ms
3.0.x:
before 199ms, after 15ms
{code}
(On 3.0.x even the forward TreeMap copying is slow, the O(n*log(n)) to O(n)
optimization is not applied because the key comparator is dynamically created
and TreeMap cannot determine the source and dest are in same order)
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]