Pengchao Wang created CASSANDRA-14660:
-----------------------------------------

             Summary: Improve TokenMetaData cache populating performance for 
large cluster
                 Key: CASSANDRA-14660
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-14660
             Project: Cassandra
          Issue Type: Improvement
          Components: Coordination
         Environment: Benchmark is on MacOSX 10.13.5, 2017 MBP
            Reporter: Pengchao Wang
             Fix For: 4.0.x
         Attachments: TokenMetaDataBenchmark.java

TokenMetaData#cachedOnlyTokenMap is a method C* used to get a consistent token 
and topology view on coordinations without paying read lock cost. Upon first 
read the method acquire a synchronize lock and generate a copy of major token 
meta data structures and cached it, and upon every token meta data changes(due 
to gossip changes), the cache get cleared and next read will taking care of 
cache population.

 

For small to medium size clusters this strategy works pretty well. But large 
clusters can actually be suffered from the locking since cache populating is 
much slower. On one of our largest cluster (~1000 nodes,  125k tokens, C* 
3.0.15)  each cache population take about 500~700ms, and during that there are 
no requests can go through since synchronize lock was acquired. This caused 
waves of timeouts errors when there are large amount gossip messages 
propagating cross the cluster, such as in the case of cluster restarting.

 

Base on profiling we found that the cost mostly comes from copying 
tokenToEndpointMap. It is a SortedBiMultiValueMap made from a forward map use 
TreeMap and a reverse map use guava TreeMultiMap. There is an optimization in 
TreeMap helps reduce copying complexity from O(n*log(n)) to O(n) when copying 
from already ordered data. But guava's TreeMultiMap copying missed that 
optimization and make it ~10 times slower than it actually need to be on our 
size of cluster.

 

The patch attached to the issue replace the reverse TreeMultiMap<K, V> to a 
vanilla TreeMap<K, TreeSet<V>> in SortedBiMultiValueMap to make sure we can 
copy it O(n) time.

 

I also attached a benchmark script (TokenMetaDataBenchmark.java), which 
simulates a large cluster then measures average latency for TokenMetaData cache 
populating.

Benchmark result before and after that patch:

 
{code:java}
trunk: 
before 100ms, after 13ms
3.0.x: 
before 199ms, after 15ms
{code}
 

 

(On 3.0.x even the forward TreeMap copying is slow, the O(n*log(n)) to O(n) 
optimization is not applied because the key comparator is dynamically created 
and TreeMap cannot determine the source and dest are in same order)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to