[ 
https://issues.apache.org/jira/browse/CASSANDRA-14660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16589216#comment-16589216
 ] 

Pengchao Wang commented on CASSANDRA-14660:
-------------------------------------------

[~benedict]: Thanks for replying. I will work on a 3.0 branch patch as well and 
let you guys decide what's the best schedule for the patch to go. The 
inefficient token map copying issue can be trace back to very early versions of 
TokenMetaData implementation, so it may not be that urgent to go with next 
release if it was already in feature freeze stage. But I do hope we can get 
this into the trunk so people have large clusters can benefits from the fix in 
the future.

Regarding recalculate during read or write: I think the problem is cache 
invalidation is protected by the write lock – so no matter we do it under read 
or write we will need pay full cost on blocking anyway. And long write locks 
looks more harmful? But I do agree with you that the locking design of 
TokenMetaData looks quite heavy and probably need some rethink.

> Improve TokenMetaData cache populating performance for large cluster
> --------------------------------------------------------------------
>
>                 Key: CASSANDRA-14660
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-14660
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Coordination
>         Environment: Benchmark is on MacOSX 10.13.5, 2017 MBP
>            Reporter: Pengchao Wang
>            Priority: Critical
>              Labels: Performance
>             Fix For: 3.0.x, 3.11.x, 4.x
>
>         Attachments: 14660-trunk.txt, TokenMetaDataBenchmark.java
>
>
> TokenMetaData#cachedOnlyTokenMap is a method C* used to get a consistent 
> token and topology view on coordinations without paying read lock cost. Upon 
> first read the method acquire a synchronize lock and generate a copy of major 
> token meta data structures and cached it, and upon every token meta data 
> changes(due to gossip changes), the cache get cleared and next read will 
> taking care of cache population.
> For small to medium size clusters this strategy works pretty well. But large 
> clusters can actually be suffered from the locking since cache populating is 
> much slower. On one of our largest cluster (~1000 nodes,  125k tokens, C* 
> 3.0.15)  each cache population take about 500~700ms, and during that there 
> are no requests can go through since synchronize lock was acquired. This 
> caused waves of timeouts errors when there are large amount gossip messages 
> propagating cross the cluster, such as in the case of cluster restarting.
> Base on profiling we found that the cost mostly comes from copying 
> tokenToEndpointMap. It is a SortedBiMultiValueMap made from a forward map use 
> TreeMap and a reverse map use guava TreeMultiMap. There is an optimization in 
> TreeMap helps reduce copying complexity from O(N*log(N)) to O(N) when copying 
> from already ordered data. But guava's TreeMultiMap copying missed that 
> optimization and make it ~10 times slower than it actually need to be on our 
> size of cluster.
> The patch attached to the issue replace the reverse TreeMultiMap<K, V> to a 
> vanilla TreeMap<K, TreeSet<V>> in SortedBiMultiValueMap to make sure we can 
> copy it O(N) time.
> I also attached a benchmark script (TokenMetaDataBenchmark.java), which 
> simulates a large cluster then measures average latency for TokenMetaData 
> cache populating.
> Benchmark result before and after that patch:
> {code:java}
> trunk: 
> before 100ms, after 13ms
> 3.0.x: 
> before 199ms, after 15ms
>  {code}
> (On 3.0.x even the forward TreeMap copying is slow, the O(N*log(N)) to O(N) 
> optimization is not applied because the key comparator is dynamically created 
> and TreeMap cannot determine the source and dest are in same order)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to