[jira] [Comment Edited] (CASSANDRA-14660) Improve TokenMetaData cache populating performance for large cluster

Benedict (JIRA) Thu, 23 Aug 2018 15:27:06 -0700


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-14660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16590859#comment-16590859
 ]


Benedict edited comment on CASSANDRA-14660 at 8/23/18 10:26 PM:
----------------------------------------------------------------

bq. In HasMultimap the value collection is a HashSet, which does not work for 
us. E.g getting tokens for a specific endpoints will not give us sorted token 
anymore.

{{MultimapBuilder}} can be used to construct a {{Multimap}} with hashed keys 
and {{TreeSet}} values.  We would not want to use the {{build(Multimap<K, V>)}} 
method, as it invokes {{put(K, V)}} for each entry, but if we instead iterated 
over {{asMap().entrySet()}} and added each {{TreeSet}} individually, we should 
get the linear complexity construction of the {{TreeSet}} containers.

Thinking about it, it's probable this change alone (iterating over 
{{asMap().entrySet()}}) would be sufficient to restore adequate performance, 
even if we continued to maintain a {{TreeMap}} for the key lookup (i.e. 
continued to use a {{TreeMultimap}}).  This should be a 3 line change, in 
{{SortedBiMultiValMap}} and tide us over until we can properly overhaul 
{{TokenMetadata}}, and have no wider semantic implications to worry about.


was (Author: benedict):
bq. In HasMultimap the value collection is a HashSet, which does not work for 
us. E.g getting tokens for a specific endpoints will not give us sorted token 
anymore.

{{MultimapBuilder}} can be used to construct a {{Multimap}} with hashed keys 
and {{TreeSet}} values.  We would not want to use the {{build(Multimap<K, V>)}} 
method, as it invokes {{put(K, V)}} for each entry, but if we instead iterated 
over {{asMap()}} and added each {{TreeSet}} individually, we should get the 
linear complexity construction of the {{TreeSet}} containers.

Thinking about it, it's probable this change alone would be sufficient to 
restore adequate performance, even if we continued to maintain a {{TreeMap}} 
for the key lookup (i.e. continued to use a TreeMultimap).  This should be a 3 
line change, in {{SortedBiMultiValMap}} and tide us over until we can properly 
overhaul {{TokenMetadata}}, and have no wider semantic implications to worry 
about.

> Improve TokenMetaData cache populating performance for large cluster
> --------------------------------------------------------------------
>
>                 Key: CASSANDRA-14660
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-14660
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Coordination
>         Environment: Benchmark is on MacOSX 10.13.5, 2017 MBP
>            Reporter: Pengchao Wang
>            Priority: Critical
>              Labels: Performance
>             Fix For: 3.0.x, 3.11.x, 4.x
>
>         Attachments: 14660-trunk.txt, TokenMetaDataBenchmark.java
>
>
> TokenMetaData#cachedOnlyTokenMap is a method C* used to get a consistent 
> token and topology view on coordinations without paying read lock cost. Upon 
> first read the method acquire a synchronize lock and generate a copy of major 
> token meta data structures and cached it, and upon every token meta data 
> changes(due to gossip changes), the cache get cleared and next read will 
> taking care of cache population.
> For small to medium size clusters this strategy works pretty well. But large 
> clusters can actually be suffered from the locking since cache populating is 
> much slower. On one of our largest cluster (~1000 nodes,  125k tokens, C* 
> 3.0.15)  each cache population take about 500~700ms, and during that there 
> are no requests can go through since synchronize lock was acquired. This 
> caused waves of timeouts errors when there are large amount gossip messages 
> propagating cross the cluster, such as in the case of cluster restarting.
> Base on profiling we found that the cost mostly comes from copying 
> tokenToEndpointMap. It is a SortedBiMultiValueMap made from a forward map use 
> TreeMap and a reverse map use guava TreeMultiMap. There is an optimization in 
> TreeMap helps reduce copying complexity from O(N*log(N)) to O(N) when copying 
> from already ordered data. But guava's TreeMultiMap copying missed that 
> optimization and make it ~10 times slower than it actually need to be on our 
> size of cluster.
> The patch attached to the issue replace the reverse TreeMultiMap<K, V> to a 
> vanilla TreeMap<K, TreeSet<V>> in SortedBiMultiValueMap to make sure we can 
> copy it O(N) time.
> I also attached a benchmark script (TokenMetaDataBenchmark.java), which 
> simulates a large cluster then measures average latency for TokenMetaData 
> cache populating.
> Benchmark result before and after that patch:
> {code:java}
> trunk: 
> before 100ms, after 13ms
> 3.0.x: 
> before 199ms, after 15ms
>  {code}
> (On 3.0.x even the forward TreeMap copying is slow, the O(N*log(N)) to O(N) 
> optimization is not applied because the key comparator is dynamically created 
> and TreeMap cannot determine the source and dest are in same order)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (CASSANDRA-14660) Improve TokenMetaData cache populating performance for large cluster

Reply via email to