[
https://issues.apache.org/jira/browse/CASSANDRA-14660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16590593#comment-16590593
]
Pengchao Wang edited comment on CASSANDRA-14660 at 8/23/18 5:47 PM:
--------------------------------------------------------------------
{quote}So a super simple quick fix would be to delete {{SortedBiMultiValMap}}
and move its static methods to {{BiMultiValMap}} but called e.g.
{{createWithSortedKeys}}. In this method we could provide a forward {{TreeMap}}
and reverse {{HashMultimap}}.
{quote}
In HasMultimap the value collection is a HashSet, which does not work for us.
E.g getting tokens for a specific endpoints will not give us sorted token
anymore.
{code:java}
Set<Token> tokens = tokenToEndpointMap.inverse().get(endpoint);
{code}
{quote}If we were to use a map based on our BTree, we could eliminate the
distinction between {{sortedTokens}} and this map, since we can treat a
{{BTreeSet}} as a {{List}}, entirely eliminating the currently experienced
pauses and all copying.
{quote}
The BTree based immutable map idea sounds very interesting. I can explore that
options when I got some time.
was (Author: wpc):
{quote}So a super simple quick fix would be to delete {{SortedBiMultiValMap}}
and move its static methods to {{BiMultiValMap}} but called e.g.
{{createWithSortedKeys}}. In this method we could provide a forward {{TreeMap}}
and reverse {{HashMultimap}}.
{quote}
In HasMultimap the value collection is a HashSet, which probably will not work
for us. E.g getting tokens for a specific endpoints will not give us sorted
token anymore.
{code:java}
Set<Token> tokens = tokenToEndpointMap.inverse().get(endpoint);
{code}
{quote}If we were to use a map based on our BTree, we could eliminate the
distinction between {{sortedTokens}} and this map, since we can treat a
{{BTreeSet}} as a {{List}}, entirely eliminating the currently experienced
pauses and all copying.
{quote}
The BTree based immutable map idea sounds very interesting. I can explore that
options when I got some time.
> Improve TokenMetaData cache populating performance for large cluster
> --------------------------------------------------------------------
>
> Key: CASSANDRA-14660
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14660
> Project: Cassandra
> Issue Type: Improvement
> Components: Coordination
> Environment: Benchmark is on MacOSX 10.13.5, 2017 MBP
> Reporter: Pengchao Wang
> Priority: Critical
> Labels: Performance
> Fix For: 3.0.x, 3.11.x, 4.x
>
> Attachments: 14660-trunk.txt, TokenMetaDataBenchmark.java
>
>
> TokenMetaData#cachedOnlyTokenMap is a method C* used to get a consistent
> token and topology view on coordinations without paying read lock cost. Upon
> first read the method acquire a synchronize lock and generate a copy of major
> token meta data structures and cached it, and upon every token meta data
> changes(due to gossip changes), the cache get cleared and next read will
> taking care of cache population.
> For small to medium size clusters this strategy works pretty well. But large
> clusters can actually be suffered from the locking since cache populating is
> much slower. On one of our largest cluster (~1000 nodes, 125k tokens, C*
> 3.0.15) each cache population take about 500~700ms, and during that there
> are no requests can go through since synchronize lock was acquired. This
> caused waves of timeouts errors when there are large amount gossip messages
> propagating cross the cluster, such as in the case of cluster restarting.
> Base on profiling we found that the cost mostly comes from copying
> tokenToEndpointMap. It is a SortedBiMultiValueMap made from a forward map use
> TreeMap and a reverse map use guava TreeMultiMap. There is an optimization in
> TreeMap helps reduce copying complexity from O(N*log(N)) to O(N) when copying
> from already ordered data. But guava's TreeMultiMap copying missed that
> optimization and make it ~10 times slower than it actually need to be on our
> size of cluster.
> The patch attached to the issue replace the reverse TreeMultiMap<K, V> to a
> vanilla TreeMap<K, TreeSet<V>> in SortedBiMultiValueMap to make sure we can
> copy it O(N) time.
> I also attached a benchmark script (TokenMetaDataBenchmark.java), which
> simulates a large cluster then measures average latency for TokenMetaData
> cache populating.
> Benchmark result before and after that patch:
> {code:java}
> trunk:
> before 100ms, after 13ms
> 3.0.x:
> before 199ms, after 15ms
> {code}
> (On 3.0.x even the forward TreeMap copying is slow, the O(N*log(N)) to O(N)
> optimization is not applied because the key comparator is dynamically created
> and TreeMap cannot determine the source and dest are in same order)
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]