[ 
https://issues.apache.org/jira/browse/CASSANDRA-6345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13828097#comment-13828097
 ] 

Rick Branson commented on CASSANDRA-6345:
-----------------------------------------

+100 at removing those pub/sub callbacks :)

The concurrency issues I bring up are probably because I'm unfamiliar with the 
"guarantees" needed by TokenMetadata updates. It looks like the current release 
code is subject to the issue I brought up, where method calls on TokenMetadata 
that change state return successfully before all threads applying mutations 
have "seen" the update. There will be some mutations in progress that are using 
"stale" token data to apply writes even after TokenMetadata write methods 
returns as successful. So this does not appear to be a regression, but I'm just 
being overly cautious having been burned by these sort of double-caching 
scenarios before. You bring up the point that over-broad operations are ok, and 
I agree, but I'm more concerned about operations that are too narrow. It seems 
that unless I'm missing something either is possible with the current release 
code, and thus these patches as well (including mine).

TokenMetadata#updateNormalTokens is (implicitly) relying on the 
removeFromMoving call to bump the version, but the tokenToEndpointMap is 
updated afterwards, which means internal data is updated after the version is 
bumped. IMHO to be defensive, any time the write lock is acquired in 
TokenMetadata, the version should be bumped in the finally block before the 
lock is released. I don't think this is exposing a bug in the existing patch 
though, because cloneOnlyTokenMap will be blocked until the write lock is 
released in the finally block.

Is the idea with the striped lock on the endpoint cache in 
AbstractReplicationStrategy to help smooth out the stampede effect when the 
"global" lock on the cached TM gets released after the fill? How much do you 
think it's worth the extra complexity? FWIW, my v2 patch suffers from this 
issue and it hasn't reared itself in production. The write load for the 
machines in the cluster I've been looking at is comparatively low though 
compared to many others at 6-7k/sec peak on an 8-core box.

> Endpoint cache invalidation causes CPU spike (on vnode rings?)
> --------------------------------------------------------------
>
>                 Key: CASSANDRA-6345
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6345
>             Project: Cassandra
>          Issue Type: Bug
>         Environment: 30 nodes total, 2 DCs
> Cassandra 1.2.11
> vnodes enabled (256 per node)
>            Reporter: Rick Branson
>            Assignee: Jonathan Ellis
>             Fix For: 1.2.13
>
>         Attachments: 6345-rbranson-v2.txt, 6345-rbranson.txt, 6345-v2.txt, 
> 6345-v3.txt, 6345-v4.txt, 6345.txt, 
> half-way-thru-6345-rbranson-patch-applied.png
>
>
> We've observed that events which cause invalidation of the endpoint cache 
> (update keyspace, add/remove nodes, etc) in AbstractReplicationStrategy 
> result in several seconds of thundering herd behavior on the entire cluster. 
> A thread dump shows over a hundred threads (I stopped counting at that point) 
> with a backtrace like this:
>         at java.net.Inet4Address.getAddress(Inet4Address.java:288)
>         at 
> org.apache.cassandra.locator.TokenMetadata$1.compare(TokenMetadata.java:106)
>         at 
> org.apache.cassandra.locator.TokenMetadata$1.compare(TokenMetadata.java:103)
>         at java.util.TreeMap.getEntryUsingComparator(TreeMap.java:351)
>         at java.util.TreeMap.getEntry(TreeMap.java:322)
>         at java.util.TreeMap.get(TreeMap.java:255)
>         at 
> com.google.common.collect.AbstractMultimap.put(AbstractMultimap.java:200)
>         at 
> com.google.common.collect.AbstractSetMultimap.put(AbstractSetMultimap.java:117)
>         at com.google.common.collect.TreeMultimap.put(TreeMultimap.java:74)
>         at 
> com.google.common.collect.AbstractMultimap.putAll(AbstractMultimap.java:273)
>         at com.google.common.collect.TreeMultimap.putAll(TreeMultimap.java:74)
>         at 
> org.apache.cassandra.utils.SortedBiMultiValMap.create(SortedBiMultiValMap.java:60)
>         at 
> org.apache.cassandra.locator.TokenMetadata.cloneOnlyTokenMap(TokenMetadata.java:598)
>         at 
> org.apache.cassandra.locator.AbstractReplicationStrategy.getNaturalEndpoints(AbstractReplicationStrategy.java:104)
>         at 
> org.apache.cassandra.service.StorageService.getNaturalEndpoints(StorageService.java:2671)
>         at 
> org.apache.cassandra.service.StorageProxy.performWrite(StorageProxy.java:375)
> It looks like there's a large amount of cost in the 
> TokenMetadata.cloneOnlyTokenMap that 
> AbstractReplicationStrategy.getNaturalEndpoints is calling each time there is 
> a cache miss for an endpoint. It seems as if this would only impact clusters 
> with large numbers of tokens, so it's probably a vnodes-only issue.
> Proposal: In AbstractReplicationStrategy.getNaturalEndpoints(), cache the 
> cloned TokenMetadata instance returned by TokenMetadata.cloneOnlyTokenMap(), 
> wrapping it with a lock to prevent stampedes, and clearing it in 
> clearEndpointCache(). Thoughts?



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to