[ 
https://issues.apache.org/jira/browse/CASSANDRA-6345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Ellis updated CASSANDRA-6345:
--------------------------------------

    Attachment: 6345-v5.txt

bq. It seems that unless I'm missing something either is possible with the 
current release code, and thus these patches as well

Technically correct, but in practice we're in pretty good shape.  The sequence 
is:

# Add the changing node to pending ranges
# Sleep for RING_DELAY so everyone else starts including the new target in 
their writes
# Flush data to be transferred
# Send over data for writes that happened before (1)

Step 1 happens on every coordinator.  2-4 only happen on the node that is 
giving up a token range.

The guarantee we need is that any write that happens before the pending range 
change, completes before the subsequent flush.

Even if we used TM.lock to protect the entire ARS sequence (guaranteeing that 
no local write is in progress once the PRC happens) we could still receive 
writes from other nodes that began their PRC change later.  

So we rely on the RING_DELAY (30s) sleep.  I suppose a GC pause for instance at 
just the wrong time could theoretically mean a mutation against the old state 
gets sent out late, but I don't see how we can improve it.

bq. IMHO to be defensive, any time the write lock is acquired in TokenMetadata, 
the version should be bumped in the finally block before the lock is released

Haven't thought this through as much.  What are you saying we should bump that 
we weren't calling invalidate on before?

bq. Is the idea with the striped lock on the endpoint cache in 
AbstractReplicationStrategy to help smooth out the stampede effect when the 
"global" lock on the cached TM gets released after the fill?

I'm trying to avoid a minor stampede on calculateNaturalEndpoints 
(CASSANDRA-3881) but it's probably premature optimization.  v5 attached w/o 
that.

> Endpoint cache invalidation causes CPU spike (on vnode rings?)
> --------------------------------------------------------------
>
>                 Key: CASSANDRA-6345
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6345
>             Project: Cassandra
>          Issue Type: Bug
>         Environment: 30 nodes total, 2 DCs
> Cassandra 1.2.11
> vnodes enabled (256 per node)
>            Reporter: Rick Branson
>            Assignee: Jonathan Ellis
>             Fix For: 1.2.13
>
>         Attachments: 6345-rbranson-v2.txt, 6345-rbranson.txt, 6345-v2.txt, 
> 6345-v3.txt, 6345-v4.txt, 6345-v5.txt, 6345.txt, 
> half-way-thru-6345-rbranson-patch-applied.png
>
>
> We've observed that events which cause invalidation of the endpoint cache 
> (update keyspace, add/remove nodes, etc) in AbstractReplicationStrategy 
> result in several seconds of thundering herd behavior on the entire cluster. 
> A thread dump shows over a hundred threads (I stopped counting at that point) 
> with a backtrace like this:
>         at java.net.Inet4Address.getAddress(Inet4Address.java:288)
>         at 
> org.apache.cassandra.locator.TokenMetadata$1.compare(TokenMetadata.java:106)
>         at 
> org.apache.cassandra.locator.TokenMetadata$1.compare(TokenMetadata.java:103)
>         at java.util.TreeMap.getEntryUsingComparator(TreeMap.java:351)
>         at java.util.TreeMap.getEntry(TreeMap.java:322)
>         at java.util.TreeMap.get(TreeMap.java:255)
>         at 
> com.google.common.collect.AbstractMultimap.put(AbstractMultimap.java:200)
>         at 
> com.google.common.collect.AbstractSetMultimap.put(AbstractSetMultimap.java:117)
>         at com.google.common.collect.TreeMultimap.put(TreeMultimap.java:74)
>         at 
> com.google.common.collect.AbstractMultimap.putAll(AbstractMultimap.java:273)
>         at com.google.common.collect.TreeMultimap.putAll(TreeMultimap.java:74)
>         at 
> org.apache.cassandra.utils.SortedBiMultiValMap.create(SortedBiMultiValMap.java:60)
>         at 
> org.apache.cassandra.locator.TokenMetadata.cloneOnlyTokenMap(TokenMetadata.java:598)
>         at 
> org.apache.cassandra.locator.AbstractReplicationStrategy.getNaturalEndpoints(AbstractReplicationStrategy.java:104)
>         at 
> org.apache.cassandra.service.StorageService.getNaturalEndpoints(StorageService.java:2671)
>         at 
> org.apache.cassandra.service.StorageProxy.performWrite(StorageProxy.java:375)
> It looks like there's a large amount of cost in the 
> TokenMetadata.cloneOnlyTokenMap that 
> AbstractReplicationStrategy.getNaturalEndpoints is calling each time there is 
> a cache miss for an endpoint. It seems as if this would only impact clusters 
> with large numbers of tokens, so it's probably a vnodes-only issue.
> Proposal: In AbstractReplicationStrategy.getNaturalEndpoints(), cache the 
> cloned TokenMetadata instance returned by TokenMetadata.cloneOnlyTokenMap(), 
> wrapping it with a lock to prevent stampedes, and clearing it in 
> clearEndpointCache(). Thoughts?



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to