[ https://issues.apache.org/jira/browse/CASSANDRA-6345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Rick Branson updated CASSANDRA-6345: ------------------------------------ Attachment: 6345-rbranson-v2.txt Attached a new patch with the deadlock fixed. We're running this on a production cluster. The primary issue was the callback for invalidation from TokenMetadata to all of the registered AbstractReplicationStrategy instances. This was asking for it anyway, so in the patch I replaced the "push" invalidation with simple versioning of the TokenMetadata endpoints. TokenMetadata bumps it's version number each time the cache would need to be invalidated, and AbstractReplicationStrategy checks it's version when it needs to do a read, invalidating if necessary. This gets the invalidation out of the gossip threads and into the RPC threads, which is probably a good thing. The only thing I'm not super crazy about is the extra hot path read lock acquisition on TokenMetadata.getEndpointVersion(), which might be avoidable. > Endpoint cache invalidation causes CPU spike (on vnode rings?) > -------------------------------------------------------------- > > Key: CASSANDRA-6345 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6345 > Project: Cassandra > Issue Type: Bug > Environment: 30 nodes total, 2 DCs > Cassandra 1.2.11 > vnodes enabled (256 per node) > Reporter: Rick Branson > Assignee: Jonathan Ellis > Fix For: 1.2.12, 2.0.3 > > Attachments: 6345-rbranson-v2.txt, 6345-rbranson.txt, 6345-v2.txt, > 6345.txt, half-way-thru-6345-rbranson-patch-applied.png > > > We've observed that events which cause invalidation of the endpoint cache > (update keyspace, add/remove nodes, etc) in AbstractReplicationStrategy > result in several seconds of thundering herd behavior on the entire cluster. > A thread dump shows over a hundred threads (I stopped counting at that point) > with a backtrace like this: > at java.net.Inet4Address.getAddress(Inet4Address.java:288) > at > org.apache.cassandra.locator.TokenMetadata$1.compare(TokenMetadata.java:106) > at > org.apache.cassandra.locator.TokenMetadata$1.compare(TokenMetadata.java:103) > at java.util.TreeMap.getEntryUsingComparator(TreeMap.java:351) > at java.util.TreeMap.getEntry(TreeMap.java:322) > at java.util.TreeMap.get(TreeMap.java:255) > at > com.google.common.collect.AbstractMultimap.put(AbstractMultimap.java:200) > at > com.google.common.collect.AbstractSetMultimap.put(AbstractSetMultimap.java:117) > at com.google.common.collect.TreeMultimap.put(TreeMultimap.java:74) > at > com.google.common.collect.AbstractMultimap.putAll(AbstractMultimap.java:273) > at com.google.common.collect.TreeMultimap.putAll(TreeMultimap.java:74) > at > org.apache.cassandra.utils.SortedBiMultiValMap.create(SortedBiMultiValMap.java:60) > at > org.apache.cassandra.locator.TokenMetadata.cloneOnlyTokenMap(TokenMetadata.java:598) > at > org.apache.cassandra.locator.AbstractReplicationStrategy.getNaturalEndpoints(AbstractReplicationStrategy.java:104) > at > org.apache.cassandra.service.StorageService.getNaturalEndpoints(StorageService.java:2671) > at > org.apache.cassandra.service.StorageProxy.performWrite(StorageProxy.java:375) > It looks like there's a large amount of cost in the > TokenMetadata.cloneOnlyTokenMap that > AbstractReplicationStrategy.getNaturalEndpoints is calling each time there is > a cache miss for an endpoint. It seems as if this would only impact clusters > with large numbers of tokens, so it's probably a vnodes-only issue. > Proposal: In AbstractReplicationStrategy.getNaturalEndpoints(), cache the > cloned TokenMetadata instance returned by TokenMetadata.cloneOnlyTokenMap(), > wrapping it with a lock to prevent stampedes, and clearing it in > clearEndpointCache(). Thoughts? -- This message was sent by Atlassian JIRA (v6.1#6144)