[
https://issues.apache.org/jira/browse/CASSANDRA-17691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17552826#comment-17552826
]
Brandon Williams commented on CASSANDRA-17691:
----------------------------------------------
If the topology is changing then examining the peers and their tokens to
reconcile it makes sense.
> Gossip/Decommission tasklock contention on large clusters
> ---------------------------------------------------------
>
> Key: CASSANDRA-17691
> URL: https://issues.apache.org/jira/browse/CASSANDRA-17691
> Project: Cassandra
> Issue Type: Bug
> Components: Cluster/Gossip, Cluster/Membership
> Reporter: BugFinder
> Priority: Normal
>
> Hi,
> I am a researcher working on finding scale issues in distributed systems. I
> have been analyzing Cassandra 4.0.0 and found a potential issue on the Gossip
> path. The method
> 'org.apache.cassandra.gms.Gossiper.addLocalApplicationStates' (line 1958)
> holds the tasklock that could end up in the invocation of
> getAddressRepplicas, like this (format is [method][lineNumber]):
> [org.apache.cassandra.gms.Gossiper.addLocalApplicationStates] [1958]
> {{*Type=EXPLICIT_LOCK, start=1960, end=1970 // Lock being held along these
> lines*}}
> {{
> [org.apache.cassandra.gms.Gossiper.addLocalApplicationStateInternal][1965]}}
> {{ [org.apache.cassandra.gms.Gossiper.doOnChangeNotifications][1950]}}
> {{
> [org.apache.cassandra.gms.IEndpointStateChangeSubscriber.onChange][1551]}}
> {{ [org.apache.cassandra.service.StorageService.onChange][1551]}}
> {{
> [org.apache.cassandra.service.StorageService.handleStateRemoving][2308]}}
> {{
> [org.apache.cassandra.service.StorageService.restoreReplicaCount][2921]}}
> {{
> [org.apache.cassandra.service.StorageService.getChangedReplicasForLeaving][3128]}}
> {{
> [org.apache.cassandra.locator.AbstractReplicationStrategy.getAddressReplicas][3203]}}
> {{
> [org.apache.cassandra.locator.AbstractReplicationStrategy.getAddressReplicas][284]}}
> {{ *[line=243, dimensions=[Peers * Tokens]] // Approx. Complexity of
> this loop*}}
>
> This seems to be affecting decommission path and the complexity is at least
> dependent on the number of tokens and peers in the cluster, thus when
> decommissioning a node with a large number of peers and tokens this path will
> end up holding the Gossiper's task lock for a long time, which could end up
> causing flapping.
> This is likely to be affecting other 4.x versions too.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]