[ 
https://issues.apache.org/jira/browse/CASSANDRA-17691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BugFinder updated CASSANDRA-17691:
----------------------------------
    Description: 
Hi,

I am a researcher working on finding scale issues in distributed systems. I 
have been analyzing Cassandra 4.0.0 and found a potential issue on the Gossip 
path. The method 'org.apache.cassandra.gms.Gossiper.addLocalApplicationStates' 
(line 1958) holds the tasklock that could end up in the invocation of 
getAddressRepplicas, like this (format is [method][lineNumber]):

[org.apache.cassandra.gms.Gossiper.addLocalApplicationStates] [1958]
{{*Type=EXPLICIT_LOCK, start=1960, end=1970 // Lock being held along these 
lines*}}
{{ [org.apache.cassandra.gms.Gossiper.addLocalApplicationStateInternal][1965]}}
{{  [org.apache.cassandra.gms.Gossiper.doOnChangeNotifications][1950]}}
{{    [org.apache.cassandra.gms.IEndpointStateChangeSubscriber.onChange][1551]}}
{{      [org.apache.cassandra.service.StorageService.onChange][1551]}}
{{       
[org.apache.cassandra.service.StorageService.handleStateRemoving][2308]}}
{{        
[org.apache.cassandra.service.StorageService.restoreReplicaCount][2921]}}
{{         
[org.apache.cassandra.service.StorageService.getChangedReplicasForLeaving][3128]}}
{{          
[org.apache.cassandra.locator.AbstractReplicationStrategy.getAddressReplicas][3203]}}
{{           
[org.apache.cassandra.locator.AbstractReplicationStrategy.getAddressReplicas][284]}}
{{           *[line=243, dimensions=[Peers * Tokens]] // Approx. Complexity of 
this loop*}}

 

This seems to be affecting decommission path and the complexity is at least 
dependent on the number of tokens and peers in the cluster, thus when 
decommissioning a node with a large number of peers and tokens this path will 
end up holding the Gossiper's task lock for a long time.

This is likely to be affecting other 4.x versions too.

  was:
Hi,

I am a researcher working on finding scale issues in distributed systems. I 
have been analyzing Cassandra 4.0.0 and found a potential issue on the Gossip 
path. The method 'org.apache.cassandra.gms.Gossiper.addLocalApplicationStates' 
(line 1958) holds the tasklock that could end up in the invocation of 
getAddressRepplicas, like this (format is [method][lineNumber]):

{{[org.apache.cassandra.gms.Gossiper.addLocalApplicationStates]}}
{{*Type=EXPLICIT_LOCK, start=1960, end=1970*}}
{{  [org.apache.cassandra.gms.Gossiper.addLocalApplicationStateInternal][1965]}}
{{    [org.apache.cassandra.gms.Gossiper.doOnChangeNotifications][1950]}}
{{      
[org.apache.cassandra.gms.IEndpointStateChangeSubscriber.onChange][1551]}}
{{        [org.apache.cassandra.service.StorageService.onChange][1551]}}
{{          
[org.apache.cassandra.service.StorageService.handleStateRemoving][2308]}}
{{            
[org.apache.cassandra.service.StorageService.restoreReplicaCount][2921]}}
{{              
[org.apache.cassandra.service.StorageService.getChangedReplicasForLeaving][3128]}}
{{                
[org.apache.cassandra.locator.AbstractReplicationStrategy.getAddressReplicas][3203]}}
{{                  
[org.apache.cassandra.locator.AbstractReplicationStrategy.{*}getAddressReplicas{*}][284]}}
{{                  *[line=243, dimensions=[Peers * Tokens]]*}}

 

This seems to be affecting decommission path and the complexity is at least 
dependent on the number of tokens and peers in the cluster, thus when 
decommissioning a node with a large number of peers and tokens this path will 
end up holding the Gossiper's task lock for a long time.

This is likely to be affecting other 4.x versions too.


> Gossip/Decommission tasklock contention on large clusters
> ---------------------------------------------------------
>
>                 Key: CASSANDRA-17691
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-17691
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Cluster/Gossip, Cluster/Membership
>            Reporter: BugFinder
>            Priority: Normal
>
> Hi,
> I am a researcher working on finding scale issues in distributed systems. I 
> have been analyzing Cassandra 4.0.0 and found a potential issue on the Gossip 
> path. The method 
> 'org.apache.cassandra.gms.Gossiper.addLocalApplicationStates' (line 1958) 
> holds the tasklock that could end up in the invocation of 
> getAddressRepplicas, like this (format is [method][lineNumber]):
> [org.apache.cassandra.gms.Gossiper.addLocalApplicationStates] [1958]
> {{*Type=EXPLICIT_LOCK, start=1960, end=1970 // Lock being held along these 
> lines*}}
> {{ 
> [org.apache.cassandra.gms.Gossiper.addLocalApplicationStateInternal][1965]}}
> {{  [org.apache.cassandra.gms.Gossiper.doOnChangeNotifications][1950]}}
> {{    
> [org.apache.cassandra.gms.IEndpointStateChangeSubscriber.onChange][1551]}}
> {{      [org.apache.cassandra.service.StorageService.onChange][1551]}}
> {{       
> [org.apache.cassandra.service.StorageService.handleStateRemoving][2308]}}
> {{        
> [org.apache.cassandra.service.StorageService.restoreReplicaCount][2921]}}
> {{         
> [org.apache.cassandra.service.StorageService.getChangedReplicasForLeaving][3128]}}
> {{          
> [org.apache.cassandra.locator.AbstractReplicationStrategy.getAddressReplicas][3203]}}
> {{           
> [org.apache.cassandra.locator.AbstractReplicationStrategy.getAddressReplicas][284]}}
> {{           *[line=243, dimensions=[Peers * Tokens]] // Approx. Complexity 
> of this loop*}}
>  
> This seems to be affecting decommission path and the complexity is at least 
> dependent on the number of tokens and peers in the cluster, thus when 
> decommissioning a node with a large number of peers and tokens this path will 
> end up holding the Gossiper's task lock for a long time.
> This is likely to be affecting other 4.x versions too.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to