[ 
https://issues.apache.org/jira/browse/CASSANDRA-19941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Goodness Ayinmode updated CASSANDRA-19941:
------------------------------------------
    Description: 
To execute the gossip protocol and exchange state info with other nodes, 
_[GossiperTask.run()|https://github.com/apache/cassandra/blob/7b1eb1f0b717beb33e611157766701cd71e4ad8c/src/java/org/apache/cassandra/gms/Gossiper.java#L321]_
 invokes 
_[doGossipToLiveMember()|https://github.com/apache/cassandra/blob/7b1eb1f0b717beb33e611157766701cd71e4ad8c/src/java/org/apache/cassandra/gms/Gossiper.java#L955]_,
 
_[maybeGossipToUnreachableMember()|https://github.com/apache/cassandra/blob/7b1eb1f0b717beb33e611157766701cd71e4ad8c/src/java/org/apache/cassandra/gms/Gossiper.java#L964]_
 and 
_[maybeGossipToSeed()|https://github.com/apache/cassandra/blob/7b1eb1f0b717beb33e611157766701cd71e4ad8c/src/java/org/apache/cassandra/gms/Gossiper.java#L982]_,
 with all 3 methods invoking 
_[sendGossip()|https://github.com/apache/cassandra/blob/7b1eb1f0b717beb33e611157766701cd71e4ad8c/src/java/org/apache/cassandra/gms/Gossiper.java#L933]_
 to send a gossip message to a randomly selected endpoint. The interaction 
between GossiperTask.run() and sendGossip() creates a potential synchronization 
bottleneck due to the lock 
([_taskLock_|https://github.com/apache/cassandra/blob/7b1eb1f0b717beb33e611157766701cd71e4ad8c/src/java/org/apache/cassandra/gms/Gossiper.java#L328])
 being held during network-bound operations. GossiperTask.run() directly calls 
[_waitUntilListening()_|https://github.com/apache/cassandra/blob/7b1eb1f0b717beb33e611157766701cd71e4ad8c/src/java/org/apache/cassandra/gms/Gossiper.java#L326]
 which will wait for the MessagingService to start listening, but there could 
be delays if the messaging service is slow to start or has issues. Also, if 
sendGossip() encounters network-related delays (i.e. network latency, timeouts, 
slow or unresponsive nodes) when there is a large number of nodes, the taskLock 
could be held for longer periods,  possibly increasing the risk of a backlog of 
waiting threads (if delays are frequent) and also affecting the scheduling of 
subsequent tasks. 

One potential optimization for this could be to move network operations outside 
the taskLock. This way the lock is released before performing time consuming 
network operations. But before doing so, I wonder if the analysis above is 
correct and whether it is worth optimizing.

  was:To execute the gossip protocol and exchange state info with other nodes, 
_[GossiperTask.run()|https://github.com/apache/cassandra/blob/7b1eb1f0b717beb33e611157766701cd71e4ad8c/src/java/org/apache/cassandra/gms/Gossiper.java#L321]_
 invokes 
_[doGossipToLiveMember()|https://github.com/apache/cassandra/blob/7b1eb1f0b717beb33e611157766701cd71e4ad8c/src/java/org/apache/cassandra/gms/Gossiper.java#L955]_,
 
_[maybeGossipToUnreachableMember()|https://github.com/apache/cassandra/blob/7b1eb1f0b717beb33e611157766701cd71e4ad8c/src/java/org/apache/cassandra/gms/Gossiper.java#L964]_
 and 
_[maybeGossipToSeed()|https://github.com/apache/cassandra/blob/7b1eb1f0b717beb33e611157766701cd71e4ad8c/src/java/org/apache/cassandra/gms/Gossiper.java#L982]_,
 with all 3 methods invoking 
_[sendGossip()|https://github.com/apache/cassandra/blob/7b1eb1f0b717beb33e611157766701cd71e4ad8c/src/java/org/apache/cassandra/gms/Gossiper.java#L933]_
 to send a gossip message to a randomly selected endpoint. The interaction 
between GossiperTask.run() and sendGossip() creates a potential synchronization 
bottleneck due to the lock 
([_taskLock_|https://github.com/apache/cassandra/blob/7b1eb1f0b717beb33e611157766701cd71e4ad8c/src/java/org/apache/cassandra/gms/Gossiper.java#L328])
 being held during network-bound operations. GossiperTask.run() directly calls 
[_waitUntilListening()_|https://github.com/apache/cassandra/blob/7b1eb1f0b717beb33e611157766701cd71e4ad8c/src/java/org/apache/cassandra/gms/Gossiper.java#L326]
 which will wait for the MessagingService to start listening, but there could 
be delays if the messaging service is slow to start or has issues. Also, if 
sendGossip() encounters network-related delays (i.e. network latency, timeouts, 
slow or unresponsive nodes) when there is a large number of nodes, the taskLock 
could be held for longer periods,  possibly increasing the risk of a backlog of 
waiting threads (if delays are frequent) and also affecting the scheduling of 
subsequent tasks. 


> Move network operations outside the lock in Gossiper$GossipTask
> ---------------------------------------------------------------
>
>                 Key: CASSANDRA-19941
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-19941
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Cluster/Gossip
>            Reporter: Goodness Ayinmode
>            Priority: Normal
>
> To execute the gossip protocol and exchange state info with other nodes, 
> _[GossiperTask.run()|https://github.com/apache/cassandra/blob/7b1eb1f0b717beb33e611157766701cd71e4ad8c/src/java/org/apache/cassandra/gms/Gossiper.java#L321]_
>  invokes 
> _[doGossipToLiveMember()|https://github.com/apache/cassandra/blob/7b1eb1f0b717beb33e611157766701cd71e4ad8c/src/java/org/apache/cassandra/gms/Gossiper.java#L955]_,
>  
> _[maybeGossipToUnreachableMember()|https://github.com/apache/cassandra/blob/7b1eb1f0b717beb33e611157766701cd71e4ad8c/src/java/org/apache/cassandra/gms/Gossiper.java#L964]_
>  and 
> _[maybeGossipToSeed()|https://github.com/apache/cassandra/blob/7b1eb1f0b717beb33e611157766701cd71e4ad8c/src/java/org/apache/cassandra/gms/Gossiper.java#L982]_,
>  with all 3 methods invoking 
> _[sendGossip()|https://github.com/apache/cassandra/blob/7b1eb1f0b717beb33e611157766701cd71e4ad8c/src/java/org/apache/cassandra/gms/Gossiper.java#L933]_
>  to send a gossip message to a randomly selected endpoint. The interaction 
> between GossiperTask.run() and sendGossip() creates a potential 
> synchronization bottleneck due to the lock 
> ([_taskLock_|https://github.com/apache/cassandra/blob/7b1eb1f0b717beb33e611157766701cd71e4ad8c/src/java/org/apache/cassandra/gms/Gossiper.java#L328])
>  being held during network-bound operations. GossiperTask.run() directly 
> calls 
> [_waitUntilListening()_|https://github.com/apache/cassandra/blob/7b1eb1f0b717beb33e611157766701cd71e4ad8c/src/java/org/apache/cassandra/gms/Gossiper.java#L326]
>  which will wait for the MessagingService to start listening, but there could 
> be delays if the messaging service is slow to start or has issues. Also, if 
> sendGossip() encounters network-related delays (i.e. network latency, 
> timeouts, slow or unresponsive nodes) when there is a large number of nodes, 
> the taskLock could be held for longer periods,  possibly increasing the risk 
> of a backlog of waiting threads (if delays are frequent) and also affecting 
> the scheduling of subsequent tasks. 
> One potential optimization for this could be to move network operations 
> outside the taskLock. This way the lock is released before performing time 
> consuming network operations. But before doing so, I wonder if the analysis 
> above is correct and whether it is worth optimizing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to