Goodness Ayinmode created CASSANDRA-19941:
---------------------------------------------
Summary: Move network operations outside the lock in
Gossiper$GossipTask
Key: CASSANDRA-19941
URL: https://issues.apache.org/jira/browse/CASSANDRA-19941
Project: Cassandra
Issue Type: Improvement
Components: Cluster/Gossip
Reporter: Goodness Ayinmode
To execute the gossip protocol and exchange state info with other nodes,
_[GossiperTask.run()|https://github.com/apache/cassandra/blob/7b1eb1f0b717beb33e611157766701cd71e4ad8c/src/java/org/apache/cassandra/gms/Gossiper.java#L321]_
invokes
{_}[doGossipToLiveMember|https://github.com/apache/cassandra/blob/7b1eb1f0b717beb33e611157766701cd71e4ad8c/src/java/org/apache/cassandra/gms/Gossiper.java#L955],[
maybeGossipToUnreachableMember|https://github.com/apache/cassandra/blob/7b1eb1f0b717beb33e611157766701cd71e4ad8c/src/java/org/apache/cassandra/gms/Gossiper.java#L964]{_},
and[
_maybeGossipToSeed_|https://github.com/apache/cassandra/blob/7b1eb1f0b717beb33e611157766701cd71e4ad8c/src/java/org/apache/cassandra/gms/Gossiper.java#L982],
with all 3 methods invoking[
_sendGossip()_|https://github.com/apache/cassandra/blob/7b1eb1f0b717beb33e611157766701cd71e4ad8c/src/java/org/apache/cassandra/gms/Gossiper.java#L933]
to send a gossip message to a randomly selected endpoint. The interaction
between GossiperTask.run() and sendGossip() creates a potential synchronization
bottleneck due to the lock
([_taskLock_|https://github.com/apache/cassandra/blob/7b1eb1f0b717beb33e611157766701cd71e4ad8c/src/java/org/apache/cassandra/gms/Gossiper.java#L328])
being held during network-bound operations. GossiperTask.run() directly calls
[_waitUntilListening()_|https://github.com/apache/cassandra/blob/7b1eb1f0b717beb33e611157766701cd71e4ad8c/src/java/org/apache/cassandra/gms/Gossiper.java#L326]
which will wait for the MessagingService to start listening, but there could
be delays if the messaging service is slow to start or has issues. Also, if
sendGossip() encounters network-related delays (i.e. network latency, timeouts,
slow or unresponsive nodes) when there is a large number of nodes, the taskLock
could be held for longer periods, possibly increasing the risk of a backlog of
waiting threads (if delays are frequent) and also affecting the scheduling of
subsequent tasks.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]