[
https://issues.apache.org/jira/browse/CASSANDRA-19941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Goodness Ayinmode updated CASSANDRA-19941:
------------------------------------------
Description: To execute the gossip protocol and exchange state info with
other nodes,
_[GossiperTask.run()|https://github.com/apache/cassandra/blob/7b1eb1f0b717beb33e611157766701cd71e4ad8c/src/java/org/apache/cassandra/gms/Gossiper.java#L321]_
invokes
{_}[doGossipToLiveMember|https://github.com/apache/cassandra/blob/7b1eb1f0b717beb33e611157766701cd71e4ad8c/src/java/org/apache/cassandra/gms/Gossiper.java#L955],
maybeGossipToUnreachableMember|https://github.com/apache/cassandra/blob/7b1eb1f0b717beb33e611157766701cd71e4ad8c/src/java/org/apache/cassandra/gms/Gossiper.java#L964{_},
and[
{_}maybeGossipToSeed{_}|https://github.com/apache/cassandra/blob/7b1eb1f0b717beb33e611157766701cd71e4ad8c/src/java/org/apache/cassandra/gms/Gossiper.java#L982],
with all 3 methods invoking[
{_}sendGossip(){_}|https://github.com/apache/cassandra/blob/7b1eb1f0b717beb33e611157766701cd71e4ad8c/src/java/org/apache/cassandra/gms/Gossiper.java#L933]
to send a gossip message to a randomly selected endpoint. The interaction
between GossiperTask.run() and sendGossip() creates a potential synchronization
bottleneck due to the lock
([_taskLock_|https://github.com/apache/cassandra/blob/7b1eb1f0b717beb33e611157766701cd71e4ad8c/src/java/org/apache/cassandra/gms/Gossiper.java#L328])
being held during network-bound operations. GossiperTask.run() directly calls
[_waitUntilListening()_|https://github.com/apache/cassandra/blob/7b1eb1f0b717beb33e611157766701cd71e4ad8c/src/java/org/apache/cassandra/gms/Gossiper.java#L326]
which will wait for the MessagingService to start listening, but there could
be delays if the messaging service is slow to start or has issues. Also, if
sendGossip() encounters network-related delays (i.e. network latency, timeouts,
slow or unresponsive nodes) when there is a large number of nodes, the taskLock
could be held for longer periods, possibly increasing the risk of a backlog of
waiting threads (if delays are frequent) and also affecting the scheduling of
subsequent tasks. (was: To execute the gossip protocol and exchange state
info with other nodes,
_[GossiperTask.run()|https://github.com/apache/cassandra/blob/7b1eb1f0b717beb33e611157766701cd71e4ad8c/src/java/org/apache/cassandra/gms/Gossiper.java#L321]_
invokes
{_}[doGossipToLiveMember|https://github.com/apache/cassandra/blob/7b1eb1f0b717beb33e611157766701cd71e4ad8c/src/java/org/apache/cassandra/gms/Gossiper.java#L955],[
maybeGossipToUnreachableMember|https://github.com/apache/cassandra/blob/7b1eb1f0b717beb33e611157766701cd71e4ad8c/src/java/org/apache/cassandra/gms/Gossiper.java#L964]{_},
and[
_maybeGossipToSeed_|https://github.com/apache/cassandra/blob/7b1eb1f0b717beb33e611157766701cd71e4ad8c/src/java/org/apache/cassandra/gms/Gossiper.java#L982],
with all 3 methods invoking[
_sendGossip()_|https://github.com/apache/cassandra/blob/7b1eb1f0b717beb33e611157766701cd71e4ad8c/src/java/org/apache/cassandra/gms/Gossiper.java#L933]
to send a gossip message to a randomly selected endpoint. The interaction
between GossiperTask.run() and sendGossip() creates a potential synchronization
bottleneck due to the lock
([_taskLock_|https://github.com/apache/cassandra/blob/7b1eb1f0b717beb33e611157766701cd71e4ad8c/src/java/org/apache/cassandra/gms/Gossiper.java#L328])
being held during network-bound operations. GossiperTask.run() directly calls
[_waitUntilListening()_|https://github.com/apache/cassandra/blob/7b1eb1f0b717beb33e611157766701cd71e4ad8c/src/java/org/apache/cassandra/gms/Gossiper.java#L326]
which will wait for the MessagingService to start listening, but there could
be delays if the messaging service is slow to start or has issues. Also, if
sendGossip() encounters network-related delays (i.e. network latency, timeouts,
slow or unresponsive nodes) when there is a large number of nodes, the taskLock
could be held for longer periods, possibly increasing the risk of a backlog of
waiting threads (if delays are frequent) and also affecting the scheduling of
subsequent tasks. )
> Move network operations outside the lock in Gossiper$GossipTask
> ---------------------------------------------------------------
>
> Key: CASSANDRA-19941
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19941
> Project: Cassandra
> Issue Type: Improvement
> Components: Cluster/Gossip
> Reporter: Goodness Ayinmode
> Priority: Normal
>
> To execute the gossip protocol and exchange state info with other nodes,
> _[GossiperTask.run()|https://github.com/apache/cassandra/blob/7b1eb1f0b717beb33e611157766701cd71e4ad8c/src/java/org/apache/cassandra/gms/Gossiper.java#L321]_
> invokes
> {_}[doGossipToLiveMember|https://github.com/apache/cassandra/blob/7b1eb1f0b717beb33e611157766701cd71e4ad8c/src/java/org/apache/cassandra/gms/Gossiper.java#L955],
>
> maybeGossipToUnreachableMember|https://github.com/apache/cassandra/blob/7b1eb1f0b717beb33e611157766701cd71e4ad8c/src/java/org/apache/cassandra/gms/Gossiper.java#L964{_},
> and[
> {_}maybeGossipToSeed{_}|https://github.com/apache/cassandra/blob/7b1eb1f0b717beb33e611157766701cd71e4ad8c/src/java/org/apache/cassandra/gms/Gossiper.java#L982],
> with all 3 methods invoking[
> {_}sendGossip(){_}|https://github.com/apache/cassandra/blob/7b1eb1f0b717beb33e611157766701cd71e4ad8c/src/java/org/apache/cassandra/gms/Gossiper.java#L933]
> to send a gossip message to a randomly selected endpoint. The interaction
> between GossiperTask.run() and sendGossip() creates a potential
> synchronization bottleneck due to the lock
> ([_taskLock_|https://github.com/apache/cassandra/blob/7b1eb1f0b717beb33e611157766701cd71e4ad8c/src/java/org/apache/cassandra/gms/Gossiper.java#L328])
> being held during network-bound operations. GossiperTask.run() directly
> calls
> [_waitUntilListening()_|https://github.com/apache/cassandra/blob/7b1eb1f0b717beb33e611157766701cd71e4ad8c/src/java/org/apache/cassandra/gms/Gossiper.java#L326]
> which will wait for the MessagingService to start listening, but there could
> be delays if the messaging service is slow to start or has issues. Also, if
> sendGossip() encounters network-related delays (i.e. network latency,
> timeouts, slow or unresponsive nodes) when there is a large number of nodes,
> the taskLock could be held for longer periods, possibly increasing the risk
> of a backlog of waiting threads (if delays are frequent) and also affecting
> the scheduling of subsequent tasks.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]