[ https://issues.apache.org/jira/browse/CASSANDRA-18543 ]
Cameron Zemek deleted comment on CASSANDRA-18543:
-------------------------------------------
was (Author: cam1982):
[^gossip4.patch]
Here is the patch applied to 4.0.4 . I haven't done any testing of this patch
against 4.x.
> Waiting for gossip to settle does not wait for live endpoints
> -------------------------------------------------------------
>
> Key: CASSANDRA-18543
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18543
> Project: Cassandra
> Issue Type: Bug
> Reporter: Cameron Zemek
> Priority: Normal
> Attachments: gossip.patch, gossip4.patch
>
>
> When a node starts it will get endpoint states (via shadow round) but have
> all nodes marked as down. The problem is the wait to settle only checks the
> size of endpoint states is stable before starting Native transport. Once
> native transport starts it will receive queries and fail consistency levels
> such as LOCAL_QUORUM since it still thinks nodes are down.
> This is problem for a number of large clusters for our customers. The cluster
> has quorum but due to this issue a node restart is causing a bunch of query
> errors.
> My initial solution to this was to only check live endpoints size in addition
> to size of endpoint states. This worked but I noticed in testing this fix
> that there also a lot of duplication of checking the same node (via Echo
> messages) for liveness. So the patch also removes this duplication of
> checking node is UP in markAlive.
> The final problem I found while testing is sometimes could still not see a
> change in live endpoints due to only 1 second polling, so the patch allows
> for overridding the settle parameters. I could not reliability reproduce this
> but think its worth providing a way to override these hardcoded values.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]