[
https://issues.apache.org/jira/browse/CASSANDRA-16238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17332737#comment-17332737
]
Brandon Williams commented on CASSANDRA-16238:
----------------------------------------------
Branch [here|https://github.com/driftx/cassandra/tree/CASSANDRA-16238].
[!https://ci-cassandra.apache.org/job/Cassandra-devbranch/714/badge/icon!|https://ci-cassandra.apache.org/blue/organizations/jenkins/Cassandra-devbranch/detail/Cassandra-devbranch/714/pipeline]
{noformat}
[junit-timeout] INFO [node4_GossipStage:1] node4 2021-04-23 16:23:05,381
Gossiper.java:1296 - Node /127.0.0.1:7012 is now part of the cluster
[junit-timeout] DEBUG [node4_GossipStage:1] node4 2021-04-23 16:23:05,381
StorageService.java:2730 - Node /127.0.0.1:7012 state NORMAL, token
[-3074457345618258603]
[junit-timeout] INFO [node4_GossipTasks:1] node4 2021-04-23 16:23:05,393
Gossiper.java:997 - FatClient /127.0.0.1:7012 has been silent for 0ms, removing
from gossip
[junit-timeout] DEBUG [node4_GossipStage:1] node4 2021-04-23 16:23:05,398
StorageService.java:2677 - New node /127.0.0.1:7012 at token
-3074457345618258603
{noformat}
We can see here that .1 is detected for the first time via gossip, and as it is
going through StorageService but before it is added to TokenMetatadata, the
gossiper's status check has begun running. Since the quarantine delay is
overridden to zero, without a presence in TMD the node is not a member yet and
thus deemed a fat client, and removed.
{noformat}
[junit-timeout] INFO [node4_GossipStage:1] node4 2021-04-23 16:23:05,407
TokenMetadata.java:505 - Updating topology for /127.0.0.1:7012
[junit-timeout] INFO [node4_GossipStage:1] node4 2021-04-23 16:23:05,414
TokenMetadata.java:505 - Updating topology for /127.0.0.1:7012
[junit-timeout] DEBUG [node4_GossipStage:1] node4 2021-04-23 16:23:05,422
StorageService.java:2730 - Node /127.0.0.1:7012 state NORMAL, token
[-3074457345618258603]
[junit-timeout] INFO [node4_GossipStage:1] node4 2021-04-23 16:23:05,422
StorageService.java:2733 - Node /127.0.0.1:7012 state jump to NORMAL
[junit-timeout] DEBUG [node4_GossipStage:1] node4 2021-04-23 16:23:05,447
Gossiper.java:1243 - removing expire time for endpoint : /127.0.0.1:7012
[junit-timeout] INFO [node4_GossipStage:1] node4 2021-04-23 16:23:05,447
Gossiper.java:1244 - InetAddress /127.0.0.1:7012 is now UP
[junit-timeout] INFO [node4_GossipStage:1] node4 2021-04-23 16:23:05,447
Gossiper.java:579 - removed /127.0.0.1:7012 from seeds, updated seeds list = []
[junit-timeout] WARN 16:23:05 Seeds list is now empty!
[junit-timeout] WARN [node4_GossipStage:1] node4 2021-04-23 16:23:05,447
Gossiper.java:581 - Seeds list is now empty!
[junit-timeout] DEBUG [node4_GossipStage:1] node4 2021-04-23 16:23:05,452
Gossiper.java:590 - removing endpoint /127.0.0.1:7012
[junit-timeout] DEBUG [node4_GossipStage:1] node4 2021-04-23 16:23:05,452
Gossiper.java:561 - evicting /127.0.0.1:7012 from gossip
[junit-timeout] DEBUG [node4_GossipTasks:1] node4 2021-04-23 16:23:06,453
Gossiper.java:1025 - 0 elapsed, /127.0.0.1:7012 gossip quarantine over
{noformat}
Crucially, as part of this removal the node is also removed from the seeds
list, since it is listed there. The warning about the empty seed list is added
from my branch.
{noformat}
[junit-timeout] INFO [node4_GossipStage:1] node4 2021-04-23 16:23:07,176
Gossiper.java:1296 - Node /127.0.0.1:7012 is now part of the cluster
[junit-timeout] DEBUG [node4_GossipStage:1] node4 2021-04-23 16:23:07,177
StorageService.java:2730 - Node /127.0.0.1:7012 state NORMAL, token
[-3074457345618258603]
[junit-timeout] DEBUG [node4_GossipStage:1] node4 2021-04-23 16:23:07,189
StorageService.java:2677 - New node /127.0.0.1:7012 at token
-3074457345618258603
[junit-timeout] INFO [node4_GossipStage:1] node4 2021-04-23 16:23:07,198
TokenMetadata.java:505 - Updating topology for /127.0.0.1:7012
[junit-timeout] INFO [node4_GossipStage:1] node4 2021-04-23 16:23:07,201
TokenMetadata.java:505 - Updating topology for /127.0.0.1:7012
[junit-timeout] DEBUG [node4_GossipStage:1] node4 2021-04-23 16:23:07,208
StorageService.java:2730 - Node /127.0.0.1:7012 state NORMAL, token
[-3074457345618258603]
[junit-timeout] INFO [node4_GossipStage:1] node4 2021-04-23 16:23:07,208
StorageService.java:2733 - Node /127.0.0.1:7012 state jump to NORMAL
[junit-timeout] DEBUG [node4_GossipStage:1] node4 2021-04-23 16:23:07,220
Gossiper.java:1243 - removing expire time for endpoint : /127.0.0.1:7012
[junit-timeout] INFO [node4_GossipStage:1] node4 2021-04-23 16:23:07,220
Gossiper.java:1244 - InetAddress /127.0.0.1:7012 is now UP
[junit-timeout] DEBUG [node4_BatchlogTasks:1] node4 2021-04-23 16:23:07,383
BatchlogManager.java:246 - Updating batchlog replay throttle to 1024 KB/s, 256
KB/s per endpoint
[junit-timeout] DEBUG [node4_isolatedExecutor:1] node4 2021-04-23 16:23:12,457
Gossiper.java:2142 - Gossip looks settled.
{noformat}
Another round of gossip mostly papers over this problem, except:
{noformat}
[junit-timeout] Unable to contact any seeds: []
[junit-timeout] java.lang.IllegalStateException: Unable to contact any seeds: []
[junit-timeout] at
org.apache.cassandra.service.StorageService.bootstrap(StorageService.java:1752)
[junit-timeout] at
org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:1054)
[junit-timeout] at
org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:1015)
[junit-timeout] at
org.apache.cassandra.service.StorageService.initServer(StorageService.java:799)
[junit-timeout] at
org.apache.cassandra.service.StorageService.initServer(StorageService.java:729)
[junit-timeout] at
org.apache.cassandra.distributed.impl.Instance.lambda$startup$10(Instance.java:541)
[junit-timeout] at
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
[junit-timeout] at
java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
[junit-timeout] at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
[junit-timeout] at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
[junit-timeout] at
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
[junit-timeout] at java.base/java.lang.Thread.run(Thread.java:834)
{noformat}
There's (correctly) no logic to add a removed node back as a seed. This branch
also prints the seeds list when this happens to disambiguate between problems
contacting seeds, and problems like this where the problem is not having any
seeds left to contact.
Ultimately, quarantine delay can't be set this low. I tried low non-zero
values, but it needs to be set to at least one gossiper interval to avoid not
accidentally reaping nodes as fat clients prematurely, and since ring delay was
already being overridden and quarantine delay derived from that, I just removed
the quarantine delay override.
> Fix flaky jvm-dtests that fail with Unable to contact any seeds
> ---------------------------------------------------------------
>
> Key: CASSANDRA-16238
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16238
> Project: Cassandra
> Issue Type: Bug
> Components: Test/dtest/java
> Reporter: David Capwell
> Assignee: Brandon Williams
> Priority: Normal
> Fix For: 4.0-rc
>
> Attachments: 16238-archived-failures.txt
>
>
> https://app.circleci.com/pipelines/github/dcapwell/cassandra/745/workflows/1c7e589e-b5af-4a56-b40a-43da424602c7/jobs/4231
> {code}
> test teardown failure
> Unexpected error found in node logs (see stdout for full details). Errors:
> [ERROR [main] 2020-10-29 17:38:13,808 CassandraDaemon.java:817 - Exception
> encountered during startup
> java.lang.IllegalStateException: Unable to contact any seeds!
> at
> org.apache.cassandra.service.StorageService.bootstrap(StorageService.java:1601)
> at
> org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:931)
> at
> org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:892)
> at
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:699)
> at
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:635)
> at
> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:407)
> at
> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:671)
> at
> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:795),
> ERROR [main] 2020-10-29 17:38:13,808 CassandraDaemon.java:817 - Exception
> encountered during startup
> java.lang.IllegalStateException: Unable to contact any seeds!
> at
> org.apache.cassandra.service.StorageService.bootstrap(StorageService.java:1601)
> at
> org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:931)
> at
> org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:892)
> at
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:699)
> at
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:635)
> at
> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:407)
> at
> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:671)
> at
> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:795)]
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]