[ 
https://issues.apache.org/jira/browse/CASSANDRA-16238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17332737#comment-17332737
 ] 

Brandon Williams commented on CASSANDRA-16238:
----------------------------------------------

Branch [here|https://github.com/driftx/cassandra/tree/CASSANDRA-16238].

[!https://ci-cassandra.apache.org/job/Cassandra-devbranch/714/badge/icon!|https://ci-cassandra.apache.org/blue/organizations/jenkins/Cassandra-devbranch/detail/Cassandra-devbranch/714/pipeline]

{noformat}
[junit-timeout] INFO  [node4_GossipStage:1] node4 2021-04-23 16:23:05,381 
Gossiper.java:1296 - Node /127.0.0.1:7012 is now part of the cluster
[junit-timeout] DEBUG [node4_GossipStage:1] node4 2021-04-23 16:23:05,381 
StorageService.java:2730 - Node /127.0.0.1:7012 state NORMAL, token 
[-3074457345618258603]
[junit-timeout] INFO  [node4_GossipTasks:1] node4 2021-04-23 16:23:05,393 
Gossiper.java:997 - FatClient /127.0.0.1:7012 has been silent for 0ms, removing 
from gossip
[junit-timeout] DEBUG [node4_GossipStage:1] node4 2021-04-23 16:23:05,398 
StorageService.java:2677 - New node /127.0.0.1:7012 at token 
-3074457345618258603
{noformat}

We can see here that .1 is detected for the first time via gossip, and as it is 
going through StorageService but before it is added to TokenMetatadata, the 
gossiper's status check has begun running.  Since the quarantine delay is 
overridden to zero, without a presence in TMD the node is not a member yet and 
thus deemed a fat client, and removed.

{noformat}
[junit-timeout] INFO  [node4_GossipStage:1] node4 2021-04-23 16:23:05,407 
TokenMetadata.java:505 - Updating topology for /127.0.0.1:7012
[junit-timeout] INFO  [node4_GossipStage:1] node4 2021-04-23 16:23:05,414 
TokenMetadata.java:505 - Updating topology for /127.0.0.1:7012
[junit-timeout] DEBUG [node4_GossipStage:1] node4 2021-04-23 16:23:05,422 
StorageService.java:2730 - Node /127.0.0.1:7012 state NORMAL, token 
[-3074457345618258603]
[junit-timeout] INFO  [node4_GossipStage:1] node4 2021-04-23 16:23:05,422 
StorageService.java:2733 - Node /127.0.0.1:7012 state jump to NORMAL
[junit-timeout] DEBUG [node4_GossipStage:1] node4 2021-04-23 16:23:05,447 
Gossiper.java:1243 - removing expire time for endpoint : /127.0.0.1:7012
[junit-timeout] INFO  [node4_GossipStage:1] node4 2021-04-23 16:23:05,447 
Gossiper.java:1244 - InetAddress /127.0.0.1:7012 is now UP
[junit-timeout] INFO  [node4_GossipStage:1] node4 2021-04-23 16:23:05,447 
Gossiper.java:579 - removed /127.0.0.1:7012 from seeds, updated seeds list = []
[junit-timeout] WARN  16:23:05 Seeds list is now empty!
[junit-timeout] WARN  [node4_GossipStage:1] node4 2021-04-23 16:23:05,447 
Gossiper.java:581 - Seeds list is now empty!
[junit-timeout] DEBUG [node4_GossipStage:1] node4 2021-04-23 16:23:05,452 
Gossiper.java:590 - removing endpoint /127.0.0.1:7012
[junit-timeout] DEBUG [node4_GossipStage:1] node4 2021-04-23 16:23:05,452 
Gossiper.java:561 - evicting /127.0.0.1:7012 from gossip
[junit-timeout] DEBUG [node4_GossipTasks:1] node4 2021-04-23 16:23:06,453 
Gossiper.java:1025 - 0 elapsed, /127.0.0.1:7012 gossip quarantine over
{noformat}

Crucially, as part of this removal the node is also removed from the seeds 
list, since it is listed there. The warning about the empty seed list is added 
from my branch.

{noformat}
[junit-timeout] INFO  [node4_GossipStage:1] node4 2021-04-23 16:23:07,176 
Gossiper.java:1296 - Node /127.0.0.1:7012 is now part of the cluster
[junit-timeout] DEBUG [node4_GossipStage:1] node4 2021-04-23 16:23:07,177 
StorageService.java:2730 - Node /127.0.0.1:7012 state NORMAL, token 
[-3074457345618258603]
[junit-timeout] DEBUG [node4_GossipStage:1] node4 2021-04-23 16:23:07,189 
StorageService.java:2677 - New node /127.0.0.1:7012 at token 
-3074457345618258603
[junit-timeout] INFO  [node4_GossipStage:1] node4 2021-04-23 16:23:07,198 
TokenMetadata.java:505 - Updating topology for /127.0.0.1:7012
[junit-timeout] INFO  [node4_GossipStage:1] node4 2021-04-23 16:23:07,201 
TokenMetadata.java:505 - Updating topology for /127.0.0.1:7012
[junit-timeout] DEBUG [node4_GossipStage:1] node4 2021-04-23 16:23:07,208 
StorageService.java:2730 - Node /127.0.0.1:7012 state NORMAL, token 
[-3074457345618258603]
[junit-timeout] INFO  [node4_GossipStage:1] node4 2021-04-23 16:23:07,208 
StorageService.java:2733 - Node /127.0.0.1:7012 state jump to NORMAL
[junit-timeout] DEBUG [node4_GossipStage:1] node4 2021-04-23 16:23:07,220 
Gossiper.java:1243 - removing expire time for endpoint : /127.0.0.1:7012
[junit-timeout] INFO  [node4_GossipStage:1] node4 2021-04-23 16:23:07,220 
Gossiper.java:1244 - InetAddress /127.0.0.1:7012 is now UP
[junit-timeout] DEBUG [node4_BatchlogTasks:1] node4 2021-04-23 16:23:07,383 
BatchlogManager.java:246 - Updating batchlog replay throttle to 1024 KB/s, 256 
KB/s per endpoint
[junit-timeout] DEBUG [node4_isolatedExecutor:1] node4 2021-04-23 16:23:12,457 
Gossiper.java:2142 - Gossip looks settled.
{noformat}

Another round of gossip mostly papers over this problem, except:

{noformat}
[junit-timeout] Unable to contact any seeds: []
[junit-timeout] java.lang.IllegalStateException: Unable to contact any seeds: []
[junit-timeout]         at 
org.apache.cassandra.service.StorageService.bootstrap(StorageService.java:1752)
[junit-timeout]         at 
org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:1054)
[junit-timeout]         at 
org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:1015)
[junit-timeout]         at 
org.apache.cassandra.service.StorageService.initServer(StorageService.java:799)
[junit-timeout]         at 
org.apache.cassandra.service.StorageService.initServer(StorageService.java:729)
[junit-timeout]         at 
org.apache.cassandra.distributed.impl.Instance.lambda$startup$10(Instance.java:541)
[junit-timeout]         at 
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
[junit-timeout]         at 
java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
[junit-timeout]         at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
[junit-timeout]         at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
[junit-timeout]         at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
[junit-timeout]         at java.base/java.lang.Thread.run(Thread.java:834)
{noformat}

There's (correctly) no logic to add a removed node back as a seed.  This branch 
also prints the seeds list when this happens to disambiguate between problems 
contacting seeds, and problems like this where the problem is not having any 
seeds left to contact.

Ultimately, quarantine delay can't be set this low.  I tried low non-zero 
values, but it needs to be set to at least one gossiper interval to avoid not 
accidentally reaping nodes as fat clients prematurely, and since ring delay was 
already being overridden and quarantine delay derived from that, I just removed 
the quarantine delay override.

> Fix flaky jvm-dtests that fail with Unable to contact any seeds
> ---------------------------------------------------------------
>
>                 Key: CASSANDRA-16238
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-16238
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Test/dtest/java
>            Reporter: David Capwell
>            Assignee: Brandon Williams
>            Priority: Normal
>             Fix For: 4.0-rc
>
>         Attachments: 16238-archived-failures.txt
>
>
> https://app.circleci.com/pipelines/github/dcapwell/cassandra/745/workflows/1c7e589e-b5af-4a56-b40a-43da424602c7/jobs/4231
> {code}
> test teardown failure
> Unexpected error found in node logs (see stdout for full details). Errors: 
> [ERROR [main] 2020-10-29 17:38:13,808 CassandraDaemon.java:817 - Exception 
> encountered during startup
> java.lang.IllegalStateException: Unable to contact any seeds!
>       at 
> org.apache.cassandra.service.StorageService.bootstrap(StorageService.java:1601)
>       at 
> org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:931)
>       at 
> org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:892)
>       at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:699)
>       at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:635)
>       at 
> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:407)
>       at 
> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:671)
>       at 
> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:795), 
> ERROR [main] 2020-10-29 17:38:13,808 CassandraDaemon.java:817 - Exception 
> encountered during startup
> java.lang.IllegalStateException: Unable to contact any seeds!
>       at 
> org.apache.cassandra.service.StorageService.bootstrap(StorageService.java:1601)
>       at 
> org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:931)
>       at 
> org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:892)
>       at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:699)
>       at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:635)
>       at 
> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:407)
>       at 
> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:671)
>       at 
> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:795)]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to