[ 
https://issues.apache.org/jira/browse/CASSANDRA-16387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17277716#comment-17277716
 ] 

Caleb Rackliffe commented on CASSANDRA-16387:
---------------------------------------------

Alright, after digging around even more, I'm starting to wonder how tests that 
start 3 or more nodes (and don't use the GOSSIP feature) could be expected to 
avoid this problem. In {{AbstractCluster#startup()}}, we start the first node 
and any nodes using {{auto_bootstrap}} sequentially, one-at-a-time. After that, 
the other nodes are started in parallel. This seems to open a whole pandoras 
box of races in the initialization of the messaging service, gossiper, and 
migration manager.

In this case nodes 2 and 3 start up in parallel. By default, 
{{UpgradeTestBase}} doesn't use {{auto_bootstrap}}, and it doesn't use the 
GOSSIP or NETWORK features. For the 3.0 {{Instance}} this means we rely on 
{{Gossiper#initializeNodeUnsafe()}} to initialize the endpoint state map entry 
for the local node and then immediately treat all other nodes as being live via 
a call to {{Gossiper#realMarkAlive()}}. Importantly, this happens after mock 
messaging is set up. Lastly, a call to {{StorageService#ensureTraceKeyspace()}} 
makes sure the traces keyspace is created and _pushes the schema change to 
other live nodes_. The problem is that means all nodes in the cluster, even the 
ones that haven't finished calling {{Gossiper#initializeNodeUnsafe()}} will get 
a {{DEFINITIONS_UPDATE}} message and try to access the local node gossip state 
from the endpoint state map that doesn't exist yet.

With some extra debugging and stack traces, this is what the progression looks 
like in the logs:

1.) Node 2 create the traces keyspace, announces it, and pushes the schema 
change mutations.

{noformat}
INFO  [node2_isolatedExecutor:3] node2 2021-02-02 19:33:54,101 
StorageService.java:1293 - NORMAL
java.lang.RuntimeException
        at 
org.apache.cassandra.service.MigrationManager.announce(MigrationManager.java:433)
        at 
org.apache.cassandra.service.MigrationManager.announceGlobally(MigrationManager.java:419)
        at java.util.Optional.ifPresent(Optional.java:159)
        at 
org.apache.cassandra.service.StorageService.ensureTraceKeyspace(StorageService.java:1080)
        at 
org.apache.cassandra.distributed.impl.Instance.lambda$startup$7(Instance.java:609)
{noformat}

2.) Node 3 get the message.

{noformat}
INFO  [node3_MigrationStage:1] node3 2021-02-02 19:33:54,119 
DefinitionsUpdateVerbHandler.java:48 - Received schema mutation push from 
/127.0.0.2 for keyspaces [system_schema]
{noformat}

3.) Node 3 tries to update its own schema version, but can't since its own 
endpoint state is very much missing.

{noformat}
DEBUG [node3_MigrationStage:1] node3 2021-02-02 19:33:54,186 Schema.java:485 - 
Adding 
org.apache.cassandra.config.CFMetaData@3262feed[cfId=c5e99f16-8677-3914-b17e-960613512345,ksName=system_traces
Exception null occurred on thread node3_MigrationStage:1
java.lang.AssertionError
        at 
org.apache.cassandra.gms.Gossiper.addLocalApplicationStateInternal(Gossiper.java:1555)
        at 
org.apache.cassandra.gms.Gossiper.addLocalApplicationStates(Gossiper.java:1579)
        at 
org.apache.cassandra.gms.Gossiper.addLocalApplicationState(Gossiper.java:1569)
        at 
org.apache.cassandra.service.MigrationManager.passiveAnnounce(MigrationManager.java:479)
        at 
org.apache.cassandra.config.Schema.updateVersionAndAnnounce(Schema.java:600)
        at 
org.apache.cassandra.schema.SchemaKeyspace.mergeSchemaAndAnnounceVersion(SchemaKeyspace.java:1336)
        at 
org.apache.cassandra.db.DefinitionsUpdateVerbHandler$1.runMayThrow(DefinitionsUpdateVerbHandler.java:54)
{noformat}

4.) Node 3 finally gets initialized, but it's too late.

{noformat}
INFO  [node3_GossipStage:1] node3 2021-02-02 19:33:54,304 Gossiper.java:1633 - 
Initializing state for /127.0.0.3 on /127.0.0.3
java.lang.RuntimeException
        at 
org.apache.cassandra.gms.Gossiper.initializeNodeUnsafe(Gossiper.java:1634)
        at 
org.apache.cassandra.distributed.impl.Instance.lambda$addToRing$8(Instance.java:666)
{noformat}


> UpgradeTest sporadically failing on schema updates
> --------------------------------------------------
>
>                 Key: CASSANDRA-16387
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-16387
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Test/dtest/java
>            Reporter: Caleb Rackliffe
>            Assignee: Caleb Rackliffe
>            Priority: Normal
>             Fix For: 4.0-rc
>
>
> We’ve observed {{UpdateTest}} failing during what appears to be a schema 
> change:
> https://app.circleci.com/pipelines/github/maedhroz/cassandra/192/workflows/ed5305e6-e4f9-420e-9f0a-6153333746dc/jobs/1068
> It almost looks like the Gossiper can’t find its own endpoint state in the 
> endpoint state map, and the failure is not consistent, which might suggest a 
> race.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to