[
https://issues.apache.org/jira/browse/CASSANDRA-16387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17277716#comment-17277716
]
Caleb Rackliffe commented on CASSANDRA-16387:
---------------------------------------------
Alright, after digging around even more, I'm starting to wonder how tests that
start 3 or more nodes (and don't use the GOSSIP feature) could be expected to
avoid this problem. In {{AbstractCluster#startup()}}, we start the first node
and any nodes using {{auto_bootstrap}} sequentially, one-at-a-time. After that,
the other nodes are started in parallel. This seems to open a whole pandoras
box of races in the initialization of the messaging service, gossiper, and
migration manager.
In this case nodes 2 and 3 start up in parallel. By default,
{{UpgradeTestBase}} doesn't use {{auto_bootstrap}}, and it doesn't use the
GOSSIP or NETWORK features. For the 3.0 {{Instance}} this means we rely on
{{Gossiper#initializeNodeUnsafe()}} to initialize the endpoint state map entry
for the local node and then immediately treat all other nodes as being live via
a call to {{Gossiper#realMarkAlive()}}. Importantly, this happens after mock
messaging is set up. Lastly, a call to {{StorageService#ensureTraceKeyspace()}}
makes sure the traces keyspace is created and _pushes the schema change to
other live nodes_. The problem is that means all nodes in the cluster, even the
ones that haven't finished calling {{Gossiper#initializeNodeUnsafe()}} will get
a {{DEFINITIONS_UPDATE}} message and try to access the local node gossip state
from the endpoint state map that doesn't exist yet.
With some extra debugging and stack traces, this is what the progression looks
like in the logs:
1.) Node 2 create the traces keyspace, announces it, and pushes the schema
change mutations.
{noformat}
INFO [node2_isolatedExecutor:3] node2 2021-02-02 19:33:54,101
StorageService.java:1293 - NORMAL
java.lang.RuntimeException
at
org.apache.cassandra.service.MigrationManager.announce(MigrationManager.java:433)
at
org.apache.cassandra.service.MigrationManager.announceGlobally(MigrationManager.java:419)
at java.util.Optional.ifPresent(Optional.java:159)
at
org.apache.cassandra.service.StorageService.ensureTraceKeyspace(StorageService.java:1080)
at
org.apache.cassandra.distributed.impl.Instance.lambda$startup$7(Instance.java:609)
{noformat}
2.) Node 3 get the message.
{noformat}
INFO [node3_MigrationStage:1] node3 2021-02-02 19:33:54,119
DefinitionsUpdateVerbHandler.java:48 - Received schema mutation push from
/127.0.0.2 for keyspaces [system_schema]
{noformat}
3.) Node 3 tries to update its own schema version, but can't since its own
endpoint state is very much missing.
{noformat}
DEBUG [node3_MigrationStage:1] node3 2021-02-02 19:33:54,186 Schema.java:485 -
Adding
org.apache.cassandra.config.CFMetaData@3262feed[cfId=c5e99f16-8677-3914-b17e-960613512345,ksName=system_traces
Exception null occurred on thread node3_MigrationStage:1
java.lang.AssertionError
at
org.apache.cassandra.gms.Gossiper.addLocalApplicationStateInternal(Gossiper.java:1555)
at
org.apache.cassandra.gms.Gossiper.addLocalApplicationStates(Gossiper.java:1579)
at
org.apache.cassandra.gms.Gossiper.addLocalApplicationState(Gossiper.java:1569)
at
org.apache.cassandra.service.MigrationManager.passiveAnnounce(MigrationManager.java:479)
at
org.apache.cassandra.config.Schema.updateVersionAndAnnounce(Schema.java:600)
at
org.apache.cassandra.schema.SchemaKeyspace.mergeSchemaAndAnnounceVersion(SchemaKeyspace.java:1336)
at
org.apache.cassandra.db.DefinitionsUpdateVerbHandler$1.runMayThrow(DefinitionsUpdateVerbHandler.java:54)
{noformat}
4.) Node 3 finally gets initialized, but it's too late.
{noformat}
INFO [node3_GossipStage:1] node3 2021-02-02 19:33:54,304 Gossiper.java:1633 -
Initializing state for /127.0.0.3 on /127.0.0.3
java.lang.RuntimeException
at
org.apache.cassandra.gms.Gossiper.initializeNodeUnsafe(Gossiper.java:1634)
at
org.apache.cassandra.distributed.impl.Instance.lambda$addToRing$8(Instance.java:666)
{noformat}
> UpgradeTest sporadically failing on schema updates
> --------------------------------------------------
>
> Key: CASSANDRA-16387
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16387
> Project: Cassandra
> Issue Type: Bug
> Components: Test/dtest/java
> Reporter: Caleb Rackliffe
> Assignee: Caleb Rackliffe
> Priority: Normal
> Fix For: 4.0-rc
>
>
> We’ve observed {{UpdateTest}} failing during what appears to be a schema
> change:
> https://app.circleci.com/pipelines/github/maedhroz/cassandra/192/workflows/ed5305e6-e4f9-420e-9f0a-6153333746dc/jobs/1068
> It almost looks like the Gossiper can’t find its own endpoint state in the
> endpoint state map, and the failure is not consistent, which might suggest a
> race.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]