[
https://issues.apache.org/jira/browse/CASSANDRA-15155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17486218#comment-17486218
]
Jai Bheemsen Rao Dhanwada commented on CASSANDRA-15155:
-------------------------------------------------------
[~cowardlydragon] just curious, were able to narrow down this issue to see why
it happened?
> Bootstrapping node finishes 'successfully' before schema synced, data not
> streamed
> ----------------------------------------------------------------------------------
>
> Key: CASSANDRA-15155
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15155
> Project: Cassandra
> Issue Type: Bug
> Components: Cluster/Membership, Consistency/Bootstrap and
> Decommission
> Reporter: Constance Eustace
> Priority: Normal
> Attachments: debug.log.zip
>
>
> Bootstrap of a new node to expand an existing cluster is completing the
> bootstrapping "successfully", joining the cluster as UN in nodetool status,
> but has no data and no active streams. Writes and reads start being served.
> Environment: AWS, IPV6, three datacenters, asia / europe / us
> Version: 2.2.13
> We have previously scaled the europe and us datacenters from 5 nodes to 25
> nodes (one node at a time) without incident.
> Asia (tokyo) is a different story. We have seen multiple failure scenarios,
> but the most troubling is a string of attempted node bootstrappings where the
> bootstrap completes and the node joins the ring, but there is no data.
> We were able to expand Asia by four nodes by increasing ring delay to 100
> seconds, but that has not worked anymore.
> Attached Log: Our autoscaling + Ansible initial setup creates the node, but
> the ansible has not run yet, so the autostarted cassandra fails to load, but
> it has no security group yet so it did not communicate with any other node.
> That is the 15:15:XX series log messages at the very top.
> Then 15:20:XX series messages begin after ansible has completed setup of the
> box, and the data dirs and commit log dirs have been scrubbed.
> This same process ran for EU and US expansions without incident.
> From what I can tell of the log (DEBUG was enabled):
> Ring information collection begins, so some sort of gossip/cluster
> communication is healthy:
> INFO [main] 2019-06-12 15:20:05,468 StorageService.java:1142 - JOINING:
> waiting for ring information
> But almost all of those checks output:
> DEBUG [GossipStage:1] 2019-06-12 15:20:07,673 MigrationManager.java:96 - Not
> pulling schema because versions match or shouldPullSchemaFrom returned false
> Which seems weird, as we shall see soon. After all the nodes have reported in
> a similar way, most not pulling because of the above message, and a couple
> that were interpreted as DOWN, it then does:
> INFO [main] 2019-06-12 15:21:45,486 StorageService.java:1142 - JOINING:
> schema complete, ready to bootstrap
> INFO [main] 2019-06-12 15:21:45,487 StorageService.java:1142 - JOINING:
> waiting for pending range calculation
> INFO [main] 2019-06-12 15:21:45,487 StorageService.java:1142 - JOINING:
> calculation complete, ready to bootstrap
> We then get a huge number of
> org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find
> cfId=dd5d7fa0-1e50-11e9-a62d-0d41c97b2404
> INFO [main] 2019-06-12 15:23:25,515 StorageService.java:1142 - JOINING:
> Starting to bootstrap...
> INFO [main] 2019-06-12 15:23:25,525 StreamResultFuture.java:87 - [Stream
> #05af9ee0-8d26-11e9-85c1-bd5476090c54] Executing streaming plan for Bootstrap
> INFO [main] 2019-06-12 15:23:25,526 StorageService.java:1199 - Bootstrap
> completed! for the tokens [-7314981925085449175, ...<tokens>...
> 5499447097629838103]
> Here are the log messages for MIgrationManager for schema gossiping:
> DEBUG [main] 2019-06-12 15:20:05,423 MigrationManager.java:493 - Gossiping my
> schema version 59adb24e-f3cd-3e02-97f0-5b395827453f
> DEBUG [MigrationStage:1] 2019-06-12 15:23:25,694 MigrationManager.java:493 -
> Gossiping my schema version 3d1a9d9e-1120-37ae-abe0-e064cd147a99
> DEBUG [MigrationStage:1] 2019-06-12 15:23:25,775 MigrationManager.java:493 -
> Gossiping my schema version 0bf74f5a-4b39-3525-b217-e9ccf7a1b6cb
> DEBUG [MigrationStage:1] 2019-06-12 15:23:25,905 MigrationManager.java:493 -
> Gossiping my schema version b145475a-02dc-370c-8af7-a9aba2d61362
> DEBUG [InternalResponseStage:12] 2019-06-12 15:24:26,445
> MigrationManager.java:493 - Gossiping my schema version
> 9c2ed14a-8db5-39b3-af48-6cdb5463c772
> the schema version ending in -6cdb5463c772 is the proper version in the other
> nodes per gossipinfo. But as can be seen, the bootstrap completion message
> (15:23:25,526) is logged before four or five intermediate schema versions are
> created, which seem to be due to system_distributed and other keyspaces being
> created.
> The Bootstrap completed! message comes from a Future in StorageService line
> 1199 for cassandra 2.2.13 source:
> {code:java}
> ListenableFuture<StreamState> bootstrapStream =
> bootstrapper.bootstrap(streamStateStore, !replacing && useStrictConsistency);
> // handles token update Futures.addCallback(bootstrapStream, new
> FutureCallback<StreamState>() { @Override public void onSuccess(StreamState
> streamState) { isBootstrapMode = false; logger.info("Bootstrap completed! for
> the tokens {}", tokens); }
> {code}
> That future seems to complete a bit too quickly. Perhaps the schema did not
> synchronize quickly enough and therefore a list of migration/bootstrap tasks
> not created in time for the bootstrapper future to handle. We suspect the AWS
> asia datacenter networking/WAN links are very slow, or IPV6 is introducing
> other delays.
> But the code should not allow a bootstrap to silently/falsely complete
> "successfully".
> Smells like a race condition.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]