[
https://issues.apache.org/jira/browse/CASSANDRA-13569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16061094#comment-16061094
]
Michael Fong commented on CASSANDRA-13569:
------------------------------------------
Hi, [[email protected]]
I agree w/ you that even ScheduledExecutor on MigrationTask would fail on rare
cases.
In CASSANDRA-11748, we had patched our own v2.0 source code with similar idea
that limits schema pull only once per endpoint. However, we later on have
observed a corner case that when two nodes with different schema version boot
up at the same time, one node running slightly - a few seconds - faster than
the other. The first node requests schema pull and failed since the other node
has not yet finished initialization.
There has been a huge difference in v2.0 and 3.x code bases, and I do not know
if the corner problem still persists. Here is the the problematic code snippet
for your reference.
{code:java}
if (epState == null) {
{code} would probably not prevent this. In your patch, if the state of
ScheduledFuture return done, things could get much messier since schema
migration would never happen.
Sincerely,
Michael Fong
> Schedule schema pulls just once per endpoint
> --------------------------------------------
>
> Key: CASSANDRA-13569
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13569
> Project: Cassandra
> Issue Type: Improvement
> Components: Distributed Metadata
> Reporter: Stefan Podkowinski
> Assignee: Stefan Podkowinski
> Fix For: 3.0.x, 3.11.x, 4.x
>
>
> Schema mismatches detected through gossip will get resolved by calling
> {{MigrationManager.maybeScheduleSchemaPull}}. This method may decide to
> schedule execution of {{MigrationTask}}, but only after using a
> {{MIGRATION_DELAY_IN_MS = 60000}} delay (for reasons unclear to me).
> Meanwhile, as long as the migration task hasn't been executed, we'll continue
> to have schema mismatches reported by gossip and will have corresponding
> {{maybeScheduleSchemaPull}} calls, which will schedule further tasks with the
> mentioned delay. Some local testing shows that dozens of tasks for the same
> endpoint will eventually be executed and causing the same, stormy behavior
> for this very endpoints.
> My proposal would be to simply not schedule new tasks for the same endpoint,
> in case we still have pending tasks waiting for execution after
> {{MIGRATION_DELAY_IN_MS}}.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]