[
https://issues.apache.org/jira/browse/CASSANDRA-15355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507563#comment-17507563
]
Stefan Miklosovic commented on CASSANDRA-15355:
-----------------------------------------------
[~jebaker] is this still relevant?
> Schema push/pull race on continuous schema changes
> --------------------------------------------------
>
> Key: CASSANDRA-15355
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15355
> Project: Cassandra
> Issue Type: Bug
> Components: Cluster/Gossip
> Reporter: James Baker
> Priority: Normal
> Labels: pull-request-available
> Time Spent: 10m
> Remaining Estimate: 0h
>
> In https://issues.apache.org/jira/browse/CASSANDRA-5025, pull based schema
> updates were scheduled 1 minute after the schema change was first visible, so
> as to prefer the push codepath as much as possible.
> Unfortunately, this does not handle the case where there are many schema
> changes happening - imagine a scenario where we create a table every 5
> seconds for 2 minutes - the first update tasks execute 60 seconds in and the
> schemas may well be out of sync between nodes at that point.
> In this case, there is some chance that when the task runs, the schemas are
> out of sync because a subsequent schema update has occurred, and so the same
> push/pull race has happened.
> A fix is to modify the codepath such that the scheduled task is only run if
> the other node's schema version is the same as when the task was scheduled. A
> different (later scheduled) task should run otherwise.
> For us, what we see is that when we have a reasonably large number of
> changes, a few schema changes can have the unfortunate outcome of causing our
> nodes to run out of memory and crash - if we have a 30 node cluster, create a
> table every second for 2 minutes, and for some reason we pause for 10 seconds
> after 60 seconds with no progress, we can easily end up currently running 300
> schema pulls for a single node. These can cause further piling up which
> causes cascading failures. This change stops that.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]