[ 
https://issues.apache.org/jira/browse/CASSANDRA-13441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Jirsa updated CASSANDRA-13441:
-----------------------------------
    Description: 
In versions < 3.0, during a rolling upgrade (say 2.0 -> 2.1), the first node to 
upgrade to 2.1 would add the new tables, setting the new 2.1 version ID, and 
subsequently upgraded hosts would settle on that version.

When a 3.0 node upgrades and writes its own new-in-3.0 system tables, it'll 
write the same tables that exist in the schema with brand new timestamps. As 
written, this will cause all nodes in the cluster to change schema (to the 
version with the newest timestamp). On a sufficiently large cluster with a 
non-trivial schema, this could cause (literally) millions of migration tasks to 
needlessly bounce across the cluster.



  was:
In versions < 3.0, schema was essentially deterministic - a given schema always 
hashed to the same version, so during a rolling upgrade (say 2.0 -> 2.1), the 
first node to upgrade to 2.1 would add the new tables, setting the new 2.1 
version ID, and subsequently upgraded hosts would settle on that version.

In 3.0, we delegate the digest calculation to the post-8099 data structures, 
which are the same digest calculators used in the read path for digest 
match/mismatch - which means it includes timestamps (and ttls).

Since schema will never use TTL, we don't care about TTL fields. Similarly, 
when a 3.0 node upgrades and writes its own new-in-3.0 system tables, it'll 
write the same tables that exist in the schema with brand new timestamps. As 
written, this will cause all nodes in the cluster to change schema (to the 
version with the newest timestamp), and then change a second time as the 
non-system schema is propagated to the newly upgraded nodes.

On a sufficiently large cluster with a non-trivial schema, this could cause 
(literally) millions of migration tasks to needlessly bounce across the cluster.

Up for discussion: if we fix this in 3.0 (say 3.0.X where X >= 14), then any 
3.0 node below this will always mismatch, and cause ping-ponging described in 
CASSANDRA-11050 . However, if we don't fix it, we create a situation that's 
potentially an outage on rolling upgrade. I'm leaning towards a strong warning 
in NEWS about the right way to upgrade, and fixing it in 4.x, but wouldn't mind 
hearing opinions from [~slebresne] and [~iamaleksey] and [~amorton] since you 
three already talked about this on CASSANDRA-11050 . 


> Schema version uses built-in digest which includes timestamps, causing 
> migration storms
> ---------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-13441
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13441
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Schema
>            Reporter: Jeff Jirsa
>            Assignee: Jeff Jirsa
>             Fix For: 4.x
>
>
> In versions < 3.0, during a rolling upgrade (say 2.0 -> 2.1), the first node 
> to upgrade to 2.1 would add the new tables, setting the new 2.1 version ID, 
> and subsequently upgraded hosts would settle on that version.
> When a 3.0 node upgrades and writes its own new-in-3.0 system tables, it'll 
> write the same tables that exist in the schema with brand new timestamps. As 
> written, this will cause all nodes in the cluster to change schema (to the 
> version with the newest timestamp). On a sufficiently large cluster with a 
> non-trivial schema, this could cause (literally) millions of migration tasks 
> to needlessly bounce across the cluster.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to