[
https://issues.apache.org/jira/browse/CASSANDRA-9539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Joshua McKenzie updated CASSANDRA-9539:
---------------------------------------
Fix Version/s: (was: 3.0.0 rc2)
3.0.x
3.x
> Race condition in schema propagation with dependence for cluster stability
> --------------------------------------------------------------------------
>
> Key: CASSANDRA-9539
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9539
> Project: Cassandra
> Issue Type: Bug
> Components: Core
> Reporter: Benedict
> Fix For: 3.0.x
>
>
> Follow up from CASSANDRA-8099. Split out into its own ticket for discussion
> following a brief exchange over github.
> My initial comment in SerializationHeader
> {quote}
> // TODO 8099: this looks like a potential race condition with schema changes
> to me: within a given node we
> // can accept writes to a column not present in the metadata, or receive
> stream data without them.
> // This shouldn't cause deserialization to fail
> {quote}
> And [~slebresne]'s response:
> {quote}
> I've also somewhat edited the comment in {{SerializationHeader}}. It's true
> that we're theoretically racy, but it's not a new thing to 8099 nor isolated
> to this specific part of the code. In fact, I suspect we're not terribly
> likely to get a problem at this particular point of the code because while
> nodes are not prevented from taking writes for columns they don't know about
> yet, we'll complain before it reaches the memtable (in the CQL layer if
> that's the coordinator, in message deserialization otherwise). And while we
> could get it through streams, given how schema propagation work and where
> streaming is used, it feels quite unlikely that streaming would reach a node
> before a schema change.
> Anyway, don't mean by that that we shouldn't improve all of this, just adding
> my bit of context.
> {quote}
> My concern is that we expose ourselves to nodes failing to start up if there
> is a bug or problem with schema propagation, or if the race condition manages
> to present purely through timing, let's say due to flapping network problems
> (either are possible, but the former is more likely). Right now we would
> continue to function in this scenario, but after 8099 the node will fail on
> opening its sstables. I think this is something we should fix preferably
> before, or early on in release. We know our schema propagation code is not
> brlliant, and tightly coupling stability of the cluster to it concerns me.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)