[ 
https://issues.apache.org/jira/browse/CASSANDRA-9539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshua McKenzie updated CASSANDRA-9539:
---------------------------------------
    Fix Version/s:     (was: 3.x)

> Race condition in schema propagation with dependence for cluster stability
> --------------------------------------------------------------------------
>
>                 Key: CASSANDRA-9539
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9539
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Benedict
>             Fix For: 3.0.x
>
>
> Follow up from CASSANDRA-8099. Split out into its own ticket for discussion 
> following a brief exchange over github.
> My initial comment in SerializationHeader
> {quote}
> // TODO 8099: this looks like a potential race condition with schema changes 
> to me: within a given node we
> // can accept writes to a column not present in the metadata, or receive 
> stream data without them.
> // This shouldn't cause deserialization to fail
> {quote}
> And [~slebresne]'s response:
> {quote}
> I've also somewhat edited the comment in {{SerializationHeader}}. It's true 
> that we're theoretically racy, but it's not a new thing to 8099 nor isolated 
> to this specific part of the code. In fact, I suspect we're not terribly 
> likely to get a problem at this particular point of the code because while 
> nodes are not prevented from taking writes for columns they don't know about 
> yet, we'll complain before it reaches the memtable (in the CQL layer if 
> that's the coordinator, in message deserialization otherwise). And while we 
> could get it through streams, given how schema propagation work and where 
> streaming is used, it feels quite unlikely that streaming would reach a node 
> before a schema change.
> Anyway, don't mean by that that we shouldn't improve all of this, just adding 
> my bit of context.
> {quote}
> My concern is that we expose ourselves to nodes failing to start up if there 
> is a bug or problem with schema propagation, or if the race condition manages 
> to present purely through timing, let's say due to flapping network problems 
> (either are possible, but the former is more likely). Right now we would 
> continue to function in this scenario, but after 8099 the node will fail on 
> opening its sstables. I think this is something we should fix preferably 
> before, or early on in release. We know our schema propagation code is not 
> brlliant, and tightly coupling stability of the cluster to it concerns me.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to