Benedict created CASSANDRA-9539:
-----------------------------------

             Summary: Race condition in schema propagation with dependence for 
cluster stability
                 Key: CASSANDRA-9539
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9539
             Project: Cassandra
          Issue Type: Bug
          Components: Core
            Reporter: Benedict
             Fix For: 3.0.x


Follow up from CASSANDRA-8099. Split out into its own ticket for discussion 
following a brief exchange over github.

My initial comment in SerializationHeader
{quote}
// TODO 8099: this looks like a potential race condition with schema changes to 
me: within a given node we
// can accept writes to a column not present in the metadata, or receive stream 
data without them.
// This shouldn't cause deserialization to fail
{quote}

And [~slebresne]'s response:
{quote}
I've also somewhat edited the comment in {{SerializationHeader}}. It's true 
that we're theoretically racy, but it's not a new thing to 8099 nor isolated to 
this specific part of the code. In fact, I suspect we're not terribly likely to 
get a problem at this particular point of the code because while nodes are not 
prevented from taking writes for columns they don't know about yet, we'll 
complain before it reaches the memtable (in the CQL layer if that's the 
coordinator, in message deserialization otherwise). And while we could get it 
through streams, given how schema propagation work and where streaming is used, 
it feels quite unlikely that streaming would reach a node before a schema 
change.

Anyway, don't mean by that that we shouldn't improve all of this, just adding 
my bit of context.
{quote}

My concern is that we expose ourselves to nodes failing to start up if there is 
a bug or problem with schema propagation, or if the race condition manages to 
present purely through timing, let's say due to flapping network problems 
(either are possible, but the former is more likely). Right now we would 
continue to function in this scenario, but after 8099 the node will fail on 
opening its sstables. I think this is something we should fix preferably 
before, or early on in release. We know our schema propagation code is not 
brlliant, and tightly coupling stability of the cluster to it concerns me.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to