Propagation of schema changes got out of sync with node's notion of ring
------------------------------------------------------------------------

                 Key: CASSANDRA-2015
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2015
             Project: Cassandra
          Issue Type: Bug
    Affects Versions: 0.7.0
            Reporter: Peter Schuller


I have a test cluster of 0.7.0 of three nodes, 1, 2, 3. 1 and 2 are seeds (but 
not 3).

I had a situation where the following was observed:

* Schema changes submitted to node 1 would not propagate to any other node 
(observational method: tail syslog and don't see any flushing of system 
memtables/etc except on node 1).
* Schema changes submitted to node 2 or 3 would propagate between them, or to 
all (not sure which).
* Mutations submitted on node 1 *would* get propagated to node 3.
* All nodes knew of each other and considered themselves up according to 
'nodetool ring'.
* Because node 3 never got schema migrations, writes submitted to node 1 that 
got sent to node 3 blocked for extended periods of time on node 1, while 
triggering an exception on now 3 because of an invalid cfid in the row mutation.
* I can not be entirely sure whether just a regular restart would have fixed 
the problem.

Unfortunately, I was not aware of the problem until running some unit tests 
against the cluster and I cannot say for sure which order the machines were 
bootstrapped in.

After initial discovery I switched to manually submitting 'create keyspace x;' 
via cassandra-cli on each node (for different ks:es or interleaving 
create/drop), and observing results in syslog.

The observations w.r.t. row mutations did not come from the manual test, but 
rather from the unit test that failed so there is some chance that there was a 
different mode of failure than during my cassandra-cli tests.

Stopping all nodes and wiping data directories and restarting, fixed the 
problem and so far I have not been able to trigger it again. I am not sure 
whether just restarting the nodes would have fixed it.

It definitely seems like a problem to me that schema changes did not propagate 
even though the node (1) node was apparently sufficiently aware of the other 
node (3) to sent mutations to it, even if the original problem may have been 
due to some kind of operational error.

I'd be interested in hearing speculation of what likely triggers may be.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to