Propagation of schema changes got out of sync with node's notion of ring
------------------------------------------------------------------------
Key: CASSANDRA-2015
URL: https://issues.apache.org/jira/browse/CASSANDRA-2015
Project: Cassandra
Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Peter Schuller
I have a test cluster of 0.7.0 of three nodes, 1, 2, 3. 1 and 2 are seeds (but
not 3).
I had a situation where the following was observed:
* Schema changes submitted to node 1 would not propagate to any other node
(observational method: tail syslog and don't see any flushing of system
memtables/etc except on node 1).
* Schema changes submitted to node 2 or 3 would propagate between them, or to
all (not sure which).
* Mutations submitted on node 1 *would* get propagated to node 3.
* All nodes knew of each other and considered themselves up according to
'nodetool ring'.
* Because node 3 never got schema migrations, writes submitted to node 1 that
got sent to node 3 blocked for extended periods of time on node 1, while
triggering an exception on now 3 because of an invalid cfid in the row mutation.
* I can not be entirely sure whether just a regular restart would have fixed
the problem.
Unfortunately, I was not aware of the problem until running some unit tests
against the cluster and I cannot say for sure which order the machines were
bootstrapped in.
After initial discovery I switched to manually submitting 'create keyspace x;'
via cassandra-cli on each node (for different ks:es or interleaving
create/drop), and observing results in syslog.
The observations w.r.t. row mutations did not come from the manual test, but
rather from the unit test that failed so there is some chance that there was a
different mode of failure than during my cassandra-cli tests.
Stopping all nodes and wiping data directories and restarting, fixed the
problem and so far I have not been able to trigger it again. I am not sure
whether just restarting the nodes would have fixed it.
It definitely seems like a problem to me that schema changes did not propagate
even though the node (1) node was apparently sufficiently aware of the other
node (3) to sent mutations to it, even if the original problem may have been
due to some kind of operational error.
I'd be interested in hearing speculation of what likely triggers may be.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.