Hi everyone, We recently updated our 4-node cluster from 1.2 to 1.3.1. The cluster used to work somehow stable until updated.
We did update in the following order: 1. Performed clean installation of Riak 1.3.1 on the first node (192.168.0.2) (thus, the Riak settings and data directory were wiped out); 2. Connected the first node to the rest of cluster; 3. Updated the other three nodes to 1.3.1, leaving data directories unchanged. We use Ubuntu Server 12.04 on the first node and 10.04 on the others; backend is LevelDB. When the update was done, we could see that every node had been connected to the cluster and everything was fine: Status Ring Pending Node ------------------------------------------------------------------------------- valid 25.0% -- '[email protected]' valid 25.0% -- '[email protected]' <-- first node valid 25.0% -- '[email protected]' valid 25.0% -- '[email protected]' ------------------------------------------------------------------------------- Valid:4 / Leaving:0 / Exiting:0 / Joining:0 / Down:0 A week later the first node detached from the cluster by it's own, and now member-status on the first node shows that: Status Ring Pending Node ------------------------------------------------------------------------------- valid 100.0% -- '[email protected]' ------------------------------------------------------------------------------- Valid:1 / Leaving:0 / Exiting:0 / Joining:0 / Down:0 However, other nodes say that they are still participate in the full cluster: Status Ring Pending Node ------------------------------------------------------------------------------- joining 25.0% -- '[email protected]' valid 25.0% -- '[email protected]' <-- first node valid 25.0% -- '[email protected]' valid 25.0% -- '[email protected]' ------------------------------------------------------------------------------- Valid:3 / Leaving:0 / Exiting:0 / Joining:1 / Down:0 Despite the fact that the node 192.168.0.2 is said to be detached (according to its member-status output), its data is still available throughout the cluster: $ curl -XPOST http://192.168.0.2:8098/buckets/test/keys/foo -d BAR $ curl http://192.168.0.2:8098/buckets/test/keys/foo BAR $ curl http://192.168.0.3:8098/buckets/test/keys/foo BAR Here are the questions: 1. Why member-status output on different nodes is inconsistent, and what to do about it? 2. Sometimes the first node returns error "{insufficient_vnodes,0,need,2}" on read queries. This error goes away for few days when the node restarted. I suspect that this has some relation to the first question. 3. Node on 192.168.0.3 is desperately trying to join the cluster for at least 10 days, still no luck. (overall data size is about 60GB, network is 100Mbit Ethernet) Which additional info may help? As I said before, things went wrong when the cluster was updated to 1.3.1. So, maybe the right way to fix these problems is to install 1.3.1 cleanly, then restore the data from backup? Thanks in advance! Pavel.
_______________________________________________ riak-users mailing list [email protected] http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
