Inconsistent cluster membership

Pavel Kirienko Mon, 06 May 2013 03:30:46 -0700

Hi everyone,

We recently updated our 4-node cluster from 1.2 to 1.3.1. The cluster used
to work somehow stable until updated.


We did update in the following order:
1. Performed clean installation of Riak 1.3.1 on the first node
(192.168.0.2) (thus, the Riak settings and data directory were wiped out);
2. Connected the first node to the rest of cluster;
3. Updated the other three nodes to 1.3.1, leaving data directories
unchanged.

We use Ubuntu Server 12.04 on the first node and 10.04 on the others;
backend is LevelDB.

When the update was done, we could see that every node had been connected
to the cluster and everything was fine:
Status     Ring    Pending    Node
-------------------------------------------------------------------------------
valid      25.0%      --      '[email protected]'
valid      25.0%      --      '[email protected]'   <-- first node
valid      25.0%      --      '[email protected]'
valid      25.0%      --      '[email protected]'
-------------------------------------------------------------------------------
Valid:4 / Leaving:0 / Exiting:0 / Joining:0 / Down:0


A week later the first node detached from the cluster by it's own, and now
member-status on the first node shows that:
Status     Ring    Pending    Node
-------------------------------------------------------------------------------
valid     100.0%      --      '[email protected]'
-------------------------------------------------------------------------------
Valid:1 / Leaving:0 / Exiting:0 / Joining:0 / Down:0


However, other nodes say that they are still participate in the full
cluster:
Status     Ring    Pending    Node
-------------------------------------------------------------------------------
joining    25.0%      --      '[email protected]'
valid      25.0%      --      '[email protected]'   <-- first node
valid      25.0%      --      '[email protected]'
valid      25.0%      --      '[email protected]'
-------------------------------------------------------------------------------
Valid:3 / Leaving:0 / Exiting:0 / Joining:1 / Down:0


Despite the fact that the node 192.168.0.2 is said to be detached
(according to its member-status output), its data is still available
throughout the cluster:

$ curl -XPOST http://192.168.0.2:8098/buckets/test/keys/foo -d BAR
$ curl http://192.168.0.2:8098/buckets/test/keys/foo
BAR
$ curl http://192.168.0.3:8098/buckets/test/keys/foo
BAR


Here are the questions:
1. Why member-status output on different nodes is inconsistent, and what to
do about it?
2. Sometimes the first node returns error "{insufficient_vnodes,0,need,2}"
on read queries. This error goes away for few days when the node restarted.
I suspect that this has some relation to the first question.
3. Node on 192.168.0.3 is desperately trying to join the cluster for at
least 10 days, still no luck. (overall data size is about 60GB, network is
100Mbit Ethernet)

Which additional info may help?

As I said before, things went wrong when the cluster was updated to 1.3.1.
So, maybe the right way to fix these problems is to install 1.3.1 cleanly,
then restore the data from backup?

Thanks in advance!
Pavel.

_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Inconsistent cluster membership

Reply via email to