Out of curiousity, what was the reason for the 'join' command behaviour to 
change?



Regards,

Jens



-----------------------------------------------------------

Date: Wed, 7 Sep 2011 18:12:40 -0600

From: Joseph Blomstedt <[email protected]>

To: riak-users Users <[email protected]>

Subject: Riak Clustering Changes in 1.0

Message-ID:

            
<CANvk2KRPpath-ZJaDuhZo0b+pEByn0MhxyJ-9LEB+b_12d=v...@mail.gmail.com>

Content-Type: text/plain; charset=ISO-8859-1



Given that 1.0 prerelease packages are now available, I wanted to

mention some changes to Riak's clustering capabilities in 1.0. In

particular, there are some subtle semantic differences in the

riak-admin commands. More complete docs will be updated in the near

future, but I hope a quick email suffices for now.



[nodeB/riak-admin join nodeA] is now strictly one-way. It joins nodeB

to the cluster that nodeA is a member of. This is semantically

different than pre-1.0 Riak in which join essentially joined clusters

together rather than joined a node to a cluster. As part of this

change, the joining node (nodeB in this case) must be a singleton

(1-node) cluster.



In pre-1.0, leave and remove were essentially the same operation, with

leave just being an alias for 'remove this-node'. This has changed.

Leave and remove are now very different operations.



[nodeB/riak-admin leave] is the only safe way to have a node leave the

cluster, and it must be executed by the node that you want to remove.

In this case, nodeB will start leaving the cluster, and will not leave

the cluster until after it has handed off all its data. Even if nodeB

is restarted (crashed/shutdown/whatever), it will remain in the leave

state and continue handing off partitions until done. After handoff,

it will leave the cluster, and eventually shutdown.



[nodeA/riak-admin remove nodeB] immediately removes nodeB from the

cluster, without handing off its data. All replicas held by nodeB are

therefore lost, and will need to be re-generated through read-repair.

Use this command carefully. It's intended for nodes that are

permanently unrecoverable and therefore for which handoff doesn't make

sense. By the final 1.0 release, this command may be renamed

"force-remove" just to make the distinction clear.



There are now two new commands that provide additional insight into

the cluster. [riak-admin member_status] and [riak-admin ring_status].



Underneath, the clustering protocol has been mostly re-written. The

new approach has the following advantages:

1. It is no longer necessary to wait on [riak-admin ringready] in

between adding/removing nodes from the cluster, and adding/removing is

also much more sound/graceful. Starting up 16 nodes and issuing

[nodeX: riak-admin join node1] for X=1:16 should just work.



2. Data is first transferred to new partition owners before handing

over partition ownership. This change fixes numerous bugs, such as

404s/not_founds during ownership changes. The Ring/Pending columns in

[riak-admin member_status] visualize this at a high-level, and the

full transfer status in [riak-admin ring_status] provide additional

insight.



3. All partition ownership decisions are now made by a single node in

the cluster (the claimant). Any node can be the claimant, and the duty

is automatically taken over if the previous claimant is removed from

the cluster. [riak-admin member_status] will list the current

claimant.



4. Handoff related to ownership changes can now occur under load;

hinted handoff still only occurs when a vnode is inactive. This change

allows a cluster to scale up/down under load, although this needs to

be further benchmarked and tuned before 1.0.



To support all of the above, a new limitation has been introduced.

Cluster changes (member addition/removal, ring rebalance, etc) can

only occur when all nodes are up and reachable. [riak-admin

ring_status] will complain when this is not the case. If a node is

down, you must issue [riak-admin down <node>] to mark the node as

down, and the remaining nodes will then proceed to converge as usual.

Once the down node comes back online, it will automatically

re-integrate into the cluster. However, there is nothing preventing

client requests being served by a down node before it re-integrates.

Before issuing [down <node>], make sure to update your load balancers

/ connection pools to not include this node. Future releases of Riak

may make offlining a node an automatic operation, but it's a

user-initiated action in 1.0.



-Joe

_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to