Re: Riak Clustering Changes in 1.0

Mark Phillips Fri, 09 Sep 2011 15:02:10 -0700

Extremely relevant to this conversation:

http://blog.basho.com/2011/09/09/Riak-Cluster-Membership-Overview/


Buy Joe a beer next time you see him :)

Mark

On Fri, Sep 9, 2011 at 2:37 PM, Jeff Pollard <[email protected]> wrote:
>> Data is first transferred to new partition owners before handing over
>> partition ownership. This change fixes numerous bugs, such
>> as 404s/not_founds during ownership changes. The Ring/Pending columns
>> in [riak admin member_status] visualize this at a high-level, and the full
>> transfer status in [riak-admin ring_status] provide additional insight.
>
> At present (0.14.x series) my understanding is that when a new node is added
> to the cluster, it claims a portion of the ring and services requests for
> that portion before all the data is actually present on the node.  Is that
> correct?  If so, as long as you're able to meet the R value of a read (i.e.
> R=2, N=3) by servicing reads from nodes with replicas of the same data you
> shouldn't see any 404s.  Is that also correct?
>
> I should add that we're planning on adding our first node to our production
> cluster soon and wanted to make sure we had our story straight :)  That
> said, I'm very excited to see all the clustering improvements in 1.0 and
> hoping we can upgrade before adding a new node.
>
> On Fri, Sep 9, 2011 at 3:10 AM, Jens Rantil <[email protected]> wrote:
>>
>> Thanks for very well written answer. I appreciate it, mate.
>>
>> Jens
>>
>> -----Ursprungligt meddelande-----
>> Från: Joseph Blomstedt [mailto:[email protected]]
>> Skickat: den 8 september 2011 17:42
>> Till: Jens Rantil
>> Kopia: [email protected]
>> Ämne: Re: Riak Clustering Changes in 1.0
>>
>> > Out of curiousity, what was the reason for the 'join' command
>> > behaviour to change?
>>
>> 1. Existing bugs/limitations. For example, joining two entire clusters
>> together was not an entirely safe operation. In some cases, the newly formed
>> cluster would not correctly converge, leaving the ring/cluster in flux.
>> Likewise, we realized that many users were often joining two clusters
>> together by accident and would prefer additional safety. In particular,
>> joining two clusters together with overlapping data but no common vector
>> clock relationship could result in data loss as unintended siblings were
>> reconciled.
>>
>> 2. It was necessary consequence of how the new cluster code works. In the
>> new cluster, the cluster state / ring is only ever mutated by a single node
>> at a time. This is done by having a cluster-wide claimant, as mentioned in
>> my original email. Given the claimant approach, all cluster state / ring
>> changes are totally ordered. When a new node joins an existing cluster, it
>> throws away it's existing ring and replaces it with a copy of the ring from
>> the target cluster, thus joining into the same cluster history. If you were
>> to join two clusters together, we would need to deterministically merge two
>> independent cluster histories and elect a single new claimant for the new
>> cluster. This is easy in cases where there are no node failures or
>> net-splits during joining, but less trivial when there are errors. The
>> entire new cluster code was heavily modeled before implementation, and in
>> the modeling work several corner cases related to failures were found that
>> were hard to address in a cluster/cluster join but easy to fix in a
>> node/cluster join. Thus, I went with the simple and correct approach.
>>
>> -Joe
>>
>> --
>> Joseph Blomstedt <[email protected]>
>> Software Engineer
>> Basho Technologies, Inc.
>> http://www.basho.com/
>>
>> On Thu, Sep 8, 2011 at 5:19 AM, Jens Rantil <[email protected]>
>> wrote:
>> > Out of curiousity, what was the reason for the 'join' command
>> > behaviour to change?
>> >
>> >
>> >
>> > Regards,
>> >
>> > Jens
>> >
>> >
>> >
>> > -----------------------------------------------------------
>> >
>> > Date: Wed, 7 Sep 2011 18:12:40 -0600
>> >
>> > From: Joseph Blomstedt <[email protected]>
>> >
>> > To: riak-users Users <[email protected]>
>> >
>> > Subject: Riak Clustering Changes in 1.0
>> >
>> > Message-ID:
>> >
>> >
>> > <CANvk2KRPpath-ZJaDuhZo0b+pEByn0MhxyJ-9LEB+b_12d=v...@mail.gmail.com>
>> >
>> > Content-Type: text/plain; charset=ISO-8859-1
>> >
>> >
>> >
>> > Given that 1.0 prerelease packages are now available, I wanted to
>> >
>> > mention some changes to Riak's clustering capabilities in 1.0. In
>> >
>> > particular, there are some subtle semantic differences in the
>> >
>> > riak-admin commands. More complete docs will be updated in the near
>> >
>> > future, but I hope a quick email suffices for now.
>> >
>> >
>> >
>> > [nodeB/riak-admin join nodeA] is now strictly one-way. It joins nodeB
>> >
>> > to the cluster that nodeA is a member of. This is semantically
>> >
>> > different than pre-1.0 Riak in which join essentially joined clusters
>> >
>> > together rather than joined a node to a cluster. As part of this
>> >
>> > change, the joining node (nodeB in this case) must be a singleton
>> >
>> > (1-node) cluster.
>> >
>> >
>> >
>> > In pre-1.0, leave and remove were essentially the same operation, with
>> >
>> > leave just being an alias for 'remove this-node'. This has changed.
>> >
>> > Leave and remove are now very different operations.
>> >
>> >
>> >
>> > [nodeB/riak-admin leave] is the only safe way to have a node leave the
>> >
>> > cluster, and it must be executed by the node that you want to remove.
>> >
>> > In this case, nodeB will start leaving the cluster, and will not leave
>> >
>> > the cluster until after it has handed off all its data. Even if nodeB
>> >
>> > is restarted (crashed/shutdown/whatever), it will remain in the leave
>> >
>> > state and continue handing off partitions until done. After handoff,
>> >
>> > it will leave the cluster, and eventually shutdown.
>> >
>> >
>> >
>> > [nodeA/riak-admin remove nodeB] immediately removes nodeB from the
>> >
>> > cluster, without handing off its data. All replicas held by nodeB are
>> >
>> > therefore lost, and will need to be re-generated through read-repair.
>> >
>> > Use this command carefully. It's intended for nodes that are
>> >
>> > permanently unrecoverable and therefore for which handoff doesn't make
>> >
>> > sense. By the final 1.0 release, this command may be renamed
>> >
>> > "force-remove" just to make the distinction clear.
>> >
>> >
>> >
>> > There are now two new commands that provide additional insight into
>> >
>> > the cluster. [riak-admin member_status] and [riak-admin ring_status].
>> >
>> >
>> >
>> > Underneath, the clustering protocol has been mostly re-written. The
>> >
>> > new approach has the following advantages:
>> >
>> > 1. It is no longer necessary to wait on [riak-admin ringready] in
>> >
>> > between adding/removing nodes from the cluster, and adding/removing is
>> >
>> > also much more sound/graceful. Starting up 16 nodes and issuing
>> >
>> > [nodeX: riak-admin join node1] for X=1:16 should just work.
>> >
>> >
>> >
>> > 2. Data is first transferred to new partition owners before handing
>> >
>> > over partition ownership. This change fixes numerous bugs, such as
>> >
>> > 404s/not_founds during ownership changes. The Ring/Pending columns in
>> >
>> > [riak-admin member_status] visualize this at a high-level, and the
>> >
>> > full transfer status in [riak-admin ring_status] provide additional
>> >
>> > insight.
>> >
>> >
>> >
>> > 3. All partition ownership decisions are now made by a single node in
>> >
>> > the cluster (the claimant). Any node can be the claimant, and the duty
>> >
>> > is automatically taken over if the previous claimant is removed from
>> >
>> > the cluster. [riak-admin member_status] will list the current
>> >
>> > claimant.
>> >
>> >
>> >
>> > 4. Handoff related to ownership changes can now occur under load;
>> >
>> > hinted handoff still only occurs when a vnode is inactive. This change
>> >
>> > allows a cluster to scale up/down under load, although this needs to
>> >
>> > be further benchmarked and tuned before 1.0.
>> >
>> >
>> >
>> > To support all of the above, a new limitation has been introduced.
>> >
>> > Cluster changes (member addition/removal, ring rebalance, etc) can
>> >
>> > only occur when all nodes are up and reachable. [riak-admin
>> >
>> > ring_status] will complain when this is not the case. If a node is
>> >
>> > down, you must issue [riak-admin down <node>] to mark the node as
>> >
>> > down, and the remaining nodes will then proceed to converge as usual.
>> >
>> > Once the down node comes back online, it will automatically
>> >
>> > re-integrate into the cluster. However, there is nothing preventing
>> >
>> > client requests being served by a down node before it re-integrates.
>> >
>> > Before issuing [down <node>], make sure to update your load balancers
>> >
>> > / connection pools to not include this node. Future releases of Riak
>> >
>> > may make offlining a node an automatic operation, but it's a
>> >
>> > user-initiated action in 1.0.
>> >
>> >
>> >
>> > -Joe
>> >
>> >
>> >
>> > _______________________________________________
>> > riak-users mailing list
>> > [email protected]
>> > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>> >
>> >
>>
>> _______________________________________________
>> riak-users mailing list
>> [email protected]
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>
> _______________________________________________
> riak-users mailing list
> [email protected]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>

_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: Riak Clustering Changes in 1.0

Reply via email to