Re: Riak Clustering Changes in 1.0

Joseph Blomstedt Sun, 11 Sep 2011 11:26:08 -0700

> At present (0.14.x series) my understanding is that when a new node is added
> to the cluster, it claims a portion of the ring and services requests for
> that portion before all the data is actually present on the node.  Is that
> correct?


Yes, that's correct.

>  If so, as long as you're able to meet the R value of a read (i.e.
> R=2, N=3) by servicing reads from nodes with replicas of the same data you
> shouldn't see any 404s.  Is that also correct?

It depends.

In Riak a read request is always sent out to N nodes, and the read
coordinator waits for at least R replies before replying to the
client. If there are R successful replies, then the coordinator
returns success to the client with the appropriate data. However, in
cases where there are not successful replies, the coordinator tries to
exit out early if it can prove no more replies will help meet the R
quota. For example, if N is 4 and R=3, then 2 failed replies is enough
to fail the request even before seeing the responses from the two
remaining nodes.

In 0.14.2, not_founds count as failures against R. Also, in 0.14.2, we
have a built-in "basic quorum" check in addition to R. This is checked
against failures/not_founds. Basic quorum = truncate(N/2)+1. So, if
N=3, R=1 is sent to a cluster, and you get not_found, not_found,
success, the coordinate still fails the request because the basic
quorum wasn't met.

Given how 0.14.2 would immediately reassign partition ownership before
actually transferring the data, it was entirely possible to have 2 or
more replicas temporarily in the not_found state even if there was
actual data for the item somewhere in the cluster (the old partition
owner).

As already mentioned, the new changes fix this by not reassigning
ownership until after transfer. Also, the other points I just talked
about are now configurable in 1.0. There's an option that now counts
not_found as success, and another option that disables the basic
quorum check. Although, those changes were introduced for purposes
other than the discussion in this thread. The new clustering code
should work fine even with the default behavior of not_founds=fail,
check basic_quorum.

-Joe

On Fri, Sep 9, 2011 at 3:37 PM, Jeff Pollard <[email protected]> wrote:
>> Data is first transferred to new partition owners before handing over
>> partition ownership. This change fixes numerous bugs, such
>> as 404s/not_founds during ownership changes. The Ring/Pending columns
>> in [riak admin member_status] visualize this at a high-level, and the full
>> transfer status in [riak-admin ring_status] provide additional insight.
>
> At present (0.14.x series) my understanding is that when a new node is added
> to the cluster, it claims a portion of the ring and services requests for
> that portion before all the data is actually present on the node.  Is that
> correct?  If so, as long as you're able to meet the R value of a read (i.e.
> R=2, N=3) by servicing reads from nodes with replicas of the same data you
> shouldn't see any 404s.  Is that also correct?
>
> I should add that we're planning on adding our first node to our production
> cluster soon and wanted to make sure we had our story straight :)  That
> said, I'm very excited to see all the clustering improvements in 1.0 and
> hoping we can upgrade before adding a new node.
>
> On Fri, Sep 9, 2011 at 3:10 AM, Jens Rantil <[email protected]> wrote:
>>
>> Thanks for very well written answer. I appreciate it, mate.
>>
>> Jens
>>
>> -----Ursprungligt meddelande-----
>> Från: Joseph Blomstedt [mailto:[email protected]]
>> Skickat: den 8 september 2011 17:42
>> Till: Jens Rantil
>> Kopia: [email protected]
>> Ämne: Re: Riak Clustering Changes in 1.0
>>
>> > Out of curiousity, what was the reason for the 'join' command
>> > behaviour to change?
>>
>> 1. Existing bugs/limitations. For example, joining two entire clusters
>> together was not an entirely safe operation. In some cases, the newly formed
>> cluster would not correctly converge, leaving the ring/cluster in flux.
>> Likewise, we realized that many users were often joining two clusters
>> together by accident and would prefer additional safety. In particular,
>> joining two clusters together with overlapping data but no common vector
>> clock relationship could result in data loss as unintended siblings were
>> reconciled.
>>
>> 2. It was necessary consequence of how the new cluster code works. In the
>> new cluster, the cluster state / ring is only ever mutated by a single node
>> at a time. This is done by having a cluster-wide claimant, as mentioned in
>> my original email. Given the claimant approach, all cluster state / ring
>> changes are totally ordered. When a new node joins an existing cluster, it
>> throws away it's existing ring and replaces it with a copy of the ring from
>> the target cluster, thus joining into the same cluster history. If you were
>> to join two clusters together, we would need to deterministically merge two
>> independent cluster histories and elect a single new claimant for the new
>> cluster. This is easy in cases where there are no node failures or
>> net-splits during joining, but less trivial when there are errors. The
>> entire new cluster code was heavily modeled before implementation, and in
>> the modeling work several corner cases related to failures were found that
>> were hard to address in a cluster/cluster join but easy to fix in a
>> node/cluster join. Thus, I went with the simple and correct approach.
>>
>> -Joe
>>
>> --
>> Joseph Blomstedt <[email protected]>
>> Software Engineer
>> Basho Technologies, Inc.
>> http://www.basho.com/
>>
>> On Thu, Sep 8, 2011 at 5:19 AM, Jens Rantil <[email protected]>
>> wrote:
>> > Out of curiousity, what was the reason for the 'join' command
>> > behaviour to change?
>> >
>> >
>> >
>> > Regards,
>> >
>> > Jens
>> >
>> >
>> >
>> > -----------------------------------------------------------
>> >
>> > Date: Wed, 7 Sep 2011 18:12:40 -0600
>> >
>> > From: Joseph Blomstedt <[email protected]>
>> >
>> > To: riak-users Users <[email protected]>
>> >
>> > Subject: Riak Clustering Changes in 1.0
>> >
>> > Message-ID:
>> >
>> >
>> > <CANvk2KRPpath-ZJaDuhZo0b+pEByn0MhxyJ-9LEB+b_12d=v...@mail.gmail.com>
>> >
>> > Content-Type: text/plain; charset=ISO-8859-1
>> >
>> >
>> >
>> > Given that 1.0 prerelease packages are now available, I wanted to
>> >
>> > mention some changes to Riak's clustering capabilities in 1.0. In
>> >
>> > particular, there are some subtle semantic differences in the
>> >
>> > riak-admin commands. More complete docs will be updated in the near
>> >
>> > future, but I hope a quick email suffices for now.
>> >
>> >
>> >
>> > [nodeB/riak-admin join nodeA] is now strictly one-way. It joins nodeB
>> >
>> > to the cluster that nodeA is a member of. This is semantically
>> >
>> > different than pre-1.0 Riak in which join essentially joined clusters
>> >
>> > together rather than joined a node to a cluster. As part of this
>> >
>> > change, the joining node (nodeB in this case) must be a singleton
>> >
>> > (1-node) cluster.
>> >
>> >
>> >
>> > In pre-1.0, leave and remove were essentially the same operation, with
>> >
>> > leave just being an alias for 'remove this-node'. This has changed.
>> >
>> > Leave and remove are now very different operations.
>> >
>> >
>> >
>> > [nodeB/riak-admin leave] is the only safe way to have a node leave the
>> >
>> > cluster, and it must be executed by the node that you want to remove.
>> >
>> > In this case, nodeB will start leaving the cluster, and will not leave
>> >
>> > the cluster until after it has handed off all its data. Even if nodeB
>> >
>> > is restarted (crashed/shutdown/whatever), it will remain in the leave
>> >
>> > state and continue handing off partitions until done. After handoff,
>> >
>> > it will leave the cluster, and eventually shutdown.
>> >
>> >
>> >
>> > [nodeA/riak-admin remove nodeB] immediately removes nodeB from the
>> >
>> > cluster, without handing off its data. All replicas held by nodeB are
>> >
>> > therefore lost, and will need to be re-generated through read-repair.
>> >
>> > Use this command carefully. It's intended for nodes that are
>> >
>> > permanently unrecoverable and therefore for which handoff doesn't make
>> >
>> > sense. By the final 1.0 release, this command may be renamed
>> >
>> > "force-remove" just to make the distinction clear.
>> >
>> >
>> >
>> > There are now two new commands that provide additional insight into
>> >
>> > the cluster. [riak-admin member_status] and [riak-admin ring_status].
>> >
>> >
>> >
>> > Underneath, the clustering protocol has been mostly re-written. The
>> >
>> > new approach has the following advantages:
>> >
>> > 1. It is no longer necessary to wait on [riak-admin ringready] in
>> >
>> > between adding/removing nodes from the cluster, and adding/removing is
>> >
>> > also much more sound/graceful. Starting up 16 nodes and issuing
>> >
>> > [nodeX: riak-admin join node1] for X=1:16 should just work.
>> >
>> >
>> >
>> > 2. Data is first transferred to new partition owners before handing
>> >
>> > over partition ownership. This change fixes numerous bugs, such as
>> >
>> > 404s/not_founds during ownership changes. The Ring/Pending columns in
>> >
>> > [riak-admin member_status] visualize this at a high-level, and the
>> >
>> > full transfer status in [riak-admin ring_status] provide additional
>> >
>> > insight.
>> >
>> >
>> >
>> > 3. All partition ownership decisions are now made by a single node in
>> >
>> > the cluster (the claimant). Any node can be the claimant, and the duty
>> >
>> > is automatically taken over if the previous claimant is removed from
>> >
>> > the cluster. [riak-admin member_status] will list the current
>> >
>> > claimant.
>> >
>> >
>> >
>> > 4. Handoff related to ownership changes can now occur under load;
>> >
>> > hinted handoff still only occurs when a vnode is inactive. This change
>> >
>> > allows a cluster to scale up/down under load, although this needs to
>> >
>> > be further benchmarked and tuned before 1.0.
>> >
>> >
>> >
>> > To support all of the above, a new limitation has been introduced.
>> >
>> > Cluster changes (member addition/removal, ring rebalance, etc) can
>> >
>> > only occur when all nodes are up and reachable. [riak-admin
>> >
>> > ring_status] will complain when this is not the case. If a node is
>> >
>> > down, you must issue [riak-admin down <node>] to mark the node as
>> >
>> > down, and the remaining nodes will then proceed to converge as usual.
>> >
>> > Once the down node comes back online, it will automatically
>> >
>> > re-integrate into the cluster. However, there is nothing preventing
>> >
>> > client requests being served by a down node before it re-integrates.
>> >
>> > Before issuing [down <node>], make sure to update your load balancers
>> >
>> > / connection pools to not include this node. Future releases of Riak
>> >
>> > may make offlining a node an automatic operation, but it's a
>> >
>> > user-initiated action in 1.0.
>> >
>> >
>> >
>> > -Joe
>> >
>> >
>> >
>> > _______________________________________________
>> > riak-users mailing list
>> > [email protected]
>> > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>> >
>> >
>>
>> _______________________________________________
>> riak-users mailing list
>> [email protected]
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>
> _______________________________________________
> riak-users mailing list
> [email protected]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>

_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: Riak Clustering Changes in 1.0

Reply via email to