I've also been asking for this, and the current master has code to remedy these
things, but it's not in an official release yet.
At Erlang level, you can specify options to a RiakClient:get as follows
-type option() :: {r, pos_integer()} | %% Minimum number of successful
responses
{pr, non_neg_integer()} | %% Minimum number of primary
vnodes participating
{basic_quorum, boolean()} | %% Whether to use basic quorum
(return early
%% in some failure cases.
{notfound_ok, boolean()} | %% Count notfound reponses as
successful.
{timeout, pos_integer() | infinity}. %% Timeout for vnode
responses
And so to get the semantics I think you're asking for, do GET (assuming N=3)
with
[{r,2},{pr, 2}, {basic_quorum, false}, {notfound_ok, true}]
So this will work as you want as long as there is only one node down.
During handoff you may see a new kind of error
HTTP 503 / {error r_value_unsatisfied ...}
which is the behavior when basic quorum is disabled, i.e. the alternative to
getting a notfound just because there was some node which did not have the
value.
Each of those are also available as query parameters when doing a HTTP get.
curl http://127.0.0.1:8091/riak/buck/key?r=2&basic_quorum=false¬found_ok=true
I'm also looking forward to a release which has this, and I'm hoping that the
defaults can somehow be simplified / strengthened so people new to this don't
need to be so surprised about these things.
Kresten
On May 5, 2011, at 8:18 AM, Greg Nelson wrote:
I just added node #5 to our cluster, and once again the experience during the
subsequent 60-minute handoff period was pretty awful! I just don't understand
why this would be expected behavior while adding a node. There doesn't seem to
be any realistic way to join a node to an online cluster. As far as I'm
concerned this is a huge defect in Riak.
Read-repair didn't seem to kick in immediately for data. My application was
configured to retry GETs (with a few seconds of backoff), and still got 404s.
I manually requested an object repeatedly for over 20 minutes until finally
getting a result.
I think bug #992 (https://issues.basho.com/show_bug.cgi?id=992) describes the
defect, but I'm wondering if there is more to it than this? Especially since
read-repair didn't quite seem to work.
Could what Daniel describes on that bug ("Only return not found when all vnodes
have reported not found (or error)") be implemented as a configurable option?
Maybe something one could kick in when a node joins until all handoffs are
complete?
What we can do to remedy this before I add node #6, #7, etc. We're storing
huge amounts of data, which means that a) we'll be adding nodes often, and b)
the amount of data handoff will be large, which means long periods of handoff
where we don't want to have downtime.
Greg
On Tuesday, May 3, 2011 at 2:30 AM, Nico Meyer wrote:
Hi everyone,
I just want to note that I observed similar behaviour with a somewhat
larger clusters of 10 or so nodes. I first noticed that handoff activity
after node join (or leave for that matter) involved a lot more
partitions than I would have expected. By comparing the old and the new
ring file, I found out that more than 80 percent of partitions had to be
moved to another node.
My naive expectation was that joining a node to a cluster of size X
would result in roughly ring_creation_size/(X+1) partitions to be handed
off, which would also be the minimum if one expects a balanced cluster
afterwards.
Furthermore it would in theory be possible to move partitions in such a
way that at least one partition from each preflist stays on the same
node. Maybe for X>N it should even be possible to guarantee this for a
basic quorum of each preflist, eliminating the notfound problem
completely, but I am not sure about that.
I may be able to provide some ring files to analyze this behaviour if
someone from basho is interested.
Cheer Nico
Am Montag, den 02.05.2011, 23:14 -0400 schrieb Ryan Zezeski:
Greg,
Your expectations are fair, just because you added a node doesn't mean
Riak should return notfounds. Unfortunately, we aren't quite there
yet. This is a side effect of how Riak currently implements handoff
in that it immediately updates/gossips the ring causing
many partitions to handoff immediately. If a request comes in that
relies on these partitions then it will get a notfound and perform
read repair. You're situation is multiplied by the fact that you are
going from 3 nodes to 4. More vnode shuffling occurs because of the
small cluster size.
We're well aware of this and have it on our radar for improvement in a
future release.
All this said, you data will be eventually consistent. That is, all
your data will eventually be handed off and things will work as
normal. It's only during the handoff that you _may_ encounter
notfounds. In this case it would be best to add a new node to your
cluster at lowest load times and if you can spare additional hardware
a few more nodes to start with is an even easier option.
-Ryan
On Mon, May 2, 2011 at 9:48 PM, Greg Nelson
<[email protected]<mailto:[email protected]>>
wrote:
Hello riak users!
I have a 4 node cluster that started out as 3 nodes.
ring_creation_size = 2048, target_n_val is default (4), and
all buckets have n_val = 3.
When I joined the 4th node, for a few minutes some GETs were
returning 'not found' for data that was already in riak.
Eventually the data was returned, due to read repair I would
assume. Is this expected? It seems that 'not found' and read
repairs should only happen when something goes wrong, like a
node goes down. Not when adding a node to the cluster, which
is supposed to be part of normal operation!
Any help or insight is appreciated!
Greg
________________________________________________
riak-users mailing list
[email protected]<mailto:[email protected]>
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
_______________________________________________
riak-users mailing list
[email protected]<mailto:[email protected]>
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
_______________________________________________
riak-users mailing list
[email protected]<mailto:[email protected]>
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
<ATT00001..txt>
_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com