> > Crap, the second after I hit "send" the lightbulb goes on! Why is that? > > The quorum _was_ met (all vnodes just migrated to the one machine) but since > some of them were fail-overs they didn't have the value yet (or the wrong > value)? In this case a read repair happened and subsequent gets worked. >
Your understanding is correct. However, when I say "quorum was met" I usually mean that "it had R successful replies". Minor semantic quibble. You are correct in saying that the wiki is misleading -- read repair happens when any successful reply reaches the FSM, even if "not found" was returned to the client, that is, if quorum was not met. We'll get that fixed. > I'm still dark on the second question. > > > 2) Why doesn't r=1 work? > > In the IRC session, you claimed that r=1 would not have helped this problem. > Just like the OP, this confused me. You then went on to say it was because > of some optimization and then mentioned a "basic quorum." > > I took a few minutes to think about this and the only conclusion I came to is > that when r=1 you will treat the first response as the final response, and in > this case the notfound response will always come back first? I'm not sure if > what I just said makes sense but I would have expected r=1 to work, just like > the OP. I'll admit that I still haven't read all the wiki docs yet (but I've > read Read Repair 3 times now), so I'd be happy to hear RTFM. A number of months ago, we ran into some issues with a cluster where "not found" responses were not returning in a reasonable amount of time, especially when R=1. That is, the requests took MUCH longer than a SUCCESSFUL read. We determined that this occurred because one of the partitions was too busy to reply, causing the request timeout to expire. So we added a special case called "basic quorum" (n_val/2 + 1) that is invoked only when receiving a "not found" response from a replica. The idea is that if a simple majority of the replica partitions report "not found", it's probably not there. This way, you don't sit around waiting for the last lonely partition to reply when R=1 (and your successful reads are still fast because you only wait for one replica). It's a tradeoff of availability: returning a potentially incorrect response vs. appearing unavailable (timing out). We chose the former. Hope that helps, Sean Cribbs <[email protected]> Developer Advocate Basho Technologies, Inc. http://basho.com/
_______________________________________________ riak-users mailing list [email protected] http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
