BTW, I wasn't suggesting that claim v3 would choose a plan with violations
over one without - so I don't think there is a bug.

The plan below which I said scored well, but was "worse" than a purely
sequential plan, is worse only in the sense that it does not cope as well
with dual node failures (which is not something the algorithm even tries to
score for).  So in the two plans below the first plan scores better for
diversity, but there are numerous dual-node failures (e.g n1, n2) that
would lead to writes to some partitions being stored on just two physical
nodes.  There are no target_n_val violations in either cluster.

> ScoreFun([n1, n2, n3, n4, n1, n2, n3, n4, n5, n6, n7, n8, n5, n6, n7,
n8], 4).

> ScoreFun([n1, n2, n3, n4, n5, n6, n7, n8, n1, n2, n3, n4, n5, n6, n7,
n8], 4).

Perhaps the algorithm could be expanded to include other factors such as
the minimum spacing for any node, but then I assume other anomalies will
occur.  I suspect the scoring algorithm could be made ever more complex
without ever managing to please all of the people all of the time.  It is
because of this that I think claim v3 is perhaps best suited as a secondary
option rather than it being the default algorithm: and we should work on
ironing out some of the v2 issues rather than just switching to v3.



On 19 May 2017 at 05:06, Jon Meredith <> wrote:

> That's what I get for doing things from memory and not running the
> simulator :)
> I think you're right about the actual operation of the preference lists -
> but i haven't had a chance to look over the code or run some simulations.
> The effect isn't quite as severe, but as you say unevenly loads the cluster
> which is best avoided.
> I took a quick look at the v3 code from my phone and it looks like the
> plans are ordered by target nval violations, node ownership balance and
> finally diversity.
> It shouldn't ever pick a plan that increases violations, however if the
> starting plan has violations that aren't resolved (which is very possible
> for lots of combinations of Q and S) then it will start optimizing on the
> balance and then diversity. Maybe there's a bug somewhere.
> One thing I thought about but never implemented was adding some freedom to
> the algorithm to move a small number of unnecessary partitions to see if
> that would free things up and get better balance/diversity.
> For improving v2 you might try seeing if there's a way to
> deterministically shuffle the starting point where it starts to look for
> patitions to take - rather than always from the beginning or end.
> Jon
> On Thu, May 18, 2017 at 12:21 PM Martin Sumner <
>> wrote:
>> Jon,
>> With regards to this snippet below, I think I get your point, but I don't
>> think the example is valid:
>> >>>>>>>
>> If with N=3 if a node goes down, all of the responsibility for that
>> node is shift to another single node in the cluster.
>> n1 | n2 | n3 | n4 | n1 | n2 | n3 | n4    (Q=8 S=4,TargetN4)
>> Partition   All Up     n4 down
>> (position)
>>     0       n2 n3 n4   n2 n3 n1
>>     1       n3 n4 n1   n3 n1 n1
>>     2       n4 n1 n2   n1 n1 n2
>>     3       n1 n2 n3   n1 n2 n3
>>     4       n2 n3 n4   n2 n3 n1
>>     5       n3 n4 n1   n3 n1 n1
>>     6       n4 n1 n2   n1 n1 n2
>>     7       n1 n2 n3   n1 n2 n3
>> With all nodes up, the number of times each node appears in a preflist
>> is equal.  6 * n1, 6 * n2, 6 * n3, 6 * n4 each appears (TN*Q/S)
>> But during single node failure
>> 12 * n1, 6 * n2, 6 * n3, n4 down.
>> The load on n1 is doubled.
>> >>>>>>>
>> This is not how I understood fallback election works.  My understanding
>> is that the fallback is the node which owns the first vnode after the
>> preflist, where the node is up.  Not the node which owns the first vnode
>> after the unavailable vnode.  That is to say to find fallbacks for a
>> preflist, Riak will iterate around the ring from after the primaries, not
>> iterate around the ring from after the unavailable vnode.
>> So I would expect the following arrangement on n4 going down.
>> ring = n1 | n2 | n3 | n4 | n1 | n2 | n3 | n4     (Q=8 S=4,TargetN4)
>> Partition   All Up
>> (position)
>>     0       n2       n3       n4 (p3)
>>     1       n3       n4 (p3) n1
>>     2       n4 (p3) n1       n2
>>     3       n1       n2        n3
>>     4       n2       n3        n4 (p7)
>>     5       n3       n4 (p7) n1
>>     6       n4 (p7) n1       n2
>>     7       n1        n2       n3
>> Partition   n4 down
>> (position)
>>     0       n2  n3  n1 (fb for p3)
>>     1       n3  n1  n2 (fb for p3)
>>     2       n1  n2  n3 (fb for p3)
>>     3       n1  n2  n3
>>     4       n2  n3  n1 (fb for p7)
>>     5       n3  n1  n2 (fb for p7)
>>     6       n1  n2  n3 (fb for p7)
>>     7       n1  n2  n3
>> So there isn't a biasing of load in this case, all nodes 33.3% more
>> load.  Interestingly we do go from having 8 vnodes live on 4 nodes, to
>> having 12 vnodes live on 3 nodes when one node fails in this case - so the
>> number of vnodes active does double on all nodes (not sure if the dynamic
>> memory allocation in leveldb handles this?).
>> I still think you have a valid point about about simple sequenced
>> allocations being bad though.  If we have a ring-size of 16 and a node
>> count of 8:
>> ring = n1 | n2 | n3 | n4 | n5 | n6 | n7 | n8 | n1 | n2 | n3 | n4 | n5 |
>> n6 | n7 | n8    (Q=16 S=8,TargetN4)
>> Then on failure of node 4 - n5, n6, n7 have a 33% increase in load, but
>> all other nodes remain with their previous load.  This is true for any
>> failure in this diagonalised ring.
>> Whereas if we shuffle the ring this way:
>> ring = n1 | n2 | n3 | n4 | n5 | n6 | n7 | n8 | n4 | n3 | n2 | n1 | n8 |
>> n7 | n6 | n5    (Q=16 S=8,TargetN4)
>> Now node 4 down will lead to n1, n2, n3, n5, n6, and n7 each having 16.7%
>> extra load each.  This more even balance of load is also true for the
>> failure of any node in this ring.
>> So I think your point is valid - but I think the example is wrong.
>> And claim v3 does do a good job on these two rings, as the scoring for
>> the second ring is much better:
>> > ScoreFun = fun(L, N) -> riak_core_claim_util:score_am(
>> lists:sort(riak_core_claim_util:adjacency_matrix_from_al(
>> riak_core_claim_util:adjacency_list(L))),N) end.
>> #Fun<erl_eval.12.80484245>
>> > ScoreFun([n1, n2, n3, n4, n5, n6, n7, n8, n1, n2, n3, n4, n5, n6, n7,
>> n8], 4).
>> 109.7142857142855
>> > ScoreFun([n1, n2, n3, n4, n5, n6, n7, n8, n4, n3, n2, n1, n8, n7, n6,
>> n5], 4).
>> 61.71428571428584
>> Neither can I think of a way of bettering this by improving claim v2.
>> However as mentioned in the long read, one of the issues with claim v3
>> might be that rings with bad properties (such as loss of physical diversity
>> on dual node failures) can also get good scores:
>> > ScoreFun([n1, n2, n3, n4, n1, n2, n3, n4, n5, n6, n7, n8, n5, n6, n7,
>> n8], 4).
>> 66.0
>> Regards
>> Martin
>> On 17 May 2017 at 16:34, Jon Meredith <> wrote:
>>> Thanks for the excellent writeup.
>>> I have a few notes on your writeup and then a little history to help
>>> explain the motivation for the v3 work.
>>> The Claiming Problem
>>>   One other property of the broader claim algorithm + claimant + handoff
>>>   manager group of processes that's worth mentioning is safety during
>>>   transition.  The cluster should ensure that target N-val copies
>>>   are always available even during transitions.  Much earlier in Riak's
>>>   life the claim would just execute and ownership transfer immediately,
>>>   without putting the data in place (fine, it's eventually consistent,
>>> right?)
>>>   but that meant if more than two vnodes in a preference list changed
>>>   ownership then clients would read not found until at least one of the
>>>   objects it was receiving had transferred. The claimant now shepherds
>>> those
>>>   transitions so it should be safe.  The solution of transferring the
>>>   data before ownership has fixed the notfound problem, but Riak lost
>>>   agility in adding capacity to the cluster - existing data has to
>>> transfer
>>>   to new nodes before they are freed up, and they continue to grow
>>>   while waiting.  In hindsight, Ryan Zezeski's plan of just adding new
>>>   capacity and proxying back to the original vnode is probably a better
>>>   option.
>>>   Predicting load on the cluster is also difficult with the single
>>>   ring with a target n-val set at creation time being used for all
>>>   buckets despite their n-value.  To compute the operations sent to
>>>   each vnode you need to know the proportion of access to each N-value.
>>>   There's also the problem that if a bucket is created with an N-value
>>>   larger than target N all bets are off about the number of physical
>>> nodes
>>>   values are written to (*cough* strong consistency N-5)
>>>   Having a partitioning-scheme-per-N-value is one way of sidestepping
>>> the
>>>   load prediction and max-N problems.
>>> Promixity of Vnodes
>>>   An alternate solution to the target_n_val problem is to change the way
>>>   fallback partitions are added and apply an additional uniqueness
>>> constraint
>>>   as target nodes are added.  That provides safety against multiple node
>>>   failures (although can potentially cause loading problems).  I think
>>>   you imply this a couple of points when you talk about 'at runtime'.
>>> Proximity of vnodes as the partition list wraps
>>>   One kludge I considered solving the wraparound problem is to go from
>>>   a ring to a 'spiral' where you add extra target_n_val-1 additional
>>>   vnodes that alias the few vnodes in the ring.
>>>   Using the pathalogically bad (vnodes) Q=4, (nodes) S=3, (nval) N=3
>>> ```
>>>   v0 | v1 | v2 | v3
>>>   nA | nB | nC | nA
>>>   p0 = [ {v1, nB} {v2, Nc} {v3, nA} ]
>>>   p1 = [ {v2, Nc} {v3, nA} {v0, nA} ] <<< Bad
>>>   p2 = [ {v3, nA} {v0, nA} {v1, nB} ] <<< Bad
>>>   p3 = [ {v0, nA} {v1, nB} {v2, nC} ]
>>> ```
>>>   You get 2/4 preflists violating target_n_val=3.
>>>   If you extend the ring to allow aliasing (i.e. go beyond 2^160) but
>>>   only use it for assignment
>>> ```
>>>   v0 | v1 | v2 | v3 | v0' | v1'
>>>   nA | nB | nC | nA | nB  | nC
>>>   p0 = [ {v1, nB} {v2, Nc}  {v3, nA} ]
>>>   p1 = [ {v2, Nc} {v3, nA}  {v0', nB} ]
>>>   p2 = [ {v3, nA} {v0', nB} {v1', nB} ]
>>>   p3 = [ {v0, nA} {v1, nB}  {v2, nC} ]
>>> ```
>>>   The additional vnodes can never be hashed directly, just during
>>>   wraparound.
>>> As you say, the v3 algorithm was written (by me) a long time ago and
>>> never made it to production.  It was due to a few factors, partly
>>> the non-determinism, partly because I didn't like the (very stupid)
>>> optimization system tying up the claimant node for multiple seconds,
>>> but more troublingly when we did some commissioning tests for a large
>>> customer that ran with a ring size of 256 with 60 nodes we experienced
>>> a performance drop of around 5% when the cluster was maxed out for
>>> reads.  The diversity measurements were much 'better' in that the
>>> v3 claimed cluster was far more diverse and performed better during
>>> node failures, but the (unproven) fear that having a greater number
>>> of saturated disterl connections between nodes dropped performance
>>> without explanation stopped me from promoting it to default.
>>> The reason the v3 algorithm was created was to resolve problems with
>>> longer lived clusters created with the v2 claim that had had nodes
>>> added and removed over time.  I don't remember all the details now,
>>> but I think the cluster had a ring size of 1024 (to future proof,
>>> as no 2I/listkey on that cluster) and somewhere between 15-30 nodes.
>>> In that particular configuration, the v2 algorithm had left the original
>>> sequential node assignment (n1, n2, ..., n15, n1, n2, ...) and assigned
>>> new nodes in place, but that left many places were the original
>>> sequential
>>> assignments still existed.
>>> What we hadn't realized at the time is that sequential node assignment
>>> is the *worst* possible plan for handling fallback load.
>>> If with N=3 if a node goes down, all of the responsibility for that
>>> node is shift to another single node in the cluster.
>>> n1 | n2 | n3 | n4 | n1 | n2 | n3 | n4    (Q=8 S=4,TargetN4)
>>> Partition   All Up     n4 down
>>> (position)
>>>     0       n2 n3 n4   n2 n3 n1
>>>     1       n3 n4 n1   n3 n1 n1
>>>     2       n4 n1 n2   n1 n1 n2
>>>     3       n1 n2 n3   n1 n2 n3
>>>     4       n2 n3 n4   n2 n3 n1
>>>     5       n3 n4 n1   n3 n1 n1
>>>     6       n4 n1 n2   n1 n1 n2
>>>     7       n1 n2 n3   n1 n2 n3
>>> With all nodes up, the number of times each node appears in a preflist
>>> is equal.  6 * n1, 6 * n2, 6 * n3, 6 * n4 each appears (TN*Q/S)
>>> But during single node failure
>>> 12 * n1, 6 * n2, 6 * n3, n4 down.
>>> The load on n1 is doubled.
>>> In the real scenario, although it was no longer sequentially assigned
>>> there were still a large number of very similar preference lists to
>>> the original assignment (as growing a few nodes on that ring size
>>> only reassigns preference lists in proportion to the new nodes claiming
>>> partitions).
>>> The production cluster was running fairly close to capacity, so the
>>> increased loading during failure, even though it wasn't as bad as doubled
>>> was enough to push it over the performance 'step' lowering tail latencies
>>> and slowed it down enough to overload the vnodes and exhaust memory
>>> crashing the next node causing a cascade.  This was before vnodes had
>>> overload protection so would present differently now.
>>> Another pre-claimant problem that shaped some of the earlier claim
>>> code vnode 'want' threshods was that when the nodes were individually
>>> allowed to say if they wanted to claim more vnodes (with the
>>> wants_claim function, before calling choose_claim), there were some
>>> states
>>> the cluster would get into where two nodes both decided they were under
>>> capacity and continually tried to claim, causing the vnode to flip/flop
>>> back and forth between them (that was a reason for writing one of the
>>> early
>>> QuickCheck tests).
>>> I'm not sure if you've encountered it or not, but the riak_core_claim_sim
>>> is also a good tool for testing the behavior of the claim functions and
>>> the claimant.  You don't mention it in your write up, but one of the
>>> important functions of the claimant is to make sure it only performs
>>> safe transitions between rings.  It makes sure that the n val is not
>>> violated during handoff.
>>> What to do?
>>>   Fixing the claim algorithm is one way of doing things, but I worry
>>>   it has a number of problems that are hard to solve (multi-AZ,
>>> multi-Nval
>>>   etc).
>>>   One more radical option is to dump the ring and just publish a table
>>>   per-vnode of the nodes and vnode hash you'd like to service them.
>>>   Riak doesn't really need consistent hashing - it doesn't *really* use
>>>   it's original form (the Dynamo A scheme), and is more of a hybrid
>>>   of the B/C schemes.
>>>   Use cluster metadata and publish out the tables, update riak_core_apl
>>>   to take the new data and serve up the preference lists.  Obviously
>>>   it trickles into things like the vnode and handoff managers, but it
>>>   may be possible.
>>>   That gives you the advantage of no longer being constrained in how
>>>   you assign the nodes - a separation of policy and execution.  You
>>>   could keep the existing ring based algorithms, or you could do
>>> something
>>>   better.
>>>   It may be interesting to change the number of vnodes/hashing algorithm
>>>   too.  Jordan West was a big fan of Consistent Jump Hashing at one
>>> point.
>>>   The thing you give up if you lose the power-of-2 partitioning scheme
>>>   is the ability to split and combine partitions.  Each partition in
>>>   a 64 vnode ring maps to exactly two (non-consecutive) partitions in a
>>> 128
>>>   vnode ring.  Which is a very nice for replicating between clusters
>>>   with different ring sizes and localizing where to look for data.
>>> Good luck!
>>> On Wed, May 17, 2017 at 6:37 AM Daniel Abrahamsson <>
>>> wrote:
>>>> Thanks for the writeup and detailed investigation, Martin.
>>>> We ran into these issues a few months when we expanded a 5 node cluster
>>>> into a 8 node cluster. We ended up rebuilding the cluster and writing a
>>>> small escript to verify that the generated riak ring lived up to our
>>>> requirements (which were 1: to survive an AZ outage, and 2: to survive any
>>>> 2 nodes going down at the same time).
>>>> This will be a great document to refer to when explaining the
>>>> subtleties of setting up a Riak cluster.
>>>> //Daniel
>>>> _______________________________________________
>>>> riak-users mailing list
riak-users mailing list

Reply via email to