Re: Rebalancing issue -- failure to hand-off partitions

Evan Vigil-McClanahan Thu, 28 Mar 2013 15:38:48 -0700

No.  AAE is unrelated to the handoff subsystem.  I am not familiar
enough with the lowest level of it's working to know if it'd reproduce
the TTL stuff across on nodes that don't have it.


I am not totally sure about your timeline here.

When did you start seeing these errors, before or after your upgrade
to 1.3?  When did you start your cluster transition?  What cluster
transitions have you initiated?

If these errors started with 1.3, an interesting experiment would be
to disable AAE and do a rolling restart of the cluster, which should
lead to empty memory backends that won't be populated by AAE with
anything suspicious.  That said: if you've had cluster balance
problems for a while, it's possible that these messages (even this
whole issue) is just masking some other problem.

On Thu, Mar 28, 2013 at 3:24 PM, Giri Iyengar
<giri.iyen...@sociocast.com> wrote:
> Evan,
>
> All nodes have been restarted (more than once, in fact) after the config
> changes. Using riak-admin aae-status, I noticed that the anti-entropy repair
> is still proceeding across the cluster.
> It has been less than 24 hours since I upgraded to 1.3 and maybe I have to
> wait till the first complete build of the index trees happens for the
> cluster to start rebalancing itself.
> Could that be the case?
>
> -giri
>
>
> On Thu, Mar 28, 2013 at 5:49 PM, Evan Vigil-McClanahan
> <emcclana...@basho.com> wrote:
>>
>> Giri,
>>
>> if all of the nodes are using identical app.config files (including
>> the joining node) and have been restarted since those files changed,
>> it may be some other, related issue.
>>
>> On Thu, Mar 28, 2013 at 2:46 PM, Giri Iyengar
>> <giri.iyen...@sociocast.com> wrote:
>> > Evan,
>> >
>> > I reconfirmed that all the servers are using identical app.configs. They
>> > all
>> > use multi-backend schema. Are you saying that some of the vnodes are in
>> > memory backend in one physical node and in eleveldb backend in another
>> > physical node?
>> > If so, how can I fix the offending vnodes?
>> >
>> > Thanks,
>> >
>> > -giri
>> >
>> > On Thu, Mar 28, 2013 at 5:18 PM, Evan Vigil-McClanahan
>> > <emcclana...@basho.com> wrote:
>> >>
>> >> it would if some of the nodes weren't migrated to the new
>> >> multi-backend schema; if a memory node was trying to hand off to a
>> >> eleveldb backed node, you'd see this.
>> >>
>> >> On Thu, Mar 28, 2013 at 2:05 PM, Giri Iyengar
>> >> <giri.iyen...@sociocast.com> wrote:
>> >> > Evan,
>> >> >
>> >> > I verified that all of the memory backends have the same ttl settings
>> >> > and
>> >> > have done rolling restarts but it doesn't seem to make a difference.
>> >> > One
>> >> > thing to note though -- I remember this problem starting roughly
>> >> > around
>> >> > the
>> >> > time I migrated a bucket from being backed by leveldb to being backed
>> >> > by
>> >> > memory. I did this by setting the bucket properties via curl and let
>> >> > Riak do
>> >> > the migration of the objects in that bucket. Would that cause such
>> >> > issues?
>> >> >
>> >> > Thanks for your help.
>> >> >
>> >> > -giri
>> >> >
>> >> >
>> >> > On Thu, Mar 28, 2013 at 4:55 PM, Evan Vigil-McClanahan
>> >> > <emcclana...@basho.com> wrote:
>> >> >>
>> >> >> Giri, I've seen similar issues in the past when someone was
>> >> >> adjusting
>> >> >> their ttl setting on the memory backend.  Because one memory backend
>> >> >> has it and the other does not, it fails on handoff.   The solution
>> >> >> then was to make sure that all memory backend settings are the same
>> >> >> and then do a rolling restart of the cluster (ignoring a lot of
>> >> >> errors
>> >> >> along the way).  I am not sure that this is applicable to your case,
>> >> >> but it's something to look at.
>> >> >>
>> >> >> On Thu, Mar 28, 2013 at 10:22 AM, Giri Iyengar
>> >> >> <giri.iyen...@sociocast.com> wrote:
>> >> >> > Godefroy:
>> >> >> >
>> >> >> > Thanks. Your email exchange on the mailing list was what prompted
>> >> >> > me
>> >> >> > to
>> >> >> > consider switching to Riak 1.3. I do see repair messages in the
>> >> >> > console
>> >> >> > logs
>> >> >> > and so some healing is happening. However, there are a bunch of
>> >> >> > hinted
>> >> >> > handoffs and ownership handoffs that are simply not proceeding
>> >> >> > because
>> >> >> > the
>> >> >> > same vnodes keep coming up for transfer and fail. Perhaps there is
>> >> >> > a
>> >> >> > manual
>> >> >> > way to forcibly repair and push the vnodes around.
>> >> >> >
>> >> >> > -giri
>> >> >> >
>> >> >> >
>> >> >> > On Thu, Mar 28, 2013 at 1:19 PM, Godefroy de Compreignac
>> >> >> > <godef...@eklablog.com> wrote:
>> >> >> >>
>> >> >> >> I have exactly the same problem with my cluster. If anyone knows
>> >> >> >> what
>> >> >> >> those errors mean... :-)
>> >> >> >>
>> >> >> >> Godefroy
>> >> >> >>
>> >> >> >>
>> >> >> >> 2013/3/28 Giri Iyengar <giri.iyen...@sociocast.com>
>> >> >> >>>
>> >> >> >>> Hello,
>> >> >> >>>
>> >> >> >>> We are running a 6-node Riak 1.3.0 cluster in production. We
>> >> >> >>> recently
>> >> >> >>> upgraded to 1.3. Prior to this, we were running Riak 1.2 on the
>> >> >> >>> same
>> >> >> >>> 6-node
>> >> >> >>> cluster.
>> >> >> >>>
>> >> >> >>> We are finding that the nodes are not balanced. For instance:
>> >> >> >>>
>> >> >> >>> ================================= Membership
>> >> >> >>> ==================================
>> >> >> >>> Status     Ring    Pending    Node
>> >> >> >>>
>> >> >> >>>
>> >> >> >>>
>> >> >> >>>
>> >> >> >>> -------------------------------------------------------------------------------
>> >> >> >>> valid       0.0%      0.0%    'riak@172.16.25.106'
>> >> >> >>> valid      34.4%     20.3%    'riak@172.16.25.107'
>> >> >> >>> valid      21.9%     20.3%    'riak@172.16.25.113'
>> >> >> >>> valid      19.5%     20.3%    'riak@172.16.25.114'
>> >> >> >>> valid       8.6%     19.5%    'riak@172.16.25.121'
>> >> >> >>> valid      15.6%     19.5%    'riak@172.16.25.122'
>> >> >> >>>
>> >> >> >>>
>> >> >> >>>
>> >> >> >>>
>> >> >> >>> -------------------------------------------------------------------------------
>> >> >> >>> Valid:6 / Leaving:0 / Exiting:0 / Joining:0 / Down:0
>> >> >> >>>
>> >> >> >>>
>> >> >> >>> When we look at the logs in the largest node
>> >> >> >>> (riak@172.16.25.107),
>> >> >> >>> we
>> >> >> >>> see
>> >> >> >>> error messages that look like this:
>> >> >> >>>
>> >> >> >>> 2013-03-28 13:04:16.957 [error]
>> >> >> >>> <0.10957.1462>@riak_core_handoff_sender:start_fold:226
>> >> >> >>> hinted_handoff
>> >> >> >>> transfer of riak_kv_vnode from 'riak@172.16.25.107'
>> >> >> >>> 148433760041419827630061740822747494183805648896 to
>> >> >> >>> 'riak@172.16.25.121'
>> >> >> >>> 148433760041419827630061740822747494183805648896 failed because
>> >> >> >>> of
>> >> >> >>>
>> >> >> >>>
>> >> >> >>>
>> >> >> >>> error:{badmatch,{error,{worker_crash,{function_clause,[{riak_core_pb,encode,[{ts,{1364,476737,222223}},{{ts,{1364,476737,222223}},<<131,104,7,100,0,8,114,95,111,98,106,101,99,116,109,0,0,0,11,69,78,84,73,84,89,95,83,69,83,83,109,0,0,0,36,67,54,57,95,48,48,51,56,100,56,102,50,52,49,52,99,97,97,54,102,99,52,56,53,52,99,99,101,51,98,50,48,102,53,98,52,108,0,0,0,1,104,3,100,0,9,114,95,99,111,110,116,101,110,116,104,9,100,0,4,100,105,99,116,97,5,97,16,97,16,97,8,97,80,97,48,104,16,106,106,106,106,106,106,106,106,106,106,106,106,106,106,106,106,104,1,104,16,106,106,106,106,106,106,106,106,106,106,108,0,0,0,2,108,0,0,0,11,109,0,0,0,12,99,111,110,116,101,110,116,45,116,121,112,101,97,116,97,101,97,120,97,116,97,47,97,112,97,108,97,97,97,105,97,110,106,108,0,0,0,23,109,0,0,0,11,88,45,82,105,97,107,45,86,84,97,103,97,51,97,120,97,105,97,101,97,120,97,66,97,120,97,107,97,119,97,101,97,75,97,117,97,122,97,111,97,55,97,85,97,104,97,85,97,107,97,112,97,120,97,107,106,106,108,0,0,0,1,108,0,0,0,1,109,0,0,0,5,105,110,100,101,120,106,106,106,108,0,0,0,1,108,0,0,0,1,109,0,0,0,20,88,45,82,105,97,107,45,76,97,115,116,45,77,111,100,105,102,105,101,100,104,3,98,0,0,5,84,98,0,7,70,65,98,0,3,99,115,106,106,108,0,0,0,1,108,0,0,0,6,109,0,0,0,7,99,104,97,114,115,101,116,97,85,97,84,97,70,97,45,97,56,106,106,109,0,0,0,36,52,54,55,98,54,51,98,50,45,50,99,56,52,45,52,56,50,99,45,97,48,99,54,45,56,53,50,100,53,99,57,97,98,98,53,101,106,108,0,0,0,1,104,2,109,0,0,0,8,0,69,155,215,81,84,63,31,104,2,97,1,110,5,0,65,191,200,202,14,106,104,9,100,0,4,100,105,99,116,97,1,97,16,97,16,97,8,97,80,97,48,104,16,106,106,106,106,106,106,106,106,106,106,106,106,106,106,106,106,104,1,104,16,106,106,106,106,106,106,106,106,106,106,106,106,106,106,108,0,0,0,1,108,0,0,0,1,100,0,5,99,108,101,97,110,100,0,4,116,114,117,101,106,106,100,0,9,117,110,100,101,102,105,110,101,100>>}],[{file,"src/riak_core_pb.erl"},{line,40}]},{riak_core_pb,pack,5,...},...]},...}}}
>> >> >> >>>
>> >> >> >>>
>> >> >> >>>
>> >> >> >>> [{riak_core_handoff_sender,start_fold,5,[{file,"src/riak_core_handoff_sender.erl"},{line,170}]}]
>> >> >> >>> 2013-03-28 13:04:16.961 [error] <0.29352.909> CRASH REPORT
>> >> >> >>> Process
>> >> >> >>> <0.29352.909> with 0 neighbours exited with reason: no function
>> >> >> >>> clause
>> >> >> >>> matching riak_core_pb:encode({ts,{1364,476737,222223}},
>> >> >> >>>
>> >> >> >>>
>> >> >> >>>
>> >> >> >>> {{ts,{1364,476737,222223}},<<131,104,7,100,0,8,114,95,111,98,106,101,99,116,109,0,0,0,11,69,78,...>>})
>> >> >> >>> line 40 in gen_server:terminate/6 line 747
>> >> >> >>>
>> >> >> >>>
>> >> >> >>> 2013-03-28 13:04:13.888 [error]
>> >> >> >>> <0.12680.1435>@riak_core_handoff_sender:start_fold:226
>> >> >> >>> ownership_handoff
>> >> >> >>> transfer of riak_kv_vnode from 'riak@172.16.25.107'
>> >> >> >>> 11417981541647679048466287755595961091061972992 to
>> >> >> >>> 'riak@172.16.25.113'
>> >> >> >>> 11417981541647679048466287755595961091061972992 failed because
>> >> >> >>> of
>> >> >> >>>
>> >> >> >>>
>> >> >> >>>
>> >> >> >>> error:{badmatch,{error,{worker_crash,{function_clause,[{riak_core_pb,encode,[{ts,{1364,458917,232318}},{{ts,{1364,458917,232318}},<<131,104,7,100,0,8,114,95,111,98,106,101,99,116,109,0,0,0,11,69,78,84,73,84,89,95,83,69,83,83,109,0,0,0,36,67,54,57,95,48,48,48,54,52,98,99,52,53,51,49,52,55,101,50,101,53,97,102,101,102,49,57,99,50,55,99,97,49,53,54,99,108,0,0,0,1,104,3,100,0,9,114,95,99,111,110,116,101,110,116,104,9,100,0,4,100,105,99,116,97,5,97,16,97,16,97,8,97,80,97,48,104,16,106,106,106,106,106,106,106,106,106,106,106,106,106,106,106,106,104,1,104,16,106,106,106,106,106,106,106,106,106,106,108,0,0,0,2,108,0,0,0,11,109,0,0,0,12,99,111,110,116,101,110,116,45,116,121,112,101,97,116,97,101,97,120,97,116,97,47,97,112,97,108,97,97,97,105,97,110,106,108,0,0,0,23,109,0,0,0,11,88,45,82,105,97,107,45,86,84,97,103,97,54,97,88,97,76,97,66,97,69,97,69,97,116,97,73,97,104,97,118,97,77,97,86,97,48,97,81,97,103,97,110,97,119,97,73,97,51,97,85,97,72,97,53,106,106,108,0,0,0,1,108,0,0,0,1,109,0,0,0,5,105,110,100,101,120,106,106,106,108,0,0,0,1,108,0,0,0,1,109,0,0,0,20,88,45,82,105,97,107,45,76,97,115,116,45,77,111,100,105,102,105,101,100,104,3,98,0,0,5,84,98,0,7,0,165,98,0,3,138,179,106,106,108,0,0,0,1,108,0,0,0,6,109,0,0,0,7,99,104,97,114,115,101,116,97,85,97,84,97,70,97,45,97,56,106,106,109,0,0,0,36,55,102,98,52,50,54,54,53,45,57,100,56,48,45,52,54,98,97,45,98,53,97,100,45,56,55,52,52,54,54,97,97,50,56,53,99,106,108,0,0,0,1,104,2,109,0,0,0,8,0,69,155,215,81,59,179,219,104,2,97,1,110,5,0,165,121,200,202,14,106,104,9,100,0,4,100,105,99,116,97,1,97,16,97,16,97,8,97,80,97,48,104,16,106,106,106,106,106,106,106,106,106,106,106,106,106,106,106,106,104,1,104,16,106,106,106,106,106,106,106,106,106,106,106,106,106,106,108,0,0,0,1,108,0,0,0,1,100,0,5,99,108,101,97,110,100,0,4,116,114,117,101,106,106,100,0,9,117,110,100,101,102,105,110,101,100>>}],[{file,"src/riak_core_pb.erl"},{line,40}]},{riak_core_pb,pack,5,[{...},...]},...]},...}}}
>> >> >> >>>
>> >> >> >>>
>> >> >> >>>
>> >> >> >>> [{riak_core_handoff_sender,start_fold,5,[{file,"src/riak_core_handoff_sender.erl"},{line,170}]}]
>> >> >> >>> 2013-03-28 13:04:14.255 [error] <0.1120.0> CRASH REPORT Process
>> >> >> >>> <0.1120.0> with 0 neighbours exited with reason: no function
>> >> >> >>> clause
>> >> >> >>> matching
>> >> >> >>> riak_core_pb:encode({ts,{1364,458917,232318}},
>> >> >> >>>
>> >> >> >>>
>> >> >> >>>
>> >> >> >>> {{ts,{1364,458917,232318}},<<131,104,7,100,0,8,114,95,111,98,106,101,99,116,109,0,0,0,11,69,78,...>>})
>> >> >> >>> line 40 in gen_server:terminate/6 line 747
>> >> >> >>>
>> >> >> >>> This has been going on for days and the cluster doesn't seem to
>> >> >> >>> be
>> >> >> >>> rebalancing itself. We see this issue with both hinted_handoffs
>> >> >> >>> and
>> >> >> >>> ownership_handoffs. Looks like we have some corrupt data in our
>> >> >> >>> cluster. I
>> >> >> >>> checked through the leveldb LOGs and did not see any compaction
>> >> >> >>> errors.
>> >> >> >>> I was hoping that upgrading to 1.3.0 will slowly start repairing
>> >> >> >>> the
>> >> >> >>> cluster. However, that doesn't seem to be happening.
>> >> >> >>>
>> >> >> >>> Any help/hints would be much appreciated.
>> >> >> >>>
>> >> >> >>> -giri
>> >> >> >>> --
>> >> >> >>> GIRI IYENGAR, CTO
>> >> >> >>> SOCIOCAST
>> >> >> >>> Simple. Powerful. Predictions.
>> >> >> >>>
>> >> >> >>> 36 WEST 25TH STREET, 7TH FLOOR NEW YORK CITY, NY 10010
>> >> >> >>> O: 917.525.2466x104   M: 914.924.7935   F: 347.943.6281
>> >> >> >>> E: giri.iyen...@sociocast.com W: www.sociocast.com
>> >> >> >>>
>> >> >> >>> Facebook's Ad Guru Joins Sociocast - http://bit.ly/NjPQBQ
>> >> >> >>>
>> >> >> >>> _______________________________________________
>> >> >> >>> riak-users mailing list
>> >> >> >>> riak-users@lists.basho.com
>> >> >> >>>
>> >> >> >>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>> >> >> >>>
>> >> >> >>
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > --
>> >> >> > GIRI IYENGAR, CTO
>> >> >> > SOCIOCAST
>> >> >> > Simple. Powerful. Predictions.
>> >> >> >
>> >> >> > 36 WEST 25TH STREET, 7TH FLOOR NEW YORK CITY, NY 10010
>> >> >> > O: 917.525.2466x104   M: 914.924.7935   F: 347.943.6281
>> >> >> > E: giri.iyen...@sociocast.com W: www.sociocast.com
>> >> >> >
>> >> >> > Facebook's Ad Guru Joins Sociocast - http://bit.ly/NjPQBQ
>> >> >> >
>> >> >> > _______________________________________________
>> >> >> > riak-users mailing list
>> >> >> > riak-users@lists.basho.com
>> >> >> > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>> >> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > GIRI IYENGAR, CTO
>> >> > SOCIOCAST
>> >> > Simple. Powerful. Predictions.
>> >> >
>> >> > 36 WEST 25TH STREET, 7TH FLOOR NEW YORK CITY, NY 10010
>> >> > O: 917.525.2466x104   M: 914.924.7935   F: 347.943.6281
>> >> > E: giri.iyen...@sociocast.com W: www.sociocast.com
>> >> >
>> >> > Facebook's Ad Guru Joins Sociocast - http://bit.ly/NjPQBQ
>> >
>> >
>> >
>> >
>> > --
>> > GIRI IYENGAR, CTO
>> > SOCIOCAST
>> > Simple. Powerful. Predictions.
>> >
>> > 36 WEST 25TH STREET, 7TH FLOOR NEW YORK CITY, NY 10010
>> > O: 917.525.2466x104   M: 914.924.7935   F: 347.943.6281
>> > E: giri.iyen...@sociocast.com W: www.sociocast.com
>> >
>> > Facebook's Ad Guru Joins Sociocast - http://bit.ly/NjPQBQ
>
>
>
>
> --
> GIRI IYENGAR, CTO
> SOCIOCAST
> Simple. Powerful. Predictions.
>
> 36 WEST 25TH STREET, 7TH FLOOR NEW YORK CITY, NY 10010
> O: 917.525.2466x104   M: 914.924.7935   F: 347.943.6281
> E: giri.iyen...@sociocast.com W: www.sociocast.com
>
> Facebook's Ad Guru Joins Sociocast - http://bit.ly/NjPQBQ

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: Rebalancing issue -- failure to hand-off partitions

Reply via email to