Giri, I've seen similar issues in the past when someone was adjusting
their ttl setting on the memory backend.  Because one memory backend
has it and the other does not, it fails on handoff.   The solution
then was to make sure that all memory backend settings are the same
and then do a rolling restart of the cluster (ignoring a lot of errors
along the way).  I am not sure that this is applicable to your case,
but it's something to look at.

On Thu, Mar 28, 2013 at 10:22 AM, Giri Iyengar
<giri.iyen...@sociocast.com> wrote:
> Godefroy:
>
> Thanks. Your email exchange on the mailing list was what prompted me to
> consider switching to Riak 1.3. I do see repair messages in the console logs
> and so some healing is happening. However, there are a bunch of hinted
> handoffs and ownership handoffs that are simply not proceeding because the
> same vnodes keep coming up for transfer and fail. Perhaps there is a manual
> way to forcibly repair and push the vnodes around.
>
> -giri
>
>
> On Thu, Mar 28, 2013 at 1:19 PM, Godefroy de Compreignac
> <godef...@eklablog.com> wrote:
>>
>> I have exactly the same problem with my cluster. If anyone knows what
>> those errors mean... :-)
>>
>> Godefroy
>>
>>
>> 2013/3/28 Giri Iyengar <giri.iyen...@sociocast.com>
>>>
>>> Hello,
>>>
>>> We are running a 6-node Riak 1.3.0 cluster in production. We recently
>>> upgraded to 1.3. Prior to this, we were running Riak 1.2 on the same 6-node
>>> cluster.
>>>
>>> We are finding that the nodes are not balanced. For instance:
>>>
>>> ================================= Membership
>>> ==================================
>>> Status     Ring    Pending    Node
>>>
>>> -------------------------------------------------------------------------------
>>> valid       0.0%      0.0%    'riak@172.16.25.106'
>>> valid      34.4%     20.3%    'riak@172.16.25.107'
>>> valid      21.9%     20.3%    'riak@172.16.25.113'
>>> valid      19.5%     20.3%    'riak@172.16.25.114'
>>> valid       8.6%     19.5%    'riak@172.16.25.121'
>>> valid      15.6%     19.5%    'riak@172.16.25.122'
>>>
>>> -------------------------------------------------------------------------------
>>> Valid:6 / Leaving:0 / Exiting:0 / Joining:0 / Down:0
>>>
>>>
>>> When we look at the logs in the largest node (riak@172.16.25.107), we see
>>> error messages that look like this:
>>>
>>> 2013-03-28 13:04:16.957 [error]
>>> <0.10957.1462>@riak_core_handoff_sender:start_fold:226 hinted_handoff
>>> transfer of riak_kv_vnode from 'riak@172.16.25.107'
>>> 148433760041419827630061740822747494183805648896 to 'riak@172.16.25.121'
>>> 148433760041419827630061740822747494183805648896 failed because of
>>> error:{badmatch,{error,{worker_crash,{function_clause,[{riak_core_pb,encode,[{ts,{1364,476737,222223}},{{ts,{1364,476737,222223}},<<131,104,7,100,0,8,114,95,111,98,106,101,99,116,109,0,0,0,11,69,78,84,73,84,89,95,83,69,83,83,109,0,0,0,36,67,54,57,95,48,48,51,56,100,56,102,50,52,49,52,99,97,97,54,102,99,52,56,53,52,99,99,101,51,98,50,48,102,53,98,52,108,0,0,0,1,104,3,100,0,9,114,95,99,111,110,116,101,110,116,104,9,100,0,4,100,105,99,116,97,5,97,16,97,16,97,8,97,80,97,48,104,16,106,106,106,106,106,106,106,106,106,106,106,106,106,106,106,106,104,1,104,16,106,106,106,106,106,106,106,106,106,106,108,0,0,0,2,108,0,0,0,11,109,0,0,0,12,99,111,110,116,101,110,116,45,116,121,112,101,97,116,97,101,97,120,97,116,97,47,97,112,97,108,97,97,97,105,97,110,106,108,0,0,0,23,109,0,0,0,11,88,45,82,105,97,107,45,86,84,97,103,97,51,97,120,97,105,97,101,97,120,97,66,97,120,97,107,97,119,97,101,97,75,97,117,97,122,97,111,97,55,97,85,97,104,97,85,97,107,97,112,97,120,97,107,106,106,108,0,0,0,1,108,0,0,0,1,109,0,0,0,5,105,110,100,101,120,106,106,106,108,0,0,0,1,108,0,0,0,1,109,0,0,0,20,88,45,82,105,97,107,45,76,97,115,116,45,77,111,100,105,102,105,101,100,104,3,98,0,0,5,84,98,0,7,70,65,98,0,3,99,115,106,106,108,0,0,0,1,108,0,0,0,6,109,0,0,0,7,99,104,97,114,115,101,116,97,85,97,84,97,70,97,45,97,56,106,106,109,0,0,0,36,52,54,55,98,54,51,98,50,45,50,99,56,52,45,52,56,50,99,45,97,48,99,54,45,56,53,50,100,53,99,57,97,98,98,53,101,106,108,0,0,0,1,104,2,109,0,0,0,8,0,69,155,215,81,84,63,31,104,2,97,1,110,5,0,65,191,200,202,14,106,104,9,100,0,4,100,105,99,116,97,1,97,16,97,16,97,8,97,80,97,48,104,16,106,106,106,106,106,106,106,106,106,106,106,106,106,106,106,106,104,1,104,16,106,106,106,106,106,106,106,106,106,106,106,106,106,106,108,0,0,0,1,108,0,0,0,1,100,0,5,99,108,101,97,110,100,0,4,116,114,117,101,106,106,100,0,9,117,110,100,101,102,105,110,101,100>>}],[{file,"src/riak_core_pb.erl"},{line,40}]},{riak_core_pb,pack,5,...},...]},...}}}
>>> [{riak_core_handoff_sender,start_fold,5,[{file,"src/riak_core_handoff_sender.erl"},{line,170}]}]
>>> 2013-03-28 13:04:16.961 [error] <0.29352.909> CRASH REPORT Process
>>> <0.29352.909> with 0 neighbours exited with reason: no function clause
>>> matching riak_core_pb:encode({ts,{1364,476737,222223}},
>>> {{ts,{1364,476737,222223}},<<131,104,7,100,0,8,114,95,111,98,106,101,99,116,109,0,0,0,11,69,78,...>>})
>>> line 40 in gen_server:terminate/6 line 747
>>>
>>>
>>> 2013-03-28 13:04:13.888 [error]
>>> <0.12680.1435>@riak_core_handoff_sender:start_fold:226 ownership_handoff
>>> transfer of riak_kv_vnode from 'riak@172.16.25.107'
>>> 11417981541647679048466287755595961091061972992 to 'riak@172.16.25.113'
>>> 11417981541647679048466287755595961091061972992 failed because of
>>> error:{badmatch,{error,{worker_crash,{function_clause,[{riak_core_pb,encode,[{ts,{1364,458917,232318}},{{ts,{1364,458917,232318}},<<131,104,7,100,0,8,114,95,111,98,106,101,99,116,109,0,0,0,11,69,78,84,73,84,89,95,83,69,83,83,109,0,0,0,36,67,54,57,95,48,48,48,54,52,98,99,52,53,51,49,52,55,101,50,101,53,97,102,101,102,49,57,99,50,55,99,97,49,53,54,99,108,0,0,0,1,104,3,100,0,9,114,95,99,111,110,116,101,110,116,104,9,100,0,4,100,105,99,116,97,5,97,16,97,16,97,8,97,80,97,48,104,16,106,106,106,106,106,106,106,106,106,106,106,106,106,106,106,106,104,1,104,16,106,106,106,106,106,106,106,106,106,106,108,0,0,0,2,108,0,0,0,11,109,0,0,0,12,99,111,110,116,101,110,116,45,116,121,112,101,97,116,97,101,97,120,97,116,97,47,97,112,97,108,97,97,97,105,97,110,106,108,0,0,0,23,109,0,0,0,11,88,45,82,105,97,107,45,86,84,97,103,97,54,97,88,97,76,97,66,97,69,97,69,97,116,97,73,97,104,97,118,97,77,97,86,97,48,97,81,97,103,97,110,97,119,97,73,97,51,97,85,97,72,97,53,106,106,108,0,0,0,1,108,0,0,0,1,109,0,0,0,5,105,110,100,101,120,106,106,106,108,0,0,0,1,108,0,0,0,1,109,0,0,0,20,88,45,82,105,97,107,45,76,97,115,116,45,77,111,100,105,102,105,101,100,104,3,98,0,0,5,84,98,0,7,0,165,98,0,3,138,179,106,106,108,0,0,0,1,108,0,0,0,6,109,0,0,0,7,99,104,97,114,115,101,116,97,85,97,84,97,70,97,45,97,56,106,106,109,0,0,0,36,55,102,98,52,50,54,54,53,45,57,100,56,48,45,52,54,98,97,45,98,53,97,100,45,56,55,52,52,54,54,97,97,50,56,53,99,106,108,0,0,0,1,104,2,109,0,0,0,8,0,69,155,215,81,59,179,219,104,2,97,1,110,5,0,165,121,200,202,14,106,104,9,100,0,4,100,105,99,116,97,1,97,16,97,16,97,8,97,80,97,48,104,16,106,106,106,106,106,106,106,106,106,106,106,106,106,106,106,106,104,1,104,16,106,106,106,106,106,106,106,106,106,106,106,106,106,106,108,0,0,0,1,108,0,0,0,1,100,0,5,99,108,101,97,110,100,0,4,116,114,117,101,106,106,100,0,9,117,110,100,101,102,105,110,101,100>>}],[{file,"src/riak_core_pb.erl"},{line,40}]},{riak_core_pb,pack,5,[{...},...]},...]},...}}}
>>> [{riak_core_handoff_sender,start_fold,5,[{file,"src/riak_core_handoff_sender.erl"},{line,170}]}]
>>> 2013-03-28 13:04:14.255 [error] <0.1120.0> CRASH REPORT Process
>>> <0.1120.0> with 0 neighbours exited with reason: no function clause matching
>>> riak_core_pb:encode({ts,{1364,458917,232318}},
>>> {{ts,{1364,458917,232318}},<<131,104,7,100,0,8,114,95,111,98,106,101,99,116,109,0,0,0,11,69,78,...>>})
>>> line 40 in gen_server:terminate/6 line 747
>>>
>>> This has been going on for days and the cluster doesn't seem to be
>>> rebalancing itself. We see this issue with both hinted_handoffs and
>>> ownership_handoffs. Looks like we have some corrupt data in our cluster. I
>>> checked through the leveldb LOGs and did not see any compaction errors.
>>> I was hoping that upgrading to 1.3.0 will slowly start repairing the
>>> cluster. However, that doesn't seem to be happening.
>>>
>>> Any help/hints would be much appreciated.
>>>
>>> -giri
>>> --
>>> GIRI IYENGAR, CTO
>>> SOCIOCAST
>>> Simple. Powerful. Predictions.
>>>
>>> 36 WEST 25TH STREET, 7TH FLOOR NEW YORK CITY, NY 10010
>>> O: 917.525.2466x104   M: 914.924.7935   F: 347.943.6281
>>> E: giri.iyen...@sociocast.com W: www.sociocast.com
>>>
>>> Facebook's Ad Guru Joins Sociocast - http://bit.ly/NjPQBQ
>>>
>>> _______________________________________________
>>> riak-users mailing list
>>> riak-users@lists.basho.com
>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>>
>>
>
>
>
> --
> GIRI IYENGAR, CTO
> SOCIOCAST
> Simple. Powerful. Predictions.
>
> 36 WEST 25TH STREET, 7TH FLOOR NEW YORK CITY, NY 10010
> O: 917.525.2466x104   M: 914.924.7935   F: 347.943.6281
> E: giri.iyen...@sociocast.com W: www.sociocast.com
>
> Facebook's Ad Guru Joins Sociocast - http://bit.ly/NjPQBQ
>
> _______________________________________________
> riak-users mailing list
> riak-users@lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to