Giri, if all of the nodes are using identical app.config files (including the joining node) and have been restarted since those files changed, it may be some other, related issue.
On Thu, Mar 28, 2013 at 2:46 PM, Giri Iyengar <giri.iyen...@sociocast.com> wrote: > Evan, > > I reconfirmed that all the servers are using identical app.configs. They all > use multi-backend schema. Are you saying that some of the vnodes are in > memory backend in one physical node and in eleveldb backend in another > physical node? > If so, how can I fix the offending vnodes? > > Thanks, > > -giri > > On Thu, Mar 28, 2013 at 5:18 PM, Evan Vigil-McClanahan > <emcclana...@basho.com> wrote: >> >> it would if some of the nodes weren't migrated to the new >> multi-backend schema; if a memory node was trying to hand off to a >> eleveldb backed node, you'd see this. >> >> On Thu, Mar 28, 2013 at 2:05 PM, Giri Iyengar >> <giri.iyen...@sociocast.com> wrote: >> > Evan, >> > >> > I verified that all of the memory backends have the same ttl settings >> > and >> > have done rolling restarts but it doesn't seem to make a difference. One >> > thing to note though -- I remember this problem starting roughly around >> > the >> > time I migrated a bucket from being backed by leveldb to being backed by >> > memory. I did this by setting the bucket properties via curl and let >> > Riak do >> > the migration of the objects in that bucket. Would that cause such >> > issues? >> > >> > Thanks for your help. >> > >> > -giri >> > >> > >> > On Thu, Mar 28, 2013 at 4:55 PM, Evan Vigil-McClanahan >> > <emcclana...@basho.com> wrote: >> >> >> >> Giri, I've seen similar issues in the past when someone was adjusting >> >> their ttl setting on the memory backend. Because one memory backend >> >> has it and the other does not, it fails on handoff. The solution >> >> then was to make sure that all memory backend settings are the same >> >> and then do a rolling restart of the cluster (ignoring a lot of errors >> >> along the way). I am not sure that this is applicable to your case, >> >> but it's something to look at. >> >> >> >> On Thu, Mar 28, 2013 at 10:22 AM, Giri Iyengar >> >> <giri.iyen...@sociocast.com> wrote: >> >> > Godefroy: >> >> > >> >> > Thanks. Your email exchange on the mailing list was what prompted me >> >> > to >> >> > consider switching to Riak 1.3. I do see repair messages in the >> >> > console >> >> > logs >> >> > and so some healing is happening. However, there are a bunch of >> >> > hinted >> >> > handoffs and ownership handoffs that are simply not proceeding >> >> > because >> >> > the >> >> > same vnodes keep coming up for transfer and fail. Perhaps there is a >> >> > manual >> >> > way to forcibly repair and push the vnodes around. >> >> > >> >> > -giri >> >> > >> >> > >> >> > On Thu, Mar 28, 2013 at 1:19 PM, Godefroy de Compreignac >> >> > <godef...@eklablog.com> wrote: >> >> >> >> >> >> I have exactly the same problem with my cluster. If anyone knows >> >> >> what >> >> >> those errors mean... :-) >> >> >> >> >> >> Godefroy >> >> >> >> >> >> >> >> >> 2013/3/28 Giri Iyengar <giri.iyen...@sociocast.com> >> >> >>> >> >> >>> Hello, >> >> >>> >> >> >>> We are running a 6-node Riak 1.3.0 cluster in production. We >> >> >>> recently >> >> >>> upgraded to 1.3. Prior to this, we were running Riak 1.2 on the >> >> >>> same >> >> >>> 6-node >> >> >>> cluster. >> >> >>> >> >> >>> We are finding that the nodes are not balanced. For instance: >> >> >>> >> >> >>> ================================= Membership >> >> >>> ================================== >> >> >>> Status Ring Pending Node >> >> >>> >> >> >>> >> >> >>> >> >> >>> ------------------------------------------------------------------------------- >> >> >>> valid 0.0% 0.0% 'riak@172.16.25.106' >> >> >>> valid 34.4% 20.3% 'riak@172.16.25.107' >> >> >>> valid 21.9% 20.3% 'riak@172.16.25.113' >> >> >>> valid 19.5% 20.3% 'riak@172.16.25.114' >> >> >>> valid 8.6% 19.5% 'riak@172.16.25.121' >> >> >>> valid 15.6% 19.5% 'riak@172.16.25.122' >> >> >>> >> >> >>> >> >> >>> >> >> >>> ------------------------------------------------------------------------------- >> >> >>> Valid:6 / Leaving:0 / Exiting:0 / Joining:0 / Down:0 >> >> >>> >> >> >>> >> >> >>> When we look at the logs in the largest node (riak@172.16.25.107), >> >> >>> we >> >> >>> see >> >> >>> error messages that look like this: >> >> >>> >> >> >>> 2013-03-28 13:04:16.957 [error] >> >> >>> <0.10957.1462>@riak_core_handoff_sender:start_fold:226 >> >> >>> hinted_handoff >> >> >>> transfer of riak_kv_vnode from 'riak@172.16.25.107' >> >> >>> 148433760041419827630061740822747494183805648896 to >> >> >>> 'riak@172.16.25.121' >> >> >>> 148433760041419827630061740822747494183805648896 failed because of >> >> >>> >> >> >>> >> >> >>> error:{badmatch,{error,{worker_crash,{function_clause,[{riak_core_pb,encode,[{ts,{1364,476737,222223}},{{ts,{1364,476737,222223}},<<131,104,7,100,0,8,114,95,111,98,106,101,99,116,109,0,0,0,11,69,78,84,73,84,89,95,83,69,83,83,109,0,0,0,36,67,54,57,95,48,48,51,56,100,56,102,50,52,49,52,99,97,97,54,102,99,52,56,53,52,99,99,101,51,98,50,48,102,53,98,52,108,0,0,0,1,104,3,100,0,9,114,95,99,111,110,116,101,110,116,104,9,100,0,4,100,105,99,116,97,5,97,16,97,16,97,8,97,80,97,48,104,16,106,106,106,106,106,106,106,106,106,106,106,106,106,106,106,106,104,1,104,16,106,106,106,106,106,106,106,106,106,106,108,0,0,0,2,108,0,0,0,11,109,0,0,0,12,99,111,110,116,101,110,116,45,116,121,112,101,97,116,97,101,97,120,97,116,97,47,97,112,97,108,97,97,97,105,97,110,106,108,0,0,0,23,109,0,0,0,11,88,45,82,105,97,107,45,86,84,97,103,97,51,97,120,97,105,97,101,97,120,97,66,97,120,97,107,97,119,97,101,97,75,97,117,97,122,97,111,97,55,97,85,97,104,97,85,97,107,97,112,97,120,97,107,106,106,108,0,0,0,1,108,0,0,0,1,109,0,0,0,5,105,110,100,101,120,106,106,106,108,0,0,0,1,108,0,0,0,1,109,0,0,0,20,88,45,82,105,97,107,45,76,97,115,116,45,77,111,100,105,102,105,101,100,104,3,98,0,0,5,84,98,0,7,70,65,98,0,3,99,115,106,106,108,0,0,0,1,108,0,0,0,6,109,0,0,0,7,99,104,97,114,115,101,116,97,85,97,84,97,70,97,45,97,56,106,106,109,0,0,0,36,52,54,55,98,54,51,98,50,45,50,99,56,52,45,52,56,50,99,45,97,48,99,54,45,56,53,50,100,53,99,57,97,98,98,53,101,106,108,0,0,0,1,104,2,109,0,0,0,8,0,69,155,215,81,84,63,31,104,2,97,1,110,5,0,65,191,200,202,14,106,104,9,100,0,4,100,105,99,116,97,1,97,16,97,16,97,8,97,80,97,48,104,16,106,106,106,106,106,106,106,106,106,106,106,106,106,106,106,106,104,1,104,16,106,106,106,106,106,106,106,106,106,106,106,106,106,106,108,0,0,0,1,108,0,0,0,1,100,0,5,99,108,101,97,110,100,0,4,116,114,117,101,106,106,100,0,9,117,110,100,101,102,105,110,101,100>>}],[{file,"src/riak_core_pb.erl"},{line,40}]},{riak_core_pb,pack,5,...},...]},...}}} >> >> >>> >> >> >>> >> >> >>> [{riak_core_handoff_sender,start_fold,5,[{file,"src/riak_core_handoff_sender.erl"},{line,170}]}] >> >> >>> 2013-03-28 13:04:16.961 [error] <0.29352.909> CRASH REPORT Process >> >> >>> <0.29352.909> with 0 neighbours exited with reason: no function >> >> >>> clause >> >> >>> matching riak_core_pb:encode({ts,{1364,476737,222223}}, >> >> >>> >> >> >>> >> >> >>> {{ts,{1364,476737,222223}},<<131,104,7,100,0,8,114,95,111,98,106,101,99,116,109,0,0,0,11,69,78,...>>}) >> >> >>> line 40 in gen_server:terminate/6 line 747 >> >> >>> >> >> >>> >> >> >>> 2013-03-28 13:04:13.888 [error] >> >> >>> <0.12680.1435>@riak_core_handoff_sender:start_fold:226 >> >> >>> ownership_handoff >> >> >>> transfer of riak_kv_vnode from 'riak@172.16.25.107' >> >> >>> 11417981541647679048466287755595961091061972992 to >> >> >>> 'riak@172.16.25.113' >> >> >>> 11417981541647679048466287755595961091061972992 failed because of >> >> >>> >> >> >>> >> >> >>> error:{badmatch,{error,{worker_crash,{function_clause,[{riak_core_pb,encode,[{ts,{1364,458917,232318}},{{ts,{1364,458917,232318}},<<131,104,7,100,0,8,114,95,111,98,106,101,99,116,109,0,0,0,11,69,78,84,73,84,89,95,83,69,83,83,109,0,0,0,36,67,54,57,95,48,48,48,54,52,98,99,52,53,51,49,52,55,101,50,101,53,97,102,101,102,49,57,99,50,55,99,97,49,53,54,99,108,0,0,0,1,104,3,100,0,9,114,95,99,111,110,116,101,110,116,104,9,100,0,4,100,105,99,116,97,5,97,16,97,16,97,8,97,80,97,48,104,16,106,106,106,106,106,106,106,106,106,106,106,106,106,106,106,106,104,1,104,16,106,106,106,106,106,106,106,106,106,106,108,0,0,0,2,108,0,0,0,11,109,0,0,0,12,99,111,110,116,101,110,116,45,116,121,112,101,97,116,97,101,97,120,97,116,97,47,97,112,97,108,97,97,97,105,97,110,106,108,0,0,0,23,109,0,0,0,11,88,45,82,105,97,107,45,86,84,97,103,97,54,97,88,97,76,97,66,97,69,97,69,97,116,97,73,97,104,97,118,97,77,97,86,97,48,97,81,97,103,97,110,97,119,97,73,97,51,97,85,97,72,97,53,106,106,108,0,0,0,1,108,0,0,0,1,109,0,0,0,5,105,110,100,101,120,106,106,106,108,0,0,0,1,108,0,0,0,1,109,0,0,0,20,88,45,82,105,97,107,45,76,97,115,116,45,77,111,100,105,102,105,101,100,104,3,98,0,0,5,84,98,0,7,0,165,98,0,3,138,179,106,106,108,0,0,0,1,108,0,0,0,6,109,0,0,0,7,99,104,97,114,115,101,116,97,85,97,84,97,70,97,45,97,56,106,106,109,0,0,0,36,55,102,98,52,50,54,54,53,45,57,100,56,48,45,52,54,98,97,45,98,53,97,100,45,56,55,52,52,54,54,97,97,50,56,53,99,106,108,0,0,0,1,104,2,109,0,0,0,8,0,69,155,215,81,59,179,219,104,2,97,1,110,5,0,165,121,200,202,14,106,104,9,100,0,4,100,105,99,116,97,1,97,16,97,16,97,8,97,80,97,48,104,16,106,106,106,106,106,106,106,106,106,106,106,106,106,106,106,106,104,1,104,16,106,106,106,106,106,106,106,106,106,106,106,106,106,106,108,0,0,0,1,108,0,0,0,1,100,0,5,99,108,101,97,110,100,0,4,116,114,117,101,106,106,100,0,9,117,110,100,101,102,105,110,101,100>>}],[{file,"src/riak_core_pb.erl"},{line,40}]},{riak_core_pb,pack,5,[{...},...]},...]},...}}} >> >> >>> >> >> >>> >> >> >>> [{riak_core_handoff_sender,start_fold,5,[{file,"src/riak_core_handoff_sender.erl"},{line,170}]}] >> >> >>> 2013-03-28 13:04:14.255 [error] <0.1120.0> CRASH REPORT Process >> >> >>> <0.1120.0> with 0 neighbours exited with reason: no function clause >> >> >>> matching >> >> >>> riak_core_pb:encode({ts,{1364,458917,232318}}, >> >> >>> >> >> >>> >> >> >>> {{ts,{1364,458917,232318}},<<131,104,7,100,0,8,114,95,111,98,106,101,99,116,109,0,0,0,11,69,78,...>>}) >> >> >>> line 40 in gen_server:terminate/6 line 747 >> >> >>> >> >> >>> This has been going on for days and the cluster doesn't seem to be >> >> >>> rebalancing itself. We see this issue with both hinted_handoffs and >> >> >>> ownership_handoffs. Looks like we have some corrupt data in our >> >> >>> cluster. I >> >> >>> checked through the leveldb LOGs and did not see any compaction >> >> >>> errors. >> >> >>> I was hoping that upgrading to 1.3.0 will slowly start repairing >> >> >>> the >> >> >>> cluster. However, that doesn't seem to be happening. >> >> >>> >> >> >>> Any help/hints would be much appreciated. >> >> >>> >> >> >>> -giri >> >> >>> -- >> >> >>> GIRI IYENGAR, CTO >> >> >>> SOCIOCAST >> >> >>> Simple. Powerful. Predictions. >> >> >>> >> >> >>> 36 WEST 25TH STREET, 7TH FLOOR NEW YORK CITY, NY 10010 >> >> >>> O: 917.525.2466x104 M: 914.924.7935 F: 347.943.6281 >> >> >>> E: giri.iyen...@sociocast.com W: www.sociocast.com >> >> >>> >> >> >>> Facebook's Ad Guru Joins Sociocast - http://bit.ly/NjPQBQ >> >> >>> >> >> >>> _______________________________________________ >> >> >>> riak-users mailing list >> >> >>> riak-users@lists.basho.com >> >> >>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >> >> >>> >> >> >> >> >> > >> >> > >> >> > >> >> > -- >> >> > GIRI IYENGAR, CTO >> >> > SOCIOCAST >> >> > Simple. Powerful. Predictions. >> >> > >> >> > 36 WEST 25TH STREET, 7TH FLOOR NEW YORK CITY, NY 10010 >> >> > O: 917.525.2466x104 M: 914.924.7935 F: 347.943.6281 >> >> > E: giri.iyen...@sociocast.com W: www.sociocast.com >> >> > >> >> > Facebook's Ad Guru Joins Sociocast - http://bit.ly/NjPQBQ >> >> > >> >> > _______________________________________________ >> >> > riak-users mailing list >> >> > riak-users@lists.basho.com >> >> > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >> >> > >> > >> > >> > >> > >> > -- >> > GIRI IYENGAR, CTO >> > SOCIOCAST >> > Simple. Powerful. Predictions. >> > >> > 36 WEST 25TH STREET, 7TH FLOOR NEW YORK CITY, NY 10010 >> > O: 917.525.2466x104 M: 914.924.7935 F: 347.943.6281 >> > E: giri.iyen...@sociocast.com W: www.sociocast.com >> > >> > Facebook's Ad Guru Joins Sociocast - http://bit.ly/NjPQBQ > > > > > -- > GIRI IYENGAR, CTO > SOCIOCAST > Simple. Powerful. Predictions. > > 36 WEST 25TH STREET, 7TH FLOOR NEW YORK CITY, NY 10010 > O: 917.525.2466x104 M: 914.924.7935 F: 347.943.6281 > E: giri.iyen...@sociocast.com W: www.sociocast.com > > Facebook's Ad Guru Joins Sociocast - http://bit.ly/NjPQBQ _______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com