Re: Rebalancing issue -- failure to hand-off partitions

Evan Vigil-McClanahan Wed, 03 Apr 2013 13:58:34 -0700

You may wish to make the memory limit a little smaller; that means
~20GB per node, which might put undue memory pressure on leveldb.
I'd also recommend setting {max_open_files, 400} in the eleveldb
section (and maybe tuning the write buffer back down), as that's
important for high quantile latencies.


Both the TTL and the memory limit put per-operation load on a cluster,
but it shouldn't be too expensive, depending on how much it's being
used.

I'd keep TTL off for now.  I'll keep looking into it.

On Wed, Apr 3, 2013 at 1:23 PM, Giri Iyengar <giri.iyen...@sociocast.com> wrote:
> Evan,
>
> We have about 28G per partition. We are running leveldb + memory ( a
> multi-backend).
> Here are the relevant info from our app.config:
>
>             {storage_backend, riak_kv_multi_backend},
>
>             {multi_backend_default, <<"eleveldb_backend">>},
>
>             {multi_backend, [
>               {<<"eleveldb_backend">>, riak_kv_eleveldb_backend, [
>                 {write_buffer_size_max, 125829120},
>                 {write_buffer_size_min, 62914560}
>               ]},
>               {<<"memory_backend">>, riak_kv_memory_backend, [
>                 {max_memory, 1024}
>               ]}
>             ]},
>
> We are running a 6 node cluster and the ring size is 128.
>
> Best Regards and thanks for your help on this.
>
> -giri
>
>
> On Wed, Apr 3, 2013 at 3:58 PM, Evan Vigil-McClanahan
> <emcclana...@basho.com> wrote:
>>
>> How much data do you have in each partition?  Are you running leveldb
>> or bitcask?  If the former, what does your eleveldb config look like?
>>
>> On Wed, Apr 3, 2013 at 6:26 AM, Giri Iyengar <giri.iyen...@sociocast.com>
>> wrote:
>> > Evan,
>> >
>> > I tried re-introducing the TTL. It reverts back to the issue of vnodes
>> > not
>> > successfully transferring. This time it is the hinted_handoffs as
>> > expected
>> > since the ownership transfers have already happened. I have been
>> > specifying
>> > both a memory limit and a TTL. Should I be specifying only one?
>> >
>> > Also, on a separate note, I notice that if I turn AAE on, while the
>> > repairs
>> > happen nodes tend to go offline causing a lot of hinted_handoffs. I
>> > tried
>> > leaving AAE on for about 48 hours but it still didn't settle. So, I
>> > ended up
>> > turning AAE off and the cluster quiesced quickly after that.
>> >
>> > -giri
>> >
>> >
>> > On Tue, Apr 2, 2013 at 1:22 PM, Godefroy de Compreignac
>> > <godef...@eklablog.com> wrote:
>> >>
>> >> Thanks Evan, it helped me a lot for my cluster!
>> >>
>> >> Godefroy
>> >>
>> >>
>> >>
>> >> 2013/3/29 Evan Vigil-McClanahan <emcclana...@basho.com>
>> >>>
>> >>> That's an interesting result.  Once it's fully rebalanced, I'd turn it
>> >>> back on and see if the fallback handoffs still fail.  If they do, I'd
>> >>> recommend using memory limits, rather than TTL to limit growth (also,
>> >>> remember that memory limits are *per vnode*, rather than per node).
>> >>> They're slower, but they don't seem to have this problem.  I'll do my
>> >>> best to figure out what is going on and get a patch (if one is needed)
>> >>> into the next version.
>> >>>
>> >>>
>> >>> On Fri, Mar 29, 2013 at 3:15 PM, Giri Iyengar
>> >>> <giri.iyen...@sociocast.com> wrote:
>> >>> > Evan,
>> >>> >
>> >>> > As recommended by you, I disabled the TTL on the memory backends and
>> >>> > did a
>> >>> > rolling restart of the cluster. Now, things are rebalancing quite
>> >>> > nicely.
>> >>> > Do you think I can turn the TTL back on once the rebalancing
>> >>> > completes?
>> >>> > I'd
>> >>> > like to ensure that the vnodes in memory don't keep growing forever.
>> >>> >
>> >>> > -giri
>> >>> >
>> >>> >
>> >>> > On Thu, Mar 28, 2013 at 6:50 PM, Giri Iyengar
>> >>> > <giri.iyen...@sociocast.com>
>> >>> > wrote:
>> >>> >>
>> >>> >> Evan,
>> >>> >>
>> >>> >> This has been happening for a while now (about 3.5 weeks now), even
>> >>> >> prior
>> >>> >> to our upgrade to 1.3.
>> >>> >>
>> >>> >> -giri
>> >>> >>
>> >>> >> On Thu, Mar 28, 2013 at 6:36 PM, Evan Vigil-McClanahan
>> >>> >> <emcclana...@basho.com> wrote:
>> >>> >>>
>> >>> >>> No.  AAE is unrelated to the handoff subsystem.  I am not familiar
>> >>> >>> enough with the lowest level of it's working to know if it'd
>> >>> >>> reproduce
>> >>> >>> the TTL stuff across on nodes that don't have it.
>> >>> >>>
>> >>> >>> I am not totally sure about your timeline here.
>> >>> >>>
>> >>> >>> When did you start seeing these errors, before or after your
>> >>> >>> upgrade
>> >>> >>> to 1.3?  When did you start your cluster transition?  What cluster
>> >>> >>> transitions have you initiated?
>> >>> >>>
>> >>> >>> If these errors started with 1.3, an interesting experiment would
>> >>> >>> be
>> >>> >>> to disable AAE and do a rolling restart of the cluster, which
>> >>> >>> should
>> >>> >>> lead to empty memory backends that won't be populated by AAE with
>> >>> >>> anything suspicious.  That said: if you've had cluster balance
>> >>> >>> problems for a while, it's possible that these messages (even this
>> >>> >>> whole issue) is just masking some other problem.
>> >>> >>>
>> >>> >>> On Thu, Mar 28, 2013 at 3:24 PM, Giri Iyengar
>> >>> >>> <giri.iyen...@sociocast.com> wrote:
>> >>> >>> > Evan,
>> >>> >>> >
>> >>> >>> > All nodes have been restarted (more than once, in fact) after
>> >>> >>> > the
>> >>> >>> > config
>> >>> >>> > changes. Using riak-admin aae-status, I noticed that the
>> >>> >>> > anti-entropy
>> >>> >>> > repair
>> >>> >>> > is still proceeding across the cluster.
>> >>> >>> > It has been less than 24 hours since I upgraded to 1.3 and maybe
>> >>> >>> > I
>> >>> >>> > have
>> >>> >>> > to
>> >>> >>> > wait till the first complete build of the index trees happens
>> >>> >>> > for
>> >>> >>> > the
>> >>> >>> > cluster to start rebalancing itself.
>> >>> >>> > Could that be the case?
>> >>> >>> >
>> >>> >>> > -giri
>> >>> >>> >
>> >>> >>> >
>> >>> >>> > On Thu, Mar 28, 2013 at 5:49 PM, Evan Vigil-McClanahan
>> >>> >>> > <emcclana...@basho.com> wrote:
>> >>> >>> >>
>> >>> >>> >> Giri,
>> >>> >>> >>
>> >>> >>> >> if all of the nodes are using identical app.config files
>> >>> >>> >> (including
>> >>> >>> >> the joining node) and have been restarted since those files
>> >>> >>> >> changed,
>> >>> >>> >> it may be some other, related issue.
>> >>> >>> >>
>> >>> >>> >> On Thu, Mar 28, 2013 at 2:46 PM, Giri Iyengar
>> >>> >>> >> <giri.iyen...@sociocast.com> wrote:
>> >>> >>> >> > Evan,
>> >>> >>> >> >
>> >>> >>> >> > I reconfirmed that all the servers are using identical
>> >>> >>> >> > app.configs.
>> >>> >>> >> > They
>> >>> >>> >> > all
>> >>> >>> >> > use multi-backend schema. Are you saying that some of the
>> >>> >>> >> > vnodes
>> >>> >>> >> > are
>> >>> >>> >> > in
>> >>> >>> >> > memory backend in one physical node and in eleveldb backend
>> >>> >>> >> > in
>> >>> >>> >> > another
>> >>> >>> >> > physical node?
>> >>> >>> >> > If so, how can I fix the offending vnodes?
>> >>> >>> >> >
>> >>> >>> >> > Thanks,
>> >>> >>> >> >
>> >>> >>> >> > -giri
>> >>> >>> >> >
>> >>> >>> >> > On Thu, Mar 28, 2013 at 5:18 PM, Evan Vigil-McClanahan
>> >>> >>> >> > <emcclana...@basho.com> wrote:
>> >>> >>> >> >>
>> >>> >>> >> >> it would if some of the nodes weren't migrated to the new
>> >>> >>> >> >> multi-backend schema; if a memory node was trying to hand
>> >>> >>> >> >> off
>> >>> >>> >> >> to a
>> >>> >>> >> >> eleveldb backed node, you'd see this.
>> >>> >>> >> >>
>> >>> >>> >> >> On Thu, Mar 28, 2013 at 2:05 PM, Giri Iyengar
>> >>> >>> >> >> <giri.iyen...@sociocast.com> wrote:
>> >>> >>> >> >> > Evan,
>> >>> >>> >> >> >
>> >>> >>> >> >> > I verified that all of the memory backends have the same
>> >>> >>> >> >> > ttl
>> >>> >>> >> >> > settings
>> >>> >>> >> >> > and
>> >>> >>> >> >> > have done rolling restarts but it doesn't seem to make a
>> >>> >>> >> >> > difference.
>> >>> >>> >> >> > One
>> >>> >>> >> >> > thing to note though -- I remember this problem starting
>> >>> >>> >> >> > roughly
>> >>> >>> >> >> > around
>> >>> >>> >> >> > the
>> >>> >>> >> >> > time I migrated a bucket from being backed by leveldb to
>> >>> >>> >> >> > being
>> >>> >>> >> >> > backed
>> >>> >>> >> >> > by
>> >>> >>> >> >> > memory. I did this by setting the bucket properties via
>> >>> >>> >> >> > curl
>> >>> >>> >> >> > and
>> >>> >>> >> >> > let
>> >>> >>> >> >> > Riak do
>> >>> >>> >> >> > the migration of the objects in that bucket. Would that
>> >>> >>> >> >> > cause
>> >>> >>> >> >> > such
>> >>> >>> >> >> > issues?
>> >>> >>> >> >> >
>> >>> >>> >> >> > Thanks for your help.
>> >>> >>> >> >> >
>> >>> >>> >> >> > -giri
>> >>> >>> >> >> >
>> >>> >>> >> >> >
>> >>> >>> >> >> > On Thu, Mar 28, 2013 at 4:55 PM, Evan Vigil-McClanahan
>> >>> >>> >> >> > <emcclana...@basho.com> wrote:
>> >>> >>> >> >> >>
>> >>> >>> >> >> >> Giri, I've seen similar issues in the past when someone
>> >>> >>> >> >> >> was
>> >>> >>> >> >> >> adjusting
>> >>> >>> >> >> >> their ttl setting on the memory backend.  Because one
>> >>> >>> >> >> >> memory
>> >>> >>> >> >> >> backend
>> >>> >>> >> >> >> has it and the other does not, it fails on handoff.   The
>> >>> >>> >> >> >> solution
>> >>> >>> >> >> >> then was to make sure that all memory backend settings
>> >>> >>> >> >> >> are
>> >>> >>> >> >> >> the
>> >>> >>> >> >> >> same
>> >>> >>> >> >> >> and then do a rolling restart of the cluster (ignoring a
>> >>> >>> >> >> >> lot
>> >>> >>> >> >> >> of
>> >>> >>> >> >> >> errors
>> >>> >>> >> >> >> along the way).  I am not sure that this is applicable to
>> >>> >>> >> >> >> your
>> >>> >>> >> >> >> case,
>> >>> >>> >> >> >> but it's something to look at.
>> >>> >>> >> >> >>
>> >>> >>> >> >> >> On Thu, Mar 28, 2013 at 10:22 AM, Giri Iyengar
>> >>> >>> >> >> >> <giri.iyen...@sociocast.com> wrote:
>> >>> >>> >> >> >> > Godefroy:
>> >>> >>> >> >> >> >
>> >>> >>> >> >> >> > Thanks. Your email exchange on the mailing list was
>> >>> >>> >> >> >> > what
>> >>> >>> >> >> >> > prompted
>> >>> >>> >> >> >> > me
>> >>> >>> >> >> >> > to
>> >>> >>> >> >> >> > consider switching to Riak 1.3. I do see repair
>> >>> >>> >> >> >> > messages
>> >>> >>> >> >> >> > in
>> >>> >>> >> >> >> > the
>> >>> >>> >> >> >> > console
>> >>> >>> >> >> >> > logs
>> >>> >>> >> >> >> > and so some healing is happening. However, there are a
>> >>> >>> >> >> >> > bunch
>> >>> >>> >> >> >> > of
>> >>> >>> >> >> >> > hinted
>> >>> >>> >> >> >> > handoffs and ownership handoffs that are simply not
>> >>> >>> >> >> >> > proceeding
>> >>> >>> >> >> >> > because
>> >>> >>> >> >> >> > the
>> >>> >>> >> >> >> > same vnodes keep coming up for transfer and fail.
>> >>> >>> >> >> >> > Perhaps
>> >>> >>> >> >> >> > there is
>> >>> >>> >> >> >> > a
>> >>> >>> >> >> >> > manual
>> >>> >>> >> >> >> > way to forcibly repair and push the vnodes around.
>> >>> >>> >> >> >> >
>> >>> >>> >> >> >> > -giri
>> >>> >>> >> >> >> >
>> >>> >>> >> >> >> >
>> >>> >>> >> >> >> > On Thu, Mar 28, 2013 at 1:19 PM, Godefroy de
>> >>> >>> >> >> >> > Compreignac
>> >>> >>> >> >> >> > <godef...@eklablog.com> wrote:
>> >>> >>> >> >> >> >>
>> >>> >>> >> >> >> >> I have exactly the same problem with my cluster. If
>> >>> >>> >> >> >> >> anyone
>> >>> >>> >> >> >> >> knows
>> >>> >>> >> >> >> >> what
>> >>> >>> >> >> >> >> those errors mean... :-)
>> >>> >>> >> >> >> >>
>> >>> >>> >> >> >> >> Godefroy
>> >>> >>> >> >> >> >>
>> >>> >>> >> >> >> >>
>> >>> >>> >> >> >> >> 2013/3/28 Giri Iyengar <giri.iyen...@sociocast.com>
>> >>> >>> >> >> >> >>>
>> >>> >>> >> >> >> >>> Hello,
>> >>> >>> >> >> >> >>>
>> >>> >>> >> >> >> >>> We are running a 6-node Riak 1.3.0 cluster in
>> >>> >>> >> >> >> >>> production. We
>> >>> >>> >> >> >> >>> recently
>> >>> >>> >> >> >> >>> upgraded to 1.3. Prior to this, we were running Riak
>> >>> >>> >> >> >> >>> 1.2
>> >>> >>> >> >> >> >>> on
>> >>> >>> >> >> >> >>> the
>> >>> >>> >> >> >> >>> same
>> >>> >>> >> >> >> >>> 6-node
>> >>> >>> >> >> >> >>> cluster.
>> >>> >>> >> >> >> >>>
>> >>> >>> >> >> >> >>> We are finding that the nodes are not balanced. For
>> >>> >>> >> >> >> >>> instance:
>> >>> >>> >> >> >> >>>
>> >>> >>> >> >> >> >>> ================================= Membership
>> >>> >>> >> >> >> >>> ==================================
>> >>> >>> >> >> >> >>> Status     Ring    Pending    Node
>> >>> >>> >> >> >> >>>
>> >>> >>> >> >> >> >>>
>> >>> >>> >> >> >> >>>
>> >>> >>> >> >> >> >>>
>> >>> >>> >> >> >> >>>
>> >>> >>> >> >> >> >>>
>> >>> >>> >> >> >> >>>
>> >>> >>> >> >> >> >>> -------------------------------------------------------------------------------
>> >>> >>> >> >> >> >>> valid       0.0%      0.0%    'riak@172.16.25.106'
>> >>> >>> >> >> >> >>> valid      34.4%     20.3%    'riak@172.16.25.107'
>> >>> >>> >> >> >> >>> valid      21.9%     20.3%    'riak@172.16.25.113'
>> >>> >>> >> >> >> >>> valid      19.5%     20.3%    'riak@172.16.25.114'
>> >>> >>> >> >> >> >>> valid       8.6%     19.5%    'riak@172.16.25.121'
>> >>> >>> >> >> >> >>> valid      15.6%     19.5%    'riak@172.16.25.122'
>> >>> >>> >> >> >> >>>
>> >>> >>> >> >> >> >>>
>> >>> >>> >> >> >> >>>
>> >>> >>> >> >> >> >>>
>> >>> >>> >> >> >> >>>
>> >>> >>> >> >> >> >>>
>> >>> >>> >> >> >> >>>
>> >>> >>> >> >> >> >>> -------------------------------------------------------------------------------
>> >>> >>> >> >> >> >>> Valid:6 / Leaving:0 / Exiting:0 / Joining:0 / Down:0
>> >>> >>> >> >> >> >>>
>> >>> >>> >> >> >> >>>
>> >>> >>> >> >> >> >>> When we look at the logs in the largest node
>> >>> >>> >> >> >> >>> (riak@172.16.25.107),
>> >>> >>> >> >> >> >>> we
>> >>> >>> >> >> >> >>> see
>> >>> >>> >> >> >> >>> error messages that look like this:
>> >>> >>> >> >> >> >>>
>> >>> >>> >> >> >> >>> 2013-03-28 13:04:16.957 [error]
>> >>> >>> >> >> >> >>>
>> >>> >>> >> >> >> >>> <0.10957.1462>@riak_core_handoff_sender:start_fold:226
>> >>> >>> >> >> >> >>> hinted_handoff
>> >>> >>> >> >> >> >>> transfer of riak_kv_vnode from 'riak@172.16.25.107'
>> >>> >>> >> >> >> >>> 148433760041419827630061740822747494183805648896 to
>> >>> >>> >> >> >> >>> 'riak@172.16.25.121'
>> >>> >>> >> >> >> >>> 148433760041419827630061740822747494183805648896
>> >>> >>> >> >> >> >>> failed
>> >>> >>> >> >> >> >>> because
>> >>> >>> >> >> >> >>> of
>> >>> >>> >> >> >> >>>
>> >>> >>> >> >> >> >>>
>> >>> >>> >> >> >> >>>
>> >>> >>> >> >> >> >>>
>> >>> >>> >> >> >> >>>
>> >>> >>> >> >> >> >>>
>> >>> >>> >> >> >> >>> error:{badmatch,{error,{worker_crash,{function_clause,[{riak_core_pb,encode,[{ts,{1364,476737,222223}},{{ts,{1364,476737,222223}},<<131,104,7,100,0,8,114,95,111,98,106,101,99,116,109,0,0,0,11,69,78,84,73,84,89,95,83,69,83,83,109,0,0,0,36,67,54,57,95,48,48,51,56,100,56,102,50,52,49,52,99,97,97,54,102,99,52,56,53,52,99,99,101,51,98,50,48,102,53,98,52,108,0,0,0,1,104,3,100,0,9,114,95,99,111,110,116,101,110,116,104,9,100,0,4,100,105,99,116,97,5,97,16,97,16,97,8,97,80,97,48,104,16,106,106,106,106,106,106,106,106,106,106,106,106,106,106,106,106,104,1,104,16,106,106,106,106,106,106,106,106,106,106,108,0,0,0,2,108,0,0,0,11,109,0,0,0,12,99,111,110,116,101,110,116,45,116,121,112,101,97,116,97,101,97,120,97,116,97,47,97,112,97,108,97,97,97,105,97,110,106,108,0,0,0,23,109,0,0,0,11,88,45,82,105,97,107,45,86,84,97,103,97,51,97,120,97,105,97,101,97,120,97,66,97,120,97,107,97,119,97,101,97,75,97,117,97,122,97,111,97,55,97,85,97,104,97,85,97,107,97,112,97,120,97,107,106,106,108,0,0,0,1,108,0,0,0,1,109,0,0,0,5,105,110,100,101,120,106,106,106,108,0,0,0,1,108,0,0,0,1,109,0,0,0,20,88,45,82,105,97,107,45,76,97,115,116,45,77,111,100,105,102,105,101,100,104,3,98,0,0,5,84,98,0,7,70,65,98,0,3,99,115,106,106,108,0,0,0,1,108,0,0,0,6,109,0,0,0,7,99,104,97,114,115,101,116,97,85,97,84,97,70,97,45,97,56,106,106,109,0,0,0,36,52,54,55,98,54,51,98,50,45,50,99,56,52,45,52,56,50,99,45,97,48,99,54,45,56,53,50,100,53,99,57,97,98,98,53,101,106,108,0,0,0,1,104,2,109,0,0,0,8,0,69,155,215,81,84,63,31,104,2,97,1,110,5,0,65,191,200,202,14,106,104,9,100,0,4,100,105,99,116,97,1,97,16,97,16,97,8,97,80,97,48,104,16,106,106,106,106,106,106,106,106,106,106,106,106,106,106,106,106,104,1,104,16,106,106,106,106,106,106,106,106,106,106,106,106,106,106,108,0,0,0,1,108,0,0,0,1,100,0,5,99,108,101,97,110,100,0,4,116,114,117,101,106,106,100,0,9,117,110,100,101,102,105,110,101,100>>}],[{file,"src/riak_core_pb.erl"},{line,40}]},{riak_core_pb,pack,5,...},...]},...}}}
>> >>> >>> >> >> >> >>>
>> >>> >>> >> >> >> >>>
>> >>> >>> >> >> >> >>>
>> >>> >>> >> >> >> >>>
>> >>> >>> >> >> >> >>>
>> >>> >>> >> >> >> >>>
>> >>> >>> >> >> >> >>> [{riak_core_handoff_sender,start_fold,5,[{file,"src/riak_core_handoff_sender.erl"},{line,170}]}]
>> >>> >>> >> >> >> >>> 2013-03-28 13:04:16.961 [error] <0.29352.909> CRASH
>> >>> >>> >> >> >> >>> REPORT
>> >>> >>> >> >> >> >>> Process
>> >>> >>> >> >> >> >>> <0.29352.909> with 0 neighbours exited with reason:
>> >>> >>> >> >> >> >>> no
>> >>> >>> >> >> >> >>> function
>> >>> >>> >> >> >> >>> clause
>> >>> >>> >> >> >> >>> matching
>> >>> >>> >> >> >> >>> riak_core_pb:encode({ts,{1364,476737,222223}},
>> >>> >>> >> >> >> >>>
>> >>> >>> >> >> >> >>>
>> >>> >>> >> >> >> >>>
>> >>> >>> >> >> >> >>>
>> >>> >>> >> >> >> >>>
>> >>> >>> >> >> >> >>>
>> >>> >>> >> >> >> >>> {{ts,{1364,476737,222223}},<<131,104,7,100,0,8,114,95,111,98,106,101,99,116,109,0,0,0,11,69,78,...>>})
>> >>> >>> >> >> >> >>> line 40 in gen_server:terminate/6 line 747
>> >>> >>> >> >> >> >>>
>> >>> >>> >> >> >> >>>
>> >>> >>> >> >> >> >>> 2013-03-28 13:04:13.888 [error]
>> >>> >>> >> >> >> >>>
>> >>> >>> >> >> >> >>> <0.12680.1435>@riak_core_handoff_sender:start_fold:226
>> >>> >>> >> >> >> >>> ownership_handoff
>> >>> >>> >> >> >> >>> transfer of riak_kv_vnode from 'riak@172.16.25.107'
>> >>> >>> >> >> >> >>> 11417981541647679048466287755595961091061972992 to
>> >>> >>> >> >> >> >>> 'riak@172.16.25.113'
>> >>> >>> >> >> >> >>> 11417981541647679048466287755595961091061972992
>> >>> >>> >> >> >> >>> failed
>> >>> >>> >> >> >> >>> because
>> >>> >>> >> >> >> >>> of
>> >>> >>> >> >> >> >>>
>> >>> >>> >> >> >> >>>
>> >>> >>> >> >> >> >>>
>> >>> >>> >> >> >> >>>
>> >>> >>> >> >> >> >>>
>> >>> >>> >> >> >> >>>
>> >>> >>> >> >> >> >>> error:{badmatch,{error,{worker_crash,{function_clause,[{riak_core_pb,encode,[{ts,{1364,458917,232318}},{{ts,{1364,458917,232318}},<<131,104,7,100,0,8,114,95,111,98,106,101,99,116,109,0,0,0,11,69,78,84,73,84,89,95,83,69,83,83,109,0,0,0,36,67,54,57,95,48,48,48,54,52,98,99,52,53,51,49,52,55,101,50,101,53,97,102,101,102,49,57,99,50,55,99,97,49,53,54,99,108,0,0,0,1,104,3,100,0,9,114,95,99,111,110,116,101,110,116,104,9,100,0,4,100,105,99,116,97,5,97,16,97,16,97,8,97,80,97,48,104,16,106,106,106,106,106,106,106,106,106,106,106,106,106,106,106,106,104,1,104,16,106,106,106,106,106,106,106,106,106,106,108,0,0,0,2,108,0,0,0,11,109,0,0,0,12,99,111,110,116,101,110,116,45,116,121,112,101,97,116,97,101,97,120,97,116,97,47,97,112,97,108,97,97,97,105,97,110,106,108,0,0,0,23,109,0,0,0,11,88,45,82,105,97,107,45,86,84,97,103,97,54,97,88,97,76,97,66,97,69,97,69,97,116,97,73,97,104,97,118,97,77,97,86,97,48,97,81,97,103,97,110,97,119,97,73,97,51,97,85,97,72,97,53,106,106,108,0,0,0,1,108,0,0,0,1,109,0,0,0,5,105,110,100,101,120,106,106,106,108,0,0,0,1,108,0,0,0,1,109,0,0,0,20,88,45,82,105,97,107,45,76,97,115,116,45,77,111,100,105,102,105,101,100,104,3,98,0,0,5,84,98,0,7,0,165,98,0,3,138,179,106,106,108,0,0,0,1,108,0,0,0,6,109,0,0,0,7,99,104,97,114,115,101,116,97,85,97,84,97,70,97,45,97,56,106,106,109,0,0,0,36,55,102,98,52,50,54,54,53,45,57,100,56,48,45,52,54,98,97,45,98,53,97,100,45,56,55,52,52,54,54,97,97,50,56,53,99,106,108,0,0,0,1,104,2,109,0,0,0,8,0,69,155,215,81,59,179,219,104,2,97,1,110,5,0,165,121,200,202,14,106,104,9,100,0,4,100,105,99,116,97,1,97,16,97,16,97,8,97,80,97,48,104,16,106,106,106,106,106,106,106,106,106,106,106,106,106,106,106,106,104,1,104,16,106,106,106,106,106,106,106,106,106,106,106,106,106,106,108,0,0,0,1,108,0,0,0,1,100,0,5,99,108,101,97,110,100,0,4,116,114,117,101,106,106,100,0,9,117,110,100,101,102,105,110,101,100>>}],[{file,"src/riak_core_pb.erl"},{line,40}]},{riak_core_pb,pack,5,[{...},...]},...]},...}}}
>> >>> >>> >> >> >> >>>
>> >>> >>> >> >> >> >>>
>> >>> >>> >> >> >> >>>
>> >>> >>> >> >> >> >>>
>> >>> >>> >> >> >> >>>
>> >>> >>> >> >> >> >>>
>> >>> >>> >> >> >> >>> [{riak_core_handoff_sender,start_fold,5,[{file,"src/riak_core_handoff_sender.erl"},{line,170}]}]
>> >>> >>> >> >> >> >>> 2013-03-28 13:04:14.255 [error] <0.1120.0> CRASH
>> >>> >>> >> >> >> >>> REPORT
>> >>> >>> >> >> >> >>> Process
>> >>> >>> >> >> >> >>> <0.1120.0> with 0 neighbours exited with reason: no
>> >>> >>> >> >> >> >>> function
>> >>> >>> >> >> >> >>> clause
>> >>> >>> >> >> >> >>> matching
>> >>> >>> >> >> >> >>> riak_core_pb:encode({ts,{1364,458917,232318}},
>> >>> >>> >> >> >> >>>
>> >>> >>> >> >> >> >>>
>> >>> >>> >> >> >> >>>
>> >>> >>> >> >> >> >>>
>> >>> >>> >> >> >> >>>
>> >>> >>> >> >> >> >>>
>> >>> >>> >> >> >> >>> {{ts,{1364,458917,232318}},<<131,104,7,100,0,8,114,95,111,98,106,101,99,116,109,0,0,0,11,69,78,...>>})
>> >>> >>> >> >> >> >>> line 40 in gen_server:terminate/6 line 747
>> >>> >>> >> >> >> >>>
>> >>> >>> >> >> >> >>> This has been going on for days and the cluster
>> >>> >>> >> >> >> >>> doesn't
>> >>> >>> >> >> >> >>> seem
>> >>> >>> >> >> >> >>> to
>> >>> >>> >> >> >> >>> be
>> >>> >>> >> >> >> >>> rebalancing itself. We see this issue with both
>> >>> >>> >> >> >> >>> hinted_handoffs
>> >>> >>> >> >> >> >>> and
>> >>> >>> >> >> >> >>> ownership_handoffs. Looks like we have some corrupt
>> >>> >>> >> >> >> >>> data
>> >>> >>> >> >> >> >>> in
>> >>> >>> >> >> >> >>> our
>> >>> >>> >> >> >> >>> cluster. I
>> >>> >>> >> >> >> >>> checked through the leveldb LOGs and did not see any
>> >>> >>> >> >> >> >>> compaction
>> >>> >>> >> >> >> >>> errors.
>> >>> >>> >> >> >> >>> I was hoping that upgrading to 1.3.0 will slowly
>> >>> >>> >> >> >> >>> start
>> >>> >>> >> >> >> >>> repairing
>> >>> >>> >> >> >> >>> the
>> >>> >>> >> >> >> >>> cluster. However, that doesn't seem to be happening.
>> >>> >>> >> >> >> >>>
>> >>> >>> >> >> >> >>> Any help/hints would be much appreciated.
>> >>> >>> >> >> >> >>>
>> >>> >>> >> >> >> >>> -giri
>> >>> >>> >> >> >> >>> --
>> >>> >>> >> >> >> >>> GIRI IYENGAR, CTO
>> >>> >>> >> >> >> >>> SOCIOCAST
>> >>> >>> >> >> >> >>> Simple. Powerful. Predictions.
>> >>> >>> >> >> >> >>>
>> >>> >>> >> >> >> >>> 36 WEST 25TH STREET, 7TH FLOOR NEW YORK CITY, NY
>> >>> >>> >> >> >> >>> 10010
>> >>> >>> >> >> >> >>> O: 917.525.2466x104   M: 914.924.7935   F:
>> >>> >>> >> >> >> >>> 347.943.6281
>> >>> >>> >> >> >> >>> E: giri.iyen...@sociocast.com W: www.sociocast.com
>> >>> >>> >> >> >> >>>
>> >>> >>> >> >> >> >>> Facebook's Ad Guru Joins Sociocast -
>> >>> >>> >> >> >> >>> http://bit.ly/NjPQBQ
>> >>> >>> >> >> >> >>>
>> >>> >>> >> >> >> >>> _______________________________________________
>> >>> >>> >> >> >> >>> riak-users mailing list
>> >>> >>> >> >> >> >>> riak-users@lists.basho.com
>> >>> >>> >> >> >> >>>
>> >>> >>> >> >> >> >>>
>> >>> >>> >> >> >> >>>
>> >>> >>> >> >> >> >>>
>> >>> >>> >> >> >> >>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>> >>> >>> >> >> >> >>>
>> >>> >>> >> >> >> >>
>> >>> >>> >> >> >> >
>> >>> >>> >> >> >> >
>> >>> >>> >> >> >> >
>> >>> >>> >> >> >> > --
>> >>> >>> >> >> >> > GIRI IYENGAR, CTO
>> >>> >>> >> >> >> > SOCIOCAST
>> >>> >>> >> >> >> > Simple. Powerful. Predictions.
>> >>> >>> >> >> >> >
>> >>> >>> >> >> >> > 36 WEST 25TH STREET, 7TH FLOOR NEW YORK CITY, NY 10010
>> >>> >>> >> >> >> > O: 917.525.2466x104   M: 914.924.7935   F: 347.943.6281
>> >>> >>> >> >> >> > E: giri.iyen...@sociocast.com W: www.sociocast.com
>> >>> >>> >> >> >> >
>> >>> >>> >> >> >> > Facebook's Ad Guru Joins Sociocast -
>> >>> >>> >> >> >> > http://bit.ly/NjPQBQ
>> >>> >>> >> >> >> >
>> >>> >>> >> >> >> > _______________________________________________
>> >>> >>> >> >> >> > riak-users mailing list
>> >>> >>> >> >> >> > riak-users@lists.basho.com
>> >>> >>> >> >> >> >
>> >>> >>> >> >> >> >
>> >>> >>> >> >> >> >
>> >>> >>> >> >> >> > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>> >>> >>> >> >> >> >
>> >>> >>> >> >> >
>> >>> >>> >> >> >
>> >>> >>> >> >> >
>> >>> >>> >> >> >
>> >>> >>> >> >> > --
>> >>> >>> >> >> > GIRI IYENGAR, CTO
>> >>> >>> >> >> > SOCIOCAST
>> >>> >>> >> >> > Simple. Powerful. Predictions.
>> >>> >>> >> >> >
>> >>> >>> >> >> > 36 WEST 25TH STREET, 7TH FLOOR NEW YORK CITY, NY 10010
>> >>> >>> >> >> > O: 917.525.2466x104   M: 914.924.7935   F: 347.943.6281
>> >>> >>> >> >> > E: giri.iyen...@sociocast.com W: www.sociocast.com
>> >>> >>> >> >> >
>> >>> >>> >> >> > Facebook's Ad Guru Joins Sociocast - http://bit.ly/NjPQBQ
>> >>> >>> >> >
>> >>> >>> >> >
>> >>> >>> >> >
>> >>> >>> >> >
>> >>> >>> >> > --
>> >>> >>> >> > GIRI IYENGAR, CTO
>> >>> >>> >> > SOCIOCAST
>> >>> >>> >> > Simple. Powerful. Predictions.
>> >>> >>> >> >
>> >>> >>> >> > 36 WEST 25TH STREET, 7TH FLOOR NEW YORK CITY, NY 10010
>> >>> >>> >> > O: 917.525.2466x104   M: 914.924.7935   F: 347.943.6281
>> >>> >>> >> > E: giri.iyen...@sociocast.com W: www.sociocast.com
>> >>> >>> >> >
>> >>> >>> >> > Facebook's Ad Guru Joins Sociocast - http://bit.ly/NjPQBQ
>> >>> >>> >
>> >>> >>> >
>> >>> >>> >
>> >>> >>> >
>> >>> >>> > --
>> >>> >>> > GIRI IYENGAR, CTO
>> >>> >>> > SOCIOCAST
>> >>> >>> > Simple. Powerful. Predictions.
>> >>> >>> >
>> >>> >>> > 36 WEST 25TH STREET, 7TH FLOOR NEW YORK CITY, NY 10010
>> >>> >>> > O: 917.525.2466x104   M: 914.924.7935   F: 347.943.6281
>> >>> >>> > E: giri.iyen...@sociocast.com W: www.sociocast.com
>> >>> >>> >
>> >>> >>> > Facebook's Ad Guru Joins Sociocast - http://bit.ly/NjPQBQ
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >> --
>> >>> >> GIRI IYENGAR, CTO
>> >>> >> SOCIOCAST
>> >>> >> Simple. Powerful. Predictions.
>> >>> >>
>> >>> >> 36 WEST 25TH STREET, 7TH FLOOR NEW YORK CITY, NY 10010
>> >>> >> O: 917.525.2466x104   M: 914.924.7935   F: 347.943.6281
>> >>> >> E: giri.iyen...@sociocast.com W: www.sociocast.com
>> >>> >>
>> >>> >> Facebook's Ad Guru Joins Sociocast - http://bit.ly/NjPQBQ
>> >>> >
>> >>> >
>> >>> >
>> >>> >
>> >>> > --
>> >>> > GIRI IYENGAR, CTO
>> >>> > SOCIOCAST
>> >>> > Simple. Powerful. Predictions.
>> >>> >
>> >>> > 36 WEST 25TH STREET, 7TH FLOOR NEW YORK CITY, NY 10010
>> >>> > O: 917.525.2466x104   M: 914.924.7935   F: 347.943.6281
>> >>> > E: giri.iyen...@sociocast.com W: www.sociocast.com
>> >>> >
>> >>> > Facebook's Ad Guru Joins Sociocast - http://bit.ly/NjPQBQ
>> >>
>> >>
>> >
>> >
>> >
>> > --
>> > GIRI IYENGAR, CTO
>> > SOCIOCAST
>> > Simple. Powerful. Predictions.
>> >
>> > 36 WEST 25TH STREET, 7TH FLOOR NEW YORK CITY, NY 10010
>> > O: 917.525.2466x104   M: 914.924.7935   F: 347.943.6281
>> > E: giri.iyen...@sociocast.com W: www.sociocast.com
>> >
>> > Facebook's Ad Guru Joins Sociocast - http://bit.ly/NjPQBQ
>> >
>> > _______________________________________________
>> > riak-users mailing list
>> > riak-users@lists.basho.com
>> > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>> >
>
>
>
>
> --
> GIRI IYENGAR, CTO
> SOCIOCAST
> Simple. Powerful. Predictions.
>
> 36 WEST 25TH STREET, 7TH FLOOR NEW YORK CITY, NY 10010
> O: 917.525.2466x104   M: 914.924.7935   F: 347.943.6281
> E: giri.iyen...@sociocast.com W: www.sociocast.com
>
> Facebook's Ad Guru Joins Sociocast - http://bit.ly/NjPQBQ

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: Rebalancing issue -- failure to hand-off partitions

Reply via email to