Ownership handoff never completes

Mark Phillips Wed, 20 Nov 2013 10:51:10 -0800

Excellent. I'm glad to hear it's cleared up.

All hail Sparrow: slayer of issues; master of drum kits.


Mark

On Wednesday, November 20, 2013, Jeppe Toustrup wrote:

> I've got the problem solved thanks to Brian Sparrow on the IRC channel.
>
> Here's the steps we tried during the troubleshooting session:
>
> 1. We first tried to delete the data folders on the receiving node for
> the two partitions, while the node was stopped, to see if it would
> retrigger the ownership handoff. It didn't change anything.
>
> 2. We then tried to insert the following Erlang code on the sending
> node, in order to see if it would retrigger the ownership handoff. The
> partition IDs were for the partitions needing to be transfered:
> IdxList = [696496874040508421956443553091353626554780352512,
> 239777612374601260017792042867515182912301432832],
>   Mod = riak_kv,
>   Ring = riak_core_ring_manager:get_my_ring(),
>   riak_core_ring_manager:ring_trans(
>         fun(Ring, _) ->
>                 Ring2 = lists:foldl(
>                           fun(Idx, Ring) ->
>
> riak_core_ring:handoff_complete(Ring, Idx, Mod)
>                           end,
>                           Ring,
>                           IdxList),
>                 {new_ring, Ring2}
>         end, []).
>
> That piece of code didn't help anything either. The output of the
> command showed the two partitions to be in the "awaiting" state:
>
>                 [{239777612374601260017792042867515182912301432832,
>                   '[email protected]','[email protected]',
>                   [riak_kv,riak_kv_vnode,riak_pipe_vnode],
>                   awaiting},
>                  {696496874040508421956443553091353626554780352512,
>                   '[email protected]','[email protected]',
>                   [riak_kv,riak_kv_vnode,riak_pipe_vnode],
>                   awaiting}],
>
> 3. Brian suggested that I should run
> "riak_core_ring_events:force_update()." in the Erlang console as well,
> but that didn't have any effect.
>
> 4. I send the ring directories from the source and destination nodes
> to Brian, and he came back with the following Erlang code which
> problem for us:
>
> IdxList = [696496874040508421956443553091353626554780352512,
> 239777612374601260017792042867515182912301432832],
>   Mod = riak_kv_vnode,
>   Ring = riak_core_ring_manager:get_my_ring(),
>   riak_core_ring_manager:ring_trans(
>         fun(Ring, _) ->
>                 Ring1 = begin
>                             A = element(7, Ring),
>                             B = [{B1, B2, B3,
>                                   [B4E || B4E <- B4, B4E /= riak_kv],
>                                 B5} || {B1, B2, B3, B4, B5} <- A],
>                             setelement(7,Ring, B)
>                         end,
>                 Ring2 = lists:foldl(
>                           fun(Idx, R) ->
>                                   riak_core_ring:handoff_complete(R, Idx,
> Mod)
>                           end,
>                           Ring1,
>                           IdxList),
>                 {new_ring, Ring2}
>         end, []).
>
> The output of the command showed the handoffs was complete:
>
>                 [{239777612374601260017792042867515182912301432832,
>                   '[email protected]','[email protected]',
>                   [riak_kv_vnode,riak_pipe_vnode],
>                   complete},
>                  {696496874040508421956443553091353626554780352512,
>                   '[email protected]','[email protected]',
>                   [riak_kv_vnode,riak_pipe_vnode],
>                   complete}],
>
> And I could confirm that with the usual "ring-status", "member-status"
> and "transfers" commands. There were no pending transfers, no pending
> ownership handoffs and the cluster didn't show the rebalancing to be
> in progress any more.
>
> Thanks a lot to Brian for helping solve this issue. I hope anybody
> else who may encounter it can use the above info.
>
> --
> Jeppe Fihl Toustrup
> Operations Engineer
> Falcon Social
>
>
> On 20 November 2013 17:52, Mark Phillips <[email protected]> wrote:
> > Hmm. The fact that you've disabled Search probably changes things but I'm
> > not entirely sure how.
> >
> > Ryan et al - any ideas?
> >
> > Mark
> >
> > On Wednesday, November 20, 2013, Jeppe Toustrup wrote:
> >>
> >> Hi
> >>
> >> Thank you for the guide. I stopped two of the nodes (the source and
> >> the destination of the partition transfers), renamed the folders
> >> inside the merge_index folder and started them again. The ownership
> >> handoff does however not seem to be retried.
> >>
> >> Looking at the logs it seems like the last attempt was 48 hours ago.
> >> Is there any logic inside Riak which causes it to give up after a
> >> certain amount of tries?
> >> Is there a way I can retrigger the handoffs?
> >> I have tried to set the transfer-limit on the cluster to 0 and then
> >> back to 2, but it doesn't seem to do anything.
> >>
> >> I wonder if we need the merge_index folder at all, as we have disabled
> >> Riak search since the initial configuration of the cluster. We found a
> >> better way to query our data so that we don't need Riak search
> >> anymore. We disabled it by resetting the properties on the buckets
> >> where search was enabled, and then disabled search in app.config
> >> followed by a restart of each of the nodes. This was done after the
> >> ownership handoff issue first occurred.
> >>
> >> --
> >> Jeppe Fihl Toustrup
> >> Operations Engineer
> >> Falcon Social
> >>
> >>
> >> On 19 November 2013 23:17, Mark Phillips <[email protected]> wrote:
> >> > Hi Jeppe,
> >> >
> >> >
> >> >
> >> > As you suspected, this looks like index corruption in Search that's
> >> > preventing handoff from finishing.  Specifically, you'll need to
> delete
> >> > the
> >> >
> >> > segment files for the two partitions' indexes and rebuild those
> indexes
> >> > post-transfer.
> >> >
> >> >
> >> > Here's the full process:
> >> >
> >> >
> >> >
> >> > - Stop each node that owns the partitions in question.
> >> > - Delete the data directory for each partition (which contains the
> >> > segment
> >> > files). It should be something like:
> >> >
> >> >
> >> >
> >> >
> >> > "rm -rf /var/lib/riak/merge_index/<p>"
> >> >
> >> >
> >> > - Restart each node
> >> >
> >> > - Wait for the transfers to complete
> >> > - Rebuild the indexes in question [1]
> >> >
> >> >
> >> > Let us know if you run into any further issues.
> >> >
> >> >
> >> >
> >> > Mark
> >> >
> >> >
> >> > [1]
> >> >
> >> >
> http://docs.basho.com/riak/latest/ops/running/recovery/repairing-indexes/
> >> >
> >> >
> >> >
> >> > On Tue, Nov 19, 2013 at 4:26 AM, Jeppe Toustrup <
> [email protected]>
> >> > wrote:
> >> >>
> >> >> Hi
> >> >>
> >> >> I have recently added two extra nodes to the now seven node Riak
> >> >> cluster. The rebalancing following the expansion worked fine, except
> >> >> for two partitions which seem to not being able to go through.
> Running
> >> >> "riak-admin ring-status" shows the following:
> >> >>
> >> >> ============================== Ownership Handoff
> >> >> ==============================
> >> >> Owner:      [email protected]
> >> >> Next Owner: [email protected]
> >> >>
> >> >> Index: 239777612374601260017792042867515182912301432832
> >> >>   Waiting on: []
> >> >>   Complete:   [riak_kv_vnode,riak_pipe_<

_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Ownership handoff never completes

Reply via email to