Special thanks to Engel Sanchez who quickly came up with those snippets to mark their handoff complete.
-- Brian Sparrow Developer Advocate Basho Technologies Sent with Sparrow (http://www.sparrowmailapp.com/?sig) On Wednesday, November 20, 2013 at 1:50 PM, Mark Phillips wrote: > Excellent. I'm glad to hear it's cleared up. > > All hail Sparrow: slayer of issues; master of drum kits. > > Mark > > On Wednesday, November 20, 2013, Jeppe Toustrup wrote: > > I've got the problem solved thanks to Brian Sparrow on the IRC channel. > > > > Here's the steps we tried during the troubleshooting session: > > > > 1. We first tried to delete the data folders on the receiving node for > > the two partitions, while the node was stopped, to see if it would > > retrigger the ownership handoff. It didn't change anything. > > > > 2. We then tried to insert the following Erlang code on the sending > > node, in order to see if it would retrigger the ownership handoff. The > > partition IDs were for the partitions needing to be transfered: > > IdxList = [696496874040508421956443553091353626554780352512, > > 239777612374601260017792042867515182912301432832], > > Mod = riak_kv, > > Ring = riak_core_ring_manager:get_my_ring(), > > riak_core_ring_manager:ring_trans( > > fun(Ring, _) -> > > Ring2 = lists:foldl( > > fun(Idx, Ring) -> > > > > riak_core_ring:handoff_complete(Ring, Idx, Mod) > > end, > > Ring, > > IdxList), > > {new_ring, Ring2} > > end, []). > > > > That piece of code didn't help anything either. The output of the > > command showed the two partitions to be in the "awaiting" state: > > > > [{239777612374601260017792042867515182912301432832, > > '[email protected]','[email protected]', > > [riak_kv,riak_kv_vnode,riak_pipe_vnode], > > awaiting}, > > {696496874040508421956443553091353626554780352512, > > '[email protected]','[email protected]', > > [riak_kv,riak_kv_vnode,riak_pipe_vnode], > > awaiting}], > > > > 3. Brian suggested that I should run > > "riak_core_ring_events:force_update()." in the Erlang console as well, > > but that didn't have any effect. > > > > 4. I send the ring directories from the source and destination nodes > > to Brian, and he came back with the following Erlang code which > > problem for us: > > > > IdxList = [696496874040508421956443553091353626554780352512, > > 239777612374601260017792042867515182912301432832], > > Mod = riak_kv_vnode, > > Ring = riak_core_ring_manager:get_my_ring(), > > riak_core_ring_manager:ring_trans( > > fun(Ring, _) -> > > Ring1 = begin > > A = element(7, Ring), > > B = [{B1, B2, B3, > > [B4E || B4E <- B4, B4E /= riak_kv], > > B5} || {B1, B2, B3, B4, B5} <- A], > > setelement(7,Ring, B) > > end, > > Ring2 = lists:foldl( > > fun(Idx, R) -> > > riak_core_ring:handoff_complete(R, Idx, > > Mod) > > end, > > Ring1, > > IdxList), > > {new_ring, Ring2} > > end, []). > > > > The output of the command showed the handoffs was complete: > > > > [{239777612374601260017792042867515182912301432832, > > '[email protected]','[email protected]', > > [riak_kv_vnode,riak_pipe_vnode], > > complete}, > > {696496874040508421956443553091353626554780352512, > > '[email protected]','[email protected]', > > [riak_kv_vnode,riak_pipe_vnode], > > complete}], > > > > And I could confirm that with the usual "ring-status", "member-status" > > and "transfers" commands. There were no pending transfers, no pending > > ownership handoffs and the cluster didn't show the rebalancing to be > > in progress any more. > > > > Thanks a lot to Brian for helping solve this issue. I hope anybody > > else who may encounter it can use the above info. > > > > -- > > Jeppe Fihl Toustrup > > Operations Engineer > > Falcon Social > > > > > > On 20 November 2013 17:52, Mark Phillips <[email protected]> wrote: > > > Hmm. The fact that you've disabled Search probably changes things but I'm > > > not entirely sure how. > > > > > > Ryan et al - any ideas? > > > > > > Mark > > > > > > On Wednesday, November 20, 2013, Jeppe Toustrup wrote: > > >> > > >> Hi > > >> > > >> Thank you for the guide. I stopped two of the nodes (the source and > > >> the destination of the partition transfers), renamed the folders > > >> inside the merge_index folder and started them again. The ownership > > >> handoff does however not seem to be retried. > > >> > > >> Looking at the logs it seems like the last attempt was 48 hours ago. > > >> Is there any logic inside Riak which causes it to give up after a > > >> certain amount of tries? > > >> Is there a way I can retrigger the handoffs? > > >> I have tried to set the transfer-limit on the cluster to 0 and then > > >> back to 2, but it doesn't seem to do anything. > > >> > > >> I wonder if we need the merge_index folder at all, as we have disabled > > >> Riak search since the initial configuration of the cluster. We found a > > >> better way to query our data so that we don't need Riak search > > >> anymore. We disabled it by resetting the properties on the buckets > > >> where search was enabled, and then disabled search in app.config > > >> followed by a restart of each of the nodes. This was done after the > > >> ownership handoff issue first occurred. > > >> > > >> -- > > >> Jeppe Fihl Toustrup > > >> Operations Engineer > > >> Falcon Social > > >> > > >> > > >> On 19 November 2013 23:17, Mark Phillips <[email protected]> wrote: > > >> > Hi Jeppe, > > >> > > > >> > > > >> > > > >> > As you suspected, this looks like index corruption in Search that's > > >> > preventing handoff from finishing. Specifically, you'll need to delete > > >> > the > > >> > > > >> > segment files for the two partitions' indexes and rebuild those indexes > > >> > post-transfer. > > >> > > > >> > > > >> > Here's the full process: > > >> > > > >> > > > >> > > > >> > - Stop each node that owns the partitions in question. > > >> > - Delete the data directory for each partition (which contains the > > >> > segment > > >> > files). It should be something like: > > >> > > > >> > > > >> > > > >> > > > >> > "rm -rf /var/lib/riak/merge_index/<p>" > > >> > > > >> > > > >> > - Restart each node > > >> > > > >> > - Wait for the transfers to complete > > >> > - Rebuild the indexes in question [1] > > >> > > > >> > > > >> > Let us know if you run into any further issues. > > >> > > > >> > > > >> > > > >> > Mark > > >> > > > >> > > > >> > [1] > > >> > > > >> > http://docs.basho.com/riak/latest/ops/running/recovery/repairing-indexes/ > > >> > > > >> > > > >> > > > >> > On Tue, Nov 19, 2013 at 4:26 AM, Jeppe Toustrup > > >> > <[email protected]> > > >> > wrote: > > >> >> > > >> >> Hi > > >> >> > > >> >> I have recently added two extra nodes to the now seven node Riak > > >> >> cluster. The rebalancing following the expansion worked fine, except > > >> >> for two partitions which seem to not being able to go through. Running > > >> >> "riak-admin ring-status" shows the following: > > >> >> > > >> >> ============================== Ownership Handoff > > >> >> ============================== > > >> >> Owner: [email protected] > > >> >> Next Owner: [email protected] > > >> >> > > >> >> Index: 239777612374601260017792042867515182912301432832 > > >> >> Waiting on: [] > > >> >> Complete: [riak_kv_vnode,riak_pipe_< > _______________________________________________ > riak-users mailing list > [email protected] (mailto:[email protected]) > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > >
_______________________________________________ riak-users mailing list [email protected] http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
