I've got the problem solved thanks to Brian Sparrow on the IRC channel.
Here's the steps we tried during the troubleshooting session:
1. We first tried to delete the data folders on the receiving node for
the two partitions, while the node was stopped, to see if it would
retrigger the ownership handoff. It didn't change anything.
2. We then tried to insert the following Erlang code on the sending
node, in order to see if it would retrigger the ownership handoff. The
partition IDs were for the partitions needing to be transfered:
IdxList = [696496874040508421956443553091353626554780352512,
239777612374601260017792042867515182912301432832],
Mod = riak_kv,
Ring = riak_core_ring_manager:get_my_ring(),
riak_core_ring_manager:ring_trans(
fun(Ring, _) ->
Ring2 = lists:foldl(
fun(Idx, Ring) ->
riak_core_ring:handoff_complete(Ring, Idx, Mod)
end,
Ring,
IdxList),
{new_ring, Ring2}
end, []).
That piece of code didn't help anything either. The output of the
command showed the two partitions to be in the "awaiting" state:
[{239777612374601260017792042867515182912301432832,
'[email protected]','[email protected]',
[riak_kv,riak_kv_vnode,riak_pipe_vnode],
awaiting},
{696496874040508421956443553091353626554780352512,
'[email protected]','[email protected]',
[riak_kv,riak_kv_vnode,riak_pipe_vnode],
awaiting}],
3. Brian suggested that I should run
"riak_core_ring_events:force_update()." in the Erlang console as well,
but that didn't have any effect.
4. I send the ring directories from the source and destination nodes
to Brian, and he came back with the following Erlang code which
problem for us:
IdxList = [696496874040508421956443553091353626554780352512,
239777612374601260017792042867515182912301432832],
Mod = riak_kv_vnode,
Ring = riak_core_ring_manager:get_my_ring(),
riak_core_ring_manager:ring_trans(
fun(Ring, _) ->
Ring1 = begin
A = element(7, Ring),
B = [{B1, B2, B3,
[B4E || B4E <- B4, B4E /= riak_kv],
B5} || {B1, B2, B3, B4, B5} <- A],
setelement(7,Ring, B)
end,
Ring2 = lists:foldl(
fun(Idx, R) ->
riak_core_ring:handoff_complete(R, Idx, Mod)
end,
Ring1,
IdxList),
{new_ring, Ring2}
end, []).
The output of the command showed the handoffs was complete:
[{239777612374601260017792042867515182912301432832,
'[email protected]','[email protected]',
[riak_kv_vnode,riak_pipe_vnode],
complete},
{696496874040508421956443553091353626554780352512,
'[email protected]','[email protected]',
[riak_kv_vnode,riak_pipe_vnode],
complete}],
And I could confirm that with the usual "ring-status", "member-status"
and "transfers" commands. There were no pending transfers, no pending
ownership handoffs and the cluster didn't show the rebalancing to be
in progress any more.
Thanks a lot to Brian for helping solve this issue. I hope anybody
else who may encounter it can use the above info.
--
Jeppe Fihl Toustrup
Operations Engineer
Falcon Social
On 20 November 2013 17:52, Mark Phillips <[email protected]> wrote:
> Hmm. The fact that you've disabled Search probably changes things but I'm
> not entirely sure how.
>
> Ryan et al - any ideas?
>
> Mark
>
> On Wednesday, November 20, 2013, Jeppe Toustrup wrote:
>>
>> Hi
>>
>> Thank you for the guide. I stopped two of the nodes (the source and
>> the destination of the partition transfers), renamed the folders
>> inside the merge_index folder and started them again. The ownership
>> handoff does however not seem to be retried.
>>
>> Looking at the logs it seems like the last attempt was 48 hours ago.
>> Is there any logic inside Riak which causes it to give up after a
>> certain amount of tries?
>> Is there a way I can retrigger the handoffs?
>> I have tried to set the transfer-limit on the cluster to 0 and then
>> back to 2, but it doesn't seem to do anything.
>>
>> I wonder if we need the merge_index folder at all, as we have disabled
>> Riak search since the initial configuration of the cluster. We found a
>> better way to query our data so that we don't need Riak search
>> anymore. We disabled it by resetting the properties on the buckets
>> where search was enabled, and then disabled search in app.config
>> followed by a restart of each of the nodes. This was done after the
>> ownership handoff issue first occurred.
>>
>> --
>> Jeppe Fihl Toustrup
>> Operations Engineer
>> Falcon Social
>>
>>
>> On 19 November 2013 23:17, Mark Phillips <[email protected]> wrote:
>> > Hi Jeppe,
>> >
>> >
>> >
>> > As you suspected, this looks like index corruption in Search that's
>> > preventing handoff from finishing. Specifically, you'll need to delete
>> > the
>> >
>> > segment files for the two partitions' indexes and rebuild those indexes
>> > post-transfer.
>> >
>> >
>> > Here's the full process:
>> >
>> >
>> >
>> > - Stop each node that owns the partitions in question.
>> > - Delete the data directory for each partition (which contains the
>> > segment
>> > files). It should be something like:
>> >
>> >
>> >
>> >
>> > "rm -rf /var/lib/riak/merge_index/<p>"
>> >
>> >
>> > - Restart each node
>> >
>> > - Wait for the transfers to complete
>> > - Rebuild the indexes in question [1]
>> >
>> >
>> > Let us know if you run into any further issues.
>> >
>> >
>> >
>> > Mark
>> >
>> >
>> > [1]
>> >
>> > http://docs.basho.com/riak/latest/ops/running/recovery/repairing-indexes/
>> >
>> >
>> >
>> > On Tue, Nov 19, 2013 at 4:26 AM, Jeppe Toustrup <[email protected]>
>> > wrote:
>> >>
>> >> Hi
>> >>
>> >> I have recently added two extra nodes to the now seven node Riak
>> >> cluster. The rebalancing following the expansion worked fine, except
>> >> for two partitions which seem to not being able to go through. Running
>> >> "riak-admin ring-status" shows the following:
>> >>
>> >> ============================== Ownership Handoff
>> >> ==============================
>> >> Owner: [email protected]
>> >> Next Owner: [email protected]
>> >>
>> >> Index: 239777612374601260017792042867515182912301432832
>> >> Waiting on: []
>> >> Complete: [riak_kv_vnode,riak_pipe_vnode]
>> >>
>> >> Index: 696496874040508421956443553091353626554780352512
>> >> Waiting on: []
>> >> Complete: [riak_kv_vnode,riak_pipe_vnode]
>> >>
>> >>
>> >>
>> >> -------------------------------------------------------------------------------
>> >>
>> >> I can see from the log file on the source node (10.0.0.96) that it has
>> >> made numerous attempt to transfer the partitions, but it ends up
>> >> failing all the time. Here's an except of the log file showing the
>> >> lines from when the transfer attempt ends up failing:
>> >>
>> >> 2013-11-18 12:29:03.694 [error] emulator Error in process <0.5745.8>
>> >> on node '[email protected]' with exit value:
>> >> {badarg,[{erlang,binary_to_term,[<<29942
>> >>
>> >>
>> >> bytes>>],[]},{mi_segment,iterate_all_bytes,2,[{file,"src/mi_segment.erl"},{line,167}]},{mi_server,'-group_iterator/2-fun-1-',2,[{file,"src/mi_server.erl"},{line,725}]},{mi_server,'-group_iterator/2-fun-0-'...
>> >> 2013-11-18 12:29:03.885 [error] <0.3269.0>@mi_server:handle_info:524
>> >> lookup/range failure:
>> >>
>> >>
>> >> {badarg,[{erlang,binary_to_term,[<<131,109,0,0,244,240,108,109,102,97,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111
_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com