I found this in the logs for hours:

2013-05-12 04:39:13.002 UTC [info] 
<0.188.0>@riak_core_handoff_manager:handle_info:279 An outbound handoff of 
partition riak_kv_vnode 884893569477695126256137301058686984557302906880 was 
terminated for reason: {shutdown,max_concurrency}


after the rolling restart it started working:

2013-05-12 05:11:52.277 UTC [info] 
<0.4696.0>@riak_core_handoff_sender:start_fold:130 Starting hinted_handoff 
transfer of riak_kv_vnode from '[email protected]' 
1255977969581244695331291653115555720016817029120 to '[email protected]' 
1255977969581244695331291653115555720016817029120


also the AAE was suddenly working:

2013-05-12 06:44:35.889 UTC [info] 
<0.7820.3>@riak_kv_exchange_fsm:key_exchange:204 Repaired 1 keys during active 
anti-entropy exchange of {108470824645652950960429733678161630365088743424,12} 
between {137015778499772148581595453067151533092743675904,'[email protected]'} 
and {14272476927059598810582859694494951363827
4662400,'[email protected]'}

The situation started to long ago so I have no log files for the start of this 
event except some error logs which only shows:

** Removing (timedout) connection **
2013-05-05 00:35:30 UTC =ERROR REPORT====
** Node '[email protected]' not responding **
** Removing (timedout) connection **
2013-05-05 00:35:30 UTC =ERROR REPORT====
** Node '[email protected]' not responding **
** Removing (timedout) connection **
2013-05-05 00:35:30 UTC =ERROR REPORT====
** Node '[email protected]' not responding **
** Removing (timedout) connection **
2013-05-05 00:36:30 UTC =ERROR REPORT====
** Node '[email protected]' not responding **
** Removing (timedout) connection **
2013-05-05 00:36:30 UTC =ERROR REPORT====
** Node '[email protected]' not responding **

this "could" be the start of the problem and we have had some weird network 
issues between to DC's at this timeframe with some broken TCP connections. But 
it looks like Riak wasn't able to get out of this situation by itself without a 
rolling restart.

Any ideas?

Cheers,
Simon


On Sun, 12 May 2013 07:34:50 +0200
Simon Effenberg <[email protected]> wrote:

> Hi list,
> 
> this morning I did a "riak-admin transfers" on the riak machines and
> saw this:
> 
> [root@kriak46-1:~]# riak-admin transfers
> Attempting to restart script through sudo -H -u riak
> '[email protected]' waiting to handoff 30 partitions
> '[email protected]' waiting to handoff 3 partitions
> '[email protected]' waiting to handoff 15 partitions
> '[email protected]' waiting to handoff 1 partitions
> '[email protected]' waiting to handoff 1 partitions
> '[email protected]' waiting to handoff 29 partitions
> '[email protected]' waiting to handoff 16 partitions
> '[email protected]' waiting to handoff 17 partitions
> '[email protected]' waiting to handoff 3 partitions
> 
> Active Transfers:
> 
> and this didn't changed for at least 15 minutes while checking.
> 
> I also did on each node:
> 
> riak-admin ring-status
> 
> and every node said:
> 
> Attempting to restart script through sudo -H -u riak
> ================================== Claimant 
> ===================================
> Claimant:  '[email protected]'
> Status:     up
> Ring Ready: true
> 
> ============================== Ownership Handoff 
> ==============================
> No pending changes.
> 
> ============================== Unreachable Nodes 
> ==============================
> All nodes are up and reachable
> 
> 
> So why do I not see any "Ownership Handoff" on some nodes (or is this 
> "another" handoff)?
> Also how could I get rid of the waiting handoffs?
> 
> My solution was a rolling restart which helped. But I don't know how it was 
> possible to get into this situation and also it would be nice to resolve it 
> without a rolling restart.
> 
> Any ideas?
> 
> Cheers,
> Simon
> 



_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to