help rebalance hangs

Michael Liu Mon, 15 Dec 2014 18:02:28 -0800

How does one go about debugging why a node fails to rebalance?

1. The node in question fails to complete cbtransfer/cbbackup. See 
https://groups.google.com/d/msg/couchbase/D90J8zwhLos/OAUX_obzCygJ


2. The node was once part of a 2-node cluster. I failed over a node, and 
removed it from the cluster. Now, trying to add a second node back to the 
cluster always hangs on rebalance. Seems like from attached logs that the 
rebalancer has hit some kind of error in the middle of transfering data, 
and never gets out of it and just start/stops continuously from there out.

3. Tried both 3.0.1 as well as 3.0.2 with similar results.

seems like it starts to rebalance, then errors out, then starts again at 
the same place over and over:

[ns_server:debug,2014-12-16T1:56:06.793,ns_1@union-2:<0.1111.0>:
ns_rebalance_observer:docs_left_updater_loop:359]Starting 
docs_left_updater_loop:"loop"


 


[{move_state,961, 


             ['ns_1@union-1',undefined], 


             ['ns_1@union-2', 


              'ns_1@union-1'], 


             dcp, 


             [{replica_building_stats,'ns_1@union-2', 


                                      3,3,<<>>}]}, 


 {move_state,968, 


             ['ns_1@union-1',undefined], 


             ['ns_1@union-2', 


              'ns_1@union-1'], 


             dcp, 


             [{replica_building_stats,'ns_1@union-2', 


                                      4,4,<<>>}]}, 


 {move_state,973, 


             ['ns_1@union-1',undefined], 


             ['ns_1@union-2', 


              'ns_1@union-1'], 


             dcp, 


             [{replica_building_stats,'ns_1@union-2', 


                                      9,9,<<>>}]}, 


 {move_state,986, 


             ['ns_1@union-1',undefined], 


             ['ns_1@union-2', 


              'ns_1@union-1'], 


             dcp, 


             [{replica_building_stats,'ns_1@union-2', 


                                      4,4,<<>>}]}, 


 {move_state,987, 


             ['ns_1@union-1',undefined], 


             ['ns_1@union-2', 


              'ns_1@union-1'], 


             dcp, 


             [{replica_building_stats,'ns_1@union-2', 


                                      6,6,<<>>}]}, 


 {move_state,991, 


             ['ns_1@union-1',undefined], 


             ['ns_1@union-2', 


              'ns_1@union-1'], 


             dcp, 


             [{replica_building_stats,'ns_1@union-2', 


                                      3,3,<<>>}]}, 


 {move_state,998, 


             ['ns_1@union-1',undefined], 


             ['ns_1@union-2', 


              'ns_1@union-1'], 


             dcp, 


             [{replica_building_stats,'ns_1@union-2', 


                                      4,4,<<>>}]}, 


 {move_state,1000, 


             ['ns_1@union-1',undefined], 


             ['ns_1@union-2', 


              'ns_1@union-1'], 


             dcp, 


             [{replica_building_stats,'ns_1@union-2', 


                                      5,5,<<>>}]}, 


 {move_state,1007, 


             ['ns_1@union-1',undefined], 


             ['ns_1@union-2', 


              'ns_1@union-1'], 


             dcp, 


             [{replica_building_stats,'ns_1@union-2', 


                                      1,1,<<>>}]}, 


 {move_state,1012, 


             ['ns_1@union-1',undefined], 


             ['ns_1@union-2', 


              'ns_1@union-1'], 


             dcp, 


             [{replica_building_stats,'ns_1@union-2', 


                                      7,7,<<>>}]}, 


 {move_state,1016, 


             ['ns_1@union-1',undefined], 


             ['ns_1@union-2', 


              'ns_1@union-1'], 


             dcp, 


             [{replica_building_stats,'ns_1@union-2', 


                                      5,5,<<>>}]}, 


 {move_state,1020, 


             ['ns_1@union-1',undefined], 


             ['ns_1@union-2', 


              'ns_1@union-1'], 


             dcp, 


             [{replica_building_stats,'ns_1@union-2', 


                                      1,1,<<>>}]}, 


 {move_state,1022, 


             ['ns_1@union-1',undefined], 


             ['ns_1@union-2', 


              'ns_1@union-1'], 


             dcp, 


             [{replica_building_stats,'ns_1@union-2', 


                                      1,1,<<>>}]}]




then something is wrong with rebalance:
[rebalance:debug,2014-12-16T1:51:37.169,ns_1@union-2:<0.3239.0>:
janitor_agent:do_wait_seqno_persisted:1119]Got etmpfail waiting for seq no 
persistence. Will try again
[rebalance:debug,2014-12-16T1:51:37.265,ns_1@union-2:<0.3259.0>:
janitor_agent:do_wait_seqno_persisted:1119]Got etmpfail waiting for seq no 
persistence. Will try again 
[rebalance:debug,2014-12-16T1:51:40.582,ns_1@union-2:<0.4018.0>:
janitor_agent:do_wait_seqno_persisted:1119]Got etmpfail waiting for seq no 
persistence. Will try again 
[rebalance:debug,2014-12-16T1:51:41.233,ns_1@union-2:<0.1316.0>:
janitor_agent:do_wait_seqno_persisted:1119]Got etmpfail waiting for seq no 
persistence. Will try again 
[rebalance:debug,2014-12-16T1:51:41.703,ns_1@union-2:<0.4279.0>:
janitor_agent:do_wait_seqno_persisted:1119]Got etmpfail waiting for seq no 
persistence. Will try again 
[ns_server:debug,2014-12-16T1:51:41.793,ns_1@union-2:<0.1111.0>:
ns_rebalance_observer:docs_left_updater_loop:359]Starting 
docs_left_updater_loop:"loop"


This replication start/stop cycle just keeps going on, without it getting 
anywhere...

Memcached logs has a bunch of these

Tue Dec 16 01:34:00.745866 UTC 3: (loop) Notified the timeout on checkpoint 
persistence for vbucket 998, id 19, cookie 0x577f900
Tue Dec 16 01:34:02.207600 UTC 3: (loop) Notified the timeout on checkpoint 
persistence for vbucket 991, id 8, cookie 0x577f300 
Tue Dec 16 01:34:03.119728 UTC 3: (loop) Notified the timeout on checkpoint 
persistence for vbucket 987, id 13, cookie 0x577f000 
Tue Dec 16 01:34:03.228693 UTC 3: (loop) Notified the timeout on checkpoint 
persistence for vbucket 986, id 11, cookie 0x577ed00 
Tue Dec 16 01:34:06.542319 UTC 3: (loop) Notified the timeout on checkpoint 
persistence for vbucket 973, id 43, cookie 0x577ea00 
Tue Dec 16 01:34:07.187235 UTC 3: (loop) Notified the timeout on checkpoint 
persistence for vbucket 1022, id 7, cookie 0x57ccc00 
Tue Dec 16 01:34:07.644722 UTC 3: (loop) Notified the timeout on checkpoint 
persistence for vbucket 968, id 21, cookie 0x577e700 
Tue Dec 16 01:34:08.681139 UTC 3: (loop) Notified the timeout on checkpoint 
persistence for vbucket 1016, id 11, cookie 0x57cc300 
Tue Dec 16 01:34:09.614624 UTC 3: (loop) Notified the timeout on checkpoint 
persistence for vbucket 961, id 14, cookie 0x577e400 
Tue Dec 16 01:34:17.389983 UTC 3: (loop) Notified the timeout on checkpoint 
persistence for vbucket 1020, id 4, cookie 0x57cc900



How do I go about figuring out what on the first node the rebalancer does 
not like???

-- 
You received this message because you are subscribed to the Google Groups 
"Couchbase" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

help rebalance hangs

Reply via email to