How does one go about debugging why a node fails to rebalance?
1. The node in question fails to complete cbtransfer/cbbackup. See
https://groups.google.com/d/msg/couchbase/D90J8zwhLos/OAUX_obzCygJ
2. The node was once part of a 2-node cluster. I failed over a node, and
removed it from the cluster. Now, trying to add a second node back to the
cluster always hangs on rebalance. Seems like from attached logs that the
rebalancer has hit some kind of error in the middle of transfering data,
and never gets out of it and just start/stops continuously from there out.
3. Tried both 3.0.1 as well as 3.0.2 with similar results.
seems like it starts to rebalance, then errors out, then starts again at
the same place over and over:
[ns_server:debug,2014-12-16T1:56:06.793,ns_1@union-2:<0.1111.0>:
ns_rebalance_observer:docs_left_updater_loop:359]Starting
docs_left_updater_loop:"loop"
[{move_state,961,
['ns_1@union-1',undefined],
['ns_1@union-2',
'ns_1@union-1'],
dcp,
[{replica_building_stats,'ns_1@union-2',
3,3,<<>>}]},
{move_state,968,
['ns_1@union-1',undefined],
['ns_1@union-2',
'ns_1@union-1'],
dcp,
[{replica_building_stats,'ns_1@union-2',
4,4,<<>>}]},
{move_state,973,
['ns_1@union-1',undefined],
['ns_1@union-2',
'ns_1@union-1'],
dcp,
[{replica_building_stats,'ns_1@union-2',
9,9,<<>>}]},
{move_state,986,
['ns_1@union-1',undefined],
['ns_1@union-2',
'ns_1@union-1'],
dcp,
[{replica_building_stats,'ns_1@union-2',
4,4,<<>>}]},
{move_state,987,
['ns_1@union-1',undefined],
['ns_1@union-2',
'ns_1@union-1'],
dcp,
[{replica_building_stats,'ns_1@union-2',
6,6,<<>>}]},
{move_state,991,
['ns_1@union-1',undefined],
['ns_1@union-2',
'ns_1@union-1'],
dcp,
[{replica_building_stats,'ns_1@union-2',
3,3,<<>>}]},
{move_state,998,
['ns_1@union-1',undefined],
['ns_1@union-2',
'ns_1@union-1'],
dcp,
[{replica_building_stats,'ns_1@union-2',
4,4,<<>>}]},
{move_state,1000,
['ns_1@union-1',undefined],
['ns_1@union-2',
'ns_1@union-1'],
dcp,
[{replica_building_stats,'ns_1@union-2',
5,5,<<>>}]},
{move_state,1007,
['ns_1@union-1',undefined],
['ns_1@union-2',
'ns_1@union-1'],
dcp,
[{replica_building_stats,'ns_1@union-2',
1,1,<<>>}]},
{move_state,1012,
['ns_1@union-1',undefined],
['ns_1@union-2',
'ns_1@union-1'],
dcp,
[{replica_building_stats,'ns_1@union-2',
7,7,<<>>}]},
{move_state,1016,
['ns_1@union-1',undefined],
['ns_1@union-2',
'ns_1@union-1'],
dcp,
[{replica_building_stats,'ns_1@union-2',
5,5,<<>>}]},
{move_state,1020,
['ns_1@union-1',undefined],
['ns_1@union-2',
'ns_1@union-1'],
dcp,
[{replica_building_stats,'ns_1@union-2',
1,1,<<>>}]},
{move_state,1022,
['ns_1@union-1',undefined],
['ns_1@union-2',
'ns_1@union-1'],
dcp,
[{replica_building_stats,'ns_1@union-2',
1,1,<<>>}]}]
then something is wrong with rebalance:
[rebalance:debug,2014-12-16T1:51:37.169,ns_1@union-2:<0.3239.0>:
janitor_agent:do_wait_seqno_persisted:1119]Got etmpfail waiting for seq no
persistence. Will try again
[rebalance:debug,2014-12-16T1:51:37.265,ns_1@union-2:<0.3259.0>:
janitor_agent:do_wait_seqno_persisted:1119]Got etmpfail waiting for seq no
persistence. Will try again
[rebalance:debug,2014-12-16T1:51:40.582,ns_1@union-2:<0.4018.0>:
janitor_agent:do_wait_seqno_persisted:1119]Got etmpfail waiting for seq no
persistence. Will try again
[rebalance:debug,2014-12-16T1:51:41.233,ns_1@union-2:<0.1316.0>:
janitor_agent:do_wait_seqno_persisted:1119]Got etmpfail waiting for seq no
persistence. Will try again
[rebalance:debug,2014-12-16T1:51:41.703,ns_1@union-2:<0.4279.0>:
janitor_agent:do_wait_seqno_persisted:1119]Got etmpfail waiting for seq no
persistence. Will try again
[ns_server:debug,2014-12-16T1:51:41.793,ns_1@union-2:<0.1111.0>:
ns_rebalance_observer:docs_left_updater_loop:359]Starting
docs_left_updater_loop:"loop"
This replication start/stop cycle just keeps going on, without it getting
anywhere...
Memcached logs has a bunch of these
Tue Dec 16 01:34:00.745866 UTC 3: (loop) Notified the timeout on checkpoint
persistence for vbucket 998, id 19, cookie 0x577f900
Tue Dec 16 01:34:02.207600 UTC 3: (loop) Notified the timeout on checkpoint
persistence for vbucket 991, id 8, cookie 0x577f300
Tue Dec 16 01:34:03.119728 UTC 3: (loop) Notified the timeout on checkpoint
persistence for vbucket 987, id 13, cookie 0x577f000
Tue Dec 16 01:34:03.228693 UTC 3: (loop) Notified the timeout on checkpoint
persistence for vbucket 986, id 11, cookie 0x577ed00
Tue Dec 16 01:34:06.542319 UTC 3: (loop) Notified the timeout on checkpoint
persistence for vbucket 973, id 43, cookie 0x577ea00
Tue Dec 16 01:34:07.187235 UTC 3: (loop) Notified the timeout on checkpoint
persistence for vbucket 1022, id 7, cookie 0x57ccc00
Tue Dec 16 01:34:07.644722 UTC 3: (loop) Notified the timeout on checkpoint
persistence for vbucket 968, id 21, cookie 0x577e700
Tue Dec 16 01:34:08.681139 UTC 3: (loop) Notified the timeout on checkpoint
persistence for vbucket 1016, id 11, cookie 0x57cc300
Tue Dec 16 01:34:09.614624 UTC 3: (loop) Notified the timeout on checkpoint
persistence for vbucket 961, id 14, cookie 0x577e400
Tue Dec 16 01:34:17.389983 UTC 3: (loop) Notified the timeout on checkpoint
persistence for vbucket 1020, id 4, cookie 0x57cc900
How do I go about figuring out what on the first node the rebalancer does
not like???
--
You received this message because you are subscribed to the Google Groups
"Couchbase" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.