Re: help rebalance hangs

Aliaksey Kandratsenka Mon, 15 Dec 2014 18:25:49 -0800

On Mon, Dec 15, 2014 at 6:01 PM, Michael Liu <[email protected]> wrote:
>
> How does one go about debugging why a node fails to rebalance?
>
> 1. The node in question fails to complete cbtransfer/cbbackup. See
> https://groups.google.com/d/msg/couchbase/D90J8zwhLos/OAUX_obzCygJ
>
> 2. The node was once part of a 2-node cluster. I failed over a node, and
> removed it from the cluster. Now, trying to add a second node back to the
> cluster always hangs on rebalance. Seems like from attached logs that the
> rebalancer has hit some kind of error in the middle of transfering data,
> and never gets out of it and just start/stops continuously from there out.
>
> 3. Tried both 3.0.1 as well as 3.0.2 with similar results.
>
> seems like it starts to rebalance, then errors out, then starts again at
> the same place over and over:
>
> [ns_server:debug,2014-12-16T1:56:06.793,ns_1@union-2:<0.1111.0>:
> ns_rebalance_observer:docs_left_updater_loop:359]Starting
> docs_left_updater_loop:"loop"
>
>
>
>
>
> [{move_state,961,
>
>
>              ['ns_1@union-1',undefined],
>
>
>              ['ns_1@union-2',
>
>
>               'ns_1@union-1'],
>
>
>              dcp,
>
>
>              [{replica_building_stats,'ns_1@union-2',
>
>
>                                       3,3,<<>>}]},
>
>
>  {move_state,968,
>
>
>              ['ns_1@union-1',undefined],
>
>
>              ['ns_1@union-2',
>
>
>               'ns_1@union-1'],
>
>
>              dcp,
>
>
>              [{replica_building_stats,'ns_1@union-2',
>
>
>                                       4,4,<<>>}]},
>
>
>  {move_state,973,
>
>
>              ['ns_1@union-1',undefined],
>
>
>              ['ns_1@union-2',
>
>
>               'ns_1@union-1'],
>
>
>              dcp,
>
>
>              [{replica_building_stats,'ns_1@union-2',
>
>
>                                       9,9,<<>>}]},
>
>
>  {move_state,986,
>
>
>              ['ns_1@union-1',undefined],
>
>
>              ['ns_1@union-2',
>
>
>               'ns_1@union-1'],
>
>
>              dcp,
>
>
>              [{replica_building_stats,'ns_1@union-2',
>
>
>                                       4,4,<<>>}]},
>
>
>  {move_state,987,
>
>
>              ['ns_1@union-1',undefined],
>
>
>              ['ns_1@union-2',
>
>
>               'ns_1@union-1'],
>
>
>              dcp,
>
>
>              [{replica_building_stats,'ns_1@union-2',
>
>
>                                       6,6,<<>>}]},
>
>
>  {move_state,991,
>
>
>              ['ns_1@union-1',undefined],
>
>
>              ['ns_1@union-2',
>
>
>               'ns_1@union-1'],
>
>
>              dcp,
>
>
>              [{replica_building_stats,'ns_1@union-2',
>
>
>                                       3,3,<<>>}]},
>
>
>  {move_state,998,
>
>
>              ['ns_1@union-1',undefined],
>
>
>              ['ns_1@union-2',
>
>
>               'ns_1@union-1'],
>
>
>              dcp,
>
>
>              [{replica_building_stats,'ns_1@union-2',
>
>
>                                       4,4,<<>>}]},
>
>
>  {move_state,1000,
>
>
>              ['ns_1@union-1',undefined],
>
>
>              ['ns_1@union-2',
>
>
>               'ns_1@union-1'],
>
>
>              dcp,
>
>
>              [{replica_building_stats,'ns_1@union-2',
>
>
>                                       5,5,<<>>}]},
>
>
>  {move_state,1007,
>
>
>              ['ns_1@union-1',undefined],
>
>
>              ['ns_1@union-2',
>
>
>               'ns_1@union-1'],
>
>
>              dcp,
>
>
>              [{replica_building_stats,'ns_1@union-2',
>
>
>                                       1,1,<<>>}]},
>
>
>  {move_state,1012,
>
>
>              ['ns_1@union-1',undefined],
>
>
>              ['ns_1@union-2',
>
>
>               'ns_1@union-1'],
>
>
>              dcp,
>
>
>              [{replica_building_stats,'ns_1@union-2',
>
>
>                                       7,7,<<>>}]},
>
>
>  {move_state,1016,
>
>
>              ['ns_1@union-1',undefined],
>
>
>              ['ns_1@union-2',
>
>
>               'ns_1@union-1'],
>
>
>              dcp,
>
>
>              [{replica_building_stats,'ns_1@union-2',
>
>
>                                       5,5,<<>>}]},
>
>
>  {move_state,1020,
>
>
>              ['ns_1@union-1',undefined],
>
>
>              ['ns_1@union-2',
>
>
>               'ns_1@union-1'],
>
>
>              dcp,
>
>
>              [{replica_building_stats,'ns_1@union-2',
>
>
>                                       1,1,<<>>}]},
>
>
>  {move_state,1022,
>
>
>              ['ns_1@union-1',undefined],
>
>
>              ['ns_1@union-2',
>
>
>               'ns_1@union-1'],
>
>
>              dcp,
>
>
>              [{replica_building_stats,'ns_1@union-2',
>
>
>                                       1,1,<<>>}]}]
>
>
>
>
> then something is wrong with rebalance:
> [rebalance:debug,2014-12-16T1:51:37.169,ns_1@union-2:<0.3239.0>:
> janitor_agent:do_wait_seqno_persisted:1119]Got etmpfail waiting for seq no
> persistence. Will try again
> [rebalance:debug,2014-12-16T1:51:37.265,ns_1@union-2:<0.3259.0>:
> janitor_agent:do_wait_seqno_persisted:1119]Got etmpfail waiting for seq no
> persistence. Will try again
> [rebalance:debug,2014-12-16T1:51:40.582,ns_1@union-2:<0.4018.0>:
> janitor_agent:do_wait_seqno_persisted:1119]Got etmpfail waiting for seq no
> persistence. Will try again
> [rebalance:debug,2014-12-16T1:51:41.233,ns_1@union-2:<0.1316.0>:
> janitor_agent:do_wait_seqno_persisted:1119]Got etmpfail waiting for seq no
> persistence. Will try again
> [rebalance:debug,2014-12-16T1:51:41.703,ns_1@union-2:<0.4279.0>:
> janitor_agent:do_wait_seqno_persisted:1119]Got etmpfail waiting for seq no
> persistence. Will try again
> [ns_server:debug,2014-12-16T1:51:41.793,ns_1@union-2:<0.1111.0>:
> ns_rebalance_observer:docs_left_updater_loop:359]Starting
> docs_left_updater_loop:"loop"
>
>
> This replication start/stop cycle just keeps going on, without it getting
> anywhere...
>
> Memcached logs has a bunch of these
>
> Tue Dec 16 01:34:00.745866 UTC 3: (loop) Notified the timeout on
> checkpoint persistence for vbucket 998, id 19, cookie 0x577f900
> Tue Dec 16 01:34:02.207600 UTC 3: (loop) Notified the timeout on
> checkpoint persistence for vbucket 991, id 8, cookie 0x577f300
> Tue Dec 16 01:34:03.119728 UTC 3: (loop) Notified the timeout on
> checkpoint persistence for vbucket 987, id 13, cookie 0x577f000
> Tue Dec 16 01:34:03.228693 UTC 3: (loop) Notified the timeout on
> checkpoint persistence for vbucket 986, id 11, cookie 0x577ed00
> Tue Dec 16 01:34:06.542319 UTC 3: (loop) Notified the timeout on
> checkpoint persistence for vbucket 973, id 43, cookie 0x577ea00
> Tue Dec 16 01:34:07.187235 UTC 3: (loop) Notified the timeout on
> checkpoint persistence for vbucket 1022, id 7, cookie 0x57ccc00
> Tue Dec 16 01:34:07.644722 UTC 3: (loop) Notified the timeout on
> checkpoint persistence for vbucket 968, id 21, cookie 0x577e700
> Tue Dec 16 01:34:08.681139 UTC 3: (loop) Notified the timeout on
> checkpoint persistence for vbucket 1016, id 11, cookie 0x57cc300
> Tue Dec 16 01:34:09.614624 UTC 3: (loop) Notified the timeout on
> checkpoint persistence for vbucket 961, id 14, cookie 0x577e400
> Tue Dec 16 01:34:17.389983 UTC 3: (loop) Notified the timeout on
> checkpoint persistence for vbucket 1020, id 4, cookie 0x57cc900
>
>
>
> How do I go about figuring out what on the first node the rebalancer does
> not like???
>


Hi.

Glad to see someone willing to to inspect logs.

While messages you posted might seem to indicate that couchbase is stuck
it's not quite necessarily the case. I.e. observe that those messages
actually refer to different vbuckets.

On the other hand it might still be the case that one or few vbuckets are
stuck persisting things to disk. If you can see _same_ vbucket repeatedly
timing out waiting for persistence, then you can safely conclude that
something isn't right in persistence. If that's not the case than it is
possible that couchbase is making some progress moving vbuckets, just very
slowly.

Timeouts themselves might be "harmless". Surely they indicate something.
I.e. that persistence is taking longer time than we expect. Why is that is
hard to say. But that's not necessarily a bug.

-- 
You received this message because you are subscribed to the Google Groups 
"Couchbase" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: help rebalance hangs

Reply via email to