I'm currently testing fail-over with a two-node active-active cluster
(with node dig and node dag): Both nodes are up, one is manually killed.
CTDB on the node that's still alive should perform a recovery and
everything should working again.
What's infrequently happening is:
After killing the pacemaker-process on dag (and dag consequently being
fenced), dig's CTDB tries to get the recovery lock and fails. As there
is no other node online to get the recovery lock and thus finishing
CTDB's recovery, dig's CTDB keeps trying to get the recovery lock until
manually stopped.
The only way to get CTDB back to work is to restart OCFS2's distributed
lock manager.
logfiles and pacemaker-configuration are attached, any help would be
greatly appreciated :)
Regards,
Uwe
Our setting:
two nodes directly connected via LAN running openSuse 11.3 and sharing a
SAN-drive that is connected via two interfaces using multipath.
pacemaker 1.1.2
corosync 1.2.1
cluster-glue 1.0.5-1.4
ctdb 1.0.114-2.20
ocfs2 1.4.3-1.4
multipath 0.4.8-51.3
node dag
node dig
primitive CTDB ocf:heartbeat:CTDB \
params ctdb_recovery_lock="/srv/digedag1/samba/ctdb.lock"
smb_private_dir="/srv/digedag1/samba/private" ctdb_debuglevel="3" \
op monitor interval="10" timeout="160" depth="0" \
op start interval="0" timeout="290" \
op stop interval="0" timeout="200"
primitive ClusterIP ocf:heartbeat:IPaddr2 \
params ip="X.X.9.206" cidr_netmask="22"
clusterip_hash="sourceip-sourceport" \
op monitor interval="60"
primitive DLM-Ocfs2 ocf:pacemaker:controld \
op monitor interval="120" \
op start interval="0" timeout="90" \
op stop interval="0" timeout="100"
primitive O2CB ocf:ocfs2:o2cb \
op monitor interval="120" \
op start interval="0" timeout="90" \
op stop interval="0" timeout="100"
primitive SbdStonith stonith:external/sbd \
params sbd_device="/dev/mapper/sbd"
primitive digedag1 ocf:heartbeat:Filesystem \
params device="/dev/mapper/digedag1-part1" directory="/srv/digedag1"
fstype="ocfs2" options="acl" \
op monitor interval="120" timeout="40" \
op start interval="0" timeout="60" \
op stop interval="0" timeout="60"
clone CTDB-clone CTDB \
meta globally-unique="false" interleave="true"
clone DLM-Ocfs2-clone DLM-Ocfs2 \
meta globally-unique="false" interleave="true"
clone IP-Clone ClusterIP \
meta globally-unique="true"
clone O2CB-clone O2CB \
meta globally-unique="false" interleave="true"
clone digedag1-clone digedag1 \
meta interleave="true" ordered="true"
colocation CTDB-with-digedag1 inf: CTDB-clone digedag1-clone
colocation IP-with-CTDB inf: IP-Clone CTDB-clone
colocation digedag1-with-O2CB inf: digedag1-clone O2CB-clone
order start-CTDB-after-digedag1 inf: digedag1-clone CTDB-clone
order start-IP-after-CTDB inf: CTDB-clone IP-Clone
order start-O2CB-after-DLM-Ocfs2 inf: DLM-Ocfs2-clone O2CB-clone
order start-digedag1-after-O2CB inf: O2CB-clone digedag1-clone
property $id="cib-bootstrap-options" \
dc-version="1.1.2-8b9ec9ccc5060457ac761dce1de719af86895b10" \
cluster-infrastructure="openais" \
expected-quorum-votes="2" \
stonith-enabled="true" \
no-quorum-policy="ignore" \
stonith-timeout="150s" \
last-lrm-refresh="1299253687"
rsc_defaults $id="rsc-options" \
resource-stickiness="100"
2011/03/11 11:47:12.004372 [ 3679]: server/eventscript.c:701 Starting
eventscript monitor
2011/03/11 11:47:12.034905 [ 3679]: Connected client with pid:6570
2011/03/11 11:47:12.059019 [ 3679]: No public addresses file found. Nothing to
do for 10.interfaces
2011/03/11 11:47:12.211681 [ 3679]: server/eventscript.c:488 Eventscript
monitor finished with state 0
2011/03/11 11:47:12.890352 [ 3679]: Connected client with pid:6628
2011/03/11 11:47:22.898283 [ 3679]: Connected client with pid:6710
2011/03/11 11:47:23.329367 [ 3717]: client/ctdb_client.c:771 control timed out.
reqid:5130 opcode:91 dstnode:1
2011/03/11 11:47:23.329424 [ 3717]: client/ctdb_client.c:882 ctdb_control_recv
failed
2011/03/11 11:47:23.329449 [ 3717]: client/ctdb_client.c:1225 ctdb_control for
getnodes failed ret:-1 res:-1
2011/03/11 11:47:23.329466 [ 3717]: server/ctdb_recoverd.c:3074 Unable to get
nodemap from recovery master 1
2011/03/11 11:47:27.212378 [ 3679]: server/eventscript.c:701 Starting
eventscript monitor
2011/03/11 11:47:27.242495 [ 3679]: Connected client with pid:6722
2011/03/11 11:47:27.266210 [ 3679]: No public addresses file found. Nothing to
do for 10.interfaces
2011/03/11 11:47:27.418848 [ 3679]: server/eventscript.c:488 Eventscript
monitor finished with state 0
2011/03/11 11:47:30.692368 [ 3679]: dead count reached for node 1
2011/03/11 11:47:30.692434 [ 3679]: 192.168.1.2:4379: node 192.168.1.3:4379 is
dead: 0 connected
2011/03/11 11:47:30.692497 [ 3679]: Tearing down connection to dead node :1
2011/03/11 11:47:30.692519 [ 3717]: Dropped orphaned reply control with
reqid:5130
2011/03/11 11:47:30.692562 [ 3717]: ctdb_control error: 'node is disconnected'
2011/03/11 11:47:30.692605 [ 3717]: client/ctdb_client.c:1225 ctdb_control for
getnodes failed ret:-1 res:-1
2011/03/11 11:47:30.692635 [ 3717]: server/ctdb_recoverd.c:3074 Unable to get
nodemap from recovery master 1
2011/03/11 11:47:31.697869 [ 3717]: Recmaster node 1 is disconnected. Force
reelection
2011/03/11 11:47:31.697920 [ 3717]: server/ctdb_recoverd.c:2078 Force an
election
2011/03/11 11:47:31.697978 [ 3679]: Freeze priority 1
2011/03/11 11:47:31.698330 [ 6775]: locking database 0x2607456f priority:1
g_lock.tdb
2011/03/11 11:47:31.698470 [ 6775]: locking database 0x17055d90 priority:1
connections.tdb
2011/03/11 11:47:31.698517 [ 6775]: locking database 0xc0bdde6a priority:1
sessionid.tdb
2011/03/11 11:47:31.698535 [ 6775]: locking database 0x1421fb78 priority:1
brlock.tdb
2011/03/11 11:47:31.698553 [ 6775]: locking database 0x42fe72c5 priority:1
locking.tdb
2011/03/11 11:47:31.698571 [ 6775]: locking database 0x7bbbd26c priority:1
passdb.tdb
2011/03/11 11:47:31.698588 [ 6775]: locking database 0x2672a57f priority:1
idmap2.tdb
2011/03/11 11:47:31.698620 [ 6775]: locking database 0x63501287 priority:1
share_info.tdb
2011/03/11 11:47:31.698637 [ 6775]: locking database 0xb775fff6 priority:1
secrets.tdb
2011/03/11 11:47:31.698654 [ 6775]: locking database 0xf2a58948 priority:1
registry.tdb
2011/03/11 11:47:31.698672 [ 6775]: locking database 0x92380e87 priority:1
account_policy.tdb
2011/03/11 11:47:31.698691 [ 6775]: locking database 0x435d3410 priority:1
notify.tdb
2011/03/11 11:47:31.698715 [ 6775]: locking database 0x1c904dfd priority:1
notify_onelevel.tdb
2011/03/11 11:47:31.698849 [ 3679]: Freeze priority 2
2011/03/11 11:47:31.699204 [ 3679]: Freeze priority 3
2011/03/11 11:47:31.699552 [ 3679]: server/ctdb_recover.c:673 Recovery mode set
to ACTIVE
2011/03/11 11:47:31.699681 [ 3717]: server/ctdb_recoverd.c:1756 Send election
request to all active nodes
2011/03/11 11:47:32.915519 [ 3679]: Connected client with pid:6781
2011/03/11 11:47:34.701362 [ 3717]: server/ctdb_recoverd.c:908 Election timed
out
2011/03/11 11:47:35.618975 [ 3679]: Connected client with pid:6782
2011/03/11 11:47:35.706359 [ 3717]: server/ctdb_recoverd.c:2260 Node:0 was in
recovery mode. Restart recovery process
2011/03/11 11:47:35.706402 [ 3717]: server/ctdb_recoverd.c:1313 Starting
do_recovery
2011/03/11 11:47:35.706417 [ 3717]: Taking out recovery lock from recovery
daemon
2011/03/11 11:47:35.706436 [ 3717]: Take the recovery lock
2011/03/11 11:47:42.932623 [ 3679]: Connected client with pid:6786
2011/03/11 11:47:52.949752 [ 3679]: Connected client with pid:6790
2011/03/11 11:48:02.963978 [ 3679]: Connected client with pid:6839
2011/03/11 11:48:12.977325 [ 3679]: Connected client with pid:6852
2011/03/11 11:48:23.000752 [ 3679]: Connected client with pid:6908
2011/03/11 11:48:31.700370 [ 3679]: server/ctdb_recover.c:637 Been in recovery
mode for too long. Dropping all IPS
2011/03/11 11:48:33.018190 [ 3679]: Connected client with pid:6912
2011/03/11 11:48:35.708376 [ 3679]: Recovery daemon ping timeout. Count : 0
2011/03/11 11:48:43.035359 [ 3679]: Connected client with pid:6916
2011/03/11 11:48:53.042577 [ 3679]: Connected client with pid:6920
2011/03/11 11:49:03.056439 [ 3679]: Connected client with pid:6969
2011/03/11 11:49:13.073716 [ 3679]: Connected client with pid:6973
2011/03/11 11:49:23.092184 [ 3679]: Connected client with pid:7165
2011/03/11 11:49:23.478826 [ 3717]: ctdb_recovery_lock: Failed to get recovery
lock on '/srv/digedag1/samba/ctdb.lock'
2011/03/11 11:49:23.478943 [ 3717]: Unable to get recovery lock - aborting
recovery
2011/03/11 11:49:24.482314 [ 3717]: server/ctdb_recoverd.c:1313 Starting
do_recovery
2011/03/11 11:49:24.482380 [ 3717]: Taking out recovery lock from recovery
daemon
2011/03/11 11:49:24.482397 [ 3717]: Take the recovery lock
2011/03/11 11:49:24.482502 [ 3717]: ctdb_recovery_lock: Failed to get recovery
lock on '/srv/digedag1/samba/ctdb.lock'
2011/03/11 11:49:24.482526 [ 3717]: Unable to get recovery lock - aborting
recovery
2011/03/11 11:49:25.486289 [ 3717]: server/ctdb_recoverd.c:1313 Starting
do_recovery
2011/03/11 11:49:25.486351 [ 3717]: Taking out recovery lock from recovery
daemon
2011/03/11 11:49:25.486377 [ 3717]: Take the recovery lock
2011/03/11 11:49:25.486537 [ 3717]: ctdb_recovery_lock: Failed to get recovery
lock on '/srv/digedag1/samba/ctdb.lock'
2011/03/11 11:49:25.486576 [ 3717]: Unable to get recovery lock - aborting
recovery
2011/03/11 11:49:26.490313 [ 3717]: server/ctdb_recoverd.c:1313 Starting
do_recovery
2011/03/11 11:49:26.490377 [ 3717]: Taking out recovery lock from recovery
daemon
2011/03/11 11:49:26.490393 [ 3717]: Take the recovery lock
2011/03/11 11:49:26.490580 [ 3717]: ctdb_recovery_lock: Failed to get recovery
lock on '/srv/digedag1/samba/ctdb.lock'
2011/03/11 11:49:26.490628 [ 3717]: Unable to get recovery lock - aborting
recovery
2011/03/11 11:49:27.494318 [ 3717]: server/ctdb_recoverd.c:1313 Starting
do_recovery
2011/03/11 11:49:27.494371 [ 3717]: Node 0 has caused 4 recoveries recently -
banning it for 300 seconds
2011/03/11 11:49:27.494388 [ 3717]: Banning node 0 for 300 seconds
2011/03/11 11:49:27.494442 [ 3679]: SET BAN STATE
2011/03/11 11:49:27.494467 [ 3679]: Banning this node for 300 seconds
2011/03/11 11:49:27.494521 [ 3717]: Taking out recovery lock from recovery
daemon
2011/03/11 11:49:27.494540 [ 3717]: Take the recovery lock
2011/03/11 11:49:27.494644 [ 3717]: ctdb_recovery_lock: Failed to get recovery
lock on '/srv/digedag1/samba/ctdb.lock'
2011/03/11 11:49:27.494666 [ 3717]: Unable to get recovery lock - aborting
recovery
2011/03/11 11:49:28.497981 [ 3717]: Recmaster node 0 no longer available. Force
reelection
2011/03/11 11:49:28.498031 [ 3717]: server/ctdb_recoverd.c:2078 Force an
election
2011/03/11 11:49:28.498116 [ 3717]: server/ctdb_recoverd.c:1756 Send election
request to all active nodes
2011/03/11 11:49:31.501360 [ 3717]: server/ctdb_recoverd.c:908 Election timed
out
2011/03/11 11:49:32.505954 [ 3717]: Recmaster node 0 no longer available. Force
reelection
2011/03/11 11:49:32.506001 [ 3717]: server/ctdb_recoverd.c:2078 Force an
election
2011/03/11 11:49:32.506087 [ 3717]: server/ctdb_recoverd.c:1756 Send election
request to all active nodes
2011/03/11 11:49:33.109455 [ 3679]: Connected client with pid:7172
2011/03/11 11:49:34.133796 [ 3679]: Connected client with pid:7173
2011/03/11 11:49:34.134018 [ 3679]: server/ctdb_daemon.c:157 Registered message
handler for srvid=13791600006582132985
2011/03/11 11:49:34.134101 [ 3679]: server/ctdb_daemon.c:157 Registered message
handler for srvid=7173
2011/03/11 11:49:34.134179 [ 3679]: server/ctdb_daemon.c:157 Registered message
handler for srvid=4294967296
2011/03/11 11:49:34.134251 [ 3679]: server/ctdb_daemon.c:157 Registered message
handler for srvid=18302628885633695744
2011/03/11 11:49:34.134448 [ 3679]: Connected client with pid:7174
2011/03/11 11:49:34.134659 [ 3679]: server/ctdb_daemon.c:157 Registered message
handler for srvid=251054750753042733
2011/03/11 11:49:34.134750 [ 3679]: Connected client with pid:7175
2011/03/11 11:49:34.134990 [ 3679]: server/ctdb_daemon.c:157 Registered message
handler for srvid=12616595777325935393
2011/03/11 11:49:34.135265 [ 3679]: server/ctdb_daemon.c:157 Registered message
handler for srvid=7174
2011/03/11 11:49:34.135418 [ 3679]: server/ctdb_daemon.c:157 Registered message
handler for srvid=4294967296
2011/03/11 11:49:34.135490 [ 3679]: server/ctdb_daemon.c:157 Registered message
handler for srvid=7175
2011/03/11 11:49:34.135533 [ 3679]: server/ctdb_daemon.c:157 Registered message
handler for srvid=18302628885633695744
2011/03/11 11:49:34.135605 [ 3679]: server/ctdb_daemon.c:157 Registered message
handler for srvid=4294967296
2011/03/11 11:49:34.135667 [ 3679]: server/ctdb_daemon.c:157 Registered message
handler for srvid=18302628885633695744
2011/03/11 11:49:34.143078 [ 3679]: server/ctdb_daemon.c:157 Registered message
handler for srvid=17509995351216488448
2011/03/11 11:49:34.143254 [ 3679]: Could not add client IP X.X.9.206. This is
not a public address.
2011/03/11 11:49:34.143469 [ 3679]: server/ctdb_daemon.c:157 Registered message
handler for srvid=17509995351216488448
2011/03/11 11:49:34.143606 [ 3679]: Could not add client IP X.X.9.206. This is
not a public address.
2011/03/11 11:49:34.143691 [ 3679]: server/ctdb_daemon.c:157 Registered message
handler for srvid=17509995351216488448
2011/03/11 11:49:34.143828 [ 3679]: Could not add client IP X.X.9.206. This is
not a public address.
2011/03/11 11:49:35.509360 [ 3717]: server/ctdb_recoverd.c:908 Election timed
out
2011/03/11 11:49:36.513964 [ 3717]: Recmaster node 0 no longer available. Force
reelection
2011/03/11 11:49:36.514022 [ 3717]: server/ctdb_recoverd.c:2078 Force an
election
2011/03/11 11:49:36.514302 [ 3717]: server/ctdb_recoverd.c:1756 Send election
request to all active nodes
2011/03/11 11:49:39.517360 [ 3717]: server/ctdb_recoverd.c:908 Election timed
out
2011/03/11 11:49:40.521961 [ 3717]: Recmaster node 0 no longer available. Force
reelection
2011/03/11 11:49:40.522013 [ 3717]: server/ctdb_recoverd.c:2078 Force an
election
2011/03/11 11:49:40.522101 [ 3717]: server/ctdb_recoverd.c:1756 Send election
request to all active nodes
2011/03/11 11:49:43.126577 [ 3679]: Connected client with pid:7181
2011/03/11 11:49:43.525362 [ 3717]: server/ctdb_recoverd.c:908 Election timed
out
2011/03/11 11:49:44.529971 [ 3717]: Recmaster node 0 no longer available. Force
reelection
2011/03/11 11:49:44.530019 [ 3717]: server/ctdb_recoverd.c:2078 Force an
election
2011/03/11 11:49:44.530116 [ 3717]: server/ctdb_recoverd.c:1756 Send election
request to all active nodes
2011/03/11 11:49:47.533360 [ 3717]: server/ctdb_recoverd.c:908 Election timed
out
2011/03/11 11:49:48.537967 [ 3717]: Recmaster node 0 no longer available. Force
reelection
2011/03/11 11:49:48.538027 [ 3717]: server/ctdb_recoverd.c:2078 Force an
election
2011/03/11 11:49:48.538279 [ 3717]: server/ctdb_recoverd.c:1756 Send election
request to all active nodes
2011/03/11 11:49:51.541360 [ 3717]: server/ctdb_recoverd.c:908 Election timed
out
[...]
--
To unsubscribe from this list go to the following URL and read the
instructions: https://lists.samba.org/mailman/options/samba