[Samba] Samba in Pacemaker-Cluster: CTDB fails to get recovery lock

Uwe Ritzschke Sun, 13 Mar 2011 07:05:20 -0700

I'm currently testing fail-over with a two-node active-active cluster(with node dig and node dag): Both nodes are up, one is manually killed.CTDB on the node that's still alive should perform a recovery andeverything should working again.


What's infrequently happening is:

After killing the pacemaker-process on dag (and dag consequently beingfenced), dig's CTDB tries to get the recovery lock and fails. As thereis no other node online to get the recovery lock and thus finishingCTDB's recovery, dig's CTDB keeps trying to get the recovery lock untilmanually stopped.The only way to get CTDB back to work is to restart OCFS2's distributedlock manager.

logfiles and pacemaker-configuration are attached, any help would begreatly appreciated :)


Regards,
Uwe



Our setting:

two nodes directly connected via LAN running openSuse 11.3 and sharing aSAN-drive that is connected via two interfaces using multipath.


pacemaker 1.1.2
corosync 1.2.1
cluster-glue 1.0.5-1.4
ctdb 1.0.114-2.20
ocfs2 1.4.3-1.4
multipath 0.4.8-51.3

node dag
node dig
primitive CTDB ocf:heartbeat:CTDB \
        params ctdb_recovery_lock="/srv/digedag1/samba/ctdb.lock" 
smb_private_dir="/srv/digedag1/samba/private" ctdb_debuglevel="3" \
        op monitor interval="10" timeout="160" depth="0" \
        op start interval="0" timeout="290" \
        op stop interval="0" timeout="200"
primitive ClusterIP ocf:heartbeat:IPaddr2 \
        params ip="X.X.9.206" cidr_netmask="22" 
clusterip_hash="sourceip-sourceport" \
        op monitor interval="60"
primitive DLM-Ocfs2 ocf:pacemaker:controld \
        op monitor interval="120" \
        op start interval="0" timeout="90" \
        op stop interval="0" timeout="100"
primitive O2CB ocf:ocfs2:o2cb \
        op monitor interval="120" \
        op start interval="0" timeout="90" \
        op stop interval="0" timeout="100"
primitive SbdStonith stonith:external/sbd \
        params sbd_device="/dev/mapper/sbd"
primitive digedag1 ocf:heartbeat:Filesystem \
        params device="/dev/mapper/digedag1-part1" directory="/srv/digedag1" 
fstype="ocfs2" options="acl" \
        op monitor interval="120" timeout="40" \
        op start interval="0" timeout="60" \
        op stop interval="0" timeout="60"
clone CTDB-clone CTDB \
        meta globally-unique="false" interleave="true"
clone DLM-Ocfs2-clone DLM-Ocfs2 \
        meta globally-unique="false" interleave="true"
clone IP-Clone ClusterIP \
        meta globally-unique="true" 
clone O2CB-clone O2CB \
        meta globally-unique="false" interleave="true" 
clone digedag1-clone digedag1 \
        meta interleave="true" ordered="true" 
colocation CTDB-with-digedag1 inf: CTDB-clone digedag1-clone
colocation IP-with-CTDB inf: IP-Clone CTDB-clone
colocation digedag1-with-O2CB inf: digedag1-clone O2CB-clone
order start-CTDB-after-digedag1 inf: digedag1-clone CTDB-clone
order start-IP-after-CTDB inf: CTDB-clone IP-Clone
order start-O2CB-after-DLM-Ocfs2 inf: DLM-Ocfs2-clone O2CB-clone
order start-digedag1-after-O2CB inf: O2CB-clone digedag1-clone
property $id="cib-bootstrap-options" \
        dc-version="1.1.2-8b9ec9ccc5060457ac761dce1de719af86895b10" \
        cluster-infrastructure="openais" \
        expected-quorum-votes="2" \
        stonith-enabled="true" \
        no-quorum-policy="ignore" \
        stonith-timeout="150s" \
        last-lrm-refresh="1299253687"
rsc_defaults $id="rsc-options" \
        resource-stickiness="100"

2011/03/11 11:47:12.004372 [ 3679]: server/eventscript.c:701 Starting 
eventscript monitor 
2011/03/11 11:47:12.034905 [ 3679]: Connected client with pid:6570
2011/03/11 11:47:12.059019 [ 3679]: No public addresses file found. Nothing to 
do for 10.interfaces
2011/03/11 11:47:12.211681 [ 3679]: server/eventscript.c:488 Eventscript 
monitor  finished with state 0
2011/03/11 11:47:12.890352 [ 3679]: Connected client with pid:6628
2011/03/11 11:47:22.898283 [ 3679]: Connected client with pid:6710
2011/03/11 11:47:23.329367 [ 3717]: client/ctdb_client.c:771 control timed out. 
reqid:5130 opcode:91 dstnode:1
2011/03/11 11:47:23.329424 [ 3717]: client/ctdb_client.c:882 ctdb_control_recv 
failed
2011/03/11 11:47:23.329449 [ 3717]: client/ctdb_client.c:1225 ctdb_control for 
getnodes failed ret:-1 res:-1
2011/03/11 11:47:23.329466 [ 3717]: server/ctdb_recoverd.c:3074 Unable to get 
nodemap from recovery master 1
2011/03/11 11:47:27.212378 [ 3679]: server/eventscript.c:701 Starting 
eventscript monitor 
2011/03/11 11:47:27.242495 [ 3679]: Connected client with pid:6722
2011/03/11 11:47:27.266210 [ 3679]: No public addresses file found. Nothing to 
do for 10.interfaces
2011/03/11 11:47:27.418848 [ 3679]: server/eventscript.c:488 Eventscript 
monitor  finished with state 0
2011/03/11 11:47:30.692368 [ 3679]: dead count reached for node 1
2011/03/11 11:47:30.692434 [ 3679]: 192.168.1.2:4379: node 192.168.1.3:4379 is 
dead: 0 connected
2011/03/11 11:47:30.692497 [ 3679]: Tearing down connection to dead node :1
2011/03/11 11:47:30.692519 [ 3717]: Dropped orphaned reply control with 
reqid:5130
2011/03/11 11:47:30.692562 [ 3717]: ctdb_control error: 'node is disconnected'
2011/03/11 11:47:30.692605 [ 3717]: client/ctdb_client.c:1225 ctdb_control for 
getnodes failed ret:-1 res:-1
2011/03/11 11:47:30.692635 [ 3717]: server/ctdb_recoverd.c:3074 Unable to get 
nodemap from recovery master 1
2011/03/11 11:47:31.697869 [ 3717]: Recmaster node 1 is disconnected. Force 
reelection
2011/03/11 11:47:31.697920 [ 3717]: server/ctdb_recoverd.c:2078 Force an 
election
2011/03/11 11:47:31.697978 [ 3679]: Freeze priority 1
2011/03/11 11:47:31.698330 [ 6775]: locking database 0x2607456f priority:1 
g_lock.tdb
2011/03/11 11:47:31.698470 [ 6775]: locking database 0x17055d90 priority:1 
connections.tdb
2011/03/11 11:47:31.698517 [ 6775]: locking database 0xc0bdde6a priority:1 
sessionid.tdb
2011/03/11 11:47:31.698535 [ 6775]: locking database 0x1421fb78 priority:1 
brlock.tdb
2011/03/11 11:47:31.698553 [ 6775]: locking database 0x42fe72c5 priority:1 
locking.tdb
2011/03/11 11:47:31.698571 [ 6775]: locking database 0x7bbbd26c priority:1 
passdb.tdb
2011/03/11 11:47:31.698588 [ 6775]: locking database 0x2672a57f priority:1 
idmap2.tdb
2011/03/11 11:47:31.698620 [ 6775]: locking database 0x63501287 priority:1 
share_info.tdb
2011/03/11 11:47:31.698637 [ 6775]: locking database 0xb775fff6 priority:1 
secrets.tdb
2011/03/11 11:47:31.698654 [ 6775]: locking database 0xf2a58948 priority:1 
registry.tdb
2011/03/11 11:47:31.698672 [ 6775]: locking database 0x92380e87 priority:1 
account_policy.tdb
2011/03/11 11:47:31.698691 [ 6775]: locking database 0x435d3410 priority:1 
notify.tdb
2011/03/11 11:47:31.698715 [ 6775]: locking database 0x1c904dfd priority:1 
notify_onelevel.tdb
2011/03/11 11:47:31.698849 [ 3679]: Freeze priority 2
2011/03/11 11:47:31.699204 [ 3679]: Freeze priority 3
2011/03/11 11:47:31.699552 [ 3679]: server/ctdb_recover.c:673 Recovery mode set 
to ACTIVE
2011/03/11 11:47:31.699681 [ 3717]: server/ctdb_recoverd.c:1756 Send election 
request to all active nodes
2011/03/11 11:47:32.915519 [ 3679]: Connected client with pid:6781
2011/03/11 11:47:34.701362 [ 3717]: server/ctdb_recoverd.c:908 Election timed 
out
2011/03/11 11:47:35.618975 [ 3679]: Connected client with pid:6782
2011/03/11 11:47:35.706359 [ 3717]: server/ctdb_recoverd.c:2260 Node:0 was in 
recovery mode. Restart recovery process
2011/03/11 11:47:35.706402 [ 3717]: server/ctdb_recoverd.c:1313 Starting 
do_recovery
2011/03/11 11:47:35.706417 [ 3717]: Taking out recovery lock from recovery 
daemon
2011/03/11 11:47:35.706436 [ 3717]: Take the recovery lock
2011/03/11 11:47:42.932623 [ 3679]: Connected client with pid:6786
2011/03/11 11:47:52.949752 [ 3679]: Connected client with pid:6790
2011/03/11 11:48:02.963978 [ 3679]: Connected client with pid:6839
2011/03/11 11:48:12.977325 [ 3679]: Connected client with pid:6852
2011/03/11 11:48:23.000752 [ 3679]: Connected client with pid:6908
2011/03/11 11:48:31.700370 [ 3679]: server/ctdb_recover.c:637 Been in recovery 
mode for too long. Dropping all IPS
2011/03/11 11:48:33.018190 [ 3679]: Connected client with pid:6912
2011/03/11 11:48:35.708376 [ 3679]: Recovery daemon ping timeout. Count : 0
2011/03/11 11:48:43.035359 [ 3679]: Connected client with pid:6916
2011/03/11 11:48:53.042577 [ 3679]: Connected client with pid:6920
2011/03/11 11:49:03.056439 [ 3679]: Connected client with pid:6969
2011/03/11 11:49:13.073716 [ 3679]: Connected client with pid:6973
2011/03/11 11:49:23.092184 [ 3679]: Connected client with pid:7165
2011/03/11 11:49:23.478826 [ 3717]: ctdb_recovery_lock: Failed to get recovery 
lock on '/srv/digedag1/samba/ctdb.lock'
2011/03/11 11:49:23.478943 [ 3717]: Unable to get recovery lock - aborting 
recovery
2011/03/11 11:49:24.482314 [ 3717]: server/ctdb_recoverd.c:1313 Starting 
do_recovery
2011/03/11 11:49:24.482380 [ 3717]: Taking out recovery lock from recovery 
daemon
2011/03/11 11:49:24.482397 [ 3717]: Take the recovery lock
2011/03/11 11:49:24.482502 [ 3717]: ctdb_recovery_lock: Failed to get recovery 
lock on '/srv/digedag1/samba/ctdb.lock'
2011/03/11 11:49:24.482526 [ 3717]: Unable to get recovery lock - aborting 
recovery
2011/03/11 11:49:25.486289 [ 3717]: server/ctdb_recoverd.c:1313 Starting 
do_recovery
2011/03/11 11:49:25.486351 [ 3717]: Taking out recovery lock from recovery 
daemon
2011/03/11 11:49:25.486377 [ 3717]: Take the recovery lock
2011/03/11 11:49:25.486537 [ 3717]: ctdb_recovery_lock: Failed to get recovery 
lock on '/srv/digedag1/samba/ctdb.lock'
2011/03/11 11:49:25.486576 [ 3717]: Unable to get recovery lock - aborting 
recovery
2011/03/11 11:49:26.490313 [ 3717]: server/ctdb_recoverd.c:1313 Starting 
do_recovery
2011/03/11 11:49:26.490377 [ 3717]: Taking out recovery lock from recovery 
daemon
2011/03/11 11:49:26.490393 [ 3717]: Take the recovery lock
2011/03/11 11:49:26.490580 [ 3717]: ctdb_recovery_lock: Failed to get recovery 
lock on '/srv/digedag1/samba/ctdb.lock'
2011/03/11 11:49:26.490628 [ 3717]: Unable to get recovery lock - aborting 
recovery
2011/03/11 11:49:27.494318 [ 3717]: server/ctdb_recoverd.c:1313 Starting 
do_recovery
2011/03/11 11:49:27.494371 [ 3717]: Node 0 has caused 4 recoveries recently - 
banning it for 300 seconds
2011/03/11 11:49:27.494388 [ 3717]: Banning node 0 for 300 seconds
2011/03/11 11:49:27.494442 [ 3679]: SET BAN STATE
2011/03/11 11:49:27.494467 [ 3679]: Banning this node for 300 seconds
2011/03/11 11:49:27.494521 [ 3717]: Taking out recovery lock from recovery 
daemon
2011/03/11 11:49:27.494540 [ 3717]: Take the recovery lock
2011/03/11 11:49:27.494644 [ 3717]: ctdb_recovery_lock: Failed to get recovery 
lock on '/srv/digedag1/samba/ctdb.lock'
2011/03/11 11:49:27.494666 [ 3717]: Unable to get recovery lock - aborting 
recovery
2011/03/11 11:49:28.497981 [ 3717]: Recmaster node 0 no longer available. Force 
reelection
2011/03/11 11:49:28.498031 [ 3717]: server/ctdb_recoverd.c:2078 Force an 
election
2011/03/11 11:49:28.498116 [ 3717]: server/ctdb_recoverd.c:1756 Send election 
request to all active nodes
2011/03/11 11:49:31.501360 [ 3717]: server/ctdb_recoverd.c:908 Election timed 
out
2011/03/11 11:49:32.505954 [ 3717]: Recmaster node 0 no longer available. Force 
reelection
2011/03/11 11:49:32.506001 [ 3717]: server/ctdb_recoverd.c:2078 Force an 
election
2011/03/11 11:49:32.506087 [ 3717]: server/ctdb_recoverd.c:1756 Send election 
request to all active nodes
2011/03/11 11:49:33.109455 [ 3679]: Connected client with pid:7172
2011/03/11 11:49:34.133796 [ 3679]: Connected client with pid:7173
2011/03/11 11:49:34.134018 [ 3679]: server/ctdb_daemon.c:157 Registered message 
handler for srvid=13791600006582132985
2011/03/11 11:49:34.134101 [ 3679]: server/ctdb_daemon.c:157 Registered message 
handler for srvid=7173
2011/03/11 11:49:34.134179 [ 3679]: server/ctdb_daemon.c:157 Registered message 
handler for srvid=4294967296
2011/03/11 11:49:34.134251 [ 3679]: server/ctdb_daemon.c:157 Registered message 
handler for srvid=18302628885633695744
2011/03/11 11:49:34.134448 [ 3679]: Connected client with pid:7174
2011/03/11 11:49:34.134659 [ 3679]: server/ctdb_daemon.c:157 Registered message 
handler for srvid=251054750753042733
2011/03/11 11:49:34.134750 [ 3679]: Connected client with pid:7175
2011/03/11 11:49:34.134990 [ 3679]: server/ctdb_daemon.c:157 Registered message 
handler for srvid=12616595777325935393
2011/03/11 11:49:34.135265 [ 3679]: server/ctdb_daemon.c:157 Registered message 
handler for srvid=7174
2011/03/11 11:49:34.135418 [ 3679]: server/ctdb_daemon.c:157 Registered message 
handler for srvid=4294967296
2011/03/11 11:49:34.135490 [ 3679]: server/ctdb_daemon.c:157 Registered message 
handler for srvid=7175
2011/03/11 11:49:34.135533 [ 3679]: server/ctdb_daemon.c:157 Registered message 
handler for srvid=18302628885633695744
2011/03/11 11:49:34.135605 [ 3679]: server/ctdb_daemon.c:157 Registered message 
handler for srvid=4294967296
2011/03/11 11:49:34.135667 [ 3679]: server/ctdb_daemon.c:157 Registered message 
handler for srvid=18302628885633695744
2011/03/11 11:49:34.143078 [ 3679]: server/ctdb_daemon.c:157 Registered message 
handler for srvid=17509995351216488448
2011/03/11 11:49:34.143254 [ 3679]: Could not add client IP X.X.9.206. This is 
not a public address.
2011/03/11 11:49:34.143469 [ 3679]: server/ctdb_daemon.c:157 Registered message 
handler for srvid=17509995351216488448
2011/03/11 11:49:34.143606 [ 3679]: Could not add client IP X.X.9.206. This is 
not a public address.
2011/03/11 11:49:34.143691 [ 3679]: server/ctdb_daemon.c:157 Registered message 
handler for srvid=17509995351216488448
2011/03/11 11:49:34.143828 [ 3679]: Could not add client IP X.X.9.206. This is 
not a public address.
2011/03/11 11:49:35.509360 [ 3717]: server/ctdb_recoverd.c:908 Election timed 
out
2011/03/11 11:49:36.513964 [ 3717]: Recmaster node 0 no longer available. Force 
reelection
2011/03/11 11:49:36.514022 [ 3717]: server/ctdb_recoverd.c:2078 Force an 
election
2011/03/11 11:49:36.514302 [ 3717]: server/ctdb_recoverd.c:1756 Send election 
request to all active nodes
2011/03/11 11:49:39.517360 [ 3717]: server/ctdb_recoverd.c:908 Election timed 
out
2011/03/11 11:49:40.521961 [ 3717]: Recmaster node 0 no longer available. Force 
reelection
2011/03/11 11:49:40.522013 [ 3717]: server/ctdb_recoverd.c:2078 Force an 
election
2011/03/11 11:49:40.522101 [ 3717]: server/ctdb_recoverd.c:1756 Send election 
request to all active nodes
2011/03/11 11:49:43.126577 [ 3679]: Connected client with pid:7181
2011/03/11 11:49:43.525362 [ 3717]: server/ctdb_recoverd.c:908 Election timed 
out
2011/03/11 11:49:44.529971 [ 3717]: Recmaster node 0 no longer available. Force 
reelection
2011/03/11 11:49:44.530019 [ 3717]: server/ctdb_recoverd.c:2078 Force an 
election
2011/03/11 11:49:44.530116 [ 3717]: server/ctdb_recoverd.c:1756 Send election 
request to all active nodes
2011/03/11 11:49:47.533360 [ 3717]: server/ctdb_recoverd.c:908 Election timed 
out
2011/03/11 11:49:48.537967 [ 3717]: Recmaster node 0 no longer available. Force 
reelection
2011/03/11 11:49:48.538027 [ 3717]: server/ctdb_recoverd.c:2078 Force an 
election
2011/03/11 11:49:48.538279 [ 3717]: server/ctdb_recoverd.c:1756 Send election 
request to all active nodes
2011/03/11 11:49:51.541360 [ 3717]: server/ctdb_recoverd.c:908 Election timed 
out
[...]

-- 
To unsubscribe from this list go to the following URL and read the
instructions:  https://lists.samba.org/mailman/options/samba

[Samba] Samba in Pacemaker-Cluster: CTDB fails to get recovery lock

Reply via email to