Executive overview: When bringing a node back from standby, to test failover, the Filesytem RA on the _slave_ node, which has just relinquished the resource, tries to mount the filesystem after it's handed it back to the master, fails, and leaves the cluster with a failure event that won't let it remount until cleared with crm_resource --resource fs_asterisk -C
So, background: SL6 x86_64, drbd 8.4.0rc2, pacemaker-1.1.2-7.el6,, resource-agents-3.0.12-15.el6 I've added some debugging to /usr/lib/ocf/resource.d/heartbeat/Filesystem with a nice unique XYZZY tag when Filesystem_start or Filesystem_stop is called. # crm status (other resources snipped for readablity purposes) ============ Last updated: Mon Jun 20 15:02:05 2011 Stack: openais Current DC: slave - partition with quorum Version: 1.1.2-f059ec7ced7a86f18e5490b67ebf4a0b963bccfe 2 Nodes configured, 2 expected votes 10 Resources configured. ============ Online: [ master slave ] Master/Slave Set: ms_drbd_asterisk Masters: [ master ] Slaves: [ slave ] Resource Group: asterisk fs_asterisk (ocf::heartbeat:Filesystem): Started master ip_asterisk (ocf::heartbeat:IPaddr2): Started master [...snip...] # echo > /var/log/messages # crm node standby master # crm status [root@localhost ~]# crm status ============ Last updated: Mon Jun 20 15:07:52 2011 Stack: openais Current DC: slave - partition with quorum Version: 1.1.2-f059ec7ced7a86f18e5490b67ebf4a0b963bccfe 2 Nodes configured, 2 expected votes 10 Resources configured. ============ Node master: standby Online: [ slave ] Master/Slave Set: ms_drbd_asterisk Masters: [ slave ] Stopped: [ drbd_asterisk:1 ] Resource Group: asterisk fs_asterisk (ocf::heartbeat:Filesystem): Started slave ip_asterisk (ocf::heartbeat:IPaddr2): Started slave (still no errors, all good) # grep XYZZY /var/log/messages Jun 20 15:07:37 localhost Filesystem[9879]: WARNING: XYZZY: Starting Filesystem /dev/drbd1 Now, if I bring 'master' back online (and I've stickied that resource to master, so it moves back straight away) # echo > /var/log/messages # crm node online master # crm status ============ Last updated: Mon Jun 20 15:11:29 2011 Stack: openais Current DC: slave - partition with quorum Version: 1.1.2-f059ec7ced7a86f18e5490b67ebf4a0b963bccfe 2 Nodes configured, 2 expected votes 10 Resources configured. ============ Online: [ master slave ] Master/Slave Set: ms_drbd_asterisk Masters: [ master ] Slaves: [ slave ] Resource Group: asterisk fs_asterisk (ocf::heartbeat:Filesystem): Started master ip_asterisk (ocf::heartbeat:IPaddr2): Started master Failed actions: fs_asterisk_start_0 (node=slave, call=809, rc=1, status=complete): unknown error And grepping produces: # grep XYZZY /var/log/messages Jun 20 15:10:51 localhost Filesystem[13889]: WARNING: XYZZY: Stopping Filesystem /dev/drbd1 Jun 20 15:10:56 localhost Filesystem[15338]: WARNING: XYZZY: Starting Filesystem /dev/drbd1 Jun 20 15:10:58 localhost Filesystem[15593]: WARNING: XYZZY: Stopping Filesystem /dev/drbd1 Oddly enough, there are actually _the same 3_ of the 5 groups doing this, each time. Complete 'crm configure show' output here: http://pastebin.com/0S7UVZ5U (Note that I've tried it with a longer monitor time - 59s - and without the --target-role=started on a couple of the resources. Doing a 'crm status' whilst it's failing over you see it say 'FAILED: slave' before it starts up on the master. If there's any useful information I can provide from /var/log/messages I'd be happy to provide them, but I just need to know which parts you consider important, as there's tonnes of them! --Rob _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker