On 3/29/12 3:19 AM, Andrew Beekhof wrote: > On Wed, Mar 28, 2012 at 9:12 AM, William Seligman > <[email protected]> wrote: >> The basics: Dual-primary cman+pacemaker+drbd cluster running on RHEL6.2; spec >> files and versions below. >> >> Problem: If I restart both nodes at the same time, or even just start >> pacemaker >> on both nodes at the same time, the drbd ms resource starts, but both nodes >> stay >> in slave mode. They'll both stay in slave mode until one of the following >> occurs: >> >> - I manually type "crm resource cleanup <ms-resource-name>" >> >> - 15 minutes elapse. Then the "PEngine Recheck Timer" is fired, and the ms >> resources are promoted. >> >> The key resource definitions: >> >> primitive AdminDrbd ocf:linbit:drbd \ >> � � � �params drbd_resource="admin" \ >> � � � �op monitor interval="59s" role="Master" timeout="30s" \ >> � � � �op monitor interval="60s" role="Slave" timeout="30s" \ >> � � � �op stop interval="0" timeout="100" \ >> � � � �op start interval="0" timeout="240" \ >> � � � �meta target-role="Master" >> ms AdminClone AdminDrbd \ >> � � � �meta master-max="2" master-node-max="1" clone-max="2" \ >> � � � �clone-node-max="1" notify="true" interleave="true" >> # The lengthy definition of "FilesystemGroup" is in the crm pastebin below >> clone FilesystemClone FilesystemGroup \ >> � � � �meta interleave="true" target-role="Started" >> colocation Filesystem_With_Admin inf: FilesystemClone AdminClone:Master >> order Admin_Before_Filesystem inf: AdminClone:promote FilesystemClone:start >> >> Note that I stuck in "target-role" options to try to solve the problem; no >> effect. >> >> When I look in /var/log/messages, I see no error messages or indications why >> the >> promotion should be delayed. The 'admin' drbd resource is reported as >> UpToDate >> on both nodes. There are no error messages when I force the issue with: >> >> crm resource cleanup AdminClone >> >> It's as if pacemaker, at start, needs some kind of "kick" after the drbd >> resource is ready to be promoted. >> >> This is not just an abstract case for me. At my site, it's not uncommon for >> there to be lengthy power outages that will bring down the cluster. Both >> systems >> will come up when power is restored, and I need for cluster services to be >> available shortly afterward, not 15 minutes later. >> >> Any ideas? > > Not without any logs
Sure! Here's an extract from the log: <http://pastebin.com/L1ZnsQ0R> Before you click on the link (it's a big wall of text), here are what I think are the landmarks: - The extract starts just after the node boots, at the start of syslog at time 10:49:21. - I've highlighted when pacemakerd starts, at 10:49:46. - I've highlighted when drbd reports that the 'admin' resource is UpToDate, at 10:50:10. - One last highlight: When pacemaker finally promotes the drbd resource to Primary on both nodes, at 11:05:11. > Details: >> >> # rpm -q kernel cman pacemaker drbd >> kernel-2.6.32-220.4.1.el6.x86_64 >> cman-3.0.12.1-23.el6.x86_64 >> pacemaker-1.1.6-3.el6.x86_64 >> drbd-8.4.1-1.el6.x86_64 >> >> Output of crm_mon after two-node reboot or pacemaker restart: >> <http://pastebin.com/jzrpCk3i> >> cluster.conf: <http://pastebin.com/sJw4KBws> >> "crm configure show": <http://pastebin.com/MgYCQ2JH> >> "drbdadm dump all": <http://pastebin.com/NrY6bskk> -- Bill Seligman | Phone: (914) 591-2823 Nevis Labs, Columbia Univ | mailto://[email protected] PO Box 137 | Irvington NY 10533 USA | http://www.nevis.columbia.edu/~seligman/
smime.p7s
Description: S/MIME Cryptographic Signature
_______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
