On 3/27/12 6:12 PM, William Seligman wrote: > The basics: Dual-primary cman+pacemaker+drbd cluster running on RHEL6.2; spec > files and versions below. > > Problem: If I restart both nodes at the same time, or even just start > pacemaker > on both nodes at the same time, the drbd ms resource starts, but both nodes > stay > in slave mode. They'll both stay in slave mode until one of the following > occurs: > > - I manually type "crm resource cleanup <ms-resource-name>" > > - 15 minutes elapse. Then the "PEngine Recheck Timer" is fired, and the ms > resources are promoted. > > The key resource definitions: > > primitive AdminDrbd ocf:linbit:drbd \ > params drbd_resource="admin" \ > op monitor interval="59s" role="Master" timeout="30s" \ > op monitor interval="60s" role="Slave" timeout="30s" \ > op stop interval="0" timeout="100" \ > op start interval="0" timeout="240" \ > meta target-role="Master" > ms AdminClone AdminDrbd \ > meta master-max="2" master-node-max="1" clone-max="2" \ > clone-node-max="1" notify="true" interleave="true" > # The lengthy definition of "FilesystemGroup" is in the crm pastebin below > clone FilesystemClone FilesystemGroup \ > meta interleave="true" target-role="Started" > colocation Filesystem_With_Admin inf: FilesystemClone AdminClone:Master > order Admin_Before_Filesystem inf: AdminClone:promote FilesystemClone:start > > Note that I stuck in "target-role" options to try to solve the problem; no > effect. > > When I look in /var/log/messages, I see no error messages or indications why > the > promotion should be delayed. The 'admin' drbd resource is reported as UpToDate > on both nodes. There are no error messages when I force the issue with: > > crm resource cleanup AdminClone > > It's as if pacemaker, at start, needs some kind of "kick" after the drbd > resource is ready to be promoted. > > This is not just an abstract case for me. At my site, it's not uncommon for > there to be lengthy power outages that will bring down the cluster. Both > systems > will come up when power is restored, and I need for cluster services to be > available shortly afterward, not 15 minutes later. > > Any ideas? > > Details: > > # rpm -q kernel cman pacemaker drbd > kernel-2.6.32-220.4.1.el6.x86_64 > cman-3.0.12.1-23.el6.x86_64 > pacemaker-1.1.6-3.el6.x86_64 > drbd-8.4.1-1.el6.x86_64 > > Output of crm_mon after two-node reboot or pacemaker restart: > <http://pastebin.com/jzrpCk3i> > cluster.conf: <http://pastebin.com/sJw4KBws> > "crm configure show": <http://pastebin.com/MgYCQ2JH> > "drbdadm dump all": <http://pastebin.com/NrY6bskk>
Well, I can't say that I've "solved" this one, but I have a solution: If I turn on both machines at once there's a 15-minute delay. But if I turn on one machine, wait a couple of minutes, then turn on the other, at least the resources start promptly on the first machine. The second machine joins the cluster, but there's still a 15-minute delay until its DRBD partition is promoted by pacemaker. The reason why DRBD is promoted on the first machine has to do the previous issue I posted to this list: <http://www.gossamer-threads.com/lists/linuxha/users/78691?do=post_view_threaded> When doing the initial resource probe of the AdminLvm resource, it times out due the one-node LVM issue I discuss in the that thread. This error causes the pengine on the node to start re-probing resources, promote the DRBD partition, which in turn leads to all all the other resources starting on that node. So I have a work-around, but not a solution. I'll take what I can get! -- Bill Seligman | Phone: (914) 591-2823 Nevis Labs, Columbia Univ | mailto://[email protected] PO Box 137 | Irvington NY 10533 USA | http://www.nevis.columbia.edu/~seligman/
smime.p7s
Description: S/MIME Cryptographic Signature
_______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
