On 3/27/12 6:12 PM, William Seligman wrote:
> The basics: Dual-primary cman+pacemaker+drbd cluster running on RHEL6.2; spec
> files and versions below.
> 
> Problem: If I restart both nodes at the same time, or even just start 
> pacemaker
> on both nodes at the same time, the drbd ms resource starts, but both nodes 
> stay
> in slave mode. They'll both stay in slave mode until one of the following 
> occurs:
> 
> - I manually type "crm resource cleanup <ms-resource-name>"
> 
> - 15 minutes elapse. Then the "PEngine Recheck Timer" is fired, and the ms
> resources are promoted.
> 
> The key resource definitions:
> 
> primitive AdminDrbd ocf:linbit:drbd \
>         params drbd_resource="admin" \
>         op monitor interval="59s" role="Master" timeout="30s" \
>         op monitor interval="60s" role="Slave" timeout="30s" \
>         op stop interval="0" timeout="100" \
>         op start interval="0" timeout="240" \
>         meta target-role="Master"
> ms AdminClone AdminDrbd \
>         meta master-max="2" master-node-max="1" clone-max="2" \
>         clone-node-max="1" notify="true" interleave="true"
> # The lengthy definition of "FilesystemGroup" is in the crm pastebin below
> clone FilesystemClone FilesystemGroup \
>         meta interleave="true" target-role="Started"
> colocation Filesystem_With_Admin inf: FilesystemClone AdminClone:Master
> order Admin_Before_Filesystem inf: AdminClone:promote FilesystemClone:start
> 
> Note that I stuck in "target-role" options to try to solve the problem; no 
> effect.
> 
> When I look in /var/log/messages, I see no error messages or indications why 
> the
> promotion should be delayed. The 'admin' drbd resource is reported as UpToDate
> on both nodes. There are no error messages when I force the issue with:
> 
> crm resource cleanup AdminClone
> 
> It's as if pacemaker, at start, needs some kind of "kick" after the drbd
> resource is ready to be promoted.
> 
> This is not just an abstract case for me. At my site, it's not uncommon for
> there to be lengthy power outages that will bring down the cluster. Both 
> systems
> will come up when power is restored, and I need for cluster services to be
> available shortly afterward, not 15 minutes later.
> 
> Any ideas?
> 
> Details:
> 
> # rpm -q kernel cman pacemaker drbd
> kernel-2.6.32-220.4.1.el6.x86_64
> cman-3.0.12.1-23.el6.x86_64
> pacemaker-1.1.6-3.el6.x86_64
> drbd-8.4.1-1.el6.x86_64
> 
> Output of crm_mon after two-node reboot or pacemaker restart:
> <http://pastebin.com/jzrpCk3i>
> cluster.conf: <http://pastebin.com/sJw4KBws>
> "crm configure show": <http://pastebin.com/MgYCQ2JH>
> "drbdadm dump all": <http://pastebin.com/NrY6bskk>

Well, I can't say that I've "solved" this one, but I have a solution: If I turn
on both machines at once there's a 15-minute delay. But if I turn on one
machine, wait a couple of minutes, then turn on the other, at least the
resources start promptly on the first machine. The second machine joins the
cluster, but there's still a 15-minute delay until its DRBD partition is
promoted by pacemaker.

The reason why DRBD is promoted on the first machine has to do the previous
issue I posted to this list:

<http://www.gossamer-threads.com/lists/linuxha/users/78691?do=post_view_threaded>

When doing the initial resource probe of the AdminLvm resource, it times out due
the one-node LVM issue I discuss in the that thread. This error causes the
pengine on the node to start re-probing resources, promote the DRBD partition,
which in turn leads to all all the other resources starting on that node.

So I have a work-around, but not a solution. I'll take what I can get!
-- 
Bill Seligman             | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://[email protected]
PO Box 137                |
Irvington NY 10533 USA    | http://www.nevis.columbia.edu/~seligman/

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to