On 3/29/12 3:19 AM, Andrew Beekhof wrote:
> On Wed, Mar 28, 2012 at 9:12 AM, William Seligman
> <[email protected]> wrote:
>> The basics: Dual-primary cman+pacemaker+drbd cluster running on RHEL6.2; spec
>> files and versions below.
>>
>> Problem: If I restart both nodes at the same time, or even just start 
>> pacemaker
>> on both nodes at the same time, the drbd ms resource starts, but both nodes 
>> stay
>> in slave mode. They'll both stay in slave mode until one of the following 
>> occurs:
>>
>> - I manually type "crm resource cleanup <ms-resource-name>"
>>
>> - 15 minutes elapse. Then the "PEngine Recheck Timer" is fired, and the ms
>> resources are promoted.
>>
>> The key resource definitions:
>>
>> primitive AdminDrbd ocf:linbit:drbd \
>> � � � �params drbd_resource="admin" \
>> � � � �op monitor interval="59s" role="Master" timeout="30s" \
>> � � � �op monitor interval="60s" role="Slave" timeout="30s" \
>> � � � �op stop interval="0" timeout="100" \
>> � � � �op start interval="0" timeout="240" \
>> � � � �meta target-role="Master"
>> ms AdminClone AdminDrbd \
>> � � � �meta master-max="2" master-node-max="1" clone-max="2" \
>> � � � �clone-node-max="1" notify="true" interleave="true"
>> # The lengthy definition of "FilesystemGroup" is in the crm pastebin below
>> clone FilesystemClone FilesystemGroup \
>> � � � �meta interleave="true" target-role="Started"
>> colocation Filesystem_With_Admin inf: FilesystemClone AdminClone:Master
>> order Admin_Before_Filesystem inf: AdminClone:promote FilesystemClone:start
>>
>> Note that I stuck in "target-role" options to try to solve the problem; no 
>> effect.
>>
>> When I look in /var/log/messages, I see no error messages or indications why 
>> the
>> promotion should be delayed. The 'admin' drbd resource is reported as 
>> UpToDate
>> on both nodes. There are no error messages when I force the issue with:
>>
>> crm resource cleanup AdminClone
>>
>> It's as if pacemaker, at start, needs some kind of "kick" after the drbd
>> resource is ready to be promoted.
>>
>> This is not just an abstract case for me. At my site, it's not uncommon for
>> there to be lengthy power outages that will bring down the cluster. Both 
>> systems
>> will come up when power is restored, and I need for cluster services to be
>> available shortly afterward, not 15 minutes later.
>>
>> Any ideas?
> 
> Not without any logs

Sure! Here's an extract from the log: <http://pastebin.com/L1ZnsQ0R>

Before you click on the link (it's a big wall of text), here are what I think
are the landmarks:

- The extract starts just after the node boots, at the start of syslog at time
10:49:21.
- I've highlighted when pacemakerd starts, at 10:49:46.
- I've highlighted when drbd reports that the 'admin' resource is UpToDate, at
10:50:10.
- One last highlight: When pacemaker finally promotes the drbd resource to
Primary on both nodes, at 11:05:11.

> Details:
>>
>> # rpm -q kernel cman pacemaker drbd
>> kernel-2.6.32-220.4.1.el6.x86_64
>> cman-3.0.12.1-23.el6.x86_64
>> pacemaker-1.1.6-3.el6.x86_64
>> drbd-8.4.1-1.el6.x86_64
>>
>> Output of crm_mon after two-node reboot or pacemaker restart:
>> <http://pastebin.com/jzrpCk3i>
>> cluster.conf: <http://pastebin.com/sJw4KBws>
>> "crm configure show": <http://pastebin.com/MgYCQ2JH>
>> "drbdadm dump all": <http://pastebin.com/NrY6bskk>

-- 
Bill Seligman             | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://[email protected]
PO Box 137                |
Irvington NY 10533 USA    | http://www.nevis.columbia.edu/~seligman/

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to