On Fri, Mar 30, 2012 at 2:57 AM, William Seligman
<[email protected]> wrote:
> On 3/29/12 3:19 AM, Andrew Beekhof wrote:
>> On Wed, Mar 28, 2012 at 9:12 AM, William Seligman
>> <[email protected]> wrote:
>>> The basics: Dual-primary cman+pacemaker+drbd cluster running on RHEL6.2; 
>>> spec
>>> files and versions below.
>>>
>>> Problem: If I restart both nodes at the same time, or even just start 
>>> pacemaker
>>> on both nodes at the same time, the drbd ms resource starts, but both nodes 
>>> stay
>>> in slave mode. They'll both stay in slave mode until one of the following 
>>> occurs:
>>>
>>> - I manually type "crm resource cleanup <ms-resource-name>"
>>>
>>> - 15 minutes elapse. Then the "PEngine Recheck Timer" is fired, and the ms
>>> resources are promoted.
>>>
>>> The key resource definitions:
>>>
>>> primitive AdminDrbd ocf:linbit:drbd \
>>> � � � �params drbd_resource="admin" \
>>> � � � �op monitor interval="59s" role="Master" timeout="30s" \
>>> � � � �op monitor interval="60s" role="Slave" timeout="30s" \
>>> � � � �op stop interval="0" timeout="100" \
>>> � � � �op start interval="0" timeout="240" \
>>> � � � �meta target-role="Master"
>>> ms AdminClone AdminDrbd \
>>> � � � �meta master-max="2" master-node-max="1" clone-max="2" \
>>> � � � �clone-node-max="1" notify="true" interleave="true"
>>> # The lengthy definition of "FilesystemGroup" is in the crm pastebin below
>>> clone FilesystemClone FilesystemGroup \
>>> � � � �meta interleave="true" target-role="Started"
>>> colocation Filesystem_With_Admin inf: FilesystemClone AdminClone:Master
>>> order Admin_Before_Filesystem inf: AdminClone:promote FilesystemClone:start
>>>
>>> Note that I stuck in "target-role" options to try to solve the problem; no 
>>> effect.
>>>
>>> When I look in /var/log/messages, I see no error messages or indications 
>>> why the
>>> promotion should be delayed. The 'admin' drbd resource is reported as 
>>> UpToDate
>>> on both nodes. There are no error messages when I force the issue with:
>>>
>>> crm resource cleanup AdminClone
>>>
>>> It's as if pacemaker, at start, needs some kind of "kick" after the drbd
>>> resource is ready to be promoted.
>>>
>>> This is not just an abstract case for me. At my site, it's not uncommon for
>>> there to be lengthy power outages that will bring down the cluster. Both 
>>> systems
>>> will come up when power is restored, and I need for cluster services to be
>>> available shortly afterward, not 15 minutes later.
>>>
>>> Any ideas?
>>
>> Not without any logs
>
> Sure! Here's an extract from the log: <http://pastebin.com/L1ZnsQ0R>
>
> Before you click on the link (it's a big wall of text),

I'm used to trawling the logs.  Grep is a wonderful thing :-)

At this stage it is apparent that I need to see
/var/lib/pengine/pe-input-4.bz2 from hypatia-corosync.
Do you have this file still?

> here are what I think
> are the landmarks:
>
> - The extract starts just after the node boots, at the start of syslog at time
> 10:49:21.
> - I've highlighted when pacemakerd starts, at 10:49:46.
> - I've highlighted when drbd reports that the 'admin' resource is UpToDate, at
> 10:50:10.
> - One last highlight: When pacemaker finally promotes the drbd resource to
> Primary on both nodes, at 11:05:11.
>
>> Details:
>>>
>>> # rpm -q kernel cman pacemaker drbd
>>> kernel-2.6.32-220.4.1.el6.x86_64
>>> cman-3.0.12.1-23.el6.x86_64
>>> pacemaker-1.1.6-3.el6.x86_64
>>> drbd-8.4.1-1.el6.x86_64
>>>
>>> Output of crm_mon after two-node reboot or pacemaker restart:
>>> <http://pastebin.com/jzrpCk3i>
>>> cluster.conf: <http://pastebin.com/sJw4KBws>
>>> "crm configure show": <http://pastebin.com/MgYCQ2JH>
>>> "drbdadm dump all": <http://pastebin.com/NrY6bskk>
>
> --
> Bill Seligman             | Phone: (914) 591-2823
> Nevis Labs, Columbia Univ | mailto://[email protected]
> PO Box 137                |
> Irvington NY 10533 USA    | http://www.nevis.columbia.edu/~seligman/
>
>
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to