On 3/30/12 1:13 AM, Andrew Beekhof wrote: > On Fri, Mar 30, 2012 at 2:57 AM, William Seligman > <[email protected]> wrote: >> On 3/29/12 3:19 AM, Andrew Beekhof wrote: >>> On Wed, Mar 28, 2012 at 9:12 AM, William Seligman >>> <[email protected]> wrote: >>>> The basics: Dual-primary cman+pacemaker+drbd cluster running on RHEL6.2; >>>> spec >>>> files and versions below. >>>> >>>> Problem: If I restart both nodes at the same time, or even just start >>>> pacemaker >>>> on both nodes at the same time, the drbd ms resource starts, but both >>>> nodes stay >>>> in slave mode. They'll both stay in slave mode until one of the following >>>> occurs: >>>> >>>> - I manually type "crm resource cleanup <ms-resource-name>" >>>> >>>> - 15 minutes elapse. Then the "PEngine Recheck Timer" is fired, and the ms >>>> resources are promoted. >>>> >>>> The key resource definitions: >>>> >>>> primitive AdminDrbd ocf:linbit:drbd \ >>>> � � � �params drbd_resource="admin" \ >>>> � � � �op monitor interval="59s" role="Master" timeout="30s" \ >>>> � � � �op monitor interval="60s" role="Slave" timeout="30s" \ >>>> � � � �op stop interval="0" timeout="100" \ >>>> � � � �op start interval="0" timeout="240" \ >>>> � � � �meta target-role="Master" >>>> ms AdminClone AdminDrbd \ >>>> � � � �meta master-max="2" master-node-max="1" clone-max="2" \ >>>> � � � �clone-node-max="1" notify="true" interleave="true" >>>> # The lengthy definition of "FilesystemGroup" is in the crm pastebin below >>>> clone FilesystemClone FilesystemGroup \ >>>> � � � �meta interleave="true" target-role="Started" >>>> colocation Filesystem_With_Admin inf: FilesystemClone AdminClone:Master >>>> order Admin_Before_Filesystem inf: AdminClone:promote FilesystemClone:start >>>> >>>> Note that I stuck in "target-role" options to try to solve the problem; no >>>> effect. >>>> >>>> When I look in /var/log/messages, I see no error messages or indications >>>> why the >>>> promotion should be delayed. The 'admin' drbd resource is reported as >>>> UpToDate >>>> on both nodes. There are no error messages when I force the issue with: >>>> >>>> crm resource cleanup AdminClone >>>> >>>> It's as if pacemaker, at start, needs some kind of "kick" after the drbd >>>> resource is ready to be promoted. >>>> >>>> This is not just an abstract case for me. At my site, it's not uncommon for >>>> there to be lengthy power outages that will bring down the cluster. Both >>>> systems >>>> will come up when power is restored, and I need for cluster services to be >>>> available shortly afterward, not 15 minutes later. >>>> >>>> Any ideas? >>> >>> Not without any logs >> >> Sure! Here's an extract from the log: <http://pastebin.com/L1ZnsQ0R> >> >> Before you click on the link (it's a big wall of text), > > I'm used to trawling the logs. Grep is a wonderful thing :-) > > At this stage it is apparent that I need to see > /var/lib/pengine/pe-input-4.bz2 from hypatia-corosync. > Do you have this file still?
No, so I re-ran the test. Here's the log extract from the test I did today <http://pastebin.com/6QYH2jkf>. Based on what you asked for from the previous extract, I think what you want from this test is pe-input-5. Just to play it safe, I copied and bunzip2'ed all three pe-input files mentioned in the log messages: pe-input-4: <http://pastebin.com/Txx50BJp> pe-input-5: <http://pastebin.com/zzppL6DF> pe-input-6: <http://pastebin.com/1dRgURK5> I pray to the gods of Grep that you find a clue in all of that! >> here are what I think >> are the landmarks: >> >> - The extract starts just after the node boots, at the start of syslog at >> time >> 10:49:21. >> - I've highlighted when pacemakerd starts, at 10:49:46. >> - I've highlighted when drbd reports that the 'admin' resource is UpToDate, >> at >> 10:50:10. >> - One last highlight: When pacemaker finally promotes the drbd resource to >> Primary on both nodes, at 11:05:11. >> >>> Details: >>>> >>>> # rpm -q kernel cman pacemaker drbd >>>> kernel-2.6.32-220.4.1.el6.x86_64 >>>> cman-3.0.12.1-23.el6.x86_64 >>>> pacemaker-1.1.6-3.el6.x86_64 >>>> drbd-8.4.1-1.el6.x86_64 >>>> >>>> Output of crm_mon after two-node reboot or pacemaker restart: >>>> <http://pastebin.com/jzrpCk3i> >>>> cluster.conf: <http://pastebin.com/sJw4KBws> >>>> "crm configure show": <http://pastebin.com/MgYCQ2JH> >>>> "drbdadm dump all": <http://pastebin.com/NrY6bskk> -- Bill Seligman | Phone: (914) 591-2823 Nevis Labs, Columbia Univ | mailto://[email protected] PO Box 137 | Irvington NY 10533 USA | http://www.nevis.columbia.edu/~seligman/
smime.p7s
Description: S/MIME Cryptographic Signature
_______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
