On 3/27/12 4:52 AM, emmanuel segura wrote: > So now your cluster it's OK?
*Laughs* No! There's another problem I have to solve. But it's completely unrelated to this one. I'll work on it some more, and if I can't solve it I'll start a new thread. Thanks for asking, Emmanuel. (I want to prove I can spell your name correctly!) > Il giorno 27 marzo 2012 00:33, William Seligman <[email protected] >> ha scritto: > >> On 3/26/12 5:31 PM, William Seligman wrote: >>> On 3/26/12 5:17 PM, William Seligman wrote: >>>> On 3/26/12 4:28 PM, emmanuel segura wrote: >> >>>>> and i suggest you to start clvmd at boot time >>>>> >>>>> chkconfig clvmd on >>>> >>>> I'm afraid this doesn't work. It's as I predicted; when gfs2 starts I get: >>>> >>>> Mounting GFS2 filesystem (/usr/nevis): invalid device path >>>> "/dev/mapper/ADMIN-usr" >>>> [FAILED] >>>> >>>> ... and so on, because the ADMIN volume group was never loaded by >>>> clvmd. Without a "vgscan" in there somewhere, the system can't see the >>>> volume groups on the drbd resource. >>> >>> Wait a second... there's an ocf:heartbeat:LVM resource! Testing... >> >> Emannuel, you did it! >> >> For the sake of future searches, and possibly future documentation, let me >> start with my original description of the problem: >> >>> I'm setting up a two-node cman+pacemaker+gfs2 cluster as described in >>> "Clusters From Scratch." Fencing is through forcibly rebooting a node by >>> cutting and restoring its power via UPS. >>> >>> My fencing/failover tests have revealed a problem. If I gracefully turn >>> off one node ("crm node standby"; "service pacemaker stop"; "shutdown -r >>> now") all the resources transfer to the other node with no problems. If I >>> cut power to one node (as would happen if it were fenced), the lsb::clvmd >>> resource on the remaining node eventually fails. Since all the other >>> resources depend on clvmd, all the resources on the remaining node stop >>> and the cluster is left with nothing running. >>> >>> I've traced why the lsb::clvmd fails: The monitor/status command >>> includes "vgdisplay", which hangs indefinitely. Therefore the monitor >>> will always time-out. >>> >>> So this isn't a problem with pacemaker, but with clvmd/dlm: If a node is >>> cut off, the cluster isn't handling it properly. Has anyone on this list >>> seen this before? Any ideas? >>> >>> Details: >>> >>> versions: >>> Redhat Linux 6.2 (kernel 2.6.32) >>> cman-3.0.12.1 >>> corosync-1.4.1 >>> pacemaker-1.1.6 >>> lvm2-2.02.87 >>> lvm2-cluster-2.02.87 >> >> The problem is that clvmd on the main node will hang if there's a >> substantive period of time during which the other node returns running cman >> but not clvmd. I never tracked down why this happens, but there's a >> practical solution: minimize any interval for which that would be true. To >> ensure this, take clvmd outside the resource manager's control: >> >> chkconfig cman on >> chkconfig clvmd on >> chkconfig pacemaker on >> >> On RHEL6.2, these services will be started in the above order; clvmd will >> start within a few seconds after cman. >> >> Here's my cluster.conf <http://pastebin.com/GUr0CEgZ> and the output of >> "crm configure show" <http://pastebin.com/f9D4Ui5Z>. The key lines from >> the latter are: >> >> primitive AdminDrbd ocf:linbit:drbd \ >> params drbd_resource="admin" >> primitive AdminLvm ocf:heartbeat:LVM \ >> params volgrpname="ADMIN" \ >> op monitor interval="30" timeout="100" depth="0" >> primitive Gfs2 lsb:gfs2 >> group VolumeGroup AdminLvm Gfs2 >> ms AdminClone AdminDrbd \ >> meta master-max="2" master-node-max="1" \ >> clone-max="2" clone-node-max="1" \ >> notify="true" interleave="true" >> clone VolumeClone VolumeGroup \ >> meta interleave="true" >> colocation Volume_With_Admin inf: VolumeClone AdminClone:Master >> order Admin_Before_Volume inf: AdminClone:promote VolumeClone:start >> >> What I learned: If one is going to extend the example in "Clusters From >> Scratch" to include logical volumes, one must start clvmd at boot time, and >> include any volume groups in ocf:heartbeat:LVM resources that start before >> gfs2. >> >> Note the long timeout on the ocf:heartbeat:LVM resource. This is a good >> idea because, during the boot of the crashed node, there'll still be an >> interval of a few seconds when cman will be running but clvmd won't be. >> During my tests, the LVM monitor would fail if it checked during that >> interval with a timeout that was shorter than it took clvmd to start on the >> crashed node. This was annoying; all resources dependent on AdminLvm would >> be stopped until AdminLvm recovered (a few more seconds). Increasing the >> timeout avoids this. >> >> It also means that during any recovery procedure on the crashed node for >> which I turn off all the services, I have to minimize the interval between >> the start of cman and clvmd if I've turned off services at boot; e.g., >> >> service drbd start # ... and fix any split-brain problems or whatever >> service cman start; service clvmd start # put on one line >> service pacemaker start >> >> I thank everyone on this list who was patient with me as I pounded on this >> problem for two weeks! -- Bill Seligman | Phone: (914) 591-2823 Nevis Labs, Columbia Univ | mailto://[email protected] PO Box 137 | Irvington NY 10533 USA | http://www.nevis.columbia.edu/~seligman/
smime.p7s
Description: S/MIME Cryptographic Signature
_______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
