On 3/24/12 4:47 AM, emmanuel segura wrote: > How do you configure clvmd? > > with cman or with pacemaker?
Pacemaker. Here's the output of 'crm configure show': <http://pastebin.com/426CdVwN> > Il giorno 23 marzo 2012 22:14, William Seligman <[email protected] >> ha scritto: > >> On 3/23/12 5:03 PM, emmanuel segura wrote: >> >>> Sorry but i would to know if can show me your /etc/cluster/cluster.conf >> >> Here it is: <http://pastebin.com/GUr0CEgZ> >> >>> Il giorno 23 marzo 2012 21:50, William Seligman < >> [email protected] >>>> ha scritto: >>> >>>> On 3/22/12 2:43 PM, William Seligman wrote: >>>>> On 3/20/12 4:55 PM, Lars Ellenberg wrote: >>>>>> On Fri, Mar 16, 2012 at 05:06:04PM -0400, William Seligman wrote: >>>>>>> On 3/16/12 12:12 PM, William Seligman wrote: >>>>>>>> On 3/16/12 7:02 AM, Andreas Kurz wrote: >>>>>>>>> >>>>>>>>> s----- ... DRBD suspended io, most likely because of it's >>>>>>>>> fencing-policy. For valid dual-primary setups you have to use >>>>>>>>> "resource-and-stonith" policy and a working "fence-peer" handler. >> In >>>>>>>>> this mode I/O is suspended until fencing of peer was succesful. >>>> Question >>>>>>>>> is, why the peer does _not_ also suspend its I/O because obviously >>>>>>>>> fencing was not successful ..... >>>>>>>>> >>>>>>>>> So with a correct DRBD configuration one of your nodes should >> already >>>>>>>>> have been fenced because of connection loss between nodes (on drbd >>>>>>>>> replication link). >>>>>>>>> >>>>>>>>> You can use e.g. that nice fencing script: >>>>>>>>> >>>>>>>>> http://goo.gl/O4N8f >>>>>>>> >>>>>>>> This is the output of "drbdadm dump admin": < >>>> http://pastebin.com/kTxvHCtx> >>>>>>>> >>>>>>>> So I've got resource-and-stonith. I gather from an earlier thread >> that >>>>>>>> obliterate-peer.sh is more-or-less equivalent in functionality with >>>>>>>> stonith_admin_fence_peer.sh: >>>>>>>> >>>>>>>> <http://www.gossamer-threads.com/lists/linuxha/users/78504#78504> >>>>>>>> >>>>>>>> At the moment I'm pursuing the possibility that I'm returning the >>>> wrong return >>>>>>>> codes from my fencing agent: >>>>>>>> >>>>>>>> <http://www.gossamer-threads.com/lists/linuxha/users/78572> >>>>>>> >>>>>>> I cleaned up my fencing agent, making sure its return code matched >>>> those >>>>>>> returned by other agents in /usr/sbin/fence_, and allowing for some >>>> delay issues >>>>>>> in reading the UPS status. But... >>>>>>> >>>>>>>> After that, I'll look at another suggestion with lvm.conf: >>>>>>>> >>>>>>>> <http://www.gossamer-threads.com/lists/linuxha/users/78796#78796> >>>>>>>> >>>>>>>> Then I'll try DRBD 8.4.1. Hopefully one of these is the source of >> the >>>> issue. >>>>>>> >>>>>>> Failure on all three counts. >>>>>> >>>>>> May I suggest you double check the permissions on your fence peer >>>> script? >>>>>> I suspect you may simply have forgotten the "chmod +x" . >>>>>> >>>>>> Test with "drbdadm fence-peer minor-0" from the command line. >>>>> >>>>> I still haven't solved the problem, but this advice has gotten me >>>> further than >>>>> before. >>>>> >>>>> First, Lars was correct: I did not have execute permissions set on my >>>> fence peer >>>>> scripts. (D'oh!) I turned them on, but that did not change anything: >>>> cman+clvmd >>>>> still hung on the vgdisplay command if I crashed the peer node. >>>>> >>>>> I started up both nodes again (cman+pacemaker+drbd+clvmd) and tried >> Lars' >>>>> suggested command. I didn't save the response for this message (d'oh >>>> again!) but >>>>> it said that the fence-peer script had failed. >>>>> >>>>> Hmm. The peer was definitely shutting down, so my fencing script is >>>> working. I >>>>> went over it, comparing the return codes to those of the existing >>>> scripts, and >>>>> made some changes. Here's my current script: < >>>> http://pastebin.com/nUnYVcBK>. >>>>> >>>>> Up until now my fence-peer scripts had either been Lon Hohberger's >>>>> obliterate-peer.sh or Digimer's rhcs_fence. I decided to try >>>>> stonith_admin-fence-peer.sh that Andreas Kurz recommended; unlike the >>>> first two >>>>> scripts, which fence using fence_node, the latter script just calls >>>> stonith_admin. >>>>> >>>>> When I tried the stonith_admin-fence-peer.sh script, it worked: >>>>> >>>>> # drbdadm fence-peer minor-0 >>>>> stonith_admin-fence-peer.sh[10886]: stonith_admin successfully fenced >>>> peer >>>>> orestes-corosync.nevis.columbia.edu. >>>>> >>>>> Power was cut on the peer, the remaining node stayed up. Then I brought >>>> up the >>>>> peer with: >>>>> >>>>> stonith_admin -U orestes-corosync.nevis.columbia.edu >>>>> >>>>> BUT: When the restored peer came up and started to run cman, the clvmd >>>> hung on >>>>> the main node again. >>>>> >>>>> After cycling through some more tests, I found that if I brought down >>>> the peer >>>>> with drbdadm, then brought up with the peer with no HA services, then >>>> started >>>>> drbd and then cman, the cluster remained intact. >>>>> >>>>> If I crashed the peer, the scheme in the previous paragraph didn't >> work. >>>> I bring >>>>> up drbd, check that the disks are both UpToDate, then bring up cman. At >>>> that >>>>> point the vgdisplay on the main node takes so long to run that clvmd >>>> will time out: >>>>> >>>>> vgdisplay Error locking on node orestes-corosync.nevis.columbia.edu: >>>> Command >>>>> timed out >>>>> >>>>> I timed how long it took vgdisplay to run. I might be able to work >>>> around this >>>>> by setting the timeout on my clvmd resource to 300s, but that seems to >>>> be a >>>>> band-aid for an underlying problem. Any suggestions on what else I >> could >>>> check? >>>> >>>> I've done some more tests. Still no solution, just an observation: The >>>> "death >>>> mode" appears to be: >>>> >>>> - Two nodes running cman+pacemaker+drbd+clvmd >>>> - Take one node down = one remaining node w/cman+pacemaker+drbd+clvmd >>>> - Start up dead node. If it ever gets into a state in which it's running >>>> cman >>>> but not clvmd, clvmd on the uncrashed node hangs. >>>> - Conversely, if I bring up drbd, make it primary, start cman+clvmd, >>>> there's no >>>> problem on the uncrashed node. >>>> >>>> My guess is that clvmd is getting the number of nodes it expects from >>>> cman. When >>>> the formally-dead node starts running cman, the number of cluster nodes >>>> goes to >>>> 2 (I checked with 'cman_tool status') but the number of nodes running >>>> clvmd is >>>> still 1, hence the crash. >>>> >>>> Does this guess make sense? -- Bill Seligman | mailto://[email protected] Nevis Labs, Columbia Univ | http://www.nevis.columbia.edu/~seligman/ PO Box 137 | Irvington NY 10533 USA | Phone: (914) 591-2823
smime.p7s
Description: S/MIME Cryptographic Signature
_______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
