Re: [DRBD-user] DRBD+Pacemaker: Won't promote with only one node

Dan Barker Thu, 05 Jan 2012 09:32:23 -0800

I see your point. I was only thinking about the power failure scenario. Maybe 
you could shut down differently depending on the reason for the STONITH?

The problem I had, was that the APC-provided shutdown process was pretty dumb, 
I had two hours of battery and it takes about 45 minutes to gracefully shut 
everything down. So, my program monitors the battery remaining, and when it 
gets to 45 minutes, starts shutting down VMs. If the power comes back on, it 
restarts the VMs it's stopped. The APC version once begun would continue the 
shutdown even if the power came back on. I finally implemented this using:

string UPSOff = ".1.3.6.1.4.1.318.1.1.1.6.2.1.0 i 3" ; 
string UPSSleep = ".1.3.6.1.4.1.318.1.1.1.6.2.3.0 i 3" ; 
read the Utility voltage to see if the power has come back on 
(.1.3.6.1.4.1.318.1.1.1.3.2.1.0)
Battery remaining (.1.3.6.1.4.1.318.1.1.1.2.2.3.0)

I needed to use UPSSleep. I had been using USPOff. I use Battery Remaining and 
Utility voltage to control if to abandon/reverse my suspends. After I’ve 
suspended all the VMs, shutdown the ESXi hosts and stopped the SAN, a UPSSleep 
will only go lights out until the utility power comes back on. OID codes were 
found at: 

http://support.ipmonitor.com/mibs/POWERNET-MIB/oids.aspx (says it's 
unmaintained, but worked for my SUA3000XL / SUA48XLBP.

(I don't see any DRBD in my reply. I guess we've drifted off topic<g>)

Dan

-----Original Message-----
From: [email protected] 
[mailto:[email protected]] On Behalf Of William Seligman
Sent: Thursday, January 05, 2012 12:06 PM
To: [email protected]
Subject: Re: [DRBD-user] DRBD+Pacemaker: Won't promote with only one node

> Message: 1
> Date: Wed, 4 Jan 2012 15:58:09 -0500
> From: "Dan Barker" <[email protected]>
> Subject: Re: [DRBD-user] DRBD+Pacemaker: Won't promote with only one
>       node
> To: <[email protected]>
> Message-ID: <[email protected]>
> Content-Type: text/plain;     charset="UTF-8"
> 
> I'd say the error is in the STONITH method. You evidently are giving the UPS 
> a "SHUTDOWN" command when you should be giving it a "SLEEP" or "SUSPEND" 
> command (Whatever your UPS Vendor's idea of power off the outlets only until 
> the mains come on and have charged the batteries to above 5% or whatever. 
> With the APC family and a Network Card, there are very fine controls over 
> this sort of action. The APIs published are fairly primitive. I had to write 
> SNMP routines to make my APC do what I wanted, when I wanted. The doc was 
> like pulling teeth to find. If you have APC equipment, I can share. If not, 
> what do you have? What controls does it publish?

I'm using APC SMART-UPSes, and issuing the SHUTDOWN command as you suspected. I 
wouldn't mind seeing the SNMP write-up you've got on the obscure APC API.

However, I believe what I've got is what I want to do. Suppose one node 
STONITHs another for reasons that have nothing to do with a power outage. I 
don't want the STONITHed UPS to come back on for any reason. I'm concerned 
about the (admittedly unlikely) possibility that a node is STONITHed because it 
goes wonky, and then there's a power outage. Upon power recovery I'd get the 
wonky node trying to rejoin the cluster again.

So I think I've got STONITH set up satisfactorily. I just need help figuring 
out why a single node's DRBD resource is not being promoted to primary after a 
restart.

On 1/4/12 3:10 PM, William Seligman wrote:
> I'll give the technical details in a moment, but I thought I'd start 
> with a description of the problem.
> 
> I have a two-node active/passive cluster, with DRBD controlled by 
> Pacemaker. I upgraded to DRBD 8.4.x about six months ago (probably too 
> soon); everything was fine. Then last week we did some power-outage tests on 
> our cluster.
> 
> Each node in the cluster is attached to its own uninterruptible power 
> supply; the STONITH mechanism is to turn off the other node's UPS. In 
> the event of an extended power outage (this happens 2-3 times a year 
> at my site), it's likely that one node will STONITH the other when the 
> other node's UPS runs out of power and shuts it down. This means that 
> when power comes back on, only one node will come back up, since the 
> STONITHed UPS won't turn on again without manual intervention.
> 
> The problem is that with only one node, Pacemaker+DRBD won't promote 
> the DRBD resource to primary; it just sits there at secondary and 
> won't start up any DRBD-dependent resources. Only when the second node 
> comes back up will Pacemaker assign one of them the primary role. I've 
> confirmed this by shutting down corosync on both nodes, then bringing it up 
> again on just one of them.
> 
> I'm pretty sure that this is due to a mistake I"ve made in made in my 
> DRBD configuration when I fiddled with it during the 8.4.x upgrade. 
> I've attached the files. Can one of you kind folks spot the error?
> 
> Technical details:
> 
> Two-node configuration: hypatia and orestes
> OS: Scientific Linux 5.5, kernel 2.6.18-238.19.1.el5xen
> Packages:
> drbd-8.4.1-1
> corosync-1.2.7-1.1.el5
> pacemaker-1.0.12-1.el5.centos
> openais-1.1.3-1.6.el5

-- 
Bill Seligman             | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://[email protected]
PO Box 137                |
Irvington NY 10533 USA    | http://www.nevis.columbia.edu/~seligman/

_______________________________________________
drbd-user mailing list
[email protected]
http://lists.linbit.com/mailman/listinfo/drbd-user

Re: [DRBD-user] DRBD+Pacemaker: Won't promote with only one node

Reply via email to