RE: [Linux-HA] Stonith, 2 node cluster - on loss of powertoprimarynode; failure to secondary didn't happen.

Alex Strachan Tue, 28 Oct 2008 23:59:20 -0700

And fnally my ha.cf

[EMAIL PROTECTED] ha.d]# egrep -v "^#|^$" ha.cf
keepalive 2
deadtime 30
warntime 10
initdead 120
udpport 695
bcast   eth0 eth2       # Linux
auto_failback off
node    dtbaims
node    itbaims
debug 1
use_logd yes
conn_logd_time 60
compression     bz2
crm respawn


IS there anything which can be added to this to ensure the failover on
complete powerloss to the primary server?



-----Original Message-----
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Alex Strachan
Sent: Wednesday, 29 October 2008 4:52 PM
To: 'General Linux-HA mailing list'
Subject: RE: [Linux-HA] Stonith, 2 node cluster - on loss of
powertoprimarynode; failure to secondary didn't happen.

It looks like it may be possible to power the card via a separate power
adapter - this still doesn't help in the case of a complete power failure.


The stonith seems to be working fine.  I have a Filesystem resource set to
'fence' on failure'  I triggered this to happen on the primary server and
stonith kicked in from the secondary and reset the primary, then started
running the resources - fantastic!


Hmmm - only leaves how to recover from a complete powerloss where the RSA
card is not available.

I have attached my cib.

Any pointers would be great.

Thanks

Alex



-----Original Message-----
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Alex Strachan
Sent: Wednesday, 29 October 2008 4:07 PM
To: 'General Linux-HA mailing list'
Subject: RE: [Linux-HA] Stonith, 2 node cluster - on loss of power
toprimarynode; failure to secondary didn't happen.


When power was restored the resources restarted on the X primary dtbaims.

Last error from crm_verify -
 
[EMAIL PROTECTED] ~]# crm_verify -L -V
crm_verify[31741]: 2008/10/29_16:00:20 notice: main: Required feature set:
2.0
crm_verify[31741]: 2008/10/29_16:00:20 WARN: main: Your configuration was
internally updated to the latest version (pacemaker-1.0)
crm_verify[31741]: 2008/10/29_16:00:20 notice: unpack_config: On loss of CCM
Quorum: Ignore
crm_verify[31741]: 2008/10/29_16:00:20 WARN: unpack_rsc_op: Processing
failed op r_stonith-dtbaims_start_0 on itbaims: Error
crm_verify[31741]: 2008/10/29_16:00:20 WARN: unpack_rsc_op: Compatibility
handling for failed op r_stonith-dtbaims_start_0 on itbaims
crm_verify[31741]: 2008/10/29_16:00:20 WARN: native_color: Resource
r_stonith-dtbaims cannot run anywhere


Is the " WARN: unpack_rsc_op: Compatibility handling for failed op
r_stonith-dtbaims_start_0 on itbaims" indicative of a more serious error in
configuration.



My stonith cib config is..
      <primitive id="r_stonith-dtbaims" class="stonith"
type="external/ibmrsa-telnet">
        <operations>
          <op name="monitor" interval="60" id="r_stonith-dtbaims-mon"
timeout="300" requires="nothing"/>
          <op name="start" interval="0" id="r_stonith-dtbaims-start"
timeout="180"/>
          <op name="stop" interval="0" id="r_stonith-dtbaims-stop"
timeout="180"/>
        </operations>
        <instance_attributes id="instance_attributes.id49828">
          <nvpair id="nvpair.id49835" name="nodename" value="dtbaims"/>
          <nvpair id="nvpair.id49844" name="ip_address"
value="192.168.201.37"/>
          <nvpair id="nvpair.id49853" name="username" value="########"/>
          <nvpair id="nvpair.id49862" name="password" value="########"/>
        </instance_attributes>
        <meta_attributes id="primitive-r_stonith-dtbaims.meta">
          <nvpair id="resource_stickiness.meta.auto-7"
name="resource-stickiness" value="INFINITY"/>
        </meta_attributes>
      </primitive>
      <primitive id="r_stonith-itbaims" class="stonith"
type="external/ibmrsa-telnet">
        <operations>
          <op name="monitor" interval="60" id="r_stonith-itbaims-mon"
timeout="300" requires="nothing"/>
          <op name="start" interval="0" id="r_stonith-itbaims-start"
timeout="180"/>
          <op name="stop" interval="0" id="r_stonith-itbaims-stop"
timeout="180"/>
        </operations>
        <instance_attributes id="instance_attributes.id49921">
          <nvpair id="nvpair.id49928" name="nodename" value="itbaims"/>
          <nvpair id="nvpair.id49937" name="ip_address"
value="192.168.201.38"/>
          <nvpair id="nvpair.id49946" name="username" value="########"/>
          <nvpair id="nvpair.id49955" name="password" value="########"/>
        </instance_attributes>
        <meta_attributes id="primitive-r_stonith-itbaims.meta">
          <nvpair id="resource_stickiness.meta.auto-33"
name="resource-stickiness" value="INFINITY"/>
        </meta_attributes>
      </primitive>

And constraints ( I use a non-symmetrical cluster)

      <rsc_location id="r_stonith-dtbaims_hates_dtbaims"
rsc="r_stonith-dtbaims">
        <rule id="r_stonith-dtbaims_hates_dtbaims_rule" score="-INFINITY">
          <expression attribute="#uname" id="expression.id49985"
operation="eq" value="dtbaims"/>
        </rule>
      </rsc_location>
      <rsc_location id="r_stonith-dtbaims_loves_itbaims"
rsc="r_stonith-dtbaims">
        <rule id="r_stonith-dtbaims_loves_itbaims_rule" score="INFINITY">
          <expression attribute="#uname" id="expression.id50013"
operation="eq" value="itbaims"/>
        </rule>
      </rsc_location>
      <rsc_location id="r_stonith-itbaims_hates_itbaims"
rsc="r_stonith-itbaims">
        <rule id="r_stonith-itbaims_hates_itbaims_rule" score="-INFINITY">
          <expression attribute="#uname" id="expression.id50012"
operation="eq" value="itbaims"/>
        </rule>
      </rsc_location>
      <rsc_location id="r_stonith-itbaims_loves_dtbaims"
rsc="r_stonith-itbaims">
        <rule id="r_stonith-itbaims_loves_dtbaims_rule" score="INFINITY">
          <expression attribute="#uname" id="expression.id50014"
operation="eq" value="dtbaims"/>
        </rule>
      </rsc_location>




-----Original Message-----
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Alex Strachan
Sent: Wednesday, 29 October 2008 3:26 PM
To: 'General Linux-HA mailing list'
Subject: [Linux-HA] Stonith, 2 node cluster - on loss of power to
primarynode; failure to secondary didn't happen.

Finally configured Stonith for an HA cluster - believe me doing this made me
happy!

 

Versions - heartbeat 2.99.1, pacemaker 1.0, redhat 4 x86_64

 

I have two nodes, dtbaims, itbaims. Stonith device ibmrsa-telnet is being
used; failover is fine when doing a reset via the RSA card.  Complete loss
of power seems to be an issue.  The RSA card is powered via the host.

 

Status -    dtbaims is primary (DRBD) and running all of the resources.

            itbaims is secondary

 

Status before power loss..

============

Last updated: Wed Oct 29 14:43:15 2008

Current DC: dtbaims (4f1614ac-d465-49db-b847-bac60f9dac6c)

2 Nodes configured.

3 Resources configured.

============

 

Node: dtbaims (4f1614ac-d465-49db-b847-bac60f9dac6c): online

Node: itbaims (96595e56-e3db-42da-b13b-1e2d3a956529): online

 

Full list of resources:

 

Resource Group: group_its

    resource_its_drbd   (heartbeat:its_drbddisk):       Started dtbaims

    resource_its_fs     (ocf::heartbeat:its_Filesystem):        Started
dtbaims

    resource_its_vip    (ocf::heartbeat:IPaddr):        Started dtbaims

    resource_its_oracle (ocf::heartbeat:its_oracle):    Started dtbaims

    resource_its_oralsnr        (ocf::heartbeat:its_oralsnr):   Started
dtbaims

    resource_its_aims   (lsb:its_aims): Started dtbaims

    resource_its_apache (ocf::heartbeat:its_apache):    Started dtbaims

    resource_its_smb    (lsb:its_smb):  Started dtbaims

    resource_its_dhcpd  (lsb:its_dhcpd):        Started dtbaims

r_stonith-dtbaims       (stonith:external/ibmrsa-telnet):       Started
itbaims

r_stonith-itbaims       (stonith:external/ibmrsa-telnet):       Started
dtbaims

 

Migration summary::

* Node itbaims:

* Node dtbaims:

 

Status after powerloss -  (on the secondary host)

My expectation was DC would be transferred to itbaims (this was done),
resources would start on itbaims (not done?)  It looks like HA is waiting on
completing the Stonith action.

 

[EMAIL PROTECTED] ~]# /etc/init.d/drbd status

drbd driver loaded OK; device status:

version: 8.2.6 (api:88/proto:86-88)

GIT-hash: 3e69822d3bb4920a8c1bfdf7d647169eba7d2eb4 build by
[EMAIL PROTECTED], 2008-06-04 16:15:48

m:res  cs            st                 ds                 p  mounted
fstype

0:r0   WFConnection  Secondary/Unknown  UpToDate/DUnknown  C

 

 

============

Last updated: Wed Oct 29 15:11:48 2008

Current DC: itbaims (96595e56-e3db-42da-b13b-1e2d3a956529)

2 Nodes configured.

3 Resources configured.

============

 

Node: dtbaims (4f1614ac-d465-49db-b847-bac60f9dac6c): OFFLINE

Node: itbaims (96595e56-e3db-42da-b13b-1e2d3a956529): online

 

Full list of resources:

 

Resource Group: group_its

    resource_its_drbd   (heartbeat:its_drbddisk):       Started dtbaims

    resource_its_fs     (ocf::heartbeat:its_Filesystem):        Started
dtbaims

    resource_its_vip    (ocf::heartbeat:IPaddr):        Started dtbaims

    resource_its_oracle (ocf::heartbeat:its_oracle):    Started dtbaims

    resource_its_oralsnr        (ocf::heartbeat:its_oralsnr):   Started
dtbaims

    resource_its_aims   (lsb:its_aims): Started dtbaims

    resource_its_apache (ocf::heartbeat:its_apache):    Started dtbaims

    resource_its_smb    (lsb:its_smb):  Started dtbaims

    resource_its_dhcpd  (lsb:its_dhcpd):        Started dtbaims

r_stonith-dtbaims       (stonith:external/ibmrsa-telnet):       Started
itbaims FAILED

r_stonith-itbaims       (stonith:external/ibmrsa-telnet):       Started
dtbaims

 

Migration summary::

* Node itbaims:

   r_stonith-dtbaims: migration-threshold=0 fail-count=1000000

 

Failed actions:

    r_stonith-dtbaims_monitor_60000 (node=itbaims, call=14, rc=14): complete

    r_stonith-dtbaims_start_0 (node=itbaims, call=17, rc=1): complete

 

 

The HA cluster doesn't start the resources until power is restored to the X
primary host.

 

Running crm_verify -L -V just shows lots of 

crm_verify[31645]: 2008/10/29_15:19:47 notice: NoRoleChange: Move resource
resource_its_dhcpd   (Started dtbaims -> itbaims)

crm_verify[31645]: 2008/10/29_15:19:47 notice: StopRsc:   dtbaims       Stop
resource_its_dhcpd

crm_verify[31645]: 2008/10/29_15:19:47 notice: StartRsc:  itbaims
Start resource_its_dhcpd

crm_verify[31645]: 2008/10/29_15:19:47 notice: RecurringOp:  Start recurring
monitor (360s) for resource_its_dhcpd on itbaims

crm_verify[31645]: 2008/10/29_15:19:47 info: native_stop_constraints:
r_stonith-itbaims_stop_0 is implicit after dtbaims is fenced

 

It looks like it wants to start the resources but waiting to clear the
failed op.

 

 

What can I do to ensure that the failover occurs in the result of a complete
power loss to the primary host?

 

 

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

RE: [Linux-HA] Stonith, 2 node cluster - on loss of powertoprimarynode; failure to secondary didn't happen.

Reply via email to