Re: [Linux-HA] Stonith, 2 node cluster - on loss of power to primary node; failure to secondary didn't happen.

Rolf Schmidt Wed, 29 Oct 2008 01:35:44 -0700

Hi Alex,

this is a double failure. Host and STONITH device are gone, as Host
and STONITH device have the same power source. Once this is gone
you have 2 elements of your cluster failing and then you are ...
well, in trouble ...


Had the same issue with iLO. STONITH does not succeed in killing
the other node on the surviving node, so it does not start any
resources.

You can try a disk based stonith instead if you have a sharedblock device. Didn't do it myself yet.



Rolf


On Wed, 29 Oct 2008, Alex Strachan wrote:

Finally configured Stonith for an HA cluster - believe me doing this made me
happy!



Versions - heartbeat 2.99.1, pacemaker 1.0, redhat 4 x86_64



I have two nodes, dtbaims, itbaims. Stonith device ibmrsa-telnet is being
used; failover is fine when doing a reset via the RSA card.  Complete loss
of power seems to be an issue.  The RSA card is powered via the host.



Status -    dtbaims is primary (DRBD) and running all of the resources.

           itbaims is secondary



Status before power loss..

============

Last updated: Wed Oct 29 14:43:15 2008

Current DC: dtbaims (4f1614ac-d465-49db-b847-bac60f9dac6c)

2 Nodes configured.

3 Resources configured.

============



Node: dtbaims (4f1614ac-d465-49db-b847-bac60f9dac6c): online

Node: itbaims (96595e56-e3db-42da-b13b-1e2d3a956529): online



Full list of resources:



Resource Group: group_its

   resource_its_drbd   (heartbeat:its_drbddisk):       Started dtbaims

   resource_its_fs     (ocf::heartbeat:its_Filesystem):        Started
dtbaims

   resource_its_vip    (ocf::heartbeat:IPaddr):        Started dtbaims

   resource_its_oracle (ocf::heartbeat:its_oracle):    Started dtbaims

   resource_its_oralsnr        (ocf::heartbeat:its_oralsnr):   Started
dtbaims

   resource_its_aims   (lsb:its_aims): Started dtbaims

   resource_its_apache (ocf::heartbeat:its_apache):    Started dtbaims

   resource_its_smb    (lsb:its_smb):  Started dtbaims

   resource_its_dhcpd  (lsb:its_dhcpd):        Started dtbaims

r_stonith-dtbaims       (stonith:external/ibmrsa-telnet):       Started
itbaims

r_stonith-itbaims       (stonith:external/ibmrsa-telnet):       Started
dtbaims



Migration summary::

* Node itbaims:

* Node dtbaims:



Status after powerloss -  (on the secondary host)

My expectation was DC would be transferred to itbaims (this was done),
resources would start on itbaims (not done?)  It looks like HA is waiting on
completing the Stonith action.



[EMAIL PROTECTED] ~]# /etc/init.d/drbd status

drbd driver loaded OK; device status:

version: 8.2.6 (api:88/proto:86-88)

GIT-hash: 3e69822d3bb4920a8c1bfdf7d647169eba7d2eb4 build by
[EMAIL PROTECTED], 2008-06-04 16:15:48

m:res  cs            st                 ds                 p  mounted
fstype

0:r0   WFConnection  Secondary/Unknown  UpToDate/DUnknown  C





============

Last updated: Wed Oct 29 15:11:48 2008

Current DC: itbaims (96595e56-e3db-42da-b13b-1e2d3a956529)

2 Nodes configured.

3 Resources configured.

============



Node: dtbaims (4f1614ac-d465-49db-b847-bac60f9dac6c): OFFLINE

Node: itbaims (96595e56-e3db-42da-b13b-1e2d3a956529): online



Full list of resources:



Resource Group: group_its

   resource_its_drbd   (heartbeat:its_drbddisk):       Started dtbaims

   resource_its_fs     (ocf::heartbeat:its_Filesystem):        Started
dtbaims

   resource_its_vip    (ocf::heartbeat:IPaddr):        Started dtbaims

   resource_its_oracle (ocf::heartbeat:its_oracle):    Started dtbaims

   resource_its_oralsnr        (ocf::heartbeat:its_oralsnr):   Started
dtbaims

   resource_its_aims   (lsb:its_aims): Started dtbaims

   resource_its_apache (ocf::heartbeat:its_apache):    Started dtbaims

   resource_its_smb    (lsb:its_smb):  Started dtbaims

   resource_its_dhcpd  (lsb:its_dhcpd):        Started dtbaims

r_stonith-dtbaims       (stonith:external/ibmrsa-telnet):       Started
itbaims FAILED

r_stonith-itbaims       (stonith:external/ibmrsa-telnet):       Started
dtbaims



Migration summary::

* Node itbaims:

  r_stonith-dtbaims: migration-threshold=0 fail-count=1000000



Failed actions:

   r_stonith-dtbaims_monitor_60000 (node=itbaims, call=14, rc=14): complete

   r_stonith-dtbaims_start_0 (node=itbaims, call=17, rc=1): complete





The HA cluster doesn't start the resources until power is restored to the X
primary host.



Running crm_verify -L -V just shows lots of

crm_verify[31645]: 2008/10/29_15:19:47 notice: NoRoleChange: Move resource
resource_its_dhcpd   (Started dtbaims -> itbaims)

crm_verify[31645]: 2008/10/29_15:19:47 notice: StopRsc:   dtbaims       Stop
resource_its_dhcpd

crm_verify[31645]: 2008/10/29_15:19:47 notice: StartRsc:  itbaims
Start resource_its_dhcpd

crm_verify[31645]: 2008/10/29_15:19:47 notice: RecurringOp:  Start recurring
monitor (360s) for resource_its_dhcpd on itbaims

crm_verify[31645]: 2008/10/29_15:19:47 info: native_stop_constraints:
r_stonith-itbaims_stop_0 is implicit after dtbaims is fenced



It looks like it wants to start the resources but waiting to clear the
failed op.





What can I do to ensure that the failover occurs in the result of a complete
power loss to the primary host?





_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Mit freundlichen Gruessen,

  Rolf Schmidt

--
SUSE LINUX GmbH             -o)   Tel: +49-(0)911-740 53 380
Maxfeldstr. 5               /\\   Fax: +49-(0)911-740 53 679
90409 Nuernberg, Germany   _\_v

SUSE LINUX GmbH, GF: Volker Smid, HRB 21284 (AG Nürnberg)

PLEASE NOTE:  This e-mail may contain confidential and privileged
material for the sole use of the intended recipient. Any review,
distribution or other use by anyone else is strictly prohibited. If
you are not an intended recipient, please contact the sender and
delete all copies. Thank you.

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Stonith, 2 node cluster - on loss of power to primary node; failure to secondary didn't happen.

Reply via email to