Hi Alex,
this is a double failure. Host and STONITH device are gone, as Host
and STONITH device have the same power source. Once this is gone
you have 2 elements of your cluster failing and then you are ...
well, in trouble ...
Had the same issue with iLO. STONITH does not succeed in killing
the other node on the surviving node, so it does not start any
resources.
You can try a disk based stonith instead if you have a shared
block device. Didn't do it myself yet.
Rolf
On Wed, 29 Oct 2008, Alex Strachan wrote:
Finally configured Stonith for an HA cluster - believe me doing this made me
happy!
Versions - heartbeat 2.99.1, pacemaker 1.0, redhat 4 x86_64
I have two nodes, dtbaims, itbaims. Stonith device ibmrsa-telnet is being
used; failover is fine when doing a reset via the RSA card. Complete loss
of power seems to be an issue. The RSA card is powered via the host.
Status - dtbaims is primary (DRBD) and running all of the resources.
itbaims is secondary
Status before power loss..
============
Last updated: Wed Oct 29 14:43:15 2008
Current DC: dtbaims (4f1614ac-d465-49db-b847-bac60f9dac6c)
2 Nodes configured.
3 Resources configured.
============
Node: dtbaims (4f1614ac-d465-49db-b847-bac60f9dac6c): online
Node: itbaims (96595e56-e3db-42da-b13b-1e2d3a956529): online
Full list of resources:
Resource Group: group_its
resource_its_drbd (heartbeat:its_drbddisk): Started dtbaims
resource_its_fs (ocf::heartbeat:its_Filesystem): Started
dtbaims
resource_its_vip (ocf::heartbeat:IPaddr): Started dtbaims
resource_its_oracle (ocf::heartbeat:its_oracle): Started dtbaims
resource_its_oralsnr (ocf::heartbeat:its_oralsnr): Started
dtbaims
resource_its_aims (lsb:its_aims): Started dtbaims
resource_its_apache (ocf::heartbeat:its_apache): Started dtbaims
resource_its_smb (lsb:its_smb): Started dtbaims
resource_its_dhcpd (lsb:its_dhcpd): Started dtbaims
r_stonith-dtbaims (stonith:external/ibmrsa-telnet): Started
itbaims
r_stonith-itbaims (stonith:external/ibmrsa-telnet): Started
dtbaims
Migration summary::
* Node itbaims:
* Node dtbaims:
Status after powerloss - (on the secondary host)
My expectation was DC would be transferred to itbaims (this was done),
resources would start on itbaims (not done?) It looks like HA is waiting on
completing the Stonith action.
[EMAIL PROTECTED] ~]# /etc/init.d/drbd status
drbd driver loaded OK; device status:
version: 8.2.6 (api:88/proto:86-88)
GIT-hash: 3e69822d3bb4920a8c1bfdf7d647169eba7d2eb4 build by
[EMAIL PROTECTED], 2008-06-04 16:15:48
m:res cs st ds p mounted
fstype
0:r0 WFConnection Secondary/Unknown UpToDate/DUnknown C
============
Last updated: Wed Oct 29 15:11:48 2008
Current DC: itbaims (96595e56-e3db-42da-b13b-1e2d3a956529)
2 Nodes configured.
3 Resources configured.
============
Node: dtbaims (4f1614ac-d465-49db-b847-bac60f9dac6c): OFFLINE
Node: itbaims (96595e56-e3db-42da-b13b-1e2d3a956529): online
Full list of resources:
Resource Group: group_its
resource_its_drbd (heartbeat:its_drbddisk): Started dtbaims
resource_its_fs (ocf::heartbeat:its_Filesystem): Started
dtbaims
resource_its_vip (ocf::heartbeat:IPaddr): Started dtbaims
resource_its_oracle (ocf::heartbeat:its_oracle): Started dtbaims
resource_its_oralsnr (ocf::heartbeat:its_oralsnr): Started
dtbaims
resource_its_aims (lsb:its_aims): Started dtbaims
resource_its_apache (ocf::heartbeat:its_apache): Started dtbaims
resource_its_smb (lsb:its_smb): Started dtbaims
resource_its_dhcpd (lsb:its_dhcpd): Started dtbaims
r_stonith-dtbaims (stonith:external/ibmrsa-telnet): Started
itbaims FAILED
r_stonith-itbaims (stonith:external/ibmrsa-telnet): Started
dtbaims
Migration summary::
* Node itbaims:
r_stonith-dtbaims: migration-threshold=0 fail-count=1000000
Failed actions:
r_stonith-dtbaims_monitor_60000 (node=itbaims, call=14, rc=14): complete
r_stonith-dtbaims_start_0 (node=itbaims, call=17, rc=1): complete
The HA cluster doesn't start the resources until power is restored to the X
primary host.
Running crm_verify -L -V just shows lots of
crm_verify[31645]: 2008/10/29_15:19:47 notice: NoRoleChange: Move resource
resource_its_dhcpd (Started dtbaims -> itbaims)
crm_verify[31645]: 2008/10/29_15:19:47 notice: StopRsc: dtbaims Stop
resource_its_dhcpd
crm_verify[31645]: 2008/10/29_15:19:47 notice: StartRsc: itbaims
Start resource_its_dhcpd
crm_verify[31645]: 2008/10/29_15:19:47 notice: RecurringOp: Start recurring
monitor (360s) for resource_its_dhcpd on itbaims
crm_verify[31645]: 2008/10/29_15:19:47 info: native_stop_constraints:
r_stonith-itbaims_stop_0 is implicit after dtbaims is fenced
It looks like it wants to start the resources but waiting to clear the
failed op.
What can I do to ensure that the failover occurs in the result of a complete
power loss to the primary host?
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
Mit freundlichen Gruessen,
Rolf Schmidt
--
SUSE LINUX GmbH -o) Tel: +49-(0)911-740 53 380
Maxfeldstr. 5 /\\ Fax: +49-(0)911-740 53 679
90409 Nuernberg, Germany _\_v
SUSE LINUX GmbH, GF: Volker Smid, HRB 21284 (AG Nürnberg)
PLEASE NOTE: This e-mail may contain confidential and privileged
material for the sole use of the intended recipient. Any review,
distribution or other use by anyone else is strictly prohibited. If
you are not an intended recipient, please contact the sender and
delete all copies. Thank you.
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems