Fellows
I have a cluster with two nodes, (db-sql1, db-sql3) and am trying to
configure STONITH over IPMI between them. I got my IPMI working fine,
and successfully tested to see if I can reboot the hosts using the same
command that is used in the driver's implementation (/usr/bin/ipmitool
-I lan -H ${ipaddr} -U ${userid} -P ${passwd} power reset).
I wrote the following XML configuration, from following the DTD
shipped with my version of the heartbeat software (CentOS, 2.1.3-21.1,
with kernel 2.6.18-53.1.14.el5):
<primitive id="db-sql1-shooter" class="stonith" type="external/ipmi"
provider="heartbeat">
<operations>
<op id="op-sql1-shooter-stop" name="stop" timeout="60s"/>
<op id="op-sql1-shooter-start" name="start" timeout="30s"/>
<op id="op-sql1-shooter-monitor" name="monitor" timeout="5s"
interval="10s"/>
</operations>
<instance_attributes id="df585416-074e-4955-a431-862f529e5b0b">
<attributes>
<nvpair name="hostname" value="db-sql1"
id="ee671a7b-9322-487e-8a19-689c85b0df65"/>
<nvpair name="ipaddr" value="db-sql1-ipmi"
id="2931750a-4304-48df-ac80-eb65619f1d33"/>
<nvpair name="userid" value="guess-who"
id="bbd8a701-b5aa-4a93-8c2e-7a838d5d3558"/>
<nvpair name="passwd" value="keep-trying"
id="3c6ce903-6e7f-45e7-86cb-b6c63b92ddb0"/>
</attributes>
</instance_attributes>
</primitive>
<primitive id="db-sql3-shooter" class="stonith" type="external/ipmi"
provider="heartbeat">
<operations>
<op id="op-sql3-shooter-stop" name="stop" timeout="60s"/>
<op id="op-sql3-shooter-start" name="start" timeout="30s"/>
<op id="op-sql3-shooter-monitor" name="monitor" timeout="5s"
interval="10s"/>
</operations>
<instance_attributes id="2a0dba3a-f121-4695-9ed2-5c3407d2d38f">
<attributes>
<nvpair name="hostname" value="db-sql3"
id="8a57bbb6-06f0-4916-a2fe-1ef54cba637f"/>
<nvpair name="ipaddr" value="db-sql3-ipmi"
id="9b544cea-8da5-4945-a30c-84320536463f"/>
<nvpair name="userid" value="guess-who"
id="bf1cb9f4-5af5-4cd3-906f-239d87b04e66"/>
<nvpair name="passwd" value="keep-trying"
id="0e784680-871e-4c55-bec5-9f534620c032"/>
</attributes>
</instance_attributes>
</primitive>
</resources>
Also, under Lars' recommendation, added two constraints to my constrait
set to prevent silly warnings during the start up of the stonith resources:
<rsc_location id="db-sql1-shooter-run-on-sql3" description="SQL1 shooter
must be at SQL3" rsc="db-sql1-shooter" node="db-sql3" score="+INFINITY"/>
<rsc_location id="db-sql3-shooter-run-on-sql1" description="SQL3 shooter
must be at SQL1" rsc="db-sql3-shooter" node="db-sql1" score="+INFINITY"/>
As the final result, crm_mon sees my cluster like this:
============
Last updated: Thu Nov 20 11:53:52 2008
Current DC: db-sql3.ripe.net (bdee5d1b-405a-4630-9836-66e8758e81f1)
2 Nodes configured.
4 Resources configured.
============
Node: db-sql3.ripe.net (bdee5d1b-405a-4630-9836-66e8758e81f1): online
Node: db-sql1.ripe.net (46818264-663c-43dd-b5e4-7b7cd7f85022): online
Master/Slave Set: database-disk
database-storage-drbd:0 (heartbeat::ocf:drbd): Master
db-sql3.ripe.net
database-storage-drbd:1 (heartbeat::ocf:drbd): Started
db-sql1.ripe.net
Resource Group: db-cluster-service
database-filesystem (heartbeat::ocf:Filesystem): Started
db-sql3.ripe.net
database-ip (heartbeat::ocf:IPaddr): Started db-sql3.ripe.net
database-server (heartbeat::ocf:mysql): Started db-sql3.ripe.net
db-sql1-shooter (stonith:external/ipmi): Started db-sql3.ripe.net
db-sql3-shooter (stonith:external/ipmi): Started db-sql1.ripe.net
Failed actions:
db-sql1-shooter_start_0 (node=db-sql1.ripe.net, call=14, rc=1):
complete
and crm_verify -VL points me to several warnings and a error that I am
unable to interpret correctly:
crm_verify[29480]: 2008/11/20_11:55:03 ERROR: unpack_rsc_op: Remapping
db-sql1-shooter_start_0 (rc=1) on db-sql1.ripe.net to an ERROR
crm_verify[29480]: 2008/11/20_11:55:03 WARN: unpack_rsc_op: Processing
failed op db-sql1-shooter_start_0 on db-sql1.ripe.net: Error
crm_verify[29480]: 2008/11/20_11:55:03 WARN: unpack_rsc_op:
Compatability handling for failed op db-sql1-shooter_start_0 on
db-sql1.ripe.net
I am stuck, and need help. Don't know how to diagnose this, or which
part of the source code to read to find out what's going on. Any
suggestions, pointers, tips, tricks, or support will be highly appreciated.
Many thanks in advance.
Kind regards
--
Luis Motta Campos is a software engineer,
Perl Programmer, foodie and photographer.
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems