Hi,
On Thu, Nov 20, 2008 at 11:57:51AM +0100, Luis Motta Campos wrote:
> Fellows
>
> I have a cluster with two nodes, (db-sql1, db-sql3) and am trying to
> configure STONITH over IPMI between them. I got my IPMI working fine, and
> successfully tested to see if I can reboot the hosts using the same command
> that is used in the driver's implementation (/usr/bin/ipmitool -I lan -H
> ${ipaddr} -U ${userid} -P ${passwd} power reset).
>
> I wrote the following XML configuration, from following the DTD shipped
> with my version of the heartbeat software (CentOS, 2.1.3-21.1, with kernel
> 2.6.18-53.1.14.el5):
>
> <primitive id="db-sql1-shooter" class="stonith" type="external/ipmi"
> provider="heartbeat">
There are no providers for class stonith. Just drop that
attribute.
> <operations>
> <op id="op-sql1-shooter-stop" name="stop" timeout="60s"/>
> <op id="op-sql1-shooter-start" name="start" timeout="30s"/>
> <op id="op-sql1-shooter-monitor" name="monitor" timeout="5s"
> interval="10s"/>
This monitor timeout (interval too) are way to short. How likely
is it that your stonith device fails within 10 seconds when it's
actually required to reset a node? The start timeout should equal
the monitor timeout.
> </operations>
> <instance_attributes id="df585416-074e-4955-a431-862f529e5b0b">
> <attributes>
> <nvpair name="hostname" value="db-sql1"
> id="ee671a7b-9322-487e-8a19-689c85b0df65"/>
> <nvpair name="ipaddr" value="db-sql1-ipmi"
> id="2931750a-4304-48df-ac80-eb65619f1d33"/>
> <nvpair name="userid" value="guess-who"
> id="bbd8a701-b5aa-4a93-8c2e-7a838d5d3558"/>
> <nvpair name="passwd" value="keep-trying"
> id="3c6ce903-6e7f-45e7-86cb-b6c63b92ddb0"/>
> </attributes>
> </instance_attributes>
> </primitive>
> <primitive id="db-sql3-shooter" class="stonith" type="external/ipmi"
> provider="heartbeat">
> <operations>
> <op id="op-sql3-shooter-stop" name="stop" timeout="60s"/>
> <op id="op-sql3-shooter-start" name="start" timeout="30s"/>
> <op id="op-sql3-shooter-monitor" name="monitor" timeout="5s"
> interval="10s"/>
> </operations>
> <instance_attributes id="2a0dba3a-f121-4695-9ed2-5c3407d2d38f">
> <attributes>
> <nvpair name="hostname" value="db-sql3"
> id="8a57bbb6-06f0-4916-a2fe-1ef54cba637f"/>
> <nvpair name="ipaddr" value="db-sql3-ipmi"
> id="9b544cea-8da5-4945-a30c-84320536463f"/>
> <nvpair name="userid" value="guess-who"
> id="bf1cb9f4-5af5-4cd3-906f-239d87b04e66"/>
> <nvpair name="passwd" value="keep-trying"
> id="0e784680-871e-4c55-bec5-9f534620c032"/>
> </attributes>
> </instance_attributes>
> </primitive>
> </resources>
>
> Also, under Lars' recommendation, added two constraints to my constrait set
> to prevent silly warnings during the start up of the stonith resources:
>
> <rsc_location id="db-sql1-shooter-run-on-sql3" description="SQL1 shooter
> must be at SQL3" rsc="db-sql1-shooter" node="db-sql3" score="+INFINITY"/>
> <rsc_location id="db-sql3-shooter-run-on-sql1" description="SQL3 shooter
> must be at SQL1" rsc="db-sql3-shooter" node="db-sql1" score="+INFINITY"/>
>
> As the final result, crm_mon sees my cluster like this:
>
> ============
> Last updated: Thu Nov 20 11:53:52 2008
> Current DC: db-sql3.ripe.net (bdee5d1b-405a-4630-9836-66e8758e81f1)
> 2 Nodes configured.
> 4 Resources configured.
> ============
>
> Node: db-sql3.ripe.net (bdee5d1b-405a-4630-9836-66e8758e81f1): online
> Node: db-sql1.ripe.net (46818264-663c-43dd-b5e4-7b7cd7f85022): online
>
> Master/Slave Set: database-disk
> database-storage-drbd:0 (heartbeat::ocf:drbd): Master
> db-sql3.ripe.net
> database-storage-drbd:1 (heartbeat::ocf:drbd): Started
> db-sql1.ripe.net
> Resource Group: db-cluster-service
> database-filesystem (heartbeat::ocf:Filesystem): Started
> db-sql3.ripe.net
> database-ip (heartbeat::ocf:IPaddr): Started db-sql3.ripe.net
> database-server (heartbeat::ocf:mysql): Started db-sql3.ripe.net
> db-sql1-shooter (stonith:external/ipmi): Started db-sql3.ripe.net
> db-sql3-shooter (stonith:external/ipmi): Started db-sql1.ripe.net
>
> Failed actions:
> db-sql1-shooter_start_0 (node=db-sql1.ripe.net, call=14, rc=1):
> complete
>
> and crm_verify -VL points me to several warnings and a error that I am
> unable to interpret correctly:
>
> crm_verify[29480]: 2008/11/20_11:55:03 ERROR: unpack_rsc_op: Remapping
> db-sql1-shooter_start_0 (rc=1) on db-sql1.ripe.net to an ERROR
> crm_verify[29480]: 2008/11/20_11:55:03 WARN: unpack_rsc_op: Processing
> failed op db-sql1-shooter_start_0 on db-sql1.ripe.net: Error
> crm_verify[29480]: 2008/11/20_11:55:03 WARN: unpack_rsc_op: Compatability
> handling for failed op db-sql1-shooter_start_0 on db-sql1.ripe.net
The stonith resource failed to start. Your configuration looks
OK, apart from the way too tight timing constraints. Did you try
the stonith program with your device and this configuration:
stonith -d -t external/ipmi ...
Thanks,
Dejan
> I am stuck, and need help. Don't know how to diagnose this, or which part
> of the source code to read to find out what's going on. Any suggestions,
> pointers, tips, tricks, or support will be highly appreciated.
>
> Many thanks in advance.
> Kind regards
> --
> Luis Motta Campos is a software engineer,
> Perl Programmer, foodie and photographer.
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems