*Thanks again Jakob.
I see this on crm_mon*
Resource Group: web_cluster
failover-ip (ocf::heartbeat:IPaddr): Started
apauat1b.intranet.aeroplan.com
failover-apache (lsb:httpd): Stopped
Failed actions:
failover-apache_monitor_15000 (node=apauat1b.intranet.aeroplan.com,
call=2566, rc=7, status=complete): not running
*Note that with each failed attempt to start httpd the call figure above
increments by 1. So that is a counter of some description.*
*And the logs show this:*
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com crmd: [13459]: notice:
run_graph: Transition 2528 (Complete=8, Pending=0, Fired=0, Skipped=0,
Incomplete=0, Source=/usr/var/lib/pengine/pe-input-3384.bz2): Complete
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com crmd: [13459]: info:
te_graph_trigger: Transition 2528 is now complete
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com attrd: [13458]: info:
attrd_local_callback: Expanded fail-count-failover-apache=value++ to 1280
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com attrd: [13458]: info:
attrd_trigger_update: Sending flush op to all hosts for:
fail-count-failover-apache (1280)
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com crmd: [13459]: info:
do_state_transition: State transition S_TRANSITION_ENGINE ->
S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=notify_crmd ]
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com crmd: [13459]: info:
do_state_transition: All 2 cluster nodes are eligible to run resources.
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com attrd: [13458]: info:
attrd_perform_update: Sent update 3175: fail-count-failover-apache=1280
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com crmd: [13459]: info:
do_pe_invoke: Query 6702: Requesting the current CIB: S_POLICY_ENGINE
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com crmd: [13459]: info:
abort_transition_graph: te_update_diff:146 - Triggered transition abort
(complete=1, tag=transient_attributes,
id=86b5c3f4-8202-45f7-91a8-64e17163bb7a, magic=NA, cib=0.9.5422) :
Transient attribute: update
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com crmd: [13459]: info:
do_pe_invoke_callback: Invoking the PE: query=6702,
ref=pe_calc-dc-1270669810-6395, seq=2, quorate=1
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com crmd: [13459]: info:
do_pe_invoke: Query 6703: Requesting the current CIB: S_POLICY_ENGINE
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com crmd: [13459]: info:
do_pe_invoke_callback: Invoking the PE: query=6703,
ref=pe_calc-dc-1270669810-6396, seq=2, quorate=1
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com pengine: [13466]: info:
unpack_config: Node scores: 'red' = -INFINITY, 'yellow' = 0, 'green' = 0
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com pengine: [13466]: info:
determine_online_status: Node apauat1b.intranet.mydomain.com is online
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com pengine: [13466]: WARN:
unpack_rsc_op: Processing failed op failover-apache_monitor_15000 on
apauat1b.intranet.mydomain.com: not running (7)
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com pengine: [13466]: info:
determine_online_status: Node apauat1a.intranet.mydomain.com is online
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com pengine: [13466]: notice:
group_print: Resource Group: web_cluster
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com pengine: [13466]: notice:
native_print: failover-ip (ocf::heartbeat:IPaddr): Started
apauat1b.intranet.mydomain.com
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com pengine: [13466]: notice:
native_print: failover-apache (lsb:httpd): Started
apauat1b.intranet.mydomain.com FAILED
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com pengine: [13466]: notice:
RecurringOp: Start recurring monitor (15s) for failover-apache on
apauat1b.intranet.mydomain.com
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com pengine: [13466]: notice:
LogActions: Leave resource failover-ip (Started
apauat1b.intranet.mydomain.com)
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com pengine: [13466]: notice:
LogActions: Recover resource failover-apache (Started
apauat1b.intranet.mydomain.com)
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com crmd: [13459]: info:
handle_response: pe_calc calculation pe_calc-dc-1270669810-6395 is obsolete
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com pengine: [13466]: info:
process_pe_message: Transition 2529: PEngine Input stored in:
/usr/var/lib/pengine/pe-input-3385.bz2
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com pengine: [13466]: info:
unpack_config: Node scores: 'red' = -INFINITY, 'yellow' = 0, 'green' = 0
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com pengine: [13466]: info:
determine_online_status: Node apauat1b.intranet.mydomain.com is online
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com pengine: [13466]: WARN:
unpack_rsc_op: Processing failed op failover-apache_monitor_15000 on
apauat1b.intranet.mydomain.com: not running (7)
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com pengine: [13466]: info:
determine_online_status: Node apauat1a.intranet.mydomain.com is online
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com pengine: [13466]: notice:
group_print: Resource Group: web_cluster
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com pengine: [13466]: notice:
native_print: failover-ip (ocf::heartbeat:IPaddr): Started
apauat1b.intranet.mydomain.com
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com pengine: [13466]: notice:
native_print: failover-apache (lsb:httpd): Started
apauat1b.intranet.mydomain.com FAILED
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com pengine: [13466]: notice:
RecurringOp: Start recurring monitor (15s) for failover-apache on
apauat1b.intranet.mydomain.com
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com pengine: [13466]: notice:
LogActions: Leave resource failover-ip (Started
apauat1b.intranet.mydomain.com)
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com pengine: [13466]: notice:
LogActions: Recover resource failover-apache (Started
apauat1b.intranet.mydomain.com)
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com pengine: [13466]: info:
process_pe_message: Transition 2530: PEngine Input stored in:
/usr/var/lib/pengine/pe-input-3386.bz2
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com crmd: [13459]: info:
do_state_transition: State transition S_POLICY_ENGINE ->
S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE
origin=handle_response ]
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com crmd: [13459]: info:
unpack_graph: Unpacked transition 2530: 8 actions in 8 synapses
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com crmd: [13459]: info:
do_te_invoke: Processing graph 2530 (ref=pe_calc-dc-1270669810-6396)
derived from /usr/var/lib/pengine/pe-input-3386.bz2
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com crmd: [13459]: info:
te_pseudo_action: Pseudo action 13 fired and confirmed
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com crmd: [13459]: info:
te_rsc_command: Initiating action 3: stop failover-apache_stop_0 on
apauat1b.intranet.mydomain.com (local)
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com lrmd: [13456]: info:
cancel_op: operation monitor[3844] on lsb::httpd::failover-apache for
client 13459, its parameters: CRM_meta_interval=[15000]
CRM_meta_timeout=[20000] crm_feature_set=[3.0.1]
CRM_meta_name=[monitor] cancelled
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com crmd: [13459]: info:
do_lrm_rsc_op: Performing
key=3:2530:0:6b7b1df2-29a1-4b90-a254-b8fdf7df3632
op=failover-apache_stop_0 )
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com lrmd: [13456]: info:
rsc:failover-apache:3845: stop
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com lrmd: [13639]: WARN: For
LSB init script, no additional parameters are needed.
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com crmd: [13459]: info:
process_lrm_event: LRM operation failover-apache_monitor_15000
(call=3844, status=1, cib-update=0, confirmed=true) Cancelled
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com lrmd: [13456]: info: RA
output: (failover-apache:stop:stdout) Stopping httpd:
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com lrmd: [13456]: info: RA
output: (failover-apache:stop:stdout) [
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com lrmd: [13456]: info: RA
output: (failover-apache:stop:stdout) FAILED
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com lrmd: [13456]: info: RA
output: (failover-apache:stop:stdout) ]
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com lrmd: [13456]: info: RA
output: (failover-apache:stop:stdout)
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com lrmd: [13456]: info: RA
output: (failover-apache:stop:stdout)
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com lrmd: [13456]: info:
Managed failover-apache:stop process 13639 exited with return code 0.
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com crmd: [13459]: info:
process_lrm_event: LRM operation failover-apache_stop_0 (call=3845,
rc=0, cib-update=6704, confirmed=true) ok
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com crmd: [13459]: info:
match_graph_event: Action failover-apache_stop_0 (3) confirmed on
apauat1b.intranet.mydomain.com (rc=0)
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com crmd: [13459]: info:
te_pseudo_action: Pseudo action 14 fired and confirmed
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com crmd: [13459]: info:
te_pseudo_action: Pseudo action 4 fired and confirmed
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com crmd: [13459]: info:
te_pseudo_action: Pseudo action 11 fired and confirmed
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com crmd: [13459]: info:
te_rsc_command: Initiating action 10: start failover-apache_start_0 on
apauat1b.intranet.mydomain.com (local)
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com crmd: [13459]: info:
do_lrm_rsc_op: Performing
key=10:2530:0:6b7b1df2-29a1-4b90-a254-b8fdf7df3632
op=failover-apache_start_0 )
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com lrmd: [13456]: info:
rsc:failover-apache:3846: start
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com lrmd: [13649]: WARN: For
LSB init script, no additional parameters are needed.
Jakob Curdes wrote:
> What do the logs say?
> jc
>
> mike wrote:
>> Thank you Jakob,
>> I put them in a resource group as indicated but I am still seeing the
>> same behavior, i.e. if I stop httpd manually and then stop it from
>> restarting (by editing out the contents of /etc/init.d/httpd) the
>> cluster simply sits there and spins it wheels trying to restart httpd
>> on the primary node over and over and over again. At no point is a
>> failover initiated. Anyone know why stopping httpd in this manner
>> will not result in a failover?
>>
>> Here is my cib.xml
>> <cib crm_feature_set="3.0.1"
>> dc-uuid="86b5c3f4-8202-45f7-91a8-64e17163bb7a" have-quorum="1"
>> remote-tls-port="0" validate-with="pacemaker-1.0"
>> epoch="9" admin_epoch="0" num_updates="0" cib-last-written="Wed Apr
>> 7 15:11:06 2010">
>> <configuration>
>> <crm_config>
>> <cluster_property_set id="cib-bootstrap-options">
>> <nvpair id="nvpair.id17897268" name="symmetric-cluster"
>> value="true"/>
>> <nvpair id="nvpair.id17897737" name="no-quorum-policy"
>> value="stop"/>
>> <nvpair id="nvpair.id17897746"
>> name="default-resource-stickiness" value="0"/>
>> <nvpair id="nvpair.id17897755"
>> name="default-resource-failure-stickiness" value="0"/>
>> <nvpair id="nvpair.id17897413" name="stonith-enabled"
>> value="false"/>
>> <nvpair id="nvpair.id17897422" name="stonith-action"
>> value="reboot"/>
>> <nvpair id="nvpair.id17897431" name="startup-fencing"
>> value="true"/>
>> <nvpair id="nvpair.id17897704" name="stop-orphan-resources"
>> value="true"/>
>> <nvpair id="nvpair.id17897714" name="stop-orphan-actions"
>> value="true"/>
>> <nvpair id="nvpair.id17897723" name="remove-after-stop"
>> value="false"/>
>> <nvpair id="nvpair.id17898021" name="short-resource-names"
>> value="true"/>
>> <nvpair id="nvpair.id17898030" name="transition-idle-timeout"
>> value="5min"/>
>> <nvpair id="nvpair.id17898040" name="default-action-timeout"
>> value="20s"/>
>> <nvpair id="nvpair.id17897626" name="is-managed-default"
>> value="true"/>
>> <nvpair id="nvpair.id17897635" name="cluster-delay" value="60s"/>
>> <nvpair id="nvpair.id17897643" name="pe-error-series-max"
>> value="-1"/>
>> <nvpair id="nvpair.id17897653" name="pe-warn-series-max"
>> value="-1"/>
>> <nvpair id="nvpair.id17897329" name="pe-input-series-max"
>> value="-1"/>
>> <nvpair id="nvpair.id17897338" name="dc-version"
>> value="1.0.8-5443ff1ab132449ad5b236169403c6a23cf4168b"/>
>> <nvpair id="nvpair.id17897347" name="cluster-infrastructure"
>> value="Heartbeat"/>
>> </cluster_property_set>
>> </crm_config>
>> <nodes>
>> <node id="86b5c3f4-8202-45f7-91a8-64e17163bb7a"
>> uname="apauat1b.intranet.mydomain.com" type="normal"/>
>> <node id="dbd6016a-aab6-4130-87fb-80e954353b3b"
>> uname="apauat1a.intranet.mydomain.com" type="normal"/>
>> </nodes>
>> <resources>
>> <group id="web_cluster">
>> <primitive class="ocf" id="failover-ip" provider="heartbeat"
>> type="IPaddr">
>> <instance_attributes id="failover-ip-instance_attributes">
>> <nvpair id="failover-ip-instance_attributes-ip" name="ip"
>> value="172.28.185.55"/>
>> </instance_attributes>
>> <operations>
>> <op id="failover-ip-monitor-10s" interval="10s"
>> name="monitor"/>
>> </operations>
>> </primitive>
>> <primitive class="lsb" id="failover-apache" type="httpd">
>> <operations>
>> <op id="failover-apache-monitor-15s" interval="15s"
>> name="monitor"/>
>> </operations>
>> </primitive>
>> </group>
>> </resources>
>> <constraints/>
>> <rsc_defaults/>
>> <op_defaults/>
>> </configuration>
>> </cib>
>>
>> Jakob Curdes wrote:
>>> mike schrieb:
>>>> Thank you Jakob,
>>>> I did as you suggested (good idea btw) and what I saw was that
>>>> LinuxHA continually tried to restart it on the primary node. Is
>>>> there a setting that I can say "After X number of times trying to
>>>> restart, fail over" ?
>>>>
>>> I think you need to read further down that page and use the settings in
>>>
>>> "Failover IP Service in a Group"
>>>
>>> What you probably actually want is to have IP and service running
>>> always on the same node.
>>> (plus- last step - on the node with best connectivity).
>>>
>>> HTH,
>>> Jakob Curdes
>>>
>>>
>
>
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems