*Thanks again Jakob.

I see this on crm_mon*
Resource Group: web_cluster
     failover-ip        (ocf::heartbeat:IPaddr):        Started 
apauat1b.intranet.aeroplan.com
     failover-apache    (lsb:httpd):    Stopped

Failed actions:
    failover-apache_monitor_15000 (node=apauat1b.intranet.aeroplan.com, 
call=2566, rc=7, status=complete): not running
*Note that with each failed attempt to start httpd the call figure above 
increments by 1. So that is a counter of some description.*

*And the logs show this:*
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com crmd: [13459]: notice: 
run_graph: Transition 2528 (Complete=8, Pending=0, Fired=0, Skipped=0, 
Incomplete=0, Source=/usr/var/lib/pengine/pe-input-3384.bz2): Complete
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com crmd: [13459]: info: 
te_graph_trigger: Transition 2528 is now complete
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com attrd: [13458]: info: 
attrd_local_callback: Expanded fail-count-failover-apache=value++ to 1280
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com attrd: [13458]: info: 
attrd_trigger_update: Sending flush op to all hosts for: 
fail-count-failover-apache (1280)
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com crmd: [13459]: info: 
do_state_transition: State transition S_TRANSITION_ENGINE -> 
S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=notify_crmd ]
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com crmd: [13459]: info: 
do_state_transition: All 2 cluster nodes are eligible to run resources.
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com attrd: [13458]: info: 
attrd_perform_update: Sent update 3175: fail-count-failover-apache=1280
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com crmd: [13459]: info: 
do_pe_invoke: Query 6702: Requesting the current CIB: S_POLICY_ENGINE
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com crmd: [13459]: info: 
abort_transition_graph: te_update_diff:146 - Triggered transition abort 
(complete=1, tag=transient_attributes, 
id=86b5c3f4-8202-45f7-91a8-64e17163bb7a, magic=NA, cib=0.9.5422) : 
Transient attribute: update
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com crmd: [13459]: info: 
do_pe_invoke_callback: Invoking the PE: query=6702, 
ref=pe_calc-dc-1270669810-6395, seq=2, quorate=1
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com crmd: [13459]: info: 
do_pe_invoke: Query 6703: Requesting the current CIB: S_POLICY_ENGINE
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com crmd: [13459]: info: 
do_pe_invoke_callback: Invoking the PE: query=6703, 
ref=pe_calc-dc-1270669810-6396, seq=2, quorate=1
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com pengine: [13466]: info: 
unpack_config: Node scores: 'red' = -INFINITY, 'yellow' = 0, 'green' = 0
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com pengine: [13466]: info: 
determine_online_status: Node apauat1b.intranet.mydomain.com is online
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com pengine: [13466]: WARN: 
unpack_rsc_op: Processing failed op failover-apache_monitor_15000 on 
apauat1b.intranet.mydomain.com: not running (7)
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com pengine: [13466]: info: 
determine_online_status: Node apauat1a.intranet.mydomain.com is online
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com pengine: [13466]: notice: 
group_print:  Resource Group: web_cluster
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com pengine: [13466]: notice: 
native_print:      failover-ip    (ocf::heartbeat:IPaddr):    Started 
apauat1b.intranet.mydomain.com
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com pengine: [13466]: notice: 
native_print:      failover-apache    (lsb:httpd):    Started 
apauat1b.intranet.mydomain.com FAILED
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com pengine: [13466]: notice: 
RecurringOp:  Start recurring monitor (15s) for failover-apache on 
apauat1b.intranet.mydomain.com
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com pengine: [13466]: notice: 
LogActions: Leave resource failover-ip    (Started 
apauat1b.intranet.mydomain.com)
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com pengine: [13466]: notice: 
LogActions: Recover resource failover-apache    (Started 
apauat1b.intranet.mydomain.com)
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com crmd: [13459]: info: 
handle_response: pe_calc calculation pe_calc-dc-1270669810-6395 is obsolete
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com pengine: [13466]: info: 
process_pe_message: Transition 2529: PEngine Input stored in: 
/usr/var/lib/pengine/pe-input-3385.bz2
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com pengine: [13466]: info: 
unpack_config: Node scores: 'red' = -INFINITY, 'yellow' = 0, 'green' = 0
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com pengine: [13466]: info: 
determine_online_status: Node apauat1b.intranet.mydomain.com is online
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com pengine: [13466]: WARN: 
unpack_rsc_op: Processing failed op failover-apache_monitor_15000 on 
apauat1b.intranet.mydomain.com: not running (7)
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com pengine: [13466]: info: 
determine_online_status: Node apauat1a.intranet.mydomain.com is online
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com pengine: [13466]: notice: 
group_print:  Resource Group: web_cluster
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com pengine: [13466]: notice: 
native_print:      failover-ip    (ocf::heartbeat:IPaddr):    Started 
apauat1b.intranet.mydomain.com
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com pengine: [13466]: notice: 
native_print:      failover-apache    (lsb:httpd):    Started 
apauat1b.intranet.mydomain.com FAILED
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com pengine: [13466]: notice: 
RecurringOp:  Start recurring monitor (15s) for failover-apache on 
apauat1b.intranet.mydomain.com
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com pengine: [13466]: notice: 
LogActions: Leave resource failover-ip    (Started 
apauat1b.intranet.mydomain.com)
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com pengine: [13466]: notice: 
LogActions: Recover resource failover-apache    (Started 
apauat1b.intranet.mydomain.com)
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com pengine: [13466]: info: 
process_pe_message: Transition 2530: PEngine Input stored in: 
/usr/var/lib/pengine/pe-input-3386.bz2
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com crmd: [13459]: info: 
do_state_transition: State transition S_POLICY_ENGINE -> 
S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE 
origin=handle_response ]
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com crmd: [13459]: info: 
unpack_graph: Unpacked transition 2530: 8 actions in 8 synapses
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com crmd: [13459]: info: 
do_te_invoke: Processing graph 2530 (ref=pe_calc-dc-1270669810-6396) 
derived from /usr/var/lib/pengine/pe-input-3386.bz2
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com crmd: [13459]: info: 
te_pseudo_action: Pseudo action 13 fired and confirmed
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com crmd: [13459]: info: 
te_rsc_command: Initiating action 3: stop failover-apache_stop_0 on 
apauat1b.intranet.mydomain.com (local)
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com lrmd: [13456]: info: 
cancel_op: operation monitor[3844] on lsb::httpd::failover-apache for 
client 13459, its parameters: CRM_meta_interval=[15000] 
CRM_meta_timeout=[20000] crm_feature_set=[3.0.1] 
CRM_meta_name=[monitor]  cancelled
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com crmd: [13459]: info: 
do_lrm_rsc_op: Performing 
key=3:2530:0:6b7b1df2-29a1-4b90-a254-b8fdf7df3632 
op=failover-apache_stop_0 )
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com lrmd: [13456]: info: 
rsc:failover-apache:3845: stop
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com lrmd: [13639]: WARN: For 
LSB init script, no additional parameters are needed.
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com crmd: [13459]: info: 
process_lrm_event: LRM operation failover-apache_monitor_15000 
(call=3844, status=1, cib-update=0, confirmed=true) Cancelled
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com lrmd: [13456]: info: RA 
output: (failover-apache:stop:stdout) Stopping httpd:
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com lrmd: [13456]: info: RA 
output: (failover-apache:stop:stdout) [
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com lrmd: [13456]: info: RA 
output: (failover-apache:stop:stdout) FAILED
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com lrmd: [13456]: info: RA 
output: (failover-apache:stop:stdout) ]
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com lrmd: [13456]: info: RA 
output: (failover-apache:stop:stdout)
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com lrmd: [13456]: info: RA 
output: (failover-apache:stop:stdout)

Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com lrmd: [13456]: info: 
Managed failover-apache:stop process 13639 exited with return code 0.
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com crmd: [13459]: info: 
process_lrm_event: LRM operation failover-apache_stop_0 (call=3845, 
rc=0, cib-update=6704, confirmed=true) ok
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com crmd: [13459]: info: 
match_graph_event: Action failover-apache_stop_0 (3) confirmed on 
apauat1b.intranet.mydomain.com (rc=0)
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com crmd: [13459]: info: 
te_pseudo_action: Pseudo action 14 fired and confirmed
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com crmd: [13459]: info: 
te_pseudo_action: Pseudo action 4 fired and confirmed
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com crmd: [13459]: info: 
te_pseudo_action: Pseudo action 11 fired and confirmed
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com crmd: [13459]: info: 
te_rsc_command: Initiating action 10: start failover-apache_start_0 on 
apauat1b.intranet.mydomain.com (local)
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com crmd: [13459]: info: 
do_lrm_rsc_op: Performing 
key=10:2530:0:6b7b1df2-29a1-4b90-a254-b8fdf7df3632 
op=failover-apache_start_0 )
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com lrmd: [13456]: info: 
rsc:failover-apache:3846: start
Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com lrmd: [13649]: WARN: For 
LSB init script, no additional parameters are needed.


Jakob Curdes wrote:
> What do the logs say?
> jc
>
> mike wrote:
>> Thank you Jakob,
>> I put them in a resource group as indicated but I am still seeing the 
>> same behavior, i.e. if I stop httpd manually and then stop it from 
>> restarting (by editing out the contents of /etc/init.d/httpd) the 
>> cluster simply sits there and spins it wheels trying to restart httpd 
>> on the primary node over and over and over again. At no point is a 
>> failover initiated. Anyone know why stopping httpd in this manner 
>> will not result in a failover?
>>
>> Here is my cib.xml
>> <cib crm_feature_set="3.0.1" 
>> dc-uuid="86b5c3f4-8202-45f7-91a8-64e17163bb7a" have-quorum="1" 
>> remote-tls-port="0" validate-with="pacemaker-1.0"
>> epoch="9" admin_epoch="0" num_updates="0" cib-last-written="Wed Apr  
>> 7 15:11:06 2010">
>>  <configuration>
>>    <crm_config>
>>      <cluster_property_set id="cib-bootstrap-options">
>>        <nvpair id="nvpair.id17897268" name="symmetric-cluster" 
>> value="true"/>
>>        <nvpair id="nvpair.id17897737" name="no-quorum-policy" 
>> value="stop"/>
>>        <nvpair id="nvpair.id17897746" 
>> name="default-resource-stickiness" value="0"/>
>>        <nvpair id="nvpair.id17897755" 
>> name="default-resource-failure-stickiness" value="0"/>
>>        <nvpair id="nvpair.id17897413" name="stonith-enabled" 
>> value="false"/>
>>        <nvpair id="nvpair.id17897422" name="stonith-action" 
>> value="reboot"/>
>>        <nvpair id="nvpair.id17897431" name="startup-fencing" 
>> value="true"/>
>>        <nvpair id="nvpair.id17897704" name="stop-orphan-resources" 
>> value="true"/>
>>        <nvpair id="nvpair.id17897714" name="stop-orphan-actions" 
>> value="true"/>
>>        <nvpair id="nvpair.id17897723" name="remove-after-stop" 
>> value="false"/>
>>        <nvpair id="nvpair.id17898021" name="short-resource-names" 
>> value="true"/>
>>        <nvpair id="nvpair.id17898030" name="transition-idle-timeout" 
>> value="5min"/>
>>        <nvpair id="nvpair.id17898040" name="default-action-timeout" 
>> value="20s"/>
>>        <nvpair id="nvpair.id17897626" name="is-managed-default" 
>> value="true"/>
>>        <nvpair id="nvpair.id17897635" name="cluster-delay" value="60s"/>
>>        <nvpair id="nvpair.id17897643" name="pe-error-series-max" 
>> value="-1"/>
>>        <nvpair id="nvpair.id17897653" name="pe-warn-series-max" 
>> value="-1"/>
>>        <nvpair id="nvpair.id17897329" name="pe-input-series-max" 
>> value="-1"/>
>>        <nvpair id="nvpair.id17897338" name="dc-version" 
>> value="1.0.8-5443ff1ab132449ad5b236169403c6a23cf4168b"/>
>>        <nvpair id="nvpair.id17897347" name="cluster-infrastructure" 
>> value="Heartbeat"/>
>>      </cluster_property_set>
>>    </crm_config>
>>    <nodes>
>>      <node id="86b5c3f4-8202-45f7-91a8-64e17163bb7a" 
>> uname="apauat1b.intranet.mydomain.com" type="normal"/>
>>      <node id="dbd6016a-aab6-4130-87fb-80e954353b3b" 
>> uname="apauat1a.intranet.mydomain.com" type="normal"/>
>>    </nodes>
>>    <resources>
>>      <group id="web_cluster">
>>        <primitive class="ocf" id="failover-ip" provider="heartbeat" 
>> type="IPaddr">
>>          <instance_attributes id="failover-ip-instance_attributes">
>>            <nvpair id="failover-ip-instance_attributes-ip" name="ip" 
>> value="172.28.185.55"/>
>>          </instance_attributes>
>>          <operations>
>>            <op id="failover-ip-monitor-10s" interval="10s" 
>> name="monitor"/>
>>          </operations>
>>        </primitive>
>>        <primitive class="lsb" id="failover-apache" type="httpd">
>>          <operations>
>>            <op id="failover-apache-monitor-15s" interval="15s" 
>> name="monitor"/>
>>          </operations>
>>        </primitive>
>>      </group>
>>    </resources>
>>    <constraints/>
>>    <rsc_defaults/>
>>    <op_defaults/>
>>  </configuration>
>> </cib>
>>
>> Jakob Curdes wrote:
>>> mike schrieb:
>>>> Thank you Jakob,
>>>> I did as you suggested (good idea btw) and what I saw was that 
>>>> LinuxHA continually tried to restart it on the primary node. Is 
>>>> there a setting that I can say "After X number of times trying to 
>>>> restart, fail over" ?
>>>>   
>>> I think you need to read further down that page and use the settings in
>>>
>>> "Failover IP Service in a Group"
>>>
>>> What you probably actually want is to have IP and service running 
>>> always on the same node.
>>> (plus- last step - on the node with best connectivity).
>>>
>>> HTH,
>>> Jakob Curdes
>>>
>>>
>
>

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to