I believe what you're looking for is migration-threshold.
In the following Pacemaker snippet if Apache is stopped, if the website
http://localhost/index.html doesn't respond, or if the HTML body doesn't
contain "node", WebSite's failcount will be incremented by one and Apache will
be restarted. If failcount reaches two, WebSite will be moved to the other
node.
Monitor the status with "crm_mon --failcounts", and afterwards clear the
failcount with "crm resource cleanup WebSite [<nodename>]".
primitive WebSite ocf:heartbeat:apache \
params configfile="/etc/httpd/conf/httpd.conf" \
op monitor interval="30s" statusurl="http://localhost/index.html"
testregex="node" \
op start interval="0" timeout="90s" \
op stop interval="0" timeout="100s" \
meta migration-threshold="2"
p.s. Cluster experts -- should the migration threshold be tied to WebSite, or
should it be tied to the group that contains the website and the virtual IP
address?
John Simpson
Senior Software Engineer, I. T. Engineering and Operations
> -----Original Message-----
> From: [email protected] [mailto:linux-ha-
> [email protected]] On Behalf Of mike
> Sent: Thursday, April 08, 2010 9:08 AM
> To: General Linux-HA mailing list
> Subject: Re: [Linux-HA] Clarify Apache failover please?
>
> So thanks to Jakob, I confirmed that the init script is LSB compliant.
> So my question is: Does anyone have an LSB Apache cluster set up where
> if they kill httpd and leave it down (by editing out the start portion
> of the script) that a failover is initiated? I've gone through plenty of
> web sites where people set up the cluster and they test it by issuing
> service heartbeat stop or rebooting the other node. That works fine for
> me. But I have not seen a working cluster where killing httpd causes a
> failover. This is what I am looking for. I do not want it to restart it
> on the current node, at least not right now. I want to simulate a case
> where httpd will not start. right now, all that appears to happen is the
> cluster keeps trying to start httpd on the primary node. I'm obviously
> missing something because this way it is set up is certainly not highly
> available.
>
>
> mike wrote:
> > *Thanks again Jakob.
> >
> > I see this on crm_mon*
> > Resource Group: web_cluster
> > failover-ip (ocf::heartbeat:IPaddr): Started
> > apauat1b.intranet.aeroplan.com
> > failover-apache (lsb:httpd): Stopped
> >
> > Failed actions:
> > failover-apache_monitor_15000 (node=apauat1b.intranet.aeroplan.com,
> > call=2566, rc=7, status=complete): not running
> > *Note that with each failed attempt to start httpd the call figure above
> > increments by 1. So that is a counter of some description.*
> >
> > *And the logs show this:*
> > Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com crmd: [13459]: notice:
> > run_graph: Transition 2528 (Complete=8, Pending=0, Fired=0, Skipped=0,
> > Incomplete=0, Source=/usr/var/lib/pengine/pe-input-3384.bz2): Complete
> > Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com crmd: [13459]: info:
> > te_graph_trigger: Transition 2528 is now complete
> > Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com attrd: [13458]: info:
> > attrd_local_callback: Expanded fail-count-failover-apache=value++ to
> 1280
> > Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com attrd: [13458]: info:
> > attrd_trigger_update: Sending flush op to all hosts for:
> > fail-count-failover-apache (1280)
> > Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com crmd: [13459]: info:
> > do_state_transition: State transition S_TRANSITION_ENGINE ->
> > S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL
> origin=notify_crmd ]
> > Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com crmd: [13459]: info:
> > do_state_transition: All 2 cluster nodes are eligible to run resources.
> > Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com attrd: [13458]: info:
> > attrd_perform_update: Sent update 3175: fail-count-failover-apache=1280
> > Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com crmd: [13459]: info:
> > do_pe_invoke: Query 6702: Requesting the current CIB: S_POLICY_ENGINE
> > Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com crmd: [13459]: info:
> > abort_transition_graph: te_update_diff:146 - Triggered transition abort
> > (complete=1, tag=transient_attributes,
> > id=86b5c3f4-8202-45f7-91a8-64e17163bb7a, magic=NA, cib=0.9.5422) :
> > Transient attribute: update
> > Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com crmd: [13459]: info:
> > do_pe_invoke_callback: Invoking the PE: query=6702,
> > ref=pe_calc-dc-1270669810-6395, seq=2, quorate=1
> > Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com crmd: [13459]: info:
> > do_pe_invoke: Query 6703: Requesting the current CIB: S_POLICY_ENGINE
> > Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com crmd: [13459]: info:
> > do_pe_invoke_callback: Invoking the PE: query=6703,
> > ref=pe_calc-dc-1270669810-6396, seq=2, quorate=1
> > Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com pengine: [13466]: info:
> > unpack_config: Node scores: 'red' = -INFINITY, 'yellow' = 0, 'green' = 0
> > Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com pengine: [13466]: info:
> > determine_online_status: Node apauat1b.intranet.mydomain.com is online
> > Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com pengine: [13466]: WARN:
> > unpack_rsc_op: Processing failed op failover-apache_monitor_15000 on
> > apauat1b.intranet.mydomain.com: not running (7)
> > Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com pengine: [13466]: info:
> > determine_online_status: Node apauat1a.intranet.mydomain.com is online
> > Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com pengine: [13466]: notice:
> > group_print: Resource Group: web_cluster
> > Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com pengine: [13466]: notice:
> > native_print: failover-ip (ocf::heartbeat:IPaddr): Started
> > apauat1b.intranet.mydomain.com
> > Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com pengine: [13466]: notice:
> > native_print: failover-apache (lsb:httpd): Started
> > apauat1b.intranet.mydomain.com FAILED
> > Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com pengine: [13466]: notice:
> > RecurringOp: Start recurring monitor (15s) for failover-apache on
> > apauat1b.intranet.mydomain.com
> > Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com pengine: [13466]: notice:
> > LogActions: Leave resource failover-ip (Started
> > apauat1b.intranet.mydomain.com)
> > Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com pengine: [13466]: notice:
> > LogActions: Recover resource failover-apache (Started
> > apauat1b.intranet.mydomain.com)
> > Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com crmd: [13459]: info:
> > handle_response: pe_calc calculation pe_calc-dc-1270669810-6395 is
> obsolete
> > Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com pengine: [13466]: info:
> > process_pe_message: Transition 2529: PEngine Input stored in:
> > /usr/var/lib/pengine/pe-input-3385.bz2
> > Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com pengine: [13466]: info:
> > unpack_config: Node scores: 'red' = -INFINITY, 'yellow' = 0, 'green' = 0
> > Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com pengine: [13466]: info:
> > determine_online_status: Node apauat1b.intranet.mydomain.com is online
> > Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com pengine: [13466]: WARN:
> > unpack_rsc_op: Processing failed op failover-apache_monitor_15000 on
> > apauat1b.intranet.mydomain.com: not running (7)
> > Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com pengine: [13466]: info:
> > determine_online_status: Node apauat1a.intranet.mydomain.com is online
> > Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com pengine: [13466]: notice:
> > group_print: Resource Group: web_cluster
> > Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com pengine: [13466]: notice:
> > native_print: failover-ip (ocf::heartbeat:IPaddr): Started
> > apauat1b.intranet.mydomain.com
> > Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com pengine: [13466]: notice:
> > native_print: failover-apache (lsb:httpd): Started
> > apauat1b.intranet.mydomain.com FAILED
> > Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com pengine: [13466]: notice:
> > RecurringOp: Start recurring monitor (15s) for failover-apache on
> > apauat1b.intranet.mydomain.com
> > Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com pengine: [13466]: notice:
> > LogActions: Leave resource failover-ip (Started
> > apauat1b.intranet.mydomain.com)
> > Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com pengine: [13466]: notice:
> > LogActions: Recover resource failover-apache (Started
> > apauat1b.intranet.mydomain.com)
> > Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com pengine: [13466]: info:
> > process_pe_message: Transition 2530: PEngine Input stored in:
> > /usr/var/lib/pengine/pe-input-3386.bz2
> > Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com crmd: [13459]: info:
> > do_state_transition: State transition S_POLICY_ENGINE ->
> > S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE
> > origin=handle_response ]
> > Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com crmd: [13459]: info:
> > unpack_graph: Unpacked transition 2530: 8 actions in 8 synapses
> > Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com crmd: [13459]: info:
> > do_te_invoke: Processing graph 2530 (ref=pe_calc-dc-1270669810-6396)
> > derived from /usr/var/lib/pengine/pe-input-3386.bz2
> > Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com crmd: [13459]: info:
> > te_pseudo_action: Pseudo action 13 fired and confirmed
> > Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com crmd: [13459]: info:
> > te_rsc_command: Initiating action 3: stop failover-apache_stop_0 on
> > apauat1b.intranet.mydomain.com (local)
> > Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com lrmd: [13456]: info:
> > cancel_op: operation monitor[3844] on lsb::httpd::failover-apache for
> > client 13459, its parameters: CRM_meta_interval=[15000]
> > CRM_meta_timeout=[20000] crm_feature_set=[3.0.1]
> > CRM_meta_name=[monitor] cancelled
> > Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com crmd: [13459]: info:
> > do_lrm_rsc_op: Performing
> > key=3:2530:0:6b7b1df2-29a1-4b90-a254-b8fdf7df3632
> > op=failover-apache_stop_0 )
> > Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com lrmd: [13456]: info:
> > rsc:failover-apache:3845: stop
> > Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com lrmd: [13639]: WARN: For
> > LSB init script, no additional parameters are needed.
> > Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com crmd: [13459]: info:
> > process_lrm_event: LRM operation failover-apache_monitor_15000
> > (call=3844, status=1, cib-update=0, confirmed=true) Cancelled
> > Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com lrmd: [13456]: info: RA
> > output: (failover-apache:stop:stdout) Stopping httpd:
> > Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com lrmd: [13456]: info: RA
> > output: (failover-apache:stop:stdout) [
> > Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com lrmd: [13456]: info: RA
> > output: (failover-apache:stop:stdout) FAILED
> > Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com lrmd: [13456]: info: RA
> > output: (failover-apache:stop:stdout) ]
> > Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com lrmd: [13456]: info: RA
> > output: (failover-apache:stop:stdout)
> > Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com lrmd: [13456]: info: RA
> > output: (failover-apache:stop:stdout)
> >
> > Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com lrmd: [13456]: info:
> > Managed failover-apache:stop process 13639 exited with return code 0.
> > Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com crmd: [13459]: info:
> > process_lrm_event: LRM operation failover-apache_stop_0 (call=3845,
> > rc=0, cib-update=6704, confirmed=true) ok
> > Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com crmd: [13459]: info:
> > match_graph_event: Action failover-apache_stop_0 (3) confirmed on
> > apauat1b.intranet.mydomain.com (rc=0)
> > Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com crmd: [13459]: info:
> > te_pseudo_action: Pseudo action 14 fired and confirmed
> > Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com crmd: [13459]: info:
> > te_pseudo_action: Pseudo action 4 fired and confirmed
> > Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com crmd: [13459]: info:
> > te_pseudo_action: Pseudo action 11 fired and confirmed
> > Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com crmd: [13459]: info:
> > te_rsc_command: Initiating action 10: start failover-apache_start_0 on
> > apauat1b.intranet.mydomain.com (local)
> > Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com crmd: [13459]: info:
> > do_lrm_rsc_op: Performing
> > key=10:2530:0:6b7b1df2-29a1-4b90-a254-b8fdf7df3632
> > op=failover-apache_start_0 )
> > Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com lrmd: [13456]: info:
> > rsc:failover-apache:3846: start
> > Apr 07 15:50:10 APAUAT1B.intranet.mydomain.com lrmd: [13649]: WARN: For
> > LSB init script, no additional parameters are needed.
> >
> >
> > Jakob Curdes wrote:
> >
> >> What do the logs say?
> >> jc
> >>
> >> mike wrote:
> >>
> >>> Thank you Jakob,
> >>> I put them in a resource group as indicated but I am still seeing the
> >>> same behavior, i.e. if I stop httpd manually and then stop it from
> >>> restarting (by editing out the contents of /etc/init.d/httpd) the
> >>> cluster simply sits there and spins it wheels trying to restart httpd
> >>> on the primary node over and over and over again. At no point is a
> >>> failover initiated. Anyone know why stopping httpd in this manner
> >>> will not result in a failover?
> >>>
> >>> Here is my cib.xml
> >>> <cib crm_feature_set="3.0.1"
> >>> dc-uuid="86b5c3f4-8202-45f7-91a8-64e17163bb7a" have-quorum="1"
> >>> remote-tls-port="0" validate-with="pacemaker-1.0"
> >>> epoch="9" admin_epoch="0" num_updates="0" cib-last-written="Wed Apr
> >>> 7 15:11:06 2010">
> >>> <configuration>
> >>> <crm_config>
> >>> <cluster_property_set id="cib-bootstrap-options">
> >>> <nvpair id="nvpair.id17897268" name="symmetric-cluster"
> >>> value="true"/>
> >>> <nvpair id="nvpair.id17897737" name="no-quorum-policy"
> >>> value="stop"/>
> >>> <nvpair id="nvpair.id17897746"
> >>> name="default-resource-stickiness" value="0"/>
> >>> <nvpair id="nvpair.id17897755"
> >>> name="default-resource-failure-stickiness" value="0"/>
> >>> <nvpair id="nvpair.id17897413" name="stonith-enabled"
> >>> value="false"/>
> >>> <nvpair id="nvpair.id17897422" name="stonith-action"
> >>> value="reboot"/>
> >>> <nvpair id="nvpair.id17897431" name="startup-fencing"
> >>> value="true"/>
> >>> <nvpair id="nvpair.id17897704" name="stop-orphan-resources"
> >>> value="true"/>
> >>> <nvpair id="nvpair.id17897714" name="stop-orphan-actions"
> >>> value="true"/>
> >>> <nvpair id="nvpair.id17897723" name="remove-after-stop"
> >>> value="false"/>
> >>> <nvpair id="nvpair.id17898021" name="short-resource-names"
> >>> value="true"/>
> >>> <nvpair id="nvpair.id17898030" name="transition-idle-timeout"
> >>> value="5min"/>
> >>> <nvpair id="nvpair.id17898040" name="default-action-timeout"
> >>> value="20s"/>
> >>> <nvpair id="nvpair.id17897626" name="is-managed-default"
> >>> value="true"/>
> >>> <nvpair id="nvpair.id17897635" name="cluster-delay"
> value="60s"/>
> >>> <nvpair id="nvpair.id17897643" name="pe-error-series-max"
> >>> value="-1"/>
> >>> <nvpair id="nvpair.id17897653" name="pe-warn-series-max"
> >>> value="-1"/>
> >>> <nvpair id="nvpair.id17897329" name="pe-input-series-max"
> >>> value="-1"/>
> >>> <nvpair id="nvpair.id17897338" name="dc-version"
> >>> value="1.0.8-5443ff1ab132449ad5b236169403c6a23cf4168b"/>
> >>> <nvpair id="nvpair.id17897347" name="cluster-infrastructure"
> >>> value="Heartbeat"/>
> >>> </cluster_property_set>
> >>> </crm_config>
> >>> <nodes>
> >>> <node id="86b5c3f4-8202-45f7-91a8-64e17163bb7a"
> >>> uname="apauat1b.intranet.mydomain.com" type="normal"/>
> >>> <node id="dbd6016a-aab6-4130-87fb-80e954353b3b"
> >>> uname="apauat1a.intranet.mydomain.com" type="normal"/>
> >>> </nodes>
> >>> <resources>
> >>> <group id="web_cluster">
> >>> <primitive class="ocf" id="failover-ip" provider="heartbeat"
> >>> type="IPaddr">
> >>> <instance_attributes id="failover-ip-instance_attributes">
> >>> <nvpair id="failover-ip-instance_attributes-ip" name="ip"
> >>> value="172.28.185.55"/>
> >>> </instance_attributes>
> >>> <operations>
> >>> <op id="failover-ip-monitor-10s" interval="10s"
> >>> name="monitor"/>
> >>> </operations>
> >>> </primitive>
> >>> <primitive class="lsb" id="failover-apache" type="httpd">
> >>> <operations>
> >>> <op id="failover-apache-monitor-15s" interval="15s"
> >>> name="monitor"/>
> >>> </operations>
> >>> </primitive>
> >>> </group>
> >>> </resources>
> >>> <constraints/>
> >>> <rsc_defaults/>
> >>> <op_defaults/>
> >>> </configuration>
> >>> </cib>
> >>>
> >>> Jakob Curdes wrote:
> >>>
> >>>> mike schrieb:
> >>>>
> >>>>> Thank you Jakob,
> >>>>> I did as you suggested (good idea btw) and what I saw was that
> >>>>> LinuxHA continually tried to restart it on the primary node. Is
> >>>>> there a setting that I can say "After X number of times trying to
> >>>>> restart, fail over" ?
> >>>>>
> >>>>>
> >>>> I think you need to read further down that page and use the settings
> in
> >>>>
> >>>> "Failover IP Service in a Group"
> >>>>
> >>>> What you probably actually want is to have IP and service running
> >>>> always on the same node.
> >>>> (plus- last step - on the node with best connectivity).
> >>>>
> >>>> HTH,
> >>>> Jakob Curdes
> >>>>
> >>>>
> >>>>
> >>
> >
> > _______________________________________________
> > Linux-HA mailing list
> > [email protected]
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > See also: http://linux-ha.org/ReportingProblems
> >
> >
> >
>
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems