Re: [Linux-HA] resource monitor timeout, Killing with signal SIGTERM (15).

Lars Ellenberg Tue, 13 Nov 2012 13:24:16 -0800

On Fri, Oct 19, 2012 at 02:22:09PM +0700, Thanachit Wichianchai wrote:
> Hello Linux-HA community,
> 
> 
> Current Setup:
> 
> Linux HA Version: 2.1.4
> Red Hat Enterprise Linux 5.5
> Active/Standby mode.
> 
> 
> 1. I would like to know about "Killing with signal SIGTERM (15)." after
> resource monitoring is timeout.
> as I assume, I think resource monitoring status was killing  in order  to
> start a new resource monitoring process, It this correct?


Yes.

This "Killing with signal SIGTERM" is about killing the child process
that was created for the monitoring operation.

It is not about the actual resource.

The actual resource would then be in "failed" state,
and the policy engine would try to recover.

Typically, such recovery action would be to first "stop", then "start"
the resource (on the same node, or on an other availabe node, depending
on "a number of things").

But you told it to "on_fail=ignore" failures, so the resource itself
would be left alone.

That said, I too would recommend to
either back down to "haresources" mode,
or use a recent version of pacemaker.

Would not prevent your monitor operations from timing out.
But behaves much better when dealing with failures,
has much improved usability,
and better features to analyse "bad" behavior.

> lrmd[4484]: 2012/10/17_00:07:04 WARN: resource_broker7:status process (PID
> 3665) timed out (try 1).  Killing with signal SIGTERM (15).
> lrmd[4484]: 2012/10/17_00:07:04 WARN: operation status[12] on
> heartbeat::broker7::resource_broker7 for client 4487, its parameters:
> CRM_meta_interval=[5000] CRM_meta_prereq=[n
> othing] CRM_meta_start_delay=[15000] CRM_meta_role=[Started]
> CRM_meta_id=[Mon_broker7] CRM_meta_timeout=[30000]
> CRM_meta_on_fail=[ignore] crm_feature_set=[2.0] CRM_meta_disabl
> ed=[false] CRM_meta_description=[Mon_broker7] CRM_meta_name=[status] : pid
> [3665] timed out
> crmd[4487]: 2012/10/17_00:07:04 ERROR: process_lrm_event: LRM operation
> resource_broker7_status_5000 (12) Timed Out (timeout=30000ms)
> 
> 
> 
> 2.  I really need to understand the resource agent monitoring process.
> from the configuration below, it is correct that
> "the operation status is performed every 5 seconds and  has 15 seconds to
> complete before Linux HA assume the resource is failed"
> when resource is failed , Linux HA will see on_fail value and take some
> action. (ignore, stop)

I think so.
Though it is a while since I tried to make sense from the xml directly,
let alone from the older style xml that was used with heartbeat 2.x crmd.

> 3. in case that on_fail="stop"
> when resource monitoring is timeout. the Linux HA will stop that resource ?

Yes.

> and failover to standby node?

Not necessarily.
May also be restarted on the same node.
That would be determined by some "failure stickyness arithmetic",
which iirc was really cumbersome to get "right"; if at all.

That has since been replaced by the "fail count" concept,
which is much easier to handle.

> <primitive id="resource_broker7" class="heartbeat" type="broker7"
> provider="heartbeat">
> 
> <meta_attributes id="resource_broker7_meta_attrs">
> 
> <attributes>
> 
> <nvpair id="resource_broker7_metaattr_target_role" name="target_role"
> value="started"/>
> 
> </attributes>
> 
> </meta_attributes>
> 
> <operations>
> 
> <op name="status" description="Mon_broker7" interval="5" timeout="15"
> start_delay="15"
> disabled="false" role="Started" on_fail="ignore" id="Mon_broker7" prereq="n
> 
> othing"/>
> 
> </operations>
> 
> </primitive>
> 
> 
> Thank you very much.
> 
> Thanachit.

Hope that helps.

Cheers,

        Lars

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] resource monitor timeout, Killing with signal SIGTERM (15).

Reply via email to