Re: [Linux-HA] timed out of monitor

Dejan Muhamedagic Tue, 18 Aug 2009 07:03:26 -0700

Hi,

On Tue, Aug 18, 2009 at 05:40:51PM +0400, Ivan Gromov wrote:
> Hello, Dejan
> 
> Thanks for your last letter. Much appreciated!
> > The cluster tried to start it or intended to start it at this
> > node and failed.
> Do you mean that start method failed?


I'm really not sure. Looking just now at the code, it seems like
the resource was already in the Started state at the time it
failed (due to the monitor timeout). I'm not sure if that is
documented, the best is to check the logs.

> >> How does heartbeat use it?
> >Don't understand the question. The '_n' suffix means that the
> >certain operation will be repeated at the 'n' interval.
> Yes, it wasn't correct question. I'll try to reformulate it.
> I have timeout=90 and interval=30  for the res1 monitor. Assume that the 
> heartbeat start the monitor. Will the heartbeat start the monitor method 
> for res1 if previous monitor hadn't completed (timeout > interval)?

No, the next monitor operation is initiated interval time after
the previous one finished. So, the monitor operations are not run
every 30 seconds, but 30 seconds plus the execution time.

Thanks,

Dejan

> Best wishes,
> Ivan
> 
> * Dejan Muhamedagic <[email protected]> [Mon, 17 Aug 2009 23:37:56 
> +0200]:
> > Hi,
> >
> > On Mon, Aug 17, 2009 at 07:26:07PM +0400, Ivan Gromov wrote:
> > > Dear, Dejan
> > > Thanks a lot.
> > > Could you explain what means: >> resource_res1 
> (ocf::heartbeat:res1):
> > > Started node2 FAILED?  Is it mean that the resource failed while it
> > was
> > > starting?
> >
> > The cluster tried to start it or intended to start it at this
> > node and failed.
> >
> > > > > resource_res1_monitor_30000 (node=node2, call=35, rc=-2): Timed
> > Out
> > > > It's a repeating monitor operation (30000 is 30s represented in
> > > > ms). monitor_0 is what is also called a resource probe, i.e. used
> > > > when the cluster wants to establish the resource status initially.
> > > Does heartbeat use the monitor method for monitor_0 and 
> monitor_30000?
> >
> > Yes.
> >
> > > How does heartbeat use it?
> >
> > Don't understand the question. The '_n' suffix means that the
> > certain operation will be repeated at the 'n' interval.
> >
> > > What do parametrs mean : call = 35 and rc = -2 ?
> >
> > Don't have to worry about the call id, that's internal. The rc is
> > the exit code, in this case it means timeout. Normally, there is
> > an explanation for the exit code.
> >
> > Thanks,
> >
> > Dejan
> >
> >
> > > Best whishes,
> > > Ivan
> > >
> > > * Dejan Muhamedagic <[email protected]> [Mon, 17 Aug 2009 16:05:00
> > > +0200]:
> > > > Hi,
> > > >
> > > > On Mon, Aug 17, 2009 at 04:44:01PM +0400, Ivan Gromov wrote:
> > > > > Dear all,
> > > > > I had some problem with my resource res1. But I can't understand
> > > where
> > > > > was my problem.
> > > > > The information obtained from the crm_mon is
> > > > >  
> > > > > Node: node2 (237ceb38-a061-d99d-f4bf-944dd057ab5d): online
> > > > > Node: node1 (965e45c6-19c4-241e-ff9d-4904882ef868): standby
> > > > >  
> > > > >  
> > > > > resource_res1      (ocf::heartbeat:res1):     Started node2 
> FAILED
> > > > > RESOURCE2   (ocf::heartbeat:Resource):  Started node2
> > > > >  
> > > > > Failed actions:
> > > > > resource_res1_monitor_30000 (node=node2, call=35, rc=-2): Timed
> > Out
> > > > >  
> > > > > The definition of res1 is:
> > > > >  
> > > > > <primitive id="resource_res1" class="ocf" type="res1"
> > > > > provider="heartbeat">
> > > > > <operations>
> > > > > <op id="34" name="monitor" interval="30s" timeout="90s"
> > > > start_delay="0s"
> > > > > on_fail="restart"/>
> > > > > <op id="35" name="start" timeout="30s"/>
> > > > > <op id="36" name="stop" timeout="30s"/>
> > > > > </operations>
> > > > > <instance_attributes id="resource_res1_instance_attrs">
> > > > > <attributes>
> > > > > <nvpair name="target_role" id="resource_res1_target_role"
> > > > > value="started"/>
> > > > > </attributes>
> > > > > </instance_attributes>
> > > > > <meta_attributes id="resource_res1_meta">
> > > > > <attributes>
> > > > > <nvpair name="resource_stickiness" id="resource_res1_Rs"
> > > value="150"/>
> > > > > <nvpair name="resource_failure_stickiness" 
> id="resource_res1_FRs"
> > > > > value="-100"/>
> > > > > </attributes>
> > > > > </meta_attributes>
> > > > > </primitive>
> > > > >  
> > > > > Is it a mistake in monitor method of res1?
> > > >
> > > > Yes, the monitor operation timed out.
> > > >
> > > > > I wanted to repeat this situation and I included sleep (100) in
> > > > monitor
> > > > > method.
> > > > > I received this from crm_mon:
> > > > >  
> > > > > Node: node2(237ceb38-a061-d99d-f4bf-944dd057ab5d): online
> > > > > Node: node1 (965e45c6-19c4-241e-ff9d-4904882ef868): OFFLINE
> > > > >  
> > > > > RESOURCE2   (ocf::heartbeat:Resource):  Started node2
> > > > >  
> > > > > Failed actions:
> > > > > resource_res1_monitor_0 (node=node2, call=24, rc=-2): Timed Out
> > > > >  
> > > > > monitor_0 and monitor_30000 are not the same monitor method,
> > right?
> > > > What
> > > > > monitor_30000 is?
> > > >
> > > > It's a repeating monitor operation (30000 is 30s represented in
> > > > ms). monitor_0 is what is also called a resource probe, i.e. used
> > > > when the cluster wants to establish the resource status initially.
> > > >
> > > > > What should I do to find where my mistake is?
> > > >
> > > > Take a look at your resource agent :)
> > > >
> > > > Thanks,
> > > >
> > > > Dejan
> > > >
> > > > > --
> > > > >  
> > > > > Best regards,
> > > > > Ivan Gromov.
> > > > > _______________________________________________
> > > > > Linux-HA mailing list
> > > > > [email protected]
> > > > > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > > > > See also: http://linux-ha.org/ReportingProblems
> > > > _______________________________________________
> > > > Linux-HA mailing list
> > > > [email protected]
> > > > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > > > See also: http://linux-ha.org/ReportingProblems
> > >
> > > _______________________________________________
> > > Linux-HA mailing list
> > > [email protected]
> > > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > > See also: http://linux-ha.org/ReportingProblems
> > _______________________________________________
> > Linux-HA mailing list
> > [email protected]
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > See also: http://linux-ha.org/ReportingProblems
> 
> 
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] timed out of monitor

Reply via email to