Re: [Linux-HA] timed out of monitor

Ivan Gromov Tue, 18 Aug 2009 06:41:10 -0700

Hello, Dejan

Thanks for your last letter. Much appreciated!
> The cluster tried to start it or intended to start it at this
> node and failed.
Do you mean that start method failed?


>> How does heartbeat use it?
>Don't understand the question. The '_n' suffix means that the
>certain operation will be repeated at the 'n' interval.
Yes, it wasn't correct question. I'll try to reformulate it.
I have timeout=90 and interval=30  for the res1 monitor. Assume that the 
heartbeat start the monitor. Will the heartbeat start the monitor method 
for res1 if previous monitor hadn't completed (timeout > interval)?

Best wishes,
Ivan

* Dejan Muhamedagic <[email protected]> [Mon, 17 Aug 2009 23:37:56 
+0200]:
> Hi,
>
> On Mon, Aug 17, 2009 at 07:26:07PM +0400, Ivan Gromov wrote:
> > Dear, Dejan
> > Thanks a lot.
> > Could you explain what means: >> resource_res1 
(ocf::heartbeat:res1):
> > Started node2 FAILED?  Is it mean that the resource failed while it
> was
> > starting?
>
> The cluster tried to start it or intended to start it at this
> node and failed.
>
> > > > resource_res1_monitor_30000 (node=node2, call=35, rc=-2): Timed
> Out
> > > It's a repeating monitor operation (30000 is 30s represented in
> > > ms). monitor_0 is what is also called a resource probe, i.e. used
> > > when the cluster wants to establish the resource status initially.
> > Does heartbeat use the monitor method for monitor_0 and 
monitor_30000?
>
> Yes.
>
> > How does heartbeat use it?
>
> Don't understand the question. The '_n' suffix means that the
> certain operation will be repeated at the 'n' interval.
>
> > What do parametrs mean : call = 35 and rc = -2 ?
>
> Don't have to worry about the call id, that's internal. The rc is
> the exit code, in this case it means timeout. Normally, there is
> an explanation for the exit code.
>
> Thanks,
>
> Dejan
>
>
> > Best whishes,
> > Ivan
> >
> > * Dejan Muhamedagic <[email protected]> [Mon, 17 Aug 2009 16:05:00
> > +0200]:
> > > Hi,
> > >
> > > On Mon, Aug 17, 2009 at 04:44:01PM +0400, Ivan Gromov wrote:
> > > > Dear all,
> > > > I had some problem with my resource res1. But I can't understand
> > where
> > > > was my problem.
> > > > The information obtained from the crm_mon is
> > > >  
> > > > Node: node2 (237ceb38-a061-d99d-f4bf-944dd057ab5d): online
> > > > Node: node1 (965e45c6-19c4-241e-ff9d-4904882ef868): standby
> > > >  
> > > >  
> > > > resource_res1      (ocf::heartbeat:res1):     Started node2 
FAILED
> > > > RESOURCE2   (ocf::heartbeat:Resource):  Started node2
> > > >  
> > > > Failed actions:
> > > > resource_res1_monitor_30000 (node=node2, call=35, rc=-2): Timed
> Out
> > > >  
> > > > The definition of res1 is:
> > > >  
> > > > <primitive id="resource_res1" class="ocf" type="res1"
> > > > provider="heartbeat">
> > > > <operations>
> > > > <op id="34" name="monitor" interval="30s" timeout="90s"
> > > start_delay="0s"
> > > > on_fail="restart"/>
> > > > <op id="35" name="start" timeout="30s"/>
> > > > <op id="36" name="stop" timeout="30s"/>
> > > > </operations>
> > > > <instance_attributes id="resource_res1_instance_attrs">
> > > > <attributes>
> > > > <nvpair name="target_role" id="resource_res1_target_role"
> > > > value="started"/>
> > > > </attributes>
> > > > </instance_attributes>
> > > > <meta_attributes id="resource_res1_meta">
> > > > <attributes>
> > > > <nvpair name="resource_stickiness" id="resource_res1_Rs"
> > value="150"/>
> > > > <nvpair name="resource_failure_stickiness" 
id="resource_res1_FRs"
> > > > value="-100"/>
> > > > </attributes>
> > > > </meta_attributes>
> > > > </primitive>
> > > >  
> > > > Is it a mistake in monitor method of res1?
> > >
> > > Yes, the monitor operation timed out.
> > >
> > > > I wanted to repeat this situation and I included sleep (100) in
> > > monitor
> > > > method.
> > > > I received this from crm_mon:
> > > >  
> > > > Node: node2(237ceb38-a061-d99d-f4bf-944dd057ab5d): online
> > > > Node: node1 (965e45c6-19c4-241e-ff9d-4904882ef868): OFFLINE
> > > >  
> > > > RESOURCE2   (ocf::heartbeat:Resource):  Started node2
> > > >  
> > > > Failed actions:
> > > > resource_res1_monitor_0 (node=node2, call=24, rc=-2): Timed Out
> > > >  
> > > > monitor_0 and monitor_30000 are not the same monitor method,
> right?
> > > What
> > > > monitor_30000 is?
> > >
> > > It's a repeating monitor operation (30000 is 30s represented in
> > > ms). monitor_0 is what is also called a resource probe, i.e. used
> > > when the cluster wants to establish the resource status initially.
> > >
> > > > What should I do to find where my mistake is?
> > >
> > > Take a look at your resource agent :)
> > >
> > > Thanks,
> > >
> > > Dejan
> > >
> > > > --
> > > >  
> > > > Best regards,
> > > > Ivan Gromov.
> > > > _______________________________________________
> > > > Linux-HA mailing list
> > > > [email protected]
> > > > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > > > See also: http://linux-ha.org/ReportingProblems
> > > _______________________________________________
> > > Linux-HA mailing list
> > > [email protected]
> > > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > > See also: http://linux-ha.org/ReportingProblems
> >
> > _______________________________________________
> > Linux-HA mailing list
> > [email protected]
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > See also: http://linux-ha.org/ReportingProblems
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems


_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] timed out of monitor

Reply via email to