Re: Failing to synchronize 'unmonitor' actions with ongoing checks: Solaris 10 monit 5.5 possible bug

Nestor Urquiza Wed, 26 Sep 2012 06:06:31 -0700

Hi Martin,

Thanks for the clarification on the "recovery alert" side.


On the "false positive" side it is probably OK as you described the current
behavior: To interrupt all checks, one by one without checking first if
there is at least one of them in progress until that one check is to be
stopped. As you said monit will wait for that one to complete and then
unmonitor it after.

However the  "false positive" issue still remains and it is is that under
the described circumstances monit will state the remote port is down when
it is not down.

Were you able to recreate the problem on your side?

Best regards,
-Nestor


On Wed, Sep 26, 2012 at 7:49 AM, Martin Pala <[email protected]> wrote:

> Hi,
>
> when the monitoring is disabled (unmonitor), or monit is stopped then the
> service's error state is reset. When the monitoring is enabled (or monit
> started) again and the service is running, no recovery alert is send as the
> service monitoring starts from clean state.
>
> The unmonitor is performed at the start of the service check - if by
> coincidence the test of some service is in progress, it allows it to
> complete the test of that single service and doesn't interrupt the pending
> check. When monit goes to next service (in the same cycle) it disables the
> monitoring. I think it's OK to let the pending test complete - it's kind of
> corner case with low impact.
>
> Regards,
> Martin
>
>
> On Sep 24, 2012, at 4:14 PM, Nestor Urquiza <[email protected]>
> wrote:
>
> Hi guys,
>
> Not sure if this is a problem in other OSs as well but I believe I have
> found a bug in monit 5.5 which at least for Solaris 10 is failing to
> synchronize unmonitor actions with ongoing checks. Here is how to recreate
> (tested in two different physical Solaris boxes (Intel)
>
> 1. Configure monit to check every minute. Create several instances like
> the below, checking several external ports and servers:
>
> check host myhost with address myhost
>
> if failed port myport type tcp with timeout 15 seconds
>
>    then alert
>
> 2. Issue the below command exactly by the time monit runs (when the clock
> is giving hh:mm:59):
>
> monit unmonitor all
>
> 3. Randomly you get an alert for at least one of the host/port combination
> even though the host/port is actually available. As an example:
>
>
> Action: alert, Description: connection failed, INET[mssql:1433] via TCP is
> not ready for i|o -- Interrupted system call, Service: ptrsvr, Tested From
> Host: myhost
>
> 4. After issuing 'monit monitor all' no alert about the service being back
> up is sent but 'monit status' does show the service is up.
>
>
> IMO monit has a bug where basically it does not synchronize the calls to
> unmonitor and the checks to be performed. If monit receives "unmonitor all"
> it should: (wait for all current checks to finish OR cancel them AND ignore
> any alert messages to be sent).
>
>
> Makes sense?
>
>
> Thanks!
>
> -Nestor
> --
> To unsubscribe:
> https://lists.nongnu.org/mailman/listinfo/monit-general
>
>
>
> --
> To unsubscribe:
> https://lists.nongnu.org/mailman/listinfo/monit-general
>

--
To unsubscribe:
https://lists.nongnu.org/mailman/listinfo/monit-general

Re: Failing to synchronize 'unmonitor' actions with ongoing checks: Solaris 10 monit 5.5 possible bug

Reply via email to