Re: Failing to synchronize 'unmonitor' actions with ongoing checks: Solaris 10 monit 5.5 possible bug

Nestor Urquiza Wed, 24 Oct 2012 05:34:49 -0700

Hi Martin,

Were you able to replicate following my steps? Should we fill a bug for
this on https://savannah.nongnu.org/bugs/?func=additem&group=monit


Thanks!
-Nestor

On Wed, Sep 26, 2012 at 9:06 AM, Nestor Urquiza <[email protected]>wrote:

> Hi Martin,
>
> Thanks for the clarification on the "recovery alert" side.
>
> On the "false positive" side it is probably OK as you described the
> current behavior: To interrupt all checks, one by one without checking
> first if there is at least one of them in progress until that one check is
> to be stopped. As you said monit will wait for that one to complete and
> then unmonitor it after.
>
> However the  "false positive" issue still remains and it is is that under
> the described circumstances monit will state the remote port is down when
> it is not down.
>
> Were you able to recreate the problem on your side?
>
> Best regards,
> -Nestor
>
>
> On Wed, Sep 26, 2012 at 7:49 AM, Martin Pala <[email protected]>wrote:
>
>> Hi,
>>
>> when the monitoring is disabled (unmonitor), or monit is stopped then the
>> service's error state is reset. When the monitoring is enabled (or monit
>> started) again and the service is running, no recovery alert is send as the
>> service monitoring starts from clean state.
>>
>> The unmonitor is performed at the start of the service check - if by
>> coincidence the test of some service is in progress, it allows it to
>> complete the test of that single service and doesn't interrupt the pending
>> check. When monit goes to next service (in the same cycle) it disables the
>> monitoring. I think it's OK to let the pending test complete - it's kind of
>> corner case with low impact.
>>
>> Regards,
>> Martin
>>
>>
>> On Sep 24, 2012, at 4:14 PM, Nestor Urquiza <[email protected]>
>> wrote:
>>
>> Hi guys,
>>
>> Not sure if this is a problem in other OSs as well but I believe I have
>> found a bug in monit 5.5 which at least for Solaris 10 is failing to
>> synchronize unmonitor actions with ongoing checks. Here is how to recreate
>> (tested in two different physical Solaris boxes (Intel)
>>
>> 1. Configure monit to check every minute. Create several instances like
>> the below, checking several external ports and servers:
>>
>> check host myhost with address myhost
>>
>> if failed port myport type tcp with timeout 15 seconds
>>
>>    then alert
>>
>> 2. Issue the below command exactly by the time monit runs (when the clock
>> is giving hh:mm:59):
>>
>> monit unmonitor all
>>
>> 3. Randomly you get an alert for at least one of the host/port
>> combination even though the host/port is actually available. As an example:
>>
>>
>> Action: alert, Description: connection failed, INET[mssql:1433] via TCP
>> is not ready for i|o -- Interrupted system call, Service: ptrsvr, Tested
>> From Host: myhost
>>
>> 4. After issuing 'monit monitor all' no alert about the service being
>> back up is sent but 'monit status' does show the service is up.
>>
>>
>> IMO monit has a bug where basically it does not synchronize the calls to
>> unmonitor and the checks to be performed. If monit receives "unmonitor all"
>> it should: (wait for all current checks to finish OR cancel them AND ignore
>> any alert messages to be sent).
>>
>>
>> Makes sense?
>>
>>
>> Thanks!
>>
>> -Nestor
>> --
>> To unsubscribe:
>> https://lists.nongnu.org/mailman/listinfo/monit-general
>>
>>
>>
>> --
>> To unsubscribe:
>> https://lists.nongnu.org/mailman/listinfo/monit-general
>>
>
>

--
To unsubscribe:
https://lists.nongnu.org/mailman/listinfo/monit-general

Re: Failing to synchronize 'unmonitor' actions with ongoing checks: Solaris 10 monit 5.5 possible bug

Reply via email to