Re: [opsview-users] Cancel downtime seems to kill nagios

Rafael Carneiro Thu, 20 May 2010 12:43:42 -0700

This is what I see in nagios.debug:
[1274384366.242081] [008.1] [pid=4007] Next High Priority Event Time: Thu
May 20 15:39:27 2010
[1274384366.242093] [008.1] [pid=4007] Next Low Priority Event Time:  Thu
May 20 15:41:31 2010
[1274384366.242099] [008.1] [pid=4007] Current/Max Service Checks: 0/50
[1274384366.242107] [001.0] [pid=4007] check_for_external_commands()
[1274384366.242114] [001.0] [pid=4007] process_external_command1()
[1274384366.242195] [064.1] [pid=4007] Making callbacks (type 24)...
[1274384366.242205] [001.0] [pid=4007] process_external_command2()
[1274384366.242210] [128.1] [pid=4007] External Command Type: 301
[1274384366.242215] [128.1] [pid=4007] Command Entry Time: 1274384366
[1274384366.242219] [128.1] [pid=4007] Command Arguments: hostgroup_name


Nothing unusual, right?

Rafael


On Thu, May 20, 2010 at 3:35 PM, Rafael Carneiro
<rafael.carne...@gmail.com>wrote:

> It's a distributed environment, where everything but 20 boxes are monitored
> by slaves (2 clusters of 2 slaves, about 600 hosts being monitored).
>
> I seem to be able to replicate that by scheduling and then deleting
> downtime for a host group.
>
> I've changed debug_level=-1 and am still only able to see this in the
> nagios.log before it crashes: [1274383762] EXTERNAL COMMAND:
> DEL_HOSTGROUP_SVC_DOWNTIME;hostgroup_name
>
>  I had core dumps enabled, but don't know where to look for them (not sure
> if they're being created).
>
> Rafael
>
>
>
> On Thu, May 20, 2010 at 3:10 PM, Ton Voon <ton.v...@opsera.com> wrote:
>
>>
>> On 20 May 2010, at 18:50, Rafael Carneiro wrote:
>>
>>  Since upgrading to 3.7 (on Ubuntu 10.04 x86_64, Build: 3.7.0.4272) I've
>>> seen it happen a couple of times.
>>> It seems to me that, after you cancel the downtime, nagios crashes (this
>>> is the last line after the crash - nagios.log: [1274376186] EXTERNAL
>>> COMMAND: DEL_HOST_SVC_DOWNTIME;CVH-VMS-001).
>>> I do remember doing the same thing from the Opsview interface when it
>>> crashed the time before that, so I believe that the DEL_HOST_SVC_DOWNTIME is
>>> causing it.
>>>
>>
>> I've just done a test on etch, sol10, lenny, rhel5 and I can set a
>> downtime for a host group, and then cancel it without the daemon dying.
>>
>> I've also done a current downtime + a future downtime on Ubuntu lucid and
>> that is okay too.
>>
>> Is this on a distributed master or slave?
>>
>>
>>  Anyone else seeing anything like that? Where could I look for clues?
>>>
>>
>> Is there a core dump file? Enable nagios core dumps in System Preferences.
>> You may want to increase nagios debugging in nagios.cfg.
>>
>> An strace on the process while you deliver the downtime might give some
>> clues too.
>>
>>
>>  Were there any changes to the way that's handled by Opsview?
>>>
>>
>> There were additions to downtime handling in 3.5.2, but nothing
>> specifically springs to mind for 3.7.0. We did upgrade Nagios to 3.2.1, but
>> I don't think there was anything particular there.
>>
>> Ton
>>
>> _______________________________________________
>> Opsview-users mailing list
>> Opsview-users@lists.opsview.org
>> http://lists.opsview.org/lists/listinfo/opsview-users
>>
>
>
>
> --
> Rafael Carneiro
> http://ca.linkedin.com/in/rcarneiro
>



-- 
Rafael Carneiro
http://ca.linkedin.com/in/rcarneiro

_______________________________________________
Opsview-users mailing list
Opsview-users@lists.opsview.org
http://lists.opsview.org/lists/listinfo/opsview-users

Re: [opsview-users] Cancel downtime seems to kill nagios

Reply via email to