Re: [opsview-users] Cancel downtime seems to kill nagios

Rafael Carneiro Fri, 21 May 2010 09:48:44 -0700

This is one of those bugs that's not easy to reproduce. It happened to me
this morning but I wasn't running strace and when I tried to add/remove the
downtime again it didn't crash nagios. These crashes do not create a dump in
/usr/local/nagios/etc.


I left strace running now so next time it happens I'll catch it.

Ton, If you still think it may be helpful to look at our set up, we can
arrange something.

One thing I didn't mention was that we recently migrated from etch 32 bit,
and it never happened there. That might be an indication that this problem
only happens with 64 bit installs. Our database was already running on a
separate box, so I don't think that is the problem.

@ u...@bgc.se: are you running a 32 or 64 bit? Did you migrate between
architectures?

Rafael


On Fri, May 21, 2010 at 8:08 AM, Ton Voon <ton.v...@opsera.com> wrote:

> I've raised https://secure.opsera.com/jira/browse/OPS-1165 for this
> problem.
>
> I would appreciate if we could have more information about this particular
> problem so we can look into why it is happening.
>
> If you can consistently reproduce it, can we have access to your system?
>
> Ton
>
>
> On 21 May 2010, at 12:34, unix wrote:
>
>  We have had this problem in version 3.0 3.1 and now in 3.5.2 .
>> Running rhel 5.3 in distributed environment master clustered and 2 slaves.
>> The cluster service always starts up opsview again after it has crashed.
>> So our only problem is a lot of Service results are stale when it happens.
>> Abut 10-20% of our Cancel downtime crashes opsview.
>> We also had a single opsviewserver and on this server opsview newer
>> crashed.
>> But of course it  work flawless at the moment.
>> If  we can find a way to provoke it so it happen every time we will trace
>> it.
>>
>> On 2010-05-20 19:56, Ton Voon wrote:
>>
>>> On 20 May 2010, at 20:35, Rafael Carneiro wrote:
>>> > It's a distributed environment, where everything but 20 boxes are  >
>>> monitored by slaves (2 clusters of 2 slaves, about 600 hosts being  >
>>> monitored).
>>> >
>>> > I seem to be able to replicate that by scheduling and then deleting  >
>>> downtime for a host group.
>>> >
>>> > I've changed debug_level=-1 and am still only able to see this in  >
>>> the nagios.log before it crashes: [1274383762] EXTERNAL COMMAND:  >
>>> DEL_HOSTGROUP_SVC_DOWNTIME;hostgroup_name
>>> >
>>> >  I had core dumps enabled, but don't know where to look for them  >
>>> (not sure if they're being created).
>>> They should be created in the /usr/local/nagios/etc directory.
>>> An strace would be helpful.
>>> Ton
>>>
>>
> _______________________________________________
> Opsview-users mailing list
> Opsview-users@lists.opsview.org
> http://lists.opsview.org/lists/listinfo/opsview-users
>



-- 
Rafael Carneiro
http://ca.linkedin.com/in/rcarneiro

_______________________________________________
Opsview-users mailing list
Opsview-users@lists.opsview.org
http://lists.opsview.org/lists/listinfo/opsview-users

Re: [opsview-users] Cancel downtime seems to kill nagios

Reply via email to