This is what I see in nagios.debug: [1274384366.242081] [008.1] [pid=4007] Next High Priority Event Time: Thu May 20 15:39:27 2010 [1274384366.242093] [008.1] [pid=4007] Next Low Priority Event Time: Thu May 20 15:41:31 2010 [1274384366.242099] [008.1] [pid=4007] Current/Max Service Checks: 0/50 [1274384366.242107] [001.0] [pid=4007] check_for_external_commands() [1274384366.242114] [001.0] [pid=4007] process_external_command1() [1274384366.242195] [064.1] [pid=4007] Making callbacks (type 24)... [1274384366.242205] [001.0] [pid=4007] process_external_command2() [1274384366.242210] [128.1] [pid=4007] External Command Type: 301 [1274384366.242215] [128.1] [pid=4007] Command Entry Time: 1274384366 [1274384366.242219] [128.1] [pid=4007] Command Arguments: hostgroup_name
Nothing unusual, right? Rafael On Thu, May 20, 2010 at 3:35 PM, Rafael Carneiro <rafael.carne...@gmail.com>wrote: > It's a distributed environment, where everything but 20 boxes are monitored > by slaves (2 clusters of 2 slaves, about 600 hosts being monitored). > > I seem to be able to replicate that by scheduling and then deleting > downtime for a host group. > > I've changed debug_level=-1 and am still only able to see this in the > nagios.log before it crashes: [1274383762] EXTERNAL COMMAND: > DEL_HOSTGROUP_SVC_DOWNTIME;hostgroup_name > > I had core dumps enabled, but don't know where to look for them (not sure > if they're being created). > > Rafael > > > > On Thu, May 20, 2010 at 3:10 PM, Ton Voon <ton.v...@opsera.com> wrote: > >> >> On 20 May 2010, at 18:50, Rafael Carneiro wrote: >> >> Since upgrading to 3.7 (on Ubuntu 10.04 x86_64, Build: 3.7.0.4272) I've >>> seen it happen a couple of times. >>> It seems to me that, after you cancel the downtime, nagios crashes (this >>> is the last line after the crash - nagios.log: [1274376186] EXTERNAL >>> COMMAND: DEL_HOST_SVC_DOWNTIME;CVH-VMS-001). >>> I do remember doing the same thing from the Opsview interface when it >>> crashed the time before that, so I believe that the DEL_HOST_SVC_DOWNTIME is >>> causing it. >>> >> >> I've just done a test on etch, sol10, lenny, rhel5 and I can set a >> downtime for a host group, and then cancel it without the daemon dying. >> >> I've also done a current downtime + a future downtime on Ubuntu lucid and >> that is okay too. >> >> Is this on a distributed master or slave? >> >> >> Anyone else seeing anything like that? Where could I look for clues? >>> >> >> Is there a core dump file? Enable nagios core dumps in System Preferences. >> You may want to increase nagios debugging in nagios.cfg. >> >> An strace on the process while you deliver the downtime might give some >> clues too. >> >> >> Were there any changes to the way that's handled by Opsview? >>> >> >> There were additions to downtime handling in 3.5.2, but nothing >> specifically springs to mind for 3.7.0. We did upgrade Nagios to 3.2.1, but >> I don't think there was anything particular there. >> >> Ton >> >> _______________________________________________ >> Opsview-users mailing list >> Opsview-users@lists.opsview.org >> http://lists.opsview.org/lists/listinfo/opsview-users >> > > > > -- > Rafael Carneiro > http://ca.linkedin.com/in/rcarneiro > -- Rafael Carneiro http://ca.linkedin.com/in/rcarneiro
_______________________________________________ Opsview-users mailing list Opsview-users@lists.opsview.org http://lists.opsview.org/lists/listinfo/opsview-users