[EMAIL PROTECTED] wrote: > Hey Fellow Nagios-ites: > > I've been having this *exact* same segfault problem for the last couple o' > months. > > And, after looking at David's stack trace output, it is segfaulting for > him in the exact same way/place as it is for me. > > Here's what I've found: > > The core dump's that I've examined are all segfaulting when handling the > expiration of a scheduled downtime. > > Since David's stack trace looks identical to mine, I don't think it is in > the external command processing, as he believes, but it is in the downtime > expiration handling, as well. > > Having examined about a dozen of these identical core dumps, I see that it > is a corruption of the entire sheduled_downtime structure that is being > passed into the handled_scheduled_downtime() function. > > The handled_scheduled_downtime() function is being invoked by the high > priority event processing logic in the event_execution_loop(). So it > pulls a EVENT_SCHEDULED_DOWNTIME timed_event structure off of the high > priority event list, and then hands it to handle_timed_event(), which in > turns invoke the handle_scheduled_downtime() routine to handle the > expiration of the specified downtime event. > > The problem is, the scheduled_downtime structure is already corrupted > while sitting in the high_priority list - well before it is dequeued by > the event_execution_loop() logic. > > I've walked the high priority list in memory with gdb to examine other > timed_event structures and have noticed that only the scheduled_downtime > structure associated with EVENT_SCHEDULED_DOWNTIME timed events are > affected by the memory corruption. In fact, one time, I found nine > scheduled downtime expiration event sequentially listed in the high > priority list and the first three had their scheduled_downtime structures > corrupted and the remaining six were in pristine condition. > > > So, I've narrowed it down to a couple of possibilities (feel free to add > your own!): > > 1. The scheduled_downtime structure is already corrupted when it is being > added to the high priority timed event scheduling list, or > > > 2. The scheduled_downtime structure is OK when it is added to the high > priority list, but perhaps a bad pointer access is overwriting it with > garbage at some other point in the program. This would might be somewhat > painful to track down. > > > Of the two, I suspect that the second one is the more likely candidate. >
I think the first, as it only happens with scheduled downtime stuff. Otherwise you'd see it on other high-prio events as well (unless you're extremely unlucky each time the crash happens). > > Some other notes: > > 1. The timed event expirations that segfault Nagios seem to be "randomly" > chosen. > > We have some regularly submitted (via cron) scheduled downtimes that will > work fine for weeks, and then one of them will come up for expiration and > trigger this scheduled-downtime-expiration bug. I've also seen it happen > with ad-hoc scheduled downtime submissions via the CGI interface. > > I've seen it happen with "regular" scheduled downtimes as well as the new > "triggered" scheduled downtime. We thought it might have been related to > the new triggered downtime, since that was one of the first events causing > a segfault. But then after eliminating the use of triggered downtimes > altogether, the segfaults still occur with the regular scheduled downtime > expirations. > > 2. I've had this problem with Nagios 2.4, 2.5 and 2.6. So, "upgrading" > hasn't gotten rid of it. > > 3. We are currently running Nagios 2.6 on a 64-bit Linux platform: SLES-9 > x86-64, Kernel 2.6.5-7.267-smp > This is the culprit, I guess. As this isn't a widespread problem, I wouldn't be surprised if it's related to 64-bit archs (kernel-2.6.5 is fairly ancient too, but that shouldn't matter as this is the only app you're seeing it in). I'm guessing this actually is an SMP-system and that SuSE doesn't install SMP kernels on all systems, correct? If so, this could also be a source of problem for you. Nagios doesn't follow the pthread guidelines very closely and does some pretty inappropriate things post-fork() for being a threaded application. This could be one of those problems that doesn't happen on single-cpu systems because the only cpu doesn't have anything to compete with when racing for the memory. > 4. We don't have any other segfault problems with other other apps on this > system. > > > So I'm still trying figure out *what* is overwriting the > scheduled_downtime structures with garbage in memory. > > Any ideas, based upon this additional information? > Upgrade glibc and the kernel and pray. Other than that, I guess running it in valgrind and/or gdb for a long period of time or chucking assert()'s and printf()'s at the Nagios code and seeing where it breaks is the only solution. btw, thanks for the nicely detailed problem report. -- Andreas Ericsson [EMAIL PROTECTED] OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nagios-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
