I've been using OMSA 5.5 on RHEL3 for quite some time now, and have no issues with Trond's Nagios plugin.
-Finn Andersen On Wed, Apr 7, 2010 at 5:51 PM, Trond Hasle Amundsen <[email protected]> wrote: > Nick Silkey <[email protected]> writes: > >> We have quite a few production RHEL3u5 WS hosts running OMSA >> 5.1.0-354. We find pointing Trond Amundsen's check_poweredge Nagios >> plugin at them eventually leads to semaphore resource exhaustion, but >> only when the machine is under high load (low load == semaphores come, >> do their job, and go as expected ; high load == orphaned semaphores >> build up over time and eventually hit kernel limits). >> >> Example syslog from an affected host: >> >> Apr 5 12:00:53 qweqaz Server Administrator (SMIL): Data Engine >> EventID: 0 A semaphore set has to be created but the system limit for >> the maximum number of semaphore sets has been exceeded >> >> Semaphore state on an affected machine: >> >> -bash-2.05b$ ipcs >> >> ------ Shared Memory Segments -------- >> key shmid owner perms bytes nattch status >> 0x0001ffb8 0 root 666 76 3 >> 0x00025990 32769 root 666 8308 3 >> 0x00027cb9 65538 root 666 132256 1 >> 0x00027cba 98307 root 666 132256 1 >> 0x00027cdc 323092484 nagios 666 132256 0 >> 0x00027cdd 393412613 nagios 666 132256 0 >> 0x00027cde 393838598 nagios 666 132256 0 >> 0x00027cdf 394231815 nagios 666 132256 0 >> 0x00027ce0 413794312 nagios 666 132256 0 >> 0x00027ce1 455770121 nagios 666 132256 0 >> 0x00027ce2 483229706 nagios 666 132256 0 >> >> ------ Semaphore Arrays -------- >> key semid owner perms nsems >> 0x00000000 393216 root 666 1 >> 0x00000000 425985 root 666 1 >> 0x00000000 458754 root 666 1 >> 0x00000000 491523 root 666 1 >> 0x00000000 524292 root 666 1 >> 0x00000000 557061 root 666 1 >> 0x00000000 622599 root 666 1 >> 0x00000000 655368 root 666 1 >> 0x0001ffb8 688137 root 666 1 >> 0x000251c0 720906 root 666 1 >> 0x000255a8 753675 root 666 1 >> 0x00025990 786444 root 666 1 >> 0x00000000 1179661 root 666 1 >> 0x000278d1 884750 root 666 1 >> 0x00027cb9 917519 root 666 1 >> 0x00000000 950288 root 666 1 >> 0x00000000 983057 root 666 1 >> 0x00000000 1015826 root 666 1 >> 0x00000000 1048595 root 666 1 >> 0x00000000 1081364 root 666 1 >> 0x000278d2 1114133 root 666 1 >> 0x00027cba 1146902 root 666 1 >> 0x000278f4 197754904 nagios 666 1 >> 0x00027cdc 197787673 nagios 666 1 >> 0x000278f5 378044442 nagios 666 1 >> 0x00027cdd 378077211 nagios 666 1 >> 0x000278f6 379322396 nagios 666 1 >> 0x00027cde 379355165 nagios 666 1 >> 0x000278f7 380502046 nagios 666 1 >> 0x00027cdf 380534815 nagios 666 1 >> 0x000278f8 430571552 nagios 666 1 >> 0x00027ce0 430604321 nagios 666 1 >> 0x000278f9 538050594 nagios 666 1 >> 0x00027ce1 538083363 nagios 666 1 >> 0x000278fa 608436260 nagios 666 1 >> 0x00027ce2 608469029 nagios 666 1 >> >> ------ Message Queues -------- >> key msqid owner perms used-bytes messages >> >> Kernel proc bits on affected machines: >> >> -bash-2.05b$ cat /proc/sys/kernel/{sem,shm{all,max,mni}} >> 250 32000 32 128 >> 2097152 >> 33554432 >> 4096 >> >> This is the latest version of OMSA supported under RHEL3 per Dell Support. >> >> We are also several point releases behind the current check_openmanage >> Nagios plugin. However the changelog doesnt appear to indicate that >> there is a helpful bugfix. I mention this only as we have run a local >> heap of Big Brother bash mess against local OMSA on these high load >> hosts for years without any semaphore problems. >> >> We _could_ bump kernel limits _or_ splay Nagios checking to prolong >> intervals between semaphore exhaustion _or_ have something like cron >> or cfengine sweep in and ungracefully blast the orphaned semaphores at >> a fixed interval, but we would welcome a more elegant fix. Curious to >> know if 1.) others have experienced this bug and 2.) how they have >> gotten around this issue. > > Hi Nick, > > I've been trying to reproduce this on a 2650 running RHEL3.9 with OMSA > 5.4.0. No problems with semaphores whatsoever. Semaphores are created > but they are cleaned up promptly. I can't pinpoint whatever causes the > semaphore leakage at your end, but things are seemingly better with > later RHEL3 and/or newer OMSA. My guess is that OMSA is the culprit and > that upgrading would help. > > OMSA 5.1.0 is the latest version listed for RHEL3 on support.dell.com, > but at least on my 2650 later versions work OK. This is not surprising, > as ESX3.5 was supported with OMSA up to and including 5.5.0 IIRC, and > ESX3.5 is "based" on RHEL3. > > Is OMSA 5.1.0 really the latest supported version on RHEL3? Perhaps some > Dell folks can shed some light on this.. > > As for check_openmanage versions... Your assumptions are correct, there > are no recent changes that I can think of that is relevant to your > problem. In local mode, the plugin simply executes omreport commands to > determine the current status of the components you wish to monitor > (usually all of them). If omreport leaks semaphores there isn't much > that check_openmanage can do about it. > > PS. We don't monitor hardware health on our RHEL3 boxes. Luckily, RHEL3 > is soon EOL :) > > Cheers, > -- > Trond H. Amundsen <[email protected]> > Center for Information Technology Services, University of Oslo > > _______________________________________________ > Linux-PowerEdge mailing list > [email protected] > https://lists.us.dell.com/mailman/listinfo/linux-poweredge > Please read the FAQ at http://lists.us.dell.com/faq > _______________________________________________ Linux-PowerEdge mailing list [email protected] https://lists.us.dell.com/mailman/listinfo/linux-poweredge Please read the FAQ at http://lists.us.dell.com/faq
