Nick Silkey <[email protected]> writes: > We have quite a few production RHEL3u5 WS hosts running OMSA > 5.1.0-354. We find pointing Trond Amundsen's check_poweredge Nagios > plugin at them eventually leads to semaphore resource exhaustion, but > only when the machine is under high load (low load == semaphores come, > do their job, and go as expected ; high load == orphaned semaphores > build up over time and eventually hit kernel limits). > > Example syslog from an affected host: > > Apr 5 12:00:53 qweqaz Server Administrator (SMIL): Data Engine > EventID: 0 A semaphore set has to be created but the system limit for > the maximum number of semaphore sets has been exceeded > > Semaphore state on an affected machine: > > -bash-2.05b$ ipcs > > ------ Shared Memory Segments -------- > key shmid owner perms bytes nattch status > 0x0001ffb8 0 root 666 76 3 > 0x00025990 32769 root 666 8308 3 > 0x00027cb9 65538 root 666 132256 1 > 0x00027cba 98307 root 666 132256 1 > 0x00027cdc 323092484 nagios 666 132256 0 > 0x00027cdd 393412613 nagios 666 132256 0 > 0x00027cde 393838598 nagios 666 132256 0 > 0x00027cdf 394231815 nagios 666 132256 0 > 0x00027ce0 413794312 nagios 666 132256 0 > 0x00027ce1 455770121 nagios 666 132256 0 > 0x00027ce2 483229706 nagios 666 132256 0 > > ------ Semaphore Arrays -------- > key semid owner perms nsems > 0x00000000 393216 root 666 1 > 0x00000000 425985 root 666 1 > 0x00000000 458754 root 666 1 > 0x00000000 491523 root 666 1 > 0x00000000 524292 root 666 1 > 0x00000000 557061 root 666 1 > 0x00000000 622599 root 666 1 > 0x00000000 655368 root 666 1 > 0x0001ffb8 688137 root 666 1 > 0x000251c0 720906 root 666 1 > 0x000255a8 753675 root 666 1 > 0x00025990 786444 root 666 1 > 0x00000000 1179661 root 666 1 > 0x000278d1 884750 root 666 1 > 0x00027cb9 917519 root 666 1 > 0x00000000 950288 root 666 1 > 0x00000000 983057 root 666 1 > 0x00000000 1015826 root 666 1 > 0x00000000 1048595 root 666 1 > 0x00000000 1081364 root 666 1 > 0x000278d2 1114133 root 666 1 > 0x00027cba 1146902 root 666 1 > 0x000278f4 197754904 nagios 666 1 > 0x00027cdc 197787673 nagios 666 1 > 0x000278f5 378044442 nagios 666 1 > 0x00027cdd 378077211 nagios 666 1 > 0x000278f6 379322396 nagios 666 1 > 0x00027cde 379355165 nagios 666 1 > 0x000278f7 380502046 nagios 666 1 > 0x00027cdf 380534815 nagios 666 1 > 0x000278f8 430571552 nagios 666 1 > 0x00027ce0 430604321 nagios 666 1 > 0x000278f9 538050594 nagios 666 1 > 0x00027ce1 538083363 nagios 666 1 > 0x000278fa 608436260 nagios 666 1 > 0x00027ce2 608469029 nagios 666 1 > > ------ Message Queues -------- > key msqid owner perms used-bytes messages > > Kernel proc bits on affected machines: > > -bash-2.05b$ cat /proc/sys/kernel/{sem,shm{all,max,mni}} > 250 32000 32 128 > 2097152 > 33554432 > 4096 > > This is the latest version of OMSA supported under RHEL3 per Dell Support. > > We are also several point releases behind the current check_openmanage > Nagios plugin. However the changelog doesnt appear to indicate that > there is a helpful bugfix. I mention this only as we have run a local > heap of Big Brother bash mess against local OMSA on these high load > hosts for years without any semaphore problems. > > We _could_ bump kernel limits _or_ splay Nagios checking to prolong > intervals between semaphore exhaustion _or_ have something like cron > or cfengine sweep in and ungracefully blast the orphaned semaphores at > a fixed interval, but we would welcome a more elegant fix. Curious to > know if 1.) others have experienced this bug and 2.) how they have > gotten around this issue.
Hi Nick, I've been trying to reproduce this on a 2650 running RHEL3.9 with OMSA 5.4.0. No problems with semaphores whatsoever. Semaphores are created but they are cleaned up promptly. I can't pinpoint whatever causes the semaphore leakage at your end, but things are seemingly better with later RHEL3 and/or newer OMSA. My guess is that OMSA is the culprit and that upgrading would help. OMSA 5.1.0 is the latest version listed for RHEL3 on support.dell.com, but at least on my 2650 later versions work OK. This is not surprising, as ESX3.5 was supported with OMSA up to and including 5.5.0 IIRC, and ESX3.5 is "based" on RHEL3. Is OMSA 5.1.0 really the latest supported version on RHEL3? Perhaps some Dell folks can shed some light on this.. As for check_openmanage versions... Your assumptions are correct, there are no recent changes that I can think of that is relevant to your problem. In local mode, the plugin simply executes omreport commands to determine the current status of the components you wish to monitor (usually all of them). If omreport leaks semaphores there isn't much that check_openmanage can do about it. PS. We don't monitor hardware health on our RHEL3 boxes. Luckily, RHEL3 is soon EOL :) Cheers, -- Trond H. Amundsen <[email protected]> Center for Information Technology Services, University of Oslo _______________________________________________ Linux-PowerEdge mailing list [email protected] https://lists.us.dell.com/mailman/listinfo/linux-poweredge Please read the FAQ at http://lists.us.dell.com/faq
