We have quite a few production RHEL3u5 WS hosts running OMSA
5.1.0-354. We find pointing Trond Amundsen's check_poweredge Nagios
plugin at them eventually leads to semaphore resource exhaustion, but
only when the machine is under high load (low load == semaphores come,
do their job, and go as expected ; high load == orphaned semaphores
build up over time and eventually hit kernel limits).
Example syslog from an affected host:
Apr 5 12:00:53 qweqaz Server Administrator (SMIL): Data Engine
EventID: 0 A semaphore set has to be created but the system limit for
the maximum number of semaphore sets has been exceeded
Semaphore state on an affected machine:
-bash-2.05b$ ipcs
------ Shared Memory Segments --------
key shmid owner perms bytes nattch status
0x0001ffb8 0 root 666 76 3
0x00025990 32769 root 666 8308 3
0x00027cb9 65538 root 666 132256 1
0x00027cba 98307 root 666 132256 1
0x00027cdc 323092484 nagios 666 132256 0
0x00027cdd 393412613 nagios 666 132256 0
0x00027cde 393838598 nagios 666 132256 0
0x00027cdf 394231815 nagios 666 132256 0
0x00027ce0 413794312 nagios 666 132256 0
0x00027ce1 455770121 nagios 666 132256 0
0x00027ce2 483229706 nagios 666 132256 0
------ Semaphore Arrays --------
key semid owner perms nsems
0x00000000 393216 root 666 1
0x00000000 425985 root 666 1
0x00000000 458754 root 666 1
0x00000000 491523 root 666 1
0x00000000 524292 root 666 1
0x00000000 557061 root 666 1
0x00000000 622599 root 666 1
0x00000000 655368 root 666 1
0x0001ffb8 688137 root 666 1
0x000251c0 720906 root 666 1
0x000255a8 753675 root 666 1
0x00025990 786444 root 666 1
0x00000000 1179661 root 666 1
0x000278d1 884750 root 666 1
0x00027cb9 917519 root 666 1
0x00000000 950288 root 666 1
0x00000000 983057 root 666 1
0x00000000 1015826 root 666 1
0x00000000 1048595 root 666 1
0x00000000 1081364 root 666 1
0x000278d2 1114133 root 666 1
0x00027cba 1146902 root 666 1
0x000278f4 197754904 nagios 666 1
0x00027cdc 197787673 nagios 666 1
0x000278f5 378044442 nagios 666 1
0x00027cdd 378077211 nagios 666 1
0x000278f6 379322396 nagios 666 1
0x00027cde 379355165 nagios 666 1
0x000278f7 380502046 nagios 666 1
0x00027cdf 380534815 nagios 666 1
0x000278f8 430571552 nagios 666 1
0x00027ce0 430604321 nagios 666 1
0x000278f9 538050594 nagios 666 1
0x00027ce1 538083363 nagios 666 1
0x000278fa 608436260 nagios 666 1
0x00027ce2 608469029 nagios 666 1
------ Message Queues --------
key msqid owner perms used-bytes messages
Kernel proc bits on affected machines:
-bash-2.05b$ cat /proc/sys/kernel/{sem,shm{all,max,mni}}
250 32000 32 128
2097152
33554432
4096
This is the latest version of OMSA supported under RHEL3 per Dell Support.
We are also several point releases behind the current check_openmanage
Nagios plugin. However the changelog doesnt appear to indicate that
there is a helpful bugfix. I mention this only as we have run a local
heap of Big Brother bash mess against local OMSA on these high load
hosts for years without any semaphore problems.
We _could_ bump kernel limits _or_ splay Nagios checking to prolong
intervals between semaphore exhaustion _or_ have something like cron
or cfengine sweep in and ungracefully blast the orphaned semaphores at
a fixed interval, but we would welcome a more elegant fix. Curious to
know if 1.) others have experienced this bug and 2.) how they have
gotten around this issue.
Input appreciated. Thanks.
--
Nick Silkey
_______________________________________________
Linux-PowerEdge mailing list
[email protected]
https://lists.us.dell.com/mailman/listinfo/linux-poweredge
Please read the FAQ at http://lists.us.dell.com/faq