semaphore leaks via RHEL3u5 WS + OMSA 5.1.0-354 + check_openmanage 3.4.9 + 'high load'

Nick Silkey Tue, 06 Apr 2010 08:43:17 -0700

We have quite a few production RHEL3u5 WS hosts running OMSA
5.1.0-354.  We find pointing Trond Amundsen's check_poweredge Nagios
plugin at them eventually leads to semaphore resource exhaustion, but
only when the machine is under high load (low load == semaphores come,
do their job, and go as expected ; high load == orphaned semaphores
build up over time and eventually hit kernel limits).


Example syslog from an affected host:

Apr 5 12:00:53 qweqaz Server Administrator (SMIL): Data Engine
EventID: 0 A semaphore set has to be created but the system limit for
the maximum number of semaphore sets has been exceeded

Semaphore state on an affected machine:

-bash-2.05b$ ipcs

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status
0x0001ffb8 0          root      666        76         3
0x00025990 32769      root      666        8308       3
0x00027cb9 65538      root      666        132256     1
0x00027cba 98307      root      666        132256     1
0x00027cdc 323092484  nagios    666        132256     0
0x00027cdd 393412613  nagios    666        132256     0
0x00027cde 393838598  nagios    666        132256     0
0x00027cdf 394231815  nagios    666        132256     0
0x00027ce0 413794312  nagios    666        132256     0
0x00027ce1 455770121  nagios    666        132256     0
0x00027ce2 483229706  nagios    666        132256     0

------ Semaphore Arrays --------
key        semid      owner      perms      nsems
0x00000000 393216     root      666        1
0x00000000 425985     root      666        1
0x00000000 458754     root      666        1
0x00000000 491523     root      666        1
0x00000000 524292     root      666        1
0x00000000 557061     root      666        1
0x00000000 622599     root      666        1
0x00000000 655368     root      666        1
0x0001ffb8 688137     root      666        1
0x000251c0 720906     root      666        1
0x000255a8 753675     root      666        1
0x00025990 786444     root      666        1
0x00000000 1179661    root      666        1
0x000278d1 884750     root      666        1
0x00027cb9 917519     root      666        1
0x00000000 950288     root      666        1
0x00000000 983057     root      666        1
0x00000000 1015826    root      666        1
0x00000000 1048595    root      666        1
0x00000000 1081364    root      666        1
0x000278d2 1114133    root      666        1
0x00027cba 1146902    root      666        1
0x000278f4 197754904  nagios    666        1
0x00027cdc 197787673  nagios    666        1
0x000278f5 378044442  nagios    666        1
0x00027cdd 378077211  nagios    666        1
0x000278f6 379322396  nagios    666        1
0x00027cde 379355165  nagios    666        1
0x000278f7 380502046  nagios    666        1
0x00027cdf 380534815  nagios    666        1
0x000278f8 430571552  nagios    666        1
0x00027ce0 430604321  nagios    666        1
0x000278f9 538050594  nagios    666        1
0x00027ce1 538083363  nagios    666        1
0x000278fa 608436260  nagios    666        1
0x00027ce2 608469029  nagios    666        1

------ Message Queues --------
key        msqid      owner      perms      used-bytes   messages

Kernel proc bits on affected machines:

-bash-2.05b$ cat /proc/sys/kernel/{sem,shm{all,max,mni}}
250     32000   32      128
2097152
33554432
4096

This is the latest version of OMSA supported under RHEL3 per Dell Support.

We are also several point releases behind the current check_openmanage
Nagios plugin.  However the changelog doesnt appear to indicate that
there is a helpful bugfix.  I mention this only as we have run a local
heap of Big Brother bash mess against local OMSA on these high load
hosts for years without any semaphore problems.

We _could_ bump kernel limits _or_ splay Nagios checking to prolong
intervals between semaphore exhaustion _or_ have something like cron
or cfengine sweep in and ungracefully blast the orphaned semaphores at
a fixed interval, but we would welcome a more elegant fix.  Curious to
know if 1.) others have experienced this bug and 2.) how they have
gotten around this issue.

Input appreciated.  Thanks.

-- 
Nick Silkey

_______________________________________________
Linux-PowerEdge mailing list
[email protected]
https://lists.us.dell.com/mailman/listinfo/linux-poweredge
Please read the FAQ at http://lists.us.dell.com/faq

semaphore leaks via RHEL3u5 WS + OMSA 5.1.0-354 + check_openmanage 3.4.9 + 'high load'

Reply via email to