Re: semaphore leaks via RHEL3u5 WS + OMSA 5.1.0-354 + check_openmanage 3.4.9 + 'high load'

Finn Andersen Wed, 07 Apr 2010 13:44:44 -0700

I've been using OMSA 5.5 on RHEL3 for quite some time now, and have no
issues with Trond's Nagios plugin.


-Finn Andersen

On Wed, Apr 7, 2010 at 5:51 PM, Trond Hasle Amundsen
<[email protected]> wrote:
> Nick Silkey <[email protected]> writes:
>
>> We have quite a few production RHEL3u5 WS hosts running OMSA
>> 5.1.0-354.  We find pointing Trond Amundsen's check_poweredge Nagios
>> plugin at them eventually leads to semaphore resource exhaustion, but
>> only when the machine is under high load (low load == semaphores come,
>> do their job, and go as expected ; high load == orphaned semaphores
>> build up over time and eventually hit kernel limits).
>>
>> Example syslog from an affected host:
>>
>> Apr 5 12:00:53 qweqaz Server Administrator (SMIL): Data Engine
>> EventID: 0 A semaphore set has to be created but the system limit for
>> the maximum number of semaphore sets has been exceeded
>>
>> Semaphore state on an affected machine:
>>
>> -bash-2.05b$ ipcs
>>
>> ------ Shared Memory Segments --------
>> key        shmid      owner      perms      bytes      nattch     status
>> 0x0001ffb8 0          root      666        76         3
>> 0x00025990 32769      root      666        8308       3
>> 0x00027cb9 65538      root      666        132256     1
>> 0x00027cba 98307      root      666        132256     1
>> 0x00027cdc 323092484  nagios    666        132256     0
>> 0x00027cdd 393412613  nagios    666        132256     0
>> 0x00027cde 393838598  nagios    666        132256     0
>> 0x00027cdf 394231815  nagios    666        132256     0
>> 0x00027ce0 413794312  nagios    666        132256     0
>> 0x00027ce1 455770121  nagios    666        132256     0
>> 0x00027ce2 483229706  nagios    666        132256     0
>>
>> ------ Semaphore Arrays --------
>> key        semid      owner      perms      nsems
>> 0x00000000 393216     root      666        1
>> 0x00000000 425985     root      666        1
>> 0x00000000 458754     root      666        1
>> 0x00000000 491523     root      666        1
>> 0x00000000 524292     root      666        1
>> 0x00000000 557061     root      666        1
>> 0x00000000 622599     root      666        1
>> 0x00000000 655368     root      666        1
>> 0x0001ffb8 688137     root      666        1
>> 0x000251c0 720906     root      666        1
>> 0x000255a8 753675     root      666        1
>> 0x00025990 786444     root      666        1
>> 0x00000000 1179661    root      666        1
>> 0x000278d1 884750     root      666        1
>> 0x00027cb9 917519     root      666        1
>> 0x00000000 950288     root      666        1
>> 0x00000000 983057     root      666        1
>> 0x00000000 1015826    root      666        1
>> 0x00000000 1048595    root      666        1
>> 0x00000000 1081364    root      666        1
>> 0x000278d2 1114133    root      666        1
>> 0x00027cba 1146902    root      666        1
>> 0x000278f4 197754904  nagios    666        1
>> 0x00027cdc 197787673  nagios    666        1
>> 0x000278f5 378044442  nagios    666        1
>> 0x00027cdd 378077211  nagios    666        1
>> 0x000278f6 379322396  nagios    666        1
>> 0x00027cde 379355165  nagios    666        1
>> 0x000278f7 380502046  nagios    666        1
>> 0x00027cdf 380534815  nagios    666        1
>> 0x000278f8 430571552  nagios    666        1
>> 0x00027ce0 430604321  nagios    666        1
>> 0x000278f9 538050594  nagios    666        1
>> 0x00027ce1 538083363  nagios    666        1
>> 0x000278fa 608436260  nagios    666        1
>> 0x00027ce2 608469029  nagios    666        1
>>
>> ------ Message Queues --------
>> key        msqid      owner      perms      used-bytes   messages
>>
>> Kernel proc bits on affected machines:
>>
>> -bash-2.05b$ cat /proc/sys/kernel/{sem,shm{all,max,mni}}
>> 250   32000   32      128
>> 2097152
>> 33554432
>> 4096
>>
>> This is the latest version of OMSA supported under RHEL3 per Dell Support.
>>
>> We are also several point releases behind the current check_openmanage
>> Nagios plugin.  However the changelog doesnt appear to indicate that
>> there is a helpful bugfix.  I mention this only as we have run a local
>> heap of Big Brother bash mess against local OMSA on these high load
>> hosts for years without any semaphore problems.
>>
>> We _could_ bump kernel limits _or_ splay Nagios checking to prolong
>> intervals between semaphore exhaustion _or_ have something like cron
>> or cfengine sweep in and ungracefully blast the orphaned semaphores at
>> a fixed interval, but we would welcome a more elegant fix.  Curious to
>> know if 1.) others have experienced this bug and 2.) how they have
>> gotten around this issue.
>
> Hi Nick,
>
> I've been trying to reproduce this on a 2650 running RHEL3.9 with OMSA
> 5.4.0. No problems with semaphores whatsoever. Semaphores are created
> but they are cleaned up promptly. I can't pinpoint whatever causes the
> semaphore leakage at your end, but things are seemingly better with
> later RHEL3 and/or newer OMSA. My guess is that OMSA is the culprit and
> that upgrading would help.
>
> OMSA 5.1.0 is the latest version listed for RHEL3 on support.dell.com,
> but at least on my 2650 later versions work OK. This is not surprising,
> as ESX3.5 was supported with OMSA up to and including 5.5.0 IIRC, and
> ESX3.5 is "based" on RHEL3.
>
> Is OMSA 5.1.0 really the latest supported version on RHEL3? Perhaps some
> Dell folks can shed some light on this..
>
> As for check_openmanage versions... Your assumptions are correct, there
> are no recent changes that I can think of that is relevant to your
> problem. In local mode, the plugin simply executes omreport commands to
> determine the current status of the components you wish to monitor
> (usually all of them). If omreport leaks semaphores there isn't much
> that check_openmanage can do about it.
>
> PS. We don't monitor hardware health on our RHEL3 boxes. Luckily, RHEL3
> is soon EOL :)
>
> Cheers,
> --
> Trond H. Amundsen <[email protected]>
> Center for Information Technology Services, University of Oslo
>
> _______________________________________________
> Linux-PowerEdge mailing list
> [email protected]
> https://lists.us.dell.com/mailman/listinfo/linux-poweredge
> Please read the FAQ at http://lists.us.dell.com/faq
>

_______________________________________________
Linux-PowerEdge mailing list
[email protected]
https://lists.us.dell.com/mailman/listinfo/linux-poweredge
Please read the FAQ at http://lists.us.dell.com/faq

Re: semaphore leaks via RHEL3u5 WS + OMSA 5.1.0-354 + check_openmanage 3.4.9 + 'high load'

Reply via email to