Re: [Ganglia-general] gmond memory leaks

Scott Dworkis Thu, 04 Mar 2010 12:49:36 -0800

resend without valgrind attachments to get around 40k moderator approval 
limit...


On Thu, 4 Mar 2010, Scott Dworkis wrote:

> ok here's a couple more valgrinds with --leak-check=full 
> --show-reachable=yes. still reporting a very small amount of memory lost if 
> i'm reading this right.
>
> also, i think i've isolated the cause (woot).
>
> i had a couple different custom gmetric spoof metrics for my network 
> switches... one had our internal fqdn for the switches, and the other just 
> had the short switchnames.  when i normalized both scripts to use the fqdn, 
> the leak looks like it has mostly if not completely stopped. probably would 
> have worked to normalize both to the short hostname also, but fqdn makes more 
> sense in my setup.  so, the first valgrind is before my gemtric hostname 
> normalizations, the second is after.
>
> anyone know if i should submit a bug?
>
> -scott
>
>
> On Wed, 3 Mar 2010, Martin Knoblauch wrote:
>
>> 
>> 
>> ------------------------------------------------------
>> Martin Knoblauch
>> email: k n o b i AT knobisoft DOT de
>> www:   http://www.knobisoft.de
>> 
>> 
>> 
>> ----- Original Message ----
>>> From: Scott Dworkis <[email protected]>
>>> To: Martin Knoblauch <[email protected]>
>>> Cc: [email protected]
>>> Sent: Wed, March 3, 2010 5:21:32 AM
>>> Subject: Re: [Ganglia-general] gmond memory leaks
>>> 
>>> finally had some time to do a few attempts at valgrind... so far it
>>> doesn't seem to be telling me much... the numbers it reports are in the
>>> megabyte and not gigabyte range that i'm seeing.
>>> 
>>> after a couple hours of valgrind i see:
>>> 
>>> ==31952== LEAK SUMMARY:
>>> ==31952==    definitely lost: 532 bytes in 23 blocks.
>>> ==31952==    indirectly lost: 271 bytes in 16 blocks.
>>> ==31952==      possibly lost: 13,872 bytes in 30 blocks.
>>> ==31952==    still reachable: 1,626,182 bytes in 2,188 blocks.
>>> ==31952==         suppressed: 0 bytes in 0 blocks.
>>> ==31952== Reachable blocks (those to which a pointer was found) are not
>>> shown.
>>> ==31952== To see them, rerun with: --leak-check=full --show-reachable=yes
>>> 
>>> this doesn't grow much even after valgrinding overnight
>>> 
>>> ==24957== LEAK SUMMARY:
>>> ==24957==    definitely lost: 2,404 bytes in 179 blocks.
>>> ==24957==    indirectly lost: 271 bytes in 16 blocks.
>>> ==24957==      possibly lost: 13,872 bytes in 30 blocks.
>>> ==24957==    still reachable: 1,626,182 bytes in 2,188 blocks.
>>> ==24957==         suppressed: 0 bytes in 0 blocks.
>>> 
>>> in fact most of these numbers are identical, so they must be fixed losses
>>> in terms of valgrind accounting.
>>> 
>> 
>> Did you try "--leak-check=full --show-reachable=yes". I believe that is 
>> supposed to show all allocations. Might be a bit of output, but as far as I 
>> can see you are able to reproduce early.
>> 
>>> this does not really reflect the growth of my gmond process (running under
>>> valgrind here, so reported as "memcheck"), which i tracked with 5 minute
>>> samples of top for an hour, shows a linear leak of over 1GB during that
>>> period:
>>> 
>>> (s...@admin3:16:43:/home/admin/monitoring/scripts) while [ 1 ];do top -n 1
>>> | grep mem;sleep 300;done
>>> 24957 nobody    20   0 5623m 3.5g 3648 R   80 11.1 121:49.98 memcheck
>>> 24957 nobody    20   0 5753m 3.6g 3648 R   76 11.4 126:43.25 memcheck
>>> 24957 nobody    20   0 5948m 3.7g 3652 R  101 11.8 131:36.26 memcheck
>>> 24957 nobody    20   0 6108m 3.8g 3652 R   99 12.1 136:29.35 memcheck
>>> 24957 nobody    20   0 6267m 3.9g 3652 R   97 12.4 141:17.02 memcheck
>>> 24957 nobody    20   0 6436m 4.0g 3652 R   97 12.7 146:07.58 memcheck
>>> 24957 nobody    20   0 6547m 4.1g 3652 R   63 13.0 150:56.74 memcheck
>>> 24957 nobody    20   0 6707m 4.2g 3652 R   99 13.3 155:47.88 memcheck
>>> 24957 nobody    20   0 6917m 4.3g 3652 R   99 13.7 160:40.30 memcheck
>>> 24957 nobody    20   0 7055m 4.4g 3652 R   97 14.0 165:28.40 memcheck
>>> 24957 nobody    20   0 7201m 4.5g 3652 R  101 14.3 170:20.32 memcheck
>>> 24957 nobody    20   0 7340m 4.6g 3652 R   99 14.6 175:08.75 memcheck
>>> 
>>> if i understand valgrind right, it's only orphaned data that's counted as
>>> lost... perhaps some structure is not orphaned but bloating?
>>> 
>>> one other accidental observation, i have a job that generates 70k metrics
>>> every 5 minutes (a few dozen for every port on each of our switches)...
>>> these are all "spoof" ip metrics.  this job had been accidentally disabled
>>> for a few days and i noticed that the leak virtually stopped.  i can play
>>> some more with various parameters of this job and see if i find anything
>>> more... could be the spoof thing is coincidental but Rick Cobb also
>>> mentioned his leak seemed to be spoof related.  i'll also see if sending
>>> heartbeats for the spoof ips helps anything.
>>> 
>> 
>> spoofing might indeed be a hint.
>> 
>> Martin
>>> -scott
>>> 
>>>> Message: 2
>>>> Date: Thu, 18 Feb 2010 07:15:33 -0800 (PST)
>>>> From: Martin Knoblauch
>>>> Subject: Re: [Ganglia-general] gmond memory leaks
>>>> To: Scott Dworkis
>>>> Cc: [email protected]
>>>> Message-ID: <[email protected]>
>>>> Content-Type: text/plain; charset=us-ascii
>>>> 
>>>> ----- Original Message ----
>>>> 
>>>>> From: Scott Dworkis
>>>>> To: Martin Knoblauch
>>>>> Cc: [email protected]
>>>>> Sent: Wed, February 17, 2010 8:32:32 PM
>>>>> Subject: Re: [Ganglia-general] gmond memory leaks
>>>>> 
>>>>> 3.1.2 on gentoo (that solaris must be a sourceforge ad?).  i have zero
>>>>> experience with valgrind... i'll have a look but a smidge of guidance
>>>>> would be appreciated.  :)
>>>>> 
>>>> 
>>>> Just  get valgrind and run the leaking "gmond" under its control. "gmond"
>>> should be configured to not run in background. After some time interrupt 
>>> it and
>>> you will get a report of valgrinds findings.
>>>> 
>>>> For example, a simple program leaking 8x1MB will produce:
>>>> 
>>>> [mknob...@l6g0223j ~]$ valgrind  ./memeat
>>>> ==13647== Memcheck, a memory error detector.
>>>> ==13647== Copyright (C) 2002-2006, and GNU GPL'd, by Julian Seward et al.
>>>> ==13647== Using LibVEX rev 1658, a library for dynamic binary 
>>>> translation.
>>>> ==13647== Copyright (C) 2004-2006, and GNU GPL'd, by OpenWorks LLP.
>>>> ==13647== Using valgrind-3.2.1, a dynamic binary instrumentation 
>>>> framework.
>>>> ==13647== Copyright (C) 2000-2006, and GNU GPL'd, by Julian Seward et al.
>>>> ==13647== For more details, rerun with: -v
>>>> ==13647==
>>>> ^C
>>>> ==13647==
>>>> ==13647== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 5 from 1)
>>>> ==13647== malloc/free: in use at exit: 8,000,000 bytes in 8 blocks.
>>>> ==13647== malloc/free: 8 allocs, 0 frees, 8,000,000 bytes allocated.
>>>> ==13647== For counts of detected errors, rerun with: -v
>>>> ==13647== searching for pointers to 8 not-freed blocks.
>>>> ==13647== checked 66,440 bytes.
>>>> ==13647==
>>>> ==13647== LEAK SUMMARY:
>>>> ==13647==    definitely lost: 8,000,000 bytes in 8 blocks.
>>>> ==13647==      possibly lost: 0 bytes in 0 blocks.
>>>> ==13647==    still reachable: 0 bytes in 0 blocks.
>>>> ==13647==         suppressed: 0 bytes in 0 blocks.
>>>> ==13647== Use --leak-check=full to see details of leaked memory.
>>>> 
>>>> If you use "--leak-check=full", it will tell you where the leaking memory 
>>>> was
>>> allocated. "gmond" needs to be compiled with debug info (-g).
>>>> 
>>>> A few questions.
>>>> 
>>>> - What is your setup? I assume quite a few hosts monitoring (collectors)
>>> metrics and one aggregating the results.
>>>> - Which of the "gmond"s leak? The "collectors", the "aggregator" or both?
>>>> 
>>>> Cheers
>>>> Martin
>>>> 
>>>>> yeah 150k metrics is a lot... i have an interest in scaling this thing.
>>>>> i'll post another thread bout things i've done to scale so far that seem
>>>>> to be working well.
>>>>> 
>>>>> On Wed, 17 Feb 2010, Martin Knoblauch wrote:
>>>>> 
>>>>>> Hi Scott,
>>>>>> 
>>>>>> which version of Ganglia and which operating environment do you have
>>> (guessing
>>>>> Solaris from your signature :-)? Any chance that you could run valgrind 
>>>>> or
>>>>> equivalent on your setup? 10GB/day is a lot, as is 150k metrics.
>>>>>> 
>>>>>> Cheers
>>>>>> Martin
>>>>>> ------------------------------------------------------
>>>>>> Martin Knoblauch
>>>>>> email: k n o b i AT knobisoft DOT de
>>>>>> www:  http://www.knobisoft.de
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> ----- Original Message ----
>>>>>>> From: Scott Dworkis
>>>>>>> To: [email protected]
>>>>>>> Sent: Wed, February 17, 2010 3:08:26 AM
>>>>>>> Subject: [Ganglia-general] gmond memory leaks
>>>>>>> 
>>>>>>> (sorry if this is a repost... i tried previously without having first
>>>>>>> subscribed to the list, and fear i got lost somewhere along the 
>>>>>>> moderation
>>>>>>> path)
>>>>>>> 
>>>>>>> hi all - i am seeing gmond leak about 10GB/day on about 150k metrics
>>>>>>> collected.  it seemed like things worsened when i added dmax to all my
>>>>>>> custom metrics, but maybe it was always bad.  is this a known issue?
>>>>>>> 
>>>>>>> sorry if it is already known... i couldn't see that there was a good 
>>>>>>> way
>>>>>>> to search the forums or if there is a bug tracker to search.
>>>>>>> 
>>>>>>> -scott
>>>>>>> 
>>>>>>> 
>>>>>
>>> 
>>> ------------------------------------------------------------------------------
>>>>>>> SOLARIS 10 is the OS for Data Centers - provides features such as 
>>>>>>> DTrace,
>>>>>>> Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW
>>>>>>> http://p.sf.net/sfu/solaris-dev2dev
>>>>>>> _______________________________________________
>>>>>>> Ganglia-general mailing list
>>>>>>> [email protected]
>>>>>>> https://lists.sourceforge.net/lists/listinfo/ganglia-general
>>>>>> 
>>>>> 
>>>>>
>>> 
>>> ------------------------------------------------------------------------------
>>>>> SOLARIS 10 is the OS for Data Centers - provides features such as 
>>>>> DTrace,
>>>>> Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW
>>>>> http://p.sf.net/sfu/solaris-dev2dev
>>>>> _______________________________________________
>>>>> Ganglia-general mailing list
>>>>> [email protected]
>>>>> https://lists.sourceforge.net/lists/listinfo/ganglia-general
>>>> 
>>>> 
>>>> 
>>>> 
>

------------------------------------------------------------------------------
Download Intel&#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
Ganglia-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Re: [Ganglia-general] gmond memory leaks

Reply via email to