Re: [Ganglia-general] gmond memory leaks

Martin Knoblauch Wed, 03 Mar 2010 03:35:01 -0800


 ------------------------------------------------------
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de




----- Original Message ----
> From: Scott Dworkis <[email protected]>
> To: Martin Knoblauch <[email protected]>
> Cc: [email protected]
> Sent: Wed, March 3, 2010 5:21:32 AM
> Subject: Re: [Ganglia-general] gmond memory leaks
> 
> finally had some time to do a few attempts at valgrind... so far it 
> doesn't seem to be telling me much... the numbers it reports are in the 
> megabyte and not gigabyte range that i'm seeing.
> 
> after a couple hours of valgrind i see:
> 
> ==31952== LEAK SUMMARY:
> ==31952==    definitely lost: 532 bytes in 23 blocks.
> ==31952==    indirectly lost: 271 bytes in 16 blocks.
> ==31952==      possibly lost: 13,872 bytes in 30 blocks.
> ==31952==    still reachable: 1,626,182 bytes in 2,188 blocks.
> ==31952==         suppressed: 0 bytes in 0 blocks.
> ==31952== Reachable blocks (those to which a pointer was found) are not 
> shown.
> ==31952== To see them, rerun with: --leak-check=full --show-reachable=yes
> 
> this doesn't grow much even after valgrinding overnight
> 
> ==24957== LEAK SUMMARY:
> ==24957==    definitely lost: 2,404 bytes in 179 blocks.
> ==24957==    indirectly lost: 271 bytes in 16 blocks.
> ==24957==      possibly lost: 13,872 bytes in 30 blocks.
> ==24957==    still reachable: 1,626,182 bytes in 2,188 blocks.
> ==24957==         suppressed: 0 bytes in 0 blocks.
> 
> in fact most of these numbers are identical, so they must be fixed losses 
> in terms of valgrind accounting.
>

 Did you try "--leak-check=full --show-reachable=yes". I believe that is 
supposed to show all allocations. Might be a bit of output, but as far as I can 
see you are able to reproduce early.

> this does not really reflect the growth of my gmond process (running under 
> valgrind here, so reported as "memcheck"), which i tracked with 5 minute 
> samples of top for an hour, shows a linear leak of over 1GB during that 
> period:
> 
> (s...@admin3:16:43:/home/admin/monitoring/scripts) while [ 1 ];do top -n 1 
> | grep mem;sleep 300;done
> 24957 nobody    20   0 5623m 3.5g 3648 R   80 11.1 121:49.98 memcheck
> 24957 nobody    20   0 5753m 3.6g 3648 R   76 11.4 126:43.25 memcheck
> 24957 nobody    20   0 5948m 3.7g 3652 R  101 11.8 131:36.26 memcheck
> 24957 nobody    20   0 6108m 3.8g 3652 R   99 12.1 136:29.35 memcheck
> 24957 nobody    20   0 6267m 3.9g 3652 R   97 12.4 141:17.02 memcheck
> 24957 nobody    20   0 6436m 4.0g 3652 R   97 12.7 146:07.58 memcheck
> 24957 nobody    20   0 6547m 4.1g 3652 R   63 13.0 150:56.74 memcheck
> 24957 nobody    20   0 6707m 4.2g 3652 R   99 13.3 155:47.88 memcheck
> 24957 nobody    20   0 6917m 4.3g 3652 R   99 13.7 160:40.30 memcheck
> 24957 nobody    20   0 7055m 4.4g 3652 R   97 14.0 165:28.40 memcheck
> 24957 nobody    20   0 7201m 4.5g 3652 R  101 14.3 170:20.32 memcheck
> 24957 nobody    20   0 7340m 4.6g 3652 R   99 14.6 175:08.75 memcheck
> 
> if i understand valgrind right, it's only orphaned data that's counted as 
> lost... perhaps some structure is not orphaned but bloating?
> 
> one other accidental observation, i have a job that generates 70k metrics 
> every 5 minutes (a few dozen for every port on each of our switches)... 
> these are all "spoof" ip metrics.  this job had been accidentally disabled 
> for a few days and i noticed that the leak virtually stopped.  i can play 
> some more with various parameters of this job and see if i find anything 
> more... could be the spoof thing is coincidental but Rick Cobb also 
> mentioned his leak seemed to be spoof related.  i'll also see if sending 
> heartbeats for the spoof ips helps anything.
>

 spoofing might indeed be a hint.

Martin
> -scott
> 
> > Message: 2
> > Date: Thu, 18 Feb 2010 07:15:33 -0800 (PST)
> > From: Martin Knoblauch 
> > Subject: Re: [Ganglia-general] gmond memory leaks
> > To: Scott Dworkis 
> > Cc: [email protected]
> > Message-ID: <[email protected]>
> > Content-Type: text/plain; charset=us-ascii
> >
> > ----- Original Message ----
> >
> >> From: Scott Dworkis 
> >> To: Martin Knoblauch 
> >> Cc: [email protected]
> >> Sent: Wed, February 17, 2010 8:32:32 PM
> >> Subject: Re: [Ganglia-general] gmond memory leaks
> >>
> >> 3.1.2 on gentoo (that solaris must be a sourceforge ad?).  i have zero
> >> experience with valgrind... i'll have a look but a smidge of guidance
> >> would be appreciated.  :)
> >>
> >
> > Just  get valgrind and run the leaking "gmond" under its control. "gmond" 
> should be configured to not run in background. After some time interrupt it 
> and 
> you will get a report of valgrinds findings.
> >
> > For example, a simple program leaking 8x1MB will produce:
> >
> > [mknob...@l6g0223j ~]$ valgrind  ./memeat
> > ==13647== Memcheck, a memory error detector.
> > ==13647== Copyright (C) 2002-2006, and GNU GPL'd, by Julian Seward et al.
> > ==13647== Using LibVEX rev 1658, a library for dynamic binary translation.
> > ==13647== Copyright (C) 2004-2006, and GNU GPL'd, by OpenWorks LLP.
> > ==13647== Using valgrind-3.2.1, a dynamic binary instrumentation framework.
> > ==13647== Copyright (C) 2000-2006, and GNU GPL'd, by Julian Seward et al.
> > ==13647== For more details, rerun with: -v
> > ==13647==
> > ^C
> > ==13647==
> > ==13647== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 5 from 1)
> > ==13647== malloc/free: in use at exit: 8,000,000 bytes in 8 blocks.
> > ==13647== malloc/free: 8 allocs, 0 frees, 8,000,000 bytes allocated.
> > ==13647== For counts of detected errors, rerun with: -v
> > ==13647== searching for pointers to 8 not-freed blocks.
> > ==13647== checked 66,440 bytes.
> > ==13647==
> > ==13647== LEAK SUMMARY:
> > ==13647==    definitely lost: 8,000,000 bytes in 8 blocks.
> > ==13647==      possibly lost: 0 bytes in 0 blocks.
> > ==13647==    still reachable: 0 bytes in 0 blocks.
> > ==13647==         suppressed: 0 bytes in 0 blocks.
> > ==13647== Use --leak-check=full to see details of leaked memory.
> >
> > If you use "--leak-check=full", it will tell you where the leaking memory 
> > was 
> allocated. "gmond" needs to be compiled with debug info (-g).
> >
> > A few questions.
> >
> > - What is your setup? I assume quite a few hosts monitoring (collectors) 
> metrics and one aggregating the results.
> > - Which of the "gmond"s leak? The "collectors", the "aggregator" or both?
> >
> > Cheers
> > Martin
> >
> >> yeah 150k metrics is a lot... i have an interest in scaling this thing.
> >> i'll post another thread bout things i've done to scale so far that seem
> >> to be working well.
> >>
> >> On Wed, 17 Feb 2010, Martin Knoblauch wrote:
> >>
> >>> Hi Scott,
> >>>
> >>> which version of Ganglia and which operating environment do you have 
> (guessing
> >> Solaris from your signature :-)? Any chance that you could run valgrind or
> >> equivalent on your setup? 10GB/day is a lot, as is 150k metrics.
> >>>
> >>> Cheers
> >>> Martin
> >>> ------------------------------------------------------
> >>> Martin Knoblauch
> >>> email: k n o b i AT knobisoft DOT de
> >>> www:  http://www.knobisoft.de
> >>>
> >>>
> >>>
> >>> ----- Original Message ----
> >>>> From: Scott Dworkis
> >>>> To: [email protected]
> >>>> Sent: Wed, February 17, 2010 3:08:26 AM
> >>>> Subject: [Ganglia-general] gmond memory leaks
> >>>>
> >>>> (sorry if this is a repost... i tried previously without having first
> >>>> subscribed to the list, and fear i got lost somewhere along the 
> >>>> moderation
> >>>> path)
> >>>>
> >>>> hi all - i am seeing gmond leak about 10GB/day on about 150k metrics
> >>>> collected.  it seemed like things worsened when i added dmax to all my
> >>>> custom metrics, but maybe it was always bad.  is this a known issue?
> >>>>
> >>>> sorry if it is already known... i couldn't see that there was a good way
> >>>> to search the forums or if there is a bug tracker to search.
> >>>>
> >>>> -scott
> >>>>
> >>>>
> >> 
> ------------------------------------------------------------------------------
> >>>> SOLARIS 10 is the OS for Data Centers - provides features such as DTrace,
> >>>> Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW
> >>>> http://p.sf.net/sfu/solaris-dev2dev
> >>>> _______________________________________________
> >>>> Ganglia-general mailing list
> >>>> [email protected]
> >>>> https://lists.sourceforge.net/lists/listinfo/ganglia-general
> >>>
> >>
> >> 
> ------------------------------------------------------------------------------
> >> SOLARIS 10 is the OS for Data Centers - provides features such as DTrace,
> >> Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW
> >> http://p.sf.net/sfu/solaris-dev2dev
> >> _______________________________________________
> >> Ganglia-general mailing list
> >> [email protected]
> >> https://lists.sourceforge.net/lists/listinfo/ganglia-general
> >
> >
> >
> >


------------------------------------------------------------------------------
Download Intel&#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
Ganglia-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Re: [Ganglia-general] gmond memory leaks

Reply via email to