------------------------------------------------------ Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
----- Original Message ---- > From: Scott Dworkis <[email protected]> > To: Martin Knoblauch <[email protected]> > Cc: [email protected] > Sent: Wed, March 3, 2010 5:21:32 AM > Subject: Re: [Ganglia-general] gmond memory leaks > > finally had some time to do a few attempts at valgrind... so far it > doesn't seem to be telling me much... the numbers it reports are in the > megabyte and not gigabyte range that i'm seeing. > > after a couple hours of valgrind i see: > > ==31952== LEAK SUMMARY: > ==31952== definitely lost: 532 bytes in 23 blocks. > ==31952== indirectly lost: 271 bytes in 16 blocks. > ==31952== possibly lost: 13,872 bytes in 30 blocks. > ==31952== still reachable: 1,626,182 bytes in 2,188 blocks. > ==31952== suppressed: 0 bytes in 0 blocks. > ==31952== Reachable blocks (those to which a pointer was found) are not > shown. > ==31952== To see them, rerun with: --leak-check=full --show-reachable=yes > > this doesn't grow much even after valgrinding overnight > > ==24957== LEAK SUMMARY: > ==24957== definitely lost: 2,404 bytes in 179 blocks. > ==24957== indirectly lost: 271 bytes in 16 blocks. > ==24957== possibly lost: 13,872 bytes in 30 blocks. > ==24957== still reachable: 1,626,182 bytes in 2,188 blocks. > ==24957== suppressed: 0 bytes in 0 blocks. > > in fact most of these numbers are identical, so they must be fixed losses > in terms of valgrind accounting. > Did you try "--leak-check=full --show-reachable=yes". I believe that is supposed to show all allocations. Might be a bit of output, but as far as I can see you are able to reproduce early. > this does not really reflect the growth of my gmond process (running under > valgrind here, so reported as "memcheck"), which i tracked with 5 minute > samples of top for an hour, shows a linear leak of over 1GB during that > period: > > (s...@admin3:16:43:/home/admin/monitoring/scripts) while [ 1 ];do top -n 1 > | grep mem;sleep 300;done > 24957 nobody 20 0 5623m 3.5g 3648 R 80 11.1 121:49.98 memcheck > 24957 nobody 20 0 5753m 3.6g 3648 R 76 11.4 126:43.25 memcheck > 24957 nobody 20 0 5948m 3.7g 3652 R 101 11.8 131:36.26 memcheck > 24957 nobody 20 0 6108m 3.8g 3652 R 99 12.1 136:29.35 memcheck > 24957 nobody 20 0 6267m 3.9g 3652 R 97 12.4 141:17.02 memcheck > 24957 nobody 20 0 6436m 4.0g 3652 R 97 12.7 146:07.58 memcheck > 24957 nobody 20 0 6547m 4.1g 3652 R 63 13.0 150:56.74 memcheck > 24957 nobody 20 0 6707m 4.2g 3652 R 99 13.3 155:47.88 memcheck > 24957 nobody 20 0 6917m 4.3g 3652 R 99 13.7 160:40.30 memcheck > 24957 nobody 20 0 7055m 4.4g 3652 R 97 14.0 165:28.40 memcheck > 24957 nobody 20 0 7201m 4.5g 3652 R 101 14.3 170:20.32 memcheck > 24957 nobody 20 0 7340m 4.6g 3652 R 99 14.6 175:08.75 memcheck > > if i understand valgrind right, it's only orphaned data that's counted as > lost... perhaps some structure is not orphaned but bloating? > > one other accidental observation, i have a job that generates 70k metrics > every 5 minutes (a few dozen for every port on each of our switches)... > these are all "spoof" ip metrics. this job had been accidentally disabled > for a few days and i noticed that the leak virtually stopped. i can play > some more with various parameters of this job and see if i find anything > more... could be the spoof thing is coincidental but Rick Cobb also > mentioned his leak seemed to be spoof related. i'll also see if sending > heartbeats for the spoof ips helps anything. > spoofing might indeed be a hint. Martin > -scott > > > Message: 2 > > Date: Thu, 18 Feb 2010 07:15:33 -0800 (PST) > > From: Martin Knoblauch > > Subject: Re: [Ganglia-general] gmond memory leaks > > To: Scott Dworkis > > Cc: [email protected] > > Message-ID: <[email protected]> > > Content-Type: text/plain; charset=us-ascii > > > > ----- Original Message ---- > > > >> From: Scott Dworkis > >> To: Martin Knoblauch > >> Cc: [email protected] > >> Sent: Wed, February 17, 2010 8:32:32 PM > >> Subject: Re: [Ganglia-general] gmond memory leaks > >> > >> 3.1.2 on gentoo (that solaris must be a sourceforge ad?). i have zero > >> experience with valgrind... i'll have a look but a smidge of guidance > >> would be appreciated. :) > >> > > > > Just get valgrind and run the leaking "gmond" under its control. "gmond" > should be configured to not run in background. After some time interrupt it > and > you will get a report of valgrinds findings. > > > > For example, a simple program leaking 8x1MB will produce: > > > > [mknob...@l6g0223j ~]$ valgrind ./memeat > > ==13647== Memcheck, a memory error detector. > > ==13647== Copyright (C) 2002-2006, and GNU GPL'd, by Julian Seward et al. > > ==13647== Using LibVEX rev 1658, a library for dynamic binary translation. > > ==13647== Copyright (C) 2004-2006, and GNU GPL'd, by OpenWorks LLP. > > ==13647== Using valgrind-3.2.1, a dynamic binary instrumentation framework. > > ==13647== Copyright (C) 2000-2006, and GNU GPL'd, by Julian Seward et al. > > ==13647== For more details, rerun with: -v > > ==13647== > > ^C > > ==13647== > > ==13647== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 5 from 1) > > ==13647== malloc/free: in use at exit: 8,000,000 bytes in 8 blocks. > > ==13647== malloc/free: 8 allocs, 0 frees, 8,000,000 bytes allocated. > > ==13647== For counts of detected errors, rerun with: -v > > ==13647== searching for pointers to 8 not-freed blocks. > > ==13647== checked 66,440 bytes. > > ==13647== > > ==13647== LEAK SUMMARY: > > ==13647== definitely lost: 8,000,000 bytes in 8 blocks. > > ==13647== possibly lost: 0 bytes in 0 blocks. > > ==13647== still reachable: 0 bytes in 0 blocks. > > ==13647== suppressed: 0 bytes in 0 blocks. > > ==13647== Use --leak-check=full to see details of leaked memory. > > > > If you use "--leak-check=full", it will tell you where the leaking memory > > was > allocated. "gmond" needs to be compiled with debug info (-g). > > > > A few questions. > > > > - What is your setup? I assume quite a few hosts monitoring (collectors) > metrics and one aggregating the results. > > - Which of the "gmond"s leak? The "collectors", the "aggregator" or both? > > > > Cheers > > Martin > > > >> yeah 150k metrics is a lot... i have an interest in scaling this thing. > >> i'll post another thread bout things i've done to scale so far that seem > >> to be working well. > >> > >> On Wed, 17 Feb 2010, Martin Knoblauch wrote: > >> > >>> Hi Scott, > >>> > >>> which version of Ganglia and which operating environment do you have > (guessing > >> Solaris from your signature :-)? Any chance that you could run valgrind or > >> equivalent on your setup? 10GB/day is a lot, as is 150k metrics. > >>> > >>> Cheers > >>> Martin > >>> ------------------------------------------------------ > >>> Martin Knoblauch > >>> email: k n o b i AT knobisoft DOT de > >>> www: http://www.knobisoft.de > >>> > >>> > >>> > >>> ----- Original Message ---- > >>>> From: Scott Dworkis > >>>> To: [email protected] > >>>> Sent: Wed, February 17, 2010 3:08:26 AM > >>>> Subject: [Ganglia-general] gmond memory leaks > >>>> > >>>> (sorry if this is a repost... i tried previously without having first > >>>> subscribed to the list, and fear i got lost somewhere along the > >>>> moderation > >>>> path) > >>>> > >>>> hi all - i am seeing gmond leak about 10GB/day on about 150k metrics > >>>> collected. it seemed like things worsened when i added dmax to all my > >>>> custom metrics, but maybe it was always bad. is this a known issue? > >>>> > >>>> sorry if it is already known... i couldn't see that there was a good way > >>>> to search the forums or if there is a bug tracker to search. > >>>> > >>>> -scott > >>>> > >>>> > >> > ------------------------------------------------------------------------------ > >>>> SOLARIS 10 is the OS for Data Centers - provides features such as DTrace, > >>>> Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW > >>>> http://p.sf.net/sfu/solaris-dev2dev > >>>> _______________________________________________ > >>>> Ganglia-general mailing list > >>>> [email protected] > >>>> https://lists.sourceforge.net/lists/listinfo/ganglia-general > >>> > >> > >> > ------------------------------------------------------------------------------ > >> SOLARIS 10 is the OS for Data Centers - provides features such as DTrace, > >> Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW > >> http://p.sf.net/sfu/solaris-dev2dev > >> _______________________________________________ > >> Ganglia-general mailing list > >> [email protected] > >> https://lists.sourceforge.net/lists/listinfo/ganglia-general > > > > > > > > ------------------------------------------------------------------------------ Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev _______________________________________________ Ganglia-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/ganglia-general

