finally had some time to do a few attempts at valgrind... so far it doesn't seem to be telling me much... the numbers it reports are in the megabyte and not gigabyte range that i'm seeing.
after a couple hours of valgrind i see: ==31952== LEAK SUMMARY: ==31952== definitely lost: 532 bytes in 23 blocks. ==31952== indirectly lost: 271 bytes in 16 blocks. ==31952== possibly lost: 13,872 bytes in 30 blocks. ==31952== still reachable: 1,626,182 bytes in 2,188 blocks. ==31952== suppressed: 0 bytes in 0 blocks. ==31952== Reachable blocks (those to which a pointer was found) are not shown. ==31952== To see them, rerun with: --leak-check=full --show-reachable=yes this doesn't grow much even after valgrinding overnight ==24957== LEAK SUMMARY: ==24957== definitely lost: 2,404 bytes in 179 blocks. ==24957== indirectly lost: 271 bytes in 16 blocks. ==24957== possibly lost: 13,872 bytes in 30 blocks. ==24957== still reachable: 1,626,182 bytes in 2,188 blocks. ==24957== suppressed: 0 bytes in 0 blocks. in fact most of these numbers are identical, so they must be fixed losses in terms of valgrind accounting. this does not really reflect the growth of my gmond process (running under valgrind here, so reported as "memcheck"), which i tracked with 5 minute samples of top for an hour, shows a linear leak of over 1GB during that period: (s...@admin3:16:43:/home/admin/monitoring/scripts) while [ 1 ];do top -n 1 | grep mem;sleep 300;done 24957 nobody 20 0 5623m 3.5g 3648 R 80 11.1 121:49.98 memcheck 24957 nobody 20 0 5753m 3.6g 3648 R 76 11.4 126:43.25 memcheck 24957 nobody 20 0 5948m 3.7g 3652 R 101 11.8 131:36.26 memcheck 24957 nobody 20 0 6108m 3.8g 3652 R 99 12.1 136:29.35 memcheck 24957 nobody 20 0 6267m 3.9g 3652 R 97 12.4 141:17.02 memcheck 24957 nobody 20 0 6436m 4.0g 3652 R 97 12.7 146:07.58 memcheck 24957 nobody 20 0 6547m 4.1g 3652 R 63 13.0 150:56.74 memcheck 24957 nobody 20 0 6707m 4.2g 3652 R 99 13.3 155:47.88 memcheck 24957 nobody 20 0 6917m 4.3g 3652 R 99 13.7 160:40.30 memcheck 24957 nobody 20 0 7055m 4.4g 3652 R 97 14.0 165:28.40 memcheck 24957 nobody 20 0 7201m 4.5g 3652 R 101 14.3 170:20.32 memcheck 24957 nobody 20 0 7340m 4.6g 3652 R 99 14.6 175:08.75 memcheck if i understand valgrind right, it's only orphaned data that's counted as lost... perhaps some structure is not orphaned but bloating? one other accidental observation, i have a job that generates 70k metrics every 5 minutes (a few dozen for every port on each of our switches)... these are all "spoof" ip metrics. this job had been accidentally disabled for a few days and i noticed that the leak virtually stopped. i can play some more with various parameters of this job and see if i find anything more... could be the spoof thing is coincidental but Rick Cobb also mentioned his leak seemed to be spoof related. i'll also see if sending heartbeats for the spoof ips helps anything. -scott > Message: 2 > Date: Thu, 18 Feb 2010 07:15:33 -0800 (PST) > From: Martin Knoblauch <[email protected]> > Subject: Re: [Ganglia-general] gmond memory leaks > To: Scott Dworkis <[email protected]> > Cc: [email protected] > Message-ID: <[email protected]> > Content-Type: text/plain; charset=us-ascii > > ----- Original Message ---- > >> From: Scott Dworkis <[email protected]> >> To: Martin Knoblauch <[email protected]> >> Cc: [email protected] >> Sent: Wed, February 17, 2010 8:32:32 PM >> Subject: Re: [Ganglia-general] gmond memory leaks >> >> 3.1.2 on gentoo (that solaris must be a sourceforge ad?). i have zero >> experience with valgrind... i'll have a look but a smidge of guidance >> would be appreciated. :) >> > > Just get valgrind and run the leaking "gmond" under its control. "gmond" > should be configured to not run in background. After some time interrupt it > and you will get a report of valgrinds findings. > > For example, a simple program leaking 8x1MB will produce: > > [mknob...@l6g0223j ~]$ valgrind ./memeat > ==13647== Memcheck, a memory error detector. > ==13647== Copyright (C) 2002-2006, and GNU GPL'd, by Julian Seward et al. > ==13647== Using LibVEX rev 1658, a library for dynamic binary translation. > ==13647== Copyright (C) 2004-2006, and GNU GPL'd, by OpenWorks LLP. > ==13647== Using valgrind-3.2.1, a dynamic binary instrumentation framework. > ==13647== Copyright (C) 2000-2006, and GNU GPL'd, by Julian Seward et al. > ==13647== For more details, rerun with: -v > ==13647== > ^C > ==13647== > ==13647== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 5 from 1) > ==13647== malloc/free: in use at exit: 8,000,000 bytes in 8 blocks. > ==13647== malloc/free: 8 allocs, 0 frees, 8,000,000 bytes allocated. > ==13647== For counts of detected errors, rerun with: -v > ==13647== searching for pointers to 8 not-freed blocks. > ==13647== checked 66,440 bytes. > ==13647== > ==13647== LEAK SUMMARY: > ==13647== definitely lost: 8,000,000 bytes in 8 blocks. > ==13647== possibly lost: 0 bytes in 0 blocks. > ==13647== still reachable: 0 bytes in 0 blocks. > ==13647== suppressed: 0 bytes in 0 blocks. > ==13647== Use --leak-check=full to see details of leaked memory. > > If you use "--leak-check=full", it will tell you where the leaking memory was > allocated. "gmond" needs to be compiled with debug info (-g). > > A few questions. > > - What is your setup? I assume quite a few hosts monitoring (collectors) > metrics and one aggregating the results. > - Which of the "gmond"s leak? The "collectors", the "aggregator" or both? > > Cheers > Martin > >> yeah 150k metrics is a lot... i have an interest in scaling this thing. >> i'll post another thread bout things i've done to scale so far that seem >> to be working well. >> >> On Wed, 17 Feb 2010, Martin Knoblauch wrote: >> >>> Hi Scott, >>> >>> which version of Ganglia and which operating environment do you have >>> (guessing >> Solaris from your signature :-)? Any chance that you could run valgrind or >> equivalent on your setup? 10GB/day is a lot, as is 150k metrics. >>> >>> Cheers >>> Martin >>> ------------------------------------------------------ >>> Martin Knoblauch >>> email: k n o b i AT knobisoft DOT de >>> www: http://www.knobisoft.de >>> >>> >>> >>> ----- Original Message ---- >>>> From: Scott Dworkis >>>> To: [email protected] >>>> Sent: Wed, February 17, 2010 3:08:26 AM >>>> Subject: [Ganglia-general] gmond memory leaks >>>> >>>> (sorry if this is a repost... i tried previously without having first >>>> subscribed to the list, and fear i got lost somewhere along the moderation >>>> path) >>>> >>>> hi all - i am seeing gmond leak about 10GB/day on about 150k metrics >>>> collected. it seemed like things worsened when i added dmax to all my >>>> custom metrics, but maybe it was always bad. is this a known issue? >>>> >>>> sorry if it is already known... i couldn't see that there was a good way >>>> to search the forums or if there is a bug tracker to search. >>>> >>>> -scott >>>> >>>> >> ------------------------------------------------------------------------------ >>>> SOLARIS 10 is the OS for Data Centers - provides features such as DTrace, >>>> Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW >>>> http://p.sf.net/sfu/solaris-dev2dev >>>> _______________________________________________ >>>> Ganglia-general mailing list >>>> [email protected] >>>> https://lists.sourceforge.net/lists/listinfo/ganglia-general >>> >> >> ------------------------------------------------------------------------------ >> SOLARIS 10 is the OS for Data Centers - provides features such as DTrace, >> Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW >> http://p.sf.net/sfu/solaris-dev2dev >> _______________________________________________ >> Ganglia-general mailing list >> [email protected] >> https://lists.sourceforge.net/lists/listinfo/ganglia-general > > > > ------------------------------------------------------------------------------ Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev _______________________________________________ Ganglia-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/ganglia-general

