resend without valgrind attachments to get around 40k moderator approval limit...
On Thu, 4 Mar 2010, Scott Dworkis wrote: > ok here's a couple more valgrinds with --leak-check=full > --show-reachable=yes. still reporting a very small amount of memory lost if > i'm reading this right. > > also, i think i've isolated the cause (woot). > > i had a couple different custom gmetric spoof metrics for my network > switches... one had our internal fqdn for the switches, and the other just > had the short switchnames. when i normalized both scripts to use the fqdn, > the leak looks like it has mostly if not completely stopped. probably would > have worked to normalize both to the short hostname also, but fqdn makes more > sense in my setup. so, the first valgrind is before my gemtric hostname > normalizations, the second is after. > > anyone know if i should submit a bug? > > -scott > > > On Wed, 3 Mar 2010, Martin Knoblauch wrote: > >> >> >> ------------------------------------------------------ >> Martin Knoblauch >> email: k n o b i AT knobisoft DOT de >> www: http://www.knobisoft.de >> >> >> >> ----- Original Message ---- >>> From: Scott Dworkis <[email protected]> >>> To: Martin Knoblauch <[email protected]> >>> Cc: [email protected] >>> Sent: Wed, March 3, 2010 5:21:32 AM >>> Subject: Re: [Ganglia-general] gmond memory leaks >>> >>> finally had some time to do a few attempts at valgrind... so far it >>> doesn't seem to be telling me much... the numbers it reports are in the >>> megabyte and not gigabyte range that i'm seeing. >>> >>> after a couple hours of valgrind i see: >>> >>> ==31952== LEAK SUMMARY: >>> ==31952== definitely lost: 532 bytes in 23 blocks. >>> ==31952== indirectly lost: 271 bytes in 16 blocks. >>> ==31952== possibly lost: 13,872 bytes in 30 blocks. >>> ==31952== still reachable: 1,626,182 bytes in 2,188 blocks. >>> ==31952== suppressed: 0 bytes in 0 blocks. >>> ==31952== Reachable blocks (those to which a pointer was found) are not >>> shown. >>> ==31952== To see them, rerun with: --leak-check=full --show-reachable=yes >>> >>> this doesn't grow much even after valgrinding overnight >>> >>> ==24957== LEAK SUMMARY: >>> ==24957== definitely lost: 2,404 bytes in 179 blocks. >>> ==24957== indirectly lost: 271 bytes in 16 blocks. >>> ==24957== possibly lost: 13,872 bytes in 30 blocks. >>> ==24957== still reachable: 1,626,182 bytes in 2,188 blocks. >>> ==24957== suppressed: 0 bytes in 0 blocks. >>> >>> in fact most of these numbers are identical, so they must be fixed losses >>> in terms of valgrind accounting. >>> >> >> Did you try "--leak-check=full --show-reachable=yes". I believe that is >> supposed to show all allocations. Might be a bit of output, but as far as I >> can see you are able to reproduce early. >> >>> this does not really reflect the growth of my gmond process (running under >>> valgrind here, so reported as "memcheck"), which i tracked with 5 minute >>> samples of top for an hour, shows a linear leak of over 1GB during that >>> period: >>> >>> (s...@admin3:16:43:/home/admin/monitoring/scripts) while [ 1 ];do top -n 1 >>> | grep mem;sleep 300;done >>> 24957 nobody 20 0 5623m 3.5g 3648 R 80 11.1 121:49.98 memcheck >>> 24957 nobody 20 0 5753m 3.6g 3648 R 76 11.4 126:43.25 memcheck >>> 24957 nobody 20 0 5948m 3.7g 3652 R 101 11.8 131:36.26 memcheck >>> 24957 nobody 20 0 6108m 3.8g 3652 R 99 12.1 136:29.35 memcheck >>> 24957 nobody 20 0 6267m 3.9g 3652 R 97 12.4 141:17.02 memcheck >>> 24957 nobody 20 0 6436m 4.0g 3652 R 97 12.7 146:07.58 memcheck >>> 24957 nobody 20 0 6547m 4.1g 3652 R 63 13.0 150:56.74 memcheck >>> 24957 nobody 20 0 6707m 4.2g 3652 R 99 13.3 155:47.88 memcheck >>> 24957 nobody 20 0 6917m 4.3g 3652 R 99 13.7 160:40.30 memcheck >>> 24957 nobody 20 0 7055m 4.4g 3652 R 97 14.0 165:28.40 memcheck >>> 24957 nobody 20 0 7201m 4.5g 3652 R 101 14.3 170:20.32 memcheck >>> 24957 nobody 20 0 7340m 4.6g 3652 R 99 14.6 175:08.75 memcheck >>> >>> if i understand valgrind right, it's only orphaned data that's counted as >>> lost... perhaps some structure is not orphaned but bloating? >>> >>> one other accidental observation, i have a job that generates 70k metrics >>> every 5 minutes (a few dozen for every port on each of our switches)... >>> these are all "spoof" ip metrics. this job had been accidentally disabled >>> for a few days and i noticed that the leak virtually stopped. i can play >>> some more with various parameters of this job and see if i find anything >>> more... could be the spoof thing is coincidental but Rick Cobb also >>> mentioned his leak seemed to be spoof related. i'll also see if sending >>> heartbeats for the spoof ips helps anything. >>> >> >> spoofing might indeed be a hint. >> >> Martin >>> -scott >>> >>>> Message: 2 >>>> Date: Thu, 18 Feb 2010 07:15:33 -0800 (PST) >>>> From: Martin Knoblauch >>>> Subject: Re: [Ganglia-general] gmond memory leaks >>>> To: Scott Dworkis >>>> Cc: [email protected] >>>> Message-ID: <[email protected]> >>>> Content-Type: text/plain; charset=us-ascii >>>> >>>> ----- Original Message ---- >>>> >>>>> From: Scott Dworkis >>>>> To: Martin Knoblauch >>>>> Cc: [email protected] >>>>> Sent: Wed, February 17, 2010 8:32:32 PM >>>>> Subject: Re: [Ganglia-general] gmond memory leaks >>>>> >>>>> 3.1.2 on gentoo (that solaris must be a sourceforge ad?). i have zero >>>>> experience with valgrind... i'll have a look but a smidge of guidance >>>>> would be appreciated. :) >>>>> >>>> >>>> Just get valgrind and run the leaking "gmond" under its control. "gmond" >>> should be configured to not run in background. After some time interrupt >>> it and >>> you will get a report of valgrinds findings. >>>> >>>> For example, a simple program leaking 8x1MB will produce: >>>> >>>> [mknob...@l6g0223j ~]$ valgrind ./memeat >>>> ==13647== Memcheck, a memory error detector. >>>> ==13647== Copyright (C) 2002-2006, and GNU GPL'd, by Julian Seward et al. >>>> ==13647== Using LibVEX rev 1658, a library for dynamic binary >>>> translation. >>>> ==13647== Copyright (C) 2004-2006, and GNU GPL'd, by OpenWorks LLP. >>>> ==13647== Using valgrind-3.2.1, a dynamic binary instrumentation >>>> framework. >>>> ==13647== Copyright (C) 2000-2006, and GNU GPL'd, by Julian Seward et al. >>>> ==13647== For more details, rerun with: -v >>>> ==13647== >>>> ^C >>>> ==13647== >>>> ==13647== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 5 from 1) >>>> ==13647== malloc/free: in use at exit: 8,000,000 bytes in 8 blocks. >>>> ==13647== malloc/free: 8 allocs, 0 frees, 8,000,000 bytes allocated. >>>> ==13647== For counts of detected errors, rerun with: -v >>>> ==13647== searching for pointers to 8 not-freed blocks. >>>> ==13647== checked 66,440 bytes. >>>> ==13647== >>>> ==13647== LEAK SUMMARY: >>>> ==13647== definitely lost: 8,000,000 bytes in 8 blocks. >>>> ==13647== possibly lost: 0 bytes in 0 blocks. >>>> ==13647== still reachable: 0 bytes in 0 blocks. >>>> ==13647== suppressed: 0 bytes in 0 blocks. >>>> ==13647== Use --leak-check=full to see details of leaked memory. >>>> >>>> If you use "--leak-check=full", it will tell you where the leaking memory >>>> was >>> allocated. "gmond" needs to be compiled with debug info (-g). >>>> >>>> A few questions. >>>> >>>> - What is your setup? I assume quite a few hosts monitoring (collectors) >>> metrics and one aggregating the results. >>>> - Which of the "gmond"s leak? The "collectors", the "aggregator" or both? >>>> >>>> Cheers >>>> Martin >>>> >>>>> yeah 150k metrics is a lot... i have an interest in scaling this thing. >>>>> i'll post another thread bout things i've done to scale so far that seem >>>>> to be working well. >>>>> >>>>> On Wed, 17 Feb 2010, Martin Knoblauch wrote: >>>>> >>>>>> Hi Scott, >>>>>> >>>>>> which version of Ganglia and which operating environment do you have >>> (guessing >>>>> Solaris from your signature :-)? Any chance that you could run valgrind >>>>> or >>>>> equivalent on your setup? 10GB/day is a lot, as is 150k metrics. >>>>>> >>>>>> Cheers >>>>>> Martin >>>>>> ------------------------------------------------------ >>>>>> Martin Knoblauch >>>>>> email: k n o b i AT knobisoft DOT de >>>>>> www: http://www.knobisoft.de >>>>>> >>>>>> >>>>>> >>>>>> ----- Original Message ---- >>>>>>> From: Scott Dworkis >>>>>>> To: [email protected] >>>>>>> Sent: Wed, February 17, 2010 3:08:26 AM >>>>>>> Subject: [Ganglia-general] gmond memory leaks >>>>>>> >>>>>>> (sorry if this is a repost... i tried previously without having first >>>>>>> subscribed to the list, and fear i got lost somewhere along the >>>>>>> moderation >>>>>>> path) >>>>>>> >>>>>>> hi all - i am seeing gmond leak about 10GB/day on about 150k metrics >>>>>>> collected. it seemed like things worsened when i added dmax to all my >>>>>>> custom metrics, but maybe it was always bad. is this a known issue? >>>>>>> >>>>>>> sorry if it is already known... i couldn't see that there was a good >>>>>>> way >>>>>>> to search the forums or if there is a bug tracker to search. >>>>>>> >>>>>>> -scott >>>>>>> >>>>>>> >>>>> >>> >>> ------------------------------------------------------------------------------ >>>>>>> SOLARIS 10 is the OS for Data Centers - provides features such as >>>>>>> DTrace, >>>>>>> Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW >>>>>>> http://p.sf.net/sfu/solaris-dev2dev >>>>>>> _______________________________________________ >>>>>>> Ganglia-general mailing list >>>>>>> [email protected] >>>>>>> https://lists.sourceforge.net/lists/listinfo/ganglia-general >>>>>> >>>>> >>>>> >>> >>> ------------------------------------------------------------------------------ >>>>> SOLARIS 10 is the OS for Data Centers - provides features such as >>>>> DTrace, >>>>> Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW >>>>> http://p.sf.net/sfu/solaris-dev2dev >>>>> _______________________________________________ >>>>> Ganglia-general mailing list >>>>> [email protected] >>>>> https://lists.sourceforge.net/lists/listinfo/ganglia-general >>>> >>>> >>>> >>>> > ------------------------------------------------------------------------------ Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev _______________________________________________ Ganglia-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/ganglia-general

