----- Original Message ----
> From: Kumar Vaibhav <[EMAIL PROTECTED]>
> To: Jesse Becker <[EMAIL PROTECTED]>
> Cc: Martin Knoblauch <[EMAIL PROTECTED]>; Ganglia Developers 
> <ganglia-developers@lists.sourceforge.net>; Bernard Li <[EMAIL PROTECTED]>
> Sent: Friday, March 21, 2008 8:16:42 AM
> Subject: Re: [Ganglia-developers] Memory leak in gmond
> 
> Hi All,
> 
> I am still seeing some memory leak in the nodes
> Now the problem is not in the deaf mode but in the mute mode. To reduce the 
> debugging complexity I am running the 3.0.7 
> on 2 nodes one in deaf mode and other in mute mode. The deaf mode is working 
> fine and the node in mute mode is giving 
> memory leak. Here is the o/p of the valgrind for the node with mute mode.
> 
Hi Kumar,

 while I almost assume that some/most of the "leaks" that you are seeing are 
one-time allocations that just live until process-end, I am at least confused 
about the ones from "hash_lookup". This is part of a metrics sampling function 
which  should not be called at all in "mute" mode - unless I am not completely 
wrong.

 Could you do the valgrind runs twice, with different total run-times. Just to 
see which of the "leaks" accumulate.

> 
> ==21588==
> ==21588== Process terminating with default action of signal 2 (SIGINT)
> ==21588==    at 0x3F810C485F: poll (in /lib64/libc-2.5.so)
> ==21588==    by 0x41D7B1: apr_pollset_poll (poll.c:504)
> ==21588==    by 0x405846: main (gmond.c:1269)
> --21588-- Discarding syms at 0x4D41000-0x4F4C000 in 
> /lib64/libnss_files-2.5.so 
> due to munmap()
> ==21588==
> ==21588== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 5 from 1)
> --21588--
> --21588-- supp:    5 Fedora-Core-6-hack3-ld25
> ==21588== malloc/free: in use at exit: 740,602 bytes in 1,190 blocks.
> ==21588== malloc/free: 2,574 allocs, 1,384 frees, 946,209 bytes allocated.
> ==21588==
> ==21588== searching for pointers to 1,190 not-freed blocks.
> ==21588== checked 479,904 bytes.
> ==21588==
> ==21588== 5 bytes in 1 blocks are still reachable in loss record 1 of 16
> ==21588==    at 0x4A05809: malloc (vg_replace_malloc.c:149)
> ==21588==    by 0x4111FF: cfg_init (confuse.c:1087)
> ==21588==    by 0x40EB7C: Ganglia_gmond_config_create (libgmond.c:523)
> ==21588==    by 0x405529: process_configuration_file (gmond.c:180)
> ==21588==    by 0x405627: main (gmond.c:1815)
> ==21588==

 I think this is a one-time alloc from reading the config file.

> ==21588==
> ==21588== 19 bytes in 4 blocks are still reachable in loss record 2 of 16
> ==21588==    at 0x4A05809: malloc (vg_replace_malloc.c:149)
> ==21588==    by 0x3F810750E1: strndup (in /lib64/libc-2.5.so)
> ==21588==    by 0x40806A: hash_lookup (metrics.c:151)
> ==21588==    by 0x408D75: bytes_out_func (metrics.c:425)
> ==21588==    by 0x40418C: Ganglia_collection_group_collect (gmond.c:1540)
> ==21588==    by 0x404FC8: process_collection_groups (gmond.c:1662)
> ==21588==    by 0x40600E: main (gmond.c:1913)
> ==21588==

 Now, this one is from bytes_out_func. Likely a one-time allcation. How many 
network interfaces has that system got? What are they named? And I wonder why 
it is called at all in mute mode.

> ==21588==
> ==21588== 22 bytes in 2 blocks are still reachable in loss record 3 of 16
> ==21588==    at 0x4A05809: malloc (vg_replace_malloc.c:149)
> ==21588==    by 0x406740: gengetopt_strdup (cmdline.c:64)
> ==21588==    by 0x40689E: cmdline_parser (cmdline.c:100)
> ==21588==    by 0x4055BD: main (gmond.c:1780)
> ==21588==

 One-time allocation.

> ==21588==
> ==21588== 56 bytes in 1 blocks are still reachable in loss record 4 of 16
> ==21588==    at 0x4A05809: malloc (vg_replace_malloc.c:149)
> ==21588==    by 0x4111D2: cfg_init (confuse.c:1083)
> ==21588==    by 0x40EB7C: Ganglia_gmond_config_create (libgmond.c:523)
> ==21588==    by 0x405529: process_configuration_file (gmond.c:180)
> ==21588==    by 0x405627: main (gmond.c:1815)
> ==21588==

 One-time allocation.

> ==21588==
> ==21588== 192 bytes in 4 blocks are still reachable in loss record 5 of 16
> ==21588==    at 0x4A05809: malloc (vg_replace_malloc.c:149)
> ==21588==    by 0x408057: hash_lookup (metrics.c:144)
> ==21588==    by 0x408D75: bytes_out_func (metrics.c:425)
> ==21588==    by 0x40418C: Ganglia_collection_group_collect (gmond.c:1540)
> ==21588==    by 0x404FC8: process_collection_groups (gmond.c:1662)
> ==21588==    by 0x40600E: main (gmond.c:1913)
> ==21588==

  See my comment above. That looks like 4 net_dev_stats structures. Likely 
one-time allcations. But should not happen at all in "mute" mode. Are you 
running in 32-bit or 64-bit mode? Seems we can save 8-bytes per struct by 
better sorting the members.

> ==21588==
> ==21588== 192 bytes in 1 blocks are still reachable in loss record 6 of 16
> ==21588==    at 0x4A05809: malloc (vg_replace_malloc.c:149)
> ==21588==    by 0x41BDC1: apr_allocator_create (apr_pools.c:90)
> ==21588==    by 0x41C55C: apr_pool_initialize (apr_pools.c:506)
> ==21588==    by 0x41A7C4: apr_initialize (start.c:55)
> ==21588==    by 0x40EC9F: Ganglia_pool_create (libgmond.c:494)
> ==21588==    by 0x4055DA: main (gmond.c:1789)
> ==21588==

Likely one-time allocation.

> ==21588==
> ==21588== 322 bytes in 48 blocks are still reachable in loss record 7 of 16
> ==21588==    at 0x4A05809: malloc (vg_replace_malloc.c:149)
> ==21588==    by 0x3F810F5A63: xdr_string (in /lib64/libc-2.5.so)
> ==21588==    by 0x40DA67: xdr_Ganglia_message (protocol_xdr.c:124)
> ==21588==    by 0x4047C3: process_udp_recv_channel (gmond.c:905)
> ==21588==    by 0x4059FC: main (gmond.c:1279)
> ==21588==

 No idea

> ==21588==
> ==21588== 328 bytes in 8 blocks are still reachable in loss record 8 of 16
> ==21588==    at 0x4A0590B: realloc (vg_replace_malloc.c:306)
> ==21588==    by 0x40FCAB: cfg_addval (confuse.c:372)
> ==21588==    by 0x411397: cfg_setopt (confuse.c:587)
> ==21588==    by 0x410B7A: cfg_parse_internal (confuse.c:938)
> ==21588==    by 0x410B9F: cfg_parse_internal (confuse.c:944)
> ==21588==    by 0x410EC3: cfg_parse_fp (confuse.c:1035)
> ==21588==    by 0x410F9D: cfg_parse (confuse.c:1054)
> ==21588==    by 0x40EB8A: Ganglia_gmond_config_create (libgmond.c:525)
> ==21588==    by 0x405529: process_configuration_file (gmond.c:180)
> ==21588==    by 0x405627: main (gmond.c:1815)
> ==21588==

 One-time allcation from processing the config file.

> ==21588==
> ==21588== 1,128 bytes in 141 blocks are still reachable in loss record 9 of 16
> ==21588==    at 0x4A05809: malloc (vg_replace_malloc.c:149)
> ==21588==    by 0x4A05883: realloc (vg_replace_malloc.c:306)
> ==21588==    by 0x40FCAB: cfg_addval (confuse.c:372)
> ==21588==    by 0x411397: cfg_setopt (confuse.c:587)
> ==21588==    by 0x4110D7: cfg_init_defaults (confuse.c:529)
> ==21588==    by 0x411251: cfg_init (confuse.c:1094)
> ==21588==    by 0x40EB7C: Ganglia_gmond_config_create (libgmond.c:523)
> ==21588==    by 0x405529: process_configuration_file (gmond.c:180)
> ==21588==    by 0x405627: main (gmond.c:1815)
> ==21588==

ditto

> ==21588==
> ==21588== 1,143 bytes in 180 blocks are definitely lost in loss record 10 of 
> 16
> ==21588==    at 0x4A05809: malloc (vg_replace_malloc.c:149)
> ==21588==    by 0x3F810F5A63: xdr_string (in /lib64/libc-2.5.so)
> ==21588==    by 0x40D87D: xdr_Ganglia_gmetric_message (protocol_xdr.c:23)
> ==21588==    by 0x40D9FD: xdr_Ganglia_message (protocol_xdr.c:83)
> ==21588==    by 0x4047C3: process_udp_recv_channel (gmond.c:905)
> ==21588==    by 0x4059FC: main (gmond.c:1279)
> ==21588==

no idea

> ==21588==
> ==21588== 1,456 bytes in 182 blocks are still reachable in loss record 11 of 
> 16
> ==21588==    at 0x4A05809: malloc (vg_replace_malloc.c:149)
> ==21588==    by 0x40FCC5: cfg_addval (confuse.c:375)
> ==21588==    by 0x411397: cfg_setopt (confuse.c:587)
> ==21588==    by 0x4110D7: cfg_init_defaults (confuse.c:529)
> ==21588==    by 0x411251: cfg_init (confuse.c:1094)
> ==21588==    by 0x40EB7C: Ganglia_gmond_config_create (libgmond.c:523)
> ==21588==    by 0x405529: process_configuration_file (gmond.c:180)
> ==21588==    by 0x405627: main (gmond.c:1815)
> ==21588==

One-time allocation

> ==21588==
> ==21588== 2,912 bytes in 52 blocks are still reachable in loss record 12 of 16
> ==21588==    at 0x4A05809: malloc (vg_replace_malloc.c:149)
> ==21588==    by 0x41151E: cfg_setopt (confuse.c:665)
> ==21588==    by 0x4110D7: cfg_init_defaults (confuse.c:529)
> ==21588==    by 0x411251: cfg_init (confuse.c:1094)
> ==21588==    by 0x40EB7C: Ganglia_gmond_config_create (libgmond.c:523)
> ==21588==    by 0x405529: process_configuration_file (gmond.c:180)
> ==21588==    by 0x405627: main (gmond.c:1815)
> ==21588==

ditto

> ==21588==
> ==21588== 4,123 bytes in 401 blocks are still reachable in loss record 13 of 
> 16
> ==21588==    at 0x4A05809: malloc (vg_replace_malloc.c:149)
> ==21588==    by 0x3F81075081: strdup (in /lib64/libc-2.5.so)
> ==21588==    by 0x410218: cfg_dupopt_array (confuse.c:401)
> ==21588==    by 0x41122B: cfg_init (confuse.c:1088)
> ==21588==    by 0x40EB7C: Ganglia_gmond_config_create (libgmond.c:523)
> ==21588==    by 0x405529: process_configuration_file (gmond.c:180)
> ==21588==    by 0x405627: main (gmond.c:1815)
> ==21588==

ditto

> ==21588==
> ==21588== 16,384 bytes in 2 blocks are still reachable in loss record 14 of 16
> ==21588==    at 0x4A05809: malloc (vg_replace_malloc.c:149)
> ==21588==    by 0x41C7C0: apr_palloc (apr_pools.c:293)
> ==21588==    by 0x403E59: Ganglia_metric_cb_define (gmond.c:1306)
> ==21588==    by 0x403F08: setup_metric_callbacks (gmond.c:1367)
> ==21588==    by 0x4056EE: main (gmond.c:1845)
> ==21588==

hmm. I just wonder why we do this in "mute" mode.

> ==21588==
> ==21588== 40,576 bytes in 81 blocks are still reachable in loss record 15 of 
> 16
> ==21588==    at 0x4A05809: malloc (vg_replace_malloc.c:149)
> ==21588==    by 0x4101BB: cfg_dupopt_array (confuse.c:395)
> ==21588==    by 0x41122B: cfg_init (confuse.c:1088)
> ==21588==    by 0x40EB7C: Ganglia_gmond_config_create (libgmond.c:523)
> ==21588==    by 0x405529: process_configuration_file (gmond.c:180)
> ==21588==    by 0x405627: main (gmond.c:1815)
> ==21588==

One-time allocation. Rather big, but likely not serious

> ==21588==
> ==21588== 671,744 bytes in 82 blocks are still reachable in loss record 16 of 
> 16
> ==21588==    at 0x4A05809: malloc (vg_replace_malloc.c:149)
> ==21588==    by 0x41C2C3: apr_pool_create_ex (apr_pools.c:293)
> ==21588==    by 0x41C586: apr_pool_initialize (apr_pools.c:511)
> ==21588==    by 0x41A7C4: apr_initialize (start.c:55)
> ==21588==    by 0x40EC9F: Ganglia_pool_create (libgmond.c:494)
> ==21588==    by 0x4055DA: main (gmond.c:1789)
> ==21588==

no idea, seems to come from initialization.

> ==21588== LEAK SUMMARY:
> ==21588==    definitely lost: 1,143 bytes in 180 blocks.
> ==21588==      possibly lost: 0 bytes in 0 blocks.
> ==21588==    still reachable: 739,459 bytes in 1,010 blocks.
> ==21588==         suppressed: 0 bytes in 0 blocks.
> --21588--  memcheck: sanity checks: 46 cheap, 2 expensive
> --21588--  memcheck: auxmaps: 301 auxmap entries (19264k, 18M) in use
> --21588--  memcheck: auxmaps: 4427566 searches, 5961364 comparisons
> --21588--  memcheck: SMs: n_issued      = 41 (656k, 0M)
> --21588--  memcheck: SMs: n_deissued    = 0 (0k, 0M)
> --21588--  memcheck: SMs: max_noaccess  = 524287 (8388592k, 8191M)
> --21588--  memcheck: SMs: max_undefined = 0 (0k, 0M)
> --21588--  memcheck: SMs: max_defined   = 387 (6192k, 6M)
> --21588--  memcheck: SMs: max_non_DSM   = 41 (656k, 0M)
> --21588--  memcheck: max sec V bit nodes:    3 (0k, 0M)
> --21588--  memcheck: set_sec_vbits8 calls: 3 (new: 3, updates: 0)
> --21588--  memcheck: max shadow mem size:   4800k, 4M
> --21588-- translate:            fast SP updates identified: 5,079 ( 86.9%)
> --21588-- translate:   generic_known SP updates identified: 607 ( 10.3%)
> --21588-- translate: generic_unknown SP updates identified: 153 (  2.6%)
> --21588--     tt/tc: 23,734 tt lookups requiring 24,620 probes
> --21588--     tt/tc: 23,734 fast-cache updates, 6 flushes
> --21588--  transtab: new        5,719 (136,749 -> 2,554,493; ratio 186:10) [0 
> scs]
> --21588--  transtab: dumped     0 (0 -> ??)
> --21588--  transtab: discarded  153 (2,818 -> ??)
> --21588-- scheduler: 4,609,790 jumps (bb entries).
> --21588-- scheduler: 46/39,509 major/minor sched events.
> --21588--    sanity: 47 cheap, 2 expensive checks.
> --21588--    exectx: 30,011 lists, 203 contexts (avg 0 per list)
> --21588--    exectx: 3,963 searches, 5,723 full compares (1,444 per 1000)
> --21588--    exectx: 4,421 cmp2, 10 cmp4, 0 cmpAll
> 
> 
> 
> Jesse Becker wrote:
> > On Feb 19, 2008 7:39 PM, Martin Knoblauch  wrote:
> >> ----- Original Message ----
> >>> From: Jesse Becker 
> >>> To: Ganglia Developers 
> >>> Sent: Tuesday, February 19, 2008 11:25:54 PM
> >>> Subject: Re: [Ganglia-developers] Memory leak in gmond
> >>>
> >>> I'm not sure if this is right--I've only take a really quick check in
> >>> libmetrics/linux/metrics.c, and my C-fu is rusty.
> >>>
> >>> It looks like strndup() is called in linux/metrics.c:hash_lookup
> >>> (about line 131) to dupliate an interface name, which is included in
> >>> the stats structure as stats->name.  The net_dev_stats function will
> >>> return this struct.
> >>>
> >>> The function is called in a number of places pkts_in_func,
> >>> pkts_out_func, bytes_out_func and bytes_in_func.  The variable "*ns"
> >>> is assigned the output of hash_lookup (e.g. the struct).  Since the
> >>> 'name' element is malloc()ed, but not explictly freed, it will not go
> >>> away when *ns goes out of scope.  This is the leak, isn't it?  All
> >>> four of these functions are very similar, and need to be fixed if this
> >>> is the case.
> >>>
> >>> Or did I miss something obvious? :)
> >>>
> >>  Lines 137, 148 and 159 ? :-)
> > 
> > I saw those. :-P  I meant after the struct has been returned, outside
> > the function, the memory is never freed.  Inside that function, it's
> > okay.
> > 
> >>  The memory allocated in line 151 is never freed, indeed. But it is only
> >> allocated once per interface and stays alive for the entire lifetime of the
> >> gmond process. So, it is not leaked.
> > 
> > Ah, that makes more sense, especially if those variables exist for the
> > lifetime of the program.
> > 
> > So, I've just run gmond under valgrind and duma (a fork of the old
> > Electric Fence memory debugger), and I can't seem to reproduce the
> > problem now.  Neither one of them is showing any obvious leaks, at
> > least not in the 15 minute tests I've run.  The test system(s) are
> > CentOS4.6 boxes.
> > 
> > 
> 
> 
> 



-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Ganglia-developers mailing list
Ganglia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-developers

Reply via email to