On Mon, Nov 03, 2008 at 01:29:50PM -0500, Ofer Inbar wrote:

> Oh, one more thing:  I can see on the Ganglia graph for that host that
> just before it "froze", system CPU went to 100% very suddenly - one
> sample was normal, with user ~15% and system ~8%, and the next sample
> it was all system CPU.  It stayed that way for about 3-4 minutes before
> Ganglia stopped recording metrics.
> 
> I don't think this was gmond's fault necessarily, but it may have been
> the situation that caused gmond to get into the frozen state.

I had a look at the code and although the gmond.conf man page says that
the default timeout for tcp_accept_channel is 1 sec in the code is set
to -1 and connections can block forever.

You can use the following in gmetad.conf to solve the problem.

  tcp_accept_channel {
    port = 8649
    timeout = 1000000
  }

The following untested patch should also work I hope :)
--- lib/libgmond.c-orig 2008-11-03 20:14:01.000000000 +0000
+++ lib/libgmond.c      2008-11-03 20:14:12.000000000 +0000
@@ -101,7 +101,7 @@
   CFG_INT("port", -1, CFGF_NONE ),
   CFG_STR("interface", NULL, CFGF_NONE),
   CFG_SEC("acl", acl_opts, CFGF_NONE),
-  CFG_INT("timeout", -1, CFGF_NONE),
+  CFG_INT("timeout", 1000000, CFGF_NONE),
   CFG_STR("family", "inet4", CFGF_NONE),
   CFG_END()
 };

Cheers,
Kostas

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Ganglia-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Reply via email to