On Mon, Nov 03, 2008 at 01:29:50PM -0500, Ofer Inbar wrote:
> Oh, one more thing: I can see on the Ganglia graph for that host that
> just before it "froze", system CPU went to 100% very suddenly - one
> sample was normal, with user ~15% and system ~8%, and the next sample
> it was all system CPU. It stayed that way for about 3-4 minutes before
> Ganglia stopped recording metrics.
>
> I don't think this was gmond's fault necessarily, but it may have been
> the situation that caused gmond to get into the frozen state.
I had a look at the code and although the gmond.conf man page says that
the default timeout for tcp_accept_channel is 1 sec in the code is set
to -1 and connections can block forever.
You can use the following in gmetad.conf to solve the problem.
tcp_accept_channel {
port = 8649
timeout = 1000000
}
The following untested patch should also work I hope :)
--- lib/libgmond.c-orig 2008-11-03 20:14:01.000000000 +0000
+++ lib/libgmond.c 2008-11-03 20:14:12.000000000 +0000
@@ -101,7 +101,7 @@
CFG_INT("port", -1, CFGF_NONE ),
CFG_STR("interface", NULL, CFGF_NONE),
CFG_SEC("acl", acl_opts, CFGF_NONE),
- CFG_INT("timeout", -1, CFGF_NONE),
+ CFG_INT("timeout", 1000000, CFGF_NONE),
CFG_STR("family", "inet4", CFGF_NONE),
CFG_END()
};
Cheers,
Kostas
-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Ganglia-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ganglia-general