On Mon, Nov 03, 2008 at 01:55:22PM -0700, Brad Nicholes wrote:

> >>> On 11/3/2008 at 1:15 PM, in message <[EMAIL PROTECTED]>,
> Kostas Georgiou <[EMAIL PROTECTED]> wrote:
> > On Mon, Nov 03, 2008 at 01:29:50PM -0500, Ofer Inbar wrote:
> > 
> >> Oh, one more thing:  I can see on the Ganglia graph for that host that
> >> just before it "froze", system CPU went to 100% very suddenly - one
> >> sample was normal, with user ~15% and system ~8%, and the next sample
> >> it was all system CPU.  It stayed that way for about 3-4 minutes before
> >> Ganglia stopped recording metrics.
> >> 
> >> I don't think this was gmond's fault necessarily, but it may have been
> >> the situation that caused gmond to get into the frozen state.
> > 
> > I had a look at the code and although the gmond.conf man page says that
> > the default timeout for tcp_accept_channel is 1 sec in the code is set
> > to -1 and connections can block forever.
> > 
> > You can use the following in gmetad.conf to solve the problem.
> > 
> >   tcp_accept_channel {
> >     port = 8649
> >     timeout = 1000000
> >   }
> > 
> > The following untested patch should also work I hope :)
> > --- lib/libgmond.c-orig 2008-11-03 20:14:01.000000000 +0000
> > +++ lib/libgmond.c      2008-11-03 20:14:12.000000000 +0000
> > @@ -101,7 +101,7 @@
> >    CFG_INT("port", -1, CFGF_NONE ),
> >    CFG_STR("interface", NULL, CFGF_NONE),
> >    CFG_SEC("acl", acl_opts, CFGF_NONE),
> > -  CFG_INT("timeout", -1, CFGF_NONE),
> > +  CFG_INT("timeout", 1000000, CFGF_NONE),
> >    CFG_STR("family", "inet4", CFGF_NONE),
> >    CFG_END()
> >  };
> > 
> 
> If a timeout is set, then is the resulting XML output still good or did we 
> lose something because of the timeout?

No, it seems to be working fine. I am testing with:

#!/usr/bin/python
import socket, time, sys

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.setsockopt(socket.SOL_SOCKET, socket.SO_RCVBUF, 32)
print "socket buffer size:", s.getsockopt(socket.SOL_SOCKET, socket.SO_RCVBUF)

s.connect(('127.0.0.1', 8649))
while 1:
  data = s.recv(32)
  if not data: break
  sys.stdout.write(data)
  time.sleep(1)
s.close()

without the timeout and the python script pulling data slowly
telnet localhost 8649 blocks otherwise both seem to be able to pull
data at the same time without any problems. I don't have a clue
about the apr library though so I have no idea what timeout does
beyond guessing (writes that timeout are retried? etc.).

Kostas

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Ganglia-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Reply via email to