On Fri, Oct 17, 2008 at 16:24, Ofer Inbar <[EMAIL PROTECTED]> wrote:
> Ganglia 3.1.0 on CentOS 4.

Ganglia 3.1.1, Solaris 10, Sparc.

I'm also seeing a blocked gmond, although my situation may be slightly
different.

> Earlier today one of my clusters stopped reporting.
> Grid server logged these to syslog:
>  /usr/sbin/gmetad[12271]: poll() timeout for [Web] data source after 0 bytes 
> read

I don't have logging going to syslog, although I do have debug output
(level 5) from gmond.  I also ran gmetad with various logging levels
up to 100, but there's nothign besides:

[EMAIL PROTECTED]:/usr/local/ganglia]$ sudo ./start_gmetad.sh -d1000
Going to run as user nobody
Sources are ...
Source: [NHGRI Systems, step 10] has 3 sources
        10.0.0.81
        10.0.0.116
        10.0.0.40
xml listening on port 8651
interactive xml listening on port 8652
Data thread 8 is monitoring [NHGRI Systems] data source
        10.0.0.81
        10.0.0.116
        10.0.0.40
cleanup thread has been started
server_thread() received request "/?filter=summary" from 127.0.0.1
Found subtree / and filter=summary

In this case, 10.0.0.81 is the system that runs gmetad, and gmond.
The other two hosts (.40 and .116) are running gmond, and sending
updates to .91 via unicast.

> I checked that gmond was running on that host, and it was.
> However, attempts to connect to its port 8649 would indeed timeout.

Same here.  Gmond will run fine for a while, then fail to respond to
TCP connections.  Running 'telnet localhost 8649' fails to connect.
In my case, "a while" ranges from minutes to hours--I've been testing
this off and on since yesterday.

Restarting gmond on the aggregation host will fix the problem...for a while.

Another important point is that gmond has *not* completely hung.
Running it under debug mode (-d5) shows that it is both collecting
metrics from the local system, and accepting metrics from the two
other hosts.  The problem appears to be specifically with responding
to TCP connections.

Don't have lsof installed, but netstat confirms that the port is open,
and something is listening on it.

> P.S. gmetad should've fallen back on another data source, but that's
> another email thread that we've already had :)

Same behavior here as well--gmetad didn't fall over to the other two
gmond processes...

-- 
Jesse Becker
GPG Fingerprint -- BD00 7AA4 4483 AFCC 82D0  2720 0083 0931 9A2B 06A2

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Ganglia-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Reply via email to