Thanks for your quick response.

So, I've been playing around a bit with the TN thing. Here is something interesting.. Here is a larger sample of the output from gmetad:
telnet localhost 8651:
...
<GRID NAME="unspecified" AUTHORITY="http://blahblah/ganglia/"; LOCALTIME="1072822698"> <CLUSTER NAME="Cluster X" LOCALTIME="1072822524" OWNER="Terascale Computing Facility" LATLONG="unspecified" URL="unspecified">
...
<HOST NAME="n0603.tcf-int.vt.edu" IP="10.1.2.175" REPORTED="1072822628" TN="0" TMAX="20" DMAX="0" LOCATION="unspecified" GMOND_STARTED="1072822099"> <HOST NAME="n0604.tcf-int.vt.edu" IP="10.1.2.176" REPORTED="1072822629" TN="4294967295" TMAX="20" DMAX="0" LOCATION="unspecified" GMOND_STARTED="1072822100"> <HOST NAME="n0605.tcf-int.vt.edu" IP="10.1.2.177" REPORTED="1072822629" TN="4294967295" TMAX="20" DMAX="0" LOCATION="unspecified" GMOND_STARTED="1072822100"> <HOST NAME="n0606.tcf-int.vt.edu" IP="10.1.2.178" REPORTED="1072822629" TN="4294967295" TMAX="20" DMAX="0" LOCATION="unspecified" GMOND_STARTED="1072822100"> <HOST NAME="n0607.tcf-int.vt.edu" IP="10.1.2.179" REPORTED="1072822629" TN="4294967295" TMAX="20" DMAX="0" LOCATION="unspecified" GMOND_STARTED="1072822100"> <HOST NAME="n0608.tcf-int.vt.edu" IP="10.1.2.180" REPORTED="1072822616" TN="12" TMAX="20" DMAX="0" LOCATION="unspecified" GMOND_STARTED="1072822100"> <HOST NAME="n0609.tcf-int.vt.edu" IP="10.1.2.181" REPORTED="1072822616" TN="12" TMAX="20" DMAX="0" LOCATION="unspecified" GMOND_STARTED="1072822100"> <HOST NAME="n0610.tcf-int.vt.edu" IP="10.1.2.182" REPORTED="1072822616" TN="12" TMAX="20" DMAX="0" LOCATION="unspecified" GMOND_STARTED="1072797006"> <HOST NAME="n0611.tcf-int.vt.edu" IP="10.1.2.183" REPORTED="1072822616" TN="12" TMAX="20" DMAX="0" LOCATION="unspecified" GMOND_STARTED="1072822100"> <HOST NAME="n0612.tcf-int.vt.edu" IP="10.1.2.184" REPORTED="1072822629" TN="4294967295" TMAX="20" DMAX="0" LOCATION="unspecified" GMOND_STARTED="1072822100"> <HOST NAME="n0613.tcf-int.vt.edu" IP="10.1.2.185" REPORTED="1072822616" TN="12" TMAX="20" DMAX="0" LOCATION="unspecified" GMOND_STARTED="1072822100"> <HOST NAME="n0614.tcf-int.vt.edu" IP="10.1.2.186" REPORTED="1072822629" TN="4294967295" TMAX="20" DMAX="0" LOCATION="unspecified" GMOND_STARTED="1072822100"> <HOST NAME="n0615.tcf-int.vt.edu" IP="10.1.2.187" REPORTED="1072822629" TN="4294967295" TMAX="20" DMAX="0" LOCATION="unspecified" GMOND_STARTED="1072822100"> <HOST NAME="n0616.tcf-int.vt.edu" IP="10.1.3.11" REPORTED="1072822515" TN="9" TMAX="20" DMAX="0" LOCATION="unspecified" GMOND_STARTED="1072822100"> <HOST NAME="n0617.tcf-int.vt.edu" IP="10.1.3.12" REPORTED="1072822616" TN="12" TMAX="20" DMAX="0" LOCATION="unspecified" GMOND_STARTED="1072822100">

Those that have the funky TNs were all reported at the same time. I have a feeling it's a timing issue.

And actually, I caught gmond doing something similar I had to run it a few times, but I got (from telnet localhost 8649):
...
<GANGLIA_XML VERSION="2.5.5" SOURCE="gmond">
<CLUSTER NAME="Cluster X" LOCALTIME="1072823227" OWNER="Terascale Computing Facility" LATLONG="unspecified" URL="unspecified">
...
<HOST NAME="n0163.tcf-int.vt.edu" IP="10.1.1.173" REPORTED="1072823228" TN="-1" TMAX="20" DMAX="0" LOCATION="unspecified" GMOND_STARTED="1072822082"> <HOST NAME="n0164.tcf-int.vt.edu" IP="10.1.1.174" REPORTED="1072823228" TN="-1" TMAX="20" DMAX="0" LOCATION="unspecified" GMOND_STARTED="1072822082">

Is it possible, that because this data is so big, that it is being updated while it's being reported? I'm not too familiar with the source, but if the following is happening, this could be the problem:
1. gmond receives XML request from gmetad.
2. gmond records current time in client->timestamp.
3. gmond starts to go through the host hash, reporting tn as client->timestamp - node->timestamp (where node->timestamp is REPORTED) 4. gmond receives an update from a computational node after 1 second of the start of the XML request, reports a negative TN?

Also, a note. This is a Dual Processor 1.3GHz Apple G4 XServe. I have a feeling I could run this on a DP 2.0 GHz G5 without issue, but I'd rather run it on my server platform. So, if I run just gmond, it takes about 0.8 seconds to pull the XML. When I run gmetad (which is eating up some process cycles,) it goes up to 1.2 seconds.

What I don't understand, is gmetad should handle this.. It's check to see if it is up is tn < tmax * 4 (-1 < 60).
So, I added this to process_xml.c, line 447:
debug_msg("XXXX Host alive: cluster_localtime=%d reported=%d expr=%d tn=%d tmax=%d host_alive=%d", xmldata->cluster_localtime,reported,(tn < tmax * 4),tn,tmax,xmldata->host_alive);

And I get:
XXXX Host alive: cluster_localtime=1072825831 reported=1072825832 expr=0 tn=-1 tmax=20 host_alive=0

Now I'm baffled.  Why isn't -1 < 20 * 4 coming out as 1?
Sorry my rambling.. Thinking outloud, in a way.

Any ideas on this?

Also, on the mem_total problem I'm having, I'm not sure xdr_hyper is an option. It doesn't exist in OS X's /etc/include/rpc/xdr.h. I might be able to use xdr_bytes, but I don't know alot about RPC/XDR. I was thinking of cheating and having it report MB in the summary RRDs, but that's not really a good solution.

I am looking forward to Ganglia 3. One of the problems I'm having with the Darwin specific metrics is the cpu_*_funcs. It's easy if I could return user,nice,system, and idle in one function as an array of values (. f(10.0 0.0 5.0 85.0). The trick is figuring out how to split them up.

Also, I havn't checked in a while, but I think my baseline network usage was about 80KB/s while running Ganglia. Reducing that would be nice on the monitoring nodes.

On Tuesday, December 30, 2003, at 08:44 AM, [EMAIL PROTECTED] wrote:

Sweet to hear you are running Ganglia on the G5 cluster. Strange about the TN figure, looks like a signed-unsigned int issue. I'll have a look at the
code when I get back from my holiday vacation.

Definately send the patches when you get them in order.

-Federico


Reply via email to