I have noticed the same exact problem here, occasionally some nodes would get marked as down even though they are still up, and it appears to be the same timing issue. Based on what you discovered below it appears that gmetad is using an unsigned int to store TN and gmond is using a signed int.
I think I remember several months ago ganglia was patched to call the time system call a lot less to improve efficiency, I bet that is when this timing bug was introduced which causes the webfrontend to mark some nodes as down if the condition you discovered occur. Any ideas on how to fix it without putting all the time system calls back in? ~Jason On Tue, 2003-12-30 at 18:38, Josh Durham wrote: > Thanks for your quick response. > > So, I've been playing around a bit with the TN thing. Here is > something interesting.. Here is a larger sample of the output from > gmetad: > telnet localhost 8651: > ... > <GRID NAME="unspecified" AUTHORITY="http://blahblah/ganglia/" > LOCALTIME="1072822698"> > <CLUSTER NAME="Cluster X" LOCALTIME="1072822524" OWNER="Terascale > Computing Facility" LATLONG="unspecified" URL="unspecified"> > ... > <HOST NAME="n0603.tcf-int.vt.edu" IP="10.1.2.175" REPORTED="1072822628" > TN="0" TMAX="20" DMAX="0" LOCATION="unspecified" > GMOND_STARTED="1072822099"> > <HOST NAME="n0604.tcf-int.vt.edu" IP="10.1.2.176" REPORTED="1072822629" > TN="4294967295" TMAX="20" DMAX="0" LOCATION="unspecified" > GMOND_STARTED="1072822100"> > <HOST NAME="n0605.tcf-int.vt.edu" IP="10.1.2.177" REPORTED="1072822629" > TN="4294967295" TMAX="20" DMAX="0" LOCATION="unspecified" > GMOND_STARTED="1072822100"> > <HOST NAME="n0606.tcf-int.vt.edu" IP="10.1.2.178" REPORTED="1072822629" > TN="4294967295" TMAX="20" DMAX="0" LOCATION="unspecified" > GMOND_STARTED="1072822100"> > <HOST NAME="n0607.tcf-int.vt.edu" IP="10.1.2.179" REPORTED="1072822629" > TN="4294967295" TMAX="20" DMAX="0" LOCATION="unspecified" > GMOND_STARTED="1072822100"> > <HOST NAME="n0608.tcf-int.vt.edu" IP="10.1.2.180" REPORTED="1072822616" > TN="12" TMAX="20" DMAX="0" LOCATION="unspecified" > GMOND_STARTED="1072822100"> > <HOST NAME="n0609.tcf-int.vt.edu" IP="10.1.2.181" REPORTED="1072822616" > TN="12" TMAX="20" DMAX="0" LOCATION="unspecified" > GMOND_STARTED="1072822100"> > <HOST NAME="n0610.tcf-int.vt.edu" IP="10.1.2.182" REPORTED="1072822616" > TN="12" TMAX="20" DMAX="0" LOCATION="unspecified" > GMOND_STARTED="1072797006"> > <HOST NAME="n0611.tcf-int.vt.edu" IP="10.1.2.183" REPORTED="1072822616" > TN="12" TMAX="20" DMAX="0" LOCATION="unspecified" > GMOND_STARTED="1072822100"> > <HOST NAME="n0612.tcf-int.vt.edu" IP="10.1.2.184" REPORTED="1072822629" > TN="4294967295" TMAX="20" DMAX="0" LOCATION="unspecified" > GMOND_STARTED="1072822100"> > <HOST NAME="n0613.tcf-int.vt.edu" IP="10.1.2.185" REPORTED="1072822616" > TN="12" TMAX="20" DMAX="0" LOCATION="unspecified" > GMOND_STARTED="1072822100"> > <HOST NAME="n0614.tcf-int.vt.edu" IP="10.1.2.186" REPORTED="1072822629" > TN="4294967295" TMAX="20" DMAX="0" LOCATION="unspecified" > GMOND_STARTED="1072822100"> > <HOST NAME="n0615.tcf-int.vt.edu" IP="10.1.2.187" REPORTED="1072822629" > TN="4294967295" TMAX="20" DMAX="0" LOCATION="unspecified" > GMOND_STARTED="1072822100"> > <HOST NAME="n0616.tcf-int.vt.edu" IP="10.1.3.11" REPORTED="1072822515" > TN="9" TMAX="20" DMAX="0" LOCATION="unspecified" > GMOND_STARTED="1072822100"> > <HOST NAME="n0617.tcf-int.vt.edu" IP="10.1.3.12" REPORTED="1072822616" > TN="12" TMAX="20" DMAX="0" LOCATION="unspecified" > GMOND_STARTED="1072822100"> > > Those that have the funky TNs were all reported at the same time. I > have a feeling it's a timing issue. > > And actually, I caught gmond doing something similar I had to run it a > few times, but I got (from telnet localhost 8649): > ... > <GANGLIA_XML VERSION="2.5.5" SOURCE="gmond"> > <CLUSTER NAME="Cluster X" LOCALTIME="1072823227" OWNER="Terascale > Computing Facility" LATLONG="unspecified" URL="unspecified"> > ... > <HOST NAME="n0163.tcf-int.vt.edu" IP="10.1.1.173" REPORTED="1072823228" > TN="-1" TMAX="20" DMAX="0" LOCATION="unspecified" > GMOND_STARTED="1072822082"> > <HOST NAME="n0164.tcf-int.vt.edu" IP="10.1.1.174" REPORTED="1072823228" > TN="-1" TMAX="20" DMAX="0" LOCATION="unspecified" > GMOND_STARTED="1072822082"> > > Is it possible, that because this data is so big, that it is being > updated while it's being reported? I'm not too familiar with the > source, but if the following is happening, this could be the problem: > 1. gmond receives XML request from gmetad. > 2. gmond records current time in client->timestamp. > 3. gmond starts to go through the host hash, reporting tn as > client->timestamp - node->timestamp (where node->timestamp is REPORTED) > 4. gmond receives an update from a computational node after 1 second of > the start of the XML request, reports a negative TN? > > Also, a note. This is a Dual Processor 1.3GHz Apple G4 XServe. I have > a feeling I could run this on a DP 2.0 GHz G5 without issue, but I'd > rather run it on my server platform. > So, if I run just gmond, it takes about 0.8 seconds to pull the XML. > When I run gmetad (which is eating up some process cycles,) it goes up > to 1.2 seconds. > > What I don't understand, is gmetad should handle this.. It's check to > see if it is up is tn < tmax * 4 (-1 < 60). > So, I added this to process_xml.c, line 447: > debug_msg("XXXX Host alive: cluster_localtime=%d reported=%d expr=%d > tn=%d tmax=%d host_alive=%d", > xmldata->cluster_localtime,reported,(tn < tmax * > 4),tn,tmax,xmldata->host_alive); > > And I get: > XXXX Host alive: cluster_localtime=1072825831 reported=1072825832 > expr=0 tn=-1 tmax=20 host_alive=0 > > Now I'm baffled. Why isn't -1 < 20 * 4 coming out as 1? > Sorry my rambling.. Thinking outloud, in a way. > > Any ideas on this? > > Also, on the mem_total problem I'm having, I'm not sure xdr_hyper is an > option. It doesn't exist in OS X's /etc/include/rpc/xdr.h. I might be > able to use xdr_bytes, but I don't know alot about > RPC/XDR. I was thinking of cheating and having it report MB in the > summary RRDs, but that's not really a good solution. > > I am looking forward to Ganglia 3. One of the problems I'm having with > the Darwin specific metrics is the cpu_*_funcs. It's easy if I could > return user,nice,system, and idle in one function as an array of values > (. f(10.0 0.0 5.0 85.0). The trick is figuring out how to split them > up. > > Also, I havn't checked in a while, but I think my baseline network > usage was about 80KB/s while running Ganglia. Reducing that would be > nice on the monitoring nodes. > > On Tuesday, December 30, 2003, at 08:44 AM, [EMAIL PROTECTED] wrote: > > > Sweet to hear you are running Ganglia on the G5 cluster. Strange about > > the > > TN figure, looks like a signed-unsigned int issue. I'll have a look at > > the > > code when I get back from my holiday vacation. > > > > Definately send the patches when you get them in order. > > > > -Federico > > > > ------------------------------------------------------- > This SF.net email is sponsored by: IBM Linux Tutorials. > Become an expert in LINUX or just sharpen your skills. Sign up for IBM's > Free Linux Tutorials. Learn everything from the bash shell to sys admin. > Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click > _______________________________________________ > Ganglia-developers mailing list > Ganglia-developers@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/ganglia-developers -- /------------------------------------------------------------------\ | Jason A. Smith Email: [EMAIL PROTECTED] | | Atlas Computing Facility, Bldg. 510M Phone: (631)344-4226 | | Brookhaven National Lab, P.O. Box 5000 Fax: (631)344-7616 | | Upton, NY 11973-5000 | \------------------------------------------------------------------/