----- Original Message ----- From: "Jason A. Smith" <[EMAIL PROTECTED]> To: "Josh Durham" <[EMAIL PROTECTED]> Cc: "Ganglia Developers" <ganglia-developers@lists.sourceforge.net> Sent: Wednesday, December 31, 2003 5:55 PM Subject: Re: [Ganglia-developers] Re: Scaling Issues? and Memory SizeProblems (combined)
> I have noticed the same exact problem here, occasionally some nodes > would get marked as down even though they are still up, and it appears > to be the same timing issue. Based on what you discovered below it > appears that gmetad is using an unsigned int to store TN and gmond is > using a signed int. I didnt look at the code (I am writing from switerland on a slow slow connection) but you may be right. Gmetad should probably use a signed int for tn. > > I think I remember several months ago ganglia was patched to call the > time system call a lot less to improve efficiency, I bet that is when > this timing bug was introduced which causes the webfrontend to mark some > nodes as down if the condition you discovered occur. Any ideas on how > to fix it without putting all the time system calls back in? > I dont think we need to put the time calls back in. Just fix the bug and we should be fine. -Federico > ~Jason > > > On Tue, 2003-12-30 at 18:38, Josh Durham wrote: > > Thanks for your quick response. > > > > So, I've been playing around a bit with the TN thing. Here is > > something interesting.. Here is a larger sample of the output from > > gmetad: > > telnet localhost 8651: > > ... > > <GRID NAME="unspecified" AUTHORITY="http://blahblah/ganglia/" > > LOCALTIME="1072822698"> > > <CLUSTER NAME="Cluster X" LOCALTIME="1072822524" OWNER="Terascale > > Computing Facility" LATLONG="unspecified" URL="unspecified"> > > ... > > <HOST NAME="n0603.tcf-int.vt.edu" IP="10.1.2.175" REPORTED="1072822628" > > TN="0" TMAX="20" DMAX="0" LOCATION="unspecified" > > GMOND_STARTED="1072822099"> > > <HOST NAME="n0604.tcf-int.vt.edu" IP="10.1.2.176" REPORTED="1072822629" > > TN="4294967295" TMAX="20" DMAX="0" LOCATION="unspecified" > > GMOND_STARTED="1072822100"> > > <HOST NAME="n0605.tcf-int.vt.edu" IP="10.1.2.177" REPORTED="1072822629" > > TN="4294967295" TMAX="20" DMAX="0" LOCATION="unspecified" > > GMOND_STARTED="1072822100"> > > <HOST NAME="n0606.tcf-int.vt.edu" IP="10.1.2.178" REPORTED="1072822629" > > TN="4294967295" TMAX="20" DMAX="0" LOCATION="unspecified" > > GMOND_STARTED="1072822100"> > > <HOST NAME="n0607.tcf-int.vt.edu" IP="10.1.2.179" REPORTED="1072822629" > > TN="4294967295" TMAX="20" DMAX="0" LOCATION="unspecified" > > GMOND_STARTED="1072822100"> > > <HOST NAME="n0608.tcf-int.vt.edu" IP="10.1.2.180" REPORTED="1072822616" > > TN="12" TMAX="20" DMAX="0" LOCATION="unspecified" > > GMOND_STARTED="1072822100"> > > <HOST NAME="n0609.tcf-int.vt.edu" IP="10.1.2.181" REPORTED="1072822616" > > TN="12" TMAX="20" DMAX="0" LOCATION="unspecified" > > GMOND_STARTED="1072822100"> > > <HOST NAME="n0610.tcf-int.vt.edu" IP="10.1.2.182" REPORTED="1072822616" > > TN="12" TMAX="20" DMAX="0" LOCATION="unspecified" > > GMOND_STARTED="1072797006"> > > <HOST NAME="n0611.tcf-int.vt.edu" IP="10.1.2.183" REPORTED="1072822616" > > TN="12" TMAX="20" DMAX="0" LOCATION="unspecified" > > GMOND_STARTED="1072822100"> > > <HOST NAME="n0612.tcf-int.vt.edu" IP="10.1.2.184" REPORTED="1072822629" > > TN="4294967295" TMAX="20" DMAX="0" LOCATION="unspecified" > > GMOND_STARTED="1072822100"> > > <HOST NAME="n0613.tcf-int.vt.edu" IP="10.1.2.185" REPORTED="1072822616" > > TN="12" TMAX="20" DMAX="0" LOCATION="unspecified" > > GMOND_STARTED="1072822100"> > > <HOST NAME="n0614.tcf-int.vt.edu" IP="10.1.2.186" REPORTED="1072822629" > > TN="4294967295" TMAX="20" DMAX="0" LOCATION="unspecified" > > GMOND_STARTED="1072822100"> > > <HOST NAME="n0615.tcf-int.vt.edu" IP="10.1.2.187" REPORTED="1072822629" > > TN="4294967295" TMAX="20" DMAX="0" LOCATION="unspecified" > > GMOND_STARTED="1072822100"> > > <HOST NAME="n0616.tcf-int.vt.edu" IP="10.1.3.11" REPORTED="1072822515" > > TN="9" TMAX="20" DMAX="0" LOCATION="unspecified" > > GMOND_STARTED="1072822100"> > > <HOST NAME="n0617.tcf-int.vt.edu" IP="10.1.3.12" REPORTED="1072822616" > > TN="12" TMAX="20" DMAX="0" LOCATION="unspecified" > > GMOND_STARTED="1072822100"> > > > > Those that have the funky TNs were all reported at the same time. I > > have a feeling it's a timing issue. > > > > And actually, I caught gmond doing something similar I had to run it a > > few times, but I got (from telnet localhost 8649): > > ... > > <GANGLIA_XML VERSION="2.5.5" SOURCE="gmond"> > > <CLUSTER NAME="Cluster X" LOCALTIME="1072823227" OWNER="Terascale > > Computing Facility" LATLONG="unspecified" URL="unspecified"> > > ... > > <HOST NAME="n0163.tcf-int.vt.edu" IP="10.1.1.173" REPORTED="1072823228" > > TN="-1" TMAX="20" DMAX="0" LOCATION="unspecified" > > GMOND_STARTED="1072822082"> > > <HOST NAME="n0164.tcf-int.vt.edu" IP="10.1.1.174" REPORTED="1072823228" > > TN="-1" TMAX="20" DMAX="0" LOCATION="unspecified" > > GMOND_STARTED="1072822082"> > > > > Is it possible, that because this data is so big, that it is being > > updated while it's being reported? I'm not too familiar with the > > source, but if the following is happening, this could be the problem: > > 1. gmond receives XML request from gmetad. > > 2. gmond records current time in client->timestamp. > > 3. gmond starts to go through the host hash, reporting tn as > > client->timestamp - node->timestamp (where node->timestamp is REPORTED) > > 4. gmond receives an update from a computational node after 1 second of > > the start of the XML request, reports a negative TN? > > > > Also, a note. This is a Dual Processor 1.3GHz Apple G4 XServe. I have > > a feeling I could run this on a DP 2.0 GHz G5 without issue, but I'd > > rather run it on my server platform. > > So, if I run just gmond, it takes about 0.8 seconds to pull the XML. > > When I run gmetad (which is eating up some process cycles,) it goes up > > to 1.2 seconds. > > > > What I don't understand, is gmetad should handle this.. It's check to > > see if it is up is tn < tmax * 4 (-1 < 60). > > So, I added this to process_xml.c, line 447: > > debug_msg("XXXX Host alive: cluster_localtime=%d reported=%d expr=%d > > tn=%d tmax=%d host_alive=%d", > > xmldata->cluster_localtime,reported,(tn < tmax * > > 4),tn,tmax,xmldata->host_alive); > > > > And I get: > > XXXX Host alive: cluster_localtime=1072825831 reported=1072825832 > > expr=0 tn=-1 tmax=20 host_alive=0 > > > > Now I'm baffled. Why isn't -1 < 20 * 4 coming out as 1? > > Sorry my rambling.. Thinking outloud, in a way. > > > > Any ideas on this? > > > > Also, on the mem_total problem I'm having, I'm not sure xdr_hyper is an > > option. It doesn't exist in OS X's /etc/include/rpc/xdr.h. I might be > > able to use xdr_bytes, but I don't know alot about > > RPC/XDR. I was thinking of cheating and having it report MB in the > > summary RRDs, but that's not really a good solution. > > > > I am looking forward to Ganglia 3. One of the problems I'm having with > > the Darwin specific metrics is the cpu_*_funcs. It's easy if I could > > return user,nice,system, and idle in one function as an array of values > > (. f(10.0 0.0 5.0 85.0). The trick is figuring out how to split them > > up. > > > > Also, I havn't checked in a while, but I think my baseline network > > usage was about 80KB/s while running Ganglia. Reducing that would be > > nice on the monitoring nodes. > > > > On Tuesday, December 30, 2003, at 08:44 AM, [EMAIL PROTECTED] wrote: > > > > > Sweet to hear you are running Ganglia on the G5 cluster. Strange about > > > the > > > TN figure, looks like a signed-unsigned int issue. I'll have a look at > > > the > > > code when I get back from my holiday vacation. > > > > > > Definately send the patches when you get them in order. > > > > > > -Federico > > > > > > > > ------------------------------------------------------- > > This SF.net email is sponsored by: IBM Linux Tutorials. > > Become an expert in LINUX or just sharpen your skills. Sign up for IBM's > > Free Linux Tutorials. Learn everything from the bash shell to sys admin. > > Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click > > _______________________________________________ > > Ganglia-developers mailing list > > Ganglia-developers@lists.sourceforge.net > > https://lists.sourceforge.net/lists/listinfo/ganglia-developers > -- > /------------------------------------------------------------------\ > | Jason A. Smith Email: [EMAIL PROTECTED] | > | Atlas Computing Facility, Bldg. 510M Phone: (631)344-4226 | > | Brookhaven National Lab, P.O. Box 5000 Fax: (631)344-7616 | > | Upton, NY 11973-5000 | > \------------------------------------------------------------------/ > > > > ------------------------------------------------------- > This SF.net email is sponsored by: IBM Linux Tutorials. > Become an expert in LINUX or just sharpen your skills. Sign up for IBM's > Free Linux Tutorials. Learn everything from the bash shell to sys admin. > Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click > _______________________________________________ > Ganglia-developers mailing list > Ganglia-developers@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/ganglia-developers >