[Ganglia-developers] Re: Scaling Issues? and Memory Size Problems (combined)

Josh Durham Tue, 30 Dec 2003 15:38:26 -0800

Thanks for your quick response.

So, I've been playing around a bit with the TN thing. Here issomething interesting.. Here is a larger sample of the output fromgmetad:

telnet localhost 8651:
...

<GRID NAME="unspecified" AUTHORITY="http://blahblah/ganglia/";LOCALTIME="1072822698"><CLUSTER NAME="Cluster X" LOCALTIME="1072822524" OWNER="TerascaleComputing Facility" LATLONG="unspecified" URL="unspecified">

...

<HOST NAME="n0603.tcf-int.vt.edu" IP="10.1.2.175" REPORTED="1072822628"TN="0" TMAX="20" DMAX="0" LOCATION="unspecified"GMOND_STARTED="1072822099"><HOST NAME="n0604.tcf-int.vt.edu" IP="10.1.2.176" REPORTED="1072822629"TN="4294967295" TMAX="20" DMAX="0" LOCATION="unspecified"GMOND_STARTED="1072822100"><HOST NAME="n0605.tcf-int.vt.edu" IP="10.1.2.177" REPORTED="1072822629"TN="4294967295" TMAX="20" DMAX="0" LOCATION="unspecified"GMOND_STARTED="1072822100"><HOST NAME="n0606.tcf-int.vt.edu" IP="10.1.2.178" REPORTED="1072822629"TN="4294967295" TMAX="20" DMAX="0" LOCATION="unspecified"GMOND_STARTED="1072822100"><HOST NAME="n0607.tcf-int.vt.edu" IP="10.1.2.179" REPORTED="1072822629"TN="4294967295" TMAX="20" DMAX="0" LOCATION="unspecified"GMOND_STARTED="1072822100"><HOST NAME="n0608.tcf-int.vt.edu" IP="10.1.2.180" REPORTED="1072822616"TN="12" TMAX="20" DMAX="0" LOCATION="unspecified"GMOND_STARTED="1072822100"><HOST NAME="n0609.tcf-int.vt.edu" IP="10.1.2.181" REPORTED="1072822616"TN="12" TMAX="20" DMAX="0" LOCATION="unspecified"GMOND_STARTED="1072822100"><HOST NAME="n0610.tcf-int.vt.edu" IP="10.1.2.182" REPORTED="1072822616"TN="12" TMAX="20" DMAX="0" LOCATION="unspecified"GMOND_STARTED="1072797006"><HOST NAME="n0611.tcf-int.vt.edu" IP="10.1.2.183" REPORTED="1072822616"TN="12" TMAX="20" DMAX="0" LOCATION="unspecified"GMOND_STARTED="1072822100"><HOST NAME="n0612.tcf-int.vt.edu" IP="10.1.2.184" REPORTED="1072822629"TN="4294967295" TMAX="20" DMAX="0" LOCATION="unspecified"GMOND_STARTED="1072822100"><HOST NAME="n0613.tcf-int.vt.edu" IP="10.1.2.185" REPORTED="1072822616"TN="12" TMAX="20" DMAX="0" LOCATION="unspecified"GMOND_STARTED="1072822100"><HOST NAME="n0614.tcf-int.vt.edu" IP="10.1.2.186" REPORTED="1072822629"TN="4294967295" TMAX="20" DMAX="0" LOCATION="unspecified"GMOND_STARTED="1072822100"><HOST NAME="n0615.tcf-int.vt.edu" IP="10.1.2.187" REPORTED="1072822629"TN="4294967295" TMAX="20" DMAX="0" LOCATION="unspecified"GMOND_STARTED="1072822100"><HOST NAME="n0616.tcf-int.vt.edu" IP="10.1.3.11" REPORTED="1072822515"TN="9" TMAX="20" DMAX="0" LOCATION="unspecified"GMOND_STARTED="1072822100"><HOST NAME="n0617.tcf-int.vt.edu" IP="10.1.3.12" REPORTED="1072822616"TN="12" TMAX="20" DMAX="0" LOCATION="unspecified"GMOND_STARTED="1072822100">

Those that have the funky TNs were all reported at the same time. Ihave a feeling it's a timing issue.

And actually, I caught gmond doing something similar I had to run it afew times, but I got (from telnet localhost 8649):

...
<GANGLIA_XML VERSION="2.5.5" SOURCE="gmond">

<CLUSTER NAME="Cluster X" LOCALTIME="1072823227" OWNER="TerascaleComputing Facility" LATLONG="unspecified" URL="unspecified">

...

<HOST NAME="n0163.tcf-int.vt.edu" IP="10.1.1.173" REPORTED="1072823228"TN="-1" TMAX="20" DMAX="0" LOCATION="unspecified"GMOND_STARTED="1072822082"><HOST NAME="n0164.tcf-int.vt.edu" IP="10.1.1.174" REPORTED="1072823228"TN="-1" TMAX="20" DMAX="0" LOCATION="unspecified"GMOND_STARTED="1072822082">

Is it possible, that because this data is so big, that it is beingupdated while it's being reported? I'm not too familiar with thesource, but if the following is happening, this could be the problem:

1. gmond receives XML request from gmetad.
2. gmond records current time in client->timestamp.

3. gmond starts to go through the host hash, reporting tn asclient->timestamp - node->timestamp (where node->timestamp is REPORTED)4. gmond receives an update from a computational node after 1 second ofthe start of the XML request, reports a negative TN?

Also, a note. This is a Dual Processor 1.3GHz Apple G4 XServe. I havea feeling I could run this on a DP 2.0 GHz G5 without issue, but I'drather run it on my server platform.So, if I run just gmond, it takes about 0.8 seconds to pull the XML.When I run gmetad (which is eating up some process cycles,) it goes upto 1.2 seconds.

What I don't understand, is gmetad should handle this.. It's check tosee if it is up is tn < tmax * 4 (-1 < 60).

So, I added this to process_xml.c, line 447:

debug_msg("XXXX Host alive: cluster_localtime=%d reported=%d expr=%dtn=%d tmax=%d host_alive=%d",xmldata->cluster_localtime,reported,(tn < tmax *4),tn,tmax,xmldata->host_alive);


And I get:

XXXX Host alive: cluster_localtime=1072825831 reported=1072825832expr=0 tn=-1 tmax=20 host_alive=0


Now I'm baffled.  Why isn't -1 < 20 * 4 coming out as 1?
Sorry my rambling.. Thinking outloud, in a way.

Any ideas on this?

Also, on the mem_total problem I'm having, I'm not sure xdr_hyper is anoption. It doesn't exist in OS X's /etc/include/rpc/xdr.h. I might beable to use xdr_bytes, but I don't know alot aboutRPC/XDR. I was thinking of cheating and having it report MB in thesummary RRDs, but that's not really a good solution.

I am looking forward to Ganglia 3. One of the problems I'm having withthe Darwin specific metrics is the cpu_*_funcs. It's easy if I couldreturn user,nice,system, and idle in one function as an array of values(. f(10.0 0.0 5.0 85.0). The trick is figuring out how to split themup.

Also, I havn't checked in a while, but I think my baseline networkusage was about 80KB/s while running Ganglia. Reducing that would benice on the monitoring nodes.


On Tuesday, December 30, 2003, at 08:44 AM, [EMAIL PROTECTED] wrote:

Sweet to hear you are running Ganglia on the G5 cluster. Strange abouttheTN figure, looks like a signed-unsigned int issue. I'll have a look atthe
code when I get back from my holiday vacation.

Definately send the patches when you get them in order.

-Federico

[Ganglia-developers] Re: Scaling Issues? and Memory Size Problems (combined)

Reply via email to