RE: [Ganglia-general] Re: gmetad not updating RRD's/hosts that are proper in gmond XML

2006-03-30 Thread Martin Knoblauch
Eli,

 OK, the messages coming from RRDTOOL, just telling that you tried to
update the same metric with exactely the same timestamp.

 Do you see any messages prefixed RRD_create in your logfiles? 

 The problem is that if one of the rrd_updates fails, gmetad stops
working on anything.

 Do you have a chance to rebuild gmetad with the following patch? It is
against current CVS, but should apply against 3.0.2. If it helps, all
hosts (metrics) except the one causing problems should be OK. It might
not be the real solution, but may help us to track it down.

[gmetad]$ diff -udp rrd_helpers.c rrd_helpers.c-new
--- rrd_helpers.c   2005-03-15 19:11:33.0 +0100
+++ rrd_helpers.c-new   2006-03-30 11:28:26.0 +0200
@@ -54,7 +54,7 @@ RRD_update( char *rrd, const char *sum,
   {
  err_msg(RRD_update (%s): %s, rrd, rrd_get_error());
  pthread_mutex_unlock( rrd_mutex );
- return 1;
+ return 0;
   }
/* debug_msg(Updated rrd %s with value %s, rrd, val); */
pthread_mutex_unlock( rrd_mutex );


 In addition,  do you see any messages prefixed RRD_create in your
logfiles? You should, as some of the RRD files seem to be missing.

Cheers
Martin



--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de



RE: [Ganglia-general] Re: gmetad not updating RRD's/hosts that are proper in gmond XML

2006-03-30 Thread Martin Knoblauch
Richard, [adding ganglia-developers for comments]

 pretty good explanation of what is likely happening, or what can go
wrong. I sent Eli a patch I found useful a while ago, but which is not
in CVS yet (because I fixed the root-problem of the illegal updates).
This should prevent gmetad from ignoring all hosts/metrics if just one
of them is corrupt. Somewhere in the code we go nuts on an error
return.

[gmetad]$ diff -udp rrd_helpers.c rrd_helpers.c-new
--- rrd_helpers.c   2005-03-15 19:11:33.0 +0100
+++ rrd_helpers.c-new   2006-03-30 11:28:26.0 +0200
@@ -54,7 +54,7 @@ RRD_update( char *rrd, const char *sum,
   {
  err_msg(RRD_update (%s): %s, rrd, rrd_get_error());
  pthread_mutex_unlock( rrd_mutex );
- return 1;
+ return 0;
   }
/* debug_msg(Updated rrd %s with value %s, rrd, val); */
pthread_mutex_unlock( rrd_mutex );


--- [EMAIL PROTECTED] wrote:

 Eli,
 
 Martin is most surely right. If you are running an unpatched 3.0.2,
 let me share with you the many ways it can all go wrong.
 
 gmond generates the hostnames found in the XML stream by reverse DNS
 lookup only. Its internal structures treat every different IP address
 it sees as a different host, regardless of what the reverse DNS entry
 is.
 
 So, if you have
 1) Incorrect reverse DNS entries such that 2 different hosts reverse
 map
   to the same hostname,
 2) Or 2 NICs on a host that are not teamed (i.e. 2 different
 addresses)
 and
   the routing allows packets to exit either NIC, hence either source
 address
   may be used.
 3) Or a DHCP lease renewal that results in a host changing IP
 addresses.
 
 Then what will happen is that the XML stream from the cluster will
 contain
 2 (or more) entries with different IP addrs, but the same name. Even
 in
 the DHCP
 case when only 1 source address is used at a time, gmond will keep
 the
 old IP address
 entry until a timeout, even though it is not being updated. So dups
 arise again.
 
 Now unfortunately, gmetad only uses the HOSTNAME for the RRD files
 and
 its own
 processing. So if there is a duplicated hostname in the XML stream,
 it
 will update
 the RRDs after parsing the first entry, and then again after parsing
 the
 second.
 As these 2 updates to the same RRD files will occur in less than one
 second, this
 results in an RRD update error.
 
 On unpatched 3.0.2, this then causes THE ENTIRE PROCESSING OF THE
 CLUSTER TO BE ABANDONED.
 So some hosts get updated, some not, and the cluster view does not
 get
 updated.
 If you patch this particular issue, you will still get double
 processing
 for duped
 hosts, which can result in them erroneouly being reported as down
 (for
 example).
 
 phew.
 long mail.
 
 - richard
 
 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED] On Behalf Of
 Martin
 Knoblauch
 Sent: 30 March 2006 08:05
 To: Eli Stair
 Cc: Ganglia-general@lists.sourceforge.net
 Subject: Re: [Ganglia-general] Re: gmetad not updating RRD's/hosts
 that
 are proper in gmond XML
 
 
 Eli,
 
  yup. That could definitely cause problems. Do you see anything in
 the
 /var/log/messages of the gmetad host?
 
  Hmm. You may have to restart *all* gmonds, as well as the gmetad.
 This
 is something that I usually do when my ganglia setup was hosed
 somehow.
 Definitely the case for multicast clusters. Not really sure about
 unicast.
 
  And yes - this is not optimal.
 
 --- Eli Stair [EMAIL PROTECTED] wrote:
 
  
  The only issue I can find at all with this config is that the new 
  hosts have been deployed by someone with two PTR records, both the 
  proper one
  pointing to the A hostname, as well as all having an improper PTR
 - 
  linux.FQDN.
  
  Is there a potential that gmetad is doing a lookup of both the
 forward
  and reverse entries for a host before populating it?  Unfortunately
 
  removing the invalid entry for a host and restarting gmetad as well
  as 
  the gmond aggregator and the host did not resolve it.
  
  /eli
  
  Eli Stair wrote:
   
   My installation started having an issue yesterday afternoon that
 I
  have
   yet to explain or remedy.  One cluster that I have unicasting,
 has
   started losing hosts... the directory entries on disk never get
 
   created for newly deployed hosts, and gmond reports receiving
  messages
   for the host (and outputs metrics) but gmetad does not report an
   updating host message, and never creates the RRD's even though
  the
   host is up.
   
   The critical problem is that the report graphs for this cluster
  have
   stopped being updated as well, which nix'es my ability to view
  cluster
   load/job level... in addition to not being able to alert on the
 RRD
  
   values for the individual hosts that are malfunctioning.  Those
  hosts
   that are good continue to update their metric RRD's properly,
  their
   host reports are populated etc.  The bad ones I cannot explain...
   
   The two questions, if anyone has insight:
   
   1) What is causing

RE: [Ganglia-general] Re: gmetad not updating RRD's/hosts that are proper in gmond XML

2006-03-30 Thread Eli Stair

Sorry about my imprecise categorizing of things ;)

The RRD_create still occurs if I delete the fqdn-directory entry for the 
'linux' host that is being generated from the numerous reverse PTR records.

/eli

-Original Message-
From: Martin Knoblauch [mailto:[EMAIL PROTECTED]
Sent: Thu 3/30/2006 1:40 AM
To: Eli Stair; [EMAIL PROTECTED]
Cc: Ganglia-general@lists.sourceforge.net
Subject: RE: [Ganglia-general] Re: gmetad not updating RRD's/hosts that are 
proper in gmond XML
 
Eli,

 is the famous tragic scientist the good name, or the bad name. In
either case, does the opposite RRD file exist? And, do you see any
messages prefixed RRD_create in your messages file?

Cheers
Martin

--- Eli Stair [EMAIL PROTECTED] wrote:


   Here are the details:
   
   server:
   ganglia 3.0.2 (x86_64)
   6 (six) multicast clusters polled by gmetad
   1 (one) unicast cluster, reporting to a 'mute' gmond aggregating
 on
  the 
   same host as gmetad.
   
   clients:
   suse9.3 x86_64
   ganglia 3.0.2 (x86_64)
   
   
   Debug logged info (-d2):
   
   Bad host:
   
 Apache error_log for bad host:
   ERROR: opening 
  
 

'/var/lib/ganglia/rrds/Opteron_Production-Desktop_Droid_Cluster/frankenstein.lucasfilm.com/swap_free.rrd':
  
   No such file or directory
   
 gmond:
   Processing a Ganglia_message from badhost
 gmetad:
   server_thread() received request 
   /Opteron_Production-Desktop_Droid_Cluster/badhost from
 127.0.0.1
   


--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de




RE: [Ganglia-general] Re: gmetad not updating RRD's/hosts that are proper in gmond XML

2006-03-30 Thread Eli Stair

I'll give this a try today before I rectify the DNS issue altogether.  I wanted 
to track down and understand the cause internally before just fixing DNS...  
This actually seems to describe the situation I have now though, only the hosts 
with incorrect PTR's and not populating their RRD's; also the cluster stats 
aren't being aggregated.

Thanks for your input!

/eli

-Original Message-
From: Martin Knoblauch [mailto:[EMAIL PROTECTED]
Sent: Thu 3/30/2006 1:36 AM
To: Eli Stair; [EMAIL PROTECTED]
Cc: Ganglia-general@lists.sourceforge.net
Subject: RE: [Ganglia-general] Re: gmetad not updating RRD's/hosts that are 
proper in gmond XML
 
Eli,

 OK, the messages coming from RRDTOOL, just telling that you tried to
update the same metric with exactely the same timestamp.

 Do you see any messages prefixed RRD_create in your logfiles? 

 The problem is that if one of the rrd_updates fails, gmetad stops
working on anything.

 Do you have a chance to rebuild gmetad with the following patch? It is
against current CVS, but should apply against 3.0.2. If it helps, all
hosts (metrics) except the one causing problems should be OK. It might
not be the real solution, but may help us to track it down.

[gmetad]$ diff -udp rrd_helpers.c rrd_helpers.c-new
--- rrd_helpers.c   2005-03-15 19:11:33.0 +0100
+++ rrd_helpers.c-new   2006-03-30 11:28:26.0 +0200
@@ -54,7 +54,7 @@ RRD_update( char *rrd, const char *sum,
   {
  err_msg(RRD_update (%s): %s, rrd, rrd_get_error());
  pthread_mutex_unlock( rrd_mutex );
- return 1;
+ return 0;
   }
/* debug_msg(Updated rrd %s with value %s, rrd, val); */
pthread_mutex_unlock( rrd_mutex );


 In addition,  do you see any messages prefixed RRD_create in your
logfiles? You should, as some of the RRD files seem to be missing.

Cheers
Martin



--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de




RE: [Ganglia-general] Re: gmetad not updating RRD's/hosts that are proper in gmond XML

2006-03-30 Thread Martin Knoblauch
Eli,

 if the patch helps, I tend to put it into 3.0.3 (if CVS is working
again :-(

Martin

--- Eli Stair [EMAIL PROTECTED] wrote:

 
 Richard, Martin, et al:
 
 Thanks for all your assistance describing the workings and why it is
 going wrong... the glomming together of all the host XML and the
 organization to disk of it has been quite a black box to me.  
 
 You explain how this is can occur on an unpatched 3.0.2; is the
 recommended patch that which martin posted or is there something else
 suggested?  I'll give his a shot, and if it isn't successfull try the
 CVS build.  I've been trying to wait for 3.0.3 before making any more
 changes than just PHP interface stuff.
 
 Cheers,
 
 /eli
 
 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED]
 Sent: Thu 3/30/2006 1:35 AM
 To: [EMAIL PROTECTED]; Eli Stair
 Cc: Ganglia-general@lists.sourceforge.net
 Subject: RE: [Ganglia-general] Re: gmetad not updating RRD's/hosts
 that areproper in gmond XML
  
 Eli,
 
 Martin is most surely right. If you are running an unpatched 3.0.2,
 let me share with you the many ways it can all go wrong.
 
 gmond generates the hostnames found in the XML stream by reverse DNS
 lookup only. Its internal structures treat every different IP address
 it sees as a different host, regardless of what the reverse DNS entry
 is.
 
 So, if you have
 1) Incorrect reverse DNS entries such that 2 different hosts reverse
 map
   to the same hostname,
 2) Or 2 NICs on a host that are not teamed (i.e. 2 different
 addresses)
 and
   the routing allows packets to exit either NIC, hence either source
 address
   may be used.
 3) Or a DHCP lease renewal that results in a host changing IP
 addresses.
 
 Then what will happen is that the XML stream from the cluster will
 contain
 2 (or more) entries with different IP addrs, but the same name. Even
 in
 the DHCP
 case when only 1 source address is used at a time, gmond will keep
 the
 old IP address
 entry until a timeout, even though it is not being updated. So dups
 arise again.
 
 Now unfortunately, gmetad only uses the HOSTNAME for the RRD files
 and
 its own
 processing. So if there is a duplicated hostname in the XML stream,
 it
 will update
 the RRDs after parsing the first entry, and then again after parsing
 the
 second.
 As these 2 updates to the same RRD files will occur in less than one
 second, this
 results in an RRD update error.
 
 On unpatched 3.0.2, this then causes THE ENTIRE PROCESSING OF THE
 CLUSTER TO BE ABANDONED.
 So some hosts get updated, some not, and the cluster view does not
 get
 updated.
 If you patch this particular issue, you will still get double
 processing
 for duped
 hosts, which can result in them erroneouly being reported as down
 (for
 example).
 
 phew.
 long mail.
 
 - richard
 
 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED] On Behalf Of
 Martin
 Knoblauch
 Sent: 30 March 2006 08:05
 To: Eli Stair
 Cc: Ganglia-general@lists.sourceforge.net
 Subject: Re: [Ganglia-general] Re: gmetad not updating RRD's/hosts
 that
 are proper in gmond XML
 
 
 Eli,
 
  yup. That could definitely cause problems. Do you see anything in
 the
 /var/log/messages of the gmetad host?
 
  Hmm. You may have to restart *all* gmonds, as well as the gmetad.
 This
 is something that I usually do when my ganglia setup was hosed
 somehow.
 Definitely the case for multicast clusters. Not really sure about
 unicast.
 
  And yes - this is not optimal.
 
 --- Eli Stair [EMAIL PROTECTED] wrote:
 
  
  The only issue I can find at all with this config is that the new 
  hosts have been deployed by someone with two PTR records, both the 
  proper one
  pointing to the A hostname, as well as all having an improper PTR
 - 
  linux.FQDN.
  
  Is there a potential that gmetad is doing a lookup of both the
 forward
  and reverse entries for a host before populating it?  Unfortunately
 
  removing the invalid entry for a host and restarting gmetad as well
  as 
  the gmond aggregator and the host did not resolve it.
  
  /eli
  
  Eli Stair wrote:
   
   My installation started having an issue yesterday afternoon that
 I
  have
   yet to explain or remedy.  One cluster that I have unicasting,
 has
   started losing hosts... the directory entries on disk never get
 
   created for newly deployed hosts, and gmond reports receiving
  messages
   for the host (and outputs metrics) but gmetad does not report an
   updating host message, and never creates the RRD's even though
  the
   host is up.
   
   The critical problem is that the report graphs for this cluster
  have
   stopped being updated as well, which nix'es my ability to view
  cluster
   load/job level... in addition to not being able to alert on the
 RRD
  
   values for the individual hosts that are malfunctioning.  Those
  hosts
   that are good continue to update their metric RRD's properly,
  their
   host reports are populated etc.  The bad ones I cannot explain...
   
   The two

Re: [Ganglia-general] Re: gmetad not updating RRD's/hosts that are proper in gmond XML

2006-03-30 Thread Eli Stair
Sorry I haven't had a chance to guineapig... I had to nuke the false PTR 
records for the hosts causing issues.  It looks good to me though, your 
patch is just allowing the process to continue updating the RRD's 
instead of erroring out... allowing everything but the hosts with issues 
to continue to function.


/eli


@@ -54,7 +54,7 @@ RRD_update( char *rrd, const char *sum,
   {
  err_msg(RRD_update (%s): %s, rrd, rrd_get_error());
  pthread_mutex_unlock( rrd_mutex );
- return 1;
+ return 0;
   }
/* debug_msg(Updated rrd %s with value %s, rrd, val); */
pthread_mutex_unlock( rrd_mutex );

Martin Knoblauch wrote:

Eli,

 if the patch helps, I tend to put it into 3.0.3 (if CVS is working
again :-(

Martin

--- Eli Stair [EMAIL PROTECTED] wrote:


Richard, Martin, et al:

Thanks for all your assistance describing the workings and why it is
going wrong... the glomming together of all the host XML and the
organization to disk of it has been quite a black box to me.  


You explain how this is can occur on an unpatched 3.0.2; is the
recommended patch that which martin posted or is there something else
suggested?  I'll give his a shot, and if it isn't successfull try the
CVS build.  I've been trying to wait for 3.0.3 before making any more
changes than just PHP interface stuff.

Cheers,

/eli

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED]
Sent: Thu 3/30/2006 1:35 AM
To: [EMAIL PROTECTED]; Eli Stair
Cc: Ganglia-general@lists.sourceforge.net
Subject: RE: [Ganglia-general] Re: gmetad not updating RRD's/hosts
that areproper in gmond XML

Eli,

Martin is most surely right. If you are running an unpatched 3.0.2,
let me share with you the many ways it can all go wrong.

gmond generates the hostnames found in the XML stream by reverse DNS
lookup only. Its internal structures treat every different IP address
it sees as a different host, regardless of what the reverse DNS entry
is.

So, if you have
1) Incorrect reverse DNS entries such that 2 different hosts reverse
map
 to the same hostname,
2) Or 2 NICs on a host that are not teamed (i.e. 2 different
addresses)
and
 the routing allows packets to exit either NIC, hence either source
address
 may be used.
3) Or a DHCP lease renewal that results in a host changing IP
addresses.

Then what will happen is that the XML stream from the cluster will
contain
2 (or more) entries with different IP addrs, but the same name. Even
in
the DHCP
case when only 1 source address is used at a time, gmond will keep
the
old IP address
entry until a timeout, even though it is not being updated. So dups
arise again.

Now unfortunately, gmetad only uses the HOSTNAME for the RRD files
and
its own
processing. So if there is a duplicated hostname in the XML stream,
it
will update
the RRDs after parsing the first entry, and then again after parsing
the
second.
As these 2 updates to the same RRD files will occur in less than one
second, this
results in an RRD update error.

On unpatched 3.0.2, this then causes THE ENTIRE PROCESSING OF THE
CLUSTER TO BE ABANDONED.
So some hosts get updated, some not, and the cluster view does not
get
updated.
If you patch this particular issue, you will still get double
processing
for duped
hosts, which can result in them erroneouly being reported as down
(for
example).

phew.
long mail.

- richard

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of
Martin
Knoblauch
Sent: 30 March 2006 08:05
To: Eli Stair
Cc: Ganglia-general@lists.sourceforge.net
Subject: Re: [Ganglia-general] Re: gmetad not updating RRD's/hosts
that
are proper in gmond XML


Eli,

yup. That could definitely cause problems. Do you see anything in
the
/var/log/messages of the gmetad host?

Hmm. You may have to restart *all* gmonds, as well as the gmetad.
This
is something that I usually do when my ganglia setup was hosed
somehow.
Definitely the case for multicast clusters. Not really sure about
unicast.

And yes - this is not optimal.

--- Eli Stair [EMAIL PROTECTED] wrote:

The only issue I can find at all with this config is that the new 
hosts have been deployed by someone with two PTR records, both the 
proper one

pointing to the A hostname, as well as all having an improper PTR
- 

linux.FQDN.

Is there a potential that gmetad is doing a lookup of both the

forward

and reverse entries for a host before populating it?  Unfortunately
removing the invalid entry for a host and restarting gmetad as well
as 
the gmond aggregator and the host did not resolve it.


/eli

Eli Stair wrote:

My installation started having an issue yesterday afternoon that

I

have

yet to explain or remedy.  One cluster that I have unicasting,

has

started losing hosts... the directory entries on disk never get
created for newly deployed hosts, and gmond reports receiving

messages

for the host (and outputs metrics) but gmetad does not report an
updating host message, and never creates the RRD's

Re: [Ganglia-general] Re: gmetad not updating RRD's/hosts that are proper in gmond XML

2006-03-29 Thread Martin Knoblauch
Eli,

 yup. That could definitely cause problems. Do you see anything in the
/var/log/messages of the gmetad host?

 Hmm. You may have to restart *all* gmonds, as well as the gmetad. This
is something that I usually do when my ganglia setup was hosed somehow.
Definitely the case for multicast clusters. Not really sure about
unicast.

 And yes - this is not optimal.

--- Eli Stair [EMAIL PROTECTED] wrote:

 
 The only issue I can find at all with this config is that the new
 hosts 
 have been deployed by someone with two PTR records, both the proper
 one 
 pointing to the A hostname, as well as all having an improper PTR - 
 linux.FQDN.
 
 Is there a potential that gmetad is doing a lookup of both the
 forward 
 and reverse entries for a host before populating it?  Unfortunately 
 removing the invalid entry for a host and restarting gmetad as well
 as 
 the gmond aggregator and the host did not resolve it.
 
 /eli
 
 Eli Stair wrote:
  
  My installation started having an issue yesterday afternoon that I
 have 
  yet to explain or remedy.  One cluster that I have unicasting, has 
  started losing hosts... the directory entries on disk never get 
  created for newly deployed hosts, and gmond reports receiving
 messages 
  for the host (and outputs metrics) but gmetad does not report an 
  updating host message, and never creates the RRD's even though
 the 
  host is up.
  
  The critical problem is that the report graphs for this cluster
 have 
  stopped being updated as well, which nix'es my ability to view
 cluster 
  load/job level... in addition to not being able to alert on the RRD
 
  values for the individual hosts that are malfunctioning.  Those
 hosts 
  that are good continue to update their metric RRD's properly,
 their 
  host reports are populated etc.  The bad ones I cannot explain...
  
  The two questions, if anyone has insight:
  
  1) What is causing gmetad to stop acting on the gmond XML input
 that it 
  has available?  I don't see any error or threshhold it's hitting
 WRT the 
  hosts, they just don't create/update the RRD
  
  2) Why does the report stop being populated (the graph is still 
  generated with past data, but not updated with new... not even the
 data 
  from hosts that ARE functioning individually.
  
  I'm continuing on with this, will update with anything else I find
 awry. 
   Any suggestions on what to pursue beyond this are welcome... at
 this 
  point it looks to me a problem with the magic in gmetad's parsing
 of the 
  gmond output, since it is present and up-to-date but not acting on
 it.
  
  Cheers,
  
  /eli
  
  
  Here are the details:
  
  server:
  ganglia 3.0.2 (x86_64)
  6 (six) multicast clusters polled by gmetad
  1 (one) unicast cluster, reporting to a 'mute' gmond aggregating on
 the 
  same host as gmetad.
  
  clients:
  suse9.3 x86_64
  ganglia 3.0.2 (x86_64)
  
  
  Debug logged info (-d2):
  
  Bad host:
  
Apache error_log for bad host:
  ERROR: opening 
 

'/var/lib/ganglia/rrds/Opteron_Production-Desktop_Droid_Cluster/frankenstein.lucasfilm.com/swap_free.rrd':
 
  No such file or directory
  
gmond:
  Processing a Ganglia_message from badhost
gmetad:
  server_thread() received request 
  /Opteron_Production-Desktop_Droid_Cluster/badhost from 127.0.0.1
  
XML:
  HOST NAME=badhost IP=10.65.34.22 REPORTED=1143682835 TN=4 
  TMAX=20 DMAX=0 LOCATION=unspecified
 GMOND_STARTED=1143677550
  METRIC NAME=cpu_num VAL=2 TYPE=uint16 UNITS=CPUs TN=488 
  TMAX=1200 DMAX=0 SLOPE=zero SOURCE=gmond/
  METRIC NAME=disk_total VAL=71.047 TYPE=double UNITS=GB 
  TN=1688 TMAX=1200 DMAX=0 SLOPE=both SOURCE=gmond/
  METRIC NAME=disk_free VAL=57.776 TYPE=double UNITS=GB
 TN=128 
  TMAX=180 DMAX=0 SLOPE=both SOURCE=gmond/
  METRIC NAME=cpu_speed VAL=2612 TYPE=uint32 UNITS=MHz
 TN=488 
  TMAX=1200 DMAX=0 SLOPE=zero SOURCE=gmond/
  METRIC NAME=part_max_used VAL=52.7 TYPE=float UNITS=
 TN=128 
  TMAX=180 DMAX=0 SLOPE=both SOURCE=gmond/
  METRIC NAME=mem_total VAL=8147640 TYPE=uint32 UNITS=KB
 TN=488 
  TMAX=1200 DMAX=0 SLOPE=zero SOURCE=gmond/
  METRIC NAME=swap_total VAL=2104504 TYPE=uint32 UNITS=KB 
  TN=488 TMAX=1200 DMAX=0 SLOPE=zero SOURCE=gmond/
  METRIC NAME=boottime VAL=1143590767 TYPE=uint32 UNITS=s 
  TN=488 TMAX=1200 DMAX=0 SLOPE=zero SOURCE=gmond/
  METRIC NAME=machine_type VAL=x86_64 TYPE=string UNITS=
 TN=488 
  TMAX=1200 DMAX=0 SLOPE=zero SOURCE=gmond/
  METRIC NAME=os_name VAL=Linux TYPE=string UNITS= TN=488 
  TMAX=1200 DMAX=0 SLOPE=zero SOURCE=gmond/
  METRIC NAME=os_release VAL=2.6.13.4_K8+NUMA+NV TYPE=string 
  UNITS= TN=488 TMAX=1200 DMAX=0 SLOPE=zero
 SOURCE=gmond/
  METRIC NAME=cpu_user VAL=93.6 TYPE=float UNITS=% TN=27 
  TMAX=90 DMAX=0 SLOPE=both SOURCE=gmond/
  METRIC NAME=cpu_system VAL=0.6 TYPE=float UNITS=% TN=27 
  TMAX=90 DMAX=0 SLOPE=both SOURCE=gmond/
  METRIC NAME=load_one VAL=2.03 TYPE=float UNITS= TN=68 
  TMAX=70 DMAX=0 SLOPE=both SOURCE=gmond/
  METRIC NAME=proc_run VAL=2 TYPE=uint32 UNITS= TN=8
 TMAX=950 

RE: [Ganglia-general] Re: gmetad not updating RRD's/hosts that are proper in gmond XML

2006-03-29 Thread Eli Stair

Martin, et al:

I'm getting ...illegal attempt to update using time 1143703242 when last 
update time is 1143703242 (minimum one second step)... messages for the 
improper 'linux.' hosts.  I was assuming that gmetad was sorting/indexing the 
data from those sources by the FQDN which was the same for about 12 new hosts, 
and thus erroring out when it receives timestamped data from the same host 
that is out of line with the last send period from a nearby node.  gmetad is 
still creating/populating the linux RRD's however.

IMO, since there's no inter-node communication without multicast being enabled 
(the XML port on the unicast hosts doesn't even output metrics, only the basic 
info headers, CLUSTER NAME= stanza, and !ELEMENT * EMPTY) there (shouldn't) 
be anything to worry about with other nodes caching the improper data... 
however, that doesn't mesh with the host I have fixed and still doesn't update.

So still trying to work out why gmetad isn't acting on the metrics sent, and 
logging the equivalent of Updating host host.fqdn, metric... for the 
previously broken, but now proper DNS'd host.  And the most unexplainable item 
so far, that one functional host I noticed is reported the HOST NAME= as that 
of the gmetad server... after a restart it ceased doing that.

Thanks for the input, I'm going to restart all the gmond's on the unicast 
hosts, just to be sure, and fixing the crufty PTR records as well.  Any other 
thoughts anyone?  

Thanks,

/eli


-Original Message-
From: Martin Knoblauch [mailto:[EMAIL PROTECTED]
Sent: Wed 3/29/2006 11:05 PM
To: Eli Stair
Cc: Ganglia-general@lists.sourceforge.net
Subject: Re: [Ganglia-general] Re: gmetad not updating RRD's/hosts that are 
proper in gmond XML
 
Eli,

 yup. That could definitely cause problems. Do you see anything in the
/var/log/messages of the gmetad host?

 Hmm. You may have to restart *all* gmonds, as well as the gmetad. This
is something that I usually do when my ganglia setup was hosed somehow.
Definitely the case for multicast clusters. Not really sure about
unicast.

 And yes - this is not optimal.

--- Eli Stair [EMAIL PROTECTED] wrote:

 
 The only issue I can find at all with this config is that the new
 hosts 
 have been deployed by someone with two PTR records, both the proper
 one 
 pointing to the A hostname, as well as all having an improper PTR - 
 linux.FQDN.
 
 Is there a potential that gmetad is doing a lookup of both the
 forward 
 and reverse entries for a host before populating it?  Unfortunately 
 removing the invalid entry for a host and restarting gmetad as well
 as 
 the gmond aggregator and the host did not resolve it.
 
 /eli
 
 Eli Stair wrote:
  
  My installation started having an issue yesterday afternoon that I
 have 
  yet to explain or remedy.  One cluster that I have unicasting, has 
  started losing hosts... the directory entries on disk never get 
  created for newly deployed hosts, and gmond reports receiving
 messages 
  for the host (and outputs metrics) but gmetad does not report an 
  updating host message, and never creates the RRD's even though
 the 
  host is up.
  
  The critical problem is that the report graphs for this cluster
 have 
  stopped being updated as well, which nix'es my ability to view
 cluster 
  load/job level... in addition to not being able to alert on the RRD
 
  values for the individual hosts that are malfunctioning.  Those
 hosts 
  that are good continue to update their metric RRD's properly,
 their 
  host reports are populated etc.  The bad ones I cannot explain...
  
  The two questions, if anyone has insight:
  
  1) What is causing gmetad to stop acting on the gmond XML input
 that it 
  has available?  I don't see any error or threshhold it's hitting
 WRT the 
  hosts, they just don't create/update the RRD
  
  2) Why does the report stop being populated (the graph is still 
  generated with past data, but not updated with new... not even the
 data 
  from hosts that ARE functioning individually.
  
  I'm continuing on with this, will update with anything else I find
 awry. 
   Any suggestions on what to pursue beyond this are welcome... at
 this 
  point it looks to me a problem with the magic in gmetad's parsing
 of the 
  gmond output, since it is present and up-to-date but not acting on
 it.
  
  Cheers,
  
  /eli
  
  
  Here are the details:
  
  server:
  ganglia 3.0.2 (x86_64)
  6 (six) multicast clusters polled by gmetad
  1 (one) unicast cluster, reporting to a 'mute' gmond aggregating on
 the 
  same host as gmetad.
  
  clients:
  suse9.3 x86_64
  ganglia 3.0.2 (x86_64)
  
  
  Debug logged info (-d2):
  
  Bad host:
  
Apache error_log for bad host:
  ERROR: opening 
 

'/var/lib/ganglia/rrds/Opteron_Production-Desktop_Droid_Cluster/frankenstein.lucasfilm.com/swap_free.rrd':
 
  No such file or directory
  
gmond:
  Processing a Ganglia_message from badhost
gmetad:
  server_thread() received request 
  /Opteron_Production