RE: [Ganglia-general] Re: gmetad not updating RRD's/hosts that are proper in gmond XML
Eli, OK, the messages coming from RRDTOOL, just telling that you tried to update the same metric with exactely the same timestamp. Do you see any messages prefixed RRD_create in your logfiles? The problem is that if one of the rrd_updates fails, gmetad stops working on anything. Do you have a chance to rebuild gmetad with the following patch? It is against current CVS, but should apply against 3.0.2. If it helps, all hosts (metrics) except the one causing problems should be OK. It might not be the real solution, but may help us to track it down. [gmetad]$ diff -udp rrd_helpers.c rrd_helpers.c-new --- rrd_helpers.c 2005-03-15 19:11:33.0 +0100 +++ rrd_helpers.c-new 2006-03-30 11:28:26.0 +0200 @@ -54,7 +54,7 @@ RRD_update( char *rrd, const char *sum, { err_msg(RRD_update (%s): %s, rrd, rrd_get_error()); pthread_mutex_unlock( rrd_mutex ); - return 1; + return 0; } /* debug_msg(Updated rrd %s with value %s, rrd, val); */ pthread_mutex_unlock( rrd_mutex ); In addition, do you see any messages prefixed RRD_create in your logfiles? You should, as some of the RRD files seem to be missing. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
RE: [Ganglia-general] Re: gmetad not updating RRD's/hosts that are proper in gmond XML
Richard, [adding ganglia-developers for comments] pretty good explanation of what is likely happening, or what can go wrong. I sent Eli a patch I found useful a while ago, but which is not in CVS yet (because I fixed the root-problem of the illegal updates). This should prevent gmetad from ignoring all hosts/metrics if just one of them is corrupt. Somewhere in the code we go nuts on an error return. [gmetad]$ diff -udp rrd_helpers.c rrd_helpers.c-new --- rrd_helpers.c 2005-03-15 19:11:33.0 +0100 +++ rrd_helpers.c-new 2006-03-30 11:28:26.0 +0200 @@ -54,7 +54,7 @@ RRD_update( char *rrd, const char *sum, { err_msg(RRD_update (%s): %s, rrd, rrd_get_error()); pthread_mutex_unlock( rrd_mutex ); - return 1; + return 0; } /* debug_msg(Updated rrd %s with value %s, rrd, val); */ pthread_mutex_unlock( rrd_mutex ); --- [EMAIL PROTECTED] wrote: Eli, Martin is most surely right. If you are running an unpatched 3.0.2, let me share with you the many ways it can all go wrong. gmond generates the hostnames found in the XML stream by reverse DNS lookup only. Its internal structures treat every different IP address it sees as a different host, regardless of what the reverse DNS entry is. So, if you have 1) Incorrect reverse DNS entries such that 2 different hosts reverse map to the same hostname, 2) Or 2 NICs on a host that are not teamed (i.e. 2 different addresses) and the routing allows packets to exit either NIC, hence either source address may be used. 3) Or a DHCP lease renewal that results in a host changing IP addresses. Then what will happen is that the XML stream from the cluster will contain 2 (or more) entries with different IP addrs, but the same name. Even in the DHCP case when only 1 source address is used at a time, gmond will keep the old IP address entry until a timeout, even though it is not being updated. So dups arise again. Now unfortunately, gmetad only uses the HOSTNAME for the RRD files and its own processing. So if there is a duplicated hostname in the XML stream, it will update the RRDs after parsing the first entry, and then again after parsing the second. As these 2 updates to the same RRD files will occur in less than one second, this results in an RRD update error. On unpatched 3.0.2, this then causes THE ENTIRE PROCESSING OF THE CLUSTER TO BE ABANDONED. So some hosts get updated, some not, and the cluster view does not get updated. If you patch this particular issue, you will still get double processing for duped hosts, which can result in them erroneouly being reported as down (for example). phew. long mail. - richard -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Martin Knoblauch Sent: 30 March 2006 08:05 To: Eli Stair Cc: Ganglia-general@lists.sourceforge.net Subject: Re: [Ganglia-general] Re: gmetad not updating RRD's/hosts that are proper in gmond XML Eli, yup. That could definitely cause problems. Do you see anything in the /var/log/messages of the gmetad host? Hmm. You may have to restart *all* gmonds, as well as the gmetad. This is something that I usually do when my ganglia setup was hosed somehow. Definitely the case for multicast clusters. Not really sure about unicast. And yes - this is not optimal. --- Eli Stair [EMAIL PROTECTED] wrote: The only issue I can find at all with this config is that the new hosts have been deployed by someone with two PTR records, both the proper one pointing to the A hostname, as well as all having an improper PTR - linux.FQDN. Is there a potential that gmetad is doing a lookup of both the forward and reverse entries for a host before populating it? Unfortunately removing the invalid entry for a host and restarting gmetad as well as the gmond aggregator and the host did not resolve it. /eli Eli Stair wrote: My installation started having an issue yesterday afternoon that I have yet to explain or remedy. One cluster that I have unicasting, has started losing hosts... the directory entries on disk never get created for newly deployed hosts, and gmond reports receiving messages for the host (and outputs metrics) but gmetad does not report an updating host message, and never creates the RRD's even though the host is up. The critical problem is that the report graphs for this cluster have stopped being updated as well, which nix'es my ability to view cluster load/job level... in addition to not being able to alert on the RRD values for the individual hosts that are malfunctioning. Those hosts that are good continue to update their metric RRD's properly, their host reports are populated etc. The bad ones I cannot explain... The two questions, if anyone has insight: 1) What is causing
RE: [Ganglia-general] Re: gmetad not updating RRD's/hosts that are proper in gmond XML
Sorry about my imprecise categorizing of things ;) The RRD_create still occurs if I delete the fqdn-directory entry for the 'linux' host that is being generated from the numerous reverse PTR records. /eli -Original Message- From: Martin Knoblauch [mailto:[EMAIL PROTECTED] Sent: Thu 3/30/2006 1:40 AM To: Eli Stair; [EMAIL PROTECTED] Cc: Ganglia-general@lists.sourceforge.net Subject: RE: [Ganglia-general] Re: gmetad not updating RRD's/hosts that are proper in gmond XML Eli, is the famous tragic scientist the good name, or the bad name. In either case, does the opposite RRD file exist? And, do you see any messages prefixed RRD_create in your messages file? Cheers Martin --- Eli Stair [EMAIL PROTECTED] wrote: Here are the details: server: ganglia 3.0.2 (x86_64) 6 (six) multicast clusters polled by gmetad 1 (one) unicast cluster, reporting to a 'mute' gmond aggregating on the same host as gmetad. clients: suse9.3 x86_64 ganglia 3.0.2 (x86_64) Debug logged info (-d2): Bad host: Apache error_log for bad host: ERROR: opening '/var/lib/ganglia/rrds/Opteron_Production-Desktop_Droid_Cluster/frankenstein.lucasfilm.com/swap_free.rrd': No such file or directory gmond: Processing a Ganglia_message from badhost gmetad: server_thread() received request /Opteron_Production-Desktop_Droid_Cluster/badhost from 127.0.0.1 -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
RE: [Ganglia-general] Re: gmetad not updating RRD's/hosts that are proper in gmond XML
I'll give this a try today before I rectify the DNS issue altogether. I wanted to track down and understand the cause internally before just fixing DNS... This actually seems to describe the situation I have now though, only the hosts with incorrect PTR's and not populating their RRD's; also the cluster stats aren't being aggregated. Thanks for your input! /eli -Original Message- From: Martin Knoblauch [mailto:[EMAIL PROTECTED] Sent: Thu 3/30/2006 1:36 AM To: Eli Stair; [EMAIL PROTECTED] Cc: Ganglia-general@lists.sourceforge.net Subject: RE: [Ganglia-general] Re: gmetad not updating RRD's/hosts that are proper in gmond XML Eli, OK, the messages coming from RRDTOOL, just telling that you tried to update the same metric with exactely the same timestamp. Do you see any messages prefixed RRD_create in your logfiles? The problem is that if one of the rrd_updates fails, gmetad stops working on anything. Do you have a chance to rebuild gmetad with the following patch? It is against current CVS, but should apply against 3.0.2. If it helps, all hosts (metrics) except the one causing problems should be OK. It might not be the real solution, but may help us to track it down. [gmetad]$ diff -udp rrd_helpers.c rrd_helpers.c-new --- rrd_helpers.c 2005-03-15 19:11:33.0 +0100 +++ rrd_helpers.c-new 2006-03-30 11:28:26.0 +0200 @@ -54,7 +54,7 @@ RRD_update( char *rrd, const char *sum, { err_msg(RRD_update (%s): %s, rrd, rrd_get_error()); pthread_mutex_unlock( rrd_mutex ); - return 1; + return 0; } /* debug_msg(Updated rrd %s with value %s, rrd, val); */ pthread_mutex_unlock( rrd_mutex ); In addition, do you see any messages prefixed RRD_create in your logfiles? You should, as some of the RRD files seem to be missing. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
RE: [Ganglia-general] Re: gmetad not updating RRD's/hosts that are proper in gmond XML
Eli, if the patch helps, I tend to put it into 3.0.3 (if CVS is working again :-( Martin --- Eli Stair [EMAIL PROTECTED] wrote: Richard, Martin, et al: Thanks for all your assistance describing the workings and why it is going wrong... the glomming together of all the host XML and the organization to disk of it has been quite a black box to me. You explain how this is can occur on an unpatched 3.0.2; is the recommended patch that which martin posted or is there something else suggested? I'll give his a shot, and if it isn't successfull try the CVS build. I've been trying to wait for 3.0.3 before making any more changes than just PHP interface stuff. Cheers, /eli -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Thu 3/30/2006 1:35 AM To: [EMAIL PROTECTED]; Eli Stair Cc: Ganglia-general@lists.sourceforge.net Subject: RE: [Ganglia-general] Re: gmetad not updating RRD's/hosts that areproper in gmond XML Eli, Martin is most surely right. If you are running an unpatched 3.0.2, let me share with you the many ways it can all go wrong. gmond generates the hostnames found in the XML stream by reverse DNS lookup only. Its internal structures treat every different IP address it sees as a different host, regardless of what the reverse DNS entry is. So, if you have 1) Incorrect reverse DNS entries such that 2 different hosts reverse map to the same hostname, 2) Or 2 NICs on a host that are not teamed (i.e. 2 different addresses) and the routing allows packets to exit either NIC, hence either source address may be used. 3) Or a DHCP lease renewal that results in a host changing IP addresses. Then what will happen is that the XML stream from the cluster will contain 2 (or more) entries with different IP addrs, but the same name. Even in the DHCP case when only 1 source address is used at a time, gmond will keep the old IP address entry until a timeout, even though it is not being updated. So dups arise again. Now unfortunately, gmetad only uses the HOSTNAME for the RRD files and its own processing. So if there is a duplicated hostname in the XML stream, it will update the RRDs after parsing the first entry, and then again after parsing the second. As these 2 updates to the same RRD files will occur in less than one second, this results in an RRD update error. On unpatched 3.0.2, this then causes THE ENTIRE PROCESSING OF THE CLUSTER TO BE ABANDONED. So some hosts get updated, some not, and the cluster view does not get updated. If you patch this particular issue, you will still get double processing for duped hosts, which can result in them erroneouly being reported as down (for example). phew. long mail. - richard -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Martin Knoblauch Sent: 30 March 2006 08:05 To: Eli Stair Cc: Ganglia-general@lists.sourceforge.net Subject: Re: [Ganglia-general] Re: gmetad not updating RRD's/hosts that are proper in gmond XML Eli, yup. That could definitely cause problems. Do you see anything in the /var/log/messages of the gmetad host? Hmm. You may have to restart *all* gmonds, as well as the gmetad. This is something that I usually do when my ganglia setup was hosed somehow. Definitely the case for multicast clusters. Not really sure about unicast. And yes - this is not optimal. --- Eli Stair [EMAIL PROTECTED] wrote: The only issue I can find at all with this config is that the new hosts have been deployed by someone with two PTR records, both the proper one pointing to the A hostname, as well as all having an improper PTR - linux.FQDN. Is there a potential that gmetad is doing a lookup of both the forward and reverse entries for a host before populating it? Unfortunately removing the invalid entry for a host and restarting gmetad as well as the gmond aggregator and the host did not resolve it. /eli Eli Stair wrote: My installation started having an issue yesterday afternoon that I have yet to explain or remedy. One cluster that I have unicasting, has started losing hosts... the directory entries on disk never get created for newly deployed hosts, and gmond reports receiving messages for the host (and outputs metrics) but gmetad does not report an updating host message, and never creates the RRD's even though the host is up. The critical problem is that the report graphs for this cluster have stopped being updated as well, which nix'es my ability to view cluster load/job level... in addition to not being able to alert on the RRD values for the individual hosts that are malfunctioning. Those hosts that are good continue to update their metric RRD's properly, their host reports are populated etc. The bad ones I cannot explain... The two
Re: [Ganglia-general] Re: gmetad not updating RRD's/hosts that are proper in gmond XML
Sorry I haven't had a chance to guineapig... I had to nuke the false PTR records for the hosts causing issues. It looks good to me though, your patch is just allowing the process to continue updating the RRD's instead of erroring out... allowing everything but the hosts with issues to continue to function. /eli @@ -54,7 +54,7 @@ RRD_update( char *rrd, const char *sum, { err_msg(RRD_update (%s): %s, rrd, rrd_get_error()); pthread_mutex_unlock( rrd_mutex ); - return 1; + return 0; } /* debug_msg(Updated rrd %s with value %s, rrd, val); */ pthread_mutex_unlock( rrd_mutex ); Martin Knoblauch wrote: Eli, if the patch helps, I tend to put it into 3.0.3 (if CVS is working again :-( Martin --- Eli Stair [EMAIL PROTECTED] wrote: Richard, Martin, et al: Thanks for all your assistance describing the workings and why it is going wrong... the glomming together of all the host XML and the organization to disk of it has been quite a black box to me. You explain how this is can occur on an unpatched 3.0.2; is the recommended patch that which martin posted or is there something else suggested? I'll give his a shot, and if it isn't successfull try the CVS build. I've been trying to wait for 3.0.3 before making any more changes than just PHP interface stuff. Cheers, /eli -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Thu 3/30/2006 1:35 AM To: [EMAIL PROTECTED]; Eli Stair Cc: Ganglia-general@lists.sourceforge.net Subject: RE: [Ganglia-general] Re: gmetad not updating RRD's/hosts that areproper in gmond XML Eli, Martin is most surely right. If you are running an unpatched 3.0.2, let me share with you the many ways it can all go wrong. gmond generates the hostnames found in the XML stream by reverse DNS lookup only. Its internal structures treat every different IP address it sees as a different host, regardless of what the reverse DNS entry is. So, if you have 1) Incorrect reverse DNS entries such that 2 different hosts reverse map to the same hostname, 2) Or 2 NICs on a host that are not teamed (i.e. 2 different addresses) and the routing allows packets to exit either NIC, hence either source address may be used. 3) Or a DHCP lease renewal that results in a host changing IP addresses. Then what will happen is that the XML stream from the cluster will contain 2 (or more) entries with different IP addrs, but the same name. Even in the DHCP case when only 1 source address is used at a time, gmond will keep the old IP address entry until a timeout, even though it is not being updated. So dups arise again. Now unfortunately, gmetad only uses the HOSTNAME for the RRD files and its own processing. So if there is a duplicated hostname in the XML stream, it will update the RRDs after parsing the first entry, and then again after parsing the second. As these 2 updates to the same RRD files will occur in less than one second, this results in an RRD update error. On unpatched 3.0.2, this then causes THE ENTIRE PROCESSING OF THE CLUSTER TO BE ABANDONED. So some hosts get updated, some not, and the cluster view does not get updated. If you patch this particular issue, you will still get double processing for duped hosts, which can result in them erroneouly being reported as down (for example). phew. long mail. - richard -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Martin Knoblauch Sent: 30 March 2006 08:05 To: Eli Stair Cc: Ganglia-general@lists.sourceforge.net Subject: Re: [Ganglia-general] Re: gmetad not updating RRD's/hosts that are proper in gmond XML Eli, yup. That could definitely cause problems. Do you see anything in the /var/log/messages of the gmetad host? Hmm. You may have to restart *all* gmonds, as well as the gmetad. This is something that I usually do when my ganglia setup was hosed somehow. Definitely the case for multicast clusters. Not really sure about unicast. And yes - this is not optimal. --- Eli Stair [EMAIL PROTECTED] wrote: The only issue I can find at all with this config is that the new hosts have been deployed by someone with two PTR records, both the proper one pointing to the A hostname, as well as all having an improper PTR - linux.FQDN. Is there a potential that gmetad is doing a lookup of both the forward and reverse entries for a host before populating it? Unfortunately removing the invalid entry for a host and restarting gmetad as well as the gmond aggregator and the host did not resolve it. /eli Eli Stair wrote: My installation started having an issue yesterday afternoon that I have yet to explain or remedy. One cluster that I have unicasting, has started losing hosts... the directory entries on disk never get created for newly deployed hosts, and gmond reports receiving messages for the host (and outputs metrics) but gmetad does not report an updating host message, and never creates the RRD's
Re: [Ganglia-general] Re: gmetad not updating RRD's/hosts that are proper in gmond XML
Eli, yup. That could definitely cause problems. Do you see anything in the /var/log/messages of the gmetad host? Hmm. You may have to restart *all* gmonds, as well as the gmetad. This is something that I usually do when my ganglia setup was hosed somehow. Definitely the case for multicast clusters. Not really sure about unicast. And yes - this is not optimal. --- Eli Stair [EMAIL PROTECTED] wrote: The only issue I can find at all with this config is that the new hosts have been deployed by someone with two PTR records, both the proper one pointing to the A hostname, as well as all having an improper PTR - linux.FQDN. Is there a potential that gmetad is doing a lookup of both the forward and reverse entries for a host before populating it? Unfortunately removing the invalid entry for a host and restarting gmetad as well as the gmond aggregator and the host did not resolve it. /eli Eli Stair wrote: My installation started having an issue yesterday afternoon that I have yet to explain or remedy. One cluster that I have unicasting, has started losing hosts... the directory entries on disk never get created for newly deployed hosts, and gmond reports receiving messages for the host (and outputs metrics) but gmetad does not report an updating host message, and never creates the RRD's even though the host is up. The critical problem is that the report graphs for this cluster have stopped being updated as well, which nix'es my ability to view cluster load/job level... in addition to not being able to alert on the RRD values for the individual hosts that are malfunctioning. Those hosts that are good continue to update their metric RRD's properly, their host reports are populated etc. The bad ones I cannot explain... The two questions, if anyone has insight: 1) What is causing gmetad to stop acting on the gmond XML input that it has available? I don't see any error or threshhold it's hitting WRT the hosts, they just don't create/update the RRD 2) Why does the report stop being populated (the graph is still generated with past data, but not updated with new... not even the data from hosts that ARE functioning individually. I'm continuing on with this, will update with anything else I find awry. Any suggestions on what to pursue beyond this are welcome... at this point it looks to me a problem with the magic in gmetad's parsing of the gmond output, since it is present and up-to-date but not acting on it. Cheers, /eli Here are the details: server: ganglia 3.0.2 (x86_64) 6 (six) multicast clusters polled by gmetad 1 (one) unicast cluster, reporting to a 'mute' gmond aggregating on the same host as gmetad. clients: suse9.3 x86_64 ganglia 3.0.2 (x86_64) Debug logged info (-d2): Bad host: Apache error_log for bad host: ERROR: opening '/var/lib/ganglia/rrds/Opteron_Production-Desktop_Droid_Cluster/frankenstein.lucasfilm.com/swap_free.rrd': No such file or directory gmond: Processing a Ganglia_message from badhost gmetad: server_thread() received request /Opteron_Production-Desktop_Droid_Cluster/badhost from 127.0.0.1 XML: HOST NAME=badhost IP=10.65.34.22 REPORTED=1143682835 TN=4 TMAX=20 DMAX=0 LOCATION=unspecified GMOND_STARTED=1143677550 METRIC NAME=cpu_num VAL=2 TYPE=uint16 UNITS=CPUs TN=488 TMAX=1200 DMAX=0 SLOPE=zero SOURCE=gmond/ METRIC NAME=disk_total VAL=71.047 TYPE=double UNITS=GB TN=1688 TMAX=1200 DMAX=0 SLOPE=both SOURCE=gmond/ METRIC NAME=disk_free VAL=57.776 TYPE=double UNITS=GB TN=128 TMAX=180 DMAX=0 SLOPE=both SOURCE=gmond/ METRIC NAME=cpu_speed VAL=2612 TYPE=uint32 UNITS=MHz TN=488 TMAX=1200 DMAX=0 SLOPE=zero SOURCE=gmond/ METRIC NAME=part_max_used VAL=52.7 TYPE=float UNITS= TN=128 TMAX=180 DMAX=0 SLOPE=both SOURCE=gmond/ METRIC NAME=mem_total VAL=8147640 TYPE=uint32 UNITS=KB TN=488 TMAX=1200 DMAX=0 SLOPE=zero SOURCE=gmond/ METRIC NAME=swap_total VAL=2104504 TYPE=uint32 UNITS=KB TN=488 TMAX=1200 DMAX=0 SLOPE=zero SOURCE=gmond/ METRIC NAME=boottime VAL=1143590767 TYPE=uint32 UNITS=s TN=488 TMAX=1200 DMAX=0 SLOPE=zero SOURCE=gmond/ METRIC NAME=machine_type VAL=x86_64 TYPE=string UNITS= TN=488 TMAX=1200 DMAX=0 SLOPE=zero SOURCE=gmond/ METRIC NAME=os_name VAL=Linux TYPE=string UNITS= TN=488 TMAX=1200 DMAX=0 SLOPE=zero SOURCE=gmond/ METRIC NAME=os_release VAL=2.6.13.4_K8+NUMA+NV TYPE=string UNITS= TN=488 TMAX=1200 DMAX=0 SLOPE=zero SOURCE=gmond/ METRIC NAME=cpu_user VAL=93.6 TYPE=float UNITS=% TN=27 TMAX=90 DMAX=0 SLOPE=both SOURCE=gmond/ METRIC NAME=cpu_system VAL=0.6 TYPE=float UNITS=% TN=27 TMAX=90 DMAX=0 SLOPE=both SOURCE=gmond/ METRIC NAME=load_one VAL=2.03 TYPE=float UNITS= TN=68 TMAX=70 DMAX=0 SLOPE=both SOURCE=gmond/ METRIC NAME=proc_run VAL=2 TYPE=uint32 UNITS= TN=8 TMAX=950
RE: [Ganglia-general] Re: gmetad not updating RRD's/hosts that are proper in gmond XML
Martin, et al: I'm getting ...illegal attempt to update using time 1143703242 when last update time is 1143703242 (minimum one second step)... messages for the improper 'linux.' hosts. I was assuming that gmetad was sorting/indexing the data from those sources by the FQDN which was the same for about 12 new hosts, and thus erroring out when it receives timestamped data from the same host that is out of line with the last send period from a nearby node. gmetad is still creating/populating the linux RRD's however. IMO, since there's no inter-node communication without multicast being enabled (the XML port on the unicast hosts doesn't even output metrics, only the basic info headers, CLUSTER NAME= stanza, and !ELEMENT * EMPTY) there (shouldn't) be anything to worry about with other nodes caching the improper data... however, that doesn't mesh with the host I have fixed and still doesn't update. So still trying to work out why gmetad isn't acting on the metrics sent, and logging the equivalent of Updating host host.fqdn, metric... for the previously broken, but now proper DNS'd host. And the most unexplainable item so far, that one functional host I noticed is reported the HOST NAME= as that of the gmetad server... after a restart it ceased doing that. Thanks for the input, I'm going to restart all the gmond's on the unicast hosts, just to be sure, and fixing the crufty PTR records as well. Any other thoughts anyone? Thanks, /eli -Original Message- From: Martin Knoblauch [mailto:[EMAIL PROTECTED] Sent: Wed 3/29/2006 11:05 PM To: Eli Stair Cc: Ganglia-general@lists.sourceforge.net Subject: Re: [Ganglia-general] Re: gmetad not updating RRD's/hosts that are proper in gmond XML Eli, yup. That could definitely cause problems. Do you see anything in the /var/log/messages of the gmetad host? Hmm. You may have to restart *all* gmonds, as well as the gmetad. This is something that I usually do when my ganglia setup was hosed somehow. Definitely the case for multicast clusters. Not really sure about unicast. And yes - this is not optimal. --- Eli Stair [EMAIL PROTECTED] wrote: The only issue I can find at all with this config is that the new hosts have been deployed by someone with two PTR records, both the proper one pointing to the A hostname, as well as all having an improper PTR - linux.FQDN. Is there a potential that gmetad is doing a lookup of both the forward and reverse entries for a host before populating it? Unfortunately removing the invalid entry for a host and restarting gmetad as well as the gmond aggregator and the host did not resolve it. /eli Eli Stair wrote: My installation started having an issue yesterday afternoon that I have yet to explain or remedy. One cluster that I have unicasting, has started losing hosts... the directory entries on disk never get created for newly deployed hosts, and gmond reports receiving messages for the host (and outputs metrics) but gmetad does not report an updating host message, and never creates the RRD's even though the host is up. The critical problem is that the report graphs for this cluster have stopped being updated as well, which nix'es my ability to view cluster load/job level... in addition to not being able to alert on the RRD values for the individual hosts that are malfunctioning. Those hosts that are good continue to update their metric RRD's properly, their host reports are populated etc. The bad ones I cannot explain... The two questions, if anyone has insight: 1) What is causing gmetad to stop acting on the gmond XML input that it has available? I don't see any error or threshhold it's hitting WRT the hosts, they just don't create/update the RRD 2) Why does the report stop being populated (the graph is still generated with past data, but not updated with new... not even the data from hosts that ARE functioning individually. I'm continuing on with this, will update with anything else I find awry. Any suggestions on what to pursue beyond this are welcome... at this point it looks to me a problem with the magic in gmetad's parsing of the gmond output, since it is present and up-to-date but not acting on it. Cheers, /eli Here are the details: server: ganglia 3.0.2 (x86_64) 6 (six) multicast clusters polled by gmetad 1 (one) unicast cluster, reporting to a 'mute' gmond aggregating on the same host as gmetad. clients: suse9.3 x86_64 ganglia 3.0.2 (x86_64) Debug logged info (-d2): Bad host: Apache error_log for bad host: ERROR: opening '/var/lib/ganglia/rrds/Opteron_Production-Desktop_Droid_Cluster/frankenstein.lucasfilm.com/swap_free.rrd': No such file or directory gmond: Processing a Ganglia_message from badhost gmetad: server_thread() received request /Opteron_Production