Eli, yup. That could definitely cause problems. Do you see anything in the /var/log/messages of the gmetad host?
Hmm. You may have to restart *all* gmonds, as well as the gmetad. This is something that I usually do when my ganglia setup was hosed somehow. Definitely the case for multicast clusters. Not really sure about unicast. And yes - this is not optimal. --- Eli Stair <[EMAIL PROTECTED]> wrote: > > The only issue I can find at all with this config is that the new > hosts > have been deployed by someone with two PTR records, both the proper > one > pointing to the A hostname, as well as all having an improper PTR -> > linux."FQDN". > > Is there a potential that gmetad is doing a lookup of both the > forward > and reverse entries for a host before populating it? Unfortunately > removing the invalid entry for a host and restarting gmetad as well > as > the gmond aggregator and the host did not resolve it. > > /eli > > Eli Stair wrote: > > > > My installation started having an issue yesterday afternoon that I > have > > yet to explain or remedy. One cluster that I have unicasting, has > > started "losing" hosts... the directory entries on disk never get > > created for newly deployed hosts, and gmond reports receiving > messages > > for the host (and outputs metrics) but gmetad does not report an > > "updating host" message, and never creates the RRD's even though > the > > host is up. > > > > The critical problem is that the report graphs for this cluster > have > > stopped being updated as well, which nix'es my ability to view > cluster > > load/job level... in addition to not being able to alert on the RRD > > > values for the individual hosts that are malfunctioning. Those > hosts > > that are "good" continue to update their metric RRD's properly, > their > > host reports are populated etc. The bad ones I cannot explain... > > > > The two questions, if anyone has insight: > > > > 1) What is causing gmetad to stop acting on the gmond XML input > that it > > has available? I don't see any error or threshhold it's hitting > WRT the > > hosts, they just don't create/update the RRD > > > > 2) Why does the report stop being populated (the graph is still > > generated with past data, but not updated with new... not even the > data > > from hosts that ARE functioning individually. > > > > I'm continuing on with this, will update with anything else I find > awry. > > Any suggestions on what to pursue beyond this are welcome... at > this > > point it looks to me a problem with the magic in gmetad's parsing > of the > > gmond output, since it is present and up-to-date but not acting on > it. > > > > Cheers, > > > > /eli > > > > > > Here are the details: > > > > server: > > ganglia 3.0.2 (x86_64) > > 6 (six) multicast clusters polled by gmetad > > 1 (one) unicast cluster, reporting to a 'mute' gmond aggregating on > the > > same host as gmetad. > > > > clients: > > suse9.3 x86_64 > > ganglia 3.0.2 (x86_64) > > > > > > Debug logged info (-d2): > > > > Bad host: > > > > Apache error_log for bad host: > > ERROR: opening > > > '/var/lib/ganglia/rrds/Opteron_Production-Desktop_Droid_Cluster/frankenstein.lucasfilm.com/swap_free.rrd': > > > No such file or directory > > > > gmond: > > Processing a Ganglia_message from badhost > > gmetad: > > server_thread() received request > > "/Opteron_Production-Desktop_Droid_Cluster/badhost" from 127.0.0.1 > > > > XML: > > <HOST NAME="badhost" IP="10.65.34.22" REPORTED="1143682835" TN="4" > > TMAX="20" DMAX="0" LOCATION="unspecified" > GMOND_STARTED="1143677550"> > > <METRIC NAME="cpu_num" VAL="2" TYPE="uint16" UNITS="CPUs" TN="488" > > TMAX="1200" DMAX="0" SLOPE="zero" SOURCE="gmond"/> > > <METRIC NAME="disk_total" VAL="71.047" TYPE="double" UNITS="GB" > > TN="1688" TMAX="1200" DMAX="0" SLOPE="both" SOURCE="gmond"/> > > <METRIC NAME="disk_free" VAL="57.776" TYPE="double" UNITS="GB" > TN="128" > > TMAX="180" DMAX="0" SLOPE="both" SOURCE="gmond"/> > > <METRIC NAME="cpu_speed" VAL="2612" TYPE="uint32" UNITS="MHz" > TN="488" > > TMAX="1200" DMAX="0" SLOPE="zero" SOURCE="gmond"/> > > <METRIC NAME="part_max_used" VAL="52.7" TYPE="float" UNITS="" > TN="128" > > TMAX="180" DMAX="0" SLOPE="both" SOURCE="gmond"/> > > <METRIC NAME="mem_total" VAL="8147640" TYPE="uint32" UNITS="KB" > TN="488" > > TMAX="1200" DMAX="0" SLOPE="zero" SOURCE="gmond"/> > > <METRIC NAME="swap_total" VAL="2104504" TYPE="uint32" UNITS="KB" > > TN="488" TMAX="1200" DMAX="0" SLOPE="zero" SOURCE="gmond"/> > > <METRIC NAME="boottime" VAL="1143590767" TYPE="uint32" UNITS="s" > > TN="488" TMAX="1200" DMAX="0" SLOPE="zero" SOURCE="gmond"/> > > <METRIC NAME="machine_type" VAL="x86_64" TYPE="string" UNITS="" > TN="488" > > TMAX="1200" DMAX="0" SLOPE="zero" SOURCE="gmond"/> > > <METRIC NAME="os_name" VAL="Linux" TYPE="string" UNITS="" TN="488" > > TMAX="1200" DMAX="0" SLOPE="zero" SOURCE="gmond"/> > > <METRIC NAME="os_release" VAL="2.6.13.4_K8+NUMA+NV" TYPE="string" > > UNITS="" TN="488" TMAX="1200" DMAX="0" SLOPE="zero" > SOURCE="gmond"/> > > <METRIC NAME="cpu_user" VAL="93.6" TYPE="float" UNITS="%" TN="27" > > TMAX="90" DMAX="0" SLOPE="both" SOURCE="gmond"/> > > <METRIC NAME="cpu_system" VAL="0.6" TYPE="float" UNITS="%" TN="27" > > TMAX="90" DMAX="0" SLOPE="both" SOURCE="gmond"/> > > <METRIC NAME="load_one" VAL="2.03" TYPE="float" UNITS="" TN="68" > > TMAX="70" DMAX="0" SLOPE="both" SOURCE="gmond"/> > > <METRIC NAME="proc_run" VAL="2" TYPE="uint32" UNITS="" TN="8" > TMAX="950" > > DMAX="0" SLOPE="both" SOURCE="gmond"/> > > <METRIC NAME="proc_total" VAL="128" TYPE="uint32" UNITS="" TN="8" > > TMAX="950" DMAX="0" SLOPE="both" SOURCE="gmond"/> > > <METRIC NAME="mem_free" VAL="1328356" TYPE="uint32" UNITS="KB" > TN="8" > > TMAX="180" DMAX="0" SLOPE="both" SOURCE="gmond"/> > > <METRIC NAME="mem_shared" VAL="0" TYPE="uint32" UNITS="KB" TN="8" > > TMAX="180" DMAX="0" SLOPE="both" SOURCE="gmond"/> > > <METRIC NAME="mem_buffers" VAL="199232" TYPE="uint32" UNITS="KB" > TN="8" > > TMAX="180" DMAX="0" SLOPE="both" SOURCE="gmond"/> > > <METRIC NAME="mem_cached" VAL="4569200" TYPE="uint32" UNITS="KB" > TN="8" > > TMAX="180" DMAX="0" SLOPE="both" SOURCE="gmond"/> > > <METRIC NAME="swap_free" VAL="2101964" TYPE="uint32" UNITS="KB" > TN="8" > > TMAX="180" DMAX="0" SLOPE="both" SOURCE="gmond"/> > > <METRIC NAME="gexec" VAL="ON" TYPE="string" UNITS="" TN="188" > TMAX="300" > > DMAX="0" SLOPE="zero" SOURCE="gmond"/> > > <METRIC NAME="bytes_out" VAL="6066.85" TYPE="float" > UNITS="bytes/sec" > > TN="8" TMAX="300" DMAX="0" SLOPE="both" SOURCE="gmond"/> > > <METRIC NAME="bytes_in" VAL="203006.30" TYPE="float" > UNITS="bytes/sec" > > TN="8" TMAX="300" DMAX="0" SLOPE="both" SOURCE="gmond"/> > > <METRIC NAME="numthreads" VAL="2" TYPE="int8" UNITS="" TN="324" > > TMAX="60" DMAX="0" SLOPE="both" SOURCE="gmetric"/> > > <METRIC NAME="numjobs" VAL="2" TYPE="int8" UNITS="" TN="324" > TMAX="60" > > DMAX="0" SLOPE="both" SOURCE="gmetric"/> > > </HOST> > > > > > > Good host: > > > > gmond: > > Processing a Ganglia_message from goodhost > > gmetad: > > Updating host goodhost, metric numjobs > > server_thread() received request > > "/Opteron_Production-Desktop_Droid_Cluster/goodhost" from 127.0.0.1 > > XML: > > <HOST NAME="goodhost" IP="10.73.16.225" REPORTED="1143682838" > TN="1" > > TMAX="20" DMAX="0" LOCATION="unspecified" > GMOND_STARTED="1143137198"> > > <METRIC NAME="cpu_num" VAL="2" TYPE="uint16" UNITS="CPUs" TN="838" > > TMAX="1200" DMAX="0" SLOPE="zero" SOURCE="gmond"/> > > <METRIC NAME="disk_total" VAL="71.047" TYPE="double" UNITS="GB" > > TN="2039" TMAX="1200" DMAX="0" SLOPE="both" SOURCE="gmond"/> > > <METRIC NAME="disk_free" VAL="46.667" TYPE="double" UNITS="GB" > TN="178" > > TMAX="180" DMAX="0" SLOPE="both" SOURCE="gmond"/> > > <METRIC NAME="cpu_speed" VAL="2411" TYPE="uint32" UNITS="MHz" > TN="838" > > TMAX="1200" DMAX="0" SLOPE="zero" SOURCE="gmond"/> > > <METRIC NAME="part_max_used" VAL="70.5" TYPE="float" UNITS="" > TN="178" > > TMAX="180" DMAX="0" SLOPE="both" SOURCE="gmond"/> > > <METRIC NAME="mem_total" VAL="8147640" TYPE="uint32" UNITS="KB" > TN="838" > > TMAX="1200" DMAX="0" SLOPE="zero" SOURCE="gmond"/> > > <METRIC NAME="swap_total" VAL="2104504" TYPE="uint32" UNITS="KB" > > TN="838" TMAX="1200" DMAX="0" SLOPE="zero" SOURCE="gmond"/> > > <METRIC NAME="boottime" VAL="1142553979" TYPE="uint32" UNITS="s" > > TN="838" TMAX="1200" DMAX="0" SLOPE="zero" SOURCE="gmond"/> > > <METRIC NAME="machine_type" VAL="x86_64" TYPE="string" UNITS="" > TN="838" > > TMAX="1200" DMAX="0" SLOPE="zero" SOURCE="gmond"/> > > <METRIC NAME="os_name" VAL="Linux" TYPE="string" UNITS="" TN="838" > > TMAX="1200" DMAX="0" SLOPE="zero" SOURCE="gmond"/> > > <METRIC NAME="os_release" VAL="2.6.13.4_K8+NUMA+NV" TYPE="string" > > UNITS="" TN="838" TMAX="1200" DMAX="0" SLOPE="zero" > SOURCE="gmond"/> > > <METRIC NAME="cpu_user" VAL="73.1" TYPE="float" UNITS="%" TN="8" > > TMAX="90" DMAX="0" SLOPE="both" SOURCE="gmond"/> > > <METRIC NAME="cpu_system" VAL="3.9" TYPE="float" UNITS="%" TN="8" > > TMAX="90" DMAX="0" SLOPE="both" SOURCE="gmond"/> > > <METRIC NAME="load_one" VAL="1.99" TYPE="float" UNITS="" TN="9" > > TMAX="70" DMAX="0" SLOPE="both" SOURCE="gmond"/> > > <METRIC NAME="proc_run" VAL="2" TYPE="uint32" UNITS="" TN="149" > > TMAX="950" DMAX="0" SLOPE="both" SOURCE="gmond"/> > > <METRIC NAME="proc_total" VAL="156" TYPE="uint32" UNITS="" TN="149" > > > TMAX="950" DMAX="0" SLOPE="both" SOURCE="gmond"/> > > <METRIC NAME="mem_free" VAL="2359176" TYPE="uint32" UNITS="KB" > TN="28" > > TMAX="180" DMAX="0" SLOPE="both" SOURCE="gmond"/> > > <METRIC NAME="mem_shared" VAL="0" TYPE="uint32" UNITS="KB" TN="28" > > TMAX="180" DMAX="0" SLOPE="both" SOURCE="gmond"/> > > <METRIC NAME="mem_buffers" VAL="36384" TYPE="uint32" UNITS="KB" > TN="28" > > TMAX="180" DMAX="0" SLOPE="both" SOURCE="gmond"/> > > <METRIC NAME="mem_cached" VAL="4162056" TYPE="uint32" UNITS="KB" > TN="28" > > TMAX="180" DMAX="0" SLOPE="both" SOURCE="gmond"/> > > <METRIC NAME="swap_free" VAL="1786428" TYPE="uint32" UNITS="KB" > TN="28" > > TMAX="180" DMAX="0" SLOPE="both" SOURCE="gmond"/> > > <METRIC NAME="gexec" VAL="ON" TYPE="string" UNITS="" TN="229" > TMAX="300" > > DMAX="0" SLOPE="zero" SOURCE="gmond"/> > > <METRIC NAME="bytes_out" VAL="305162.19" TYPE="float" > UNITS="bytes/sec" > > TN="28" TMAX="300" DMAX="0" SLOPE="both" SOURCE="gmond"/> > > <METRIC NAME="bytes_in" VAL="40802.30" TYPE="float" > UNITS="bytes/sec" > > TN="28" TMAX="300" DMAX="0" SLOPE="both" SOURCE="gmond"/> > > <METRIC NAME="numthreads" VAL="1" TYPE="int8" UNITS="" TN="844" > > TMAX="60" DMAX="0" SLOPE="both" SOURCE="gmetric"/> > > <METRIC NAME="numjobs" VAL="1" TYPE="int8" UNITS="" TN="844" > TMAX="60" > > DMAX="0" SLOPE="both" SOURCE="gmetric"/> > > </HOST> > > > > > > > > ------------------------------------------------------- > This SF.Net email is sponsored by xPML, a groundbreaking scripting > language > that extends applications into web and mobile media. Attend the live > webcast > and join the prime developer group breaking into this new coding > territory! > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642 > _______________________________________________ > Ganglia-general mailing list > Ganglia-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/ganglia-general > > ------------------------------------------------------ Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de