Eli,

 yup. That could definitely cause problems. Do you see anything in the
/var/log/messages of the gmetad host?

 Hmm. You may have to restart *all* gmonds, as well as the gmetad. This
is something that I usually do when my ganglia setup was hosed somehow.
Definitely the case for multicast clusters. Not really sure about
unicast.

 And yes - this is not optimal.

--- Eli Stair <[EMAIL PROTECTED]> wrote:

> 
> The only issue I can find at all with this config is that the new
> hosts 
> have been deployed by someone with two PTR records, both the proper
> one 
> pointing to the A hostname, as well as all having an improper PTR -> 
> linux."FQDN".
> 
> Is there a potential that gmetad is doing a lookup of both the
> forward 
> and reverse entries for a host before populating it?  Unfortunately 
> removing the invalid entry for a host and restarting gmetad as well
> as 
> the gmond aggregator and the host did not resolve it.
> 
> /eli
> 
> Eli Stair wrote:
> > 
> > My installation started having an issue yesterday afternoon that I
> have 
> > yet to explain or remedy.  One cluster that I have unicasting, has 
> > started "losing" hosts... the directory entries on disk never get 
> > created for newly deployed hosts, and gmond reports receiving
> messages 
> > for the host (and outputs metrics) but gmetad does not report an 
> > "updating host" message, and never creates the RRD's even though
> the 
> > host is up.
> > 
> > The critical problem is that the report graphs for this cluster
> have 
> > stopped being updated as well, which nix'es my ability to view
> cluster 
> > load/job level... in addition to not being able to alert on the RRD
> 
> > values for the individual hosts that are malfunctioning.  Those
> hosts 
> > that are "good" continue to update their metric RRD's properly,
> their 
> > host reports are populated etc.  The bad ones I cannot explain...
> > 
> > The two questions, if anyone has insight:
> > 
> > 1) What is causing gmetad to stop acting on the gmond XML input
> that it 
> > has available?  I don't see any error or threshhold it's hitting
> WRT the 
> > hosts, they just don't create/update the RRD
> > 
> > 2) Why does the report stop being populated (the graph is still 
> > generated with past data, but not updated with new... not even the
> data 
> > from hosts that ARE functioning individually.
> > 
> > I'm continuing on with this, will update with anything else I find
> awry. 
> >  Any suggestions on what to pursue beyond this are welcome... at
> this 
> > point it looks to me a problem with the magic in gmetad's parsing
> of the 
> > gmond output, since it is present and up-to-date but not acting on
> it.
> > 
> > Cheers,
> > 
> > /eli
> > 
> > 
> > Here are the details:
> > 
> > server:
> > ganglia 3.0.2 (x86_64)
> > 6 (six) multicast clusters polled by gmetad
> > 1 (one) unicast cluster, reporting to a 'mute' gmond aggregating on
> the 
> > same host as gmetad.
> > 
> > clients:
> > suse9.3 x86_64
> > ganglia 3.0.2 (x86_64)
> > 
> > 
> > Debug logged info (-d2):
> > 
> > Bad host:
> > 
> >   Apache error_log for bad host:
> >     ERROR: opening 
> >
>
'/var/lib/ganglia/rrds/Opteron_Production-Desktop_Droid_Cluster/frankenstein.lucasfilm.com/swap_free.rrd':
> 
> > No such file or directory
> > 
> >   gmond:
> >     Processing a Ganglia_message from badhost
> >   gmetad:
> >     server_thread() received request 
> > "/Opteron_Production-Desktop_Droid_Cluster/badhost" from 127.0.0.1
> > 
> >   XML:
> > <HOST NAME="badhost" IP="10.65.34.22" REPORTED="1143682835" TN="4" 
> > TMAX="20" DMAX="0" LOCATION="unspecified"
> GMOND_STARTED="1143677550">
> > <METRIC NAME="cpu_num" VAL="2" TYPE="uint16" UNITS="CPUs" TN="488" 
> > TMAX="1200" DMAX="0" SLOPE="zero" SOURCE="gmond"/>
> > <METRIC NAME="disk_total" VAL="71.047" TYPE="double" UNITS="GB" 
> > TN="1688" TMAX="1200" DMAX="0" SLOPE="both" SOURCE="gmond"/>
> > <METRIC NAME="disk_free" VAL="57.776" TYPE="double" UNITS="GB"
> TN="128" 
> > TMAX="180" DMAX="0" SLOPE="both" SOURCE="gmond"/>
> > <METRIC NAME="cpu_speed" VAL="2612" TYPE="uint32" UNITS="MHz"
> TN="488" 
> > TMAX="1200" DMAX="0" SLOPE="zero" SOURCE="gmond"/>
> > <METRIC NAME="part_max_used" VAL="52.7" TYPE="float" UNITS=""
> TN="128" 
> > TMAX="180" DMAX="0" SLOPE="both" SOURCE="gmond"/>
> > <METRIC NAME="mem_total" VAL="8147640" TYPE="uint32" UNITS="KB"
> TN="488" 
> > TMAX="1200" DMAX="0" SLOPE="zero" SOURCE="gmond"/>
> > <METRIC NAME="swap_total" VAL="2104504" TYPE="uint32" UNITS="KB" 
> > TN="488" TMAX="1200" DMAX="0" SLOPE="zero" SOURCE="gmond"/>
> > <METRIC NAME="boottime" VAL="1143590767" TYPE="uint32" UNITS="s" 
> > TN="488" TMAX="1200" DMAX="0" SLOPE="zero" SOURCE="gmond"/>
> > <METRIC NAME="machine_type" VAL="x86_64" TYPE="string" UNITS=""
> TN="488" 
> > TMAX="1200" DMAX="0" SLOPE="zero" SOURCE="gmond"/>
> > <METRIC NAME="os_name" VAL="Linux" TYPE="string" UNITS="" TN="488" 
> > TMAX="1200" DMAX="0" SLOPE="zero" SOURCE="gmond"/>
> > <METRIC NAME="os_release" VAL="2.6.13.4_K8+NUMA+NV" TYPE="string" 
> > UNITS="" TN="488" TMAX="1200" DMAX="0" SLOPE="zero"
> SOURCE="gmond"/>
> > <METRIC NAME="cpu_user" VAL="93.6" TYPE="float" UNITS="%" TN="27" 
> > TMAX="90" DMAX="0" SLOPE="both" SOURCE="gmond"/>
> > <METRIC NAME="cpu_system" VAL="0.6" TYPE="float" UNITS="%" TN="27" 
> > TMAX="90" DMAX="0" SLOPE="both" SOURCE="gmond"/>
> > <METRIC NAME="load_one" VAL="2.03" TYPE="float" UNITS="" TN="68" 
> > TMAX="70" DMAX="0" SLOPE="both" SOURCE="gmond"/>
> > <METRIC NAME="proc_run" VAL="2" TYPE="uint32" UNITS="" TN="8"
> TMAX="950" 
> > DMAX="0" SLOPE="both" SOURCE="gmond"/>
> > <METRIC NAME="proc_total" VAL="128" TYPE="uint32" UNITS="" TN="8" 
> > TMAX="950" DMAX="0" SLOPE="both" SOURCE="gmond"/>
> > <METRIC NAME="mem_free" VAL="1328356" TYPE="uint32" UNITS="KB"
> TN="8" 
> > TMAX="180" DMAX="0" SLOPE="both" SOURCE="gmond"/>
> > <METRIC NAME="mem_shared" VAL="0" TYPE="uint32" UNITS="KB" TN="8" 
> > TMAX="180" DMAX="0" SLOPE="both" SOURCE="gmond"/>
> > <METRIC NAME="mem_buffers" VAL="199232" TYPE="uint32" UNITS="KB"
> TN="8" 
> > TMAX="180" DMAX="0" SLOPE="both" SOURCE="gmond"/>
> > <METRIC NAME="mem_cached" VAL="4569200" TYPE="uint32" UNITS="KB"
> TN="8" 
> > TMAX="180" DMAX="0" SLOPE="both" SOURCE="gmond"/>
> > <METRIC NAME="swap_free" VAL="2101964" TYPE="uint32" UNITS="KB"
> TN="8" 
> > TMAX="180" DMAX="0" SLOPE="both" SOURCE="gmond"/>
> > <METRIC NAME="gexec" VAL="ON" TYPE="string" UNITS="" TN="188"
> TMAX="300" 
> > DMAX="0" SLOPE="zero" SOURCE="gmond"/>
> > <METRIC NAME="bytes_out" VAL="6066.85" TYPE="float"
> UNITS="bytes/sec" 
> > TN="8" TMAX="300" DMAX="0" SLOPE="both" SOURCE="gmond"/>
> > <METRIC NAME="bytes_in" VAL="203006.30" TYPE="float"
> UNITS="bytes/sec" 
> > TN="8" TMAX="300" DMAX="0" SLOPE="both" SOURCE="gmond"/>
> > <METRIC NAME="numthreads" VAL="2" TYPE="int8" UNITS="" TN="324" 
> > TMAX="60" DMAX="0" SLOPE="both" SOURCE="gmetric"/>
> > <METRIC NAME="numjobs" VAL="2" TYPE="int8" UNITS="" TN="324"
> TMAX="60" 
> > DMAX="0" SLOPE="both" SOURCE="gmetric"/>
> > </HOST>
> > 
> > 
> > Good host:
> > 
> >   gmond:
> >     Processing a Ganglia_message from goodhost
> >   gmetad:
> >     Updating host goodhost, metric numjobs
> >     server_thread() received request 
> > "/Opteron_Production-Desktop_Droid_Cluster/goodhost" from 127.0.0.1
> >   XML:
> > <HOST NAME="goodhost" IP="10.73.16.225" REPORTED="1143682838"
> TN="1" 
> > TMAX="20" DMAX="0" LOCATION="unspecified"
> GMOND_STARTED="1143137198">
> > <METRIC NAME="cpu_num" VAL="2" TYPE="uint16" UNITS="CPUs" TN="838" 
> > TMAX="1200" DMAX="0" SLOPE="zero" SOURCE="gmond"/>
> > <METRIC NAME="disk_total" VAL="71.047" TYPE="double" UNITS="GB" 
> > TN="2039" TMAX="1200" DMAX="0" SLOPE="both" SOURCE="gmond"/>
> > <METRIC NAME="disk_free" VAL="46.667" TYPE="double" UNITS="GB"
> TN="178" 
> > TMAX="180" DMAX="0" SLOPE="both" SOURCE="gmond"/>
> > <METRIC NAME="cpu_speed" VAL="2411" TYPE="uint32" UNITS="MHz"
> TN="838" 
> > TMAX="1200" DMAX="0" SLOPE="zero" SOURCE="gmond"/>
> > <METRIC NAME="part_max_used" VAL="70.5" TYPE="float" UNITS=""
> TN="178" 
> > TMAX="180" DMAX="0" SLOPE="both" SOURCE="gmond"/>
> > <METRIC NAME="mem_total" VAL="8147640" TYPE="uint32" UNITS="KB"
> TN="838" 
> > TMAX="1200" DMAX="0" SLOPE="zero" SOURCE="gmond"/>
> > <METRIC NAME="swap_total" VAL="2104504" TYPE="uint32" UNITS="KB" 
> > TN="838" TMAX="1200" DMAX="0" SLOPE="zero" SOURCE="gmond"/>
> > <METRIC NAME="boottime" VAL="1142553979" TYPE="uint32" UNITS="s" 
> > TN="838" TMAX="1200" DMAX="0" SLOPE="zero" SOURCE="gmond"/>
> > <METRIC NAME="machine_type" VAL="x86_64" TYPE="string" UNITS=""
> TN="838" 
> > TMAX="1200" DMAX="0" SLOPE="zero" SOURCE="gmond"/>
> > <METRIC NAME="os_name" VAL="Linux" TYPE="string" UNITS="" TN="838" 
> > TMAX="1200" DMAX="0" SLOPE="zero" SOURCE="gmond"/>
> > <METRIC NAME="os_release" VAL="2.6.13.4_K8+NUMA+NV" TYPE="string" 
> > UNITS="" TN="838" TMAX="1200" DMAX="0" SLOPE="zero"
> SOURCE="gmond"/>
> > <METRIC NAME="cpu_user" VAL="73.1" TYPE="float" UNITS="%" TN="8" 
> > TMAX="90" DMAX="0" SLOPE="both" SOURCE="gmond"/>
> > <METRIC NAME="cpu_system" VAL="3.9" TYPE="float" UNITS="%" TN="8" 
> > TMAX="90" DMAX="0" SLOPE="both" SOURCE="gmond"/>
> > <METRIC NAME="load_one" VAL="1.99" TYPE="float" UNITS="" TN="9" 
> > TMAX="70" DMAX="0" SLOPE="both" SOURCE="gmond"/>
> > <METRIC NAME="proc_run" VAL="2" TYPE="uint32" UNITS="" TN="149" 
> > TMAX="950" DMAX="0" SLOPE="both" SOURCE="gmond"/>
> > <METRIC NAME="proc_total" VAL="156" TYPE="uint32" UNITS="" TN="149"
> 
> > TMAX="950" DMAX="0" SLOPE="both" SOURCE="gmond"/>
> > <METRIC NAME="mem_free" VAL="2359176" TYPE="uint32" UNITS="KB"
> TN="28" 
> > TMAX="180" DMAX="0" SLOPE="both" SOURCE="gmond"/>
> > <METRIC NAME="mem_shared" VAL="0" TYPE="uint32" UNITS="KB" TN="28" 
> > TMAX="180" DMAX="0" SLOPE="both" SOURCE="gmond"/>
> > <METRIC NAME="mem_buffers" VAL="36384" TYPE="uint32" UNITS="KB"
> TN="28" 
> > TMAX="180" DMAX="0" SLOPE="both" SOURCE="gmond"/>
> > <METRIC NAME="mem_cached" VAL="4162056" TYPE="uint32" UNITS="KB"
> TN="28" 
> > TMAX="180" DMAX="0" SLOPE="both" SOURCE="gmond"/>
> > <METRIC NAME="swap_free" VAL="1786428" TYPE="uint32" UNITS="KB"
> TN="28" 
> > TMAX="180" DMAX="0" SLOPE="both" SOURCE="gmond"/>
> > <METRIC NAME="gexec" VAL="ON" TYPE="string" UNITS="" TN="229"
> TMAX="300" 
> > DMAX="0" SLOPE="zero" SOURCE="gmond"/>
> > <METRIC NAME="bytes_out" VAL="305162.19" TYPE="float"
> UNITS="bytes/sec" 
> > TN="28" TMAX="300" DMAX="0" SLOPE="both" SOURCE="gmond"/>
> > <METRIC NAME="bytes_in" VAL="40802.30" TYPE="float"
> UNITS="bytes/sec" 
> > TN="28" TMAX="300" DMAX="0" SLOPE="both" SOURCE="gmond"/>
> > <METRIC NAME="numthreads" VAL="1" TYPE="int8" UNITS="" TN="844" 
> > TMAX="60" DMAX="0" SLOPE="both" SOURCE="gmetric"/>
> > <METRIC NAME="numjobs" VAL="1" TYPE="int8" UNITS="" TN="844"
> TMAX="60" 
> > DMAX="0" SLOPE="both" SOURCE="gmetric"/>
> > </HOST>
> > 
> > 
> 
> 
> 
> -------------------------------------------------------
> This SF.Net email is sponsored by xPML, a groundbreaking scripting
> language
> that extends applications into web and mobile media. Attend the live
> webcast
> and join the prime developer group breaking into this new coding
> territory!
>
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
> _______________________________________________
> Ganglia-general mailing list
> Ganglia-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/ganglia-general
> 
> 


------------------------------------------------------
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de

Reply via email to