Leif Nixon wrote:
Well, this is a new one - at least for me.

One of our clusters was rebooted last week, due to a physical
relocation. Now the ganglia XML data doesn't contain any mention of
the cluster frontend, even though gmond is running fine and responding
to the XML data port:

    nixon $ telnet grendel 8649|grep -i "host name"|cut -c -60
    Connection closed by foreign host.
          <!ATTLIST HOST NAME CDATA #REQUIRED>
    <HOST NAME="g10" IP="192.168.1.10" REPORTED="1047377023" TN=
    <HOST NAME="g11" IP="192.168.1.11" REPORTED="1047377026" TN=
    <HOST NAME="g12" IP="192.168.1.12" REPORTED="1047377029" TN=
    <HOST NAME="g13" IP="192.168.1.13" REPORTED="1047377026" TN=
    <HOST NAME="g1" IP="192.168.1.1" REPORTED="1047377032" TN="0
    <HOST NAME="g2" IP="192.168.1.2" REPORTED="1047377029" TN="3
    <HOST NAME="g16" IP="192.168.1.16" REPORTED="1047377022" TN=
    <HOST NAME="g4" IP="192.168.1.4" REPORTED="1047377025" TN="7
    <HOST NAME="g5" IP="192.168.1.5" REPORTED="1047377023" TN="9
    <HOST NAME="g6" IP="192.168.1.6" REPORTED="1047377031" TN="1
    <HOST NAME="g8" IP="192.168.1.8" REPORTED="1047377028" TN="4
    <HOST NAME="g9" IP="192.168.1.9" REPORTED="1047377022" TN="1
    nixon $

The frontend used to turn up as "g0".

The same behaviour is presented by ganglia 2.5.1 and 2.5.3. I've run
gmond for a while with debug enabled, but nothing in the output seems
alarming to me. Anyone who wants to take a look can find the log at:

  http://www.nsc.liu.se/~nixon/tmp/ganglia.log

What blindingly obvious mistake am I making here?


Maybe you should look at your mcast_ttl value for g0 (I'm assuming it's not running with the IP 192.168.1.0 ... :) ), increase it by one and restart the monitoring core until it shows up on the other gmonds.

That's how I found out that my front-end was *three* hops away from the test cluster and I'm thinking you have either a monitoring core config issue or a host/network config issue to track down... (maybe a host/network device between the front-end and the cluster is configured to drop multicast packets?)

I got my hopes up from reading the subject message - a problem I've noticed lately is that the metadaemon seems to "forget" to update all the RRDs for a cluster. But as it turns out, the metadaemon's behaving properly - the front-end isn't recognizing the cluster as down/unreachable, though! I'm not sure if this behavior appeared as a result of the modifications I've made to my installation though...


Reply via email to