Hi Mike, list

I am no network expert or Ganglia guru, I must tell you upfront.

However, I had a very similar problem with Ganglia 3.0.6
on one of our clusters, which I could eventually fix.
In my case, I had two problems:

1) Duplicate nodes showing up on Ganglia (just like you).

The reason was a messed up network setup,
which is probably what you have also.
The fix consisted in sorting out the network mess,
and setting them (2 private networks) correctly.

2) Besides appearing twice, the nodes would go
up and down randomly.

This was apparently due to a bug in the "heartbeat" Ganglia
metric, which would not work if collected with a different
frequency than the other metrics.
The fix consisted in changing the Ganglia setup (gmond.conf)
to collect all the metrics in lockstep



Our cluster here has two separate GigE private networks,
one for logins, NFS, services,
another only for MPI.
Besides, the head node has an additional NIC with an external IP.
The cluster was delivered to us with the network(s) setup totally
messed up.
The fist symptom of this mess appeared in Ganglia.
Diagnosing problem required some careful and patient observations.
Hosts files, routing files, dhcp files, etc had to be carefully 
compared, edited, checked for functionality, edited again, checked 
again, etc.
The original network setup was routing traffic from the
MPI network to the login/service network.
It took some long detective work to diagnose
the problem and find a proper fix.
You may have a similar network "crosstalk".

Do you have two private networks?
Ping (using specific interface/IP address)
from/to a head node and a variety of compute nodes may help identify 
problems like this, if you have them.

Try to find out the network(s) structure.
If you are on RHEL Linux (or CentOS, or Fedora),
the main files to check are
/etc/hosts,
/etc/sysconfig/static-routes,
/etc/dhcpd.conf (if your compute nodes contact the head node for DHCP),
/etc/sysconfig/network,
/etc/sysconfig/network-scripts/ifcfg-ethX (X being 0,1,2, etc), and
/etc/named.conf (if your head node is a DNS server)
There may be more files, but those are the ones I remember now.
Check their contents on both the master node (which I presume collects 
the Ganglia traffic), and on a "problematic/duplicate" compute node.

I hope you don't have the same mess I had
(several misconfigured files), but you may.
If you have some form of warranty, I would suggest contacting the
cluster vendor, as they should be responsible for fixing the network setup.

After I figured the sticking points that were causing the problems,
sorted out the networks setup, and brought them to a sane state,
the duplicate nodes disappeared from Ganglia.


However, in my case Ganglia would still not work proper.
Nodes continued to come up and down randomly.

So, I set to fix the remaining problem.

I read reports that some managed switches won't handle multicast (the
Ganglia default mode), unless configured to do so.
However, this wouldn't fix the problem either, as my switch
enables multicast by default.
I also tried to change the Ganglia setup from multicast to unicast (on 
gmond.conf), but this didn't fix the problem either.

Finally, I was told that the heartbeat metrics in Ganglia had a bug
that would break the functionality.
It is unclear to me what specific bug is this, but somehow if I asked
the "heartbeat" metrics to report with a different
frequency than the other metrics
The default Ganglia gmond.conf file, with different collecting time 
intervals for different metrics, would not work.
Ganglia would show nodes going up and down in a random way.

Hence, I set up all metrics to report in lockstep mode, same frequency
for everyone.
I've got the clue for the fix from another cluster we have here
that runs the Rocks Cluster software, which had Ganglia working right.
The gmond.conf file on the Rocks cluster has all metrics being
collected in lockstep.
Changing Ganglia gmond.conf to this very simple metric collection 
structure fixed the problem of nodes going up and down in Ganglia.

I attach the relevant part of the (functional) gmond.conf file below.

I added some comments inline below.

I hope this helps.

Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------


/* UDP Channels for Send and Recv */

udp_recv_channel {
        mcast_join = 239.128.134.247
        port = 8649
        mcast_if = eth0
}

udp_send_channel {
        mcast_join = 239.128.134.247
        port = 8649
        mcast_if = eth0
}

/* TCP Accept Channel */

tcp_accept_channel {
        port = 8649
}

/* Metrics Collection group */

collection_group {
   collect_every = 60
   time_threshold = 300

    metric {
        name = "load_one"
        value_threshold = 10.0
      }

    metric {
        name = "mem_total"
        value_threshold = 10.0
      }

    metric {
        name = "cpu_intr"
        value_threshold = 10.0
      }

    metric {
        name = "proc_run"
        value_threshold = 10.0
      }

    metric {
        name = "load_five"
        value_threshold = 10.0
      }

    metric {
        name = "disk_free"
        value_threshold = 10.0
      }

    metric {
        name = "mem_cached"
        value_threshold = 10.0
      }

    metric {
        name = "mtu"
        value_threshold = 10.0
      }

    metric {
        name = "cpu_sintr"
        value_threshold = 10.0
      }

    metric {
        name = "pkts_in"
        value_threshold = 10.0
      }

    metric {
        name = "bytes_in"
        value_threshold = 10.0
      }

    metric {
        name = "bytes_out"
        value_threshold = 10.0
      }

    metric {
        name = "swap_total"
        value_threshold = 10.0
      }

    metric {
        name = "mem_free"
        value_threshold = 10.0
      }

    metric {
        name = "load_fifteen"
        value_threshold = 10.0
      }

    metric {
        name = "boottime"
        value_threshold = 10.0
      }

    metric {
        name = "cpu_idle"
        value_threshold = 10.0
      }

    metric {
        name = "cpu_aidle"
        value_threshold = 10.0
      }

    metric {
        name = "cpu_user"
        value_threshold = 10.0
      }

    metric {
        name = "cpu_nice"
        value_threshold = 10.0
      }

    metric {
        name = "sys_clock"
        value_threshold = 10.0
      }

    metric {
        name = "mem_buffers"
        value_threshold = 10.0
      }

    metric {
        name = "cpu_system"
        value_threshold = 10.0
      }

    metric {
        name = "part_max_used"
        value_threshold = 10.0
      }

    metric {
        name = "disk_total"
        value_threshold = 10.0
      }

    metric {
        name = "heartbeat"
        value_threshold = 10.0
      }

    metric {
        name = "mem_shared"
        value_threshold = 10.0
      }

    metric {
        name = "machine_type"
        value_threshold = 10.0
      }

    metric {
        name = "cpu_wio"
        value_threshold = 10.0
      }

    metric {
        name = "proc_total"
        value_threshold = 10.0
      }

    metric {
        name = "cpu_num"
        value_threshold = 10.0
      }

    metric {
        name = "cpu_speed"
        value_threshold = 10.0
      }

    metric {
        name = "pkts_out"
        value_threshold = 10.0
      }

    metric {
        name = "swap_free"
        value_threshold = 10.0
      }
}


[email protected] wrote:
> Hi
> 
> Very new to ganglia - just installed it - and I'm seeing two hosts showing
> up twice (short name and long name), yet 2 other hosts only showing up once
> (short name).
> 

What do you mean by "short and long name"?
Is "short name" something like node01, or node01.localdomain?
Is "long name" the FQDN, or something like node01.localdomain?
Are these names associated to different network interfaces or to the
same NIC?

> For one of the hosts that shows up twice and one that shows up once I've
> done nslookup on the shortname, long name and ip address and they come back
> identical structure.
> 

Are the nodes  on one private network, on two private networks,
on a public network,
or do you have both types of networks on your nodes (presumably on
different NICs)?
Any information on the private IP addresses, network mask,
gateway (if any), static routes, etc?

> Can anyone provide any clues as to why this is happening or where I should
> look next?
> 

> Thanks
> 
> 
> Mike
> 
> **********************************************************************
> This transmission is confidential and must not be used or disclosed by
> anyone other than the intended recipient. Neither Tata Steel Europe Limited
> nor
> any of its subsidiaries can accept any responsibility for any use or
> misuse of the transmission by anyone.
> 
> For address and company registration details of certain entities
> within the Corus group of companies, please visit
> http://www.corusgroup.com/entities
> 
> **********************************************************************
> 
> 
> ------------------------------------------------------------------------------
> Stay on top of everything new and different, both inside and 
> around Java (TM) technology - register by April 22, and save
> $200 on the JavaOne (SM) conference, June 2-5, 2009, San Francisco.
> 300 plus technical and hands-on sessions. Register today. 
> Use priority code J9JMT32. http://p.sf.net/sfu/p
> _______________________________________________
> Ganglia-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/ganglia-general



------------------------------------------------------------------------------
Stay on top of everything new and different, both inside and 
around Java (TM) technology - register by April 22, and save
$200 on the JavaOne (SM) conference, June 2-5, 2009, San Francisco.
300 plus technical and hands-on sessions. Register today. 
Use priority code J9JMT32. http://p.sf.net/sfu/p
_______________________________________________
Ganglia-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Reply via email to