Hello,

I am trying to use ganglia on our 296-node cluster Medusa
(http://www.lsc-group.phys.uwm.edu/beowulf/medusa - associated with
the iVDGL project). I'm running into a major problem, however.

We used to sucessfully run ganglia on our cluster. However, at one point in
the distant past, it stopped working for some reason. We have one
master node running gmond, gmetad and the web frontend. We have 296 nodes each
running gmond. For some reason, all of the nodes besides the master are
reported incorrectly.

The web frontend ( http://medusa.phys.uwm.edu/ganglia-webfrontend/ ) routinely
shows roughly 1-7 nodes and the master are alive. The rest are reported dead.
All besides the master are completely inaccurate - metrics not present except
for uptime, which is incorrect).

We used to have ganglia working properly, and are using the same gmond/gmetad
configuration files from that time.

My current guess is that there is some multicast issue. The degree of
multicast traffic *seems* like it is lower than it was before (I just don't
have records to back that up). Perhaps something is wrong with our switch
connecting these machines, but I do not know enough about switch
configuration.

This problem seems similar to
http://sourceforge.net/mailarchive/message.php?msg_id=4129785
and
http://sourceforge.net/mailarchive/forum.php?thread_id=2569896&forum_id=7186

These both don't seem to have any responses (besides a response involving time
sync but that's not an issue with our machines).

Does anyone have any idea what might be the problem? Any ideas at all would be
very much appreciated.

Thanks in advance!

Kevin Flasch


Reply via email to