Hello, I am trying to use ganglia on our 296-node cluster Medusa (http://www.lsc-group.phys.uwm.edu/beowulf/medusa - associated with the iVDGL project). I'm running into a major problem, however.
We used to sucessfully run ganglia on our cluster. However, at one point in the distant past, it stopped working for some reason. We have one master node running gmond, gmetad and the web frontend. We have 296 nodes each running gmond. For some reason, all of the nodes besides the master are reported incorrectly. The web frontend ( http://medusa.phys.uwm.edu/ganglia-webfrontend/ ) routinely shows roughly 1-7 nodes and the master are alive. The rest are reported dead. All besides the master are completely inaccurate - metrics not present except for uptime, which is incorrect). We used to have ganglia working properly, and are using the same gmond/gmetad configuration files from that time. My current guess is that there is some multicast issue. The degree of multicast traffic *seems* like it is lower than it was before (I just don't have records to back that up). Perhaps something is wrong with our switch connecting these machines, but I do not know enough about switch configuration. This problem seems similar to http://sourceforge.net/mailarchive/message.php?msg_id=4129785 and http://sourceforge.net/mailarchive/forum.php?thread_id=2569896&forum_id=7186 These both don't seem to have any responses (besides a response involving time sync but that's not an issue with our machines). Does anyone have any idea what might be the problem? Any ideas at all would be very much appreciated. Thanks in advance! Kevin Flasch