Some things to try...this probably won't solve anything, but you can report back here on what you find, and you can get better help.
1. Log into one of the 8 machines that doesn't show up and run 'gstat --all'. What does it tell you? Do the 8 machines see each other? Do they report the rest of the 128 as being down? 2. What do you get when you telnet to port 8649? Which machines are reported there? Do this from the 8 bad machines as well as the ones that are working. I would say this is almost definitely a multicast issue of some kind. Are all these machines on the same subnet? What network devices (switches, routers, whatever) are they using? Are the 8 bad machines on a different switch? Let us know what you find. I had a very similar problem recently, and it turned out to be something with our switches and the way they handle multicast traffic. I know practically nothing about multicast, so I don't know the details, but I think our network guy finally did something where multicast traffic is more or less treated the same as broadcast packets or something like that. Steve Gilbert Unix Systems Administrator [EMAIL PROTECTED] -----Original Message----- From: Alexander Weeks [mailto:[EMAIL PROTECTED] Sent: Thursday, October 16, 2003 6:22 AM To: [email protected] Subject: [Ganglia-general] Missing nodes I am trying to use Ganglia on a 128 node cluster. I have 8 nodes that just won't show up. I have confirmed that gmond is running on them, and can telnet into port 8649. What can I do to debug this? What can I check for? Alex Weeks Linux Systems Analyst ------------------------------------------------------- This SF.net email is sponsored by: SF.net Giveback Program. SourceForge.net hosts over 70,000 Open Source Projects. See the people who have HELPED US provide better services: Click here: http://sourceforge.net/supporters.php _______________________________________________ Ganglia-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/ganglia-general

