Bernard, Thanks for your suggestion. I changed to unicast and our application has been running without a problem thus far. I am not sure why MPI had a problem with multicast, but for now I am happy. Regards. Misha.
-----Original Message----- From: Bernard Li [mailto:[EMAIL PROTECTED] Sent: Tuesday, October 14, 2008 10:24 To: Sushchik, Mikhail Cc: [email protected] Subject: Re: [Ganglia-general] Possible Ganglia-MPICH conflict -- ??? Hi Misha: On Mon, Oct 13, 2008 at 6:42 PM, Sushchik, Mikhail <[EMAIL PROTECTED]> wrote: > I installed Ganglia 3.0.5 from an RPM for SUSE 10 SP2 that I downloaded from > one of the RPM repositories. It is running on a small cluster of 2 nodes, 16 > CPUs total. I did not modify the default setup beyond what was absolutely > necessary to get everything working. I see what looks like an interference > of one of Ganglia components with MPICH. The latest version of Ganglia is 3.1.1, you might want to use this newer version instead. If you would like to stay with the 3.0.x series, 3.0.7 is the latest version. > 1) At some point my MPICH application was running fine, but Ganglia was > not showing information from node 2. It turned out gmond was not running. I > started it. After a very short while my MPICH application hung. Why wasn't gmond running initially? Did you start it and it crashed? Or perhaps it was not started in the first place? What interconnect do your MPI jobs go through? Just regular Ethernet or Infiniband/Myrinet? One suspicion I have is gmond communicates with each other using multicast and it causes a bit of traffic on your network. However, given 2 hosts in your setup, I doubt this would cause any noticeable issues but what you might consider doing is use unicast instead of multicast. For more information look up the man pages for gmond.conf. > 2) I restarted the application. It ran for a while and then it hung > again. I killed gmond and gmetad on both nodes and the application > immediately resumed and continued running. It looked as if MPI messages got > held up. Have you tried debugging your MPICH code? Is it with *all* MPICH code or just particular ones? You could also turn on debugging for gmetad/gmond by starting them via -d to see if any error messages are logged. Good luck, Bernard ------------------------------------------------------------------------- This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK & win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100&url=/ _______________________________________________ Ganglia-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/ganglia-general

