Bernard, thanks for your suggestions. > -----Original Message----- > From: Bernard Li [mailto:[EMAIL PROTECTED] > Sent: Tuesday, October 14, 2008 10:24 AM > To: Sushchik, Mikhail > Cc: [email protected] > Subject: Re: [Ganglia-general] Possible Ganglia-MPICH conflict -- ??? > > On Mon, Oct 13, 2008 at 6:42 PM, Sushchik, Mikhail > <[EMAIL PROTECTED]> wrote: > > > I installed Ganglia 3.0.5 from an RPM for SUSE 10 SP2 that I downloaded > from > > one of the RPM repositories. It is running on a small cluster of 2 > nodes, 16 > > CPUs total. I did not modify the default setup beyond what was > absolutely > > necessary to get everything working. I see what looks like an > interference > > of one of Ganglia components with MPICH. > > The latest version of Ganglia is 3.1.1, you might want to use this > newer version instead. If you would like to stay with the 3.0.x > series, 3.0.7 is the latest version.
I used 3.0.5 because I found an RPM of this version built for SUSE 10.2, which is what I have -- I though it would be easier that way. I will try the latest. > > > 1) At some point my MPICH application was running fine, but Ganglia > was > > not showing information from node 2. It turned out gmond was not > running. I > > started it. After a very short while my MPICH application hung. > > Why wasn't gmond running initially? Did you start it and it crashed? > Or perhaps it was not started in the first place? I think it crashed, but I cannot be sure. > > What interconnect do your MPI jobs go through? Just regular Ethernet > or Infiniband/Myrinet? Regular Ethernet. > > One suspicion I have is gmond communicates with each other using > multicast and it causes a bit of traffic on your network. However, > given 2 hosts in your setup, I doubt this would cause any noticeable > issues but what you might consider doing is use unicast instead of > multicast. For more information look up the man pages for gmond.conf. Thanks, I will try this (once I figure out how to use it :). But will unicast work once I add more nodes to this system, which is the plan. (Sorry, but I am just learning what all these *casts are). > > > 2) I restarted the application. It ran for a while and then it hung > > again. I killed gmond and gmetad on both nodes and the application > > immediately resumed and continued running. It looked as if MPI messages > got > > held up. > > Have you tried debugging your MPICH code? The MPI code was very well debugged on other clusters, but running a different LINUX. Now I am porting it to the new cluster running SUSE. I suppose it is possible that some previously unseen bugs can raise their heads... > Is it with *all* MPICH code > or just particular ones? We only run this one application. I guess, I should make a dummy app and test with it. Thanks. > You could also turn on debugging for > gmetad/gmond by starting them via -d to see if any error messages are > logged. This is the easiest thing to try. I will start with this one. Once again thanks a lot for this feedback. Misha. PS. Sorry for screwy formatting -- I can't figure out how to make Outlook format things nicely with > >. ------------------------------------------------------------------------- This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK & win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100&url=/ _______________________________________________ Ganglia-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/ganglia-general

