Bernard, thanks for your suggestions.

> -----Original Message-----
> From: Bernard Li [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, October 14, 2008 10:24 AM
> To: Sushchik, Mikhail
> Cc: [email protected]
> Subject: Re: [Ganglia-general] Possible Ganglia-MPICH conflict -- ???
>
> On Mon, Oct 13, 2008 at 6:42 PM, Sushchik, Mikhail
> <[EMAIL PROTECTED]> wrote:
> 
> > I installed Ganglia 3.0.5 from an RPM for SUSE 10 SP2 that I
downloaded
> from
> > one of the RPM repositories. It is running on a small cluster of 2
> nodes, 16
> > CPUs total. I did not modify the default setup beyond what was
> absolutely
> > necessary to get everything working. I see what looks like an
> interference
> > of one of Ganglia components with MPICH.
> 
> The latest version of Ganglia is 3.1.1, you might want to use this
> newer version instead.  If you would like to stay with the 3.0.x
> series, 3.0.7 is the latest version.

I used 3.0.5 because I found an RPM of this version built for SUSE 10.2,
which is what I have -- I though it would be easier that way. I will try
the latest. 

> 
> > 1)     At some point my MPICH application was running fine, but
Ganglia
> was
> > not showing information from node 2. It turned out gmond was not
> running. I
> > started it. After a very short while my MPICH application hung.
> 
> Why wasn't gmond running initially?  Did you start it and it crashed?
> Or perhaps it was not started in the first place?

I think it crashed, but I cannot be sure.

> 
> What interconnect do your MPI jobs go through?  Just regular Ethernet
> or Infiniband/Myrinet?

Regular Ethernet.

> 
> One suspicion I have is gmond communicates with each other using
> multicast and it causes a bit of traffic on your network.  However,
> given 2 hosts in your setup, I doubt this would cause any noticeable
> issues but what you might consider doing is use unicast instead of
> multicast.  For more information look up the man pages for gmond.conf.

Thanks, I will try this (once I figure out how to use it :). But will
unicast work once I add more nodes to this system, which is the plan.
(Sorry, but I am just learning what all these *casts are).

> 
> > 2)     I restarted the application. It ran for a while and then it
hung
> > again. I killed gmond and gmetad on both nodes and the application
> > immediately resumed and continued running. It looked as if MPI
messages
> got
> > held up.
> 
> Have you tried debugging your MPICH code?  

The MPI code was very well debugged on other clusters, but running a
different LINUX. Now I am porting it to the new cluster running SUSE. I
suppose it is possible that some previously unseen bugs can raise their
heads...

> Is it with *all* MPICH code
> or just particular ones?  

We only run this one application. I guess, I should make a dummy app and
test with it. Thanks.

> You could also turn on debugging for
> gmetad/gmond by starting them via -d to see if any error messages are
> logged.

This is the easiest thing to try. I will start with this one.

Once again thanks a lot for this feedback.
Misha.
PS. Sorry for screwy formatting -- I can't figure out how to make
Outlook format things nicely with > >.

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Ganglia-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Reply via email to