Hi Misha:

On Mon, Oct 13, 2008 at 6:42 PM, Sushchik, Mikhail
<[EMAIL PROTECTED]> wrote:

> I installed Ganglia 3.0.5 from an RPM for SUSE 10 SP2 that I downloaded from
> one of the RPM repositories. It is running on a small cluster of 2 nodes, 16
> CPUs total. I did not modify the default setup beyond what was absolutely
> necessary to get everything working. I see what looks like an interference
> of one of Ganglia components with MPICH.

The latest version of Ganglia is 3.1.1, you might want to use this
newer version instead.  If you would like to stay with the 3.0.x
series, 3.0.7 is the latest version.

> 1)     At some point my MPICH application was running fine, but Ganglia was
> not showing information from node 2. It turned out gmond was not running. I
> started it. After a very short while my MPICH application hung.

Why wasn't gmond running initially?  Did you start it and it crashed?
Or perhaps it was not started in the first place?

What interconnect do your MPI jobs go through?  Just regular Ethernet
or Infiniband/Myrinet?

One suspicion I have is gmond communicates with each other using
multicast and it causes a bit of traffic on your network.  However,
given 2 hosts in your setup, I doubt this would cause any noticeable
issues but what you might consider doing is use unicast instead of
multicast.  For more information look up the man pages for gmond.conf.

> 2)     I restarted the application. It ran for a while and then it hung
> again. I killed gmond and gmetad on both nodes and the application
> immediately resumed and continued running. It looked as if MPI messages got
> held up.

Have you tried debugging your MPICH code?  Is it with *all* MPICH code
or just particular ones?  You could also turn on debugging for
gmetad/gmond by starting them via -d to see if any error messages are
logged.

Good luck,

Bernard

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Ganglia-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Reply via email to