It's hard to say without more detail about your application; this could simply be the communication pattern of your application, that it causes blocking and makes processes wait for message passing to complete, etc.

Which RPI were you using in 6.5.9? I ask because LAM could only have one RPI compiled into it back in the 6.x series; only in the 7.x series did we debut the ability to choose your RPI at run-time.

I'm guessing that you should be defaulting to usysv in 7.0.6, which, since it uses shared memory for messages on the same node, *may* account for speed differences between your 6.x and 7.x runs (e.g., if you were using the tcp RPI in the 6.x series) and therefore expose timing problems in your code.

The usysv RPI uses spin locks for on-node communication, so it should spin (and consume all the CPU) when it's waiting for on-node communication. But if you're blocking waiting for off-node communication, you won't see this spinning behavior.

Can you attach a debugger to any of the processes and see what they are doing?



On Jan 13, 2005, at 11:36 AM, Yu Chen wrote:

Hello,

After installation of OSCAR 4 on RH-EL-AS-3 cluster, one of my major mpi program is not running right. Here is the detail, thanks in advance for any help:

In short, the program will just sit there, waiting and waiting, but doing nothing, since normally it should gives out a lot of outputs.

In detail, we have a 28 nodes cluster including master node, each have 2 CPUs

Originally, I was running LAM-6.5.9 on Redhat 7.2, using PGI FORTRAN compiler and GNU C compiler. The command used to run is:
"mpirun -O -x CYANALIB c0,1,2,3,4,5,6,7,8,9,10,11,12 My_Program"
It ran fine, when run "gstat -a -1", I would see 6 nodes running at about 100% CPU time, since each had two copies running.


Now, I am using OSCAR 4(LAM-7.0.6) on RH-EL-AS-3 with all GNU compilers(C and FORTRAN), I recompiled my program BTW. Now with the same command, it runs, then just sits there, doing nothing. And from "gstat -a -1", it only shows 6 nodes running at about 50% CPU time, which seems like only one copy running on each node. The "mpitask" shows everything running.

Anyone's got any idea?

Regards
Chen

===========================================
Yu Chen
Howard Hughes Medical Institute
Chemistry Building, Rm 182
University of Maryland at Baltimore County
1000 Hilltop Circle
Baltimore, MD 21250

phone:  (410)455-6347 (primary)
        (410)455-2718 (secondary)
fax:    (410)455-1174
email:  [EMAIL PROTECTED]
===========================================
_______________________________________________
This list is archived at http://www.lam-mpi.org/MailArchives/lam/


-- {+} Jeff Squyres {+} [EMAIL PROTECTED] {+} http://www.lam-mpi.org/



-------------------------------------------------------
The SF.Net email is sponsored by: Beat the post-holiday blues
Get a FREE limited edition SourceForge.net t-shirt from ThinkGeek.
It's fun and FREE -- well, almost....http://www.thinkgeek.com/sfshirt
_______________________________________________
Oscar-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/oscar-users

Reply via email to