[Oscar-users] Re: LAM: problem running mpi program, no error/warning mesgs

Yu Chen Thu, 13 Jan 2005 16:26:55 -0800

Thanks a lot, Jeff, I will try that

Chen

On Thu, 13 Jan 2005, Jeff Squyres wrote:

On Jan 13, 2005, at 1:42 PM, Yu Chen wrote:
It's hard to say without more detail about your application; this could simply be the communication pattern of your application, that it causes blocking and makes processes wait for message passing to complete, etc.
But that program worked in provious setup, and it never got changed (only difference is the different FORTRAN compiler, PGI vs GNU)
I wish I had a better answer, but "sometimes this just happens" -- there are a *lot* of differences between the 6.x and 7.x series in LAM, any number of which could (and did!) expose bugs in user applications.

Not that I'm claiming that LAM is 100% bug-free -- no software ever is! But it's pretty darn stable and lots of people are running production codes with it. Of course, that being said, if we do find a genuine bug that your application exposes in LAM, I'll be the first to a) eat crow, and b) fix the little bugger in LAM.

Can you attach a debugger to any of the processes and see what they are doing?
I really don't know how to do it, could you help me with this.
When the processes are running on your nodes, login to any of the nodes and run "ps" to find the PID's of the two processes on that node (I assume you're launching 2 processes per node). Then run "gdb --pid <PID>", replacing <PID> with one of the PIDs of your processes.

This will attach to the process and show you where it is in the process (it's most helpful if you have compiled your application with -g). It will show you a stack trace of where the application is currently executing. From there, you can do all the normal things that you do in gdb (step, next, examine variables, go up and down the stack trace, etc.).

You might want to do this simultaneously on several different processes to see where they are all blocked.

I also strongly recommend running your application through a memory-checking debugger such as the most recent version of valgrind (http://valgrind.kde.org). Even if you think your application is running properly, valgrind can illuminate all kinds of hidden bugs that you weren't even aware were there (we use Valgrind and other memory-checking debuggers in developing LAM, for example). Note that with the default install of LAM on OSCAR clusters, you'll unfortunately get a lot of false positive reports from valgrind about reads from uninitialized memory deep within LAM. These are all actually ok; to avoid a long story, suffice it to say that it's actually a safe optimization that we use in LAM that Valgrind is unaware of. When you compile LAM from source, you can use the configure switch --with-purify to eliminate these false positive reports, but there is a *slight* performance hit for doing this, so we don't enable it by default (i.e., it removes the optimization).
See the LAM FAQ for debugging for a few more hints:
        http://www.lam-mpi.org/faq/

===========================================
Yu Chen
Howard Hughes Medical Institute
Chemistry Building, Rm 182
University of Maryland at Baltimore County
1000 Hilltop Circle
Baltimore, MD 21250

phone:  (410)455-6347 (primary)
        (410)455-2718 (secondary)
fax:    (410)455-1174
email:  [EMAIL PROTECTED]
===========================================


-------------------------------------------------------
The SF.Net email is sponsored by: Beat the post-holiday blues
Get a FREE limited edition SourceForge.net t-shirt from ThinkGeek.
It's fun and FREE -- well, almost....http://www.thinkgeek.com/sfshirt
_______________________________________________
Oscar-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/oscar-users

[Oscar-users] Re: LAM: problem running mpi program, no error/warning mesgs

Reply via email to