[Oscar-users] Re: LAM: problem running mpi program, no error/warning mesgs

Jeff Squyres Fri, 14 Jan 2005 07:38:41 -0800

If this happens with all the RPIs (lamd, tcp, sysv, usysv), then it is really looking like a problem with your application.

You didn't provide any stack traces to indicate where the application is. Try "bt" to show the stack trace, and the "up" and "down" commands to move up and down the stack trace. You should be able to move "up" to find the line in your source code where you invoked an MPI API function, and then be able to look around at the variable values in your application to figure out exactly where it is, what it's trying to do, etc.

From there, you can see what processes are sending to whom, and who is blocking waiting for what. It *sounds* like you have a limited throughput scenario, where most processes are blocking waiting for input from other processes before they can continue.


On Jan 14, 2005, at 9:46 AM, Yu Chen wrote:

Hi, Jeff
Just tried your suggestions, here are some outputs from "gdb", but I don't know where to continue now, seems for lamd and tcp, it just sit, for sysv, file is missing?

-------running with "-ssi rpi lamd" and "-ssi rpi tcp------------ Reading symbols from /raid1/p12/hhmi/software/Cyana/gnu-lam/cyanaexe.gnu-lam...done. Using host libthread_db library "/lib/tls/libthread_db.so.1". Reading symbols from /lib/libutil.so.1...done. Loaded symbols for /lib/libutil.so.1 Reading symbols from /usr/lib/libg2c.so.0...done. Loaded symbols for /usr/lib/libg2c.so.0 Reading symbols from /lib/tls/libm.so.6...done. Loaded symbols for /lib/tls/libm.so.6 Reading symbols from /lib/libgcc_s.so.1...done. Loaded symbols for /lib/libgcc_s.so.1 Reading symbols from /lib/tls/libpthread.so.0...done. [Thread debugging using libthread_db enabled] [New Thread -1218517440 (LWP 28311)] Loaded symbols for /lib/tls/libpthread.so.0 Reading symbols from /lib/tls/libc.so.6...done. Loaded symbols for /lib/tls/libc.so.6 Reading symbols from /lib/ld-linux.so.2...done. Loaded symbols for /lib/ld-linux.so.2 Reading symbols from /lib/libnss_files.so.2...done. Loaded symbols for /lib/libnss_files.so.2 0x007329ee in __read_nocancel () from /lib/tls/libpthread.so.0 (gdb) step Single stepping until exit from function __read_nocancel, which has no line number information. =============== then (gdb) just stop here =============

------------- running with "-ssi rpi ------------------ Reading symbols from /raid1/p12/hhmi/software/Cyana/gnu-lam/cyanaexe.gnu-lam...done. Using host libthread_db library "/lib/tls/libthread_db.so.1". Reading symbols from /lib/libutil.so.1...done. Loaded symbols for /lib/libutil.so.1 Reading symbols from /usr/lib/libg2c.so.0...done. Loaded symbols for /usr/lib/libg2c.so.0 Reading symbols from /lib/tls/libm.so.6...done. Loaded symbols for /lib/tls/libm.so.6 Reading symbols from /lib/libgcc_s.so.1...done. Loaded symbols for /lib/libgcc_s.so.1 Reading symbols from /lib/tls/libpthread.so.0...done. [Thread debugging using libthread_db enabled] [New Thread -1218484672 (LWP 28743)] Loaded symbols for /lib/tls/libpthread.so.0 Reading symbols from /lib/tls/libc.so.6...done. Loaded symbols for /lib/tls/libc.so.6 Reading symbols from /lib/ld-linux.so.2...done. Loaded symbols for /lib/ld-linux.so.2 Reading symbols from /lib/libnss_files.so.2...done. Loaded symbols for /lib/libnss_files.so.2 0x00ed5726 in semop () from /lib/tls/libc.so.6 (gdb) step Single stepping until exit from function semop, which has no line number information. [Switching to Thread -1218484672 (LWP 28743)] lam_ssi_rpi_sysv_readlock (p=0x4037ef48) at ssi_rpi_sysv_shm.c:296 296 ssi_rpi_sysv_shm.c: No such file or directory. in ssi_rpi_sysv_shm.c ===================================================================
Any clue? Thanks in advance!
Regards,
Chen
On Thu, 13 Jan 2005, Jeff Squyres wrote:
On Jan 13, 2005, at 1:42 PM, Yu Chen wrote:
It's hard to say without more detail about your application; this could simply be the communication pattern of your application, that it causes blocking and makes processes wait for message passing to complete, etc.
But that program worked in provious setup, and it never got changed (only difference is the different FORTRAN compiler, PGI vs GNU)
I wish I had a better answer, but "sometimes this just happens" -- there are a *lot* of differences between the 6.x and 7.x series in LAM, any number of which could (and did!) expose bugs in user applications.

Not that I'm claiming that LAM is 100% bug-free -- no software ever is! But it's pretty darn stable and lots of people are running production codes with it. Of course, that being said, if we do find a genuine bug that your application exposes in LAM, I'll be the first to a) eat crow, and b) fix the little bugger in LAM.

Can you attach a debugger to any of the processes and see what they are doing?
I really don't know how to do it, could you help me with this.
When the processes are running on your nodes, login to any of the nodes and run "ps" to find the PID's of the two processes on that node (I assume you're launching 2 processes per node). Then run "gdb --pid <PID>", replacing <PID> with one of the PIDs of your processes.

This will attach to the process and show you where it is in the process (it's most helpful if you have compiled your application with -g). It will show you a stack trace of where the application is currently executing. From there, you can do all the normal things that you do in gdb (step, next, examine variables, go up and down the stack trace, etc.).

You might want to do this simultaneously on several different processes to see where they are all blocked.

I also strongly recommend running your application through a memory-checking debugger such as the most recent version of valgrind (http://valgrind.kde.org). Even if you think your application is running properly, valgrind can illuminate all kinds of hidden bugs that you weren't even aware were there (we use Valgrind and other memory-checking debuggers in developing LAM, for example). Note that with the default install of LAM on OSCAR clusters, you'll unfortunately get a lot of false positive reports from valgrind about reads from uninitialized memory deep within LAM. These are all actually ok; to avoid a long story, suffice it to say that it's actually a safe optimization that we use in LAM that Valgrind is unaware of. When you compile LAM from source, you can use the configure switch --with-purify to eliminate these false positive reports, but there is a *slight* performance hit for doing this, so we don't enable it by default (i.e., it removes the optimization).
See the LAM FAQ for debugging for a few more hints:
        http://www.lam-mpi.org/faq/
===========================================
Yu Chen
Howard Hughes Medical Institute
Chemistry Building, Rm 182
University of Maryland at Baltimore County
1000 Hilltop Circle
Baltimore, MD 21250
phone:  (410)455-6347 (primary)
        (410)455-2718 (secondary)
fax:    (410)455-1174
email:  [EMAIL PROTECTED]
===========================================
_______________________________________________
This list is archived at http://www.lam-mpi.org/MailArchives/lam/


--
{+} Jeff Squyres
{+} [EMAIL PROTECTED]
{+} http://www.lam-mpi.org/

-------------------------------------------------------
The SF.Net email is sponsored by: Beat the post-holiday blues
Get a FREE limited edition SourceForge.net t-shirt from ThinkGeek.
It's fun and FREE -- well, almost....http://www.thinkgeek.com/sfshirt
_______________________________________________
Oscar-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/oscar-users

[Oscar-users] Re: LAM: problem running mpi program, no error/warning mesgs

Reply via email to