Thanks a lot, Jeff, I will try that
Chen
On Thu, 13 Jan 2005, Jeff Squyres wrote:
On Jan 13, 2005, at 1:42 PM, Yu Chen wrote:
It's hard to say without more detail about your application; this could
simply be the communication pattern of your application, that it causes
blocking and makes processes wait for message passing to complete, etc.
But that program worked in provious setup, and it never got changed (only
difference is the different FORTRAN compiler, PGI vs GNU)
I wish I had a better answer, but "sometimes this just happens" -- there are
a *lot* of differences between the 6.x and 7.x series in LAM, any number of
which could (and did!) expose bugs in user applications.
Not that I'm claiming that LAM is 100% bug-free -- no software ever is! But
it's pretty darn stable and lots of people are running production codes with
it. Of course, that being said, if we do find a genuine bug that your
application exposes in LAM, I'll be the first to a) eat crow, and b) fix the
little bugger in LAM.
Can you attach a debugger to any of the processes and see what they are
doing?
I really don't know how to do it, could you help me with this.
When the processes are running on your nodes, login to any of the nodes and
run "ps" to find the PID's of the two processes on that node (I assume you're
launching 2 processes per node). Then run "gdb --pid <PID>", replacing <PID>
with one of the PIDs of your processes.
This will attach to the process and show you where it is in the process (it's
most helpful if you have compiled your application with -g). It will show
you a stack trace of where the application is currently executing. From
there, you can do all the normal things that you do in gdb (step, next,
examine variables, go up and down the stack trace, etc.).
You might want to do this simultaneously on several different processes to
see where they are all blocked.
I also strongly recommend running your application through a memory-checking
debugger such as the most recent version of valgrind
(http://valgrind.kde.org). Even if you think your application is running
properly, valgrind can illuminate all kinds of hidden bugs that you weren't
even aware were there (we use Valgrind and other memory-checking debuggers in
developing LAM, for example). Note that with the default install of LAM on
OSCAR clusters, you'll unfortunately get a lot of false positive reports from
valgrind about reads from uninitialized memory deep within LAM. These are
all actually ok; to avoid a long story, suffice it to say that it's actually
a safe optimization that we use in LAM that Valgrind is unaware of. When you
compile LAM from source, you can use the configure switch --with-purify to
eliminate these false positive reports, but there is a *slight* performance
hit for doing this, so we don't enable it by default (i.e., it removes the
optimization).
See the LAM FAQ for debugging for a few more hints:
http://www.lam-mpi.org/faq/
===========================================
Yu Chen
Howard Hughes Medical Institute
Chemistry Building, Rm 182
University of Maryland at Baltimore County
1000 Hilltop Circle
Baltimore, MD 21250
phone: (410)455-6347 (primary)
(410)455-2718 (secondary)
fax: (410)455-1174
email: [EMAIL PROTECTED]
===========================================
-------------------------------------------------------
The SF.Net email is sponsored by: Beat the post-holiday blues
Get a FREE limited edition SourceForge.net t-shirt from ThinkGeek.
It's fun and FREE -- well, almost....http://www.thinkgeek.com/sfshirt
_______________________________________________
Oscar-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/oscar-users