[EMAIL PROTECTED] wrote:
I have a variety of systems that run in parallel without ever having
errors due to shortage of shared memory (up to 500K atoms). However, I
find that I sometimes run into this problem with lipid bilayer systems
of less than 30K atoms.
If I submit a job and I get the shared memory error the error occurs
before any simulation time. What's more, if I resubmit the job it often
works fine. Howerver, one recent bilayer system set up by a colleague
won't ever run.
I am using openmpi_v1.2.1 and I can avoid using shared memory like this:
${OMPI}/mpirun --mca btl ^sm ${ED}/mdrun_openmpi_v1.2.1 -np ${mynp} -4
(etc...)
That absolutely fixes the error, but when I do that the scaling to 4
processors is very poor as judged by walltime and also by the output at
the end of the gromacs .log file.
This also confuses me since my sysadmin tells me that gromacs doesn't
use shared memory.
I get two basic error messages. Sometimes it is this to stderr:
[cn-r4-18][0,1,1][btl_sm_component.c:521:mca_btl_sm_component_progress]
SM faild to send message due to shortage of shared memory.
And sometimes it is a longer style error message (see the end of this
email for all stderr from a run of that type.)
I believe this to be a problem with our cluster, and I guess that would
make this the wrong mailing list for this question, but I am hoping that
somebody can help me clarify what is going on with shared memory usage
in gromacs and perhaps why the error appears to be stochastic but also
related to bilayers.
Our cluster is also having some problems with random xtc or trr file
corruption (1 in 10 to 20 runs) in case that seems related to the shared
memory issue. However, that is not the issue that I am presenting in
this post.
Thanks,
Chris.
So it seems that there is a problem in the shared memory communication
layer of openmpi that only shows up sporadically. However, if it is not
reproducible it could also be physical memory problems, i.e. bad DIMMS,
espcially sice you have data corruption every once in a while. Some
tests that you can do, take a big file (much larger than the amount of
memory you have) and run md5sum on it a few times. Copy the file to a
"good" machine and run it there as well. It should always give the same
result. If you can rule out hardware than OpenMPI could be the problem.
You could try the latest LAM or MPICH 2.x (not 1.x!).
--
David van der Spoel, Ph.D.
Molec. Biophys. group, Dept. of Cell & Molec. Biol., Uppsala University.
Box 596, 75124 Uppsala, Sweden. Phone: +46184714205. Fax: +4618511755.
[EMAIL PROTECTED] [EMAIL PROTECTED] http://folding.bmc.uu.se
_______________________________________________
gmx-users mailing list gmx-users@gromacs.org
http://www.gromacs.org/mailman/listinfo/gmx-users
Please search the archive at http://www.gromacs.org/search before posting!
Please don't post (un)subscribe requests to the list. Use the
www interface or send it to [EMAIL PROTECTED]
Can't post? Read http://www.gromacs.org/mailing_lists/users.php