Re: [gmx-users] shortage of shared memory

David van der Spoel Sat, 07 Jul 2007 23:25:36 -0700

[EMAIL PROTECTED] wrote:

I have a variety of systems that run in parallel without ever havingerrors due to shortage of shared memory (up to 500K atoms). However, Ifind that I sometimes run into this problem with lipid bilayer systemsof less than 30K atoms.
If I submit a job and I get the shared memory error the error occursbefore any simulation time. What's more, if I resubmit the job it oftenworks fine. Howerver, one recent bilayer system set up by a colleaguewon't ever run.
I am using openmpi_v1.2.1 and I can avoid using shared memory like this:

${OMPI}/mpirun --mca btl ^sm ${ED}/mdrun_openmpi_v1.2.1 -np ${mynp} -4
(etc...)
That absolutely fixes the error, but when I do that the scaling to 4processors is very poor as judged by walltime and also by the output atthe end of the gromacs .log file.
This also confuses me since my sysadmin tells me that gromacs doesn'tuse shared memory.
I get two basic error messages. Sometimes it is this to stderr:[cn-r4-18][0,1,1][btl_sm_component.c:521:mca_btl_sm_component_progress]SM faild to send message due to shortage of shared memory.
And sometimes it is a longer style error message (see the end of thisemail for all stderr from a run of that type.)
I believe this to be a problem with our cluster, and I guess that wouldmake this the wrong mailing list for this question, but I am hoping thatsomebody can help me clarify what is going on with shared memory usagein gromacs and perhaps why the error appears to be stochastic but alsorelated to bilayers.
Our cluster is also having some problems with random xtc or trr filecorruption (1 in 10 to 20 runs) in case that seems related to the sharedmemory issue. However, that is not the issue that I am presenting inthis post.
Thanks,
Chris.

So it seems that there is a problem in the shared memory communicationlayer of openmpi that only shows up sporadically. However, if it is notreproducible it could also be physical memory problems, i.e. bad DIMMS,espcially sice you have data corruption every once in a while. Sometests that you can do, take a big file (much larger than the amount ofmemory you have) and run md5sum on it a few times. Copy the file to a"good" machine and run it there as well. It should always give the sameresult. If you can rule out hardware than OpenMPI could be the problem.You could try the latest LAM or MPICH 2.x (not 1.x!).



--
David van der Spoel, Ph.D.
Molec. Biophys. group, Dept. of Cell & Molec. Biol., Uppsala University.
Box 596, 75124 Uppsala, Sweden. Phone:  +46184714205. Fax: +4618511755.
[EMAIL PROTECTED]       [EMAIL PROTECTED]   http://folding.bmc.uu.se
_______________________________________________
gmx-users mailing list    gmx-users@gromacs.org
http://www.gromacs.org/mailman/listinfo/gmx-users
Please search the archive at http://www.gromacs.org/search before posting!

Please don't post (un)subscribe requests to the list. Use thewww interface or send it to [EMAIL PROTECTED]

Can't post? Read http://www.gromacs.org/mailing_lists/users.php

Re: [gmx-users] shortage of shared memory

Reply via email to