I have a variety of systems that run in parallel without ever having errors due to shortage of shared memory (up to 500K atoms). However, I find that I sometimes run into this problem with lipid bilayer systems of less than 30K atoms.

If I submit a job and I get the shared memory error the error occurs before any simulation time. What's more, if I resubmit the job it often works fine. Howerver, one recent bilayer system set up by a colleague won't ever run.

I am using openmpi_v1.2.1 and I can avoid using shared memory like this:

${OMPI}/mpirun --mca btl ^sm ${ED}/mdrun_openmpi_v1.2.1 -np ${mynp} -4
(etc...)

That absolutely fixes the error, but when I do that the scaling to 4 processors is very poor as judged by walltime and also by the output at the end of the gromacs .log file.

This also confuses me since my sysadmin tells me that gromacs doesn't use shared memory.

I get two basic error messages. Sometimes it is this to stderr: [cn-r4-18][0,1,1][btl_sm_component.c:521:mca_btl_sm_component_progress] SM faild to send message due to shortage of shared memory.

And sometimes it is a longer style error message (see the end of this email for all stderr from a run of that type.)

I believe this to be a problem with our cluster, and I guess that would make this the wrong mailing list for this question, but I am hoping that somebody can help me clarify what is going on with shared memory usage in gromacs and perhaps why the error appears to be stochastic but also related to bilayers.

Our cluster is also having some problems with random xtc or trr file corruption (1 in 10 to 20 runs) in case that seems related to the shared memory issue. However, that is not the issue that I am presenting in this post.

Thanks,
Chris.

########## Here is the stderr
########## Following this is the x0.log file, but that doesn't appear to have error indications in it

NNODES=4, MYRANK=1, HOSTNAME=cn-r1-27
NNODES=4, MYRANK=0, HOSTNAME=cn-r1-27
NODEID=0 argc=8
                         :-)  G  R  O  M  A  C  S  (-:

NODEID=1 argc=8
NNODES=4, MYRANK=2, HOSTNAME=cn-r1-27
NODEID=2 argc=8
NNODES=4, MYRANK=3, HOSTNAME=cn-r1-27
NODEID=3 argc=8
                  Green Red Orange Magenta Azure Cyan Skyblue

                            :-)  VERSION 3.3.1  (-:


      Written by David van der Spoel, Erik Lindahl, Berk Hess, and others.
       Copyright (c) 1991-2000, University of Groningen, The Netherlands.
             Copyright (c) 2001-2006, The GROMACS development team,
            check out http://www.gromacs.org for more information.

         This program is free software; you can redistribute it and/or
          modify it under the terms of the GNU General Public License
         as published by the Free Software Foundation; either version 2
             of the License, or (at your option) any later version.

:-) /projects/pomes/cneale/exe/gromacs-3.3.1/exec/fftw-3.1.2/bin/mdrun_openmpi_v1.2 (-:

Option     Filename  Type         Description
------------------------------------------------------------
  -s bilayer_popc_md219.tpr  Input        Generic run input: tpr tpb tpa xml
  -o bilayer_popc_md219.trr  Output       Full precision trajectory: trr trj
  -x bilayer_popc_md219.xtc  Output, Opt. Compressed trajectory (portable xdr
                                   format)
  -c bilayer_popc_md219.gro  Output       Generic structure: gro g96 pdb xml
  -e bilayer_popc_md219.edr  Output       Generic energy: edr ene
  -g bilayer_popc_md219.log  Output       Log file
-dgdlbilayer_popc_md219.xvg  Output, Opt. xvgr/xmgr file
-fieldilayer_popc_md219.xvg  Output, Opt. xvgr/xmgr file
-tableilayer_popc_md219.xvg  Input, Opt.  xvgr/xmgr file
-tableplayer_popc_md219.xvg  Input, Opt.  xvgr/xmgr file
-rerunilayer_popc_md219.xtc  Input, Opt.  Generic trajectory: xtc trr trj gro
                                   g96 pdb
-tpi bilayer_popc_md219.xvg  Output, Opt. xvgr/xmgr file
 -ei bilayer_popc_md219.edi  Input, Opt.  ED sampling input
 -eo bilayer_popc_md219.edo  Output, Opt. ED sampling output
  -j bilayer_popc_md219.gct  Input, Opt.  General coupling stuff
 -jo bilayer_popc_md219.gct  Output, Opt. General coupling stuff
-ffoutilayer_popc_md219.xvg  Output, Opt. xvgr/xmgr file
-devoutlayer_popc_md219.xvg  Output, Opt. xvgr/xmgr file
-runavilayer_popc_md219.xvg  Output, Opt. xvgr/xmgr file
 -pi bilayer_popc_md219.ppa  Input, Opt.  Pull parameters
 -po bilayer_popc_md219.ppa  Output, Opt. Pull parameters
 -pd bilayer_popc_md219.pdo  Output, Opt. Pull data output
 -pn bilayer_popc_md219.ndx  Input, Opt.  Index file
-mtx bilayer_popc_md219.mtx  Output, Opt. Hessian matrix
 -dn bilayer_popc_md219.ndx  Output, Opt. Index file

      Option   Type  Value  Description
------------------------------------------------------
      -[no]h   bool     no  Print help info and quit
      -[no]X   bool     no  Use dialog box GUI to edit command line options
       -nice    int     19  Set the nicelevel
     -deffnm string bilayer_popc_md219  Set the default filename for all file
                            options
   -[no]xvgr   bool    yes  Add specific codes (legends etc.) in the output
                            xvg files for the xmgrace program
         -np    int      4  Number of nodes, must be the same as used for
                            grompp
         -nt    int      1  Number of threads to start on each node
      -[no]v   bool    yes  Be loud and noisy
-[no]compact   bool    yes  Write a compact log file
-[no]sepdvdl   bool     no  Write separate V and dVdl terms for each
                            interaction type and node to the log file(s)
  -[no]multi   bool     no  Do multiple simulations in parallel (only with
                            -np > 1)
     -replex    int      0  Attempt replica exchange every # steps
     -reseed    int     -1  Seed for replica exchange, -1 is generate a seed
   -[no]glas   bool     no  Do glass simulation with special long range
                            corrections
 -[no]ionize   bool     no  Do a simulation including the effect of an X-Ray
                            bombardment on your system

Getting Loaded...
Reading file bilayer_popc_md219.tpr, VERSION 3.3.1 (single precision)
Loaded with Money

[cn-r1-27:26937] *** Process received signal ***
[cn-r1-27:26937] Signal: Segmentation fault (11)
[cn-r1-27:26937] Signal code: Address not mapped (1)
[cn-r1-27:26937] Failing at address: 0x18
[cn-r1-27:26937] [ 0] /lib64/tls/libpthread.so.0 [0x2a969a0730]
[cn-r1-27:26937] [ 1] /tools/openmpi/1.2/lib/openmpi/mca_btl_sm.so(mca_btl_sm_send+0x6b) [0x2a9a488c3b] [cn-r1-27:26937] [ 2] /tools/openmpi/1.2/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_start_copy+0x14d) [0x2a9a1765ed] [cn-r1-27:26937] [ 3] /tools/openmpi/1.2/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_process_pending+0x1c4) [0x2a9a177af4] [cn-r1-27:26937] [ 4] /tools/openmpi/1.2/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_match_completion_free+0x22e) [0x2a9a17827e] [cn-r1-27:26937] [ 5] /tools/openmpi/1.2/lib/openmpi/mca_btl_sm.so(mca_btl_sm_component_progress+0x846) [0x2a9a489fe6] [cn-r1-27:26937] [ 6] /tools/openmpi/1.2/lib/openmpi/mca_bml_r2.so(mca_bml_r2_progress+0x2a) [0x2a9a27e47a] [cn-r1-27:26937] [ 7] /tools/openmpi/1.2/lib/libopen-pal.so.0(opal_progress+0x4a) [0x2a966405da] [cn-r1-27:26937] [ 8] /tools/openmpi/1.2/lib/libmpi.so.0(ompi_request_wait_all+0xad) [0x2a9637adbd] [cn-r1-27:26937] [ 9] /tools/openmpi/1.2/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allreduce_intra_recursivedoubling+0x2ab) [0x2a9a9af39b] [cn-r1-27:26937] [10] /tools/openmpi/1.2/lib/libmpi.so.0(ompi_comm_nextcid+0x20f) [0x2a9636b36f] [cn-r1-27:26937] [11] /tools/openmpi/1.2/lib/libmpi.so.0(ompi_comm_dup+0x94) [0x2a96369bd4] [cn-r1-27:26937] [12] /tools/openmpi/1.2/lib/libmpi.so.0(PMPI_Comm_dup+0x6f) [0x2a96390e0f] [cn-r1-27:26937] [13] /projects/pomes/cneale/exe/gromacs-3.3.1/exec/fftw-3.1.2/bin/mdrun_openmpi_v1.2(gmx_parallel_3dfft_init+0x72) [0x48f902] [cn-r1-27:26937] [14] /projects/pomes/cneale/exe/gromacs-3.3.1/exec/fftw-3.1.2/bin/mdrun_openmpi_v1.2(mk_fftgrid+0xd1) [0x467f11] [cn-r1-27:26937] [15] /projects/pomes/cneale/exe/gromacs-3.3.1/exec/fftw-3.1.2/bin/mdrun_openmpi_v1.2(init_pme+0x4c0) [0x460860] [cn-r1-27:26937] [16] /projects/pomes/cneale/exe/gromacs-3.3.1/exec/fftw-3.1.2/bin/mdrun_openmpi_v1.2(mdrunner+0x84f) [0x42d8ff] [cn-r1-27:26937] [17] /projects/pomes/cneale/exe/gromacs-3.3.1/exec/fftw-3.1.2/bin/mdrun_openmpi_v1.2(main+0x237) [0x42e357] [cn-r1-27:26937] [18] /lib64/tls/libc.so.6(__libc_start_main+0xea) [0x2a96ac4aaa] [cn-r1-27:26937] [19] /projects/pomes/cneale/exe/gromacs-3.3.1/exec/fftw-3.1.2/bin/mdrun_openmpi_v1.2(XmCreateOptionMenu+0x42) [0x416d8a]
[cn-r1-27:26937] *** End of error message ***
mpirun noticed that job rank 0 with PID 26934 on node cn-r1-27 exited on signal 15 (Terminated).


#######################



####################### Here is the log file from the head node


$cat bilayer_popclrlj_md1910.log
Log file opened on Sun May 20 18:12:34 2007
Host: cn-r4-29  pid: 9245  nodeid: 0  nnodes:  4
The Gromacs distribution was built Mon Mar 19 11:20:43 EDT 2007 by
[EMAIL PROTECTED] (Linux 2.6.5-7.282-smp x86_64)


                         :-)  G  R  O  M  A  C  S  (-:

                  Gromacs Runs On Most of All Computer Systems

                            :-)  VERSION 3.3.1  (-:


      Written by David van der Spoel, Erik Lindahl, Berk Hess, and others.
       Copyright (c) 1991-2000, University of Groningen, The Netherlands.
             Copyright (c) 2001-2006, The GROMACS development team,
            check out http://www.gromacs.org for more information.

         This program is free software; you can redistribute it and/or
          modify it under the terms of the GNU General Public License
         as published by the Free Software Foundation; either version 2
             of the License, or (at your option) any later version.

:-) /projects/pomes/cneale/exe/gromacs-3.3.1/exec/fftw-3.1.2/bin/mdrun_openmpi_v1.2 (-:


++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
E. Lindahl and B. Hess and D. van der Spoel
GROMACS 3.0: A package for molecular simulation and trajectory analysis
J. Mol. Mod. 7 (2001) pp. 306-317
-------- -------- --- Thank You --- -------- --------


++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
H. J. C. Berendsen, D. van der Spoel and R. van Drunen
GROMACS: A message-passing parallel molecular dynamics implementation
Comp. Phys. Comm. 91 (1995) pp. 43-56
-------- -------- --- Thank You --- -------- --------

CPU=  0, lastcg= 5041, targetcg=15126, myshift=    2
CPU=  1, lastcg=10083, targetcg=20168, myshift=    2
CPU=  2, lastcg=15126, targetcg= 5042, myshift=    2
CPU=  3, lastcg=20168, targetcg=10084, myshift=    2
nsb->shift =   2, nsb->bshift=  0
Listing Scalars
nsb->nodeid:         0
nsb->nnodes:      4
nsb->cgtotal: 20169
nsb->natoms:  51236
nsb->shift:       2
nsb->bshift:      0
Nodeid   index  homenr  cgload  workload
     0       0   12808    5042      5042
     1   12808   12808   10084     10084
     2   25616   12812   15127     15127
     3   38428   12808   20169     20169

parameters of the run:
   integrator           = md
   nsteps               = 250000
   init_step            = 0
   ns_type              = Grid
   nstlist              = 10
   ndelta               = 2
   bDomDecomp           = FALSE
   decomp_dir           = 0
   nstcomm              = 1
   comm_mode            = Linear
   nstcheckpoint        = 1000
   nstlog               = 1000
   nstxout              = 250000
   nstvout              = 250000
   nstfout              = 250000
   nstenergy            = 5000
   nstxtcout            = 5000
   init_t               = 87300
   delta_t              = 0.002
   xtcprec              = 1000
   nkx                  = 84
   nky                  = 80
   nkz                  = 60
   pme_order            = 4
   ewald_rtol           = 1e-05
   ewald_geometry       = 0
   epsilon_surface      = 0
   optimize_fft         = FALSE
   ePBC                 = xyz
   bUncStart            = TRUE
   bShakeSOR            = FALSE
   etc                  = Berendsen
   epc                  = Berendsen
   epctype              = Semiisotropic
   tau_p                = 4
   ref_p (3x3):
      ref_p[    0]={ 1.00000e+00,  0.00000e+00,  0.00000e+00}
      ref_p[    1]={ 0.00000e+00,  1.00000e+00,  0.00000e+00}
      ref_p[    2]={ 0.00000e+00,  0.00000e+00,  1.00000e+00}
   compress (3x3):
      compress[    0]={ 4.50000e-05,  0.00000e+00,  0.00000e+00}
      compress[    1]={ 0.00000e+00,  4.50000e-05,  0.00000e+00}
      compress[    2]={ 0.00000e+00,  0.00000e+00,  4.50000e-05}
   andersen_seed        = 815131
   rlist                = 0.9
   coulombtype          = PME
   rcoulomb_switch      = 0
   rcoulomb             = 0.9
   vdwtype              = Cut-off
   rvdw_switch          = 0
   rvdw                 = 1.4
   epsilon_r            = 1
   epsilon_rf           = 1
   tabext               = 1
   gb_algorithm         = Still
   nstgbradii           = 1
   rgbradii             = 2
   gb_saltconc          = 0
   implicit_solvent     = No
   DispCorr             = EnerPres
   fudgeQQ              = 0.5
   free_energy          = no
   init_lambda          = 0
   sc_alpha             = 0
   sc_power             = 0
   sc_sigma             = 0.3
   delta_lambda         = 0
   disre_weighting      = Conservative
   disre_mixed          = FALSE
   dr_fc                = 1000
   dr_tau               = 0
   nstdisreout          = 100
   orires_fc            = 0
   orires_tau           = 0
   nstorireout          = 100
   dihre-fc             = 1000
   dihre-tau            = 0
   nstdihreout          = 100
   em_stepsize          = 0.01
   em_tol               = 10
   niter                = 20
   fc_stepsize          = 0
   nstcgsteep           = 1000
   nbfgscorr            = 10
   ConstAlg             = Lincs
   shake_tol            = 1e-04
   lincs_order          = 4
   lincs_warnangle      = 30
   lincs_iter           = 1
   bd_fric              = 0
   ld_seed              = 1993
   cos_accel            = 0
   deform (3x3):
      deform[    0]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
      deform[    1]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
      deform[    2]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
   userint1             = 0
   userint2             = 0
   userint3             = 0
   userint4             = 0
   userreal1            = 0
   userreal2            = 0
   userreal3            = 0
   userreal4            = 0
grpopts:
   nrdf:             33598.8     51892.2
   ref_t:                310         310
   tau_t:                0.1         0.1
anneal:                   No          No
ann_npoints:               0           0
   acc:            0           0           0
   nfreeze:           N           N           N
   energygrp_flags[  0]: 0
   efield-x:
      n = 0
   efield-xt:
      n = 0
   efield-y:
      n = 0
   efield-yt:
      n = 0
   efield-z:
      n = 0
   efield-zt:
      n = 0
   bQMMM                = FALSE
   QMconstraints        = 0
   QMMMscheme           = 0
   scalefactor          = 1
qm_opts:
   ngQM                 = 0
Max number of graph edges per atom is 4
Table routines are used for coulomb: TRUE
Table routines are used for vdw:     FALSE
Using a Gaussian width (1/beta) of 0.288146 nm for Ewald
Cut-off's:   NS: 0.9   Coulomb: 0.9   LJ: 1.4
System total charge: 0.000
Generated table with 1200 data points for Ewald.
Tabscale = 500 points/nm
Generated table with 1200 data points for LJ6.
Tabscale = 500 points/nm
Generated table with 1200 data points for LJ12.
Tabscale = 500 points/nm
Generated table with 500 data points for 1-4 COUL.
Tabscale = 500 points/nm
Generated table with 500 data points for 1-4 LJ6.
Tabscale = 500 points/nm
Generated table with 500 data points for 1-4 LJ12.
Tabscale = 500 points/nm

Enabling TIP4p water optimization for 8649 molecules.

Will do PME sum in reciprocal space.

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
U. Essman, L. Perela, M. L. Berkowitz, T. Darden, H. Lee and L. G. Pedersen
A smooth particle mesh Ewald method
J. Chem. Phys. 103 (1995) pp. 8577-8592
-------- -------- --- Thank You --- -------- --------


_______________________________________________
gmx-users mailing list    gmx-users@gromacs.org
http://www.gromacs.org/mailman/listinfo/gmx-users
Please search the archive at http://www.gromacs.org/search before posting!
Please don't post (un)subscribe requests to the list. Use the
www interface or send it to [EMAIL PROTECTED]
Can't post? Read http://www.gromacs.org/mailing_lists/users.php

Reply via email to