Re: [gmx-users] continuation run segmentation fault
Hi, There is a certain version of MPI that caused a lot of headache until we realized that it is buggy. I'm not entirely sure what version was it, but I suspect it was the 1.4.3 shipped as default on Ubuntu 12.04 server. I suggest that you try: - using a different MPI version; - using a single rank/no MPI to continue; - using thread-MPI to continue; Cheers, -- Szilárd On Thu, Jul 24, 2014 at 5:29 PM, David de Sancho wrote: > Dear all > I am having some trouble continuing some runs with Gromacs 4.5.5 in our > local cluster. Surprisingly, the simulations run smoothly in the same > number of nodes and cores before in the same system. And even more > surprisingly if I reduce the number of nodes to 1 with its 12 processors, > then it runs again. > > And the script I am using to run the simulations looks something like this@ > > # Set some Torque options: class name and max time for the job. Torque >> developed from a program called >> # OpenPBS, hence all the PBS references in this file >> #PBS -l nodes=4:ppn=12,walltime=24:00:00 > > source /home/dd363/src/gromacs-4.5.5/bin/GMXRC.bash >> application="/home/user/src/gromacs-4.5.5/bin/mdrun_openmpi_intel" >> options="-s data/tpr/filename.tpr -deffnm data/filename -cpi data/filename" >> >> #! change the working directory (default is home directory) >> cd $PBS_O_WORKDIR >> echo Running on host `hostname` >> echo Time is `date` >> echo Directory is `pwd` >> echo PBS job ID is $PBS_JOBID >> echo This jobs runs on the following machines: >> echo `cat $PBS_NODEFILE | uniq` >> #! Run the parallel MPI executable >> #!export LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:/lib64:/usr/lib64" >> echo "Running mpiexec $application $options" >> mpiexec $application $options > > > And the error messages I am getting look something like this > >> [compute-0-11:09645] *** Process received signal *** >> [compute-0-11:09645] Signal: Segmentation fault (11) >> [compute-0-11:09645] Signal code: Address not mapped (1) >> [compute-0-11:09645] Failing at address: 0x10 >> [compute-0-11:09643] *** Process received signal *** >> [compute-0-11:09643] Signal: Segmentation fault (11) >> [compute-0-11:09643] Signal code: Address not mapped (1) >> [compute-0-11:09643] Failing at address: 0xd0 >> [compute-0-11:09645] [ 0] /lib64/libpthread.so.0 [0x38d300e7c0] >> [compute-0-11:09645] [ 1] >> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_pml_ob1.so >> [0x2af2091443f9] >> [compute-0-11:09645] [ 2] >> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_pml_ob1.so >> [0x2af209142963] >> [compute-0-11:09645] [ 3] >> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_btl_sm.so >> [0x2af20996e33c] >> [compute-0-11:09645] [ 4] >> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/libopen-pal.so.0(opal_progress+0x87) >> [0x2af20572cfa7] >> [compute-0-11:09645] [ 5] >> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/libmpi.so.0 >> [0x2af205219636] >> [compute-0-11:09645] [ 6] >> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_coll_tuned.so >> [0x2af20aa2259b] >> [compute-0-11:09645] [ 7] >> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_coll_tuned.so >> [0x2af20aa2a04b] >> [compute-0-11:09645] [ 8] >> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_coll_tuned.so >> [0x2af20aa22da9] >> [compute-0-11:09645] [ 9] >> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/libmpi.so.0(ompi_comm_split+0xcc) >> [0x2af205204dcc] >> [compute-0-11:09645] [10] >> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/libmpi.so.0(MPI_Comm_split+0x3c) >> [0x2af205236f0c] >> [compute-0-11:09645] [11] >> /home/dd363/src/gromacs-4.5.5/lib/libgmx_mpi.so.6(gmx_setup_nodecomm+0x14b) >> [0x2af204b8ba6b] >> [compute-0-11:09645] [12] >> /home/dd363/src/gromacs-4.5.5/bin/mdrun_openmpi_intel(mdrunner+0x86c) >> [0x415aac] >> [compute-0-11:09645] [13] >> /home/dd363/src/gromacs-4.5.5/bin/mdrun_openmpi_intel(main+0x1928) >> [0x41d968] >> [compute-0-11:09645] [14] /lib64/libc.so.6(__libc_start_main+0xf4) >> [0x38d281d994] >> [compute-0-11:09643] [ 0] /lib64/libpthread.so.0 [0x38d300e7c0] >> [compute-0-11:09643] [ 1] >> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_pml_ob1.so >> [0x2b56aca403f9] >> [compute-0-11:09643] [ 2] >> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_pml_ob1.so >> [0x2b56aca3e963] >> [compute-0-11:09643] [ 3] >> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_btl_sm.so >> [0x2b56ad26a33c] >> [compute-0-11:09643] [ 4] >> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/libopen-pal.so.0(opal_progress+0x87) >> [0x2b56a9028fa7] >> [compute-0-11:09643] [ 5] >> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/libmpi.so.0 >> [0x2b56a8b15636] >> [compute-0-11:09643] [ 6] >> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_coll_tuned.so >> [
[gmx-users] continuation run segmentation fault
Dear all I am having some trouble continuing some runs with Gromacs 4.5.5 in our local cluster. Surprisingly, the simulations run smoothly in the same number of nodes and cores before in the same system. And even more surprisingly if I reduce the number of nodes to 1 with its 12 processors, then it runs again. And the script I am using to run the simulations looks something like this@ # Set some Torque options: class name and max time for the job. Torque > developed from a program called > # OpenPBS, hence all the PBS references in this file > #PBS -l nodes=4:ppn=12,walltime=24:00:00 source /home/dd363/src/gromacs-4.5.5/bin/GMXRC.bash > application="/home/user/src/gromacs-4.5.5/bin/mdrun_openmpi_intel" > options="-s data/tpr/filename.tpr -deffnm data/filename -cpi data/filename" > > #! change the working directory (default is home directory) > cd $PBS_O_WORKDIR > echo Running on host `hostname` > echo Time is `date` > echo Directory is `pwd` > echo PBS job ID is $PBS_JOBID > echo This jobs runs on the following machines: > echo `cat $PBS_NODEFILE | uniq` > #! Run the parallel MPI executable > #!export LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:/lib64:/usr/lib64" > echo "Running mpiexec $application $options" > mpiexec $application $options And the error messages I am getting look something like this > [compute-0-11:09645] *** Process received signal *** > [compute-0-11:09645] Signal: Segmentation fault (11) > [compute-0-11:09645] Signal code: Address not mapped (1) > [compute-0-11:09645] Failing at address: 0x10 > [compute-0-11:09643] *** Process received signal *** > [compute-0-11:09643] Signal: Segmentation fault (11) > [compute-0-11:09643] Signal code: Address not mapped (1) > [compute-0-11:09643] Failing at address: 0xd0 > [compute-0-11:09645] [ 0] /lib64/libpthread.so.0 [0x38d300e7c0] > [compute-0-11:09645] [ 1] > /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_pml_ob1.so > [0x2af2091443f9] > [compute-0-11:09645] [ 2] > /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_pml_ob1.so > [0x2af209142963] > [compute-0-11:09645] [ 3] > /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_btl_sm.so > [0x2af20996e33c] > [compute-0-11:09645] [ 4] > /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/libopen-pal.so.0(opal_progress+0x87) > [0x2af20572cfa7] > [compute-0-11:09645] [ 5] > /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/libmpi.so.0 > [0x2af205219636] > [compute-0-11:09645] [ 6] > /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_coll_tuned.so > [0x2af20aa2259b] > [compute-0-11:09645] [ 7] > /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_coll_tuned.so > [0x2af20aa2a04b] > [compute-0-11:09645] [ 8] > /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_coll_tuned.so > [0x2af20aa22da9] > [compute-0-11:09645] [ 9] > /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/libmpi.so.0(ompi_comm_split+0xcc) > [0x2af205204dcc] > [compute-0-11:09645] [10] > /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/libmpi.so.0(MPI_Comm_split+0x3c) > [0x2af205236f0c] > [compute-0-11:09645] [11] > /home/dd363/src/gromacs-4.5.5/lib/libgmx_mpi.so.6(gmx_setup_nodecomm+0x14b) > [0x2af204b8ba6b] > [compute-0-11:09645] [12] > /home/dd363/src/gromacs-4.5.5/bin/mdrun_openmpi_intel(mdrunner+0x86c) > [0x415aac] > [compute-0-11:09645] [13] > /home/dd363/src/gromacs-4.5.5/bin/mdrun_openmpi_intel(main+0x1928) > [0x41d968] > [compute-0-11:09645] [14] /lib64/libc.so.6(__libc_start_main+0xf4) > [0x38d281d994] > [compute-0-11:09643] [ 0] /lib64/libpthread.so.0 [0x38d300e7c0] > [compute-0-11:09643] [ 1] > /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_pml_ob1.so > [0x2b56aca403f9] > [compute-0-11:09643] [ 2] > /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_pml_ob1.so > [0x2b56aca3e963] > [compute-0-11:09643] [ 3] > /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_btl_sm.so > [0x2b56ad26a33c] > [compute-0-11:09643] [ 4] > /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/libopen-pal.so.0(opal_progress+0x87) > [0x2b56a9028fa7] > [compute-0-11:09643] [ 5] > /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/libmpi.so.0 > [0x2b56a8b15636] > [compute-0-11:09643] [ 6] > /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_coll_tuned.so > [0x2b56ae31e59b] > [compute-0-11:09643] [ 7] > /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_coll_tuned.so > [0x2b56ae32604b] > [compute-0-11:09643] [ 8] > /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_coll_tuned.so > [0x2b56ae31eda9] > [compute-0-11:09643] [ 9] > /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/libmpi.so.0(ompi_comm_split+0xcc) > [0x2b56a8b00dcc] > [compute-0-11:09643] [10] > /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/libmpi.so.0(MPI_Comm_split+