Re: [gmx-users] continuation run segmentation fault

2014-07-24 Thread Szilárd Páll
Hi,

There is a certain version of MPI that caused a lot of headache until
we realized that it is buggy. I'm not entirely sure what version was
it, but I suspect it was the 1.4.3 shipped as default on Ubuntu 12.04
server.

I suggest that you try:
- using a different MPI version;
- using a single rank/no MPI to continue;
- using thread-MPI to continue;

Cheers,
--
Szilárd


On Thu, Jul 24, 2014 at 5:29 PM, David de Sancho
 wrote:
> Dear all
> I am having some trouble continuing some runs with Gromacs 4.5.5 in our
> local cluster. Surprisingly, the simulations run smoothly in the same
> number of nodes and cores before in the same system. And even more
> surprisingly if I reduce the number of nodes to 1 with its 12 processors,
> then it runs again.
>
> And the script I am using to run the simulations looks something like this@
>
> # Set some Torque options: class name and max time for the job. Torque
>> developed from a program called
>> # OpenPBS, hence all the PBS references in this file
>> #PBS -l nodes=4:ppn=12,walltime=24:00:00
>
> source /home/dd363/src/gromacs-4.5.5/bin/GMXRC.bash
>> application="/home/user/src/gromacs-4.5.5/bin/mdrun_openmpi_intel"
>> options="-s data/tpr/filename.tpr -deffnm data/filename -cpi data/filename"
>>
>> #! change the working directory (default is home directory)
>> cd $PBS_O_WORKDIR
>> echo Running on host `hostname`
>> echo Time is `date`
>> echo Directory is `pwd`
>> echo PBS job ID is $PBS_JOBID
>> echo This jobs runs on the following machines:
>> echo `cat $PBS_NODEFILE | uniq`
>> #! Run the parallel MPI executable
>> #!export LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:/lib64:/usr/lib64"
>> echo "Running mpiexec $application $options"
>> mpiexec $application $options
>
>
> And the error messages I am getting look something like this
>
>> [compute-0-11:09645] *** Process received signal ***
>> [compute-0-11:09645] Signal: Segmentation fault (11)
>> [compute-0-11:09645] Signal code: Address not mapped (1)
>> [compute-0-11:09645] Failing at address: 0x10
>> [compute-0-11:09643] *** Process received signal ***
>> [compute-0-11:09643] Signal: Segmentation fault (11)
>> [compute-0-11:09643] Signal code: Address not mapped (1)
>> [compute-0-11:09643] Failing at address: 0xd0
>> [compute-0-11:09645] [ 0] /lib64/libpthread.so.0 [0x38d300e7c0]
>> [compute-0-11:09645] [ 1]
>> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_pml_ob1.so
>> [0x2af2091443f9]
>> [compute-0-11:09645] [ 2]
>> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_pml_ob1.so
>> [0x2af209142963]
>> [compute-0-11:09645] [ 3]
>> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_btl_sm.so
>> [0x2af20996e33c]
>> [compute-0-11:09645] [ 4]
>> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/libopen-pal.so.0(opal_progress+0x87)
>> [0x2af20572cfa7]
>> [compute-0-11:09645] [ 5]
>> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/libmpi.so.0
>> [0x2af205219636]
>> [compute-0-11:09645] [ 6]
>> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_coll_tuned.so
>> [0x2af20aa2259b]
>> [compute-0-11:09645] [ 7]
>> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_coll_tuned.so
>> [0x2af20aa2a04b]
>> [compute-0-11:09645] [ 8]
>> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_coll_tuned.so
>> [0x2af20aa22da9]
>> [compute-0-11:09645] [ 9]
>> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/libmpi.so.0(ompi_comm_split+0xcc)
>> [0x2af205204dcc]
>> [compute-0-11:09645] [10]
>> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/libmpi.so.0(MPI_Comm_split+0x3c)
>> [0x2af205236f0c]
>> [compute-0-11:09645] [11]
>> /home/dd363/src/gromacs-4.5.5/lib/libgmx_mpi.so.6(gmx_setup_nodecomm+0x14b)
>> [0x2af204b8ba6b]
>> [compute-0-11:09645] [12]
>> /home/dd363/src/gromacs-4.5.5/bin/mdrun_openmpi_intel(mdrunner+0x86c)
>> [0x415aac]
>> [compute-0-11:09645] [13]
>> /home/dd363/src/gromacs-4.5.5/bin/mdrun_openmpi_intel(main+0x1928)
>> [0x41d968]
>> [compute-0-11:09645] [14] /lib64/libc.so.6(__libc_start_main+0xf4)
>> [0x38d281d994]
>> [compute-0-11:09643] [ 0] /lib64/libpthread.so.0 [0x38d300e7c0]
>> [compute-0-11:09643] [ 1]
>> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_pml_ob1.so
>> [0x2b56aca403f9]
>> [compute-0-11:09643] [ 2]
>> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_pml_ob1.so
>> [0x2b56aca3e963]
>> [compute-0-11:09643] [ 3]
>> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_btl_sm.so
>> [0x2b56ad26a33c]
>> [compute-0-11:09643] [ 4]
>> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/libopen-pal.so.0(opal_progress+0x87)
>> [0x2b56a9028fa7]
>> [compute-0-11:09643] [ 5]
>> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/libmpi.so.0
>> [0x2b56a8b15636]
>> [compute-0-11:09643] [ 6]
>> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_coll_tuned.so
>> [

[gmx-users] continuation run segmentation fault

2014-07-24 Thread David de Sancho
Dear all
I am having some trouble continuing some runs with Gromacs 4.5.5 in our
local cluster. Surprisingly, the simulations run smoothly in the same
number of nodes and cores before in the same system. And even more
surprisingly if I reduce the number of nodes to 1 with its 12 processors,
then it runs again.

And the script I am using to run the simulations looks something like this@

# Set some Torque options: class name and max time for the job. Torque
> developed from a program called
> # OpenPBS, hence all the PBS references in this file
> #PBS -l nodes=4:ppn=12,walltime=24:00:00

source /home/dd363/src/gromacs-4.5.5/bin/GMXRC.bash
> application="/home/user/src/gromacs-4.5.5/bin/mdrun_openmpi_intel"
> options="-s data/tpr/filename.tpr -deffnm data/filename -cpi data/filename"
>
> #! change the working directory (default is home directory)
> cd $PBS_O_WORKDIR
> echo Running on host `hostname`
> echo Time is `date`
> echo Directory is `pwd`
> echo PBS job ID is $PBS_JOBID
> echo This jobs runs on the following machines:
> echo `cat $PBS_NODEFILE | uniq`
> #! Run the parallel MPI executable
> #!export LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:/lib64:/usr/lib64"
> echo "Running mpiexec $application $options"
> mpiexec $application $options


And the error messages I am getting look something like this

> [compute-0-11:09645] *** Process received signal ***
> [compute-0-11:09645] Signal: Segmentation fault (11)
> [compute-0-11:09645] Signal code: Address not mapped (1)
> [compute-0-11:09645] Failing at address: 0x10
> [compute-0-11:09643] *** Process received signal ***
> [compute-0-11:09643] Signal: Segmentation fault (11)
> [compute-0-11:09643] Signal code: Address not mapped (1)
> [compute-0-11:09643] Failing at address: 0xd0
> [compute-0-11:09645] [ 0] /lib64/libpthread.so.0 [0x38d300e7c0]
> [compute-0-11:09645] [ 1]
> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_pml_ob1.so
> [0x2af2091443f9]
> [compute-0-11:09645] [ 2]
> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_pml_ob1.so
> [0x2af209142963]
> [compute-0-11:09645] [ 3]
> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_btl_sm.so
> [0x2af20996e33c]
> [compute-0-11:09645] [ 4]
> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/libopen-pal.so.0(opal_progress+0x87)
> [0x2af20572cfa7]
> [compute-0-11:09645] [ 5]
> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/libmpi.so.0
> [0x2af205219636]
> [compute-0-11:09645] [ 6]
> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_coll_tuned.so
> [0x2af20aa2259b]
> [compute-0-11:09645] [ 7]
> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_coll_tuned.so
> [0x2af20aa2a04b]
> [compute-0-11:09645] [ 8]
> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_coll_tuned.so
> [0x2af20aa22da9]
> [compute-0-11:09645] [ 9]
> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/libmpi.so.0(ompi_comm_split+0xcc)
> [0x2af205204dcc]
> [compute-0-11:09645] [10]
> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/libmpi.so.0(MPI_Comm_split+0x3c)
> [0x2af205236f0c]
> [compute-0-11:09645] [11]
> /home/dd363/src/gromacs-4.5.5/lib/libgmx_mpi.so.6(gmx_setup_nodecomm+0x14b)
> [0x2af204b8ba6b]
> [compute-0-11:09645] [12]
> /home/dd363/src/gromacs-4.5.5/bin/mdrun_openmpi_intel(mdrunner+0x86c)
> [0x415aac]
> [compute-0-11:09645] [13]
> /home/dd363/src/gromacs-4.5.5/bin/mdrun_openmpi_intel(main+0x1928)
> [0x41d968]
> [compute-0-11:09645] [14] /lib64/libc.so.6(__libc_start_main+0xf4)
> [0x38d281d994]
> [compute-0-11:09643] [ 0] /lib64/libpthread.so.0 [0x38d300e7c0]
> [compute-0-11:09643] [ 1]
> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_pml_ob1.so
> [0x2b56aca403f9]
> [compute-0-11:09643] [ 2]
> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_pml_ob1.so
> [0x2b56aca3e963]
> [compute-0-11:09643] [ 3]
> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_btl_sm.so
> [0x2b56ad26a33c]
> [compute-0-11:09643] [ 4]
> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/libopen-pal.so.0(opal_progress+0x87)
> [0x2b56a9028fa7]
> [compute-0-11:09643] [ 5]
> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/libmpi.so.0
> [0x2b56a8b15636]
> [compute-0-11:09643] [ 6]
> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_coll_tuned.so
> [0x2b56ae31e59b]
> [compute-0-11:09643] [ 7]
> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_coll_tuned.so
> [0x2b56ae32604b]
> [compute-0-11:09643] [ 8]
> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_coll_tuned.so
> [0x2b56ae31eda9]
> [compute-0-11:09643] [ 9]
> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/libmpi.so.0(ompi_comm_split+0xcc)
> [0x2b56a8b00dcc]
> [compute-0-11:09643] [10]
> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/libmpi.so.0(MPI_Comm_split+