I am using 3.3.2 and 3.3.1 and I get the following problem with both of
them.

If I run replica exchange on >4 processors (2 and 4 are fine), the
simulations finish, but mpi gives the following errors, thus the job never
terminates


this is the end of my log file

-----------------------------------------------------------------------

               NODE (s)   Real (s)      (%)
       Time: 158483.430 159636.000     99.3
                       1d20h01:23
               (Mnbf/s)   (MFlops)   (ns/day)  (hour/ns)
Performance:     18.919    818.029      2.726      8.805
p13_15442:  p4_error: Timeout in establishing connection to remote process:
0
p12_15407:  p4_error: Timeout in establishing connection to remote process:
0
Broken pipe
p11_2364:  p4_error: Timeout in establishing connection to remote process: 0
p9_20588:  p4_error: Timeout in establishing connection to remote process: 0
p10_2329:  p4_error: Timeout in establishing connection to remote process: 0
Broken pipe
Broken pipe
Broken pipe
Broken pipe
p6_24137:  p4_error: Timeout in establishing connection to remote process: 0
p7_24172:  p4_error: Timeout in establishing connection to remote process: 0
Broken pipe
Broken pipe


I have tried installing on three different clusters, using different
versions of mpich and they all do this.  BUT, I do not get the error if I am
running a single simulation on 8 processors, I only get this problem when I
run replica exchange.  Any ideas what is going on?  I'm also including my
submission script, perhaps I am missing something, but I'm just not seeing
it

#!/bin/bash
#
#$ -N switch_less
#$ -pe mpich 8
#$ -cwd
#$ -j y
#$ -S /bin/bash
#
#$ -l h_rt=00:05:00

MPIDIR=/opt/mpich/intel/bin/
MDDIR=/soft/linux/pkg/gromacs-3.3.1/bin
SYSTEM=free


INDEX=0
for T in 80 82 84 86 87 88 89 90
do
sed "s/TTTT/$T/g" MDRUN > mdrun.$INDEX.mdp

$MDDIR/grompp \
        -f mdrun.$INDEX \
        -c $SYSTEM.gro \
        -p $SYSTEM.top \
        -po mdout.$INDEX \
        -o $SYSTEM$INDEX.tpr
let "INDEX += 1"

done

if test $NSLOTS -eq $INDEX
then
$MPIDIR/mpirun -v -np $NSLOTS -machinefile $TMPDIR/machines \
  -nolocal $MDDIR/mdrun-mpi -v \
        -np $NSLOTS \
        -multi $NSLOTS \
        -replex 50 \
        -s $SYSTEM.tpr \
        -o $SYSTEM \
        -c $SYSTEM.out \
        -g $SYSTEM \
        -e $SYSTEM \
        -x $SYSTEM
else

echo 'wrong number of nodes for the number of replicas'
fi


I have tried using the -debug option when running gromacs, but I can't tell
what is going on with it.  Is there something I should look for in the debug
logfile?

thanks

-Paul
_______________________________________________
gmx-users mailing list    gmx-users@gromacs.org
http://www.gromacs.org/mailman/listinfo/gmx-users
Please search the archive at http://www.gromacs.org/search before posting!
Please don't post (un)subscribe requests to the list. Use the 
www interface or send it to [EMAIL PROTECTED]
Can't post? Read http://www.gromacs.org/mailing_lists/users.php

Reply via email to