Hi, I am using gromacs-2019.4. I have been running simulations box that contains a peptide embedded in a DOPC bilayer membrane, using all atom simulations. I have been running for weeks in a TACC computer that has 4 gpu in a single node, so I usually run 4 trajectories in a single node using the -multidir option. My submission script is:
#!/bin/bash #SBATCH -J sb16 # Job name #SBATCH -o test.o%j # Job name #SBATCH -e test.e%j # Job name #SBATCH -N 1 # Total number of nodes requested #SBATCH -n 4 # Total number of mpi tasks requested #SBATCH -p rtx # Queue (partition) name -- normal, development, etc. #SBATCH -t 48:00:00 # Run time (hh:mm:ss) - 1.5 hours module load cuda/10.1 module use -a /home1/01247/alfredo/Software/ForGPU/plumed-2.5.3/MyInstall/lib/plumed/ModuleFile module load plumed_gpu export OMP_NUM_THREADS=4 ibrun /home1/01247/alfredo/Software/gromacs-2019.4_gpu/build-gpu-mpi-plumed/My_install/bin/mdrun_mpi -s topol.tpr -plumed plumed.dat -multidir 1 2 3 4 Because the system is going to be down for a week I want to do continuation runs in a slower computer system, also using gpus. Because the system is slower I want to run it using two nodes. A script that I have used successfully in that old machine is: #!/bin/bash #SBATCH -J SB9_pi1 # Job name #SBATCH -o test.o%j # Job name #SBATCH -N 2 # Total number of nodes requested #SBATCH -n 2 # Total number of mpi tasks requested #SBATCH -p gpu # Queue (partition) name -- normal, development, etc. #SBATCH -t 24:00:00 # Run time (hh:mm:ss) - 1.5 hours module load gcc/5.2.0 module load cray_mpich/7.7.3 module load cuda/9.0 # Launc hMPI-based executable export OMP_NUM_THREADS=6 ibrun /home1/01247/alfredo/gromacs-2019.4/build_MPI/My_install/bin/mdrun_mpi -s topol2.tpr -pin on -cpi state.cpt -noappend It works great if I setup a new simulation of the same molecular system (create a new tpr file). But if I attempt to run a continuation run coming from the other machine (that used 4 threads). I get Not all bonded interactions have been properly assigned to the domain decomposition cells A list of missing interactions: Bond of 10801 missing -5 U-B of 53187 missing 22 Proper Dih. of 89703 missing 119 LJ-14 of 73729 missing 3 srun: Job step aborted: Waiting up to 32 seconds for job step to finish. And it stops. If I modified the script for the old machine to use 1 nodes, 1 task and 4 thread, it runs well but it is a lot slower. My question is if there is any way to avoid this error, so I can do a continuation run using state.cpt with a different domain decomposition. I have seen in the list that is suggested to use -rdd. The value printed in the log file is 1.595 nm. I increased to 2.0 and gave a similar error. Thanks, Alfredo -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-requ...@gromacs.org.