[gmx-users] Restarting a REMD simulation (error)
Dear all, Due to cluster wall-time limitations, I was forced to restart two REMD simulations. It ran absolutely fine until hitting the wall-time. To restart I used the following command: mpirun -np 64 -output-filename MPIoutput $GromDir/mdrun_mpi -s H5_.tpr -multi 64 -replex 1000 -deffnm H5_ -cpi -noappend (I'm using GMX-4.0.7 and yes I know it's old but I have my own reasons for using it.) Here is a random replica (#1) MPI output: ##START### NNODES=64, MYRANK=1, HOSTNAME=an091 NODEID=1 argc=11 Checkpoint file is from part 1, new output files will be suffixed part0002. Reading file H5_1.tpr, VERSION 4.0.7 (single precision) Reading checkpoint file H5_1.cpt generated: Wed Apr 3 17:13:14 2013 --- Program mdrun_mpi, VERSION 4.0.7 Source code file: main.c, line: 116 Fatal error: The 64 subsystems are not compatible --- Error on node 1, will try to stop all the nodes Halting parallel program mdrun_mpi on CPU 1 out of 64 ##END### It's reading from the correct cpt and tpr files, so it must be something else. Here is a tail of the respective log file: ##START### Initializing Replica Exchange Repl There are 64 replicas: Multi-checking the number of atoms ... OK Multi-checking the integrator ... OK Multi-checking init_step+nsteps ... OK Multi-checking first exchange step: init_step/-replex ... first exchange step: init_step/-replex is not equal for all subsystems subsystem 0: 3062 subsystem 1: 3062 subsystem 2: 3062 subsystem 3: 3062 subsystem 4: 3062 subsystem 5: 3062 subsystem 6: 3062 subsystem 7: 3062 subsystem 8: 3062 subsystem 9: 3062 subsystem 10: 3062 subsystem 11: 3062 subsystem 12: 3062 subsystem 13: 3062 subsystem 14: 3062 subsystem 15: 3062 subsystem 16: 3062 subsystem 17: 3062 subsystem 18: 3062 subsystem 19: 3062 subsystem 20: 3062 subsystem 21: 3062 subsystem 22: 3062 subsystem 23: 3062 subsystem 24: 3062 subsystem 25: 3062 subsystem 26: 3062 subsystem 27: 3062 subsystem 28: 3062 subsystem 29: 3062 subsystem 30: 3062 subsystem 31: 3062 subsystem 32: 3062 subsystem 33: 3062 subsystem 34: 3062 subsystem 35: 3062 subsystem 36: 3062 subsystem 37: 3062 subsystem 38: 3062 subsystem 39: 3066 subsystem 40: 3062 subsystem 41: 3062 subsystem 42: 3062 subsystem 43: 3062 subsystem 44: 3062 subsystem 45: 3062 subsystem 46: 3062 subsystem 47: 3062 subsystem 48: 3062 subsystem 49: 3062 subsystem 50: 3062 subsystem 51: 3062 subsystem 52: 3062 subsystem 53: 3062 subsystem 54: 3062 subsystem 55: 3062 subsystem 56: 3062 subsystem 57: 3062 subsystem 58: 3062 subsystem 59: 3062 subsystem 60: 3062 subsystem 61: 3062 subsystem 62: 3062 subsystem 63: 3062 --- Program mdrun_mpi, VERSION 4.0.7 Source code file: main.c, line: 116 Fatal error: The 64 subsystems are not compatible --- ##END### It's clear that init_step/-replex is not equal for all subsystems is the problem, but does anyone know why this is happening and how to solve it? Thank you for your patience, Best regards, João Henriques -- gmx-users mailing listgmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! * Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
Re: [gmx-users] Restarting a REMD simulation (error)
On Apr 8, 2013 8:53 AM, João Henriques joao.henriques.32...@gmail.com wrote: Dear all, Due to cluster wall-time limitations, I was forced to restart two REMD simulations. It ran absolutely fine until hitting the wall-time. To restart I used the following command: mpirun -np 64 -output-filename MPIoutput $GromDir/mdrun_mpi -s H5_.tpr -multi 64 -replex 1000 -deffnm H5_ -cpi -noappend (I'm using GMX-4.0.7 and yes I know it's old but I have my own reasons for using it.) Here is a random replica (#1) MPI output: ##START### NNODES=64, MYRANK=1, HOSTNAME=an091 NODEID=1 argc=11 Checkpoint file is from part 1, new output files will be suffixed part0002. Reading file H5_1.tpr, VERSION 4.0.7 (single precision) Reading checkpoint file H5_1.cpt generated: Wed Apr 3 17:13:14 2013 --- Program mdrun_mpi, VERSION 4.0.7 Source code file: main.c, line: 116 Fatal error: The 64 subsystems are not compatible --- Error on node 1, will try to stop all the nodes Halting parallel program mdrun_mpi on CPU 1 out of 64 ##END### It's reading from the correct cpt and tpr files, so it must be something else. Here is a tail of the respective log file: ##START### Initializing Replica Exchange Repl There are 64 replicas: Multi-checking the number of atoms ... OK Multi-checking the integrator ... OK Multi-checking init_step+nsteps ... OK Multi-checking first exchange step: init_step/-replex ... first exchange step: init_step/-replex is not equal for all subsystems subsystem 0: 3062 subsystem 1: 3062 subsystem 2: 3062 subsystem 3: 3062 subsystem 4: 3062 subsystem 5: 3062 subsystem 6: 3062 subsystem 7: 3062 subsystem 8: 3062 subsystem 9: 3062 subsystem 10: 3062 subsystem 11: 3062 subsystem 12: 3062 subsystem 13: 3062 subsystem 14: 3062 subsystem 15: 3062 subsystem 16: 3062 subsystem 17: 3062 subsystem 18: 3062 subsystem 19: 3062 subsystem 20: 3062 subsystem 21: 3062 subsystem 22: 3062 subsystem 23: 3062 subsystem 24: 3062 subsystem 25: 3062 subsystem 26: 3062 subsystem 27: 3062 subsystem 28: 3062 subsystem 29: 3062 subsystem 30: 3062 subsystem 31: 3062 subsystem 32: 3062 subsystem 33: 3062 subsystem 34: 3062 subsystem 35: 3062 subsystem 36: 3062 subsystem 37: 3062 subsystem 38: 3062 subsystem 39: 3066 Seems system 39 got its IO done faster. Its state_prev.cpt will be 3062. Back up your files. Use gmxcheck to see what's in files. Rename as suitable so your set of files is consistent. Mark subsystem 40: 3062 subsystem 41: 3062 subsystem 42: 3062 subsystem 43: 3062 subsystem 44: 3062 subsystem 45: 3062 subsystem 46: 3062 subsystem 47: 3062 subsystem 48: 3062 subsystem 49: 3062 subsystem 50: 3062 subsystem 51: 3062 subsystem 52: 3062 subsystem 53: 3062 subsystem 54: 3062 subsystem 55: 3062 subsystem 56: 3062 subsystem 57: 3062 subsystem 58: 3062 subsystem 59: 3062 subsystem 60: 3062 subsystem 61: 3062 subsystem 62: 3062 subsystem 63: 3062 --- Program mdrun_mpi, VERSION 4.0.7 Source code file: main.c, line: 116 Fatal error: The 64 subsystems are not compatible --- ##END### It's clear that init_step/-replex is not equal for all subsystems is the problem, but does anyone know why this is happening and how to solve it? Thank you for your patience, Best regards, João Henriques -- gmx-users mailing listgmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! * Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists -- gmx-users mailing listgmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! * Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
Re: [gmx-users] Restarting a REMD simulation (error)
Thank you very much. I didn't notice it until now considering all those numbers look so similar. Great eye for detail! João On Mon, Apr 8, 2013 at 3:17 PM, Mark Abraham mark.j.abra...@gmail.comwrote: On Apr 8, 2013 8:53 AM, João Henriques joao.henriques.32...@gmail.com wrote: Dear all, Due to cluster wall-time limitations, I was forced to restart two REMD simulations. It ran absolutely fine until hitting the wall-time. To restart I used the following command: mpirun -np 64 -output-filename MPIoutput $GromDir/mdrun_mpi -s H5_.tpr -multi 64 -replex 1000 -deffnm H5_ -cpi -noappend (I'm using GMX-4.0.7 and yes I know it's old but I have my own reasons for using it.) Here is a random replica (#1) MPI output: ##START### NNODES=64, MYRANK=1, HOSTNAME=an091 NODEID=1 argc=11 Checkpoint file is from part 1, new output files will be suffixed part0002. Reading file H5_1.tpr, VERSION 4.0.7 (single precision) Reading checkpoint file H5_1.cpt generated: Wed Apr 3 17:13:14 2013 --- Program mdrun_mpi, VERSION 4.0.7 Source code file: main.c, line: 116 Fatal error: The 64 subsystems are not compatible --- Error on node 1, will try to stop all the nodes Halting parallel program mdrun_mpi on CPU 1 out of 64 ##END### It's reading from the correct cpt and tpr files, so it must be something else. Here is a tail of the respective log file: ##START### Initializing Replica Exchange Repl There are 64 replicas: Multi-checking the number of atoms ... OK Multi-checking the integrator ... OK Multi-checking init_step+nsteps ... OK Multi-checking first exchange step: init_step/-replex ... first exchange step: init_step/-replex is not equal for all subsystems subsystem 0: 3062 subsystem 1: 3062 subsystem 2: 3062 subsystem 3: 3062 subsystem 4: 3062 subsystem 5: 3062 subsystem 6: 3062 subsystem 7: 3062 subsystem 8: 3062 subsystem 9: 3062 subsystem 10: 3062 subsystem 11: 3062 subsystem 12: 3062 subsystem 13: 3062 subsystem 14: 3062 subsystem 15: 3062 subsystem 16: 3062 subsystem 17: 3062 subsystem 18: 3062 subsystem 19: 3062 subsystem 20: 3062 subsystem 21: 3062 subsystem 22: 3062 subsystem 23: 3062 subsystem 24: 3062 subsystem 25: 3062 subsystem 26: 3062 subsystem 27: 3062 subsystem 28: 3062 subsystem 29: 3062 subsystem 30: 3062 subsystem 31: 3062 subsystem 32: 3062 subsystem 33: 3062 subsystem 34: 3062 subsystem 35: 3062 subsystem 36: 3062 subsystem 37: 3062 subsystem 38: 3062 subsystem 39: 3066 Seems system 39 got its IO done faster. Its state_prev.cpt will be 3062. Back up your files. Use gmxcheck to see what's in files. Rename as suitable so your set of files is consistent. Mark subsystem 40: 3062 subsystem 41: 3062 subsystem 42: 3062 subsystem 43: 3062 subsystem 44: 3062 subsystem 45: 3062 subsystem 46: 3062 subsystem 47: 3062 subsystem 48: 3062 subsystem 49: 3062 subsystem 50: 3062 subsystem 51: 3062 subsystem 52: 3062 subsystem 53: 3062 subsystem 54: 3062 subsystem 55: 3062 subsystem 56: 3062 subsystem 57: 3062 subsystem 58: 3062 subsystem 59: 3062 subsystem 60: 3062 subsystem 61: 3062 subsystem 62: 3062 subsystem 63: 3062 --- Program mdrun_mpi, VERSION 4.0.7 Source code file: main.c, line: 116 Fatal error: The 64 subsystems are not compatible --- ##END### It's clear that init_step/-replex is not equal for all subsystems is the problem, but does anyone know why this is happening and how to solve it? Thank you for your patience, Best regards, João Henriques -- gmx-users mailing listgmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! * Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists -- gmx-users mailing listgmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! * Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists -- João Henriques -- gmx-users mailing listgmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users * Please search the archive at
Re: [gmx-users] Restarting a REMD simulation (error)
It helped that I *really* knew one must differ ;-) Mark On Apr 8, 2013 2:24 PM, João Henriques joao.henriques.32...@gmail.com wrote: Thank you very much. I didn't notice it until now considering all those numbers look so similar. Great eye for detail! João On Mon, Apr 8, 2013 at 3:17 PM, Mark Abraham mark.j.abra...@gmail.com wrote: On Apr 8, 2013 8:53 AM, João Henriques joao.henriques.32...@gmail.com wrote: Dear all, Due to cluster wall-time limitations, I was forced to restart two REMD simulations. It ran absolutely fine until hitting the wall-time. To restart I used the following command: mpirun -np 64 -output-filename MPIoutput $GromDir/mdrun_mpi -s H5_.tpr -multi 64 -replex 1000 -deffnm H5_ -cpi -noappend (I'm using GMX-4.0.7 and yes I know it's old but I have my own reasons for using it.) Here is a random replica (#1) MPI output: ##START### NNODES=64, MYRANK=1, HOSTNAME=an091 NODEID=1 argc=11 Checkpoint file is from part 1, new output files will be suffixed part0002. Reading file H5_1.tpr, VERSION 4.0.7 (single precision) Reading checkpoint file H5_1.cpt generated: Wed Apr 3 17:13:14 2013 --- Program mdrun_mpi, VERSION 4.0.7 Source code file: main.c, line: 116 Fatal error: The 64 subsystems are not compatible --- Error on node 1, will try to stop all the nodes Halting parallel program mdrun_mpi on CPU 1 out of 64 ##END### It's reading from the correct cpt and tpr files, so it must be something else. Here is a tail of the respective log file: ##START### Initializing Replica Exchange Repl There are 64 replicas: Multi-checking the number of atoms ... OK Multi-checking the integrator ... OK Multi-checking init_step+nsteps ... OK Multi-checking first exchange step: init_step/-replex ... first exchange step: init_step/-replex is not equal for all subsystems subsystem 0: 3062 subsystem 1: 3062 subsystem 2: 3062 subsystem 3: 3062 subsystem 4: 3062 subsystem 5: 3062 subsystem 6: 3062 subsystem 7: 3062 subsystem 8: 3062 subsystem 9: 3062 subsystem 10: 3062 subsystem 11: 3062 subsystem 12: 3062 subsystem 13: 3062 subsystem 14: 3062 subsystem 15: 3062 subsystem 16: 3062 subsystem 17: 3062 subsystem 18: 3062 subsystem 19: 3062 subsystem 20: 3062 subsystem 21: 3062 subsystem 22: 3062 subsystem 23: 3062 subsystem 24: 3062 subsystem 25: 3062 subsystem 26: 3062 subsystem 27: 3062 subsystem 28: 3062 subsystem 29: 3062 subsystem 30: 3062 subsystem 31: 3062 subsystem 32: 3062 subsystem 33: 3062 subsystem 34: 3062 subsystem 35: 3062 subsystem 36: 3062 subsystem 37: 3062 subsystem 38: 3062 subsystem 39: 3066 Seems system 39 got its IO done faster. Its state_prev.cpt will be 3062. Back up your files. Use gmxcheck to see what's in files. Rename as suitable so your set of files is consistent. Mark subsystem 40: 3062 subsystem 41: 3062 subsystem 42: 3062 subsystem 43: 3062 subsystem 44: 3062 subsystem 45: 3062 subsystem 46: 3062 subsystem 47: 3062 subsystem 48: 3062 subsystem 49: 3062 subsystem 50: 3062 subsystem 51: 3062 subsystem 52: 3062 subsystem 53: 3062 subsystem 54: 3062 subsystem 55: 3062 subsystem 56: 3062 subsystem 57: 3062 subsystem 58: 3062 subsystem 59: 3062 subsystem 60: 3062 subsystem 61: 3062 subsystem 62: 3062 subsystem 63: 3062 --- Program mdrun_mpi, VERSION 4.0.7 Source code file: main.c, line: 116 Fatal error: The 64 subsystems are not compatible --- ##END### It's clear that init_step/-replex is not equal for all subsystems is the problem, but does anyone know why this is happening and how to solve it? Thank you for your patience, Best regards, João Henriques -- gmx-users mailing listgmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! * Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists -- gmx-users mailing listgmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! * Please don't post (un)subscribe requests