[gmx-users] Restarting a REMD simulation (error)

2013-04-08 Thread João Henriques
Dear all,

Due to cluster wall-time limitations, I was forced to restart two REMD
simulations. It ran absolutely fine until hitting the wall-time. To restart
I used the following command:

mpirun -np 64 -output-filename MPIoutput $GromDir/mdrun_mpi -s H5_.tpr
-multi 64 -replex 1000 -deffnm H5_ -cpi -noappend

(I'm using GMX-4.0.7 and yes I know it's old but I have my own reasons for
using it.)

Here is a random replica (#1) MPI output:

##START###
NNODES=64, MYRANK=1, HOSTNAME=an091
NODEID=1 argc=11
Checkpoint file is from part 1, new output files will be suffixed part0002.
Reading file H5_1.tpr, VERSION 4.0.7 (single precision)

Reading checkpoint file H5_1.cpt generated: Wed Apr  3 17:13:14 2013

---
Program mdrun_mpi, VERSION 4.0.7
Source code file: main.c, line: 116

Fatal error:
The 64 subsystems are not compatible

---

Error on node 1, will try to stop all the nodes
Halting parallel program mdrun_mpi on CPU 1 out of 64
##END###

It's reading from the correct cpt and tpr files, so it must be something
else.

Here is a tail of the respective log file:

##START###
Initializing Replica Exchange
Repl  There are 64 replicas:
Multi-checking the number of atoms ... OK
Multi-checking the integrator ... OK
Multi-checking init_step+nsteps ... OK
Multi-checking first exchange step: init_step/-replex ...
first exchange step: init_step/-replex is not equal for all subsystems
  subsystem 0: 3062
  subsystem 1: 3062
  subsystem 2: 3062
  subsystem 3: 3062
  subsystem 4: 3062
  subsystem 5: 3062
  subsystem 6: 3062
  subsystem 7: 3062
  subsystem 8: 3062
  subsystem 9: 3062
  subsystem 10: 3062
  subsystem 11: 3062
  subsystem 12: 3062
  subsystem 13: 3062
  subsystem 14: 3062
  subsystem 15: 3062
  subsystem 16: 3062
  subsystem 17: 3062
  subsystem 18: 3062
  subsystem 19: 3062
  subsystem 20: 3062
  subsystem 21: 3062
  subsystem 22: 3062
  subsystem 23: 3062
  subsystem 24: 3062
  subsystem 25: 3062
  subsystem 26: 3062
  subsystem 27: 3062
  subsystem 28: 3062
  subsystem 29: 3062
  subsystem 30: 3062
  subsystem 31: 3062
  subsystem 32: 3062
  subsystem 33: 3062
  subsystem 34: 3062
  subsystem 35: 3062
  subsystem 36: 3062
  subsystem 37: 3062
  subsystem 38: 3062
  subsystem 39: 3066
  subsystem 40: 3062
  subsystem 41: 3062
  subsystem 42: 3062
  subsystem 43: 3062
  subsystem 44: 3062
  subsystem 45: 3062
  subsystem 46: 3062
  subsystem 47: 3062
  subsystem 48: 3062
  subsystem 49: 3062
  subsystem 50: 3062
  subsystem 51: 3062
  subsystem 52: 3062
  subsystem 53: 3062
  subsystem 54: 3062
  subsystem 55: 3062
  subsystem 56: 3062
  subsystem 57: 3062
  subsystem 58: 3062
  subsystem 59: 3062
  subsystem 60: 3062
  subsystem 61: 3062
  subsystem 62: 3062
  subsystem 63: 3062

---
Program mdrun_mpi, VERSION 4.0.7
Source code file: main.c, line: 116

Fatal error:
The 64 subsystems are not compatible

---
##END###

It's clear that init_step/-replex is not equal for all subsystems is the
problem, but does anyone know why this is happening and how to solve it?

Thank you for your patience,
Best regards,

João Henriques
--
gmx-users mailing listgmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
* Please don't post (un)subscribe requests to the list. Use the
www interface or send it to gmx-users-requ...@gromacs.org.
* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists


Re: [gmx-users] Restarting a REMD simulation (error)

2013-04-08 Thread Mark Abraham
On Apr 8, 2013 8:53 AM, João Henriques joao.henriques.32...@gmail.com
wrote:

 Dear all,

 Due to cluster wall-time limitations, I was forced to restart two REMD
 simulations. It ran absolutely fine until hitting the wall-time. To
restart
 I used the following command:

 mpirun -np 64 -output-filename MPIoutput $GromDir/mdrun_mpi -s H5_.tpr
 -multi 64 -replex 1000 -deffnm H5_ -cpi -noappend

 (I'm using GMX-4.0.7 and yes I know it's old but I have my own reasons for
 using it.)

 Here is a random replica (#1) MPI output:

 ##START###
 NNODES=64, MYRANK=1, HOSTNAME=an091
 NODEID=1 argc=11
 Checkpoint file is from part 1, new output files will be suffixed
part0002.
 Reading file H5_1.tpr, VERSION 4.0.7 (single precision)

 Reading checkpoint file H5_1.cpt generated: Wed Apr  3 17:13:14 2013

 ---
 Program mdrun_mpi, VERSION 4.0.7
 Source code file: main.c, line: 116

 Fatal error:
 The 64 subsystems are not compatible

 ---

 Error on node 1, will try to stop all the nodes
 Halting parallel program mdrun_mpi on CPU 1 out of 64
 ##END###

 It's reading from the correct cpt and tpr files, so it must be something
 else.

 Here is a tail of the respective log file:

 ##START###
 Initializing Replica Exchange
 Repl  There are 64 replicas:
 Multi-checking the number of atoms ... OK
 Multi-checking the integrator ... OK
 Multi-checking init_step+nsteps ... OK
 Multi-checking first exchange step: init_step/-replex ...
 first exchange step: init_step/-replex is not equal for all subsystems
   subsystem 0: 3062
   subsystem 1: 3062
   subsystem 2: 3062
   subsystem 3: 3062
   subsystem 4: 3062
   subsystem 5: 3062
   subsystem 6: 3062
   subsystem 7: 3062
   subsystem 8: 3062
   subsystem 9: 3062
   subsystem 10: 3062
   subsystem 11: 3062
   subsystem 12: 3062
   subsystem 13: 3062
   subsystem 14: 3062
   subsystem 15: 3062
   subsystem 16: 3062
   subsystem 17: 3062
   subsystem 18: 3062
   subsystem 19: 3062
   subsystem 20: 3062
   subsystem 21: 3062
   subsystem 22: 3062
   subsystem 23: 3062
   subsystem 24: 3062
   subsystem 25: 3062
   subsystem 26: 3062
   subsystem 27: 3062
   subsystem 28: 3062
   subsystem 29: 3062
   subsystem 30: 3062
   subsystem 31: 3062
   subsystem 32: 3062
   subsystem 33: 3062
   subsystem 34: 3062
   subsystem 35: 3062
   subsystem 36: 3062
   subsystem 37: 3062
   subsystem 38: 3062
   subsystem 39: 3066

Seems system 39 got its IO done faster. Its state_prev.cpt will be 3062.
Back up your files. Use gmxcheck to see what's in files. Rename as suitable
so your set of files is consistent.

Mark

   subsystem 40: 3062
   subsystem 41: 3062
   subsystem 42: 3062
   subsystem 43: 3062
   subsystem 44: 3062
   subsystem 45: 3062
   subsystem 46: 3062
   subsystem 47: 3062
   subsystem 48: 3062
   subsystem 49: 3062
   subsystem 50: 3062
   subsystem 51: 3062
   subsystem 52: 3062
   subsystem 53: 3062
   subsystem 54: 3062
   subsystem 55: 3062
   subsystem 56: 3062
   subsystem 57: 3062
   subsystem 58: 3062
   subsystem 59: 3062
   subsystem 60: 3062
   subsystem 61: 3062
   subsystem 62: 3062
   subsystem 63: 3062

 ---
 Program mdrun_mpi, VERSION 4.0.7
 Source code file: main.c, line: 116

 Fatal error:
 The 64 subsystems are not compatible

 ---
 ##END###

 It's clear that init_step/-replex is not equal for all subsystems is the
 problem, but does anyone know why this is happening and how to solve it?

 Thank you for your patience,
 Best regards,

 João Henriques
 --
 gmx-users mailing listgmx-users@gromacs.org
 http://lists.gromacs.org/mailman/listinfo/gmx-users
 * Please search the archive at
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
 * Please don't post (un)subscribe requests to the list. Use the
 www interface or send it to gmx-users-requ...@gromacs.org.
 * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
--
gmx-users mailing listgmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
* Please don't post (un)subscribe requests to the list. Use the
www interface or send it to gmx-users-requ...@gromacs.org.
* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists


Re: [gmx-users] Restarting a REMD simulation (error)

2013-04-08 Thread João Henriques
Thank you very much. I didn't notice it until now considering all those
numbers look so similar. Great eye for detail!

João


On Mon, Apr 8, 2013 at 3:17 PM, Mark Abraham mark.j.abra...@gmail.comwrote:

 On Apr 8, 2013 8:53 AM, João Henriques joao.henriques.32...@gmail.com
 wrote:
 
  Dear all,
 
  Due to cluster wall-time limitations, I was forced to restart two REMD
  simulations. It ran absolutely fine until hitting the wall-time. To
 restart
  I used the following command:
 
  mpirun -np 64 -output-filename MPIoutput $GromDir/mdrun_mpi -s H5_.tpr
  -multi 64 -replex 1000 -deffnm H5_ -cpi -noappend
 
  (I'm using GMX-4.0.7 and yes I know it's old but I have my own reasons
 for
  using it.)
 
  Here is a random replica (#1) MPI output:
 
  ##START###
  NNODES=64, MYRANK=1, HOSTNAME=an091
  NODEID=1 argc=11
  Checkpoint file is from part 1, new output files will be suffixed
 part0002.
  Reading file H5_1.tpr, VERSION 4.0.7 (single precision)
 
  Reading checkpoint file H5_1.cpt generated: Wed Apr  3 17:13:14 2013
 
  ---
  Program mdrun_mpi, VERSION 4.0.7
  Source code file: main.c, line: 116
 
  Fatal error:
  The 64 subsystems are not compatible
 
  ---
 
  Error on node 1, will try to stop all the nodes
  Halting parallel program mdrun_mpi on CPU 1 out of 64
  ##END###
 
  It's reading from the correct cpt and tpr files, so it must be something
  else.
 
  Here is a tail of the respective log file:
 
  ##START###
  Initializing Replica Exchange
  Repl  There are 64 replicas:
  Multi-checking the number of atoms ... OK
  Multi-checking the integrator ... OK
  Multi-checking init_step+nsteps ... OK
  Multi-checking first exchange step: init_step/-replex ...
  first exchange step: init_step/-replex is not equal for all subsystems
subsystem 0: 3062
subsystem 1: 3062
subsystem 2: 3062
subsystem 3: 3062
subsystem 4: 3062
subsystem 5: 3062
subsystem 6: 3062
subsystem 7: 3062
subsystem 8: 3062
subsystem 9: 3062
subsystem 10: 3062
subsystem 11: 3062
subsystem 12: 3062
subsystem 13: 3062
subsystem 14: 3062
subsystem 15: 3062
subsystem 16: 3062
subsystem 17: 3062
subsystem 18: 3062
subsystem 19: 3062
subsystem 20: 3062
subsystem 21: 3062
subsystem 22: 3062
subsystem 23: 3062
subsystem 24: 3062
subsystem 25: 3062
subsystem 26: 3062
subsystem 27: 3062
subsystem 28: 3062
subsystem 29: 3062
subsystem 30: 3062
subsystem 31: 3062
subsystem 32: 3062
subsystem 33: 3062
subsystem 34: 3062
subsystem 35: 3062
subsystem 36: 3062
subsystem 37: 3062
subsystem 38: 3062
subsystem 39: 3066

 Seems system 39 got its IO done faster. Its state_prev.cpt will be 3062.
 Back up your files. Use gmxcheck to see what's in files. Rename as suitable
 so your set of files is consistent.

 Mark

subsystem 40: 3062
subsystem 41: 3062
subsystem 42: 3062
subsystem 43: 3062
subsystem 44: 3062
subsystem 45: 3062
subsystem 46: 3062
subsystem 47: 3062
subsystem 48: 3062
subsystem 49: 3062
subsystem 50: 3062
subsystem 51: 3062
subsystem 52: 3062
subsystem 53: 3062
subsystem 54: 3062
subsystem 55: 3062
subsystem 56: 3062
subsystem 57: 3062
subsystem 58: 3062
subsystem 59: 3062
subsystem 60: 3062
subsystem 61: 3062
subsystem 62: 3062
subsystem 63: 3062
 
  ---
  Program mdrun_mpi, VERSION 4.0.7
  Source code file: main.c, line: 116
 
  Fatal error:
  The 64 subsystems are not compatible
 
  ---
  ##END###
 
  It's clear that init_step/-replex is not equal for all subsystems is
 the
  problem, but does anyone know why this is happening and how to solve it?
 
  Thank you for your patience,
  Best regards,
 
  João Henriques
  --
  gmx-users mailing listgmx-users@gromacs.org
  http://lists.gromacs.org/mailman/listinfo/gmx-users
  * Please search the archive at
 http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
  * Please don't post (un)subscribe requests to the list. Use the
  www interface or send it to gmx-users-requ...@gromacs.org.
  * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
 --
 gmx-users mailing listgmx-users@gromacs.org
 http://lists.gromacs.org/mailman/listinfo/gmx-users
 * Please search the archive at
 http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
 * Please don't post (un)subscribe requests to the list. Use the
 www interface or send it to gmx-users-requ...@gromacs.org.
 * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists




-- 
João Henriques
--
gmx-users mailing listgmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
* Please search the archive at 

Re: [gmx-users] Restarting a REMD simulation (error)

2013-04-08 Thread Mark Abraham
It helped that I *really* knew one must differ ;-)

Mark
On Apr 8, 2013 2:24 PM, João Henriques joao.henriques.32...@gmail.com
wrote:

 Thank you very much. I didn't notice it until now considering all those
 numbers look so similar. Great eye for detail!

 João


 On Mon, Apr 8, 2013 at 3:17 PM, Mark Abraham mark.j.abra...@gmail.com
 wrote:

  On Apr 8, 2013 8:53 AM, João Henriques joao.henriques.32...@gmail.com
 
  wrote:
  
   Dear all,
  
   Due to cluster wall-time limitations, I was forced to restart two REMD
   simulations. It ran absolutely fine until hitting the wall-time. To
  restart
   I used the following command:
  
   mpirun -np 64 -output-filename MPIoutput $GromDir/mdrun_mpi -s H5_.tpr
   -multi 64 -replex 1000 -deffnm H5_ -cpi -noappend
  
   (I'm using GMX-4.0.7 and yes I know it's old but I have my own reasons
  for
   using it.)
  
   Here is a random replica (#1) MPI output:
  
   ##START###
   NNODES=64, MYRANK=1, HOSTNAME=an091
   NODEID=1 argc=11
   Checkpoint file is from part 1, new output files will be suffixed
  part0002.
   Reading file H5_1.tpr, VERSION 4.0.7 (single precision)
  
   Reading checkpoint file H5_1.cpt generated: Wed Apr  3 17:13:14 2013
  
   ---
   Program mdrun_mpi, VERSION 4.0.7
   Source code file: main.c, line: 116
  
   Fatal error:
   The 64 subsystems are not compatible
  
   ---
  
   Error on node 1, will try to stop all the nodes
   Halting parallel program mdrun_mpi on CPU 1 out of 64
   ##END###
  
   It's reading from the correct cpt and tpr files, so it must be
 something
   else.
  
   Here is a tail of the respective log file:
  
   ##START###
   Initializing Replica Exchange
   Repl  There are 64 replicas:
   Multi-checking the number of atoms ... OK
   Multi-checking the integrator ... OK
   Multi-checking init_step+nsteps ... OK
   Multi-checking first exchange step: init_step/-replex ...
   first exchange step: init_step/-replex is not equal for all subsystems
 subsystem 0: 3062
 subsystem 1: 3062
 subsystem 2: 3062
 subsystem 3: 3062
 subsystem 4: 3062
 subsystem 5: 3062
 subsystem 6: 3062
 subsystem 7: 3062
 subsystem 8: 3062
 subsystem 9: 3062
 subsystem 10: 3062
 subsystem 11: 3062
 subsystem 12: 3062
 subsystem 13: 3062
 subsystem 14: 3062
 subsystem 15: 3062
 subsystem 16: 3062
 subsystem 17: 3062
 subsystem 18: 3062
 subsystem 19: 3062
 subsystem 20: 3062
 subsystem 21: 3062
 subsystem 22: 3062
 subsystem 23: 3062
 subsystem 24: 3062
 subsystem 25: 3062
 subsystem 26: 3062
 subsystem 27: 3062
 subsystem 28: 3062
 subsystem 29: 3062
 subsystem 30: 3062
 subsystem 31: 3062
 subsystem 32: 3062
 subsystem 33: 3062
 subsystem 34: 3062
 subsystem 35: 3062
 subsystem 36: 3062
 subsystem 37: 3062
 subsystem 38: 3062
 subsystem 39: 3066
 
  Seems system 39 got its IO done faster. Its state_prev.cpt will be 3062.
  Back up your files. Use gmxcheck to see what's in files. Rename as
 suitable
  so your set of files is consistent.
 
  Mark
 
 subsystem 40: 3062
 subsystem 41: 3062
 subsystem 42: 3062
 subsystem 43: 3062
 subsystem 44: 3062
 subsystem 45: 3062
 subsystem 46: 3062
 subsystem 47: 3062
 subsystem 48: 3062
 subsystem 49: 3062
 subsystem 50: 3062
 subsystem 51: 3062
 subsystem 52: 3062
 subsystem 53: 3062
 subsystem 54: 3062
 subsystem 55: 3062
 subsystem 56: 3062
 subsystem 57: 3062
 subsystem 58: 3062
 subsystem 59: 3062
 subsystem 60: 3062
 subsystem 61: 3062
 subsystem 62: 3062
 subsystem 63: 3062
  
   ---
   Program mdrun_mpi, VERSION 4.0.7
   Source code file: main.c, line: 116
  
   Fatal error:
   The 64 subsystems are not compatible
  
   ---
   ##END###
  
   It's clear that init_step/-replex is not equal for all subsystems is
  the
   problem, but does anyone know why this is happening and how to solve
 it?
  
   Thank you for your patience,
   Best regards,
  
   João Henriques
   --
   gmx-users mailing listgmx-users@gromacs.org
   http://lists.gromacs.org/mailman/listinfo/gmx-users
   * Please search the archive at
  http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
   * Please don't post (un)subscribe requests to the list. Use the
   www interface or send it to gmx-users-requ...@gromacs.org.
   * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
  --
  gmx-users mailing listgmx-users@gromacs.org
  http://lists.gromacs.org/mailman/listinfo/gmx-users
  * Please search the archive at
  http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
  * Please don't post (un)subscribe requests