Hi, The only explanation is that that file is not in fact properly accessible if rank 0 is placed other than on "compute-node," which means your organization of file system / slurm / etc. aren't good enough for what you're doing.
Mark On Thu, Jun 23, 2016 at 10:15 AM Husen R <hus...@gmail.com> wrote: > Hi, > > I still unable to find out the cause of the fatal error. > Previously, gromacs is installed in every nodes. That's the cause Build > time mismatch and Build user mismatch appeared. > Now, Build time mismatch and Build user mismatch issues are solved by > installing Gromacs in shared directory. > > I have tried to install gromacs in one node only (not in shared directory), > but the error appeared. > > > this is the error message if I exclude compute-node > "--exclude=compute-node" from nodelist in slurm sbatch. excluding other > nodes works fine. > > > > ========================================================================================= > GROMACS: gmx mdrun, VERSION 5.1.2 > Executable: /mirror/source/gromacs/bin/gmx_mpi > Data prefix: /mirror/source/gromacs > Command line: > gmx_mpi mdrun -cpi md_gmx.cpt -deffnm md_gmx > > > Running on 2 nodes with total 8 cores, 16 logical cores > Cores per node: 4 > Logical cores per node: 8 > Hardware detected on host head-node (the node of MPI rank 0): > CPU info: > Vendor: GenuineIntel > Brand: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz > SIMD instructions most likely to fit this hardware: AVX_256 > SIMD instructions selected at GROMACS compile time: AVX_256 > > Reading file md_gmx.tpr, VERSION 5.1.2 (single precision) > Changing nstlist from 10 to 20, rlist from 1 to 1.03 > > > Reading checkpoint file md_gmx.cpt generated: Thu Jun 23 12:54:02 2016 > > > #ranks mismatch, > current program: 16 > checkpoint file: 24 > > #PME-ranks mismatch, > current program: -1 > checkpoint file: 6 > > GROMACS patchlevel, binary or parallel settings differ from previous run. > Continuation is exact, but not guaranteed to be binary identical. > > > ------------------------------------------------------- > Program gmx mdrun, VERSION 5.1.2 > Source code file: > /home/necis/gromacsinstall/gromacs-5.1.2/src/gromacs/gmxlib/checkpoint.cpp, > line: 2216 > > Fatal error: > Truncation of file md_gmx.xtc failed. Cannot do appending because of this > failure. > For more information and tips for troubleshooting, please check the GROMACS > website at http://www.gromacs.org/Documentation/Errors > > ============================================================================================================ > > On Thu, Jun 16, 2016 at 6:23 PM, Mark Abraham <mark.j.abra...@gmail.com> > wrote: > > > Hi, > > > > On Thu, Jun 16, 2016 at 12:24 PM Husen R <hus...@gmail.com> wrote: > > > > > On Thu, Jun 16, 2016 at 4:01 PM, Mark Abraham < > mark.j.abra...@gmail.com> > > > wrote: > > > > > > > Hi, > > > > > > > > There's just nothing special about any node at run time. > > > > > > > > Your script looks like it is building GROMACS fresh each time - > there's > > > no > > > > need to do that, > > > > > > > > > which part of my script ? > > > > > > > I can't tell how your script is finding its GROMACS installations, but > the > > advisory message says precisely that your runs are finding different > > installations... > > > > Build time mismatch, > > current program: Sel Apr 5 13:37:32 WIB 2016 > > checkpoint file: Rab Apr 6 09:44:51 WIB 2016 > > > > Build user mismatch, > > current program: pro@head-node [CMAKE] > > checkpoint file: pro@compute-node [CMAKE] > > > > This reinforces my impression that the view of your file system available > > at the start of the job script is varying with your choice of node > subsets. > > > > > > > I always use this command to restart from checkpoint file --> "mpirun > > > gmx_mpi mdrun -cpi [name].cpt -deffnm [name]". > > > as far as I know -cpi option is used to refer to checkpoint file as > input > > > file. > > > what I have to change in my script ? > > > > > > > Nothing about that aspect. But clearly your first run and the restart > > simulating loss of a node are finding different gmx_mpi binaries from > their > > respective environments. This is not itself a problem, but it's probably > > not what you intend, and may be symptomatic of the same issue that leads > to > > md_test.xtc not being accessible. > > > > Mark > > > > > > > > > > but the fact that the node name is showing up in the check > > > > that takes place when the checkpoint is read is not relevant to the > > > > problem. > > > > > > > > Mark > > > > > > > > On Thu, Jun 16, 2016 at 9:46 AM Husen R <hus...@gmail.com> wrote: > > > > > > > > > On Thu, Jun 16, 2016 at 2:32 PM, Mark Abraham < > > > mark.j.abra...@gmail.com> > > > > > wrote: > > > > > > > > > > > Hi, > > > > > > > > > > > > On Thu, Jun 16, 2016 at 9:30 AM Husen R <hus...@gmail.com> > wrote: > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > Thank you for your reply ! > > > > > > > > > > > > > > md_test.xtc is exist and writable. > > > > > > > > > > > > > > > > > > > OK, but it needs to be seen that way from the set of compute > nodes > > > you > > > > > are > > > > > > using, and organizing that is up to you and your job scheduler, > > etc. > > > > > > > > > > > > > > > > > > > I tried to restart from checkpoint file by excluding other node > > > than > > > > > > > compute-node and it works. > > > > > > > > > > > > > > > > > > > Go do that, then :-) > > > > > > > > > > > > > > > > I'm building a simple system that can respond to node failure. if > > > failure > > > > > occured on node A, than the application has to be restarted and > that > > > node > > > > > has to be excluded. > > > > > this should apply to all node including this 'compute-node'. > > > > > > > > > > > > > > > > > > > > > > > > only '--exclude=compute-node' that produces this error. > > > > > > > > > > > > > > > > > > > Then there's something about that node that is special with > respect > > > to > > > > > the > > > > > > file system - there's nothing about any particular node that > > GROMACS > > > > > cares > > > > > > about. > > > > > > > > > > > > > > > > > Mark > > > > > > > > > > > > > > > > > > > is this has the same issue with this thread ? > > > > > > > > > http://comments.gmane.org/gmane.science.biology.gromacs.user/40984 > > > > > > > > > > > > > > regards, > > > > > > > > > > > > > > Husen > > > > > > > > > > > > > > On Thu, Jun 16, 2016 at 2:20 PM, Mark Abraham < > > > > > mark.j.abra...@gmail.com> > > > > > > > wrote: > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > The stuff about different nodes or numbers of nodes doesn't > > > matter > > > > - > > > > > > it's > > > > > > > > merely an advisory note from mdrun. mdrun failed when it > tried > > to > > > > > > operate > > > > > > > > upon md_test.xtc, so perhaps you need to consider whether the > > > file > > > > > > > exists, > > > > > > > > is writable, etc. > > > > > > > > > > > > > > > > Mark > > > > > > > > > > > > > > > > On Thu, Jun 16, 2016 at 6:48 AM Husen R <hus...@gmail.com> > > > wrote: > > > > > > > > > > > > > > > > > Hi all, > > > > > > > > > > > > > > > > > > I got the following error message when I tried to restart > > > gromacs > > > > > > > > > simulation from checkpoint file. > > > > > > > > > I restart the simulation using fewer nodes and processes, > and > > > > also > > > > > I > > > > > > > > > exclude one node using '--exclude=' option (in slurm) for > > > > > > experimental > > > > > > > > > purpose. > > > > > > > > > > > > > > > > > > I'm sure fewer nodes and processes are not the cause of > this > > > > error > > > > > > as I > > > > > > > > > already test that. > > > > > > > > > I have checked that the cause of this error is '--exclude=' > > > > usage. > > > > > I > > > > > > > > > excluded 1 node named 'compute-node' when restart from > > > checkpoint > > > > > (at > > > > > > > > first > > > > > > > > > run, I use all node including 'compute-node'). > > > > > > > > > > > > > > > > > > > > > > > > > > > it seems that at first run, the submit job script was built > > at > > > > > > > > > compute-node. So, at restart, build user mismatch appeared > > > > because > > > > > > > > > compute-node was not found (excluded). > > > > > > > > > > > > > > > > > > Am I right ? is this behavior normal ? > > > > > > > > > or is that a way to avoid this, so I can freely restart > from > > > > > > checkpoint > > > > > > > > > using any nodes without limitation. > > > > > > > > > > > > > > > > > > thank you in advance > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > > > > > > > > > > > > > > > > > > > Husen > > > > > > > > > > > > > > > > > > ==========================restart script================= > > > > > > > > > #!/bin/bash > > > > > > > > > #SBATCH -J ayo > > > > > > > > > #SBATCH -o md%j.out > > > > > > > > > #SBATCH -A necis > > > > > > > > > #SBATCH -N 2 > > > > > > > > > #SBATCH -n 16 > > > > > > > > > #SBATCH --exclude=compute-node > > > > > > > > > #SBATCH --time=144:00:00 > > > > > > > > > #SBATCH --mail-user=hus...@gmail.com > > > > > > > > > #SBATCH --mail-type=begin > > > > > > > > > #SBATCH --mail-type=end > > > > > > > > > > > > > > > > > > mpirun gmx_mpi mdrun -cpi md_test.cpt -deffnm md_test > > > > > > > > > ===================================================== > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ==================================output > > > > > > error======================== > > > > > > > > > Reading checkpoint file md_test.cpt generated: Wed Jun 15 > > > > 16:30:44 > > > > > > 2016 > > > > > > > > > > > > > > > > > > > > > > > > > > > Build time mismatch, > > > > > > > > > current program: Sel Apr 5 13:37:32 WIB 2016 > > > > > > > > > checkpoint file: Rab Apr 6 09:44:51 WIB 2016 > > > > > > > > > > > > > > > > > > Build user mismatch, > > > > > > > > > current program: pro@head-node [CMAKE] > > > > > > > > > checkpoint file: pro@compute-node [CMAKE] > > > > > > > > > > > > > > > > > > #ranks mismatch, > > > > > > > > > current program: 16 > > > > > > > > > checkpoint file: 24 > > > > > > > > > > > > > > > > > > #PME-ranks mismatch, > > > > > > > > > current program: -1 > > > > > > > > > checkpoint file: 6 > > > > > > > > > > > > > > > > > > GROMACS patchlevel, binary or parallel settings differ from > > > > > previous > > > > > > > run. > > > > > > > > > Continuation is exact, but not guaranteed to be binary > > > identical. > > > > > > > > > > > > > > > > > > > > > > > > > > > ------------------------------------------------------- > > > > > > > > > Program gmx mdrun, VERSION 5.1.2 > > > > > > > > > Source code file: > > > > > > > > > /home/pro/gromacs-5.1.2/src/gromacs/gmxlib/checkpoint.cpp, > > > line: > > > > > 2216 > > > > > > > > > > > > > > > > > > Fatal error: > > > > > > > > > Truncation of file md_test.xtc failed. Cannot do appending > > > > because > > > > > of > > > > > > > > this > > > > > > > > > failure. > > > > > > > > > For more information and tips for troubleshooting, please > > check > > > > the > > > > > > > > GROMACS > > > > > > > > > website at http://www.gromacs.org/Documentation/Errors > > > > > > > > > ------------------------------------------------------- > > > > > > > > > > > > ================================================================ > > > > > > > > > -- > > > > > > > > > Gromacs Users mailing list > > > > > > > > > > > > > > > > > > * Please search the archive at > > > > > > > > > > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List > > > > before > > > > > > > > > posting! > > > > > > > > > > > > > > > > > > * Can't post? Read > > > http://www.gromacs.org/Support/Mailing_Lists > > > > > > > > > > > > > > > > > > * For (un)subscribe requests visit > > > > > > > > > > > > > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users > > > > > > or > > > > > > > > > send a mail to gmx-users-requ...@gromacs.org. > > > > > > > > > > > > > > > > > -- > > > > > > > > Gromacs Users mailing list > > > > > > > > > > > > > > > > * Please search the archive at > > > > > > > > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List > > > before > > > > > > > > posting! > > > > > > > > > > > > > > > > * Can't post? Read > > http://www.gromacs.org/Support/Mailing_Lists > > > > > > > > > > > > > > > > * For (un)subscribe requests visit > > > > > > > > > > > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users > > > > > or > > > > > > > > send a mail to gmx-users-requ...@gromacs.org. > > > > > > > > > > > > > > > -- > > > > > > > Gromacs Users mailing list > > > > > > > > > > > > > > * Please search the archive at > > > > > > > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List > > before > > > > > > > posting! > > > > > > > > > > > > > > * Can't post? Read > http://www.gromacs.org/Support/Mailing_Lists > > > > > > > > > > > > > > * For (un)subscribe requests visit > > > > > > > > > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users > > > > or > > > > > > > send a mail to gmx-users-requ...@gromacs.org. > > > > > > > > > > > > > -- > > > > > > Gromacs Users mailing list > > > > > > > > > > > > * Please search the archive at > > > > > > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List > before > > > > > > posting! > > > > > > > > > > > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists > > > > > > > > > > > > * For (un)subscribe requests visit > > > > > > > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users > > > or > > > > > > send a mail to gmx-users-requ...@gromacs.org. > > > > > > > > > > > -- > > > > > Gromacs Users mailing list > > > > > > > > > > * Please search the archive at > > > > > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before > > > > > posting! > > > > > > > > > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists > > > > > > > > > > * For (un)subscribe requests visit > > > > > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users > > or > > > > > send a mail to gmx-users-requ...@gromacs.org. > > > > > > > > > -- > > > > Gromacs Users mailing list > > > > > > > > * Please search the archive at > > > > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before > > > > posting! > > > > > > > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists > > > > > > > > * For (un)subscribe requests visit > > > > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users > or > > > > send a mail to gmx-users-requ...@gromacs.org. > > > > > > > -- > > > Gromacs Users mailing list > > > > > > * Please search the archive at > > > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before > > > posting! > > > > > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists > > > > > > * For (un)subscribe requests visit > > > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or > > > send a mail to gmx-users-requ...@gromacs.org. > > > > > -- > > Gromacs Users mailing list > > > > * Please search the archive at > > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before > > posting! > > > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists > > > > * For (un)subscribe requests visit > > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or > > send a mail to gmx-users-requ...@gromacs.org. > > > -- > Gromacs Users mailing list > > * Please search the archive at > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before > posting! > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists > > * For (un)subscribe requests visit > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or > send a mail to gmx-users-requ...@gromacs.org. > -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-requ...@gromacs.org.