this is the rest of the error message.. regards, Husen
Halting parallel program gmx mdrun on rank 0 out of 16 application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0 Fatal error in PMPI_Bcast: Unknown error class, error stack: PMPI_Bcast(1635)......................: MPI_Bcast(buf=0xcd9ed8, count=4, MPI_BYTE, root=0, MPI_COMM_WORLD) failed MPIR_Bcast_impl(1477).................: MPIR_Bcast(1501)......................: MPIR_Bcast_intra(1272)................: MPIR_SMP_Bcast(1104)..................: MPIR_Bcast_binomial(256)..............: MPIDU_Complete_posted_with_error(1189): Process failed MPIR_SMP_Bcast(1111)..................: MPIR_Bcast_binomial(327)..............: Failure during collective Fatal error in PMPI_Bcast: Other MPI error, error stack: PMPI_Bcast(1635)........: MPI_Bcast(buf=0x1858e78, count=4, MPI_BYTE, root=0, MPI_COMM_WORLD) failed MPIR_Bcast_impl(1477)...: MPIR_Bcast(1501)........: MPIR_Bcast_intra(1272)..: MPIR_SMP_Bcast(1111)....: MPIR_Bcast_binomial(327): Failure during collective Fatal error in PMPI_Bcast: Other MPI error, error stack: PMPI_Bcast(1635)........: MPI_Bcast(buf=0x24f7e78, count=4, MPI_BYTE, root=0, MPI_COMM_WORLD) failed MPIR_Bcast_impl(1477)...: MPIR_Bcast(1501)........: MPIR_Bcast_intra(1272)..: MPIR_SMP_Bcast(1111)....: MPIR_Bcast_binomial(327): Failure during collective Fatal error in PMPI_Bcast: Other MPI error, error stack: PMPI_Bcast(1635)........: MPI_Bcast(buf=0xb21e78, count=4, MPI_BYTE, root=0, MPI_COMM_WORLD) failed MPIR_Bcast_impl(1477)...: MPIR_Bcast(1501)........: MPIR_Bcast_intra(1272)..: MPIR_SMP_Bcast(1111)....: MPIR_Bcast_binomial(327): Failure during collective Fatal error in PMPI_Bcast: Other MPI error, error stack: PMPI_Bcast(1635)........: MPI_Bcast(buf=0x15fbe78, count=4, MPI_BYTE, root=0, MPI_COMM_WORLD) failed MPIR_Bcast_impl(1477)...: MPIR_Bcast(1501)........: MPIR_Bcast_intra(1272)..: MPIR_SMP_Bcast(1111)....: MPIR_Bcast_binomial(327): Failure during collective =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 6983 RUNNING AT head-node = EXIT CODE: 1 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== On Thu, Jun 16, 2016 at 11:48 AM, Husen R <hus...@gmail.com> wrote: > Hi all, > > I got the following error message when I tried to restart gromacs > simulation from checkpoint file. > I restart the simulation using fewer nodes and processes, and also I > exclude one node using '--exclude=' option (in slurm) for experimental > purpose. > > I'm sure fewer nodes and processes are not the cause of this error as I > already test that. > I have checked that the cause of this error is '--exclude=' usage. I > excluded 1 node named 'compute-node' when restart from checkpoint (at first > run, I use all node including 'compute-node'). > > > it seems that at first run, the submit job script was built at > compute-node. So, at restart, build user mismatch appeared because > compute-node was not found (excluded). > > Am I right ? is this behavior normal ? > or is that a way to avoid this, so I can freely restart from checkpoint > using any nodes without limitation. > > thank you in advance > > Regards, > > > Husen > > ==========================restart script================= > #!/bin/bash > #SBATCH -J ayo > #SBATCH -o md%j.out > #SBATCH -A necis > #SBATCH -N 2 > #SBATCH -n 16 > #SBATCH --exclude=compute-node > #SBATCH --time=144:00:00 > #SBATCH --mail-user=hus...@gmail.com > #SBATCH --mail-type=begin > #SBATCH --mail-type=end > > mpirun gmx_mpi mdrun -cpi md_test.cpt -deffnm md_test > ===================================================== > > > > > ==================================output error======================== > Reading checkpoint file md_test.cpt generated: Wed Jun 15 16:30:44 2016 > > > Build time mismatch, > current program: Sel Apr 5 13:37:32 WIB 2016 > checkpoint file: Rab Apr 6 09:44:51 WIB 2016 > > Build user mismatch, > current program: pro@head-node [CMAKE] > checkpoint file: pro@compute-node [CMAKE] > > #ranks mismatch, > current program: 16 > checkpoint file: 24 > > #PME-ranks mismatch, > current program: -1 > checkpoint file: 6 > > GROMACS patchlevel, binary or parallel settings differ from previous run. > Continuation is exact, but not guaranteed to be binary identical. > > > ------------------------------------------------------- > Program gmx mdrun, VERSION 5.1.2 > Source code file: > /home/pro/gromacs-5.1.2/src/gromacs/gmxlib/checkpoint.cpp, line: 2216 > > Fatal error: > Truncation of file md_test.xtc failed. Cannot do appending because of this > failure. > For more information and tips for troubleshooting, please check the GROMACS > website at http://www.gromacs.org/Documentation/Errors > ------------------------------------------------------- > ================================================================ > > -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-requ...@gromacs.org.