Hi, Yes, that's a real bug. I'm not yet sure what to do about it, but I'll continue discussion at http://redmine.gromacs.org/issues/1848
Mark On Thu, Nov 19, 2015 at 8:06 PM Krzysztof Kuczera <kkucz...@ku.edu> wrote: > Dear Justin and Mark > > Thanks for your helpful suggestions. > Yes, my case is just like bug 1848 > Our computing staff recompiled the 5.1.1 code with the debugger an ran > a backtrace > on my job, concluding that there is a bug in the code. > I include their conclusions in case that might help resolve the problem- > condensed backtrace follows: > Krzysztof > > > [67] 0x00000000007a10bd in add_binr (b=0x25f11c0, nr=9, r=0x0) at > /home/wmason/gromacs-5.1.1/src/gromacs/gmxlib/rbin.c:94 > [4-7,12-15,20-31,60-67] 94 rbuf[i] = r[i]; > > 0x0000000000725758 in global_stat (fplog=0x3543750, gs=0x3639420, > cr=0x3536fa0, > enerd=0x36398b0, fvir=0x0, svir=0x0, mu_tot=0x7fff6a2dfe6c, > inputrec=0x35424f0, > ekind=0x3635990, constr=0x363b920, vcm=0x0, nsig=0, sig=0x0, > top_global=0x3541860, state_local=0x363a090, bSumEkinhOld=0, flags=146) at > /home/wmason/gromacs-5.1.1/src/gromacs/mdlib/stat.cpp:229 > > 0x000000000073efcd in compute_globals (fplog=0x3543750, gstat=0x3639420, > cr=0x3536fa0, ir=0x35424f0, fr=0x3599df0, ekind=0x3635990, state=0x363a090, > state_global=0x3543270, mdatoms=0x35cc860, nrnb=0x3599a20, vcm=0x3623b40, > wcycle=0x3599340, enerd=0x36398b0, force_vir=0x0, shake_vir=0x0, > total_vir=0x0, > pres=0x0, mu_tot=0x7fff6a2dfe6c, constr=0x363b920, gs=0x0, bInterSimGS=0, > box=0x363a0b0, top_global=0x3541860, bSumEkinhOld=0x7fff6a2dff10, > flags=146) at > /home/wmason/gromacs-5.1.1/src/gromacs/mdlib/md_support.cpp:342 > > 0x00000000004c3dca in do_md (fplog=0x0, cr=0x24a7fb0, nfile=35, > fnm=0x7fffa95b48a8, oenv=0x24b46e0, bVerbose=0, bCompact=1, > nstglobalcomm=20, > vsite=0x252d890, constr=0x25e9a80, stepout=100, ir=0x24b2430, > top_global=0x24b4760, fcd=0x24ea8b0, state_global=0x24b31c0, > mdatoms=0x252d980, > nrnb=0x24fa9b0, wcycle=0x24fa1b0, ed=0x0, fr=0x24fad80, repl_ex_nst=500, > repl_ex_nex=0, repl_ex_seed=-1, membed=0x0, cpt_period=15, max_hours=-1, > imdport=8888, Flags=1055744, walltime_accounting=0x2584a20) at > /home/wmason/gromacs-5.1.1/src/programs/mdrun/md.cpp:969 > > 0x00000000004d4a64 in mdrunner (hw_opt=0x7fff6a2e1d58, fplog=0x3543750, > cr=0x3536fa0, nfile=35, fnm=0x7fff6a2e15f8, oenv=0x35436d0, bVerbose=0, > bCompact=1, nstglobalcomm=-1, ddxyz=0x7fff6a2e11bc, dd_node_order=1, rdd=0, > rconstr=0, dddlb_opt=0x1f3b10c "auto", dlb_scale=0.800000012, ddcsx=0x0, > ddcsy=0x0, ddcsz=0x0, nbpu_opt=0x1f3b10c "auto", nstlist_cmdline=0, > nsteps_cmdline=-2, nstepout=100, resetstep=-1, nmultisim=40, > repl_ex_nst=500, > repl_ex_nex=0, repl_ex_seed=-1, pforce=-1, cpt_period=15, max_hours=-1, > imdport=8888, Flags=1055744) at > /home/wmason/gromacs-5.1.1/src/programs/mdrun/runner.cpp:1270 > > 0x00000000004cb637 in gmx_mdrun (argc=15, argv=0x3531c20) at > /home/wmason/gromacs-5.1.1/src/programs/mdrun/mdrun.cpp:537 > > 0x000000000050b26b in gmx::CommandLineModuleManager::runAsMainCMain > (argc=15, > argv=0x7fffa95b5aa8, mainFunction=0x4c8d73 <gmx_mdrun(int, char**)>) at > > /home/wmason/gromacs-5.1.1/src/gromacs/commandline/cmdlinemodulemanager.cpp:588 > > 0x00000000004ba316 in main (argc=15, argv=0x7fff6a2e27f8) at > /home/wmason/gromacs-5.1.1/src/programs/mdrun_main.cpp:43 > > > The error is a classic segmentation fault--caused by accessing an array > out of > bounds. It's a bug in the Gromacs 5.1.1 code. You will need to file a bug > report with gromacs, and they will need your job input to reproduce the > error, > and the backtrace info above, which I've summarized if you just want to > read > the code yourself: > > "add_rbin" in gromacs-5.1.1/src/gromacs/gmxlib/rbin.c:94 > called from "global_stat" in gromacs-5.1.1/src/gromacs/mdlib/stat.cpp:229 > called from "compute_globals" in > gromacs-5.1.1/src/gromacs/mdlib/md_support.cpp:342 > called from "do_md" in /gromacs-5.1.1/src/programs/mdrun/md.cpp:969 > called from "mdrunner" in gromacs-5.1.1/src/programs/mdrun/runner.cpp:1270 > called from "gmx_mdrun" in gromacs-5.1.1/src/programs/mdrun/mdrun.cpp:537 > > > The code which fails is taking a "bin" of the "gmx_global_stat" type, which > holds an array of doubles and the size of the array, and tries to copy in > data > from a "tensor" of the force virial, during a step in which energy is > computed > globally. > > I don't have any clue what this means, except for what the low-level code > says > (I almost always have read the source code when debugging so I could try to > explain what's happening). There are multiple possible causes for this > error, > such as: > A. The memory allocation for the "bin" fails quietly--the array is not > resized > (or not the right size), and then the error occurs at the next function > which > tries to write data there. > B. The tensor is actually not the size "DIM*DIM" (3x3 if I'm reading > correctly) > that the function expects. Accessing the source tensor array out-of-bounds > also > generates this error. > > C. The tensor is actually a NULL pointer. This is the most likely > explanation, > which one can see from the line: > add_binr (b=0x25f11c0, nr=9, r=0x0) > ^---- Either r=NULL or the debugger is not reporting the value correctly. > > This would mean the program is calling "do_md" from "mdrunner" with bad > parameters and not checking its parameters for errors. The error actually > occurs at a higher level of the code, rather than the low level where the > error > is reported. > > > > > > On 11/17/15 3:20 PM, Justin Lemkul wrote: > > > > > > On 11/17/15 3:00 PM, Mark Abraham wrote: > >> Hi, > >> > >> That is indeed strange. MPI_Allreduce isn't used in replica exchange, > >> nor > >> did the replica-exchange code change between 5.0.6 and 5.1, so the > >> problem > >> is elsewhere. You could try running with the environment variable > >> GMX_CYCLE_BARRIER set to 1 (which might require you to tell mpirun > >> that's > >> what you want) so that we can localize which MPI_Allreduce is losing a > >> process. Or any other way you might have available to get a stack trace > >> from each process. > >> > > > > Maybe related to this? > > > > http://redmine.gromacs.org/issues/1848 > > > > -Justin > > > >> Mark > >> > >> On Tue, Nov 17, 2015 at 6:11 PM Krzysztof Kuczera <kkucz...@ku.edu> > >> wrote: > >> > >>> Hi > >>> I am trying to run a temperature-exchange REMD simulation with GROMACS > >>> 5.1 or 5.1.1 > >>> and my job is crashing in a way difficult to explain > >>> - the MD part works fine > >>> - crash occurs at first replica-exchange attempt > >>> - error log contains a bunch of messages of type, which I suppose mean > >>> that the MPI communication > >>> did not work > >>> > >>> NOTE: Turning on dynamic load balancingFatal error in MPI_Allreduce: A > >>> process has failed, error stack:MPI_Allreduce(1421).......: > >>> MPI_Allreduce(sbuf=0x7fff5538018c, rbuf=0x28b2070, count=3, MPI_FLOAT, > >>> MPI_SUM, comm=0x84000002) failed > >>> MPIR_Allreduce_impl(1262).:MPIR_Allreduce_intra(497).: > >>> MPIR_Bcast_binomial(245)..:dequeue_and_set_error(917): Communication > >>> error with rank 48Fatal error in MPI_Allreduce: Other MPI error, error > >>> stack:MPI_Allreduce(1421)......: MPI_Allreduce(sbuf=0x7fff31eb660c, > >>> rbuf=0x2852c00, count=3, MPI_FLOAT, MPI_SUM, comm=0x84000001) > >>> failedMPIR_Allreduce_impl(1262): > >>> MPIR_Allreduce_intra(497): > >>> MPIR_Bcast_binomial(316).: Failure during collective > >>> Fatal error in MPI_Allreduce: Other MPI error, error > >>> stack:MPI_Allreduce(1421)......: MPI_Allreduce(sbuf=0x7fff2e54068c, > >>> rbuf=0x31e35a0, count=3, MPI_FLOAT, MPI > >>> _SUM, comm=0x84000001) failed > >>> > >>> > >>> Recently compiled slightly older versions like 5.0.6 do not have this > >>> behavior. > >>> I have tried updating to latest cmake, compiler and MPI versions on our > >>> system, > >>> but it does not change things. > >>> Does anyone have suggestions how to fix this? > >>> > >>> Thanks > >>> Krzysztof > >>> > >>> -- > >>> Krzysztof Kuczera > >>> Departments of Chemistry and Molecular Biosciences > >>> The University of Kansas > >>> 1251 Wescoe Hall Drive, 5090 Malott Hall > >>> Lawrence, KS 66045 > >>> Tel: 785-864-5060 Fax: 785-864-5396 email: kkucz...@ku.edu > >>> http://oolung.chem.ku.edu/~kuczera/home.html > >>> > >>> -- > >>> Gromacs Users mailing list > >>> > >>> * Please search the archive at > >>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before > >>> posting! > >>> > >>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists > >>> > >>> * For (un)subscribe requests visit > >>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or > >>> send a mail to gmx-users-requ...@gromacs.org. > >>> > > > > > -- > Krzysztof Kuczera > Departments of Chemistry and Molecular Biosciences > The University of Kansas > 1251 Wescoe Hall Drive, 5090 Malott Hall > Lawrence, KS 66045 > Tel: 785-864-5060 Fax: 785-864-5396 email: kkucz...@ku.edu > http://oolung.chem.ku.edu/~kuczera/home.html > > -- > Gromacs Users mailing list > > * Please search the archive at > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before > posting! > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists > > * For (un)subscribe requests visit > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or > send a mail to gmx-users-requ...@gromacs.org. > -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-requ...@gromacs.org.