The fact that you are observing errors alo the energies to be off by so much and that it reproduces with multiple inputs suggest that this may not a code issue. Did you do all runs that failed on the same hardware? Have you excluded the option that one of those GeForce cards may be flaky?
-- Szilárd On Wed, Dec 4, 2019 at 7:47 PM Chenou Zhang <czhan...@asu.edu> wrote: > We tried the same gmx settings in 2019.4 with different protein systems. > And we got the same weird potential energy jump within 1000 steps. > > ``` > > Step Time > 0 0.00000 > Energies (kJ/mol) > Bond U-B Proper Dih. Improper Dih. CMAP Dih. > 2.08204e+04 9.92358e+04 6.53063e+04 1.06706e+03 -2.75672e+02 > LJ-14 Coulomb-14 LJ (SR) Coulomb (SR) Coul. recip. > 1.50031e+04 -4.86857e+04 3.10386e+04 -1.09745e+06 4.81832e+03 > Potential Kinetic En. Total Energy Conserved En. Temperature > -9.09123e+05 2.80635e+05 -6.28487e+05 -6.28428e+05 3.04667e+02 > Pressure (bar) Constr. rmsd > -1.56013e+00 3.60634e-06 > > DD step 999 load imb.: force 14.6% pme mesh/force 0.581 > Step Time > 1000 2.00000 > > Energies (kJ/mol) > Bond U-B Proper Dih. Improper Dih. CMAP Dih. > 2.04425e+04 9.92768e+04 6.52873e+04 1.02016e+03 -2.45851e+02 > LJ-14 Coulomb-14 LJ (SR) Coulomb (SR) Coul. recip. > 1.49863e+04 -4.91092e+04 3.10572e+04 -1.09508e+06 4.97942e+03 > Potential Kinetic En. Total Energy Conserved En. Temperature > 1.35726e+35 2.77598e+05 1.35726e+35 1.35726e+35 3.01370e+02 > Pressure (bar) Constr. rmsd > -7.55250e+01 3.63239e-06 > > DD step 1999 load imb.: force 16.1% pme mesh/force 0.598 > Step Time > 2000 4.00000 > > Energies (kJ/mol) > Bond U-B Proper Dih. Improper Dih. CMAP Dih. > 1.99521e+04 9.97482e+04 6.49595e+04 1.00798e+03 -2.42567e+02 > LJ-14 Coulomb-14 LJ (SR) Coulomb (SR) Coul. recip. > 1.50156e+04 -4.85324e+04 3.01944e+04 -1.09620e+06 4.82958e+03 > Potential Kinetic En. Total Energy Conserved En. Temperature > 1.35726e+35 2.79206e+05 1.35726e+35 1.35726e+35 3.03115e+02 > Pressure (bar) Constr. rmsd > -5.50508e+01 3.64353e-06 > > DD step 2999 load imb.: force 16.6% pme mesh/force 0.602 > Step Time > 3000 6.00000 > > > Energies (kJ/mol) > Bond U-B Proper Dih. Improper Dih. CMAP Dih. > 1.98590e+04 9.88100e+04 6.50934e+04 1.07048e+03 -2.38831e+02 > LJ-14 Coulomb-14 LJ (SR) Coulomb (SR) Coul. recip. > 1.49609e+04 -4.93079e+04 3.12273e+04 -1.09582e+06 4.83209e+03 > Potential Kinetic En. Total Energy Conserved En. Temperature > 1.35726e+35 2.79438e+05 1.35726e+35 1.35726e+35 3.03367e+02 > Pressure (bar) Constr. rmsd > 7.62438e+01 3.61574e-06 > > ``` > > On Mon, Dec 2, 2019 at 2:13 PM Mark Abraham <mark.j.abra...@gmail.com> > wrote: > > > Hi, > > > > What driver version is reported in the respective log files? Does the > error > > persist if mdrun -notunepme is used? > > > > Mark > > > > On Mon., 2 Dec. 2019, 21:18 Chenou Zhang, <czhan...@asu.edu> wrote: > > > > > Hi Gromacs developers, > > > > > > I'm currently running gromacs 2019.4 on our university's HPC cluster. > To > > > fully utilize the GPU nodes, I followed notes on > > > > > > > > > http://manual.gromacs.org/documentation/current/user-guide/mdrun-performance.html > > > . > > > > > > > > > And here is the command I used for my runs. > > > ``` > > > gmx mdrun -v -s $TPR -deffnm md_seed_fixed -ntmpi 8 -pin on -nb gpu > > -ntomp > > > 3 -pme gpu -pmefft gpu -npme 1 -gputasks 00112233 -maxh $HOURS -cpt 60 > > -cpi > > > -noappend > > > ``` > > > > > > And for some of those runs, they might fail with the following error: > > > ``` > > > ------------------------------------------------------- > > > > > > Program: gmx mdrun, version 2019.4 > > > > > > Source file: src/gromacs/gpu_utils/cudautils.cuh (line 229) > > > > > > MPI rank: 3 (out of 8) > > > > > > > > > > > > Fatal error: > > > > > > cudaStreamSynchronize failed: an illegal memory access was encountered > > > > > > > > > > > > For more information and tips for troubleshooting, please check the > > GROMACS > > > > > > website at http://www.gromacs.org/Documentation/Errors > > > ``` > > > > > > we also had a different error from slurm system: > > > ``` > > > ^Mstep 4400: timed with pme grid 96 96 60, coulomb cutoff 1.446: 467.9 > > > M-cycles > > > ^Mstep 4600: timed with pme grid 96 96 64, coulomb cutoff 1.372: 451.4 > > > M-cycles > > > /var/spool/slurmd/job2321134/slurm_script: line 44: 29866 Segmentation > > > fault gmx mdrun -v -s $TPR -deffnm md_seed_fixed -ntmpi 8 -pin on > > -nb > > > gpu -ntomp 3 -pme gpu -pmefft gpu -npme 1 -gputasks 00112233 -maxh > $HOURS > > > -cpt 60 -cpi -noappend > > > ``` > > > > > > We first thought this could due to compiler issue and tried different > > > settings as following: > > > ===test1=== > > > <source> > > > module load cuda/9.2.88.1 > > > module load gcc/7.3.0 > > > . /home/rsexton2/Library/gromacs/2019.4/test1/bin/GMXRC > > > </source> > > > ===test2=== > > > <source> > > > module load cuda/9.2.88.1 > > > module load gcc/6x > > > . /home/rsexton2/Library/gromacs/2019.4/test2/bin/GMXRC > > > </source> > > > ===test3=== > > > <source> > > > module load cuda/9.2.148 > > > module load gcc/7.3.0 > > > . /home/rsexton2/Library/gromacs/2019.4/test3/bin/GMXRC > > > </source> > > > ===test4=== > > > <source> > > > module load cuda/9.2.148 > > > module load gcc/6x > > > . /home/rsexton2/Library/gromacs/2019.4/test4/bin/GMXRC > > > </source> > > > ===test5=== > > > <source> > > > module load cuda/9.1.85 > > > module load gcc/6x > > > . /home/rsexton2/Library/gromacs/2019.4/test5/bin/GMXRC > > > </source> > > > ===test6=== > > > <source> > > > module load cuda/9.0.176 > > > module load gcc/6x > > > . /home/rsexton2/Library/gromacs/2019.4/test6/bin/GMXRC > > > </source> > > > ===test7=== > > > <source> > > > module load cuda/9.2.88.1 > > > module load gccgpu/7.4.0 > > > . /home/rsexton2/Library/gromacs/2019.4/test7/bin/GMXRC > > > </source> > > > > > > However we still ended up with the same errors showed above. Does > anyone > > > know where does the cudaStreamSynchronize come in? Or am I wrongly > using > > > those gmx gpu commands? > > > > > > Any input will be appreciated! > > > > > > Thanks! > > > Chenou > > > -- > > > Gromacs Users mailing list > > > > > > * Please search the archive at > > > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before > > > posting! > > > > > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists > > > > > > * For (un)subscribe requests visit > > > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or > > > send a mail to gmx-users-requ...@gromacs.org. > > > > > -- > > Gromacs Users mailing list > > > > * Please search the archive at > > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before > > posting! > > > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists > > > > * For (un)subscribe requests visit > > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or > > send a mail to gmx-users-requ...@gromacs.org. > > > -- > Gromacs Users mailing list > > * Please search the archive at > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before > posting! > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists > > * For (un)subscribe requests visit > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or > send a mail to gmx-users-requ...@gromacs.org. > -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-requ...@gromacs.org.