Re: [gmx-users] Hardware-specific crash with 4.5.1

Justin A. Lemkul Mon, 27 Sep 2010 18:11:36 -0700


Roland Schulz wrote:

Justin,
I think the interaction kernel is not OK on your PowerPC machine. Iassume that from: 1) The force seems to be zero (minimization output).2) When you use the all-to-all kernel which is not available for thepowerpc kernel, it automatically falls back to the C kernel and then itworks.


Sounds about right.

What is the kernel you are using? It should say in the log file. Lookfor: "Configuring single precision IBM Power6-specific Fortran kernels"or "Testing Altivec/VMX support"


I'm not finding either in the config.log - weird?

You can also look in the config.h whether GMX_POWER6and/or GMX_PPC_ALTIVEC is set. I suggest you try to compile withone/both of them deactivated and see whether that solves it. This willmake it slower too. Thus if this is indeed the problem, you willprobably want to figure out why the fastest kernel doesn't workcorrectly to get good performance.

It looks like GMX_PPC_ALTIVEC is set. I suppose I could re-compile with thisturned off.

Here's what's even weirder. The problematic version was compiled using thestandard autoconf procedure. If I use a CMake-compiled version, the energyminimization runs fine, giving the same results (energy and force) as the twosystems I know are good. So I guess there's something wrong with the wayautoconf installed Gromacs. Perhaps this isn't of concern since Gromacs willrequire CMake in subsequent releases, but I figure I should at least report itin case it affects anyone else.

If I may tack one more question on here, I'm wondering why my CMake installationdoesn't actually appear to be using MPI. I get the right result, but theproblem is, I get a .log, .edr, and .trr for every processor that's being used,as if each processor is being given its own job and not distributing the work.Here's how I compiled my MPI mdrun, version 4.5.1:

cmake ../gromacs-4.5.1-DFFTW3F_LIBRARIES=/home/rdiv1001/fftw-3.0.1-osx/lib/libfftw3f.a-DFFTW3F_INCLUDE_DIR=/home/rdiv1001/fftw-3.0.1-osx/include/-DCMAKE_INSTALL_PREFIX=/home/rdiv1001/gromacs-4.5_cmake-osx-DGMX_BINARY_SUFFIX=_4.5_cmake_mpi -DGMX_THREADS=OFF -DBUILD_SHARED_LIBS=OFF-DGMX_X11=OFF -DGMX_MPI=ON-DMPI_COMPILER=/home/rdiv1001/compilers/openmpi-1.2.3-osx/bin/mpicxx-DMPI_INCLUDE_PATH=/home/rdiv1001/compilers/openmpi-1.2.3-osx/include


$ make mdrun

$ make install-mdrun

Is there anything obviously wrong with those commands? Is there any way Ishould know (before actually using mdrun) whether or not I've done things right?


-Justin

Roland

On Mon, Sep 27, 2010 at 4:59 PM, Justin A. Lemkul <[email protected]<mailto:[email protected]>> wrote:



    Hi All,

    I'm hoping I might get some tips in tracking down the source of an
    issue that appears to be hardware-specific, leading to crashes in my
    system.  The failures are occurring on our supercomputer (Mac OSX
    10.3, PowerPC).  Running the same .tpr file on my laptop (Mac OSX
    10.5.8, Intel Core2Duo) and on another workstation (Ubuntu 10.04,
    AMD64) produce identical results.  I suspect the problem stems from
    unsuccessful energy minimization, which then leads to a crash when
    running full MD.  All jobs were run in parallel on two cores.  The
    supercomputer does not support threading, so MPI is invoked using
    MPICH-1.2.5 (native MPI implementation on the cluster).


    Details as follows:

    EM md.log file: successful run (Intel Core2Duo or AMD64)

    Steepest Descents converged to Fmax < 1000 in 7 steps
    Potential Energy  = -4.8878180e+04
    Maximum force     =  8.7791553e+02 on atom 5440
    Norm of force     =  1.1781271e+02


    EM md.log file: unsuccessful run (PowerPC)

    Steepest Descents converged to Fmax < 1000 in 1 steps
    Potential Energy  = -2.4873273e+04
    Maximum force     =  0.0000000e+00 on atom 0
    Norm of force     =            nan


    MD invoked from the minimized structure generated on my laptop or
    AMD64 runs successfully (at least for a few hundred steps in my
    test), but the MD on the PowerPC cluster fails immediately:

              Step           Time         Lambda
                 0        0.00000        0.00000

      Energies (kJ/mol)
               U-B    Proper Dih.  Improper Dih.      CMAP Dih.GB
    Polarization

7.93559e+03 9.34958e+03 2.24036e+02 -2.47750e+03-7.83599e+04LJ-14 Coulomb-14 LJ (SR) Coulomb (SR)Potential7.70042e+03 9.94520e+04 -1.17168e+04 -5.79783e+04-2.55780e+04Kinetic En. Total Energy Temperature Pressure (bar)Constr. rmsdnan nan nan 0.00000e+00nan

     Constr.2 rmsd
               nan

    DD  step 9 load imb.: force  3.0%


    -------------------------------------------------------
    Program mdrun_4.5.1_mpi, VERSION 4.5.1
    Source code file: nsgrid.c, line: 601

    Range checking error:
    Explanation: During neighborsearching, we assign each particle to a grid
    based on its coordinates. If your system contains collisions or
    parameter
    errors that give particles very high velocities you might end up
    with some
    coordinates being +-Infinity or NaN (not-a-number). Obviously, we cannot
    put these on a grid, so this is usually where we detect those errors.
    Make sure your system is properly energy-minimized and that the
    potential
    energy seems reasonable before trying again.
    Variable ind has value 7131. It should have been within [ 0 .. 7131 ]

    For more information and tips for troubleshooting, please check the
    GROMACS
    website at http://www.gromacs.org/Documentation/Errors
    -------------------------------------------------------

    It seems as if the crash really shouldn't be happening, if the value
    range is inclusive.

    Running with all-vs-all kernels works, but the performance is
    horrendously slow (<300 ps per day for a 7131-atom system) so I am
    attempting to use long cutoffs (2.0 nm) as others on the list have
    suggested.

    Details of the installations and .mdp files are appended below.

    -Justin

    === em.mdp ===
    ; Run parameters
    integrator      = steep         ; EM
    emstep      = 0.005
    emtol       = 1000
    nsteps      = 50000
    nstcomm         = 1
    comm_mode   = angular       ; non-periodic system
    ; Bond parameters
    constraint_algorithm    = lincs
    constraints             = all-bonds
    continuation    = no            ; starting up
    ; required cutoffs for implicit
    nstlist         = 1
    ns_type         = grid
    rlist           = 2.0
    rcoulomb        = 2.0
    rvdw            = 2.0
    ; cutoffs required for qq and vdw
    coulombtype     = cut-off
    vdwtype     = cut-off
    ; temperature coupling
    tcoupl          = no
    ; Pressure coupling is off
    Pcoupl          = no
    ; Periodic boundary conditions are off for implicit
    pbc                 = no
    ; Settings for implicit solvent
    implicit_solvent    = GBSA
    gb_algorithm        = OBC
    rgbradii            = 2.0


    === md.mdp ===

    ; Run parameters
    integrator      = sd            ; velocity Langevin dynamics
    dt                  = 0.002
    nsteps          = 2500000               ; 5000 ps (5 ns)
    nstcomm         = 1
    comm_mode   = angular       ; non-periodic system
    ; Output parameters
    nstxout         = 0             ; nst[xvf]out = 0 to suppress
    useless .trr output
    nstvout         = 0
    nstfout         = 0
    nstlog      = 5000          ; 10 ps
    nstenergy   = 5000          ; 10 ps
    nstxtcout   = 5000          ; 10 ps
    ; Bond parameters
    constraint_algorithm    = lincs
    constraints             = all-bonds
    continuation    = no            ; starting up
    ; required cutoffs for implicit
    nstlist         = 10
    ns_type         = grid
    rlist           = 2.0
    rcoulomb        = 2.0
    rvdw            = 2.0
    ; cutoffs required for qq and vdw
    coulombtype     = cut-off
    vdwtype     = cut-off
    ; temperature coupling
    tc_grps         = System
    tau_t           = 1.0   ; inverse friction coefficient for Langevin
    (ps^-1)
    ref_t           = 310
    ; Pressure coupling is off
    Pcoupl          = no
    ; Generate velocities is on

gen_vel = yesgen_temp = 310

    gen_seed        = 173529
    ; Periodic boundary conditions are off for implicit
    pbc                 = no
    ; Free energy must be off to use all-vs-all kernels
    ; default, but just for the sake of being pedantic
    free_energy = no
    ; Settings for implicit solvent
    implicit_solvent    = GBSA
    gb_algorithm        = OBC
    rgbradii            = 2.0


    === Installation commands for the cluster ===

    $ ./configure --prefix=/home/rdiv1001/gromacs-4.5
    CPPFLAGS="-I/home/rdiv1001/fftw-3.0.1-osx/include"
    LDFLAGS="-L/home/rdiv1001/fftw-3.0.1-osx/lib" --disable-threads
    --without-x --program-suffix=_4.5.1_s

    $ make

    $ make install

    $ make distclean

    $ ./configure --prefix=/home/rdiv1001/gromacs-4.5
    CPPFLAGS="-I/home/rdiv1001/fftw-3.0.1-osx/include"
    LDFLAGS="-L/home/rdiv1001/fftw-3.0.1-osx/lib" --disable-threads
    --without-x --program-suffix=_4.5.1_mpi --enable-mpi
    CXXCPP="/nfs/compilers/mpich-1.2.5/bin/mpicxx -E"

    $ make mdrun

    $ make install-mdrun

--========================================


    Justin A. Lemkul
    Ph.D. Candidate
    ICTAS Doctoral Scholar
    MILES-IGERT Trainee
    Department of Biochemistry
    Virginia Tech
    Blacksburg, VA
    jalemkul[at]vt.edu <http://vt.edu> | (540) 231-9080
    http://www.bevanlab.biochem.vt.edu/Pages/Personal/justin

    ========================================

--gmx-users mailing list [email protected]

    <mailto:[email protected]>
    http://lists.gromacs.org/mailman/listinfo/gmx-users
    Please search the archive at
    http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
    Please don't post (un)subscribe requests to the list. Use the www
    interface or send it to [email protected]
    <mailto:[email protected]>.
    Can't post? Read http://www.gromacs.org/Support/Mailing_Lists




--
ORNL/UT Center for Molecular Biophysics cmb.ornl.gov <http://cmb.ornl.gov>
865-241-1537, ORNL PO BOX 2008 MS6309


--
========================================

Justin A. Lemkul
Ph.D. Candidate
ICTAS Doctoral Scholar
MILES-IGERT Trainee
Department of Biochemistry
Virginia Tech
Blacksburg, VA
jalemkul[at]vt.edu | (540) 231-9080
http://www.bevanlab.biochem.vt.edu/Pages/Personal/justin

========================================
--
gmx-users mailing list    [email protected]
http://lists.gromacs.org/mailman/listinfo/gmx-users
Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!

Please don't post (un)subscribe requests to the list. Use thewww interface or send it to [email protected].

Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

Re: [gmx-users] Hardware-specific crash with 4.5.1

Reply via email to