On Thu, Oct 17, 2013 at 3:47 PM, Bishesh Khanal <[email protected]> wrote:
> > > > On Thu, Oct 17, 2013 at 3:00 PM, Jed Brown <[email protected]> wrote: > >> Bishesh Khanal <[email protected]> writes: >> > The program crashes only for a bigger domain size. Even in the cluster, >> it >> > does not crash for the domain size up to a certain size. So I need to >> run >> > in the debugger for the case when it crashes to get the stack trace from >> > the SEGV, right ? I do not know how to attach a debugger when >> submitting a >> > job to the cluster if that is possible at all! >> >> Most machines allow you to get "interactive" sessions. You can usually >> run debuggers within those. Some facilities also have commercial >> debuggers. >> > > Thanks, I'll have a look at that. > > >> >> > Or are you asking me to run the program in the debugger in my laptop >> > for the biggest size ? (I have not tried running the code for the >> > biggest size in my laptop fearing it might take forever) >> >> Your laptop probably doesn't have enough memory for that. >> > > Yes, I tried it just a while ago and this is happened I think. (Just to > confirm, I have put the error message for this case at the very end of this > reply.*) > > >> >> Can you try running on the cluster with one MPI rank per node? We >> should rule out simple out-of-memory problems, confirm that the code >> executes correctly with MPICH, and finally figure out why it fails with >> Open MPI (assuming that the previous hunch was correct). >> >> > I tried running on the cluster with one core per node with 4 nodes and I got the following errors (note: using valgrind, and openmpi of the cluster) at the very end after the many usual "unconditional jump ... errors" which might be interesting mpiexec: killing job... mpiexec: abort is already in progress...hit ctrl-c again to forcibly terminate -------------------------------------------------------------------------- MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode 59. NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them. -------------------------------------------------------------------------- [0]PETSC ERROR: ------------------------------------------------------------------------ [0]PETSC ERROR: Caught signal number 15 Terminate: Somet process (or the batch system) has told this process to end [0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger [0]PETSC ERROR: or see http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind[0]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors [0]PETSC ERROR: likely location of problem given in stack below [0]PETSC ERROR: --------------------- Stack Frames ------------------------------------ [0]PETSC ERROR: Note: The EXACT line numbers in the stack are not available, [0]PETSC ERROR: INSTEAD the line number of the start of the function [0]PETSC ERROR: is given. [0]PETSC ERROR: [0] MatSetValues_MPIAIJ line 505 /tmp/petsc-3.4.3/src/mat/impls/aij/mpi/mpiaij.c [0]PETSC ERROR: [0] MatSetValues line 1071 /tmp/petsc-3.4.3/src/mat/interface/matrix.c [0]PETSC ERROR: [0] MatSetValuesLocal line 1935 /tmp/petsc-3.4.3/src/mat/interface/matrix.c [0]PETSC ERROR: [0] DMCreateMatrix_DA_3d_MPIAIJ line 1051 /tmp/petsc-3.4.3/src/dm/impls/da/fdda.c [0]PETSC ERROR: [0] DMCreateMatrix_DA line 627 /tmp/petsc-3.4.3/src/dm/impls/da/fdda.c [0]PETSC ERROR: [0] DMCreateMatrix line 900 /tmp/petsc-3.4.3/src/dm/interface/dm.c [0]PETSC ERROR: [0] KSPSetUp line 192 /tmp/petsc-3.4.3/src/ksp/ksp/interface/itfunc.c [0]PETSC ERROR: [0] solveModel line 122 "unknowndirectory/"/epi/asclepios2/bkhanal/works/AdLemModel/src/PetscAdLemTaras3D.cxx [0]PETSC ERROR: --------------------- Error Message ------------------------------------ [0]PETSC ERROR: Signal received! [0]PETSC ERROR: ------------------------------------------------------------------------ [0]PETSC ERROR: Petsc Release Version 3.4.3, Oct, 15, 2013 [0]PETSC ERROR: See docs/changes/index.html for recent updates. [0]PETSC ERROR: See docs/faq.html for hints about trouble shooting. [0]PETSC ERROR: See docs/index.html for manual pages. [0]PETSC ERROR: ------------------------------------------------------------------------ [0]PETSC ERROR: /epi/asclepios2/bkhanal/works/AdLemModel/build/src/AdLemMain on a arch-linux2-cxx-debug named nef002 by bkhanal Thu Oct 17 15:55:33 2013 [0]PETSC ERROR: Libraries linked from /epi/asclepios2/bkhanal/petscDebug/lib [0]PETSC ERROR: Configure run at Wed Oct 16 14:18:48 2013 [0]PETSC ERROR: Configure options --with-mpi-dir=/opt/openmpi-gcc/current/ --with-shared-libraries --prefix=/epi/asclepios2/bkhanal/petscDebug -download-f-blas-lapack=1 --download-metis --download-parmetis --download-superlu_dist --download-scalapack --download-mumps --download-hypre --with-clanguage=cxx [0]PETSC ERROR: ------------------------------------------------------------------------ [0]PETSC ERROR: User provided function() line 0 in unknown directory unknown file ==47363== ==47363== HEAP SUMMARY: ==47363== in use at exit: 10,939,838,029 bytes in 8,091 blocks ==47363== total heap usage: 1,936,963 allocs, 1,928,872 frees, 11,530,164,042 bytes allocated ==47363== Does it mean it is crashing near MatSetValues_MPIAIJ ? > I'm sorry but I'm a complete beginner with MPI and clusters; so what does > one MPI rank per node means and what should I do to do that ? My guess is > that I set one core per node and use multiple nodes in my job script file ? > Or do I need to do something in the petsc code ? > > *Here is the error I get when running for the full domain size in my > laptop: > [3]PETSC ERROR: --------------------- Error Message > ------------------------------------ > [3]PETSC ERROR: Out of memory. This could be due to allocating > [3]PETSC ERROR: too large an object or bleeding by not properly > [3]PETSC ERROR: destroying unneeded objects. > [1]PETSC ERROR: Memory allocated 0 Memory used by process 1700159488 > [1]PETSC ERROR: Try running with -malloc_dump or -malloc_log for info. > [1]PETSC ERROR: Memory requested 6234924800! > [1]PETSC ERROR: > ------------------------------------------------------------------------ > [1]PETSC ERROR: Petsc Release Version 3.4.3, Oct, 15, 2013 > [1]PETSC ERROR: See docs/changes/index.html for recent updates. > [1]PETSC ERROR: See docs/faq.html for hints about trouble shooting. > [1]PETSC ERROR: See docs/index.html for manual pages. > [1]PETSC ERROR: > ------------------------------------------------------------------------ > [1]PETSC ERROR: [2]PETSC ERROR: Memory allocated 0 Memory used by process > 1695793152 > [2]PETSC ERROR: Try running with -malloc_dump or -malloc_log for info. > [2]PETSC ERROR: Memory requested 6223582208! > [2]PETSC ERROR: > ------------------------------------------------------------------------ > [2]PETSC ERROR: Petsc Release Version 3.4.3, Oct, 15, 2013 > [2]PETSC ERROR: See docs/changes/index.html for recent updates. > [2]PETSC ERROR: See docs/faq.html for hints about trouble shooting. > [2]PETSC ERROR: See docs/index.html for manual pages. > [2]PETSC ERROR: > ------------------------------------------------------------------------ > [2]PETSC ERROR: src/AdLemMain on a arch-linux2-cxx-debug named edwards by > bkhanal Thu Oct 17 15:19:22 2013 > [1]PETSC ERROR: Libraries linked from > /home/bkhanal/Documents/softwares/petsc-3.4.3/arch-linux2-cxx-debug/lib > [1]PETSC ERROR: Configure run at Wed Oct 16 15:13:05 2013 > [1]PETSC ERROR: Configure options --download-mpich > -download-f-blas-lapack=1 --download-metis --download-parmetis > --download-superlu_dist --download-scalapack --download-mumps > --download-hypre --with-clanguage=cxx > [1]PETSC ERROR: > ------------------------------------------------------------------------ > [1]PETSC ERROR: PetscMallocAlign() line 46 in > /home/bkhanal/Documents/softwares/petsc-3.4.3/src/sys/memory/mal.c > src/AdLemMain on a arch-linux2-cxx-debug named edwards by bkhanal Thu Oct > 17 15:19:22 2013 > [2]PETSC ERROR: Libraries linked from > /home/bkhanal/Documents/softwares/petsc-3.4.3/arch-linux2-cxx-debug/lib > [2]PETSC ERROR: Configure run at Wed Oct 16 15:13:05 2013 > [2]PETSC ERROR: Configure options --download-mpich > -download-f-blas-lapack=1 --download-metis --download-parmetis > --download-superlu_dist --download-scalapack --download-mumps > --download-hypre --with-clanguage=cxx > [2]PETSC ERROR: > ------------------------------------------------------------------------ > [2]PETSC ERROR: PetscMallocAlign() line 46 in > /home/bkhanal/Documents/softwares/petsc-3.4.3/src/sys/memory/mal.c > [1]PETSC ERROR: MatSeqAIJSetPreallocation_SeqAIJ() line 3551 in > /home/bkhanal/Documents/softwares/petsc-3.4.3/src/mat/impls/aij/seq/aij.c > [1]PETSC ERROR: MatSeqAIJSetPreallocation() line 3496 in > /home/bkhanal/Documents/softwares/petsc-3.4.3/src/mat/impls/aij/seq/aij.c > [2]PETSC ERROR: MatSeqAIJSetPreallocation_SeqAIJ() line 3551 in > /home/bkhanal/Documents/softwares/petsc-3.4.3/src/mat/impls/aij/seq/aij.c > [2]PETSC ERROR: MatSeqAIJSetPreallocation() line 3496 in > /home/bkhanal/Documents/softwares/petsc-3.4.3/src/mat/impls/aij/seq/aij.c > [1]PETSC ERROR: MatMPIAIJSetPreallocation_MPIAIJ() line 3307 in > /home/bkhanal/Documents/softwares/petsc-3.4.3/src/mat/impls/aij/mpi/mpiaij.c > [1]PETSC ERROR: MatMPIAIJSetPreallocation() line 4015 in > /home/bkhanal/Documents/softwares/petsc-3.4.3/src/mat/impls/aij/mpi/mpiaij.c > [2]PETSC ERROR: MatMPIAIJSetPreallocation_MPIAIJ() line 3307 in > /home/bkhanal/Documents/softwares/petsc-3.4.3/src/mat/impls/aij/mpi/mpiaij.c > [2]PETSC ERROR: MatMPIAIJSetPreallocation() line 4015 in > /home/bkhanal/Documents/softwares/petsc-3.4.3/src/mat/impls/aij/mpi/mpiaij.c > [0]PETSC ERROR: --------------------- Error Message > ------------------------------------ > [0]PETSC ERROR: Out of memory. This could be due to allocating > [0]PETSC ERROR: too large an object or bleeding by not properly > [0]PETSC ERROR: destroying unneeded objects. > [2]PETSC ERROR: [1]PETSC ERROR: DMCreateMatrix_DA_3d_MPIAIJ() line 1101 in > /home/bkhanal/Documents/softwares/petsc-3.4.3/src/dm/impls/da/fdda.c > [1]PETSC ERROR: DMCreateMatrix_DA_3d_MPIAIJ() line 1101 in > /home/bkhanal/Documents/softwares/petsc-3.4.3/src/dm/impls/da/fdda.c > [2]PETSC ERROR: DMCreateMatrix_DA() line 771 in > /home/bkhanal/Documents/softwares/petsc-3.4.3/src/dm/impls/da/fdda.c > DMCreateMatrix_DA() line 771 in > /home/bkhanal/Documents/softwares/petsc-3.4.3/src/dm/impls/da/fdda.c > [3]PETSC ERROR: Memory allocated 0 Memory used by process 1675407360 > [3]PETSC ERROR: Try running with -malloc_dump or -malloc_log for info. > [3]PETSC ERROR: Memory requested 6166659200! > [3]PETSC ERROR: > ------------------------------------------------------------------------ > [3]PETSC ERROR: Petsc Release Version 3.4.3, Oct, 15, 2013 > [3]PETSC ERROR: See docs/changes/index.html for recent updates. > [3]PETSC ERROR: See docs/faq.html for hints about trouble shooting. > [3]PETSC ERROR: See docs/index.html for manual pages. > [3]PETSC ERROR: > ------------------------------------------------------------------------ > [3]PETSC ERROR: src/AdLemMain on a arch-linux2-cxx-debug named edwards by > bkhanal Thu Oct 17 15:19:22 2013 > [3]PETSC ERROR: Libraries linked from > /home/bkhanal/Documents/softwares/petsc-3.4.3/arch-linux2-cxx-debug/lib > [3]PETSC ERROR: Configure run at Wed Oct 16 15:13:05 2013 > [3]PETSC ERROR: Configure options --download-mpich > -download-f-blas-lapack=1 --download-metis --download-parmetis > --download-superlu_dist --download-scalapack --download-mumps > --download-hypre --with-clanguage=cxx > [3]PETSC ERROR: > ------------------------------------------------------------------------ > [3]PETSC ERROR: [1]PETSC ERROR: DMCreateMatrix() line 910 in > /home/bkhanal/Documents/softwares/petsc-3.4.3/src/dm/interface/dm.c > [2]PETSC ERROR: DMCreateMatrix() line 910 in > /home/bkhanal/Documents/softwares/petsc-3.4.3/src/dm/interface/dm.c > PetscMallocAlign() line 46 in > /home/bkhanal/Documents/softwares/petsc-3.4.3/src/sys/memory/mal.c > [3]PETSC ERROR: MatSeqAIJSetPreallocation_SeqAIJ() line 3551 in > /home/bkhanal/Documents/softwares/petsc-3.4.3/src/mat/impls/aij/seq/aij.c > [3]PETSC ERROR: MatSeqAIJSetPreallocation() line 3496 in > /home/bkhanal/Documents/softwares/petsc-3.4.3/src/mat/impls/aij/seq/aij.c > [1]PETSC ERROR: KSPSetUp() line 207 in > /home/bkhanal/Documents/softwares/petsc-3.4.3/src/ksp/ksp/interface/itfunc.c > [2]PETSC ERROR: KSPSetUp() line 207 in > /home/bkhanal/Documents/softwares/petsc-3.4.3/src/ksp/ksp/interface/itfunc.c > [3]PETSC ERROR: MatMPIAIJSetPreallocation_MPIAIJ() line 3307 in > /home/bkhanal/Documents/softwares/petsc-3.4.3/src/mat/impls/aij/mpi/mpiaij.c > [3]PETSC ERROR: MatMPIAIJSetPreallocation() line 4015 in > /home/bkhanal/Documents/softwares/petsc-3.4.3/src/mat/impls/aij/mpi/mpiaij.c > [3]PETSC ERROR: DMCreateMatrix_DA_3d_MPIAIJ() line 1101 in > /home/bkhanal/Documents/softwares/petsc-3.4.3/src/dm/impls/da/fdda.c > [3]PETSC ERROR: DMCreateMatrix_DA() line 771 in > /home/bkhanal/Documents/softwares/petsc-3.4.3/src/dm/impls/da/fdda.c > [3]PETSC ERROR: DMCreateMatrix() line 910 in > /home/bkhanal/Documents/softwares/petsc-3.4.3/src/dm/interface/dm.c > [3]PETSC ERROR: KSPSetUp() line 207 in > /home/bkhanal/Documents/softwares/petsc-3.4.3/src/ksp/ksp/interface/itfunc.c > [1]PETSC ERROR: solveModel() line 128 in > "unknowndirectory/"/user/bkhanal/home/works/AdLemModel/src/PetscAdLemTaras3D.cxx > [2]PETSC ERROR: solveModel() line 128 in > "unknowndirectory/"/user/bkhanal/home/works/AdLemModel/src/PetscAdLemTaras3D.cxx > [3]PETSC ERROR: solveModel() line 128 in > "unknowndirectory/"/user/bkhanal/home/works/AdLemModel/src/PetscAdLemTaras3D.cxx > [0]PETSC ERROR: Memory allocated 0 Memory used by process 1711476736 > [0]PETSC ERROR: Try running with -malloc_dump or -malloc_log for info. > [0]PETSC ERROR: Memory requested 6292477952! > [0]PETSC ERROR: > ------------------------------------------------------------------------ > [0]PETSC ERROR: Petsc Release Version 3.4.3, Oct, 15, 2013 > [0]PETSC ERROR: See docs/changes/index.html for recent updates. > [0]PETSC ERROR: See docs/faq.html for hints about trouble shooting. > [0]PETSC ERROR: See docs/index.html for manual pages. > [0]PETSC ERROR: > ------------------------------------------------------------------------ > [0]PETSC ERROR: src/AdLemMain on a arch-linux2-cxx-debug named edwards by > bkhanal Thu Oct 17 15:19:22 2013 > [0]PETSC ERROR: Libraries linked from > /home/bkhanal/Documents/softwares/petsc-3.4.3/arch-linux2-cxx-debug/lib > [0]PETSC ERROR: Configure run at Wed Oct 16 15:13:05 2013 > [0]PETSC ERROR: Configure options --download-mpich > -download-f-blas-lapack=1 --download-metis --download-parmetis > --download-superlu_dist --download-scalapack --download-mumps > --download-hypre --with-clanguage=cxx > [0]PETSC ERROR: > ------------------------------------------------------------------------ > [0]PETSC ERROR: PetscMallocAlign() line 46 in > /home/bkhanal/Documents/softwares/petsc-3.4.3/src/sys/memory/mal.c > [0]PETSC ERROR: MatSeqAIJSetPreallocation_SeqAIJ() line 3551 in > /home/bkhanal/Documents/softwares/petsc-3.4.3/src/mat/impls/aij/seq/aij.c > [0]PETSC ERROR: MatSeqAIJSetPreallocation() line 3496 in > /home/bkhanal/Documents/softwares/petsc-3.4.3/src/mat/impls/aij/seq/aij.c > [0]PETSC ERROR: MatMPIAIJSetPreallocation_MPIAIJ() line 3307 in > /home/bkhanal/Documents/softwares/petsc-3.4.3/src/mat/impls/aij/mpi/mpiaij.c > [0]PETSC ERROR: MatMPIAIJSetPreallocation() line 4015 in > /home/bkhanal/Documents/softwares/petsc-3.4.3/src/mat/impls/aij/mpi/mpiaij.c > [0]PETSC ERROR: DMCreateMatrix_DA_3d_MPIAIJ() line 1101 in > /home/bkhanal/Documents/softwares/petsc-3.4.3/src/dm/impls/da/fdda.c > [0]PETSC ERROR: DMCreateMatrix_DA() line 771 in > /home/bkhanal/Documents/softwares/petsc-3.4.3/src/dm/impls/da/fdda.c > [0]PETSC ERROR: DMCreateMatrix() line 910 in > /home/bkhanal/Documents/softwares/petsc-3.4.3/src/dm/interface/dm.c > [0]PETSC ERROR: KSPSetUp() line 207 in > /home/bkhanal/Documents/softwares/petsc-3.4.3/src/ksp/ksp/interface/itfunc.c > [0]PETSC ERROR: solveModel() line 128 in > "unknowndirectory/"/user/bkhanal/home/works/AdLemModel/src/PetscAdLemTaras3D.cxx > --9345:0:aspacem Valgrind: FATAL: VG_N_SEGMENTS is too low. > --9345:0:aspacem Increase it and rebuild. Exiting now. > --9344:0:aspacem Valgrind: FATAL: VG_N_SEGMENTS is too low. > --9344:0:aspacem Increase it and rebuild. Exiting now. > --9343:0:aspacem Valgrind: FATAL: VG_N_SEGMENTS is too low. > --9343:0:aspacem Increase it and rebuild. Exiting now. > --9346:0:aspacem Valgrind: FATAL: VG_N_SEGMENTS is too low. > --9346:0:aspacem Increase it and rebuild. Exiting now. > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = EXIT CODE: 1 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > > >
