On Fri, 27 Apr 2012, Andrew Spott wrote: > I'm honestly stumped. > > I have some petsc code that essentially just populates a matrix in parallel, > then puts it in a file. All my code that uses floating point computations is > checked for NaN's and infinities and it doesn't seem to show up. However, > when I run it on more than 4 cores, I get floating point exceptions that kill > the program. I tried turning off the exceptions from petsc, but the program > still dies from them, just without the petsc error message. > > I honestly don't know where to go, I suppose I should attach a debugger, but > I'm not sure how to do that for multi-processor code.
assuming you have X11 setup properly from compute nodes you can run with the extra option '-start_in_debugger' If X11 is not properly setup - and you'd like to run gdb on one of the nodes [say node 14 where you see SEGV] - you can do: -start_in-debugger noxterm -debugger_nodes 14 Or try valgrind mpiexec -n 16 valgrind --tool=memcheck -q ./executable For debugging - its best to install with --download-mpich [so that its valgrind clean] - and run all mpi stuff on a single machine - [usually X11 works well from a single machine.] Satish > > any ideas? (long error message below): > > -Andrew > > [14]PETSC ERROR: > ------------------------------------------------------------------------ > [14]PETSC ERROR: Caught signal number 8 FPE: Floating Point > Exception,probably divide by zero > [14]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger > [14]PETSC ERROR: or see > http://www.mcs.anl.gov/petsc/petsc-as/documentation/faq.html#valgrind[14]PETSC > ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find > memory corruption errors > [14]PETSC ERROR: likely location of problem given in stack below > [14]PETSC ERROR: --------------------- Stack Frames > ------------------------------------ > [14]PETSC ERROR: Note: The EXACT line numbers in the stack are not available, > [14]PETSC ERROR: INSTEAD the line number of the start of the function > [14]PETSC ERROR: is given. > [14]PETSC ERROR: --------------------- Error Message > ------------------------------------ > [14]PETSC ERROR: Signal received! > [14]PETSC ERROR: > ------------------------------------------------------------------------ > [14]PE[15]PETSC ERROR: > ------------------------------------------------------------------------ > [15]PETSC ERROR: Caught signal number 8 FPE: Floating Point > Exception,probably divide by zero > [15]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger > [15]PETSC ERROR: or see > http://www.mcs.anl.gov/petsc/petsc-as/documentation/faq.html#valgrind[15]PETSC > ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find > memory corruption errors > [15]PETSC ERROR: likely location of problem given in stack below > [15]PETSC ERROR: --------------------- Stack Frames > ------------------------------------ > [15]PETSC ERROR: Note: The EXACT line numbers in the stack are not available, > [15]PETSC ERROR: INSTEAD the line number of the start of the function > [15]PETSC ERROR: is given. > [15]PETSC ERROR: --------------------- Error Message > ------------------------------------ > [15]PETSC ERROR: Signal received! > [15]PETSC ERROR: > ------------------------------------------------------------------------ > [15]PETSC ERROR: Petsc Release Version 3.2.0, Patch 6, Wed Jan 11 09:28:45 > CST 2012 > [14]PETSC ERROR: See docs/changes/index.html for recent updates. > [14]PETSC ERROR: See docs/faq.html for hints about trouble shooting. > [14]PETSC ERROR: See docs/index.html for manual pages. > [14]PETSC ERROR: > ------------------------------------------------------------------------ > [14]PETSC ERROR: /home/becker/ansp6066/local/bin/finddme on a linux-gnu named > photon9.colorado.edu by ansp6066 Fri Apr 27 18:01:55 2012 > [14]PETSC ERROR: Libraries linked from > /home/becker/ansp6066/local/petsc-3.2-p6/lib > [14]PETSC ERROR: Configure run at Mon Feb 27 11:17:14 2012 > [14]PETSC ERROR: Configure options > --prefix=/home/becker/ansp6066/local/petsc-3.2-p6 --with-c++-support > --with-fortran --with-mpi-dir=/usr/local/mpich2 --with-shared-libraries=0 > --with-scalar-type=complex > --with-blas-lapack-libs=/central/intel/mkl/lib/em64t/libmkl_core.a > --with-clanguage=cxx > [14]PETSC ERROR: > ------------------------------------------------------------------------ > [14]TSC ERROR: Petsc Release Version 3.2.0, Patch 6, Wed Jan 11 09:28:45 CST > 2012 > [15]PETSC ERROR: See docs/changes/index.html for recent updates. > [15]PETSC ERROR: See docs/faq.html for hints about trouble shooting. > [15]PETSC ERROR: See docs/index.html for manual pages. > [15]PETSC ERROR: > ------------------------------------------------------------------------ > [15]PETSC ERROR: /home/becker/ansp6066/local/bin/finddme on a linux-gnu named > photon9.colorado.edu by ansp6066 Fri Apr 27 18:01:55 2012 > [15]PETSC ERROR: Libraries linked from > /home/becker/ansp6066/local/petsc-3.2-p6/lib > [15]PETSC ERROR: Configure run at Mon Feb 27 11:17:14 2012 > [15]PETSC ERROR: Configure options > --prefix=/home/becker/ansp6066/local/petsc-3.2-p6 --with-c++-support > --with-fortran --with-mpi-dir=/usr/local/mpich2 --with-shared-libraries=0 > --with-scalar-type=complex > --with-blas-lapack-libs=/central/intel/mkl/lib/em64t/libmkl_core.a > --with-clanguage=cxx > [15]PETSC ERROR: > ------------------------------------------------------------------------ > [15]PETSC ERROR: User provided function() line 0 in unknown directory unknown > file > application called MPI_Abort(MPI_COMM_WORLD, 59) - process 14PETSC ERROR: > User provided function() line 0 in unknown directory unknown file > application called MPI_Abort(MPI_COMM_WORLD, 59) - process 15[0]0:Return code > = 0, signaled with Interrupt
