On Sat, Apr 28, 2012 at 8:07 PM, Andrew Spott <andrew.spott at gmail.com>wrote:
> are there any tricks to doing this across ssh? > > I've attempted it using the method given, but I can't get it to start in > the debugger or to attach the debugger, the program just exits or hangs > after telling me the error. > Is there a reason you cannot run this problem on your local machine with 4 processes? Matt > -Andrew > > On Apr 28, 2012, at 4:45 PM, Matthew Knepley wrote: > > On Sat, Apr 28, 2012 at 6:39 PM, Andrew Spott <andrew.spott at > gmail.com>wrote: > >> >-start_in-debugger noxterm -debugger_nodes 14 >> >> All my cores are on the same machine, is this supposed to start a >> debugger on processor 14? or computer 14? >> > > Neither. This spawns a gdb process on the same node as the process with > MPI rank 14. Then attaches gdb > to process 14. > > Matt > > >> I don't think I have x11 setup properly for the compute nodes, so x11 >> isn't really an option. >> >> Thanks for the help. >> >> -Andrew >> >> >> On Apr 27, 2012, at 7:26 PM, Satish Balay wrote: >> >> > On Fri, 27 Apr 2012, Andrew Spott wrote: >> > >> >> I'm honestly stumped. >> >> >> >> I have some petsc code that essentially just populates a matrix in >> parallel, then puts it in a file. All my code that uses floating point >> computations is checked for NaN's and infinities and it doesn't seem to >> show up. However, when I run it on more than 4 cores, I get floating point >> exceptions that kill the program. I tried turning off the exceptions from >> petsc, but the program still dies from them, just without the petsc error >> message. >> >> >> >> I honestly don't know where to go, I suppose I should attach a >> debugger, but I'm not sure how to do that for multi-processor code. >> > >> > assuming you have X11 setup properly from compute nodes you can run >> > with the extra option '-start_in_debugger' >> > >> > If X11 is not properly setup - and you'd like to run gdb on one of the >> > nodes [say node 14 where you see SEGV] - you can do: >> > >> > -start_in-debugger noxterm -debugger_nodes 14 >> > >> > Or try valgrind >> > >> > mpiexec -n 16 valgrind --tool=memcheck -q ./executable >> > >> > >> > For debugging - its best to install with --download-mpich [so that its >> > valgrind clean] - and run all mpi stuff on a single machine - [usually >> > X11 works well from a single machine.] >> > >> > Satish >> > >> >> >> >> any ideas? (long error message below): >> >> >> >> -Andrew >> >> >> >> [14]PETSC ERROR: >> ------------------------------------------------------------------------ >> >> [14]PETSC ERROR: Caught signal number 8 FPE: Floating Point >> Exception,probably divide by zero >> >> [14]PETSC ERROR: Try option -start_in_debugger or >> -on_error_attach_debugger >> >> [14]PETSC ERROR: or see >> http://www.mcs.anl.gov/petsc/petsc-as/documentation/faq.html#valgrind[14]PETSCERROR: >> or try >> http://valgrind.org on GNU/linux and Apple Mac OS X to find memory >> corruption errors >> >> [14]PETSC ERROR: likely location of problem given in stack below >> >> [14]PETSC ERROR: --------------------- Stack Frames >> ------------------------------------ >> >> [14]PETSC ERROR: Note: The EXACT line numbers in the stack are not >> available, >> >> [14]PETSC ERROR: INSTEAD the line number of the start of the >> function >> >> [14]PETSC ERROR: is given. >> >> [14]PETSC ERROR: --------------------- Error Message >> ------------------------------------ >> >> [14]PETSC ERROR: Signal received! >> >> [14]PETSC ERROR: >> ------------------------------------------------------------------------ >> >> [14]PE[15]PETSC ERROR: >> ------------------------------------------------------------------------ >> >> [15]PETSC ERROR: Caught signal number 8 FPE: Floating Point >> Exception,probably divide by zero >> >> [15]PETSC ERROR: Try option -start_in_debugger or >> -on_error_attach_debugger >> >> [15]PETSC ERROR: or see >> http://www.mcs.anl.gov/petsc/petsc-as/documentation/faq.html#valgrind[15]PETSCERROR: >> or try >> http://valgrind.org on GNU/linux and Apple Mac OS X to find memory >> corruption errors >> >> [15]PETSC ERROR: likely location of problem given in stack below >> >> [15]PETSC ERROR: --------------------- Stack Frames >> ------------------------------------ >> >> [15]PETSC ERROR: Note: The EXACT line numbers in the stack are not >> available, >> >> [15]PETSC ERROR: INSTEAD the line number of the start of the >> function >> >> [15]PETSC ERROR: is given. >> >> [15]PETSC ERROR: --------------------- Error Message >> ------------------------------------ >> >> [15]PETSC ERROR: Signal received! >> >> [15]PETSC ERROR: >> ------------------------------------------------------------------------ >> >> [15]PETSC ERROR: Petsc Release Version 3.2.0, Patch 6, Wed Jan 11 >> 09:28:45 CST 2012 >> >> [14]PETSC ERROR: See docs/changes/index.html for recent updates. >> >> [14]PETSC ERROR: See docs/faq.html for hints about trouble shooting. >> >> [14]PETSC ERROR: See docs/index.html for manual pages. >> >> [14]PETSC ERROR: >> ------------------------------------------------------------------------ >> >> [14]PETSC ERROR: /home/becker/ansp6066/local/bin/finddme on a >> linux-gnu named photon9.colorado.edu by ansp6066 Fri Apr 27 18:01:55 2012 >> >> [14]PETSC ERROR: Libraries linked from >> /home/becker/ansp6066/local/petsc-3.2-p6/lib >> >> [14]PETSC ERROR: Configure run at Mon Feb 27 11:17:14 2012 >> >> [14]PETSC ERROR: Configure options >> --prefix=/home/becker/ansp6066/local/petsc-3.2-p6 --with-c++-support >> --with-fortran --with-mpi-dir=/usr/local/mpich2 --with-shared-libraries=0 >> --with-scalar-type=complex >> --with-blas-lapack-libs=/central/intel/mkl/lib/em64t/libmkl_core.a >> --with-clanguage=cxx >> >> [14]PETSC ERROR: >> ------------------------------------------------------------------------ >> >> [14]TSC ERROR: Petsc Release Version 3.2.0, Patch 6, Wed Jan 11 >> 09:28:45 CST 2012 >> >> [15]PETSC ERROR: See docs/changes/index.html for recent updates. >> >> [15]PETSC ERROR: See docs/faq.html for hints about trouble shooting. >> >> [15]PETSC ERROR: See docs/index.html for manual pages. >> >> [15]PETSC ERROR: >> ------------------------------------------------------------------------ >> >> [15]PETSC ERROR: /home/becker/ansp6066/local/bin/finddme on a >> linux-gnu named photon9.colorado.edu by ansp6066 Fri Apr 27 18:01:55 2012 >> >> [15]PETSC ERROR: Libraries linked from >> /home/becker/ansp6066/local/petsc-3.2-p6/lib >> >> [15]PETSC ERROR: Configure run at Mon Feb 27 11:17:14 2012 >> >> [15]PETSC ERROR: Configure options >> --prefix=/home/becker/ansp6066/local/petsc-3.2-p6 --with-c++-support >> --with-fortran --with-mpi-dir=/usr/local/mpich2 --with-shared-libraries=0 >> --with-scalar-type=complex >> --with-blas-lapack-libs=/central/intel/mkl/lib/em64t/libmkl_core.a >> --with-clanguage=cxx >> >> [15]PETSC ERROR: >> ------------------------------------------------------------------------ >> >> [15]PETSC ERROR: User provided function() line 0 in unknown directory >> unknown file >> >> application called MPI_Abort(MPI_COMM_WORLD, 59) - process 14PETSC >> ERROR: User provided function() line 0 in unknown directory unknown file >> >> application called MPI_Abort(MPI_COMM_WORLD, 59) - process >> 15[0]0:Return code = 0, signaled with Interrupt >> > >> >> > > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20120428/e4a76fe0/attachment-0001.htm>
