On Wed, Apr 23, 2014 at 5:55 AM, TAY wee-beng <[email protected]> wrote:
> Hi, > > Just to update that I managed to compare the values by reducing the > problem size to hundred plus values. The matrix and vector are almost the > same compared to my win7 output. > Run in the debugger and get a stack trace, Matt > Also tried valgrind but it aborts almost immediately: > > valgrind --leak-check=yes ./a.out > ==17603== Memcheck, a memory error detector. > ==17603== Copyright (C) 2002-2006, and GNU GPL'd, by Julian Seward et al. > ==17603== Using LibVEX rev 1658, a library for dynamic binary translation. > ==17603== Copyright (C) 2004-2006, and GNU GPL'd, by OpenWorks LLP. > ==17603== Using valgrind-3.2.1, a dynamic binary instrumentation framework. > ==17603== Copyright (C) 2000-2006, and GNU GPL'd, by Julian Seward et al. > ==17603== For more details, rerun with: -v > ==17603== > --17603-- DWARF2 CFI reader: unhandled CFI instruction 0:10 > --17603-- DWARF2 CFI reader: unhandled CFI instruction 0:10 > vex amd64->IR: unhandled instruction bytes: 0xF 0xAE 0x85 0xF0 > ==17603== valgrind: Unrecognised instruction at address 0x5DD0F0E. > ==17603== Your program just tried to execute an instruction that Valgrind > ==17603== did not recognise. There are two possible reasons for this. > ==17603== 1. Your program has a bug and erroneously jumped to a non-code > ==17603== location. If you are running Memcheck and you just saw a > ==17603== warning about a bad jump, it's probably your program's fault. > ==17603== 2. The instruction is legitimate but Valgrind doesn't handle it, > ==17603== i.e. it's Valgrind's fault. If you think this is the case or > ==17603== you are not sure, please let us know and we'll try to fix it. > ==17603== Either way, Valgrind will now raise a SIGILL signal which will > ==17603== probably kill your program. > forrtl: severe (168): Program Exception - illegal instruction > Image PC Routine Line Source > libifcore.so.5 0000000005DD0F0E Unknown Unknown Unknown > libifcore.so.5 0000000005DD0DC7 Unknown Unknown Unknown > a.out 0000000001CB4CBB Unknown Unknown Unknown > a.out 00000000004093DC Unknown Unknown Unknown > libc.so.6 000000369141D974 Unknown Unknown Unknown > a.out 00000000004092E9 Unknown Unknown Unknown > ==17603== > ==17603== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 5 from 1) > ==17603== malloc/free: in use at exit: 239 bytes in 8 blocks. > ==17603== malloc/free: 31 allocs, 23 frees, 31,388 bytes allocated. > ==17603== For counts of detected errors, rerun with: -v > ==17603== searching for pointers to 8 not-freed blocks. > ==17603== checked 2,340,280 bytes. > ==17603== > ==17603== LEAK SUMMARY: > ==17603== definitely lost: 0 bytes in 0 blocks. > ==17603== possibly lost: 0 bytes in 0 blocks. > ==17603== still reachable: 239 bytes in 8 blocks. > ==17603== suppressed: 0 bytes in 0 blocks. > ==17603== Reachable blocks (those to which a pointer was found) are not > shown. > ==17603== To see them, rerun with: --show-reachable=yes > > Thank you > > Yours sincerely, > > TAY wee-beng > > On 23/4/2014 5:18 PM, TAY wee-beng wrote: > >> Hi, >> >> My code was found to be giving error answer in one of the cluster, even >> on single processor. No error msg was given. It used to be working fine. >> >> I run the debug version and it gives the error msg: >> >> [0]PETSC ERROR: ------------------------------ >> ------------------------------------------ >> [0]PETSC ERROR: Caught signal number 8 FPE: Floating Point >> Exception,probably divide by zero >> [0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger >> [0]PETSC ERROR: or see http://www.mcs.anl.gov/petsc/ >> documentation/faq.html#valgrind[0]PETSC ERROR: or try http://valgrind.orgon >> GNU/linux and Apple Mac OS X to find memory corruption errors >> [0]PETSC ERROR: likely location of problem given in stack below >> [0]PETSC ERROR: --------------------- Stack Frames >> ------------------------------------ >> [0]PETSC ERROR: Note: The EXACT line numbers in the stack are not >> available, >> [0]PETSC ERROR: INSTEAD the line number of the start of the function >> [0]PETSC ERROR: is given. >> [0]PETSC ERROR: [0] VecDot_Seq line 62 src/vec/vec/impls/seq/bvec1.c >> [0]PETSC ERROR: [0] VecDot_MPI line 14 src/vec/vec/impls/mpi/pbvec.c >> [0]PETSC ERROR: [0] VecDot line 118 src/vec/vec/interface/rvector.c >> [0]PETSC ERROR: [0] KSPSolve_BCGS line 39 src/ksp/ksp/impls/bcgs/bcgs.c >> [0]PETSC ERROR: [0] KSPSolve line 356 src/ksp/ksp/interface/itfunc.c >> [0]PETSC ERROR: --------------------- Error Message >> ------------------------------------ >> [0]PETSC ERROR: Signal received! >> [0]PETSC ERROR: ------------------------------ >> ------------------------------------------ >> >> It happens after KSPSolve. There was no problem on other cluster. So how >> should I debug to find the error? >> >> I tried to compare the input matrix and vector between different cluster >> but there are too many values. >> >> > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener
