On 23/4/2014 6:00 PM, Matthew Knepley wrote:
On Wed, Apr 23, 2014 at 5:55 AM, TAY wee-beng <[email protected]
<mailto:[email protected]>> wrote:
Hi,
Just to update that I managed to compare the values by reducing
the problem size to hundred plus values. The matrix and vector are
almost the same compared to my win7 output.
Run in the debugger and get a stack trace,
Hi,
I use -start_in_debugger option and it hangs at this point:
Program received signal SIGFPE, Arithmetic exception.
VecDot_Seq (xin=0x14ad3940, yin=0x14ad8fb0, z=0x7fff24cd79b8) at bvec1.c:71
71 ierr = PetscLogFlops(2.0*xin->map->n-1);CHKERRQ(ierr);
(gdb) where
#0 VecDot_Seq (xin=0x14ad3940, yin=0x14ad8fb0, z=0x7fff24cd79b8) at
bvec1.c:71
#1 0x0000000001f1d8b5 in VecDot_MPI (xin=0x14ad3940, yin=0x14ad8fb0,
z=0x7fff24cd7f40) at pbvec.c:15
#2 0x0000000001edfa14 in VecDot (x=0x14ad3940, y=0x14ad8fb0,
val=0x7fff24cd7f40) at rvector.c:128
#3 0x00000000025cf539 in KSPSolve_BCGS (ksp=0x1479d910) at bcgs.c:85
#4 0x0000000002576687 in KSPSolve (ksp=0x1479d910, b=0x1476b110,
x=0x14771890)
at itfunc.c:441
#5 0x0000000001d859d9 in kspsolve_ (ksp=0x395a548, b=0x395a650,
x=0x3959f38,
__ierr=0x384d8b8) at itfuncf.c:219
#6 0x0000000001c37def in petsc_solvers_mp_semi_momentum_simple_xyz_ ()
#7 0x0000000001c97c02 in fractional_initial_mp_fractional_steps_ ()
#8 0x0000000001cbc336 in ibm3d_high_re () at ibm3d_high_Re.F90:675
#9 0x00000000004093dc in main ()
(gdb)
Is this what you mean by a stack trace?
I have also used "bt full" and I have attached a more detailed output.
Matt
Also tried valgrind but it aborts almost immediately:
valgrind --leak-check=yes ./a.out
==17603== Memcheck, a memory error detector.
==17603== Copyright (C) 2002-2006, and GNU GPL'd, by Julian Seward
et al.
==17603== Using LibVEX rev 1658, a library for dynamic binary
translation.
==17603== Copyright (C) 2004-2006, and GNU GPL'd, by OpenWorks LLP.
==17603== Using valgrind-3.2.1, a dynamic binary instrumentation
framework.
==17603== Copyright (C) 2000-2006, and GNU GPL'd, by Julian Seward
et al.
==17603== For more details, rerun with: -v
==17603==
--17603-- DWARF2 CFI reader: unhandled CFI instruction 0:10
--17603-- DWARF2 CFI reader: unhandled CFI instruction 0:10
vex amd64->IR: unhandled instruction bytes: 0xF 0xAE 0x85 0xF0
==17603== valgrind: Unrecognised instruction at address 0x5DD0F0E.
==17603== Your program just tried to execute an instruction that
Valgrind
==17603== did not recognise. There are two possible reasons for this.
==17603== 1. Your program has a bug and erroneously jumped to a
non-code
==17603== location. If you are running Memcheck and you just saw a
==17603== warning about a bad jump, it's probably your
program's fault.
==17603== 2. The instruction is legitimate but Valgrind doesn't
handle it,
==17603== i.e. it's Valgrind's fault. If you think this is the
case or
==17603== you are not sure, please let us know and we'll try to
fix it.
==17603== Either way, Valgrind will now raise a SIGILL signal
which will
==17603== probably kill your program.
forrtl: severe (168): Program Exception - illegal instruction
Image PC Routine Line Source
libifcore.so.5 0000000005DD0F0E Unknown Unknown Unknown
libifcore.so.5 0000000005DD0DC7 Unknown Unknown Unknown
a.out 0000000001CB4CBB Unknown Unknown Unknown
a.out 00000000004093DC Unknown Unknown Unknown
libc.so.6 000000369141D974 Unknown Unknown Unknown
a.out 00000000004092E9 Unknown Unknown Unknown
==17603==
==17603== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 5
from 1)
==17603== malloc/free: in use at exit: 239 bytes in 8 blocks.
==17603== malloc/free: 31 allocs, 23 frees, 31,388 bytes allocated.
==17603== For counts of detected errors, rerun with: -v
==17603== searching for pointers to 8 not-freed blocks.
==17603== checked 2,340,280 bytes.
==17603==
==17603== LEAK SUMMARY:
==17603== definitely lost: 0 bytes in 0 blocks.
==17603== possibly lost: 0 bytes in 0 blocks.
==17603== still reachable: 239 bytes in 8 blocks.
==17603== suppressed: 0 bytes in 0 blocks.
==17603== Reachable blocks (those to which a pointer was found)
are not shown.
==17603== To see them, rerun with: --show-reachable=yes
Thank you
Yours sincerely,
TAY wee-beng
On 23/4/2014 5:18 PM, TAY wee-beng wrote:
Hi,
My code was found to be giving error answer in one of the
cluster, even on single processor. No error msg was given. It
used to be working fine.
I run the debug version and it gives the error msg:
[0]PETSC ERROR:
------------------------------------------------------------------------
[0]PETSC ERROR: Caught signal number 8 FPE: Floating Point
Exception,probably divide by zero
[0]PETSC ERROR: Try option -start_in_debugger or
-on_error_attach_debugger
[0]PETSC ERROR: or see
http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind[0]PETSC
ERROR: or try http://valgrind.org on GNU/linux and Apple Mac
OS X to find memory corruption errors
[0]PETSC ERROR: likely location of problem given in stack below
[0]PETSC ERROR: --------------------- Stack Frames
------------------------------------
[0]PETSC ERROR: Note: The EXACT line numbers in the stack are
not available,
[0]PETSC ERROR: INSTEAD the line number of the start of
the function
[0]PETSC ERROR: is given.
[0]PETSC ERROR: [0] VecDot_Seq line 62
src/vec/vec/impls/seq/bvec1.c
[0]PETSC ERROR: [0] VecDot_MPI line 14
src/vec/vec/impls/mpi/pbvec.c
[0]PETSC ERROR: [0] VecDot line 118
src/vec/vec/interface/rvector.c
[0]PETSC ERROR: [0] KSPSolve_BCGS line 39
src/ksp/ksp/impls/bcgs/bcgs.c
[0]PETSC ERROR: [0] KSPSolve line 356
src/ksp/ksp/interface/itfunc.c
[0]PETSC ERROR: --------------------- Error Message
------------------------------------
[0]PETSC ERROR: Signal received!
[0]PETSC ERROR:
------------------------------------------------------------------------
It happens after KSPSolve. There was no problem on other
cluster. So how should I debug to find the error?
I tried to compare the input matrix and vector between
different cluster but there are too many values.
--
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which
their experiments lead.
-- Norbert Wiener
(gdb) bt full
#0 VecDot_Seq (xin=0x14ad3940, yin=0x14ad8fb0, z=0x7fff24cd79b8) at bvec1.c:71
ya = (const PetscScalar *) 0x0
xa = (const PetscScalar *) 0x0
one = 1
bn = 960
ierr = 0
#1 0x0000000001f1d8b5 in VecDot_MPI (xin=0x14ad3940, yin=0x14ad8fb0,
z=0x7fff24cd7f40) at pbvec.c:15
sum = 1.9762625833649862e-323
work = 1.3431953209405154
ierr = 0
#2 0x0000000001edfa14 in VecDot (x=0x14ad3940, y=0x14ad8fb0,
val=0x7fff24cd7f40) at rvector.c:128
ierr = 0
#3 0x00000000025cf539 in KSPSolve_BCGS (ksp=0x1479d910) at bcgs.c:85
ierr = 0
i = 0
rho = 1.9762625833649862e-323
rhoold = 1
alpha = 1
beta = 1.600807474747106e-316
omega = 1.6910452843641213e-315
omegaold = 1
---Type <return> to continue, or q <return> to quit---
d1 = 0
X = (Vec) 0x14771890
B = (Vec) 0x1476b110
V = (Vec) 0x14ade620
P = (Vec) 0x14aee970
R = (Vec) 0x14ad3940
RP = (Vec) 0x14ad8fb0
T = (Vec) 0x14ae3c90
S = (Vec) 0x14ae9300
dp = 1.1589630369172761
d2 = 1.5718032521948665e-316
bcgs = (KSP_BCGS *) 0x14abdc40
#4 0x0000000002576687 in KSPSolve (ksp=0x1479d910, b=0x1476b110, x=0x14771890)
at itfunc.c:441
ierr = 0
rank = 32767
flag1 = PETSC_FALSE
flag2 = PETSC_FALSE
flag3 = PETSC_FALSE
flg = PETSC_FALSE
inXisinB = PETSC_FALSE
guess_zero = PETSC_TRUE
viewer = (PetscViewer) 0x7fff0000000b
---Type <return> to continue, or q <return> to quit---
mat = (Mat) 0x0
premat = (Mat) 0x7fff24ce0cf8
format = PETSC_VIEWER_DEFAULT
#5 0x0000000001d859d9 in kspsolve_ (ksp=0x395a548, b=0x395a650, x=0x3959f38,
__ierr=0x384d8b8) at itfuncf.c:219
No locals.
#6 0x0000000001c37def in petsc_solvers_mp_semi_momentum_simple_xyz_ ()
No symbol table info available.
#7 0x0000000001c97c02 in fractional_initial_mp_fractional_steps_ ()
No symbol table info available.
#8 0x0000000001cbc336 in ibm3d_high_re () at ibm3d_high_Re.F90:675
filename = <error reading variable>
file_write_no_char = <error reading variable>
del_t_tmp = 0
chord = 0
sum_w = 317.24634223818254
sum_v = -0.029308797839978005
sum_u = -0.26935975976548698
max_w = 1.1569511841851332
max_v = 0.15264060063132492
max_u = 0.23053472715538209
explode_check = 10
error_all = 0
---Type <return> to continue, or q <return> to quit---
ijk = 1
escape_time = 99990000
error = 0
interval2 = -16000
openstatus = {0, 0, 0, 0, 0, 0}
#9 0x00000000004093dc in main ()
No symbol table info available.