0x00007f96a2148e52 in libmetis__FM_2WayCutRefine (ctrl=0x2784d20, 
graph=0x2784940, ntpwgts=0x7ffdfa323060, niter=4)
at 
/home/mefpp_ericc/petsc-3.9.2-debug/arch-linux2-c-debug/externalpackages/git.metis/libmetis/fm.c:60

It appears the crash is in metis, not SuperLU_Dist.

  So either a bug in Metis or a bug in our Metis is called by ParMetis or 
SuperLU_Dist.

   Barry




> On May 22, 2018, at 10:37 AM, Hong <[email protected]> wrote:
> 
> Eric:
> Likely, you encounter a zero pivot. Run your code with 
> '-ksp_error_if_not_converged' would show it.
> Adding option '-mat_superlu_dist_replacetinypivot' might help.
> Hong
> 
> Hi,
> 
> The given matrix+vector is bogus with SuperLU_Dist on some of our nighlty 
> validation tests since I activated the parallel symbolic factorisation. (with 
> -mat_superlu_dist_colperm PARMETIS -mat_superlu_dist_parsymbfact 1 )
> 
> I extracted an example system and reproduced the bug with 
> src/ksp/ksp/examples/tests/ex6.c that I can run it with 2 or 3 processes, but 
> with 4 it gives a FPE on process #1:
> 
> mpirun -n 4 ./ex6 -f AssembleurGD_resolution_no_0_0 -ksp_view -ksp_type 
> preonly -pc_type lu -pc_factor_mat_solver_type superlu_dist 
> -mat_superlu_dist_colperm PARMETIS -mat_superlu_dist_parsymbfact 1
> 
> ...
> [1]PETSC ERROR: 
> ------------------------------------------------------------------------
> [1]PETSC ERROR: Caught signal number 8 FPE: Floating Point Exception,probably 
> divide by zero
> [1]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
> [1]PETSC ERROR: or see 
> http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
> [1]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to 
> find memory corruption errors
> [1]PETSC ERROR: likely location of problem given in stack below
> [1]PETSC ERROR: ---------------------  Stack Frames 
> ------------------------------------
> [1]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
> [1]PETSC ERROR:       INSTEAD the line number of the start of the function
> [1]PETSC ERROR:       is given.
> [1]PETSC ERROR: [1] SuperLU_DIST:pdgssvx line 467 
> /home/mefpp_ericc/petsc-3.9.2-debug/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c
> [1]PETSC ERROR: [1] MatLUFactorNumeric_SuperLU_DIST line 314 
> /home/mefpp_ericc/petsc-3.9.2-debug/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c
> [1]PETSC ERROR: [1] MatLUFactorNumeric line 3014 
> /home/mefpp_ericc/petsc-3.9.2-debug/src/mat/interface/matrix.c
> [1]PETSC ERROR: [1] PCSetUp_LU line 59 
> /home/mefpp_ericc/petsc-3.9.2-debug/src/ksp/pc/impls/factor/lu/lu.c
> [1]PETSC ERROR: [1] PCSetUp line 885 
> /home/mefpp_ericc/petsc-3.9.2-debug/src/ksp/pc/interface/precon.c
> [1]PETSC ERROR: [1] KSPSetUp line 294 
> /home/mefpp_ericc/petsc-3.9.2-debug/src/ksp/ksp/interface/itfunc.c
> [1]PETSC ERROR: --------------------- Error Message 
> --------------------------------------------------------------
> [1]PETSC ERROR: Signal received
> [1]PETSC ERROR: See http://www.mcs.anl.gov/petsc/documentation/faq.html for 
> trouble shooting.
> [1]PETSC ERROR: Petsc Release Version 3.9.2, May, 20, 2018
> [1]PETSC ERROR: ./ex6 on a  named lorien by eric Tue May 22 10:39:15 2018
> [1]PETSC ERROR: Configure options 
> --prefix=/opt/petsc-3.9.2_debug_openmpi-1.10.2 --with-mpi-compilers=1 
> --with-mpi-dir=/opt/openmpi-1.10.2 --with-make-np=12 
> --with-shared-libraries=1 --with-debugging=yes --with-memalign=64 
> --with-visibility=0 --with-64-bit-indices=0 --download-ml=yes 
> --download-mumps=yes --download-superlu=yes --download-superlu_dist=yes 
> --download-parmetis=yes --download-ptscotch=yes --download-metis=yes 
> --download-suitesparse=yes --download-hypre=yes 
> --with-blaslapack-dir=/opt/intel/composer_xe_2015.2.164/mkl/lib/intel64 
> --with-mkl_pardiso-dir=/opt/intel/composer_xe_2015.2.164/mkl 
> --with-mkl_cpardiso-dir=/opt/intel/composer_xe_2015.2.164/mkl 
> --with-scalapack=1 
> --with-scalapack-include=/opt/intel/composer_xe_2015.2.164/mkl/include 
> --with-scalapack-lib="-L/opt/intel/composer_xe_2015.2.164/mkl/lib/intel64 
> -lmkl_scalapack_lp64 -lmkl_blacs_openmpi_lp64"
> [1]PETSC ERROR: #1 User provided function() line 0 in  unknown file
> ...
> 
> The given Matrix+Vector are available here:
> 
> http://www.giref.ulaval.ca/~ericc/bug_superlu_dist_parallel_factorisation/AssembleurGD_resolution_no_0_0
> 
> http://www.giref.ulaval.ca/~ericc/bug_superlu_dist_parallel_factorisation/AssembleurGD_resolution_no_0_0.info
> 
> If I run with -on_error_attach_debugger, I can see a division by zero here:
> 
> #8  <signal handler called>
> (gdb)
> #9  0x00007f96a2148e52 in libmetis__FM_2WayCutRefine (ctrl=0x2784d20, 
> graph=0x2784940, ntpwgts=0x7ffdfa323060, niter=4)
>     at 
> /home/mefpp_ericc/petsc-3.9.2-debug/arch-linux2-c-debug/externalpackages/git.metis/libmetis/fm.c:60
> 60        avgvwgt = gk_min((pwgts[0]+pwgts[1])/20, 
> 2*(pwgts[0]+pwgts[1])/nvtxs);
> 
> and nvtxs value is "0"...
> 
> Thanks!
> 
> Eric
> 

Reply via email to