Eric: Likely, you encounter a zero pivot. Run your code with '-ksp_error_if_not_converged' would show it. Adding option '-mat_superlu_dist_replacetinypivot' might help. Hong
Hi, > > The given matrix+vector is bogus with SuperLU_Dist on some of our nighlty > validation tests since I activated the parallel symbolic factorisation. > (with -mat_superlu_dist_colperm PARMETIS -mat_superlu_dist_parsymbfact 1 ) > > I extracted an example system and reproduced the bug with > src/ksp/ksp/examples/tests/ex6.c that I can run it with 2 or 3 processes, > but with 4 it gives a FPE on process #1: > > mpirun -n 4 ./ex6 -f AssembleurGD_resolution_no_0_0 -ksp_view -ksp_type > preonly -pc_type lu -pc_factor_mat_solver_type superlu_dist > -mat_superlu_dist_colperm PARMETIS -mat_superlu_dist_parsymbfact 1 > > ... > [1]PETSC ERROR: ------------------------------ > ------------------------------------------ > [1]PETSC ERROR: Caught signal number 8 FPE: Floating Point > Exception,probably divide by zero > [1]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger > [1]PETSC ERROR: or see http://www.mcs.anl.gov/petsc/d > ocumentation/faq.html#valgrind > [1]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS > X to find memory corruption errors > [1]PETSC ERROR: likely location of problem given in stack below > [1]PETSC ERROR: --------------------- Stack Frames > ------------------------------------ > [1]PETSC ERROR: Note: The EXACT line numbers in the stack are not > available, > [1]PETSC ERROR: INSTEAD the line number of the start of the function > [1]PETSC ERROR: is given. > [1]PETSC ERROR: [1] SuperLU_DIST:pdgssvx line 467 > /home/mefpp_ericc/petsc-3.9.2-debug/src/mat/impls/aij/mpi/su > perlu_dist/superlu_dist.c > [1]PETSC ERROR: [1] MatLUFactorNumeric_SuperLU_DIST line 314 > /home/mefpp_ericc/petsc-3.9.2-debug/src/mat/impls/aij/mpi/su > perlu_dist/superlu_dist.c > [1]PETSC ERROR: [1] MatLUFactorNumeric line 3014 > /home/mefpp_ericc/petsc-3.9.2-debug/src/mat/interface/matrix.c > [1]PETSC ERROR: [1] PCSetUp_LU line 59 /home/mefpp_ericc/petsc-3.9.2- > debug/src/ksp/pc/impls/factor/lu/lu.c > [1]PETSC ERROR: [1] PCSetUp line 885 /home/mefpp_ericc/petsc-3.9.2- > debug/src/ksp/pc/interface/precon.c > [1]PETSC ERROR: [1] KSPSetUp line 294 /home/mefpp_ericc/petsc-3.9.2- > debug/src/ksp/ksp/interface/itfunc.c > [1]PETSC ERROR: --------------------- Error Message > -------------------------------------------------------------- > [1]PETSC ERROR: Signal received > [1]PETSC ERROR: See http://www.mcs.anl.gov/petsc/documentation/faq.html > for trouble shooting. > [1]PETSC ERROR: Petsc Release Version 3.9.2, May, 20, 2018 > [1]PETSC ERROR: ./ex6 on a named lorien by eric Tue May 22 10:39:15 2018 > [1]PETSC ERROR: Configure options > --prefix=/opt/petsc-3.9.2_debug_openmpi-1.10.2 > --with-mpi-compilers=1 --with-mpi-dir=/opt/openmpi-1.10.2 > --with-make-np=12 --with-shared-libraries=1 --with-debugging=yes > --with-memalign=64 --with-visibility=0 --with-64-bit-indices=0 > --download-ml=yes --download-mumps=yes --download-superlu=yes > --download-superlu_dist=yes --download-parmetis=yes --download-ptscotch=yes > --download-metis=yes --download-suitesparse=yes --download-hypre=yes > --with-blaslapack-dir=/opt/intel/composer_xe_2015.2.164/mkl/lib/intel64 > --with-mkl_pardiso-dir=/opt/intel/composer_xe_2015.2.164/mkl > --with-mkl_cpardiso-dir=/opt/intel/composer_xe_2015.2.164/mkl > --with-scalapack=1 --with-scalapack-include=/opt/ > intel/composer_xe_2015.2.164/mkl/include --with-scalapack-lib="-L/opt/i > ntel/composer_xe_2015.2.164/mkl/lib/intel64 -lmkl_scalapack_lp64 > -lmkl_blacs_openmpi_lp64" > [1]PETSC ERROR: #1 User provided function() line 0 in unknown file > ... > > The given Matrix+Vector are available here: > > http://www.giref.ulaval.ca/~ericc/bug_superlu_dist_parallel_ > factorisation/AssembleurGD_resolution_no_0_0 > > http://www.giref.ulaval.ca/~ericc/bug_superlu_dist_parallel_ > factorisation/AssembleurGD_resolution_no_0_0.info > > If I run with -on_error_attach_debugger, I can see a division by zero here: > > #8 <signal handler called> > (gdb) > #9 0x00007f96a2148e52 in libmetis__FM_2WayCutRefine (ctrl=0x2784d20, > graph=0x2784940, ntpwgts=0x7ffdfa323060, niter=4) > at /home/mefpp_ericc/petsc-3.9.2-debug/arch-linux2-c-debug/exte > rnalpackages/git.metis/libmetis/fm.c:60 > 60 avgvwgt = gk_min((pwgts[0]+pwgts[1])/20, > 2*(pwgts[0]+pwgts[1])/nvtxs); > > and nvtxs value is "0"... > > Thanks! > > Eric >
