Dear PETSC users, I try to solve a large problem (about 9,000,000 unknowns) with large number of processes (about 400 processes and 1TB). I guess that this is a reasonably large resource for solving this problem because I was able to solve the same problem using serial MUMPS with 500GB. Of course, it took very long to be computed. The same code was parallelized with PETSC. However, my code with PETSC suddenly crashes after KSPSolve() successfully calls MUMPS as shown below. If this problem comes from MUMPS, I expect that MUMPS should produce an error report (ICNTL(4)=3), but no error report was not generated. Did anyone have such experience with PETSC+MUMPS? I request comments on its trouble shooting. In advance, I appreciate your help.
Regards, Evan Codes: KSPCreate(PETSC_COMM_WORLD, &ksp); KSPSetOperators(ksp, A, A); KSPSetType (ksp, KSPPREONLY); KSPGetPC(ksp, &pc); MatSetOption(A, MAT_SPD, PETSC_TRUE); PCSetType(pc, PCCHOLESKY); PCFactorSetMatSolverPackage(pc, MATSOLVERMUMPS); PCFactorSetUpMatSolverPackage(pc); PCFactorGetMatrix(pc, &F); KSPSetType(ksp, KSPCG); MPI_Barrier(MPI_COMM_WORLD); icntl=29; ival=2; // ParMetis MatMumpsSetIcntl(F, icntl, ival); icntl=4; ival=3; // Errors MatMumpsSetIcntl(F, icntl, ival); icntl=23; ival=1500; MatMumpsSetIcntl(F, icntl, ival); KSPSolve(ksp,b,x); Errors: Entering DMUMPS driver with JOB, N, NZ = 1 9778426 0 DMUMPS 4.10.0 L D L^T Solver for symmetric positive definite matrices Type of parallelism: Working host ****** ANALYSIS STEP ******** Using ParMETIS for parallel ordering. Structual symmetry is:100% -------------------------------------------------------------------------- WARNING: A process refused to die! Host: n0000.voltaire0 PID: 28131 This process may still be running and/or consuming resources. -------------------------------------------------------------------------- [n0000.voltaire0:28047] 1 more process has sent help message help-odls-default.txt / odls-default:could-not-kill [n0000.voltaire0:28047] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages -------------------------------------------------------------------------- MPI_ABORT was invoked on rank 11 in communicator MPI_COMM_WORLD with errorcode 59. NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them. -------------------------------------------------------------------------- [1]PETSC ERROR: ------------------------------------------------------------------------ [1]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the batch system) has told this process to end [1]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger [1]PETSC ERROR: or see http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind[1]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors [1]PETSC ERROR: likely location of problem given in stack below [1]PETSC ERROR: --------------------- Stack Frames ------------------------------------ [1]PETSC ERROR: Note: The EXACT line numbers in the stack are not available, [1]PETSC ERROR: INSTEAD the line number of the start of the function [1]PETSC ERROR: is given. [1]PETSC ERROR: [1] MatCholeskyFactorSymbolic_MUMPS line 1076 /clusterfs/voltaire/home/software/source/petsc-3.5.0/src/mat/impls/aij/mpi/mumps/mumps.c [1]PETSC ERROR: [1] MatCholeskyFactorSymbolic line 2995 /clusterfs/voltaire/home/software/source/petsc-3.5.0/src/mat/interface/matrix.c [1]PETSC ERROR: [1] PCSetUp_Cholesky line 88 /clusterfs/voltaire/home/software/source/petsc-3.5.0/src/ksp/pc/impls/factor/cholesky/cholesky.c [1]PETSC ERROR: [1] KSPSetUp line 219 /clusterfs/voltaire/home/software/source/petsc-3.5.0/src/ksp/ksp/interface/itfunc.c [1]PETSC ERROR: [1] KSPSolve line 381 /clusterfs/voltaire/home/software/source/petsc-3.5.0/src/ksp/ksp/interface/itfunc.c [1]PETSC ERROR: --------------------- Error Message -------------------------------------------------------------- [1]PETSC ERROR: Signal received [1]PETSC ERROR: See http://www.mcs.anl.gov/petsc/documentation/faq.html for trouble shooting. [1]PETSC ERROR: Petsc Release Version 3.5.0, Jun, 30, 2014 [1]PETSC ERROR: fetdem3dp on a arch-linux2-c-debug named n0000.voltaire0 by esum Wed Aug 27 13:48:51 2014 [1]PETSC ERROR: Configure options --prefix=/clusterfs/voltaire/home/software/modules/petsc/3.5.0 --download-fblaslapack=1 --download-mumps=1 --download-parmetis=1 --download-scalapack --download-metis=1 --with-mpi-dir=/global/software/sl-6.x86_64/modules/gcc/4.4.7/openmpi/1.6.5-gcc/ [1]PETSC ERROR: #1 User provided function() line 0 in unknown file [5]PETSC ERROR: ------------------------------------------------------------------------
