Evan,
Please comment out your own mumps parameters and run the code with the
default icnt and ival. Does it still crash? If so, please send us
entire error message. It it common to get memory error in numerical
factorization of mumps. I've rarely seen error occurs in the symbolic
phase.

Hong

On Wed, Aug 27, 2014 at 4:58 PM, Barry Smith <[email protected]> wrote:
>
>    Ok
>
> [11]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the 
> batch system) has told this process to end
>
> This message usually happens because either
>
> 1) the process ran out of memory or
> 2) the process took more time than the batch system allowed
>
> my guess is 1.  I don’t know how MUMPS does its symbolic factorization but my 
> guess is that it may have something in it that is not scalable per node thus 
> causing it to run out of memory. Hong knows more about this and may have 
> advice on how to proceed.
>
> Have you tried superlu_dist on the same problem?
>
>   Barry
>
>
>
>
> On Aug 27, 2014, at 4:52 PM, Evan Um <[email protected]> wrote:
>
>> Dear Barry,
>>
>> Attached is the whole error message file. Thanks for your help.
>>
>> Evan
>>
>>
>> On Wed, Aug 27, 2014 at 2:44 PM, Barry Smith <[email protected]> wrote:
>>
>> > MPI_ABORT was invoked on rank 11 in communicator MPI_COMM_WORLD
>>
>>
>>   Please send ALL the output. In particular since rank 11 seems to have 
>> chocked we need to see all the messages from [11] to see what it thinks has 
>> gone wrong.
>>
>>    Barry
>>
>> On Aug 27, 2014, at 4:27 PM, Evan Um <[email protected]> wrote:
>>
>> > Dear PETSC users,
>> >
>> > I try to solve a large problem (about 9,000,000 unknowns) with large 
>> > number of processes (about 400 processes and 1TB). I guess that this is a 
>> > reasonably large resource for solving this problem because I was able to 
>> > solve the same problem using serial MUMPS with 500GB. Of course, it took 
>> > very long to be computed.
>> > The same code was parallelized with PETSC. However, my code with PETSC 
>> > suddenly crashes after KSPSolve() successfully calls MUMPS as shown below. 
>> > If this problem comes from MUMPS, I expect that MUMPS should produce an 
>> > error report (ICNTL(4)=3), but no error report was not generated. Did 
>> > anyone have such experience with PETSC+MUMPS? I request comments on its 
>> > trouble shooting. In advance, I appreciate your help.
>> >
>> > Regards,
>> > Evan
>> >
>> > Codes:
>> >
>> > KSPCreate(PETSC_COMM_WORLD, &ksp);
>> > KSPSetOperators(ksp, A, A);
>> > KSPSetType (ksp, KSPPREONLY);
>> > KSPGetPC(ksp, &pc);
>> > MatSetOption(A, MAT_SPD, PETSC_TRUE);
>> > PCSetType(pc, PCCHOLESKY);
>> > PCFactorSetMatSolverPackage(pc, MATSOLVERMUMPS);
>> > PCFactorSetUpMatSolverPackage(pc);
>> > PCFactorGetMatrix(pc, &F);
>> > KSPSetType(ksp, KSPCG);
>> > MPI_Barrier(MPI_COMM_WORLD);
>> > icntl=29; ival=2; // ParMetis
>> > MatMumpsSetIcntl(F, icntl, ival);
>> > icntl=4; ival=3; // Errors
>> > MatMumpsSetIcntl(F, icntl, ival);
>> > icntl=23; ival=1500;
>> > MatMumpsSetIcntl(F, icntl, ival);
>> > KSPSolve(ksp,b,x);
>> >
>> >
>> >
>> > Errors:
>> >
>> > Entering DMUMPS driver with JOB, N, NZ =   1     9778426              0
>> >  DMUMPS 4.10.0
>> > L D L^T Solver for symmetric positive definite matrices
>> > Type of parallelism: Working host
>> >  ****** ANALYSIS STEP ********
>> > Using ParMETIS for parallel ordering.
>> > Structual symmetry is:100%
>> > --------------------------------------------------------------------------
>> > WARNING: A process refused to die!
>> > Host: n0000.voltaire0
>> > PID:  28131
>> > This process may still be running and/or consuming resources.
>> > --------------------------------------------------------------------------
>> > [n0000.voltaire0:28047] 1 more process has sent help message 
>> > help-odls-default.txt / odls-default:could-not-kill
>> > [n0000.voltaire0:28047] Set MCA parameter "orte_base_help_aggregate" to 0 
>> > to see all help / error messages
>> > --------------------------------------------------------------------------
>> > MPI_ABORT was invoked on rank 11 in communicator MPI_COMM_WORLD
>> > with errorcode 59.
>> > NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>> > You may or may not see output from other processes, depending on
>> > exactly when Open MPI kills them.
>> > --------------------------------------------------------------------------
>> > [1]PETSC ERROR: 
>> > ------------------------------------------------------------------------
>> > [1]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the 
>> > batch system) has told this process to end
>> > [1]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
>> > [1]PETSC ERROR: or see 
>> > http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind[1]PETSC 
>> > ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find 
>> > memory corruption errors
>> > [1]PETSC ERROR: likely location of problem given in stack below
>> > [1]PETSC ERROR: ---------------------  Stack Frames 
>> > ------------------------------------
>> > [1]PETSC ERROR: Note: The EXACT line numbers in the stack are not 
>> > available,
>> > [1]PETSC ERROR:       INSTEAD the line number of the start of the function
>> > [1]PETSC ERROR:       is given.
>> > [1]PETSC ERROR: [1] MatCholeskyFactorSymbolic_MUMPS line 1076 
>> > /clusterfs/voltaire/home/software/source/petsc-3.5.0/src/mat/impls/aij/mpi/mumps/mumps.c
>> > [1]PETSC ERROR: [1] MatCholeskyFactorSymbolic line 2995 
>> > /clusterfs/voltaire/home/software/source/petsc-3.5.0/src/mat/interface/matrix.c
>> > [1]PETSC ERROR: [1] PCSetUp_Cholesky line 88 
>> > /clusterfs/voltaire/home/software/source/petsc-3.5.0/src/ksp/pc/impls/factor/cholesky/cholesky.c
>> > [1]PETSC ERROR: [1] KSPSetUp line 219 
>> > /clusterfs/voltaire/home/software/source/petsc-3.5.0/src/ksp/ksp/interface/itfunc.c
>> > [1]PETSC ERROR: [1] KSPSolve line 381 
>> > /clusterfs/voltaire/home/software/source/petsc-3.5.0/src/ksp/ksp/interface/itfunc.c
>> > [1]PETSC ERROR: --------------------- Error Message 
>> > --------------------------------------------------------------
>> > [1]PETSC ERROR: Signal received
>> > [1]PETSC ERROR: See http://www.mcs.anl.gov/petsc/documentation/faq.html 
>> > for trouble shooting.
>> > [1]PETSC ERROR: Petsc Release Version 3.5.0, Jun, 30, 2014
>> > [1]PETSC ERROR: fetdem3dp on a arch-linux2-c-debug named n0000.voltaire0 
>> > by esum Wed Aug 27 13:48:51 2014
>> > [1]PETSC ERROR: Configure options 
>> > --prefix=/clusterfs/voltaire/home/software/modules/petsc/3.5.0 
>> > --download-fblaslapack=1 --download-mumps=1 --download-parmetis=1 
>> > --download-scalapack --download-metis=1 
>> > --with-mpi-dir=/global/software/sl-6.x86_64/modules/gcc/4.4.7/openmpi/1.6.5-gcc/
>> > [1]PETSC ERROR: #1 User provided function() line 0 in  unknown file
>> > [5]PETSC ERROR: 
>> > ------------------------------------------------------------------------
>>
>>
>> <slurm-504727.out>
>

Reply via email to