Alfredo: It would be much easier to install petsc with mumps, parmetis, and debugging this case. Here is what you can do on a linux machine (see http://www.mcs.anl.gov/petsc/documentation/installation.html):
1) get petsc-release: git clone -b maint https://bitbucket.org/petsc/petsc petsc cd petsc git pull export PETSC_DIR=$PWD export PETSC_ARCH=<> 2) configure petsc with additional options '--download-metis --download-parmetis --download-mumps --download-scalapack --download-ptscotch' see http://www.mcs.anl.gov/petsc/documentation/installation.html 3) build petsc and test make make test 4) test ex53.c: cd $PETSC_DIR/src/ksp/ksp/examples/tutorials make ex53 mpiexec -n 4 ./ex53 -n 10000 -ksp_view -mat_mumps_icntl_28 2 -mat_mumps_icntl_29 2 5) debugging ex53.c: mpiexec -n 4 ./ex53 -n 10000 -ksp_view -mat_mumps_icntl_28 2 -mat_mumps_icntl_29 2 -start_in_debugger Give it a try. Contact us if you cannot reproduce this case. Hong Dear all, > this may well be due to a bug in the parallel analysis. Do you think you > can reproduce the problem in a standalone MUMPS program (i.e., without > going through PETSc) ? that would save a lot of time to track the bug since > we do not have a PETSc install at hand. Otherwise we'll give it a shot at > installing petsc and reproducing the problem on our side. > > Kind regards, > the MUMPS team > > > > On Wed, Oct 19, 2016 at 8:32 PM, Barry Smith <[email protected]> wrote: > >> >> Tim, >> >> You can/should also run with valgrind to determine exactly the first >> point with memory corruption issues. >> >> Barry >> >> > On Oct 19, 2016, at 11:08 AM, Hong <[email protected]> wrote: >> > >> > Tim: >> > With '-mat_mumps_icntl_28 1', i.e., sequential analysis, I can run ex56 >> with np=3 or larger np successfully. >> > >> > With '-mat_mumps_icntl_28 2', i.e., parallel analysis, I can run up to >> np=3. >> > >> > For np=4: >> > mpiexec -n 4 ./ex53 -n 10000 -ksp_view -mat_mumps_icntl_28 2 >> -mat_mumps_icntl_29 2 -start_in_debugger >> > >> > code crashes inside mumps: >> > Program received signal SIGSEGV, Segmentation fault. >> > 0x00007f33d75857cb in dmumps_parallel_analysis::dmumps_build_scotch_graph >> ( >> > id=..., first=..., last=..., ipe=..., >> > pe=<error reading variable: Cannot access memory at address 0x0>, >> work=...) >> > at dana_aux_par.F:1450 >> > 1450 MAPTAB(J) = I >> > (gdb) bt >> > #0 0x00007f33d75857cb in >> > dmumps_parallel_analysis::dmumps_build_scotch_graph >> ( >> > id=..., first=..., last=..., ipe=..., >> > pe=<error reading variable: Cannot access memory at address 0x0>, >> work=...) >> > at dana_aux_par.F:1450 >> > #1 0x00007f33d759207c in dmumps_parallel_analysis::dmumps_parmetis_ord >> ( >> > id=..., ord=..., work=...) at dana_aux_par.F:400 >> > #2 0x00007f33d7592d14 in dmumps_parallel_analysis::dmumps_do_par_ord >> (id=..., >> > ord=..., work=...) at dana_aux_par.F:351 >> > #3 0x00007f33d7593aa9 in dmumps_parallel_analysis::dmumps_ana_f_par >> (id=..., >> > work1=..., work2=..., nfsiz=..., >> > fils=<error reading variable: Cannot access memory at address 0x0>, >> > frere=<error reading variable: Cannot access memory at address 0x0>) >> > at dana_aux_par.F:98 >> > #4 0x00007f33d74c622a in dmumps_ana_driver (id=...) at >> dana_driver.F:563 >> > #5 0x00007f33d747706b in dmumps (id=...) at dmumps_driver.F:1108 >> > #6 0x00007f33d74721b5 in dmumps_f77 (job=1, sym=0, par=1, >> > comm_f77=-2080374779, n=10000, icntl=..., cntl=..., keep=..., >> dkeep=..., >> > keep8=..., nz=0, irn=..., irnhere=0, jcn=..., jcnhere=0, a=..., >> ahere=0, >> > nz_loc=7500, irn_loc=..., irn_lochere=1, jcn_loc=..., jcn_lochere=1, >> > a_loc=..., a_lochere=1, nelt=0, eltptr=..., eltptrhere=0, >> eltvar=..., >> > eltvarhere=0, a_elt=..., a_elthere=0, perm_in=..., perm_inhere=0, >> rhs=..., >> > rhshere=0, redrhs=..., redrhshere=0, info=..., rinfo=..., infog=..., >> > rinfog=..., deficiency=0, lwk_user=0, size_schur=0, >> listvar_schur=..., >> > ---Type <return> to continue, or q <return> to quit--- >> > ar_schurhere=0, schur=..., schurhere=0, wk_user=..., wk_userhere=0, >> colsca=..., >> > colscahere=0, rowsca=..., rowscahere=0, instance_number=1, nrhs=1, >> lrhs=0, lredrhs=0, >> > rhs_sparse=..., rhs_sparsehere=0, sol_loc=..., sol_lochere=0, >> irhs_sparse=..., >> > irhs_sparsehere=0, irhs_ptr=..., irhs_ptrhere=0, isol_loc=..., >> isol_lochere=0, >> > nz_rhs=0, lsol_loc=0, schur_mloc=0, schur_nloc=0, schur_lld=0, >> mblock=0, nblock=0, >> > nprow=0, npcol=0, ooc_tmpdir=..., ooc_prefix=..., >> write_problem=..., tmpdirlen=20, >> > prefixlen=20, write_problemlen=20) at dmumps_f77.F:260 >> > #7 0x00007f33d74709b1 in dmumps_c (mumps_par=0x16126f0) at >> mumps_c.c:415 >> > #8 0x00007f33d68408ca in MatLUFactorSymbolic_AIJMUMPS (F=0x1610280, >> A=0x14bafc0, >> > r=0x160cc30, c=0x1609ed0, info=0x15c6708) >> > at /scratch/hzhang/petsc/src/mat/impls/aij/mpi/mumps/mumps.c:1487 >> > >> > -mat_mumps_icntl_29 = 0 or 1 give same error. >> > I'm cc'ing this email to mumps developer, who may help to resolve this >> matter. >> > >> > Hong >> > >> > >> > Hi all, >> > >> > I have some problems with PETSc using MUMPS and PARMETIS. >> > In some cases it works fine, but in some others it doesn't, so I am >> > trying to understand what is happening. >> > >> > I just picked the following example: >> > http://www.mcs.anl.gov/petsc/petsc-current/src/ksp/ksp/examp >> les/tutorials/ex53.c.html >> > >> > Now, when I start it with less than 4 processes it works as expected: >> > mpirun -n 3 ./ex53 -n 10000 -ksp_view -mat_mumps_icntl_28 1 >> > -mat_mumps_icntl_29 2 >> > >> > But with 4 or more processes, it crashes, but only when I am using >> Parmetis: >> > mpirun -n 4 ./ex53 -n 10000 -ksp_view -mat_mumps_icntl_28 1 >> > -mat_mumps_icntl_29 2 >> > >> > Metis worked in every case I tried without any problems. >> > >> > I wonder if I am doing something wrong or if this is a general problem >> > or even a bug? Is Parmetis supposed to work with that example with 4 >> > processes? >> > >> > Thanks a lot and kind regards. >> > >> > Volker >> > >> > >> > Here is the error log of process 0: >> > >> > Entering DMUMPS 5.0.1 driver with JOB, N = 1 10000 >> > ================================================= >> > MUMPS compiled with option -Dmetis >> > MUMPS compiled with option -Dparmetis >> > ================================================= >> > L U Solver for unsymmetric matrices >> > Type of parallelism: Working host >> > >> > ****** ANALYSIS STEP ******** >> > >> > ** Max-trans not allowed because matrix is distributed >> > Using ParMETIS for parallel ordering. >> > [0]PETSC ERROR: >> > ------------------------------------------------------------ >> ------------ >> > [0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, >> > probably memory access out of range >> > [0]PETSC ERROR: Try option -start_in_debugger or >> -on_error_attach_debugger >> > [0]PETSC ERROR: or see >> > http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind >> > [0]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac >> > OS X to find memory corruption errors >> > [0]PETSC ERROR: likely location of problem given in stack below >> > [0]PETSC ERROR: --------------------- Stack Frames >> > ------------------------------------ >> > [0]PETSC ERROR: Note: The EXACT line numbers in the stack are not >> available, >> > [0]PETSC ERROR: INSTEAD the line number of the start of the >> function >> > [0]PETSC ERROR: is given. >> > [0]PETSC ERROR: [0] MatLUFactorSymbolic_AIJMUMPS line 1395 >> > /fsgarwinhpc/133/petsc/sources/petsc-3.7.4a/src/mat/impls/ >> aij/mpi/mumps/mumps.c >> > [0]PETSC ERROR: [0] MatLUFactorSymbolic line 2927 >> > /fsgarwinhpc/133/petsc/sources/petsc-3.7.4a/src/mat/interface/matrix.c >> > [0]PETSC ERROR: [0] PCSetUp_LU line 101 >> > /fsgarwinhpc/133/petsc/sources/petsc-3.7.4a/src/ksp/pc/ >> impls/factor/lu/lu.c >> > [0]PETSC ERROR: [0] PCSetUp line 930 >> > /fsgarwinhpc/133/petsc/sources/petsc-3.7.4a/src/ksp/pc/ >> interface/precon.c >> > [0]PETSC ERROR: [0] KSPSetUp line 305 >> > /fsgarwinhpc/133/petsc/sources/petsc-3.7.4a/src/ksp/ksp/ >> interface/itfunc.c >> > [0]PETSC ERROR: [0] KSPSolve line 563 >> > /fsgarwinhpc/133/petsc/sources/petsc-3.7.4a/src/ksp/ksp/ >> interface/itfunc.c >> > [0]PETSC ERROR: --------------------- Error Message >> > -------------------------------------------------------------- >> > [0]PETSC ERROR: Signal received >> > [0]PETSC ERROR: See >> > http://www.mcs.anl.gov/petsc/documentation/faq.html for trouble >> > shooting. >> > [0]PETSC ERROR: Petsc Release Version 3.7.4, Oct, 02, 2016 >> > [0]PETSC ERROR: ./ex53 on a linux-manni-mumps named manni by 133 Wed >> > Oct 19 16:39:49 2016 >> > [0]PETSC ERROR: Configure options --with-cc=mpiicc --with-cxx=mpiicpc >> > --with-fc=mpiifort --with-shared-libraries=1 >> > --with-valgrind-dir=~/usr/valgrind/ >> > --with-mpi-dir=/home/software/intel/Intel-2016.4/compilers_a >> nd_libraries_2016.4.258/linux/mpi >> > --download-scalapack --download-mumps --download-metis >> > --download-metis-shared=0 --download-parmetis >> > --download-parmetis-shared=0 >> > [0]PETSC ERROR: #1 User provided function() line 0 in unknown file >> > application called MPI_Abort(MPI_COMM_WORLD, 59) - process 0 >> > >> >> > > > -- > ----------------------------------------- > Alfredo Buttari, PhD > CNRS-IRIT > 2 rue Camichel, 31071 Toulouse, France > http://buttari.perso.enseeiht.fr >
