Run with the debug version of the libraries under valgrind http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind to see if any memory corruption issues come up.
My guess is simply that it has run out of allocatable memory. Barry > On Jun 17, 2016, at 3:55 PM, Eric Chamberland > <[email protected]> wrote: > > Hi, > > We got the another run on the cluster with petsc 3.5.4 compiled with 64 bit > indices (see end of message for configure options). > > This time, the execution terminated with a segmentation violation with the > following backtrace: > > Thu Jun 16 16:03:08 2016<stderr>:#000: reqBacktrace(std::string&) >>> > /rap/jsf-051-aa/ericc/GIREF/bin/probGD.opt > Thu Jun 16 16:03:08 2016<stderr>:#001: attacheDebugger() >>> > /rap/jsf-051-aa/ericc/GIREF/bin/probGD.opt > Thu Jun 16 16:03:08 2016<stderr>:#002: > /rap/jsf-051-aa/ericc/GIREF/lib/libgiref_opt_Util.so(traitementSignal+0x305) > [0x2b6c4885a875] > Thu Jun 16 16:03:08 2016<stderr>:#003: /lib64/libc.so.6(+0x326a0) > [0x2b6c502156a0] > Thu Jun 16 16:03:08 2016<stderr>:#004: > /software6/mpi/openmpi/1.8.8_intel/lib/libopen-pal.so.6(opal_memory_ptmalloc2_int_malloc+0xee5) > [0x2b6c55e99ab5] > Thu Jun 16 16:03:08 2016<stderr>:#005: > /software6/mpi/openmpi/1.8.8_intel/lib/libopen-pal.so.6(opal_memory_ptmalloc2_malloc+0x58) > [0x2b6c55e9b8f8] > Thu Jun 16 16:03:08 2016<stderr>:#006: > /software6/mpi/openmpi/1.8.8_intel/lib/libopen-pal.so.6(opal_memory_ptmalloc2_memalign+0x3ff) > [0x2b6c55e9c50f] > Thu Jun 16 16:03:08 2016<stderr>:#007: > /software6/libs/petsc/3.5.4_intel_openmpi1.8.8/lib/libpetsc.so.3.5(PetscMallocAlign+0x17) > [0x2b6c4a49ecc7] > Thu Jun 16 16:03:08 2016<stderr>:#008: > /software6/libs/petsc/3.5.4_intel_openmpi1.8.8/lib/libpetsc.so.3.5(MatMatMultSymbolic_MPIAIJ_MPIAIJ+0x37d) > [0x2b6c4a915eed] > Thu Jun 16 16:03:08 2016<stderr>:#009: > /software6/libs/petsc/3.5.4_intel_openmpi1.8.8/lib/libpetsc.so.3.5(MatMatMult_MPIAIJ_MPIAIJ+0x1a3) > [0x2b6c4a915713] > Thu Jun 16 16:03:08 2016<stderr>:#010: > /software6/libs/petsc/3.5.4_intel_openmpi1.8.8/lib/libpetsc.so.3.5(MatMatMult+0x5f4) > [0x2b6c4a630f44] > Thu Jun 16 16:03:08 2016<stderr>:#011: girefMatMatMult(MatricePETSc const&, > MatricePETSc const&, MatricePETSc&, MatReuse) >>> > /rap/jsf-051-aa/ericc/GIREF/lib/libgiref_opt_Petsc.so > > Now looking at the MatMatMultSymbolic_MPIAIJ_MPIAIJ modifications, since I > run in 64 bit indices, there is something either really bad with what we try > to do or maybe there is still something in this routine... > > What is your advice and how could I retreive more information if I can launch > it again? > > Is a -malloc_dump or -malloc_log would help or anything else? > > (the very same calculus passed with 240M unknowns). > > Thanks for your insights! > > Eric > > here are the configure options: > static const char *petscconfigureoptions = "PETSC_ARCH=linux-gnu-intel > CFLAGS=\"-O3 -xHost -mkl -fPIC -m64 -no-diag-message-catalog\" FFLAGS=\"-O3 > -xHost -mkl -fPIC -m64 -no-diag-message-catalog\" --prefix=/sof > tware6/libs/petsc/3.5.4_intel_openmpi1.8.8 --with-x=0 --with-mpi-compilers=1 > --with-mpi-dir=/software6/mpi/openmpi/1.8.8_intel > --known-mpi-shared-libraries=1 --with-debugging=no --with-64-bit-indices=1 > --with-s > hared-libraries=1 > --with-blas-lapack-dir=/software6/compilers/intel/composer_xe_2013_sp1.0.080/mkl/lib/intel64 > --with-scalapack=1 > --with-scalapack-include=/software6/compilers/intel/composer_xe_2013_sp1.0.080/m > kl/include --with-scalapack-lib=\"-lmkl_scalapack_lp64 > -lmkl_blacs_openmpi_lp64\" --download-ptscotch=1 --download-superlu_dist=yes > --download-parmetis=yes --download-metis=yes --download-hypre=yes"; > > > On 16/11/15 07:12 PM, Barry Smith wrote: >> >> I have started a branch with utilities to help catch/handle these integer >> overflow issues >> https://bitbucket.org/petsc/petsc/pull-requests/389/add-utilities-for-handling-petscint/diff >> all suggestions are appreciated >> >> Barry >> >>> On Nov 16, 2015, at 12:26 PM, Eric Chamberland >>> <[email protected]> wrote: >>> >>> Barry, >>> >>> I can't launch the code again and retrieve other informations, since I am >>> not allowed to do so: the cluster have around ~780 nodes and I got a very >>> special permission to reserve 530 of them... >>> >>> So the best I can do is to give you the backtrace PETSc gave me... :/ >>> (see the first post with the backtrace: >>> http://lists.mcs.anl.gov/pipermail/petsc-users/2015-November/027644.html) >>> >>> And until today, all smaller meshes with the same solver succeeded to >>> complete... (I went up to 219 millions of unknowns on 64 nodes). >>> >>> I understand then that there could be some use of PetscInt64 in the actual >>> code that would help fix problems like the one I got. I found it is a big >>> challenge to track down all occurrence of this kind of overflow in the >>> code, due to the size of the systems you have to have to reproduce this >>> problem.... >>> >>> Eric >>> >>> >>> On 16/11/15 12:40 PM, Barry Smith wrote: >>>> >>>> Eric, >>>> >>>> The behavior you get with bizarre integers and a crash is not the >>>> behavior we want. We would like to detect these overflows appropriately. >>>> If you can track through the error and determine the location where the >>>> overflow occurs then we would gladly put in additional checks and use of >>>> PetscInt64 to handle these things better. So let us know the exact cause >>>> and we'll improve the code. >>>> >>>> Barry >>>> >>>> >>>> >>>>> On Nov 16, 2015, at 11:11 AM, Eric Chamberland >>>>> <[email protected]> wrote: >>>>> >>>>> On 16/11/15 10:42 AM, Matthew Knepley wrote: >>>>>> Sometimes when we do not have exact counts, we need to overestimate >>>>>> sizes. This is especially true >>>>>> in sparse MatMat. >>>>> >>>>> Ok... so, to be sure, I am correct if I say that recompiling petsc with >>>>> "--with-64-bit-indices" is the only solution to my problem? >>>>> >>>>> I mean, no other fixes exist for this overestimation in a more recent >>>>> release of petsc, like putting the result in a "long int" instead? >>>>> >>>>> Thanks, >>>>> >>>>> Eric >>>>> >>> >
