Run with the debug version of the libraries under valgrind 
http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind to see if any 
memory corruption issues come up.

   My guess is simply that it has run out of allocatable memory.

   Barry



> On Jun 17, 2016, at 3:55 PM, Eric Chamberland 
> <[email protected]> wrote:
> 
> Hi,
> 
> We got the another run on the cluster with petsc 3.5.4 compiled with 64 bit 
> indices (see end of message for configure options).
> 
> This time, the execution terminated with a segmentation violation with the 
> following backtrace:
> 
> Thu Jun 16 16:03:08 2016<stderr>:#000: reqBacktrace(std::string&)  >>> 
> /rap/jsf-051-aa/ericc/GIREF/bin/probGD.opt
> Thu Jun 16 16:03:08 2016<stderr>:#001: attacheDebugger()  >>> 
> /rap/jsf-051-aa/ericc/GIREF/bin/probGD.opt
> Thu Jun 16 16:03:08 2016<stderr>:#002: 
> /rap/jsf-051-aa/ericc/GIREF/lib/libgiref_opt_Util.so(traitementSignal+0x305) 
> [0x2b6c4885a875]
> Thu Jun 16 16:03:08 2016<stderr>:#003: /lib64/libc.so.6(+0x326a0) 
> [0x2b6c502156a0]
> Thu Jun 16 16:03:08 2016<stderr>:#004: 
> /software6/mpi/openmpi/1.8.8_intel/lib/libopen-pal.so.6(opal_memory_ptmalloc2_int_malloc+0xee5)
>  [0x2b6c55e99ab5]
> Thu Jun 16 16:03:08 2016<stderr>:#005: 
> /software6/mpi/openmpi/1.8.8_intel/lib/libopen-pal.so.6(opal_memory_ptmalloc2_malloc+0x58)
>  [0x2b6c55e9b8f8]
> Thu Jun 16 16:03:08 2016<stderr>:#006: 
> /software6/mpi/openmpi/1.8.8_intel/lib/libopen-pal.so.6(opal_memory_ptmalloc2_memalign+0x3ff)
>  [0x2b6c55e9c50f]
> Thu Jun 16 16:03:08 2016<stderr>:#007: 
> /software6/libs/petsc/3.5.4_intel_openmpi1.8.8/lib/libpetsc.so.3.5(PetscMallocAlign+0x17)
>  [0x2b6c4a49ecc7]
> Thu Jun 16 16:03:08 2016<stderr>:#008: 
> /software6/libs/petsc/3.5.4_intel_openmpi1.8.8/lib/libpetsc.so.3.5(MatMatMultSymbolic_MPIAIJ_MPIAIJ+0x37d)
>  [0x2b6c4a915eed]
> Thu Jun 16 16:03:08 2016<stderr>:#009: 
> /software6/libs/petsc/3.5.4_intel_openmpi1.8.8/lib/libpetsc.so.3.5(MatMatMult_MPIAIJ_MPIAIJ+0x1a3)
>  [0x2b6c4a915713]
> Thu Jun 16 16:03:08 2016<stderr>:#010: 
> /software6/libs/petsc/3.5.4_intel_openmpi1.8.8/lib/libpetsc.so.3.5(MatMatMult+0x5f4)
>  [0x2b6c4a630f44]
> Thu Jun 16 16:03:08 2016<stderr>:#011: girefMatMatMult(MatricePETSc const&, 
> MatricePETSc const&, MatricePETSc&, MatReuse)  >>> 
> /rap/jsf-051-aa/ericc/GIREF/lib/libgiref_opt_Petsc.so
> 
> Now looking at the MatMatMultSymbolic_MPIAIJ_MPIAIJ modifications, since I 
> run in 64 bit indices, there is something either really bad with what we try 
> to do or maybe there is still something in this routine...
> 
> What is your advice and how could I retreive more information if I can launch 
> it again?
> 
> Is a -malloc_dump or -malloc_log would help or anything else?
> 
> (the very same calculus passed with 240M unknowns).
> 
> Thanks for your insights!
> 
> Eric
> 
> here are the configure options:
> static const char *petscconfigureoptions = "PETSC_ARCH=linux-gnu-intel 
> CFLAGS=\"-O3 -xHost -mkl -fPIC -m64 -no-diag-message-catalog\" FFLAGS=\"-O3 
> -xHost -mkl -fPIC -m64 -no-diag-message-catalog\" --prefix=/sof
> tware6/libs/petsc/3.5.4_intel_openmpi1.8.8 --with-x=0 --with-mpi-compilers=1 
> --with-mpi-dir=/software6/mpi/openmpi/1.8.8_intel 
> --known-mpi-shared-libraries=1 --with-debugging=no --with-64-bit-indices=1 
> --with-s
> hared-libraries=1 
> --with-blas-lapack-dir=/software6/compilers/intel/composer_xe_2013_sp1.0.080/mkl/lib/intel64
>  --with-scalapack=1 
> --with-scalapack-include=/software6/compilers/intel/composer_xe_2013_sp1.0.080/m
> kl/include --with-scalapack-lib=\"-lmkl_scalapack_lp64 
> -lmkl_blacs_openmpi_lp64\" --download-ptscotch=1 --download-superlu_dist=yes 
> --download-parmetis=yes --download-metis=yes --download-hypre=yes";
> 
> 
> On 16/11/15 07:12 PM, Barry Smith wrote:
>> 
>>   I have started a branch with utilities to help catch/handle these integer 
>> overflow issues 
>> https://bitbucket.org/petsc/petsc/pull-requests/389/add-utilities-for-handling-petscint/diff
>>  all suggestions are appreciated
>> 
>>   Barry
>> 
>>> On Nov 16, 2015, at 12:26 PM, Eric Chamberland 
>>> <[email protected]> wrote:
>>> 
>>> Barry,
>>> 
>>> I can't launch the code again and retrieve other informations, since I am 
>>> not allowed to do so: the cluster have around ~780 nodes and I got a very 
>>> special permission to reserve 530 of them...
>>> 
>>> So the best I can do is to give you the backtrace PETSc gave me... :/
>>> (see the first post with the backtrace: 
>>> http://lists.mcs.anl.gov/pipermail/petsc-users/2015-November/027644.html)
>>> 
>>> And until today, all smaller meshes with the same solver succeeded to 
>>> complete... (I went up to 219 millions of unknowns on 64 nodes).
>>> 
>>> I understand then that there could be some use of PetscInt64 in the actual 
>>> code that would help fix problems like the one I got.  I found it is a big 
>>> challenge to track down all occurrence of this kind of overflow in the 
>>> code, due to the size of the systems you have to have to reproduce this 
>>> problem....
>>> 
>>> Eric
>>> 
>>> 
>>> On 16/11/15 12:40 PM, Barry Smith wrote:
>>>> 
>>>>   Eric,
>>>> 
>>>>     The behavior you get with bizarre integers and a crash is not the 
>>>> behavior we want. We would like to detect these overflows appropriately.   
>>>> If you can track through the error and determine the location where the 
>>>> overflow occurs then we would gladly put in additional checks and use of 
>>>> PetscInt64 to handle these things better. So let us know the exact cause 
>>>> and we'll improve the code.
>>>> 
>>>>   Barry
>>>> 
>>>> 
>>>> 
>>>>> On Nov 16, 2015, at 11:11 AM, Eric Chamberland 
>>>>> <[email protected]> wrote:
>>>>> 
>>>>> On 16/11/15 10:42 AM, Matthew Knepley wrote:
>>>>>> Sometimes when we do not have exact counts, we need to overestimate
>>>>>> sizes. This is especially true
>>>>>> in sparse MatMat.
>>>>> 
>>>>> Ok... so, to be sure, I am correct if I say that recompiling petsc with
>>>>> "--with-64-bit-indices" is the only solution to my problem?
>>>>> 
>>>>> I mean, no other fixes exist for this overestimation in a more recent 
>>>>> release of petsc, like putting the result in a "long int" instead?
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> Eric
>>>>> 
>>> 
> 

Reply via email to