valgrind doesn't care how many processes you use it on. Sometimes brute force 
is the best way to go.

   Barry

> On Jun 20, 2016, at 10:14 PM, Eric Chamberland 
> <[email protected]> wrote:
> 
> Hi,
> 
> for sure, valgrind is a reliable tool to find memory corruption... but 
> launching it on a 1k processes jobs sounds like "funny" and unfeasible to 
> me... I really don't know what is the biggest job valgrind ran on out there 
> on any cluster???
> 
> Anyway...
> 
> I looked around into the code for now.. It seems that 
> MatMatMult_MPIAIJ_MPIAIJ switches between 2 different functions:
> 
> MatMatMultSymbolic_MPIAIJ_MPIAIJ_nonscalable
> and
> MatMatMultSymbolic_MPIAIJ_MPIAIJ
> 
> and that by default, I think the nonscalable version is used...
> 
> until someone use "-matmatmult_via scalable" option (is there any way I 
> should have found this?)
> 
> So I have a few questions/remarks:
> 
> #1- Maybe my problem is just that the non-scalable version is used on a 1.7 
> billion unknowns....?
> #2- Is it possible/feasible to do a better choice than to default to the 
> non-scalable version?
> 
> I would like to have some hints before launching this large calculus again...
> 
> Thanks,
> 
> Eric
> 
> 
> Le 2016-06-17 17:02, Barry Smith a écrit :
>>    Run with the debug version of the libraries under valgrind 
>> http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind to see if any 
>> memory corruption issues come up.
>> 
>>    My guess is simply that it has run out of allocatable memory.
>> 
>>    Barry
>> 
>> 
>> 
>>> On Jun 17, 2016, at 3:55 PM, Eric Chamberland 
>>> <[email protected]> wrote:
>>> 
>>> Hi,
>>> 
>>> We got the another run on the cluster with petsc 3.5.4 compiled with 64 bit 
>>> indices (see end of message for configure options).
>>> 
>>> This time, the execution terminated with a segmentation violation with the 
>>> following backtrace:
>>> 
>>> Thu Jun 16 16:03:08 2016<stderr>:#000: reqBacktrace(std::string&)  >>> 
>>> /rap/jsf-051-aa/ericc/GIREF/bin/probGD.opt
>>> Thu Jun 16 16:03:08 2016<stderr>:#001: attacheDebugger()  >>> 
>>> /rap/jsf-051-aa/ericc/GIREF/bin/probGD.opt
>>> Thu Jun 16 16:03:08 2016<stderr>:#002: 
>>> /rap/jsf-051-aa/ericc/GIREF/lib/libgiref_opt_Util.so(traitementSignal+0x305)
>>>  [0x2b6c4885a875]
>>> Thu Jun 16 16:03:08 2016<stderr>:#003: /lib64/libc.so.6(+0x326a0) 
>>> [0x2b6c502156a0]
>>> Thu Jun 16 16:03:08 2016<stderr>:#004: 
>>> /software6/mpi/openmpi/1.8.8_intel/lib/libopen-pal.so.6(opal_memory_ptmalloc2_int_malloc+0xee5)
>>>  [0x2b6c55e99ab5]
>>> Thu Jun 16 16:03:08 2016<stderr>:#005: 
>>> /software6/mpi/openmpi/1.8.8_intel/lib/libopen-pal.so.6(opal_memory_ptmalloc2_malloc+0x58)
>>>  [0x2b6c55e9b8f8]
>>> Thu Jun 16 16:03:08 2016<stderr>:#006: 
>>> /software6/mpi/openmpi/1.8.8_intel/lib/libopen-pal.so.6(opal_memory_ptmalloc2_memalign+0x3ff)
>>>  [0x2b6c55e9c50f]
>>> Thu Jun 16 16:03:08 2016<stderr>:#007: 
>>> /software6/libs/petsc/3.5.4_intel_openmpi1.8.8/lib/libpetsc.so.3.5(PetscMallocAlign+0x17)
>>>  [0x2b6c4a49ecc7]
>>> Thu Jun 16 16:03:08 2016<stderr>:#008: 
>>> /software6/libs/petsc/3.5.4_intel_openmpi1.8.8/lib/libpetsc.so.3.5(MatMatMultSymbolic_MPIAIJ_MPIAIJ+0x37d)
>>>  [0x2b6c4a915eed]
>>> Thu Jun 16 16:03:08 2016<stderr>:#009: 
>>> /software6/libs/petsc/3.5.4_intel_openmpi1.8.8/lib/libpetsc.so.3.5(MatMatMult_MPIAIJ_MPIAIJ+0x1a3)
>>>  [0x2b6c4a915713]
>>> Thu Jun 16 16:03:08 2016<stderr>:#010: 
>>> /software6/libs/petsc/3.5.4_intel_openmpi1.8.8/lib/libpetsc.so.3.5(MatMatMult+0x5f4)
>>>  [0x2b6c4a630f44]
>>> Thu Jun 16 16:03:08 2016<stderr>:#011: girefMatMatMult(MatricePETSc const&, 
>>> MatricePETSc const&, MatricePETSc&, MatReuse)  >>> 
>>> /rap/jsf-051-aa/ericc/GIREF/lib/libgiref_opt_Petsc.so
>>> 
>>> Now looking at the MatMatMultSymbolic_MPIAIJ_MPIAIJ modifications, since I 
>>> run in 64 bit indices, there is something either really bad with what we 
>>> try to do or maybe there is still something in this routine...
>>> 
>>> What is your advice and how could I retreive more information if I can 
>>> launch it again?
>>> 
>>> Is a -malloc_dump or -malloc_log would help or anything else?
>>> 
>>> (the very same calculus passed with 240M unknowns).
>>> 
>>> Thanks for your insights!
>>> 
>>> Eric
>>> 
>>> here are the configure options:
>>> static const char *petscconfigureoptions = "PETSC_ARCH=linux-gnu-intel 
>>> CFLAGS=\"-O3 -xHost -mkl -fPIC -m64 -no-diag-message-catalog\" FFLAGS=\"-O3 
>>> -xHost -mkl -fPIC -m64 -no-diag-message-catalog\" --prefix=/sof
>>> tware6/libs/petsc/3.5.4_intel_openmpi1.8.8 --with-x=0 
>>> --with-mpi-compilers=1 --with-mpi-dir=/software6/mpi/openmpi/1.8.8_intel 
>>> --known-mpi-shared-libraries=1 --with-debugging=no --with-64-bit-indices=1 
>>> --with-s
>>> hared-libraries=1 
>>> --with-blas-lapack-dir=/software6/compilers/intel/composer_xe_2013_sp1.0.080/mkl/lib/intel64
>>>  --with-scalapack=1 
>>> --with-scalapack-include=/software6/compilers/intel/composer_xe_2013_sp1.0.080/m
>>> kl/include --with-scalapack-lib=\"-lmkl_scalapack_lp64 
>>> -lmkl_blacs_openmpi_lp64\" --download-ptscotch=1 
>>> --download-superlu_dist=yes --download-parmetis=yes --download-metis=yes 
>>> --download-hypre=yes";
>>> 
>>> 
>>> On 16/11/15 07:12 PM, Barry Smith wrote:
>>>>   I have started a branch with utilities to help catch/handle these 
>>>> integer overflow issues 
>>>> https://bitbucket.org/petsc/petsc/pull-requests/389/add-utilities-for-handling-petscint/diff
>>>>  all suggestions are appreciated
>>>> 
>>>>   Barry
>>>> 
>>>>> On Nov 16, 2015, at 12:26 PM, Eric Chamberland 
>>>>> <[email protected]> wrote:
>>>>> 
>>>>> Barry,
>>>>> 
>>>>> I can't launch the code again and retrieve other informations, since I am 
>>>>> not allowed to do so: the cluster have around ~780 nodes and I got a very 
>>>>> special permission to reserve 530 of them...
>>>>> 
>>>>> So the best I can do is to give you the backtrace PETSc gave me... :/
>>>>> (see the first post with the backtrace: 
>>>>> http://lists.mcs.anl.gov/pipermail/petsc-users/2015-November/027644.html)
>>>>> 
>>>>> And until today, all smaller meshes with the same solver succeeded to 
>>>>> complete... (I went up to 219 millions of unknowns on 64 nodes).
>>>>> 
>>>>> I understand then that there could be some use of PetscInt64 in the 
>>>>> actual code that would help fix problems like the one I got.  I found it 
>>>>> is a big challenge to track down all occurrence of this kind of overflow 
>>>>> in the code, due to the size of the systems you have to have to reproduce 
>>>>> this problem....
>>>>> 
>>>>> Eric
>>>>> 
>>>>> 
>>>>> On 16/11/15 12:40 PM, Barry Smith wrote:
>>>>>>   Eric,
>>>>>> 
>>>>>>     The behavior you get with bizarre integers and a crash is not the 
>>>>>> behavior we want. We would like to detect these overflows appropriately. 
>>>>>>   If you can track through the error and determine the location where 
>>>>>> the overflow occurs then we would gladly put in additional checks and 
>>>>>> use of PetscInt64 to handle these things better. So let us know the 
>>>>>> exact cause and we'll improve the code.
>>>>>> 
>>>>>>   Barry
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On Nov 16, 2015, at 11:11 AM, Eric Chamberland 
>>>>>>> <[email protected]> wrote:
>>>>>>> 
>>>>>>> On 16/11/15 10:42 AM, Matthew Knepley wrote:
>>>>>>>> Sometimes when we do not have exact counts, we need to overestimate
>>>>>>>> sizes. This is especially true
>>>>>>>> in sparse MatMat.
>>>>>>> Ok... so, to be sure, I am correct if I say that recompiling petsc with
>>>>>>> "--with-64-bit-indices" is the only solution to my problem?
>>>>>>> 
>>>>>>> I mean, no other fixes exist for this overestimation in a more recent 
>>>>>>> release of petsc, like putting the result in a "long int" instead?
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> 
>>>>>>> Eric
>>>>>>> 
> 

Reply via email to