Re: [petsc-users] On the edge of 2^31 unknowns

Barry Smith Mon, 20 Jun 2016 20:37:41 -0700

> On Jun 20, 2016, at 10:32 PM, Eric Chamberland 
> <[email protected]> wrote:
> 
> ok, but what about -matmatmult_via scalable?


  Both should work. It just may be one is faster or slower than the other 
depending on the problem size.

  Barry

> 
> Eric
> 
> Le 2016-06-20 23:18, Barry Smith a écrit :
>>   valgrind doesn't care how many processes you use it on. Sometimes brute 
>> force is the best way to go.
>> 
>>    Barry
>> 
>>> On Jun 20, 2016, at 10:14 PM, Eric Chamberland 
>>> <[email protected]> wrote:
>>> 
>>> Hi,
>>> 
>>> for sure, valgrind is a reliable tool to find memory corruption... but 
>>> launching it on a 1k processes jobs sounds like "funny" and unfeasible to 
>>> me... I really don't know what is the biggest job valgrind ran on out there 
>>> on any cluster???
>>> 
>>> Anyway...
>>> 
>>> I looked around into the code for now.. It seems that 
>>> MatMatMult_MPIAIJ_MPIAIJ switches between 2 different functions:
>>> 
>>> MatMatMultSymbolic_MPIAIJ_MPIAIJ_nonscalable
>>> and
>>> MatMatMultSymbolic_MPIAIJ_MPIAIJ
>>> 
>>> and that by default, I think the nonscalable version is used...
>>> 
>>> until someone use "-matmatmult_via scalable" option (is there any way I 
>>> should have found this?)
>>> 
>>> So I have a few questions/remarks:
>>> 
>>> #1- Maybe my problem is just that the non-scalable version is used on a 1.7 
>>> billion unknowns....?
>>> #2- Is it possible/feasible to do a better choice than to default to the 
>>> non-scalable version?
>>> 
>>> I would like to have some hints before launching this large calculus 
>>> again...
>>> 
>>> Thanks,
>>> 
>>> Eric
>>> 
>>> 
>>> Le 2016-06-17 17:02, Barry Smith a écrit :
>>>>    Run with the debug version of the libraries under valgrind 
>>>> http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind to see if any 
>>>> memory corruption issues come up.
>>>> 
>>>>    My guess is simply that it has run out of allocatable memory.
>>>> 
>>>>    Barry
>>>> 
>>>> 
>>>> 
>>>>> On Jun 17, 2016, at 3:55 PM, Eric Chamberland 
>>>>> <[email protected]> wrote:
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> We got the another run on the cluster with petsc 3.5.4 compiled with 64 
>>>>> bit indices (see end of message for configure options).
>>>>> 
>>>>> This time, the execution terminated with a segmentation violation with 
>>>>> the following backtrace:
>>>>> 
>>>>> Thu Jun 16 16:03:08 2016<stderr>:#000: reqBacktrace(std::string&)  >>> 
>>>>> /rap/jsf-051-aa/ericc/GIREF/bin/probGD.opt
>>>>> Thu Jun 16 16:03:08 2016<stderr>:#001: attacheDebugger()  >>> 
>>>>> /rap/jsf-051-aa/ericc/GIREF/bin/probGD.opt
>>>>> Thu Jun 16 16:03:08 2016<stderr>:#002: 
>>>>> /rap/jsf-051-aa/ericc/GIREF/lib/libgiref_opt_Util.so(traitementSignal+0x305)
>>>>>  [0x2b6c4885a875]
>>>>> Thu Jun 16 16:03:08 2016<stderr>:#003: /lib64/libc.so.6(+0x326a0) 
>>>>> [0x2b6c502156a0]
>>>>> Thu Jun 16 16:03:08 2016<stderr>:#004: 
>>>>> /software6/mpi/openmpi/1.8.8_intel/lib/libopen-pal.so.6(opal_memory_ptmalloc2_int_malloc+0xee5)
>>>>>  [0x2b6c55e99ab5]
>>>>> Thu Jun 16 16:03:08 2016<stderr>:#005: 
>>>>> /software6/mpi/openmpi/1.8.8_intel/lib/libopen-pal.so.6(opal_memory_ptmalloc2_malloc+0x58)
>>>>>  [0x2b6c55e9b8f8]
>>>>> Thu Jun 16 16:03:08 2016<stderr>:#006: 
>>>>> /software6/mpi/openmpi/1.8.8_intel/lib/libopen-pal.so.6(opal_memory_ptmalloc2_memalign+0x3ff)
>>>>>  [0x2b6c55e9c50f]
>>>>> Thu Jun 16 16:03:08 2016<stderr>:#007: 
>>>>> /software6/libs/petsc/3.5.4_intel_openmpi1.8.8/lib/libpetsc.so.3.5(PetscMallocAlign+0x17)
>>>>>  [0x2b6c4a49ecc7]
>>>>> Thu Jun 16 16:03:08 2016<stderr>:#008: 
>>>>> /software6/libs/petsc/3.5.4_intel_openmpi1.8.8/lib/libpetsc.so.3.5(MatMatMultSymbolic_MPIAIJ_MPIAIJ+0x37d)
>>>>>  [0x2b6c4a915eed]
>>>>> Thu Jun 16 16:03:08 2016<stderr>:#009: 
>>>>> /software6/libs/petsc/3.5.4_intel_openmpi1.8.8/lib/libpetsc.so.3.5(MatMatMult_MPIAIJ_MPIAIJ+0x1a3)
>>>>>  [0x2b6c4a915713]
>>>>> Thu Jun 16 16:03:08 2016<stderr>:#010: 
>>>>> /software6/libs/petsc/3.5.4_intel_openmpi1.8.8/lib/libpetsc.so.3.5(MatMatMult+0x5f4)
>>>>>  [0x2b6c4a630f44]
>>>>> Thu Jun 16 16:03:08 2016<stderr>:#011: girefMatMatMult(MatricePETSc 
>>>>> const&, MatricePETSc const&, MatricePETSc&, MatReuse)  >>> 
>>>>> /rap/jsf-051-aa/ericc/GIREF/lib/libgiref_opt_Petsc.so
>>>>> 
>>>>> Now looking at the MatMatMultSymbolic_MPIAIJ_MPIAIJ modifications, since 
>>>>> I run in 64 bit indices, there is something either really bad with what 
>>>>> we try to do or maybe there is still something in this routine...
>>>>> 
>>>>> What is your advice and how could I retreive more information if I can 
>>>>> launch it again?
>>>>> 
>>>>> Is a -malloc_dump or -malloc_log would help or anything else?
>>>>> 
>>>>> (the very same calculus passed with 240M unknowns).
>>>>> 
>>>>> Thanks for your insights!
>>>>> 
>>>>> Eric
>>>>> 
>>>>> here are the configure options:
>>>>> static const char *petscconfigureoptions = "PETSC_ARCH=linux-gnu-intel 
>>>>> CFLAGS=\"-O3 -xHost -mkl -fPIC -m64 -no-diag-message-catalog\" 
>>>>> FFLAGS=\"-O3 -xHost -mkl -fPIC -m64 -no-diag-message-catalog\" 
>>>>> --prefix=/sof
>>>>> tware6/libs/petsc/3.5.4_intel_openmpi1.8.8 --with-x=0 
>>>>> --with-mpi-compilers=1 --with-mpi-dir=/software6/mpi/openmpi/1.8.8_intel 
>>>>> --known-mpi-shared-libraries=1 --with-debugging=no 
>>>>> --with-64-bit-indices=1 --with-s
>>>>> hared-libraries=1 
>>>>> --with-blas-lapack-dir=/software6/compilers/intel/composer_xe_2013_sp1.0.080/mkl/lib/intel64
>>>>>  --with-scalapack=1 
>>>>> --with-scalapack-include=/software6/compilers/intel/composer_xe_2013_sp1.0.080/m
>>>>> kl/include --with-scalapack-lib=\"-lmkl_scalapack_lp64 
>>>>> -lmkl_blacs_openmpi_lp64\" --download-ptscotch=1 
>>>>> --download-superlu_dist=yes --download-parmetis=yes --download-metis=yes 
>>>>> --download-hypre=yes";
>>>>> 
>>>>> 
>>>>> On 16/11/15 07:12 PM, Barry Smith wrote:
>>>>>>   I have started a branch with utilities to help catch/handle these 
>>>>>> integer overflow issues 
>>>>>> https://bitbucket.org/petsc/petsc/pull-requests/389/add-utilities-for-handling-petscint/diff
>>>>>>  all suggestions are appreciated
>>>>>> 
>>>>>>   Barry
>>>>>> 
>>>>>>> On Nov 16, 2015, at 12:26 PM, Eric Chamberland 
>>>>>>> <[email protected]> wrote:
>>>>>>> 
>>>>>>> Barry,
>>>>>>> 
>>>>>>> I can't launch the code again and retrieve other informations, since I 
>>>>>>> am not allowed to do so: the cluster have around ~780 nodes and I got a 
>>>>>>> very special permission to reserve 530 of them...
>>>>>>> 
>>>>>>> So the best I can do is to give you the backtrace PETSc gave me... :/
>>>>>>> (see the first post with the backtrace: 
>>>>>>> http://lists.mcs.anl.gov/pipermail/petsc-users/2015-November/027644.html)
>>>>>>> 
>>>>>>> And until today, all smaller meshes with the same solver succeeded to 
>>>>>>> complete... (I went up to 219 millions of unknowns on 64 nodes).
>>>>>>> 
>>>>>>> I understand then that there could be some use of PetscInt64 in the 
>>>>>>> actual code that would help fix problems like the one I got.  I found 
>>>>>>> it is a big challenge to track down all occurrence of this kind of 
>>>>>>> overflow in the code, due to the size of the systems you have to have 
>>>>>>> to reproduce this problem....
>>>>>>> 
>>>>>>> Eric
>>>>>>> 
>>>>>>> 
>>>>>>> On 16/11/15 12:40 PM, Barry Smith wrote:
>>>>>>>>   Eric,
>>>>>>>> 
>>>>>>>>     The behavior you get with bizarre integers and a crash is not the 
>>>>>>>> behavior we want. We would like to detect these overflows 
>>>>>>>> appropriately.   If you can track through the error and determine the 
>>>>>>>> location where the overflow occurs then we would gladly put in 
>>>>>>>> additional checks and use of PetscInt64 to handle these things better. 
>>>>>>>> So let us know the exact cause and we'll improve the code.
>>>>>>>> 
>>>>>>>>   Barry
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Nov 16, 2015, at 11:11 AM, Eric Chamberland 
>>>>>>>>> <[email protected]> wrote:
>>>>>>>>> 
>>>>>>>>> On 16/11/15 10:42 AM, Matthew Knepley wrote:
>>>>>>>>>> Sometimes when we do not have exact counts, we need to overestimate
>>>>>>>>>> sizes. This is especially true
>>>>>>>>>> in sparse MatMat.
>>>>>>>>> Ok... so, to be sure, I am correct if I say that recompiling petsc 
>>>>>>>>> with
>>>>>>>>> "--with-64-bit-indices" is the only solution to my problem?
>>>>>>>>> 
>>>>>>>>> I mean, no other fixes exist for this overestimation in a more recent 
>>>>>>>>> release of petsc, like putting the result in a "long int" instead?
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> 
>>>>>>>>> Eric
>>>>>>>>> 
>

Re: [petsc-users] On the edge of 2^31 unknowns

Reply via email to