Re: [petsc-users] On the edge of 2^31 unknowns

Eric Chamberland Mon, 20 Jun 2016 20:15:07 -0700

Hi,

for sure, valgrind is a reliable tool to find memory corruption... butlaunching it on a 1k processes jobs sounds like "funny" and unfeasibleto me... I really don't know what is the biggest job valgrind ran on outthere on any cluster???


Anyway...

I looked around into the code for now.. It seems thatMatMatMult_MPIAIJ_MPIAIJ switches between 2 different functions:


MatMatMultSymbolic_MPIAIJ_MPIAIJ_nonscalable
and
MatMatMultSymbolic_MPIAIJ_MPIAIJ

and that by default, I think the nonscalable version is used...

until someone use "-matmatmult_via scalable" option (is there any way Ishould have found this?)


So I have a few questions/remarks:

#1- Maybe my problem is just that the non-scalable version is used on a1.7 billion unknowns....?#2- Is it possible/feasible to do a better choice than to default to thenon-scalable version?

I would like to have some hints before launching this large calculusagain...


Thanks,

Eric


Le 2016-06-17 17:02, Barry Smith a écrit :

    Run with the debug version of the libraries under valgrind 
http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind to see if any 
memory corruption issues come up.

    My guess is simply that it has run out of allocatable memory.

    Barry

On Jun 17, 2016, at 3:55 PM, Eric Chamberland 
<[email protected]> wrote:

Hi,

We got the another run on the cluster with petsc 3.5.4 compiled with 64 bit 
indices (see end of message for configure options).

This time, the execution terminated with a segmentation violation with the 
following backtrace:

Thu Jun 16 16:03:08 2016<stderr>:#000: reqBacktrace(std::string&)  >>> 
/rap/jsf-051-aa/ericc/GIREF/bin/probGD.opt
Thu Jun 16 16:03:08 2016<stderr>:#001: attacheDebugger()  >>> 
/rap/jsf-051-aa/ericc/GIREF/bin/probGD.opt
Thu Jun 16 16:03:08 2016<stderr>:#002: 
/rap/jsf-051-aa/ericc/GIREF/lib/libgiref_opt_Util.so(traitementSignal+0x305) 
[0x2b6c4885a875]
Thu Jun 16 16:03:08 2016<stderr>:#003: /lib64/libc.so.6(+0x326a0) 
[0x2b6c502156a0]
Thu Jun 16 16:03:08 2016<stderr>:#004: 
/software6/mpi/openmpi/1.8.8_intel/lib/libopen-pal.so.6(opal_memory_ptmalloc2_int_malloc+0xee5)
 [0x2b6c55e99ab5]
Thu Jun 16 16:03:08 2016<stderr>:#005: 
/software6/mpi/openmpi/1.8.8_intel/lib/libopen-pal.so.6(opal_memory_ptmalloc2_malloc+0x58)
 [0x2b6c55e9b8f8]
Thu Jun 16 16:03:08 2016<stderr>:#006: 
/software6/mpi/openmpi/1.8.8_intel/lib/libopen-pal.so.6(opal_memory_ptmalloc2_memalign+0x3ff)
 [0x2b6c55e9c50f]
Thu Jun 16 16:03:08 2016<stderr>:#007: 
/software6/libs/petsc/3.5.4_intel_openmpi1.8.8/lib/libpetsc.so.3.5(PetscMallocAlign+0x17)
 [0x2b6c4a49ecc7]
Thu Jun 16 16:03:08 2016<stderr>:#008: 
/software6/libs/petsc/3.5.4_intel_openmpi1.8.8/lib/libpetsc.so.3.5(MatMatMultSymbolic_MPIAIJ_MPIAIJ+0x37d)
 [0x2b6c4a915eed]
Thu Jun 16 16:03:08 2016<stderr>:#009: 
/software6/libs/petsc/3.5.4_intel_openmpi1.8.8/lib/libpetsc.so.3.5(MatMatMult_MPIAIJ_MPIAIJ+0x1a3)
 [0x2b6c4a915713]
Thu Jun 16 16:03:08 2016<stderr>:#010: 
/software6/libs/petsc/3.5.4_intel_openmpi1.8.8/lib/libpetsc.so.3.5(MatMatMult+0x5f4) 
[0x2b6c4a630f44]
Thu Jun 16 16:03:08 2016<stderr>:#011: girefMatMatMult(MatricePETSc const&, MatricePETSc 
const&, MatricePETSc&, MatReuse)  >>> 
/rap/jsf-051-aa/ericc/GIREF/lib/libgiref_opt_Petsc.so

Now looking at the MatMatMultSymbolic_MPIAIJ_MPIAIJ modifications, since I run 
in 64 bit indices, there is something either really bad with what we try to do 
or maybe there is still something in this routine...

What is your advice and how could I retreive more information if I can launch 
it again?

Is a -malloc_dump or -malloc_log would help or anything else?

(the very same calculus passed with 240M unknowns).

Thanks for your insights!

Eric

here are the configure options:
static const char *petscconfigureoptions = "PETSC_ARCH=linux-gnu-intel CFLAGS=\"-O3 -xHost 
-mkl -fPIC -m64 -no-diag-message-catalog\" FFLAGS=\"-O3 -xHost -mkl -fPIC -m64 
-no-diag-message-catalog\" --prefix=/sof
tware6/libs/petsc/3.5.4_intel_openmpi1.8.8 --with-x=0 --with-mpi-compilers=1 
--with-mpi-dir=/software6/mpi/openmpi/1.8.8_intel 
--known-mpi-shared-libraries=1 --with-debugging=no --with-64-bit-indices=1 
--with-s
hared-libraries=1 
--with-blas-lapack-dir=/software6/compilers/intel/composer_xe_2013_sp1.0.080/mkl/lib/intel64
 --with-scalapack=1 
--with-scalapack-include=/software6/compilers/intel/composer_xe_2013_sp1.0.080/m
kl/include --with-scalapack-lib=\"-lmkl_scalapack_lp64 -lmkl_blacs_openmpi_lp64\" 
--download-ptscotch=1 --download-superlu_dist=yes --download-parmetis=yes --download-metis=yes 
--download-hypre=yes";


On 16/11/15 07:12 PM, Barry Smith wrote:

   I have started a branch with utilities to help catch/handle these integer 
overflow issues 
https://bitbucket.org/petsc/petsc/pull-requests/389/add-utilities-for-handling-petscint/diff
 all suggestions are appreciated

   Barry

On Nov 16, 2015, at 12:26 PM, Eric Chamberland 
<[email protected]> wrote:

Barry,

I can't launch the code again and retrieve other informations, since I am not 
allowed to do so: the cluster have around ~780 nodes and I got a very special 
permission to reserve 530 of them...

So the best I can do is to give you the backtrace PETSc gave me... :/
(see the first post with the backtrace: 
http://lists.mcs.anl.gov/pipermail/petsc-users/2015-November/027644.html)

And until today, all smaller meshes with the same solver succeeded to 
complete... (I went up to 219 millions of unknowns on 64 nodes).

I understand then that there could be some use of PetscInt64 in the actual code 
that would help fix problems like the one I got.  I found it is a big challenge 
to track down all occurrence of this kind of overflow in the code, due to the 
size of the systems you have to have to reproduce this problem....

Eric


On 16/11/15 12:40 PM, Barry Smith wrote:

   Eric,

     The behavior you get with bizarre integers and a crash is not the behavior 
we want. We would like to detect these overflows appropriately.   If you can 
track through the error and determine the location where the overflow occurs 
then we would gladly put in additional checks and use of PetscInt64 to handle 
these things better. So let us know the exact cause and we'll improve the code.

   Barry

On Nov 16, 2015, at 11:11 AM, Eric Chamberland 
<[email protected]> wrote:

On 16/11/15 10:42 AM, Matthew Knepley wrote:

Sometimes when we do not have exact counts, we need to overestimate
sizes. This is especially true
in sparse MatMat.

Ok... so, to be sure, I am correct if I say that recompiling petsc with
"--with-64-bit-indices" is the only solution to my problem?

I mean, no other fixes exist for this overestimation in a more recent release of petsc, 
like putting the result in a "long int" instead?

Thanks,

Eric

Re: [petsc-users] On the edge of 2^31 unknowns

Reply via email to