valgrind doesn't care how many processes you use it on. Sometimes brute force is the best way to go.
Barry > On Jun 20, 2016, at 10:14 PM, Eric Chamberland > <[email protected]> wrote: > > Hi, > > for sure, valgrind is a reliable tool to find memory corruption... but > launching it on a 1k processes jobs sounds like "funny" and unfeasible to > me... I really don't know what is the biggest job valgrind ran on out there > on any cluster??? > > Anyway... > > I looked around into the code for now.. It seems that > MatMatMult_MPIAIJ_MPIAIJ switches between 2 different functions: > > MatMatMultSymbolic_MPIAIJ_MPIAIJ_nonscalable > and > MatMatMultSymbolic_MPIAIJ_MPIAIJ > > and that by default, I think the nonscalable version is used... > > until someone use "-matmatmult_via scalable" option (is there any way I > should have found this?) > > So I have a few questions/remarks: > > #1- Maybe my problem is just that the non-scalable version is used on a 1.7 > billion unknowns....? > #2- Is it possible/feasible to do a better choice than to default to the > non-scalable version? > > I would like to have some hints before launching this large calculus again... > > Thanks, > > Eric > > > Le 2016-06-17 17:02, Barry Smith a écrit : >> Run with the debug version of the libraries under valgrind >> http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind to see if any >> memory corruption issues come up. >> >> My guess is simply that it has run out of allocatable memory. >> >> Barry >> >> >> >>> On Jun 17, 2016, at 3:55 PM, Eric Chamberland >>> <[email protected]> wrote: >>> >>> Hi, >>> >>> We got the another run on the cluster with petsc 3.5.4 compiled with 64 bit >>> indices (see end of message for configure options). >>> >>> This time, the execution terminated with a segmentation violation with the >>> following backtrace: >>> >>> Thu Jun 16 16:03:08 2016<stderr>:#000: reqBacktrace(std::string&) >>> >>> /rap/jsf-051-aa/ericc/GIREF/bin/probGD.opt >>> Thu Jun 16 16:03:08 2016<stderr>:#001: attacheDebugger() >>> >>> /rap/jsf-051-aa/ericc/GIREF/bin/probGD.opt >>> Thu Jun 16 16:03:08 2016<stderr>:#002: >>> /rap/jsf-051-aa/ericc/GIREF/lib/libgiref_opt_Util.so(traitementSignal+0x305) >>> [0x2b6c4885a875] >>> Thu Jun 16 16:03:08 2016<stderr>:#003: /lib64/libc.so.6(+0x326a0) >>> [0x2b6c502156a0] >>> Thu Jun 16 16:03:08 2016<stderr>:#004: >>> /software6/mpi/openmpi/1.8.8_intel/lib/libopen-pal.so.6(opal_memory_ptmalloc2_int_malloc+0xee5) >>> [0x2b6c55e99ab5] >>> Thu Jun 16 16:03:08 2016<stderr>:#005: >>> /software6/mpi/openmpi/1.8.8_intel/lib/libopen-pal.so.6(opal_memory_ptmalloc2_malloc+0x58) >>> [0x2b6c55e9b8f8] >>> Thu Jun 16 16:03:08 2016<stderr>:#006: >>> /software6/mpi/openmpi/1.8.8_intel/lib/libopen-pal.so.6(opal_memory_ptmalloc2_memalign+0x3ff) >>> [0x2b6c55e9c50f] >>> Thu Jun 16 16:03:08 2016<stderr>:#007: >>> /software6/libs/petsc/3.5.4_intel_openmpi1.8.8/lib/libpetsc.so.3.5(PetscMallocAlign+0x17) >>> [0x2b6c4a49ecc7] >>> Thu Jun 16 16:03:08 2016<stderr>:#008: >>> /software6/libs/petsc/3.5.4_intel_openmpi1.8.8/lib/libpetsc.so.3.5(MatMatMultSymbolic_MPIAIJ_MPIAIJ+0x37d) >>> [0x2b6c4a915eed] >>> Thu Jun 16 16:03:08 2016<stderr>:#009: >>> /software6/libs/petsc/3.5.4_intel_openmpi1.8.8/lib/libpetsc.so.3.5(MatMatMult_MPIAIJ_MPIAIJ+0x1a3) >>> [0x2b6c4a915713] >>> Thu Jun 16 16:03:08 2016<stderr>:#010: >>> /software6/libs/petsc/3.5.4_intel_openmpi1.8.8/lib/libpetsc.so.3.5(MatMatMult+0x5f4) >>> [0x2b6c4a630f44] >>> Thu Jun 16 16:03:08 2016<stderr>:#011: girefMatMatMult(MatricePETSc const&, >>> MatricePETSc const&, MatricePETSc&, MatReuse) >>> >>> /rap/jsf-051-aa/ericc/GIREF/lib/libgiref_opt_Petsc.so >>> >>> Now looking at the MatMatMultSymbolic_MPIAIJ_MPIAIJ modifications, since I >>> run in 64 bit indices, there is something either really bad with what we >>> try to do or maybe there is still something in this routine... >>> >>> What is your advice and how could I retreive more information if I can >>> launch it again? >>> >>> Is a -malloc_dump or -malloc_log would help or anything else? >>> >>> (the very same calculus passed with 240M unknowns). >>> >>> Thanks for your insights! >>> >>> Eric >>> >>> here are the configure options: >>> static const char *petscconfigureoptions = "PETSC_ARCH=linux-gnu-intel >>> CFLAGS=\"-O3 -xHost -mkl -fPIC -m64 -no-diag-message-catalog\" FFLAGS=\"-O3 >>> -xHost -mkl -fPIC -m64 -no-diag-message-catalog\" --prefix=/sof >>> tware6/libs/petsc/3.5.4_intel_openmpi1.8.8 --with-x=0 >>> --with-mpi-compilers=1 --with-mpi-dir=/software6/mpi/openmpi/1.8.8_intel >>> --known-mpi-shared-libraries=1 --with-debugging=no --with-64-bit-indices=1 >>> --with-s >>> hared-libraries=1 >>> --with-blas-lapack-dir=/software6/compilers/intel/composer_xe_2013_sp1.0.080/mkl/lib/intel64 >>> --with-scalapack=1 >>> --with-scalapack-include=/software6/compilers/intel/composer_xe_2013_sp1.0.080/m >>> kl/include --with-scalapack-lib=\"-lmkl_scalapack_lp64 >>> -lmkl_blacs_openmpi_lp64\" --download-ptscotch=1 >>> --download-superlu_dist=yes --download-parmetis=yes --download-metis=yes >>> --download-hypre=yes"; >>> >>> >>> On 16/11/15 07:12 PM, Barry Smith wrote: >>>> I have started a branch with utilities to help catch/handle these >>>> integer overflow issues >>>> https://bitbucket.org/petsc/petsc/pull-requests/389/add-utilities-for-handling-petscint/diff >>>> all suggestions are appreciated >>>> >>>> Barry >>>> >>>>> On Nov 16, 2015, at 12:26 PM, Eric Chamberland >>>>> <[email protected]> wrote: >>>>> >>>>> Barry, >>>>> >>>>> I can't launch the code again and retrieve other informations, since I am >>>>> not allowed to do so: the cluster have around ~780 nodes and I got a very >>>>> special permission to reserve 530 of them... >>>>> >>>>> So the best I can do is to give you the backtrace PETSc gave me... :/ >>>>> (see the first post with the backtrace: >>>>> http://lists.mcs.anl.gov/pipermail/petsc-users/2015-November/027644.html) >>>>> >>>>> And until today, all smaller meshes with the same solver succeeded to >>>>> complete... (I went up to 219 millions of unknowns on 64 nodes). >>>>> >>>>> I understand then that there could be some use of PetscInt64 in the >>>>> actual code that would help fix problems like the one I got. I found it >>>>> is a big challenge to track down all occurrence of this kind of overflow >>>>> in the code, due to the size of the systems you have to have to reproduce >>>>> this problem.... >>>>> >>>>> Eric >>>>> >>>>> >>>>> On 16/11/15 12:40 PM, Barry Smith wrote: >>>>>> Eric, >>>>>> >>>>>> The behavior you get with bizarre integers and a crash is not the >>>>>> behavior we want. We would like to detect these overflows appropriately. >>>>>> If you can track through the error and determine the location where >>>>>> the overflow occurs then we would gladly put in additional checks and >>>>>> use of PetscInt64 to handle these things better. So let us know the >>>>>> exact cause and we'll improve the code. >>>>>> >>>>>> Barry >>>>>> >>>>>> >>>>>> >>>>>>> On Nov 16, 2015, at 11:11 AM, Eric Chamberland >>>>>>> <[email protected]> wrote: >>>>>>> >>>>>>> On 16/11/15 10:42 AM, Matthew Knepley wrote: >>>>>>>> Sometimes when we do not have exact counts, we need to overestimate >>>>>>>> sizes. This is especially true >>>>>>>> in sparse MatMat. >>>>>>> Ok... so, to be sure, I am correct if I say that recompiling petsc with >>>>>>> "--with-64-bit-indices" is the only solution to my problem? >>>>>>> >>>>>>> I mean, no other fixes exist for this overestimation in a more recent >>>>>>> release of petsc, like putting the result in a "long int" instead? >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> Eric >>>>>>> >
