> On Jun 20, 2016, at 10:32 PM, Eric Chamberland > <[email protected]> wrote: > > ok, but what about -matmatmult_via scalable?
Both should work. It just may be one is faster or slower than the other depending on the problem size. Barry > > Eric > > Le 2016-06-20 23:18, Barry Smith a écrit : >> valgrind doesn't care how many processes you use it on. Sometimes brute >> force is the best way to go. >> >> Barry >> >>> On Jun 20, 2016, at 10:14 PM, Eric Chamberland >>> <[email protected]> wrote: >>> >>> Hi, >>> >>> for sure, valgrind is a reliable tool to find memory corruption... but >>> launching it on a 1k processes jobs sounds like "funny" and unfeasible to >>> me... I really don't know what is the biggest job valgrind ran on out there >>> on any cluster??? >>> >>> Anyway... >>> >>> I looked around into the code for now.. It seems that >>> MatMatMult_MPIAIJ_MPIAIJ switches between 2 different functions: >>> >>> MatMatMultSymbolic_MPIAIJ_MPIAIJ_nonscalable >>> and >>> MatMatMultSymbolic_MPIAIJ_MPIAIJ >>> >>> and that by default, I think the nonscalable version is used... >>> >>> until someone use "-matmatmult_via scalable" option (is there any way I >>> should have found this?) >>> >>> So I have a few questions/remarks: >>> >>> #1- Maybe my problem is just that the non-scalable version is used on a 1.7 >>> billion unknowns....? >>> #2- Is it possible/feasible to do a better choice than to default to the >>> non-scalable version? >>> >>> I would like to have some hints before launching this large calculus >>> again... >>> >>> Thanks, >>> >>> Eric >>> >>> >>> Le 2016-06-17 17:02, Barry Smith a écrit : >>>> Run with the debug version of the libraries under valgrind >>>> http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind to see if any >>>> memory corruption issues come up. >>>> >>>> My guess is simply that it has run out of allocatable memory. >>>> >>>> Barry >>>> >>>> >>>> >>>>> On Jun 17, 2016, at 3:55 PM, Eric Chamberland >>>>> <[email protected]> wrote: >>>>> >>>>> Hi, >>>>> >>>>> We got the another run on the cluster with petsc 3.5.4 compiled with 64 >>>>> bit indices (see end of message for configure options). >>>>> >>>>> This time, the execution terminated with a segmentation violation with >>>>> the following backtrace: >>>>> >>>>> Thu Jun 16 16:03:08 2016<stderr>:#000: reqBacktrace(std::string&) >>> >>>>> /rap/jsf-051-aa/ericc/GIREF/bin/probGD.opt >>>>> Thu Jun 16 16:03:08 2016<stderr>:#001: attacheDebugger() >>> >>>>> /rap/jsf-051-aa/ericc/GIREF/bin/probGD.opt >>>>> Thu Jun 16 16:03:08 2016<stderr>:#002: >>>>> /rap/jsf-051-aa/ericc/GIREF/lib/libgiref_opt_Util.so(traitementSignal+0x305) >>>>> [0x2b6c4885a875] >>>>> Thu Jun 16 16:03:08 2016<stderr>:#003: /lib64/libc.so.6(+0x326a0) >>>>> [0x2b6c502156a0] >>>>> Thu Jun 16 16:03:08 2016<stderr>:#004: >>>>> /software6/mpi/openmpi/1.8.8_intel/lib/libopen-pal.so.6(opal_memory_ptmalloc2_int_malloc+0xee5) >>>>> [0x2b6c55e99ab5] >>>>> Thu Jun 16 16:03:08 2016<stderr>:#005: >>>>> /software6/mpi/openmpi/1.8.8_intel/lib/libopen-pal.so.6(opal_memory_ptmalloc2_malloc+0x58) >>>>> [0x2b6c55e9b8f8] >>>>> Thu Jun 16 16:03:08 2016<stderr>:#006: >>>>> /software6/mpi/openmpi/1.8.8_intel/lib/libopen-pal.so.6(opal_memory_ptmalloc2_memalign+0x3ff) >>>>> [0x2b6c55e9c50f] >>>>> Thu Jun 16 16:03:08 2016<stderr>:#007: >>>>> /software6/libs/petsc/3.5.4_intel_openmpi1.8.8/lib/libpetsc.so.3.5(PetscMallocAlign+0x17) >>>>> [0x2b6c4a49ecc7] >>>>> Thu Jun 16 16:03:08 2016<stderr>:#008: >>>>> /software6/libs/petsc/3.5.4_intel_openmpi1.8.8/lib/libpetsc.so.3.5(MatMatMultSymbolic_MPIAIJ_MPIAIJ+0x37d) >>>>> [0x2b6c4a915eed] >>>>> Thu Jun 16 16:03:08 2016<stderr>:#009: >>>>> /software6/libs/petsc/3.5.4_intel_openmpi1.8.8/lib/libpetsc.so.3.5(MatMatMult_MPIAIJ_MPIAIJ+0x1a3) >>>>> [0x2b6c4a915713] >>>>> Thu Jun 16 16:03:08 2016<stderr>:#010: >>>>> /software6/libs/petsc/3.5.4_intel_openmpi1.8.8/lib/libpetsc.so.3.5(MatMatMult+0x5f4) >>>>> [0x2b6c4a630f44] >>>>> Thu Jun 16 16:03:08 2016<stderr>:#011: girefMatMatMult(MatricePETSc >>>>> const&, MatricePETSc const&, MatricePETSc&, MatReuse) >>> >>>>> /rap/jsf-051-aa/ericc/GIREF/lib/libgiref_opt_Petsc.so >>>>> >>>>> Now looking at the MatMatMultSymbolic_MPIAIJ_MPIAIJ modifications, since >>>>> I run in 64 bit indices, there is something either really bad with what >>>>> we try to do or maybe there is still something in this routine... >>>>> >>>>> What is your advice and how could I retreive more information if I can >>>>> launch it again? >>>>> >>>>> Is a -malloc_dump or -malloc_log would help or anything else? >>>>> >>>>> (the very same calculus passed with 240M unknowns). >>>>> >>>>> Thanks for your insights! >>>>> >>>>> Eric >>>>> >>>>> here are the configure options: >>>>> static const char *petscconfigureoptions = "PETSC_ARCH=linux-gnu-intel >>>>> CFLAGS=\"-O3 -xHost -mkl -fPIC -m64 -no-diag-message-catalog\" >>>>> FFLAGS=\"-O3 -xHost -mkl -fPIC -m64 -no-diag-message-catalog\" >>>>> --prefix=/sof >>>>> tware6/libs/petsc/3.5.4_intel_openmpi1.8.8 --with-x=0 >>>>> --with-mpi-compilers=1 --with-mpi-dir=/software6/mpi/openmpi/1.8.8_intel >>>>> --known-mpi-shared-libraries=1 --with-debugging=no >>>>> --with-64-bit-indices=1 --with-s >>>>> hared-libraries=1 >>>>> --with-blas-lapack-dir=/software6/compilers/intel/composer_xe_2013_sp1.0.080/mkl/lib/intel64 >>>>> --with-scalapack=1 >>>>> --with-scalapack-include=/software6/compilers/intel/composer_xe_2013_sp1.0.080/m >>>>> kl/include --with-scalapack-lib=\"-lmkl_scalapack_lp64 >>>>> -lmkl_blacs_openmpi_lp64\" --download-ptscotch=1 >>>>> --download-superlu_dist=yes --download-parmetis=yes --download-metis=yes >>>>> --download-hypre=yes"; >>>>> >>>>> >>>>> On 16/11/15 07:12 PM, Barry Smith wrote: >>>>>> I have started a branch with utilities to help catch/handle these >>>>>> integer overflow issues >>>>>> https://bitbucket.org/petsc/petsc/pull-requests/389/add-utilities-for-handling-petscint/diff >>>>>> all suggestions are appreciated >>>>>> >>>>>> Barry >>>>>> >>>>>>> On Nov 16, 2015, at 12:26 PM, Eric Chamberland >>>>>>> <[email protected]> wrote: >>>>>>> >>>>>>> Barry, >>>>>>> >>>>>>> I can't launch the code again and retrieve other informations, since I >>>>>>> am not allowed to do so: the cluster have around ~780 nodes and I got a >>>>>>> very special permission to reserve 530 of them... >>>>>>> >>>>>>> So the best I can do is to give you the backtrace PETSc gave me... :/ >>>>>>> (see the first post with the backtrace: >>>>>>> http://lists.mcs.anl.gov/pipermail/petsc-users/2015-November/027644.html) >>>>>>> >>>>>>> And until today, all smaller meshes with the same solver succeeded to >>>>>>> complete... (I went up to 219 millions of unknowns on 64 nodes). >>>>>>> >>>>>>> I understand then that there could be some use of PetscInt64 in the >>>>>>> actual code that would help fix problems like the one I got. I found >>>>>>> it is a big challenge to track down all occurrence of this kind of >>>>>>> overflow in the code, due to the size of the systems you have to have >>>>>>> to reproduce this problem.... >>>>>>> >>>>>>> Eric >>>>>>> >>>>>>> >>>>>>> On 16/11/15 12:40 PM, Barry Smith wrote: >>>>>>>> Eric, >>>>>>>> >>>>>>>> The behavior you get with bizarre integers and a crash is not the >>>>>>>> behavior we want. We would like to detect these overflows >>>>>>>> appropriately. If you can track through the error and determine the >>>>>>>> location where the overflow occurs then we would gladly put in >>>>>>>> additional checks and use of PetscInt64 to handle these things better. >>>>>>>> So let us know the exact cause and we'll improve the code. >>>>>>>> >>>>>>>> Barry >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> On Nov 16, 2015, at 11:11 AM, Eric Chamberland >>>>>>>>> <[email protected]> wrote: >>>>>>>>> >>>>>>>>> On 16/11/15 10:42 AM, Matthew Knepley wrote: >>>>>>>>>> Sometimes when we do not have exact counts, we need to overestimate >>>>>>>>>> sizes. This is especially true >>>>>>>>>> in sparse MatMat. >>>>>>>>> Ok... so, to be sure, I am correct if I say that recompiling petsc >>>>>>>>> with >>>>>>>>> "--with-64-bit-indices" is the only solution to my problem? >>>>>>>>> >>>>>>>>> I mean, no other fixes exist for this overestimation in a more recent >>>>>>>>> release of petsc, like putting the result in a "long int" instead? >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> >>>>>>>>> Eric >>>>>>>>> >
