More importantly, ==43569== Conditional jump or move depends on uninitialised value(s) ==43569== at 0x1473C515: pzgstrs (pzgstrs.c:1074) ==43569== by 0x146F5E72: pzgssvx (pzgssvx.c:1422) ==43569== by 0x58C3FE5: MatMatSolve_SuperLU_DIST (superlu_dist.c:242) ==43569== by 0x55FB716: MatMatSolve (matrix.c:3485) ==43569== by 0x40465D: main (superlu_test.c:59)
You should run using valgrind's option --track-origins=yes to understand the reason for this. Il giorno dom 1 nov 2020 alle ore 11:53 Barry Smith <[email protected]> ha scritto: > > > You can sometimes use -on_error_attach_debugger noxterm and it will try > to attach just in the console you started the job. If you are lucky this > works and you use bt and see the stack and look at variables. But if > multiple ranks crash the debugger will get confused and even if only one > crashes if it is not rank zero the stty can get messed up so you cannot > type to control the debugger. > > The valgrind information is very valuable, likely Sherry can look at > those lines and have a really good idea what the problem is, for example, > > Address 0x266e5ac0 is 0 bytes after a block of size 35,520 alloc'd > > > means that for some reason the code is writing past the end of an > allocated array, either because the array allocated was not long enough or > the code has some issue where it wants to write further than it should. > This kind of thing is very common and usually easy to debug by someone who > knows the code once they know exactly what line of code is problematic. > Since it shows exactly where the memory was allocated and exactly where it > went out of bounds. > > Barry > > > On Nov 1, 2020, at 1:21 AM, Marius Buerkle <[email protected]> wrote: > > Hi, > > I cannot use on_error_attach_debugger as X forwarding does not work on the > system. Is it possible to dump the gdb output to file instead? > > I run it through valgrind. It seems there is some problem during calls in > superlu_dist but I don't know if this eventually causes the seg fault. I > think this is the relevant output: > > ==43569== Conditional jump or move depends on uninitialised value(s) > ==43569== at 0x1473C515: pzgstrs (pzgstrs.c:1074) > ==43569== by 0x146F5E72: pzgssvx (pzgssvx.c:1422) > ==43569== by 0x58C3FE5: MatMatSolve_SuperLU_DIST (superlu_dist.c:242) > ==43569== by 0x55FB716: MatMatSolve (matrix.c:3485) > ==43569== by 0x40465D: main (superlu_test.c:59) > ==43569== > ==43569== Use of uninitialised value of size 8 > ==43569== at 0x1473C554: pzgstrs (pzgstrs.c:1077) > ==43569== by 0x146F5E72: pzgssvx (pzgssvx.c:1422) > ==43569== by 0x58C3FE5: MatMatSolve_SuperLU_DIST (superlu_dist.c:242) > ==43569== by 0x55FB716: MatMatSolve (matrix.c:3485) > ==43569== by 0x40465D: main (superlu_test.c:59) > ==43569== > ==43569== Use of uninitialised value of size 8 > ==43569== at 0x1473C55A: pzgstrs (pzgstrs.c:1077) > ==43569== by 0x146F5E72: pzgssvx (pzgssvx.c:1422) > ==43569== by 0x58C3FE5: MatMatSolve_SuperLU_DIST (superlu_dist.c:242) > ==43569== by 0x55FB716: MatMatSolve (matrix.c:3485) > ==43569== by 0x40465D: main (superlu_test.c:59) > ==43569== > ==43569== Invalid write of size 8 > ==43569== at 0x1473C554: pzgstrs (pzgstrs.c:1077) > ==43569== by 0x146F5E72: pzgssvx (pzgssvx.c:1422) > ==43569== by 0x58C3FE5: MatMatSolve_SuperLU_DIST (superlu_dist.c:242) > ==43569== by 0x55FB716: MatMatSolve (matrix.c:3485) > ==43569== by 0x40465D: main (superlu_test.c:59) > ==43569== Address 0x266e5ac0 is 0 bytes after a block of size 35,520 > alloc'd > ==43569== at 0x4C2D814: memalign (vg_replace_malloc.c:906) > ==43569== by 0x4C2D97B: posix_memalign (vg_replace_malloc.c:1070) > ==43569== by 0x1464D488: superlu_malloc_dist (memory.c:127) > ==43569== by 0x1473C451: pzgstrs (pzgstrs.c:1044) > ==43569== by 0x146F5E72: pzgssvx (pzgssvx.c:1422) > ==43569== by 0x58C3FE5: MatMatSolve_SuperLU_DIST (superlu_dist.c:242) > ==43569== by 0x55FB716: MatMatSolve (matrix.c:3485) > ==43569== by 0x40465D: main (superlu_test.c:59) > ==43569== > ==43569== Invalid write of size 8 > ==43569== at 0x1473C55A: pzgstrs (pzgstrs.c:1077) > ==43569== by 0x146F5E72: pzgssvx (pzgssvx.c:1422) > ==43569== by 0x58C3FE5: MatMatSolve_SuperLU_DIST (superlu_dist.c:242) > ==43569== by 0x55FB716: MatMatSolve (matrix.c:3485) > ==43569== by 0x40465D: main (superlu_test.c:59) > ==43569== Address 0x266e5ad0 is 16 bytes after a block of size 35,520 > alloc'd > ==43569== at 0x4C2D814: memalign (vg_replace_malloc.c:906) > ==43569== by 0x4C2D97B: posix_memalign (vg_replace_malloc.c:1070) > ==43569== by 0x1464D488: superlu_malloc_dist (memory.c:127) > ==43569== by 0x1473C451: pzgstrs (pzgstrs.c:1044) > ==43569== by 0x146F5E72: pzgssvx (pzgssvx.c:1422) > ==43569== by 0x58C3FE5: MatMatSolve_SuperLU_DIST (superlu_dist.c:242) > ==43569== by 0x55FB716: MatMatSolve (matrix.c:3485) > ==43569== by 0x40465D: main (superlu_test.c:59) > ==43569== > > I also attached the whole log. Does this make any sense? The problem seems > to be around where I get the original segfault. > > best, > marius > > > *Gesendet:* Samstag, 31. Oktober 2020 um 04:07 Uhr > *Von:* "Barry Smith" <[email protected]> > *An:* "Marius Buerkle" <[email protected]> > *Cc:* "Xiaoye S. Li" <[email protected]>, "[email protected]" < > [email protected]>, "Sherry Li" <[email protected]> > *Betreff:* Re: [petsc-users] superlu_dist segfault > > Have you run it yet with valgrind, good be memory corruption earlier that > causes a later crash, crashes that occur at different places for the same > run are almost always due to memory corruption. > > If valgrind is clean you can run with -on_error_attach_debugger and if > the X forwarding is set up it will open a debugger on the crashing process > and you can type bt to see exactly where it is crashing, at what line > number and code line. > > Barry > > > > On Oct 29, 2020, at 1:04 AM, Marius Buerkle <[email protected]> wrote: > > Hi Sherry, > > I used only 1 OpenMP thread and I also recompiled PETSC in debug mode with > OpenMP turned off. But did not help. > > Here is the output I can get from SuperLu during the PETSC run > Nonzeros in L 29519630 > Nonzeros in U 29519630 > nonzeros in L+U 58996711 > nonzeros in LSUB 4509612 > ** Memory Usage ********************************** > ** NUMfact space (MB): (sum-of-all-processes) > L\U : 952.18 | Total : 1980.60 > ** Total highmark (MB): > Sum-of-all : 12401.85 | Avg : 387.56 | Max : 387.56 > ************************************************** > ************************************************** > **** Time (seconds) **** > EQUIL time 0.06 > ROWPERM time 1.03 > COLPERM time 1.01 > SYMBFACT time 0.45 > DISTRIBUTE time 0.33 > FACTOR time 0.90 > Factor flops 2.225916e+11 Mflops 247438.62 > SOLVE time 0.000 > ************************************************** > > I tried all available ordering options for Colperm > (NATURAL,MMD_AT_PLUS_A,MMD_ATA,METIS_AT_PLUS_A), save for parmetis which > always crashes. For Rowperm I used NOROWPERM, LargeDiag_MC64. All gives the > same seg. fault. > > > *Gesendet:* Donnerstag, 29. Oktober 2020 um 14:14 Uhr > *Von:* "Xiaoye S. Li" <[email protected]> > *An:* "Marius Buerkle" <[email protected]> > *Cc:* "Zhang, Hong" <[email protected]>, "[email protected]" < > [email protected]>, "Sherry Li" <[email protected]> > *Betreff:* Re: Re: Re: [petsc-users] superlu_dist segfault > Hong: thanks for the diagnosis! > > Marius: how many OpenMP threads are you using per MPI task? > In an earlier email, you mentioned the allocation failure at the following > line: > if ( !(lsum = (doublecomplex*) SUPERLU_MALLOC(sizelsum*num_thread * > sizeof(doublecomplex)))) ABORT("Malloc fails for lsum[]."); > > this is in the solve phase. I think when we do some OpenMP optimization, > we allowed several data structures to grow with OpenMP threads. You can > try to use 1 thread. > > The RHS and X memories are easy to compute. However, in order to gauge > how much memory is used in the factorization, can you print out the number > of nonzeros in the L and U factors? What ordering option are you using? > The sparse matrix A looks pretty small. > > The code can also print out the working storage used during > factorization. I am not sure how this printing can be turned on through > PETSc. > > Sherry > > On Wed, Oct 28, 2020 at 9:43 PM Marius Buerkle <[email protected]> wrote: > >> Thanks for the swift reply. >> >> I also realized if I reduce the number of RHS then it works. But I am >> running the code on a cluster with 256GB ram / node. One dense matrix >> would be around ~30 Gb so 60 Gb, which is large but does exceed the >> memory of even one node and I also get the seg fault if I run it on several >> nodes. Moreover, it works well with MUMPS and MKL_CPARDISO solver. The >> maxium memory used when using MUMPS is around 150 Gb during the solver >> phase but for SuperLU_dist it crashed even before reaching the solver >> phase. Could there be such a large difference in memory usage between >> SuperLu_dist and MUMPS ? >> >> >> best, >> >> marius >> >> *Gesendet:* Donnerstag, 29. Oktober 2020 um 10:10 Uhr >> *Von:* "Zhang, Hong" <[email protected]> >> *An:* "Marius Buerkle" <[email protected]> >> *Cc:* "[email protected]" <[email protected]>, "Sherry Li" < >> [email protected]> >> *Betreff:* Re: Re: [petsc-users] superlu_dist segfault >> Marius, >> I tested your code with petsc-release on my mac laptop using np=2 cores. >> I first tested a small matrix data file successfully. Then I switch to your >> data file and run out of memory, likely due to the dense matrices B and X. >> I got an error "Your system has run out of application memory" from my >> laptop. >> >> The sparse matrix A has size 42549 by 42549. Your code creates dense >> matrices B and X with the same size -- a huge memory requirement! >> By replacing B and X with size 42549 by nrhs (nrhs =< 4000), I had the >> code run well with np=2. Note the error message you got >> [23]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, >> probably memory access out of range >> >> The modified code I used is attached. >> Hong >> >> ------------------------------ >> *From:* Marius Buerkle <[email protected]> >> *Sent:* Tuesday, October 27, 2020 10:01 PM >> *To:* Zhang, Hong <[email protected]> >> *Cc:* [email protected] <[email protected]>; Sherry Li < >> [email protected]> >> *Subject:* Aw: Re: [petsc-users] superlu_dist segfault >> >> Hi, >> >> I recompiled PETSC with debug option, now I get a seg fault at a >> different position >> >> [23]PETSC ERROR: >> ------------------------------------------------------------------------ >> [23]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, >> probably memory access out of range >> [23]PETSC ERROR: Try option -start_in_debugger or >> -on_error_attach_debugger >> [23]PETSC ERROR: or see >> https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind >> [23]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac >> OS X to find memory corruption errors >> [23]PETSC ERROR: likely location of problem given in stack below >> [23]PETSC ERROR: --------------------- Stack Frames >> ------------------------------------ >> [23]PETSC ERROR: Note: The EXACT line numbers in the stack are not >> available, >> [23]PETSC ERROR: INSTEAD the line number of the start of the >> function >> [23]PETSC ERROR: is given. >> [23]PETSC ERROR: [23] SuperLU_DIST:pzgssvx line 242 >> /home/cdfmat_marius/prog/petsc/git/release/petsc/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c >> [23]PETSC ERROR: [23] MatMatSolve_SuperLU_DIST line 211 >> /home/cdfmat_marius/prog/petsc/git/release/petsc/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c >> [23]PETSC ERROR: [23] MatMatSolve line 3466 >> /home/cdfmat_marius/prog/petsc/git/release/petsc/src/mat/interface/matrix.c >> [23]PETSC ERROR: --------------------- Error Message >> -------------------------------------------------------------- >> [23]PETSC ERROR: Signal received >> >> I made a small reproducer. The matrix is a bit too big so I cannot >> attach it directly to the email, but I put it in the cloud >> https://1drv.ms/u/s!AqZsng1oUcKzjYxGMGHojLRG09Sf1A?e=7uHnmw >> >> Best, >> Marius >> >> >> *Gesendet:* Dienstag, 27. Oktober 2020 um 23:11 Uhr >> *Von:* "Zhang, Hong" <[email protected]> >> *An:* "Marius Buerkle" <[email protected]>, "[email protected]" < >> [email protected]>, "Sherry Li" <[email protected]> >> *Betreff:* Re: [petsc-users] superlu_dist segfault >> Marius, >> It fails at the line 1075 in file >> /home/petsc3.14.release/arch-linux-c-debug/externalpackages/git.superlu_dist/SRC/pzgstrs.c >> if ( !(lsum = (doublecomplex*)SUPERLU_MALLOC(sizelsum*num_thread * >> sizeof(doublecomplex)))) ABORT("Malloc fails for lsum[]."); >> >> We do not know what it means. You may use a debugger to check the values >> of the variables involved. >> I'm cc'ing Sherry (superlu_dist developer), or you may send us a >> stand-alone short code that reproduce the error. We can help on its >> investigation. >> Hong >> >> >> ------------------------------ >> *From:* petsc-users <[email protected]> on behalf of >> Marius Buerkle <[email protected]> >> *Sent:* Tuesday, October 27, 2020 8:46 AM >> *To:* [email protected] <[email protected]> >> *Subject:* [petsc-users] superlu_dist segfault >> >> Hi, >> >> When using MatMatSolve with superlu_dist I get a segmentation fault: >> >> Malloc fails for lsum[]. at line 1075 in file >> /home/petsc3.14.release/arch-linux-c-debug/externalpackages/git.superlu_dist/SRC/pzgstrs.c >> >> The matrix size is not particular big and I am using the petsc release >> branch and superlu_dist is v6.3.0 I think. >> >> Best, >> Marius >> > <valgrind.tar.gz> > > > -- Stefano
