Hong, My fix is already in next [and will merge to maint now]. https://bitbucket.org/petsc/petsc/commits/83fed2ed878cb731bb04364f986d423ef53d20e6
I was hoping you would check the issue with valgrind messages on MatGetBrowsOfAoCols_MPIAIJ() [As mentioned in my earlier mail - its probably setting some local variables with uninitialized data for sequential runs - and perhaps can be fixed by not doing that..] And I've indicated my changes to Constantin's code in my earlier e-mail. [I don't see some of my changes in your diff..] Satish On Fri, 27 May 2016, Hong wrote: > Satish, > I tested your fix on ex51f.F90 (modified from > build_nullbasis_petsc_mumps.F90) --it gives clean results with valgrind. > > Shall you patch it to petsc-maint? > > I also like add ex51f.F90 (contributed by Constantin) > to petsc/src/ksp/ksp/examples/tests/. > > Hong > > > On Thu, May 26, 2016 at 5:15 PM, Hong <[email protected]> wrote: > > > Satish found a problem in using inode routines. > > > > In addition, user code has bugs. I modified > > build_nullbasis_petsc_mumps.F90 into ex51f.F90 (attached) > > which works well with option '-mat_no_inode'. > > > > ex51f.F90 differs from build_nullbasis_petsc_mumps.F90 in > > 1) use MATAIJ/MATDENSE instead of MATMPIAIJ/MATMPIDENSE > > MATAIJ wraps MATSEQAIJ and MATMPIAIJ. > > > > 2) > > MatConvert(x, MATMPIAIJ, MAT_REUSE_MATRIX, x,ierr) > > -> > > MatConvert(x, MATMPIAIJ, MAT_INPLACE_MATRIX, x,ierr) > > see > > http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Mat/MatConvert.html > > > > Hong > > > > On Thu, May 26, 2016 at 3:05 PM, Satish Balay <[email protected]> wrote: > > > >> Well looks like MatGetBrowsOfAoCols_MPIAIJ() issue is primarily > >> setting some local variables with uninitialzed data [thats primarily > >> set/used for parallel commumication]. So valgrind flags it - but I > >> don't think it gets used later on. > >> > >> [perhaps most of the code should be skipped for a sequential run..] > >> > >> The primary issue here is MatGetRowIJ_SeqAIJ_Inode_Symmetric() called > >> by MatGetOrdering_ND(). > >> > >> The workarround is to not use ND with: > >> call PCFactorSetMatOrderingType(pc,MATORDERINGNATURAL,ierr) > >> > >> But I think the following might be the fix [have to recheck].. The > >> test code works with this change [with the default ND] > >> > >> diff --git a/src/mat/impls/aij/seq/inode.c b/src/mat/impls/aij/seq/inode.c > >> index 9af404e..49f76ce 100644 > >> --- a/src/mat/impls/aij/seq/inode.c > >> +++ b/src/mat/impls/aij/seq/inode.c > >> @@ -97,6 +97,7 @@ static PetscErrorCode > >> MatGetRowIJ_SeqAIJ_Inode_Symmetric(Mat A,const PetscInt *i > >> > >> j = aj + ai[row] + ishift; > >> jmax = aj + ai[row+1] + ishift; > >> + if (j==jmax) continue; /* empty row */ > >> col = *j++ + ishift; > >> i2 = tvc[col]; > >> while (i2<i1 && j<jmax) { /* 1.[-xx-d-xx--] > >> 2.[-xx-------],off-diagonal elemets */ > >> @@ -125,6 +126,7 @@ static PetscErrorCode > >> MatGetRowIJ_SeqAIJ_Inode_Symmetric(Mat A,const PetscInt *i > >> for (i1=0,row=0; i1<nslim_row; row += ns_row[i1],i1++) { > >> j = aj + ai[row] + ishift; > >> jmax = aj + ai[row+1] + ishift; > >> + if (j==jmax) continue; /* empty row */ > >> col = *j++ + ishift; > >> i2 = tvc[col]; > >> while (i2<i1 && j<jmax) { > >> > >> Satish > >> > >> On Thu, 26 May 2016, Hong wrote: > >> > >> > I'll investigate this - had a day off since yesterday. > >> > Hong > >> > > >> > On Thu, May 26, 2016 at 12:04 PM, Barry Smith <[email protected]> > >> wrote: > >> > > >> > > > >> > > Hong needs to run with this matrix and add appropriate error > >> checkers in > >> > > the matrix routines to detect "incomplete" matrices and likely just > >> error > >> > > out. > >> > > > >> > > Barry > >> > > > >> > > > On May 26, 2016, at 11:23 AM, Satish Balay <[email protected]> > >> wrote: > >> > > > > >> > > > Mat Object: 1 MPI processes > >> > > > type: mpiaij > >> > > > row 0: (0, 0.) (1, 0.486111) > >> > > > row 1: (0, 0.486111) (1, 0.) > >> > > > row 2: (2, 0.) (3, 0.486111) > >> > > > row 3: (4, 0.486111) (5, -0.486111) > >> > > > row 4: > >> > > > row 5: > >> > > > > >> > > > The matrix created is funny (empty rows at the end) - so perhaps its > >> > > > exposing bugs in Mat code? [is that a valid matrix for this code?] > >> > > > > >> > > > ==21091== Use of uninitialised value of size 8 > >> > > > ==21091== at 0x57CA16B: MatGetRowIJ_SeqAIJ_Inode_Symmetric > >> > > (inode.c:101) > >> > > > ==21091== by 0x57CBA1C: MatGetRowIJ_SeqAIJ_Inode (inode.c:241) > >> > > > ==21091== by 0x537C0B5: MatGetRowIJ (matrix.c:7274) > >> > > > ==21091== by 0x53072FD: MatGetOrdering_ND (spnd.c:18) > >> > > > ==21091== by 0x530BC39: MatGetOrdering (sorder.c:260) > >> > > > ==21091== by 0x530A72D: MatGetOrdering (sorder.c:202) > >> > > > ==21091== by 0x5DDD764: PCSetUp_LU (lu.c:124) > >> > > > ==21091== by 0x5EBFE60: PCSetUp (precon.c:968) > >> > > > ==21091== by 0x5FDA1B3: KSPSetUp (itfunc.c:390) > >> > > > ==21091== by 0x601C17D: kspsetup_ (itfuncf.c:252) > >> > > > ==21091== by 0x4028B9: MAIN__ (ex1f.F90:104) > >> > > > ==21091== by 0x403535: main (ex1f.F90:185) > >> > > > > >> > > > > >> > > > This goes away if I add: > >> > > > > >> > > > call PCFactorSetMatOrderingType(pc,MATORDERINGNATURAL,ierr) > >> > > > > >> > > > And then there is also: > >> > > > > >> > > > ==21275== Invalid read of size 8 > >> > > > ==21275== at 0x584DE93: MatGetBrowsOfAoCols_MPIAIJ > >> (mpiaij.c:4734) > >> > > > ==21275== by 0x58970A8: > >> MatMatMultSymbolic_MPIAIJ_MPIAIJ_nonscalable > >> > > (mpimatmatmult.c:198) > >> > > > ==21275== by 0x5894A54: MatMatMult_MPIAIJ_MPIAIJ > >> (mpimatmatmult.c:34) > >> > > > ==21275== by 0x539664E: MatMatMult (matrix.c:9510) > >> > > > ==21275== by 0x53B3201: matmatmult_ (matrixf.c:1157) > >> > > > ==21275== by 0x402FC9: MAIN__ (ex1f.F90:149) > >> > > > ==21275== by 0x4035B9: main (ex1f.F90:186) > >> > > > ==21275== Address 0xa3d20f0 is 0 bytes after a block of size 48 > >> alloc'd > >> > > > ==21275== at 0x4C2DF93: memalign (vg_replace_malloc.c:858) > >> > > > ==21275== by 0x4FDE05E: PetscMallocAlign (mal.c:28) > >> > > > ==21275== by 0x5240240: VecScatterCreate (vscat.c:1220) > >> > > > ==21275== by 0x5857708: MatSetUpMultiply_MPIAIJ (mmaij.c:116) > >> > > > ==21275== by 0x581C31E: MatAssemblyEnd_MPIAIJ (mpiaij.c:747) > >> > > > ==21275== by 0x53680F2: MatAssemblyEnd (matrix.c:5187) > >> > > > ==21275== by 0x53B24D2: matassemblyend_ (matrixf.c:926) > >> > > > ==21275== by 0x40262C: MAIN__ (ex1f.F90:60) > >> > > > ==21275== by 0x4035B9: main (ex1f.F90:186) > >> > > > > >> > > > > >> > > > Satish > >> > > > > >> > > > ----------- > >> > > > > >> > > > $ diff build_nullbasis_petsc_mumps.F90 ex1f.F90 > >> > > > 3,7c3 > >> > > > < #include <petsc/finclude/petscsys.h> > >> > > > < #include "petsc/finclude/petscvec.h" > >> > > > < #include "petsc/finclude/petscmat.h" > >> > > > < #include "petsc/finclude/petscpc.h" > >> > > > < #include "petsc/finclude/petscksp.h" > >> > > > --- > >> > > >> #include "petsc/finclude/petsc.h" > >> > > > 40,41c36,37 > >> > > > < call PetscViewerBinaryOpen(PETSC_COMM_WORLD, "mat_c_bin.txt", > >> 0, > >> > > viewer, ierr) > >> > > > < call MatLoad(mat_c, viewer) > >> > > > --- > >> > > >> call PetscViewerBinaryOpen(PETSC_COMM_WORLD, "mat_c_bin.txt", > >> > > FILE_MODE_READ, viewer, ierr) > >> > > >> call MatLoad(mat_c, viewer,ierr) > >> > > > 75a72 > >> > > >> call PCFactorSetMatOrderingType(pc,MATORDERINGNATURAL,ierr) > >> > > > 150c147 > >> > > > < call MatConvert(x, MATMPIAIJ, MAT_REUSE_MATRIX, x, ierr) > >> > > > --- > >> > > >> call MatConvert(x, MATMPIAIJ, MAT_INPLACE_MATRIX, x, ierr) > >> > > > > >> > > > > >> > > > On Thu, 26 May 2016, Matthew Knepley wrote: > >> > > > > >> > > >> Usually this means you have an uninitialized variable that is > >> causing > >> > > you > >> > > >> to overwrite memory. Fortran > >> > > >> is so lax in checking this, its one reason to switch to C. > >> > > >> > >> > > >> Thanks, > >> > > >> > >> > > >> Matt > >> > > >> > >> > > >> On Thu, May 26, 2016 at 1:46 AM, Constantin Nguyen Van < > >> > > >> [email protected]> wrote: > >> > > >> > >> > > >>> Thanks for all your answers. > >> > > >>> I'm sorry for the syntax mistake in MatLoad, it was done > >> afterwards. > >> > > >>> > >> > > >>> I recompile PETSC --with-debugging=yes and run my code again. > >> > > >>> Now, I also have this strange behaviour. When I run the code > >> without > >> > > >>> valgrind and with one proc, I have this error message: > >> > > >>> > >> > > >>> BEGIN PROC 0 > >> > > >>> ITERATION 1 > >> > > >>> ECHO 1 > >> > > >>> ECHO 2 > >> > > >>> INFOG(28): 2 > >> > > >>> BASIS OK 0 > >> > > >>> END PROC 0 > >> > > >>> BEGIN PROC 0 > >> > > >>> ITERATION 2 > >> > > >>> ECHO 1 > >> > > >>> ECHO 2 > >> > > >>> INFOG(28): 2 > >> > > >>> BASIS OK 0 > >> > > >>> END PROC 0 > >> > > >>> BEGIN PROC 0 > >> > > >>> ITERATION 3 > >> > > >>> ECHO 1 > >> > > >>> [0]PETSC ERROR: > >> > > >>> > >> > > > >> ------------------------------------------------------------------------ > >> > > >>> [0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation > >> Violation, > >> > > >>> probably memory access out of range > >> > > >>> [0]PETSC ERROR: Try option -start_in_debugger or > >> > > -on_error_attach_debugger > >> > > >>> [0]PETSC ERROR: or see > >> > > >>> http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind > >> > > >>> [0]PETSC ERROR: or try http://valgrind.org on GNU/linux and > >> Apple Mac > >> > > OS > >> > > >>> X to find memory corruption errors > >> > > >>> [0]PETSC ERROR: likely location of problem given in stack below > >> > > >>> [0]PETSC ERROR: --------------------- Stack Frames > >> > > >>> ------------------------------------ > >> > > >>> [0]PETSC ERROR: Note: The EXACT line numbers in the stack are not > >> > > >>> available, > >> > > >>> [0]PETSC ERROR: INSTEAD the line number of the start of the > >> > > function > >> > > >>> [0]PETSC ERROR: is given. > >> > > >>> [0]PETSC ERROR: [0] MatGetRowIJ_SeqAIJ_Inode_Symmetric line 69 > >> > > >>> > >> > > > >> /home/j10077/librairie/petsc-mumps/petsc-3.6.4/src/mat/impls/aij/seq/inode.c > >> > > >>> [0]PETSC ERROR: [0] MatGetRowIJ_SeqAIJ_Inode line 235 > >> > > >>> > >> > > > >> /home/j10077/librairie/petsc-mumps/petsc-3.6.4/src/mat/impls/aij/seq/inode.c > >> > > >>> [0]PETSC ERROR: [0] MatGetRowIJ line 7099 > >> > > >>> > >> > > > >> /home/j10077/librairie/petsc-mumps/petsc-3.6.4/src/mat/interface/matrix.c > >> > > >>> [0]PETSC ERROR: [0] MatGetOrdering_ND line 17 > >> > > >>> > >> /home/j10077/librairie/petsc-mumps/petsc-3.6.4/src/mat/order/spnd.c > >> > > >>> [0]PETSC ERROR: [0] MatGetOrdering line 185 > >> > > >>> > >> /home/j10077/librairie/petsc-mumps/petsc-3.6.4/src/mat/order/sorder.c > >> > > >>> [0]PETSC ERROR: [0] MatGetOrdering line 185 > >> > > >>> > >> /home/j10077/librairie/petsc-mumps/petsc-3.6.4/src/mat/order/sorder.c > >> > > >>> [0]PETSC ERROR: [0] PCSetUp_LU line 99 > >> > > >>> > >> > > > >> /home/j10077/librairie/petsc-mumps/petsc-3.6.4/src/ksp/pc/impls/factor/lu/lu.c > >> > > >>> [0]PETSC ERROR: [0] PCSetUp line 945 > >> > > >>> > >> > > > >> /home/j10077/librairie/petsc-mumps/petsc-3.6.4/src/ksp/pc/interface/precon.c > >> > > >>> [0]PETSC ERROR: [0] KSPSetUp line 247 > >> > > >>> > >> > > > >> /home/j10077/librairie/petsc-mumps/petsc-3.6.4/src/ksp/ksp/interface/itfunc.c > >> > > >>> > >> > > >>> But when I run it with valgrind, it does work well. > >> > > >>> > >> > > >>> Le 2016-05-25 20:04, Barry Smith a écrit : > >> > > >>> > >> > > >>>> First run with valgrind > >> > > >>>> http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind > >> > > >>>> > >> > > >>>> On May 25, 2016, at 2:35 AM, Constantin Nguyen Van > >> > > >>>>> <[email protected]> wrote: > >> > > >>>>> > >> > > >>>>> Hi, > >> > > >>>>> > >> > > >>>>> I'm a new user of PETSc and I try to use it with MUMPS > >> > > >>>>> functionalities to compute a nullbasis. > >> > > >>>>> I wrote a code where I compute 4 times the same nullbasis. It > >> does > >> > > >>>>> work well when I run it with several procs but with only one > >> > > >>>>> processor I get an error on the 2nd iteration when KSPSetUp is > >> > > >>>>> called. Furthermore when it is run with a debugger ( > >> > > >>>>> --with-debugging=yes), it works fine with one or several > >> processors. > >> > > >>>>> Have you got any idea about why it doesn't work with one > >> processor > >> > > >>>>> and no debugger? > >> > > >>>>> > >> > > >>>>> Thanks. > >> > > >>>>> Constantin. > >> > > >>>>> > >> > > >>>>> PS: You can find the code and the files required to run it > >> enclosed. > >> > > >>>>> > >> > > >>>> > >> > > >> > >> > > >> > >> > > > >> > > > >> > > >> > > > > >
