This log looks truncated. Are there any valgrind mesages before this? [like from your application code - or from MPI]
Perhaps you can send the complete log - with: valgrind -q --tool=memcheck --leak-check=yes --num-callers=20 --track-origins=yes [and if there were more valgrind messages from MPI - rebuild petsc with --download-mpich - for a valgrind clean mpi] Sherry, Perhaps this log points to some issue in superlu_dist? thanks, Satish On Tue, 11 Oct 2016, Anton Popov wrote: > Valgrind immediately detects interesting stuff: > > ==25673== Use of uninitialised value of size 8 > ==25673== at 0x178272C: static_schedule (static_schedule.c:960) > ==25674== Use of uninitialised value of size 8 > ==25674== at 0x178272C: static_schedule (static_schedule.c:960) > ==25674== by 0x174E74E: pdgstrf (pdgstrf.c:572) > ==25674== by 0x1733954: pdgssvx (pdgssvx.c:1124) > > > ==25673== Conditional jump or move depends on uninitialised value(s) > ==25673== at 0x1752143: pdgstrf (dlook_ahead_update.c:24) > ==25673== by 0x1733954: pdgssvx (pdgssvx.c:1124) > > > ==25673== Conditional jump or move depends on uninitialised value(s) > ==25673== at 0x5C83F43: PMPI_Recv (in /opt/mpich3/lib/libmpi.so.12.1.0) > ==25673== by 0x1755385: pdgstrf2_trsm (pdgstrf2.c:253) > ==25673== by 0x1751E4F: pdgstrf (dlook_ahead_update.c:195) > ==25673== by 0x1733954: pdgssvx (pdgssvx.c:1124) > > ==25674== Use of uninitialised value of size 8 > ==25674== at 0x62BF72B: _itoa_word (_itoa.c:179) > ==25674== by 0x62C1289: printf_positional (vfprintf.c:2022) > ==25674== by 0x62C2465: vfprintf (vfprintf.c:1677) > ==25674== by 0x638AFD5: __vsnprintf_chk (vsnprintf_chk.c:63) > ==25674== by 0x638AF37: __snprintf_chk (snprintf_chk.c:34) > ==25674== by 0x5CC6C08: MPIR_Err_create_code_valist (in > /opt/mpich3/lib/libmpi.so.12.1.0) > ==25674== by 0x5CC7A9A: MPIR_Err_create_code (in > /opt/mpich3/lib/libmpi.so.12.1.0) > ==25674== by 0x5C83FB1: PMPI_Recv (in /opt/mpich3/lib/libmpi.so.12.1.0) > ==25674== by 0x1755385: pdgstrf2_trsm (pdgstrf2.c:253) > ==25674== by 0x1751E4F: pdgstrf (dlook_ahead_update.c:195) > ==25674== by 0x1733954: pdgssvx (pdgssvx.c:1124) > > ==25674== Use of uninitialised value of size 8 > ==25674== at 0x1751E92: pdgstrf (dlook_ahead_update.c:205) > ==25674== by 0x1733954: pdgssvx (pdgssvx.c:1124) > > And it crashes after this: > > ==25674== Invalid write of size 4 > ==25674== at 0x1751F2F: pdgstrf (dlook_ahead_update.c:211) > ==25674== by 0x1733954: pdgssvx (pdgssvx.c:1124) > ==25674== by 0xAAEFAE: MatLUFactorNumeric_SuperLU_DIST (superlu_dist.c:421) > ==25674== Address 0xa0 is not stack'd, malloc'd or (recently) free'd > ==25674== > [1]PETSC ERROR: > ------------------------------------------------------------------------ > [1]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably > memory access out of range > > > On 10/11/2016 03:26 PM, Anton Popov wrote: > > > > On 10/10/2016 07:11 PM, Satish Balay wrote: > > > Thats from petsc-3.5 > > > > > > Anton - please post the stack trace you get with > > > --download-superlu_dist-commit=origin/maint > > > > I guess this is it: > > > > [0]PETSC ERROR: [0] SuperLU_DIST:pdgssvx line 421 > > /home/anton/LIB/petsc/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c > > [0]PETSC ERROR: [0] MatLUFactorNumeric_SuperLU_DIST line 282 > > /home/anton/LIB/petsc/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c > > [0]PETSC ERROR: [0] MatLUFactorNumeric line 2985 > > /home/anton/LIB/petsc/src/mat/interface/matrix.c > > [0]PETSC ERROR: [0] PCSetUp_LU line 101 > > /home/anton/LIB/petsc/src/ksp/pc/impls/factor/lu/lu.c > > [0]PETSC ERROR: [0] PCSetUp line 930 > > /home/anton/LIB/petsc/src/ksp/pc/interface/precon.c > > > > According to the line numbers it crashes within > > MatLUFactorNumeric_SuperLU_DIST while calling pdgssvx. > > > > Surprisingly this only happens on the second SNES iteration, but not on the > > first. > > > > I'm trying to reproduce this behavior with PETSc KSP and SNES examples. > > However, everything I've tried up to now with SuperLU_DIST does just fine. > > > > I'm also checking our code in Valgrind to make sure it's clean. > > > > Anton > > > > > > Satish > > > > > > > > > On Mon, 10 Oct 2016, Xiaoye S. Li wrote: > > > > > > > Which version of superlu_dist does this capture? I looked at the > > > > original > > > > error log, it pointed to pdgssvx: line 161. But that line is in > > > > comment > > > > block, not the program. > > > > > > > > Sherry > > > > > > > > > > > > On Mon, Oct 10, 2016 at 7:27 AM, Anton Popov <[email protected]> wrote: > > > > > > > > > > > > > > On 10/07/2016 05:23 PM, Satish Balay wrote: > > > > > > > > > > > On Fri, 7 Oct 2016, Kong, Fande wrote: > > > > > > > > > > > > On Fri, Oct 7, 2016 at 9:04 AM, Satish Balay <[email protected]> > > > > > > wrote: > > > > > > > On Fri, 7 Oct 2016, Anton Popov wrote: > > > > > > > > Hi guys, > > > > > > > > > are there any news about fixing buggy behavior of > > > > > > > > > SuperLU_DIST, exactly > > > > > > > > > > > > > > > > > what > > > > > > > > > > > > > > > > > is described here: > > > > > > > > > > > > > > > > > > https://urldefense.proofpoint.com/v2/url?u=http-3A__lists. > > > > > > > > > > > > > > > > > mcs.anl.gov_pipermail_petsc-2Dusers_2015-2DAugust_026802.htm > > > > > > > > l&d=CwIBAg&c= > > > > > > > > 54IZrppPQZKX9mLzcGdPfFD1hxrcB__aEkJFOKJFd00&r=DUUt3SRGI0_ > > > > > > > > JgtNaS3udV68GRkgV4ts7XKfj2opmiCY&m=RwruX6ckX0t9H89Z6LXKBfJBOAM2vG > > > > > > > > 1sQHw2tIsSQtA&s=bbB62oGLm582JebVs8xsUej_OX0eUwibAKsRRWKafos&e= ? > > > > > > > > > > > > > > > > > I'm using 3.7.4 and still get SEGV in pdgssvx routine. > > > > > > > > > Everything works > > > > > > > > > > > > > > > > > fine > > > > > > > > > > > > > > > > > with 3.5.4. > > > > > > > > > > > > > > > > > > Do I still have to stick to maint branch, and what are the > > > > > > > > > chances for > > > > > > > > > > > > > > > > > these > > > > > > > > > > > > > > > > > fixes to be included in 3.7.5? > > > > > > > > > > > > > > > > > 3.7.4. is off maint branch [as of a week ago]. So if you are > > > > > > > > seeing > > > > > > > > issues with it - its best to debug and figure out the cause. > > > > > > > > > > > > > > > > This bug is indeed inside of superlu_dist, and we started having > > > > > > > > this > > > > > > > issue > > > > > > > from PETSc-3.6.x. I think superlu_dist developers should have > > > > > > > fixed this > > > > > > > bug. We forgot to update superlu_dist?? This is not a thing users > > > > > > > could > > > > > > > debug and fix. > > > > > > > > > > > > > > I have many people in INL suffering from this issue, and they have > > > > > > > to > > > > > > > stay > > > > > > > with PETSc-3.5.4 to use superlu_dist. > > > > > > > > > > > > > To verify if the bug is fixed in latest superlu_dist - you can try > > > > > > [assuming you have git - either from petsc-3.7/maint/master]: > > > > > > > > > > > > --download-superlu_dist --download-superlu_dist-commit=origin/maint > > > > > > > > > > > > > > > > > > Satish > > > > > > > > > > > > Hi Satish, > > > > > I did this: > > > > > > > > > > git clone -b maint https://bitbucket.org/petsc/petsc.git petsc > > > > > > > > > > --download-superlu_dist > > > > > --download-superlu_dist-commit=origin/maint (not sure this is needed, > > > > > since I'm already in maint) > > > > > > > > > > The problem is still there. > > > > > > > > > > Cheers, > > > > > Anton > > > > > > > > > >
