I looked at each valgrind-complained item in your email dated Oct. 11. Those reports are really superficial; I don't see anything wrong with those lines (mostly uninitialized variables) singled out. I did a few tests with the latest version in github, all went fine.
Perhaps you can print your matrix that caused problem, I can run it using your matrix. Sherry On Tue, Oct 11, 2016 at 2:18 PM, Anton <[email protected]> wrote: > > > On 10/11/16 7:19 PM, Satish Balay wrote: > >> This log looks truncated. Are there any valgrind mesages before this? >> [like from your application code - or from MPI] >> > Yes it is indeed truncated. I only included relevant messages. > >> >> Perhaps you can send the complete log - with: >> valgrind -q --tool=memcheck --leak-check=yes --num-callers=20 >> --track-origins=yes >> >> [and if there were more valgrind messages from MPI - rebuild petsc >> > There are no messages originating from our code, just a few MPI related > ones (probably false positives) and from SuperLU_DIST (most of them). > > Thanks, > Anton > > with --download-mpich - for a valgrind clean mpi] >> >> Sherry, >> Perhaps this log points to some issue in superlu_dist? >> >> thanks, >> Satish >> >> On Tue, 11 Oct 2016, Anton Popov wrote: >> >> Valgrind immediately detects interesting stuff: >>> >>> ==25673== Use of uninitialised value of size 8 >>> ==25673== at 0x178272C: static_schedule (static_schedule.c:960) >>> ==25674== Use of uninitialised value of size 8 >>> ==25674== at 0x178272C: static_schedule (static_schedule.c:960) >>> ==25674== by 0x174E74E: pdgstrf (pdgstrf.c:572) >>> ==25674== by 0x1733954: pdgssvx (pdgssvx.c:1124) >>> >>> >>> ==25673== Conditional jump or move depends on uninitialised value(s) >>> ==25673== at 0x1752143: pdgstrf (dlook_ahead_update.c:24) >>> ==25673== by 0x1733954: pdgssvx (pdgssvx.c:1124) >>> >>> >>> ==25673== Conditional jump or move depends on uninitialised value(s) >>> ==25673== at 0x5C83F43: PMPI_Recv (in /opt/mpich3/lib/libmpi.so.12.1 >>> .0) >>> ==25673== by 0x1755385: pdgstrf2_trsm (pdgstrf2.c:253) >>> ==25673== by 0x1751E4F: pdgstrf (dlook_ahead_update.c:195) >>> ==25673== by 0x1733954: pdgssvx (pdgssvx.c:1124) >>> >>> ==25674== Use of uninitialised value of size 8 >>> ==25674== at 0x62BF72B: _itoa_word (_itoa.c:179) >>> ==25674== by 0x62C1289: printf_positional (vfprintf.c:2022) >>> ==25674== by 0x62C2465: vfprintf (vfprintf.c:1677) >>> ==25674== by 0x638AFD5: __vsnprintf_chk (vsnprintf_chk.c:63) >>> ==25674== by 0x638AF37: __snprintf_chk (snprintf_chk.c:34) >>> ==25674== by 0x5CC6C08: MPIR_Err_create_code_valist (in >>> /opt/mpich3/lib/libmpi.so.12.1.0) >>> ==25674== by 0x5CC7A9A: MPIR_Err_create_code (in >>> /opt/mpich3/lib/libmpi.so.12.1.0) >>> ==25674== by 0x5C83FB1: PMPI_Recv (in /opt/mpich3/lib/libmpi.so.12.1 >>> .0) >>> ==25674== by 0x1755385: pdgstrf2_trsm (pdgstrf2.c:253) >>> ==25674== by 0x1751E4F: pdgstrf (dlook_ahead_update.c:195) >>> ==25674== by 0x1733954: pdgssvx (pdgssvx.c:1124) >>> >>> ==25674== Use of uninitialised value of size 8 >>> ==25674== at 0x1751E92: pdgstrf (dlook_ahead_update.c:205) >>> ==25674== by 0x1733954: pdgssvx (pdgssvx.c:1124) >>> >>> And it crashes after this: >>> >>> ==25674== Invalid write of size 4 >>> ==25674== at 0x1751F2F: pdgstrf (dlook_ahead_update.c:211) >>> ==25674== by 0x1733954: pdgssvx (pdgssvx.c:1124) >>> ==25674== by 0xAAEFAE: MatLUFactorNumeric_SuperLU_DIST >>> (superlu_dist.c:421) >>> ==25674== Address 0xa0 is not stack'd, malloc'd or (recently) free'd >>> ==25674== >>> [1]PETSC ERROR: >>> ------------------------------------------------------------------------ >>> [1]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, >>> probably >>> memory access out of range >>> >>> >>> On 10/11/2016 03:26 PM, Anton Popov wrote: >>> >>>> On 10/10/2016 07:11 PM, Satish Balay wrote: >>>> >>>>> Thats from petsc-3.5 >>>>> >>>>> Anton - please post the stack trace you get with >>>>> --download-superlu_dist-commit=origin/maint >>>>> >>>> I guess this is it: >>>> >>>> [0]PETSC ERROR: [0] SuperLU_DIST:pdgssvx line 421 >>>> /home/anton/LIB/petsc/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c >>>> [0]PETSC ERROR: [0] MatLUFactorNumeric_SuperLU_DIST line 282 >>>> /home/anton/LIB/petsc/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c >>>> [0]PETSC ERROR: [0] MatLUFactorNumeric line 2985 >>>> /home/anton/LIB/petsc/src/mat/interface/matrix.c >>>> [0]PETSC ERROR: [0] PCSetUp_LU line 101 >>>> /home/anton/LIB/petsc/src/ksp/pc/impls/factor/lu/lu.c >>>> [0]PETSC ERROR: [0] PCSetUp line 930 >>>> /home/anton/LIB/petsc/src/ksp/pc/interface/precon.c >>>> >>>> According to the line numbers it crashes within >>>> MatLUFactorNumeric_SuperLU_DIST while calling pdgssvx. >>>> >>>> Surprisingly this only happens on the second SNES iteration, but not on >>>> the >>>> first. >>>> >>>> I'm trying to reproduce this behavior with PETSc KSP and SNES examples. >>>> However, everything I've tried up to now with SuperLU_DIST does just >>>> fine. >>>> >>>> I'm also checking our code in Valgrind to make sure it's clean. >>>> >>>> Anton >>>> >>>>> Satish >>>>> >>>>> >>>>> On Mon, 10 Oct 2016, Xiaoye S. Li wrote: >>>>> >>>>> Which version of superlu_dist does this capture? I looked at the >>>>>> original >>>>>> error log, it pointed to pdgssvx: line 161. But that line is in >>>>>> comment >>>>>> block, not the program. >>>>>> >>>>>> Sherry >>>>>> >>>>>> >>>>>> On Mon, Oct 10, 2016 at 7:27 AM, Anton Popov <[email protected]> >>>>>> wrote: >>>>>> >>>>>> On 10/07/2016 05:23 PM, Satish Balay wrote: >>>>>>> >>>>>>> On Fri, 7 Oct 2016, Kong, Fande wrote: >>>>>>>> >>>>>>>> On Fri, Oct 7, 2016 at 9:04 AM, Satish Balay <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> On Fri, 7 Oct 2016, Anton Popov wrote: >>>>>>>>> >>>>>>>>>> Hi guys, >>>>>>>>>> >>>>>>>>>>> are there any news about fixing buggy behavior of >>>>>>>>>>> SuperLU_DIST, exactly >>>>>>>>>>> >>>>>>>>>>> what >>>>>>>>>> >>>>>>>>>> is described here: >>>>>>>>>>> >>>>>>>>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists. >>>>>>>>>>> >>>>>>>>>>> mcs.anl.gov_pipermail_petsc-2Dusers_2015-2DAugust_026802.htm >>>>>>>>>> l&d=CwIBAg&c= >>>>>>>>>> 54IZrppPQZKX9mLzcGdPfFD1hxrcB__aEkJFOKJFd00&r=DUUt3SRGI0_ >>>>>>>>>> JgtNaS3udV68GRkgV4ts7XKfj2opmiCY&m=RwruX6ckX0t9H89Z6LXKBfJBOAM2vG >>>>>>>>>> 1sQHw2tIsSQtA&s=bbB62oGLm582JebVs8xsUej_OX0eUwibAKsRRWKafos&e= ? >>>>>>>>>> >>>>>>>>>> I'm using 3.7.4 and still get SEGV in pdgssvx routine. >>>>>>>>>>> Everything works >>>>>>>>>>> >>>>>>>>>>> fine >>>>>>>>>> >>>>>>>>>> with 3.5.4. >>>>>>>>>>> >>>>>>>>>>> Do I still have to stick to maint branch, and what are the >>>>>>>>>>> chances for >>>>>>>>>>> >>>>>>>>>>> these >>>>>>>>>> >>>>>>>>>> fixes to be included in 3.7.5? >>>>>>>>>>> >>>>>>>>>>> 3.7.4. is off maint branch [as of a week ago]. So if you are >>>>>>>>>> seeing >>>>>>>>>> issues with it - its best to debug and figure out the cause. >>>>>>>>>> >>>>>>>>>> This bug is indeed inside of superlu_dist, and we started having >>>>>>>>>> this >>>>>>>>>> >>>>>>>>> issue >>>>>>>>> from PETSc-3.6.x. I think superlu_dist developers should have >>>>>>>>> fixed this >>>>>>>>> bug. We forgot to update superlu_dist?? This is not a thing users >>>>>>>>> could >>>>>>>>> debug and fix. >>>>>>>>> >>>>>>>>> I have many people in INL suffering from this issue, and they have >>>>>>>>> to >>>>>>>>> stay >>>>>>>>> with PETSc-3.5.4 to use superlu_dist. >>>>>>>>> >>>>>>>>> To verify if the bug is fixed in latest superlu_dist - you can try >>>>>>>> [assuming you have git - either from petsc-3.7/maint/master]: >>>>>>>> >>>>>>>> --download-superlu_dist --download-superlu_dist-commit=origin/maint >>>>>>>> >>>>>>>> >>>>>>>> Satish >>>>>>>> >>>>>>>> Hi Satish, >>>>>>>> >>>>>>> I did this: >>>>>>> >>>>>>> git clone -b maint https://bitbucket.org/petsc/petsc.git petsc >>>>>>> >>>>>>> --download-superlu_dist >>>>>>> --download-superlu_dist-commit=origin/maint (not sure this is >>>>>>> needed, >>>>>>> since I'm already in maint) >>>>>>> >>>>>>> The problem is still there. >>>>>>> >>>>>>> Cheers, >>>>>>> Anton >>>>>>> >>>>>>> >>> >>> >
