Yes - but this test code [that Hong is also using] is buggy due to using MatLoad() twice - so the corrupted Matrix does have wierd behavior later in PC.
With your fix - the test code rpovided by Anton behaves fine for me. So Hong would have to restart the diagnosis - and I suspect all the wierd behavior she observed will go away [well I don't see the the original wired behavior with this test code anymore].. Sinced you said "This will also make MatMPIAIJSetPreallocation() work properly with multiple calls" - perhaps Anton's issue is also somehow releated? I think its best if he can try this fix. And if it doesn't work - then we'll need a better test case to reproduce. [Or perhaps Hong is using a different test code and is observing bugs with superlu_dist interface..] Satish On Mon, 24 Oct 2016, Barry Smith wrote: > > Hong wrote: (Note that it creates a new Mat each time so shouldn't be > affected by the bug I fixed; it also "works" with MUMPs but not superlu_dist.) > > > It is not problem with Matload twice. The file has one matrix, but is loaded > twice. > > Replacing pc with ksp, the code runs fine. > The error occurs when PCSetUp_LU() is called with SAME_NONZERO_PATTERN. > I'll further look at it later. > > Hong > ________________________________________ > From: Zhang, Hong > Sent: Friday, October 21, 2016 8:18 PM > To: Barry Smith; petsc-users > Subject: RE: [petsc-users] SuperLU_dist issue in 3.7.4 > > I am investigating it. The file has two matrices. The code takes following > steps: > > PCCreate(PETSC_COMM_WORLD, &pc); > > MatCreate(PETSC_COMM_WORLD,&A); > MatLoad(A,fd); > PCSetOperators(pc,A,A); > PCSetUp(pc); > > MatCreate(PETSC_COMM_WORLD,&A); > MatLoad(A,fd); > PCSetOperators(pc,A,A); > PCSetUp(pc); //crash here with np=2, superlu_dist, not with mumps/superlu or > superlu_dist np=1 > > Hong > > > On Oct 24, 2016, at 9:00 AM, Satish Balay <[email protected]> wrote: > > > > Since the provided test code dosn't crash [and is valgrind clean] - > > with this fix - I'm not sure what bug Hong is chasing.. > > > > Satish > > > > On Mon, 24 Oct 2016, Barry Smith wrote: > > > >> > >> Anton, > >> > >> Sorry for any confusion. This doesn't resolve the SuperLU_DIST issue > >> which I think Hong is working on, this only resolves multiple loads of > >> matrices into the same Mat. > >> > >> Barry > >> > >>> On Oct 24, 2016, at 5:07 AM, Anton Popov <[email protected]> wrote: > >>> > >>> Thank you Barry, Satish, Fande! > >>> > >>> Is there a chance to get this fix in the maintenance release 3.7.5 > >>> together with the latest SuperLU_DIST? Or next release is a more > >>> realistic option? > >>> > >>> Anton > >>> > >>> On 10/24/2016 01:58 AM, Satish Balay wrote: > >>>> The original testcode from Anton also works [i.e is valgrind clean] with > >>>> this change.. > >>>> > >>>> Satish > >>>> > >>>> On Sun, 23 Oct 2016, Barry Smith wrote: > >>>> > >>>>> Thanks Satish, > >>>>> > >>>>> I have fixed this in barry/fix-matmpixxxsetpreallocation-reentrant > >>>>> (in next for testing) > >>>>> > >>>>> Fande, > >>>>> > >>>>> This will also make MatMPIAIJSetPreallocation() work properly > >>>>> with multiple calls (you will not need a MatReset()). > >>>>> > >>>>> Barry > >>>>> > >>>>> > >>>>>> On Oct 21, 2016, at 6:48 PM, Satish Balay <[email protected]> wrote: > >>>>>> > >>>>>> On Fri, 21 Oct 2016, Barry Smith wrote: > >>>>>> > >>>>>>> valgrind first > >>>>>> balay@asterix /home/balay/download-pine/x/superlu_dist_test > >>>>>> $ mpiexec -n 2 $VG ./ex16 -f ~/datafiles/matrices/small > >>>>>> First MatLoad! > >>>>>> Mat Object: 2 MPI processes > >>>>>> type: mpiaij > >>>>>> row 0: (0, 4.) (1, -1.) (6, -1.) > >>>>>> row 1: (0, -1.) (1, 4.) (2, -1.) (7, -1.) > >>>>>> row 2: (1, -1.) (2, 4.) (3, -1.) (8, -1.) > >>>>>> row 3: (2, -1.) (3, 4.) (4, -1.) (9, -1.) > >>>>>> row 4: (3, -1.) (4, 4.) (5, -1.) (10, -1.) > >>>>>> row 5: (4, -1.) (5, 4.) (11, -1.) > >>>>>> row 6: (0, -1.) (6, 4.) (7, -1.) (12, -1.) > >>>>>> row 7: (1, -1.) (6, -1.) (7, 4.) (8, -1.) (13, -1.) > >>>>>> row 8: (2, -1.) (7, -1.) (8, 4.) (9, -1.) (14, -1.) > >>>>>> row 9: (3, -1.) (8, -1.) (9, 4.) (10, -1.) (15, -1.) > >>>>>> row 10: (4, -1.) (9, -1.) (10, 4.) (11, -1.) (16, -1.) > >>>>>> row 11: (5, -1.) (10, -1.) (11, 4.) (17, -1.) > >>>>>> row 12: (6, -1.) (12, 4.) (13, -1.) (18, -1.) > >>>>>> row 13: (7, -1.) (12, -1.) (13, 4.) (14, -1.) (19, -1.) > >>>>>> row 14: (8, -1.) (13, -1.) (14, 4.) (15, -1.) (20, -1.) > >>>>>> row 15: (9, -1.) (14, -1.) (15, 4.) (16, -1.) (21, -1.) > >>>>>> row 16: (10, -1.) (15, -1.) (16, 4.) (17, -1.) (22, -1.) > >>>>>> row 17: (11, -1.) (16, -1.) (17, 4.) (23, -1.) > >>>>>> row 18: (12, -1.) (18, 4.) (19, -1.) (24, -1.) > >>>>>> row 19: (13, -1.) (18, -1.) (19, 4.) (20, -1.) (25, -1.) > >>>>>> row 20: (14, -1.) (19, -1.) (20, 4.) (21, -1.) (26, -1.) > >>>>>> row 21: (15, -1.) (20, -1.) (21, 4.) (22, -1.) (27, -1.) > >>>>>> row 22: (16, -1.) (21, -1.) (22, 4.) (23, -1.) (28, -1.) > >>>>>> row 23: (17, -1.) (22, -1.) (23, 4.) (29, -1.) > >>>>>> row 24: (18, -1.) (24, 4.) (25, -1.) (30, -1.) > >>>>>> row 25: (19, -1.) (24, -1.) (25, 4.) (26, -1.) (31, -1.) > >>>>>> row 26: (20, -1.) (25, -1.) (26, 4.) (27, -1.) (32, -1.) > >>>>>> row 27: (21, -1.) (26, -1.) (27, 4.) (28, -1.) (33, -1.) > >>>>>> row 28: (22, -1.) (27, -1.) (28, 4.) (29, -1.) (34, -1.) > >>>>>> row 29: (23, -1.) (28, -1.) (29, 4.) (35, -1.) > >>>>>> row 30: (24, -1.) (30, 4.) (31, -1.) > >>>>>> row 31: (25, -1.) (30, -1.) (31, 4.) (32, -1.) > >>>>>> row 32: (26, -1.) (31, -1.) (32, 4.) (33, -1.) > >>>>>> row 33: (27, -1.) (32, -1.) (33, 4.) (34, -1.) > >>>>>> row 34: (28, -1.) (33, -1.) (34, 4.) (35, -1.) > >>>>>> row 35: (29, -1.) (34, -1.) (35, 4.) > >>>>>> Second MatLoad! > >>>>>> Mat Object: 2 MPI processes > >>>>>> type: mpiaij > >>>>>> ==4592== Invalid read of size 4 > >>>>>> ==4592== at 0x5814014: MatView_MPIAIJ_ASCIIorDraworSocket > >>>>>> (mpiaij.c:1402) > >>>>>> ==4592== by 0x5814A75: MatView_MPIAIJ (mpiaij.c:1440) > >>>>>> ==4592== by 0x53373D7: MatView (matrix.c:989) > >>>>>> ==4592== by 0x40107E: main (ex16.c:30) > >>>>>> ==4592== Address 0xa47b460 is 20 bytes after a block of size 28 > >>>>>> alloc'd > >>>>>> ==4592== at 0x4C2FF83: memalign (vg_replace_malloc.c:858) > >>>>>> ==4592== by 0x4FD121A: PetscMallocAlign (mal.c:28) > >>>>>> ==4592== by 0x5842C70: MatSetUpMultiply_MPIAIJ (mmaij.c:41) > >>>>>> ==4592== by 0x5809943: MatAssemblyEnd_MPIAIJ (mpiaij.c:747) > >>>>>> ==4592== by 0x536B299: MatAssemblyEnd (matrix.c:5298) > >>>>>> ==4592== by 0x5829C05: MatLoad_MPIAIJ (mpiaij.c:3032) > >>>>>> ==4592== by 0x5337FEA: MatLoad (matrix.c:1101) > >>>>>> ==4592== by 0x400D9F: main (ex16.c:22) > >>>>>> ==4592== > >>>>>> ==4591== Invalid read of size 4 > >>>>>> ==4591== at 0x5814014: MatView_MPIAIJ_ASCIIorDraworSocket > >>>>>> (mpiaij.c:1402) > >>>>>> ==4591== by 0x5814A75: MatView_MPIAIJ (mpiaij.c:1440) > >>>>>> ==4591== by 0x53373D7: MatView (matrix.c:989) > >>>>>> ==4591== by 0x40107E: main (ex16.c:30) > >>>>>> ==4591== Address 0xa482958 is 24 bytes before a block of size 7 > >>>>>> alloc'd > >>>>>> ==4591== at 0x4C2FF83: memalign (vg_replace_malloc.c:858) > >>>>>> ==4591== by 0x4FD121A: PetscMallocAlign (mal.c:28) > >>>>>> ==4591== by 0x4F31FB5: PetscStrallocpy (str.c:197) > >>>>>> ==4591== by 0x4F0D3F5: PetscClassRegLogRegister (classlog.c:253) > >>>>>> ==4591== by 0x4EF96E2: PetscClassIdRegister (plog.c:2053) > >>>>>> ==4591== by 0x51FA018: VecInitializePackage (dlregisvec.c:165) > >>>>>> ==4591== by 0x51F6DE9: VecCreate (veccreate.c:35) > >>>>>> ==4591== by 0x51C49F0: VecCreateSeq (vseqcr.c:37) > >>>>>> ==4591== by 0x5843191: MatSetUpMultiply_MPIAIJ (mmaij.c:104) > >>>>>> ==4591== by 0x5809943: MatAssemblyEnd_MPIAIJ (mpiaij.c:747) > >>>>>> ==4591== by 0x536B299: MatAssemblyEnd (matrix.c:5298) > >>>>>> ==4591== by 0x5829C05: MatLoad_MPIAIJ (mpiaij.c:3032) > >>>>>> ==4591== by 0x5337FEA: MatLoad (matrix.c:1101) > >>>>>> ==4591== by 0x400D9F: main (ex16.c:22) > >>>>>> ==4591== > >>>>>> [0]PETSC ERROR: --------------------- Error Message > >>>>>> -------------------------------------------------------------- > >>>>>> [0]PETSC ERROR: Argument out of range > >>>>>> [0]PETSC ERROR: Column too large: col 96 max 35 > >>>>>> [0]PETSC ERROR: See > >>>>>> http://www.mcs.anl.gov/petsc/documentation/faq.html for trouble > >>>>>> shooting. > >>>>>> [0]PETSC ERROR: Petsc Development GIT revision: v3.7.4-1729-g4c4de23 > >>>>>> GIT Date: 2016-10-20 22:22:58 +0000 > >>>>>> [0]PETSC ERROR: ./ex16 on a arch-idx64-slu named asterix by balay Fri > >>>>>> Oct 21 18:47:51 2016 > >>>>>> [0]PETSC ERROR: Configure options --download-metis --download-parmetis > >>>>>> --download-superlu_dist PETSC_ARCH=arch-idx64-slu > >>>>>> [0]PETSC ERROR: #1 MatSetValues_MPIAIJ() line 585 in > >>>>>> /home/balay/petsc/src/mat/impls/aij/mpi/mpiaij.c > >>>>>> [0]PETSC ERROR: #2 MatAssemblyEnd_MPIAIJ() line 724 in > >>>>>> /home/balay/petsc/src/mat/impls/aij/mpi/mpiaij.c > >>>>>> [0]PETSC ERROR: #3 MatAssemblyEnd() line 5298 in > >>>>>> /home/balay/petsc/src/mat/interface/matrix.c > >>>>>> [0]PETSC ERROR: #4 MatView_MPIAIJ_ASCIIorDraworSocket() line 1410 in > >>>>>> /home/balay/petsc/src/mat/impls/aij/mpi/mpiaij.c > >>>>>> [0]PETSC ERROR: #5 MatView_MPIAIJ() line 1440 in > >>>>>> /home/balay/petsc/src/mat/impls/aij/mpi/mpiaij.c > >>>>>> [0]PETSC ERROR: #6 MatView() line 989 in > >>>>>> /home/balay/petsc/src/mat/interface/matrix.c > >>>>>> [0]PETSC ERROR: #7 main() line 30 in > >>>>>> /home/balay/download-pine/x/superlu_dist_test/ex16.c > >>>>>> [0]PETSC ERROR: PETSc Option Table entries: > >>>>>> [0]PETSC ERROR: -display :0.0 > >>>>>> [0]PETSC ERROR: -f /home/balay/datafiles/matrices/small > >>>>>> [0]PETSC ERROR: -malloc_dump > >>>>>> [0]PETSC ERROR: ----------------End of Error Message -------send > >>>>>> entire error message to [email protected] > >>>>>> application called MPI_Abort(MPI_COMM_WORLD, 63) - process 0 > >>>>>> [cli_0]: aborting job: > >>>>>> application called MPI_Abort(MPI_COMM_WORLD, 63) - process 0 > >>>>>> ==4591== 16,965 (2,744 direct, 14,221 indirect) bytes in 1 blocks are > >>>>>> definitely lost in loss record 1,014 of 1,016 > >>>>>> ==4591== at 0x4C2FF83: memalign (vg_replace_malloc.c:858) > >>>>>> ==4591== by 0x4FD121A: PetscMallocAlign (mal.c:28) > >>>>>> ==4591== by 0x52F3B14: MatCreate (gcreate.c:84) > >>>>>> ==4591== by 0x581390A: MatView_MPIAIJ_ASCIIorDraworSocket > >>>>>> (mpiaij.c:1371) > >>>>>> ==4591== by 0x5814A75: MatView_MPIAIJ (mpiaij.c:1440) > >>>>>> ==4591== by 0x53373D7: MatView (matrix.c:989) > >>>>>> ==4591== by 0x40107E: main (ex16.c:30) > >>>>>> ==4591== > >>>>>> > >>>>>> =================================================================================== > >>>>>> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > >>>>>> = PID 4591 RUNNING AT asterix > >>>>>> = EXIT CODE: 63 > >>>>>> = CLEANING UP REMAINING PROCESSES > >>>>>> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > >>>>>> =================================================================================== > >>>>>> balay@asterix /home/balay/download-pine/x/superlu_dist_test > >>>>>> $ > >>>>> > >>> > >> > >> > > > >
