On Mon, 8 Jun 2015, Jed Brown wrote: > Barry Smith <[email protected]> writes: > > > We are having some problems with master and MPICH 1 in the nightly tests > > > > http://ftp.mcs.anl.gov/pub/petsc/nightlylogs/archive/2015/06/08/examples_master_arch-linux-mpich1_steamroller.log > > > > I've done a lot of debugging including with valgrind and cannot > > determine the problem. I'm concluding that it is a problem with how > > they handle attributes since changing the number of attributes can > > produce crashes or prevent crashes. It is flaky, like memory > > corruption problems but valgrind is happy. > > MPICH uses integer tables instead of pointers, hiding a lot of > information from Valgrind. When run in a debugger or Valgrind, what is > the trace when SEGV is raised? (I don't currently have an MPICH1 > built.) > > It may well be an MPICH1 bug, but there is definitely no support at all > for MPICH1 and hopefully people have stopped using it long ago. If we > do decide to end MPICH1 support, we can merge 'jed/mpi-2' (not > necessarily for this release).
This can be reproduced on MCS linux boxes. The trigger is the following commit: http://bitbucket.org/petsc/petsc/commits/5c25fcd7c4e1ecb15ec7a0829572c7e72f90b2d9 [however we don't see anything there thats closely relavent. The current hypothesis is - the order/number of attributes added/deleted changed - triggering an error] A Vec example [with VecView()] triggered this error before. With the following change - that vec example is now happy - but snes examples [with -snes_monitor_short] are crashing. ./configure --with-mpi-dir=/homes/petsc/soft/build/mpich-1.2.7p1 --with-cxx=0 --with-fc=0 --with-shared-libraries=0 balay@es^/scratch/balay/petsc/src/snes/examples/tutorials(master=) $ ./ex1 Number of SNES iterations = 6 balay@es^/scratch/balay/petsc/src/snes/examples/tutorials(master=) $ valgrind --tool=memcheck -q ./ex1 -snes_monitor_short 0 SNES Function norm 6.04152 1 SNES Function norm 4.78676 2 SNES Function norm 2.98646 3 SNES Function norm 0.230624 4 SNES Function norm 0.00193631 5 SNES Function norm 1.43559e-07 6 SNES Function norm < 1.e-11 Number of SNES iterations = 6 ==6303== Invalid read of size 2 ==6303== at 0x13A54FC: MPIR_HBT_delete (util_hbt.c:575) ==6303== by 0x13A756D: PMPI_Attr_delete (attr_delval.c:88) ==6303== by 0x460BB8: Petsc_DelComm_Outer (pinit.c:362) ==6303== by 0x13A7527: PMPI_Attr_delete (attr_delval.c:82) ==6303== by 0x43F16B: PetscCommDestroy (tagm.c:237) ==6303== by 0xA284CA: PetscHeaderDestroy_Private (inherit.c:121) ==6303== by 0x4977BC: PetscViewerDestroy (view.c:108) ==6303== by 0x440BA3: PetscObjectDestroy (destroy.c:73) ==6303== by 0x4422B5: PetscObjectRegisterDestroyAll (destroy.c:251) ==6303== by 0x464B62: PetscFinalize (pinit.c:1096) ==6303== by 0x406562: main (ex1.c:143) ==6303== Address 0x18 is not stack'd, malloc'd or (recently) free'd ==6303== [0]PETSC ERROR: ------------------------------------------------------------------------ [0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range [0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger [0]PETSC ERROR: or see http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind [0]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors [0]PETSC ERROR: likely location of problem given in stack below [0]PETSC ERROR: --------------------- Stack Frames ------------------------------------ [0]PETSC ERROR: Note: The EXACT line numbers in the stack are not available, [0]PETSC ERROR: INSTEAD the line number of the start of the function [0]PETSC ERROR: is given. [0]PETSC ERROR: [0] Petsc_DelComm_Outer line 355 /scratch/balay/petsc/src/sys/objects/pinit.c [0]PETSC ERROR: [0] PetscCommDestroy line 217 /scratch/balay/petsc/src/sys/objects/tagm.c [0]PETSC ERROR: [0] PetscHeaderDestroy_Private line 101 /scratch/balay/petsc/src/sys/objects/inherit.c [0]PETSC ERROR: [0] PetscViewerDestroy line 97 /scratch/balay/petsc/src/sys/classes/viewer/interface/view.c [0]PETSC ERROR: [0] PetscObjectDestroy line 69 /scratch/balay/petsc/src/sys/objects/destroy.c [0]PETSC ERROR: [0] PetscObjectRegisterDestroyAll line 249 /scratch/balay/petsc/src/sys/objects/destroy.c [0]PETSC ERROR: [0] PetscFinalize line 956 /scratch/balay/petsc/src/sys/objects/pinit.c [0]PETSC ERROR: --------------------- Error Message -------------------------------------------------------------- [0]PETSC ERROR: Signal received [0]PETSC ERROR: See http://www.mcs.anl.gov/petsc/documentation/faq.html for trouble shooting. [0]PETSC ERROR: Petsc Development GIT revision: v3.5.4-3311-gf6dae19 GIT Date: 2015-06-08 16:14:54 -0600 [0]PETSC ERROR: ./ex1 on a arch-linux2-c-debug named es by balay Mon Jun 8 17:23:37 2015 [0]PETSC ERROR: Configure options --with-mpi-dir=/homes/petsc/soft/build/mpich-1.2.7p1 --with-cxx=0 --with-fc=0 --with-shared-libraries=0 [0]PETSC ERROR: #1 User provided function() line 0 in unknown file [0] MPI Abort by user Aborting program ! [0] Aborting program! p0_6303: p4_error: : 59 balay@es^/scratch/balay/petsc/src/snes/examples/tutorials(master=) $
