If you plan to use valgrind you may want to use mpich (--download-mpich configure option) since openmpi has a lot of false positives.
Il 17 Giu 2017 17:49, "TAY wee-beng" <[email protected]> ha scritto: > Hi Lukasz, > > Thanks for the tip. > > I tied using valgrind. However, I got a lot of errors at a few of > locations. One complained of uninitialized value of : > > call PetscInitialize(PETSC_NULL_CHARACTER,ierr) > > But I already initialize "ierr". Are these errors valid or can I hide > them? > > == > ==17300== Conditional jump or move depends on uninitialised value(s) > ==17300== at 0x3C2A872849: _IO_file_fopen@@GLIBC_2.2.5 (in /lib64/ > libc-2.12.so) > ==17300== by 0x3C2A866D95: __fopen_internal (in /lib64/libc-2.12.so) > ==17300== by 0x3C2A8E2CB3: setmntent (in /lib64/libc-2.12.so) > ==17300== by 0xA726083: mca_mpool_hugepage_open (in > /home/tsltaywb/lib/openmpi-2.1.1/lib/openmpi/mca_mpool_hugepage.so) > ==17300== by 0x65A83A1: mca_base_framework_components_open (in > /home/tsltaywb/lib/openmpi-2.1.1/lib/libopen-pal.so.20.10.1) > ==17300== by 0x6614041: mca_mpool_base_open (in > /home/tsltaywb/lib/openmpi-2.1.1/lib/libopen-pal.so.20.10.1) > ==17300== by 0x65B1EC0: mca_base_framework_open (in > /home/tsltaywb/lib/openmpi-2.1.1/lib/libopen-pal.so.20.10.1) > ==17300== by 0x5E11123: ompi_mpi_init (in /home/tsltaywb/lib/openmpi-2. > 1.1/lib/libmpi.so.20.10.1) > ==17300== by 0x5E31032: PMPI_Init (in /home/tsltaywb/lib/openmpi-2. > 1.1/lib/libmpi.so.20.10.1) > ==17300== by 0x5978E87: PMPI_INIT (in /home/tsltaywb/lib/openmpi-2. > 1.1/lib/libmpi_mpifh.so.20.11.0) > ==17300== by 0xB29696: petscinitialize_ (zstart.c:316) > ==17300== by 0xA80D2B: MAIN__ (ibm3d_high_Re.F90:63) > ==17300== Uninitialised value was created by a stack allocation > ==17300== at 0x3C2A8E2C82: setmntent (in /lib64/libc-2.12.so) > ==17300== > ==17300== Conditional jump or move depends on uninitialised value(s) > ==17300== at 0x3C2A87284F: _IO_file_fopen@@GLIBC_2.2.5 (in /lib64/ > libc-2.12.so) > ==17300== by 0x3C2A866D95: __fopen_internal (in /lib64/libc-2.12.so) > ==17300== by 0x3C2A8E2CB3: setmntent (in /lib64/libc-2.12.so) > ==17300== by 0xA726083: mca_mpool_hugepage_open (in > /home/tsltaywb/lib/openmpi-2.1.1/lib/openmpi/mca_mpool_hugepage.so) > ==17300== by 0x65A83A1: mca_base_framework_components_open (in > /home/tsltaywb/lib/openmpi-2.1.1/lib/libopen-pal.so.20.10.1) > ==17300== by 0x6614041: mca_mpool_base_open (in > /home/tsltaywb/lib/openmpi-2.1.1/lib/libopen-pal.so.20.10.1) > ==17300== by 0x65B1EC0: mca_base_framework_open (in > /home/tsltaywb/lib/openmpi-2.1.1/lib/libopen-pal.so.20.10.1) > ==17300== by 0x5E11123: ompi_mpi_init (in /home/tsltaywb/lib/openmpi-2. > 1.1/lib/libmpi.so.20.10.1) > ==17300== by 0x5E31032: PMPI_Init (in /home/tsltaywb/lib/openmpi-2. > 1.1/lib/libmpi.so.20.10.1) > ==17300== by 0x5978E87: PMPI_INIT (in /home/tsltaywb/lib/openmpi-2. > 1.1/lib/libmpi_mpifh.so.20.11.0) > ==17300== by 0xB29696: petscinitialize_ (zstart.c:316) > ==17300== by 0xA80D2B: MAIN__ (ibm3d_high_Re.F90:63) > > > > > > Thank you very much. > > Yours sincerely, > > ================================================ > TAY Wee-Beng (Zheng Weiming) ιδΌζ > Personal research webpage: http://tayweebeng.wixsite.com/website > Youtube research showcase: > https://www.youtube.com/channel/UC72ZHtvQNMpNs2uRTSToiLA > linkedin: www.linkedin.com/in/tay-weebeng > ================================================ > > On 7/6/2017 3:22 PM, Lukasz Kaczmarczyk wrote: > > > On 7 Jun 2017, at 07:57, TAY wee-beng <[email protected]> wrote: > > Hi, > > I have been PETSc together with my CFD code. There seems to be a bug with > the Intel compiler such that when I call some DM routines such as > DMLocalToLocalBegin, a segmentation violation will occur if full > optimization is used. I had posted this question a while back. So the > current solution is to use -O1 -ip instead of -O3 -ipo -ip for certain > source files which uses DMLocalToLocalBegin etc. > > Recently, I made some changes to the code, mainly adding some stuffs. > However, depending on my options. some cases still go thru the same program > path. > > Now when I tried to run those same cases, I got segmentation violation, > which didn't happen before: > > * IIB_I_cell_no_uvw_total2 14 10 6 3* > * 2 1* > > *[0]PETSC ERROR: > ------------------------------------------------------------------------* > *[0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, > probably memory access out of range* > *[0]PETSC ERROR: Try option -start_in_debugger or > -on_error_attach_debugger* > *[0]PETSC ERROR: or see > http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind > <http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind>* > *[0]PETSC ERROR: or try http://valgrind.org <http://valgrind.org/> on > GNU/linux and Apple Mac OS X to find memory corruption errors* > *[0]PETSC ERROR: configure using --with-debugging=yes, recompile, link, > and run * > *[0]PETSC ERROR: to get more information on the crash.* > *[0]PETSC ERROR: --------------------- Error Message > --------------------------------------------------------------* > *[0]PETSC ERROR: Signal received* > *[0]PETSC ERROR: See http://www.mcs.anl.gov/petsc/documentation/faq.html > <http://www.mcs.anl.gov/petsc/documentation/faq.html> for trouble shooting.* > *[0]PETSC ERROR: Petsc Release Version 3.7.4, Oct, 02, 2016 * > *[0]PETSC ERROR: ./a.out * > > > I can't debug using VS since the codes have been optimized. I tried to > print messages (if (myid == 0) print "1") to pinpoint the error. Strangely, > after adding these print messages, the error disappears. > > * IIB_I_cell_no_uvw_total2 14 10 6 3* > * 2 1* > * 1* > * 2* > * 3* > * 4* > * 5* > * 1 0.26873613 0.12620288 0.12949340 1.11422363 > 0.43983516E-06 -0.59311066E-01 0.25546227E+04* > * 2 0.22236892 0.14528589 0.16939270 1.10459102 > 0.74556128E-02 -0.55168234E-01 0.25532419E+04* > * 3 0.20764796 0.14832689 0.18780489 1.08039569 > 0.80299767E-02 -0.46972411E-01 0.25523174E+04* > > Can anyone give a logical explanation why this is happening? Moreover, if > I removed printing 1 to 3, and only print 4 and 5, segmentation violation > appears again. > > I am using Intel Fortran 2016.1.150. I wonder if it helps if I post in the > Intel Fortran forum. > > I can provide more info if require. > > You very likely write on the memory, for example when you exceed the size > of arrays. Depending on your compilation options, starting parameters, > etc. you write in an uncontrolled way on the part of memory which belongs > to your process or protected by operation system. In the second case, you > have a segmentation fault. You can have correct results for some runs, but > your bug is there hiding in the dark. > > To put light on it, you need Valgrind. Compile the code with debugging on, > no optimisation and start searching. You can run as well generate core > file and in gdb/ldb buck track error. > > Lukasz > > >
