On Thu, Oct 17, 2013 at 3:42 AM, Bishesh Khanal <[email protected]> wrote:
> > > > On Wed, Oct 16, 2013 at 8:04 PM, Satish Balay <[email protected]> wrote: > >> On Wed, 16 Oct 2013, Matthew Knepley wrote: >> >> > You can also try running under MPICH, which can be valgrind clean. >> >> Actually --download-mpich would configure/install mpich with appropriate >> flags to be valgrind clean. >> > > In my laptop (but not in the cluster, please look at the second part of > this reply below for the cluster case) that's how I configured petsc and > ran it under mpich. The following errors (which I do not understand what > they mean) was reported by valgrind when using the mpich of the petsc in my > laptop: Here is the command I used and the error: > This is harmless, and as you can see it comes from gfortran initialization. > (Note: petsc is an alias in my .bashrc: alias > petsc='/home/bkhanal/Documents/softwares/petsc-3.4.3/bin/petscmpiexec' > > petsc -n 2 valgrind src/AdLemMain -pc_type fieldsplit -pc_fieldsplit_type > schur -pc_fieldsplit_dm_splits 0 -pc_fieldsplit_0_fields 0,1,2 > -pc_fieldsplit_1_fields 3 -fieldsplit_0_pc_type hypre > -fieldsplit_0_ksp_converged_reason -ksp_converged_reason > ==3106== Memcheck, a memory error detector > ==3106== Copyright (C) 2002-2010, and GNU GPL'd, by Julian Seward et al. > ==3106== Using Valgrind-3.6.1 and LibVEX; rerun with -h for copyright info > ==3107== Memcheck, a memory error detector > ==3107== Copyright (C) 2002-2010, and GNU GPL'd, by Julian Seward et al. > ==3107== Using Valgrind-3.6.1 and LibVEX; rerun with -h for copyright info > ==3107== Command: src/AdLemMain -pc_type fieldsplit -pc_fieldsplit_type > schur -pc_fieldsplit_dm_splits 0 -pc_fieldsplit_0_fields 0,1,2 > -pc_fieldsplit_1_fields 3 -fieldsplit_0_pc_type hypre > -fieldsplit_0_ksp_converged_reason -ksp_converged_reason > ==3107== > ==3106== Command: src/AdLemMain -pc_type fieldsplit -pc_fieldsplit_type > schur -pc_fieldsplit_dm_splits 0 -pc_fieldsplit_0_fields 0,1,2 > -pc_fieldsplit_1_fields 3 -fieldsplit_0_pc_type hypre > -fieldsplit_0_ksp_converged_reason -ksp_converged_reason > ==3106== > ==3107== Conditional jump or move depends on uninitialised value(s) > ==3107== at 0x32EEED9BCE: ??? (in /usr/lib64/libgfortran.so.3.0.0) > ==3107== by 0x32EEED9155: ??? (in /usr/lib64/libgfortran.so.3.0.0) > ==3107== by 0x32EEE185D7: ??? (in /usr/lib64/libgfortran.so.3.0.0) > ==3107== by 0x32ECC0F195: call_init.part.0 (in /lib64/ld-2.14.90.so) > ==3107== by 0x32ECC0F272: _dl_init (in /lib64/ld-2.14.90.so) > ==3107== by 0x32ECC01719: ??? (in /lib64/ld-2.14.90.so) > ==3107== by 0xE: ??? > ==3107== by 0x7FF0003EE: ??? > ==3107== by 0x7FF0003FC: ??? > ==3107== by 0x7FF000405: ??? > ==3107== by 0x7FF000410: ??? > ==3107== by 0x7FF000424: ??? > ==3107== > ==3107== Conditional jump or move depends on uninitialised value(s) > ==3107== at 0x32EEED9BD9: ??? (in /usr/lib64/libgfortran.so.3.0.0) > ==3107== by 0x32EEED9155: ??? (in /usr/lib64/libgfortran.so.3.0.0) > ==3107== by 0x32EEE185D7: ??? (in /usr/lib64/libgfortran.so.3.0.0) > ==3107== by 0x32ECC0F195: call_init.part.0 (in /lib64/ld-2.14.90.so) > ==3107== by 0x32ECC0F272: _dl_init (in /lib64/ld-2.14.90.so) > ==3107== by 0x32ECC01719: ??? (in /lib64/ld-2.14.90.so) > ==3107== by 0xE: ??? > ==3107== by 0x7FF0003EE: ??? > ==3107== by 0x7FF0003FC: ??? > ==3107== by 0x7FF000405: ??? > ==3107== by 0x7FF000410: ??? > ==3107== by 0x7FF000424: ??? > ==3107== > dmda of size: (8,8,8) > > using schur complement > > using user defined split > Linear solve converged due to CONVERGED_ATOL iterations 0 > Linear solve converged due to CONVERGED_RTOL iterations 3 > Linear solve converged due to CONVERGED_RTOL iterations 3 > Linear solve converged due to CONVERGED_RTOL iterations 3 > Linear solve converged due to CONVERGED_RTOL iterations 3 > Linear solve converged due to CONVERGED_RTOL iterations 3 > Linear solve converged due to CONVERGED_RTOL iterations 3 > Linear solve converged due to CONVERGED_RTOL iterations 3 > Linear solve converged due to CONVERGED_RTOL iterations 3 > Linear solve converged due to CONVERGED_RTOL iterations 3 > Linear solve converged due to CONVERGED_RTOL iterations 3 > Linear solve converged due to CONVERGED_RTOL iterations 3 > Linear solve converged due to CONVERGED_RTOL iterations 3 > Linear solve converged due to CONVERGED_RTOL iterations 3 > Linear solve converged due to CONVERGED_RTOL iterations 3 > Linear solve converged due to CONVERGED_RTOL iterations 3 > Linear solve converged due to CONVERGED_RTOL iterations 3 > Linear solve converged due to CONVERGED_RTOL iterations 3 > Linear solve converged due to CONVERGED_RTOL iterations 3 > Linear solve converged due to CONVERGED_RTOL iterations 3 > Linear solve converged due to CONVERGED_RTOL iterations 3 > Linear solve converged due to CONVERGED_RTOL iterations 3 > Linear solve converged due to CONVERGED_RTOL iterations 3 > Linear solve converged due to CONVERGED_RTOL iterations 3 > Linear solve converged due to CONVERGED_RTOL iterations 3 > Linear solve converged due to CONVERGED_RTOL iterations 3 > Linear solve converged due to CONVERGED_RTOL iterations 3 > Linear solve converged due to CONVERGED_RTOL iterations 1 > ==3106== > ==3106== HEAP SUMMARY: > ==3106== in use at exit: 187,709 bytes in 1,864 blocks > ==3106== total heap usage: 112,891 allocs, 111,027 frees, 19,838,487 > bytes allocated > ==3106== > ==3107== > ==3107== HEAP SUMMARY: > ==3107== in use at exit: 212,357 bytes in 1,870 blocks > ==3107== total heap usage: 112,701 allocs, 110,831 frees, 19,698,341 > bytes allocated > ==3107== > ==3106== LEAK SUMMARY: > ==3106== definitely lost: 0 bytes in 0 blocks > ==3106== indirectly lost: 0 bytes in 0 blocks > ==3106== possibly lost: 0 bytes in 0 blocks > ==3106== still reachable: 187,709 bytes in 1,864 blocks > ==3106== suppressed: 0 bytes in 0 blocks > ==3106== Rerun with --leak-check=full to see details of leaked memory > ==3106== > ==3106== For counts of detected and suppressed errors, rerun with: -v > ==3106== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 2 from 2) > ==3107== LEAK SUMMARY: > ==3107== definitely lost: 0 bytes in 0 blocks > ==3107== indirectly lost: 0 bytes in 0 blocks > ==3107== possibly lost: 0 bytes in 0 blocks > ==3107== still reachable: 212,357 bytes in 1,870 blocks > ==3107== suppressed: 0 bytes in 0 blocks > ==3107== Rerun with --leak-check=full to see details of leaked memory > ==3107== > ==3107== For counts of detected and suppressed errors, rerun with: -v > ==3107== Use --track-origins=yes to see where uninitialised values come > from > ==3107== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 2 from 2) > > In the above example, the solver iterates and gives results. > > Now the case in cluster: I had to configure petsc with the option: > --with-mpi-dir=/opt/openmpi-gcc/current/ , that's how the cluster > administrators asked me to install to get petsc running in many nodes of > the clusters. I had tried on my own to configure with --download-mpich in > the cluster too, but could not succeed with some errors. If you really > think the errors could be from this configuration, I would retry to install > with the petscmpich; please let me know. > And the valgrind errors for the case where program terminates without > completing normally (big sized domain), it has following errors just before > abrupt termination: > > ... lots of other errors and then warnings such as: > This appears to be a bug in OpenMPI, which would not be all that surprising. First, you can try running in the debugger and extracting a stack trace from the SEGV. Then you could 1) Get the admin to install MPICH 2) Try running a PETSc example on the cluster 3) Try running on another machine Matt > ==55437== Warning: set address range perms: large range [0xc4369040, > 0xd6abb670) (defined) > ==55438== Warning: set address range perms: large range [0xc4369040, > 0xd6a6cd00) (defined) > ==37183== Warning: set address range perms: large range [0xc4369040, > 0xd69f57d8) (defined) > ==37182== Warning: set address range perms: large range [0xc4369040, > 0xd6a474f0) (defined) > mpiexec: killing job... > > > In between there are several errors such as: > ==59334== Use of uninitialised value of size 8 > ==59334== at 0xD5B3704: mca_pml_ob1_send_request_put > (pml_ob1_sendreq.c:1217) > ==59334== by 0xE1EF01A: btl_openib_handle_incoming > (btl_openib_component.c:3092) > ==59334== by 0xE1F03E9: btl_openib_component_progress > (btl_openib_component.c:3634) > ==59334== by 0x81CF16A: opal_progress (opal_progress.c:207) > ==59334== by 0x81153AC: ompi_request_default_wait_all (condition.h:92) > ==59334== by 0xF4C25DD: ompi_coll_tuned_sendrecv_actual > (coll_tuned_util.c:54) > ==59334== by 0xF4C91FD: > ompi_coll_tuned_allgatherv_intra_neighborexchange (coll_tuned_util.h:57) > ==59334== by 0x8121783: PMPI_Allgatherv (pallgatherv.c:139) > ==59334== by 0x5156D19: ISAllGather (iscoloring.c:502) > ==59334== by 0x57A6B78: MatGetSubMatrix_MPIAIJ (mpiaij.c:3607) > ==59334== by 0x532DB36: MatGetSubMatrix (matrix.c:7297) > ==59334== by 0x5B97725: PCSetUp_FieldSplit(_p_PC*) (fieldsplit.c:524) > > > > > > >> >> Satish >> > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener
