Re: [petsc-users] signal received error; MatNullSpaceTest; Stokes flow solver with pc fieldsplit and schur complement

Bishesh Khanal Thu, 17 Oct 2013 05:52:31 -0700

On Thu, Oct 17, 2013 at 2:46 PM, Matthew Knepley <[email protected]> wrote:


> On Thu, Oct 17, 2013 at 7:43 AM, Bishesh Khanal <[email protected]>wrote:
>
>>
>>
>>
>> On Thu, Oct 17, 2013 at 1:07 PM, Matthew Knepley <[email protected]>wrote:
>>
>>> On Thu, Oct 17, 2013 at 3:42 AM, Bishesh Khanal <[email protected]>wrote:
>>>
>>>>
>>>>
>>>>
>>>> On Wed, Oct 16, 2013 at 8:04 PM, Satish Balay <[email protected]>wrote:
>>>>
>>>>> On Wed, 16 Oct 2013, Matthew Knepley wrote:
>>>>>
>>>>> > You can also try running under MPICH, which can be valgrind clean.
>>>>>
>>>>> Actually --download-mpich would configure/install mpich with
>>>>> appropriate flags to be valgrind clean.
>>>>>
>>>>
>>>> In my laptop (but not in the cluster, please look at the second part of
>>>> this reply below for the cluster case) that's how I configured petsc and
>>>> ran it under mpich. The following errors (which I do not understand what
>>>> they mean) was reported by valgrind when using the mpich of the petsc in my
>>>> laptop: Here is the command I used and the error:
>>>>
>>>
>>> This is harmless, and as you can see it comes from gfortran
>>> initialization.
>>>
>>>
>>>>  (Note: petsc is an alias in my .bashrc: alias
>>>> petsc='/home/bkhanal/Documents/softwares/petsc-3.4.3/bin/petscmpiexec'
>>>>
>>>> petsc -n 2 valgrind src/AdLemMain -pc_type fieldsplit
>>>> -pc_fieldsplit_type schur -pc_fieldsplit_dm_splits 0
>>>> -pc_fieldsplit_0_fields 0,1,2 -pc_fieldsplit_1_fields 3
>>>> -fieldsplit_0_pc_type hypre -fieldsplit_0_ksp_converged_reason
>>>> -ksp_converged_reason
>>>> ==3106== Memcheck, a memory error detector
>>>> ==3106== Copyright (C) 2002-2010, and GNU GPL'd, by Julian Seward et al.
>>>> ==3106== Using Valgrind-3.6.1 and LibVEX; rerun with -h for copyright
>>>> info
>>>> ==3107== Memcheck, a memory error detector
>>>> ==3107== Copyright (C) 2002-2010, and GNU GPL'd, by Julian Seward et al.
>>>> ==3107== Using Valgrind-3.6.1 and LibVEX; rerun with -h for copyright
>>>> info
>>>> ==3107== Command: src/AdLemMain -pc_type fieldsplit -pc_fieldsplit_type
>>>> schur -pc_fieldsplit_dm_splits 0 -pc_fieldsplit_0_fields 0,1,2
>>>> -pc_fieldsplit_1_fields 3 -fieldsplit_0_pc_type hypre
>>>> -fieldsplit_0_ksp_converged_reason -ksp_converged_reason
>>>> ==3107==
>>>> ==3106== Command: src/AdLemMain -pc_type fieldsplit -pc_fieldsplit_type
>>>> schur -pc_fieldsplit_dm_splits 0 -pc_fieldsplit_0_fields 0,1,2
>>>> -pc_fieldsplit_1_fields 3 -fieldsplit_0_pc_type hypre
>>>> -fieldsplit_0_ksp_converged_reason -ksp_converged_reason
>>>> ==3106==
>>>> ==3107== Conditional jump or move depends on uninitialised value(s)
>>>> ==3107==    at 0x32EEED9BCE: ??? (in /usr/lib64/libgfortran.so.3.0.0)
>>>> ==3107==    by 0x32EEED9155: ??? (in /usr/lib64/libgfortran.so.3.0.0)
>>>> ==3107==    by 0x32EEE185D7: ??? (in /usr/lib64/libgfortran.so.3.0.0)
>>>> ==3107==    by 0x32ECC0F195: call_init.part.0 (in /lib64/ld-2.14.90.so)
>>>> ==3107==    by 0x32ECC0F272: _dl_init (in /lib64/ld-2.14.90.so)
>>>> ==3107==    by 0x32ECC01719: ??? (in /lib64/ld-2.14.90.so)
>>>> ==3107==    by 0xE: ???
>>>> ==3107==    by 0x7FF0003EE: ???
>>>> ==3107==    by 0x7FF0003FC: ???
>>>> ==3107==    by 0x7FF000405: ???
>>>> ==3107==    by 0x7FF000410: ???
>>>> ==3107==    by 0x7FF000424: ???
>>>> ==3107==
>>>> ==3107== Conditional jump or move depends on uninitialised value(s)
>>>> ==3107==    at 0x32EEED9BD9: ??? (in /usr/lib64/libgfortran.so.3.0.0)
>>>> ==3107==    by 0x32EEED9155: ??? (in /usr/lib64/libgfortran.so.3.0.0)
>>>> ==3107==    by 0x32EEE185D7: ??? (in /usr/lib64/libgfortran.so.3.0.0)
>>>> ==3107==    by 0x32ECC0F195: call_init.part.0 (in /lib64/ld-2.14.90.so)
>>>> ==3107==    by 0x32ECC0F272: _dl_init (in /lib64/ld-2.14.90.so)
>>>> ==3107==    by 0x32ECC01719: ??? (in /lib64/ld-2.14.90.so)
>>>> ==3107==    by 0xE: ???
>>>> ==3107==    by 0x7FF0003EE: ???
>>>> ==3107==    by 0x7FF0003FC: ???
>>>> ==3107==    by 0x7FF000405: ???
>>>> ==3107==    by 0x7FF000410: ???
>>>> ==3107==    by 0x7FF000424: ???
>>>> ==3107==
>>>> dmda of size: (8,8,8)
>>>>
>>>>  using schur complement
>>>>
>>>>  using user defined split
>>>>   Linear solve converged due to CONVERGED_ATOL iterations 0
>>>>   Linear solve converged due to CONVERGED_RTOL iterations 3
>>>>   Linear solve converged due to CONVERGED_RTOL iterations 3
>>>>   Linear solve converged due to CONVERGED_RTOL iterations 3
>>>>   Linear solve converged due to CONVERGED_RTOL iterations 3
>>>>   Linear solve converged due to CONVERGED_RTOL iterations 3
>>>>   Linear solve converged due to CONVERGED_RTOL iterations 3
>>>>   Linear solve converged due to CONVERGED_RTOL iterations 3
>>>>   Linear solve converged due to CONVERGED_RTOL iterations 3
>>>>   Linear solve converged due to CONVERGED_RTOL iterations 3
>>>>   Linear solve converged due to CONVERGED_RTOL iterations 3
>>>>   Linear solve converged due to CONVERGED_RTOL iterations 3
>>>>   Linear solve converged due to CONVERGED_RTOL iterations 3
>>>>   Linear solve converged due to CONVERGED_RTOL iterations 3
>>>>   Linear solve converged due to CONVERGED_RTOL iterations 3
>>>>   Linear solve converged due to CONVERGED_RTOL iterations 3
>>>>   Linear solve converged due to CONVERGED_RTOL iterations 3
>>>>   Linear solve converged due to CONVERGED_RTOL iterations 3
>>>>   Linear solve converged due to CONVERGED_RTOL iterations 3
>>>>   Linear solve converged due to CONVERGED_RTOL iterations 3
>>>>   Linear solve converged due to CONVERGED_RTOL iterations 3
>>>>   Linear solve converged due to CONVERGED_RTOL iterations 3
>>>>   Linear solve converged due to CONVERGED_RTOL iterations 3
>>>>   Linear solve converged due to CONVERGED_RTOL iterations 3
>>>>   Linear solve converged due to CONVERGED_RTOL iterations 3
>>>>   Linear solve converged due to CONVERGED_RTOL iterations 3
>>>>   Linear solve converged due to CONVERGED_RTOL iterations 3
>>>> Linear solve converged due to CONVERGED_RTOL iterations 1
>>>> ==3106==
>>>> ==3106== HEAP SUMMARY:
>>>> ==3106==     in use at exit: 187,709 bytes in 1,864 blocks
>>>> ==3106==   total heap usage: 112,891 allocs, 111,027 frees, 19,838,487
>>>> bytes allocated
>>>> ==3106==
>>>> ==3107==
>>>> ==3107== HEAP SUMMARY:
>>>> ==3107==     in use at exit: 212,357 bytes in 1,870 blocks
>>>> ==3107==   total heap usage: 112,701 allocs, 110,831 frees, 19,698,341
>>>> bytes allocated
>>>> ==3107==
>>>> ==3106== LEAK SUMMARY:
>>>> ==3106==    definitely lost: 0 bytes in 0 blocks
>>>> ==3106==    indirectly lost: 0 bytes in 0 blocks
>>>> ==3106==      possibly lost: 0 bytes in 0 blocks
>>>> ==3106==    still reachable: 187,709 bytes in 1,864 blocks
>>>> ==3106==         suppressed: 0 bytes in 0 blocks
>>>> ==3106== Rerun with --leak-check=full to see details of leaked memory
>>>> ==3106==
>>>> ==3106== For counts of detected and suppressed errors, rerun with: -v
>>>> ==3106== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 2 from 2)
>>>> ==3107== LEAK SUMMARY:
>>>> ==3107==    definitely lost: 0 bytes in 0 blocks
>>>> ==3107==    indirectly lost: 0 bytes in 0 blocks
>>>> ==3107==      possibly lost: 0 bytes in 0 blocks
>>>> ==3107==    still reachable: 212,357 bytes in 1,870 blocks
>>>> ==3107==         suppressed: 0 bytes in 0 blocks
>>>> ==3107== Rerun with --leak-check=full to see details of leaked memory
>>>> ==3107==
>>>> ==3107== For counts of detected and suppressed errors, rerun with: -v
>>>> ==3107== Use --track-origins=yes to see where uninitialised values come
>>>> from
>>>> ==3107== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 2 from 2)
>>>>
>>>> In the above example, the solver iterates and gives results.
>>>>
>>>> Now the case in cluster: I had to configure petsc with the option:
>>>> --with-mpi-dir=/opt/openmpi-gcc/current/  , that's how the cluster
>>>> administrators asked me to install to get petsc running in many nodes of
>>>> the clusters. I had tried on my own to configure with --download-mpich in
>>>> the cluster too, but  could not succeed with some errors. If you really
>>>> think the errors could be from this configuration, I would retry to install
>>>> with the petscmpich; please let me know.
>>>> And the valgrind errors for the case where program terminates without
>>>> completing normally (big sized domain), it has following errors just before
>>>> abrupt termination:
>>>>
>>>> ... lots of other errors and then warnings such as:
>>>>
>>>
>>> This appears to be a bug in OpenMPI, which would not be all that
>>> surprising. First, you can try running
>>> in the debugger and extracting a stack trace from the SEGV.
>>>
>>
>> Just to make sure that I understood what you said before I talk with the
>> cluster administrators:
>> The program crashes only for a bigger domain size. Even in the cluster,
>> it does not crash for the domain size up to a certain size.  So I need to
>> run in the debugger for the case when it crashes to get the stack trace
>> from the SEGV, right ? I do not know how to attach a debugger when
>> submitting a job to the cluster if that is possible at all! Or are you
>> asking me to run the program in the debugger in my laptop for the biggest
>> size ? (I have not tried running the code for the biggest size in my laptop
>> fearing it might take forever)
>>
>
> There is no use being afraid. Run it both places. Ask the admins how to
> map your display for the cluster and use -start_in_debugger.
>

 Thanks, I'll try them.


>    Matt
>
>
>>  Then you could
>>>
>>>   1) Get the admin to install MPICH
>>>
>>>   2) Try running a PETSc example on the cluster
>>>
>>
>>>   3) Try running on another machine
>>>
>>>     Matt
>>>
>>>
>>>> ==55437== Warning: set address range perms: large range [0xc4369040,
>>>> 0xd6abb670) (defined)
>>>> ==55438== Warning: set address range perms: large range [0xc4369040,
>>>> 0xd6a6cd00) (defined)
>>>> ==37183== Warning: set address range perms: large range [0xc4369040,
>>>> 0xd69f57d8) (defined)
>>>> ==37182== Warning: set address range perms: large range [0xc4369040,
>>>> 0xd6a474f0) (defined)
>>>> mpiexec: killing job...
>>>>
>>>>
>>>> In between there are several errors such as:
>>>> ==59334== Use of uninitialised value of size 8
>>>> ==59334==    at 0xD5B3704: mca_pml_ob1_send_request_put
>>>> (pml_ob1_sendreq.c:1217)
>>>> ==59334==    by 0xE1EF01A: btl_openib_handle_incoming
>>>> (btl_openib_component.c:3092)
>>>> ==59334==    by 0xE1F03E9: btl_openib_component_progress
>>>> (btl_openib_component.c:3634)
>>>> ==59334==    by 0x81CF16A: opal_progress (opal_progress.c:207)
>>>> ==59334==    by 0x81153AC: ompi_request_default_wait_all
>>>> (condition.h:92)
>>>> ==59334==    by 0xF4C25DD: ompi_coll_tuned_sendrecv_actual
>>>> (coll_tuned_util.c:54)
>>>> ==59334==    by 0xF4C91FD:
>>>> ompi_coll_tuned_allgatherv_intra_neighborexchange (coll_tuned_util.h:57)
>>>> ==59334==    by 0x8121783: PMPI_Allgatherv (pallgatherv.c:139)
>>>> ==59334==    by 0x5156D19: ISAllGather (iscoloring.c:502)
>>>> ==59334==    by 0x57A6B78: MatGetSubMatrix_MPIAIJ (mpiaij.c:3607)
>>>> ==59334==    by 0x532DB36: MatGetSubMatrix (matrix.c:7297)
>>>> ==59334==    by 0x5B97725: PCSetUp_FieldSplit(_p_PC*) (fieldsplit.c:524)
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>>
>>>>> Satish
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> What most experimenters take for granted before they begin their
>>> experiments is infinitely more interesting than any results to which their
>>> experiments lead.
>>> -- Norbert Wiener
>>>
>>
>>
>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>

Re: [petsc-users] signal received error; MatNullSpaceTest; Stokes flow solver with pc fieldsplit and schur complement

Reply via email to