>
> Maybe we should also check how petsc4py deals with the issue and
get
> eventually inspired.
I haven't checked in detail, but I presumed that petsc4py wraps the
PETSc 'FooDestroy' functions. From a quick survey of some of the
petsc4py demos, it appears that some explicitly clean up, e.g.
https://bitbucket.org/petsc/petsc4py/src/7081705ebf90034163de05034df749fcd50cc288/demo/taosolve/chwirut.py?at=master
and some do not, e.g.
https://bitbucket.org/petsc/petsc4py/src/7081705ebf90034163de05034df749fcd50cc288/demo/kspsolve/test_mat_cg.py?at=master
I suspect that petsc4py could suffer the same issue that we're
seeing
in the case that objects are not explicitly cleaned up.
Garth
>
> Jan
>
>>
>> > 2) Recommend users to throw in some gc.collect() calls in
their
>> > code if objects go out of scope in their code. This doesn't
>> > seem
>> to
>> > be a big problem, but it's a lingering non-deterministic mpi
>> > deadlock waiting to happen and very hard to debug.
>> >
>>
>> What about insisting that objects that require collective calls
>> during destruction must have a collective ‘clear’ or
‘destroy'
>> function that cleans up the object.
>>
>> Related to this discussion, we really need to to starting
marking
>> (logically) collective functions in the docstrings.
>>
>> Garth
>>
>> > Martin
>> >
>> >
>> > On 6 October 2014 15:05, Martin Sandve Alnæs
>> > <[email protected]> wrote: Yes. The difference is that mpi
>> > initialization /
>> destruction
>> > happens at beginning / end of the process while the
destructors
>> > happen all the time anywhere. I think that makes this a harder
>> > problem to solve.
>> >
>> > Anyway I was replying to "would it help if MPI is initialised
>> > explicitly in the setup" and the answer is still no because
mpi
>> > init is not the problem in the tests, although it is of
similar
>> > nature.
>> >
>> > I'm pondering if its possible (if necessary) to add a
>> > dolfin.mpi_gc() function and overload __del__ in some classes
to
>> > handle this deterministically.
>> >
>> > 6. okt. 2014 14:54 skrev "Garth N. Wells" <[email protected]>
>> > følgende:
>> >
>> >
>> >
>> > On Mon, 6 Oct, 2014 at 1:10 PM, Martin Sandve Alnæs
>> > <[email protected]> wrote: MPI initialization has nothing to
do
>> > with the test problems. The problem is the destructors of
>> > objects. It is temporarily solved by calling gc collect in
>> > pytest fixtures.
>> >
>> >
>> > The core problem is the same. The problem I describe occurs
when
>> > the SubSystemsManager singleton that controls MPI
intialisation
>> > is destroyed (and finalises MPI) before a PETSc object is
>> > destroyed. It is an issue of destruction order.
>> >
>> > Garth
>> >
>> > I think we should implement the with statement pattern for all
>> file
>> > types in dolfin to allow scope management.
>> >
>> > If vectors _do_ call mpi in destructors that's a problem for
>> > nontrivial dolfin python programs.
>> >
>> > 6. okt. 2014 13:53 skrev "Garth N. Wells" <[email protected]>
>> > følgende:
>> >
>> >
>> > On Mon, 6 Oct, 2014 at 12:32 PM, Jan Blechta
>> > <[email protected]> wrote: On Mon, 06 Oct 2014
11:53:58
>> > +0100 "Garth N. Wells" <[email protected]> wrote:
>> >
>> > On Mon, 6 Oct, 2014 at 11:23 AM, Martin Sandve Alnæs
>> > <[email protected]> wrote:
>> > > All collective destructors must be managed explicitly in
>> python,
>> > > preferably via with statement. Are there any apart from
file
>> > > objects? Vectors? Matrices? Meshes?
>> > >
>> >
>> > Off the top of my head I can't think of any cases, apart from
>> > IO, in which a (collective) MPI call needs to be made inside a
>> > destructor. For IO, we could insist on a user closing or
>> > flushing
>> a
>> > file explicitly. We cannot guarantee that 3rd party linear
>> > algebra backends do not call MPI when objects are destroyed.
>> >
>> > VecDestroy and MatDestroy (called by PESTcVector and
>> PETScBaseMatrix
>> > destructors) are claimed to be collective by PETSc doc:
>> >
>>
http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Vec/VecDestroy.html
>> >
>>
http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Mat/MatDestroy.html
>> >
>> > Yes, they are collective but don't necessarily make MPI calls.
>> > My understanding is that 'collective' is not the issue but
>> > whether or not MPI calls are made from a destructor. Some
>> > functions will only make sense if called collectively (e.g.,
>> > VecDestroy), but might not make collective MPI calls.
>> >
>> > For the tests, assuming PyTest permits a 'setup' function like
>> > unittest, would it help if MPI is initialised explicitly in
the
>> > setup function and closed down at the end of a test suite (if
>> > possible with PyTest)?
>> >
>> > Garth
>> >
>> >
>> >
>> > Jan
>> >
>> >
>> > We have had this problem in the past with the 'automatic'
>> > finalisation of MPI, which is a problem if MPI is shutdown
>> > before PETSc.
>> >
>> > Garth
>> >
>> >
>> > > 6. okt. 2014 12:18 skrev "Jan Blechta"
>> > > <[email protected]> følgende:
>> > >> On Mon, 6 Oct 2014 12:07:02 +0200
>> > >> Martin Sandve Alnæs <[email protected]> wrote:
>> > >>
>> > >> > The problem is that gc is nondeterministic and in
>> > >> > particular not running with equal timing and ordering on
>> > >> > each mpi process.
>> > >> >
>> > >> > We can't use the with statement to handle the scope of
>> > >> > every single dolfin object in a program.
>> > >>
>> > >> Most of the DOLFIN destructors are not collective. So the
>> moral
>> > >> is that
>> > >> we should avoid collective destructors as possible and
>> document
>> > >> it like
>> > >> it is in PETSc doc.
>> > >>
>> > >> Jan
>> > >>
>> > >> >
>> > >> > We can change all file handling to use with, and require
>> > >> > the user
>> > >> to
>> > >> > use that in parallel.
>> > >> > 6. okt. 2014 11:41 skrev "Jan Blechta"
>> > >> <[email protected]>
>> > >> > følgende:
>> > >> >
>> > >> > > On Mon, 6 Oct 2014 09:48:29 +0200
>> > >> > > Martin Sandve Alnæs <[email protected]> wrote:
>> > >> > >
>> > >> > > > The 'fix' that's in the branch now was to trigger
>> > >> > > > python
>> > >> garbage
>> > >> > > > collection (suggested by Øyvind Evju) before each
>> > >> > > > test.
>> > >> > > >
>> > >> > > > This probably means we have a general problem in
>> > >> > > > dolfin with non-deterministic destruction order of
>> > >> > > > objects in parallel. Any destructor that uses MPI
>> > >> > > > represents a potential deadlock.
>> > >> > >
>> > >> > > To understand the issue, is the problem that garbage
>> > >> > > collection
>> > >> does
>> > >> > > not ensure when the object is destroyed which is the
>> > >> > > problem?
>> > >> > >
>> > >> > > Here http://stackoverflow.com/a/5071376/1796717 the
>> > >> > > distinction between variable scoping and object
cleanup
>> > >> > > is discussed.
>> > >> Quoting it
>> > >> > >
>> > >> > > Deterministic cleanup happens through the with
>> statement.
>> > >> > >
>> > >> > > which might be a proper solution to the problem.
>> > >> > >
>> > >> > > Jan
>> > >> > >
>> > >> > > >
>> > >> > > > On 19 September 2014 12:52, Jan Blechta
>> > >> > > > <[email protected]> wrote:
>> > >> > > >
>> > >> > > > > On Fri, 19 Sep 2014 00:27:50 +0200
>> > >> > > > > Jan Blechta <[email protected]> wrote:
>> > >> > > > >
>> > >> > > > > > Yes, after many trials using
>> > >> > > > > >
>> > >> > > > > > $ cd test/unit/io/python
>> > >> > > > > > $ while true; do git clean -fdx && mpirun -n 3
>> > >> > > > > > xterm -e gdb -ex r -ex q -args python -m pytest
>> > >> > > > > > -sv; done # when it hangs and you interrupt it,
>> > >> > > > > > it asks for confirmation for # quitting, so you
>> > >> > > > > > type n and enjoy gdb...
>> > >> > > > > >
>> > >> > > > > > I've seen a situation when 2 processes
deadlocked
>> > >> > > > > > on HDF5Interface::close_file() in DOLFIN with
>> > >> > > > > > backtrace like
>> > >> > > > > >
>> > >> > > > > > # MPI barrier
>> > >> > > > > > ...
>> > >> > > > > > # MPI close
>> > >> > > > > > # HDF5 lib calls
>> > >> > > > > > H5FClose()
>> > >> > > > > > dolfin::HDF5Interface::close_file()
>> > >> > > > > > dolfin::HDF5File::close()
>> > >> > > > > > dolfin::HDF5File::~HDF5File()
>> > >> > > > > > dolfin::HDF5File::~HDF5File()
>> > >> > > > > > # smart ptr management
>> > >> > > > > > # garbage collection
>> > >> > > > > >
>> > >> > > > > > while 3rd process is waiting far away. Isn't it
>> > >> > > > > > strange
>> > >> that
>> > >> > > > > > destructor is there twice in stacktrace? (The
>> > >> > > > > > upper one is
>> > >> on
>> > >> > > > > > '}' line which I don't get.) What does it mean?
>> > >> > > > >
>> > >> > > > > Probably just code generation artifact - nothing
>> > >> > > > > harmful, see
>> http://stackoverflow.com/a/15244091/1796717
>> > >> > > > >
>> > >> > > > > Jan
>> > >> > > > >
>> > >> > > > > >
>> > >> > > > > > Jan
>> > >> > > > > >
>> > >> > > > > >
>> > >> > > > > > On Thu, 18 Sep 2014 16:20:51 +0200
>> > >> > > > > > Martin Sandve Alnæs <[email protected]> wrote:
>> > >> > > > > >
>> > >> > > > > > > I've added the mpi fixes for temppath fixture
>> > >> > > > > > > and fixed some other related issues while at
>> > >> > > > > > > it: When
>> > >> parameterizing
>> > >> > > > > > > a test that uses a temppath fixture, there is
a
>> need
>> > >> > > > > > > for separate directories for each parameter
>> > >> > > > > > > combo. A further improvement would be
automatic
>> > >> > > > > > > cleaning
>> of
>> > >> > > > > > > old tempdirs, but I leave that for now.
>> > >> > > > > > >
>> > >> > > > > > > I've pushed these changes to the branch
>> > >> > > > > > > aslakbergersen/topic-change-unittest-to-pytest
>> > >> > > > > > >
>> > >> > > > > > > The tests still hang though, in the closing of
>> > >> > > > > > > HDF5File.
>> > >> > > > > > >
>> > >> > > > > > > Here's now to debug if someone wants to give
it
>> > >> > > > > > > a shot: Just run:
>> > >> > > > > > > mpirun -np 3 python -m pytest -s -v
>> > >> > > > > > > With gdb:
>> > >> > > > > > > mpirun -np 3 xterm -e gdb --args python -m
>> > >> > > > > > > pytest then enter 'r' in each of the three
>> > >> > > > > > > xterms.
>> > >> > > > > > >
>> > >> > > > > > > You may have to try a couple of times to get
the
>> > >> > > > > > > hanging behaviour.
>> > >> > > > > > >
>> > >> > > > > > > Martin
>> > >> > > > > > >
>> > >> > > > > > > On 18 September 2014 13:23, Martin Sandve
Alnæs
>> > >> > > > > > > <[email protected]> wrote:
>> > >> > > > > > >
>> > >> > > > > > > > Good spotting both of you, thanks.
>> > >> > > > > > > >
>> > >> > > > > > > > Martin
>> > >> > > > > > > >
>> > >> > > > > > > > On 18 September 2014 13:01, Lawrence
Mitchell
>> > >> > > > > > > > < [email protected]> wrote:
>> > >> > > > > > > >
>> > >> > > > > > > >> On 18/09/14 11:42, Jan Blechta wrote:
>> > >> > > > > > > >> > Some problems (when running in a clean
>> > >> > > > > > > >> > dir)
>> are
>> > >> avoided
>> > >> > > > > > > >> > using this (although incorrect) patch.
>> > >> > > > > > > >> > There are
>> > >> race
>> > >> > > > > > > >> > conditions in creation of temp dir. It
>> > >> > > > > > > >> > should be
>> > >> done
>> > >> > > > > > > >> > using atomic operation.
>> > >> > > > > > > >> >
>> > >> > > > > > > >> > Jan
>> > >> > > > > > > >> >
>> > >> > > > > > > >> >
>> > >> > > > > > > >> >
>> > >> > >
>> > >>
>>
==================================================================
>> > >> > > > > > > >> > diff --git
>> > >> > > > > > > >> > a/test/unit/io/python/test_XDMF.py
>> > >> > > > > > > >> > b/test/unit/io/python/test_XDMF.py index
>> > >> > > > > > > >> > 9ad65a4..31471f1 100755 ---
>> > >> > > > > > > >> > a/test/unit/io/python/test_XDMF.py +++
>> > >> > > > > > > >> > b/test/unit/io/python/test_XDMF.py @@
>> > >> > > > > > > >> > -28,8 +28,9 @@ def temppath(): filedir =
>> > >> > > > > > > >> >
os.path.dirname(os.path.abspath(__file__))
>> > >> > > > > > > >> > basename
>> > >> =
>> > >> > > > > > > >> > os.path.basename(__file__).replace(".py",
>> > >> > > > > > > >> > "_data") temppath = os.path.join(filedir,
>> > >> > > > > > > >> > basename, "")
>> > >> > > > > > > >> > - if not os.path.exists(temppath):
>> > >> > > > > > > >> > - os.mkdir(temppath)
>> > >> > > > > > > >> > + if MPI.rank(mpi_comm_world()) == 0:
>> > >> > > > > > > >> > + if not os.path.exists(temppath):
>> > >> > > > > > > >> > + os.mkdir(temppath)
>> > >> > > > > > > >> > return temppath
>> > >> > > > > > > >>
>> > >> > > > > > > >> There's still a race condition here because
>> ranks
>> > >> other
>> > >> > > > > > > >> than zero might try and use temppath before
>> it's
>> > >> > > > > > > >> created. I think you want something like
the
>> > >> > > > > > > >> below:
>> > >> > > > > > > >>
>> > >> > > > > > > >> if MPI.rank(mpi_comm_world()) == 0:
>> > >> > > > > > > >> if not os.path.exists(temppath):
>> > >> > > > > > > >> os.mkdir(temppath)
>> > >> > > > > > > >> MPI.barrier(mpi_comm_world())
>> > >> > > > > > > >> return temppath
>> > >> > > > > > > >>
>> > >> > > > > > > >> If you're worried about the OS not creating
>> files
>> > >> > > > > > > >> atomically, you can always mkdir into a tmp
>> > >> > > > > > > >> directory
>> > >> and
>> > >> > > > > > > >> then os.rename(tmp, temppath), since posix
>> > >> > > > > > > >> guarantees that renames are atomic.
>> > >> > > > > > > >>
>> > >> > > > > > > >> Lawrence
>> > >> > > > > > > >>
_______________________________________________
>> > >> > > > > > > >> fenics mailing list
>> > >> > > > > > > >> [email protected]
>> > >> > > > > > > >>
>> http://fenicsproject.org/mailman/listinfo/fenics
>> > >> > > > > > > >>
>> > >> > > > > > > >
>> > >> > > > > > > >
>> > >> > > > > >
>> > >> > > > > > _______________________________________________
>> > >> > > > > > fenics mailing list
>> > >> > > > > > [email protected]
>> > >> > > > > > http://fenicsproject.org/mailman/listinfo/fenics
>> > >> > > > >
>> > >> > > > >
>> > >> > >
>> > >> > >
>> > >>
>> >
>> >
>> >
>> >
>>
>> _______________________________________________
>> fenics mailing list
>> [email protected]
>> http://fenicsproject.org/mailman/listinfo/fenics
>