We have had this problem in the past with the 'automatic'
finalisation of MPI, which is a problem if MPI is shutdown before
PETSc.
Garth
> 6. okt. 2014 12:18 skrev "Jan Blechta"
<[email protected]>
> følgende:
>> On Mon, 6 Oct 2014 12:07:02 +0200
>> Martin Sandve Alnæs <[email protected]> wrote:
>>
>> > The problem is that gc is nondeterministic and in particular
not
>> > running with equal timing and ordering on each mpi process.
>> >
>> > We can't use the with statement to handle the scope of every
>> > single dolfin object in a program.
>>
>> Most of the DOLFIN destructors are not collective. So the
moral is
>> that
>> we should avoid collective destructors as possible and
document it
>> like
>> it is in PETSc doc.
>>
>> Jan
>>
>> >
>> > We can change all file handling to use with, and require the
>> > user
>> to
>> > use that in parallel.
>> > 6. okt. 2014 11:41 skrev "Jan Blechta"
>> <[email protected]>
>> > følgende:
>> >
>> > > On Mon, 6 Oct 2014 09:48:29 +0200
>> > > Martin Sandve Alnæs <[email protected]> wrote:
>> > >
>> > > > The 'fix' that's in the branch now was to trigger python
>> garbage
>> > > > collection (suggested by Øyvind Evju) before each test.
>> > > >
>> > > > This probably means we have a general problem in dolfin
with
>> > > > non-deterministic destruction order of objects in
parallel.
>> > > > Any destructor that uses MPI represents a potential
deadlock.
>> > >
>> > > To understand the issue, is the problem that garbage
>> > > collection
>> does
>> > > not ensure when the object is destroyed which is the
problem?
>> > >
>> > > Here http://stackoverflow.com/a/5071376/1796717 the
distinction
>> > > between variable scoping and object cleanup is discussed.
>> Quoting it
>> > >
>> > > Deterministic cleanup happens through the with statement.
>> > >
>> > > which might be a proper solution to the problem.
>> > >
>> > > Jan
>> > >
>> > > >
>> > > > On 19 September 2014 12:52, Jan Blechta
>> > > > <[email protected]> wrote:
>> > > >
>> > > > > On Fri, 19 Sep 2014 00:27:50 +0200
>> > > > > Jan Blechta <[email protected]> wrote:
>> > > > >
>> > > > > > Yes, after many trials using
>> > > > > >
>> > > > > > $ cd test/unit/io/python
>> > > > > > $ while true; do git clean -fdx && mpirun -n 3 xterm
-e
>> > > > > > gdb -ex r -ex q -args python -m pytest -sv; done
>> > > > > > # when it hangs and you interrupt it, it asks for
>> > > > > > confirmation for # quitting, so you type n and enjoy
>> > > > > > gdb...
>> > > > > >
>> > > > > > I've seen a situation when 2 processes deadlocked on
>> > > > > > HDF5Interface::close_file() in DOLFIN with backtrace
like
>> > > > > >
>> > > > > > # MPI barrier
>> > > > > > ...
>> > > > > > # MPI close
>> > > > > > # HDF5 lib calls
>> > > > > > H5FClose()
>> > > > > > dolfin::HDF5Interface::close_file()
>> > > > > > dolfin::HDF5File::close()
>> > > > > > dolfin::HDF5File::~HDF5File()
>> > > > > > dolfin::HDF5File::~HDF5File()
>> > > > > > # smart ptr management
>> > > > > > # garbage collection
>> > > > > >
>> > > > > > while 3rd process is waiting far away. Isn't it
strange
>> that
>> > > > > > destructor is there twice in stacktrace? (The upper
one
>> > > > > > is
>> on
>> > > > > > '}' line which I don't get.) What does it mean?
>> > > > >
>> > > > > Probably just code generation artifact - nothing
harmful,
>> > > > > see http://stackoverflow.com/a/15244091/1796717
>> > > > >
>> > > > > Jan
>> > > > >
>> > > > > >
>> > > > > > Jan
>> > > > > >
>> > > > > >
>> > > > > > On Thu, 18 Sep 2014 16:20:51 +0200
>> > > > > > Martin Sandve Alnæs <[email protected]> wrote:
>> > > > > >
>> > > > > > > I've added the mpi fixes for temppath fixture and
fixed
>> > > > > > > some other related issues while at it: When
>> parameterizing
>> > > > > > > a test that uses a temppath fixture, there is a
need
>> > > > > > > for separate directories for each parameter combo.
>> > > > > > > A further improvement would be automatic cleaning
of
>> > > > > > > old tempdirs, but I leave that for now.
>> > > > > > >
>> > > > > > > I've pushed these changes to the branch
>> > > > > > > aslakbergersen/topic-change-unittest-to-pytest
>> > > > > > >
>> > > > > > > The tests still hang though, in the closing of
>> > > > > > > HDF5File.
>> > > > > > >
>> > > > > > > Here's now to debug if someone wants to give it a
shot:
>> > > > > > > Just run:
>> > > > > > > mpirun -np 3 python -m pytest -s -v
>> > > > > > > With gdb:
>> > > > > > > mpirun -np 3 xterm -e gdb --args python -m
pytest
>> > > > > > > then enter 'r' in each of the three xterms.
>> > > > > > >
>> > > > > > > You may have to try a couple of times to get the
>> > > > > > > hanging behaviour.
>> > > > > > >
>> > > > > > > Martin
>> > > > > > >
>> > > > > > > On 18 September 2014 13:23, Martin Sandve Alnæs
>> > > > > > > <[email protected]> wrote:
>> > > > > > >
>> > > > > > > > Good spotting both of you, thanks.
>> > > > > > > >
>> > > > > > > > Martin
>> > > > > > > >
>> > > > > > > > On 18 September 2014 13:01, Lawrence Mitchell <
>> > > > > > > > [email protected]> wrote:
>> > > > > > > >
>> > > > > > > >> On 18/09/14 11:42, Jan Blechta wrote:
>> > > > > > > >> > Some problems (when running in a clean dir)
are
>> avoided
>> > > > > > > >> > using this (although incorrect) patch. There
are
>> race
>> > > > > > > >> > conditions in creation of temp dir. It should
be
>> done
>> > > > > > > >> > using atomic operation.
>> > > > > > > >> >
>> > > > > > > >> > Jan
>> > > > > > > >> >
>> > > > > > > >> >
>> > > > > > > >> >
>> > >
>>
==================================================================
>> > > > > > > >> > diff --git a/test/unit/io/python/test_XDMF.py
>> > > > > > > >> > b/test/unit/io/python/test_XDMF.py index
>> > > > > > > >> > 9ad65a4..31471f1 100755 ---
>> > > > > > > >> > a/test/unit/io/python/test_XDMF.py +++
>> > > > > > > >> > b/test/unit/io/python/test_XDMF.py @@ -28,8
+28,9
>> > > > > > > >> > @@ def temppath(): filedir =
>> > > > > > > >> > os.path.dirname(os.path.abspath(__file__))
>> > > > > > > >> > basename
>> =
>> > > > > > > >> > os.path.basename(__file__).replace(".py",
"_data")
>> > > > > > > >> > temppath = os.path.join(filedir, basename, "")
>> > > > > > > >> > - if not os.path.exists(temppath):
>> > > > > > > >> > - os.mkdir(temppath)
>> > > > > > > >> > + if MPI.rank(mpi_comm_world()) == 0:
>> > > > > > > >> > + if not os.path.exists(temppath):
>> > > > > > > >> > + os.mkdir(temppath)
>> > > > > > > >> > return temppath
>> > > > > > > >>
>> > > > > > > >> There's still a race condition here because
ranks
>> other
>> > > > > > > >> than zero might try and use temppath before it's
>> > > > > > > >> created. I think you want something like the
below:
>> > > > > > > >>
>> > > > > > > >> if MPI.rank(mpi_comm_world()) == 0:
>> > > > > > > >> if not os.path.exists(temppath):
>> > > > > > > >> os.mkdir(temppath)
>> > > > > > > >> MPI.barrier(mpi_comm_world())
>> > > > > > > >> return temppath
>> > > > > > > >>
>> > > > > > > >> If you're worried about the OS not creating
files
>> > > > > > > >> atomically, you can always mkdir into a tmp
>> > > > > > > >> directory
>> and
>> > > > > > > >> then os.rename(tmp, temppath), since posix
>> > > > > > > >> guarantees that renames are atomic.
>> > > > > > > >>
>> > > > > > > >> Lawrence
>> > > > > > > >> _______________________________________________
>> > > > > > > >> fenics mailing list
>> > > > > > > >> [email protected]
>> > > > > > > >> http://fenicsproject.org/mailman/listinfo/fenics
>> > > > > > > >>
>> > > > > > > >
>> > > > > > > >
>> > > > > >
>> > > > > > _______________________________________________
>> > > > > > fenics mailing list
>> > > > > > [email protected]
>> > > > > > http://fenicsproject.org/mailman/listinfo/fenics
>> > > > >
>> > > > >
>> > >
>> > >
>>