On Tue, 07 Oct 2014 13:34:17 +0100 "Garth N. Wells" <[email protected]> wrote:
> > > On Tue, 7 Oct, 2014 at 10:35 AM, Jan Blechta > <[email protected]> wrote: > > On Tue, 7 Oct 2014 10:23:21 +0100 > > "Garth N. Wells" <[email protected]> wrote: > > > >> > >> On 6 Oct 2014, at 16:38, Martin Sandve Alnæs <[email protected]> > >> wrote: > >> > >> > I think this is the best solution: > >> > > >> > 1) Require the user to close file objects deterministically. > >> > Relying on the del operator is not deterministic, we need to > >> support > >> > .close() and/or __enter__/__exit__ for the with statement in > >> dolfin. > >> > > >> > >> Sounds good. We can print a warning message from the File object > >> destructors if a file is not closed (this can later become an > >> error). > > > > Good idea, but maybe warning could be issued from __del__ operator > > if object was not properly destroyed/closed. In C++ layer > > everything is OK. > > There are some advantages to insisting on explicit file > opening/closing in the C++ interface too, particularly in parallel, > so we could reasonably keep the interface consistent across Python > and C++. Ok, but do you really want to stay so consistent and require explicit destruction of PETScVector / PETScMatrix? Jan > > > > > Maybe we should also check how petsc4py deals with the issue and get > > eventually inspired. > > I haven't checked in detail, but I presumed that petsc4py wraps the > PETSc 'FooDestroy' functions. From a quick survey of some of the > petsc4py demos, it appears that some explicitly clean up, e.g. > > https://bitbucket.org/petsc/petsc4py/src/7081705ebf90034163de05034df749fcd50cc288/demo/taosolve/chwirut.py?at=master > > and some do not, e.g. > > https://bitbucket.org/petsc/petsc4py/src/7081705ebf90034163de05034df749fcd50cc288/demo/kspsolve/test_mat_cg.py?at=master > > I suspect that petsc4py could suffer the same issue that we're seeing > in the case that objects are not explicitly cleaned up. > > Garth > > > > > > Jan > > > >> > >> > 2) Recommend users to throw in some gc.collect() calls in their > >> > code if objects go out of scope in their code. This doesn't > >> > seem > >> to > >> > be a big problem, but it's a lingering non-deterministic mpi > >> > deadlock waiting to happen and very hard to debug. > >> > > >> > >> What about insisting that objects that require collective calls > >> during destruction must have a collective ‘clear’ or ‘destroy' > >> function that cleans up the object. > >> > >> Related to this discussion, we really need to to starting marking > >> (logically) collective functions in the docstrings. > >> > >> Garth > >> > >> > Martin > >> > > >> > > >> > On 6 October 2014 15:05, Martin Sandve Alnæs > >> > <[email protected]> wrote: Yes. The difference is that mpi > >> > initialization / > >> destruction > >> > happens at beginning / end of the process while the destructors > >> > happen all the time anywhere. I think that makes this a harder > >> > problem to solve. > >> > > >> > Anyway I was replying to "would it help if MPI is initialised > >> > explicitly in the setup" and the answer is still no because mpi > >> > init is not the problem in the tests, although it is of similar > >> > nature. > >> > > >> > I'm pondering if its possible (if necessary) to add a > >> > dolfin.mpi_gc() function and overload __del__ in some classes to > >> > handle this deterministically. > >> > > >> > 6. okt. 2014 14:54 skrev "Garth N. Wells" <[email protected]> > >> > følgende: > >> > > >> > > >> > > >> > On Mon, 6 Oct, 2014 at 1:10 PM, Martin Sandve Alnæs > >> > <[email protected]> wrote: MPI initialization has nothing to do > >> > with the test problems. The problem is the destructors of > >> > objects. It is temporarily solved by calling gc collect in > >> > pytest fixtures. > >> > > >> > > >> > The core problem is the same. The problem I describe occurs when > >> > the SubSystemsManager singleton that controls MPI intialisation > >> > is destroyed (and finalises MPI) before a PETSc object is > >> > destroyed. It is an issue of destruction order. > >> > > >> > Garth > >> > > >> > I think we should implement the with statement pattern for all > >> file > >> > types in dolfin to allow scope management. > >> > > >> > If vectors _do_ call mpi in destructors that's a problem for > >> > nontrivial dolfin python programs. > >> > > >> > 6. okt. 2014 13:53 skrev "Garth N. Wells" <[email protected]> > >> > følgende: > >> > > >> > > >> > On Mon, 6 Oct, 2014 at 12:32 PM, Jan Blechta > >> > <[email protected]> wrote: On Mon, 06 Oct 2014 11:53:58 > >> > +0100 "Garth N. Wells" <[email protected]> wrote: > >> > > >> > On Mon, 6 Oct, 2014 at 11:23 AM, Martin Sandve Alnæs > >> > <[email protected]> wrote: > >> > > All collective destructors must be managed explicitly in > >> python, > >> > > preferably via with statement. Are there any apart from file > >> > > objects? Vectors? Matrices? Meshes? > >> > > > >> > > >> > Off the top of my head I can't think of any cases, apart from > >> > IO, in which a (collective) MPI call needs to be made inside a > >> > destructor. For IO, we could insist on a user closing or > >> > flushing > >> a > >> > file explicitly. We cannot guarantee that 3rd party linear > >> > algebra backends do not call MPI when objects are destroyed. > >> > > >> > VecDestroy and MatDestroy (called by PESTcVector and > >> PETScBaseMatrix > >> > destructors) are claimed to be collective by PETSc doc: > >> > > >> http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Vec/VecDestroy.html > >> > > >> http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Mat/MatDestroy.html > >> > > >> > Yes, they are collective but don't necessarily make MPI calls. > >> > My understanding is that 'collective' is not the issue but > >> > whether or not MPI calls are made from a destructor. Some > >> > functions will only make sense if called collectively (e.g., > >> > VecDestroy), but might not make collective MPI calls. > >> > > >> > For the tests, assuming PyTest permits a 'setup' function like > >> > unittest, would it help if MPI is initialised explicitly in the > >> > setup function and closed down at the end of a test suite (if > >> > possible with PyTest)? > >> > > >> > Garth > >> > > >> > > >> > > >> > Jan > >> > > >> > > >> > We have had this problem in the past with the 'automatic' > >> > finalisation of MPI, which is a problem if MPI is shutdown > >> > before PETSc. > >> > > >> > Garth > >> > > >> > > >> > > 6. okt. 2014 12:18 skrev "Jan Blechta" > >> > > <[email protected]> følgende: > >> > >> On Mon, 6 Oct 2014 12:07:02 +0200 > >> > >> Martin Sandve Alnæs <[email protected]> wrote: > >> > >> > >> > >> > The problem is that gc is nondeterministic and in > >> > >> > particular not running with equal timing and ordering on > >> > >> > each mpi process. > >> > >> > > >> > >> > We can't use the with statement to handle the scope of > >> > >> > every single dolfin object in a program. > >> > >> > >> > >> Most of the DOLFIN destructors are not collective. So the > >> moral > >> > >> is that > >> > >> we should avoid collective destructors as possible and > >> document > >> > >> it like > >> > >> it is in PETSc doc. > >> > >> > >> > >> Jan > >> > >> > >> > >> > > >> > >> > We can change all file handling to use with, and require > >> > >> > the user > >> > >> to > >> > >> > use that in parallel. > >> > >> > 6. okt. 2014 11:41 skrev "Jan Blechta" > >> > >> <[email protected]> > >> > >> > følgende: > >> > >> > > >> > >> > > On Mon, 6 Oct 2014 09:48:29 +0200 > >> > >> > > Martin Sandve Alnæs <[email protected]> wrote: > >> > >> > > > >> > >> > > > The 'fix' that's in the branch now was to trigger > >> > >> > > > python > >> > >> garbage > >> > >> > > > collection (suggested by Øyvind Evju) before each > >> > >> > > > test. > >> > >> > > > > >> > >> > > > This probably means we have a general problem in > >> > >> > > > dolfin with non-deterministic destruction order of > >> > >> > > > objects in parallel. Any destructor that uses MPI > >> > >> > > > represents a potential deadlock. > >> > >> > > > >> > >> > > To understand the issue, is the problem that garbage > >> > >> > > collection > >> > >> does > >> > >> > > not ensure when the object is destroyed which is the > >> > >> > > problem? > >> > >> > > > >> > >> > > Here http://stackoverflow.com/a/5071376/1796717 the > >> > >> > > distinction between variable scoping and object cleanup > >> > >> > > is discussed. > >> > >> Quoting it > >> > >> > > > >> > >> > > Deterministic cleanup happens through the with > >> statement. > >> > >> > > > >> > >> > > which might be a proper solution to the problem. > >> > >> > > > >> > >> > > Jan > >> > >> > > > >> > >> > > > > >> > >> > > > On 19 September 2014 12:52, Jan Blechta > >> > >> > > > <[email protected]> wrote: > >> > >> > > > > >> > >> > > > > On Fri, 19 Sep 2014 00:27:50 +0200 > >> > >> > > > > Jan Blechta <[email protected]> wrote: > >> > >> > > > > > >> > >> > > > > > Yes, after many trials using > >> > >> > > > > > > >> > >> > > > > > $ cd test/unit/io/python > >> > >> > > > > > $ while true; do git clean -fdx && mpirun -n 3 > >> > >> > > > > > xterm -e gdb -ex r -ex q -args python -m pytest > >> > >> > > > > > -sv; done # when it hangs and you interrupt it, > >> > >> > > > > > it asks for confirmation for # quitting, so you > >> > >> > > > > > type n and enjoy gdb... > >> > >> > > > > > > >> > >> > > > > > I've seen a situation when 2 processes deadlocked > >> > >> > > > > > on HDF5Interface::close_file() in DOLFIN with > >> > >> > > > > > backtrace like > >> > >> > > > > > > >> > >> > > > > > # MPI barrier > >> > >> > > > > > ... > >> > >> > > > > > # MPI close > >> > >> > > > > > # HDF5 lib calls > >> > >> > > > > > H5FClose() > >> > >> > > > > > dolfin::HDF5Interface::close_file() > >> > >> > > > > > dolfin::HDF5File::close() > >> > >> > > > > > dolfin::HDF5File::~HDF5File() > >> > >> > > > > > dolfin::HDF5File::~HDF5File() > >> > >> > > > > > # smart ptr management > >> > >> > > > > > # garbage collection > >> > >> > > > > > > >> > >> > > > > > while 3rd process is waiting far away. Isn't it > >> > >> > > > > > strange > >> > >> that > >> > >> > > > > > destructor is there twice in stacktrace? (The > >> > >> > > > > > upper one is > >> > >> on > >> > >> > > > > > '}' line which I don't get.) What does it mean? > >> > >> > > > > > >> > >> > > > > Probably just code generation artifact - nothing > >> > >> > > > > harmful, see > >> http://stackoverflow.com/a/15244091/1796717 > >> > >> > > > > > >> > >> > > > > Jan > >> > >> > > > > > >> > >> > > > > > > >> > >> > > > > > Jan > >> > >> > > > > > > >> > >> > > > > > > >> > >> > > > > > On Thu, 18 Sep 2014 16:20:51 +0200 > >> > >> > > > > > Martin Sandve Alnæs <[email protected]> wrote: > >> > >> > > > > > > >> > >> > > > > > > I've added the mpi fixes for temppath fixture > >> > >> > > > > > > and fixed some other related issues while at > >> > >> > > > > > > it: When > >> > >> parameterizing > >> > >> > > > > > > a test that uses a temppath fixture, there is a > >> need > >> > >> > > > > > > for separate directories for each parameter > >> > >> > > > > > > combo. A further improvement would be automatic > >> > >> > > > > > > cleaning > >> of > >> > >> > > > > > > old tempdirs, but I leave that for now. > >> > >> > > > > > > > >> > >> > > > > > > I've pushed these changes to the branch > >> > >> > > > > > > aslakbergersen/topic-change-unittest-to-pytest > >> > >> > > > > > > > >> > >> > > > > > > The tests still hang though, in the closing of > >> > >> > > > > > > HDF5File. > >> > >> > > > > > > > >> > >> > > > > > > Here's now to debug if someone wants to give it > >> > >> > > > > > > a shot: Just run: > >> > >> > > > > > > mpirun -np 3 python -m pytest -s -v > >> > >> > > > > > > With gdb: > >> > >> > > > > > > mpirun -np 3 xterm -e gdb --args python -m > >> > >> > > > > > > pytest then enter 'r' in each of the three > >> > >> > > > > > > xterms. > >> > >> > > > > > > > >> > >> > > > > > > You may have to try a couple of times to get the > >> > >> > > > > > > hanging behaviour. > >> > >> > > > > > > > >> > >> > > > > > > Martin > >> > >> > > > > > > > >> > >> > > > > > > On 18 September 2014 13:23, Martin Sandve Alnæs > >> > >> > > > > > > <[email protected]> wrote: > >> > >> > > > > > > > >> > >> > > > > > > > Good spotting both of you, thanks. > >> > >> > > > > > > > > >> > >> > > > > > > > Martin > >> > >> > > > > > > > > >> > >> > > > > > > > On 18 September 2014 13:01, Lawrence Mitchell > >> > >> > > > > > > > < [email protected]> wrote: > >> > >> > > > > > > > > >> > >> > > > > > > >> On 18/09/14 11:42, Jan Blechta wrote: > >> > >> > > > > > > >> > Some problems (when running in a clean > >> > >> > > > > > > >> > dir) > >> are > >> > >> avoided > >> > >> > > > > > > >> > using this (although incorrect) patch. > >> > >> > > > > > > >> > There are > >> > >> race > >> > >> > > > > > > >> > conditions in creation of temp dir. It > >> > >> > > > > > > >> > should be > >> > >> done > >> > >> > > > > > > >> > using atomic operation. > >> > >> > > > > > > >> > > >> > >> > > > > > > >> > Jan > >> > >> > > > > > > >> > > >> > >> > > > > > > >> > > >> > >> > > > > > > >> > > >> > >> > > > >> > >> > >> ================================================================== > >> > >> > > > > > > >> > diff --git > >> > >> > > > > > > >> > a/test/unit/io/python/test_XDMF.py > >> > >> > > > > > > >> > b/test/unit/io/python/test_XDMF.py index > >> > >> > > > > > > >> > 9ad65a4..31471f1 100755 --- > >> > >> > > > > > > >> > a/test/unit/io/python/test_XDMF.py +++ > >> > >> > > > > > > >> > b/test/unit/io/python/test_XDMF.py @@ > >> > >> > > > > > > >> > -28,8 +28,9 @@ def temppath(): filedir = > >> > >> > > > > > > >> > os.path.dirname(os.path.abspath(__file__)) > >> > >> > > > > > > >> > basename > >> > >> = > >> > >> > > > > > > >> > os.path.basename(__file__).replace(".py", > >> > >> > > > > > > >> > "_data") temppath = os.path.join(filedir, > >> > >> > > > > > > >> > basename, "") > >> > >> > > > > > > >> > - if not os.path.exists(temppath): > >> > >> > > > > > > >> > - os.mkdir(temppath) > >> > >> > > > > > > >> > + if MPI.rank(mpi_comm_world()) == 0: > >> > >> > > > > > > >> > + if not os.path.exists(temppath): > >> > >> > > > > > > >> > + os.mkdir(temppath) > >> > >> > > > > > > >> > return temppath > >> > >> > > > > > > >> > >> > >> > > > > > > >> There's still a race condition here because > >> ranks > >> > >> other > >> > >> > > > > > > >> than zero might try and use temppath before > >> it's > >> > >> > > > > > > >> created. I think you want something like the > >> > >> > > > > > > >> below: > >> > >> > > > > > > >> > >> > >> > > > > > > >> if MPI.rank(mpi_comm_world()) == 0: > >> > >> > > > > > > >> if not os.path.exists(temppath): > >> > >> > > > > > > >> os.mkdir(temppath) > >> > >> > > > > > > >> MPI.barrier(mpi_comm_world()) > >> > >> > > > > > > >> return temppath > >> > >> > > > > > > >> > >> > >> > > > > > > >> If you're worried about the OS not creating > >> files > >> > >> > > > > > > >> atomically, you can always mkdir into a tmp > >> > >> > > > > > > >> directory > >> > >> and > >> > >> > > > > > > >> then os.rename(tmp, temppath), since posix > >> > >> > > > > > > >> guarantees that renames are atomic. > >> > >> > > > > > > >> > >> > >> > > > > > > >> Lawrence > >> > >> > > > > > > >> _______________________________________________ > >> > >> > > > > > > >> fenics mailing list > >> > >> > > > > > > >> [email protected] > >> > >> > > > > > > >> > >> http://fenicsproject.org/mailman/listinfo/fenics > >> > >> > > > > > > >> > >> > >> > > > > > > > > >> > >> > > > > > > > > >> > >> > > > > > > >> > >> > > > > > _______________________________________________ > >> > >> > > > > > fenics mailing list > >> > >> > > > > > [email protected] > >> > >> > > > > > http://fenicsproject.org/mailman/listinfo/fenics > >> > >> > > > > > >> > >> > > > > > >> > >> > > > >> > >> > > > >> > >> > >> > > >> > > >> > > >> > > >> > >> _______________________________________________ > >> fenics mailing list > >> [email protected] > >> http://fenicsproject.org/mailman/listinfo/fenics > > _______________________________________________ fenics mailing list [email protected] http://fenicsproject.org/mailman/listinfo/fenics
