That’s very helpful, thanks! Adding `gc.collect()` to the beginning of the offending test does indeed resolve that particular problem.
I’ve not been systematic about calling XXX.destroy(), thinking garbage collection was sufficient, so I need to get to work on that. > On Oct 28, 2020, at 1:21 PM, Lawrence Mitchell <[email protected]> wrote: > > >> On 28 Oct 2020, at 16:35, Guyer, Jonathan E. Dr. (Fed) via petsc-users >> <[email protected]> wrote: >> >> We use petsc4py as a solver suite in our >> [FiPy](https://www.ctcms.nist.gov/fipy) Python-based PDE solver package. >> Some time back, I refactored some of the code and provoked a deadlock >> situation in our test suite. I have been tearing what remains of my hair out >> trying to isolate things and am at a loss. I’ve gone through the refactoring >> line-by-line and I just don’t think I’ve changed anything substantive, just >> how the code is organized. >> >> I have posted a branch that exhibits the issue at >> https://github.com/usnistgov/fipy/pull/761 >> >> I explain in greater detail in that “pull request” how to reproduce, but in >> short, after a substantial number of our tests run, the code either >> deadlocks or raises exceptions: >> >> On processor 0 in >> >> matrix.setUp() >> >> specifically in >> >> [0] PetscSplitOwnership() line 93 in >> /Users/runner/miniforge3/conda-bld/petsc_1601473259434/work/src/sys/utils/psplit.c >> >> and on other processors a few lines earlier in >> >> matrix.create(comm) >> >> specifically in >> >> [1] PetscCommDuplicate() line 126 in >> /Users/runner/miniforge3/conda-bld/petsc_1601473259434/work/src/sys/objects/tagm.c >> >> >> The circumstances that lead to this failure are really fragile and it seems >> likely due to some memory corruption. Particularly likely given that I can >> make the failure go away by removing seemingly irrelevant things like >> >>>>> from scipy.stats.mstats import argstoarray >> >> Note that when I run the full test suite after taking out this scipy import, >> the same problem just arises elsewhere without any obvious similar import >> trigger. >> >> Running with `-malloc_debug true` doesn’t illuminate anything. >> >> I’ve run with `-info` and `-log_trace` and don’t see any obvious issues, but >> there’s a ton of output. >> >> >> >> I have tried reducing things to a minimal reproducible example, but >> unfortunately things remain way too complicated and idiosyncratic to FiPy. >> I’m grateful for any help anybody can offer despite the mess that I’m >> offering. > > My crystal ball guess is the following: > > PETSc objects have collective destroy semantics. > > When using petsc4py, XXX.destroy() is called on an object when its Python > refcount drops to zero, or when it is collected by the generational garbage > collector. > > In the absence of reference-cycles, all allocated objects will be collected > by the refcounting part of the collector. This is (unless you do something > funky like hold more references on one process than another) deterministic, > and if you do normal SPMD programming, you'll call XXX.destroy() in the same > order on the same objects on all processes. > > If you have reference cycles, then the refcounting part of the collector will > not collect these objects. Now you are at the mercy of the generational > collector. This is definitely not deterministic. If different Python > processes do different things (for example, rank 0 might open files) then > when the generational collector runs is no longer in sync across processes. > > A consequence is that you now might have rank 0 collect XXX then YYY, whereas > rank 1 might collect YYY then XXX => deadlock. > > You can test this hypothesis by turning off the garbage collector in your > test that provokes the failure: > > import gc > gc.disable() > ... > > If this turns out to be the case, I don't think there's a good solution here. > You can audit your code base and ensure that objects that hold PETSc objects > never participate in reference cycles. This is fragile. > > Another option, is to explicitly require that the user of the API call > XXX.destroy() on all your objects (and then PETSc objects). This is the > decision taken for mpi4py: you are responsible for freeing any objects that > you create. > > That is, your API becomes more like the C API with > > x = Foo(...) # holds some petsc object XX > ... # use x > x.destroy() # calls XX.destroy() > > you could make this more pythonic by wrapping this pattern in contextmanagers: > > with Foo(...) as x: > ... > > > Thanks, > > Lawrence
