*gc.disable() > On Oct 28, 2020, at 1:32 PM, Jonathan Guyer <[email protected]> wrote: > > That’s very helpful, thanks! > > Adding `gc.collect()` to the beginning of the offending test does indeed > resolve that particular problem. > > I’ve not been systematic about calling XXX.destroy(), thinking garbage > collection was sufficient, so I need to get to work on that. > >> On Oct 28, 2020, at 1:21 PM, Lawrence Mitchell <[email protected]> wrote: >> >> >>> On 28 Oct 2020, at 16:35, Guyer, Jonathan E. Dr. (Fed) via petsc-users >>> <[email protected]> wrote: >>> >>> We use petsc4py as a solver suite in our >>> [FiPy](https://www.ctcms.nist.gov/fipy) Python-based PDE solver package. >>> Some time back, I refactored some of the code and provoked a deadlock >>> situation in our test suite. I have been tearing what remains of my hair >>> out trying to isolate things and am at a loss. I’ve gone through the >>> refactoring line-by-line and I just don’t think I’ve changed anything >>> substantive, just how the code is organized. >>> >>> I have posted a branch that exhibits the issue at >>> https://github.com/usnistgov/fipy/pull/761 >>> >>> I explain in greater detail in that “pull request” how to reproduce, but in >>> short, after a substantial number of our tests run, the code either >>> deadlocks or raises exceptions: >>> >>> On processor 0 in >>> >>> matrix.setUp() >>> >>> specifically in >>> >>> [0] PetscSplitOwnership() line 93 in >>> /Users/runner/miniforge3/conda-bld/petsc_1601473259434/work/src/sys/utils/psplit.c >>> >>> and on other processors a few lines earlier in >>> >>> matrix.create(comm) >>> >>> specifically in >>> >>> [1] PetscCommDuplicate() line 126 in >>> /Users/runner/miniforge3/conda-bld/petsc_1601473259434/work/src/sys/objects/tagm.c >>> >>> >>> The circumstances that lead to this failure are really fragile and it seems >>> likely due to some memory corruption. Particularly likely given that I can >>> make the failure go away by removing seemingly irrelevant things like >>> >>>>>> from scipy.stats.mstats import argstoarray >>> >>> Note that when I run the full test suite after taking out this scipy >>> import, the same problem just arises elsewhere without any obvious similar >>> import trigger. >>> >>> Running with `-malloc_debug true` doesn’t illuminate anything. >>> >>> I’ve run with `-info` and `-log_trace` and don’t see any obvious issues, >>> but there’s a ton of output. >>> >>> >>> >>> I have tried reducing things to a minimal reproducible example, but >>> unfortunately things remain way too complicated and idiosyncratic to FiPy. >>> I’m grateful for any help anybody can offer despite the mess that I’m >>> offering. >> >> My crystal ball guess is the following: >> >> PETSc objects have collective destroy semantics. >> >> When using petsc4py, XXX.destroy() is called on an object when its Python >> refcount drops to zero, or when it is collected by the generational garbage >> collector. >> >> In the absence of reference-cycles, all allocated objects will be collected >> by the refcounting part of the collector. This is (unless you do something >> funky like hold more references on one process than another) deterministic, >> and if you do normal SPMD programming, you'll call XXX.destroy() in the same >> order on the same objects on all processes. >> >> If you have reference cycles, then the refcounting part of the collector >> will not collect these objects. Now you are at the mercy of the generational >> collector. This is definitely not deterministic. If different Python >> processes do different things (for example, rank 0 might open files) then >> when the generational collector runs is no longer in sync across processes. >> >> A consequence is that you now might have rank 0 collect XXX then YYY, >> whereas rank 1 might collect YYY then XXX => deadlock. >> >> You can test this hypothesis by turning off the garbage collector in your >> test that provokes the failure: >> >> import gc >> gc.disable() >> ... >> >> If this turns out to be the case, I don't think there's a good solution >> here. You can audit your code base and ensure that objects that hold PETSc >> objects never participate in reference cycles. This is fragile. >> >> Another option, is to explicitly require that the user of the API call >> XXX.destroy() on all your objects (and then PETSc objects). This is the >> decision taken for mpi4py: you are responsible for freeing any objects that >> you create. >> >> That is, your API becomes more like the C API with >> >> x = Foo(...) # holds some petsc object XX >> ... # use x >> x.destroy() # calls XX.destroy() >> >> you could make this more pythonic by wrapping this pattern in >> contextmanagers: >> >> with Foo(...) as x: >> ... >> >> >> Thanks, >> >> Lawrence >
Re: [petsc-users] Vexing deadlock situation with petsc4py
Guyer, Jonathan E. Dr. (Fed) via petsc-users Wed, 28 Oct 2020 10:33:19 -0700
- [petsc-users] Vexing deadlock... Guyer, Jonathan E. Dr. (Fed) via petsc-users
- Re: [petsc-users] Vexing... Guyer, Jonathan E. Dr. (Fed) via petsc-users
- Re: [petsc-users] Vexing... Lawrence Mitchell
- Re: [petsc-users] Ve... Guyer, Jonathan E. Dr. (Fed) via petsc-users
- Re: [petsc-users... Guyer, Jonathan E. Dr. (Fed) via petsc-users
