Re: [petsc-users] Vexing deadlock situation with petsc4py

Guyer, Jonathan E. Dr. (Fed) via petsc-users Wed, 28 Oct 2020 10:33:20 -0700

That’s very helpful, thanks!

Adding `gc.collect()` to the beginning of the offending test does indeed 
resolve that particular problem.


I’ve not been systematic about calling XXX.destroy(), thinking garbage 
collection was sufficient, so I need to get to work on that.

> On Oct 28, 2020, at 1:21 PM, Lawrence Mitchell <[email protected]> wrote:
> 
> 
>> On 28 Oct 2020, at 16:35, Guyer, Jonathan E. Dr. (Fed) via petsc-users 
>> <[email protected]> wrote:
>> 
>> We use petsc4py as a solver suite in our 
>> [FiPy](https://www.ctcms.nist.gov/fipy) Python-based PDE solver package. 
>> Some time back, I refactored some of the code and provoked a deadlock 
>> situation in our test suite. I have been tearing what remains of my hair out 
>> trying to isolate things and am at a loss. I’ve gone through the refactoring 
>> line-by-line and I just don’t think I’ve changed anything substantive, just 
>> how the code is organized.
>> 
>> I have posted a branch that exhibits the issue at 
>> https://github.com/usnistgov/fipy/pull/761
>> 
>> I explain in greater detail in that “pull request” how to reproduce, but in 
>> short, after a substantial number of our tests run, the code either 
>> deadlocks or raises exceptions:
>> 
>> On processor 0 in 
>> 
>>  matrix.setUp() 
>> 
>> specifically in
>> 
>>  [0] PetscSplitOwnership() line 93 in 
>> /Users/runner/miniforge3/conda-bld/petsc_1601473259434/work/src/sys/utils/psplit.c
>> 
>> and on other processors a few lines earlier in
>> 
>>  matrix.create(comm)
>> 
>> specifically in 
>> 
>>  [1] PetscCommDuplicate() line 126 in 
>> /Users/runner/miniforge3/conda-bld/petsc_1601473259434/work/src/sys/objects/tagm.c
>> 
>> 
>> The circumstances that lead to this failure are really fragile and it seems 
>> likely due to some memory corruption. Particularly likely given that I can 
>> make the failure go away by removing seemingly irrelevant things like
>> 
>>>>> from scipy.stats.mstats import argstoarray
>> 
>> Note that when I run the full test suite after taking out this scipy import, 
>> the same problem just arises elsewhere without any obvious similar import 
>> trigger.
>> 
>> Running with `-malloc_debug true` doesn’t illuminate anything.
>> 
>> I’ve run with `-info` and `-log_trace` and don’t see any obvious issues, but 
>> there’s a ton of output.
>> 
>> 
>> 
>> I have tried reducing things to a minimal reproducible example, but 
>> unfortunately things remain way too complicated and idiosyncratic to FiPy. 
>> I’m grateful for any help anybody can offer despite the mess that I’m 
>> offering.
> 
> My crystal ball guess is the following:
> 
> PETSc objects have collective destroy semantics.
> 
> When using petsc4py, XXX.destroy() is called on an object when its Python 
> refcount drops to zero, or when it is collected by the generational garbage 
> collector.
> 
> In the absence of reference-cycles, all allocated objects will be collected 
> by the refcounting part of the collector. This is (unless you do something 
> funky like hold more references on one process than another) deterministic, 
> and if you do normal SPMD programming, you'll call XXX.destroy() in the same 
> order on the same objects on all processes.
> 
> If you have reference cycles, then the refcounting part of the collector will 
> not collect these objects. Now you are at the mercy of the generational 
> collector. This is definitely not deterministic. If different Python 
> processes do different things (for example, rank 0 might open files) then 
> when the generational collector runs is no longer in sync across processes.
> 
> A consequence is that you now might have rank 0 collect XXX then YYY, whereas 
> rank 1 might collect YYY then XXX => deadlock.
> 
> You can test this hypothesis by turning off the garbage collector in your 
> test that provokes the failure:
> 
> import gc
> gc.disable()
> ...
> 
> If this turns out to be the case, I don't think there's a good solution here. 
> You can audit your code base and ensure that objects that hold PETSc objects 
> never participate in reference cycles. This is fragile.
> 
> Another option, is to explicitly require that the user of the API call 
> XXX.destroy() on all your objects (and then PETSc objects). This is the 
> decision taken for mpi4py: you are responsible for freeing any objects that 
> you create.
> 
> That is, your API becomes more like the C API with
> 
> x = Foo(...) # holds some petsc object XX
> ... # use x
> x.destroy() # calls XX.destroy()
> 
> you could make this more pythonic by wrapping this pattern in contextmanagers:
> 
> with Foo(...) as x:
>   ...
> 
> 
> Thanks,
> 
> Lawrence

Re: [petsc-users] Vexing deadlock situation with petsc4py

Reply via email to