Re: [petsc-users] Vexing deadlock situation with petsc4py

Guyer, Jonathan E. Dr. (Fed) via petsc-users Wed, 28 Oct 2020 10:33:19 -0700
*gc.disable()

> On Oct 28, 2020, at 1:32 PM, Jonathan Guyer <[email protected]> wrote:
> 
> That’s very helpful, thanks!
> 
> Adding `gc.collect()` to the beginning of the offending test does indeed 
> resolve that particular problem.
> 
> I’ve not been systematic about calling XXX.destroy(), thinking garbage 
> collection was sufficient, so I need to get to work on that.
> 
>> On Oct 28, 2020, at 1:21 PM, Lawrence Mitchell <[email protected]> wrote:
>> 
>> 
>>> On 28 Oct 2020, at 16:35, Guyer, Jonathan E. Dr. (Fed) via petsc-users 
>>> <[email protected]> wrote:
>>> 
>>> We use petsc4py as a solver suite in our 
>>> [FiPy](https://www.ctcms.nist.gov/fipy) Python-based PDE solver package. 
>>> Some time back, I refactored some of the code and provoked a deadlock 
>>> situation in our test suite. I have been tearing what remains of my hair 
>>> out trying to isolate things and am at a loss. I’ve gone through the 
>>> refactoring line-by-line and I just don’t think I’ve changed anything 
>>> substantive, just how the code is organized.
>>> 
>>> I have posted a branch that exhibits the issue at 
>>> https://github.com/usnistgov/fipy/pull/761
>>> 
>>> I explain in greater detail in that “pull request” how to reproduce, but in 
>>> short, after a substantial number of our tests run, the code either 
>>> deadlocks or raises exceptions:
>>> 
>>> On processor 0 in 
>>> 
>>> matrix.setUp() 
>>> 
>>> specifically in
>>> 
>>> [0] PetscSplitOwnership() line 93 in 
>>> /Users/runner/miniforge3/conda-bld/petsc_1601473259434/work/src/sys/utils/psplit.c
>>> 
>>> and on other processors a few lines earlier in
>>> 
>>> matrix.create(comm)
>>> 
>>> specifically in 
>>> 
>>> [1] PetscCommDuplicate() line 126 in 
>>> /Users/runner/miniforge3/conda-bld/petsc_1601473259434/work/src/sys/objects/tagm.c
>>> 
>>> 
>>> The circumstances that lead to this failure are really fragile and it seems 
>>> likely due to some memory corruption. Particularly likely given that I can 
>>> make the failure go away by removing seemingly irrelevant things like
>>> 
>>>>>> from scipy.stats.mstats import argstoarray
>>> 
>>> Note that when I run the full test suite after taking out this scipy 
>>> import, the same problem just arises elsewhere without any obvious similar 
>>> import trigger.
>>> 
>>> Running with `-malloc_debug true` doesn’t illuminate anything.
>>> 
>>> I’ve run with `-info` and `-log_trace` and don’t see any obvious issues, 
>>> but there’s a ton of output.
>>> 
>>> 
>>> 
>>> I have tried reducing things to a minimal reproducible example, but 
>>> unfortunately things remain way too complicated and idiosyncratic to FiPy. 
>>> I’m grateful for any help anybody can offer despite the mess that I’m 
>>> offering.
>> 
>> My crystal ball guess is the following:
>> 
>> PETSc objects have collective destroy semantics.
>> 
>> When using petsc4py, XXX.destroy() is called on an object when its Python 
>> refcount drops to zero, or when it is collected by the generational garbage 
>> collector.
>> 
>> In the absence of reference-cycles, all allocated objects will be collected 
>> by the refcounting part of the collector. This is (unless you do something 
>> funky like hold more references on one process than another) deterministic, 
>> and if you do normal SPMD programming, you'll call XXX.destroy() in the same 
>> order on the same objects on all processes.
>> 
>> If you have reference cycles, then the refcounting part of the collector 
>> will not collect these objects. Now you are at the mercy of the generational 
>> collector. This is definitely not deterministic. If different Python 
>> processes do different things (for example, rank 0 might open files) then 
>> when the generational collector runs is no longer in sync across processes.
>> 
>> A consequence is that you now might have rank 0 collect XXX then YYY, 
>> whereas rank 1 might collect YYY then XXX => deadlock.
>> 
>> You can test this hypothesis by turning off the garbage collector in your 
>> test that provokes the failure:
>> 
>> import gc
>> gc.disable()
>> ...
>> 
>> If this turns out to be the case, I don't think there's a good solution 
>> here. You can audit your code base and ensure that objects that hold PETSc 
>> objects never participate in reference cycles. This is fragile.
>> 
>> Another option, is to explicitly require that the user of the API call 
>> XXX.destroy() on all your objects (and then PETSc objects). This is the 
>> decision taken for mpi4py: you are responsible for freeing any objects that 
>> you create.
>> 
>> That is, your API becomes more like the C API with
>> 
>> x = Foo(...) # holds some petsc object XX
>> ... # use x
>> x.destroy() # calls XX.destroy()
>> 
>> you could make this more pythonic by wrapping this pattern in 
>> contextmanagers:
>> 
>> with Foo(...) as x:
>>  ...
>> 
>> 
>> Thanks,
>> 
>> Lawrence
>
Re: [petsc-users] Vexing deadlock situation with petsc4py

Reply via email to