We use petsc4py as a solver suite in our 
[FiPy](https://www.ctcms.nist.gov/fipy) Python-based PDE solver package. Some 
time back, I refactored some of the code and provoked a deadlock situation in 
our test suite. I have been tearing what remains of my hair out trying to 
isolate things and am at a loss. I’ve gone through the refactoring line-by-line 
and I just don’t think I’ve changed anything substantive, just how the code is 
organized.

I have posted a branch that exhibits the issue at 
https://github.com/usnistgov/fipy/pull/761

I explain in greater detail in that “pull request” how to reproduce, but in 
short, after a substantial number of our tests run, the code either deadlocks 
or raises exceptions:

On processor 0 in

  matrix.setUp()

specifically in

  [0] PetscSplitOwnership() line 93 in 
/Users/runner/miniforge3/conda-bld/petsc_1601473259434/work/src/sys/utils/psplit.c

and on other processors a few lines earlier in

  matrix.create(comm)

specifically in

  [1] PetscCommDuplicate() line 126 in 
/Users/runner/miniforge3/conda-bld/petsc_1601473259434/work/src/sys/objects/tagm.c


The circumstances that lead to this failure are really fragile and it seems 
likely due to some memory corruption. Particularly likely given that I can make 
the failure go away by removing seemingly irrelevant things like

    >>> from scipy.stats.mstats import argstoarray

Note that when I run the full test suite after taking out this scipy import, 
the same problem just arises elsewhere without any obvious similar import 
trigger.

Running with `-malloc_debug true` doesn’t illuminate anything.

I’ve run with `-info` and `-log_trace` and don’t see any obvious issues, but 
there’s a ton of output.



I have tried reducing things to a minimal reproducible example, but 
unfortunately things remain way too complicated and idiosyncratic to FiPy. I’m 
grateful for any help anybody can offer despite the mess that I’m offering.

Reply via email to