Maybe you can try to use faulthandler.dump_traceback_later
https://docs.python.org/3/library/faulthandler.html#faulthandler.dump_traceback_later
to get a traceback of all the threads of the main process.

But the fact that you are using the default `p =
multiprocessing.Pool()` makes me think that it might be related to the
lack of fork-safety of the OpenMP runtime library of GCC (libgomp)
[1]. There are several ways to check this:

- print the output of threadpoolctl.threadpool_info() before calling
the code that freezes to confirm (or not) that the libgomp runtime has
been loaded before creating the MP Pool.
- use multiprocessing Pool using a forkserver context instead of the
default fork context: multiprocessing.get_context("forkserver").Pool()
- alternatively, use loky.get_reusable_excutor() instead of
multiprocessing.Pool() (with a slightly different API)
- alternatively, use joblib that uses loky internally with an even
more different API.
- alternatively, recompile scikit-learn from source with clang instead
of gcc so as to link scikit-learn to llvm-openmp instead of gcc's
libgomp runtime. llvm-openmp is forksafe,
- alternatively, install scikit-learn from conda-forge (conda install
-c conda-forge scikit-learn) as the conda-forge distribution relinks
all OpenMP compiled extensions of its packaged libraries to
llvm-openmp transparently at install time, even if they were built
with GCC (maybe we should do that for our linux wheels).

[1] https://gcc.gnu.org/legacy-ml/gcc-patches/2014-02/msg00979.html

If that does not work or need more help, please feel free to open an
issue with a minimal reproducer and ping me on gitter or discord.

Le jeu. 9 déc. 2021 à 05:59, Norbert Preining <norb...@preining.info> a écrit :
>
> Dear all,
>
> I am trying to track down a strange behaviour in one of our (Fujitsu)
> library we are planning to open source. In preparation for that, I am
> trying to bring it into a state that it works with scikit-learn >= 1.
>
> But, some of our tests fail when running in parallel mode. But they
> only fail when running under pytest, but NOT when running under python.
>
> The library code contains
>
>         def fit(self, X, y=None):
>             ...
>             p = multiprocessing.Pool()
>             ret = _reduce(
>                 p.map(....))
>
> Now what happens is that with scikit-learn 1(.0.1), the code hangs
> forever. I adjusted the code also so that the pool definition is not in
> the fit function, but in the __init__ function, and saved into self, but
> that didn't help either.
>
> When interrupted, pytest gives:
>
> !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 
> KeyboardInterrupt 
> !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
> /home/norbert/.pyenv/versions/3.9.6/lib/python3.9/threading.py:312: 
> KeyboardInterrupt
> (to show a full traceback on KeyboardInterrupt use --full-trace)
> ================================================ 1 passed, 2 warnings in 
> 273.84s (0:04:33) =================================================
> Exception ignored in: <function Pool.__del__ at 0x7ff72f31b9d0>
> Traceback (most recent call last):
>   File 
> "/home/norbert/.pyenv/versions/3.9.6/lib/python3.9/multiprocessing/pool.py", 
> line 268, in __del__
>     self._change_notifier.put(None)
>   File 
> "/home/norbert/.pyenv/versions/3.9.6/lib/python3.9/multiprocessing/queues.py",
>  line 378, in put
>     self._writer.send_bytes(obj)
>   File 
> "/home/norbert/.pyenv/versions/3.9.6/lib/python3.9/multiprocessing/connection.py",
>  line 205, in send_bytes
>     self._send_bytes(m[offset:offset + size])
>   File 
> "/home/norbert/.pyenv/versions/3.9.6/lib/python3.9/multiprocessing/connection.py",
>  line 416, in _send_bytes
>     self._send(header + buf)
>   File 
> "/home/norbert/.pyenv/versions/3.9.6/lib/python3.9/multiprocessing/connection.py",
>  line 373, in _send
>     n = write(self._handle, buf)
>
>
> While when running under python testfile.py all goes well.
>
>
> I have tested the following combinations:
> * scikit-learn 0.23.*, python 3.8 and python 3.9 => works
> * scikit-learn 0.24.*, python 3.8 and python 3.9 => works
> * scikit-learn 1.0.1,  python 3.8 and python 3.9 => fails
>
> I don't really understand where scikit-learn comes into the play here,
> so I wanted to ask whether someone here has an idea.
>
> Thanks for any suggestion
>
>
> Norbert
>
> --
> PREINING Norbert                              https://www.preining.info
> Fujitsu Research  +  IFMGA Guide  +  TU Wien  +  TeX Live  + Debian Dev
> GPG: 0x860CDC13   fp: F7D8 A928 26E3 16A1 9FA0 ACF0 6CAC A448 860C DC13
> _______________________________________________
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn



-- 
Olivier
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Reply via email to