Maybe you can try to use faulthandler.dump_traceback_later https://docs.python.org/3/library/faulthandler.html#faulthandler.dump_traceback_later to get a traceback of all the threads of the main process.
But the fact that you are using the default `p = multiprocessing.Pool()` makes me think that it might be related to the lack of fork-safety of the OpenMP runtime library of GCC (libgomp) [1]. There are several ways to check this: - print the output of threadpoolctl.threadpool_info() before calling the code that freezes to confirm (or not) that the libgomp runtime has been loaded before creating the MP Pool. - use multiprocessing Pool using a forkserver context instead of the default fork context: multiprocessing.get_context("forkserver").Pool() - alternatively, use loky.get_reusable_excutor() instead of multiprocessing.Pool() (with a slightly different API) - alternatively, use joblib that uses loky internally with an even more different API. - alternatively, recompile scikit-learn from source with clang instead of gcc so as to link scikit-learn to llvm-openmp instead of gcc's libgomp runtime. llvm-openmp is forksafe, - alternatively, install scikit-learn from conda-forge (conda install -c conda-forge scikit-learn) as the conda-forge distribution relinks all OpenMP compiled extensions of its packaged libraries to llvm-openmp transparently at install time, even if they were built with GCC (maybe we should do that for our linux wheels). [1] https://gcc.gnu.org/legacy-ml/gcc-patches/2014-02/msg00979.html If that does not work or need more help, please feel free to open an issue with a minimal reproducer and ping me on gitter or discord. Le jeu. 9 déc. 2021 à 05:59, Norbert Preining <norb...@preining.info> a écrit : > > Dear all, > > I am trying to track down a strange behaviour in one of our (Fujitsu) > library we are planning to open source. In preparation for that, I am > trying to bring it into a state that it works with scikit-learn >= 1. > > But, some of our tests fail when running in parallel mode. But they > only fail when running under pytest, but NOT when running under python. > > The library code contains > > def fit(self, X, y=None): > ... > p = multiprocessing.Pool() > ret = _reduce( > p.map(....)) > > Now what happens is that with scikit-learn 1(.0.1), the code hangs > forever. I adjusted the code also so that the pool definition is not in > the fit function, but in the __init__ function, and saved into self, but > that didn't help either. > > When interrupted, pytest gives: > > !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! > KeyboardInterrupt > !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! > /home/norbert/.pyenv/versions/3.9.6/lib/python3.9/threading.py:312: > KeyboardInterrupt > (to show a full traceback on KeyboardInterrupt use --full-trace) > ================================================ 1 passed, 2 warnings in > 273.84s (0:04:33) ================================================= > Exception ignored in: <function Pool.__del__ at 0x7ff72f31b9d0> > Traceback (most recent call last): > File > "/home/norbert/.pyenv/versions/3.9.6/lib/python3.9/multiprocessing/pool.py", > line 268, in __del__ > self._change_notifier.put(None) > File > "/home/norbert/.pyenv/versions/3.9.6/lib/python3.9/multiprocessing/queues.py", > line 378, in put > self._writer.send_bytes(obj) > File > "/home/norbert/.pyenv/versions/3.9.6/lib/python3.9/multiprocessing/connection.py", > line 205, in send_bytes > self._send_bytes(m[offset:offset + size]) > File > "/home/norbert/.pyenv/versions/3.9.6/lib/python3.9/multiprocessing/connection.py", > line 416, in _send_bytes > self._send(header + buf) > File > "/home/norbert/.pyenv/versions/3.9.6/lib/python3.9/multiprocessing/connection.py", > line 373, in _send > n = write(self._handle, buf) > > > While when running under python testfile.py all goes well. > > > I have tested the following combinations: > * scikit-learn 0.23.*, python 3.8 and python 3.9 => works > * scikit-learn 0.24.*, python 3.8 and python 3.9 => works > * scikit-learn 1.0.1, python 3.8 and python 3.9 => fails > > I don't really understand where scikit-learn comes into the play here, > so I wanted to ask whether someone here has an idea. > > Thanks for any suggestion > > > Norbert > > -- > PREINING Norbert https://www.preining.info > Fujitsu Research + IFMGA Guide + TU Wien + TeX Live + Debian Dev > GPG: 0x860CDC13 fp: F7D8 A928 26E3 16A1 9FA0 ACF0 6CAC A448 860C DC13 > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn -- Olivier _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn