"Malakhov, Anton" <anton.malak...@intel.com> writes: > Jed, > >> From: Jed Brown [mailto:j...@jedbrown.org] >> Sent: Friday, May 3, 2019 12:41 > >> You linked to a NumPy discussion >> (https://github.com/numpy/numpy/issues/11826) that is encountering the same >> issues, but proposing solutions based on the global environment. >> That is perhaps acceptable for typical Python callers due to the GIL, but C++ >> callers may be using threads themselves. A typical example: >> >> App: >> calls libB sequentially: >> calls Arrow sequentially (wants to use threads) >> calls libC sequentially: >> omp parallel (creates threads somehow): >> calls Arrow from threads (Arrow should not create more) >> omp parallel: >> calls libD from threads: >> calls Arrow (Arrow should not create more) > > That's not correct assumption about Python. GIL is used for > synchronization of Python's interpreter state, its C-API data > structures. When Python calls a C extension like Numpy, the latter is > not restricted for doing its own internal parallelism (like what > OpenBLAS and MKL do). Moreover, Numpy and other libraries usually > release GIL before going into a long compute region, which allows a > concurrent thread to start a compute region in parallel.
Thanks, I wasn't aware under what conditions NumPy (or other callers) would release the GIL. > So, there is no much difference between Python and C++ for what you > can get in terms of nested parallelism (the difference is in overheads > and scalability). If there is an app-level parallelism (like for libD) > and/or other nesting (like in your libC), which can be implemented > e.g. with Dask, Numpy will still create parallel region inside for > each call from outermost thread or process (Python, Dask support > both). And this is exactly the problem I'm solving, that's the reason > I started this discussion, so thanks for sharing my concerns. For more > information, please refer to my Scipy2017 talk and later paper where > we introduced 3 approaches to the problem (TBB, settings > orchestration, OpenMP extension): > http://conference.scipy.org/proceedings/scipy2018/pdfs/anton_malakhov.pdf Nice paper, thanks! Did you investigate latency impact from the IPC counting semaphore? Is your test code available?