Jed, > From: Jed Brown [mailto:j...@jedbrown.org] > Sent: Friday, May 3, 2019 12:41
> You linked to a NumPy discussion > (https://github.com/numpy/numpy/issues/11826) that is encountering the same > issues, but proposing solutions based on the global environment. > That is perhaps acceptable for typical Python callers due to the GIL, but C++ > callers may be using threads themselves. A typical example: > > App: > calls libB sequentially: > calls Arrow sequentially (wants to use threads) > calls libC sequentially: > omp parallel (creates threads somehow): > calls Arrow from threads (Arrow should not create more) > omp parallel: > calls libD from threads: > calls Arrow (Arrow should not create more) That's not correct assumption about Python. GIL is used for synchronization of Python's interpreter state, its C-API data structures. When Python calls a C extension like Numpy, the latter is not restricted for doing its own internal parallelism (like what OpenBLAS and MKL do). Moreover, Numpy and other libraries usually release GIL before going into a long compute region, which allows a concurrent thread to start a compute region in parallel. So, there is no much difference between Python and C++ for what you can get in terms of nested parallelism (the difference is in overheads and scalability). If there is an app-level parallelism (like for libD) and/or other nesting (like in your libC), which can be implemented e.g. with Dask, Numpy will still create parallel region inside for each call from outermost thread or process (Python, Dask support both). And this is exactly the problem I'm solving, that's the reason I started this discussion, so thanks for sharing my concerns. For more information, please refer to my Scipy2017 talk and later paper where we introduced 3 approaches to the problem (TBB, settings orchestration, OpenMP extension): http://conference.scipy.org/proceedings/scipy2018/pdfs/anton_malakhov.pdf > Arrow doesn't need to know the difference between the libC and libD cases, > but > it may make a difference to the implementation of those libraries. In both of > these cases, the user may desire that Arrow create tasks for load balancing > reasons (but no new threads) so long as they can run on the specified thread > team. Exactly, tasks is one way to solve it. This is what TBB does as a good first approximation for the solution: global task scheduler, no mandatory threads/parallel regions, wide adoption in numeric libraries (MKL, DAAL, Numba, soon PyTorch and others). And that's the first step I'm proposing. Though we know based on the past experience, that it is still not sufficient because NUMA effects are not accounted: tasks are randomly distributed. That's where other threading layer implementations can work better for some cases and where more elaborated TBB-based NUMA-aware implementation is needed. > Global solutions like this one (linked by Antoine) > > https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/thread- > pool.cc#L268 > > imply that threading mode is global and set via an environment variable, > neither > of which are true in cases such as the above (and many simpler cases). Right. I wrote about problem with this implementation in the proposal. First, we should not mimic OpenMP for something completely irrelevant, it is causing confusion and is hard to control for more complex cases. Regards, // Anton