"Malakhov, Anton" <anton.malak...@intel.com> writes:

> Jed,
>
>> From: Jed Brown [mailto:j...@jedbrown.org]
>> Sent: Friday, May 3, 2019 12:41
>
>> You linked to a NumPy discussion
>> (https://github.com/numpy/numpy/issues/11826) that is encountering the same
>> issues, but proposing solutions based on the global environment.
>> That is perhaps acceptable for typical Python callers due to the GIL, but C++
>> callers may be using threads themselves.  A typical example:
>> 
>> App:
>>   calls libB sequentially:
>>     calls Arrow sequentially (wants to use threads)
>>   calls libC sequentially:
>>     omp parallel (creates threads somehow):
>>       calls Arrow from threads (Arrow should not create more)
>>   omp parallel:
>>     calls libD from threads:
>>       calls Arrow (Arrow should not create more)
>
> That's not correct assumption about Python. GIL is used for
> synchronization of Python's interpreter state, its C-API data
> structures. When Python calls a C extension like Numpy, the latter is
> not restricted for doing its own internal parallelism (like what
> OpenBLAS and MKL do). Moreover, Numpy and other libraries usually
> release GIL before going into a long compute region, which allows a
> concurrent thread to start a compute region in parallel. 

Thanks, I wasn't aware under what conditions NumPy (or other callers)
would release the GIL.

> So, there is no much difference between Python and C++ for what you
> can get in terms of nested parallelism (the difference is in overheads
> and scalability). If there is an app-level parallelism (like for libD)
> and/or other nesting (like in your libC), which can be implemented
> e.g. with Dask, Numpy will still create parallel region inside for
> each call from outermost thread or process (Python, Dask support
> both). And this is exactly the problem I'm solving, that's the reason
> I started this discussion, so thanks for sharing my concerns. For more
> information, please refer to my Scipy2017 talk and later paper where
> we introduced 3 approaches to the problem (TBB, settings
> orchestration, OpenMP extension):
> http://conference.scipy.org/proceedings/scipy2018/pdfs/anton_malakhov.pdf

Nice paper, thanks!  Did you investigate latency impact from the IPC
counting semaphore?  Is your test code available?

Reply via email to