Anton, per your comment:

> Sounds like a good way to go! We'll create a demo, as you suggested, 
> implementing a parallel execution model for a simple analytics pipeline that 
> reads and processes the files. My only concern is about adding more pipeline 
> breaker nodes and compute intensive operations into this demo because min/max 
> are effectively no-ops fused into I/O scan node. What do you think about 
> adding group-by into this picture, effectively implementing NY taxi and/or 
> mortgage benchmarks? Ideally, I'd like to go even further and add sci-kit 
> learn like stuff for processing that data in order to demonstrate the 
> co-existence side of the story. What do you think? So, the idea of the 
> prototype will be to validate the parallel execution model as the first step. 
> After that, it'll help to shape API for both - execution nodes and the 
> threading backend. Does it sound right to you?

The question is whether you want to spend at least a month or more of
intense development on something else (a basic query engine, as we've
been discussing in [1]) before we are able to develop consensus about
the approach to threading. Personally, I would not make this choice
given that there are good options available to move along the
discussion of the parallelism API. I think Antoine's point about the
CSV reader as an example of a non-trivial processing pipeline that
also includes IO operations.

- Wes

[1]: 
https://docs.google.com/document/d/10RoUZmiMQRi_J1FcPeVAUAMJ6d_ZuiEbaM2Y33sNPu4/edit?usp=sharing

On Fri, May 3, 2019 at 12:41 PM Jed Brown <j...@jedbrown.org> wrote:
>
> "Malakhov, Anton" <anton.malak...@intel.com> writes:
>
> >> > the library creates threads internally.  It's a disaster for managing
> >> > oversubscription and affinity issues among groups of threads and/or
> >> > multiple processes (e.g., MPI).
> >
> > This is exactly what I'm talking about referring as issues with threading 
> > composability! OpenMP is not easy to have inside a library. I described it 
> > in this document: 
> > https://cwiki.apache.org/confluence/display/ARROW/Parallel+Execution+Engine
>
> Thanks for this document.  I'm no great fan of OpenMP, but it's being
> billed by most vendors (especially Intel) as the way to go in the
> scientific computing space and has become relatively popular (much more
> so than TBB).
>
> You linked to a NumPy discussion
> (https://github.com/numpy/numpy/issues/11826) that is encountering the
> same issues, but proposing solutions based on the global environment.
> That is perhaps acceptable for typical Python callers due to the GIL,
> but C++ callers may be using threads themselves.  A typical example:
>
> App:
>   calls libB sequentially:
>     calls Arrow sequentially (wants to use threads)
>   calls libC sequentially:
>     omp parallel (creates threads somehow):
>       calls Arrow from threads (Arrow should not create more)
>   omp parallel:
>     calls libD from threads:
>       calls Arrow (Arrow should not create more)
>
> Arrow doesn't need to know the difference between the libC and libD
> cases, but it may make a difference to the implementation of those
> libraries.  In both of these cases, the user may desire that Arrow
> create tasks for load balancing reasons (but no new threads) so long as
> they can run on the specified thread team.
>
> I have yet to see a complete solution to this problem, but we should
> work out which modes are worth supporting and how that interface would
> look.
>
>
> Global solutions like this one (linked by Antoine)
>
>   
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/thread-pool.cc#L268
>
> imply that threading mode is global and set via an environment variable,
> neither of which are true in cases such as the above (and many simpler
> cases).

Reply via email to