Anton, per your comment: > Sounds like a good way to go! We'll create a demo, as you suggested, > implementing a parallel execution model for a simple analytics pipeline that > reads and processes the files. My only concern is about adding more pipeline > breaker nodes and compute intensive operations into this demo because min/max > are effectively no-ops fused into I/O scan node. What do you think about > adding group-by into this picture, effectively implementing NY taxi and/or > mortgage benchmarks? Ideally, I'd like to go even further and add sci-kit > learn like stuff for processing that data in order to demonstrate the > co-existence side of the story. What do you think? So, the idea of the > prototype will be to validate the parallel execution model as the first step. > After that, it'll help to shape API for both - execution nodes and the > threading backend. Does it sound right to you?
The question is whether you want to spend at least a month or more of intense development on something else (a basic query engine, as we've been discussing in [1]) before we are able to develop consensus about the approach to threading. Personally, I would not make this choice given that there are good options available to move along the discussion of the parallelism API. I think Antoine's point about the CSV reader as an example of a non-trivial processing pipeline that also includes IO operations. - Wes [1]: https://docs.google.com/document/d/10RoUZmiMQRi_J1FcPeVAUAMJ6d_ZuiEbaM2Y33sNPu4/edit?usp=sharing On Fri, May 3, 2019 at 12:41 PM Jed Brown <j...@jedbrown.org> wrote: > > "Malakhov, Anton" <anton.malak...@intel.com> writes: > > >> > the library creates threads internally. It's a disaster for managing > >> > oversubscription and affinity issues among groups of threads and/or > >> > multiple processes (e.g., MPI). > > > > This is exactly what I'm talking about referring as issues with threading > > composability! OpenMP is not easy to have inside a library. I described it > > in this document: > > https://cwiki.apache.org/confluence/display/ARROW/Parallel+Execution+Engine > > Thanks for this document. I'm no great fan of OpenMP, but it's being > billed by most vendors (especially Intel) as the way to go in the > scientific computing space and has become relatively popular (much more > so than TBB). > > You linked to a NumPy discussion > (https://github.com/numpy/numpy/issues/11826) that is encountering the > same issues, but proposing solutions based on the global environment. > That is perhaps acceptable for typical Python callers due to the GIL, > but C++ callers may be using threads themselves. A typical example: > > App: > calls libB sequentially: > calls Arrow sequentially (wants to use threads) > calls libC sequentially: > omp parallel (creates threads somehow): > calls Arrow from threads (Arrow should not create more) > omp parallel: > calls libD from threads: > calls Arrow (Arrow should not create more) > > Arrow doesn't need to know the difference between the libC and libD > cases, but it may make a difference to the implementation of those > libraries. In both of these cases, the user may desire that Arrow > create tasks for load balancing reasons (but no new threads) so long as > they can run on the specified thread team. > > I have yet to see a complete solution to this problem, but we should > work out which modes are worth supporting and how that interface would > look. > > > Global solutions like this one (linked by Antoine) > > > https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/thread-pool.cc#L268 > > imply that threading mode is global and set via an environment variable, > neither of which are true in cases such as the above (and many simpler > cases).