In light of ARROW-9924, I wanted to rekindle the discussion about our approach to multithreading (especially the _programming model_) in C++. We had some discussions about this about 6 months ago and there were more discussions as I recall in summer 2019.
Realistically, we are going to be consistently dealing with independent concurrent in-process workloads that each respectively can go faster by multithreading. These could be things like: * Reading file formats (CSV, Parquet, etc.) that benefit from multithreaded parsing/decoding * Reading one or more files in parallel using the Datasets API * Executing any number of multithreaded analytical workloads One obvious issue with our thread scheduling is the FIFO nature of the global thread pool. If a new independent multithreaded workload shows up, it has to wait for other workloads to complete before the new work will be scheduled. Think about a Flight server serving queries to users -- is it fair for one query to "hog" the thread pool and force other requests to wait until they can get access to some CPU resources? You could imagine a workload that spawns 10 minutes worth of CPU work, where a new workload has to wait for all of that work to complete before having any tasks scheduled for execution. The approach that's been taken in the Datasets API to avoid problems with nested parallelism (file-specific operations spawning multiple tasks onto the global thread pool) is simply to disable multithreading at the level of a single file. This is clearly suboptimal. We have additional problems in that some file-loading related tasks do a mixture of CPU work and IO work, and once a thread has been dispatched to execute one of these tasks, when IO takes place, a CPU core may sit underutilized while the IO is waiting. There's more aspects we can discuss, but in general I think we need to come up with a programming model for building our C++ system components with the following requirements: * Deadlocks not possible by design * Any component can safely use "nested parallelism" without the programmer having to worry about deadlocks or one task "hogging" the thread pool. So in other words, if there's only a single multithreading-capable workload running, we "let it rip" * Resources can be reasonably fairly allocated amongst concurrent workloads (think: independent requests coming in through Flight, or scan tasks on different Parquet files in the Datasets API). Limit scenarios where a new workload is blocked altogether on the completion of other workloads * A well-defined programming pattern for tasks that do a mixture of CPU work and IO work that allows CPU cores to be used when a task is waiting on IO We can't be the only project that has these problems, so I'm interested to see what solutions have been successfully employed by others. For example, it strikes me as similar to concurrency issues inside an analytic database. How are they preventing concurrent workload starvation problems or handling CPU/IO task scheduling to avoid CPU underutilization? Choices of which threading libraries we might use to implement a viable solution (e.g. TBB) seem secondary to the programming model that we use to implement our components. Thanks, Wes