[
https://issues.apache.org/jira/browse/ARROW-12560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17333511#comment-17333511
]
Weston Pace commented on ARROW-12560:
-------------------------------------
The problem is that we are using the CPU thread pool. So, imagine for a moment
that we are able to parse/decode CSV blocks in parallel (this may indeed be
possible someday so hopefully it isn't too hard to imagine).
Also imagine that we have some fixed block readahead of 10.
Then we have something like the following (this is just rough pseudocode, we
can assume ParseAndDecode is dropping the result in a vector somewhere and we
aren't checking errors)...
{code:java}
for (int i = 0; i < 10; i++)
{
source().Then(block => ParseAndDecode(block));
}
{code}
We want `source()` to be a background generator (reading along on its own I/O
thread). We want `ParseAndDecode` to run on the CPU thread.
Everything works fine if the I/O is slower than `ParseAndDecode`. Every call
to `source()` returns an unfinished future and, when it is finished, we will
transfer to the CPU pool (creating a new thread task) and run the thread task.
So there will be 10 thread tasks for 10 blocks and some of those thread tasks
may run in parallel (which is what we are after with parallel readahead here).
On the other hand, if I/O is faster than `ParseAndDecode` then calls to
`source()` will start returning finished futures (the call to `Transfer` would
have done nothing because the future was already finished). There is no need
to transfer, we are already on the CPU thread pool. However, no thread task is
created. The entire operation runs serially as a single thread task. In this
case we want to create a new thread task to do our CPU work. The cost of
`ParseAndDecode` is expensive so we know we aren't creating too many thread
tasks. The main thread can then carry on and issue the next call to `source()`.
> [C++] Investigate utilizing aggressive thread creation when adding callback
> to finished future.
> -----------------------------------------------------------------------------------------------
>
> Key: ARROW-12560
> URL: https://issues.apache.org/jira/browse/ARROW-12560
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Weston Pace
> Assignee: Weston Pace
> Priority: Major
> Labels: async-util
>
> Imagine there is a slow map function (that could run in parallel) and a
> vector generator given a long vector of tasks. If we apply map to the
> generator and then readahead we won't actually get any parallelism because
> the vector generator returns everything synchronously and so no thread task
> will ever be submitted.
> This hypothetical situation is a reality in some situations in the scanner.
> For example, if scanning CSV files and the CPU threads fall behind the I/O
> threads then all callbacks will be synchronous (since the futures will
> already have been completed by the I/O threads).
> In such a situation we might benefit from creating a new thread task even
> though we wouldn't normally create one. For example, if we have an idle
> core. You can think of this as an analogue of work stealing.
> On the other hand, creating new thread tasks at any random callback might not
> be the most efficient. We could mitigate this by marking a callback as
> "potentially long" as some kind of hint when we add the callback to indicate
> it as eligible for eager thread creation.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)