[
https://issues.apache.org/jira/browse/ARROW-10183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17266471#comment-17266471
]
Weston Pace commented on ARROW-10183:
-------------------------------------
Some good news. I figured out my CSV reading benchmark was flawed (had the
wrong separator). New results...
| |2.0.0 (Mean)|2.0.0 (StdDev)|Async (Mean)|Async (StdDev)|
|gzip/cache|6.62|0.12|6.70|0.16|
|gzip/none|9.89|0.43|9.06|0.21|
|none/cache|4.05|0.09|3.95|0.11|
|none/none|34.57|1.15|32.25|1.22|
I also realized at least one possible situation where the async reader could
fall behind the threaded reader. If the I/O is running slower (e.g. zip case)
then sometimes the parse task will find the I/O promise unfulfilled (finished
parsing but next decompressed block not ready). In the threaded case the outer
parsing thread will block here while in the async case a new task will get
added to the pool to run when the I/O finishes. That task will get added in
the pool *behind* all the conversion tasks. So then the parsing will be
delayed and it is possible the readahead queue will fill up, delaying the I/O.
The timing has to be just right so that parsing & I/O are similar in
performance. The I/O has to be slow enough to sometimes not be ready but not
so slow that the task pool completely drains between each block.
I've tested the gzip/cache case quite often and this is the only case where the
async version consistently unperformed. I think the I/O in the */none cases
are too slow and the I/O in the none/cache case is too fast.
A prioritized thread queue would allow working around this situation. The
conversion tasks should be marked lower priority than the parsing tasks.
> [C++] Create a ForEach library function that runs on an iterator of futures
> ---------------------------------------------------------------------------
>
> Key: ARROW-10183
> URL: https://issues.apache.org/jira/browse/ARROW-10183
> Project: Apache Arrow
> Issue Type: New Feature
> Components: C++
> Reporter: Weston Pace
> Assignee: Weston Pace
> Priority: Major
> Labels: pull-request-available
> Attachments: arrow-continuation-flow.jpg
>
> Time Spent: 3h 10m
> Remaining Estimate: 0h
>
> This method should take in an iterator of futures and a callback and pull an
> item off the iterator, "await" it, run the callback on it, and then fetch the
> next item from the iterator.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)