[ 
https://issues.apache.org/jira/browse/ARROW-10183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17266471#comment-17266471
 ] 

Weston Pace commented on ARROW-10183:
-------------------------------------

Some good news.  I figured out my CSV reading benchmark was flawed (had the 
wrong separator).  New results...
| |2.0.0 (Mean)|2.0.0 (StdDev)|Async (Mean)|Async (StdDev)|
|gzip/cache|6.62|0.12|6.70|0.16|
|gzip/none|9.89|0.43|9.06|0.21|
|none/cache|4.05|0.09|3.95|0.11|
|none/none|34.57|1.15|32.25|1.22|

I also realized at least one possible situation where the async reader could 
fall behind the threaded reader.  If the I/O is running slower (e.g. zip case) 
then sometimes the parse task will find the I/O promise unfulfilled (finished 
parsing but next decompressed block not ready).  In the threaded case the outer 
parsing thread will block here while in the async case a new task will get 
added to the pool to run when the I/O finishes.  That task will get added in 
the pool *behind* all the conversion tasks.  So then the parsing will be 
delayed and it is possible the readahead queue will fill up, delaying the I/O.

 

The timing has to be just right so that parsing & I/O are similar in 
performance.  The I/O has to be slow enough to sometimes not be ready but not 
so slow that the task pool completely drains between each block.

 

I've tested the gzip/cache case quite often and this is the only case where the 
async version consistently unperformed.  I think the I/O in the */none cases 
are too slow and the I/O in the none/cache case is too fast.

 

A prioritized thread queue would allow working around this situation.  The 
conversion tasks should be marked lower priority than the parsing tasks.

> [C++] Create a ForEach library function that runs on an iterator of futures
> ---------------------------------------------------------------------------
>
>                 Key: ARROW-10183
>                 URL: https://issues.apache.org/jira/browse/ARROW-10183
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++
>            Reporter: Weston Pace
>            Assignee: Weston Pace
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: arrow-continuation-flow.jpg
>
>          Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> This method should take in an iterator of futures and a callback and pull an 
> item off the iterator, "await" it, run the callback on it, and then fetch the 
> next item from the iterator.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to