[ 
https://issues.apache.org/jira/browse/ARROW-11262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17266415#comment-17266415
 ] 

Wes McKinney commented on ARROW-11262:
--------------------------------------

This same phenomenon is found many other places in the codebase (notably in IPC 
write-with-compression and read-with-compression). Rearchitecting everything 
around async where possible seems like the right path (I think there are 
various issues around Jira citing specific issues like these).

> [C++] Move decompression off background reader thread into thread pool
> ----------------------------------------------------------------------
>
>                 Key: ARROW-11262
>                 URL: https://issues.apache.org/jira/browse/ARROW-11262
>             Project: Apache Arrow
>          Issue Type: Improvement
>            Reporter: Weston Pace
>            Priority: Major
>         Attachments: read_csv.cc, zip-speedup.png
>
>
> When reading a decompressed stream there is a fairly decent amount of CPU 
> time spent decompressing that stream.  While we are doing this we could be 
> fetching the next block.  However, the current implementation has the reading 
> and decompressing on the same background reader thread so the next block will 
> not be fetched until the prior block is finished decompressing.
> There is still "some" ordering here, it isn't a fan-out, decompression of the 
> blocks has to happen in sequence, but there is some gain to be had.
> I created a simple example with gzip here 
> ([https://github.com/westonpace/arrow/tree/feature/async-compressed-csv)] and 
> you could test it with the attached example program.
> On my system, when reading a 250MB gzipped CSV file there is roughly a 5% 
> speedup if the file is cached in the OS (6.3s -> 6.0s) and a 10% to 15% 
> speedup if the file is not cached in the OS. (~6.8s -> 6.0s)
> The example requires changing the table reader implementation to receive an 
> async generator.  I think, in practice, we will want to change it to take an 
> async input stream instead.  So this may need to wait until/if we decide to 
> expand the async paradigm into the I/O interfaces.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to