[
https://issues.apache.org/jira/browse/ARROW-11262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17266415#comment-17266415
]
Wes McKinney commented on ARROW-11262:
--------------------------------------
This same phenomenon is found many other places in the codebase (notably in IPC
write-with-compression and read-with-compression). Rearchitecting everything
around async where possible seems like the right path (I think there are
various issues around Jira citing specific issues like these).
> [C++] Move decompression off background reader thread into thread pool
> ----------------------------------------------------------------------
>
> Key: ARROW-11262
> URL: https://issues.apache.org/jira/browse/ARROW-11262
> Project: Apache Arrow
> Issue Type: Improvement
> Reporter: Weston Pace
> Priority: Major
> Attachments: read_csv.cc, zip-speedup.png
>
>
> When reading a decompressed stream there is a fairly decent amount of CPU
> time spent decompressing that stream. While we are doing this we could be
> fetching the next block. However, the current implementation has the reading
> and decompressing on the same background reader thread so the next block will
> not be fetched until the prior block is finished decompressing.
> There is still "some" ordering here, it isn't a fan-out, decompression of the
> blocks has to happen in sequence, but there is some gain to be had.
> I created a simple example with gzip here
> ([https://github.com/westonpace/arrow/tree/feature/async-compressed-csv)] and
> you could test it with the attached example program.
> On my system, when reading a 250MB gzipped CSV file there is roughly a 5%
> speedup if the file is cached in the OS (6.3s -> 6.0s) and a 10% to 15%
> speedup if the file is not cached in the OS. (~6.8s -> 6.0s)
> The example requires changing the table reader implementation to receive an
> async generator. I think, in practice, we will want to change it to take an
> async input stream instead. So this may need to wait until/if we decide to
> expand the async paradigm into the I/O interfaces.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)