[
https://issues.apache.org/jira/browse/ARROW-11262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Weston Pace updated ARROW-11262:
--------------------------------
Attachment: zip-speedup.png
> [C++] Move decompression off background reader thread into thread pool
> ----------------------------------------------------------------------
>
> Key: ARROW-11262
> URL: https://issues.apache.org/jira/browse/ARROW-11262
> Project: Apache Arrow
> Issue Type: Improvement
> Reporter: Weston Pace
> Priority: Major
> Attachments: read_csv.cc, zip-speedup.png
>
>
> When reading a decompressed stream there is a fairly decent amount of CPU
> time spent decompressing that stream. While we are doing this we could be
> fetching the next block. However, the current implementation has the reading
> and decompressing on the same background reader thread so the next block will
> not be fetched until the prior block is finished decompressing.
> There is still "some" ordering here, it isn't a fan-out, decompression of the
> blocks has to happen in sequence, but there is some gain to be had.
> I created a simple example with gzip here
> ([https://github.com/westonpace/arrow/tree/feature/async-compressed-csv)] and
> you could test it with the attached example program.
> On my system, when reading a 250MB gzipped CSV file there is roughly a 5%
> speedup if the file is cached in the OS (6.3s -> 6.0s) and a 10% to 15%
> speedup if the file is not cached in the OS. (~6.8s -> 6.0s)
> The example requires changing the table reader implementation to receive an
> async generator. I think, in practice, we will want to change it to take an
> async input stream instead. So this may need to wait until/if we decide to
> expand the async paradigm into the I/O interfaces.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)