Weston Pace created ARROW-11262:
-----------------------------------

             Summary: Move decompression off background reader thread into 
thread pool
                 Key: ARROW-11262
                 URL: https://issues.apache.org/jira/browse/ARROW-11262
             Project: Apache Arrow
          Issue Type: Improvement
            Reporter: Weston Pace
         Attachments: read_csv.cc, zip-speedup.png

When reading a decompressed stream there is a fairly decent amount of CPU time 
spent decompressing that stream.  While we are doing this we could be fetching 
the next block.  However, the current implementation has the reading and 
decompressing on the same background reader thread so the next block will not 
be fetched until the prior block is finished decompressing.

There is still "some" ordering here, it isn't a fan-out, decompression of the 
blocks has to happen in sequence, but there is some gain to be had.

I created a simple example with gzip here 
([https://github.com/westonpace/arrow/tree/feature/async-compressed-csv)] and 
you could test it with the attached example program.

On my system, when reading a 250MB gzipped CSV file there is roughly a 5% 
speedup if the file is cached in the OS (6.3s -> 6.0s) and a 10% to 15% speedup 
if the file is not cached in the OS. (~6.8s -> 6.0s)

The example requires changing the table reader implementation to receive an 
async generator.  I think, in practice, we will want to change it to take an 
async input stream instead.  So this may need to wait until/if we decide to 
expand the async paradigm into the I/O interfaces.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to