Parth Chandra created PARQUET-2149:
--------------------------------------

             Summary: Implement async IO for Parquet file reader
                 Key: PARQUET-2149
                 URL: https://issues.apache.org/jira/browse/PARQUET-2149
             Project: Parquet
          Issue Type: Improvement
          Components: parquet-mr
            Reporter: Parth Chandra


ParquetFileReader's implementation has the following flow (simplified) - 
      - For every column -> Read from storage in 8MB blocks -> Read all 
uncompressed pages into output queue 
      - From output queues -> (downstream ) decompression + decoding

This flow is serialized, which means that downstream threads are blocked until 
the data has been read. Because a large part of the time spent is waiting for 
data from storage, threads are idle and CPU utilization is really low.

There is no reason why this cannot be made asynchronous _and_ parallel. So 

For Column _i_ -> reading one chunk until end, from storage -> intermediate 
output queue -> read one uncompressed page until end -> output queue -> 
(downstream ) decompression + decoding

Note that this can be made completely self contained in ParquetFileReader and 
downstream implementations like Iceberg and Spark will automatically be able to 
take advantage without code change as long as the ParquetFileReader apis are 
not changed. 

In past work with async io [ Drill - async page reader 
|https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/AsyncPageReader.java]
 , I have seen 2x-3x improvement in reading speed for Parquet files.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to