[ 
https://issues.apache.org/jira/browse/PARQUET-2149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17541359#comment-17541359
 ] 

ASF GitHub Bot commented on PARQUET-2149:
-----------------------------------------

steveloughran commented on PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1135585289

   > I was working with s3a
   > Spark 3.2.1
   > Hadoop (Hadoop-aws) 3.3.2
   > AWS SDK 1.11.655
   
   thanks., that means you are current with all shipping improvments. the main 
one extra is to use openFile(), passing in length and requesting randomio. this 
guarantees ranged GET requests and cuts the initial HEAD probe for 
existence/size of file.
   
   >> have you benchmarked this change with abfs or google gcs connectors to 
see what difference it makes there?
   
   > No I have not. Would love help from anyone in the community with access to 
these. I only have access to S3.
   
   that I have. FWIW, with the right tuning of abfs prefetch (4 threads, 128 MB 
blocks) i can get full FTTH link rate from a remote store; 700 mbit/s . that's 
to the base station. once you add wifi the bottlenecks move. 




> Implement async IO for Parquet file reader
> ------------------------------------------
>
>                 Key: PARQUET-2149
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2149
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-mr
>            Reporter: Parth Chandra
>            Priority: Major
>
> ParquetFileReader's implementation has the following flow (simplified) - 
>       - For every column -> Read from storage in 8MB blocks -> Read all 
> uncompressed pages into output queue 
>       - From output queues -> (downstream ) decompression + decoding
> This flow is serialized, which means that downstream threads are blocked 
> until the data has been read. Because a large part of the time spent is 
> waiting for data from storage, threads are idle and CPU utilization is really 
> low.
> There is no reason why this cannot be made asynchronous _and_ parallel. So 
> For Column _i_ -> reading one chunk until end, from storage -> intermediate 
> output queue -> read one uncompressed page until end -> output queue -> 
> (downstream ) decompression + decoding
> Note that this can be made completely self contained in ParquetFileReader and 
> downstream implementations like Iceberg and Spark will automatically be able 
> to take advantage without code change as long as the ParquetFileReader apis 
> are not changed. 
> In past work with async io  [Drill - async page reader 
> |https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/AsyncPageReader.java]
>  , I have seen 2x-3x improvement in reading speed for Parquet files.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to