[GitHub] [parquet-mr] parthchandra commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader

GitBox Mon, 23 May 2022 20:37:29 -0700


parthchandra commented on PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1135366352


   @steveloughran thank you very much for taking the time to review and provide 
feedback! 
   
   > 1. whose s3 client was used for testing here -if the s3a one, which hadoop 
release?
   
   I was working with s3a -
     Spark 3.2.1
     Hadoop (Hadoop-aws) 3.3.2
     AWS SDK 1.11.655
     
   
   > 2. the azure abfs and gcs connectors do async prefetching of the next 
block, but are simply assuming that code will read sequentially; if there is 
another seek/readFully to a new location, those prefetches will be abandoned. 
there is work in s3a to do prefetching here with caching, so as to reduce the 
penalty of backwards seeks. https://issues.apache.org/jira/browse/HADOOP-18028
   
   I haven't worked with abfs or gcs. If the connectors do async pre-fetching, 
that would be great. Essentially,  the time the Parquet reader would have to 
block in the file system API would reduce substantially. In such a case, we 
could turn the async reader on/off  and rerun the benchmark to compare. From 
past experience with the MaprFS which had very aggressive read ahead in its 
hdfs client, I would still expect better parquet speeds. The fact that the 
prefetch is turned off when a seek occurs is usual behaviour, but we may see no 
benefit from the connector in that case. So a combination of async reader and 
async connector might end up being a great solution (maybe at a slightly 
greater CPU utilization). We would still have to do a benchmark to see the real 
effect.
   The async version in this PR takes care of the sequential read requirement 
by a) opening a new stream for each column and ensuring every column is read 
sequentially. Footers are read using a separate stream. Except for the footer, 
no other stream ever seeks to a new location. b) The amount of data to be read 
is predetermined so there is never a read ahead that is discarded.
   
   > 
   > hadoop is adding a vectored IO api intended for libraries like orc and 
parquet to be able to use, where the application provides an unordered list of 
ranges, a bytebuffer supplier and gets back a list of futures to wait for. the 
base implementation simply reads using readFully APi. s3a (and later abfs) will 
do full async retrieval itself, using the http connection pool. 
https://issues.apache.org/jira/browse/HADOOP-18103
   > 
   > both vectored io and s3a prefetching will ship this summer in hadoop 
3.4.0. i don't see this change conflicting with this, though they may obsolete 
a lot of it.
   
   Yes, I became aware of this recently. I'm discussing integration of these 
efforts in a separate channel. At the moment I see no conflict, but have yet to 
determine how much of this async work would need to be changed. I suspect we 
may be able to eliminate or vastly simplify `AsyncMultiBufferInputStream`. 
   
   > have you benchmarked this change with abfs or google gcs connectors to see 
what difference it makes there?
   
   No I have not. Would love help from anyone in the community with access to 
these. I only have access to S3.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [parquet-mr] parthchandra commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader

Reply via email to