[GitHub] [parquet-mr] steveloughran commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader

GitBox Mon, 23 May 2022 08:47:40 -0700


steveloughran commented on PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1134843705


   1. whose s3 client was used for testing here -if the s3a one, which hadoop 
release?
   2. the azure abfs and gcs connectors do async prefetching of the next block, 
but are simply assuming that code will read sequentially; if there is another 
seek/readFully to a new location, those prefetches will be abandoned. there is 
work in s3a to do prefetching here with caching, so as to reduce the penalty of 
backwards seeks. https://issues.apache.org/jira/browse/HADOOP-18028
   
   hadoop is adding a vectored IO api intended for libraries like orc and 
parquet to be able to use, where the application provides an unordered list of 
ranges, a bytebuffer supplier and gets back a list of futures to wait for. the 
base implementation simply reads using readFully APi. s3a (and later abfs) will 
do full async retrieval itself, using the http connection pool.
    https://issues.apache.org/jira/browse/HADOOP-18103
   
   both vectored io and s3a prefetching will ship this summer in hadoop 3.4.0. 
i don't see this change conflicting with this, though they may obsolete a lot 
of it.
   
   have you benchmarked this change with abfs or google gcs connectors to see 
what difference it makes there?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [parquet-mr] steveloughran commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader

Reply via email to