[GitHub] [parquet-mr] parthchandra commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader

GitBox Tue, 24 May 2022 14:02:36 -0700


parthchandra commented on PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1136427603


   > thanks., that means you are current with all shipping improvments. the 
main one extra is to use openFile(), passing in length and requesting randomio. 
this guarantees ranged GET requests and cuts the initial HEAD probe for 
existence/size of file.
   
   By `openFile()` do you mean 
`FileSystem.openFileWithOptions(Path,OpenFileParameters)`?
   While looking I realized the Parquet builds with a [much older version of 
hadoop](https://github.com/apache/parquet-mr/blob/a2da156b251d13bce1fa81eb95b555da04880bc1/pom.xml#L79)
  
   > > > have you benchmarked this change with abfs or google gcs connectors to 
see what difference it makes there?
   > 
   > > No I have not. Would love help from anyone in the community with access 
to these. I only have access to S3.
   > 
   > that I have. FWIW, with the right tuning of abfs prefetch (4 threads, 128 
MB blocks) i can get full FTTH link rate from a remote store; 700 mbit/s . 
that's to the base station. once you add wifi the bottlenecks move.
   
   Wow! That is nearly as fast as local HDD. At this point the bottlenecks in 
parquet begin to move towards decompression and decoding but IO remains the 
slowest link in the chain.  One thing we get with my PR is that the 
ParquetFileReader had assumptions built in that all data must be read before 
downstream can proceed. Some of my changes are related to removing these 
assumptions and ensuring that downstream processing does not block until an 
entire column is read so we get efficient pipelining. 
   What does the 128 MB block mean? Is this the amount prefetched for a stream? 
The read API does not block until the entire block is filled, I presume. 
   With my PR, parquet IO is reading 8MB at a time (default) and downstream is 
processing 1MB at a time (default) and several such streams (one per column) 
are in progress at the same time. Hopefully, this read pattern would work with 
the prefetch.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [parquet-mr] parthchandra commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader

Reply via email to