[GitHub] [arrow-rs] tustvold opened a new issue, #1605: Push-Based Parquet Reader

GitBox Fri, 22 Apr 2022 05:50:05 -0700


tustvold opened a new issue, #1605:
URL: https://github.com/apache/arrow-rs/issues/1605


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   
   `SerializedFileReader` is currently created with a `ChunkReader` which looks 
like
   
   ```
   pub trait ChunkReader: Length + Send + Sync {
       type T: Read + Send;
       /// get a serialy readeable slice of the current reader
       /// This should fail if the slice exceeds the current bounds
       fn get_read(&self, start: u64, length: usize) -> Result<Self::T>;
   }
   ```
   
   The process for reading a file is then
   
   * `SerializedFileReader::new` will call `footer::parse_metadata`
     * `parse_metadata` will
       * Call `ChunkReader::get_read` with the final 64 KB byte range, and read 
this to a buffer
       * Determine the footer length
       * Potentially call `ChunkReader::get_read` to read the remainder of the 
footer, and read this to a buffer
   *  `SerializedFileReader::get_row_iter` will return a `RowIter` which for 
each row group
     * Call `SerializedRowGroupReader::new` which will
       * Call `ChunkReader::get_read` with the byte range of each column chunk
   
   There are two major options to apply this to files in object storage
   
   1. Fetch the entire file to local disk or memory and pass it to 
`SerializedFileReader`
   2. Convert `ChunkReader::get_read` to a range request to object storage
   
   The first option is problematic as it cannot use pruning logic to reduce the 
amount of data fetched from object storage.
   
   The second option runs into two problems:
   
   1. The interface is not async and blocking a thread on network IO is not 
ideal
   2. Lots of small range requests per file adding cost and latency 
   
   **Describe the solution you'd like**
   
   I would like to decouple the parquet reader entirely from IO concerns, 
allowing downstreams complete freedom to decide how they want to handle this. 
This will allow the reader to support a wide variety of potentially data access 
patterns:
   
   * Sync/Async Disk IO
   * Sync/Async Network IO
   * In-memory/mmapped parquet files
   * Interleaved row group decode with fetching the next row group
   
   ## Footer Decode
   
   Introduce functions to assist parsing the parquet metadata
   
   ```
   /// Parses the 8-bytes parquet footer and returns the length of the metadata 
section
   fn parse_footer(footer: [u8; 8]) -> Result<usize> {}
   
   /// Parse metadata payload
   fn parse_metadata(metadata: &[u8]) -> Result<ParquetMetaData> {}
   ```
   
   This will allow callers to obtain `ParquetMetaData` regardless of how they 
choose to fetch the corresponding bytes
   
   ## ScanBuilder / Scan
   
   Next introduce a `ScanBuilder` and accompanying `Scan`.
   
   ```
   /// Build a [`Scan`]
   ///
   /// Eventually this will support predicate pushdown (#1191)
   pub struct ScanBuilder {}
   
   impl ScanBuilder {
     pub fn new(metadata: Arc<ParquetMetaData>) -> Self {}
     
     pub fn with_projection(self, projection: Vec<usize>) -> Self {}
     
     pub fn with_row_groups(self, groups: Vec<usize>) -> Self {}
     
     pub fn with_range(self, range: Range<usize>) -> Self {}
     
     pub fn build(self) -> Scan {}
   }
   
   pub struct Scan {}
   
   impl Scan {
     /// Returns a list of byte ranges needed
     pub fn ranges(&self) -> &[Range<usize>] {}
     
     /// Perform the scan returning a [`ParquetRecordBatchReader`] 
     pub fn execute<R: ChunkReader>(self, reader: R) -> 
Result<ParquetRecordBatchReader> {}
   }
   ```
   
   Where `ParquetRecordBatchReader` is the same type returned by the current 
`ParquetFileArrowReader::get_record_reader`, and is just an 
`Iterator<Item=ArrowResult<RecordBatch>>` with a `Schema`.
   
   *This design will only support the arrow use-case, but I couldn't see an 
easy way to add this at a lower level without strange introducing 
inconsistencies when not scanning the entire file*
   
   **Describe alternatives you've considered**
   
   #1154 added an async reader that uses the `AsyncRead` and `AsyncSeek` traits 
to read individual column chunks into memory from an async source. This is the 
approach taken by arrow2, with its 
[range_reader](https://docs.rs/range-reader/latest/range_reader/) abstraction. 
This was not found to perform particularly well (#1473).
   
   #1473 proposed an async reader with prefetch functionality, and was also 
suggested by @alamb in 
https://github.com/apache/arrow-datafusion/issues/2205#issuecomment-1097902378. 
This is similar to the new FSDataInputStream [vectored IO 
API](https://github.com/apache/arrow-datafusion/issues/2205#issuecomment-1100069800)
 in the Hadoop ecosystem.  This was implemented in #1509 and found to perform 
better, but still represented a non-trivial performance regression on local 
files. 
   
   **Additional Context**
   
   The motivating discussion for this issue can be found 
https://github.com/apache/arrow-datafusion/issues/2205
   
   @mateuszkj clearly documented the limitations of the current API 
https://github.com/datafusion-contrib/datafusion-objectstore-s3/pull/53


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-rs] tustvold opened a new issue, #1605: Push-Based Parquet Reader

Reply via email to