tustvold opened a new issue, #1605:
URL: https://github.com/apache/arrow-rs/issues/1605
**Is your feature request related to a problem or challenge? Please describe
what you are trying to do.**
`SerializedFileReader` is currently created with a `ChunkReader` which looks
like
```
pub trait ChunkReader: Length + Send + Sync {
type T: Read + Send;
/// get a serialy readeable slice of the current reader
/// This should fail if the slice exceeds the current bounds
fn get_read(&self, start: u64, length: usize) -> Result<Self::T>;
}
```
The process for reading a file is then
* `SerializedFileReader::new` will call `footer::parse_metadata`
* `parse_metadata` will
* Call `ChunkReader::get_read` with the final 64 KB byte range, and read
this to a buffer
* Determine the footer length
* Potentially call `ChunkReader::get_read` to read the remainder of the
footer, and read this to a buffer
* `SerializedFileReader::get_row_iter` will return a `RowIter` which for
each row group
* Call `SerializedRowGroupReader::new` which will
* Call `ChunkReader::get_read` with the byte range of each column chunk
There are two major options to apply this to files in object storage
1. Fetch the entire file to local disk or memory and pass it to
`SerializedFileReader`
2. Convert `ChunkReader::get_read` to a range request to object storage
The first option is problematic as it cannot use pruning logic to reduce the
amount of data fetched from object storage.
The second option runs into two problems:
1. The interface is not async and blocking a thread on network IO is not
ideal
2. Lots of small range requests per file adding cost and latency
**Describe the solution you'd like**
I would like to decouple the parquet reader entirely from IO concerns,
allowing downstreams complete freedom to decide how they want to handle this.
This will allow the reader to support a wide variety of potentially data access
patterns:
* Sync/Async Disk IO
* Sync/Async Network IO
* In-memory/mmapped parquet files
* Interleaved row group decode with fetching the next row group
## Footer Decode
Introduce functions to assist parsing the parquet metadata
```
/// Parses the 8-bytes parquet footer and returns the length of the metadata
section
fn parse_footer(footer: [u8; 8]) -> Result<usize> {}
/// Parse metadata payload
fn parse_metadata(metadata: &[u8]) -> Result<ParquetMetaData> {}
```
This will allow callers to obtain `ParquetMetaData` regardless of how they
choose to fetch the corresponding bytes
## ScanBuilder / Scan
Next introduce a `ScanBuilder` and accompanying `Scan`.
```
/// Build a [`Scan`]
///
/// Eventually this will support predicate pushdown (#1191)
pub struct ScanBuilder {}
impl ScanBuilder {
pub fn new(metadata: Arc<ParquetMetaData>) -> Self {}
pub fn with_projection(self, projection: Vec<usize>) -> Self {}
pub fn with_row_groups(self, groups: Vec<usize>) -> Self {}
pub fn with_range(self, range: Range<usize>) -> Self {}
pub fn build(self) -> Scan {}
}
pub struct Scan {}
impl Scan {
/// Returns a list of byte ranges needed
pub fn ranges(&self) -> &[Range<usize>] {}
/// Perform the scan returning a [`ParquetRecordBatchReader`]
pub fn execute<R: ChunkReader>(self, reader: R) ->
Result<ParquetRecordBatchReader> {}
}
```
Where `ParquetRecordBatchReader` is the same type returned by the current
`ParquetFileArrowReader::get_record_reader`, and is just an
`Iterator<Item=ArrowResult<RecordBatch>>` with a `Schema`.
*This design will only support the arrow use-case, but I couldn't see an
easy way to add this at a lower level without strange introducing
inconsistencies when not scanning the entire file*
**Describe alternatives you've considered**
#1154 added an async reader that uses the `AsyncRead` and `AsyncSeek` traits
to read individual column chunks into memory from an async source. This is the
approach taken by arrow2, with its
[range_reader](https://docs.rs/range-reader/latest/range_reader/) abstraction.
This was not found to perform particularly well (#1473).
#1473 proposed an async reader with prefetch functionality, and was also
suggested by @alamb in
https://github.com/apache/arrow-datafusion/issues/2205#issuecomment-1097902378.
This is similar to the new FSDataInputStream [vectored IO
API](https://github.com/apache/arrow-datafusion/issues/2205#issuecomment-1100069800)
in the Hadoop ecosystem. This was implemented in #1509 and found to perform
better, but still represented a non-trivial performance regression on local
files.
**Additional Context**
The motivating discussion for this issue can be found
https://github.com/apache/arrow-datafusion/issues/2205
@mateuszkj clearly documented the limitations of the current API
https://github.com/datafusion-contrib/datafusion-objectstore-s3/pull/53
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]