etseidl opened a new pull request, #6392:
URL: https://github.com/apache/arrow-rs/pull/6392
# Which issue does this PR close?
Relates to #6002
# Rationale for this change
This is an attempt to consolidate Parquet footer/page index reading/parsing
into a single place.
# What changes are included in this PR?
The new `ParquetMetaDataReader` basically takes the code in
`parquet/src/file/footer.rs` and `parquet/src/arrow/async_reader/metadata.rs`
and mashes them together into a single API. Using this, the
`read_metadata_from_file` call from #6081 would become:
```rust
fn read_metadata_from_file(file: impl AsRef<Path>) -> ParquetMetaData {
let reader = ParquetMetaDataReader::new()
.with_page_indexes(true);
let mut file = std::fs::File::open(file).unwrap();
reader.try_parse(file).unwrap();
// return ParquetMetaData with page indexes populated
reader.finish().unwrap()
}
```
Also included are two async functions `try_load()` and
`try_load_from_tail()`. The former is a combination of `MetadataLoader::load()`
and `MetadataLoader::load_page_index`. The latter is an attempt at addressing
the issue of loading the footer when the file size is not known, so it requires
being able to seek from the end of the file.
This implementation is very rough, with not enough safety checking and
documentation. At this point I'm hoping for feedback on the approach. If this
seems at all useful, then a path forward would be to first add
`ParquetMetaDataReader` alone, and then in subsequent PRs begin to use it as a
replacement for other functions which could then be deprecated. The idea is to
get as much in without breaking changes, and then introduce the breaking
changes once 54.0.0 is open.
# Are there any user-facing changes?
Eventually, yes.
<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->
<!---
If there are any breaking changes to public APIs, please add the `breaking
change` label.
-->
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]