kylebarron opened a new issue, #5582: URL: https://github.com/apache/arrow-rs/issues/5582
**Is your feature request related to a problem or challenge? Please describe what you are trying to do.** In some multi-file Parquet dataset layouts, there is a sidecar metadata file, canonically named `_metadata`, which holds only the metadata for each row group in the dataset. See https://arrow.apache.org/docs/python/parquet.html#writing-metadata-and-common-metadata-files: > Some processing frameworks such as Spark or Dask (optionally) use `_metadata` and `_common_metadata` files with partitioned datasets. > Those files include information about the schema of the full dataset (for `_common_metadata`) and potentially all row group metadata of all files in the partitioned dataset as well (for `_metadata`). The actual files are metadata-only Parquet files. Note this is not a Parquet standard, but a convention set in practice by those frameworks. > Using those files can give a more efficient creation of a parquet Dataset, since it can use the stored schema and file paths of all row groups, instead of inferring the schema and crawling the directories for all Parquet files (this is especially the case for filesystems where accessing files is expensive). I'd like to be able to use such metadata files to accelerate reading of Parquet datasets in [geoarrow-rs](https://github.com/geoarrow/geoarrow-rs). Mimicking pyarrow's API, I currently have a [`ParquetFile` struct](https://github.com/geoarrow/geoarrow-rs/blob/8a9385eeeebe434ab49efbae830666e3a3997f6a/src/io/parquet/reader/async.rs#L69-L74), which is backed by a single `R: AsyncFileReader`, as well as a [`ParquetDataset` struct](https://github.com/geoarrow/geoarrow-rs/blob/8a9385eeeebe434ab49efbae830666e3a3997f6a/src/io/parquet/reader/async.rs#L263-L267), which is backed by `Vec<ParquetFile<R>>, where R: AsyncFileReader`. This allows concurrent async reads across multiple files. I'd like to have a `ParquetDataset::from_metadata` method, which constructs itself from a `_metadata` file. But to do that I need to be able to construct `ArrowReaderMetadata` for each underlying file. This is entirely possible with existing APIs, except that `ArrowReaderMetadata::try_new` has visibility `pub(crate)`. **Describe the solution you'd like** Give `ArrowReaderMetadata::try_new` full public visibility. **Describe alternatives you've considered** Unsure of alternatives. **Additional context** <!-- Add any other context or screenshots about the feature request here. --> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
