alamb commented on code in PR #6474:
URL: https://github.com/apache/arrow-rs/pull/6474#discussion_r1779451844
##########
parquet/src/arrow/async_reader/mod.rs:
##########
@@ -212,16 +212,18 @@ impl ArrowReaderMetadata {
input: &mut T,
options: ArrowReaderOptions,
) -> Result<Self> {
+ // TODO: this is all rather awkward. It would be nice if
AsyncFileReader::get_metadata
Review Comment:
Not only is this akward, it is also a common source of confusion / bugs
(namely that when someone supplies the ParquetMetaData to the arrow reader
options to avoid a second object store request, if often turns out the second
fetch happens anyways to read the page index (thus obviating the attempt at
optimization)
To avoid this they need to ensure when they read the metadata in the first
place, they also read the page index.
This is (in a roundabout way) what is happening to @progval in
https://github.com/apache/datafusion/pull/12593
I will try and file a ticket explaining the issue
##########
parquet/src/arrow/async_reader/store.rs:
##########
@@ -124,15 +124,13 @@ impl AsyncFileReader for ParquetObjectReader {
fn get_metadata(&mut self) -> BoxFuture<'_, Result<Arc<ParquetMetaData>>> {
Box::pin(async move {
- let preload_column_index = self.preload_column_index;
- let preload_offset_index = self.preload_offset_index;
- let file_size = self.meta.size;
- let prefetch = self.metadata_size_hint;
- let mut loader = MetadataLoader::load(self, file_size,
prefetch).await?;
- loader
- .load_page_index(preload_column_index, preload_offset_index)
+ let metadata = ParquetMetaDataReader::new()
Review Comment:

##########
parquet/src/arrow/async_reader/metadata.rs:
##########
@@ -235,11 +249,15 @@ where
F: FnMut(Range<usize>) -> Fut + Send,
Fut: Future<Output = Result<Bytes>> + Send,
{
- let fetch = MetadataFetchFn(fetch);
- let loader = MetadataLoader::load(fetch, file_size, prefetch).await?;
- Ok(loader.finish())
+ let mut fetch = MetadataFetchFn(fetch);
+ ParquetMetaDataReader::new()
+ .with_prefetch_hint(prefetch)
+ .load_and_finish(&mut fetch, file_size)
+ .await
}
+// these tests are all replicated in parquet::file::metadata::reader
Review Comment:
👍
##########
parquet/src/file/metadata/reader.rs:
##########
@@ -299,11 +299,14 @@ impl ParquetMetaDataReader {
/// See [`Self::with_prefetch_hint`] for a discussion of how to reduce the
number of fetches
/// performed by this function.
#[cfg(feature = "async")]
- pub async fn load_and_finish<F: MetadataFetch>(
+ pub async fn load_and_finish<F>(
Review Comment:
since this API is new and has not yet been released, this is not a breaking
API change
##########
parquet/src/file/metadata/reader.rs:
##########
@@ -299,11 +299,14 @@ impl ParquetMetaDataReader {
/// See [`Self::with_prefetch_hint`] for a discussion of how to reduce the
number of fetches
/// performed by this function.
#[cfg(feature = "async")]
- pub async fn load_and_finish<F: MetadataFetch>(
+ pub async fn load_and_finish<F>(
mut self,
- fetch: F,
+ fetch: &mut F,
file_size: usize,
- ) -> Result<ParquetMetaData> {
+ ) -> Result<ParquetMetaData>
+ where
+ for<'a> &'a mut F: MetadataFetch,
Review Comment:
> cc @alamb. I did not like that the load... functions in
ParquetMetaDataReader took ownership of fetch.
What don't you like about it? Is the idea that `MetadataFetch` actually has
mutable state so that calling `fetch` would change its state?
I think in general trying to get async functions to be happy with references
is often much harder than with owned objects.
So I guess I think we should keep the API as owned to make working with
these APIs simpler, unless there is some compelling reason for the change
(better performance, etc)
##########
parquet/src/file/metadata/reader.rs:
##########
@@ -299,11 +299,14 @@ impl ParquetMetaDataReader {
/// See [`Self::with_prefetch_hint`] for a discussion of how to reduce the
number of fetches
/// performed by this function.
#[cfg(feature = "async")]
- pub async fn load_and_finish<F: MetadataFetch>(
+ pub async fn load_and_finish<F>(
Review Comment:
(Specifically `ParquetMetaDataReader` has not been released yet)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]