nuno-faria commented on code in PR #17022:
URL: https://github.com/apache/datafusion/pull/17022#discussion_r2262530221


##########
datafusion/datasource-parquet/src/file_format.rs:
##########
@@ -968,19 +973,41 @@ pub async fn fetch_parquet_metadata(
     meta: &ObjectMeta,
     size_hint: Option<usize>,
     #[allow(unused)] decryption_properties: Option<&FileDecryptionProperties>,
-) -> Result<ParquetMetaData> {
+    file_metadata_cache: Option<Arc<dyn FileMetadataCache>>,
+) -> Result<Arc<ParquetMetaData>> {
+    if let Some(cache) = &file_metadata_cache {
+        if let Some(cached_metadata) = cache.get(meta) {
+            if let Some(parquet_metadata) = cached_metadata
+                .as_any()
+                .downcast_ref::<CachedParquetMetaData>()
+            {
+                return Ok(Arc::clone(parquet_metadata.parquet_metadata()));
+            }
+        }
+    }
+
     let file_size = meta.size;
     let fetch = ObjectStoreFetch::new(store, meta);
 
     let reader = ParquetMetaDataReader::new().with_prefetch_hint(size_hint);
-
     #[cfg(feature = "parquet_encryption")]
     let reader = reader.with_decryption_properties(decryption_properties);
 
-    reader
-        .load_and_finish(fetch, file_size)
-        .await
-        .map_err(DataFusionError::from)
+    let metadata = Arc::new(
+        reader
+            .load_and_finish(fetch, file_size)
+            .await
+            .map_err(DataFusionError::from)?,
+    );
+
+    if let Some(cache) = file_metadata_cache {
+        cache.put(
+            meta,
+            Arc::new(CachedParquetMetaData::new(Arc::clone(&metadata))),
+        );
+    }

Review Comment:
   I think there is an issue with the `fetch_parquet_metadata` function. When 
this function is initially called to retrieve the schema (in `fetch_schema`), 
it will read the metadata and update the cache. When the 
`CachedParquetFileReader` tries to get the metadata, it checks that it is 
present in the cache. However, the cached metadata does not contain the page 
index, as it is not retrieved in the `fetch_parquet_metadata`, meaning it will 
have to be read in every query.
   
   So this `fetch_parquet_metadata` needs to retrieve the entire metadata for 
the caching to be effective. On the other hand, after this we will have two 
different places where the entire metadata is read and cached 
(`CachedParquetFileReader` and `fetch_parquet_metadata`), so creating an 
utility function `retrieve_full_parquet_metadata -> Arc<ParquetMetaData>` might 
be useful to avoid duplicate modifications in the future.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to