alamb commented on code in PR #8340:
URL: https://github.com/apache/arrow-rs/pull/8340#discussion_r2382462026


##########
parquet/src/file/metadata/reader.rs:
##########
@@ -492,15 +490,27 @@ impl ParquetMetaDataReader {
     async fn load_page_index_with_remainder<F: MetadataFetch>(
         &mut self,
         mut fetch: F,
-        remainder: Option<(usize, bytes::Bytes)>,
+        remainder: Option<(usize, Bytes)>,
     ) -> Result<()> {
-        // Get bounds needed for page indexes (if any are present in the file).
-        let Some(range) = self.range_for_page_index() else {
-            return Ok(());
+        let Some(metadata) = self.metadata.take() else {

Review Comment:
   I had hoped we would be able to remove more of the logic from 
`ParquetMetadataReader` but I couldn't figure out how to do so given the 
somewhat complex way it supports reading metadata even when the file length 
isn't known



##########
parquet/src/file/metadata/parser.rs:
##########
@@ -0,0 +1,555 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Internal metadata parsing routines

Review Comment:
   Most of this module is just moving the the parsing functions (that take 
`&[u8]` and convert them to the relevant rust structure), I made no changes to 
the actual logic



##########
parquet/src/file/metadata/reader.rs:
##########
@@ -79,7 +74,7 @@ pub struct ParquetMetaDataReader {
     // `self.parse_metadata` is called.
     metadata_size: Option<usize>,
     #[cfg(feature = "encryption")]
-    file_decryption_properties: Option<FileDecryptionProperties>,
+    file_decryption_properties: 
Option<std::sync::Arc<FileDecryptionProperties>>,

Review Comment:
   The `FileDecryptionProperties` is currently copied, which is unfortunately. 
   
   As a follow on PR, I plan to update the options elsewhere to use a 
`Arc<FileDecryptonProperties>` to avoid copies



##########
parquet/src/file/metadata/push_decoder.rs:
##########
@@ -211,16 +232,38 @@ impl ParquetMetaDataPushDecoder {
             )));
         };
 
-        let metadata_reader =
-            
ParquetMetaDataReader::new().with_page_index_policy(PageIndexPolicy::Optional);
-
         Ok(Self {
-            done: false,
-            metadata_reader,
+            state: DecodeState::ReadingFooter,
+            column_index_policy: PageIndexPolicy::Optional,
+            offset_index_policy: PageIndexPolicy::Optional,
             buffers: crate::util::push_buffers::PushBuffers::new(file_len),
+            metadata_parser: MetadataParser::new(),
         })
     }
 
+    /// Begin decoding from the given footer tail.
+    pub(crate) fn try_new_with_footer_tail(
+        file_len: u64,
+        footer_tail: FooterTail,
+    ) -> Result<Self, ParquetError> {
+        let mut new_self = Self::try_new(file_len)?;
+        new_self.state = DecodeState::ReadingMetadata(footer_tail);
+        Ok(new_self)
+    }
+
+    /// Create a decoder with the given `ParquetMetaData` already known.
+    ///
+    /// This can be used to parse and populate the page index structures

Review Comment:
   I think this is now a nice API to load/decode PageIndexes -- provide an 
existing ParquetMetadata and then this decoder figures out what bytes are 
needed and parses them. If we ever want to extend ParquetMetadata to include, 
for example, BloomFilters, we could use the same basic idea



##########
parquet/src/file/metadata/parser.rs:
##########
@@ -43,6 +43,86 @@ use crate::encryption::{
 #[cfg(feature = "encryption")]
 use crate::format::EncryptionAlgorithm;
 
+/// Helper struct for metadata parsing
+///
+/// This structure parses thrift-encoded bytes into the correct Rust structs,
+/// such as [`ParquetMetaData`], handling decryption if necessary.
+//
+// Note this structure is used to minimize the number of
+// places need to add `#[cfg(feature = "encryption")]` checks.
+pub(crate) use inner::MetadataParser;
+
+#[cfg(feature = "encryption")]
+mod inner {
+    use super::*;
+    use crate::encryption::decrypt::FileDecryptionProperties;
+    use crate::errors::Result;
+
+    /// API for decoding metadata that may be encrypted
+    #[derive(Debug, Default)]
+    pub(crate) struct MetadataParser {

Review Comment:
   I am thinking we can eventually use this structure as the place to hang more 
detailed decoding instructions (like "only decode statistics for column `A`" on)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to