Re: [PR] [thrift-remodel] Complete decoding of `FileMetaData` and `RowGroupMetaData` [arrow-rs]

via GitHub Wed, 13 Aug 2025 12:03:02 -0700


etseidl commented on code in PR #8111:
URL: https://github.com/apache/arrow-rs/pull/8111#discussion_r2274359549



##########
parquet/src/file/metadata/reader.rs:
##########
@@ -1040,6 +1055,107 @@ impl ParquetMetaDataReader {
         Ok(ParquetMetaData::new(file_metadata, row_groups))
     }
 
+    /// create meta data from thrift encoded bytes
+    pub fn decode_file_metadata(buf: &[u8]) -> Result<ParquetMetaData> {
+        let mut prot = ThriftCompactInputProtocol::new(buf);
+
+        // components of the FileMetaData
+        let mut version: Option<i32> = None;
+        let mut schema_descr: Option<Arc<SchemaDescriptor>> = None;
+        let mut num_rows: Option<i64> = None;
+        let mut row_groups: Option<Vec<RowGroup>> = None;
+        let mut key_value_metadata: Option<Vec<KeyValue>> = None;
+        let mut created_by: Option<String> = None;
+        let mut column_orders: Option<Vec<ColumnOrder>> = None;
+
+        // begin decoding to intermediates
+        prot.read_struct_begin()?;
+        loop {
+            let field_ident = prot.read_field_begin()?;
+            if field_ident.field_type == FieldType::Stop {
+                break;
+            }
+            let prot = &mut prot;
+
+            match field_ident.id {
+                1 => {

Review Comment:
   I've been punting on that for now...I have tried to simplify where I can 
(such as hiding the complexity of reading lists). The issue here is that the 
thrift `FileMetaData` contains the row group metadata, while in 
`ParquetMetaData` the crate `FileMetaData` has the schema and the row group 
meta is held separately. Similarly, thrift has `ColumnChunk` that contains 
`ColumnMetaData` while we collapse those two structures into a single 
`ColumnChunkMetaData`. I can go back to decoding to a private `FileMetaData` 
that is then pulled apart (as I've wound up doing for `RowGroupMetaData`), but 
was trying to skip that step thinking it would be faster. (For instance...the 
processing of the schema is quite expensive, so rather than allocating a vector 
of schema elements, parsing them, and then translating to `TypePtr` I here pull 
the schema elements one at a time. That did cut down on the processing time, 
but by enough to justify the complexity? I'll have to revisit that).
   
   Back to the original question...hand coding is going to have some warts that 
can't be avoided. There may be a way to pretty it up some where we need custom 
parsers. Suggestions welcome :D



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [thrift-remodel] Complete decoding of `FileMetaData` and `RowGroupMetaData` [arrow-rs]

Reply via email to