etseidl commented on code in PR #8111: URL: https://github.com/apache/arrow-rs/pull/8111#discussion_r2274359549
########## parquet/src/file/metadata/reader.rs: ########## @@ -1040,6 +1055,107 @@ impl ParquetMetaDataReader { Ok(ParquetMetaData::new(file_metadata, row_groups)) } + /// create meta data from thrift encoded bytes + pub fn decode_file_metadata(buf: &[u8]) -> Result<ParquetMetaData> { + let mut prot = ThriftCompactInputProtocol::new(buf); + + // components of the FileMetaData + let mut version: Option<i32> = None; + let mut schema_descr: Option<Arc<SchemaDescriptor>> = None; + let mut num_rows: Option<i64> = None; + let mut row_groups: Option<Vec<RowGroup>> = None; + let mut key_value_metadata: Option<Vec<KeyValue>> = None; + let mut created_by: Option<String> = None; + let mut column_orders: Option<Vec<ColumnOrder>> = None; + + // begin decoding to intermediates + prot.read_struct_begin()?; + loop { + let field_ident = prot.read_field_begin()?; + if field_ident.field_type == FieldType::Stop { + break; + } + let prot = &mut prot; + + match field_ident.id { + 1 => { Review Comment: I've been punting on that for now...I have tried to simplify where I can (such as hiding the complexity of reading lists). The issue here is that the thrift `FileMetaData` contains the row group metadata, while in `ParquetMetaData` the crate `FileMetaData` has the schema and the row group meta is held separately. Similarly, thrift has `ColumnChunk` that contains `ColumnMetaData` while we collapse those two structures into a single `ColumnChunkMetaData`. I can go back to decoding to a private `FileMetaData` that is then pulled apart (as I've wound up doing for `RowGroupMetaData`), but was trying to skip that step thinking it would be faster. (For instance...the processing of the schema is quite expensive, so rather than allocating a vector of schema elements, parsing them, and then translating to `TypePtr` I here pull the schema elements one at a time. That did cut down on the processing time, but by enough to justify the complexity? I'll have to revisit that). Back to the original question...hand coding is going to have some warts that can't be avoided. There may be a way to pretty it up some where we need custom parsers. Suggestions welcome :D -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org