Re: [PR] Add `encode_metadata` function to mirror `decode_metadata` and allow ad-hoc encoding of `ParquetMetadata` [arrow-rs]

via GitHub Thu, 11 Jul 2024 03:19:38 -0700


alamb commented on code in PR #6000:
URL: https://github.com/apache/arrow-rs/pull/6000#discussion_r1673773635



##########
parquet/src/file/footer.rs:
##########
@@ -112,6 +109,46 @@ pub fn decode_metadata(buf: &[u8]) -> 
Result<ParquetMetaData> {
     Ok(ParquetMetaData::new(file_metadata, row_groups))
 }
 
+pub struct ParquetMetadataEncoder {}
+
+impl ParquetMetadataEncoder {
+    /// Creates a new ParquetMetadataEncoder.
+    pub fn new() -> Self {
+        Self {}
+    }
+
+    /// Encodes the [`ParquetMetaData`] to bytes.
+    /// The format of the returned bytes is the Thift compact binary protocol, 
as
+    /// specified by the [Parquet Spec].
+    ///
+    /// [Parquet Spec]: https://github.com/apache/parquet-format#metadata
+    pub fn encode_metadata<W: std::io::Write>(&self, metadata: 
&ParquetMetaData, out: W) -> Result<()> {
+        let column_orders = 
encode_column_orders(metadata.file_metadata().column_orders());
+        let schema = 
types::to_thrift(&metadata.file_metadata().schema().clone())?;
+
+        let t_file_metadata = TFileMetaData {

Review Comment:
   I noticed that this is not quite the same code as used n the actual writer 
(specifically the way column order is not the same) so I worry it would be 
inconsistent or drift over time from the actual writer
   
   
https://github.com/apache/arrow-rs/blob/22e0b4432c9838f2536284015271d3de9a165135/parquet/src/file/writer.rs#L352-L375
   
   Thus what I suggest we do here is change writer.rs to use the 
`ParquetMetadataEncoder` and refactor the code from there into this function. 
That would be a bit more involved but I think would set us up nicely so that 
metadata encoding remains consistent.
   
   



##########
parquet/src/file/metadata/mod.rs:
##########
@@ -86,7 +86,7 @@ pub type ParquetOffsetIndex = Vec<Vec<Vec<PageLocation>>>;
 ///
 /// [`parquet.thrift`]: 
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift
 /// [`parse_metadata`]: crate::file::footer::parse_metadata
-#[derive(Debug, Clone)]
+#[derive(Debug, Clone, PartialEq)]

Review Comment:
   👍 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Add `encode_metadata` function to mirror `decode_metadata` and allow ad-hoc encoding of `ParquetMetadata` [arrow-rs]

Reply via email to