alamb commented on code in PR #6000:
URL: https://github.com/apache/arrow-rs/pull/6000#discussion_r1673773635
##########
parquet/src/file/footer.rs:
##########
@@ -112,6 +109,46 @@ pub fn decode_metadata(buf: &[u8]) ->
Result<ParquetMetaData> {
Ok(ParquetMetaData::new(file_metadata, row_groups))
}
+pub struct ParquetMetadataEncoder {}
+
+impl ParquetMetadataEncoder {
+ /// Creates a new ParquetMetadataEncoder.
+ pub fn new() -> Self {
+ Self {}
+ }
+
+ /// Encodes the [`ParquetMetaData`] to bytes.
+ /// The format of the returned bytes is the Thift compact binary protocol,
as
+ /// specified by the [Parquet Spec].
+ ///
+ /// [Parquet Spec]: https://github.com/apache/parquet-format#metadata
+ pub fn encode_metadata<W: std::io::Write>(&self, metadata:
&ParquetMetaData, out: W) -> Result<()> {
+ let column_orders =
encode_column_orders(metadata.file_metadata().column_orders());
+ let schema =
types::to_thrift(&metadata.file_metadata().schema().clone())?;
+
+ let t_file_metadata = TFileMetaData {
Review Comment:
I noticed that this is not quite the same code as used n the actual writer
(specifically the way column order is not the same) so I worry it would be
inconsistent or drift over time from the actual writer
https://github.com/apache/arrow-rs/blob/22e0b4432c9838f2536284015271d3de9a165135/parquet/src/file/writer.rs#L352-L375
Thus what I suggest we do here is change writer.rs to use the
`ParquetMetadataEncoder` and refactor the code from there into this function.
That would be a bit more involved but I think would set us up nicely so that
metadata encoding remains consistent.
##########
parquet/src/file/metadata/mod.rs:
##########
@@ -86,7 +86,7 @@ pub type ParquetOffsetIndex = Vec<Vec<Vec<PageLocation>>>;
///
/// [`parquet.thrift`]:
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift
/// [`parse_metadata`]: crate::file::footer::parse_metadata
-#[derive(Debug, Clone)]
+#[derive(Debug, Clone, PartialEq)]
Review Comment:
👍
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]