Re: [PR] [thrift-remodel] Use new Thrift encoder/decoder for Parquet page headers [arrow-rs]

via GitHub Fri, 19 Sep 2025 13:53:20 -0700


alamb commented on code in PR #8376:
URL: https://github.com/apache/arrow-rs/pull/8376#discussion_r2364484536



##########
parquet/src/column/page.rs:
##########
@@ -252,22 +254,22 @@ impl CompressedPage {
                 ref statistics,
                 ..
             } => {
-                let data_page_header_v2 = crate::format::DataPageHeaderV2 {
+                let data_page_header_v2 = DataPageHeaderV2 {
                     num_values: num_values as i32,
                     num_nulls: num_nulls as i32,
                     num_rows: num_rows as i32,
-                    encoding: encoding.into(),
+                    encoding,
                     definition_levels_byte_length: def_levels_byte_len as i32,
                     repetition_levels_byte_length: rep_levels_byte_len as i32,
                     is_compressed: Some(is_compressed),
-                    statistics: 
crate::file::statistics::to_thrift(statistics.as_ref()),
+                    statistics: page_stats_to_thrift(statistics.as_ref()),

Review Comment:
   I personally find this much easier to understand / read now -- and it is 
nice to see us being able to avoid all the `into()`



##########
parquet/src/file/metadata/thrift_gen.rs:
##########
@@ -909,6 +819,341 @@ impl<'a, R: ThriftCompactInputProtocol<'a>> 
ReadThrift<'a, R> for ParquetMetaDat
     }
 }
 
+thrift_struct!(
+    pub(crate) struct IndexPageHeader {}
+);
+
+thrift_struct!(
+pub(crate) struct DictionaryPageHeader {
+  /// Number of values in the dictionary
+  1: required i32 num_values;
+
+  /// Encoding using this dictionary page
+  2: required Encoding encoding
+
+  /// If true, the entries in the dictionary are sorted in ascending order
+  3: optional bool is_sorted;
+}
+);
+
+// Statistics for the page header. This is separate because of the differing 
lifetime requirements
+// for page handling vs column chunk. Once we start writing column chunks this 
might need to be

Review Comment:
   I don't understand this comment -- page statistics are part of the 
PageIndex, right? Or maybe I have my structures confused



##########
parquet/src/file/metadata/thrift_gen.rs:
##########
@@ -214,6 +117,13 @@ pub(crate) struct FileCryptoMetaData<'a> {
 }
 );
 
+// expose for benchmarking
+pub(crate) fn bench_file_metadata(bytes: &bytes::Bytes) {

Review Comment:
   Why can't it be in the benchmark function itself? Maybe we should just 
benchmark end to end metadata decoding?



##########
parquet/src/file/metadata/thrift_gen.rs:
##########
@@ -909,6 +819,341 @@ impl<'a, R: ThriftCompactInputProtocol<'a>> 
ReadThrift<'a, R> for ParquetMetaDat
     }
 }
 
+thrift_struct!(
+    pub(crate) struct IndexPageHeader {}
+);
+
+thrift_struct!(
+pub(crate) struct DictionaryPageHeader {
+  /// Number of values in the dictionary
+  1: required i32 num_values;
+
+  /// Encoding using this dictionary page
+  2: required Encoding encoding
+
+  /// If true, the entries in the dictionary are sorted in ascending order
+  3: optional bool is_sorted;
+}
+);
+
+// Statistics for the page header. This is separate because of the differing 
lifetime requirements
+// for page handling vs column chunk. Once we start writing column chunks this 
might need to be
+// revisited.
+thrift_struct!(
+pub(crate) struct PageStatistics {
+   1: optional binary max;
+   2: optional binary min;
+   3: optional i64 null_count;
+   4: optional i64 distinct_count;
+   5: optional binary max_value;
+   6: optional binary min_value;
+   7: optional bool is_max_value_exact;
+   8: optional bool is_min_value_exact;
+}
+);
+
+thrift_struct!(
+pub(crate) struct DataPageHeader {
+  1: required i32 num_values
+  2: required Encoding encoding
+  3: required Encoding definition_level_encoding;
+  4: required Encoding repetition_level_encoding;
+  5: optional PageStatistics statistics;
+}
+);
+
+impl DataPageHeader {
+    // reader that skips decoding page statistics

Review Comment:
   🚀 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [thrift-remodel] Use new Thrift encoder/decoder for Parquet page headers [arrow-rs]

Reply via email to