alamb commented on code in PR #8376:
URL: https://github.com/apache/arrow-rs/pull/8376#discussion_r2364484536
##########
parquet/src/column/page.rs:
##########
@@ -252,22 +254,22 @@ impl CompressedPage {
ref statistics,
..
} => {
- let data_page_header_v2 = crate::format::DataPageHeaderV2 {
+ let data_page_header_v2 = DataPageHeaderV2 {
num_values: num_values as i32,
num_nulls: num_nulls as i32,
num_rows: num_rows as i32,
- encoding: encoding.into(),
+ encoding,
definition_levels_byte_length: def_levels_byte_len as i32,
repetition_levels_byte_length: rep_levels_byte_len as i32,
is_compressed: Some(is_compressed),
- statistics:
crate::file::statistics::to_thrift(statistics.as_ref()),
+ statistics: page_stats_to_thrift(statistics.as_ref()),
Review Comment:
I personally find this much easier to understand / read now -- and it is
nice to see us being able to avoid all the `into()`
##########
parquet/src/file/metadata/thrift_gen.rs:
##########
@@ -909,6 +819,341 @@ impl<'a, R: ThriftCompactInputProtocol<'a>>
ReadThrift<'a, R> for ParquetMetaDat
}
}
+thrift_struct!(
+ pub(crate) struct IndexPageHeader {}
+);
+
+thrift_struct!(
+pub(crate) struct DictionaryPageHeader {
+ /// Number of values in the dictionary
+ 1: required i32 num_values;
+
+ /// Encoding using this dictionary page
+ 2: required Encoding encoding
+
+ /// If true, the entries in the dictionary are sorted in ascending order
+ 3: optional bool is_sorted;
+}
+);
+
+// Statistics for the page header. This is separate because of the differing
lifetime requirements
+// for page handling vs column chunk. Once we start writing column chunks this
might need to be
Review Comment:
I don't understand this comment -- page statistics are part of the
PageIndex, right? Or maybe I have my structures confused
##########
parquet/src/file/metadata/thrift_gen.rs:
##########
@@ -214,6 +117,13 @@ pub(crate) struct FileCryptoMetaData<'a> {
}
);
+// expose for benchmarking
+pub(crate) fn bench_file_metadata(bytes: &bytes::Bytes) {
Review Comment:
Why can't it be in the benchmark function itself? Maybe we should just
benchmark end to end metadata decoding?
##########
parquet/src/file/metadata/thrift_gen.rs:
##########
@@ -909,6 +819,341 @@ impl<'a, R: ThriftCompactInputProtocol<'a>>
ReadThrift<'a, R> for ParquetMetaDat
}
}
+thrift_struct!(
+ pub(crate) struct IndexPageHeader {}
+);
+
+thrift_struct!(
+pub(crate) struct DictionaryPageHeader {
+ /// Number of values in the dictionary
+ 1: required i32 num_values;
+
+ /// Encoding using this dictionary page
+ 2: required Encoding encoding
+
+ /// If true, the entries in the dictionary are sorted in ascending order
+ 3: optional bool is_sorted;
+}
+);
+
+// Statistics for the page header. This is separate because of the differing
lifetime requirements
+// for page handling vs column chunk. Once we start writing column chunks this
might need to be
+// revisited.
+thrift_struct!(
+pub(crate) struct PageStatistics {
+ 1: optional binary max;
+ 2: optional binary min;
+ 3: optional i64 null_count;
+ 4: optional i64 distinct_count;
+ 5: optional binary max_value;
+ 6: optional binary min_value;
+ 7: optional bool is_max_value_exact;
+ 8: optional bool is_min_value_exact;
+}
+);
+
+thrift_struct!(
+pub(crate) struct DataPageHeader {
+ 1: required i32 num_values
+ 2: required Encoding encoding
+ 3: required Encoding definition_level_encoding;
+ 4: required Encoding repetition_level_encoding;
+ 5: optional PageStatistics statistics;
+}
+);
+
+impl DataPageHeader {
+ // reader that skips decoding page statistics
Review Comment:
🚀
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]