progval commented on code in PR #6042:
URL: https://github.com/apache/arrow-rs/pull/6042#discussion_r1677411834
##########
parquet/src/file/metadata/mod.rs:
##########
@@ -50,17 +50,27 @@ use crate::schema::types::{
Type as SchemaType,
};
-/// [`Index`] for each row group of each column.
+/// Page level statistics for each column chunk of each row group.
+///
+/// This structure is an memory representation of multiple [`ColumnIndex`]
+/// structures in a parquet file footer, as described in the Parquet [PageIndex
+/// documentation]. Each [`Index`] holds statistics about all the pages in a
+/// particular column chunk.
Review Comment:
```suggestion
/// This structure is an in-memory representation of multiple [`ColumnIndex`]
/// structures in a parquet file footer, as described in the Parquet
[PageIndex
/// documentation]. Each [`Index`] holds statistics about all the pages in a
/// particular column chunk.
```
##########
parquet/src/file/metadata/mod.rs:
##########
@@ -988,15 +1008,19 @@ impl ColumnIndexBuilder {
self.boundary_order = boundary_order;
}
+ /// Mark this column index as invalid
pub fn to_invalid(&mut self) {
self.valid = false;
}
+ /// Is the information in the builder valid?
pub fn valid(&self) -> bool {
self.valid
}
/// Build and get the thrift metadata of column index
+ ///
+ /// Note: callers should check [`Self::valid`] before calling this method
Review Comment:
Makes me wonder if `build_to_thrift` should return an `Option`, to make sure
callers don't forget to check `.valid()`.
##########
parquet/src/file/metadata/mod.rs:
##########
@@ -942,14 +954,21 @@ impl ColumnChunkMetaDataBuilder {
}
}
-/// Builder for column index
+/// Builder for Parquet [`ColumnIndex`], part of the Parquet [PageIndex]
+///
+/// [PageIndex]:
https://github.com/apache/parquet-format/blob/master/PageIndex.md
pub struct ColumnIndexBuilder {
null_pages: Vec<bool>,
min_values: Vec<Vec<u8>>,
max_values: Vec<Vec<u8>>,
null_counts: Vec<i64>,
boundary_order: BoundaryOrder,
- // If one page can't get build index, need to ignore all index in this
column
+ /// Is the information in the builder valid?
+ ///
+ /// Set to `false` if any entry in the page doesn't have statistics for
+ /// some reason. This might happen if the page is entirely null, or
+ /// is a floating point column without any non-nan values
+ /// e.g. <https://github.com/apache/parquet-format/pull/196>
Review Comment:
```suggestion
/// Set to `false` if any entry in the page doesn't have statistics for
/// some reason, so statistics for that page won't be written to the
file.
/// This might happen if the page is entirely null, or
/// is a floating point column without any non-nan values
/// e.g. <https://github.com/apache/parquet-format/pull/196>
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]