etseidl commented on code in PR #6000:
URL: https://github.com/apache/arrow-rs/pull/6000#discussion_r1678175214
##########
parquet/src/file/page_index/index.rs:
##########
@@ -168,6 +168,37 @@ impl<T: ParquetValueType> NativeIndex<T> {
boundary_order: index.boundary_order,
})
}
+
+ pub(crate) fn to_column_index(&self) -> ColumnIndex {
+ let min_values = self
+ .indexes
+ .iter()
+ .map(|x| x.min_bytes().map(|x| x.to_vec()))
+ .collect::<Option<Vec<_>>>()
+ .unwrap_or_else(|| vec![vec![]; self.indexes.len()]);
+
+ let max_values = self
+ .indexes
+ .iter()
+ .map(|x| x.max_bytes().map(|x| x.to_vec()))
+ .collect::<Option<Vec<_>>>()
+ .unwrap_or_else(|| vec![vec![]; self.indexes.len()]);
+
+ let null_counts = self
+ .indexes
+ .iter()
+ .map(|x| x.null_count())
+ .collect::<Option<Vec<_>>>()
+ .unwrap_or_else(|| vec![0; self.indexes.len()]);
Review Comment:
While merging with #5486, I noticed this. IIUC, if on read the optional
thrift `ColumnIndex::null_counts` is not present, then the
`PageIndex::null_count` will be `None`. When converting back to a thrift
`ColumnIndex`, it appears that this will convert the missing `null_counts` into
a vector of `num_pages` zeros. I don't know if this is the correct behavior,
mostly because the spec is (AFAICT) silent on the interpretation of a
non-present `null_counts`. Is it not present as an optimization when there are
no nulls, or is it not present due to a lack of information (say a V1 encoder
doesn't keep null counts since the V1 page header doesn't require them). Due to
that ambiguity I think `null_counts` here should be `None` if any or all of the
`PageIndex::null_count` fields is `None`. Perhaps stop after the `collect()`
and pass `null_counts` directly below.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]