Re: [PR] Add `encode_metadata` function to mirror `decode_metadata` and allow ad-hoc encoding of `ParquetMetadata` [arrow-rs]

via GitHub Mon, 15 Jul 2024 10:31:43 -0700


etseidl commented on code in PR #6000:
URL: https://github.com/apache/arrow-rs/pull/6000#discussion_r1678175214



##########
parquet/src/file/page_index/index.rs:
##########
@@ -168,6 +168,37 @@ impl<T: ParquetValueType> NativeIndex<T> {
             boundary_order: index.boundary_order,
         })
     }
+
+    pub(crate) fn to_column_index(&self) -> ColumnIndex {
+        let min_values = self
+            .indexes
+            .iter()
+            .map(|x| x.min_bytes().map(|x| x.to_vec()))
+            .collect::<Option<Vec<_>>>()
+            .unwrap_or_else(|| vec![vec![]; self.indexes.len()]);
+
+        let max_values = self
+            .indexes
+            .iter()
+            .map(|x| x.max_bytes().map(|x| x.to_vec()))
+            .collect::<Option<Vec<_>>>()
+            .unwrap_or_else(|| vec![vec![]; self.indexes.len()]);
+
+        let null_counts = self
+            .indexes
+            .iter()
+            .map(|x| x.null_count())
+            .collect::<Option<Vec<_>>>()
+            .unwrap_or_else(|| vec![0; self.indexes.len()]);

Review Comment:
   While merging with #5486, I noticed this. IIUC, if on read the optional 
thrift `ColumnIndex::null_counts` is not present, then the 
`PageIndex::null_count` will be `None`. When converting back to a thrift 
`ColumnIndex`, it appears that this will convert the missing `null_counts` into 
a vector of `num_pages` zeros. I don't know if this is the correct behavior, 
mostly because the spec is (AFAICT) silent on the interpretation of a 
non-present `null_counts`. Is it not present as an optimization when there are 
no nulls, or is it not present due to a lack of information (say a V1 encoder 
doesn't keep null counts since the V1 page header doesn't require them). Due to 
that ambiguity I think `null_counts` here should be `None` if any or all of the 
`PageIndex::null_count` fields is `None`. Perhaps stop after the `collect()` 
and pass `null_counts` directly below.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Add `encode_metadata` function to mirror `decode_metadata` and allow ad-hoc encoding of `ParquetMetadata` [arrow-rs]

Reply via email to