marvinlanhenke commented on code in PR #10852:
URL: https://github.com/apache/datafusion/pull/10852#discussion_r1634245246


##########
datafusion/core/src/datasource/physical_plan/parquet/statistics.rs:
##########
@@ -517,6 +518,72 @@ macro_rules! get_statistics {
         }}}
 }
 
+macro_rules! make_data_page_stats_iterator {
+    ($iterator_type: ident, $func: ident, $index_type: path, $stat_value_type: 
ty) => {
+        struct $iterator_type<'a, I>
+        where
+            I: Iterator<Item = &'a Index>,
+        {
+            iter: I,
+        }
+
+        impl<'a, I> $iterator_type<'a, I>
+        where
+            I: Iterator<Item = &'a Index>,
+        {
+            fn new(iter: I) -> Self {
+                Self { iter }
+            }
+        }
+
+        impl<'a, I> Iterator for $iterator_type<'a, I>
+        where
+            I: Iterator<Item = &'a Index>,
+        {
+            type Item = Vec<Option<$stat_value_type>>;
+
+            fn next(&mut self) -> Option<Self::Item> {
+                let next = self.iter.next();
+                match next {
+                    Some(index) => match index {
+                        $index_type(native_index) => Some(
+                            native_index
+                                .indexes
+                                .iter()
+                                .map(|x| x.$func)
+                                .collect::<Vec<_>>(),
+                        ),
+                        // No matching `Index` found.
+                        // Thus no statistics that can be extracted.
+                        // We return vec![None] to effectively
+                        // create an arrow null-array.
+                        _ => Some(vec![None]),

Review Comment:
   Yes, you're right - and we definitely need test coverage here (I'm still 
confused sometimes (actually all the time) about this logic 🤯).
   
   However, I think we need a different approach since we wouldn't have access 
to `native_index.indexes`, if we cannot match the index, or we encounter the 
`Index::NONE` variant.
   
   Based on the implementation in `page_filter.rs` 
[here](https://github.com/apache/datafusion/blob/main/datafusion/core/src/physical_optimizer/pruning.rs#L885-L893)
 we should probably construct a `vec![None; len]` where `len = 
page_offset_index.len()` and `page_offset_index: Vec<PageLocation>`. 
   
   I'll try make some changes here (also the API needs to change slightly) and 
we can discuss further? 
   I'll also try to make a test-case with multiple data_pages per row_group 
(haven't found the setting, yet...)
   
   WDYT @alamb
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to